Scoring Methodology
Standing prose reference for the scoring approach. Distinct from the
methodology page in the product (esg-screen.org/methodology), which
is generated from live DB joins and shows per-rule status. This page is
the why — the methodology in prose, with the decision history.
For the formal decision, see ADR-0001.
For the data model the scores live in, see Data model.
What the score is
For each institution, the system produces:
- A Stage 1 ESG composite (0–100) — universal, all institutions
- Three pillar scores (E, S, G, each 0–100)
- A coverage percentage (0–100%) — how much of the applicable rule catalogue has live signals
- A confidence percentage (0–100%) — aggregated data quality
- A peer rank (0–100%) — percentile within resolved peer group
- For financials: a Stage 2 composite combining ESG with Credit and Returns (currently placeholder)
All scores stored at four grains in the DB: sub-criterion / pillar / stage 1 / stage 2.
How the score is computed
Step 1 — signal → sub-criterion score
A scraper writes a signal row for a rule. The scoring engine
(src/scoring/sub_criterion.js) converts that signal to a 0–100 score
according to the rule's scoring logic:
- Boolean rules (e.g. "is the institution an NZBA member?"): positive = 100, negative = 0.
- 0–100 direct rules (e.g. "what fraction of revenue is fossil-fuel exposed?"): value scaled directly.
- Deduction rules (e.g. "controversies, capped at −3"): base value of 100, deducted per finding, floored at 0.
The 0–5 form (raw_score post-migration 008) is the v0.4 native form;
0–100 is the form used for display and aggregation.
Step 2 — sub-criterion → pillar
Pillar scores are weighted averages of sub-criterion scores by
rule_weight, computed over rules applicable to the institution per
GICS classification.
The key methodology choice (ADR-0001): the average is computed only over rules with live signals, not over all applicable rules.
displayed_pillar_score = weighted_avg(covered_rules_in_pillar)
Rules without signals (uncovered rules) are excluded from the headline average. They still appear in the rule catalogue with their base values (50 for direct rules, 100 for deduction rules), but the headline score doesn't blend those base values in.
Step 3 — pillar → composite
The Stage 1 composite is the pillar-weighted average of displayed
pillar scores, using blend_weight pillar weights.
When a pillar has zero covered rules, its score is null and the
composite is computed across remaining pillars with weights
re-normalised. When all three pillars have zero coverage, the composite
itself is null and the UI shows "no covered rules."
Step 4 — composite → peer rank
The institution's composite is ranked within its peer group. Peer groups resolve via GICS fallback ladder (sub-industry → industry → industry group → sector) with size thresholds 10/15/25 — first level with enough peers wins.
A peer_distribution row records the distribution per peer group per
run for percentile computation.
Why coverage-weighted (the ADR-0001 reasoning)
The v0.4 framework assigns base values to rules without live signals: 50 for direct rules (neutral assumption), 100 for deduction rules (no-deduction-by-default). Pillar and composite scores were specified as weighted averages over all applicable rules, base values included.
The implicit assumption was that base values represent a small uncovered tail of a mostly-covered system. At May 2026 the system has ~3 covered rules of ~27 applicable for the financials pilot — coverage ~11%. The headline composite under v0.4 methodology was therefore ~90% base-value noise and ~10% real signal.
The displayed number was mathematically valid per v0.4 but semantically meaningless: a user couldn't tell whether 19.2 meant "genuinely poor on ESG" or "we mostly haven't measured this institution yet."
ADR-0001 resolves this by computing the displayed average only over rules with live signals. Coverage becomes a co-equal first-class metric in the UI (per ADR-0002 and ADR-0004), structurally visible.
The 50/100 base values are retained in the data model for forward compatibility, per-rule display in the rule catalogue, and methodology audit trails. They're simply no longer included in the headline averages.
Reading the two numbers
The hero of the institution detail page shows two numbers with equal visual weight:
| What you see | What it means |
|---|---|
| Composite ESG: 66.7 | Of the rules we have live signals for, this institution sits at 66.7 / 100. |
| Coverage: 11% | We have live signals for 3 of 27 applicable rules. |
Both are RAG colour-coded per ADR-0004:
- Score: Red <40 · Amber 40–69 · Green ≥70
- Coverage: Red <25% · Amber 25–59% · Green ≥60%
So "amber score, red coverage" reads as "uncertain, mostly because we don't know enough yet." That's a defensible thing to say to a client; the v0.4 raw number (~19.2) wasn't.
Convergence
As scrapers land and coverage rises, the displayed score and the v0.4 raw score converge. Above ~80% coverage they are within rounding.
Both are stored in the DB (composite_raw_v04 and
composite_coverage_weighted per migration 011). The displayed one is
the coverage-weighted; the raw is for audit and convergence checks.
The ADR-0001 layout thresholds (1–49% → equal weight, 50–79% → score primary, ≥80% → single composite) drive UI structural changes. The ADR-0004 colour thresholds (25, 60) are independent and can be more granular without restructuring the page.
Confidence vs coverage
The two are routinely confused. They are separate dimensions:
| Dimension | Question it answers | Computed how |
|---|---|---|
| Coverage | Did we look at this rule at all? | EXISTS-based: is there a signal row for this rule × institution × run? |
| Confidence | How sure are we about what we found? | Per-signal value (0.0–1.0), aggregated through the score tree as weighted mean |
A rule with high-confidence boolean=0 (e.g. NatWest's withdrawn SBTi
commitment with conf=1.0 because the source explicitly says
"Commitment removed") is fully covered even though the finding is
negative. A rule with low-confidence not-found (e.g. Barclays missing
from SBTi with conf=0.5 because the absence could be a name-matching
issue) is also fully covered — the scraper looked, the answer was
ambiguous.
A rule that was not scraped at all (no scraper exists yet for its source) is uncovered.
The score inflation problem (historical)
For reference, why the v0.4 methodology produced misleading numbers at low coverage:
The four UK pilot banks (Barclays, NatWest, Lloyds, HSBC) had identical ESG composites of 18.7 (cycle 2) → 19.2 (cycle 3) — different runs, same scores. This was because:
- Only 2–3 rules out of 26 applicable had live signals
- The 23+ uncovered rules contributed base values
- The 4 banks have identical signals where there are signals (all NZBA founding members, none with current SBTi targets, all PRB signatories)
The signals do differentiate the banks — NatWest specifically has a high-confidence "Commitment removed" status on SBTi where others have low-confidence "not found" — but that differentiation only shows up in confidence, not in score, because the score values are the same.
ADR-0001's coverage-weighted methodology fixes this: with only NZBA, SBTi, and PRB covered, the displayed composite is the average of just those three rules' scores, weighted, per pillar. The banks come out ~66.7 on coverage-weighted composite — a defensible "looks reasonable on what we've measured, but we've only measured 11% of the catalogue."
When coverage is zero
At zero coverage the score is null, not zero. UI renders "no covered rules" rather than a number. This is the only display case where the headline isn't a number.
When a pillar has zero covered rules but other pillars do, the pillar shows null and the composite re-normalises across the non-null pillars. A common case during the pilot ramp-up.
Worked example — Barclays, run 3
(Hypothetical exact numbers — directionally accurate to current pilot state.)
| Pillar | Applicable rules | Covered rules | Coverage | Displayed score |
|---|---|---|---|---|
| E | 12 | 2 (NZBA, SBTi) | 17% | ~62 (weighted avg of two ~60ish scores) |
| S | 8 | 0 | 0% | null |
| G | 7 | 1 (PRB) | 14% | ~70 (PRB positive = 100, weighted) |
| Stage 1 | 27 | 3 | 11% | ~66.7 (composite of E and G, re-normalised because S is null) |
Headline display: 66.7 · 11%. Amber-amber on score, red on coverage. Reads as: "looks fine on what we've checked, but we've only checked about a tenth."
Under v0.4 raw methodology, the same Barclays would display ~19.2 · 11%. Same coverage; very different headline. The 19.2 is the average of 3 real scores (around 60–70) plus 24 base values (50 each, plus 100 each for the deduction rules) — heavily dragged down toward the base values.
Open methodology questions
Not pinned by an ADR yet:
- Confidence-weighted scoring. Should the displayed score weight by confidence as well as by rule weight? Currently no. Argument for: low-confidence signals should pull the score less. Argument against: the existing confidence column already surfaces this independently.
- Sector-varied pillar weights.
blend_weighttable currently has pillar weights (E=0.40, S=0.30, G=0.30) that apply universally. SASB materiality would say tech should weight S/G higher than energy. Deferred until v0.6. - Negative findings vs absence. When a scraper returns conf=0.5 "not found", the rule is covered (signal exists) but the finding is ambiguous. The score uses the value (0 = negative finding), the confidence reflects the uncertainty. Is this the right split? Open.
- Stage 2 (credit, returns) construction. Placeholders today. When data sources land, the combination logic with Stage 1 ESG needs an ADR.