Scoring Methodology

Standing prose reference for the scoring approach. Distinct from the methodology page in the product (esg-screen.org/methodology), which is generated from live DB joins and shows per-rule status. This page is the why — the methodology in prose, with the decision history.

For the formal decision, see ADR-0001.

For the data model the scores live in, see Data model.

What the score is

For each institution, the system produces:

A Stage 1 ESG composite (0–100) — universal, all institutions
Three pillar scores (E, S, G, each 0–100)
A coverage percentage (0–100%) — how much of the applicable rule catalogue has live signals
A confidence percentage (0–100%) — aggregated data quality
A peer rank (0–100%) — percentile within resolved peer group
~~For financials: a Stage 2 composite combining ESG with Credit and Returns~~ (Stage 2 de-scoped 2026-05-21; score_stage2_composite columns retained as inert placeholders)

All scores stored at four grains in the DB: sub-criterion / pillar / stage 1 / stage 2 (stage 2 inert).

How the score is computed

Scoring model (ADR-0031, live run_id=53). The screen uses a symmetric baseline model: each pillar starts at a neutral baseline, confirmed problems deduct and credible positives add, with positives deliberately weighted below deductions and independent assessments weighted above self-reported ones. A weapons-financing finding forces the Social pillar to zero as a definitional override. Full rationale, weighting magnitudes and the agency-tag rulings are in ADR-0031 on /decisions/.

How a score is built

Every company is scored on three areas — Environmental, Social and Governance — and those three combine into a single headline score. Environmental carries the most weight, with Social and Governance weighted equally behind it.

Each area starts from a neutral baseline of 100. From there:

Confirmed problems pull the area down. A single serious finding takes a meaningful cut; two findings cut roughly twice as much; three or more drive the area to the floor. Findings come from independent sources — controversy trackers, regulators, NGO assessments — not from the company's own reporting.
Credible positives push the area up — but count for less. A positive signal adds to the area, but a single positive is deliberately worth about half of what a single confirmed problem subtracts. This is by design: a company should not be able to offset a real, evidenced harm simply by joining an initiative or publishing a pledge. One genuine problem always outweighs one positive signal.
Independent positives count for more than self-reported ones. A positive assessment made by an outside body that has independently evaluated the company carries full weight. A company's own actions — memberships it has joined, pledges it has published, disclosures it is required to make — carry the lower, half-weight contribution. Mandatory disclosures and a company's own commissioned assurance fall into this self-reported group. Simply appearing on a coverage list, with no positive judgement attached, carries no weight at all.
Areas with no evidence sit at the baseline. If an area has no findings and no positive signals, it holds at the neutral 100 rather than being treated as a problem. This keeps a thin-evidence company from being unfairly marked down for gaps in available data.

The weapons override. Any company found to finance controversial weapons has its Social area set to zero. This is absolute: it is applied after everything else on the Social area, and no positive signal — however strong — lifts it back up. The override reflects that this is a definitional red line for the screen, not a factor to be weighed against others.

Why coverage-weighted (the ADR-0001 reasoning)

The v0.4 framework assigns base values to rules without live signals: 50 for direct rules (neutral assumption), 100 for deduction rules (no-deduction-by-default). Pillar and composite scores were specified as weighted averages over all applicable rules, base values included.

The implicit assumption was that base values represent a small uncovered tail of a mostly-covered system. At May 2026 the system has ~3 covered rules of ~27 applicable for the financials pilot — coverage ~11%. The headline composite under v0.4 methodology was therefore ~90% base-value noise and ~10% real signal.

The displayed number was mathematically valid per v0.4 but semantically meaningless: a user couldn't tell whether 19.2 meant "genuinely poor on ESG" or "we mostly haven't measured this institution yet."

ADR-0001 resolves this by computing the displayed average only over rules with live signals. Coverage becomes a co-equal first-class metric in the UI (per ADR-0002 and ADR-0004), structurally visible.

The 50/100 base values are retained in the data model for forward compatibility, per-rule display in the rule catalogue, and methodology audit trails. They're simply no longer included in the headline averages.

Reading the two numbers

The hero of the institution detail page shows two numbers with equal visual weight:

What you see	What it means
Composite ESG: 66.7	Of the rules we have live signals for, this institution sits at 66.7 / 100.
Coverage: 11%	We have live signals for 3 of 27 applicable rules.

Both are RAG colour-coded per ADR-0004:

Score: Red <40 · Amber 40–69 · Green ≥70
Coverage: Red <25% · Amber 25–59% · Green ≥60%

So "amber score, red coverage" reads as "uncertain, mostly because we don't know enough yet." That's a defensible thing to say to a client; the v0.4 raw number (~19.2) wasn't.

Convergence

As scrapers land and coverage rises, the displayed score and the v0.4 raw score converge. Above ~80% coverage they are within rounding.

Both are stored in the DB (composite_raw_v04 and composite_coverage_weighted per migration 011). The displayed one is the coverage-weighted; the raw is for audit and convergence checks.

The ADR-0001 layout thresholds (1–49% → equal weight, 50–79% → score primary, ≥80% → single composite) drive UI structural changes. The ADR-0004 colour thresholds (25, 60) are independent and can be more granular without restructuring the page.

Confidence vs coverage

The two are routinely confused. They are separate dimensions:

Dimension	Question it answers	Computed how
Coverage	Did we look at this rule at all?	EXISTS-based: is there a signal row for this rule × institution × run?
Confidence	How sure are we about what we found?	Per-signal value (0.0–1.0), aggregated through the score tree as weighted mean

A rule with high-confidence boolean=0 (e.g. NatWest's withdrawn SBTi commitment with conf=1.0 because the source explicitly says "Commitment removed") is fully covered even though the finding is negative. A rule with low-confidence not-found (e.g. Barclays missing from SBTi with conf=0.5 because the absence could be a name-matching issue) is also fully covered — the scraper looked, the answer was ambiguous.

A rule that was not scraped at all (no scraper exists yet for its source) is uncovered.

The score inflation problem (historical)

For reference, why the v0.4 methodology produced misleading numbers at low coverage:

The four UK pilot banks (Barclays, NatWest, Lloyds, HSBC) had identical ESG composites of 18.7 (cycle 2) → 19.2 (cycle 3) — different runs, same scores. This was because:

Only 2–3 rules out of 26 applicable had live signals
The 23+ uncovered rules contributed base values
The 4 banks have identical signals where there are signals (all NZBA founding members, none with current SBTi targets, all PRB signatories)

The signals do differentiate the banks — NatWest specifically has a high-confidence "Commitment removed" status on SBTi where others have low-confidence "not found" — but that differentiation only shows up in confidence, not in score, because the score values are the same.

ADR-0001's coverage-weighted methodology fixes this: with only NZBA, SBTi, and PRB covered, the displayed composite is the average of just those three rules' scores, weighted, per pillar. The banks come out ~66.7 on coverage-weighted composite — a defensible "looks reasonable on what we've measured, but we've only measured 11% of the catalogue."

When coverage is zero

At zero coverage the score is null, not zero. UI renders "no covered rules" rather than a number. This is the only display case where the headline isn't a number.

When a pillar has zero covered rules but other pillars do, the pillar shows null and the composite re-normalises across the non-null pillars. A common case during the pilot ramp-up.

Worked example — Barclays, run 3

(Hypothetical exact numbers — directionally accurate to current pilot state.)

Pillar	Applicable rules	Covered rules	Coverage	Displayed score
E	12	2 (NZBA, SBTi)	17%	~62 (weighted avg of two ~60ish scores)
S	8	0	0%	null
G	7	1 (PRB)	14%	~70 (PRB positive = 100, weighted)
Stage 1	27	3	11%	~66.7 (composite of E and G, re-normalised because S is null)

Headline display: 66.7 · 11%. Amber-amber on score, red on coverage. Reads as: "looks fine on what we've checked, but we've only checked about a tenth."

Under v0.4 raw methodology, the same Barclays would display ~19.2 · 11%. Same coverage; very different headline. The 19.2 is the average of 3 real scores (around 60–70) plus 24 base values (50 each, plus 100 each for the deduction rules) — heavily dragged down toward the base values.

Open methodology questions

Not pinned by an ADR yet:

Confidence-weighted scoring. Should the displayed score weight by confidence as well as by rule weight? Currently no. Argument for: low-confidence signals should pull the score less. Argument against: the existing confidence column already surfaces this independently.
Sector-varied pillar weights. blend_weight table currently has pillar weights (E=0.40, S=0.30, G=0.30) that apply universally. SASB materiality would say tech should weight S/G higher than energy. Deferred until v0.6.
Negative findings vs absence. When a scraper returns conf=0.5 "not found", the rule is covered (signal exists) but the finding is ambiguous. The score uses the value (0 = negative finding), the confidence reflects the uncertainty. Is this the right split? Open.
Stage 2 (credit, returns) construction. De-scoped 2026-05-21. Preserved as a future idea on the backlog; not on any roadmap. If revived, would require a new ADR and credit/returns data source identification.