Scraper Health

The standing knowledge about scraper health monitoring lives on this page. The live status lives on the product's /methodology page, generated from DB joins to scrape_run outcomes.

For the register (intent and design), see Register. For decisions, see ADR-0003.

How health is computed

A scraper's effective status comes from its recent scrape_run outcomes:

Status	Condition
`live · healthy`	Last 3 runs successful
`live · degraded`	1–2 failures in last 5 runs
`live · broken`	3+ consecutive failures
`pending`	Not yet implemented (per register)
`planned`	On roadmap, not scheduled (per register)
`deprecated`	No longer used (per register)

A nightly job (per ADR-0003) recomputes effective status. The signal_source.status column stores the canonical intent (live / pending / planned / deprecated); the effective sub-status is derived at render time.

What to watch for

A weekly run cycle looks something like this when healthy (79 institutions as of obs-1, 2026-05-24):

Sunday 02:00 UTC — esg-scheduler triggers
  ├── npm run scrape
  │   ├── NZBA-MEMBERS        ─ ~31 signals  (financials only)
  │   ├── SBTI-VALIDATED      ─ ~40 signals  (financials)
  │   ├── SBTI-VALIDATED-CORP ─ ~79 signals  (all sectors)
  │   ├── PRB-SIGNATORIES     ─ ~31 signals  (financials)
  │   ├── BHRRC               ─ ~79 signals  (all sectors + watchlist)
  │   ├── UK-MSA-REGISTRY     ─ ~31 signals  (GB financials)
  │   ├── UK-GPG-REGISTRY     ─ ~31 signals  (GB financials)
  │   ├── CBI-DB              ─ ~31 signals  (financials)
  │   ├── PAX-WEAPONS         ─ ~31 signals  (financials)
  │   ├── BANKTRACK-PROFILES  ─ ~31 signals  (financials, intermittently failing)
  │   ├── BOCC-REPORT         ─ ~31 signals  (pending: 2026 report not yet published)
  │   ├── GOOGLE-NEWS-RSS     ─ ~237 signals (79×3 rules)
  │   ├── NGO-TARGETS         ─ ~62 signals  (31 financials × 2 rules)
  │   ├── CHRB                ─ ~55 signals  (55 NF × 1 rule, seed-backed)
  │   ├── FOREST500           ─ ~79 signals  (FI+NF, seed-backed)
  │   ├── TNFD-ADOPTERS       ─ ~31 signals  (financials, seed-backed)
  │   ├── IFRS-S2-MANDATE     ─ ~79 signals  (all sectors, jurisdiction lookup)
  │   └── TPI-SECTOR          ─ ~47 signals  (matched institutions, seed-backed MQ)
  │   Total: ~1000+ signals, ~2-5min wall time
  ├── npm run score [run_id]
  │   ├── Sub-criterion scores
  │   ├── Pillar scores
  │   ├── Stage 1 composites
  │   ├── Stage 2 composites (financials only — inert; de-scoped 2026-05-21)
  │   └── Peer rankings
  └── Done — handoff posted to Slack (#esg-screening digest)

Signs the run isn't healthy:

scrape_run.error_count > 0 — a scraper threw an error
signals_written significantly below the expected count for this institution × scraper combination
A specific source's last_successful_at timestamp drifts beyond the source's expected cadence
The same signal.value for a specific (rule, institution) flips between runs without an obvious reason in the upstream source

When a scraper breaks

The diagnostic sequence:

Check the upstream. Has the source page moved? Changed format? Stopped publishing? (TCFD is the canonical example — the source ceased to exist.)
Check the scraper logs. pm2 logs esg-scheduler or directly inspect the scrape_run row's error message column.
Try a manual run. npm run scrape -- --source <SOURCE_ID> to reproduce in isolation.
Check for upstream rate-limiting / IP blocks. Unlikely at our request volume but possible.
Decide: fix, swap, or deprecate.

Decision criteria:

Fix if the upstream still exists and the format change is parseable.
Swap if the upstream still exists but the format is now too costly (e.g. went paid-only) — find an equivalent source if one exists.
Deprecate if the upstream ceased to exist. The TCFD pattern: mark deprecated, find the replacement signal shape, ADR the replacement if it's structurally different.

A scraper that breaks does not silently inject default scores per ADR-0001 — the rules it feeds become uncovered, the coverage percentage visibly drops, the Methodology page shows the source as broken. This is the intended behaviour. Score inflation from silent defaults is the exact failure mode ADR-0001 is designed to prevent.

Expected signal counts per source (79-institution universe, obs-1)

A useful sanity check when verifying a scraper run. Counts are approximate floors; multi-rule sources write one signal per (rule, institution) pair.

Source	Institutions in scope	Expected signals per run	Notes
`NZBA-MEMBERS`	~31 (financials)	~31	seed-backed
`SBTI-VALIDATED`	~31 (financials)	~31	Excel download
`SBTI-VALIDATED-CORP`	79 (all)	~79	Excel download
`PRB-SIGNATORIES`	~31 (financials)	~31	seed-backed
`BHRRC`	79 (all)	~79 + watchlist	live HTML
`UK-MSA-REGISTRY`	~10 (GB financials)	~10	live HTML
`UK-GPG-REGISTRY`	~10 (GB financials)	~10	live HTML
`CBI-DB`	~31 (financials)	~31	live HTML
`PAX-WEAPONS`	~31 (financials)	~31	seed-backed
`BANKTRACK-PROFILES`	~31 (financials with slug)	~124 (4 rules)	intermittently failing
`BOCC-REPORT`	~31 (financials)	~31	pending 2026 report
`GOOGLE-NEWS-RSS`	79 (all)	~237 (3 rules each)	live RSS
`NGO-TARGETS`	~31 (financials)	~62 (2 rules)	seed-backed quarterly
`CHRB`	~55 (non-financials)	~110 (2 rules)	seed-backed biennial
`FOREST500`	~79 (FI+NF)	~79	seed-backed annual
`TNFD-ADOPTERS`	~31 (financials)	~31	seed-backed
`IFRS-S2-MANDATE`	79 (all)	~79	jurisdiction lookup
`TPI-SECTOR`	~47 (matched)	~470 (10 rules)	seed-backed MQ staircase

Each "signal per run" is one row in signal per (rule, institution). Some sources feed multiple rules — signals_written in scrape_run is the authoritative count.

Run history (actuals on the live page)

The product's /runs page shows the actual per-run history with timings, error counts, and per-source signal counts. The illustrative pilot-era run history (4 UK banks, 3 sources) is no longer representative. As of obs-1 (run_id=43, 2026-05-24, 79 institutions):

934 signals written, 64 failed (all BankTrack pre-existing site outage)
Status: partial (BankTrack only; all other sources healthy)
Duration: ~several minutes (TPI seed + multi-rule sources)

Idempotency

Scrapers are idempotent within a run: running the same scraper twice against the same run_id produces the same signal rows. The signal.signal_id is a deterministic hash of (run_id, rule_id, institution_id).

A second scrape of the same source against a new run_id produces fresh signal rows with new signal_ids but the same values (assuming the upstream hasn't changed). This is normal weekly-cron behaviour.

Performance budget

At obs-1 scale (79 institutions, 19 live scrapers), a full scrape run takes several minutes. Most of that is GOOGLE-NEWS-RSS (1500ms rate-limit between institutions × 79 = ~2min) and BHRRC (live HTTP per institution). Seed-backed scrapers are effectively instant.

BankTrack remains the primary failure source (site outage pre-existing). Its 64 failures don't block the rest of the run (status=partial, not failed).

Source-specific gotchas

Recording known pitfalls so they aren't rediscovered:

NZBA (seed-backed)

The upstream page (unepfi.org/banking/net-zero-banking) is WordPress 6.9.4 with AJAX-loaded content. Direct fetch returns no useful HTML. Don't try to make this scraper live-fetch. The seed file is the right answer; refresh quarterly.

SBTi (Excel)

The data is published as .xlsx, not .csv (despite earlier signal source name SBTI-DASHBOARD-CSV). The xlsx npm package handles it cleanly.

LEI matching

GLEIF LEIs are 20 characters. Holding company LEIs preferred. Chat-side LEI guesses are wrong about half the time — always verify via GLEIF API.

For sources that match by name (most NGO databases), the scraper_config_json column on institution holds alternate names. IKEA's SBTi match name is "INGKA Holding B.V. (IKEA Group)" — different from the institution display name.

Companies House numbers

UK statutory registers (UK-MSA, UK-GPG) match by Companies House number, which is operating subsidiary, not holding company. For Barclays, the MSA statement is filed under 01026167 (Barclays Bank PLC) not the holding's number. Stored in institution.modern_slavery_company_number.

TPI-SECTOR

TPI publishes assessments as Excel files (.xlsx), not CSV (despite the original source name SBTI-DASHBOARD-CSV history — note: different scraper). The seed file data/seed/tpi-sector-mq.json is compiled from the Excel download. Refresh script: scripts/refresh-tpi-seed.js. Identity map via staircase — a company at MQ Level 3 fires E1.6/E1.7/E1.8 (≥1, ≥2, ≥3) but not E1.9/E1.10. See ADR-0025.

47 of 79 institutions matched as of 2026-05-24 seed. Unmatched institutions receive OMIT (Case A: no live signals) for E1.6–E1.10.

TNFD-ADOPTERS

TNFD adopter list is seed-backed. Seed: data/seed/tnfd-adopters.json. The live upstream page (tnfd.global/engage/adopters/) is dynamically loaded. Feeds E5.1 (financials only). Quarterly refresh cadence expected.

IFRS-S2-MANDATE

Jurisdiction proxy — NOT a corporate PDF scraper. The signal is derived from the institution's domicile country: if the country has a live IFRS S2 mandate (GB, AU, NZ, BR, SG as of 2026-05-24), the signal is boolean=1, conf=1.0. Non-mandate countries receive boolean=0, conf=0.5. The jurisdiction list is updated manually in the scraper when new mandates come into force. No seed file — logic is in the scraper code directly.

The source_status was briefly set to 'live' in migration 054 (mistake corrected in 056). A trigger pair (migration 056) enforces the vocabulary check on source_status.

NGO-TARGETS

Three-NGO compound source: Friends of the Earth (FoE) banking campaigns, Rainforest Action Network (RAN) bank accountability, ShareAction banking campaigns. Seed-backed — the three NGO sites publish campaign targets in varying formats (web pages, PDFs, press releases); human curation is required. Seed file: data/seed/ngo-targets.json. Update quarterly via scripts/refresh-ngo-targets-seed.js.

Feeds E7.3 (environmental NGO campaign target, deduct 1) and S6.2 (social NGO campaign target, deduct 1). Financials only (applicable_sectors='40'). Non-financials receive a scrape_skip record rather than signals.

Empty-seed behaviour: with an empty seed, all 17 active financials receive boolean=0, confidence=0.5 for both E7.3 and S6.2. This is correct — the empty seed is operationally valid. The source-level status=pending (visible in /scrapers/health) reflects the pre-population state; individual institution coverage is silent (scraper ran, no match), not failed (scraper error). Once the seed is populated and a run completes cleanly, source_status transitions to healthy via the nightly source-status job.

Cadence: 91 days. Expected floor: 17 signals per run. Expected ceiling: 34 (17 × 2 rules).

Name variation across NGO databases

Each NGO database uses subtly different naming. BHRRC uses "Barclays Bank Plc" without the dot; BankTrack uses "Barclays" alone; Ethical Consumer uses "Barclays Group". Per-source match names will live in scraper_config_json.