Scraper Health
The standing knowledge about scraper health monitoring lives on
this page. The live status lives on the product's /methodology
page, generated from DB joins to scrape_run outcomes.
For the register (intent and design), see Register. For decisions, see ADR-0003.
How health is computed
A scraper's effective status comes from its recent scrape_run outcomes:
| Status | Condition |
|---|---|
live · healthy |
Last 3 runs successful |
live · degraded |
1–2 failures in last 5 runs |
live · broken |
3+ consecutive failures |
pending |
Not yet implemented (per register) |
planned |
On roadmap, not scheduled (per register) |
deprecated |
No longer used (per register) |
A nightly job (per ADR-0003) recomputes effective status. The
signal_source.status column stores the canonical intent (live /
pending / planned / deprecated); the effective sub-status is derived
at render time.
What to watch for
A weekly run cycle looks something like this when healthy:
Sunday 02:00 UTC — esg-scheduler triggers
├── npm run scrape
│ ├── NZBA scraper ─ writes 4 signals (signals_written = 4)
│ ├── SBTi scraper ─ writes 8 signals (one per institution)
│ └── PRB scraper ─ writes 4 signals
│ Total: 16 signals, 0 failures, ~2s wall time
├── npm run score [run_id]
│ ├── Sub-criterion scores
│ ├── Pillar scores
│ ├── Stage 1 composites
│ ├── Stage 2 composites (financials only)
│ └── Peer rankings
└── Done — handoff posted to Slack (manual at present; auto v2)
Signs the run isn't healthy:
scrape_run.error_count > 0— a scraper threw an errorsignals_writtensignificantly below the expected count for this institution × scraper combination- A specific source's
last_successful_attimestamp drifts beyond the source's expected cadence - The same
signal.valuefor a specific (rule, institution) flips between runs without an obvious reason in the upstream source
When a scraper breaks
The diagnostic sequence:
- Check the upstream. Has the source page moved? Changed format? Stopped publishing? (TCFD is the canonical example — the source ceased to exist.)
- Check the scraper logs.
pm2 logs esg-scheduleror directly inspect thescrape_runrow's error message column. - Try a manual run.
npm run scrape -- --source <SOURCE_ID>to reproduce in isolation. - Check for upstream rate-limiting / IP blocks. Unlikely at our request volume but possible.
- Decide: fix, swap, or deprecate.
Decision criteria:
- Fix if the upstream still exists and the format change is parseable.
- Swap if the upstream still exists but the format is now too costly (e.g. went paid-only) — find an equivalent source if one exists.
- Deprecate if the upstream ceased to exist. The TCFD pattern: mark deprecated, find the replacement signal shape, ADR the replacement if it's structurally different.
A scraper that breaks does not silently inject default scores per ADR-0001 — the rules it feeds become uncovered, the coverage percentage visibly drops, the Methodology page shows the source as broken. This is the intended behaviour. Score inflation from silent defaults is the exact failure mode ADR-0001 is designed to prevent.
Expected signal counts per source (pilot)
A useful sanity check when verifying a scraper run:
| Source | Institutions in scope | Expected signals per run |
|---|---|---|
| NZBA-MEMBERS | 4 (UK banks) | 4 |
| SBTI-VALIDATED | 8 (all pilot) | 8 |
| PRB-SIGNATORIES | 4 (UK banks) | 4 |
| BHRRC | 8 (all pilot) | 8 + N watchlist findings |
| UK-MSA | 4 (UK financials) | 4 |
| UK-GPG | 4 (UK financials) | 4 |
| BankTrack-BOCC | 4 (UK banks) | 4 |
| PAX-DBOTB | 4 (UK banks) | 4 |
Each "signal per run" is one row in signal per (rule, institution).
Some sources feed multiple rules — e.g. BHRRC feeds 4 rules + watchlist;
multiple signal rows per institution.
Pilot run history (illustrative — actuals on the live page)
This section is illustrative. The product's /runs page shows the
actual per-run history with timings, error counts, and per-source
signal counts.
Run 3 — 13 May 2026 — duration 2.1s — 0 errors
NZBA-MEMBERS: 4 signals
SBTI-VALIDATED: 8 signals
PRB-SIGNATORIES: 4 signals
Total: 16 signals
Run 2 — 13 May 2026 — duration 2.3s — 0 errors
NZBA-MEMBERS: 4 signals
SBTI-VALIDATED: 8 signals
Total: 12 signals (PRB scraper not yet live)
Run 1 — 12 May 2026 — duration 2.5s — 0 errors
NZBA-MEMBERS: 4 signals
SBTI-VALIDATED: 8 signals
Total: 12 signals
Idempotency
Scrapers are idempotent within a run: running the same scraper
twice against the same run_id produces the same signal rows. The
signal.signal_id is a deterministic hash of (run_id, rule_id,
institution_id).
A second scrape of the same source against a new run_id produces
fresh signal rows with new signal_ids but the same values (assuming
the upstream hasn't changed). This is normal weekly-cron behaviour.
Performance budget
At pilot scale (8 institutions, 3 live scrapers), a full scrape run completes in ~2 seconds.
When all backlog scrapers are live (8 sources), expected ~10–30 seconds depending on PDF parsing for BankTrack and PAX. Still trivial for weekly cadence.
The performance constraint to watch is the corporate-PDF scraper
(IFRS-S2-CORP) when it lands — could be minutes per institution
depending on PDF size. Worth running it as a separate job, not in the
same window as the lighter scrapers.
Source-specific gotchas
Recording known pitfalls so they aren't rediscovered:
NZBA (seed-backed)
The upstream page (unepfi.org/banking/net-zero-banking) is WordPress
6.9.4 with AJAX-loaded content. Direct fetch returns no useful HTML.
Don't try to make this scraper live-fetch. The seed file is the
right answer; refresh quarterly.
SBTi (Excel)
The data is published as .xlsx, not .csv (despite earlier signal
source name SBTI-DASHBOARD-CSV). The xlsx npm package handles it
cleanly.
LEI matching
GLEIF LEIs are 20 characters. Holding company LEIs preferred. Chat-side LEI guesses are wrong about half the time — always verify via GLEIF API.
For sources that match by name (most NGO databases), the
scraper_config_json column on institution holds alternate names.
IKEA's SBTi match name is "INGKA Holding B.V. (IKEA Group)" — different
from the institution display name.
Companies House numbers
UK statutory registers (UK-MSA, UK-GPG) match by Companies House number,
which is operating subsidiary, not holding company. For Barclays,
the MSA statement is filed under 01026167 (Barclays Bank PLC) not the
holding's number. Stored in institution.modern_slavery_company_number.
Name variation across NGO databases
Each NGO database uses subtly different naming. BHRRC uses
"Barclays Bank Plc" without the dot; BankTrack uses "Barclays" alone;
Ethical Consumer uses "Barclays Group". Per-source match names will
live in scraper_config_json.