Skip to content

Scraper Health

The standing knowledge about scraper health monitoring lives on this page. The live status lives on the product's /methodology page, generated from DB joins to scrape_run outcomes.

For the register (intent and design), see Register. For decisions, see ADR-0003.


How health is computed

A scraper's effective status comes from its recent scrape_run outcomes:

Status Condition
live · healthy Last 3 runs successful
live · degraded 1–2 failures in last 5 runs
live · broken 3+ consecutive failures
pending Not yet implemented (per register)
planned On roadmap, not scheduled (per register)
deprecated No longer used (per register)

A nightly job (per ADR-0003) recomputes effective status. The signal_source.status column stores the canonical intent (live / pending / planned / deprecated); the effective sub-status is derived at render time.


What to watch for

A weekly run cycle looks something like this when healthy:

Sunday 02:00 UTC — esg-scheduler triggers
  ├── npm run scrape
  │   ├── NZBA scraper        ─ writes 4 signals  (signals_written = 4)
  │   ├── SBTi scraper        ─ writes 8 signals  (one per institution)
  │   └── PRB scraper         ─ writes 4 signals
  │   Total: 16 signals, 0 failures, ~2s wall time
  ├── npm run score [run_id]
  │   ├── Sub-criterion scores
  │   ├── Pillar scores
  │   ├── Stage 1 composites
  │   ├── Stage 2 composites (financials only)
  │   └── Peer rankings
  └── Done — handoff posted to Slack (manual at present; auto v2)

Signs the run isn't healthy:

  • scrape_run.error_count > 0 — a scraper threw an error
  • signals_written significantly below the expected count for this institution × scraper combination
  • A specific source's last_successful_at timestamp drifts beyond the source's expected cadence
  • The same signal.value for a specific (rule, institution) flips between runs without an obvious reason in the upstream source

When a scraper breaks

The diagnostic sequence:

  1. Check the upstream. Has the source page moved? Changed format? Stopped publishing? (TCFD is the canonical example — the source ceased to exist.)
  2. Check the scraper logs. pm2 logs esg-scheduler or directly inspect the scrape_run row's error message column.
  3. Try a manual run. npm run scrape -- --source <SOURCE_ID> to reproduce in isolation.
  4. Check for upstream rate-limiting / IP blocks. Unlikely at our request volume but possible.
  5. Decide: fix, swap, or deprecate.

Decision criteria:

  • Fix if the upstream still exists and the format change is parseable.
  • Swap if the upstream still exists but the format is now too costly (e.g. went paid-only) — find an equivalent source if one exists.
  • Deprecate if the upstream ceased to exist. The TCFD pattern: mark deprecated, find the replacement signal shape, ADR the replacement if it's structurally different.

A scraper that breaks does not silently inject default scores per ADR-0001 — the rules it feeds become uncovered, the coverage percentage visibly drops, the Methodology page shows the source as broken. This is the intended behaviour. Score inflation from silent defaults is the exact failure mode ADR-0001 is designed to prevent.


Expected signal counts per source (pilot)

A useful sanity check when verifying a scraper run:

Source Institutions in scope Expected signals per run
NZBA-MEMBERS 4 (UK banks) 4
SBTI-VALIDATED 8 (all pilot) 8
PRB-SIGNATORIES 4 (UK banks) 4
BHRRC 8 (all pilot) 8 + N watchlist findings
UK-MSA 4 (UK financials) 4
UK-GPG 4 (UK financials) 4
BankTrack-BOCC 4 (UK banks) 4
PAX-DBOTB 4 (UK banks) 4

Each "signal per run" is one row in signal per (rule, institution). Some sources feed multiple rules — e.g. BHRRC feeds 4 rules + watchlist; multiple signal rows per institution.


Pilot run history (illustrative — actuals on the live page)

This section is illustrative. The product's /runs page shows the actual per-run history with timings, error counts, and per-source signal counts.

Run 3 — 13 May 2026 — duration 2.1s — 0 errors
  NZBA-MEMBERS:    4 signals
  SBTI-VALIDATED:  8 signals
  PRB-SIGNATORIES: 4 signals
  Total: 16 signals

Run 2 — 13 May 2026 — duration 2.3s — 0 errors
  NZBA-MEMBERS:    4 signals
  SBTI-VALIDATED:  8 signals
  Total: 12 signals (PRB scraper not yet live)

Run 1 — 12 May 2026 — duration 2.5s — 0 errors
  NZBA-MEMBERS:    4 signals
  SBTI-VALIDATED:  8 signals
  Total: 12 signals

Idempotency

Scrapers are idempotent within a run: running the same scraper twice against the same run_id produces the same signal rows. The signal.signal_id is a deterministic hash of (run_id, rule_id, institution_id).

A second scrape of the same source against a new run_id produces fresh signal rows with new signal_ids but the same values (assuming the upstream hasn't changed). This is normal weekly-cron behaviour.


Performance budget

At pilot scale (8 institutions, 3 live scrapers), a full scrape run completes in ~2 seconds.

When all backlog scrapers are live (8 sources), expected ~10–30 seconds depending on PDF parsing for BankTrack and PAX. Still trivial for weekly cadence.

The performance constraint to watch is the corporate-PDF scraper (IFRS-S2-CORP) when it lands — could be minutes per institution depending on PDF size. Worth running it as a separate job, not in the same window as the lighter scrapers.


Source-specific gotchas

Recording known pitfalls so they aren't rediscovered:

NZBA (seed-backed)

The upstream page (unepfi.org/banking/net-zero-banking) is WordPress 6.9.4 with AJAX-loaded content. Direct fetch returns no useful HTML. Don't try to make this scraper live-fetch. The seed file is the right answer; refresh quarterly.

SBTi (Excel)

The data is published as .xlsx, not .csv (despite earlier signal source name SBTI-DASHBOARD-CSV). The xlsx npm package handles it cleanly.

LEI matching

GLEIF LEIs are 20 characters. Holding company LEIs preferred. Chat-side LEI guesses are wrong about half the time — always verify via GLEIF API.

For sources that match by name (most NGO databases), the scraper_config_json column on institution holds alternate names. IKEA's SBTi match name is "INGKA Holding B.V. (IKEA Group)" — different from the institution display name.

Companies House numbers

UK statutory registers (UK-MSA, UK-GPG) match by Companies House number, which is operating subsidiary, not holding company. For Barclays, the MSA statement is filed under 01026167 (Barclays Bank PLC) not the holding's number. Stored in institution.modern_slavery_company_number.

Name variation across NGO databases

Each NGO database uses subtly different naming. BHRRC uses "Barclays Bank Plc" without the dot; BankTrack uses "Barclays" alone; Ethical Consumer uses "Barclays Group". Per-source match names will live in scraper_config_json.