Skip to content

Scraper Register

The system's data sources, their state, and what rules they feed.

This page is the standing reference of intent and design. The live source-status surface is the /methodology page in the product, which renders the same register from the DB joined to recent scrape_run data.

For the runbook on adding a new scraper, see Add a scraper.

For the decision background, see ADR-0003.


Status legend

Status Meaning
live Scraper exists, runs successfully on the weekly cron, produces signals
pending Scheduled for an upcoming cycle, prioritised on the backlog
planned On the roadmap but not yet scheduled
deprecated Was live or planned, no longer used

Live scrapers carry an effective sub-status from recent run history:

Sub-status Definition
live · healthy Last 3 runs successful
live · degraded 1–2 failures in last 5 runs
live · broken 3+ consecutive failures

The product page computes effective status from scrape_run outcomes nightly. This page records the canonical intent.


Register

Code Name Type Access Feeds rules Status Priority
NZBA-MEMBERS UNEP FI Net-Zero Banking Alliance membership list seed-backed E1.1 live
SBTI-VALIDATED Science Based Targets initiative membership list free Excel download E1.5 live
PRB-SIGNATORIES UN Principles for Responsible Banking membership list seed-backed G3-PRB.1 live
BHRRC-RESOURCE-CENTRE Business & Human Rights Resource Centre NGO database free HTML + RSS S2, S5, E7, G5 + watchlist pending 1
UK-MSA-REGISTRY UK Modern Slavery Act Registry regulatory register free HTML S2.1 pending 2
UK-GPG-REGISTRY UK Gender Pay Gap Service regulatory register free HTML S3.1 pending 3
BANKTRACK-BOCC BankTrack "Banking on Climate Chaos" NGO database free PDF + data E5.1 pending 4
PAX-DBOTB PAX "Don't Bank on the Bomb" NGO database free HTML + PDF E6.1 pending 5
ETHICAL-CONSUMER Ethical Consumer ratings NGO database mixed (paid for detail) watchlist only planned 7
IFRS-S2-CORP Corporate sustainability disclosure (IFRS S2) corporate disclosure corporate PDF E4.1 planned 8
PROFUNDO-REPORTS Profundo published research commercial reports free PDF E5, E6, S5 planned 10
TCFD-SUPPORTERS TCFD supporters list membership list E4.1 (historical) deprecated
GOOGLE-NEWS-RSS Google News RSS news feed RSS E7.2, S6.1, G5.2, G7.1–G7.3 planned

Per-source notes

Live sources

NZBA-MEMBERS

UNEP FI Net-Zero Banking Alliance. Seed-backed — the upstream page at unepfi.org/banking/net-zero-banking is AJAX-loaded and direct HTTP fetch returns no useful HTML. Members list is maintained in data/seed/nzba-members.json, updated manually ~quarterly. is_seed_backed = 1.

Pilot: all 4 UK banks are founding members (April 2021). Doesn't differentiate them.

SBTI-VALIDATED

Science Based Targets initiative. Live Excel download from sciencebasedtargets.org. Parsed via xlsx npm package. Columns: company_name, near_term_status, long_term_status, net_zero_status, lei, sector. Only "Targets set" treated as validated.

Pilot finding: NatWest has "Commitment removed" status — a genuine high-confidence negative. Other 3 UK banks not found (conf=0.5).

Naming legacy: signal_source id is still SBTI-DASHBOARD and SBTI-CORPORATE in places. Rename to SBTI-VALIDATED is a known parked item.

PRB-SIGNATORIES

UN Principles for Responsible Banking. Seed-backed, same pattern as NZBA — upstream is AJAX-loaded. Seed file at data/seed/prb-signatories.json covers the 4 UK banks; expansion to ~350 banks (canonical UNEP FI list) is parked until pilot graduates.

Pilot: all 4 UK banks are founding signatories (September 2019). Doesn't differentiate them. Lifts the G3 evaluator's reachable tier from 3 to 4 (see Data model on G3 evaluator).

Pending sources (in priority order)

BHRRC-RESOURCE-CENTRE (priority 1)

Business & Human Rights Resource Centre. NGO database covering corporate human-rights allegations and responses. Highest-priority new source — first scraper that will actually differentiate the UK banks.

Feeds: - S2 (human rights policy / labour standards) - S5 (conflict-affected and high-risk areas) - E7 (environmental controversies) - G5 (corporate misconduct) - Watchlist (general allegations that don't fit a single rule)

Access: per-company HTML + RSS. Each pilot institution has a BHRRC profile URL; the scraper fetches and parses allegations.

UK-MSA-REGISTRY (priority 2)

UK Modern Slavery Act Statement Registry. Government register of modern-slavery statements filed by UK companies.

Feeds S2.1 (modern slavery statement filed).

Access: free HTML at modernslaveryregister.gov.uk. Lookup by Companies House number, which is the operating subsidiary's number (not holding company LEI). Held in institution.modern_slavery_company_number.

UK-GPG-REGISTRY (priority 3)

UK Gender Pay Gap Service. Government register of gender pay gap reports.

Feeds S3.1 (gender pay gap reporting).

Access: free HTML at gender-pay-gap.service.gov.uk. Lookup by Companies House number, same as UK-MSA.

BANKTRACK-BOCC (priority 4)

BankTrack's "Banking on Climate Chaos" report — annual analysis of bank fossil-fuel financing.

Feeds E5.1 (fossil project financing).

Access: free PDF + structured data on banktrack.org. Annual cadence; the scraper parses the latest published edition's tables.

Financials-only — the report covers banks. Non-financial institutions skip this source.

PAX-DBOTB (priority 5)

PAX's "Don't Bank on the Bomb" report — annual analysis of bank investment in producers of nuclear weapons and controversial weapons.

Feeds E6.1 (controversial weapons financing).

Access: free HTML + PDF on paxforpeace.nl. Financials-only.

Planned sources

ETHICAL-CONSUMER (priority 7)

Ethical Consumer's published ratings.

Used as a watchlist source only, not a score input. Their published ratings are valuable as third-party context but their methodology is opinionated (blanket negative treatment of fossil and tax-haven exposure). Folding their judgements into our composite would make our screen a thin veneer over theirs.

The Methodology page in the product displays Ethical Consumer ratings alongside our own in a "Third-party views" panel — never blended into the composite.

IFRS-S2-CORP (priority 8)

Corporate sustainability disclosure scraping against the IFRS S2 standard — the successor to TCFD.

Feeds E4.1 (climate disclosure quality).

This is the heavy-engineering workstream. Data lives in each institution's sustainability reports (corporate PDF). Could also be inferred from jurisdictional mandates: UK, EU, Brazil, Japan etc. have begun mandating IFRS S2 reporting, so jurisdiction + year = high-confidence positive for in-scope institutions. Hybrid approach probably right.

PROFUNDO-REPORTS (priority 10)

Profundo published research.

Lower priority because each report is a one-off PDF parse, and underlying data often overlaps with BankTrack. Useful for deep enrichment on specific issues, not for headline coverage growth.

GOOGLE-NEWS-RSS

Used as the catch-all source for several rules where no dedicated source exists:

  • E7.2 (environmental controversies)
  • S6.1 (social controversies)
  • G5.2 (DOJ enforcement actions — no dedicated DOJ source in catalogue)
  • G7.1–G7.3 (regulatory and conduct controversies)

Scraper not yet built. When built, will RSS-poll Google News with per-institution query strings.

Deprecated sources

TCFD-SUPPORTERS

TCFD task force was disbanded October 2023. Supporters list explicitly removed from fsb-tcfd.org November 2023. Monitoring transferred to IFRS Foundation / ISSB under the IFRS S2 standard.

The framework's successor (IFRS S2) is a different shape of signal: disclosure quality, not framework membership. See IFRS-S2-CORP (priority 8) for the replacement.

This source remains in the register as deprecated for historical context. The G3 evaluator was previously reading TCFD membership as its third framework input — replaced by PRB during the 13 May cycle.


Notes on source-type taxonomy

The source_type field uses these values:

Value Examples
membership_list NZBA, SBTi, PRB
regulatory_register UK MSA, UK GPG
ngo_database BHRRC, BankTrack, PAX, Ethical Consumer
news Google News RSS
corporate_disclosure IFRS-S2-CORP (corporate PDFs)
commercial_reports Profundo

The distinction matters because different source types have different reliability profiles and cadences. Membership lists are high-confidence boolean; NGO databases are noisier and need allegation-parsing; news feeds are high-noise and require LLM-grade filtering (deferred).


Access-mechanism taxonomy

Value Examples
free_html Most regulatory registers, NZBA (if scrapeable), PRB (if scrapeable)
free_api (none current)
free_pdf BankTrack, PAX, IFRS-S2-CORP
free_excel SBTi
rss Google News RSS, BHRRC RSS feed
seed_backed NZBA (AJAX-loaded upstream), PRB (AJAX-loaded upstream)
paid_subscription Ethical Consumer (partial) — paid for detail
mixed Ethical Consumer (some free, some paid)