Scraper Register
The system's data sources, their state, and what rules they feed.
This page is the standing reference of intent and design. The
live source-status surface is the /methodology page in the
product, which renders the same register from the DB joined to recent
scrape_run data.
For the runbook on adding a new scraper, see Add a scraper.
For the decision background, see ADR-0003.
Status legend
| Status | Meaning |
|---|---|
| live | Scraper exists, runs successfully on the weekly cron, produces signals |
| pending | Scheduled for an upcoming cycle, prioritised on the backlog |
| planned | On the roadmap but not yet scheduled |
| deprecated | Was live or planned, no longer used |
Live scrapers carry an effective sub-status from recent run history:
| Sub-status | Definition |
|---|---|
| live · healthy | Last 3 runs successful |
| live · degraded | 1–2 failures in last 5 runs |
| live · broken | 3+ consecutive failures |
The product page computes effective status from scrape_run outcomes
nightly. This page records the canonical intent.
Register
| Code | Name | Type | Access | Feeds rules | Status | Priority |
|---|---|---|---|---|---|---|
NZBA-MEMBERS |
UNEP FI Net-Zero Banking Alliance | membership list | seed-backed | E1.1 | live | — |
SBTI-VALIDATED |
Science Based Targets initiative | membership list | free Excel download | E1.5 | live | — |
PRB-SIGNATORIES |
UN Principles for Responsible Banking | membership list | seed-backed | G3-PRB.1 | live | — |
BHRRC-RESOURCE-CENTRE |
Business & Human Rights Resource Centre | NGO database | free HTML + RSS | S2, S5, E7, G5 + watchlist | pending | 1 |
UK-MSA-REGISTRY |
UK Modern Slavery Act Registry | regulatory register | free HTML | S2.1 | pending | 2 |
UK-GPG-REGISTRY |
UK Gender Pay Gap Service | regulatory register | free HTML | S3.1 | pending | 3 |
BANKTRACK-BOCC |
BankTrack "Banking on Climate Chaos" | NGO database | free PDF + data | E5.1 | pending | 4 |
PAX-DBOTB |
PAX "Don't Bank on the Bomb" | NGO database | free HTML + PDF | E6.1 | pending | 5 |
ETHICAL-CONSUMER |
Ethical Consumer ratings | NGO database | mixed (paid for detail) | watchlist only | planned | 7 |
IFRS-S2-CORP |
Corporate sustainability disclosure (IFRS S2) | corporate disclosure | corporate PDF | E4.1 | planned | 8 |
PROFUNDO-REPORTS |
Profundo published research | commercial reports | free PDF | E5, E6, S5 | planned | 10 |
TCFD-SUPPORTERS |
TCFD supporters list | membership list | — | E4.1 (historical) | deprecated | — |
GOOGLE-NEWS-RSS |
Google News RSS | news feed | RSS | E7.2, S6.1, G5.2, G7.1–G7.3 | planned | — |
Per-source notes
Live sources
NZBA-MEMBERS
UNEP FI Net-Zero Banking Alliance. Seed-backed — the upstream page
at unepfi.org/banking/net-zero-banking is AJAX-loaded and direct HTTP
fetch returns no useful HTML. Members list is maintained in
data/seed/nzba-members.json, updated manually ~quarterly.
is_seed_backed = 1.
Pilot: all 4 UK banks are founding members (April 2021). Doesn't differentiate them.
SBTI-VALIDATED
Science Based Targets initiative. Live Excel download from
sciencebasedtargets.org. Parsed via xlsx npm package. Columns:
company_name, near_term_status, long_term_status, net_zero_status,
lei, sector. Only "Targets set" treated as validated.
Pilot finding: NatWest has "Commitment removed" status — a genuine high-confidence negative. Other 3 UK banks not found (conf=0.5).
Naming legacy: signal_source id is still SBTI-DASHBOARD and
SBTI-CORPORATE in places. Rename to SBTI-VALIDATED is a known
parked item.
PRB-SIGNATORIES
UN Principles for Responsible Banking. Seed-backed, same pattern as
NZBA — upstream is AJAX-loaded. Seed file at
data/seed/prb-signatories.json covers the 4 UK banks; expansion to
~350 banks (canonical UNEP FI list) is parked until pilot graduates.
Pilot: all 4 UK banks are founding signatories (September 2019). Doesn't differentiate them. Lifts the G3 evaluator's reachable tier from 3 to 4 (see Data model on G3 evaluator).
Pending sources (in priority order)
BHRRC-RESOURCE-CENTRE (priority 1)
Business & Human Rights Resource Centre. NGO database covering corporate human-rights allegations and responses. Highest-priority new source — first scraper that will actually differentiate the UK banks.
Feeds: - S2 (human rights policy / labour standards) - S5 (conflict-affected and high-risk areas) - E7 (environmental controversies) - G5 (corporate misconduct) - Watchlist (general allegations that don't fit a single rule)
Access: per-company HTML + RSS. Each pilot institution has a BHRRC profile URL; the scraper fetches and parses allegations.
UK-MSA-REGISTRY (priority 2)
UK Modern Slavery Act Statement Registry. Government register of modern-slavery statements filed by UK companies.
Feeds S2.1 (modern slavery statement filed).
Access: free HTML at modernslaveryregister.gov.uk. Lookup by
Companies House number, which is the operating subsidiary's number
(not holding company LEI). Held in institution.modern_slavery_company_number.
UK-GPG-REGISTRY (priority 3)
UK Gender Pay Gap Service. Government register of gender pay gap reports.
Feeds S3.1 (gender pay gap reporting).
Access: free HTML at gender-pay-gap.service.gov.uk. Lookup by Companies
House number, same as UK-MSA.
BANKTRACK-BOCC (priority 4)
BankTrack's "Banking on Climate Chaos" report — annual analysis of bank fossil-fuel financing.
Feeds E5.1 (fossil project financing).
Access: free PDF + structured data on banktrack.org. Annual cadence; the scraper parses the latest published edition's tables.
Financials-only — the report covers banks. Non-financial institutions skip this source.
PAX-DBOTB (priority 5)
PAX's "Don't Bank on the Bomb" report — annual analysis of bank investment in producers of nuclear weapons and controversial weapons.
Feeds E6.1 (controversial weapons financing).
Access: free HTML + PDF on paxforpeace.nl. Financials-only.
Planned sources
ETHICAL-CONSUMER (priority 7)
Ethical Consumer's published ratings.
Used as a watchlist source only, not a score input. Their published ratings are valuable as third-party context but their methodology is opinionated (blanket negative treatment of fossil and tax-haven exposure). Folding their judgements into our composite would make our screen a thin veneer over theirs.
The Methodology page in the product displays Ethical Consumer ratings alongside our own in a "Third-party views" panel — never blended into the composite.
IFRS-S2-CORP (priority 8)
Corporate sustainability disclosure scraping against the IFRS S2 standard — the successor to TCFD.
Feeds E4.1 (climate disclosure quality).
This is the heavy-engineering workstream. Data lives in each institution's sustainability reports (corporate PDF). Could also be inferred from jurisdictional mandates: UK, EU, Brazil, Japan etc. have begun mandating IFRS S2 reporting, so jurisdiction + year = high-confidence positive for in-scope institutions. Hybrid approach probably right.
PROFUNDO-REPORTS (priority 10)
Profundo published research.
Lower priority because each report is a one-off PDF parse, and underlying data often overlaps with BankTrack. Useful for deep enrichment on specific issues, not for headline coverage growth.
GOOGLE-NEWS-RSS
Used as the catch-all source for several rules where no dedicated source exists:
- E7.2 (environmental controversies)
- S6.1 (social controversies)
- G5.2 (DOJ enforcement actions — no dedicated DOJ source in catalogue)
- G7.1–G7.3 (regulatory and conduct controversies)
Scraper not yet built. When built, will RSS-poll Google News with per-institution query strings.
Deprecated sources
TCFD-SUPPORTERS
TCFD task force was disbanded October 2023. Supporters list explicitly
removed from fsb-tcfd.org November 2023. Monitoring transferred to
IFRS Foundation / ISSB under the IFRS S2 standard.
The framework's successor (IFRS S2) is a different shape of signal:
disclosure quality, not framework membership. See IFRS-S2-CORP
(priority 8) for the replacement.
This source remains in the register as deprecated for historical
context. The G3 evaluator was previously reading TCFD membership as
its third framework input — replaced by PRB during the 13 May cycle.
Notes on source-type taxonomy
The source_type field uses these values:
| Value | Examples |
|---|---|
membership_list |
NZBA, SBTi, PRB |
regulatory_register |
UK MSA, UK GPG |
ngo_database |
BHRRC, BankTrack, PAX, Ethical Consumer |
news |
Google News RSS |
corporate_disclosure |
IFRS-S2-CORP (corporate PDFs) |
commercial_reports |
Profundo |
The distinction matters because different source types have different reliability profiles and cadences. Membership lists are high-confidence boolean; NGO databases are noisier and need allegation-parsing; news feeds are high-noise and require LLM-grade filtering (deferred).
Access-mechanism taxonomy
| Value | Examples |
|---|---|
free_html |
Most regulatory registers, NZBA (if scrapeable), PRB (if scrapeable) |
free_api |
(none current) |
free_pdf |
BankTrack, PAX, IFRS-S2-CORP |
free_excel |
SBTi |
rss |
Google News RSS, BHRRC RSS feed |
seed_backed |
NZBA (AJAX-loaded upstream), PRB (AJAX-loaded upstream) |
paid_subscription |
Ethical Consumer (partial) — paid for detail |
mixed |
Ethical Consumer (some free, some paid) |