Add a Scraper
Runbook for adding a new scraper. Follow these steps in order. Each step has a verification check before moving on.
For the source register, see Register. For source-specific gotchas, see Health.
This runbook assumes a CC-on-VM cycle. Steps 1–2 may be done in a chat session ahead of time as artefact preparation; steps 3+ require live schema and module access on the VM.
Decision: live scrape vs seed-backed
The first question. Some upstream pages don't permit live fetch
(AJAX-loaded, behind auth, rate-limited). For those, the scraper reads
from a versioned JSON seed file at data/seed/<source>.json instead.
Rule of thumb:
- Live scrape if the upstream is a static HTML page, a structured download (CSV / Excel / JSON), an RSS feed, or an API.
- Seed-backed if the upstream is AJAX-loaded WordPress, requires auth, or changes infrequently (≤1× per month) and a manual update is cheaper than reverse-engineering the AJAX.
When in doubt, try live first with a quick curl of the URL. If
the returned HTML has the data, live. If not, seed.
Set signal_source.is_seed_backed = 1 when seed-backed.
Step 1 — Define the source
Decide and document:
- Source code (SCREAMING-KEBAB-CASE, e.g.
BHRRC-RESOURCE-CENTRE) - Display name (human-readable, e.g. "Business & Human Rights Resource Centre")
- Source type (one of
membership_list/regulatory_register/ngo_database/news/corporate_disclosure/commercial_reports) - Access mechanism (one of
free_html/free_api/free_pdf/free_excel/rss/seed_backed/paid_subscription/mixed) - Refresh cadence (one of
Daily/Weekly/Monthly/Quarterly/Annual— these are the CHECK constraint values) - External URL (the canonical upstream link)
- Rules it feeds (list of rule IDs)
If feeding rules that don't yet exist, define those too. New rules may need a new sub-criterion, new pillar weighting, etc. — that's an ADR-grade decision if it's structural, just a migration if it's incremental.
Artefact: a one-page note or ADR draft naming all the above. Hand to CC-on-VM.
Step 2 — Verify upstream
Before writing code, prove the data is reachable:
# For an HTML page
curl -s https://example.com/members | head -200
# For an Excel/CSV download
curl -L -o /tmp/source.xlsx https://example.com/data.xlsx
extract-text /tmp/source.xlsx | head -100
# For an RSS feed
curl -s https://example.com/rss | head -100
Confirm the data is present in the fetched bytes. If it isn't — if the data is loaded via JavaScript — switch to seed-backed.
For seed-backed, manually compile the seed JSON. Use the format:
{
"source": "BHRRC-RESOURCE-CENTRE",
"fetched_at": "2026-05-19T12:00:00Z",
"fetched_by": "rob",
"data": [
{
"lei": "213800LBQA1Y9L22JB70",
"name": "Barclays PLC",
"..."
}
]
}
Verify LEIs via GLEIF before committing the seed. Do not trust chat-side LEI guesses.
Step 3 — Migration: register the source and rules
New migration NNN_<source_slug>.sql. Two halves:
-- Register the source
INSERT INTO signal_source (
source_id, name, source_type, refresh_cadence,
is_seed_backed, base_url, ...
) VALUES (
'BHRRC-RESOURCE-CENTRE', 'Business & Human Rights Resource Centre',
'ngo_database', 'Weekly',
0, 'https://www.business-humanrights.org/en/companies/...', ...
);
-- Register any new rules
INSERT INTO rule (
rule_id, pillar, sub_criterion, description, source_id,
applicable_sectors, rule_weight, ...
) VALUES
('S2.1', 'S', 'S2', 'Human rights allegation count',
'BHRRC-RESOURCE-CENTRE', 'ALL', 0.4, ...);
INSERT INTO audit_log (actor, entity, entity_id, action, after_json)
VALUES ('migration', 'signal_source', 'BHRRC-RESOURCE-CENTRE',
'insert', '{"reason": "ADR-NNNN"}');
Note: chat-session SQL drafts have known column-name uncertainty. Reconciliation in CC-on-VM is expected. See Session protocol ("CC writes code, chat writes spec").
Apply the migration:
node src/db/migrate.js
sqlite3 data/esg.db "SELECT * FROM schema_migrations ORDER BY id DESC LIMIT 5;"
Verify the row counts:
SELECT COUNT(*) FROM signal_source WHERE source_id = 'BHRRC-RESOURCE-CENTRE';
SELECT COUNT(*) FROM rule WHERE source_id = 'BHRRC-RESOURCE-CENTRE';
PRAGMA foreign_key_check;
Step 4 — Scraper module
Create src/scrapers/<source>.js. Pattern follows existing scrapers
(nzba.js, sbti.js, prb.js):
const { makeSignal, writeSignals } = require('./base');
async function scrape(institution, runId, sourceConfig) {
// 1. Fetch upstream (or read seed file)
const data = await fetchOrLoadSeed(sourceConfig);
// 2. Match against this institution (LEI-first, name fallback)
const match = matchInstitution(data, institution);
// 3. Convert to signal(s)
const signals = [];
if (match) {
signals.push(makeSignal({
runId,
ruleId: 'S2.1',
institutionId: institution.institution_id,
value: match.allegation_count,
confidence: 1.0,
sourceUrl: match.source_url,
notes: `${match.allegation_count} allegations in BHRRC database`,
}));
} else {
// Absence — is it confirmed or ambiguous?
signals.push(makeSignal({
runId,
ruleId: 'S2.1',
institutionId: institution.institution_id,
value: 0,
confidence: 0.5, // ambiguous: could be name-match issue
sourceUrl: sourceConfig.base_url,
notes: 'Not found in BHRRC',
}));
}
return signals;
}
module.exports = { scrape, sourceId: 'BHRRC-RESOURCE-CENTRE' };
Key patterns:
- LEI-first matching, name fallback — try LEI lookup if the source supports it; fall back to fuzzy name match if not.
- Confidence semantics —
1.0for definitive positive or negative,0.5for ambiguous absence (could be name-match issue). - Source URL on every signal — for audit trail per ADR-0003.
- Notes — short human-readable explanation of what the signal means.
Step 5 — Runner registration
Add the new scraper to src/scrapers/run.js:
const bhrrc = require('./bhrrc');
const SCRAPERS = [
nzba,
sbti,
prb,
bhrrc, // new
];
Routing (which institutions this scraper applies to) is determined by
the rules it feeds and their applicable_sectors. BHRRC feeds
universal rules (S2, S5, E7, G5 — applicable_sectors='ALL'), so it
runs against all 8 pilot institutions. BankTrack feeds financials-only
rules ('40'), so it only runs against the 4 UK banks.
Step 6 — Manual run + verify
npm run scrape -- --source BHRRC-RESOURCE-CENTRE
Then inspect:
SELECT * FROM signal
WHERE run_id = (SELECT MAX(run_id) FROM scrape_run)
AND rule_id IN ('S2.1', 'S5.1', 'E7.1', 'G5.1');
Expected: one signal row per (rule, institution) for the institutions the scraper applies to.
Re-score:
npm run score
Verify the scoring CLI output shows the source's contribution. For BHRRC specifically, coverage should rise (more covered rules) and the banks should start to differentiate (different allegation counts).
Step 7 — Idempotency check
Run the scraper twice in succession (different run_id each time):
npm run scrape -- --source BHRRC-RESOURCE-CENTRE
npm run scrape -- --source BHRRC-RESOURCE-CENTRE
Compare the two runs' signal rows for the same rule × institution. The
values should match (assuming the upstream hasn't changed). Different
signal_ids, same value. If they don't match, there's
non-determinism in the scraper — fix before committing.
Step 8 — Commit + push
Single commit, conventional message:
git add src/db/migrations/NNN_bhrrc.sql \
src/scrapers/bhrrc.js \
src/scrapers/run.js \
data/seed/bhrrc-pilot.json # if seed-backed
git commit -m "feat(scraper): BHRRC Resource Centre scraper
Migration NNN registers BHRRC-RESOURCE-CENTRE source and rules
S2.1, S5.1, E7.1, G5.1. Scraper at src/scrapers/bhrrc.js.
Verified: 8 signals written per run, pilot banks now show
differentiated allegation counts (Barclays N, NatWest N, ...).
Per ADR-0003."
git push origin main
Step 9 — Handoff
Post a CC-on-VM handoff to #esg-screening per the
handoff template, including:
- What landed (commit hash, migration number, scraper module)
- Pilot results (signal counts, expected differentiation observed)
- Any surprises (LEI corrections, upstream format changes, match-name issues — all the per-source gotcha patterns)
- Whether the source is now
live · healthyor has issues
Anti-patterns
- Adding a scraper without an ADR for a structural choice. Adding BHRRC needs no ADR — it's an instance of existing patterns. Adding a corporate-PDF parser does need an ADR — it's structurally new.
- Hand-coding LEIs. Always verify via GLEIF. Two LEIs in the PRB seed were wrong because they were typed from memory.
- Skipping the idempotency check. Non-deterministic scrapers cause spurious "score changes" between runs that aren't real changes. Always run twice and diff.
- Returning silent no-data on scraper failure. If the scraper
cannot fetch the upstream, throw — let
error_counton thescrape_runincrement. Silent successes mask broken scrapers. - Hardcoding institution data in scraper code. Per-institution
config goes in
institution.scraper_config_json, not in the scraper module. Match names, alternate IDs, exclusion flags — all live with the institution.
What about scraping a corporate website / PDF?
Per the source register, IFRS-S2-CORP (corporate sustainability
disclosure parsing) is the heavy-engineering workstream. PDF parsing is
its own challenge — multi-page documents, mixed text/tables, varying
quality of OCR. Don't follow this runbook for it — it needs its own
ADR and design conversation. See ADR-0003 priority-8 note.