Add a Scraper

Runbook for adding a new scraper. Follow these steps in order. Each step has a verification check before moving on.

For the source register, see Register. For source-specific gotchas, see Health.

This runbook assumes a CC-on-VM cycle. Steps 1–2 may be done in a chat session ahead of time as artefact preparation; steps 3+ require live schema and module access on the VM.

Decision: live scrape vs seed-backed

The first question. Some upstream pages don't permit live fetch (AJAX-loaded, behind auth, rate-limited). For those, the scraper reads from a versioned JSON seed file at data/seed/<source>.json instead.

Rule of thumb:

Live scrape if the upstream is a static HTML page, a structured download (CSV / Excel / JSON), an RSS feed, or an API.
Seed-backed if the upstream is AJAX-loaded WordPress, requires auth, or changes infrequently (≤1× per month) and a manual update is cheaper than reverse-engineering the AJAX.

When in doubt, try live first with a quick curl of the URL. If the returned HTML has the data, live. If not, seed.

Set signal_source.is_seed_backed = 1 when seed-backed.

Step 1 — Define the source

Decide and document:

Source code (SCREAMING-KEBAB-CASE, e.g. BHRRC-RESOURCE-CENTRE)
Display name (human-readable, e.g. "Business & Human Rights Resource Centre")
Source type (one of membership_list / regulatory_register / ngo_database / news / corporate_disclosure / commercial_reports)
Access mechanism (one of free_html / free_api / free_pdf / free_excel / rss / seed_backed / paid_subscription / mixed)
Refresh cadence (one of Daily / Weekly / Monthly / Quarterly / Annual — these are the CHECK constraint values)
External URL (the canonical upstream link)
Rules it feeds (list of rule IDs)

If feeding rules that don't yet exist, define those too. New rules may need a new sub-criterion, new pillar weighting, etc. — that's an ADR-grade decision if it's structural, just a migration if it's incremental.

Artefact: a one-page note or ADR draft naming all the above. Hand to CC-on-VM.

Step 2 — Verify upstream

Before writing code, prove the data is reachable:

# For an HTML page
curl -s https://example.com/members | head -200

# For an Excel/CSV download
curl -L -o /tmp/source.xlsx https://example.com/data.xlsx
extract-text /tmp/source.xlsx | head -100

# For an RSS feed
curl -s https://example.com/rss | head -100

Confirm the data is present in the fetched bytes. If it isn't — if the data is loaded via JavaScript — switch to seed-backed.

For seed-backed, manually compile the seed JSON. Use the format:

{
  "source": "BHRRC-RESOURCE-CENTRE",
  "fetched_at": "2026-05-19T12:00:00Z",
  "fetched_by": "rob",
  "data": [
    {
      "lei": "213800LBQA1Y9L22JB70",
      "name": "Barclays PLC",
      "..."
    }
  ]
}

Verify LEIs via GLEIF before committing the seed. Do not trust chat-side LEI guesses.

Step 3 — Migration: register the source and rules

New migration NNN_<source_slug>.sql. Two halves:

-- Register the source
INSERT INTO signal_source (
    source_id, name, source_type, refresh_cadence,
    is_seed_backed, base_url, ...
) VALUES (
    'BHRRC-RESOURCE-CENTRE', 'Business & Human Rights Resource Centre',
    'ngo_database', 'Weekly',
    0, 'https://www.business-humanrights.org/en/companies/...', ...
);

-- Register any new rules
INSERT INTO rule (
    rule_id, pillar, sub_criterion, description, source_id,
    applicable_sectors, rule_weight, ...
) VALUES
    ('S2.1', 'S', 'S2', 'Human rights allegation count',
     'BHRRC-RESOURCE-CENTRE', 'ALL', 0.4, ...);

INSERT INTO audit_log (actor, entity, entity_id, action, after_json)
VALUES ('migration', 'signal_source', 'BHRRC-RESOURCE-CENTRE',
        'insert', '{"reason": "ADR-NNNN"}');

Note: chat-session SQL drafts have known column-name uncertainty. Reconciliation in CC-on-VM is expected. See Session protocol ("CC writes code, chat writes spec").

Apply the migration:

node src/db/migrate.js
sqlite3 data/esg.db "SELECT * FROM schema_migrations ORDER BY id DESC LIMIT 5;"

Verify the row counts:

SELECT COUNT(*) FROM signal_source WHERE source_id = 'BHRRC-RESOURCE-CENTRE';
SELECT COUNT(*) FROM rule WHERE source_id = 'BHRRC-RESOURCE-CENTRE';
PRAGMA foreign_key_check;

Step 4 — Scraper module

Create src/scrapers/<source>.js. Pattern follows existing scrapers (nzba.js, sbti.js, prb.js):

const { makeSignal, writeSignals } = require('./base');

async function scrape(institution, runId, sourceConfig) {
    // 1. Fetch upstream (or read seed file)
    const data = await fetchOrLoadSeed(sourceConfig);

    // 2. Match against this institution (LEI-first, name fallback)
    const match = matchInstitution(data, institution);

    // 3. Convert to signal(s)
    const signals = [];
    if (match) {
        signals.push(makeSignal({
            runId,
            ruleId: 'S2.1',
            institutionId: institution.institution_id,
            value: match.allegation_count,
            confidence: 1.0,
            sourceUrl: match.source_url,
            notes: `${match.allegation_count} allegations in BHRRC database`,
        }));
    } else {
        // Absence — is it confirmed or ambiguous?
        signals.push(makeSignal({
            runId,
            ruleId: 'S2.1',
            institutionId: institution.institution_id,
            value: 0,
            confidence: 0.5,  // ambiguous: could be name-match issue
            sourceUrl: sourceConfig.base_url,
            notes: 'Not found in BHRRC',
        }));
    }

    return signals;
}

module.exports = { scrape, sourceId: 'BHRRC-RESOURCE-CENTRE' };

Key patterns:

LEI-first matching, name fallback — try LEI lookup if the source supports it; fall back to fuzzy name match if not.
Confidence semantics — 1.0 for definitive positive or negative, 0.5 for ambiguous absence (could be name-match issue).
Source URL on every signal — for audit trail per ADR-0003.
Notes — short human-readable explanation of what the signal means.

Step 5 — Runner registration

Add the new scraper to src/scrapers/run.js:

const bhrrc = require('./bhrrc');

const SCRAPERS = [
    nzba,
    sbti,
    prb,
    bhrrc,  // new
];

Routing (which institutions this scraper applies to) is determined by the rules it feeds and their applicable_sectors. BHRRC feeds universal rules (S2, S5, E7, G5 — applicable_sectors='ALL'), so it runs against all 8 pilot institutions. BankTrack feeds financials-only rules ('40'), so it only runs against the 4 UK banks.

Step 6 — Manual run + verify

npm run scrape -- --source BHRRC-RESOURCE-CENTRE

Then inspect:

SELECT * FROM signal
WHERE run_id = (SELECT MAX(run_id) FROM scrape_run)
  AND rule_id IN ('S2.1', 'S5.1', 'E7.1', 'G5.1');

Expected: one signal row per (rule, institution) for the institutions the scraper applies to.

Re-score:

npm run score

Verify the scoring CLI output shows the source's contribution. For BHRRC specifically, coverage should rise (more covered rules) and the banks should start to differentiate (different allegation counts).

Step 7 — Idempotency check

Run the scraper twice in succession (different run_id each time):

npm run scrape -- --source BHRRC-RESOURCE-CENTRE
npm run scrape -- --source BHRRC-RESOURCE-CENTRE

Compare the two runs' signal rows for the same rule × institution. The values should match (assuming the upstream hasn't changed). Different signal_ids, same value. If they don't match, there's non-determinism in the scraper — fix before committing.

Step 8 — Commit + push

Single commit, conventional message:

git add src/db/migrations/NNN_bhrrc.sql \
        src/scrapers/bhrrc.js \
        src/scrapers/run.js \
        data/seed/bhrrc-pilot.json  # if seed-backed
git commit -m "feat(scraper): BHRRC Resource Centre scraper

Migration NNN registers BHRRC-RESOURCE-CENTRE source and rules
S2.1, S5.1, E7.1, G5.1. Scraper at src/scrapers/bhrrc.js.

Verified: 8 signals written per run, pilot banks now show
differentiated allegation counts (Barclays N, NatWest N, ...).

Per ADR-0003."
git push origin main

Step 9 — Handoff

Post a CC-on-VM handoff to #esg-screening per the handoff template, including:

What landed (commit hash, migration number, scraper module)
Pilot results (signal counts, expected differentiation observed)
Any surprises (LEI corrections, upstream format changes, match-name issues — all the per-source gotcha patterns)
Whether the source is now live · healthy or has issues

Anti-patterns

Adding a scraper without an ADR for a structural choice. Adding BHRRC needs no ADR — it's an instance of existing patterns. Adding a corporate-PDF parser does need an ADR — it's structurally new.
Hand-coding LEIs. Always verify via GLEIF. Two LEIs in the PRB seed were wrong because they were typed from memory.
Skipping the idempotency check. Non-deterministic scrapers cause spurious "score changes" between runs that aren't real changes. Always run twice and diff.
Returning silent no-data on scraper failure. If the scraper cannot fetch the upstream, throw — let error_count on the scrape_run increment. Silent successes mask broken scrapers.
Hardcoding institution data in scraper code. Per-institution config goes in institution.scraper_config_json, not in the scraper module. Match names, alternate IDs, exclusion flags — all live with the institution.

What about scraping a corporate website / PDF?

Per the source register, IFRS-S2-CORP (corporate sustainability disclosure parsing) is the heavy-engineering workstream. PDF parsing is its own challenge — multi-page documents, mixed text/tables, varying quality of OCR. Don't follow this runbook for it — it needs its own ADR and design conversation. See ADR-0003 priority-8 note.