Data Model

The system's persistence layer. SQLite at data/esg.db on the VM. This page is the dense reference; the Architecture overview is the lighter-touch entry point.

For terminology disambiguation (source vs signal_source, etc.), see Glossary.

Layered shape

The DB has five layers, each writing to or reading from the layer above.

┌────────────────────────────────────────────────────────────────┐
│                                                                 │
│  1. Reference (seeded once from v0.4 workbook)                  │
│     rule, signal_source, peer_group, return_profile,            │
│     blend_weight, exclusion, gics_classification, config        │
│                                                                 │
│  2. Institutions (seeded then occasionally updated)             │
│     institution (LEI-keyed, with GICS classification)           │
│                                                                 │
│  3. Run lifecycle (one row per scraper execution)               │
│     scrape_run                                                  │
│                                                                 │
│  4. Signals (one row per run × institution × rule when found)   │
│     signal                                                      │
│                                                                 │
│  5. Scores (one row per institution × run at each grain)        │
│     score_sub_criterion, score_pillar,                          │
│     score_stage1_esg, score_stage2_composite (inert),           │
│     peer_distribution                                           │
│                                                                 │
└────────────────────────────────────────────────────────────────┘

Scrapers write to layer 4 (signals) only. Scoring engine reads layers 1–4 and writes layer 5. UI reads from all layers.

Reference data is append-only-via-migration. Direct UPDATEs against reference tables don't happen in normal operation; changes go through a new migration.

Reference tables (layer 1)

Seeded from the v0.4 workbook via migrations 002, 004, 008. Total ~150 rows of reference data.

`rule`

The atomic scoreable unit. 148 rows: 89 financial-sector (applicable_sectors='40') + 24 universal ('ALL') + 35 non-financial (specific GICS codes).

Key columns:

Column	Type	Notes
`rule_id`	TEXT PK	e.g. `E1.1`, `G3-PRB.1`, `UN-E1.1`
`pillar`	TEXT	`E` / `S` / `G`
`sub_criterion`	TEXT	e.g. `E1`, `G3`, `UN-E1`
`description`	TEXT	Short one-liner
`source_id`	TEXT FK → `signal_source`	The data source this rule reads from
`applicable_sectors`	TEXT	GICS code or `'ALL'`
`rule_weight`	REAL	Within sub-criterion
`confidence_tier`	TEXT	Tier label
`base_score`	INTEGER	5 for deduction rules (post-migration 008)
`notes`	TEXT	Free-text — deduction cap logic etc.

Aggregate rules (G3.1, G5.1) have source_id = NULL because they read other rules' signals rather than producing their own.

`signal_source`

The upstream data sources. ~20 rows post-migration 002, +12 from migration 004. (Migration 012 adds register-display columns per ADR-0003 — pending.)

Key columns:

Column	Type	Notes
`source_id`	TEXT PK	e.g. `NZBA-MEMBERS`, `SBTI-VALIDATED` (still `SBTI-DASHBOARD` until renamed)
`name`	TEXT	Human-readable display name
`source_type`	TEXT	`membership_list` / `regulatory_register` / `ngo_database` / etc.
`refresh_cadence`	TEXT	CHECK constraint: `'Daily'`/`'Weekly'`/`'Monthly'`/`'Quarterly'`/`'Annual'`
`is_seed_backed`	INTEGER	1 if served from a `data/seed/*.json` file rather than scraped live
`base_url`	TEXT	The canonical upstream URL

Pending columns (migration 012, per ADR-0003): type, access_mechanism, status, external_url, priority, feeds_rules.

`peer_group`, `return_profile`, `blend_weight`, `exclusion`

Reference catalogues from the workbook. 19/8/6/8 rows respectively.

blend_weight schema: (weight_group, dimension, weight) PK (weight_group, dimension). No institution_type or applies_to scoping — the existing pillar rows (E=0.40, S=0.30, G=0.30) apply universally. Non-financial-specific weights are a v0.6 question (see design note open question #2).

`gics_classification`

GICS taxonomy. 19 rows: 4 sectors + 5 industry groups + 5 industries + 5 sub-industries. Hand-classified at v0.5 seeding; not the full MSCI catalogue.

`config`

Tunable system parameters. 5 rows post-migration 003 including universe_confidence_threshold = 0.5.

Institutions (layer 2)

`institution`

The screened entities. 8 pilot rows: 4 financials + 4 non-financials.

Key columns:

Column	Type	Notes
`institution_id`	TEXT PK	LEI (holding company, not subsidiary)
`name`	TEXT	Display name
`institution_type`	TEXT	`financial` / `non_financial`
`gics_sector`, `gics_industry_group`, `gics_industry`, `gics_sub_industry`	TEXT	Full hierarchy
`peer_group_id`	TEXT FK
`return_profile_id`	TEXT FK
`scraper_config_json`	TEXT	Per-institution scraper config — Companies House numbers, alternate names, etc.
`modern_slavery_company_number`	TEXT	Companies House number for UK-MSA-REGISTRY lookups (operating subsidiary, not holding)

LEI verification is mandatory via GLEIF before seeding — chat-side LEI guesses cannot be trusted (see Session protocol).

Run lifecycle (layer 3)

`scrape_run`

One row per scraper execution. Created by the runner (src/scrapers/run.js).

Key columns:

Column	Type	Notes
`run_id`	INTEGER PK	Auto-incrementing. This is the canonical run identifier, not `scrape_run_id`.
`started_at`, `finished_at`	TEXT	ISO 8601 timestamps
`status`	TEXT	`running` / `success` / `failed`
`error_count`	INTEGER
`signals_written`	INTEGER

Naming pattern note: earlier chat-session drafts referenced scrape_run_id as a column name in FK references. The actual column is run_id. Migrations 003, 004, and others were reconciled to fix this.

Signals (layer 4)

`signal`

The raw output of scrapers. One row per (rule_id × institution_id × run_id) when the scraper found something for that combination.

Key columns:

Column	Type	Notes
`signal_id`	TEXT PK	Deterministic hash of (run_id, rule_id, institution_id)
`run_id`	INTEGER FK → `scrape_run`
`rule_id`	TEXT FK → `rule`
`institution_id`	TEXT FK → `institution`	LEI
`value`	TEXT or REAL	The actual finding — boolean, numeric, or short text
`confidence`	REAL	0.0–1.0. Default 1.0.
`source_url`	TEXT	Link to the source page/document where the signal was found
`observed_at`	TEXT	When the scrape happened
`notes`	TEXT	Free-form

EXISTS-based coverage: a rule is "covered" iff at least one signal row exists for it in a given (run × institution). Coverage = data collection presence, distinct from confidence which = data quality.

Scores (layer 5)

Four score tables at four grains. Written by the scoring engine (src/scoring/*.js), one row per institution × run at each grain.

`score_sub_criterion`

Per-rule per-run scores.

Key columns:

Column	Type	Notes
`run_id`	INTEGER FK
`institution_id`	TEXT FK
`rule_id`	TEXT FK
`score`	REAL	0–100 form
`raw_score`	INTEGER	0–5 form (post-migration 008)
`confidence`	REAL	Aggregated from underlying signal(s)
`coverage_pct`	REAL	1.0 if a signal exists, 0.0 otherwise (binary at this grain)

`score_pillar`

Per-pillar (E / S / G) per-institution per-run.

Key columns:

Column	Type	Notes
`run_id`	INTEGER FK
`institution_id`	TEXT FK
`pillar`	TEXT	`E` / `S` / `G`
`score`	REAL	Weighted average over applicable rules
`confidence`	REAL	Propagated
`coverage_pct`	REAL	Unweighted ratio of covered rules to applicable rules

Pending (migration 011, per ADR-0001): pillar_raw_v04, pillar_coverage_weighted, covered_rule_count, applicable_rule_count.

`score_stage1_esg`

Universal Stage 1 ESG composite. One row per institution per run.

Key columns:

Column	Type	Notes
`run_id`	INTEGER FK
`institution_id`	TEXT FK
`composite`	REAL	The ESG score (post migration 011, alias of `composite_coverage_weighted`)
`confidence`	REAL	Aggregated
`coverage_pct`	REAL	Across all applicable E+S+G rules
`peer_rank_pct`	REAL	0–100 percentile within peer group
`peer_count`	INTEGER	Size of resolved peer group

Pending (migration 011): composite_raw_v04, composite_coverage_weighted, coverage_pct_overall.

`score_stage2_composite`

Financials-only triangulation composite. Stage 1 ESG + Credit + Returns. (Inert; Stage 2 de-scoped 2026-05-21; columns retained for now.)

Currently ESG dimension populated; credit and returns columns hold placeholder/null values. Stage 2 has been de-scoped — see the backlog "Future ideas" section for the preserved rationale.

`peer_distribution`

Helper table for peer normalisation. Records the min/max/mean/stddev of each pillar within each resolved peer group per run.

Migration history

Chronological. Each is a single .sql file in src/db/migrations/.

#	What it added	Date
001	Init — base schema	(project start)
002	Seed six reference tables from v0.4 workbook	12 May
003	Two-stage schema (institution_type, GICS cols, applicability, confidence cols, stage 1/2 split, peer_distribution, config)	13 May
004	Non-financial rule catalogue (18 rules, 59 sub-rules) + 12 new signal sources	13 May
005	8 pilot institutions with hand-classified GICS	13 May
006	`is_seed_backed` column on `signal_source`; NZBA flagged	13 May
007	Fix IKEA `sbti_match_name` for SBTi Excel	13 May
008	`base_score` on rule, `raw_score` on `score_sub_criterion`, score floor/ceiling	13 May
009	`coverage_pct` on all four score tables (EXISTS-based)	13 May
010	PRB signatories — source, rule, seed	13 May
011	Coverage-weighted scoring (per ADR-0001)	pending
012	Source register columns (per ADR-0003)	pending

Naming conventions

Settling these here so they aren't re-derived per migration:

Table names: snake_case, singular (institution not institutions).
Column names: snake_case, full words preferred over abbreviations (institution_id not inst_id).
Primary keys: usually <table>_id (so signal_source.source_id, institution.institution_id), with run_id as the exception for scrape_run (legacy from migration 001).
Foreign keys: same name as the referenced primary key when possible (signal.rule_id references rule.rule_id).
Source identifiers: SCREAMING-KEBAB-CASE (NZBA-MEMBERS, BHRRC-RESOURCE-CENTRE).
Rule identifiers: pillar letter + sub-criterion number + dot + rule number, e.g. E1.1. New-source rules: pillar + sub-criterion + hyphen + source code, e.g. G3-PRB.1.

Retention

Currently: indefinite. Everything kept forever. No archive_signal or similar.

At the current scrape rate (8 institutions × 3 scrapers × weekly = ~1,200 signal rows per year), retention is not a problem in SQLite for the foreseeable future. Revisit if scraper count or institution count grows by an order of magnitude.

The audit reason for keeping everything is that score-change forensics require historical signal data — "why did Barclays' score change between run 12 and run 13?" needs both runs' signal rows.

What's NOT in the DB

Recording for completeness:

Raw scrape payloads. Scrapers parse upstream HTML/PDF/Excel into signal rows; the raw source content is not stored. Provenance is via signal.source_url. If a source URL goes dead, that historical signal cannot be reconstructed. Acceptable v1 trade-off; if it becomes a problem, a scrape_artefact table can be added.
User accounts. Cloudflare Access handles auth; no app-level user table.
Watchlist findings. Per ADR-0003 the watchlist is mentioned but the schema is pending. Expected as part of the BHRRC scraper cycle.
Themes (UI grouping). Per ADR-0005, themes are a UI-only config in src/config/rule-themes.js. Not stored in DB.

Data Model

Layered shape

Reference tables (layer 1)

rule

signal_source

peer_group, return_profile, blend_weight, exclusion

gics_classification

config

Institutions (layer 2)

institution

Run lifecycle (layer 3)

scrape_run

Signals (layer 4)

signal

Scores (layer 5)

score_sub_criterion

score_pillar

score_stage1_esg

score_stage2_composite

peer_distribution

Migration history

Naming conventions

Retention

What's NOT in the DB

`rule`

`signal_source`

`peer_group`, `return_profile`, `blend_weight`, `exclusion`

`gics_classification`

`config`

`institution`

`scrape_run`

`signal`

`score_sub_criterion`

`score_pillar`

`score_stage1_esg`

`score_stage2_composite`

`peer_distribution`