Data Model
The system's persistence layer. SQLite at data/esg.db on the VM. This
page is the dense reference; the Architecture overview
is the lighter-touch entry point.
For terminology disambiguation (source vs signal_source, etc.), see
Glossary.
Layered shape
The DB has five layers, each writing to or reading from the layer above.
┌────────────────────────────────────────────────────────────────┐
│ │
│ 1. Reference (seeded once from v0.4 workbook) │
│ rule, signal_source, peer_group, return_profile, │
│ blend_weight, exclusion, gics_classification, config │
│ │
│ 2. Institutions (seeded then occasionally updated) │
│ institution (LEI-keyed, with GICS classification) │
│ │
│ 3. Run lifecycle (one row per scraper execution) │
│ scrape_run │
│ │
│ 4. Signals (one row per run × institution × rule when found) │
│ signal │
│ │
│ 5. Scores (one row per institution × run at each grain) │
│ score_sub_criterion, score_pillar, │
│ score_stage1_esg, score_stage2_composite, │
│ peer_distribution │
│ │
└────────────────────────────────────────────────────────────────┘
Scrapers write to layer 4 (signals) only. Scoring engine reads layers 1–4 and writes layer 5. UI reads from all layers.
Reference data is append-only-via-migration. Direct UPDATEs against reference tables don't happen in normal operation; changes go through a new migration.
Reference tables (layer 1)
Seeded from the v0.4 workbook via migrations 002, 004, 008. Total ~150 rows of reference data.
rule
The atomic scoreable unit. 148 rows: 89 financial-sector (applicable_sectors='40')
+ 24 universal ('ALL') + 35 non-financial (specific GICS codes).
Key columns:
| Column | Type | Notes |
|---|---|---|
rule_id |
TEXT PK | e.g. E1.1, G3-PRB.1, UN-E1.1 |
pillar |
TEXT | E / S / G |
sub_criterion |
TEXT | e.g. E1, G3, UN-E1 |
description |
TEXT | Short one-liner |
source_id |
TEXT FK → signal_source |
The data source this rule reads from |
applicable_sectors |
TEXT | GICS code or 'ALL' |
rule_weight |
REAL | Within sub-criterion |
confidence_tier |
TEXT | Tier label |
base_score |
INTEGER | 5 for deduction rules (post-migration 008) |
notes |
TEXT | Free-text — deduction cap logic etc. |
Aggregate rules (G3.1, G5.1) have source_id = NULL because they
read other rules' signals rather than producing their own.
signal_source
The upstream data sources. ~20 rows post-migration 002, +12 from migration 004. (Migration 012 adds register-display columns per ADR-0003 — pending.)
Key columns:
| Column | Type | Notes |
|---|---|---|
source_id |
TEXT PK | e.g. NZBA-MEMBERS, SBTI-VALIDATED (still SBTI-DASHBOARD until renamed) |
name |
TEXT | Human-readable display name |
source_type |
TEXT | membership_list / regulatory_register / ngo_database / etc. |
refresh_cadence |
TEXT | CHECK constraint: 'Daily'/'Weekly'/'Monthly'/'Quarterly'/'Annual' |
is_seed_backed |
INTEGER | 1 if served from a data/seed/*.json file rather than scraped live |
base_url |
TEXT | The canonical upstream URL |
Pending columns (migration 012, per ADR-0003):
type, access_mechanism, status, external_url, priority, feeds_rules.
peer_group, return_profile, blend_weight, exclusion
Reference catalogues from the workbook. 19/8/6/8 rows respectively.
blend_weight schema: (weight_group, dimension, weight) PK
(weight_group, dimension). No institution_type or applies_to
scoping — the existing pillar rows (E=0.40, S=0.30, G=0.30) apply
universally. Non-financial-specific weights are a v0.6 question (see
design note open question #2).
gics_classification
GICS taxonomy. 19 rows: 4 sectors + 5 industry groups + 5 industries + 5 sub-industries. Hand-classified at v0.5 seeding; not the full MSCI catalogue.
config
Tunable system parameters. 5 rows post-migration 003 including
universe_confidence_threshold = 0.5.
Institutions (layer 2)
institution
The screened entities. 8 pilot rows: 4 financials + 4 non-financials.
Key columns:
| Column | Type | Notes |
|---|---|---|
institution_id |
TEXT PK | LEI (holding company, not subsidiary) |
name |
TEXT | Display name |
institution_type |
TEXT | financial / non_financial |
gics_sector, gics_industry_group, gics_industry, gics_sub_industry |
TEXT | Full hierarchy |
peer_group_id |
TEXT FK | |
return_profile_id |
TEXT FK | |
scraper_config_json |
TEXT | Per-institution scraper config — Companies House numbers, alternate names, etc. |
modern_slavery_company_number |
TEXT | Companies House number for UK-MSA-REGISTRY lookups (operating subsidiary, not holding) |
LEI verification is mandatory via GLEIF before seeding — chat-side LEI guesses cannot be trusted (see Session protocol).
Run lifecycle (layer 3)
scrape_run
One row per scraper execution. Created by the runner (src/scrapers/run.js).
Key columns:
| Column | Type | Notes |
|---|---|---|
run_id |
INTEGER PK | Auto-incrementing. This is the canonical run identifier, not scrape_run_id. |
started_at, finished_at |
TEXT | ISO 8601 timestamps |
status |
TEXT | running / success / failed |
error_count |
INTEGER | |
signals_written |
INTEGER |
Naming pattern note: earlier chat-session drafts referenced
scrape_run_id as a column name in FK references. The actual column is
run_id. Migrations 003, 004, and others were reconciled to fix this.
Signals (layer 4)
signal
The raw output of scrapers. One row per (rule_id × institution_id × run_id) when the scraper found something for that combination.
Key columns:
| Column | Type | Notes |
|---|---|---|
signal_id |
TEXT PK | Deterministic hash of (run_id, rule_id, institution_id) |
run_id |
INTEGER FK → scrape_run |
|
rule_id |
TEXT FK → rule |
|
institution_id |
TEXT FK → institution |
LEI |
value |
TEXT or REAL | The actual finding — boolean, numeric, or short text |
confidence |
REAL | 0.0–1.0. Default 1.0. |
source_url |
TEXT | Link to the source page/document where the signal was found |
observed_at |
TEXT | When the scrape happened |
notes |
TEXT | Free-form |
EXISTS-based coverage: a rule is "covered" iff at least one signal
row exists for it in a given (run × institution). Coverage = data
collection presence, distinct from confidence which = data quality.
Scores (layer 5)
Four score tables at four grains. Written by the scoring engine
(src/scoring/*.js), one row per institution × run at each grain.
score_sub_criterion
Per-rule per-run scores.
Key columns:
| Column | Type | Notes |
|---|---|---|
run_id |
INTEGER FK | |
institution_id |
TEXT FK | |
rule_id |
TEXT FK | |
score |
REAL | 0–100 form |
raw_score |
INTEGER | 0–5 form (post-migration 008) |
confidence |
REAL | Aggregated from underlying signal(s) |
coverage_pct |
REAL | 1.0 if a signal exists, 0.0 otherwise (binary at this grain) |
score_pillar
Per-pillar (E / S / G) per-institution per-run.
Key columns:
| Column | Type | Notes |
|---|---|---|
run_id |
INTEGER FK | |
institution_id |
TEXT FK | |
pillar |
TEXT | E / S / G |
score |
REAL | Weighted average over applicable rules |
confidence |
REAL | Propagated |
coverage_pct |
REAL | Unweighted ratio of covered rules to applicable rules |
Pending (migration 011, per ADR-0001): pillar_raw_v04,
pillar_coverage_weighted, covered_rule_count, applicable_rule_count.
score_stage1_esg
Universal Stage 1 ESG composite. One row per institution per run.
Key columns:
| Column | Type | Notes |
|---|---|---|
run_id |
INTEGER FK | |
institution_id |
TEXT FK | |
composite |
REAL | The ESG score (post migration 011, alias of composite_coverage_weighted) |
confidence |
REAL | Aggregated |
coverage_pct |
REAL | Across all applicable E+S+G rules |
peer_rank_pct |
REAL | 0–100 percentile within peer group |
peer_count |
INTEGER | Size of resolved peer group |
Pending (migration 011): composite_raw_v04,
composite_coverage_weighted, coverage_pct_overall.
score_stage2_composite
Financials-only triangulation composite. Stage 1 ESG + Credit + Returns.
Currently ESG dimension populated; credit and returns columns hold placeholder/null values awaiting their respective data sources.
peer_distribution
Helper table for peer normalisation. Records the min/max/mean/stddev of each pillar within each resolved peer group per run.
Migration history
Chronological. Each is a single .sql file in src/db/migrations/.
| # | What it added | Date |
|---|---|---|
| 001 | Init — base schema | (project start) |
| 002 | Seed six reference tables from v0.4 workbook | 12 May |
| 003 | Two-stage schema (institution_type, GICS cols, applicability, confidence cols, stage 1/2 split, peer_distribution, config) | 13 May |
| 004 | Non-financial rule catalogue (18 rules, 59 sub-rules) + 12 new signal sources | 13 May |
| 005 | 8 pilot institutions with hand-classified GICS | 13 May |
| 006 | is_seed_backed column on signal_source; NZBA flagged |
13 May |
| 007 | Fix IKEA sbti_match_name for SBTi Excel |
13 May |
| 008 | base_score on rule, raw_score on score_sub_criterion, score floor/ceiling |
13 May |
| 009 | coverage_pct on all four score tables (EXISTS-based) |
13 May |
| 010 | PRB signatories — source, rule, seed | 13 May |
| 011 | Coverage-weighted scoring (per ADR-0001) | pending |
| 012 | Source register columns (per ADR-0003) | pending |
Naming conventions
Settling these here so they aren't re-derived per migration:
- Table names:
snake_case, singular (institutionnotinstitutions). - Column names:
snake_case, full words preferred over abbreviations (institution_idnotinst_id). - Primary keys: usually
<table>_id(sosignal_source.source_id,institution.institution_id), withrun_idas the exception forscrape_run(legacy from migration 001). - Foreign keys: same name as the referenced primary key when
possible (
signal.rule_idreferencesrule.rule_id). - Source identifiers:
SCREAMING-KEBAB-CASE(NZBA-MEMBERS,BHRRC-RESOURCE-CENTRE). - Rule identifiers: pillar letter + sub-criterion number + dot + rule
number, e.g.
E1.1. New-source rules: pillar + sub-criterion + hyphen + source code, e.g.G3-PRB.1.
Retention
Currently: indefinite. Everything kept forever. No archive_signal or
similar.
At the current scrape rate (8 institutions × 3 scrapers × weekly = ~1,200 signal rows per year), retention is not a problem in SQLite for the foreseeable future. Revisit if scraper count or institution count grows by an order of magnitude.
The audit reason for keeping everything is that score-change forensics require historical signal data — "why did Barclays' score change between run 12 and run 13?" needs both runs' signal rows.
What's NOT in the DB
Recording for completeness:
- Raw scrape payloads. Scrapers parse upstream HTML/PDF/Excel into
signalrows; the raw source content is not stored. Provenance is viasignal.source_url. If a source URL goes dead, that historical signal cannot be reconstructed. Acceptable v1 trade-off; if it becomes a problem, ascrape_artefacttable can be added. - User accounts. Cloudflare Access handles auth; no app-level user table.
- Watchlist findings. Per ADR-0003 the watchlist is mentioned but the schema is pending. Expected as part of the BHRRC scraper cycle.
- Themes (UI grouping). Per ADR-0005, themes are a UI-only config
in
src/config/rule-themes.js. Not stored in DB.