Skip to content

Data Model

The system's persistence layer. SQLite at data/esg.db on the VM. This page is the dense reference; the Architecture overview is the lighter-touch entry point.

For terminology disambiguation (source vs signal_source, etc.), see Glossary.


Layered shape

The DB has five layers, each writing to or reading from the layer above.

┌────────────────────────────────────────────────────────────────┐
│                                                                 │
│  1. Reference (seeded once from v0.4 workbook)                  │
│     rule, signal_source, peer_group, return_profile,            │
│     blend_weight, exclusion, gics_classification, config        │
│                                                                 │
│  2. Institutions (seeded then occasionally updated)             │
│     institution (LEI-keyed, with GICS classification)           │
│                                                                 │
│  3. Run lifecycle (one row per scraper execution)               │
│     scrape_run                                                  │
│                                                                 │
│  4. Signals (one row per run × institution × rule when found)   │
│     signal                                                      │
│                                                                 │
│  5. Scores (one row per institution × run at each grain)        │
│     score_sub_criterion, score_pillar,                          │
│     score_stage1_esg, score_stage2_composite,                   │
│     peer_distribution                                           │
│                                                                 │
└────────────────────────────────────────────────────────────────┘

Scrapers write to layer 4 (signals) only. Scoring engine reads layers 1–4 and writes layer 5. UI reads from all layers.

Reference data is append-only-via-migration. Direct UPDATEs against reference tables don't happen in normal operation; changes go through a new migration.


Reference tables (layer 1)

Seeded from the v0.4 workbook via migrations 002, 004, 008. Total ~150 rows of reference data.

rule

The atomic scoreable unit. 148 rows: 89 financial-sector (applicable_sectors='40') + 24 universal ('ALL') + 35 non-financial (specific GICS codes).

Key columns:

Column Type Notes
rule_id TEXT PK e.g. E1.1, G3-PRB.1, UN-E1.1
pillar TEXT E / S / G
sub_criterion TEXT e.g. E1, G3, UN-E1
description TEXT Short one-liner
source_id TEXT FK → signal_source The data source this rule reads from
applicable_sectors TEXT GICS code or 'ALL'
rule_weight REAL Within sub-criterion
confidence_tier TEXT Tier label
base_score INTEGER 5 for deduction rules (post-migration 008)
notes TEXT Free-text — deduction cap logic etc.

Aggregate rules (G3.1, G5.1) have source_id = NULL because they read other rules' signals rather than producing their own.

signal_source

The upstream data sources. ~20 rows post-migration 002, +12 from migration 004. (Migration 012 adds register-display columns per ADR-0003 — pending.)

Key columns:

Column Type Notes
source_id TEXT PK e.g. NZBA-MEMBERS, SBTI-VALIDATED (still SBTI-DASHBOARD until renamed)
name TEXT Human-readable display name
source_type TEXT membership_list / regulatory_register / ngo_database / etc.
refresh_cadence TEXT CHECK constraint: 'Daily'/'Weekly'/'Monthly'/'Quarterly'/'Annual'
is_seed_backed INTEGER 1 if served from a data/seed/*.json file rather than scraped live
base_url TEXT The canonical upstream URL

Pending columns (migration 012, per ADR-0003): type, access_mechanism, status, external_url, priority, feeds_rules.

peer_group, return_profile, blend_weight, exclusion

Reference catalogues from the workbook. 19/8/6/8 rows respectively.

blend_weight schema: (weight_group, dimension, weight) PK (weight_group, dimension). No institution_type or applies_to scoping — the existing pillar rows (E=0.40, S=0.30, G=0.30) apply universally. Non-financial-specific weights are a v0.6 question (see design note open question #2).

gics_classification

GICS taxonomy. 19 rows: 4 sectors + 5 industry groups + 5 industries + 5 sub-industries. Hand-classified at v0.5 seeding; not the full MSCI catalogue.

config

Tunable system parameters. 5 rows post-migration 003 including universe_confidence_threshold = 0.5.


Institutions (layer 2)

institution

The screened entities. 8 pilot rows: 4 financials + 4 non-financials.

Key columns:

Column Type Notes
institution_id TEXT PK LEI (holding company, not subsidiary)
name TEXT Display name
institution_type TEXT financial / non_financial
gics_sector, gics_industry_group, gics_industry, gics_sub_industry TEXT Full hierarchy
peer_group_id TEXT FK
return_profile_id TEXT FK
scraper_config_json TEXT Per-institution scraper config — Companies House numbers, alternate names, etc.
modern_slavery_company_number TEXT Companies House number for UK-MSA-REGISTRY lookups (operating subsidiary, not holding)

LEI verification is mandatory via GLEIF before seeding — chat-side LEI guesses cannot be trusted (see Session protocol).


Run lifecycle (layer 3)

scrape_run

One row per scraper execution. Created by the runner (src/scrapers/run.js).

Key columns:

Column Type Notes
run_id INTEGER PK Auto-incrementing. This is the canonical run identifier, not scrape_run_id.
started_at, finished_at TEXT ISO 8601 timestamps
status TEXT running / success / failed
error_count INTEGER
signals_written INTEGER

Naming pattern note: earlier chat-session drafts referenced scrape_run_id as a column name in FK references. The actual column is run_id. Migrations 003, 004, and others were reconciled to fix this.


Signals (layer 4)

signal

The raw output of scrapers. One row per (rule_id × institution_id × run_id) when the scraper found something for that combination.

Key columns:

Column Type Notes
signal_id TEXT PK Deterministic hash of (run_id, rule_id, institution_id)
run_id INTEGER FK → scrape_run
rule_id TEXT FK → rule
institution_id TEXT FK → institution LEI
value TEXT or REAL The actual finding — boolean, numeric, or short text
confidence REAL 0.0–1.0. Default 1.0.
source_url TEXT Link to the source page/document where the signal was found
observed_at TEXT When the scrape happened
notes TEXT Free-form

EXISTS-based coverage: a rule is "covered" iff at least one signal row exists for it in a given (run × institution). Coverage = data collection presence, distinct from confidence which = data quality.


Scores (layer 5)

Four score tables at four grains. Written by the scoring engine (src/scoring/*.js), one row per institution × run at each grain.

score_sub_criterion

Per-rule per-run scores.

Key columns:

Column Type Notes
run_id INTEGER FK
institution_id TEXT FK
rule_id TEXT FK
score REAL 0–100 form
raw_score INTEGER 0–5 form (post-migration 008)
confidence REAL Aggregated from underlying signal(s)
coverage_pct REAL 1.0 if a signal exists, 0.0 otherwise (binary at this grain)

score_pillar

Per-pillar (E / S / G) per-institution per-run.

Key columns:

Column Type Notes
run_id INTEGER FK
institution_id TEXT FK
pillar TEXT E / S / G
score REAL Weighted average over applicable rules
confidence REAL Propagated
coverage_pct REAL Unweighted ratio of covered rules to applicable rules

Pending (migration 011, per ADR-0001): pillar_raw_v04, pillar_coverage_weighted, covered_rule_count, applicable_rule_count.

score_stage1_esg

Universal Stage 1 ESG composite. One row per institution per run.

Key columns:

Column Type Notes
run_id INTEGER FK
institution_id TEXT FK
composite REAL The ESG score (post migration 011, alias of composite_coverage_weighted)
confidence REAL Aggregated
coverage_pct REAL Across all applicable E+S+G rules
peer_rank_pct REAL 0–100 percentile within peer group
peer_count INTEGER Size of resolved peer group

Pending (migration 011): composite_raw_v04, composite_coverage_weighted, coverage_pct_overall.

score_stage2_composite

Financials-only triangulation composite. Stage 1 ESG + Credit + Returns.

Currently ESG dimension populated; credit and returns columns hold placeholder/null values awaiting their respective data sources.

peer_distribution

Helper table for peer normalisation. Records the min/max/mean/stddev of each pillar within each resolved peer group per run.


Migration history

Chronological. Each is a single .sql file in src/db/migrations/.

# What it added Date
001 Init — base schema (project start)
002 Seed six reference tables from v0.4 workbook 12 May
003 Two-stage schema (institution_type, GICS cols, applicability, confidence cols, stage 1/2 split, peer_distribution, config) 13 May
004 Non-financial rule catalogue (18 rules, 59 sub-rules) + 12 new signal sources 13 May
005 8 pilot institutions with hand-classified GICS 13 May
006 is_seed_backed column on signal_source; NZBA flagged 13 May
007 Fix IKEA sbti_match_name for SBTi Excel 13 May
008 base_score on rule, raw_score on score_sub_criterion, score floor/ceiling 13 May
009 coverage_pct on all four score tables (EXISTS-based) 13 May
010 PRB signatories — source, rule, seed 13 May
011 Coverage-weighted scoring (per ADR-0001) pending
012 Source register columns (per ADR-0003) pending

Naming conventions

Settling these here so they aren't re-derived per migration:

  • Table names: snake_case, singular (institution not institutions).
  • Column names: snake_case, full words preferred over abbreviations (institution_id not inst_id).
  • Primary keys: usually <table>_id (so signal_source.source_id, institution.institution_id), with run_id as the exception for scrape_run (legacy from migration 001).
  • Foreign keys: same name as the referenced primary key when possible (signal.rule_id references rule.rule_id).
  • Source identifiers: SCREAMING-KEBAB-CASE (NZBA-MEMBERS, BHRRC-RESOURCE-CENTRE).
  • Rule identifiers: pillar letter + sub-criterion number + dot + rule number, e.g. E1.1. New-source rules: pillar + sub-criterion + hyphen + source code, e.g. G3-PRB.1.

Retention

Currently: indefinite. Everything kept forever. No archive_signal or similar.

At the current scrape rate (8 institutions × 3 scrapers × weekly = ~1,200 signal rows per year), retention is not a problem in SQLite for the foreseeable future. Revisit if scraper count or institution count grows by an order of magnitude.

The audit reason for keeping everything is that score-change forensics require historical signal data — "why did Barclays' score change between run 12 and run 13?" needs both runs' signal rows.


What's NOT in the DB

Recording for completeness:

  • Raw scrape payloads. Scrapers parse upstream HTML/PDF/Excel into signal rows; the raw source content is not stored. Provenance is via signal.source_url. If a source URL goes dead, that historical signal cannot be reconstructed. Acceptable v1 trade-off; if it becomes a problem, a scrape_artefact table can be added.
  • User accounts. Cloudflare Access handles auth; no app-level user table.
  • Watchlist findings. Per ADR-0003 the watchlist is mentioned but the schema is pending. Expected as part of the BHRRC scraper cycle.
  • Themes (UI grouping). Per ADR-0005, themes are a UI-only config in src/config/rule-themes.js. Not stored in DB.