Initial SEO-INTEL documentation: architecture, scoring, code structure

Add comprehensive documentation for the dual-engine performance evaluation system: - System architecture and data flow - Score calculation methodology (0-100 approximation from CWV thresholds) - Detailed metrics reference (LCP, FCP, CLS, TBT, TTFB) - Testing engines comparison (Sitespeed vs PSI) - Complete code structure map (file-by-file breakdown) - Case study: rds.ink 77 score with actionable fixes - Quick reference guides for interpreting results Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-14 05:56:49 +10:00
commit 335d9a76e1
6 changed files with 1385 additions and 0 deletions
--- a/code-refs/file-structure.md
+++ b/code-refs/file-structure.md
@@ -0,0 +1,373 @@
+# Code Structure Map
+
+Complete file-by-file breakdown of the seo-intel repository.
+
+## Directory Layout
+
+```
+/home/help4bis/seo-intel/
+├── README.md                    # Project overview (v1.1.0)
+├── pyproject.toml              # Python project config (dependencies, build)
+├── requirements.txt            # Python package list
+├── run.sh                       # Launch script (runs main.py)
+├── .env                         # Secrets: PSI_API_KEY, DB path, etc.
+│
+├── src/                         # Python package
+│   ├── __init__.py
+│   ├── main.py                  # FastAPI app entry point
+│   ├── config.py                # Settings, site list (SITES config)
+│   ├── db.py                    # SQLAlchemy setup, migrations, session factory
+│   │
+│   ├── models/
+│   │   ├── __init__.py
+│   │   ├── perf.py              # ORM models: PerfRun, PerfAudit, PerfOpportunity, PerfResource
+│   │   ├── site.py              # Site model (name, domain, priority)
+│   │   ├── ranking.py           # Ranking snapshot model (SEO keyword rankings)
+│   │   └── ...                  # Other models (not perf-related)
+│   │
+│   ├── routers/
+│   │   ├── __init__.py
+│   │   ├── performance.py       # GET /performance/, /performance/<site_id>, POST /api/perf/test, /api/perf/sweep
+│   │   ├── dashboard.py         # GET / (main dashboard)
+│   │   ├── keywords.py          # Keyword ranking pages
+│   │   └── ...                  # Other routers (not perf-related)
+│   │
+│   ├── perf/                    # Performance testing engines
+│   │   ├── __init__.py
+│   │   ├── runner.py            # Orchestrator: run_full_test() — runs engines × devices
+│   │   ├── sitespeed.py         # Sitespeed.io Docker wrapper + HAR parser
+│   │   ├── psi.py               # Google PageSpeed Insights API client
+│   │   └── batch.py             # Weekly sweep logic
+│   │
+│   ├── playbook/                # SEO playbook generation (not perf-related)
+│   │   ├── __init__.py
+│   │   ├── rules.py
+│   │   └── llm.py
+│   │
+│   └── ...                      # Other modules (keyword analysis, etc.)
+│
+├── templates/                   # Jinja2 HTML templates
+│   ├── base.html                # Base template (nav, styling)
+│   ├── performance.html         # Portfolio scorecard
+│   ├── performance_site.html    # Per-site detail dashboard
+│   ├── dashboard.html           # Main dashboard
+│   └── ...                      # Other templates
+│
+├── data/
+│   └── seo-intel.db            # SQLite database (perf_runs, perf_audits, etc.)
+│
+├── docs/                        # Documentation (this repo)
+│
+└── ops/                         # Operations scripts
+    ├── schema.sql              # Database schema
+    └── ...
+```
+
+---
+
+## Performance System Files (Perf Tier)
+
+### src/routers/performance.py
+
+**Purpose:** FastAPI routes for the performance dashboard
+
+**Key functions:**
+- `performance_home(request, db)` — `GET /performance/` → portfolio scorecard
+- `performance_site(site_id, request, db)` — `GET /performance/<site_id>` → per-site detail
+- `api_perf_test(body, background_tasks, db)` — `POST /api/perf/test` → trigger single URL test
+- `api_perf_sweep(background_tasks)` — `POST /api/perf/sweep` → trigger portfolio sweep
+- `_portfolio_rows(db)` — SQL: latest scores per site × device
+- `_site_url_rows(db, site_id)` — SQL: latest score per URL
+- `_site_latest_audit(db, site_id, device)` — SQL: full metrics for latest run
+- `_site_trend(db, site_id, weeks)` — SQL: weekly AVG scores (12 weeks)
+- `_site_opportunities(db, site_id, device)` — SQL: top PSI opportunities
+- `_site_slow_resources(db, site_id)` — SQL: top 10 slowest resources
+
+**Key imports:**
+```python
+from fastapi import APIRouter, BackgroundTasks, Depends
+from sqlalchemy import text
+from fastapi.templating import Jinja2Templates
+from .perf.runner import run_full_test
+from .perf.batch import run_weekly_perf_sweep
+```
+
+**Size:** ~545 lines
+
+---
+
+### src/perf/runner.py
+
+**Purpose:** Orchestrates test runs across engines and devices
+
+**Key functions:**
+- `run_full_test(site_id, url, db, engines, devices)` — Main orchestrator
+  - Loops: for engine in engines: for device in devices:
+  - Calls appropriate engine (sitespeed or psi)
+  - Persists each result via `_persist_run()`
+  - Returns summary dict
+- `_persist_run(db, site_id, url, engine, result)` — Writes one test result to database
+  - Inserts: perf_runs (1), perf_audits (1), perf_opportunities (0+), perf_resources (0+)
+  - Commits transaction
+
+**Key imports:**
+```python
+from sqlalchemy.orm import Session
+from .models.perf import PerfRun, PerfAudit, PerfOpportunity, PerfResource
+from .sitespeed import run_sitespeed_test
+from .psi import run_psi_test
+```
+
+**Size:** ~200 lines
+
+---
+
+### src/perf/sitespeed.py
+
+**Purpose:** Wraps sitespeed.io Docker container, parses HAR output
+
+**Key functions:**
+- `run_sitespeed_test(url, device)` — Execute sitespeed in Docker
+  - Builds Docker command with device-specific args (--mobile vs desktop UA)
+  - Runs `docker run sitespeedio/sitespeed.io:40.4.0 {url} --n 3 ...`
+  - Waits for output (60s)
+  - Calls `_parse_har()` to extract metrics
+  - Calls `_approx_score()` to calculate performance score
+  - Returns: success, performance_score, metrics, resources
+- `_parse_har(har_path)` — Parse `/tmp/sitespeed-output/{run_id}/.../browsertime.har`
+  - Extracts _googleWebVitals from pages[] (LCP, FCP, CLS, TTFB)
+  - Extracts _cpu.longTasks.totalBlockingTime from pages[] (TBT)
+  - Sums resource sizes by type (image, script, stylesheet, font)
+  - Returns: metrics dict, resources list
+- `_approx_score(lcp_ms, fcp_ms, cls, tbt_ms, ttfb_ms)` — Calculate 0-100 score
+  - Uses _THRESHOLDS (lines 53–60)
+  - Linear interpolation between good/poor for each metric
+  - Returns: int(mean(all_metric_scores))
+- `_guess_resource_type(url, content_type)` — Classify resource (script, image, etc.)
+
+**Key constants:**
+- `SITESPEED_IMAGE = "sitespeedio/sitespeed.io:40.4.0"` (pinned version)
+- `OUTPUT_BASE = Path("/tmp/sitespeed-output")` (Docker output mount point)
+- `_THRESHOLDS` dict (lines 53–60): (good, poor) for LCP, FCP, CLS, TBT, TTFB
+
+**Size:** ~450 lines
+
+---
+
+### src/perf/psi.py
+
+**Purpose:** Calls Google PageSpeed Insights API, parses Lighthouse results
+
+**Key functions:**
+- `run_psi_test(url, device)` — Call PageSpeed Insights API
+  - GET `https://www.googleapis.com/pagespeedonline/v5/runPagespeed?url=...&strategy={device}`
+  - Parses response.lighthouseResult
+  - Calls `_parse_lighthouse_audits()` (shared with sitespeed)
+  - Returns: success, performance_score (official), metrics, opportunities
+- `_parse_lighthouse_audits(audits)` — Extract metrics + opportunities from Lighthouse JSON
+  - Maps audit keys (largest-contentful-paint, etc.) to metric values
+  - Extracts opportunities (audit.details.type == "opportunity")
+  - Calculates savings_ms and savings_bytes for each opportunity
+  - Returns: metrics dict, opportunities list
+
+**Key constants:**
+- `PSI_ENDPOINT = "https://www.googleapis.com/pagespeedonline/v5/runPagespeed"`
+- `PSI_TIMEOUT = 90` (Google's API can be slow)
+
+**Size:** ~150 lines
+
+---
+
+### src/perf/batch.py
+
+**Purpose:** Weekly portfolio performance sweep
+
+**Key functions:**
+- `run_weekly_perf_sweep(db)` — Main sweep orchestrator
+  - Loops: for each site in SITES:
+  - Calls `resolve_url_list()` to get top 6 URLs
+  - For each URL: calls `run_full_test()` (sitespeed + psi, mobile + desktop)
+  - Logs completion summary
+- `resolve_url_list(db, domain)` — Get URLs for a site
+  - Always: homepage
+  - Plus: top 5 URLs from ranking_snapshots (last 30 days, sorted by impressions)
+  - Returns: list of 6 URLs max
+- `_get_top_urls(db, site_id, limit)` — Query ranking_snapshots for impressions
+
+**Size:** ~150 lines
+
+---
+
+### src/models/perf.py
+
+**Purpose:** SQLAlchemy ORM models for performance data
+
+**Models:**
+- `PerfRun` — Test execution record
+  - Fields: id, site_id, url, engine, device, started_at, completed_at, success, error_message
+  - Relations: audits (1-to-many), opportunities (1-to-many), resources (1-to-many)
+- `PerfAudit` — Core Web Vitals metrics for one run
+  - Fields: id, perf_run_id, performance_score, lcp_ms, cls, inp_ms, tbt_ms, fcp_ms, ttfb_ms, total_byte_weight, image_bytes, js_bytes, css_bytes, font_bytes, requests_count, dom_size
+  - Relations: run (many-to-1)
+- `PerfOpportunity` — Lighthouse audit opportunity
+  - Fields: id, perf_run_id, opportunity_key, display_label, savings_ms, savings_bytes, details_json
+  - Relations: run (many-to-1)
+- `PerfResource` — HAR resource entry
+  - Fields: id, perf_run_id, resource_url, resource_type, size_bytes, transfer_size_bytes, start_time_ms, end_time_ms, is_render_blocking
+  - Relations: run (many-to-1)
+
+**Size:** ~100 lines
+
+---
+
+## Templates
+
+### templates/performance.html
+
+**Purpose:** Portfolio performance scorecard
+
+**Features:**
+- Table of all sites (13 rows)
+- Columns: domain, score_mobile, score_desktop, lcp_ms, cls, slowest_url, last_tested
+- Colour-coded scores (green ≥90, amber ≥50, red <50)
+- "Run portfolio sweep now" button (HTMX POST to /api/perf/sweep)
+- Sweep status display (idle | running | ok | error)
+
+**Size:** ~200 lines
+
+---
+
+### templates/performance_site.html
+
+**Purpose:** Per-site performance detail dashboard
+
+**Features:**
+- Latest CWV metrics (mobile + desktop side-by-side)
+- 12-week trend sparkline chart (mobile + desktop bars per week)
+- Top 5 optimisation opportunities (PSI)
+- Top 10 slowest resources (sitespeed HAR)
+- Per-URL breakdown table with test buttons
+  - Columns: URL, score, LCP, CLS, requests, tested_at, test_now_buttons
+  - Test buttons: Both (mobile+desktop), Mob, Dsk
+
+**Interactive elements:**
+- HTMX buttons that queue tests
+- Coloured metric badges (green/amber/red)
+- Tooltips for long URLs
+
+**Size:** ~390 lines
+
+---
+
+## Supporting Files
+
+### src/config.py
+
+**What it contains:**
+- `Settings` class (Pydantic)
+- `SITES` — list of 13 sites to monitor
+  - Each site: domain, priority (sorting order)
+
+**Size:** ~50 lines
+
+---
+
+### src/db.py
+
+**What it contains:**
+- SQLAlchemy engine + session factory
+- `Base` (declarative base for all models)
+- Database URI from .env
+- Migration logic (auto-create tables on startup)
+
+**Size:** ~60 lines
+
+---
+
+### requirements.txt
+
+Key dependencies for performance testing:
+- fastapi, uvicorn (web framework)
+- sqlalchemy (ORM)
+- httpx (for PSI API calls)
+- docker (for sitespeed execution)
+- jinja2 (templates)
+
+---
+
+## File Interaction Map
+
+```
+FastAPI Request
+        ↓
+performance.py (routers)
+        ↓
+[Query] perf_audits table via SQL
+        ├─→ db.py (SQLAlchemy session)
+        │
+[Create] templates (Jinja2)
+        ├─→ performance_site.html
+        └─→ performance.html
+
+[Background Task] api_perf_test()
+        ↓
+runner.py:run_full_test()
+        ├─ For each engine:
+        │  ├─ sitespeed.py:run_sitespeed_test() → Docker
+        │  │  ├─ subprocess.run("docker run sitespeedio/...")
+        │  │  ├─ _parse_har(browsertime.har)
+        │  │  └─ _approx_score(metrics) → 0-100
+        │  │
+        │  └─ psi.py:run_psi_test() → Google API
+        │     ├─ httpx.get(googleapis.com/...)
+        │     ├─ _parse_lighthouse_audits(audits)
+        │     └─ opportunities + official_score
+        │
+        ├─ runner.py:_persist_run() for each result
+        │  ├─ INSERT perf_runs
+        │  ├─ INSERT perf_audits
+        │  ├─ INSERT perf_opportunities
+        │  └─ INSERT perf_resources
+        │
+        └─ models/perf.py (ORM objects)
+           └─ db.py (commit to SQLAlchemy)
+```
+
+---
+
+## Deployment
+
+All files live in `/home/help4bis/seo-intel/` on george (192.168.0.117).
+
+**To start the service:**
+```bash
+cd /home/help4bis/seo-intel
+./run.sh
+# or
+uvicorn src.main:app --host 0.0.0.0 --port 8765 --reload
+```
+
+**To run tests manually:**
+```bash
+cd /home/help4bis/seo-intel
+python -c "
+from src.perf.runner import run_full_test
+from src.db import SessionLocal
+
+db = SessionLocal()
+result = run_full_test(
+  site_id=3,
+  url='https://rds.ink/endangered',
+  db=db,
+  engines=['sitespeed', 'psi'],
+  devices=['mobile', 'desktop']
+)
+print(result)
+"
+```
+
+---
+
+See also:
+- [Database Schema](database-schema.md) — All tables and fields
+- [API Endpoints](api-endpoints.md) — HTTP routes and payloads