# ENRICHMENT MASTER PLAN
# CPS Platform — spiritualmessage.org
# Created: March 19, 2026
# Author: Junaid Shaikh + Claude (strategy session)
# Type: Strategic Vision + Actionable Roadmap
# Location: /root/critique/RESEARCH/ENRICHMENT_MASTER_PLAN.md
# Update trigger: When a phase completes or strategy shifts

---

## The Core Insight — Maulana's Teaching Pattern

Every lecture and book passage Maulana writes follows one consistent structure:

```
TOPIC (e.g. Patience / Sabr)
    ↓
Quran verse that establishes the principle
    ↓
Hadith that elaborates it
    ↓
Prophet's life event that demonstrates it
    ↓
Companion's experience (Abu Bakr, Omar, Ayesha...)
    ↓
Historical event (Battle of Uhud, Hijra, Ottoman decline...)
    ↓
Personal anecdote (Maulana's own experience)
    ↓
Contemporary application (today's Muslim world, Pakistan, India...)
```

This is not random. It is **the same chain in almost every passage**.

The chain is: `Divine Text → Human Example → Historical Proof → Contemporary Relevance`

**If you tag this chain for every paragraph, you unlock everything.**

---

## The Single Pillar: SOURCE CHAIN

### What it is

A new enrichment field — `source_chain` — added to every paragraph during Phase B enrichment.

```json
{
  "source_chain": {
    "primary_source_type": "quran|hadith|prophet_event|companion_story|historical_event|personal_anecdote|contemporary_example|reasoning_only",
    "primary_ref": "2:153",
    "supporting_personalities": ["Prophet_Muhammad", "Abu_Bakr_al_Siddiq"],
    "supporting_events": ["Battle_of_Uhud", "Hijra_to_Madina"],
    "teaching_chain": "quran → prophet_event → contemporary_application",
    "chain_completeness": 3,
    "contemporary_context": "political_islam|pakistan|india|global_muslims|general"
  }
}
```

### Why this is the single pillar

Every other enrichment either already exists OR becomes computable FROM this field:

| Enrichment Type | Depends on source_chain? | Status |
|-----------------|--------------------------|--------|
| Verse-centric retrieval | references table (done) | ✅ Done |
| Hadith-centric retrieval | references table (done) | ✅ Done |
| Topic-centric retrieval | creation_plan (done) | ✅ Done |
| **Personality-centric** | source_chain.supporting_personalities | ❌ Missing |
| **Event-centric (Islamic)** | source_chain.supporting_events | ❌ Missing |
| **Teaching chain analysis** | source_chain.teaching_chain | ❌ Missing |
| Multi-hop retrieval | All of above | ❌ Missing |
| Sefaria-style connections | All of above | ❌ Missing |

**One field → 4 new retrieval dimensions → multi-hop graph traversal.**

### What Sefaria does that this enables

Sefaria's 3.3 million links are essentially: for every text, what other texts does it reference? Their "connections" panel shows everything linked to a verse across their entire corpus.

With source_chain:
- User asks about Quran 2:153 → get all paragraphs where Maulana cites this verse ✅ (already works)
- User asks about Battle of Uhud → get all paragraphs where Maulana uses Uhud as an example ❌ → ✅ with source_chain
- User asks about Abu Bakr → get all paragraphs and videos where Maulana references Abu Bakr's example ❌ → ✅ with source_chain
- User asks about patience AND Abu Bakr → multi-hop: find paragraphs that cite Quran 2:153 AND mention Abu Bakr ❌ → ✅ with source_chain

---

## What Already Exists (Don't Rebuild)

### Phase B enrichment fields — already on every enriched paragraph:

| Field | What it captures | Coverage |
|-------|-----------------|---------|
| `glance_text` | One-line summary | 75% of paras |
| `viral_score` | Quality score 0-100, best_quote | 75% of paras |
| `emotional_tags` | Input→output emotion, transformation | 75% of paras |
| `creation_plan` | 14 theological aspects + application | 75% of paras |
| `argument_mapping` | 10 contemporary issues | 75% of paras |
| `reasoning_flow` | How Maulana builds the argument (type + steps) | 75% of paras |
| `references` table | Quran verses, hadiths, years — linked to paragraph | 9 books |
| `entities` table | PERSON, PLACE extracted from text | 9 books |

### What reasoning_flow already captures:

```json
{
  "type": "verse_to_conclusion",
  "steps": [
    {"type": "quran_citation", "content": "...4:171..."},
    {"type": "conclusion", "content": "extremism is prohibited"}
  ],
  "citation_path": {
    "primary_verse": "4:171",
    "creation_plan_aspect": "intellectual_development_through_adversity"
  }
}
```

**reasoning_flow already captures the logical chain.** source_chain adds the NAMED ENTITIES dimension — who was mentioned, what event was described.

### 6 Centric Exporters already built:
- `verse_centric` — groups by Quran surah:ayah ✅ ACTIVE in chatbot
- `hadith_centric` — groups by hadith collection+number ✅ ACTIVE in chatbot
- `topic_centric` — groups by concept/topic ✅ ACTIVE in chatbot
- `argument_centric` — 10 contemporary issues ✅ ACTIVE in chatbot
- `creation_plan_centric` — 14 theological aspects ✅ ACTIVE in chatbot
- `timeline_centric` — 5 historical eras ✅ ACTIVE in chatbot

**Two NEW exporters needed after source_chain:**
- `personality_centric` — group by named Islamic personalities
- `event_centric` — group by Islamic historical events

---

## Step-by-Step Plan

### Phase 0 — Fix the retrieval quality NOW (1 day, this week)
**Problem:** Raw unenriched video segments flood the retrieval pool, causing irrelevant answers.
**Fix:** In chat_engine.py, when building FTS5/semantic candidates, skip video segments where `glance_text IS NULL`. Only enriched content competes.
**Impact:** Immediate quality improvement on all queries. Zero new enrichment needed.
**Effort:** 30 min code change.

---

### Phase 1 — Enrich 64 Soniox Videos (this weekend)
**What:** Run Phase B enrichment on the 64 Soniox-transcribed videos that have `soniox_text` but no `glance_text`.
**Why Soniox first:** Word-level timestamps (millisecond precision), Urdu/Hindi accuracy (6.3% WER), no cleanup step needed (`soniox_text` is the clean transcript).
**No cleanup:** `soniox_text` is used directly. `text` (Fireflies) is untouched.
**After this:** 75 Soniox videos fully enriched → quality video content in retrieval pool.
**Effort:** Run existing enrichment script on 64 videos. No new code.

Videos to enrich (64 pending):
- See: `SELECT v.title FROM videos v JOIN video_segments vs ON vs.video_id=v.id WHERE vs.soniox_text IS NOT NULL AND vs.glance_text IS NULL GROUP BY v.id`

---

### Phase 2 — Add source_chain to Phase B enrichment prompt (next weekend)
**What:** Add `source_chain` as a new JSON field in the existing Phase B enrichment prompt (the "enrich" step that runs Claude Opus).
**The field:**
```
For this paragraph, identify:
- primary_source_type: what kind of source does Maulana primarily use?
  Options: quran|hadith|prophet_event|companion_story|historical_event|personal_anecdote|contemporary_example|reasoning_only
- primary_ref: the specific reference (e.g. "2:153", "Battle of Uhud", "Abu Bakr's migration")
- supporting_personalities: list of Islamic figures mentioned by name
  Use canonical names: Prophet_Muhammad, Abu_Bakr_al_Siddiq, Umar_ibn_al_Khattab, Uthman, Ali_ibn_Abi_Talib, Ayesha, etc.
- supporting_events: list of named Islamic events
  Use canonical names: Battle_of_Badr, Battle_of_Uhud, Hijra, Treaty_of_Hudaybiyya, etc.
- teaching_chain: sequence of source types Maulana uses (e.g. "quran → prophet_event → contemporary_application")
- chain_completeness: how many distinct source types appear (1-5)
```
**DB change:** Add `source_chain` TEXT column to paragraphs table.
**After this:** Every new book enriched AND all 9 existing books re-enriched with source_chain.
**Effort:** 1 day (prompt change + DB migration + re-enrich 7,397 paragraphs overnight).

---

### Phase 3 — Build personality_centric and event_centric exporters (weekend after Phase 2)
**What:** Two new exporters that read `source_chain.supporting_personalities` and `source_chain.supporting_events` from paragraphs.
**personality_centric:** One JSON file per Islamic personality. Groups ALL paragraphs (and enriched video segments) where that person is mentioned.
```
/data/exports/personality_centric/Prophet_Muhammad.json
/data/exports/personality_centric/Abu_Bakr_al_Siddiq.json
/data/exports/personality_centric/Umar_ibn_al_Khattab.json
...
```
**event_centric:** One JSON file per Islamic event.
```
/data/exports/event_centric/Battle_of_Uhud.json
/data/exports/event_centric/Hijra.json
/data/exports/event_centric/Treaty_of_Hudaybiyya.json
...
```
**Chatbot integration:** Add Strategy 3e (personality_match) and 3f (event_match) to chat_engine.py using the same semantic matching pattern as argument_centric.
**After this:** User asks "What does Maulana say using the example of Abu Bakr?" → chatbot pulls from personality_centric/Abu_Bakr.json.
**Effort:** 1 weekend (2 exporters + 2 new chat strategies).

---

### Phase 4 — Include enriched VideoSegments in all centric exporters (after Phase 1 + 2 complete)
**What:** Update all 8 centric exporters to also query enriched VideoSegments (where `glance_text IS NOT NULL`).
**Why:** Currently all centric JSON files only contain book paragraphs. Video segments are invisible to centric retrieval even when enriched.
**After this:** User asks about Battle of Uhud → gets both book paragraphs AND video clips at exact timestamps where Maulana discusses it.
**Effort:** 1 day (update 8 exporters, re-export).

---

### Phase 5 — Multi-hop retrieval in chatbot (after Phase 3 complete)
**What:** Enable two-hop graph traversal.
  - Hop 1: Find paragraphs for the query topic (existing strategies 1-6)
  - Hop 2: For each found paragraph, look up its source_chain and retrieve ALSO the paragraphs that share the same verse / personality / event
**Example:** User asks about "patience" → finds 10 patience paragraphs → notices 3 of them cite Quran 2:153 AND mention Abu Bakr → adds 5 more paragraphs that cite the same combination → answer covers the full depth of Maulana's teaching on the topic.
**This is Sefaria's core differentiator** — their 3.3M links enable exactly this traversal.
**Effort:** 1 weekend (new retrieval strategy in chat_engine.py).

---

### Phase 6 — Process 1,300 videos with no transcript (ongoing, parallel)
**What:** Submit unenriched videos to Soniox API for transcription.
**Priority order:**
  1. English language videos (enrichment model works best)
  2. Recent 2024-2025 content (most relevant)
  3. Urdu/Hindi content (Soniox's strongest language for this)
**Rate:** Soniox API → process 10-20 videos/day in background cron.
**After this:** Corpus grows from 75 enriched videos to 1,375 enriched videos.
**Effort:** Enable video cron (after 3-video end-to-end test passes).

---

### Phase 7 — Process remaining 136 books (ongoing, parallel)
**What:** Upload books from Google Drive using the new MCP automation (process_book_from_drive).
**After Phase 2:** source_chain is automatically generated for every new book.
**Target:** 50 books indexed by end of April 2026 → 100 books by June 2026.
**Effort:** 30 min/week via Claude.ai MCP interface (no SSH needed).

---

## The Compound Effect — What You Get At Each Phase

| After Phase | User experience |
|------------|-----------------|
| Phase 0 | Answers are more relevant (no raw ASR noise) |
| Phase 1 | 75 Soniox videos give quality video results |
| Phase 2 | Each paragraph tagged with WHO and WHAT EVENT |
| Phase 3 | "What does Maulana say using Abu Bakr example?" works |
| Phase 4 | Video clips appear in verse/event/personality searches |
| Phase 5 | Answers are multi-dimensional — verse + personality + event combined |
| Phase 6+7 | Scale: 1,375 videos + 145 books fully indexed |

---

## Sefaria Comparison — Where We'll Stand After Each Phase

| Capability | Sefaria | Us now | Us after Phase 5 |
|-----------|---------|--------|------------------|
| Stable Ref IDs | ✅ Genesis.1.1 | ✅ quranic-wisdom:0:45 | ✅ |
| Verse-centric retrieval | ✅ | ✅ | ✅ |
| Personality mentions | ✅ 17,730 entities | ❌ | ✅ personality_centric |
| Event mentions | ✅ | ❌ | ✅ event_centric |
| Multi-hop traversal | ✅ 3.3M links | ❌ | ✅ source_chain |
| Emotional journey | ❌ | ✅ | ✅ |
| Creation plan (14 aspects) | ❌ | ✅ | ✅ |
| Video + books combined | ❌ | partial | ✅ Phase 4 |
| Daily habit mechanism | ✅ Daf Yomi | ❌ deferred | build after Phase 3 |
| Eval dataset | ✅ benchmarked | ❌ P9 pending | ✅ P9 this month |

---

## What NOT to build (scope control)

- ❌ NER model training (Sefaria took 2 years — use LLM extraction instead)
- ❌ Embedding benchmark (P8) — do after 50+ books indexed
- ❌ Public API (P17) — do after 70+ books indexed  
- ❌ External Linker JS (P18) — do after public API
- ❌ Scholar review UI (P11) — do after 50+ books + source_chain

---

## Critical Decision: Re-enriching 7,397 existing paragraphs

When Phase 2 is ready (source_chain in enrichment prompt):
- Run a background re-enrichment job on all 7,397 paragraphs to add source_chain
- Estimated cost: ~$2-4 USD (Claude Haiku at batch prices — source_chain extraction doesn't need Opus)
- Estimated time: 4-6 hours overnight
- No downtime required — chatbot runs on existing data while re-enrichment runs

---

## Document Control

| Version | Date | Change |
|---------|------|--------|
| 1.0 | March 19, 2026 | Initial creation — strategy session with Claude |

Next update trigger: When Phase 0 is complete OR when source_chain spec is finalized.
</content>
