# Deep Exploration Plan — Point by Point
# Each comparison point: what they do, what we do, exact gap, exact steps, done criteria
# Total: 18 exploration points across 5 domains
# Source: direct code reads + sefaria.org/ai + developer docs + GitHub wiki
# March 2026

---

## HOW TO USE THIS DOCUMENT

Each point has 5 sections:
- SEFARIA: what they actually do (evidence-sourced)
- US: what we actually do (code-sourced)
- GAP: honest delta
- ACTION: exact steps to close the gap
- DONE WHEN: verifiable completion criteria

Points are ordered by priority (high impact + low effort first).

---

## POINT 1 — STABLE HUMAN-READABLE REFERENCE IDs

### SEFARIA
Every segment has a canonical Ref string — "Genesis 1:1", "Berakhot 2a:3", "Rashi on Genesis 4:5:2".
- Defined in Index schema via sectionNames array
- Permanent: "Genesis 1:1" will always mean the same thing
- Human-readable: tells you exactly what the text is without looking it up
- Publicly addressable: sefaria.org/Genesis.1.1 works for everyone
- Used in citations, links, the Linker, the API, source sheets, everywhere
- Cross-platform: any app, any website, any scholar can cite the same Ref

### US
Each paragraph has an auto-increment integer: paragraph.id = 2341.
- Not human-readable: 2341 tells you nothing
- Not stable across environments: dev DB vs prod DB have different IDs
- Not publicly addressable: no URL exists for paragraph 2341
- Chatbot citations show [P2341] — user cannot verify what this refers to
- Fields that COULD build a stable ref already exist in DB:
  - book.slug (e.g. "patience-and-positive-thinking")
  - chapter.order_index (e.g. 3)
  - paragraph.order_index (e.g. 15)
  - → Could produce: "patience-and-positive-thinking:3:15"

### GAP
Our citations are opaque integers. Sefaria's citations are universally verifiable human-readable addresses. Anyone reading our chatbot answer cannot independently verify what [P2341] says. This destroys trust at the foundation.

### ACTION
1. Add computed property or DB column `ref_id` to Paragraph:
   - Format: `{book.slug}:{chapter.order_index}:{paragraph.order_index}`
   - Example: `patience-and-positive-thinking:3:15`
   - Computed at export time (no migration needed — all source fields exist)
2. Update chat_engine.py: when building sources dict, include ref_id alongside para.id
3. Update generate_answer prompt: cite as `[patience-and-positive-thinking:3:15]` not `[P2341]`
4. Build public URL: GET /read/{book-slug}/{chapter-order}/{para-order} → returns paragraph + context
5. Update centric exporters: include ref_id in every JSON export

### DONE WHEN
- Every paragraph in DB has a computable ref_id from existing fields
- Chatbot answer shows ref like "patience-and-positive-thinking:3:15" (not P2341)
- URL /read/patience-and-positive-thinking/3/15 returns that paragraph in a browser
- Can paste any ref_id into URL bar and get the exact paragraph

---

## POINT 2 — DAILY HABIT / RETURN MECHANISM (Equivalent of Daf Yomi)

### SEFARIA
Daf Yomi: all 2,711 pages of Babylonian Talmud, one page per day, 7.5 year cycle.
- Tens of thousands of participants worldwide complete this cycle together
- Sefaria shows "Today's Daf" on homepage and mobile app
- Calendar API (GET /api/calendars) returns the exact daily learning for any date + timezone
- Also: weekly Parasha, daily Mishnah, Nach Yomi, Mishna Yomi — parallel tracks
- Users don't come to Sefaria because the search is best. They come because today is their day.
- Mobile push notifications remind users of daily learning
- 14-year-old global habit infrastructure that Sefaria plugged into

### US
- Telegram bot exists (token in memory, chat_id 1301565858)
- No daily cron runs
- viral_score.best_quote exists on every enriched paragraph (ready to use)
- No daily learning schedule designed
- No calendar awareness (Ramadan, Juma, etc.)
- Users come only when they have a question — no return mechanism

### GAP
We have all the ingredients for a daily habit loop but zero has been assembled. Sefaria plugged into a 2,000-year-old Jewish learning cycle. We have an equivalent: daily Quran tilawat, Juma (Friday) khutba tradition, Ramadan daily tafsir, Islamic new year. We are not plugged into any of it.

### ACTION
Phase 1 (this weekend, 2 hours):
1. Write `/root/annotation_tool_v2/scripts/daily_wisdom.py`:
   - Query: SELECT p.id, p.text, p.viral_score, b.title, b.slug, c.title 
     FROM paragraphs p JOIN chapters c JOIN books b 
     WHERE p.viral_score IS NOT NULL AND p.shareable = 1
     ORDER BY RANDOM() weighted by viral_score.composite DESC LIMIT 1
   - Extract best_quote from viral_score JSON
   - Format message: "[Best Quote]\n— Maulana Wahiduddin Khan, {book_title}\n\nRead more: ask.spiritualmessage.org"
   - Send via Telegram bot API (token already in HANDOVER_DOCUMENTATION.md)
2. Add cron job: `0 7 * * * python /root/annotation_tool_v2/scripts/daily_wisdom.py`

Phase 2 (one weekend):
3. Create Islamic calendar schedule in a YAML file:
   - Ramadan days: send Quran-linked paragraphs (where creation_plan includes "discovering_god_through_creation")
   - Juma (every Friday): send paragraphs with argument_mapping to "peace_conflict" or "meaning_purpose"
   - Dhul Hijjah days 1-10: send paragraphs about sacrifice, gratitude, hereafter
4. Modify daily_wisdom.py to read from calendar schedule

Phase 3 (future):
5. Plan "Maulana's 365 Wisdom" — structured reading plan covering all 145 books proportionally
   - 145 books × average ~400 enriched paragraphs = ~58,000 paragraphs
   - At 3 paragraphs/day = 53-year plan. Need to curate to top 365 × 3 = 1,095 best_quotes.
   - This is the Islamic equivalent of a multi-year Daf Yomi cycle

### DONE WHEN
- Telegram bot sends one wisdom message at 7am IST every day (test for 7 days)
- Message includes best_quote, book title, and link
- Juma messages are different from weekday messages (Friday filter active)
- Ramadan messages draw from Quran-connected paragraphs

---

## POINT 3 — REFERENCE DETECTION ACCURACY BENCHMARK

### SEFARIA
- Trained NER model: `he_torah_ner` (published on HuggingFace, MIT license)
- Two-stage pipeline: span detection (he_ref_ner) then structure parsing (he_subref_ner)
- Published F-score: 82.96% on Hebrew citation detection
- Handles: Biblical refs, Talmud refs, Mishnah, Midrash, abbreviations, ibid, Hebrew numerals
- Rabbinic Embedding Leaderboard: benchmarks 18 embedding models on domain-specific retrieval
- They know exactly how good their detection is

### US
- Custom regex in quran_detector.py (~21KB — many patterns)
- Custom regex in hadith_detector.py for 7 collections
- hadith_matcher.py: fuzzy match against local JSON database, match_score 0-100
- No published accuracy metric
- No test set exists
- We do not know our F-score
- We do not know how many Quran verses we miss
- We do not know how many false positives we generate

### GAP
We are operating blind. We claim to detect references but have never measured if we're right. Every decision about whether to improve the detector is based on intuition not data.

### ACTION
Phase 1 — Build ground truth (1 weekend):
1. Pick 3 already-processed books
2. Manually read 100 paragraphs from these books
3. For each paragraph, manually note every Quran verse and hadith cited in the text
4. Record ground truth as JSON: {paragraph_id, expected_refs: [{type, surah, ayah, collection, hadith_number}]}
5. Run our detector on these 100 paragraphs
6. Compare detector output vs ground truth
7. Calculate: precision (detected refs that are correct), recall (actual refs that were detected), F-score

Phase 2 — Measure and fix (next weekend):
8. If F-score > 80%: our regex is good, move on
9. If F-score 60-80%: identify which pattern types fail, add regex patterns
10. If F-score < 60%: consider training a small NER model using our detected+verified refs as training data
11. Document final F-score in LEARNINGS.md

Phase 3 — Urdu detection (future):
12. Test detection on Urdu books (source_language='ur')
13. Quran references in Urdu text have different patterns: "سورة البقرة" etc.
14. Measure separately — Urdu F-score is likely much lower than English F-score

### DONE WHEN
- We have a published F-score for our Quran detector (precision, recall, F)
- We have a published F-score for our Hadith detector
- Both scores are documented in LEARNINGS.md as L-NEW
- Scores updated every time we change the detector

---

## POINT 4 — DUAL-LLM VALIDATION PIPELINE

### SEFARIA (Pirkei Avot Learning Guide)
Pipeline confirmed from sefaria.org/ai:
1. Each mishnah + commentaries → Gemini
2. Gemini: identifies discussion questions → generates summary of each commentator's answer
3. ALL Gemini output → Claude Opus 3
4. Claude Opus 3 scores each output on two dimensions (1-10):
   - Relevance: how relevant is the answer to the question?
   - Accuracy: how accurately does the summary reflect the commentary?
5. Anything scoring ≤5 on either → additional human review
6. 14,000 of 15,500 source introductions reviewed by Sefaria staff
This is AI generating → AI validating → humans reviewing worst cases.

### US
- chat_engine.py: single LLM generates answer (Gemini proxy → DeepSeek → OpenRouter)
- No validation LLM
- No scoring of answer quality
- No flag for low-confidence answers
- query_log table records all answers but no quality dimension
- No way to know which answers are wrong without user reporting it

### GAP
We have no internal quality signal. Every answer looks the same to us whether it's excellent or hallucinated. Sefaria's dual-LLM catches the worst answers before they reach users. We catch nothing.

### ACTION
Add to chat_engine.py after generate_answer():

1. Create validate_answer(question, answer_sections, sources) function:
   - For each cited paragraph [P{id}] in the answer:
     - Load paragraph.text
     - Load the specific claim the answer makes about it (extract surrounding sentence)
     - Call cheap LLM (Haiku or DeepSeek): "Does this paragraph support this claim? Score 1-10 for relevance and 1-10 for accuracy."
     - Store scores in query_log: {para_id, relevance_score, accuracy_score}
   - Overall answer_quality_score = min of all individual scores

2. Add to query_log table:
   - answer_quality_score (Float, nullable) — computed by validator
   - flagged_for_review (Boolean, default False)
   - validation_ran_at (DateTime)

3. Flagging rule:
   - If answer_quality_score ≤ 5 → set flagged_for_review = True
   - Admin dashboard shows flagged answers with filter option

4. Cost: one cheap LLM call per citation in the answer (usually 3-5 citations)
   - Haiku: ~$0.001 per validation call
   - Run on every answer initially, then consider sampling at 20% after baseline established

5. Future: scholar_review_queue table pulls from flagged_for_review=True

### DONE WHEN
- validate_answer() function exists in chat_engine.py
- query_log table has answer_quality_score and flagged_for_review columns
- Chat dashboard shows flagged answers in a filterable view
- At least 100 answers have been validated and scores reviewed

---

## POINT 5 — TOPIC SOURCE RANKING (Equivalent of Linear Integer Programming)

### SEFARIA
- Developer docs explicitly list: "Linear Integer Programming For Topic Pages Sources Selection"
- Algorithm selects which sources appear on each of the 5,000 topic pages
- Goal: balance diversity (not 10 paragraphs from one book) + importance (higher-cited sources first) + representation (different time periods, different authors)
- PageRank-style signals from 3.3 million link graph inform importance
- Result: topic pages show the canonical best sources, not just any matching sources
- User experience: "which 5 paragraphs best represent what Judaism says about prayer?"

### US
- topic_centric_exporter.py groups all CONCEPT entity paragraphs by concept name
- All citations in a topic JSON are weighted EQUALLY
- No diversity enforcement: if one book has 40 paragraphs about "patience", they all appear
- No importance ranking: a glance_text-quality paragraph ranks the same as a heading paragraph
- viral_score.composite already exists on every paragraph (0-100) but not used for ranking
- No deduplication by book: same book can dominate a topic

### GAP
Our topic JSONs return raw matches in DB order. Sefaria's topic pages return the best 5-10 sources with diversity and quality filtering. A user browsing our topics gets an overwhelming raw list. Sefaria's user gets a curated highlight reel.

### ACTION
Update topic_centric_exporter.py when rebuilding topic JSONs:

1. Load viral_score.composite for each citation
2. Sort citations within each topic by viral_score.composite descending
3. Apply diversity cap: maximum 3 citations per book_slug per topic
4. Apply quality filter: only include citations where shareable=True AND type='paragraph' (exclude headings)
5. Apply PageRank proxy: add citation_count field to Paragraph (how many centric exports it appears in)
   - Run once: for each paragraph, count appearances across all 6 centric export types
   - Store in Paragraph.citation_count
   - In topic ranking: blend 60% viral_score + 40% citation_count
6. Rebuild all topic_centric JSONs with new ranking

Result: topic pages show the 10 best paragraphs about "patience", diverse across books, with the most cited and most viral-scored paragraphs first.

### DONE WHEN
- topic_centric_exporter.py applies viral_score ranking + 3-per-book cap
- Paragraph.citation_count field exists and is populated
- All 143 topic JSONs rebuilt with ranked, diverse citations
- Can verify: open patience.json and confirm citations are from multiple books + sorted by score

---

## POINT 6 — ACTIVATE 3 UNUSED CENTRIC TYPES IN CHATBOT

### SEFARIA
Not applicable — Sefaria does not have these. This is a gap WE have internally (not vs Sefaria).

### US
We have 6 centric export types. Only 3 are wired into the chatbot retrieval:
- ✅ verse_centric → Strategy 4 (cross-ref lookup) + Strategy 5 (interlink hop)
- ✅ hadith_centric → Strategy 4 + Strategy 5
- ✅ topic_centric → Strategy 3 (semantic topic match)

3 are built and exported but NEVER used by chatbot:
- ❌ timeline_centric → grouped by 7 historical eras
- ❌ argument_centric → grouped by 10 contemporary issues
- ❌ creation_plan_centric → grouped by 14 theological aspects

This means when a user asks:
- "What does Maulana say about political Islam?" → argument_centric/political_islam.json would answer this perfectly. Instead, only FTS5 + semantic search run.
- "What happened in early Islamic era?" → timeline_centric/early_islamic.json has this. Not used.
- "Where does Maulana discuss patience as strategy?" → creation_plan_centric/patience_as_strategy.json. Not used.

### GAP
40% of our interlinking work produces no retrieval benefit. We built the exports but never connected them to the chatbot. These are the richest domain-specific signals we have.

### ACTION
Update chat_engine.py retrieve_chunks():

Strategy 3 currently: match query to topic_centric JSON only.
Extend Strategy 3 to query all 6 centric types:

1. argument detection: scan query for contemporary issue keywords
   - "political Islam", "terrorism", "gender", "interfaith", "modernity", "meaning", 
     "peace", "freedom", "science religion", "suffering"
   - If detected: load argument_centric/{issue_key}.json → add paragraph_ids at score 0.65

2. creation plan detection: scan query for Maulana's theological aspect keywords
   - "purpose of life", "patience", "dawah", "hereafter", "free will", "nature", 
     "test of life", "positive thinking", "paradise"
   - If detected: load creation_plan_centric/{aspect_key}.json → add paragraph_ids at score 0.65

3. timeline detection: scan query for era keywords or year ranges
   - "early Islamic", "Prophet's time", "medieval", "colonial", "modern", "1900s"
   - Also: if year reference detected (e.g. "622 CE") → load timeline_centric/early_islamic.json
   - Add paragraph_ids at score 0.6

4. Add to retrieval_meta: which centric types were activated per query

### DONE WHEN
- Query "What does Maulana say about political Islam?" activates argument_centric/political_islam.json
- Query "What does patience as strategy mean?" activates creation_plan_centric/patience_as_strategy.json
- Query about early Islamic history activates timeline_centric/early_islamic.json
- Retrieval meta shows centric type activations per query
- Test with 10 queries covering all 3 new types

---

## POINT 7 — CITATION PUBLIC VERIFIABILITY

### SEFARIA
When Sefaria cites "Genesis 1:1" in any answer:
- User can type sefaria.org/Genesis.1.1 and see the exact text
- The citation IS the verification — it's a globally unique address
- linkFailed:true flag tells the user if a citation could not be verified
- Any person, anywhere, any device, can verify any Sefaria citation independently
- Trust is structural, not institutional: you don't need to trust Sefaria, you can check yourself

### US
When our chatbot cites [P2341]:
- User sees an integer that means nothing
- No URL exists to verify what P2341 says
- User must trust us that P2341 actually supports the claim
- Even if we show the paragraph text inline, user cannot verify it's real
- Nothing links the citation to any independently verifiable source
- If we have a bug and cite the wrong paragraph, no one can detect it

### GAP
Our citations cannot be externally verified. Trust is institutional (you trust us) not structural (you can check yourself). This is the single biggest long-term credibility problem.

### ACTION
This is solved by Point 1 (stable Ref IDs) + one additional step:

1. Build /read/{book-slug}/{chapter-order}/{para-order} route:
   - Returns: paragraph text + book info + chapter title + surrounding paragraphs (context)
   - No login required — fully public
   - Simple HTML page, no JavaScript required
   - Shows: "From '{book_title}' by Maulana Wahiduddin Khan, Chapter '{chapter_title}', paragraph {order_index}"

2. Update chatbot answer format:
   - Inline: [P2341] → becomes [patience-positive-thinking:3:15]
   - In source cards below answer: show book + chapter + first 150 chars + "Read in context →" link

3. Add linkFailed equivalent:
   - When chatbot cites [P{id}], verify para.id exists in DB and para.deleted = False
   - If not found: show "[Citation unavailable]" instead of broken link
   - Log to query_log: citation_verified = False

4. Optional future: submit our public paragraph URLs to Google for indexing
   - Maulana's wisdom becomes searchable via Google
   - Each paragraph becomes a landing page

### DONE WHEN
- /read/patience-positive-thinking/3/15 returns a readable webpage
- Chatbot source cards link to /read/ URLs
- Broken citation IDs show "[Citation unavailable]" not broken link
- Can share any chatbot citation URL with anyone and they see the source

---

## POINT 8 — EMBEDDING MODEL BENCHMARK

### SEFARIA (Rabbinic Embedding Leaderboard, Jan 2026)
Tested 18 models on Hebrew/Aramaic religious text retrieval:
1. Gemini Embedding 001: 93.9% recall@1 (#1)
2. Qwen3-Embedding-8B: 89.4% (#2)
3. Voyage multilingual-2: 84.2% (#3)
4. OpenAI text-embedding-3-large: 69.9%
5. Hebrew-specific BERT models: 1-2% (worst!)

Key finding: domain-specific models lose badly to good general multilingual models.
They now have evidence for which model to use. Zero guessing.

### US
Current model: intfloat/multilingual-e5-base
- 768-dim, 559M params, 512-token limit
- Supports 100+ languages including Urdu/Hindi/Arabic
- Chosen for multilingual support, not because it was benchmarked
- We do not know our recall@1 on Islamic text retrieval
- We do not know if Gemini Embedding or Qwen3 would be better

### GAP
We are using a model we chose by intuition. Sefaria has evidence that Gemini Embedding 001 is 93.9% recall@1 on their corpus. If the same pattern holds for Islamic text (general multilingual > domain-specific), we may be leaving 20-30% recall on the table.

### ACTION
Phase 1 — Build Islamic evaluation dataset (prerequisite, covered in Point 9):
- Need 50+ questions with known-relevant paragraphs before we can benchmark

Phase 2 — Run benchmark (after eval set exists):
1. Set up evaluation script:
   - Input: 50 questions, each with ground-truth paragraph IDs
   - For each model: embed all 7,400 paragraph texts, embed each question
   - Measure: recall@1 (did the top-1 result match ground truth?), recall@5, recall@10
2. Test models:
   - Current: intfloat/multilingual-e5-base (baseline)
   - Candidate 1: Gemini Embedding 001 (via API) — best on Sefaria leaderboard
   - Candidate 2: Qwen3-Embedding-8B (open source, self-hostable, #2 on Sefaria)
   - Candidate 3: jina-embeddings-v3 (explicitly lists Urdu in supported languages)
3. If winner beats E5 by ≥10 recall@1 points → switch and re-index all paragraphs
4. Document results in LEARNINGS.md

Estimated cost:
- Gemini Embedding 001 via API: ~$0.00002/token × 7,400 paragraphs × avg 200 tokens = ~$30 one-time
- Qwen3-Embedding-8B: free (self-host on our Hetzner server, 8B params needs ~16GB RAM — fits)

### DONE WHEN
- Islamic eval set of 50+ questions exists
- Benchmark script runs on all 4 candidate models
- Results table: {model, recall@1, recall@5, recall@10, cost}
- Decision documented: switch or keep current model
- If switching: all paragraph embeddings re-indexed

---

## POINT 9 — EVALUATION DATASET (Islamic RAG Benchmark)

### SEFARIA
Rabbinic Embedding Leaderboard: a benchmark dataset for Hebrew/Aramaic religious text retrieval.
- Tests: given a query, which passage is most relevant?
- Published on HuggingFace, openly available
- Used to compare 18 models objectively
- They know exactly what "better retrieval" means in measurable terms
- Any improvement to their system can be measured before deploying

### US
- No evaluation dataset exists
- No way to measure if a change to chat_engine.py improves or degrades retrieval
- Every "improvement" is based on subjective impression from 3-5 test queries
- Cannot confidently change the embedding model, reranker blend, or strategy weights
- Cannot prove the chatbot is getting better over time

### GAP
We are flying blind on quality. Every retrieval change is a guess. This is the most expensive technical debt we have because it means all other improvements (Point 4, 8) cannot be validated.

### ACTION
Phase 1 — Question design (1 day):
1. Write 50 questions covering:
   - 10 factual questions ("What does Maulana say about prayer?")
   - 10 emotional questions ("I feel lost and purposeless, what wisdom does Maulana offer?")
   - 10 Quran-based questions ("Which books discuss Quran 2:153?")
   - 10 contemporary issue questions ("What does Maulana say about terrorism?")
   - 10 cross-book questions ("Where does Maulana use historical examples?")

Phase 2 — Ground truth annotation (1 weekend):
2. For each question, manually find the 3-5 best paragraphs in our DB
3. Read actual paragraph text to confirm relevance
4. Record: {question, relevant_para_ids: [list], notes}
5. Save as /root/critique/EVAL/islamic-eval-set-v1.json

Phase 3 — Run baseline (1 evening):
6. Run all 50 questions through current chat_engine.py retrieve_chunks()
7. Check: did relevant para_ids appear in top 15?
8. Calculate: recall@1, recall@5, recall@15
9. This is our baseline — all future changes are measured against it

Phase 4 — Automate (1 day):
10. eval_runner.py: takes eval JSON → runs all questions → computes metrics → saves report
11. Run after every significant change to chat_engine.py
12. Keep history of scores in /root/critique/EVAL/results/

### DONE WHEN
- 50 questions with ground-truth para_ids exist in islamic-eval-set-v1.json
- Baseline scores computed: recall@1, recall@5, recall@15
- eval_runner.py exists and produces a report in < 5 minutes
- All future chat_engine improvements measured before/after using eval_runner

---

## POINT 10 — AI TRUST & QUALITY LAYER

### SEFARIA
4 mechanisms confirmed from sefaria.org/ai:
1. Transparency: AI icon visible on all AI-generated pages
2. Learning First: AI is learning aid, not replacement for rabbis
3. Continued Evaluation: ongoing monitoring of AI content
4. Feedback: dedicated AI feedback form (public URL)
5. Rome Call for AI Ethics (Microsoft, Google, IBM, Chief Rabbinate all signed)
6. Tone policy: "Never authoritative. Always shows differing views."
7. 14,000 of 15,500 AI source introductions reviewed by staff before publishing

### US
- No AI content marking
- No stated AI policy
- No feedback mechanism in chatbot UI
- No scholar review workflow
- No tone policy defined
- Users do not know they are reading AI-generated enrichments
- Chatbot answers carry no disclaimer or quality signal

### GAP
We have zero trust infrastructure. Sefaria has 7 layers. This matters especially for religious content — users need to know: (1) this is AI, (2) it is grounded in actual books, (3) they can report errors, (4) it is not a religious authority.

### ACTION

Step 1 (30 minutes — this weekend):
Write our AI policy — 4 sentences:
"We use AI to discover and present Maulana Wahiduddin Khan's wisdom. All answers are grounded in his actual books — every citation links to the original text. AI enrichments (emotional tags, topic labels, key quotes) are clearly marked. To report an error, click 'Report an issue' below any answer."
Place at: /about-ai or in chatbot sidebar

Step 2 (1 hour):
Add "Report an issue" button to chatbot UI:
- Below every chatbot answer: 👎 "Report an issue" → opens a text field
- Submits to new DB table: chatbot_feedback (id, session_id, question, answer, feedback_text, created_at)
- No login required
- Admin dashboard shows feedback queue

Step 3 (2 hours):
Mark AI-generated content in chatbot UI:
- When showing emotional_tags, creation_plan aspects, or glance_text in the UI: add ✨ AI label
- Tooltip: "Generated by AI from Maulana's text. Click the citation to verify."

Step 4 (1 day, recurring monthly):
Scholar spot-check system:
- Add scholar_review_queue table: id, session_id, question, answer, status (pending/approved/flagged)
- Every 50th chatbot session: copy that session's Q&A to scholar_review_queue
- Monthly: export 10 pending reviews → email to a CPS scholar for review
- Log: scholar name, review date, verdict (accurate/inaccurate/partially accurate), notes

Step 5 (future):
Define tone policy in system prompt:
"Present Maulana's view as his scholarly interpretation, not as absolute religious authority.
Show nuance where it exists. Never say 'Islam says X' — say 'Maulana argues X, citing Quran Y:Z'."

### DONE WHEN
- AI policy exists at a public URL
- "Report an issue" button works and saves to DB
- AI-generated enrichments are visually marked in UI
- scholar_review_queue table exists and is populated
- At least one CPS scholar has reviewed at least 10 answers

---

## POINT 11 — SCHOLAR REVIEW WORKFLOW

### SEFARIA
- 14,000 of 15,500 AI-generated source introductions reviewed by Sefaria staff (confirmed)
- Team of "seasoned scholars, editors, and translators" built the framework + prompts
- Dual-LLM pipeline (Claude Opus 3 ≤5 score) flags the worst outputs for human review
- AI content marked with icon AND overview paragraph written by learning team
- Corrections accepted at corrections@sefaria.org
- This is not optional — it is built into their production pipeline

### US
- No scholar has ever reviewed our AI-generated enrichments
- creation_plan, emotional_tags, argument_mapping all generated by LLM with no human review
- Maulana's theological positions are subtle and context-dependent
- LLM may misattribute arguments, misread context, or over-generalize
- A single well-known Islamic scholar seeing a wrong enrichment could discredit the entire platform

### GAP
Our AI enrichments are live and unreviewed. For a religious scholar's content, this is a credibility risk. We need a minimum viable review system — not reviewing everything, but reviewing enough to catch systematic errors.

### ACTION
Phase 1 — Identify what needs review most urgently:
1. argument_mapping: most risk of error — LLM mapping Maulana's views to "terrorism" or "political Islam" requires theological precision
2. creation_plan: less risk — maps to explicit aspects Maulana himself defined
3. emotional_tags: low risk — emotional classification is not theologically loaded

Phase 2 — Build review interface (1 weekend):
1. Add /review/enrichment route in Flask
2. Shows: paragraph text + all enrichments (creation_plan, emotional_tags, argument_mapping, glance_text)
3. Reviewer can: approve, flag as incorrect, edit, add notes
4. Stores in enrichment_review table: para_id, reviewer, status, notes, reviewed_at
5. Login-gated (admin role)

Phase 3 — Recruit first reviewer:
1. Identify one CPS-connected scholar comfortable with technology
2. Provide /review/enrichment access
3. Goal: review 50 paragraphs per month
4. Focus first on argument_mapping (highest risk)
5. Use their corrections to improve the LLM prompt

### DONE WHEN
- enrichment_review table exists
- /review/enrichment UI works for admin users
- At least one CPS scholar has account access
- 50 enrichments reviewed and verdict logged
- argument_mapping prompt updated based on corrections

---

## POINT 12 — INTERLINKING POLICY: LINK TYPE TAXONOMY

### SEFARIA
Link objects have explicit types:
- commentary: Text A is a commentary on Text B
- quotation: Text A explicitly quotes Text B
- allusion: Text A alludes to Text B without explicit citation
- midrash: classical midrashic connection
- related: topically related but not formally connected
- sheet-source: used together in a user source sheet
Each link type has different trust weight and different UI treatment.

### US
All references stored in one Reference table with no relationship type taxonomy.
- A paragraph that quotes Quran 2:153 is treated the same as a paragraph that alludes to it
- We detect the presence of a reference but not HOW the author uses it
- citation_context field exists on Reference (JSON: pattern, confidence, intro/commentary/discussion text)
- But this is not structured into a formal type

### GAP
We know WHAT Maulana cited but not HOW he used it. Sefaria distinguishes quotation from allusion from commentary. We treat all as equal. This matters for retrieval quality: a direct quotation of Quran 2:153 is stronger evidence than a passing allusion.

### ACTION
1. Read citation_context field on existing References — it already has pattern and confidence data
2. Map existing patterns to a type taxonomy:
   - pattern="introduction" + confidence>0.8 → type="quotation" (explicitly cited with text)
   - pattern="commentary" → type="commentary" (Maulana is explaining a verse)
   - pattern="discussion" → type="discussion" (verse mentioned in argument)
   - no citation_context → type="mention" (detected but context unclear)
3. Add ref_relationship_type column to Reference table (quotation/commentary/discussion/mention)
4. Run backfill: classify all existing References using citation_context data
5. Update Strategy 4 in chat_engine.py: weight quotation refs higher than mention refs
   - quotation: score 0.9 (highest trust)
   - commentary: score 0.8
   - discussion: score 0.7
   - mention: score 0.5

### DONE WHEN
- ref_relationship_type column exists on Reference table
- All existing refs classified into 4 types
- Strategy 4 in chat_engine uses different scores per type
- Can query: "how many quotations of Quran 2:153 exist across our corpus?"

---

## POINT 13 — PAGERANK-STYLE PARAGRAPH IMPORTANCE

### SEFARIA (Virtual Havruta)
- Neo4j knowledge graph with 3.3 million links
- PageRank computed across the link graph
- A verse cited by 500 other texts gets high PageRank → appears first in topic results
- Graph distance from query to paragraphs used in multi-signal ranking
- Result: the most important/central passages surface first

### US
- No importance signal based on cross-paragraph citation patterns
- A paragraph cited in 5 different topic/verse/hadith exports = same weight as one cited in 1
- We have all the data needed: each centric export contains paragraph_ids
- But we don't count how many exports each paragraph appears in

### GAP
We have no signal for "how central is this paragraph to the corpus?" Sefaria uses PageRank on 3.3M links. We can approximate this with a simple citation_count across our 6 centric exports.

### ACTION
1. Add Paragraph.citation_count (Integer, default 0) to DB
2. Write populate_citation_counts.py:
   - For each topic_centric JSON: count how many times each para_id appears
   - For each verse_centric JSON: same
   - For each hadith_centric JSON: same
   - For argument_centric: same
   - For creation_plan_centric: same
   - Sum across all 6 export types → Paragraph.citation_count
3. Run script after each full export rebuild
4. Update retrieve_chunks() blend formula:
   - Current: 0.4 × original_score + 0.6 × reranker_score
   - New: 0.35 × original_score + 0.55 × reranker_score + 0.10 × citation_count_normalized
   - citation_count_normalized: log(citation_count + 1) / log(max_citation_count + 1) → 0-1 range

### DONE WHEN
- Paragraph.citation_count exists and populated for all paragraphs
- retrieve_chunks() blends citation_count into final scoring
- Top-cited paragraphs verified to appear at higher rank in retrieval
- Documented: which paragraphs have highest citation_count across our corpus

---

## POINT 14 — AI ETHICS TONE POLICY

### SEFARIA
Confirmed from eJewishPhilanthropy interview with Chief Data Officer:
"The tone of Sefaria's AI answers will not be authoritative: 'God said do X.' Instead, the library will often offer differing views on how commentators wrestled with a particular question, pushing users to dive deeper into the texts."
This is a deliberate design decision with religious sensitivity implications.

### US
- No explicitly defined tone policy
- Current system prompt: not reviewed for religious authority vs scholarly presentation
- Risk: chatbot might present Maulana's view as "Islam says X" when it should be "Maulana argues X"
- Maulana himself was careful to distinguish his interpretation from divine command
- This distinction matters enormously for scholarly and orthodox Muslim users

### GAP
We present Maulana's views without framing them as scholarly interpretation. This risks misrepresenting his humility and could alienate orthodox users who find such framings doctrinally problematic.

### ACTION
Update chat_engine.py build_prompt() system section:

Add these constraints to every system prompt:
"You are presenting Maulana Wahiduddin Khan's scholarly interpretation of Islamic texts.
RULES:
1. Never say 'Islam says X' or 'Allah commands X' based on Maulana's words alone
2. Always attribute: 'Maulana argues...', 'According to Maulana...', 'In Maulana's view...'
3. When Maulana cites Quran, present the verse as Quranic; present his interpretation as his reading
4. Never claim Maulana's view is the only valid Islamic position
5. If the question is about religious practice (halacha/fiqh), note that users should consult a scholar
6. Present emotional and spiritual guidance as wisdom, not commandment"

Test on 10 sample answers before and after to confirm tone change.

### DONE WHEN
- Tone rules added to system prompt in chat_engine.py
- 10 test answers reviewed: none say "Islam commands X"
- All answers attribute Maulana by name: "Maulana argues..."
- Fiqh/practical questions include "consult a scholar" note

---

## POINT 15 — FAILED CITATION FLAG (Equivalent of linkFailed)

### SEFARIA
When the Linker API detects a citation but cannot find it in the Sefaria corpus:
- Returns linkFailed: true
- User sees the citation text but it does not become a link
- Prevents hallucinated or incorrect citations from appearing as valid verified sources
- This is structural hallucination detection — runs at serving time, not generation time

### US
- chat_engine.py: _verify_citations() checks that cited para_ids exist in DB
- But: does not check whether the paragraph actually SUPPORTS the claim being made
- No linkFailed equivalent shown to user
- If a citation ID does not exist: currently silently removed, not flagged

### GAP
We verify citations exist but do not signal to users when they don't. Sefaria makes failed citations visible. We hide them. Users deserve to know if a citation could not be verified.

### ACTION
1. Update _verify_citations() in chat_engine.py:
   - Current: removes invalid citation IDs from answer
   - New: replaces invalid citation ID with [CITATION_UNVERIFIED] in answer text
   - Also: check if para.deleted = True → same treatment
2. Add to sources metadata: {verified: True/False, failure_reason: "not_found"|"deleted"|"ok"}
3. Frontend: show [CITATION_UNVERIFIED] with a different color/icon (grey instead of green)
4. Log to query_log: citation_failures (JSON array of failed para_ids per session)
5. Monitor: if citation failure rate > 5% of answers → investigate why (likely bug in para IDs)

### DONE WHEN
- _verify_citations() marks failures as [CITATION_UNVERIFIED] instead of silently removing
- Frontend shows unverified citations differently from verified ones
- query_log tracks citation_failures per session
- Have 1 week of data on citation failure rate

---

## POINT 16 — ISLAMIC CALENDAR INTEGRATION

### SEFARIA
Calendar API: GET /api/calendars?timezone=Asia/Kolkata&date=2026-03-15
Returns:
- Today's Daf Yomi: Tractate Menachot 62
- Today's Parasha: Vayikra (weekly Torah portion)
- Daily Mishnah: Kelim 13:7
- Nach Yomi: Isaiah 28
- Jewish holidays if applicable

This creates a structured daily reason to return. Daf Yomi alone has hundreds of thousands of global participants. The calendar is the habit engine.

### US
- No calendar awareness
- Telegram bot sends nothing daily
- No Ramadan content strategy
- No Juma (Friday) content
- No Islamic new year awareness
- No connection to Islamic learning cycles

### GAP
Sefaria plugged into a 2,000-year-old Jewish learning tradition. We have an equivalent: the Islamic calendar has structured learning cycles we could plug into.

### ACTION
Phase 1 — Ramadan content (most impactful, 1.8 billion Muslims):
1. Detect Ramadan dates programmatically (Hijri calendar library in Python: `hijridate`)
2. During Ramadan: daily_wisdom.py selects paragraphs where:
   - creation_plan includes "hereafter_as_motivation" OR "patience_as_strategy"
   - OR References includes Quran verses from Surah 2 (Al-Baqarah, major Ramadan surah)
3. Add "🌙 Ramadan Wisdom" prefix to daily message

Phase 2 — Juma (every Friday):
1. Every Friday: select paragraph from argument_centric/peace_conflict.json or meaning_purpose.json
2. These topics align with Juma khutba themes
3. Add "🕌 Friday Reflection" prefix

Phase 3 — Islamic new year, Eid, Dhul Hijjah:
1. Build YAML: islamic_calendar_2026.yaml with key dates
2. For Eid al-Fitr: send paragraphs about gratitude, community
3. For Eid al-Adha: send paragraphs about sacrifice, purpose_of_life
4. For Dhul Hijjah days 1-10: send paragraphs about hereafter, pilgrimage significance

Phase 4 — Long-term: "Maulana's 365 Daily Wisdom"
1. Curate 365 best best_quotes from all processed books
2. Assign one to each day of year
3. Structure as multi-year cycle (like Daf Yomi)
4. This becomes the Islamic learning calendar we plug into

### DONE WHEN
- Ramadan detection works (test with mock date)
- Ramadan messages use Quran-linked paragraphs
- Friday messages use peace/meaning topic exports
- Key Islamic dates in 2026 have appropriate content mapped
- 365-day plan is in draft (even if content is thin)

---

## POINT 17 — READ-ONLY PUBLIC API

### SEFARIA
15+ REST endpoints, no authentication for read access:
- GET /api/v3/texts/{ref} → full text + metadata
- GET /api/links/{ref} → all connections
- GET /api/related/{ref} → related content
- GET /api/calendars → daily learning schedule
- GET /api/topics/{slug} → topic page content
- POST /api/search-wrapper → Elasticsearch search
- POST /api/find-refs → Linker citation detection

Result: 200+ apps built on this. Other developers solved problems Sefaria never thought of.

### US
- No public API
- No documentation
- Only our annotation tool and chatbot use the data
- CPS chapters worldwide could use an API but cannot
- Islamic educators could build learning tools but cannot
- Other developers could create things we haven't imagined but cannot

### GAP
We are a closed system. Sefaria became a platform by opening their data. We have richer per-paragraph enrichment than Sefaria (emotional tags, viral scores, argument mapping) that other developers would find valuable. Zero leverage right now.

### ACTION (After 100% corpus processed):
Phase 1 — Design (1 day planning):
Define what to expose:
- GET /api/paragraphs/{book-slug}/{chapter}/{para} → text + enrichments
- GET /api/books → list of all books with status
- GET /api/topics/{topic-slug} → topic JSON (already exists as file)
- GET /api/search?q={query} → calls our FTS5 + semantic search
- GET /api/daily → today's best_quote (from daily Telegram cron)

Phase 2 — Build (1 weekend after corpus complete):
1. Create /api/v1/ Blueprint in Flask
2. Auth: no auth for reads, API key for writes (future)
3. Rate limiting: 100 requests/hour per IP
4. Response format: JSON with enrichment fields
5. CORS: allow all origins (so web apps can use it)

Phase 3 — Documentation (1 day):
1. Simple /developers page with 5 example API calls
2. curl examples for each endpoint
3. Link to it from homepage

### DONE WHEN (future milestone):
- /api/v1/paragraphs/{ref} returns paragraph + enrichment JSON
- /api/v1/daily returns today's best_quote
- /api/v1/topics/{slug} returns ranked topic citations
- /developers page exists with documentation
- Rate limiting active

---

## POINT 18 — EXTERNAL LINKER (Equivalent of linker.v3.js)

### SEFARIA
linker.v3.js: any website adds 2 lines of code → Sefaria auto-detects all Torah citations and links them:
```html
<script src="https://www.sefaria.org/linker.v3.js"></script>
<script>sefaria.link();</script>
```
- Detects Hebrew and English citations
- Popup shows bilingual text inline
- Tracks which websites cite Sefaria content
- 150+ websites use this — every Islamic/Jewish education site, blog, newsletter
- Creates reverse traffic: people on other sites discover Sefaria

### US
- No equivalent exists
- Many Islamic websites quote Quran verses and hadiths
- If those sites detected the reference and linked to our Maulana commentary — major discovery channel
- Our Quran detector already does the hard work of detection
- We just need to expose it as a JavaScript service

### GAP
We have zero ecosystem presence. Sefaria is embedded on 150 websites. Every visitor to those sites can discover Sefaria via citation links. We are invisible to the internet.

### ACTION (Long-term, after Ref IDs + public API + 50%+ corpus):
Phase 1 — Backend (Linker API):
1. POST /api/find-refs endpoint:
   - Input: {text: "...any text with Quran or hadith citations..."}
   - Runs quran_detector.py + hadith_detector.py on input text
   - Returns: {refs: [{ref_type: "quran", surah: 2, ayah: 153, display_text: "Quran 2:153", url: "https://ask.spiritualmessage.org/read/quran/2/153", maulana_commentary_url: "..."}]}
2. Host this publicly, no auth required

Phase 2 — JavaScript snippet (linker.js):
1. Small vanilla JS file hosted at ask.spiritualmessage.org/linker.js
2. Detects Arabic text patterns for Quran verses
3. Detects "bukhari", "muslim" etc. patterns for hadith
4. Creates clickable popup: shows verse text + links to our commentary
5. Any Islamic website adds 2 lines → citations auto-link to Maulana's interpretation

Phase 3 — Outreach:
1. Contact 10 Islamic education websites to pilot the linker
2. CPS newsletter uses the linker
3. Spiritualmessage.org itself uses the linker

### DONE WHEN (future):
- POST /api/find-refs returns structured citation data
- linker.js hosted publicly
- At least 1 external website using the linker
- Citations on that website link to our platform

---

## EXECUTION SUMMARY

| Point | Description | Effort | This Weekend? |
|-------|-------------|--------|---------------|
| 1 | Stable Ref IDs | Low | YES |
| 2 | Daily Telegram cron | Low | YES |
| 3 | Detection accuracy benchmark | Medium | Start eval set |
| 4 | Dual-LLM validation | Medium | Plan only |
| 5 | Topic source ranking | Low | YES |
| 6 | Activate 3 inactive centrics | Medium | YES (chat_engine edit) |
| 7 | Citation verifiability | Low | YES (needs Point 1) |
| 8 | Embedding benchmark | Medium | After eval set |
| 9 | Evaluation dataset | Medium | Start this weekend |
| 10 | AI trust layer | Low | YES |
| 11 | Scholar review workflow | Medium | Plan only |
| 12 | Link type taxonomy | Low | No — next weekend |
| 13 | PageRank paragraph weight | Medium | No — next weekend |
| 14 | AI ethics tone policy | Low | YES (30 min) |
| 15 | Failed citation flag | Low | YES |
| 16 | Islamic calendar | Medium | Start Ramadan detection |
| 17 | Read-only public API | High | No — after full corpus |
| 18 | External Linker | High | No — later |

**This weekend: Points 1, 2, 5, 6, 7, 10, 14, 15 + start 3 and 9**
**Next weekend: Points 4, 11, 12, 13**
**After full corpus: Points 17, 18**

---

Evidence: all 18 points grounded in:
- Direct code reads: chat_engine.py, models.py, centric exporters, embedding_service.py
- sefaria.org/ai (read directly): Pirkei Avot dual-LLM pipeline, 14K/15.5K reviews, AI ethics
- developers.sefaria.org/docs/linker-v3 (read directly): linker.v3.js, linkFailed, options
- github.com/Sefaria/AppliedAI (read directly): Virtual Havruta, PageRank, KG traversal
- huggingface.co/Sefaria: Rabbinic Embedding Leaderboard, F-scores, model rankings