# Sefaria vs spiritualmessage.org — Master Comparison Document
# Based on: direct code reading (our side) + sefaria.org/ai, developer docs, GitHub wiki, HuggingFace (their side)
# March 2026 | For strategic planning and gap analysis

---

## SECTION 1 — DOCUMENT PROCESSING PIPELINE

### 1A. Atomic Unit (What is one "piece" of text?)

| Dimension | Sefaria | spiritualmessage.org | Copy from Sefaria? |
|-----------|---------|---------------------|-------------------|
| Unit name | Segment | Paragraph | No — our paragraph is semantically richer |
| How defined | Explicit Index schema: sectionNames array (["Chapter","Verse"]) | Detected from DOCX paragraph breaks | No — our approach is fine |
| Unique ID | Canonical Ref string: "Genesis 1:1", "Berakhot 2a:3" | Auto-increment integer (paragraph.id = 2341) | **YES — CRITICAL** |
| ID is human-readable | Yes — "Genesis 1:1" tells you exactly what it is | No — 2341 tells you nothing | **YES — build slug-based refs** |
| Publicly addressable | Yes — sefaria.org/Genesis.1.1 works for every segment | No public URL per paragraph | **YES — CRITICAL** |
| Multiple versions | Up to 18 versions per text (Hebrew, English, French, etc.) | One version per book | No — not needed yet |
| Offline access | Full library downloadable on iOS/Android | No | No — later |
| Hierarchy | Flexible: 2–4 levels depending on text type | Book → Chapter → Paragraph (3 fixed levels) | No — ours is adequate |

**What to copy:** Build stable human-readable reference IDs for every paragraph. Format: `{book-slug}:{chapter-order}:{para-order}` e.g. `patience-positive-thinking:3:15`. Then build a public URL `/read/{book-slug}/{ref}`. All fields already exist in DB — this is routing code only.

---

### 1B. Document Ingestion

| Dimension | Sefaria | spiritualmessage.org | Copy from Sefaria? |
|-----------|---------|---------------------|-------------------|
| Input sources | OCR of print books + publisher deals + volunteer contributions | DOCX/PDF upload by admin | No — our source is clean |
| Parsing | Custom Python scripts per text (public on GitHub) | python-docx + pdf_matcher.py | No |
| Quality control | Volunteer corrections, editorial team, corrections@sefaria.org | Manual review in annotation tool | Partial — add error reporting |
| Storage | MongoDB with JaggedArray nested documents | SQLite WAL mode, 19 tables | No — SQLite is sufficient at our scale |
| Data export | GPL-3.0 code, CC data, MongoDB dump, HuggingFace datasets, GitHub | Private, no export | No — later, after 100% corpus |
| Community contribution | Volunteers add texts, translations, corrections | None | No — not applicable (one author) |
| Translation handling | Separate Version per language, same structure | One language per book | No — English first, then expand |
| Versioning | Full JaggedArray per version | Snapshot JSON blobs in Version table | No |

**What to copy:** Nothing urgent here. Our ingestion is fine for a single-author corpus.

---

### 1C. Reference Detection

| Dimension | Sefaria | spiritualmessage.org | Copy from Sefaria? |
|-----------|---------|---------------------|-------------------|
| Detection method | Trained NER model: `he_torah_ner` (HuggingFace, MIT) | Custom regex (quran_detector.py, hadith_detector.py) | Partially — build eval set first |
| NER accuracy | F-score 82.96% on Hebrew citations (published benchmark) | Not benchmarked | **YES — benchmark ours** |
| Reference types detected | All rabbinic literature: Bible, Talmud, Mishnah, Midrash, Responsa | Quran verse, Hadith (7 collections), Year, Other books | No — ours matches our corpus |
| Cross-lingual detection | Hebrew citations in English text and vice versa | English + some Arabic/Urdu patterns | Partial — improve Urdu detection |
| Output format | Link objects: {refs: ["Genesis 1:1", "Berakhot 5a"], type: "commentary"} | Reference rows: surah, ayah, collection, hadith_number, verified flag | No — our schema is richer |
| Link types | commentary, quotation, allusion, midrash, related, sheet-source | All stored in one Reference table (no type taxonomy) | **YES — add link type to Reference** |
| Verification | Community review + editorial approval | sunnah_url_verified (HTTP 200), match_score 0–100 | No — our verification is equivalent |
| Total links | 3.3 million (full corpus, 13 years) | ~50,000 estimated (6% corpus) | Scale problem — not a method problem |
| External linking | Linker v3 JS: embed on any site → auto-detects all citations → bilingual popup | None | No — later, after 100% corpus |
| Linker API | POST /api/find-refs → async → character positions + linkFailed flag | None | No — build after stable Ref IDs |
| linkFailed flag | Yes — when citation recognized but not in corpus | No explicit fail flag | **YES — add failed_to_verify flag** |
| Public citation API | GET /api/links/{ref} → all known connections | No public API | No — later |

**What to copy:**
1. Benchmark our quran_detector + hadith_detector accuracy (build 50-question test set)
2. Add `link_type` field to Reference table (quotation vs commentary vs allusion)
3. Add `failed_to_verify` flag for citations our detector found but could not confirm

---

### 1D. Semantic Enrichment

| Dimension | Sefaria | spiritualmessage.org | Copy from Sefaria? |
|-----------|---------|---------------------|-------------------|
| Per-segment LLM enrichment | No — Sefaria stores text + links only | Yes — 7 LLM steps per paragraph | We are ahead |
| Emotional journey mapping | None | 21 input → 19 output emotions per paragraph | We are ahead |
| Theological taxonomy | None (topics are text-level, not segment-level) | 14 creation plan aspects per paragraph | We are ahead |
| Audience level | None | scholar / universal / seeker / secular per paragraph | We are ahead |
| Viral scoring | None | 7 dimensions + composite + best_quote per paragraph | We are ahead |
| Argument mapping | None | 10 contemporary issues per paragraph | We are ahead |
| Reasoning flow | None | Logical argument structure per paragraph | We are ahead |
| Glance text | None | Best standalone sentence per paragraph | We are ahead |
| Healing snippet | None | 90-second standalone wisdom piece per paragraph | We are ahead |
| AI topic page content | Yes — AI-generated heading + introduction per source on topic page | No AI-generated topic summaries yet | **YES — add glance_text to topic pages** |
| Dual-LLM review pipeline | **Gemini generates → Claude Opus 3 scores (relevance + accuracy 1–10) → ≤5 gets human review** | Single LLM generates, no validation LLM | **YES — ADD THIS** |
| Scholar review of AI content | 14,000 of 15,500 AI source introductions reviewed by Sefaria staff | No scholar review system | **YES — plan scholar review workflow** |
| AI content marking | Visible AI icon on all AI-generated pages | No AI content marking | **YES — mark AI-generated content** |
| Commentary layer | Classical commentaries are separate linked texts (Rashi, Ibn Ezra, etc.) | No commentary layer | No — not applicable (one author) |
| AI ethics policy | Rome Call for AI Ethics. Transparency, Learning First, Feedback form. | None | **YES — adopt explicit AI policy** |

**What to copy:**
1. **Dual-LLM validation**: After chat_engine generates an answer, run a second cheap LLM call scoring each cited paragraph on: relevance to query (1-10) + accuracy of claim (1-10). Log scores. Flag answers with ≤5. This is their Pirkei Avot pipeline adapted to our chatbot.
2. **Mark AI content**: Any AI-generated enrichment shown to users should have a visible "AI" marker with explanation.
3. **AI ethics statement**: Write a simple 4-point policy page: Transparency, Learning Aid (not a replacement for scholars), Feedback form, Continued evaluation.
4. **Scholar review workflow**: When 10 books are processed, invite a CPS scholar to review sample AI enrichments. Log corrections.

---

### 1E. Interlinking Architecture

| Dimension | Sefaria | spiritualmessage.org | Copy from Sefaria? |
|-----------|---------|---------------------|-------------------|
| How links form | Explicit link objects (NER + community + scholars) | Centric exporter JSONs group paragraphs by shared anchor | Different approaches — both valid |
| Link direction | Bidirectional — both segments know about the link | One-directional via JSON files | No |
| Total links | 3.3 million | ~50,000 estimated | Scale problem |
| Topic ontology size | 17,730 topic entities (ontology graph) | 143 topics | Scale problem — but quality > quantity |
| Topic web pages | 5,000 pages | 143 JSON files | Scale problem |
| AI on topic pages | 1,000 of 5,000 pages have AI-generated source headings + introductions | No AI topic summaries | **YES — add glance_text as intro** |
| Topic source ranking | **Linear Integer Programming** selects best sources balancing diversity + importance | All citations weighted equally | **YES — weight by viral_score composite** |
| Community contribution to topics | 450,000 source sheets with tags feed topic ontology | None | No equivalent (one author) |
| Video cross-linking | No video in library | Books ↔ videos via shared entity names | We are ahead |
| External ecosystem | Linker on 150+ sites, 200+ apps built on API | No external apps, no API | No — later |
| MCP integration | Sefaria MCPs (confirmed in developer docs) | Our own MCP (write-enabled, Claude.ai) | Tie |
| Centric exports active in chatbot | Elasticsearch + topic graph | verse_centric + hadith_centric + topic_centric (3 of 6 used) | We need to activate 3 more |
| Timeline/argument/creation-plan active | Not applicable | Built but NOT in chatbot retrieval | **YES — activate these 3 in chat engine** |

**What to copy:**
1. **Weight topic citations by viral_score**: Instead of treating all citations in a topic JSON equally, rank by `viral_score.composite`. Top 10 per topic shows best passages.
2. **Activate 3 unused centric types** in chat engine: timeline_centric, argument_centric, creation_plan_centric — add as Strategy 3 variants or a Strategy 8.
3. **Add AI-generated summary to topic pages**: Use each paragraph's `glance_text` as the source introduction on topic pages. This mirrors Sefaria's AI heading + introduction without extra LLM cost.

---

## SECTION 2 — RETRIEVAL PIPELINE

| Dimension | Sefaria (main site) | Sefaria (Virtual Havruta) | spiritualmessage.org | Copy? |
|-----------|--------------------|--------------------------|--------------------|-------|
| Product type | Library browser + search | Research Slack bot (MIT) | Conversational chatbot | — |
| Primary search | Elasticsearch 8.8 | LangChain + OpenAI embeddings | FTS5 + BM25 + E5 | No |
| KG traversal | Sidebar connections (click) | Up to 2-hop Neo4j + PageRank | Strategy 5: verse/hadith JSON hop | Partial |
| PageRank on links | Not in main site | Yes — Neo4j PageRank for ranking | Not implemented | **YES — weight centric results by paragraph citation count** |
| Emotional routing | None | None | 7 — 12 emotion categories, 3 blend modes | We are ahead |
| Cross-modal | None | None | Books + videos in one pool | We are ahead |
| Retrieval strategy count | 1 (Elasticsearch) | 3 (semantic + Linker + KG) | 7 strategies | We are ahead |
| Reranking | Not documented | Multi-signal: semantic + graph distance + PageRank | BGE cross-encoder (0.4 original + 0.6 reranker) | We are ahead |
| Dual LLM review | No | No | No | **YES — add this (see enrichment section)** |
| Answer format | N/A (search results list) | Free-form with inline Ref citations | Recipe-based 6 sections | — |
| Citation format | Canonical Ref string ("Genesis 1:1") | Canonical Ref string | [P{id}] / [V{id}] opaque integer | **YES — add human-readable ref to citations** |
| Citation public verifiability | Yes — anyone can check sefaria.org/Genesis.1.1 | Yes | No — [P2341] is not verifiable externally | **YES — CRITICAL** |
| Embedding model | Gemini Embedding 001 (93.9% recall@1 on rabbinic text) | OpenAI embeddings | intfloat/multilingual-e5-base | **YES — benchmark ours** |
| Embedding benchmarked | Yes — Rabbinic Embedding Leaderboard (18 models tested) | — | No | **YES — build Islamic eval set** |
| Evaluation dataset | Rabbinic Embedding Leaderboard | — | None | **YES — CRITICAL** |

**What to copy:**
1. **Make citations human-readable**: When chatbot shows [P2341], also show the book title + chapter title + a short reference like "Patience book, Ch 3". Even better after stable Ref IDs are built.
2. **PageRank-style paragraph weighting**: Paragraphs cited in more verse/hadith/topic exports should rank higher. Add `citation_count` field to Paragraph table — increment when it appears in any centric export.
3. **Build evaluation dataset**: 50 questions across 10 topics with human-rated relevant paragraphs. Test current E5 vs Gemini Embedding. Evidence-based model selection.

---

## SECTION 3 — HABIT & RETURN MECHANISM

| Dimension | Sefaria | spiritualmessage.org | Copy? |
|-----------|---------|---------------------|-------|
| Daily learning schedule | Daf Yomi (daily Talmud page, 7.5 year cycle), Weekly Parasha, Daily Mishnah, Nach Yomi, Mishna Yomi | None | **YES — CREATE EQUIVALENT** |
| Calendar API | GET /api/calendars → daily schedule with timezone support | None | **YES — build daily content endpoint** |
| Daily push | Mobile push notifications, email subscriptions | Telegram bot exists but no daily cron | **YES — add daily cron TODAY** |
| Islamic calendar hooks | N/A | None — but hooks exist: Ramadan, Juma, daily wird/dhikr, Quran khatm | **YES — plug into Islamic calendar** |
| Offline app | Full iOS + Android library, downloadable | No | No — later |
| Social features | Follow sheet creators, public source sheets | None | No — later |
| User accounts | Free public accounts, notes, saved texts | Admin only | No — later |
| Streak/habit | Implicit through Daf Yomi cycle | None | **YES — create daily Islamic learning cycle** |
| Content for habit | Today's Daf is fixed by the cycle | viral_score.best_quote per book already in DB | **YES — use existing best_quote field** |

**What to copy (immediate):**
1. **Daily Telegram cron at 7am IST**: Select paragraph with highest `viral_score.composite` from most recently processed book, send `best_quote` + book title + link. Telegram bot token + chat_id already in memory. Zero new infrastructure.
2. **Islamic calendar integration**: Ramadan → send Quran-related best_quotes. Juma (Friday) → send Maulana's Friday-relevant wisdom. These are 13-week recurring habits.
3. **Plan "Maulana's Daily Wisdom"**: Structured 365-day reading plan covering all 145 books proportionally. This is the Islamic equivalent of Daf Yomi — a multi-year learning cycle.

---

## SECTION 4 — TRUST & QUALITY LAYER

| Dimension | Sefaria | spiritualmessage.org | Copy? |
|-----------|---------|---------------------|-------|
| Scholar review of AI content | All AI content reviewed by Sefaria staff (14,000 of 15,500 introductions) | None | **YES — plan CPS scholar review** |
| AI content marking | Visible AI icon on all AI-generated pages. Explained at sefaria.org/ai. | None | **YES — mark AI content** |
| Citation verification | Linker API verifies citation exists. linkFailed:true if not. | sunnah_url_verified + match_score | Partial — our verification is good but not transparent |
| Public verifiability | Anyone can click any citation and read the source | Citations are opaque IDs — not externally checkable | **YES — build public paragraph URLs** |
| Community error reporting | corrections@sefaria.org + AI feedback form + GitHub issues | None | **YES — add feedback button to chatbot** |
| AI ethics statement | Rome Call for AI Ethics (public page at sefaria.org/ai). 4 explicit commitments. | None | **YES — write our AI policy** |
| AI tone policy | Never authoritative "God said X". Always shows differing views. | Not explicitly defined | **YES — define our tone policy** |
| Hallucination prevention | Linker marks unverifiable citations. Citation-grounded answers only. | Strict citation prompt [P{id}] only | Our approach is equivalent |
| Open source for auditing | GPL-3.0 code, CC data. Anyone can verify. | Private | No — not required now |

**What to copy:**
1. **AI feedback button**: Add "Was this helpful? Report an issue" button to chatbot UI. Store feedback in a new `chatbot_feedback` table. No LLM cost.
2. **AI policy page**: 4 sentences at /about-ai: "We use AI to discover and present Maulana's wisdom. All content is grounded in his actual books. AI-generated summaries are clearly marked. We welcome corrections."
3. **Scholar spot-check system**: Every 50th chatbot answer, save it to a `scholar_review_queue` table. Monthly email to a CPS scholar with 10 answers to review. Log corrections. This is the minimum viable scholar oversight.
4. **Transparent citation display**: When chatbot shows [P2341], show it as "Patience and Positive Thinking, Ch 3 — paragraph 15" so users understand what they're reading.

---

## SECTION 5 — OPENNESS & ECOSYSTEM

| Dimension | Sefaria | spiritualmessage.org | Copy? |
|-----------|---------|---------------------|-------|
| Source code | GPL-3.0, fully public on GitHub | Private | No — not required |
| Data | CC-licensed, MongoDB dump, HuggingFace datasets | Private | No — future |
| Public API | 15+ REST endpoints, no auth for reads | None | **YES — plan read-only API** |
| MCP server | Sefaria MCPs (confirmed) | Our own MCP (write-enabled for Claude.ai) | Partial — ours is more powerful |
| External linker | linker.v3.js: any website embeds → citations auto-link | None | No — after 100% corpus |
| NER model published | he_torah_ner on HuggingFace (MIT) | Not published | No — later |
| AI research published | Rabbinic Embedding Leaderboard, Virtual Havruta (MIT) | Nothing published | No — later |
| Apps built on it | 200+ third-party apps | 0 | No — later |
| Tutorial provided | Official "Build a Torah-Powered AI Chatbot" tutorial | None | **YES — write a simple API tutorial** |
| Developer docs | Full developer portal (developers.sefaria.org) | None | **YES — create /developers page after API** |

**What to copy:**
1. **Plan the read-only API**: After 100% corpus processed, expose `/api/texts/{book-slug}/{para-id}`, `/api/topics/{topic-slug}`, `/api/search`. This enables other Islamic apps to build on Maulana's corpus.
2. **Write the CPS Developer page**: Even before an API, document what's available. Other CPS volunteers and developers worldwide could use this.

---

## SECTION 6 — WHAT SEFARIA DOES THAT WE CANNOT DO (YET)

These are things where Sefaria's 13-year head start creates insurmountable near-term gaps:

| Gap | Why We Cannot Close It Quickly |
|-----|-------------------------------|
| 3.3 million intertextual links | 13 years of NER + community contribution. Ours will grow as corpus grows. |
| 17,730 topic ontology | Built from Aspaklaria encyclopedia + WikiData + 450K source sheets. Needs scholars. |
| 450,000 source sheets | User-generated content from global Jewish learning community. No equivalent yet. |
| 5,000 topic pages | Requires full corpus indexed first. We have 143. |
| 775,000 monthly users | 13 years of growth + global Jewish community + marketing. |
| iOS/Android offline app | Engineering effort of 6+ months. |
| 200+ ecosystem apps | Requires open API + developer community. |
| Bilingual (Hebrew/English) NLP | Sefaria + Dicta partnership for years. We have no Urdu NLP partner. |

---

## SECTION 7 — WHAT WE DO THAT SEFARIA CANNOT DO

These are genuine advantages that Sefaria does not have and cannot easily replicate:

| Our Advantage | What It Enables |
|--------------|----------------|
| Per-paragraph emotional journey (21 input → 19 output emotions) | Route users to Maulana's response to their emotional state (anxiety, grief, doubt, etc.) |
| 14 creation plan aspects per paragraph | Theological topical drill-down ("show me everything about patience as strategy") |
| Audience-level routing (scholar/universal/seeker/secular) | Same question answered differently for a scholar vs a new Muslim |
| Viral scoring + best_quote per paragraph | Daily social media content, shareability ranking, content selection |
| Argument mapping to 10 contemporary issues | "What does Maulana say about political Islam?" answered instantly |
| Reasoning flow per paragraph | Show the logical structure, not just the conclusion |
| Healing snippet per paragraph | 90-second readable standalone piece for quick discovery |
| 7-strategy retrieval | FTS5 + semantic + topic + cross-ref + KG hop + cross-modal + emotional |
| Cross-modal bridge (books ↔ videos) | Same query retrieves both written and spoken wisdom |
| Dual-mode prompting (emotional vs factual) | Different temperature, different prompt structure based on query type |
| Urdu video content | 2,000 videos, Soniox word-level timestamps, Urdu speakers |

---

## SECTION 8 — PRIORITY ACTION TABLE

Ranked by impact and effort. Do in this order.

| Priority | Action | What to Build | Effort | Impact | Inspired By |
|----------|--------|--------------|--------|--------|-------------|
| 1 | **Corpus scale** | Process remaining 136 books through full 7-step pipeline | High — ongoing | Critical | — |
| 2 | **Daily Telegram cron** | 7am IST daily: pick best_quote from processed books, send via Telegram bot | Low — 1 day | High | Sefaria Daf Yomi |
| 3 | **Stable paragraph Ref** | `{book-slug}:{ch-order}:{para-order}` format on each Paragraph | Low — 1 day | High | Sefaria Ref system |
| 4 | **Public paragraph URL** | `/read/{book-slug}/{para-id}` route in Flask | Low — 1 day | High | sefaria.org/Genesis.1.1 |
| 5 | **Human-readable citations** | Show "Patience Book, Ch 3" alongside [P2341] in chatbot | Low — 1 day | High | Sefaria Ref citations |
| 6 | **Build eval dataset** | 50 questions × human-rated relevant paragraphs | Medium — 2 weekends | High | Rabbinic Embedding Leaderboard |
| 7 | **Activate 3 inactive centric types** | Add timeline/argument/creation_plan to chat_engine.py Strategy 3 | Medium — 1 weekend | High | Our own exports |
| 8 | **Dual-LLM validation** | After answer generation, score each cited para: relevance + accuracy 1-10. Flag ≤5. | Medium — 1 weekend | High | Sefaria Pirkei Avot pipeline |
| 9 | **Weight topics by viral_score** | In topic_centric JSONs, sort citations by viral_score.composite. Top 10 = best. | Low — half day | Medium | Sefaria Linear Programming |
| 10 | **AI feedback button** | "Report an issue" in chatbot UI → save to feedback table | Low — 1 day | Medium | Sefaria AI feedback form |
| 11 | **AI policy page** | 4-sentence /about-ai page. Mark AI-generated enrichment with icon. | Low — half day | Medium | sefaria.org/ai |
| 12 | **Scholar spot-check** | Save every 50th answer to review queue. Email CPS scholar monthly. | Low — 1 day | Medium | Sefaria staff review |
| 13 | **PageRank paragraph weight** | Count how many centric exports each para appears in. Add weight to reranking. | Medium — 1 weekend | Medium | Virtual Havruta PageRank |
| 14 | **Embedding benchmark** | Test E5 vs Gemini Embedding on Islamic eval set | Medium — 1 weekend | Medium | Rabbinic Embedding Leaderboard |
| 15 | **Failed citation flag** | When chatbot cannot verify a [P{id}] claim, mark it as unverified | Low — 1 day | Medium | Sefaria linkFailed flag |
| 16 | **Islamic calendar integration** | Ramadan + Juma content calendar in daily Telegram cron | Medium — 1 weekend | Medium | Sefaria calendar API |
| 17 | **Open read-only API** | After 100% corpus: /api/texts, /api/topics, /api/search | High — 1 month | High (long-term) | Sefaria public API |
| 18 | **External Linker** | JS snippet that auto-links Quran/Hadith citations on any Islamic website | High — 2 months | High (long-term) | Sefaria linker.v3.js |

---

## SECTION 9 — QUICK WINS THIS WEEKEND (under 1 day each)

| # | Task | Inspired by | Effort |
|---|------|-------------|--------|
| W1 | Add daily Telegram cron: best_quote at 7am IST | Sefaria Daf Yomi | 2 hours |
| W2 | Add stable ref_id column to paragraphs: `{book-slug}:{ch}:{para}` | Sefaria Ref | 2 hours |
| W3 | Sort topic_centric JSONs by viral_score.composite | Sefaria LP ranking | 1 hour |
| W4 | Show book+chapter title alongside [P{id}] in chatbot response | Sefaria Ref citations | 1 hour |
| W5 | Add "Report an issue" link to chatbot UI | Sefaria feedback form | 1 hour |
| W6 | Write 4-sentence AI policy in /about or README | sefaria.org/ai | 30 min |

Total: one focused weekend = 6 concrete improvements all inspired by Sefaria.

---

## SECTION 10 — THE ONE PHILOSOPHICAL LESSON

**Sefaria's strategy in one sentence:**
Build the data layer first. Make it open. Let the community build the applications.

**Our strategy in one sentence:**
Build the deepest per-paragraph understanding of one scholar's wisdom so that no other platform can match the quality of what we surface.

**The tension:**
Sefaria opened everything → 200+ apps built on their data. We are keeping everything closed → 0 apps built on our data. Long-term, opening the read-only API after 100% corpus is the highest-leverage action we can take. It would allow CPS chapters worldwide, Islamic educators, and developers to build tools we cannot imagine.

**The non-negotiable lesson:**
Sefaria spent 13 years building 3.3 million links. We have been running for 2 years with 50,000. The gap is not method — it is time × corpus. The only thing that closes the gap is processing the remaining 136 books. Everything else is secondary.

---

Sources:
- Our code: direct read, March 2026 (models.py, chat_engine.py, all 6 centric exporters, embedding_service.py, llm_client.py, 14+ files)
- sefaria.org/ai: read directly — Pirkei Avot pipeline, AI ethics, topic anatomy, 14,000/15,500 reviewed
- developers.sefaria.org/docs/linker-v3: read directly — Linker embedding code, options, citation format
- github.com/Sefaria/AppliedAI: read directly — Virtual Havruta Slack bot, MIT license
- github.com/Sefaria/Sefaria-Project/wiki: data model, Index schema, JaggedArray, API docs
- voices.sefaria.org/sheets/244437: Topics behind-the-scenes blog (17,730 ontology, 5,000 pages)
- huggingface.co/Sefaria: he_torah_ner, Rabbinic Embedding Leaderboard, F-score 82.96%
- ejewishphilanthropy.com: AI tone policy ("never God said X"), mid-2024 integration plan
- Wikipedia, Times of Israel: 775K users, 234 countries, 322M words, 3.3M links, 450K sheets, 18 engineers