# Platform 01 — Sefaria: Deep Research
# Version 2 — Updated March 2026
# Added: AI page details, Pirkei Avot dual-LLM pipeline, topic page anatomy,
#        Rome Call ethics, 5,000 topic pages vs 17,730 topic ontology distinction,
#        ecosystem apps, corrected corpus numbers, MCP

---

## Overview

Sefaria is the foundation of the entire Jewish AI ecosystem. Founded 2011 by Brett Lockspeiser (ex-Google PM) and Joshua Foer (journalist). Every major Jewish AI platform — RavGPT, ChavrutAI, DafBuddy, GoTorah!, DarshAI, TzadekAI, 200+ others — is built on top of Sefaria's corpus and API. They did not build a chatbot first. They built the data layer first. Everything else followed.

---

## Q1: What Is Their Corpus?

- Torah (Tanakh), Talmud (Bavli + Yerushalmi), Mishnah, Midrash, Kabbalah, Responsa, Philosophy, Liturgy, Jewish Law (Halacha), Historical texts
- As of 2024: **775,000 monthly active users** from **234 countries**
- **322 million words** (Times of Israel 2023 figure)
- **76.5 million translated words** (subset of above)
- **3.3 million intertextual links** (as of 2023; was 1.5M in 2017, 2.5M in 2020)
- **450,000+ source sheets** created by users
- Multiple versions per text: up to 18 versions of Genesis (Hebrew, English, French, Spanish, German, etc.)
- Complete English Talmud (William Davidson translation, 2017) — first free complete English Talmud ever
- Open source. Full MongoDB database dump downloadable from Google Cloud Storage and GitHub (Sefaria-Export) under Creative Commons license.

**CORRECTION from V1:** Earlier research quoted "10 million links" — this was wrong. The confirmed figure is 3.3 million intertextual links as of 2023. The "10 million" claim was not sourced.

**How texts enter corpus:**
1. Public domain texts — scanned, OCR processed, manually corrected by staff + volunteers
2. Publisher partnerships — Sefaria fundraises to pay publishers to release under CC licenses (e.g., Ibn Ezra 2021, JPS Tanakh 2022)
3. Volunteer community contributions — open source, corrections@sefaria.org
4. Direct digitization (e.g., Steinsaltz Talmud, Rashi on Tanakh)

---

## Q2: How Did They Make It AI-Friendly?

**The Index Schema:**
- sectionNames: array of structural levels. Bereishit = ["Chapter","Verse"]. Mishneh Torah = ["Book","Topic","Chapter","Halacha"]
- Every text reference (called a "Ref") follows a standardized canonical format: "Genesis 1:1", "Berakhot 2a:3", "Rashi on Genesis 4:5:2"
- JaggedArray storage: nested arrays with strings at lowest level (segments)
- Every segment is publicly addressable at sefaria.org/{Ref}
- Commentary texts inherit base text structure + one additional "Comment" level

**The Linker API (v3):**
- POST /api/find-refs → async task → returns all citations found with character positions
- Returns: ref (canonical string), URL, Hebrew Ref, primary category, linkFailed flag
- Supports Hebrew AND English input
- JavaScript snippet (linker.v3.js): embed on any website → auto-detects and links all citations
- Mode options: popup-click (shows bilingual text inline) or link (opens tab)
- Used by 150+ external websites
- Tracks usage back to Sefaria for analytics
- linkFailed:true when citation recognized but cannot be verified in corpus → AI hallucination detection

**The Connection Graph:**
- 3.3 million text-to-text links (citations, cross-references, allusions, commentary, midrash)
- Link types: commentary, quotation, allusion, midrash, related, sheet-source
- Clicking any passage opens side panel with every related text across entire canon

**Source Sheets:**
- 450,000+ user-created study compilations (educators + learners + rabbis)
- Mix: text from library + outside sources + images + videos + user comments
- Tags assigned by users → data feeds into topic ontology and ranking
- Source sheet data = human-curated training signal for "which texts are thematically related?"

**IMPORTANT DISTINCTION — confirmed from sefaria.org/ai:**
- Topic **ontology**: 17,730 topic entities (from Aspaklaria encyclopedia + WikiData + user tags)
- Topic **pages** (web pages): 5,000 pages
- AI-generated content: 1,000 of the 5,000 topic pages (20%)
- The 17,730 is the size of the ontology graph. The 5,000 is the number of topic landing pages.

---

## Q3: What Is Their AI Strategy?

**NEWLY CONFIRMED from sefaria.org/ai (read directly March 2026):**

**AI on Topic Pages:**
- 1,000 of 5,000 topic pages now have AI-generated content (marked with visible AI icon)
- Each source on AI pages has: AI-generated heading + AI-generated introduction
- Of 15,500 source introductions on AI pages: 14,000 reviewed and edited by Sefaria staff
- Many AI topic pages include a human-written overview paragraph from the learning team
- Users can report issues via dedicated AI feedback form

**Pirkei Avot Learning Guide (PUBLIC BETA — CONFIRMED DUAL-LLM PIPELINE):**
This is the most technically detailed thing they have published about their AI pipeline:
1. Each mishnah + its associated commentaries → input to **Gemini**
2. Gemini: identifies questions that are matter of discussion among commentaries → generates summaries of each commentator's answer
3. All Gemini output → input to **Claude Opus 3**
4. Claude: reviews each commentary against questions and summaries → gives two scores (1-10 each):
   - Relevance: how relevant is the answer to the question?
   - Accuracy: how accurately does the summary reflect the commentary?
5. Anything scored ≤5 on either dimension → additional human review
6. All content received some human review; low scores get close look
7. Result: interactive interface where clicking any Pirkei Avot teaching shows questions + summarized answers from multiple commentators + links to original sources

**THIS IS DIRECTLY COMPARABLE:** Their Gemini→Claude review pipeline is equivalent to our approach of using different LLM tiers. Key difference: they use Claude to *validate and score* Gemini's outputs. We use Gemini as primary, DeepSeek/OpenRouter as fallback, but do not have a separate validation LLM.

**AI Ethics Policy (Rome Call for AI Ethics — confirmed):**
- Transparency: all AI content marked with visible AI icon
- Learning First: AI is learning aid, not replacement for rabbis/halakhic authorities
- Continued Evaluation: ongoing monitoring
- Feedback: proactive user feedback form
- Adherence: Rome Call for AI Ethics (signed by Microsoft, Google, IBM, Cisco, Chief Rabbinate of Israel, Yeshiva University, Rabbinical Alliance of America)

**AI tone policy (confirmed from eJewishPhilanthropy interview):**
"The tone of Sefaria's AI answers will not be authoritative: 'God said do X.' Instead, the library will often offer differing views on how commentators wrestled with a particular question, pushing users to dive deeper into the texts."

---

## Q4: What Is Their Retrieval Strategy?

**Main site retrieval (library browser):**
1. Citation-based: Linker API detects exact citations → highest confidence
2. Elasticsearch 8.8: full-text search, bilingual (Hebrew + English simultaneously), variant spellings
3. Topic graph: source sheet tags + connection graph → topic clusters

**Virtual Havruta (research project, open source, MIT, Slack bot):**
- LangChain + OpenAI embeddings + Neo4j + YAML config
- retrieve_docs_linker(): queries Linker API
- select_useful_references(): LLM filters for relevance
- generate_kg_deeplink(): KG deep link for every citation
- Up to 2-hop Neo4j KG traversal with PageRank
- NOT a production chatbot on sefaria.org — it is a research/demo Slack bot

**Topic source selection (Linear Integer Programming):**
- Algorithm selects which sources appear on topic pages
- Developer docs list this as a documented feature ("Linear Integer Programming For Topic Pages Sources Selection")
- Balances diversity + importance + representation across texts
- Uses PageRank-style signals from link graph
- This is why Sefaria's topic pages show the "best" sources, not just any matching source

**Rabbinic Embedding Leaderboard (HuggingFace, Jan 2026):**
- Evaluates Hebrew/Aramaic embedding models for cross-lingual retrieval
- Result: Gemini Embedding 001 = 93.9% recall@1 (#1), Qwen3-Embedding-8B (#2), Voyage multilingual-2 (#3), OpenAI text-embedding-3-large = 69.9%, Hebrew-specific BERT = worst (~1-2%)
- Key finding: domain-specific models lose to good general multilingual models
- Direct implication for us: our intfloat/multilingual-e5-base may be beatable by Gemini Embedding or Qwen3

---

## Q5: How Do They Handle Language Complexity?

Hebrew + Aramaic challenges:
- No vowels in standard text
- Words fuse prefixes/prepositions into single tokens
- Same root appears in dozens of morphological forms
- Medieval texts use inconsistent spelling and abbreviations
- Rashi script (different letterforms)

**Sefaria's approach:**
- Partner with Dicta for Hebrew NLP (vocalization, abbreviation expansion, citation extraction)
- Multiple text versions — users choose the version they can read
- Bilingual search — accepts Hebrew or English transliteration input
- Linker v3 works on both Hebrew and English citations
- Nikud (vowel points) optional — Dicta provides vowelized versions of Berakhot and others

**Parallel to our Urdu challenge:**
- Urdu: no standard romanization, script complexity, code-switching Arabic/Persian/Hindi
- Our solution: English enrichment first, then index English (LEARNINGS.md L4)
- Sefaria's equivalent: English translations alongside Hebrew
- Key difference: Sefaria has Dicta building Hebrew-specific NLP. We have no equivalent Urdu NLP foundation.
- Sefaria NER model (he_torah_ner, HuggingFace): F-score 82.96% on Hebrew citations. We have no equivalent metric for our Arabic/Hadith detector.

---

## Q6: What Is Their Trust/Quality Layer?

**Three mechanisms:**

1. **Scholar review before AI content goes live:**
   - 14,000 of 15,500 AI-generated source introductions reviewed by staff
   - Low-scoring content (Claude Opus 3 score ≤5) gets additional review
   - AI-generated pages marked with visible AI icon
   - Human-written overview paragraphs on many pages
   - Dedicated feedback form for corrections

2. **Citation verification by design:**
   - Linker API verifies every citation exists in corpus
   - linkFailed:true flag when citation not verifiable
   - AI cannot hallucinate a citation that Linker will then mark as failed
   - Public verifiability: anyone can click any citation and check the source

3. **Open source transparency:**
   - All code GPL-3.0 on GitHub
   - All data CC-licensed on GitHub + Google Cloud
   - Community can audit and fix errors

**Critical lesson for us:**
Our [P{id}] citations are structurally similar — they verify the paragraph exists. But our citations are not publicly verifiable. Anyone checking our citation has to trust us. Sefaria's citations link to text anyone on the internet can read and cross-check. This is the single biggest trust gap.

---

## Q7: What Is Their Habit Strategy?

- **Daf Yomi**: daily Talmud page, 7.5 year cycle, tens of thousands of participants worldwide, dedicated mobile page, split-screen view
- **Parashat HaShavua**: weekly Torah portion
- **Daily Mishnah, Nach Yomi, Mishna Yomi**: multiple parallel daily schedules
- **Calendar API** (GET /api/calendars): returns daily/weekly schedule with timezone support
- **Source Sheet Builder**: educators create and share materials → community of content producers
- **Follow system**: users follow sheet creators they trust
- **Email subscriptions**: daily texts
- **Mobile apps** (iOS + Android): full library downloadable for offline use, push notifications
- **Torah Tab**: browser extension for Chrome

**Key insight:**
Sefaria does not rely on AI to bring users back. The Jewish learning calendar is the habit. Daf Yomi alone has tens of thousands of participants globally completing a 7.5-year cycle. Sefaria plugged into an existing 2,000-year-old habit system.

**Implication for us:**
Islamic equivalent habits exist: daily Quran tilawat, weekly Juma reflection, Ramadan khatm, daily wird/dhikr, Islamic calendar (Ramadan, Dhul Hijjah). We are not plugged into any of these yet. Our Telegram bot exists but has no daily cron. This is the lowest-hanging fruit for retention.

---

## Q8: Ecosystem (what was missing from V1)

**Apps built on Sefaria (confirmed from developers.sefaria.org/docs/powered-by-sefaria):**
- ChavrutAI — Talmud accessible through modern technology
- DafBuddy — AI-powered Gemara study companion using Sefaria text+translation API
- Dafyomi AI Summary — explores/summarizes/translates Talmud insights
- Darshan AI — creates full lessons with cited sources linked to Sefaria
- GoTorah! — intelligent chat with sage-specific dialogue, chavruta study
- Bina v'Da'at — Hebrew AI chatbot for Torah learning
- Tikkun.io — Torah reading preparation
- Besasefer — analytical search engine for Tanakh
- Tutorial: "Build a Torah-Powered AI Chatbot" — official Sefaria tutorial for RAG chatbot
- 200+ apps total

**Developer tools:**
- 15+ REST API endpoints, no auth for read access
- Elasticsearch proxy API for full-text search
- Sefaria MCPs (listed in developer docs nav as "The Sefaria MCPs") — confirmed
- HuggingFace org: datasets (3.55M Hebrew segments, 886K English segments, 3.74M links), NER models, Leaderboard
- Download entire DB as MongoDB dump
- Sefaria-Export GitHub repo: structured text + links under CC license

---

## Q9: What We Can Directly Apply (updated)

**Immediately actionable:**

1. **Dual-LLM validation pipeline** — use their Pirkei Avot pattern: Gemini generates, separate LLM (could be same Gemini with different prompt, or Claude) reviews on relevance + accuracy. Anything ≤5 gets flagged. We can apply this to our chatbot answers, not just enrichment.

2. **Build citation verification layer** — [P{para_id}] already verifies the paragraph exists. Next step: spot-check whether the paragraph actually supports the claim being made in the answer.

3. **Mark AI-generated content visibly** — if we add AI-generated topic summaries or enrichment data to a public page, mark it visibly (similar to Sefaria's AI icon). Builds trust.

4. **Topic ranking with diversity criteria** — our 143 topic JSONs weight all citations equally. Sefaria uses Linear Integer Programming to balance diversity + importance. A simpler version: weight by viral_score composite + deduplicate by book (don't show 10 paragraphs from one book on a topic page).

5. **Telegram daily cron** — plug into existing Islamic calendar habits. Daily Maulana quote using viral_score to select best_quote, cron at 7am IST. Zero new infrastructure.

6. **Build the eval set** — Sefaria has the Rabbinic Embedding Leaderboard. We need an Islamic equivalent: 50 questions with human-rated relevant paragraphs. Test E5 vs Gemini Embedding vs Qwen3.

7. **Study Virtual Havruta source code** — open source on GitHub. Their retrieve_docs_linker() + select_useful_references() pipeline directly applicable to chat_engine.py improvements.

**Structural lessons (unchanged):**
- Data first, chatbot second. We are already doing this.
- Citation as trust mechanism, not scholar review of every answer.
- Open source creates a community that multiplies effort by 200x.
- Habit loop beats retrieval quality every time.

---

## Gaps Found in V1 Research (what this update fixes)

| V1 Error | Corrected |
|----------|-----------|
| "10 million links" | 3.3 million confirmed |
| "17,730 topics" | 17,730 = topic ontology entities. 5,000 = topic web pages. Different things. |
| AI on topics understated | 1,000 of 5,000 pages have AI content. 14,000 introductions reviewed by staff. |
| Pirkei Avot pipeline missing | Gemini generates → Claude Opus 3 scores → ≤5 gets extra human review. Now documented. |
| Rome Call ethics missing | Confirmed: Microsoft, Google, IBM, Cisco, Chief Rabbinate, Yeshiva University all signed. |
| AI tone policy missing | Never authoritative "God said X". Always shows differing commentator views. |
| Ecosystem understated | 200+ apps. ChavrutAI, DafBuddy, DarshAI, GoTorah!, official RAG tutorial, etc. |
| MCP not mentioned | Sefaria MCPs listed in developer docs — confirmed. |
| Feedback mechanism missing | Dedicated AI feedback form. Corrections by community. corrections@sefaria.org. |

---

## Sources

- https://www.sefaria.org/ai — read directly, March 2026 (Pirkei Avot pipeline, AI ethics, topic page anatomy)
- https://developers.sefaria.org — official developer portal
- https://developers.sefaria.org/docs/linker-v3 — read directly
- https://developers.sefaria.org/docs/powered-by-sefaria — ecosystem apps
- https://github.com/Sefaria/Sefaria-Project/wiki — data model, index schema
- https://github.com/Sefaria/AppliedAI — Virtual Havruta source code (read directly)
- https://github.com/Sefaria/Sefaria-Export — structured data exports
- https://huggingface.co/Sefaria — AI/ML work, Embedding Leaderboard, he_torah_ner
- https://en.wikipedia.org/wiki/Sefaria — 775K users, 234 countries, 18 engineers
- https://voices.sefaria.org/sheets/244437 — Topics behind-the-scenes blog
- https://blogs.timesofisrael.com — 322M words, 3.3M links, 450K source sheets
- https://ejewishphilanthropy.com/sefaria-to-integrate-ai-into-its-text-library-by-mid-2024/ — AI tone policy interview
- https://jweekly.com/2025/06/18 — 14,000 of 15,500 AI introductions reviewed

---

## Status

DONE: Sefaria (Platform 01) — Version 2 complete
NEXT: Dicta — Hebrew NLP + Dicta-LM 3.0
