# Complete File Directory — annotation_tool_v2
# What every file does, based on code reading + filename inference
# Evidence key: [READ] = file directly read | [INFERRED] = inferred from name + context
# Last updated: March 2026

---

## CORE APP FILES

| File | Size | Evidence | What It Does |
|------|------|----------|--------------|
| app/models.py | 34.5KB | READ | ALL database tables: Book, Chapter, Paragraph, Group, Reference, Entity, Relationship, Video, VideoSegment, VideoEntity, VideoRelationship, VideoSnippet, SegmentTranslation, SegmentAudio, CitationVerification, ParagraphTranslation, Version. The single source of truth for the entire data model. |
| app/__init__.py | 75.4KB | INFERRED | Flask app factory. Registers all blueprints (routes), initializes DB, login manager, starts background services. The largest non-service file — likely contains significant route logic or app setup. |
| app/config.py | 3.9KB | INFERRED | App configuration: database URL, file paths, DATA_DIR, logging setup, environment variable loading. |
| app/main.py | 290B | INFERRED | Entry point. Creates Flask app and runs it. Tiny — just calls the factory. |
| app/highlight_config.py | 2.1KB | INFERRED | Configuration for text highlighting in the editor (colors per reference type). |

---

## SERVICES — BOOKS PIPELINE (ingestion → enrichment → export)

| File | Size | Evidence | What It Does |
|------|------|----------|--------------|
| upload_processor.py | 15.9KB | READ (partial) | Orchestrates book ingestion: receives uploaded DOCX/PDF, calls docx_parser, creates Book/Chapter/Paragraph rows, triggers classification pipeline. Entry point for all book processing. |
| docx_parser.py | 5.2KB | READ | Parses DOCX using python-docx. Extracts paragraphs, headings (level 1/2/3), quotes. Preserves document structure. Returns list of paragraph dicts. |
| para_classifier.py | 6.9KB | READ (partial) | Assigns type to each paragraph: paragraph / heading / subheading / quote. Sets heading level 1/2/3. Rule-based + pattern-based, no LLM. |
| pattern_classifier.py | 8.8KB | READ (partial) | Identifies structural patterns within paragraphs (e.g. numbered lists, dialogue, verse citations). Complements para_classifier. |
| pdf_matcher.py | 9.8KB | INFERRED | Matches parsed paragraphs back to original PDF pages. Stores page_number + page_confidence on each Paragraph. Enables page-level citation. |
| pdf_toc_extractor.py | 14.0KB | INFERRED | Extracts table of contents from PDF. Used to build Chapter structure automatically from PDF without DOCX. |
| grouping.py | 6.2KB | READ | Chunks paragraphs into 512-800 token groups for entity extraction. Token count = word count + punctuation/2 (approximate, not tiktoken). Groups stored in Group table with order_index. |
| heading_classifier_llm.py | 3.4KB | READ (partial) | Step 1 of /process: sends all headings to LLM (Sonnet) to classify as L1/L2/L3. Distinguishes chapter headings from section headings from sub-section headings. |
| group_entity_extractor.py | 9.3KB | READ (partial) | Step 2 of /process: sends 5 groups at a time to LLM (Sonnet). Extracts CONCEPT/PERSON/PLACE/VERSE_ASPECT entities + relationships. Stores in Entity + Relationship tables. |
| paragraph_enricher.py | 9.1KB | READ | Step 3 of /process (Opus): ONE LLM call per paragraph. Produces: creation_plan (14 aspects of Maulana's worldview), emotional_journey (21 input → 19 output emotions), audience_level, shareable flag, glance_text (best standalone sentence). |
| reasoning_flow_extractor.py | 6.7KB | READ (partial) | Step 6 of /process (Sonnet): extracts reasoning_flow JSON {type, steps} showing the logical structure of Maulana's argument in each paragraph. |
| viral_scorer.py | 4.2KB | READ | Step 4 of /process (Sonnet, batches of 5): scores each paragraph on 7 dimensions (emotional_punch, quotability, universality, relatability, novelty, brevity, actionability) with weights. Weighted composite 0-100. Also extracts best_quote, caption_suggestion, platform_fit. |
| argument_mapper.py | 5.8KB | READ | Step 5 of /process (Sonnet): maps paragraphs to contemporary issues from issues_taxonomy.yaml. Issue types: refutation, reframing, evidence, analogy, historical_example, logical_reasoning, practical_guidance. |
| timeline_enricher.py | 4.0KB | READ (partial) | Step 7 of /process (Haiku): enriches existing year Reference rows with event_data JSON {event_title, event_description, maulana_lesson, significance, category}. |
| section_helper.py | 5.1KB | INFERRED | Helper for managing section structure during book editing. Likely handles section_title assignment to paragraphs. |
| backfill_referred_text.py | 2.2KB | INFERRED | One-time backfill script: goes through existing Quran references, loads canonical verse text, populates referred_text field that was empty. |

---

## SERVICES — REFERENCE DETECTION (Quran + Hadith)

| File | Size | Evidence | What It Does |
|------|------|----------|--------------|
| quran_detector.py | 21.4KB | READ (partial) | Largest detection file. Regex patterns for: "2:153", "Quran 67:3", "(Al-Baqarah 2:255)", Arabic surah names, etc. Creates Reference rows with ref_type='quran'. Stores surah, ayah_start, ayah_end, surah_name, raw_text, display_text, classification_method. Shared by books AND videos. |
| hadith_detector.py | 7.8KB | READ (partial) | Regex patterns for hadith collections: bukhari, muslim, tirmidhi, ahmad, nasai, ibn majah, abu dawud. Creates Reference rows with ref_type='hadith'. |
| hadith_matcher.py | 22.8KB | READ (partial) | Fuzzy-matches detected hadith references against local JSON hadith database. Produces match_score (0-100), sunnah_url, match_candidates (top 5 cross-collection). Sets sunnah_url_verified after HTTP 200 check. |
| quran_data.py | 5.4KB | INFERRED | Local data file: surah names, ayah counts, chapter metadata. Used by quran_detector and quran_lookup for validation and canonical text lookup. |
| quran_lookup.py | 1.3KB | INFERRED | Simple lookup: given surah + ayah, returns canonical verse text. Tiny file — just a lookup table or API call wrapper. |
| year_detector.py | 6.7KB | INFERRED | Detects year references in text (CE/AH/BC/BCE, ranges, decades, centuries). Creates Reference rows with ref_type='year'. |
| book_detector.py | 26.6KB | INFERRED | Detects book/publication citations in parentheses. Largest non-chat service file. Classifies subtypes: actual_book, news, islamic_book, encyclopedia, bible, explanation, unknown. Likely uses LLM for ambiguous cases. |
| footnote_linker.py | 6.3KB | INFERRED | Links footnote markers in text to their footnote content. Creates Reference rows with ref_type='footnote'. |
| citation_extraction_engine.py | 5.6KB | INFERRED | MVP5 feature: extracts citation_context JSON for each reference — pattern, confidence, intro/commentary/discussion text, csv_match. Stored on Reference.citation_context. |
| cross_reference_enricher.py | 11.1KB | INFERRED | Post-processing: enriches detected references by cross-referencing across books. Finds other paragraphs that cite the same Quran verse or hadith. |
| csv_verifier.py | 3.8KB | INFERRED | Verifies reference detection against a CSV of known references. Quality check tool for the detection pipeline. |

---

## SERVICES — VIDEOS PIPELINE

| File | Size | Evidence | What It Does |
|------|------|----------|--------------|
| fireflies_pipeline.py | 26.7KB | READ | Full Fireflies.ai pipeline: yt-dlp download → catbox.moe upload (bypasses Hetzner IP block) → Fireflies GraphQL submission → async webhook handling → segment import → entity extraction → LightRAG push. Handles status lifecycle: none→downloading→submitted→transcribing→completed→failed. |
| video_processor.py | 12.5KB | READ (partial) | Imports video transcript into DB: merges Fireflies sentences into ~30s VideoSegment rows. Runs quran_detector + hadith_detector on segment text. Manages video metadata. |
| soniox_pipeline.py | 21.6KB | READ (partial) | Alternative STT via Soniox API: per-word timestamps, confidence scores, language detection, Roman Urdu transliteration. Dual-source comparison workflow vs Fireflies. auto_skip_cleanup for high-confidence segments. |
| channel_scanner.py | 8.2KB | INFERRED | Scans a YouTube channel via API/RSS to discover video URLs. Populates the video queue from @CPSInternational or other channels. |
| video_snippet_creator.py | 6.8KB | INFERRED | Creates VideoSnippet rows (topic-level ~3-5 min clips) from VideoSegments. Combines adjacent segments around a topic. Used by BM25 index for discovery. |
| transcript_compare.py | 24.6KB | READ (partial) | Side-by-side comparison of Fireflies vs Soniox transcripts. Highlights differences. Used in transcript verification workflow (transcript_editor.html). |
| video_context_helper.py | 1.4KB | INFERRED | Small helper: given a segment, loads surrounding context (prev/next segments). Used to build context window for LLM calls on video content. |

---

## SERVICES — RETRIEVAL + SEARCH

| File | Size | Evidence | What It Does |
|------|------|----------|--------------|
| chat_engine.py | 53.4KB | READ (complete) | The entire chatbot brain. 7-strategy retrieval (FTS5 + E5 semantic + topic + cross-ref + KG hop + cross-modal + emotional), BGE cross-encoder reranking (blend: 0.4 original + 0.6 reranker), context building, recipe-based prompt generation, 3-tier LLM calls, citation injection + verification. |
| discovery_service.py | 22.7KB | READ | Content discovery for video/book browsing. 4-signal hybrid: FTS5 + BM25 + E5 + BGE reranker. Urdu/English concept bridge (30+ pairs). Mood → emotional_journey mapping (8 moods). Topic + duration filtering. Cross-lingual search. |
| embedding_service.py | 12.2KB | READ | Manages all embeddings. Model: intfloat/multilingual-e5-base (768-dim). Reranker: BAAI/bge-reranker-v2-m3. In-memory caches: paragraph (~22MB), segment (~328MB), topic (~0.4MB). E5 prefix handling: "query: " / "passage: ". Cosine similarity search. |
| bm25_service.py | 2.5KB | READ | BM25Okapi (k1=1.5, b=0.75) index over VideoSnippet title + description. NOT raw transcript. Built lazily at first search call (~100ms). ~1-2MB RAM. Used by discovery_service for keyword matching on clean AI-written summaries. |
| cross_linker.py | 4.5KB | READ | Books ↔ Videos bridge via shared entity names. get_related_book_passages(segment_id): finds book paragraphs sharing entity names. get_related_video_segments(paragraph_id): reverse. Sorted by count of shared entities. |
| topic_tracker.py | 6.9KB | INFERRED | Manages the 143 topic taxonomy. Creates/updates topic embeddings. Assigns topic_tags to VideoSegments. Powers strategy 3 (topic matching) in chat engine. |
| concept_extractor.py | 3.4KB | INFERRED | Lightweight concept extraction (probably regex or small model) for quick tagging. Different from group_entity_extractor (which is full LLM). |

---

## SERVICES — EXPORT LAYER

| File | Size | Evidence | What It Does |
|------|------|----------|--------------|
| exporter.py | 12.3KB | READ (partial) | Core export: book JSON export, LightRAG JSON export, enriched JSON. Called by export routes. Produces data/exports/ files. |
| lightrag_exporter.py | 12.7KB | READ (partial) | Builds the custom_kg payload for LightRAG: formats entities + relationships from DB into {entity_name, entity_type, description, source_id} format. Called before HTTP POST to /insert_custom_kg. |
| topic_centric_exporter.py | 27.7KB | READ (partial) | Largest exporter. Builds 143 topic JSON files in data/exports/topic_centric/. Each file: {topic_slug, topic_name, citations: [{paragraph_id, book, chapter, text, emotional_tags, creation_plan...}]}. Used by strategy 3 in chat engine. |
| verse_centric_exporter.py | 11.4KB | INFERRED | Builds verse_centric JSON files (one per Quran surah:ayah). Each file: all paragraphs + video segments citing that verse. Used by strategy 5 (KG interlink hop) in chat engine. |
| hadith_centric_exporter.py | 13.1KB | INFERRED | Same as verse_centric but for hadith. Builds hadith_centric JSON files. Used by strategy 5 interlink hop. |
| timeline_centric_exporter.py | 15.0KB | INFERRED | Builds timeline JSON exports: all year references with event_data, sorted chronologically. Powers the timeline view in article templates. |
| argument_centric_exporter.py | 7.2KB | INFERRED | Builds argument JSON exports: paragraphs grouped by contemporary issue. Powers the argument article templates. |
| creation_plan_exporter.py | 7.6KB | INFERRED | Builds creation_plan JSON exports: paragraphs grouped by theological aspect. Powers the creation plan article templates. |
| healing_feed_exporter.py | 6.5KB | INFERRED | Builds healing feed: paragraphs with healing_snippet + snippet_score, sorted by score. Powers the heal.html discovery page. |
| video_centric_exporter.py | 4.7KB | INFERRED | Builds video summary exports: video metadata + top segments + entity list. Used for video article pages. |
| incremental_exporter.py | 10.3KB | INFERRED | Exports only paragraphs changed since last export. Avoids re-exporting entire book on every edit. Uses timestamps to detect changes. |
| export_tracker.py | 11.8KB | READ (partial) | Tracks which exports are complete/stale for each book. Shows export status on dashboard. Checks prerequisites (e.g. enrichment must be done before lightrag_kg export). |
| export_manifest.py | 7.0KB | INFERRED | Generates a manifest of all export files: paths, sizes, timestamps. Used to validate that all required exports exist before enabling chatbot for a book. |
| memgraph_exporter.py | 14.6KB | INFERRED | Formats entity + relationship data for Memgraph (graph database). Builds Cypher CREATE statements. Used to push knowledge graph to Memgraph for visualization. |
| memgraph_ingester.py | 16.7KB | INFERRED | Executes Cypher statements against Memgraph via Bolt protocol. Handles batching, error recovery. Companion to memgraph_exporter. |

---

## SERVICES — LANGUAGE + TRANSLATION

| File | Size | Evidence | What It Does |
|------|------|----------|--------------|
| translator.py | 17.3KB | INFERRED | Translates book content (Paragraph.text) and segment content between languages. Stores results in ParagraphTranslation / SegmentTranslation tables. LLM-based with confidence scoring. |
| urdu_to_hindi.py | 10.0KB | INFERRED | Converts Urdu script (Arabic script) → Hindi script (Devanagari). Rule-based character mapping + normalization. Used for cross-lingual display. |
| tts_service.py | 33.1KB | INFERRED | Text-to-speech: converts translated segment text to audio. Supports multiple voices (sarvam:meera, google:hi-IN-*, etc.). Speaker-aware: different voice per speaker_id. Stores in SegmentAudio table as base64. |

---

## SERVICES — MISC / UTILITIES

| File | Size | Evidence | What It Does |
|------|------|----------|--------------|
| llm_client.py | 7.2KB | READ | Centralized LLM client. 3-tier fallback: Gemini proxy (free, host port 5001) → DeepSeek API (cheapest paid) → OpenRouter (Gemini 2.5 Flash). Circuit breaker: pauses paid calls 1 hour if both return 402. parse_json_response() strips markdown code blocks. |
| pipeline_tracker.py | 12.9KB | READ (partial) | Tracks each book's progress through the pipeline stages (classified → refs_extracted → entities_extracted → enriched → viral_scored → argument_mapped → pushed_to_lightrag). Returns status strings like 'done 12 Mar' or 'pending'. Used by dashboard. |
| snippet_synthesizer.py | 10.2KB | INFERRED | LLM-synthesizes healing_snippet for each paragraph: a standalone 90-second readable piece from the paragraph's wisdom. No LLM context needed — fully self-contained. |
| snippet_scorer.py | 8.6KB | INFERRED | Scores healing_snippet quality: snippet_score (0-100 emotional resonance). Different from viral_score — specifically for healing/consolation potential. |
| pg_articles.py | 7.2KB | INFERRED | Generates article-style pages from export JSONs. Probably builds topic_article, verse_article, hadith_article content. Powers the /articles/ routes. |
| pdf_toc_extractor.py | 14.0KB | INFERRED | Extracts table of contents from uploaded PDFs using pdfminer or similar. Builds chapter structure automatically. |

---

## ROUTES (Web Interface + API)

| File | Size | Evidence | What It Does |
|------|------|----------|--------------|
| routes/api.py | 383.6KB | INFERRED | The largest file in the entire project. Main REST API: all CRUD for books, paragraphs, references, entities, groups. All enrichment trigger endpoints (/api/book/{slug}/enrich, /api/book/{slug}/viral_score, etc.). All export endpoints. All video pipeline endpoints. The bulk of the admin UI API surface. |
| routes/chat.py | 56.0KB | INFERRED | Chat UI routes + chat API endpoints. Handles: ask question → retrieve_chunks → generate_answer flow. Chat session management. Analytics endpoints. Recipe management. Query logging. |
| routes/mobile_api.py | 29.6KB | INFERRED | Mobile-optimized API subset. Likely lighter responses for the mobile reader app. Authentication, book reading, verse lookup, chat for mobile clients. |
| routes/dashboard.py | 21.4KB | INFERRED | Dashboard routes: book list, pipeline status, upload page, overall stats. Uses pipeline_tracker.py to show per-book progress. |
| routes/editor.py | 21.4KB | INFERRED | Book editor routes: paragraph editing, reference management, heading classification, annotation tools. Powers the editor.html template. |
| routes/videos.py | 5.5KB | INFERRED | Video management routes: upload YouTube URL, trigger processing, list videos, view transcript, push to LightRAG. |
| routes/articles.py | 19.9KB | INFERRED | Article page routes: topic articles, verse articles, hadith articles, timeline articles, argument articles, healing feed. All powered by export JSONs. |
| routes/quranreader.py | 10.9KB | READ (partial) | Quran reader: verse lookup, LightRAG query for a specific verse, returns related book passages and video segments. Hosts the admin.html + chat.html + quran.html interfaces. |
| routes/explore.py | 4.2KB | INFERRED | Explore/discovery routes: video search page, content discovery by mood/topic/duration. Uses discovery_service.py. |
| routes/verify.py | 2.9KB | INFERRED | Reference verification routes: shows queue of unverified Quran/Hadith references, allows manual approve/reject. Updates CitationVerification table. |
| routes/exports.py | 2.7KB | INFERRED | Export trigger routes: /api/book/{slug}/export/lightrag, /export/topic_centric, etc. Calls exporter.py functions. |
| routes/auth.py | 1.4KB | INFERRED | Login/logout routes. Flask-Login integration. Checks password_hash via werkzeug. |
| routes/admin.py | 4.9KB | INFERRED | Admin-only routes: user management, system stats, bulk operations. Role check (role='admin'). |
| routes/monitor.py | 6.5KB | INFERRED | Server monitoring routes: docker status, disk usage, service health. Powers monitor.html. |
| routes/pipeline.py | 2.4KB | INFERRED | Pipeline management routes: trigger auto_pipeline, check pipeline status. |
| routes/dev.py | 4.3KB | INFERRED | Developer tools: debug endpoints, test LLM calls, inspect DB state. Not shown in production. |
| routes/share.py | 3.2KB | INFERRED | Share routes: generate shareable links for paragraphs/quotes. Used by viral content sharing. |
| routes/reader.py | 1.0KB | INFERRED | Public reader: serves book.spiritualmessage.org reader. Minimal — mostly redirects to static React app. |
| routes/topics.py | 790B | INFERRED | Topic listing route. Very small — just serves the topics.html index page. |
| routes/showcase.py | 365B | INFERRED | Showcase page route. Tiny — just renders showcase.html. |

---

## SCRIPTS (Run Manually / Cron)

| File | Size | Evidence | What It Does |
|------|------|----------|--------------|
| scripts/auto_pipeline.py | 6.1KB | INFERRED | Automated pipeline runner: picks up books in 'in_progress' status, runs enrichment steps sequentially. Designed for background/cron execution. |
| scripts/soniox_batch.py | 16.0KB | INFERRED | Batch submits videos to Soniox STT. Reads video list, uploads audio, polls for completion, imports results. For bulk video transcription. |
| scripts/soniox_retranscribe.py | 4.6KB | INFERRED | Re-transcribes specific videos through Soniox with language hints (e.g. force Urdu). For fixing poor auto-detection results. |
| scripts/backfill_word_spacing.py | 1.9KB | INFERRED | One-time fix script: adds proper word spacing to Urdu text that was stored without spaces (common OCR issue). |

---

## DATA REFERENCE FILES (not code)

| Location | Evidence | What It Contains |
|----------|----------|-----------------|
| taxonomy/issues_taxonomy.yaml | READ (partial) | Contemporary issues taxonomy with direct_signals and indirect_signals keywords per issue. Loaded by argument_mapper.py. |
| data/exports/topic_centric/ | INFERRED | 143 JSON files, one per topic. Each has citations list with paragraph_ids. |
| data/exports/verse_centric/ | INFERRED | JSON files per Quran verse (surah-ayah). Paragraphs + segments citing each verse. |
| data/exports/hadith_centric/ | INFERRED | JSON files per hadith. Paragraphs + segments citing each hadith. |
| data/audio/ | INFERRED | Downloaded M4A audio files from YouTube. Named {video_id}.m4a. |
| data/youtube_cookies.txt | INFERRED | Browser cookies for yt-dlp to bypass YouTube age-restriction/bot detection. |

---

## PIPELINE STAGE → FILE MAPPING

| Stage | Files Involved |
|-------|---------------|
| Book upload | upload_processor.py, docx_parser.py, pdf_matcher.py |
| Classification | para_classifier.py, pattern_classifier.py, heading_classifier_llm.py |
| Reference detection | quran_detector.py, hadith_detector.py, hadith_matcher.py, year_detector.py, book_detector.py, footnote_linker.py |
| Chunking | grouping.py |
| Entity extraction | group_entity_extractor.py, concept_extractor.py |
| Enrichment (Phase B) | paragraph_enricher.py, reasoning_flow_extractor.py, viral_scorer.py, argument_mapper.py, timeline_enricher.py |
| Healing content | snippet_synthesizer.py, snippet_scorer.py |
| Export | exporter.py, lightrag_exporter.py, topic_centric_exporter.py, verse_centric_exporter.py, hadith_centric_exporter.py, argument_centric_exporter.py, creation_plan_exporter.py, timeline_centric_exporter.py, healing_feed_exporter.py, incremental_exporter.py |
| LightRAG push | lightrag_exporter.py (format) + api.py/fireflies_pipeline.py (HTTP POST) |
| Memgraph push | memgraph_exporter.py, memgraph_ingester.py |
| Video ingestion | fireflies_pipeline.py, video_processor.py, soniox_pipeline.py, channel_scanner.py |
| Video enrichment | Same as book enrichment but on VideoSegment |
| Chat retrieval | chat_engine.py, embedding_service.py, bm25_service.py, discovery_service.py, cross_linker.py |
| Chat generation | chat_engine.py, llm_client.py |
| Translation | translator.py, urdu_to_hindi.py |
| TTS / Audio | tts_service.py |
| Tracking | pipeline_tracker.py, export_tracker.py, export_manifest.py |

---

## FILE COUNT SUMMARY

| Category | Count | Total Size |
|----------|-------|-----------|
| Services | 63 files | ~700KB |
| Routes | 19 files | ~600KB |
| Templates | 50+ HTML files | ~3MB |
| Scripts | 4 files | ~40KB |
| Core (models, config, init) | 5 files | ~130KB |

---

Evidence: READ = directly read during March 2026 code review session
         INFERRED = derived from filename, size, imports, and system context
