papercast / plan.md
batuhanozkose
feat: Implement initial PaperCast application with core modules, documentation, a periodic curl script, and a Gradio certificate.
472739a
# PaperCast Implementation Plan
This plan outlines the steps to build **PaperCast**, an AI agent that converts research papers into podcast-style conversations using MCP, Gradio, and LLMs.
## 1. Infrastructure & Dependencies
- [ ] **Update `requirements.txt`**
- Add `transformers`, `accelerate`, `bitsandbytes` (for 4-bit LLM loading).
- Add `scipy` (for audio processing).
- Add `beautifulsoup4` (for web parsing).
- Add `python-multipart` (for API handling).
- Ensure `mcp` and `gradio` versions are pinned.
- [ ] **Project Structure Setup**
- Create `app.py` (entry point).
- Ensure `__init__.py` in all subdirs.
- Create `config.py` in `utils/` for global settings (LLM model names, paths).
## 2. Core Processing Modules
### 2.1. PDF Processing (`processing/`)
- [ ] **Implement `pdf_reader.py`**
- Function `extract_text_from_pdf(pdf_path) -> str`.
- Use `PyMuPDF` (fitz) for fast extraction.
- Implement basic cleaning (remove headers/footers/references if possible).
- [ ] **Implement `url_fetcher.py`**
- Function `fetch_paper_from_url(url) -> str`.
- Handle arXiv URLs (convert `/abs/` to `/pdf/` or scrape abstract).
- Download PDF to temporary storage.
### 2.2. Generation Logic (`generation/`)
- [ ] **Implement `script_generator.py`**
- **Model**: `unsloth/Phi-4-mini-instruct-unsloth-bnb-4bit`.
- Define System Prompts for "Host" and "Guest" personas.
- Function `generate_podcast_script(paper_text) -> List[Dict]`.
- Output format: `[{"speaker": "Host", "text": "...", "emotion": "excited"}, {"speaker": "Guest", "text": "...", "emotion": "neutral"}]`.
- **Key Logic**: Prompt the model to include emotion tags (e.g. `[laugh]`, `[sigh]`) supported by Maya1.
### 2.3. Audio Synthesis (`synthesis/`)
- [ ] **Implement `tts_engine.py`**
- **Model**: `maya-research/maya1`.
- Function `synthesize_dialogue(script_json) -> audio_path`.
- Parse the script for emotion tags and pass them to Maya1.
- Combine audio segments into a single file using `pydub` or `scipy`.
## 3. MCP Server Integration (`mcp_servers/`)
To satisfy the "MCP in Action" requirement, we will expose our core tools as MCP resources/tools.
- [ ] **Create `paper_tools_server.py`**
- Implement an MCP server that provides:
- Tool: `read_pdf(path)`
- Tool: `fetch_arxiv(url)`
- Tool: `synthesize_podcast(script)`
- This allows the "Agent" to call these tools via the MCP protocol.
## 4. Agent Orchestration (`agents/`)
- [ ] **Implement `podcast_agent.py`**
- Create a `PodcastAgent` class.
- **Planning Loop**:
1. Receive User Input.
2. **Plan**: Decide to fetch/read paper.
3. **Analyze**: Extract key topics.
4. **Draft**: Generate script using Phi-4-mini.
5. **Synthesize**: Create audio using Maya1.
- Use `sequential_thinking` pattern (simulated) to show "Agentic" behavior in the logs/UI.
- *Crucial*: The Agent should use the MCP Client to call the tools defined in Step 3, demonstrating "Autonomous reasoning using MCP tools".
## 5. User Interface (`app.py`)
- [ ] **Build Gradio UI**
- Input: Textbox (URL) or File Upload (PDF).
- Output: Audio Player, Transcript Textbox, Status/Logs Markdown.
- **Agent Visualization**: Show the "Thoughts" of the agent as it plans and executes (e.g., "Fetching paper...", "Analyzing structure...", "Generating script...").
- [ ] **Deployment Config**
- Create `Dockerfile` (if needed for custom deps) or rely on HF Spaces default.
## 6. Verification & Polish
- [ ] **Test Run**
- Run with a real arXiv paper.
- Verify audio quality and script coherence.
- [ ] **Documentation**
- Update `README.md` with usage instructions and "MCP in Action" details.
- Record Demo Video.
## 7. Bonus Features (Time Permitting)
- [ ] **RAG Integration**: Use a vector store to answer questions about the paper after the podcast.
- [ ] **Background Music**: Mix in intro/outro music.