Architecture Overview ===================== This document describes the system architecture of PhD Hunter. System Design ------------- PhD Hunter adopts a clean modular design, with four core parts: crawlers, database, web frontend, and command line interface. .. code-block:: text +------------------------------------------------------+ | Web Frontend (Flask + HTML/CSS/JS) | | - Professor cards with priority & filters | | - Real-time filtering & sorting | | - AI chat for analysis & cold email generation | | - Profile page for CV/PS and arXiv papers | +----------------------+-------------------------------+ | REST API +----------------+----------------+ | | | v v v +---------+ +----------+ +---------+ | CLI | | Arxiv | | SQLite | |(main.py)| | Crawler | |Database | +---------+ +----------+ +---------+ | v +------------------+ | CSRankings | | Crawler | +------------------+ | Homepage | | Crawler | +------------------+ Core Components --------------- 1. **Web Frontend** (frontend/) Visualization interface built with Flask + vanilla HTML/CSS/JavaScript: - ``app.py``: Flask API server providing JSON data endpoints - ``index.html``: Main page with navigation bar, filter bar, professor list, and detail modal - ``styles.css``: Black-and-white minimalist stylesheet - ``app.js``: Frontend logic (data loading, filtering, priority update, modal display, chat) Main features: - Professor card display (score, paper count, research areas, priority color bar) - Multi-dimensional filtering (Priority / Research Area / University / Score) - Priority dropdown modification (real-time save to database) - Professor detail modal (basic info, metrics, paper list with arXiv links) - AI Chat page (auto-generated analysis report + cold email draft) - Profile page (CV/PS upload, arXiv paper management, research preferences) - Settings modal (LLM API key, model, temperature configuration) 2. **CLI Entry** (main.py) Command-line main program providing subcommands: - ``crawl``: Crawl professor information from CSRankings - ``fetch-papers``: Fetch professor papers from arXiv - ``stats`` / ``list``: View database contents 3. **Crawler Module** (crawlers/) - ``CSRankingsCrawler``: Use Selenium to crawl CSRankings.org university rankings and professor lists - ``ArxivCrawler``: Use arXiv API to search papers by author - ``HomepageCrawler``: Use Selenium to scrape professor homepages and generate AI summaries All crawlers inherit from ``BaseCrawler`` with caching support. 4. **Analyzer Module** (analyzer/) - ``analyzer.py``: Core logic for professor analysis and cold email generation - ``prompts.py``: Prompt templates for LLM Based on user Profile (CV/PS/papers) and professor data (homepage summary/papers), generate: - Professor research direction analysis - Matching points analysis - Cold email writing guidelines - Complete cold email draft 5. **Scorer Module** (hound/) - ``scorer.py``: LLM-based professor matching scoring - ``scorer_daemon.py``: Background daemon for automatic scoring Two scores (1-5): - Direction Match: Research direction matching degree - Admission Difficulty: Admission difficulty assessment The scorer daemon uses a persistent event loop within its background thread to avoid asyncio lifecycle issues when scoring multiple professors. It polls every 30 seconds and spaces out API calls with a 5-second delay between professors to respect upstream rate limits. 6. **Database** (database.py) SQLite database containing: - ``professors`` table: Professor basic information, scores, homepage summary, chat messages - ``papers`` table: Paper metadata - ``applicant_profile`` table: User CV/PS, research preferences, arXiv papers Provides complete CRUD operations and data export functionality. 7. **Data Models** (models.py) Pydantic model definitions: - ``Professor``: Professor information - ``Paper``: Paper information - ``University``: University information 8. **Utility Module** (utils/) - ``logger.py``: Structured logging configuration - ``helpers.py``: General helper functions - ``pdf_extract.py``: PDF text extraction and Profile building 9. **API Infrastructure** (api_infra/) - ``core/client.py``: Unified LLM client supporting multiple providers Data Flow --------- 1. **Crawl Phase** .. code-block:: User -> main.py crawl -> CSRankingsCrawler.fetch() -> Selenium opens browser -> Parse HTML to extract professor list -> Database.upsert_professor() -> SQLite save 2. **Paper Fetch Phase** .. code-block:: User -> main.py fetch-papers -> Database.list_professors() -> For each professor: ArxivCrawler.fetch(professor) -> arxiv.Search query -> Parse returned results -> Database.upsert_paper() -> SQLite save 3. **Homepage Crawl Phase** .. code-block:: User -> HomepageCrawler -> Selenium opens professor homepage -> Extract page content -> LLM generates summary -> Database.update_homepage_summary() 4. **Scoring Phase** .. code-block:: ScorerDaemon -> Database.list_unscored_professors() -> For each professor: Scorer.run() -> LLM evaluates direction match & difficulty -> Database.update_professor_scores() 5. **Web Interface Query Phase** .. code-block:: Browser -> Flask app.py (GET /api/professors) -> Database.list_professors() -> JSON returns professor list -> JavaScript renders cards + filters Browser -> Flask app.py (POST /api/chat/) -> Analyzer.chat_with_professor() -> LLM generates response -> Database.update_professor_messages() 6. **Command Line Query Phase** .. code-block:: User -> main.py stats / list -> Database.get_stats() / list_professors() -> Formatted output Configuration ------------- LLM configuration is stored in ``hound_config.json``: .. code-block:: json { "api_key": "your-api-key", "model": "deepseek-v3.2", "provider": "yunwu", "url": "https://yunwu.ai/v1", "temperature": 0.6, "max_tokens": 800, "scoring_iterations": 3, "nickname": "YourName" } Extensibility ------------- Adding new crawlers: 1. Create ``crawlers/newsource.py`` 2. Inherit ``BaseCrawler`` 3. Implement ``fetch()`` method 4. Register in ``crawlers/__init__.py`` 5. Add corresponding command in main.py Example: .. code-block:: python from .base import BaseCrawler class DBLPCrawler(BaseCrawler): def fetch(self, query: str): # Implement crawling logic pass Development Workflow -------------------- 1. Modify code 2. Run validation: ``python main.py ...`` 3. Submit changes See Also -------- - :doc:`crawlers` - :doc:`api` - :doc:`contributing`