Architecture Overview

This document describes the system architecture of PhD Hunter.

System Design

PhD Hunter adopts a clean modular design, with four core parts: crawlers, database, web frontend, and command line interface.

+------------------------------------------------------+
|    Web Frontend (Flask + HTML/CSS/JS)                |
|    - Professor cards with priority & filters         |
|    - Real-time filtering & sorting                   |
|    - AI chat for analysis & cold email generation    |
|    - Profile page for CV/PS and arXiv papers         |
+----------------------+-------------------------------+
                       | REST API
      +----------------+----------------+
      |                |                |
      v                v                v
+---------+    +----------+    +---------+
| CLI     |    | Arxiv    |    | SQLite  |
|(main.py)|    | Crawler  |    |Database |
+---------+    +----------+    +---------+
      |
      v
+------------------+
| CSRankings       |
| Crawler          |
+------------------+
| Homepage         |
| Crawler          |
+------------------+

Core Components

  1. Web Frontend (frontend/)

    Visualization interface built with Flask + vanilla HTML/CSS/JavaScript:

    • app.py: Flask API server providing JSON data endpoints

    • index.html: Main page with navigation bar, filter bar, professor list, and detail modal

    • styles.css: Black-and-white minimalist stylesheet

    • app.js: Frontend logic (data loading, filtering, priority update, modal display, chat)

    Main features: - Professor card display (score, paper count, research areas, priority color bar) - Multi-dimensional filtering (Priority / Research Area / University / Score) - Priority dropdown modification (real-time save to database) - Professor detail modal (basic info, metrics, paper list with arXiv links) - AI Chat page (auto-generated analysis report + cold email draft) - Profile page (CV/PS upload, arXiv paper management, research preferences) - Settings modal (LLM API key, model, temperature configuration)

  2. CLI Entry (main.py)

    Command-line main program providing subcommands:

    • crawl: Crawl professor information from CSRankings

    • fetch-papers: Fetch professor papers from arXiv

    • stats / list: View database contents

  3. Crawler Module (crawlers/)

    • CSRankingsCrawler: Use Selenium to crawl CSRankings.org university rankings and professor lists

    • ArxivCrawler: Use arXiv API to search papers by author

    • HomepageCrawler: Use Selenium to scrape professor homepages and generate AI summaries

    All crawlers inherit from BaseCrawler with caching support.

  4. Analyzer Module (analyzer/)

    • analyzer.py: Core logic for professor analysis and cold email generation

    • prompts.py: Prompt templates for LLM

    Based on user Profile (CV/PS/papers) and professor data (homepage summary/papers), generate: - Professor research direction analysis - Matching points analysis - Cold email writing guidelines - Complete cold email draft

  5. Scorer Module (hound/)

    • scorer.py: LLM-based professor matching scoring

    • scorer_daemon.py: Background daemon for automatic scoring

    Two scores (1-5): - Direction Match: Research direction matching degree - Admission Difficulty: Admission difficulty assessment

    The scorer daemon uses a persistent event loop within its background thread to avoid asyncio lifecycle issues when scoring multiple professors. It polls every 30 seconds and spaces out API calls with a 5-second delay between professors to respect upstream rate limits.

  6. Database (database.py)

    SQLite database containing:

    • professors table: Professor basic information, scores, homepage summary, chat messages

    • papers table: Paper metadata

    • applicant_profile table: User CV/PS, research preferences, arXiv papers

    Provides complete CRUD operations and data export functionality.

  7. Data Models (models.py)

    Pydantic model definitions:

    • Professor: Professor information

    • Paper: Paper information

    • University: University information

  8. Utility Module (utils/)

    • logger.py: Structured logging configuration

    • helpers.py: General helper functions

    • pdf_extract.py: PDF text extraction and Profile building

  9. API Infrastructure (api_infra/)

    • core/client.py: Unified LLM client supporting multiple providers

Data Flow

  1. Crawl Phase

    User -> main.py crawl
       -> CSRankingsCrawler.fetch()
       -> Selenium opens browser
       -> Parse HTML to extract professor list
       -> Database.upsert_professor()
       -> SQLite save
    
  2. Paper Fetch Phase

    User -> main.py fetch-papers
       -> Database.list_professors()
       -> For each professor:
          ArxivCrawler.fetch(professor)
          -> arxiv.Search query
          -> Parse returned results
          -> Database.upsert_paper()
       -> SQLite save
    
  3. Homepage Crawl Phase

    User -> HomepageCrawler
       -> Selenium opens professor homepage
       -> Extract page content
       -> LLM generates summary
       -> Database.update_homepage_summary()
    
  4. Scoring Phase

    ScorerDaemon -> Database.list_unscored_professors()
       -> For each professor:
          Scorer.run()
          -> LLM evaluates direction match & difficulty
          -> Database.update_professor_scores()
    
  5. Web Interface Query Phase

    Browser -> Flask app.py (GET /api/professors)
       -> Database.list_professors()
       -> JSON returns professor list
       -> JavaScript renders cards + filters
    
    Browser -> Flask app.py (POST /api/chat/<id>)
       -> Analyzer.chat_with_professor()
       -> LLM generates response
       -> Database.update_professor_messages()
    
  6. Command Line Query Phase

    User -> main.py stats / list
       -> Database.get_stats() / list_professors()
       -> Formatted output
    

Configuration

LLM configuration is stored in hound_config.json:

{
    "api_key": "your-api-key",
    "model": "deepseek-v3.2",
    "provider": "yunwu",
    "url": "https://yunwu.ai/v1",
    "temperature": 0.6,
    "max_tokens": 800,
    "scoring_iterations": 3,
    "nickname": "YourName"
}

Extensibility

Adding new crawlers:

  1. Create crawlers/newsource.py

  2. Inherit BaseCrawler

  3. Implement fetch() method

  4. Register in crawlers/__init__.py

  5. Add corresponding command in main.py

Example:

from .base import BaseCrawler

class DBLPCrawler(BaseCrawler):
    def fetch(self, query: str):
        # Implement crawling logic
        pass

Development Workflow

  1. Modify code

  2. Run validation: python main.py ...

  3. Submit changes

See Also