Architecture Overview
This document describes the system architecture of PhD Hunter.
System Design
PhD Hunter adopts a clean modular design, with four core parts: crawlers, database, web frontend, and command line interface.
+------------------------------------------------------+
| Web Frontend (Flask + HTML/CSS/JS) |
| - Professor cards with priority & filters |
| - Real-time filtering & sorting |
| - AI chat for analysis & cold email generation |
| - Profile page for CV/PS and arXiv papers |
+----------------------+-------------------------------+
| REST API
+----------------+----------------+
| | |
v v v
+---------+ +----------+ +---------+
| CLI | | Arxiv | | SQLite |
|(main.py)| | Crawler | |Database |
+---------+ +----------+ +---------+
|
v
+------------------+
| CSRankings |
| Crawler |
+------------------+
| Homepage |
| Crawler |
+------------------+
Core Components
Web Frontend (frontend/)
Visualization interface built with Flask + vanilla HTML/CSS/JavaScript:
app.py: Flask API server providing JSON data endpointsindex.html: Main page with navigation bar, filter bar, professor list, and detail modalstyles.css: Black-and-white minimalist stylesheetapp.js: Frontend logic (data loading, filtering, priority update, modal display, chat)
Main features: - Professor card display (score, paper count, research areas, priority color bar) - Multi-dimensional filtering (Priority / Research Area / University / Score) - Priority dropdown modification (real-time save to database) - Professor detail modal (basic info, metrics, paper list with arXiv links) - AI Chat page (auto-generated analysis report + cold email draft) - Profile page (CV/PS upload, arXiv paper management, research preferences) - Settings modal (LLM API key, model, temperature configuration)
CLI Entry (main.py)
Command-line main program providing subcommands:
crawl: Crawl professor information from CSRankingsfetch-papers: Fetch professor papers from arXivstats/list: View database contents
Crawler Module (crawlers/)
CSRankingsCrawler: Use Selenium to crawl CSRankings.org university rankings and professor listsArxivCrawler: Use arXiv API to search papers by authorHomepageCrawler: Use Selenium to scrape professor homepages and generate AI summaries
All crawlers inherit from
BaseCrawlerwith caching support.Analyzer Module (analyzer/)
analyzer.py: Core logic for professor analysis and cold email generationprompts.py: Prompt templates for LLM
Based on user Profile (CV/PS/papers) and professor data (homepage summary/papers), generate: - Professor research direction analysis - Matching points analysis - Cold email writing guidelines - Complete cold email draft
Scorer Module (hound/)
scorer.py: LLM-based professor matching scoringscorer_daemon.py: Background daemon for automatic scoring
Two scores (1-5): - Direction Match: Research direction matching degree - Admission Difficulty: Admission difficulty assessment
The scorer daemon uses a persistent event loop within its background thread to avoid asyncio lifecycle issues when scoring multiple professors. It polls every 30 seconds and spaces out API calls with a 5-second delay between professors to respect upstream rate limits.
Database (database.py)
SQLite database containing:
professorstable: Professor basic information, scores, homepage summary, chat messagespaperstable: Paper metadataapplicant_profiletable: User CV/PS, research preferences, arXiv papers
Provides complete CRUD operations and data export functionality.
Data Models (models.py)
Pydantic model definitions:
Professor: Professor informationPaper: Paper informationUniversity: University information
Utility Module (utils/)
logger.py: Structured logging configurationhelpers.py: General helper functionspdf_extract.py: PDF text extraction and Profile building
API Infrastructure (api_infra/)
core/client.py: Unified LLM client supporting multiple providers
Data Flow
Crawl Phase
User -> main.py crawl -> CSRankingsCrawler.fetch() -> Selenium opens browser -> Parse HTML to extract professor list -> Database.upsert_professor() -> SQLite save
Paper Fetch Phase
User -> main.py fetch-papers -> Database.list_professors() -> For each professor: ArxivCrawler.fetch(professor) -> arxiv.Search query -> Parse returned results -> Database.upsert_paper() -> SQLite save
Homepage Crawl Phase
User -> HomepageCrawler -> Selenium opens professor homepage -> Extract page content -> LLM generates summary -> Database.update_homepage_summary()
Scoring Phase
ScorerDaemon -> Database.list_unscored_professors() -> For each professor: Scorer.run() -> LLM evaluates direction match & difficulty -> Database.update_professor_scores()
Web Interface Query Phase
Browser -> Flask app.py (GET /api/professors) -> Database.list_professors() -> JSON returns professor list -> JavaScript renders cards + filters Browser -> Flask app.py (POST /api/chat/<id>) -> Analyzer.chat_with_professor() -> LLM generates response -> Database.update_professor_messages()
Command Line Query Phase
User -> main.py stats / list -> Database.get_stats() / list_professors() -> Formatted output
Configuration
LLM configuration is stored in hound_config.json:
{
"api_key": "your-api-key",
"model": "deepseek-v3.2",
"provider": "yunwu",
"url": "https://yunwu.ai/v1",
"temperature": 0.6,
"max_tokens": 800,
"scoring_iterations": 3,
"nickname": "YourName"
}
Extensibility
Adding new crawlers:
Create
crawlers/newsource.pyInherit
BaseCrawlerImplement
fetch()methodRegister in
crawlers/__init__.pyAdd corresponding command in main.py
Example:
from .base import BaseCrawler
class DBLPCrawler(BaseCrawler):
def fetch(self, query: str):
# Implement crawling logic
pass
Development Workflow
Modify code
Run validation:
python main.py ...Submit changes