Crawlers

This module is responsible for fetching data from academic sources. Additionally, the project includes a web frontend interface for data display and interaction.

Overview

Crawlers fetch data from different sources:

Get professor lists from CSRankings
Fetch papers via OpenAlex (primary source — institution + author matching)
Enrich abstracts via arXiv (by ID — accurate abstracts and PDF links)
Scrape professor homepages for AI summary and recent paper titles

All crawlers follow rate limits and include retry logic.

Note

For the web frontend section, see the “Web Frontend” chapter in Architecture Overview, or check the Flask application code in src/phd_hunter/frontend/ directory.

CSRankings Crawler

File: crawlers/csrankings.py

Extracts professor data from https://csrankings.org.

Features:

Select specific institutions and CS sub-areas
Extract professor names, homepages, and affiliations
Use Selenium to handle dynamic page content

Usage:

from phd_hunter.crawlers.csrankings import CSRankingsCrawler

crawler = CSRankingsCrawler(headless=True)
universities, professors = crawler.fetch(
    areas=["ai"],
    region="world",
    max_professors=5
)
# Returns lists of University and Professor objects

Extracted data:

University name, rank, score
Professor name
University URL
Professor homepage (extracted from ranking page)

OpenAlex Crawler

File: crawlers/openalex_crawler.py

Primary paper source. Fetches professor papers via the OpenAlex API, using institution + author matching for accurate identification.

Features:

Search institution by name, then author within that institution
Select author with the most works (highest confidence)
Fetch works sorted by publication date
Extract arXiv links from locations and open_access fields
Handle non-arXiv papers gracefully (skip if no arXiv ID — DB schema requires it)

Usage:

from phd_hunter.crawlers.openalex_crawler import OpenAlexCrawler
from phd_hunter.models import Professor

crawler = OpenAlexCrawler(delay=1.0)
prof = Professor(name="Bingsheng He", university="National University of Singapore")
papers = crawler.fetch(prof, max_papers=10)
# Returns list of Paper objects (arxiv_id set when arXiv link found)
crawler.close()

Extracted data:

Paper title
Author list
Abstract (from OpenAlex, may be incomplete)
Publication year / month
arXiv ID (extracted from locations/open_access)
arXiv URL and PDF URL
Citation count
Venue name

arXiv Crawler

File: crawlers/arxiv_crawler.py

Abstract enrichment + manual addition. Supplements OpenAlex papers with accurate arXiv abstracts, and supports manual paper addition by URL.

Features:

fetch_by_ids(): batch query arXiv by ID list for accurate abstracts and PDF links
fetch_by_titles(): search by paper title with progressive query degradation and Jaccard similarity filtering
fetch(): author-name search (legacy, not used in main flow)
Author verification: fuzzy name matching (handles initials, last-name-only)

Usage — abstract enrichment (main flow):

from phd_hunter.crawlers.arxiv_crawler import ArxivCrawler

crawler = ArxivCrawler(delay=3.0)
results = crawler.fetch_by_ids(["2412.11483", "2512.02589"])
# Returns dict: arxiv_id -> Paper with accurate abstract and pdf_url
crawler.close()

Usage — manual add by title (for homepage-extracted titles):

from phd_hunter.crawlers.arxiv_crawler import ArxivCrawler
from phd_hunter.models import Professor

crawler = ArxivCrawler()
prof = Professor(name="Yangqiu Song")
titles = ["Paper Title 1", "Paper Title 2"]
papers = crawler.fetch_by_titles(prof, titles=titles, max_papers=10)
# Returns list of Paper objects where the professor is a confirmed author

Extracted data:

Paper title
Author list
Abstract (accurate, from arXiv)
Publication year
arXiv ID
PDF URL

Homepage Crawler

File: crawlers/homepage_crawler.py

Scrape professor homepages, generate AI summaries, and extract recent paper titles.

Features:

Fetch professor homepage via HTTP
Extract plain text from HTML
Use LLM to generate summary (research focus, recruiting status, content summary)
Extract recent paper titles from the homepage (used for precise arXiv search)

Usage:

from phd_hunter.crawlers.homepage_crawler import (
    fetch_and_summarize_homepage,
    load_homepage_papers,
)

# Fetch homepage and extract info
success = await fetch_and_summarize_homepage(
    professor_id=1,
    homepage_url="https://cs.stanford.edu/~prof/",
    professor_name="John Doe",
    db_path="phd_hunter.db"
)

# Retrieve extracted paper titles
titles = load_homepage_papers(1)
# Returns ["Paper Title 1", "Paper Title 2", ...]

Configuration

Current configuration is passed via command line parameters. Key parameters:

CSRankingsCrawler:
  --headless / --no-headless   # Headless mode
  --timeout 30                 # Timeout (seconds)
  --max-professors 5           # Max professors per university

OpenAlexCrawler:
  --delay 1.0                  # Request interval (seconds)
  --max-retries 3              # Retry attempts on failure

ArxivCrawler:
  --delay 3.0                  # Request interval (seconds, be respectful)
  --max-papers 10              # Max papers per professor

HomepageCrawler:
  Requires LLM configuration in hound_config.json

Caching

All crawler results are cached to avoid redundant requests:

Cache location: Memory cache (in-process)
Cache key: Parameter hash
TTL: Default 1 day

Rate Limiting

To respect data sources:

Automatic delay between requests
arXiv: Default 1 second interval (configurable)

Error Handling

Crawlers handle:

Network timeouts (retry)
Page layout changes (fault tolerance)
Missing data (return partial results)

Adding New Crawlers

To add a new data source:

Create crawlers/newsource.py
Inherit BaseCrawler
Implement fetch() method
Register in crawlers/__init__.py
Add command in main.py

Example:

from phd_hunter.crawlers.base import BaseCrawler

class DBLPCrawler(BaseCrawler):
    def fetch(self, query: str):
        # Implement crawling logic
        pass

Crawlers

Overview

CSRankings Crawler

OpenAlex Crawler

arXiv Crawler

Homepage Crawler

Configuration

Caching

Rate Limiting

Error Handling

Adding New Crawlers

See Also