Crawlers

This module is responsible for fetching data from academic sources. Additionally, the project includes a web frontend interface for data display and interaction.

Overview

Crawlers fetch data from different sources:

  • Get professor lists from CSRankings

  • Fetch papers via OpenAlex (primary source — institution + author matching)

  • Enrich abstracts via arXiv (by ID — accurate abstracts and PDF links)

  • Scrape professor homepages for AI summary and recent paper titles

All crawlers follow rate limits and include retry logic.

Note

For the web frontend section, see the “Web Frontend” chapter in Architecture Overview, or check the Flask application code in src/phd_hunter/frontend/ directory.

CSRankings Crawler

File: crawlers/csrankings.py

Extracts professor data from https://csrankings.org.

Features:

  • Select specific institutions and CS sub-areas

  • Extract professor names, homepages, and affiliations

  • Use Selenium to handle dynamic page content

Usage:

from phd_hunter.crawlers.csrankings import CSRankingsCrawler

crawler = CSRankingsCrawler(headless=True)
universities, professors = crawler.fetch(
    areas=["ai"],
    region="world",
    max_professors=5
)
# Returns lists of University and Professor objects

Extracted data:

  • University name, rank, score

  • Professor name

  • University URL

  • Professor homepage (extracted from ranking page)

OpenAlex Crawler

File: crawlers/openalex_crawler.py

Primary paper source. Fetches professor papers via the OpenAlex API, using institution + author matching for accurate identification.

Features:

  • Search institution by name, then author within that institution

  • Select author with the most works (highest confidence)

  • Fetch works sorted by publication date

  • Extract arXiv links from locations and open_access fields

  • Handle non-arXiv papers gracefully (skip if no arXiv ID — DB schema requires it)

Usage:

from phd_hunter.crawlers.openalex_crawler import OpenAlexCrawler
from phd_hunter.models import Professor

crawler = OpenAlexCrawler(delay=1.0)
prof = Professor(name="Bingsheng He", university="National University of Singapore")
papers = crawler.fetch(prof, max_papers=10)
# Returns list of Paper objects (arxiv_id set when arXiv link found)
crawler.close()

Extracted data:

  • Paper title

  • Author list

  • Abstract (from OpenAlex, may be incomplete)

  • Publication year / month

  • arXiv ID (extracted from locations/open_access)

  • arXiv URL and PDF URL

  • Citation count

  • Venue name

arXiv Crawler

File: crawlers/arxiv_crawler.py

Abstract enrichment + manual addition. Supplements OpenAlex papers with accurate arXiv abstracts, and supports manual paper addition by URL.

Features:

  • fetch_by_ids(): batch query arXiv by ID list for accurate abstracts and PDF links

  • fetch_by_titles(): search by paper title with progressive query degradation and Jaccard similarity filtering

  • fetch(): author-name search (legacy, not used in main flow)

  • Author verification: fuzzy name matching (handles initials, last-name-only)

Usage — abstract enrichment (main flow):

from phd_hunter.crawlers.arxiv_crawler import ArxivCrawler

crawler = ArxivCrawler(delay=3.0)
results = crawler.fetch_by_ids(["2412.11483", "2512.02589"])
# Returns dict: arxiv_id -> Paper with accurate abstract and pdf_url
crawler.close()

Usage — manual add by title (for homepage-extracted titles):

from phd_hunter.crawlers.arxiv_crawler import ArxivCrawler
from phd_hunter.models import Professor

crawler = ArxivCrawler()
prof = Professor(name="Yangqiu Song")
titles = ["Paper Title 1", "Paper Title 2"]
papers = crawler.fetch_by_titles(prof, titles=titles, max_papers=10)
# Returns list of Paper objects where the professor is a confirmed author

Extracted data:

  • Paper title

  • Author list

  • Abstract (accurate, from arXiv)

  • Publication year

  • arXiv ID

  • PDF URL

Homepage Crawler

File: crawlers/homepage_crawler.py

Scrape professor homepages, generate AI summaries, and extract recent paper titles.

Features:

  • Fetch professor homepage via HTTP

  • Extract plain text from HTML

  • Use LLM to generate summary (research focus, recruiting status, content summary)

  • Extract recent paper titles from the homepage (used for precise arXiv search)

Usage:

from phd_hunter.crawlers.homepage_crawler import (
    fetch_and_summarize_homepage,
    load_homepage_papers,
)

# Fetch homepage and extract info
success = await fetch_and_summarize_homepage(
    professor_id=1,
    homepage_url="https://cs.stanford.edu/~prof/",
    professor_name="John Doe",
    db_path="phd_hunter.db"
)

# Retrieve extracted paper titles
titles = load_homepage_papers(1)
# Returns ["Paper Title 1", "Paper Title 2", ...]

Configuration

Current configuration is passed via command line parameters. Key parameters:

CSRankingsCrawler:
  --headless / --no-headless   # Headless mode
  --timeout 30                 # Timeout (seconds)
  --max-professors 5           # Max professors per university

OpenAlexCrawler:
  --delay 1.0                  # Request interval (seconds)
  --max-retries 3              # Retry attempts on failure

ArxivCrawler:
  --delay 3.0                  # Request interval (seconds, be respectful)
  --max-papers 10              # Max papers per professor

HomepageCrawler:
  Requires LLM configuration in hound_config.json

Caching

All crawler results are cached to avoid redundant requests:

  • Cache location: Memory cache (in-process)

  • Cache key: Parameter hash

  • TTL: Default 1 day

Rate Limiting

To respect data sources:

  • Automatic delay between requests

  • arXiv: Default 1 second interval (configurable)

Error Handling

Crawlers handle:

  • Network timeouts (retry)

  • Page layout changes (fault tolerance)

  • Missing data (return partial results)

Adding New Crawlers

To add a new data source:

  1. Create crawlers/newsource.py

  2. Inherit BaseCrawler

  3. Implement fetch() method

  4. Register in crawlers/__init__.py

  5. Add command in main.py

Example:

from phd_hunter.crawlers.base import BaseCrawler

class DBLPCrawler(BaseCrawler):
    def fetch(self, query: str):
        # Implement crawling logic
        pass

See Also