Compare commits

..

4 Commits

Author SHA1 Message Date
426cf4d2cd Add university scraper system with backend, frontend, and configs
- Add src/university_scraper module with scraper, analyzer, and CLI
- Add backend FastAPI service with API endpoints and database models
- Add frontend React app with university management pages
- Add configs for Harvard, Manchester, and UCL universities
- Add artifacts with various scraper implementations
- Add Docker compose configuration for deployment
- Update .gitignore to exclude generated files

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-22 15:25:08 +08:00
2714c8ad5c Add Harvard test example to README
- Add detailed test results table with script paths
- Include Harvard test example with commands and sample output
- List covered Harvard schools

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 15:43:09 +08:00
a4dca81216 Rename test script and update documentation
- Rename test_rwth.py to generate_scraper.py with CLI arguments
- Update README.md with comprehensive usage guide
- Add Harvard scraper as example output
- Document troubleshooting tips for common issues

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 15:36:14 +08:00
fb2aa12f2b Add OpenRouter support and improve JSON parsing robustness
- Add OpenRouter as third LLM provider option in config.py
- Implement _extract_json() to handle markdown-wrapped JSON responses
- Add default values for missing required fields in ScriptPlan
- Handle navigation_strategy as list or string
- Add .env.example with configuration templates
- Add test script and sample generated scrapers for RWTH and KAUST

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 13:13:39 +08:00
83 changed files with 15697 additions and 60 deletions

14
.env.example Normal file
View File

@ -0,0 +1,14 @@
# OpenRouter Configuration (recommended)
CODEGEN_MODEL_PROVIDER=openrouter
OPENAI_API_KEY=your-openrouter-api-key-here
CODEGEN_OPENROUTER_MODEL=anthropic/claude-sonnet-4
# Alternative: Direct Anthropic
# CODEGEN_MODEL_PROVIDER=anthropic
# ANTHROPIC_API_KEY=your-anthropic-api-key-here
# CODEGEN_ANTHROPIC_MODEL=claude-sonnet-4-20250514
# Alternative: OpenAI
# CODEGEN_MODEL_PROVIDER=openai
# OPENAI_API_KEY=your-openai-api-key-here
# CODEGEN_OPENAI_MODEL=gpt-4o

33
.gitignore vendored
View File

@ -174,3 +174,36 @@ cython_debug/
# PyPI configuration file
.pypirc
# Windows artifacts
nul
# Scraper output files
*_results.json
# Output directories
output/
# Screenshots and debug images
*.png
artifacts/*.html
# Windows
desktop.ini
# Claude settings (local)
.claude/
# Progress files
*_progress.json
# Test result files
*_test_result.json
# Node modules
node_modules/
# Database files
*.db
# Frontend build
frontend/nul

238
README.md
View File

@ -1,73 +1,221 @@
# University Playwright Codegen Agent
构建于 [Agno](https://docs.agno.com/) 的自动化代码生成代理:输入海外大学官网的根地址,即可生成一份使用 **Playwright** 的 Python 脚本,脚本会抓取各学院/研究生院下的硕士项目网址以及项目中列出的导师Supervisor/Faculty个人信息页面。本项目使用 `uv` 进行依赖管理,`ruff` 做静态检查,`ty` 负责类型检查,并提供了一个基于 Typer 的 CLI。
构建于 [Agno](https://docs.agno.com/) 的自动化代码生成代理:输入海外大学官网的根地址,即可生成一份使用 **Playwright** 的 Python 脚本,脚本会抓取各学院/研究生院下的硕士项目网址以及项目中列出的导师Supervisor/Faculty个人信息页面。
## Features
## Quick Start
-**Agno Agent**:利用 `output_schema` 强制结构化输出,里程碑式地生成 `ScriptPlan` 并将其渲染为可执行脚本。
-**Playwright sampling**:计划生成前会用 Playwright 对站点进行轻量抓取,帮助 Agent 找到关键词与导航策略。
-**Deterministic script template**:脚本模板包含 BFS 爬取、关键词过滤、JSON 输出等逻辑,确保满足“硕士项目 + 导师”需求。
-**uv + ruff + ty workflow**:开箱即用的现代 Python 工具链。
### 1. 环境准备
## Getting started
```bash
# 克隆项目
git clone https://git.prodream.cn/YXY/University-Playwright-Codegen-Agent.git
cd University-Playwright-Codegen-Agent
1. **创建虚拟环境并安装依赖**
# 安装依赖(需要 uv
uv sync
```bash
uv venv --python 3.12
uv pip install -r pyproject.toml
playwright install # 安装浏览器内核
```
# 安装 Playwright 浏览器
uv run playwright install
```
2. **配置大模型 API key**
### 2. 配置 API Key
- OpenAI: `export OPENAI_API_KEY=...`
- Anthropic: `export ANTHROPIC_API_KEY=...`
- 可通过环境变量 `CODEGEN_MODEL_PROVIDER` 在 `openai` 与 `anthropic` 之间切换。
项目使用 OpenRouter API 调用 Claude 模型。设置环境变量:
3. **运行 CLI 生成脚本**
**Windows (PowerShell):**
```powershell
[Environment]::SetEnvironmentVariable("OPENROUTER_API_KEY", "your-api-key", "User")
```
```bash
uv run university-agent generate \
**Windows (CMD):**
```cmd
setx OPENROUTER_API_KEY "your-api-key"
```
**Linux/macOS:**
```bash
export OPENROUTER_API_KEY="your-api-key"
```
或者复制 `.env.example``.env` 并填入 API Key。
### 3. 生成爬虫脚本
**方式一:使用命令行参数**
```bash
uv run python generate_scraper.py \
--url "https://www.harvard.edu/" \
--name "Harvard" \
--language "English" \
--max-depth 3 \
--max-pages 30
```
**方式二:修改脚本中的配置**
编辑 `generate_scraper.py` 顶部的配置:
```python
TARGET_URL = "https://www.example.edu/"
CAMPUS_NAME = "Example University"
LANGUAGE = "English"
MAX_DEPTH = 3
MAX_PAGES = 30
```
然后运行:
```bash
uv run python generate_scraper.py
```
### 4. 运行生成的爬虫
生成的脚本保存在 `artifacts/` 目录下:
```bash
cd artifacts
uv run python harvard_faculty_scraper.py --max-pages 50 --no-verify
```
**常用参数:**
| 参数 | 说明 | 默认值 |
|------|------|--------|
| `--max-pages` | 最大爬取页面数 | 30 |
| `--max-depth` | 最大爬取深度 | 3 |
| `--no-verify` | 跳过链接验证(推荐) | False |
| `--browser` | 浏览器引擎 (chromium/firefox/webkit) | chromium |
| `--timeout` | 页面加载超时(ms) | 20000 |
| `--output` | 输出文件路径 | university-scraper_results.json |
### 5. 查看结果
爬取结果保存为 JSON 文件:
```json
{
"statistics": {
"total_links": 277,
"program_links": 8,
"faculty_links": 269,
"profile_pages": 265
},
"program_links": [...],
"faculty_links": [...]
}
```
## 使用 CLI可选
项目也提供 Typer CLI
```bash
uv run university-agent generate \
"https://www.example.edu" \
--campus "Example Campus" \
--language "English" \
--max-depth 2 \
--max-pages 60
```
```
运行完成后会在 `artifacts/` 下看到生成的 Playwright 脚本,并在终端展示自动规划的关键词与验证步骤。
## 测试过的大学
4. **执行 Ruff & Ty 检查**
| 大学 | 状态 | 结果 | 生成的脚本 |
|------|------|------|-----------|
| Harvard | ✅ | 277 链接 (8 项目, 269 教职, 265 个人主页) | `artifacts/harvard_faculty_scraper.py` |
| RWTH Aachen | ✅ | 108 链接 (103 项目, 5 教职) | `artifacts/rwth_aachen_playwright_scraper.py` |
| KAUST | ✅ | 9 链接 (需使用 Firefox) | `artifacts/kaust_faculty_scraper.py` |
```bash
uv run ruff check
uvx ty check
```
### Harvard 测试示例
## Project structure
**生成爬虫脚本:**
```bash
uv run python generate_scraper.py --url "https://www.harvard.edu/" --name "Harvard"
```
**运行爬虫:**
```bash
cd artifacts
uv run python harvard_faculty_scraper.py --max-pages 30 --no-verify
```
**结果输出** (`artifacts/university-scraper_results.json`)
```json
{
"statistics": {
"total_links": 277,
"program_links": 8,
"faculty_links": 269,
"profile_pages": 265
},
"program_links": [
{"url": "https://www.harvard.edu/programs/?degree_levels=graduate", "text": "Graduate Programs"},
...
],
"faculty_links": [
{"url": "https://www.gse.harvard.edu/directory/faculty", "text": "Faculty Directory"},
{"url": "https://faculty.harvard.edu", "text": "Harvard Faculty"},
...
]
}
```
爬取覆盖了 Harvard 的多个学院:
- Graduate School of Design (GSD)
- Graduate School of Education (GSE)
- Faculty of Arts and Sciences (FAS)
- Graduate School of Arts and Sciences (GSAS)
- Harvard Divinity School (HDS)
## 故障排除
### 超时错误
某些网站响应较慢,增加超时时间:
```bash
uv run python xxx_scraper.py --timeout 60000 --no-verify
```
### 浏览器被阻止
某些网站(如 KAUST会阻止 Chromium改用 Firefox
```bash
uv run python xxx_scraper.py --browser firefox
```
### API Key 错误
确保 `OPENROUTER_API_KEY` 环境变量已正确设置:
```bash
echo $OPENROUTER_API_KEY # Linux/macOS
echo %OPENROUTER_API_KEY% # Windows CMD
```
## Project Structure
```
├── README.md
├── generate_scraper.py # 主入口脚本
├── .env.example # 环境变量模板
├── pyproject.toml
├── src/university_agent
│ ├── agent.py # Agno Agent 配置
│ ├── cli.py # Typer CLI
── config.py # pydantic Settings
│ ├── generator.py # Orchestration 引擎
├── models.py # 数据模型(请求/计划/结果)
├── renderer.py # ScriptPlan -> Playwright script
├── sampler.py # Playwright 采样
├── templates/
│ │ └── playwright_script.py.jinja
── writer.py # 将脚本写入 artifacts/
└── 任务1.txt
├── artifacts/ # 生成的爬虫脚本
│ ├── harvard_faculty_scraper.py
│ ├── kaust_faculty_scraper.py
── ...
└── src/university_agent/
├── agent.py # Agno Agent 配置
├── cli.py # Typer CLI
├── config.py # pydantic Settings
├── generator.py # Orchestration 引擎
├── models.py # 数据模型
── renderer.py # ScriptPlan -> Playwright script
├── sampler.py # Playwright 采样
└── writer.py # 脚本写入
```
## Tips
## Features
- `university-agent generate --help` 查看所有 CLI 选项,可选择跳过采样或导出规划 JSON。
- 如果 Agno Agent 需使用其他工具,可在 `agent.py` 中自行扩展自定义 `tool`。
- Playwright 采样在某些环境中需要额外的浏览器依赖,请根据官方提示执行 `playwright install`。
- **Agno Agent**:利用 `output_schema` 强制结构化输出
- **Playwright sampling**:生成前对站点进行轻量抓取
- **Deterministic script template**BFS 爬取、关键词过滤、JSON 输出
- **OpenRouter 支持**:通过 OpenRouter 使用 Claude 模型
- **uv + ruff + ty workflow**:现代 Python 工具链
Happy building! 🎓🤖
## License
MIT

261
SYSTEM_DESIGN.md Normal file
View File

@ -0,0 +1,261 @@
# 大学爬虫Web系统设计方案
## 一、系统架构
```
┌─────────────────────────────────────────────────────────────────┐
│ 前端 (React/Vue) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ 输入大学URL │ │ 一键生成脚本 │ │ 查看/验证爬取数据 │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ 后端 API (FastAPI) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────<E29480><E29480>───────────────────┐ │
│ │ 脚本生成API │ │ 脚本执行API │ │ 数据查询API │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌───────────────────┐ ┌───────────────┐ ┌───────────────────────┐
│ PostgreSQL │ │ 任务队列 │ │ Agent (Claude) │
│ 数据库 │ │ (Celery) │ │ 分析+生成脚本 │
│ - 爬虫脚本 │ └───────────────┘ └───────────────────────┘
│ - 爬取结果 │
│ - 执行日志 │
└───────────────────┘
```
## 二、技术栈选择
### 后端
- **框架**: FastAPI (Python与现有爬虫代码无缝集成)
- **数据库**: PostgreSQL (存储脚本、结果、日志)
- **任务队列**: Celery + Redis (异步执行爬虫任务)
- **ORM**: SQLAlchemy
### 前端
- **框架**: React + TypeScript (或 Vue.js)
- **UI库**: Ant Design / Material-UI
- **状态管理**: React Query (数据获取和缓存)
### 部署
- **容器化**: Docker + Docker Compose
- **云平台**: 可部署到 AWS/阿里云/腾讯云
## 三、数据库设计
```sql
-- 大学表
CREATE TABLE universities (
id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
url VARCHAR(500) NOT NULL,
country VARCHAR(100),
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
-- 爬虫脚本表
CREATE TABLE scraper_scripts (
id SERIAL PRIMARY KEY,
university_id INTEGER REFERENCES universities(id),
script_name VARCHAR(255) NOT NULL,
script_content TEXT NOT NULL, -- Python脚本代码
config_content TEXT, -- YAML配置
version INTEGER DEFAULT 1,
status VARCHAR(50) DEFAULT 'draft', -- draft, active, deprecated
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
-- 爬取任务表
CREATE TABLE scrape_jobs (
id SERIAL PRIMARY KEY,
university_id INTEGER REFERENCES universities(id),
script_id INTEGER REFERENCES scraper_scripts(id),
status VARCHAR(50) DEFAULT 'pending', -- pending, running, completed, failed
started_at TIMESTAMP,
completed_at TIMESTAMP,
error_message TEXT,
created_at TIMESTAMP DEFAULT NOW()
);
-- 爬取结果表 (JSON存储层级数据)
CREATE TABLE scrape_results (
id SERIAL PRIMARY KEY,
job_id INTEGER REFERENCES scrape_jobs(id),
university_id INTEGER REFERENCES universities(id),
result_data JSONB NOT NULL, -- 学院→项目→导师 JSON数据
schools_count INTEGER,
programs_count INTEGER,
faculty_count INTEGER,
created_at TIMESTAMP DEFAULT NOW()
);
-- 执行日志表
CREATE TABLE scrape_logs (
id SERIAL PRIMARY KEY,
job_id INTEGER REFERENCES scrape_jobs(id),
level VARCHAR(20), -- info, warning, error
message TEXT,
created_at TIMESTAMP DEFAULT NOW()
);
```
## 四、API接口设计
### 1. 大学管理
```
POST /api/universities 创建大学
GET /api/universities 获取大学列表
GET /api/universities/{id} 获取大学详情
DELETE /api/universities/{id} 删除大学
```
### 2. 爬虫脚本
```
POST /api/scripts/generate 生成爬虫脚本 (Agent自动分析)
GET /api/scripts/{university_id} 获取大学的爬虫脚本
PUT /api/scripts/{id} 更新脚本
```
### 3. 爬取任务
```
POST /api/jobs/start/{university_id} 启动爬取任务
GET /api/jobs/{id} 获取任务状态
GET /api/jobs/university/{id} 获取大学的任务列表
POST /api/jobs/{id}/cancel 取消任务
```
### 4. 数据结果
```
GET /api/results/{university_id} 获取爬取结果
GET /api/results/{university_id}/schools 获取学院列表
GET /api/results/{university_id}/programs 获取项目列表
GET /api/results/{university_id}/faculty 获取导师列表
GET /api/results/{university_id}/export?format=json 导出数据
```
## 五、前端页面设计
### 页面1: 首页/大学列表
- 显示已添加的大学列表
- "添加新大学" 按钮
- 每个大学卡片显示:名称、状态、项目数、导师数、操作按钮
### 页面2: 添加大学 (一键生成脚本)
- 输入框大学官网URL
- "分析并生成脚本" 按钮
- 显示分析进度和日志
- 生成完成后自动跳转到管理页面
### 页面3: 大学管理页面
- 大学基本信息
- 爬虫脚本状态
- "一键运行爬虫" 按钮
- 运行进度和日志实时显示
- 历史任务列表
### 页面4: 数据查看页面
- 树形结构展示:学院 → 项目 → 导师
- 搜索和筛选功能
- 数据导出按钮 (JSON/Excel)
- 数据校验和编辑功能
## 六、实现步骤
### 阶段1: 后端基础 (优先)
1. 创建 FastAPI 项目结构
2. 设计数据库模型 (SQLAlchemy)
3. 实现基础 CRUD API
4. 集成现有爬虫代码
### 阶段2: 脚本生成与执行
1. 实现 Agent 自动分析逻辑
2. 实现脚本存储和版本管理
3. 集成 Celery 异步任务队列
4. 实现爬虫执行和日志记录
### 阶段3: 前端开发
1. 搭建 React 项目
2. 实现大学列表页面
3. 实现脚本生成页面
4. 实现数据查看页面
### 阶段4: 部署上线
1. Docker 容器化
2. 部署到云服务器
3. 配置域名和 HTTPS
## 七、目录结构
```
university-scraper-web/
├── backend/
│ ├── app/
│ │ ├── __init__.py
│ │ ├── main.py # FastAPI入口
│ │ ├── config.py # 配置
│ │ ├── database.py # 数据库连接
│ │ ├── models/ # SQLAlchemy模型
│ │ │ ├── university.py
│ │ │ ├── script.py
│ │ │ ├── job.py
│ │ │ └── result.py
│ │ ├── schemas/ # Pydantic模型
│ │ ├── api/ # API路由
│ │ │ ├── universities.py
│ │ │ ├── scripts.py
│ │ │ ├── jobs.py
│ │ │ └── results.py
│ │ ├── services/ # 业务逻辑
│ │ │ ├── scraper_service.py
│ │ │ └── agent_service.py
│ │ └── tasks/ # Celery任务
│ │ └── scrape_task.py
│ ├── requirements.txt
│ └── Dockerfile
├── frontend/
│ ├── src/
│ │ ├── components/
│ │ ├── pages/
│ │ ├── services/
│ │ └── App.tsx
│ ├── package.json
│ └── Dockerfile
├── docker-compose.yml
└── README.md
```
## 八、关于脚本存储位置的建议
### 推荐方案PostgreSQL + 文件系统混合
1. **PostgreSQL 存储**:
- 脚本元数据 (名称、版本、状态)
- 脚本代码内容 (TEXT字段)
- 配置文件内容 (JSONB字段)
- 爬取结果 (JSONB字段)
2. **优点**:
- 事务支持,数据一致性
- 版本管理方便
- 查询和搜索方便
- 备份和迁移简单
- 与后端集成紧密
3. **云部署选项**:
- AWS RDS PostgreSQL
- 阿里云 RDS PostgreSQL
- 腾讯云 TDSQL-C
### 备选方案MongoDB
如果数据结构经常变化,可以考虑 MongoDB
- 灵活的文档结构
- 适合存储层级化的爬取结果
- 但 Python 生态对 PostgreSQL 支持更好

View File

@ -0,0 +1,83 @@
#!/usr/bin/env python3
"""
调试Computer Science的Faculty页面
"""
import asyncio
from playwright.async_api import async_playwright
async def debug_cs():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
page = await browser.new_page()
# 访问Computer Science GSAS页面
gsas_url = "https://gsas.harvard.edu/program/computer-science"
print(f"访问: {gsas_url}")
await page.goto(gsas_url, wait_until="domcontentloaded", timeout=30000)
await page.wait_for_timeout(3000)
await page.screenshot(path="cs_gsas_page.png", full_page=True)
print("截图已保存: cs_gsas_page.png")
# 查找所有链接
links = await page.evaluate('''() => {
const links = [];
document.querySelectorAll('a[href]').forEach(a => {
const text = a.innerText.trim();
const href = a.href;
if (text && text.length > 2 && text.length < 100) {
links.push({text: text, href: href});
}
});
return links;
}''')
print(f"\n页面上的所有链接 ({len(links)} 个):")
for link in links:
print(f" - {link['text'][:60]} -> {link['href']}")
# 查找可能的Faculty或People链接
print("\n\n查找Faculty/People相关链接:")
for link in links:
text_lower = link['text'].lower()
href_lower = link['href'].lower()
if 'faculty' in text_lower or 'people' in href_lower or 'faculty' in href_lower or 'website' in text_lower:
print(f" * {link['text']} -> {link['href']}")
# 尝试访问SEAS (School of Engineering)
print("\n\n尝试访问SEAS Computer Science页面...")
seas_url = "https://seas.harvard.edu/computer-science"
await page.goto(seas_url, wait_until="domcontentloaded", timeout=30000)
await page.wait_for_timeout(2000)
await page.screenshot(path="seas_cs_page.png", full_page=True)
print("截图已保存: seas_cs_page.png")
seas_links = await page.evaluate('''() => {
const links = [];
document.querySelectorAll('a[href]').forEach(a => {
const text = a.innerText.trim();
const href = a.href;
const lowerText = text.toLowerCase();
const lowerHref = href.toLowerCase();
if ((lowerText.includes('faculty') || lowerText.includes('people') ||
lowerHref.includes('faculty') || lowerHref.includes('people')) &&
text.length > 2) {
links.push({text: text, href: href});
}
});
return links;
}''')
print(f"\nSEAS页面上的Faculty/People链接:")
for link in seas_links:
print(f" * {link['text']} -> {link['href']}")
await browser.close()
if __name__ == "__main__":
asyncio.run(debug_cs())

View File

@ -0,0 +1,110 @@
"""
探索Harvard院系People/Faculty页面结构获取导师列表
"""
import asyncio
from playwright.async_api import async_playwright
async def explore_faculty_page():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
page = await browser.new_page()
# 访问AAAS院系People页面
people_url = "https://aaas.fas.harvard.edu/aaas-people"
print(f"访问院系People页面: {people_url}")
await page.goto(people_url, wait_until='networkidle')
await page.wait_for_timeout(3000)
# 截图保存
await page.screenshot(path="aaas_people_page.png", full_page=True)
print("已保存截图: aaas_people_page.png")
# 获取所有教职员工链接
faculty_info = await page.evaluate('''() => {
const faculty = [];
// 查找所有 /people/ 路径的链接
document.querySelectorAll('a[href*="/people/"]').forEach(a => {
const href = a.href || '';
const text = a.innerText.trim();
// 过滤掉导航链接,只保留个人页面链接
if (href.includes('/people/') && text.length > 3 &&
!text.toLowerCase().includes('people') &&
!href.endsWith('/people/') &&
!href.endsWith('/aaas-people')) {
faculty.push({
name: text,
url: href
});
}
});
return faculty;
}''')
print(f"\n找到 {len(faculty_info)} 个教职员工:")
for f in faculty_info:
print(f" - {f['name']} -> {f['url']}")
# 尝试经济学院系的Faculty页面
print("\n\n========== 尝试经济学院系Faculty页面 ==========")
econ_faculty_url = "http://economics.harvard.edu/people/people-type/faculty"
print(f"访问: {econ_faculty_url}")
await page.goto(econ_faculty_url, wait_until='networkidle')
await page.wait_for_timeout(3000)
await page.screenshot(path="econ_faculty_page.png", full_page=True)
print("已保存截图: econ_faculty_page.png")
econ_faculty = await page.evaluate('''() => {
const faculty = [];
// 查找所有可能的faculty链接
document.querySelectorAll('a[href]').forEach(a => {
const href = a.href || '';
const text = a.innerText.trim();
const lowerHref = href.toLowerCase();
// 查找个人页面链接
if ((lowerHref.includes('/people/') || lowerHref.includes('/faculty/') ||
lowerHref.includes('/profile/')) &&
text.length > 3 && text.length < 100 &&
!text.toLowerCase().includes('faculty') &&
!text.toLowerCase().includes('people')) {
faculty.push({
name: text,
url: href
});
}
});
return faculty;
}''')
print(f"\n找到 {len(econ_faculty)} 个教职员工:")
for f in econ_faculty[:30]:
print(f" - {f['name']} -> {f['url']}")
# 查看页面上所有链接用于调试
print("\n\n页面上的所有链接:")
all_links = await page.evaluate('''() => {
const links = [];
document.querySelectorAll('a[href]').forEach(a => {
const href = a.href || '';
const text = a.innerText.trim();
if (text && text.length > 2 && text.length < 100) {
links.push({text: text, href: href});
}
});
return links;
}''')
for link in all_links[:40]:
print(f" - {link['text'][:50]} -> {link['href']}")
await browser.close()
if __name__ == "__main__":
asyncio.run(explore_faculty_page())

View File

@ -0,0 +1,173 @@
"""
探索曼彻斯特大学硕士课程页面结构
"""
import asyncio
import json
from playwright.async_api import async_playwright
async def explore_manchester():
"""探索曼彻斯特大学网站结构"""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
)
page = await context.new_page()
# 直接访问硕士课程A-Z列表页
print("访问硕士课程A-Z列表页面...")
await page.goto("https://www.manchester.ac.uk/study/masters/courses/list/",
wait_until="domcontentloaded", timeout=60000)
await page.wait_for_timeout(5000)
# 截图
await page.screenshot(path="manchester_masters_page.png", full_page=False)
print("截图已保存: manchester_masters_page.png")
# 分析页面结构
page_info = await page.evaluate("""() => {
const info = {
title: document.title,
url: window.location.href,
all_links: [],
course_candidates: [],
page_sections: []
};
// 获取所有链接
document.querySelectorAll('a[href]').forEach(a => {
const href = a.href;
const text = a.innerText.trim().substring(0, 100);
if (href && text) {
info.all_links.push({href, text});
}
});
// 查找可能的课程链接 - 包含 /course/ 或 list-item
document.querySelectorAll('a[href*="/course/"], .course-link, [class*="course"] a, .search-result a, .list-item a').forEach(a => {
info.course_candidates.push({
href: a.href,
text: a.innerText.trim().substring(0, 100),
classes: a.className,
parent_classes: a.parentElement?.className || ''
});
});
// 获取页面主要区块
document.querySelectorAll('main, [role="main"], .content, #content, .results, .course-list').forEach(el => {
info.page_sections.push({
tag: el.tagName,
id: el.id,
classes: el.className,
children_count: el.children.length
});
});
return info;
}""")
print(f"\n页面标题: {page_info['title']}")
print(f"当前URL: {page_info['url']}")
print(f"\n总链接数: {len(page_info['all_links'])}")
print(f"课程候选链接数: {len(page_info['course_candidates'])}")
# 查找包含 masters/courses/ 的链接
masters_links = [l for l in page_info['all_links']
if 'masters/courses/' in l['href'].lower()
and l['href'] != page_info['url']]
print(f"\n硕士课程相关链接 ({len(masters_links)}):")
for link in masters_links[:20]:
print(f" - {link['text'][:50]}: {link['href']}")
print(f"\n课程候选详情:")
for c in page_info['course_candidates'][:10]:
print(f" - {c['text'][:50]}")
print(f" URL: {c['href']}")
print(f" Classes: {c['classes']}")
# 检查是否有搜索/筛选功能
search_elements = await page.evaluate("""() => {
const elements = [];
document.querySelectorAll('input[type="search"], input[type="text"], select, .filter, .search').forEach(el => {
elements.push({
tag: el.tagName,
type: el.type || '',
id: el.id,
name: el.name || '',
classes: el.className
});
});
return elements;
}""")
print(f"\n搜索/筛选元素: {len(search_elements)}")
for el in search_elements[:5]:
print(f" - {el}")
# 尝试找到课程列表的实际结构
print("\n\n正在分析页面中的课程列表结构...")
list_structures = await page.evaluate("""() => {
const structures = [];
// 查找各种可能的列表结构
const selectors = [
'ul li a[href*="course"]',
'div[class*="result"] a',
'div[class*="course"] a',
'article a[href]',
'.search-results a',
'[data-course] a',
'table tr td a'
];
for (const selector of selectors) {
const elements = document.querySelectorAll(selector);
if (elements.length > 0) {
const samples = [];
elements.forEach((el, i) => {
if (i < 5) {
samples.push({
href: el.href,
text: el.innerText.trim().substring(0, 80)
});
}
});
structures.push({
selector: selector,
count: elements.length,
samples: samples
});
}
}
return structures;
}""")
print("\n找到的列表结构:")
for s in list_structures:
print(f"\n 选择器: {s['selector']} (共 {s['count']} 个)")
for sample in s['samples']:
print(f" - {sample['text']}: {sample['href']}")
# 保存完整分析结果
with open("manchester_analysis.json", "w", encoding="utf-8") as f:
json.dump(page_info, f, indent=2, ensure_ascii=False)
print("\n\n完整分析已保存到 manchester_analysis.json")
# 等待用户查看
print("\n按 Ctrl+C 关闭浏览器...")
try:
await asyncio.sleep(30)
except:
pass
await browser.close()
if __name__ == "__main__":
asyncio.run(explore_manchester())

View File

@ -0,0 +1,226 @@
"""
探索Harvard项目页面结构寻找导师信息
"""
import asyncio
from playwright.async_api import async_playwright
async def explore_program_page():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
page = await browser.new_page()
# 访问研究生院系页面 (GSAS)
gsas_url = "https://gsas.harvard.edu/program/african-and-african-american-studies"
print(f"访问研究生院系页面: {gsas_url}")
await page.goto(gsas_url, wait_until='networkidle')
await page.wait_for_timeout(3000)
# 截图保存
await page.screenshot(path="gsas_program_page.png", full_page=True)
print("已保存截图: gsas_program_page.png")
# 分析页面结构
page_info = await page.evaluate('''() => {
const info = {
title: document.title,
h1: document.querySelector('h1')?.innerText || '',
allHeadings: [],
facultyLinks: [],
peopleLinks: [],
allLinks: []
};
// 获取所有标题
document.querySelectorAll('h1, h2, h3, h4').forEach(h => {
info.allHeadings.push({
tag: h.tagName,
text: h.innerText.trim().substring(0, 100)
});
});
// 查找所有链接
document.querySelectorAll('a[href]').forEach(a => {
const href = a.href || '';
const text = a.innerText.trim();
// 检查是否与教职员工相关
const lowerHref = href.toLowerCase();
const lowerText = text.toLowerCase();
if (lowerHref.includes('faculty') || lowerHref.includes('people') ||
lowerHref.includes('professor') || lowerHref.includes('staff') ||
lowerText.includes('faculty') || lowerText.includes('people')) {
info.facultyLinks.push({
text: text.substring(0, 100),
href: href
});
}
// 检查是否是个人页面链接
if (href.includes('/people/') || href.includes('/faculty/') ||
href.includes('/profile/') || href.includes('/person/')) {
info.peopleLinks.push({
text: text.substring(0, 100),
href: href
});
}
// 保存所有主要链接
if (href && text.length > 2 && text.length < 150) {
info.allLinks.push({
text: text,
href: href
});
}
});
return info;
}''')
print(f"\n页面标题: {page_info['title']}")
print(f"H1: {page_info['h1']}")
print(f"\n所有标题 ({len(page_info['allHeadings'])}):")
for h in page_info['allHeadings']:
print(f" <{h['tag']}>: {h['text']}")
print(f"\n教职员工相关链接 ({len(page_info['facultyLinks'])}):")
for f in page_info['facultyLinks']:
print(f" - {f['text']} -> {f['href']}")
print(f"\n个人页面链接 ({len(page_info['peopleLinks'])}):")
for p in page_info['peopleLinks']:
print(f" - {p['text']} -> {p['href']}")
print(f"\n所有链接 ({len(page_info['allLinks'])}):")
for link in page_info['allLinks'][:50]:
print(f" - {link['text'][:60]} -> {link['href']}")
# 尝试另一个项目页面看看是否有不同结构
print("\n\n========== 尝试另一个项目页面 ==========")
economics_url = "https://gsas.harvard.edu/program/economics"
print(f"访问: {economics_url}")
await page.goto(economics_url, wait_until='networkidle')
await page.wait_for_timeout(3000)
# 截图保存
await page.screenshot(path="gsas_economics_page.png", full_page=True)
print("已保存截图: gsas_economics_page.png")
# 分析
econ_info = await page.evaluate('''() => {
const info = {
title: document.title,
facultyLinks: [],
peopleLinks: []
};
document.querySelectorAll('a[href]').forEach(a => {
const href = a.href || '';
const text = a.innerText.trim();
const lowerHref = href.toLowerCase();
const lowerText = text.toLowerCase();
if (lowerHref.includes('faculty') || lowerHref.includes('people') ||
lowerText.includes('faculty') || lowerText.includes('people')) {
info.facultyLinks.push({
text: text.substring(0, 100),
href: href
});
}
if (href.includes('/people/') || href.includes('/faculty/') ||
href.includes('/profile/') || href.includes('/person/')) {
info.peopleLinks.push({
text: text.substring(0, 100),
href: href
});
}
});
return info;
}''')
print(f"\n教职员工相关链接 ({len(econ_info['facultyLinks'])}):")
for f in econ_info['facultyLinks']:
print(f" - {f['text']} -> {f['href']}")
print(f"\n个人页面链接 ({len(econ_info['peopleLinks'])}):")
for p in econ_info['peopleLinks']:
print(f" - {p['text']} -> {p['href']}")
# 访问院系主页看看有没有Faculty页面
print("\n\n========== 尝试访问院系主页 ==========")
dept_url = "https://aaas.fas.harvard.edu/"
print(f"访问院系主页: {dept_url}")
await page.goto(dept_url, wait_until='networkidle')
await page.wait_for_timeout(3000)
await page.screenshot(path="aaas_dept_page.png", full_page=True)
print("已保存截图: aaas_dept_page.png")
dept_info = await page.evaluate('''() => {
const info = {
title: document.title,
navLinks: [],
facultyLinks: [],
peopleLinks: []
};
// 获取导航链接
document.querySelectorAll('nav a, [class*="nav"] a, [class*="menu"] a').forEach(a => {
const href = a.href || '';
const text = a.innerText.trim();
if (text && text.length > 1 && text.length < 50) {
info.navLinks.push({
text: text,
href: href
});
}
});
document.querySelectorAll('a[href]').forEach(a => {
const href = a.href || '';
const text = a.innerText.trim();
const lowerHref = href.toLowerCase();
const lowerText = text.toLowerCase();
if (lowerHref.includes('faculty') || lowerHref.includes('people') ||
lowerText.includes('faculty') || lowerText.includes('people')) {
info.facultyLinks.push({
text: text.substring(0, 100),
href: href
});
}
if (href.includes('/people/') || href.includes('/faculty/') ||
href.includes('/profile/')) {
info.peopleLinks.push({
text: text.substring(0, 100),
href: href
});
}
});
return info;
}''')
print(f"\n导航链接 ({len(dept_info['navLinks'])}):")
for link in dept_info['navLinks'][:20]:
print(f" - {link['text']} -> {link['href']}")
print(f"\n教职员工相关链接 ({len(dept_info['facultyLinks'])}):")
for f in dept_info['facultyLinks']:
print(f" - {f['text']} -> {f['href']}")
print(f"\n个人页面链接 ({len(dept_info['peopleLinks'])}):")
for p in dept_info['peopleLinks'][:30]:
print(f" - {p['text']} -> {p['href']}")
await browser.close()
if __name__ == "__main__":
asyncio.run(explore_program_page())

View File

@ -0,0 +1,445 @@
#!/usr/bin/env python
"""
Auto-generated by the Agno codegen agent.
Target university: Harvard (https://www.harvard.edu/)
Requested caps: depth=3, pages=30
Plan description: Playwright scraper for university master programs and faculty profiles.
Navigation strategy: Start at https://www.harvard.edu/ Follow links to /academics/ and /a-to-z/ to find list of schools and departments For each school/department, look for a 'faculty' or 'people' page On faculty directory pages, identify and follow links to individual profiles Check for school/department specific subdomains like hls.harvard.edu, hds.harvard.edu, etc. Prioritize crawling faculty directory pages over general site crawling
Verification checklist:
- Manually review a sample of scraped URLs to verify they are faculty profiles
- Check that major academic departments are represented in the results
- Verify the script is capturing profile page content, not just URLs
- Confirm no login pages, application forms, or directory pages are included
Playwright snapshot used to guide this plan:
1. Harvard University (https://www.harvard.edu/)
Snippet: Skip to main content Harvard University Learn about our lawsuits to protect our students and researchers Search Menu David Liu received the 2025 Breakthrough Prize in Life Sciences for developing a revolutionary gene-editing platforms that precisely corrects genetic mutations.
Anchors: Skip to main content -> https://www.harvard.edu/#main-content, Harvard University -> https://www.harvard.edu/, Learn about our lawsuits to protect our students and researchers -> https://www.harvard.edu/federal-lawsuits/, × -> javascript:void(0), A to Z index -> https://www.harvard.edu/a-to-z/, Academics -> https://www.harvard.edu/academics/
2. Index of departments, schools, and affiliates - Harvard University (https://www.harvard.edu/a-to-z/)
Snippet: Skip to main content Harvard University Learn about our lawsuits to protect our students and researchers Search Menu David Liu received the 2025 Breakthrough Prize in Life Sciences for developing a revolutionary gene-editing platforms that precisely corrects genetic mutations.
Anchors: Skip to main content -> https://www.harvard.edu/a-to-z/#main-content, Harvard University -> https://www.harvard.edu/, Learn about our lawsuits to protect our students and researchers -> https://www.harvard.edu/federal-lawsuits/, × -> javascript:void(0), A to Z index -> https://www.harvard.edu/a-to-z/, Academics -> https://www.harvard.edu/academics/
3. Academics - Harvard University (https://www.harvard.edu/academics/)
Snippet: Skip to main content Harvard University Learn about our lawsuits to protect our students and researchers Search Menu David Liu received the 2025 Breakthrough Prize in Life Sciences for developing a revolutionary gene-editing platforms that precisely corrects genetic mutations.
Anchors: Skip to main content -> https://www.harvard.edu/academics/#main-content, Harvard University -> https://www.harvard.edu/, Learn about our lawsuits to protect our students and researchers -> https://www.harvard.edu/federal-lawsuits/, A to Z index -> https://www.harvard.edu/a-to-z/, Academics -> https://www.harvard.edu/academics/, Undergraduate Degrees -> https://www.harvard.edu//programs/?degree_levels=undergraduate
4. Programs - Harvard University (https://www.harvard.edu//programs/?degree_levels=undergraduate)
Snippet: Skip to main content Harvard University Learn about our lawsuits to protect our students and researchers Search Menu David Liu received the 2025 Breakthrough Prize in Life Sciences for developing a revolutionary gene-editing platforms that precisely corrects genetic mutations.
Anchors: Skip to main content -> https://www.harvard.edu/programs/?degree_levels=undergraduate#main-content, Harvard University -> https://www.harvard.edu/, Learn about our lawsuits to protect our students and researchers -> https://www.harvard.edu/federal-lawsuits/, A to Z index -> https://www.harvard.edu/a-to-z/, Academics -> https://www.harvard.edu/academics/, Undergraduate Degrees -> https://www.harvard.edu//programs/?degree_levels=undergraduate
Snapshot truncated.
Generated at: 2025-12-10T07:19:12.294884+00:00
"""
from __future__ import annotations
import argparse
import asyncio
import json
import time
from collections import deque
from dataclasses import asdict, dataclass, field
from datetime import datetime, timezone
from pathlib import Path
from typing import Deque, Iterable, List, Set, Tuple
from urllib.parse import urljoin, urldefrag, urlparse
from playwright.async_api import async_playwright, Page, Response
PROGRAM_KEYWORDS = ['/graduate/', '/masters/', '/programs/?degree_levels=graduate', '/mpp/', 'Master of', 'M.S.', 'M.A.', 'graduate program']
FACULTY_KEYWORDS = ['/people/', '/~', '/faculty/', '/profile/', 'professor', 'dr.', 'ph.d.', 'firstname-lastname']
EXCLUSION_KEYWORDS = ['admissions', 'apply', 'tuition', 'news', 'events', 'calendar', 'careers', 'jobs', 'login', 'donate', 'alumni', 'giving']
METADATA_FIELDS = ['url', 'title', 'entity_type', 'department', 'email', 'scraped_at']
EXTRA_NOTES = ['Many Harvard faculty have profiles under the /~username/ URL pattern', 'Some faculty may be cross-listed in multiple departments', 'Prioritize finding profiles from professional schools (business, law, medicine, etc.)', "Check for non-standard faculty titles like 'lecturer', 'fellow', 'researcher'"]
# URL patterns that indicate individual profile pages
PROFILE_URL_PATTERNS = [
"/people/", "/person/", "/profile/", "/profiles/",
"/faculty/", "/staff/", "/directory/",
"/~", # Unix-style personal pages
"/bio/", "/about/",
]
# URL patterns that indicate listing/directory pages (should be crawled deeper)
DIRECTORY_URL_PATTERNS = [
"/faculty", "/people", "/directory", "/staff",
"/team", "/members", "/researchers",
]
def normalize_url(base: str, href: str) -> str:
"""Normalize URL by resolving relative paths and removing fragments."""
absolute = urljoin(base, href)
cleaned, _ = urldefrag(absolute)
# Remove trailing slash for consistency
return cleaned.rstrip("/")
def matches_any(text: str, keywords: Iterable[str]) -> bool:
"""Check if text contains any of the keywords (case-insensitive)."""
lowered = text.lower()
return any(keyword.lower() in lowered for keyword in keywords)
def is_same_domain(url1: str, url2: str) -> bool:
"""Check if two URLs belong to the same root domain."""
domain1 = urlparse(url1).netloc.replace("www.", "")
domain2 = urlparse(url2).netloc.replace("www.", "")
# Allow subdomains of the same root domain
parts1 = domain1.split(".")
parts2 = domain2.split(".")
if len(parts1) >= 2 and len(parts2) >= 2:
return parts1[-2:] == parts2[-2:]
return domain1 == domain2
def is_profile_url(url: str) -> bool:
"""Check if URL pattern suggests an individual profile page."""
url_lower = url.lower()
return any(pattern in url_lower for pattern in PROFILE_URL_PATTERNS)
def is_directory_url(url: str) -> bool:
"""Check if URL pattern suggests a directory/listing page."""
url_lower = url.lower()
return any(pattern in url_lower for pattern in DIRECTORY_URL_PATTERNS)
@dataclass
class ScrapedLink:
url: str
title: str
text: str
source_url: str
bucket: str # "program" or "faculty"
is_verified: bool = False
http_status: int = 0
is_profile_page: bool = False
scraped_at: str = field(default_factory=lambda: datetime.now(timezone.utc).isoformat())
@dataclass
class ScrapeSettings:
root_url: str
max_depth: int
max_pages: int
headless: bool
output: Path
verify_links: bool = True
request_delay: float = 1.0 # Polite crawling delay
timeout: int = 60000 # Navigation timeout in ms
async def extract_links(page: Page) -> List[Tuple[str, str]]:
"""Extract all anchor links from the page."""
anchors: Iterable[dict] = await page.eval_on_selector_all(
"a",
"""elements => elements
.map(el => ({text: (el.textContent || '').trim(), href: el.href}))
.filter(item => item.text && item.href && item.href.startsWith('http'))""",
)
return [(item["href"], item["text"]) for item in anchors]
async def get_page_title(page: Page) -> str:
"""Get the page title safely."""
try:
return await page.title() or ""
except Exception:
return ""
async def verify_link(context, url: str, timeout: int = 10000) -> Tuple[bool, int, str]:
"""
Verify a link by making a HEAD-like request.
Returns: (is_valid, status_code, page_title)
"""
page = await context.new_page()
try:
response: Response = await page.goto(url, wait_until="domcontentloaded", timeout=timeout)
if response:
status = response.status
title = await get_page_title(page)
is_valid = 200 <= status < 400
return is_valid, status, title
return False, 0, ""
except Exception:
return False, 0, ""
finally:
await page.close()
async def crawl(settings: ScrapeSettings, browser_name: str) -> List[ScrapedLink]:
"""
Crawl the website using BFS, collecting program and faculty links.
Features:
- URL deduplication
- Link verification
- Profile page detection
- Polite crawling with delays
"""
async with async_playwright() as p:
browser_launcher = getattr(p, browser_name)
browser = await browser_launcher.launch(headless=settings.headless)
context = await browser.new_context()
# Priority queue: (priority, url, depth) - lower priority = processed first
# Directory pages get priority 0, others get priority 1
queue: Deque[Tuple[int, str, int]] = deque([(0, settings.root_url, 0)])
visited: Set[str] = set()
found_urls: Set[str] = set() # For deduplication of results
results: List[ScrapedLink] = []
print(f"Starting crawl from: {settings.root_url}")
print(f"Max depth: {settings.max_depth}, Max pages: {settings.max_pages}")
try:
while queue and len(visited) < settings.max_pages:
# Sort queue by priority (directory pages first)
queue = deque(sorted(queue, key=lambda x: x[0]))
priority, url, depth = queue.popleft()
normalized_url = normalize_url(settings.root_url, url)
if normalized_url in visited or depth > settings.max_depth:
continue
# Only crawl same-domain URLs
if not is_same_domain(settings.root_url, normalized_url):
continue
visited.add(normalized_url)
print(f"[{len(visited)}/{settings.max_pages}] Depth {depth}: {normalized_url[:80]}...")
page = await context.new_page()
try:
response = await page.goto(
normalized_url, wait_until="domcontentloaded", timeout=settings.timeout
)
if not response or response.status >= 400:
await page.close()
continue
except Exception as e:
print(f" Error: {e}")
await page.close()
continue
page_title = await get_page_title(page)
links = await extract_links(page)
for href, text in links:
normalized_href = normalize_url(normalized_url, href)
# Skip if already found or is excluded
if normalized_href in found_urls:
continue
if matches_any(text, EXCLUSION_KEYWORDS) or matches_any(normalized_href, EXCLUSION_KEYWORDS):
continue
text_lower = text.lower()
href_lower = normalized_href.lower()
is_profile = is_profile_url(normalized_href)
# Check for program links
if matches_any(text_lower, PROGRAM_KEYWORDS) or matches_any(href_lower, PROGRAM_KEYWORDS):
found_urls.add(normalized_href)
results.append(
ScrapedLink(
url=normalized_href,
title="",
text=text[:200],
source_url=normalized_url,
bucket="program",
is_profile_page=False,
)
)
# Check for faculty links
if matches_any(text_lower, FACULTY_KEYWORDS) or matches_any(href_lower, FACULTY_KEYWORDS):
found_urls.add(normalized_href)
results.append(
ScrapedLink(
url=normalized_href,
title="",
text=text[:200],
source_url=normalized_url,
bucket="faculty",
is_profile_page=is_profile,
)
)
# Queue for further crawling
if depth < settings.max_depth and is_same_domain(settings.root_url, normalized_href):
# Prioritize directory pages
link_priority = 0 if is_directory_url(normalized_href) else 1
queue.append((link_priority, normalized_href, depth + 1))
await page.close()
# Polite delay between requests
await asyncio.sleep(settings.request_delay)
finally:
await context.close()
await browser.close()
# Verify links if enabled
if settings.verify_links and results:
print(f"\nVerifying {len(results)} links...")
browser = await browser_launcher.launch(headless=True)
context = await browser.new_context()
verified_results = []
for i, link in enumerate(results):
if link.url in [r.url for r in verified_results]:
continue # Skip duplicates
print(f" [{i+1}/{len(results)}] Verifying: {link.url[:60]}...")
is_valid, status, title = await verify_link(context, link.url)
link.is_verified = True
link.http_status = status
link.title = title or link.text
if is_valid:
verified_results.append(link)
else:
print(f" Invalid (HTTP {status})")
await asyncio.sleep(0.5) # Delay between verifications
await context.close()
await browser.close()
results = verified_results
return results
def deduplicate_results(results: List[ScrapedLink]) -> List[ScrapedLink]:
"""Remove duplicate URLs, keeping the first occurrence."""
seen: Set[str] = set()
unique = []
for link in results:
if link.url not in seen:
seen.add(link.url)
unique.append(link)
return unique
def serialize(results: List[ScrapedLink], target: Path, root_url: str) -> None:
"""Save results to JSON file with statistics."""
results = deduplicate_results(results)
program_links = [link for link in results if link.bucket == "program"]
faculty_links = [link for link in results if link.bucket == "faculty"]
profile_pages = [link for link in faculty_links if link.is_profile_page]
payload = {
"root_url": root_url,
"generated_at": datetime.now(timezone.utc).isoformat(),
"statistics": {
"total_links": len(results),
"program_links": len(program_links),
"faculty_links": len(faculty_links),
"profile_pages": len(profile_pages),
"verified_links": len([r for r in results if r.is_verified and r.http_status == 200]),
},
"program_links": [asdict(link) for link in program_links],
"faculty_links": [asdict(link) for link in faculty_links],
"notes": EXTRA_NOTES,
"metadata_fields": METADATA_FIELDS,
}
target.parent.mkdir(parents=True, exist_ok=True)
target.write_text(json.dumps(payload, indent=2, ensure_ascii=False), encoding="utf-8")
print(f"\nResults saved to: {target}")
print(f" Total links: {len(results)}")
print(f" Program links: {len(program_links)}")
print(f" Faculty links: {len(faculty_links)}")
print(f" Profile pages: {len(profile_pages)}")
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(
description="Playwright scraper generated by the Agno agent for https://www.harvard.edu/."
)
parser.add_argument(
"--root-url",
default="https://www.harvard.edu/",
help="Seed url to start crawling from.",
)
parser.add_argument(
"--max-depth",
type=int,
default=3,
help="Maximum crawl depth.",
)
parser.add_argument(
"--max-pages",
type=int,
default=30,
help="Maximum number of pages to visit.",
)
parser.add_argument(
"--output",
type=Path,
default=Path("university-scraper_results.json"),
help="Where to save the JSON output.",
)
parser.add_argument(
"--headless",
action="store_true",
default=True,
help="Run browser in headless mode (default: True).",
)
parser.add_argument(
"--no-headless",
action="store_false",
dest="headless",
help="Run browser with visible window.",
)
parser.add_argument(
"--browser",
choices=["chromium", "firefox", "webkit"],
default="chromium",
help="Browser engine to launch via Playwright.",
)
parser.add_argument(
"--no-verify",
action="store_true",
default=False,
help="Skip link verification step.",
)
parser.add_argument(
"--delay",
type=float,
default=1.0,
help="Delay between requests in seconds (polite crawling).",
)
parser.add_argument(
"--timeout",
type=int,
default=60000,
help="Navigation timeout in milliseconds (default: 60000 = 60s).",
)
return parser.parse_args()
async def main_async() -> None:
args = parse_args()
settings = ScrapeSettings(
root_url=args.root_url,
max_depth=args.max_depth,
max_pages=args.max_pages,
headless=args.headless,
output=args.output,
verify_links=not args.no_verify,
request_delay=args.delay,
timeout=args.timeout,
)
links = await crawl(settings, browser_name=args.browser)
serialize(links, settings.output, settings.root_url)
def main() -> None:
asyncio.run(main_async())
if __name__ == "__main__":
main()

View File

@ -0,0 +1,466 @@
#!/usr/bin/env python3
"""
Harvard Graduate Programs Scraper
专门爬取 https://www.harvard.edu/programs/?degree_levels=graduate 页面的所有研究生项目
通过点击分页按钮遍历所有页面
"""
import asyncio
import json
import re
from datetime import datetime, timezone
from pathlib import Path
from playwright.async_api import async_playwright
async def scrape_harvard_programs():
"""爬取Harvard研究生项目列表页面 - 通过点击分页按钮"""
all_programs = []
base_url = "https://www.harvard.edu/programs/?degree_levels=graduate"
async with async_playwright() as p:
# 使用无头模式
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
viewport={'width': 1920, 'height': 1080}
)
page = await context.new_page()
print(f"正在访问: {base_url}")
# 使用 domcontentloaded 而非 networkidle更快加载
await page.goto(base_url, wait_until="domcontentloaded", timeout=60000)
# 等待页面内容加载
await page.wait_for_timeout(5000)
# 滚动到页面底部以确保分页按钮加载
print("滚动到页面底部...")
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(2000)
current_page = 1
max_pages = 15
while current_page <= max_pages:
print(f"\n========== 第 {current_page} 页 ==========")
# 等待内容加载
await page.wait_for_timeout(2000)
# 提取当前页面的项目
# 从调试输出得知项目按钮的class是 'records__record___PbPhG c-programs-item__title-link'
# 需要点击按钮来获取URL因为Harvard使用JavaScript导航
# 首先获取所有项目按钮信息
page_data = await page.evaluate('''() => {
const programs = [];
// 查找所有项目行/容器
const programItems = document.querySelectorAll('[class*="records__record"], [class*="c-programs-item"]');
programItems.forEach((item, index) => {
// 获取项目名称按钮
const nameBtn = item.querySelector('button[class*="title-link"], button[class*="c-programs-item"]');
if (!nameBtn) return;
const name = nameBtn.innerText.trim();
if (!name || name.length < 3) return;
// 获取学位信息
let degrees = '';
const allText = item.innerText;
const degreeMatch = allText.match(/(A\\.B\\.|Ph\\.D\\.|M\\.A\\.|S\\.M\\.|M\\.Arch\\.|LL\\.M\\.|S\\.B\\.|A\\.L\\.B\\.|A\\.L\\.M\\.|M\\.M\\.Sc\\.|Ed\\.D\\.|Ed\\.M\\.|M\\.P\\.A\\.|M\\.P\\.P\\.|M\\.P\\.H\\.|J\\.D\\.|M\\.B\\.A\\.|M\\.D\\.|D\\.M\\.D\\.|Th\\.D\\.|M\\.Div\\.|M\\.T\\.S\\.|M\\.E\\.|D\\.M\\.Sc\\.|M\\.H\\.C\\.M\\.|M\\.L\\.A\\.|M\\.D\\.E\\.|M\\.R\\.E\\.|M\\.A\\.U\\.D\\.|M\\.R\\.P\\.L\\.)/g);
if (degreeMatch) {
degrees = degreeMatch.join(', ');
}
// 查找链接 - 检查各种可能的位置
let url = '';
// 方法1: 查找 <a> 标签
const link = item.querySelector('a[href]');
if (link && link.href) {
url = link.href;
}
// 方法2: 检查data属性
if (!url) {
const dataUrl = nameBtn.getAttribute('data-url') ||
nameBtn.getAttribute('data-href') ||
item.getAttribute('data-url');
if (dataUrl) url = dataUrl;
}
// 方法3: 检查onclick属性
if (!url) {
const onclick = nameBtn.getAttribute('onclick') || '';
const urlMatch = onclick.match(/['"]([^'"]*\\/programs\\/[^'"]*)['"]/);
if (urlMatch) url = urlMatch[1];
}
programs.push({
name: name,
degrees: degrees,
url: url,
index: index
});
});
// 如果方法1没找到项目使用备选方法
if (programs.length === 0) {
// 查找所有项目按钮
const buttons = document.querySelectorAll('button');
buttons.forEach((btn, index) => {
const className = btn.className || '';
if (className.includes('c-programs-item') || className.includes('title-link')) {
const name = btn.innerText.trim();
if (name && name.length > 3 && !name.match(/^(Page|Next|Previous|Search|Menu|Filter)/)) {
programs.push({
name: name,
degrees: '',
url: '',
index: index
});
}
}
});
}
return {
programs: programs,
totalFound: programs.length
};
}''')
# 第一页时调试输出HTML结构
if current_page == 1 and len(page_data['programs']) == 0:
print("未找到项目调试HTML结构...")
html_debug = await page.evaluate('''() => {
const debug = {
allButtons: [],
allLinks: [],
sampleHTML: ''
};
// 获取所有按钮
document.querySelectorAll('button').forEach(btn => {
const text = btn.innerText.trim().substring(0, 50);
if (text && text.length > 3) {
debug.allButtons.push({
text: text,
class: btn.className.substring(0, 80)
});
}
});
// 获取main区域的HTML片段
const main = document.querySelector('main') || document.body;
debug.sampleHTML = main.innerHTML.substring(0, 3000);
return debug;
}''')
print(f"找到 {len(html_debug['allButtons'])} 个按钮:")
for btn in html_debug['allButtons'][:20]:
print(f" - {btn['text']} | class: {btn['class']}")
print(f"\nHTML片段:\n{html_debug['sampleHTML'][:1500]}")
print(f" 本页找到 {len(page_data['programs'])} 个项目")
# 打印找到的项目
for prog in page_data['programs']:
print(f" - {prog['name']} ({prog['degrees']})")
# 添加到总列表(去重)
for prog in page_data['programs']:
name = prog['name'].strip()
if name and not any(p['name'] == name for p in all_programs):
all_programs.append({
'name': name,
'degrees': prog.get('degrees', ''),
'url': prog.get('url', ''),
'page': current_page
})
# 尝试点击下一页按钮
try:
clicked = False
# 首先打印所有分页相关元素用于调试
if current_page == 1:
# 截图保存以便调试
await page.screenshot(path="harvard_debug_pagination.png", full_page=True)
print("已保存调试截图: harvard_debug_pagination.png")
pagination_info = await page.evaluate('''() => {
const result = {
links: [],
buttons: [],
allClickable: [],
pageNumbers: [],
allText: []
};
// 查找所有链接
document.querySelectorAll('a').forEach(a => {
const text = a.innerText.trim();
if (text.match(/^[0-9]+$|Next|page|Prev/i)) {
result.links.push({
text: text.substring(0, 50),
href: a.href,
visible: a.offsetParent !== null,
className: a.className
});
}
});
// 查找所有按钮
document.querySelectorAll('button').forEach(b => {
const text = b.innerText.trim();
if (text.match(/^[0-9]+$|Next|page|Prev/i) || text.length < 20) {
result.buttons.push({
text: text.substring(0, 50),
visible: b.offsetParent !== null,
className: b.className
});
}
});
// 查找所有包含数字的可点击元素(可能是分页)
document.querySelectorAll('a, button, span[role="button"], div[role="button"], li a, nav a').forEach(el => {
const text = el.innerText.trim();
if (text.match(/^[0-9]$/) || text === 'Next page' || text.includes('Next')) {
result.pageNumbers.push({
tag: el.tagName,
text: text,
className: el.className,
id: el.id,
ariaLabel: el.getAttribute('aria-label'),
visible: el.offsetParent !== null
});
}
});
// 查找页面底部区域的所有可点击元素
const bodyRect = document.body.getBoundingClientRect();
document.querySelectorAll('*').forEach(el => {
const rect = el.getBoundingClientRect();
const text = el.innerText?.trim() || '';
// 只看页面下半部分的元素且文本短
if (rect.top > bodyRect.height * 0.5 && text.length > 0 && text.length < 30) {
const style = window.getComputedStyle(el);
if (style.cursor === 'pointer' || el.tagName === 'A' || el.tagName === 'BUTTON') {
result.allClickable.push({
tag: el.tagName,
text: text.substring(0, 30),
top: Math.round(rect.top),
className: el.className?.substring?.(0, 50) || ''
});
}
}
});
// 输出页面底部所有文本以便调试
const bodyText = document.body.innerText;
const lines = bodyText.split('\\n').filter(l => l.trim());
// 找到包含数字1-9的行
for (let i = 0; i < lines.length; i++) {
if (lines[i].match(/^[1-9]$|Next page|Previous/)) {
result.allText.push(lines[i]);
}
}
return result;
}''')
print(f"\n分页相关链接 ({len(pagination_info['links'])} 个):")
for link in pagination_info['links']:
print(f" a: '{link['text']}' class='{link.get('className', '')}' (visible: {link['visible']})")
print(f"\n分页相关按钮 ({len(pagination_info['buttons'])} 个):")
for btn in pagination_info['buttons']:
print(f" button: '{btn['text']}' class='{btn.get('className', '')}' (visible: {btn['visible']})")
print(f"\n页码元素 ({len(pagination_info['pageNumbers'])} 个):")
for pn in pagination_info['pageNumbers']:
print(f" {pn['tag']}: '{pn['text']}' aria-label='{pn.get('ariaLabel')}' visible={pn['visible']}")
print(f"\n页面下半部分可点击元素 ({len(pagination_info['allClickable'])} 个):")
for el in pagination_info['allClickable'][:30]:
print(f" {el['tag']}: '{el['text']}' (top: {el['top']})")
print(f"\n页面中的分页文本 ({len(pagination_info['allText'])} 个):")
for txt in pagination_info['allText'][:20]:
print(f" '{txt}'")
# 方法1: 直接使用CSS选择器查找 "Next page" 按钮 (最可靠)
# 从调试输出得知,分页按钮是 <button class="c-pagination__link c-pagination__link--next">
next_page_num = str(current_page + 1)
try:
next_btn = page.locator('button.c-pagination__link--next')
if await next_btn.count() > 0:
print(f"\n找到 'Next page' 按钮 (CSS选择器),尝试点击...")
await next_btn.first.scroll_into_view_if_needed()
await next_btn.first.click()
await page.wait_for_timeout(3000)
current_page += 1
clicked = True
except Exception as e:
print(f"方法1失败: {e}")
if clicked:
continue
# 方法2: 使用 get_by_role 查找按钮
try:
next_btn = page.get_by_role("button", name="Next page")
if await next_btn.count() > 0:
print(f"\n通过role找到 'Next page' 按钮,尝试点击...")
await next_btn.first.scroll_into_view_if_needed()
await next_btn.first.click()
await page.wait_for_timeout(3000)
current_page += 1
clicked = True
except Exception as e:
print(f"方法2失败: {e}")
if clicked:
continue
# 方法3: 查找所有分页按钮并点击 "Next page"
try:
pagination_buttons = await page.query_selector_all('button.c-pagination__link')
for btn in pagination_buttons:
text = await btn.inner_text()
if 'Next page' in text:
print(f"\n通过遍历分页按钮找到 'Next page',点击...")
await btn.scroll_into_view_if_needed()
await btn.click()
await page.wait_for_timeout(3000)
current_page += 1
clicked = True
break
except Exception as e:
print(f"方法3失败: {e}")
if clicked:
continue
# 方法4: 通过JavaScript直接点击分页按钮
try:
js_clicked = await page.evaluate('''() => {
// 查找 Next page 按钮
const nextBtn = document.querySelector('button.c-pagination__link--next');
if (nextBtn) {
nextBtn.click();
return true;
}
// 备选:查找所有分页按钮
const buttons = document.querySelectorAll('button.c-pagination__link');
for (const btn of buttons) {
if (btn.innerText.includes('Next page')) {
btn.click();
return true;
}
}
return false;
}''')
if js_clicked:
print(f"\n通过JavaScript点击 'Next page' 成功")
await page.wait_for_timeout(3000)
current_page += 1
clicked = True
except Exception as e:
print(f"方法4失败: {e}")
if clicked:
continue
# 方法5: 遍历所有按钮查找
try:
all_buttons = await page.query_selector_all('button')
for btn in all_buttons:
try:
text = await btn.inner_text()
if 'Next page' in text:
visible = await btn.is_visible()
if visible:
print(f"\n遍历所有按钮找到 'Next page',点击...")
await btn.scroll_into_view_if_needed()
await btn.click()
await page.wait_for_timeout(3000)
current_page += 1
clicked = True
break
except:
continue
except Exception as e:
print(f"方法5失败: {e}")
if clicked:
continue
print("没有找到下一页按钮,结束爬取")
break
except Exception as e:
print(f"点击下一页时出错: {e}")
break
# 生成项目URL - Harvard的项目URL格式为
# https://www.harvard.edu/programs/{program-name-slug}/
# 例如: african-and-african-american-studies
import re
def name_to_slug(name):
"""将项目名称转换为URL slug"""
# 转小写
slug = name.lower()
# 将特殊字符替换为空格
slug = re.sub(r'[^\w\s-]', '', slug)
# 替换空格为连字符
slug = re.sub(r'[\s_]+', '-', slug)
# 移除多余的连字符
slug = re.sub(r'-+', '-', slug)
# 移除首尾连字符
slug = slug.strip('-')
return slug
print("\n正在生成项目URL...")
for prog in all_programs:
slug = name_to_slug(prog['name'])
prog['url'] = f"https://www.harvard.edu/programs/{slug}/"
print(f" {prog['name']} -> {prog['url']}")
await browser.close()
# 排序
programs = sorted(all_programs, key=lambda x: x['name'])
# 保存
result = {
'source_url': base_url,
'scraped_at': datetime.now(timezone.utc).isoformat(),
'total_pages_scraped': current_page,
'total_programs': len(programs),
'programs': programs
}
output_file = Path('harvard_programs_results.json')
with open(output_file, 'w', encoding='utf-8') as f:
json.dump(result, f, ensure_ascii=False, indent=2)
print(f"\n{'='*60}")
print(f"爬取完成!")
print(f"共爬取 {current_page}")
print(f"共找到 {len(programs)} 个研究生项目")
print(f"结果保存到: {output_file}")
print(f"{'='*60}")
# 打印完整列表
print("\n研究生项目完整列表:")
for i, prog in enumerate(programs, 1):
print(f"{i:3}. {prog['name']} - {prog['degrees']}")
return result
if __name__ == "__main__":
asyncio.run(scrape_harvard_programs())

View File

@ -0,0 +1,356 @@
#!/usr/bin/env python3
"""
Harvard Graduate Programs Scraper with Faculty Information
爬取 https://www.harvard.edu/programs/?degree_levels=graduate 页面的所有研究生项目
并获取每个项目的导师个人信息页面URL
"""
import asyncio
import json
import re
from datetime import datetime, timezone
from pathlib import Path
from playwright.async_api import async_playwright
def name_to_slug(name):
"""将项目名称转换为URL slug"""
slug = name.lower()
slug = re.sub(r'[^\w\s-]', '', slug)
slug = re.sub(r'[\s_]+', '-', slug)
slug = re.sub(r'-+', '-', slug)
slug = slug.strip('-')
return slug
async def extract_faculty_from_page(page):
"""从当前页面提取所有教职员工链接"""
faculty_list = await page.evaluate('''() => {
const faculty = [];
const seen = new Set();
document.querySelectorAll('a[href]').forEach(a => {
const href = a.href || '';
const text = a.innerText.trim();
const lowerHref = href.toLowerCase();
const lowerText = text.toLowerCase();
// 检查是否是个人页面链接
if ((lowerHref.includes('/people/') || lowerHref.includes('/faculty/') ||
lowerHref.includes('/profile/') || lowerHref.includes('/person/')) &&
text.length > 3 && text.length < 100 &&
!lowerText.includes('people') &&
!lowerText.includes('faculty') &&
!lowerText.includes('profile') &&
!lowerText.includes('staff') &&
!lowerHref.endsWith('/people/') &&
!lowerHref.endsWith('/people') &&
!lowerHref.endsWith('/faculty/') &&
!lowerHref.endsWith('/faculty')) {
if (!seen.has(href)) {
seen.add(href);
faculty.push({
name: text,
url: href
});
}
}
});
return faculty;
}''')
return faculty_list
async def get_faculty_from_gsas_page(page, gsas_url, program_name):
"""从GSAS项目页面获取Faculty链接然后访问院系People页面获取导师列表"""
faculty_list = []
faculty_page_url = None
try:
print(f" 访问GSAS页面: {gsas_url}")
await page.goto(gsas_url, wait_until="domcontentloaded", timeout=30000)
await page.wait_for_timeout(2000)
# 策略1: 查找 "See list of ... faculty" 链接
faculty_link = await page.evaluate('''() => {
const links = document.querySelectorAll('a[href]');
for (const link of links) {
const text = link.innerText.toLowerCase();
const href = link.href;
if (text.includes('faculty') && text.includes('see list')) {
return href;
}
}
return null;
}''')
# 策略2: 查找任何包含 /people 或 /faculty 的链接
if not faculty_link:
faculty_link = await page.evaluate('''() => {
const links = document.querySelectorAll('a[href]');
for (const link of links) {
const text = link.innerText.toLowerCase();
const href = link.href.toLowerCase();
// 查找Faculty相关链接
if ((text.includes('faculty') || text.includes('people')) &&
(href.includes('/people') || href.includes('/faculty'))) {
return link.href;
}
}
return null;
}''')
# 策略3: 从页面中查找院系网站链接然后尝试访问其People页面
if not faculty_link:
dept_website = await page.evaluate('''() => {
const links = document.querySelectorAll('a[href]');
for (const link of links) {
const text = link.innerText.toLowerCase();
const href = link.href;
// 查找 Website 链接 (通常指向院系主页)
if (text.includes('website') && href.includes('harvard.edu') &&
!href.includes('gsas.harvard.edu')) {
return href;
}
}
return null;
}''')
if dept_website:
print(f" 找到院系网站: {dept_website}")
try:
await page.goto(dept_website, wait_until="domcontentloaded", timeout=30000)
await page.wait_for_timeout(2000)
# 在院系网站上查找People/Faculty链接
faculty_link = await page.evaluate('''() => {
const links = document.querySelectorAll('a[href]');
for (const link of links) {
const text = link.innerText.toLowerCase().trim();
const href = link.href;
if ((text === 'people' || text === 'faculty' ||
text === 'faculty & research' || text.includes('our faculty')) &&
(href.includes('/people') || href.includes('/faculty'))) {
return href;
}
}
return null;
}''')
except Exception as e:
print(f" 访问院系网站失败: {e}")
if faculty_link:
faculty_page_url = faculty_link
print(f" 找到Faculty页面: {faculty_link}")
# 访问Faculty/People页面
await page.goto(faculty_link, wait_until="domcontentloaded", timeout=30000)
await page.wait_for_timeout(2000)
# 提取所有导师信息
faculty_list = await extract_faculty_from_page(page)
# 如果第一页没找到,尝试处理分页或其他布局
if len(faculty_list) == 0:
# 可能需要点击某些按钮或处理JavaScript加载
await page.wait_for_timeout(2000)
faculty_list = await extract_faculty_from_page(page)
print(f" 找到 {len(faculty_list)} 位导师")
else:
print(f" 未找到Faculty页面链接")
except Exception as e:
print(f" 获取Faculty信息失败: {e}")
return faculty_list, faculty_page_url
async def scrape_harvard_programs_with_faculty():
"""爬取Harvard研究生项目列表及导师信息"""
all_programs = []
base_url = "https://www.harvard.edu/programs/?degree_levels=graduate"
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
viewport={'width': 1920, 'height': 1080}
)
page = await context.new_page()
print(f"正在访问: {base_url}")
await page.goto(base_url, wait_until="domcontentloaded", timeout=60000)
await page.wait_for_timeout(5000)
# 滚动到页面底部
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(2000)
current_page = 1
max_pages = 15
# 第一阶段:收集所有项目基本信息
print("\n========== 第一阶段:收集项目列表 ==========")
while current_page <= max_pages:
print(f"\n--- 第 {current_page} 页 ---")
await page.wait_for_timeout(2000)
# 提取当前页面的项目
page_data = await page.evaluate('''() => {
const programs = [];
const programItems = document.querySelectorAll('[class*="records__record"], [class*="c-programs-item"]');
programItems.forEach((item, index) => {
const nameBtn = item.querySelector('button[class*="title-link"], button[class*="c-programs-item"]');
if (!nameBtn) return;
const name = nameBtn.innerText.trim();
if (!name || name.length < 3) return;
let degrees = '';
const allText = item.innerText;
const degreeMatch = allText.match(/(A\\.B\\.|Ph\\.D\\.|M\\.A\\.|S\\.M\\.|M\\.Arch\\.|LL\\.M\\.|S\\.B\\.|A\\.L\\.B\\.|A\\.L\\.M\\.|M\\.M\\.Sc\\.|Ed\\.D\\.|Ed\\.M\\.|M\\.P\\.A\\.|M\\.P\\.P\\.|M\\.P\\.H\\.|J\\.D\\.|M\\.B\\.A\\.|M\\.D\\.|D\\.M\\.D\\.|Th\\.D\\.|M\\.Div\\.|M\\.T\\.S\\.|M\\.E\\.|D\\.M\\.Sc\\.|M\\.H\\.C\\.M\\.|M\\.L\\.A\\.|M\\.D\\.E\\.|M\\.R\\.E\\.|M\\.A\\.U\\.D\\.|M\\.R\\.P\\.L\\.)/g);
if (degreeMatch) {
degrees = degreeMatch.join(', ');
}
programs.push({
name: name,
degrees: degrees
});
});
if (programs.length === 0) {
const buttons = document.querySelectorAll('button');
buttons.forEach((btn) => {
const className = btn.className || '';
if (className.includes('c-programs-item') || className.includes('title-link')) {
const name = btn.innerText.trim();
if (name && name.length > 3 && !name.match(/^(Page|Next|Previous|Search|Menu|Filter)/)) {
programs.push({
name: name,
degrees: ''
});
}
}
});
}
return programs;
}''')
print(f" 本页找到 {len(page_data)} 个项目")
for prog in page_data:
name = prog['name'].strip()
if name and not any(p['name'] == name for p in all_programs):
all_programs.append({
'name': name,
'degrees': prog.get('degrees', ''),
'page': current_page
})
# 尝试点击下一页
try:
next_btn = page.locator('button.c-pagination__link--next')
if await next_btn.count() > 0:
await next_btn.first.scroll_into_view_if_needed()
await next_btn.first.click()
await page.wait_for_timeout(3000)
current_page += 1
else:
print("没有下一页按钮,结束收集")
break
except Exception as e:
print(f"分页失败: {e}")
break
print(f"\n共收集到 {len(all_programs)} 个项目")
# 第二阶段:为每个项目获取导师信息
print("\n========== 第二阶段:获取导师信息 ==========")
print("注意这将访问每个项目的GSAS页面可能需要较长时间...")
for i, prog in enumerate(all_programs, 1):
print(f"\n[{i}/{len(all_programs)}] {prog['name']}")
# 生成项目URL
slug = name_to_slug(prog['name'])
prog['url'] = f"https://www.harvard.edu/programs/{slug}/"
# 生成GSAS URL
gsas_url = f"https://gsas.harvard.edu/program/{slug}"
# 获取导师信息
faculty_list, faculty_page_url = await get_faculty_from_gsas_page(page, gsas_url, prog['name'])
prog['faculty_page_url'] = faculty_page_url or ""
prog['faculty'] = faculty_list
prog['faculty_count'] = len(faculty_list)
# 每10个项目保存一次进度
if i % 10 == 0:
temp_result = {
'source_url': base_url,
'scraped_at': datetime.now(timezone.utc).isoformat(),
'progress': f"{i}/{len(all_programs)}",
'programs': all_programs[:i]
}
with open('harvard_programs_progress.json', 'w', encoding='utf-8') as f:
json.dump(temp_result, f, ensure_ascii=False, indent=2)
print(f" [进度已保存]")
# 避免请求过快
await page.wait_for_timeout(1500)
await browser.close()
# 排序
programs = sorted(all_programs, key=lambda x: x['name'])
# 统计
total_faculty = sum(p['faculty_count'] for p in programs)
programs_with_faculty = sum(1 for p in programs if p['faculty_count'] > 0)
# 保存最终结果
result = {
'source_url': base_url,
'scraped_at': datetime.now(timezone.utc).isoformat(),
'total_pages_scraped': current_page,
'total_programs': len(programs),
'programs_with_faculty': programs_with_faculty,
'total_faculty_found': total_faculty,
'programs': programs
}
output_file = Path('harvard_programs_with_faculty.json')
with open(output_file, 'w', encoding='utf-8') as f:
json.dump(result, f, ensure_ascii=False, indent=2)
print(f"\n{'='*60}")
print(f"爬取完成!")
print(f"共爬取 {current_page}")
print(f"共找到 {len(programs)} 个研究生项目")
print(f"其中 {programs_with_faculty} 个项目有导师信息")
print(f"共找到 {total_faculty} 位导师")
print(f"结果保存到: {output_file}")
print(f"{'='*60}")
# 打印摘要
print("\n项目摘要 (前30个):")
for i, prog in enumerate(programs[:30], 1):
faculty_info = f"({prog['faculty_count']}位导师)" if prog['faculty_count'] > 0 else "(无导师信息)"
print(f"{i:3}. {prog['name']} {faculty_info}")
if len(programs) > 30:
print(f"... 还有 {len(programs) - 30} 个项目")
return result
if __name__ == "__main__":
asyncio.run(scrape_harvard_programs_with_faculty())

View File

@ -0,0 +1,435 @@
#!/usr/bin/env python
"""
Auto-generated by the Agno codegen agent.
Target university: KAUST (https://www.kaust.edu.sa/en/)
Requested caps: depth=3, pages=30
Plan description: Playwright scraper for university master programs and faculty profiles.
Navigation strategy: Start at https://www.kaust.edu.sa/en/ Navigate to /study/ to find degree program links Follow links to individual degree pages under /degree-programs/ Separately, look for links to /faculty/ or /people/ directories Crawl faculty directories to extract links to individual bio pages Individual faculty are often under a subdomain like bio.kaust.edu.sa
Verification checklist:
- Verify master's programs are under /study/ or /degree-programs/
- Check that faculty directory pages contain links to individual bios
- Confirm individual faculty pages have research/expertise details
- Ensure exclusion keywords successfully skip irrelevant pages
Playwright snapshot used to guide this plan:
No browser snapshot was captured.
Generated at: 2025-12-10T02:48:42.571899+00:00
"""
from __future__ import annotations
import argparse
import asyncio
import json
import time
from collections import deque
from dataclasses import asdict, dataclass, field
from datetime import datetime, timezone
from pathlib import Path
from typing import Deque, Iterable, List, Set, Tuple
from urllib.parse import urljoin, urldefrag, urlparse
from playwright.async_api import async_playwright, Page, Response
PROGRAM_KEYWORDS = ['/study/', '/degree-programs/', '/academics/', 'M.Sc.', 'Master of Science', 'graduate program']
FACULTY_KEYWORDS = ['/people/', '/profiles/faculty/', 'Professor', 'faculty-member', '/faculty/firstname-lastname', 'bio.kaust.edu.sa']
EXCLUSION_KEYWORDS = ['/admissions/', '/apply/', '/tuition/', '/events/', '/news/', '/careers/', '/jobs/', '/login/', '/alumni/', '/giving/', 'inquiry.kaust.edu.sa']
METADATA_FIELDS = ['url', 'title', 'entity_type', 'department', 'email', 'scraped_at']
EXTRA_NOTES = ['Many faculty are listed under a separate subdomain bio.kaust.edu.sa', 'Prioritize crawling the centralized faculty directory first', 'Alumni and affiliated faculty may not have full profile pages']
# URL patterns that indicate individual profile pages
PROFILE_URL_PATTERNS = [
"/people/", "/person/", "/profile/", "/profiles/",
"/faculty/", "/staff/", "/directory/",
"/~", # Unix-style personal pages
"/bio/", "/about/",
]
# URL patterns that indicate listing/directory pages (should be crawled deeper)
DIRECTORY_URL_PATTERNS = [
"/faculty", "/people", "/directory", "/staff",
"/team", "/members", "/researchers",
]
def normalize_url(base: str, href: str) -> str:
"""Normalize URL by resolving relative paths and removing fragments."""
absolute = urljoin(base, href)
cleaned, _ = urldefrag(absolute)
# Remove trailing slash for consistency
return cleaned.rstrip("/")
def matches_any(text: str, keywords: Iterable[str]) -> bool:
"""Check if text contains any of the keywords (case-insensitive)."""
lowered = text.lower()
return any(keyword.lower() in lowered for keyword in keywords)
def is_same_domain(url1: str, url2: str) -> bool:
"""Check if two URLs belong to the same root domain."""
domain1 = urlparse(url1).netloc.replace("www.", "")
domain2 = urlparse(url2).netloc.replace("www.", "")
# Allow subdomains of the same root domain
parts1 = domain1.split(".")
parts2 = domain2.split(".")
if len(parts1) >= 2 and len(parts2) >= 2:
return parts1[-2:] == parts2[-2:]
return domain1 == domain2
def is_profile_url(url: str) -> bool:
"""Check if URL pattern suggests an individual profile page."""
url_lower = url.lower()
return any(pattern in url_lower for pattern in PROFILE_URL_PATTERNS)
def is_directory_url(url: str) -> bool:
"""Check if URL pattern suggests a directory/listing page."""
url_lower = url.lower()
return any(pattern in url_lower for pattern in DIRECTORY_URL_PATTERNS)
@dataclass
class ScrapedLink:
url: str
title: str
text: str
source_url: str
bucket: str # "program" or "faculty"
is_verified: bool = False
http_status: int = 0
is_profile_page: bool = False
scraped_at: str = field(default_factory=lambda: datetime.now(timezone.utc).isoformat())
@dataclass
class ScrapeSettings:
root_url: str
max_depth: int
max_pages: int
headless: bool
output: Path
verify_links: bool = True
request_delay: float = 1.0 # Polite crawling delay
timeout: int = 60000 # Navigation timeout in ms (default 60s for slow sites)
async def extract_links(page: Page) -> List[Tuple[str, str]]:
"""Extract all anchor links from the page."""
anchors: Iterable[dict] = await page.eval_on_selector_all(
"a",
"""elements => elements
.map(el => ({text: (el.textContent || '').trim(), href: el.href}))
.filter(item => item.text && item.href && item.href.startsWith('http'))""",
)
return [(item["href"], item["text"]) for item in anchors]
async def get_page_title(page: Page) -> str:
"""Get the page title safely."""
try:
return await page.title() or ""
except Exception:
return ""
async def verify_link(context, url: str, timeout: int = 10000) -> Tuple[bool, int, str]:
"""
Verify a link by making a HEAD-like request.
Returns: (is_valid, status_code, page_title)
"""
page = await context.new_page()
try:
response: Response = await page.goto(url, wait_until="domcontentloaded", timeout=timeout)
if response:
status = response.status
title = await get_page_title(page)
is_valid = 200 <= status < 400
return is_valid, status, title
return False, 0, ""
except Exception:
return False, 0, ""
finally:
await page.close()
async def crawl(settings: ScrapeSettings, browser_name: str) -> List[ScrapedLink]:
"""
Crawl the website using BFS, collecting program and faculty links.
Features:
- URL deduplication
- Link verification
- Profile page detection
- Polite crawling with delays
"""
async with async_playwright() as p:
browser_launcher = getattr(p, browser_name)
browser = await browser_launcher.launch(headless=settings.headless)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
)
# Priority queue: (priority, url, depth) - lower priority = processed first
# Directory pages get priority 0, others get priority 1
queue: Deque[Tuple[int, str, int]] = deque([(0, settings.root_url, 0)])
visited: Set[str] = set()
found_urls: Set[str] = set() # For deduplication of results
results: List[ScrapedLink] = []
print(f"Starting crawl from: {settings.root_url}")
print(f"Max depth: {settings.max_depth}, Max pages: {settings.max_pages}")
try:
while queue and len(visited) < settings.max_pages:
# Sort queue by priority (directory pages first)
queue = deque(sorted(queue, key=lambda x: x[0]))
priority, url, depth = queue.popleft()
normalized_url = normalize_url(settings.root_url, url)
if normalized_url in visited or depth > settings.max_depth:
continue
# Only crawl same-domain URLs
if not is_same_domain(settings.root_url, normalized_url):
continue
visited.add(normalized_url)
print(f"[{len(visited)}/{settings.max_pages}] Depth {depth}: {normalized_url[:80]}...")
page = await context.new_page()
try:
response = await page.goto(
normalized_url, wait_until="load", timeout=settings.timeout
)
if not response or response.status >= 400:
await page.close()
continue
except Exception as e:
print(f" Error: {e}")
await page.close()
continue
page_title = await get_page_title(page)
links = await extract_links(page)
for href, text in links:
normalized_href = normalize_url(normalized_url, href)
# Skip if already found or is excluded
if normalized_href in found_urls:
continue
if matches_any(text, EXCLUSION_KEYWORDS) or matches_any(normalized_href, EXCLUSION_KEYWORDS):
continue
text_lower = text.lower()
href_lower = normalized_href.lower()
is_profile = is_profile_url(normalized_href)
# Check for program links
if matches_any(text_lower, PROGRAM_KEYWORDS) or matches_any(href_lower, PROGRAM_KEYWORDS):
found_urls.add(normalized_href)
results.append(
ScrapedLink(
url=normalized_href,
title="",
text=text[:200],
source_url=normalized_url,
bucket="program",
is_profile_page=False,
)
)
# Check for faculty links
if matches_any(text_lower, FACULTY_KEYWORDS) or matches_any(href_lower, FACULTY_KEYWORDS):
found_urls.add(normalized_href)
results.append(
ScrapedLink(
url=normalized_href,
title="",
text=text[:200],
source_url=normalized_url,
bucket="faculty",
is_profile_page=is_profile,
)
)
# Queue for further crawling
if depth < settings.max_depth and is_same_domain(settings.root_url, normalized_href):
# Prioritize directory pages
link_priority = 0 if is_directory_url(normalized_href) else 1
queue.append((link_priority, normalized_href, depth + 1))
await page.close()
# Polite delay between requests
await asyncio.sleep(settings.request_delay)
finally:
await context.close()
await browser.close()
# Verify links if enabled
if settings.verify_links and results:
print(f"\nVerifying {len(results)} links...")
browser = await browser_launcher.launch(headless=True)
context = await browser.new_context()
verified_results = []
for i, link in enumerate(results):
if link.url in [r.url for r in verified_results]:
continue # Skip duplicates
print(f" [{i+1}/{len(results)}] Verifying: {link.url[:60]}...")
is_valid, status, title = await verify_link(context, link.url)
link.is_verified = True
link.http_status = status
link.title = title or link.text
if is_valid:
verified_results.append(link)
else:
print(f" Invalid (HTTP {status})")
await asyncio.sleep(0.5) # Delay between verifications
await context.close()
await browser.close()
results = verified_results
return results
def deduplicate_results(results: List[ScrapedLink]) -> List[ScrapedLink]:
"""Remove duplicate URLs, keeping the first occurrence."""
seen: Set[str] = set()
unique = []
for link in results:
if link.url not in seen:
seen.add(link.url)
unique.append(link)
return unique
def serialize(results: List[ScrapedLink], target: Path, root_url: str) -> None:
"""Save results to JSON file with statistics."""
results = deduplicate_results(results)
program_links = [link for link in results if link.bucket == "program"]
faculty_links = [link for link in results if link.bucket == "faculty"]
profile_pages = [link for link in faculty_links if link.is_profile_page]
payload = {
"root_url": root_url,
"generated_at": datetime.now(timezone.utc).isoformat(),
"statistics": {
"total_links": len(results),
"program_links": len(program_links),
"faculty_links": len(faculty_links),
"profile_pages": len(profile_pages),
"verified_links": len([r for r in results if r.is_verified and r.http_status == 200]),
},
"program_links": [asdict(link) for link in program_links],
"faculty_links": [asdict(link) for link in faculty_links],
"notes": EXTRA_NOTES,
"metadata_fields": METADATA_FIELDS,
}
target.parent.mkdir(parents=True, exist_ok=True)
target.write_text(json.dumps(payload, indent=2, ensure_ascii=False), encoding="utf-8")
print(f"\nResults saved to: {target}")
print(f" Total links: {len(results)}")
print(f" Program links: {len(program_links)}")
print(f" Faculty links: {len(faculty_links)}")
print(f" Profile pages: {len(profile_pages)}")
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(
description="Playwright scraper generated by the Agno agent for https://www.kaust.edu.sa/en/."
)
parser.add_argument(
"--root-url",
default="https://www.kaust.edu.sa/en/",
help="Seed url to start crawling from.",
)
parser.add_argument(
"--max-depth",
type=int,
default=3,
help="Maximum crawl depth.",
)
parser.add_argument(
"--max-pages",
type=int,
default=30,
help="Maximum number of pages to visit.",
)
parser.add_argument(
"--output",
type=Path,
default=Path("university-scraper_results.json"),
help="Where to save the JSON output.",
)
parser.add_argument(
"--headless",
action="store_true",
default=True,
help="Run browser in headless mode (default: True).",
)
parser.add_argument(
"--no-headless",
action="store_false",
dest="headless",
help="Run browser with visible window.",
)
parser.add_argument(
"--browser",
choices=["chromium", "firefox", "webkit"],
default="firefox",
help="Browser engine to launch via Playwright (firefox recommended for KAUST).",
)
parser.add_argument(
"--no-verify",
action="store_true",
default=False,
help="Skip link verification step.",
)
parser.add_argument(
"--delay",
type=float,
default=1.0,
help="Delay between requests in seconds (polite crawling).",
)
parser.add_argument(
"--timeout",
type=int,
default=60000,
help="Navigation timeout in milliseconds (default: 60000 = 60s).",
)
return parser.parse_args()
async def main_async() -> None:
args = parse_args()
settings = ScrapeSettings(
root_url=args.root_url,
max_depth=args.max_depth,
max_pages=args.max_pages,
headless=args.headless,
output=args.output,
verify_links=not args.no_verify,
request_delay=args.delay,
timeout=args.timeout,
)
links = await crawl(settings, browser_name=args.browser)
serialize(links, settings.output, settings.root_url)
def main() -> None:
asyncio.run(main_async())
if __name__ == "__main__":
main()

View File

@ -0,0 +1,910 @@
"""
曼彻斯特大学完整采集脚本
新增特性:
- Research Explorer API 优先拉取 JSON / XML失败再回落 DOM
- 每个学院独立页面、并行抓取(默认 3 并发)
- 细粒度超时/重试/滚动/Load more 控制
- 多 URL / 备用 Staff 页面配置
- 导师目录缓存,可按学院关键词映射到项目
- 诊断信息记录(失败学院、超时学院、批次信息)
"""
import asyncio
import json
import re
from copy import deepcopy
from datetime import datetime, timezone
from typing import Any, Dict, List, Optional, Tuple
from urllib.parse import urlencode, urljoin
from xml.etree import ElementTree as ET
from playwright.async_api import (
TimeoutError as PlaywrightTimeoutError,
async_playwright,
)
# =========================
# 配置区
# =========================
DEFAULT_REQUEST = {
"timeout_ms": 60000,
"post_wait_ms": 2500,
"wait_until": "domcontentloaded",
"max_retries": 3,
"retry_backoff_ms": 2000,
}
STAFF_CONCURRENCY = 3
SCHOOL_CONFIG: List[Dict[str, Any]] = [
{
"name": "Alliance Manchester Business School",
"keywords": [
"accounting",
"finance",
"business",
"management",
"marketing",
"mba",
"economics",
"entrepreneurship",
],
"attach_faculty_to_programs": True,
"staff_pages": [
{
"url": "https://www.alliancembs.manchester.ac.uk/research/accounting-and-finance/staff/",
"extract_method": "table",
"request": {"timeout_ms": 60000, "wait_until": "networkidle"},
}
],
},
{
"name": "Department of Computer Science",
"keywords": [
"computer",
"software",
"data science",
"artificial intelligence",
"ai ",
"machine learning",
"cyber",
"computing",
],
"attach_faculty_to_programs": True,
"staff_pages": [
{
"url": "https://www.cs.manchester.ac.uk/about/people/academic-and-research-staff/",
"extract_method": "links",
"requires_scroll": True,
},
{
"url": "https://www.cs.manchester.ac.uk/about/people/",
"extract_method": "links",
"load_more_selector": "button.load-more",
"max_load_more": 6,
},
],
},
{
"name": "Department of Physics and Astronomy",
"keywords": [
"physics",
"astronomy",
"astrophysics",
"nuclear",
"particle",
],
"attach_faculty_to_programs": True,
"staff_pages": [
{
"url": "https://www.physics.manchester.ac.uk/about/people/academic-and-research-staff/",
"extract_method": "links",
"requires_scroll": True,
}
],
},
{
"name": "Department of Electrical and Electronic Engineering",
"keywords": [
"electrical",
"electronic",
"eee",
"power systems",
"microelectronics",
],
"attach_faculty_to_programs": True,
"staff_pages": [
{
"url": "https://www.eee.manchester.ac.uk/about/people/academic-and-research-staff/",
"extract_method": "links",
"requires_scroll": True,
}
],
},
{
"name": "Department of Chemistry",
"keywords": ["chemistry", "chemical"],
"attach_faculty_to_programs": True,
"extract_method": "research_explorer",
"research_explorer": {"page_size": 200},
"staff_pages": [
{
"url": "https://research.manchester.ac.uk/en/organisations/department-of-chemistry/persons/",
"extract_method": "research_explorer",
"requires_scroll": True,
"request": {
"timeout_ms": 120000,
"wait_until": "networkidle",
"post_wait_ms": 5000,
},
}
],
},
{
"name": "Department of Mathematics",
"keywords": [
"mathematics",
"mathematical",
"applied math",
"statistics",
"actuarial",
],
"attach_faculty_to_programs": True,
"extract_method": "research_explorer",
"research_explorer": {"page_size": 200},
"staff_pages": [
{
"url": "https://research.manchester.ac.uk/en/organisations/department-of-mathematics/persons/",
"extract_method": "research_explorer",
"requires_scroll": True,
}
],
},
{
"name": "School of Engineering",
"keywords": [
"engineering",
"mechanical",
"aerospace",
"civil",
"structural",
"materials",
],
"attach_faculty_to_programs": True,
"extract_method": "research_explorer",
"research_explorer": {"page_size": 400},
"staff_pages": [
{
"url": "https://research.manchester.ac.uk/en/organisations/school-of-engineering/persons/",
"extract_method": "research_explorer",
"requires_scroll": True,
}
],
},
{
"name": "Faculty of Biology, Medicine and Health",
"keywords": [
"medicine",
"medical",
"health",
"nursing",
"pharmacy",
"clinical",
"dental",
"optometry",
"biology",
"biomedical",
"anatomical",
"physiotherapy",
"midwifery",
"mental health",
"psychology",
],
"attach_faculty_to_programs": True,
"extract_method": "research_explorer",
"research_explorer": {"page_size": 400},
"staff_pages": [
{
"url": "https://research.manchester.ac.uk/en/organisations/faculty-of-biology-medicine-and-health/persons/",
"extract_method": "research_explorer",
"requires_scroll": True,
}
],
},
{
"name": "School of Social Sciences",
"keywords": [
"sociology",
"politics",
"international",
"social",
"criminology",
"anthropology",
"philosophy",
],
"attach_faculty_to_programs": True,
"extract_method": "research_explorer",
"research_explorer": {"page_size": 200},
"staff_pages": [
{
"url": "https://research.manchester.ac.uk/en/organisations/school-of-social-sciences/persons/",
"extract_method": "research_explorer",
"requires_scroll": True,
}
],
},
{
"name": "School of Law",
"keywords": ["law", "legal", "llm"],
"attach_faculty_to_programs": True,
"extract_method": "research_explorer",
"research_explorer": {"page_size": 200},
"staff_pages": [
{
"url": "https://research.manchester.ac.uk/en/organisations/school-of-law/persons/",
"extract_method": "research_explorer",
"requires_scroll": True,
}
],
},
{
"name": "School of Arts, Languages and Cultures",
"keywords": [
"arts",
"languages",
"culture",
"music",
"drama",
"theatre",
"history",
"linguistics",
"literature",
"translation",
"classics",
"archaeology",
"religion",
],
"attach_faculty_to_programs": True,
"extract_method": "research_explorer",
"research_explorer": {"page_size": 400},
"staff_pages": [
{
"url": "https://research.manchester.ac.uk/en/organisations/school-of-arts-languages-and-cultures/persons/",
"extract_method": "research_explorer",
"requires_scroll": True,
}
],
},
{
"name": "School of Environment, Education and Development",
"keywords": [
"environment",
"education",
"development",
"planning",
"architecture",
"urban",
"geography",
"sustainability",
],
"attach_faculty_to_programs": True,
"extract_method": "research_explorer",
"research_explorer": {"page_size": 300},
"staff_pages": [
{
"url": "https://research.manchester.ac.uk/en/organisations/school-of-environment-education-and-development/persons/",
"extract_method": "research_explorer",
"requires_scroll": True,
}
],
},
]
SCHOOL_LOOKUP = {cfg["name"]: cfg for cfg in SCHOOL_CONFIG}
# =========================
# JS 提取函数
# =========================
JS_EXTRACT_TABLE_STAFF = """() => {
const staff = [];
const seen = new Set();
document.querySelectorAll('table tr').forEach(row => {
const cells = row.querySelectorAll('td');
if (cells.length >= 2) {
const link = cells[1]?.querySelector('a[href]') || cells[0]?.querySelector('a[href]');
const titleCell = cells[2] || cells[1];
if (link) {
const name = link.innerText.trim();
const url = link.href;
const title = titleCell ? titleCell.innerText.trim() : '';
if (name.length > 2 && !name.toLowerCase().includes('skip') && !seen.has(url)) {
seen.add(url);
staff.push({
name,
url,
title
});
}
}
}
});
return staff;
}"""
JS_EXTRACT_LINK_STAFF = """() => {
const staff = [];
const seen = new Set();
document.querySelectorAll('a[href]').forEach(a => {
const href = a.href;
const text = a.innerText.trim();
if (seen.has(href)) return;
if (text.length < 5 || text.length > 80) return;
const lowerText = text.toLowerCase();
if (lowerText.includes('skip') ||
lowerText.includes('staff') ||
lowerText.includes('people') ||
lowerText.includes('academic') ||
lowerText.includes('research profiles')) return;
if (href.includes('/persons/') ||
href.includes('/portal/en/researchers/') ||
href.includes('/profile/') ||
href.includes('/people/')) {
seen.add(href);
staff.push({
name: text,
url: href,
title: ''
});
}
});
return staff;
}"""
JS_EXTRACT_RESEARCH_EXPLORER = """() => {
const staff = [];
const seen = new Set();
document.querySelectorAll('a.link.person').forEach(a => {
const href = a.href;
const text = a.innerText.trim();
if (!seen.has(href) && text.length > 3 && text.length < 80) {
seen.add(href);
staff.push({
name: text,
url: href,
title: ''
});
}
});
if (staff.length === 0) {
document.querySelectorAll('a[href*="/persons/"]').forEach(a => {
const href = a.href;
const text = a.innerText.trim();
const lower = text.toLowerCase();
if (seen.has(href)) return;
if (text.length < 3 || text.length > 80) return;
if (lower.includes('person') || lower.includes('next') || lower.includes('previous')) return;
seen.add(href);
staff.push({
name: text,
url: href,
title: ''
});
});
}
return staff;
}"""
JS_EXTRACT_PROGRAMS = """() => {
const programs = [];
const seen = new Set();
document.querySelectorAll('a[href]').forEach(a => {
const href = a.href;
const text = a.innerText.trim().replace(/\\s+/g, ' ');
if (!href || seen.has(href)) return;
if (text.length < 10 || text.length > 200) return;
const hrefLower = href.toLowerCase();
const textLower = text.toLowerCase();
const isNav = textLower === 'courses' ||
textLower === 'masters' ||
textLower.includes('admission') ||
textLower.includes('fees') ||
textLower.includes('skip to') ||
textLower.includes('search') ||
textLower.includes('contact') ||
hrefLower.includes('#');
if (isNav) return;
const hasNumericId = /\\/\\d{5}\\//.test(href);
const isCoursePage = hrefLower.includes('/courses/list/') && hasNumericId;
if (isCoursePage) {
seen.add(href);
programs.push({
name: text,
url: href
});
}
});
return programs;
}"""
# =========================
# 数据匹配
# =========================
def match_program_to_school(program_name: str) -> str:
lower = program_name.lower()
for school in SCHOOL_CONFIG:
for keyword in school["keywords"]:
if keyword in lower:
return school["name"]
return "Other Programs"
# =========================
# 请求与解析工具
# =========================
def _merge_request_settings(*layers: Optional[Dict[str, Any]]) -> Dict[str, Any]:
settings = dict(DEFAULT_REQUEST)
for layer in layers:
if not layer:
continue
for key, value in layer.items():
if value is not None:
settings[key] = value
settings["max_retries"] = max(1, int(settings.get("max_retries", 1)))
settings["retry_backoff_ms"] = settings.get("retry_backoff_ms", 2000)
return settings
async def _goto_with_retry(page, url: str, settings: Dict[str, Any], label: str) -> Tuple[bool, Optional[str]]:
last_error = None
for attempt in range(settings["max_retries"]):
try:
await page.goto(url, wait_until=settings["wait_until"], timeout=settings["timeout_ms"])
if settings.get("wait_for_selector"):
await page.wait_for_selector(settings["wait_for_selector"], timeout=settings["timeout_ms"])
if settings.get("post_wait_ms"):
await page.wait_for_timeout(settings["post_wait_ms"])
return True, None
except PlaywrightTimeoutError as exc:
last_error = f"Timeout: {exc}"
except Exception as exc: # noqa: BLE001
last_error = str(exc)
if attempt < settings["max_retries"] - 1:
await page.wait_for_timeout(settings["retry_backoff_ms"] * (attempt + 1))
return False, last_error
async def _perform_scroll(page, repetitions: int = 5, delay_ms: int = 800):
repetitions = max(1, repetitions)
for i in range(repetitions):
await page.evaluate("(y) => window.scrollTo(0, y)", 2000 * (i + 1))
await page.wait_for_timeout(delay_ms)
async def _load_more(page, selector: str, max_clicks: int = 5, wait_ms: int = 1500):
for _ in range(max_clicks):
button = await page.query_selector(selector)
if not button:
break
try:
await button.click()
await page.wait_for_timeout(wait_ms)
except Exception:
break
def _deduplicate_staff(staff: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
seen = set()
cleaned = []
for item in staff:
name = (item.get("name") or "").strip()
if not name:
continue
url = (item.get("url") or "").strip()
key = url or name.lower()
if key in seen:
continue
seen.add(key)
cleaned.append({"name": name, "url": url, "title": (item.get("title") or "").strip()})
return cleaned
def _append_query(url: str, params: Dict[str, Any]) -> str:
delimiter = "&" if "?" in url else "?"
return f"{url}{delimiter}{urlencode(params)}"
def _guess_research_slug(staff_url: Optional[str]) -> Optional[str]:
if not staff_url:
return None
path = staff_url.rstrip("/").split("/")
return path[-1] if path else None
def _parse_research_explorer_json(data: Any, base_url: str) -> List[Dict[str, str]]:
items: List[Dict[str, Any]] = []
if isinstance(data, list):
items = data
elif isinstance(data, dict):
for key in ("results", "items", "persons", "data", "entities"):
if isinstance(data.get(key), list):
items = data[key]
break
if not items and isinstance(data.get("rows"), list):
items = data["rows"]
staff = []
for item in items:
if not isinstance(item, dict):
continue
name = item.get("name") or item.get("title") or item.get("fullName")
profile_url = item.get("url") or item.get("href") or item.get("link") or item.get("primaryURL")
if not name:
continue
if profile_url:
profile_url = urljoin(base_url, profile_url)
staff.append(
{
"name": name.strip(),
"url": (profile_url or "").strip(),
"title": (item.get("jobTitle") or item.get("position") or "").strip(),
}
)
return staff
def _parse_research_explorer_xml(text: str, base_url: str) -> List[Dict[str, str]]:
staff: List[Dict[str, str]] = []
try:
root = ET.fromstring(text)
except ET.ParseError:
return staff
for entry in root.findall(".//{http://www.w3.org/2005/Atom}entry"):
title = entry.findtext("{http://www.w3.org/2005/Atom}title", default="")
link = entry.find("{http://www.w3.org/2005/Atom}link")
href = link.attrib.get("href") if link is not None else ""
if title:
staff.append(
{
"name": title.strip(),
"url": urljoin(base_url, href) if href else "",
"title": "",
}
)
return staff
async def fetch_research_explorer_api(context, school_config: Dict[str, Any], output_callback) -> List[Dict[str, str]]:
config = school_config.get("research_explorer") or {}
if not config and school_config.get("extract_method") != "research_explorer":
return []
base_staff_url = ""
if school_config.get("staff_pages"):
base_staff_url = school_config["staff_pages"][0].get("url", "")
page_size = config.get("page_size", 200)
timeout_ms = config.get("timeout_ms", 70000)
candidates: List[str] = []
slug = config.get("org_slug") or _guess_research_slug(base_staff_url)
base_api = config.get("api_base", "https://research.manchester.ac.uk/ws/portalapi.aspx")
if config.get("api_url"):
candidates.append(config["api_url"])
if slug:
params = {
"action": "search",
"language": "en",
"format": "json",
"site": "default",
"showall": "true",
"pageSize": page_size,
"organisations": slug,
}
candidates.append(f"{base_api}?{urlencode(params)}")
if base_staff_url:
candidates.append(_append_query(base_staff_url, {"format": "json", "limit": page_size}))
candidates.append(_append_query(base_staff_url, {"format": "xml", "limit": page_size}))
for url in candidates:
try:
resp = await context.request.get(url, timeout=timeout_ms)
if resp.status != 200:
continue
ctype = resp.headers.get("content-type", "")
if "json" in ctype:
data = await resp.json()
parsed = _parse_research_explorer_json(data, base_staff_url)
else:
text = await resp.text()
parsed = _parse_research_explorer_xml(text, base_staff_url)
parsed = _deduplicate_staff(parsed)
if parsed:
if output_callback:
output_callback("info", f" {school_config['name']}: {len(parsed)} staff via API")
return parsed
except Exception as exc: # noqa: BLE001
if output_callback:
output_callback(
"warning", f" {school_config['name']}: API fetch failed ({str(exc)[:60]})"
)
return []
async def scrape_staff_via_browser(context, school_config: Dict[str, Any], output_callback) -> List[Dict[str, str]]:
staff_collected: List[Dict[str, str]] = []
staff_pages = school_config.get("staff_pages") or []
if not staff_pages and school_config.get("staff_url"):
staff_pages = [{"url": school_config["staff_url"], "extract_method": school_config.get("extract_method")}]
page = await context.new_page()
blocked_types = school_config.get("blocked_resources", ["image", "font", "media"])
if blocked_types:
async def _route_handler(route):
if route.request.resource_type in blocked_types:
await route.abort()
else:
await route.continue_()
await page.route("**/*", _route_handler)
for page_cfg in staff_pages:
target_url = page_cfg.get("url")
if not target_url:
continue
settings = _merge_request_settings(school_config.get("request"), page_cfg.get("request"))
success, error = await _goto_with_retry(page, target_url, settings, school_config["name"])
if not success:
if output_callback:
output_callback("warning", f" {school_config['name']}: failed to load {target_url} ({error})")
continue
if page_cfg.get("requires_scroll"):
await _perform_scroll(page, page_cfg.get("scroll_times", 6), page_cfg.get("scroll_delay_ms", 700))
if page_cfg.get("load_from_selector"):
await _load_more(page, page_cfg["load_from_selector"], page_cfg.get("max_load_more", 5))
elif page_cfg.get("load_more_selector"):
await _load_more(page, page_cfg["load_more_selector"], page_cfg.get("max_load_more", 5))
method = page_cfg.get("extract_method") or school_config.get("extract_method") or "links"
if method == "table":
extracted = await page.evaluate(JS_EXTRACT_TABLE_STAFF)
elif method == "research_explorer":
extracted = await page.evaluate(JS_EXTRACT_RESEARCH_EXPLORER)
else:
extracted = await page.evaluate(JS_EXTRACT_LINK_STAFF)
staff_collected.extend(extracted)
await page.close()
return _deduplicate_staff(staff_collected)
# =========================
# 并发抓取学院 Staff
# =========================
async def scrape_school_staff(context, school_config: Dict[str, Any], semaphore, output_callback):
async with semaphore:
staff_list: List[Dict[str, str]] = []
status = "success"
error: Optional[str] = None
try:
if school_config.get("extract_method") == "research_explorer":
staff_list = await fetch_research_explorer_api(context, school_config, output_callback)
if not staff_list:
staff_list = await scrape_staff_via_browser(context, school_config, output_callback)
if output_callback:
output_callback("info", f" {school_config['name']}: total {len(staff_list)} staff")
except Exception as exc: # noqa: BLE001
status = "error"
error = str(exc)
if output_callback:
output_callback("error", f" {school_config['name']}: {error}")
return {
"name": school_config["name"],
"staff": staff_list,
"status": status,
"error": error,
}
async def scrape_all_school_staff(context, output_callback):
semaphore = asyncio.Semaphore(STAFF_CONCURRENCY)
tasks = [
asyncio.create_task(scrape_school_staff(context, cfg, semaphore, output_callback))
for cfg in SCHOOL_CONFIG
]
results = await asyncio.gather(*tasks)
staff_map = {}
diagnostics = {"failed": [], "success": [], "total": len(results)}
for res in results:
if res["staff"]:
staff_map[res["name"]] = res["staff"]
diagnostics["success"].append(res["name"])
else:
diagnostics["failed"].append(
{
"name": res["name"],
"status": res["status"],
"error": res.get("error"),
}
)
return staff_map, diagnostics
# =========================
# 主流程
# =========================
async def scrape(output_callback=None):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
)
base_url = "https://www.manchester.ac.uk/"
result = {
"name": "The University of Manchester",
"url": base_url,
"scraped_at": datetime.now(timezone.utc).isoformat(),
"schools": [],
"diagnostics": {},
}
try:
# Step 1: Masters 列表
if output_callback:
output_callback("info", "Step 1: Scraping masters programs list...")
page = await context.new_page()
courses_url = "https://www.manchester.ac.uk/study/masters/courses/list/"
await page.goto(courses_url, wait_until="domcontentloaded", timeout=40000)
await page.wait_for_timeout(3000)
programs_data = await page.evaluate(JS_EXTRACT_PROGRAMS)
await page.close()
if output_callback:
output_callback("info", f"Found {len(programs_data)} masters programs")
# Step 2: 并发抓取学院 Staff
if output_callback:
output_callback("info", "Step 2: Scraping faculty from staff pages (parallel)...")
school_staff, diagnostics = await scrape_all_school_staff(context, output_callback)
# Step 3: 组织数据
schools_dict: Dict[str, Dict[str, Any]] = {}
for prog in programs_data:
school_name = match_program_to_school(prog["name"])
if school_name not in schools_dict:
schools_dict[school_name] = {
"name": school_name,
"url": "",
"programs": [],
"faculty": school_staff.get(school_name, []),
"faculty_source": "school_directory" if school_staff.get(school_name) else "",
}
schools_dict[school_name]["programs"].append(
{
"name": prog["name"],
"url": prog["url"],
"faculty": [],
}
)
for cfg in SCHOOL_CONFIG:
if cfg["name"] in schools_dict:
first_page = (cfg.get("staff_pages") or [{}])[0]
schools_dict[cfg["name"]]["url"] = first_page.get("url") or cfg.get("staff_url", "")
_attach_faculty_to_programs(schools_dict, school_staff)
result["schools"] = list(schools_dict.values())
total_programs = sum(len(s["programs"]) for s in result["schools"])
total_faculty = sum(len(s.get("faculty", [])) for s in result["schools"])
result["diagnostics"] = {
"total_programs": total_programs,
"total_faculty_records": total_faculty,
"school_staff_success": diagnostics.get("success", []),
"school_staff_failed": diagnostics.get("failed", []),
}
if output_callback:
output_callback(
"info",
f"Done! {len(result['schools'])} schools, {total_programs} programs, {total_faculty} faculty",
)
except Exception as exc: # noqa: BLE001
if output_callback:
output_callback("error", f"Scraping error: {str(exc)}")
finally:
await browser.close()
return result
def _attach_faculty_to_programs(schools_dict: Dict[str, Dict[str, Any]], staff_map: Dict[str, List[Dict[str, str]]]):
for school_name, school_data in schools_dict.items():
staff = staff_map.get(school_name, [])
cfg = SCHOOL_LOOKUP.get(school_name, {})
if not staff or not cfg.get("attach_faculty_to_programs"):
continue
limit = cfg.get("faculty_per_program")
for program in school_data["programs"]:
sliced = deepcopy(staff[:limit] if limit else staff)
program["faculty"] = sliced
# =========================
# CLI
# =========================
if __name__ == "__main__":
import sys
if sys.platform == "win32":
asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())
def print_callback(level, msg):
print(f"[{level}] {msg}")
scrape_result = asyncio.run(scrape(output_callback=print_callback))
output_path = "output/manchester_complete_result.json"
with open(output_path, "w", encoding="utf-8") as f:
json.dump(scrape_result, f, ensure_ascii=False, indent=2)
print("\nResult saved to", output_path)
print("\n=== Summary ===")
for school in sorted(scrape_result["schools"], key=lambda s: -len(s.get("faculty", []))):
print(
f" {school['name']}: "
f"{len(school['programs'])} programs, "
f"{len(school.get('faculty', []))} faculty"
)

View File

@ -0,0 +1,229 @@
"""
曼彻斯特大学专用爬虫脚本
改进版 - 从学院Staff页面提取导师信息
"""
import asyncio
import json
import re
from datetime import datetime, timezone
from urllib.parse import urljoin, urlparse
from playwright.async_api import async_playwright
# 曼彻斯特大学学院Staff页面映射
# 项目关键词 -> 学院Staff页面URL
SCHOOL_STAFF_MAPPING = {
# Alliance Manchester Business School (AMBS)
"accounting": "https://www.alliancembs.manchester.ac.uk/research/accounting-and-finance/staff/",
"finance": "https://www.alliancembs.manchester.ac.uk/research/accounting-and-finance/staff/",
"business": "https://www.alliancembs.manchester.ac.uk/about/our-people/",
"management": "https://www.alliancembs.manchester.ac.uk/about/our-people/",
"marketing": "https://www.alliancembs.manchester.ac.uk/research/management-sciences-and-marketing/",
"mba": "https://www.alliancembs.manchester.ac.uk/about/our-people/",
# 其他学院可以继续添加...
# "computer": "...",
# "engineering": "...",
}
# 通用学院Staff页面列表如果没有匹配的关键词
GENERAL_STAFF_PAGES = [
"https://www.alliancembs.manchester.ac.uk/about/our-people/",
]
async def scrape(output_callback=None):
"""执行爬取"""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
)
page = await context.new_page()
base_url = "https://www.manchester.ac.uk/"
result = {
"name": "The University of Manchester",
"url": base_url,
"scraped_at": datetime.now(timezone.utc).isoformat(),
"schools": []
}
try:
# 第一步:爬取硕士项目列表
if output_callback:
output_callback("info", "Step 1: Scraping masters programs list...")
courses_url = "https://www.manchester.ac.uk/study/masters/courses/list/"
await page.goto(courses_url, wait_until="domcontentloaded", timeout=30000)
await page.wait_for_timeout(3000)
# 提取所有硕士项目
programs_data = await page.evaluate('''() => {
const programs = [];
const seen = new Set();
document.querySelectorAll('a[href]').forEach(a => {
const href = a.href;
const text = a.innerText.trim().replace(/\\s+/g, ' ');
if (!href || seen.has(href)) return;
if (text.length < 10 || text.length > 200) return;
const hrefLower = href.toLowerCase();
const textLower = text.toLowerCase();
// 排除导航链接
if (textLower === 'courses' || textLower === 'masters' ||
textLower.includes('admission') || textLower.includes('fees') ||
textLower.includes('skip to') || textLower.includes('skip navigation') ||
textLower === 'home' || textLower === 'search' ||
textLower.includes('contact') || textLower.includes('footer') ||
hrefLower.endsWith('/courses/') || hrefLower.endsWith('/masters/') ||
hrefLower.includes('#')) {
return;
}
// 检查是否是课程链接 - 必须包含课程ID
const hasNumericId = /\\/\\d{5}\\//.test(href); // 5位数字ID
const isCoursePage = hrefLower.includes('/courses/list/') &&
hasNumericId;
if (isCoursePage) {
seen.add(href);
programs.push({
name: text,
url: href
});
}
});
return programs;
}''')
if output_callback:
output_callback("info", f"Found {len(programs_data)} masters programs")
# 第二步爬取学院Staff页面的导师信息
if output_callback:
output_callback("info", "Step 2: Scraping faculty from school staff pages...")
all_faculty = {} # school_url -> faculty list
# 爬取AMBS Accounting & Finance Staff
staff_url = "https://www.alliancembs.manchester.ac.uk/research/accounting-and-finance/staff/"
if output_callback:
output_callback("info", f"Scraping staff from: {staff_url}")
await page.goto(staff_url, wait_until="domcontentloaded", timeout=30000)
await page.wait_for_timeout(3000)
# 从表格提取教职员工
faculty_data = await page.evaluate('''() => {
const faculty = [];
const rows = document.querySelectorAll('table tr');
rows.forEach(row => {
const cells = row.querySelectorAll('td');
if (cells.length >= 2) {
const link = cells[1]?.querySelector('a[href]');
const titleCell = cells[2];
if (link) {
const name = link.innerText.trim();
const url = link.href;
const title = titleCell ? titleCell.innerText.trim() : '';
if (name.length > 2 && !name.toLowerCase().includes('skip')) {
faculty.push({
name: name,
url: url,
title: title
});
}
}
}
});
return faculty;
}''')
if output_callback:
output_callback("info", f"Found {len(faculty_data)} faculty members from AMBS")
all_faculty["AMBS - Accounting and Finance"] = faculty_data
# 第三步:组装结果
# 将项目按关键词分配到学院
schools_data = {}
for prog in programs_data:
prog_name_lower = prog['name'].lower()
# 确定所属学院
school_name = "Other Programs"
matched_faculty = []
for keyword, staff_url in SCHOOL_STAFF_MAPPING.items():
if keyword in prog_name_lower:
if "accounting" in keyword or "finance" in keyword:
school_name = "Alliance Manchester Business School"
matched_faculty = all_faculty.get("AMBS - Accounting and Finance", [])
elif "business" in keyword or "management" in keyword or "mba" in keyword:
school_name = "Alliance Manchester Business School"
matched_faculty = all_faculty.get("AMBS - Accounting and Finance", [])
break
if school_name not in schools_data:
schools_data[school_name] = {
"name": school_name,
"url": "",
"programs": [],
"faculty": matched_faculty # 学院级别的导师
}
schools_data[school_name]["programs"].append({
"name": prog['name'],
"url": prog['url'],
"faculty": [] # 项目级别暂不填充
})
result["schools"] = list(schools_data.values())
# 统计
total_programs = sum(len(s['programs']) for s in result['schools'])
total_faculty = sum(len(s.get('faculty', [])) for s in result['schools'])
if output_callback:
output_callback("info", f"Done! {len(result['schools'])} schools, {total_programs} programs, {total_faculty} faculty")
except Exception as e:
if output_callback:
output_callback("error", f"Scraping error: {str(e)}")
finally:
await browser.close()
return result
if __name__ == "__main__":
import sys
if sys.platform == "win32":
asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())
def print_callback(level, msg):
print(f"[{level}] {msg}")
result = asyncio.run(scrape(output_callback=print_callback))
# 保存结果
with open("output/manchester_improved_result.json", "w", encoding="utf-8") as f:
json.dump(result, f, indent=2, ensure_ascii=False)
print(f"\nResult saved to output/manchester_improved_result.json")
print(f"Schools: {len(result['schools'])}")
for school in result['schools']:
print(f" - {school['name']}: {len(school['programs'])} programs, {len(school.get('faculty', []))} faculty")

View File

@ -0,0 +1,438 @@
#!/usr/bin/env python
"""
Auto-generated by the Agno codegen agent.
Target university: RWTH Aachen (https://www.rwth-aachen.de/go/id/a/?lidx=1)
Requested caps: depth=3, pages=30
Plan description: Playwright scraper for university master programs and faculty profiles.
Navigation strategy: Start from the main university page and look for faculty/department directories. RWTH Aachen likely structures content with faculty organized by departments. Look for department pages (like 'Fakultäten'), then navigate to individual department sites, find 'Mitarbeiter' or 'Personal' sections, and extract individual faculty profile URLs. The university uses both German and English, so check for patterns in both languages. Individual faculty pages likely follow patterns like '/mitarbeiter/firstname-lastname' or similar German naming conventions.
Verification checklist:
- Verify that faculty URLs point to individual person pages, not department listings
- Check that master's program pages contain degree information and curriculum details
- Ensure scraped faculty pages include personal information like research interests, contact details, or CV
- Validate that URLs contain individual identifiers (names, personal paths) rather than generic terms
- Cross-check that German and English versions of pages are both captured when available
Playwright snapshot used to guide this plan:
1. RWTH Aachen University | Rheinisch-Westfälische Technische Hochschule | EN (https://www.rwth-aachen.de/go/id/a/?lidx=1)
Snippet: Skip to Content Skip to Main Navigation Skip to Landing Pages for Target Groups Skip to Quick Access Skip to Search Skip to Footer News Information for... Quick Access DE       Search for Search       Copyright: ©   Copyright: ©   Copyright: ©   Copyright: © Studying at RWTH Welc
Anchors: Skip to Content -> https://www.rwth-aachen.de/go/id/a/?lidx=1#main, Skip to Main Navigation -> https://www.rwth-aachen.de/go/id/a/?lidx=1#main-nav-control, Skip to Landing Pages for Target Groups -> https://www.rwth-aachen.de/go/id/a/?lidx=1#persona-control, Skip to Quick Access -> https://www.rwth-aachen.de/go/id/a/?lidx=1#quick-start-control, Skip to Search -> https://www.rwth-aachen.de/go/id/a/?lidx=1#searchbar, Skip to Footer -> https://www.rwth-aachen.de/go/id/a/?lidx=1#footer
2. Prospective Students | RWTH Aachen University | EN (https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/)
Snippet: Skip to Content Skip to Main Navigation Skip to Landing Pages for Target Groups Skip to Quick Access Skip to Search Skip to Footer News Information for... Quick Access DE       Search for Search     Prospective Students Choosing A Course of Study Copyright: © Mario Irrmischer Adv
Anchors: Skip to Content -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/#main, Skip to Main Navigation -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/#main-nav-control, Skip to Landing Pages for Target Groups -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/#persona-control, Skip to Quick Access -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/#quick-start-control, Skip to Search -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/#searchbar, Skip to Footer -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/#footer
3. First-Year Students | RWTH Aachen University | EN (https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/)
Snippet: Skip to Content Skip to Main Navigation Skip to Landing Pages for Target Groups Skip to Quick Access Skip to Search Skip to Footer News Information for... Quick Access DE       Search for Search     First-Year Students Preparing for Your Studies Recommended Subject-Specific Res
Anchors: Skip to Content -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/#main, Skip to Main Navigation -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/#main-nav-control, Skip to Landing Pages for Target Groups -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/#persona-control, Skip to Quick Access -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/#quick-start-control, Skip to Search -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/#searchbar, Skip to Footer -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/#footer
4. Students | RWTH Aachen University | EN (https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/)
Snippet: Skip to Content Skip to Main Navigation Skip to Landing Pages for Target Groups Skip to Quick Access Skip to Search Skip to Footer News Information for... Quick Access DE       Search for Search     Students Teaser Copyright: © Martin Braun Classes What lectures do you have next
Anchors: Skip to Content -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/#main, Skip to Main Navigation -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/#main-nav-control, Skip to Landing Pages for Target Groups -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/#persona-control, Skip to Quick Access -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/#quick-start-control, Skip to Search -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/#searchbar, Skip to Footer -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/#footer
Snapshot truncated.
Generated at: 2025-12-09T10:27:25.950820+00:00
"""
from __future__ import annotations
import argparse
import asyncio
import json
import time
from collections import deque
from dataclasses import asdict, dataclass, field
from datetime import datetime, timezone
from pathlib import Path
from typing import Deque, Iterable, List, Set, Tuple
from urllib.parse import urljoin, urldefrag, urlparse
from playwright.async_api import async_playwright, Page, Response
PROGRAM_KEYWORDS = ['Master', 'M.Sc.', 'M.A.', 'Graduate', 'Masterstudiengang', '/studium/', '/studiengänge/', 'Postgraduate']
FACULTY_KEYWORDS = ['Prof.', 'Dr.', 'Professor', '/mitarbeiter/', '/people/', '/personal/', '/~', 'Professorin']
EXCLUSION_KEYWORDS = ['bewerbung', 'admission', 'apply', 'bewerben', 'news', 'nachrichten', 'events', 'veranstaltungen', 'career', 'stellenangebote', 'login', 'anmelden', 'alumni', 'donate', 'spenden', 'studienanfänger']
METADATA_FIELDS = ['url', 'title', 'entity_type', 'department', 'email', 'scraped_at']
EXTRA_NOTES = ["RWTH Aachen is a major German technical university with content in both German and English. The site structure appears to use target group portals ('Zielgruppenportale') for different audiences. Faculty information will likely be distributed across different department websites. The university uses German academic titles (Prof., Dr.) extensively. Be prepared to handle both '/cms/root/' URL structures and potential subdomain variations for different faculties."]
# URL patterns that indicate individual profile pages
PROFILE_URL_PATTERNS = [
"/people/", "/person/", "/profile/", "/profiles/",
"/faculty/", "/staff/", "/directory/",
"/~", # Unix-style personal pages
"/bio/", "/about/",
]
# URL patterns that indicate listing/directory pages (should be crawled deeper)
DIRECTORY_URL_PATTERNS = [
"/faculty", "/people", "/directory", "/staff",
"/team", "/members", "/researchers",
]
def normalize_url(base: str, href: str) -> str:
"""Normalize URL by resolving relative paths and removing fragments."""
absolute = urljoin(base, href)
cleaned, _ = urldefrag(absolute)
# Remove trailing slash for consistency
return cleaned.rstrip("/")
def matches_any(text: str, keywords: Iterable[str]) -> bool:
"""Check if text contains any of the keywords (case-insensitive)."""
lowered = text.lower()
return any(keyword.lower() in lowered for keyword in keywords)
def is_same_domain(url1: str, url2: str) -> bool:
"""Check if two URLs belong to the same root domain."""
domain1 = urlparse(url1).netloc.replace("www.", "")
domain2 = urlparse(url2).netloc.replace("www.", "")
# Allow subdomains of the same root domain
parts1 = domain1.split(".")
parts2 = domain2.split(".")
if len(parts1) >= 2 and len(parts2) >= 2:
return parts1[-2:] == parts2[-2:]
return domain1 == domain2
def is_profile_url(url: str) -> bool:
"""Check if URL pattern suggests an individual profile page."""
url_lower = url.lower()
return any(pattern in url_lower for pattern in PROFILE_URL_PATTERNS)
def is_directory_url(url: str) -> bool:
"""Check if URL pattern suggests a directory/listing page."""
url_lower = url.lower()
return any(pattern in url_lower for pattern in DIRECTORY_URL_PATTERNS)
@dataclass
class ScrapedLink:
url: str
title: str
text: str
source_url: str
bucket: str # "program" or "faculty"
is_verified: bool = False
http_status: int = 0
is_profile_page: bool = False
scraped_at: str = field(default_factory=lambda: datetime.now(timezone.utc).isoformat())
@dataclass
class ScrapeSettings:
root_url: str
max_depth: int
max_pages: int
headless: bool
output: Path
verify_links: bool = True
request_delay: float = 1.0 # Polite crawling delay
async def extract_links(page: Page) -> List[Tuple[str, str]]:
"""Extract all anchor links from the page."""
anchors: Iterable[dict] = await page.eval_on_selector_all(
"a",
"""elements => elements
.map(el => ({text: (el.textContent || '').trim(), href: el.href}))
.filter(item => item.text && item.href && item.href.startsWith('http'))""",
)
return [(item["href"], item["text"]) for item in anchors]
async def get_page_title(page: Page) -> str:
"""Get the page title safely."""
try:
return await page.title() or ""
except Exception:
return ""
async def verify_link(context, url: str, timeout: int = 10000) -> Tuple[bool, int, str]:
"""
Verify a link by making a HEAD-like request.
Returns: (is_valid, status_code, page_title)
"""
page = await context.new_page()
try:
response: Response = await page.goto(url, wait_until="domcontentloaded", timeout=timeout)
if response:
status = response.status
title = await get_page_title(page)
is_valid = 200 <= status < 400
return is_valid, status, title
return False, 0, ""
except Exception:
return False, 0, ""
finally:
await page.close()
async def crawl(settings: ScrapeSettings, browser_name: str) -> List[ScrapedLink]:
"""
Crawl the website using BFS, collecting program and faculty links.
Features:
- URL deduplication
- Link verification
- Profile page detection
- Polite crawling with delays
"""
async with async_playwright() as p:
browser_launcher = getattr(p, browser_name)
browser = await browser_launcher.launch(headless=settings.headless)
context = await browser.new_context()
# Priority queue: (priority, url, depth) - lower priority = processed first
# Directory pages get priority 0, others get priority 1
queue: Deque[Tuple[int, str, int]] = deque([(0, settings.root_url, 0)])
visited: Set[str] = set()
found_urls: Set[str] = set() # For deduplication of results
results: List[ScrapedLink] = []
print(f"Starting crawl from: {settings.root_url}")
print(f"Max depth: {settings.max_depth}, Max pages: {settings.max_pages}")
try:
while queue and len(visited) < settings.max_pages:
# Sort queue by priority (directory pages first)
queue = deque(sorted(queue, key=lambda x: x[0]))
priority, url, depth = queue.popleft()
normalized_url = normalize_url(settings.root_url, url)
if normalized_url in visited or depth > settings.max_depth:
continue
# Only crawl same-domain URLs
if not is_same_domain(settings.root_url, normalized_url):
continue
visited.add(normalized_url)
print(f"[{len(visited)}/{settings.max_pages}] Depth {depth}: {normalized_url[:80]}...")
page = await context.new_page()
try:
response = await page.goto(
normalized_url, wait_until="domcontentloaded", timeout=20000
)
if not response or response.status >= 400:
await page.close()
continue
except Exception as e:
print(f" Error: {e}")
await page.close()
continue
page_title = await get_page_title(page)
links = await extract_links(page)
for href, text in links:
normalized_href = normalize_url(normalized_url, href)
# Skip if already found or is excluded
if normalized_href in found_urls:
continue
if matches_any(text, EXCLUSION_KEYWORDS) or matches_any(normalized_href, EXCLUSION_KEYWORDS):
continue
text_lower = text.lower()
href_lower = normalized_href.lower()
is_profile = is_profile_url(normalized_href)
# Check for program links
if matches_any(text_lower, PROGRAM_KEYWORDS) or matches_any(href_lower, PROGRAM_KEYWORDS):
found_urls.add(normalized_href)
results.append(
ScrapedLink(
url=normalized_href,
title="",
text=text[:200],
source_url=normalized_url,
bucket="program",
is_profile_page=False,
)
)
# Check for faculty links
if matches_any(text_lower, FACULTY_KEYWORDS) or matches_any(href_lower, FACULTY_KEYWORDS):
found_urls.add(normalized_href)
results.append(
ScrapedLink(
url=normalized_href,
title="",
text=text[:200],
source_url=normalized_url,
bucket="faculty",
is_profile_page=is_profile,
)
)
# Queue for further crawling
if depth < settings.max_depth and is_same_domain(settings.root_url, normalized_href):
# Prioritize directory pages
link_priority = 0 if is_directory_url(normalized_href) else 1
queue.append((link_priority, normalized_href, depth + 1))
await page.close()
# Polite delay between requests
await asyncio.sleep(settings.request_delay)
finally:
await context.close()
await browser.close()
# Verify links if enabled
if settings.verify_links and results:
print(f"\nVerifying {len(results)} links...")
browser = await browser_launcher.launch(headless=True)
context = await browser.new_context()
verified_results = []
for i, link in enumerate(results):
if link.url in [r.url for r in verified_results]:
continue # Skip duplicates
print(f" [{i+1}/{len(results)}] Verifying: {link.url[:60]}...")
is_valid, status, title = await verify_link(context, link.url)
link.is_verified = True
link.http_status = status
link.title = title or link.text
if is_valid:
verified_results.append(link)
else:
print(f" Invalid (HTTP {status})")
await asyncio.sleep(0.5) # Delay between verifications
await context.close()
await browser.close()
results = verified_results
return results
def deduplicate_results(results: List[ScrapedLink]) -> List[ScrapedLink]:
"""Remove duplicate URLs, keeping the first occurrence."""
seen: Set[str] = set()
unique = []
for link in results:
if link.url not in seen:
seen.add(link.url)
unique.append(link)
return unique
def serialize(results: List[ScrapedLink], target: Path, root_url: str) -> None:
"""Save results to JSON file with statistics."""
results = deduplicate_results(results)
program_links = [link for link in results if link.bucket == "program"]
faculty_links = [link for link in results if link.bucket == "faculty"]
profile_pages = [link for link in faculty_links if link.is_profile_page]
payload = {
"root_url": root_url,
"generated_at": datetime.now(timezone.utc).isoformat(),
"statistics": {
"total_links": len(results),
"program_links": len(program_links),
"faculty_links": len(faculty_links),
"profile_pages": len(profile_pages),
"verified_links": len([r for r in results if r.is_verified and r.http_status == 200]),
},
"program_links": [asdict(link) for link in program_links],
"faculty_links": [asdict(link) for link in faculty_links],
"notes": EXTRA_NOTES,
"metadata_fields": METADATA_FIELDS,
}
target.parent.mkdir(parents=True, exist_ok=True)
target.write_text(json.dumps(payload, indent=2, ensure_ascii=False), encoding="utf-8")
print(f"\nResults saved to: {target}")
print(f" Total links: {len(results)}")
print(f" Program links: {len(program_links)}")
print(f" Faculty links: {len(faculty_links)}")
print(f" Profile pages: {len(profile_pages)}")
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(
description="Playwright scraper generated by the Agno agent for https://www.rwth-aachen.de/go/id/a/?lidx=1."
)
parser.add_argument(
"--root-url",
default="https://www.rwth-aachen.de/go/id/a/?lidx=1",
help="Seed url to start crawling from.",
)
parser.add_argument(
"--max-depth",
type=int,
default=3,
help="Maximum crawl depth.",
)
parser.add_argument(
"--max-pages",
type=int,
default=30,
help="Maximum number of pages to visit.",
)
parser.add_argument(
"--output",
type=Path,
default=Path("university-scraper_results.json"),
help="Where to save the JSON output.",
)
parser.add_argument(
"--headless",
action="store_true",
default=True,
help="Run browser in headless mode (default: True).",
)
parser.add_argument(
"--no-headless",
action="store_false",
dest="headless",
help="Run browser with visible window.",
)
parser.add_argument(
"--browser",
choices=["chromium", "firefox", "webkit"],
default="chromium",
help="Browser engine to launch via Playwright.",
)
parser.add_argument(
"--no-verify",
action="store_true",
default=False,
help="Skip link verification step.",
)
parser.add_argument(
"--delay",
type=float,
default=1.0,
help="Delay between requests in seconds (polite crawling).",
)
return parser.parse_args()
async def main_async() -> None:
args = parse_args()
settings = ScrapeSettings(
root_url=args.root_url,
max_depth=args.max_depth,
max_pages=args.max_pages,
headless=args.headless,
output=args.output,
verify_links=not args.no_verify,
request_delay=args.delay,
)
links = await crawl(settings, browser_name=args.browser)
serialize(links, settings.output, settings.root_url)
def main() -> None:
asyncio.run(main_async())
if __name__ == "__main__":
main()

View File

@ -0,0 +1,437 @@
#!/usr/bin/env python
"""
Auto-generated by the Agno codegen agent.
Target university: RWTH Aachen (https://www.rwth-aachen.de/go/id/a/?lidx=1)
Requested caps: depth=3, pages=30
Plan description: Playwright scraper for university master programs and faculty profiles.
Navigation strategy: Start at the university homepage: https://www.rwth-aachen.de/ Navigate to faculty/department pages, e.g. /fakultaeten/, /fachbereiche/ Look for staff/people directory pages within each department Crawl the staff directories to find individual profile pages Some departments may use subdomains like informatik.rwth-aachen.de
Verification checklist:
- Check that collected URLs are for individual people, not directories
- Spot check profile pages to ensure they represent faculty members
- Verify relevant graduate program pages were found
- Confirm noise pages like news, events, jobs were excluded
Playwright snapshot used to guide this plan:
1. RWTH Aachen University | Rheinisch-Westfälische Technische Hochschule | EN (https://www.rwth-aachen.de/go/id/a/?lidx=1)
Snippet: Skip to Content Skip to Main Navigation Skip to Landing Pages for Target Groups Skip to Quick Access Skip to Search Skip to Footer News Information for... Quick Access DE       Search for Search       Copyright: ©   Copyright: ©   Copyright: ©   Copyright: © Studying at RWTH Welc
Anchors: Skip to Content -> https://www.rwth-aachen.de/go/id/a/?lidx=1#main, Skip to Main Navigation -> https://www.rwth-aachen.de/go/id/a/?lidx=1#main-nav-control, Skip to Landing Pages for Target Groups -> https://www.rwth-aachen.de/go/id/a/?lidx=1#persona-control, Skip to Quick Access -> https://www.rwth-aachen.de/go/id/a/?lidx=1#quick-start-control, Skip to Search -> https://www.rwth-aachen.de/go/id/a/?lidx=1#searchbar, Skip to Footer -> https://www.rwth-aachen.de/go/id/a/?lidx=1#footer
2. Prospective Students | RWTH Aachen University | EN (https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/)
Snippet: Skip to Content Skip to Main Navigation Skip to Landing Pages for Target Groups Skip to Quick Access Skip to Search Skip to Footer News Information for... Quick Access DE       Search for Search     Prospective Students Choosing A Course of Study Copyright: © Mario Irrmischer Adv
Anchors: Skip to Content -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/#main, Skip to Main Navigation -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/#main-nav-control, Skip to Landing Pages for Target Groups -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/#persona-control, Skip to Quick Access -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/#quick-start-control, Skip to Search -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/#searchbar, Skip to Footer -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/#footer
3. First-Year Students | RWTH Aachen University | EN (https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/)
Snippet: Skip to Content Skip to Main Navigation Skip to Landing Pages for Target Groups Skip to Quick Access Skip to Search Skip to Footer News Information for... Quick Access DE       Search for Search     First-Year Students Preparing for Your Studies Recommended Subject-Specific Res
Anchors: Skip to Content -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/#main, Skip to Main Navigation -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/#main-nav-control, Skip to Landing Pages for Target Groups -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/#persona-control, Skip to Quick Access -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/#quick-start-control, Skip to Search -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/#searchbar, Skip to Footer -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/#footer
4. Students | RWTH Aachen University | EN (https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/)
Snippet: Skip to Content Skip to Main Navigation Skip to Landing Pages for Target Groups Skip to Quick Access Skip to Search Skip to Footer News Information for... Quick Access DE       Search for Search     Students Teaser Copyright: © Martin Braun Classes What lectures do you have next
Anchors: Skip to Content -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/#main, Skip to Main Navigation -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/#main-nav-control, Skip to Landing Pages for Target Groups -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/#persona-control, Skip to Quick Access -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/#quick-start-control, Skip to Search -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/#searchbar, Skip to Footer -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/#footer
Snapshot truncated.
Generated at: 2025-12-09T15:00:09.586788+00:00
"""
from __future__ import annotations
import argparse
import asyncio
import json
import time
from collections import deque
from dataclasses import asdict, dataclass, field
from datetime import datetime, timezone
from pathlib import Path
from typing import Deque, Iterable, List, Set, Tuple
from urllib.parse import urljoin, urldefrag, urlparse
from playwright.async_api import async_playwright, Page, Response
PROGRAM_KEYWORDS = ['/studium/', '/studiengaenge/', 'master', 'graduate', 'postgraduate', 'm.sc.', 'm.a.']
FACULTY_KEYWORDS = ['/staff/', '/profile/', '/personen/', '/person/', '/aw/personen/', 'prof.', 'dr.', 'professor']
EXCLUSION_KEYWORDS = ['studieninteressierte', 'studienanfaenger', 'zulassung', 'bewerbung', 'studienbeitraege', 'studienfinanzierung', 'aktuelles', 'veranstaltungen', 'karriere', 'stellenangebote', 'alumni', 'anmeldung']
METADATA_FIELDS = ['url', 'title', 'entity_type', 'department', 'email', 'scraped_at']
EXTRA_NOTES = ['Site is primarily in German, so use German keywords', 'Faculty profile URLs contain /personen/ or /person/', 'Graduate program pages use /studium/ and /studiengaenge/']
# URL patterns that indicate individual profile pages
PROFILE_URL_PATTERNS = [
"/people/", "/person/", "/profile/", "/profiles/",
"/faculty/", "/staff/", "/directory/",
"/~", # Unix-style personal pages
"/bio/", "/about/",
]
# URL patterns that indicate listing/directory pages (should be crawled deeper)
DIRECTORY_URL_PATTERNS = [
"/faculty", "/people", "/directory", "/staff",
"/team", "/members", "/researchers",
]
def normalize_url(base: str, href: str) -> str:
"""Normalize URL by resolving relative paths and removing fragments."""
absolute = urljoin(base, href)
cleaned, _ = urldefrag(absolute)
# Remove trailing slash for consistency
return cleaned.rstrip("/")
def matches_any(text: str, keywords: Iterable[str]) -> bool:
"""Check if text contains any of the keywords (case-insensitive)."""
lowered = text.lower()
return any(keyword.lower() in lowered for keyword in keywords)
def is_same_domain(url1: str, url2: str) -> bool:
"""Check if two URLs belong to the same root domain."""
domain1 = urlparse(url1).netloc.replace("www.", "")
domain2 = urlparse(url2).netloc.replace("www.", "")
# Allow subdomains of the same root domain
parts1 = domain1.split(".")
parts2 = domain2.split(".")
if len(parts1) >= 2 and len(parts2) >= 2:
return parts1[-2:] == parts2[-2:]
return domain1 == domain2
def is_profile_url(url: str) -> bool:
"""Check if URL pattern suggests an individual profile page."""
url_lower = url.lower()
return any(pattern in url_lower for pattern in PROFILE_URL_PATTERNS)
def is_directory_url(url: str) -> bool:
"""Check if URL pattern suggests a directory/listing page."""
url_lower = url.lower()
return any(pattern in url_lower for pattern in DIRECTORY_URL_PATTERNS)
@dataclass
class ScrapedLink:
url: str
title: str
text: str
source_url: str
bucket: str # "program" or "faculty"
is_verified: bool = False
http_status: int = 0
is_profile_page: bool = False
scraped_at: str = field(default_factory=lambda: datetime.now(timezone.utc).isoformat())
@dataclass
class ScrapeSettings:
root_url: str
max_depth: int
max_pages: int
headless: bool
output: Path
verify_links: bool = True
request_delay: float = 1.0 # Polite crawling delay
async def extract_links(page: Page) -> List[Tuple[str, str]]:
"""Extract all anchor links from the page."""
anchors: Iterable[dict] = await page.eval_on_selector_all(
"a",
"""elements => elements
.map(el => ({text: (el.textContent || '').trim(), href: el.href}))
.filter(item => item.text && item.href && item.href.startsWith('http'))""",
)
return [(item["href"], item["text"]) for item in anchors]
async def get_page_title(page: Page) -> str:
"""Get the page title safely."""
try:
return await page.title() or ""
except Exception:
return ""
async def verify_link(context, url: str, timeout: int = 10000) -> Tuple[bool, int, str]:
"""
Verify a link by making a HEAD-like request.
Returns: (is_valid, status_code, page_title)
"""
page = await context.new_page()
try:
response: Response = await page.goto(url, wait_until="domcontentloaded", timeout=timeout)
if response:
status = response.status
title = await get_page_title(page)
is_valid = 200 <= status < 400
return is_valid, status, title
return False, 0, ""
except Exception:
return False, 0, ""
finally:
await page.close()
async def crawl(settings: ScrapeSettings, browser_name: str) -> List[ScrapedLink]:
"""
Crawl the website using BFS, collecting program and faculty links.
Features:
- URL deduplication
- Link verification
- Profile page detection
- Polite crawling with delays
"""
async with async_playwright() as p:
browser_launcher = getattr(p, browser_name)
browser = await browser_launcher.launch(headless=settings.headless)
context = await browser.new_context()
# Priority queue: (priority, url, depth) - lower priority = processed first
# Directory pages get priority 0, others get priority 1
queue: Deque[Tuple[int, str, int]] = deque([(0, settings.root_url, 0)])
visited: Set[str] = set()
found_urls: Set[str] = set() # For deduplication of results
results: List[ScrapedLink] = []
print(f"Starting crawl from: {settings.root_url}")
print(f"Max depth: {settings.max_depth}, Max pages: {settings.max_pages}")
try:
while queue and len(visited) < settings.max_pages:
# Sort queue by priority (directory pages first)
queue = deque(sorted(queue, key=lambda x: x[0]))
priority, url, depth = queue.popleft()
normalized_url = normalize_url(settings.root_url, url)
if normalized_url in visited or depth > settings.max_depth:
continue
# Only crawl same-domain URLs
if not is_same_domain(settings.root_url, normalized_url):
continue
visited.add(normalized_url)
print(f"[{len(visited)}/{settings.max_pages}] Depth {depth}: {normalized_url[:80]}...")
page = await context.new_page()
try:
response = await page.goto(
normalized_url, wait_until="domcontentloaded", timeout=20000
)
if not response or response.status >= 400:
await page.close()
continue
except Exception as e:
print(f" Error: {e}")
await page.close()
continue
page_title = await get_page_title(page)
links = await extract_links(page)
for href, text in links:
normalized_href = normalize_url(normalized_url, href)
# Skip if already found or is excluded
if normalized_href in found_urls:
continue
if matches_any(text, EXCLUSION_KEYWORDS) or matches_any(normalized_href, EXCLUSION_KEYWORDS):
continue
text_lower = text.lower()
href_lower = normalized_href.lower()
is_profile = is_profile_url(normalized_href)
# Check for program links
if matches_any(text_lower, PROGRAM_KEYWORDS) or matches_any(href_lower, PROGRAM_KEYWORDS):
found_urls.add(normalized_href)
results.append(
ScrapedLink(
url=normalized_href,
title="",
text=text[:200],
source_url=normalized_url,
bucket="program",
is_profile_page=False,
)
)
# Check for faculty links
if matches_any(text_lower, FACULTY_KEYWORDS) or matches_any(href_lower, FACULTY_KEYWORDS):
found_urls.add(normalized_href)
results.append(
ScrapedLink(
url=normalized_href,
title="",
text=text[:200],
source_url=normalized_url,
bucket="faculty",
is_profile_page=is_profile,
)
)
# Queue for further crawling
if depth < settings.max_depth and is_same_domain(settings.root_url, normalized_href):
# Prioritize directory pages
link_priority = 0 if is_directory_url(normalized_href) else 1
queue.append((link_priority, normalized_href, depth + 1))
await page.close()
# Polite delay between requests
await asyncio.sleep(settings.request_delay)
finally:
await context.close()
await browser.close()
# Verify links if enabled
if settings.verify_links and results:
print(f"\nVerifying {len(results)} links...")
browser = await browser_launcher.launch(headless=True)
context = await browser.new_context()
verified_results = []
for i, link in enumerate(results):
if link.url in [r.url for r in verified_results]:
continue # Skip duplicates
print(f" [{i+1}/{len(results)}] Verifying: {link.url[:60]}...")
is_valid, status, title = await verify_link(context, link.url)
link.is_verified = True
link.http_status = status
link.title = title or link.text
if is_valid:
verified_results.append(link)
else:
print(f" Invalid (HTTP {status})")
await asyncio.sleep(0.5) # Delay between verifications
await context.close()
await browser.close()
results = verified_results
return results
def deduplicate_results(results: List[ScrapedLink]) -> List[ScrapedLink]:
"""Remove duplicate URLs, keeping the first occurrence."""
seen: Set[str] = set()
unique = []
for link in results:
if link.url not in seen:
seen.add(link.url)
unique.append(link)
return unique
def serialize(results: List[ScrapedLink], target: Path, root_url: str) -> None:
"""Save results to JSON file with statistics."""
results = deduplicate_results(results)
program_links = [link for link in results if link.bucket == "program"]
faculty_links = [link for link in results if link.bucket == "faculty"]
profile_pages = [link for link in faculty_links if link.is_profile_page]
payload = {
"root_url": root_url,
"generated_at": datetime.now(timezone.utc).isoformat(),
"statistics": {
"total_links": len(results),
"program_links": len(program_links),
"faculty_links": len(faculty_links),
"profile_pages": len(profile_pages),
"verified_links": len([r for r in results if r.is_verified and r.http_status == 200]),
},
"program_links": [asdict(link) for link in program_links],
"faculty_links": [asdict(link) for link in faculty_links],
"notes": EXTRA_NOTES,
"metadata_fields": METADATA_FIELDS,
}
target.parent.mkdir(parents=True, exist_ok=True)
target.write_text(json.dumps(payload, indent=2, ensure_ascii=False), encoding="utf-8")
print(f"\nResults saved to: {target}")
print(f" Total links: {len(results)}")
print(f" Program links: {len(program_links)}")
print(f" Faculty links: {len(faculty_links)}")
print(f" Profile pages: {len(profile_pages)}")
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(
description="Playwright scraper generated by the Agno agent for https://www.rwth-aachen.de/go/id/a/?lidx=1."
)
parser.add_argument(
"--root-url",
default="https://www.rwth-aachen.de/go/id/a/?lidx=1",
help="Seed url to start crawling from.",
)
parser.add_argument(
"--max-depth",
type=int,
default=3,
help="Maximum crawl depth.",
)
parser.add_argument(
"--max-pages",
type=int,
default=30,
help="Maximum number of pages to visit.",
)
parser.add_argument(
"--output",
type=Path,
default=Path("university-scraper_results.json"),
help="Where to save the JSON output.",
)
parser.add_argument(
"--headless",
action="store_true",
default=True,
help="Run browser in headless mode (default: True).",
)
parser.add_argument(
"--no-headless",
action="store_false",
dest="headless",
help="Run browser with visible window.",
)
parser.add_argument(
"--browser",
choices=["chromium", "firefox", "webkit"],
default="chromium",
help="Browser engine to launch via Playwright.",
)
parser.add_argument(
"--no-verify",
action="store_true",
default=False,
help="Skip link verification step.",
)
parser.add_argument(
"--delay",
type=float,
default=1.0,
help="Delay between requests in seconds (polite crawling).",
)
return parser.parse_args()
async def main_async() -> None:
args = parse_args()
settings = ScrapeSettings(
root_url=args.root_url,
max_depth=args.max_depth,
max_pages=args.max_pages,
headless=args.headless,
output=args.output,
verify_links=not args.no_verify,
request_delay=args.delay,
)
links = await crawl(settings, browser_name=args.browser)
serialize(links, settings.output, settings.root_url)
def main() -> None:
asyncio.run(main_async())
if __name__ == "__main__":
main()

View File

@ -0,0 +1,165 @@
#!/usr/bin/env python3
"""
测试导师信息爬取逻辑 - 只测试3个项目
"""
import asyncio
import json
import re
from playwright.async_api import async_playwright
def name_to_slug(name):
"""将项目名称转换为URL slug"""
slug = name.lower()
slug = re.sub(r'[^\w\s-]', '', slug)
slug = re.sub(r'[\s_]+', '-', slug)
slug = re.sub(r'-+', '-', slug)
slug = slug.strip('-')
return slug
async def get_faculty_from_gsas_page(page, gsas_url):
"""从GSAS项目页面获取Faculty链接然后访问院系People页面获取导师列表"""
faculty_list = []
faculty_page_url = None
try:
print(f" 访问GSAS页面: {gsas_url}")
await page.goto(gsas_url, wait_until="domcontentloaded", timeout=30000)
await page.wait_for_timeout(2000)
# 查找Faculty部分的链接
faculty_link = await page.evaluate('''() => {
const links = document.querySelectorAll('a[href]');
for (const link of links) {
const text = link.innerText.toLowerCase();
const href = link.href;
if (text.includes('faculty') && text.includes('see list')) {
return href;
}
if (text.includes('faculty') && (href.includes('/people') || href.includes('/faculty'))) {
return href;
}
}
return null;
}''')
if faculty_link:
faculty_page_url = faculty_link
print(f" 找到Faculty页面链接: {faculty_link}")
# 访问Faculty/People页面
await page.goto(faculty_link, wait_until="domcontentloaded", timeout=30000)
await page.wait_for_timeout(2000)
# 提取所有导师信息
faculty_list = await page.evaluate('''() => {
const faculty = [];
const seen = new Set();
document.querySelectorAll('a[href]').forEach(a => {
const href = a.href || '';
const text = a.innerText.trim();
const lowerHref = href.toLowerCase();
if ((lowerHref.includes('/people/') || lowerHref.includes('/faculty/') ||
lowerHref.includes('/profile/')) &&
text.length > 3 && text.length < 100 &&
!text.toLowerCase().includes('people') &&
!text.toLowerCase().includes('faculty') &&
!lowerHref.endsWith('/people/') &&
!lowerHref.endsWith('/faculty/')) {
if (!seen.has(href)) {
seen.add(href);
faculty.push({
name: text,
url: href
});
}
}
});
return faculty;
}''')
print(f" 找到 {len(faculty_list)} 位导师")
for f in faculty_list[:5]:
print(f" - {f['name']}: {f['url']}")
if len(faculty_list) > 5:
print(f" ... 还有 {len(faculty_list) - 5}")
else:
print(" 未找到Faculty页面链接")
except Exception as e:
print(f" 获取Faculty信息失败: {e}")
return faculty_list, faculty_page_url
async def test_faculty_scraper():
"""测试导师爬取"""
# 测试3个项目
test_programs = [
"African and African American Studies",
"Economics",
"Computer Science"
]
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
viewport={'width': 1920, 'height': 1080}
)
page = await context.new_page()
results = []
for i, name in enumerate(test_programs, 1):
print(f"\n{'='*60}")
print(f"[{i}/{len(test_programs)}] 测试: {name}")
print(f"{'='*60}")
slug = name_to_slug(name)
program_url = f"https://www.harvard.edu/programs/{slug}/"
gsas_url = f"https://gsas.harvard.edu/program/{slug}"
print(f"项目URL: {program_url}")
print(f"GSAS URL: {gsas_url}")
faculty_list, faculty_page_url = await get_faculty_from_gsas_page(page, gsas_url)
results.append({
'name': name,
'url': program_url,
'gsas_url': gsas_url,
'faculty_page_url': faculty_page_url,
'faculty': faculty_list,
'faculty_count': len(faculty_list)
})
await page.wait_for_timeout(1000)
await browser.close()
# 输出结果
print(f"\n\n{'='*60}")
print("测试结果汇总")
print(f"{'='*60}")
for r in results:
print(f"\n{r['name']}:")
print(f" Faculty页面: {r['faculty_page_url'] or '未找到'}")
print(f" 导师数量: {r['faculty_count']}")
# 保存测试结果
with open('test_faculty_results.json', 'w', encoding='utf-8') as f:
json.dump(results, f, ensure_ascii=False, indent=2)
print(f"\n测试结果已保存到: test_faculty_results.json")
if __name__ == "__main__":
asyncio.run(test_faculty_scraper())

View File

@ -0,0 +1,464 @@
"""
Test Manchester University scraper - improved faculty mapping
"""
import asyncio
import json
from datetime import datetime, timezone
from playwright.async_api import async_playwright
MASTERS_PATHS = [
"/study/masters/courses/list/",
"/study/masters/courses/",
"/postgraduate/taught/courses/",
"/postgraduate/courses/list/",
"/postgraduate/courses/",
"/graduate/programs/",
"/academics/graduate/programs/",
"/programmes/masters/",
"/masters/programmes/",
"/admissions/graduate/programs/",
]
ACCOUNTING_STAFF_URL = "https://www.alliancembs.manchester.ac.uk/research/accounting-and-finance/staff/"
ACCOUNTING_STAFF_CACHE = None
JS_CHECK_COURSES = r"""() => {
const links = document.querySelectorAll('a[href]');
let courseCount = 0;
for (const a of links) {
const href = a.href.toLowerCase();
if (/\/\d{4,}\//.test(href) ||
/\/(msc|ma|mba|mres|llm|med|meng)-/.test(href) ||
/\/course\/[a-z]/.test(href)) {
courseCount++;
}
}
return courseCount;
}"""
JS_FIND_LIST_URL = """() => {
const links = document.querySelectorAll('a[href]');
for (const a of links) {
const text = a.innerText.toLowerCase();
const href = a.href.toLowerCase();
if ((text.includes('a-z') || text.includes('all course') ||
text.includes('full list') || text.includes('browse all') ||
href.includes('/list')) &&
(href.includes('master') || href.includes('course') || href.includes('postgrad'))) {
return a.href;
}
}
return null;
}"""
JS_FIND_COURSES_FROM_HOME = """() => {
const links = document.querySelectorAll('a[href]');
for (const a of links) {
const href = a.href.toLowerCase();
const text = a.innerText.toLowerCase();
if ((href.includes('master') || href.includes('postgraduate') || href.includes('graduate')) &&
(href.includes('course') || href.includes('program') || href.includes('degree'))) {
return a.href;
}
}
return null;
}"""
JS_EXTRACT_PROGRAMS = r"""() => {
const programs = [];
const seen = new Set();
const currentHost = window.location.hostname;
document.querySelectorAll('a[href]').forEach(a => {
const href = a.href;
const text = a.innerText.trim().replace(/\s+/g, ' ');
if (!href || seen.has(href)) return;
if (text.length < 5 || text.length > 200) return;
if (href.includes('#') || href.includes('javascript:') || href.includes('mailto:')) return;
try {
const linkHost = new URL(href).hostname;
if (!linkHost.includes(currentHost.replace('www.', '')) &&
!currentHost.includes(linkHost.replace('www.', ''))) return;
} catch {
return;
}
const hrefLower = href.toLowerCase();
const textLower = text.toLowerCase();
const isNavigation = textLower === 'courses' ||
textLower === 'programmes' ||
textLower === 'undergraduate' ||
textLower === 'postgraduate' ||
textLower === 'masters' ||
textLower === "master's" ||
textLower.includes('skip to') ||
textLower.includes('share') ||
textLower === 'home' ||
textLower === 'study' ||
textLower.startsWith('a-z') ||
textLower.includes('admission') ||
textLower.includes('fees and funding') ||
textLower.includes('why should') ||
textLower.includes('why manchester') ||
textLower.includes('teaching and learning') ||
textLower.includes('meet us') ||
textLower.includes('student support') ||
textLower.includes('contact us') ||
textLower.includes('how to apply') ||
hrefLower.includes('/admissions/') ||
hrefLower.includes('/fees-and-funding/') ||
hrefLower.includes('/why-') ||
hrefLower.includes('/meet-us/') ||
hrefLower.includes('/contact-us/') ||
hrefLower.includes('/student-support/') ||
hrefLower.includes('/teaching-and-learning/') ||
hrefLower.endsWith('/courses/') ||
hrefLower.endsWith('/masters/') ||
hrefLower.endsWith('/postgraduate/');
if (isNavigation) return;
const isExcluded = hrefLower.includes('/undergraduate') ||
hrefLower.includes('/bachelor') ||
hrefLower.includes('/phd/') ||
hrefLower.includes('/doctoral') ||
hrefLower.includes('/research-degree') ||
textLower.includes('bachelor') ||
textLower.includes('undergraduate') ||
(textLower.includes('phd') && !textLower.includes('mphil'));
if (isExcluded) return;
const hasNumericId = /\/\d{4,}\//.test(href);
const hasDegreeSlug = /\/(msc|ma|mba|mres|llm|med|meng|mpa|mph|mphil)-[a-z]/.test(hrefLower);
const isCoursePage = (hrefLower.includes('/course/') ||
hrefLower.includes('/courses/list/') ||
hrefLower.includes('/programme/')) &&
href.split('/').filter(p => p).length > 4;
const textHasDegree = /(msc|ma|mba|mres|llm|med|meng|pgcert|pgdip)/i.test(text) ||
textLower.includes('master');
if (hasNumericId || hasDegreeSlug || isCoursePage || textHasDegree) {
seen.add(href);
programs.push({
name: text,
url: href
});
}
});
return programs;
}"""
JS_EXTRACT_FACULTY = r"""() => {
const faculty = [];
const seen = new Set();
document.querySelectorAll('a[href]').forEach(a => {
const href = a.href.toLowerCase();
const text = a.innerText.trim();
if (seen.has(href)) return;
if (text.length < 3 || text.length > 100) return;
const isStaff = href.includes('/people/') ||
href.includes('/staff/') ||
href.includes('/faculty/') ||
href.includes('/profile/') ||
href.includes('/academics/') ||
href.includes('/researcher/');
if (isStaff) {
seen.add(href);
faculty.push({
name: text.replace(/\s+/g, ' '),
url: a.href
});
}
});
return faculty.slice(0, 20);
}"""
JS_EXTRACT_ACCOUNTING_STAFF = r"""() => {
const rows = Array.from(document.querySelectorAll('table tbody tr'));
const staff = [];
for (const row of rows) {
const cells = row.querySelectorAll('td');
if (!cells || cells.length < 2) {
continue;
}
const nameCell = cells[1];
const roleCell = cells[2];
const emailCell = cells[5];
let profileUrl = '';
let displayName = nameCell ? nameCell.innerText.trim() : '';
const link = nameCell ? nameCell.querySelector('a[href]') : null;
if (link) {
profileUrl = link.href;
displayName = link.innerText.trim() || displayName;
}
if (!displayName) {
continue;
}
let email = '';
if (emailCell) {
const emailLink = emailCell.querySelector('a[href^="mailto:"]');
if (emailLink) {
email = emailLink.href.replace('mailto:', '').trim();
}
}
staff.push({
name: displayName,
title: roleCell ? roleCell.innerText.trim() : '',
url: profileUrl,
email: email
});
}
return staff;
}"""
def should_use_accounting_staff(program_name: str) -> bool:
lower_name = program_name.lower()
return "msc" in lower_name and "accounting" in lower_name
async def load_accounting_staff(context, output_callback=None):
global ACCOUNTING_STAFF_CACHE
if ACCOUNTING_STAFF_CACHE is not None:
return ACCOUNTING_STAFF_CACHE
staff_page = await context.new_page()
try:
if output_callback:
output_callback("info", "Loading official AMBS Accounting & Finance staff page...")
await staff_page.goto(ACCOUNTING_STAFF_URL, wait_until="domcontentloaded", timeout=30000)
await staff_page.wait_for_timeout(2000)
ACCOUNTING_STAFF_CACHE = await staff_page.evaluate(JS_EXTRACT_ACCOUNTING_STAFF)
if output_callback:
output_callback("info", f"Captured {len(ACCOUNTING_STAFF_CACHE)} faculty from the official staff page")
except Exception as exc:
if output_callback:
output_callback("error", f"Failed to load AMBS staff page: {exc}")
ACCOUNTING_STAFF_CACHE = []
finally:
await staff_page.close()
return ACCOUNTING_STAFF_CACHE
async def find_course_list_page(page, base_url, output_callback):
for path in MASTERS_PATHS:
test_url = base_url.rstrip('/') + path
try:
response = await page.goto(test_url, wait_until="domcontentloaded", timeout=15000)
if response and response.status == 200:
title = await page.title()
if '404' not in title.lower() and 'not found' not in title.lower():
has_courses = await page.evaluate(JS_CHECK_COURSES)
if has_courses > 5:
if output_callback:
output_callback("info", f"Found course list: {path} ({has_courses} courses)")
return test_url
list_url = await page.evaluate(JS_FIND_LIST_URL)
if list_url:
if output_callback:
output_callback("info", f"Found full course list: {list_url}")
return list_url
except:
continue
try:
await page.goto(base_url, wait_until="domcontentloaded", timeout=30000)
await page.wait_for_timeout(2000)
courses_url = await page.evaluate(JS_FIND_COURSES_FROM_HOME)
if courses_url:
return courses_url
except:
pass
return None
async def extract_course_links(page, output_callback):
return await page.evaluate(JS_EXTRACT_PROGRAMS)
async def scrape(output_callback=None):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
)
page = await context.new_page()
base_url = "https://www.manchester.ac.uk/"
result = {
"name": "Manchester University",
"url": base_url,
"scraped_at": datetime.now(timezone.utc).isoformat(),
"schools": []
}
all_programs = []
try:
if output_callback:
output_callback("info", "Searching for masters course list...")
courses_url = await find_course_list_page(page, base_url, output_callback)
if not courses_url:
if output_callback:
output_callback("warning", "Course list not found, using homepage")
courses_url = base_url
if output_callback:
output_callback("info", "Extracting masters programs...")
await page.goto(courses_url, wait_until="domcontentloaded", timeout=30000)
await page.wait_for_timeout(3000)
for _ in range(3):
try:
load_more = page.locator('button:has-text("Load more"), button:has-text("Show more"), button:has-text("View more"), a:has-text("Load more")')
if await load_more.count() > 0:
await load_more.first.click()
await page.wait_for_timeout(2000)
else:
break
except:
break
programs_data = await extract_course_links(page, output_callback)
if output_callback:
output_callback("info", f"Found {len(programs_data)} masters programs")
print("\nTop 20 programs:")
for i, prog in enumerate(programs_data[:20]):
print(f" {i+1}. {prog['name'][:60]}")
print(f" {prog['url']}")
max_detail_pages = min(len(programs_data), 30)
detailed_processed = 0
logged_official_staff = False
for prog in programs_data:
faculty_data = []
used_official_staff = False
if should_use_accounting_staff(prog['name']):
staff_list = await load_accounting_staff(context, output_callback)
if staff_list:
used_official_staff = True
if output_callback and not logged_official_staff:
output_callback("info", "Using Alliance MBS Accounting & Finance staff directory for accounting programmes")
logged_official_staff = True
faculty_data = [
{
"name": person.get("name"),
"url": person.get("url") or ACCOUNTING_STAFF_URL,
"title": person.get("title"),
"email": person.get("email"),
"source": "Alliance Manchester Business School - Accounting & Finance staff"
}
for person in staff_list
]
elif detailed_processed < max_detail_pages:
detailed_processed += 1
if output_callback and detailed_processed % 10 == 0:
output_callback("info", f"Processing {detailed_processed}/{max_detail_pages}: {prog['name'][:50]}")
try:
await page.goto(prog['url'], wait_until="domcontentloaded", timeout=15000)
await page.wait_for_timeout(800)
faculty_data = await page.evaluate(JS_EXTRACT_FACULTY)
except Exception as e:
if output_callback:
output_callback("warning", f"Failed to capture faculty for {prog['name'][:50]}: {e}")
faculty_data = []
program_entry = {
"name": prog['name'],
"url": prog['url'],
"faculty": faculty_data
}
if used_official_staff:
program_entry["faculty_page_override"] = ACCOUNTING_STAFF_URL
all_programs.append(program_entry)
result["schools"] = [{
"name": "Masters Programs",
"url": courses_url,
"programs": all_programs
}]
if output_callback:
total_faculty = sum(len(p.get('faculty', [])) for p in all_programs)
output_callback("info", f"Done! {len(all_programs)} programs, {total_faculty} faculty")
except Exception as e:
if output_callback:
output_callback("error", f"Scraping error: {str(e)}")
finally:
await browser.close()
return result
def log_callback(level, message):
print(f"[{level.upper()}] {message}")
if __name__ == "__main__":
result = asyncio.run(scrape(output_callback=log_callback))
print("\n" + "="*60)
print("Scrape summary:")
print("="*60)
if result.get("schools"):
school = result["schools"][0]
programs = school.get("programs", [])
print(f"Course list URL: {school.get('url')}")
print(f"Total programs: {len(programs)}")
faculty_count = sum(len(p.get('faculty', [])) for p in programs)
print(f"Faculty total: {faculty_count}")
print("\nTop 10 programs:")
for i, p in enumerate(programs[:10]):
print(f" {i+1}. {p['name'][:60]}")
if p.get("faculty"):
print(f" Faculty entries: {len(p['faculty'])}")
with open("manchester_test_result.json", "w", encoding="utf-8") as f:
json.dump(result, f, indent=2, ensure_ascii=False)
print("\nSaved results to manchester_test_result.json")

25
backend/Dockerfile Normal file
View File

@ -0,0 +1,25 @@
FROM python:3.11-slim
WORKDIR /app
# 安装系统依赖
RUN apt-get update && apt-get install -y \
wget \
gnupg \
&& rm -rf /var/lib/apt/lists/*
# 安装Playwright依赖
RUN pip install playwright && playwright install chromium && playwright install-deps
# 复制依赖文件
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# 复制应用代码
COPY . .
# 暴露端口
EXPOSE 8000
# 启动命令
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

1
backend/app/__init__.py Normal file
View File

@ -0,0 +1 @@
"""University Scraper Web Backend"""

View File

@ -0,0 +1,15 @@
"""API路由"""
from fastapi import APIRouter
from .universities import router as universities_router
from .scripts import router as scripts_router
from .jobs import router as jobs_router
from .results import router as results_router
api_router = APIRouter()
api_router.include_router(universities_router, prefix="/universities", tags=["大学管理"])
api_router.include_router(scripts_router, prefix="/scripts", tags=["爬虫脚本"])
api_router.include_router(jobs_router, prefix="/jobs", tags=["爬取任务"])
api_router.include_router(results_router, prefix="/results", tags=["爬取结果"])

144
backend/app/api/jobs.py Normal file
View File

@ -0,0 +1,144 @@
"""爬取任务API"""
from typing import List
from datetime import datetime
from fastapi import APIRouter, Depends, HTTPException, BackgroundTasks
from sqlalchemy.orm import Session
from ..database import get_db
from ..models import University, ScraperScript, ScrapeJob, ScrapeLog
from ..schemas.job import JobResponse, JobStatusResponse, LogResponse
from ..services.scraper_runner import run_scraper
router = APIRouter()
@router.post("/start/{university_id}", response_model=JobResponse)
async def start_scrape_job(
university_id: int,
background_tasks: BackgroundTasks,
db: Session = Depends(get_db)
):
"""
一键运行爬虫
启动爬取任务,抓取大学项目和导师数据
"""
# 检查大学是否存在
university = db.query(University).filter(University.id == university_id).first()
if not university:
raise HTTPException(status_code=404, detail="大学不存在")
# 检查是否有活跃的脚本
script = db.query(ScraperScript).filter(
ScraperScript.university_id == university_id,
ScraperScript.status == "active"
).first()
if not script:
raise HTTPException(status_code=400, detail="没有可用的爬虫脚本,请先生成脚本")
# 检查是否有正在运行的任务
running_job = db.query(ScrapeJob).filter(
ScrapeJob.university_id == university_id,
ScrapeJob.status == "running"
).first()
if running_job:
raise HTTPException(status_code=400, detail="已有正在运行的任务")
# 创建任务
job = ScrapeJob(
university_id=university_id,
script_id=script.id,
status="pending",
progress=0,
current_step="准备中..."
)
db.add(job)
db.commit()
db.refresh(job)
# 在后台执行爬虫
background_tasks.add_task(
run_scraper,
job_id=job.id,
script_id=script.id
)
return job
@router.get("/{job_id}", response_model=JobResponse)
def get_job(
job_id: int,
db: Session = Depends(get_db)
):
"""获取任务详情"""
job = db.query(ScrapeJob).filter(ScrapeJob.id == job_id).first()
if not job:
raise HTTPException(status_code=404, detail="任务不存在")
return job
@router.get("/{job_id}/status", response_model=JobStatusResponse)
def get_job_status(
job_id: int,
db: Session = Depends(get_db)
):
"""获取任务状态和日志"""
job = db.query(ScrapeJob).filter(ScrapeJob.id == job_id).first()
if not job:
raise HTTPException(status_code=404, detail="任务不存在")
# 获取最近的日志
logs = db.query(ScrapeLog).filter(
ScrapeLog.job_id == job_id
).order_by(ScrapeLog.created_at.desc()).limit(50).all()
return JobStatusResponse(
id=job.id,
status=job.status,
progress=job.progress,
current_step=job.current_step,
logs=[LogResponse(
id=log.id,
level=log.level,
message=log.message,
created_at=log.created_at
) for log in reversed(logs)]
)
@router.get("/university/{university_id}", response_model=List[JobResponse])
def get_university_jobs(
university_id: int,
db: Session = Depends(get_db)
):
"""获取大学的所有任务"""
jobs = db.query(ScrapeJob).filter(
ScrapeJob.university_id == university_id
).order_by(ScrapeJob.created_at.desc()).limit(20).all()
return jobs
@router.post("/{job_id}/cancel")
def cancel_job(
job_id: int,
db: Session = Depends(get_db)
):
"""取消任务"""
job = db.query(ScrapeJob).filter(ScrapeJob.id == job_id).first()
if not job:
raise HTTPException(status_code=404, detail="任务不存在")
if job.status not in ["pending", "running"]:
raise HTTPException(status_code=400, detail="任务已结束,无法取消")
job.status = "cancelled"
job.completed_at = datetime.utcnow()
db.commit()
return {"message": "任务已取消"}

175
backend/app/api/results.py Normal file
View File

@ -0,0 +1,175 @@
"""爬取结果API"""
from typing import Optional
from fastapi import APIRouter, Depends, HTTPException, Query
from fastapi.responses import JSONResponse
from sqlalchemy.orm import Session
from ..database import get_db
from ..models import ScrapeResult
from ..schemas.result import ResultResponse
router = APIRouter()
@router.get("/university/{university_id}", response_model=ResultResponse)
def get_university_result(
university_id: int,
db: Session = Depends(get_db)
):
"""获取大学最新的爬取结果"""
result = db.query(ScrapeResult).filter(
ScrapeResult.university_id == university_id
).order_by(ScrapeResult.created_at.desc()).first()
if not result:
raise HTTPException(status_code=404, detail="没有爬取结果")
return result
@router.get("/university/{university_id}/schools")
def get_schools(
university_id: int,
db: Session = Depends(get_db)
):
"""获取学院列表"""
result = db.query(ScrapeResult).filter(
ScrapeResult.university_id == university_id
).order_by(ScrapeResult.created_at.desc()).first()
if not result:
raise HTTPException(status_code=404, detail="没有爬取结果")
schools = result.result_data.get("schools", [])
# 返回简化的学院列表
return {
"total": len(schools),
"schools": [
{
"name": s.get("name"),
"url": s.get("url"),
"program_count": len(s.get("programs", []))
}
for s in schools
]
}
@router.get("/university/{university_id}/programs")
def get_programs(
university_id: int,
school_name: Optional[str] = Query(None, description="按学院筛选"),
search: Optional[str] = Query(None, description="搜索项目名称"),
db: Session = Depends(get_db)
):
"""获取项目列表"""
result = db.query(ScrapeResult).filter(
ScrapeResult.university_id == university_id
).order_by(ScrapeResult.created_at.desc()).first()
if not result:
raise HTTPException(status_code=404, detail="没有爬取结果")
schools = result.result_data.get("schools", [])
programs = []
for school in schools:
if school_name and school.get("name") != school_name:
continue
for prog in school.get("programs", []):
if search and search.lower() not in prog.get("name", "").lower():
continue
programs.append({
"name": prog.get("name"),
"url": prog.get("url"),
"degree_type": prog.get("degree_type"),
"school": school.get("name"),
"faculty_count": len(prog.get("faculty", []))
})
return {
"total": len(programs),
"programs": programs
}
@router.get("/university/{university_id}/faculty")
def get_faculty(
university_id: int,
school_name: Optional[str] = Query(None, description="按学院筛选"),
program_name: Optional[str] = Query(None, description="按项目筛选"),
search: Optional[str] = Query(None, description="搜索导师姓名"),
skip: int = Query(0, ge=0),
limit: int = Query(50, ge=1, le=200),
db: Session = Depends(get_db)
):
"""获取导师列表"""
result = db.query(ScrapeResult).filter(
ScrapeResult.university_id == university_id
).order_by(ScrapeResult.created_at.desc()).first()
if not result:
raise HTTPException(status_code=404, detail="没有爬取结果")
schools = result.result_data.get("schools", [])
faculty_list = []
for school in schools:
if school_name and school.get("name") != school_name:
continue
for prog in school.get("programs", []):
if program_name and prog.get("name") != program_name:
continue
for fac in prog.get("faculty", []):
if search and search.lower() not in fac.get("name", "").lower():
continue
faculty_list.append({
"name": fac.get("name"),
"url": fac.get("url"),
"title": fac.get("title"),
"email": fac.get("email"),
"program": prog.get("name"),
"school": school.get("name")
})
total = len(faculty_list)
faculty_list = faculty_list[skip:skip + limit]
return {
"total": total,
"skip": skip,
"limit": limit,
"faculty": faculty_list
}
@router.get("/university/{university_id}/export")
def export_result(
university_id: int,
format: str = Query("json", enum=["json"]),
db: Session = Depends(get_db)
):
"""导出爬取结果"""
result = db.query(ScrapeResult).filter(
ScrapeResult.university_id == university_id
).order_by(ScrapeResult.created_at.desc()).first()
if not result:
raise HTTPException(status_code=404, detail="没有爬取结果")
if format == "json":
return JSONResponse(
content=result.result_data,
headers={
"Content-Disposition": f"attachment; filename=university_{university_id}_result.json"
}
)
raise HTTPException(status_code=400, detail="不支持的格式")

167
backend/app/api/scripts.py Normal file
View File

@ -0,0 +1,167 @@
"""爬虫脚本API"""
from typing import List
from fastapi import APIRouter, Depends, HTTPException, BackgroundTasks
from sqlalchemy.orm import Session
from ..database import get_db
from ..models import University, ScraperScript
from ..schemas.script import (
ScriptCreate,
ScriptResponse,
GenerateScriptRequest,
GenerateScriptResponse
)
from ..services.script_generator import generate_scraper_script
router = APIRouter()
@router.post("/generate", response_model=GenerateScriptResponse)
async def generate_script(
data: GenerateScriptRequest,
background_tasks: BackgroundTasks,
db: Session = Depends(get_db)
):
"""
一键生成爬虫脚本
分析大学网站结构,自动生成爬虫脚本
"""
# 检查或创建大学记录
university = db.query(University).filter(University.url == data.university_url).first()
if not university:
# 从URL提取大学名称
name = data.university_name
if not name:
from urllib.parse import urlparse
parsed = urlparse(data.university_url)
name = parsed.netloc.replace("www.", "").split(".")[0].title()
university = University(
name=name,
url=data.university_url,
status="analyzing"
)
db.add(university)
db.commit()
db.refresh(university)
else:
# 更新状态
university.status = "analyzing"
db.commit()
# 在后台执行脚本生成
background_tasks.add_task(
generate_scraper_script,
university_id=university.id,
university_url=data.university_url
)
return GenerateScriptResponse(
success=True,
university_id=university.id,
script_id=None,
message="正在分析网站结构并生成爬虫脚本...",
status="analyzing"
)
@router.get("/university/{university_id}", response_model=List[ScriptResponse])
def get_university_scripts(
university_id: int,
db: Session = Depends(get_db)
):
"""获取大学的所有爬虫脚本"""
scripts = db.query(ScraperScript).filter(
ScraperScript.university_id == university_id
).order_by(ScraperScript.version.desc()).all()
return scripts
@router.get("/{script_id}", response_model=ScriptResponse)
def get_script(
script_id: int,
db: Session = Depends(get_db)
):
"""获取脚本详情"""
script = db.query(ScraperScript).filter(ScraperScript.id == script_id).first()
if not script:
raise HTTPException(status_code=404, detail="脚本不存在")
return script
@router.post("", response_model=ScriptResponse)
def create_script(
data: ScriptCreate,
db: Session = Depends(get_db)
):
"""手动创建脚本"""
# 检查大学是否存在
university = db.query(University).filter(University.id == data.university_id).first()
if not university:
raise HTTPException(status_code=404, detail="大学不存在")
# 获取当前最高版本
max_version = db.query(ScraperScript).filter(
ScraperScript.university_id == data.university_id
).count()
script = ScraperScript(
university_id=data.university_id,
script_name=data.script_name,
script_content=data.script_content,
config_content=data.config_content,
version=max_version + 1,
status="active"
)
db.add(script)
db.commit()
db.refresh(script)
# 更新大学状态
university.status = "ready"
db.commit()
return script
@router.put("/{script_id}", response_model=ScriptResponse)
def update_script(
script_id: int,
data: ScriptCreate,
db: Session = Depends(get_db)
):
"""更新脚本"""
script = db.query(ScraperScript).filter(ScraperScript.id == script_id).first()
if not script:
raise HTTPException(status_code=404, detail="脚本不存在")
script.script_content = data.script_content
if data.config_content:
script.config_content = data.config_content
db.commit()
db.refresh(script)
return script
@router.delete("/{script_id}")
def delete_script(
script_id: int,
db: Session = Depends(get_db)
):
"""删除脚本"""
script = db.query(ScraperScript).filter(ScraperScript.id == script_id).first()
if not script:
raise HTTPException(status_code=404, detail="脚本不存在")
db.delete(script)
db.commit()
return {"message": "删除成功"}

View File

@ -0,0 +1,165 @@
"""大学管理API"""
from typing import List, Optional
from fastapi import APIRouter, Depends, HTTPException, Query
from sqlalchemy.orm import Session
from ..database import get_db
from ..models import University, ScrapeResult
from ..schemas.university import (
UniversityCreate,
UniversityUpdate,
UniversityResponse,
UniversityListResponse
)
router = APIRouter()
@router.get("", response_model=UniversityListResponse)
def list_universities(
skip: int = Query(0, ge=0),
limit: int = Query(20, ge=1, le=100),
search: Optional[str] = None,
db: Session = Depends(get_db)
):
"""获取大学列表"""
query = db.query(University)
if search:
query = query.filter(University.name.ilike(f"%{search}%"))
total = query.count()
universities = query.order_by(University.created_at.desc()).offset(skip).limit(limit).all()
# 添加统计信息
items = []
for uni in universities:
# 获取最新结果
latest_result = db.query(ScrapeResult).filter(
ScrapeResult.university_id == uni.id
).order_by(ScrapeResult.created_at.desc()).first()
items.append(UniversityResponse(
id=uni.id,
name=uni.name,
url=uni.url,
country=uni.country,
description=uni.description,
status=uni.status,
created_at=uni.created_at,
updated_at=uni.updated_at,
scripts_count=len(uni.scripts),
jobs_count=len(uni.jobs),
latest_result={
"schools_count": latest_result.schools_count,
"programs_count": latest_result.programs_count,
"faculty_count": latest_result.faculty_count,
"created_at": latest_result.created_at.isoformat()
} if latest_result else None
))
return UniversityListResponse(total=total, items=items)
@router.post("", response_model=UniversityResponse)
def create_university(
data: UniversityCreate,
db: Session = Depends(get_db)
):
"""创建大学"""
# 检查是否已存在
existing = db.query(University).filter(University.url == data.url).first()
if existing:
raise HTTPException(status_code=400, detail="该大学URL已存在")
university = University(**data.model_dump())
db.add(university)
db.commit()
db.refresh(university)
return UniversityResponse(
id=university.id,
name=university.name,
url=university.url,
country=university.country,
description=university.description,
status=university.status,
created_at=university.created_at,
updated_at=university.updated_at,
scripts_count=0,
jobs_count=0,
latest_result=None
)
@router.get("/{university_id}", response_model=UniversityResponse)
def get_university(
university_id: int,
db: Session = Depends(get_db)
):
"""获取大学详情"""
university = db.query(University).filter(University.id == university_id).first()
if not university:
raise HTTPException(status_code=404, detail="大学不存在")
# 获取最新结果
latest_result = db.query(ScrapeResult).filter(
ScrapeResult.university_id == university.id
).order_by(ScrapeResult.created_at.desc()).first()
return UniversityResponse(
id=university.id,
name=university.name,
url=university.url,
country=university.country,
description=university.description,
status=university.status,
created_at=university.created_at,
updated_at=university.updated_at,
scripts_count=len(university.scripts),
jobs_count=len(university.jobs),
latest_result={
"schools_count": latest_result.schools_count,
"programs_count": latest_result.programs_count,
"faculty_count": latest_result.faculty_count,
"created_at": latest_result.created_at.isoformat()
} if latest_result else None
)
@router.put("/{university_id}", response_model=UniversityResponse)
def update_university(
university_id: int,
data: UniversityUpdate,
db: Session = Depends(get_db)
):
"""更新大学信息"""
university = db.query(University).filter(University.id == university_id).first()
if not university:
raise HTTPException(status_code=404, detail="大学不存在")
update_data = data.model_dump(exclude_unset=True)
for field, value in update_data.items():
setattr(university, field, value)
db.commit()
db.refresh(university)
return get_university(university_id, db)
@router.delete("/{university_id}")
def delete_university(
university_id: int,
db: Session = Depends(get_db)
):
"""删除大学"""
university = db.query(University).filter(University.id == university_id).first()
if not university:
raise HTTPException(status_code=404, detail="大学不存在")
db.delete(university)
db.commit()
return {"message": "删除成功"}

37
backend/app/config.py Normal file
View File

@ -0,0 +1,37 @@
"""应用配置"""
from pydantic_settings import BaseSettings
from typing import Optional
class Settings(BaseSettings):
"""应用设置"""
# 应用配置
APP_NAME: str = "University Scraper API"
APP_VERSION: str = "1.0.0"
DEBUG: bool = True
# 数据库配置
DATABASE_URL: str = "sqlite:///./university_scraper.db" # 开发环境使用SQLite
# 生产环境使用: postgresql://user:password@localhost/university_scraper
# Redis配置 (用于Celery任务队列)
REDIS_URL: str = "redis://localhost:6379/0"
# CORS配置
CORS_ORIGINS: list = ["http://localhost:3000", "http://127.0.0.1:3000"]
# Agent配置 (用于自动生成脚本)
OPENROUTER_API_KEY: Optional[str] = None
# 文件存储路径
SCRIPTS_DIR: str = "./scripts"
RESULTS_DIR: str = "./results"
class Config:
env_file = ".env"
case_sensitive = True
settings = Settings()

35
backend/app/database.py Normal file
View File

@ -0,0 +1,35 @@
"""数据库连接和会话管理"""
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from .config import settings
# 创建数据库引擎
engine = create_engine(
settings.DATABASE_URL,
connect_args={"check_same_thread": False} if "sqlite" in settings.DATABASE_URL else {},
echo=settings.DEBUG
)
# 创建会话工厂
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
# 声明基类
Base = declarative_base()
def get_db():
"""获取数据库会话 (依赖注入)"""
db = SessionLocal()
try:
yield db
finally:
db.close()
def init_db():
"""初始化数据库 (创建所有表)"""
from .models import university, script, job, result # noqa
Base.metadata.create_all(bind=engine)

72
backend/app/main.py Normal file
View File

@ -0,0 +1,72 @@
"""
University Scraper Web API
主应用入口
"""
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from .config import settings
from .database import init_db
from .api import api_router
# 创建应用
app = FastAPI(
title=settings.APP_NAME,
version=settings.APP_VERSION,
description="""
## 大学爬虫Web系统 API
### 功能
- 🏫 **大学管理**: 添加、编辑、删除大学
- 📜 **脚本生成**: 一键生成爬虫脚本
- 🚀 **任务执行**: 一键运行爬虫
- 📊 **数据查看**: 查看和导出爬取结果
### 数据结构
大学 → 学院 → 项目 → 导师
""",
docs_url="/docs",
redoc_url="/redoc"
)
# 配置CORS
app.add_middleware(
CORSMiddleware,
allow_origins=settings.CORS_ORIGINS,
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# 注册路由
app.include_router(api_router, prefix="/api")
@app.on_event("startup")
async def startup_event():
"""应用启动时初始化数据库"""
init_db()
@app.get("/")
async def root():
"""根路由"""
return {
"name": settings.APP_NAME,
"version": settings.APP_VERSION,
"docs": "/docs",
"api": "/api"
}
@app.get("/health")
async def health_check():
"""健康检查"""
return {"status": "healthy"}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)

View File

@ -0,0 +1,8 @@
"""数据库模型"""
from .university import University
from .script import ScraperScript
from .job import ScrapeJob, ScrapeLog
from .result import ScrapeResult
__all__ = ["University", "ScraperScript", "ScrapeJob", "ScrapeLog", "ScrapeResult"]

56
backend/app/models/job.py Normal file
View File

@ -0,0 +1,56 @@
"""爬取任务模型"""
from datetime import datetime
from sqlalchemy import Column, Integer, String, DateTime, Text, ForeignKey
from sqlalchemy.orm import relationship
from ..database import Base
class ScrapeJob(Base):
"""爬取任务表"""
__tablename__ = "scrape_jobs"
id = Column(Integer, primary_key=True, index=True)
university_id = Column(Integer, ForeignKey("universities.id"), nullable=False)
script_id = Column(Integer, ForeignKey("scraper_scripts.id"))
status = Column(String(50), default="pending") # pending, running, completed, failed, cancelled
progress = Column(Integer, default=0) # 0-100 进度百分比
current_step = Column(String(255)) # 当前步骤描述
started_at = Column(DateTime)
completed_at = Column(DateTime)
error_message = Column(Text)
created_at = Column(DateTime, default=datetime.utcnow)
# 关联
university = relationship("University", back_populates="jobs")
script = relationship("ScraperScript", back_populates="jobs")
logs = relationship("ScrapeLog", back_populates="job", cascade="all, delete-orphan")
results = relationship("ScrapeResult", back_populates="job", cascade="all, delete-orphan")
def __repr__(self):
return f"<ScrapeJob(id={self.id}, status='{self.status}')>"
class ScrapeLog(Base):
"""爬取日志表"""
__tablename__ = "scrape_logs"
id = Column(Integer, primary_key=True, index=True)
job_id = Column(Integer, ForeignKey("scrape_jobs.id"), nullable=False)
level = Column(String(20), default="info") # debug, info, warning, error
message = Column(Text, nullable=False)
created_at = Column(DateTime, default=datetime.utcnow)
# 关联
job = relationship("ScrapeJob", back_populates="logs")
def __repr__(self):
return f"<ScrapeLog(id={self.id}, level='{self.level}')>"

View File

@ -0,0 +1,34 @@
"""爬取结果模型"""
from datetime import datetime
from sqlalchemy import Column, Integer, DateTime, ForeignKey, JSON
from sqlalchemy.orm import relationship
from ..database import Base
class ScrapeResult(Base):
"""爬取结果表"""
__tablename__ = "scrape_results"
id = Column(Integer, primary_key=True, index=True)
job_id = Column(Integer, ForeignKey("scrape_jobs.id"))
university_id = Column(Integer, ForeignKey("universities.id"), nullable=False)
# JSON数据: 学院 → 项目 → 导师 层级结构
result_data = Column(JSON, nullable=False)
# 统计信息
schools_count = Column(Integer, default=0)
programs_count = Column(Integer, default=0)
faculty_count = Column(Integer, default=0)
created_at = Column(DateTime, default=datetime.utcnow)
# 关联
job = relationship("ScrapeJob", back_populates="results")
university = relationship("University", back_populates="results")
def __repr__(self):
return f"<ScrapeResult(id={self.id}, programs={self.programs_count}, faculty={self.faculty_count})>"

View File

@ -0,0 +1,34 @@
"""爬虫脚本模型"""
from datetime import datetime
from sqlalchemy import Column, Integer, String, DateTime, Text, ForeignKey, JSON
from sqlalchemy.orm import relationship
from ..database import Base
class ScraperScript(Base):
"""爬虫脚本表"""
__tablename__ = "scraper_scripts"
id = Column(Integer, primary_key=True, index=True)
university_id = Column(Integer, ForeignKey("universities.id"), nullable=False)
script_name = Column(String(255), nullable=False)
script_content = Column(Text, nullable=False) # Python脚本代码
config_content = Column(JSON) # YAML配置转为JSON存储
version = Column(Integer, default=1)
status = Column(String(50), default="draft") # draft, active, deprecated, error
error_message = Column(Text)
created_at = Column(DateTime, default=datetime.utcnow)
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
# 关联
university = relationship("University", back_populates="scripts")
jobs = relationship("ScrapeJob", back_populates="script")
def __repr__(self):
return f"<ScraperScript(id={self.id}, name='{self.script_name}')>"

View File

@ -0,0 +1,31 @@
"""大学模型"""
from datetime import datetime
from sqlalchemy import Column, Integer, String, DateTime, Text
from sqlalchemy.orm import relationship
from ..database import Base
class University(Base):
"""大学表"""
__tablename__ = "universities"
id = Column(Integer, primary_key=True, index=True)
name = Column(String(255), nullable=False, index=True)
url = Column(String(500), nullable=False)
country = Column(String(100))
description = Column(Text)
status = Column(String(50), default="pending") # pending, analyzing, ready, error
created_at = Column(DateTime, default=datetime.utcnow)
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
# 关联
scripts = relationship("ScraperScript", back_populates="university", cascade="all, delete-orphan")
jobs = relationship("ScrapeJob", back_populates="university", cascade="all, delete-orphan")
results = relationship("ScrapeResult", back_populates="university", cascade="all, delete-orphan")
def __repr__(self):
return f"<University(id={self.id}, name='{self.name}')>"

View File

@ -0,0 +1,33 @@
"""Pydantic schemas for API"""
from .university import (
UniversityCreate,
UniversityUpdate,
UniversityResponse,
UniversityListResponse
)
from .script import (
ScriptCreate,
ScriptResponse,
GenerateScriptRequest,
GenerateScriptResponse
)
from .job import (
JobCreate,
JobResponse,
JobStatusResponse,
LogResponse
)
from .result import (
ResultResponse,
SchoolData,
ProgramData,
FacultyData
)
__all__ = [
"UniversityCreate", "UniversityUpdate", "UniversityResponse", "UniversityListResponse",
"ScriptCreate", "ScriptResponse", "GenerateScriptRequest", "GenerateScriptResponse",
"JobCreate", "JobResponse", "JobStatusResponse", "LogResponse",
"ResultResponse", "SchoolData", "ProgramData", "FacultyData"
]

View File

@ -0,0 +1,52 @@
"""爬取任务相关的Pydantic模型"""
from datetime import datetime
from typing import Optional, List
from pydantic import BaseModel
class JobCreate(BaseModel):
"""创建任务请求"""
university_id: int
script_id: Optional[int] = None
class JobResponse(BaseModel):
"""任务响应"""
id: int
university_id: int
script_id: Optional[int] = None
status: str
progress: int
current_step: Optional[str] = None
started_at: Optional[datetime] = None
completed_at: Optional[datetime] = None
error_message: Optional[str] = None
created_at: datetime
class Config:
from_attributes = True
class JobStatusResponse(BaseModel):
"""任务状态响应"""
id: int
status: str
progress: int
current_step: Optional[str] = None
logs: List["LogResponse"] = []
class LogResponse(BaseModel):
"""日志响应"""
id: int
level: str
message: str
created_at: datetime
class Config:
from_attributes = True
# 解决循环引用
JobStatusResponse.model_rebuild()

View File

@ -0,0 +1,67 @@
"""爬取结果相关的Pydantic模型"""
from datetime import datetime
from typing import Optional, List, Dict, Any
from pydantic import BaseModel
class FacultyData(BaseModel):
"""导师数据"""
name: str
url: str
title: Optional[str] = None
email: Optional[str] = None
department: Optional[str] = None
class ProgramData(BaseModel):
"""项目数据"""
name: str
url: str
degree_type: Optional[str] = None
description: Optional[str] = None
faculty_page_url: Optional[str] = None
faculty_count: int = 0
faculty: List[FacultyData] = []
class SchoolData(BaseModel):
"""学院数据"""
name: str
url: str
description: Optional[str] = None
program_count: int = 0
programs: List[ProgramData] = []
class ResultResponse(BaseModel):
"""完整结果响应"""
id: int
university_id: int
job_id: Optional[int] = None
# 统计
schools_count: int
programs_count: int
faculty_count: int
# 完整数据
result_data: Dict[str, Any]
created_at: datetime
class Config:
from_attributes = True
class ResultSummary(BaseModel):
"""结果摘要"""
id: int
university_id: int
schools_count: int
programs_count: int
faculty_count: int
created_at: datetime
class Config:
from_attributes = True

View File

@ -0,0 +1,46 @@
"""爬虫脚本相关的Pydantic模型"""
from datetime import datetime
from typing import Optional, Dict, Any
from pydantic import BaseModel
class ScriptBase(BaseModel):
"""脚本基础字段"""
script_name: str
script_content: str
config_content: Optional[Dict[str, Any]] = None
class ScriptCreate(ScriptBase):
"""创建脚本请求"""
university_id: int
class ScriptResponse(ScriptBase):
"""脚本响应"""
id: int
university_id: int
version: int
status: str
error_message: Optional[str] = None
created_at: datetime
updated_at: datetime
class Config:
from_attributes = True
class GenerateScriptRequest(BaseModel):
"""生成脚本请求"""
university_url: str
university_name: Optional[str] = None
class GenerateScriptResponse(BaseModel):
"""生成脚本响应"""
success: bool
university_id: int
script_id: Optional[int] = None
message: str
status: str # analyzing, completed, failed

View File

@ -0,0 +1,48 @@
"""大学相关的Pydantic模型"""
from datetime import datetime
from typing import Optional, List
from pydantic import BaseModel, HttpUrl
class UniversityBase(BaseModel):
"""大学基础字段"""
name: str
url: str
country: Optional[str] = None
description: Optional[str] = None
class UniversityCreate(UniversityBase):
"""创建大学请求"""
pass
class UniversityUpdate(BaseModel):
"""更新大学请求"""
name: Optional[str] = None
url: Optional[str] = None
country: Optional[str] = None
description: Optional[str] = None
class UniversityResponse(UniversityBase):
"""大学响应"""
id: int
status: str
created_at: datetime
updated_at: datetime
# 统计信息
scripts_count: int = 0
jobs_count: int = 0
latest_result: Optional[dict] = None
class Config:
from_attributes = True
class UniversityListResponse(BaseModel):
"""大学列表响应"""
total: int
items: List[UniversityResponse]

View File

@ -0,0 +1,6 @@
"""业务服务"""
from .script_generator import generate_scraper_script
from .scraper_runner import run_scraper
__all__ = ["generate_scraper_script", "run_scraper"]

View File

@ -0,0 +1,177 @@
"""
爬虫执行服务
运行爬虫脚本并保存结果
"""
import asyncio
import json
import re
import sys
import traceback
from datetime import datetime, timezone
from urllib.parse import urljoin, urlparse
from sqlalchemy.orm import Session
# Windows 上需要设置事件循环策略
if sys.platform == "win32":
asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())
# 导入playwright供脚本使用
try:
from playwright.async_api import async_playwright
PLAYWRIGHT_AVAILABLE = True
except ImportError:
PLAYWRIGHT_AVAILABLE = False
async_playwright = None
from ..database import SessionLocal
from ..models import ScraperScript, ScrapeJob, ScrapeLog, ScrapeResult
def run_scraper(job_id: int, script_id: int):
"""
执行爬虫的后台任务
"""
db = SessionLocal()
try:
job = db.query(ScrapeJob).filter(ScrapeJob.id == job_id).first()
script = db.query(ScraperScript).filter(ScraperScript.id == script_id).first()
if not job or not script:
return
# 更新任务状态
job.status = "running"
job.started_at = datetime.utcnow()
job.current_step = "正在初始化..."
job.progress = 5
db.commit()
_add_log(db, job_id, "info", "开始执行爬虫脚本")
# 创建日志回调函数
def log_callback(level: str, message: str):
_add_log(db, job_id, level, message)
# 执行脚本
job.current_step = "正在爬取数据..."
job.progress = 20
db.commit()
result_data = _execute_script(script.script_content, log_callback)
if result_data:
job.progress = 80
job.current_step = "正在保存结果..."
db.commit()
_add_log(db, job_id, "info", "爬取完成,正在保存结果...")
# 计算统计信息
schools = result_data.get("schools", [])
schools_count = len(schools)
programs_count = sum(len(s.get("programs", [])) for s in schools)
faculty_count = sum(
len(p.get("faculty", []))
for s in schools
for p in s.get("programs", [])
)
# 保存结果
result = ScrapeResult(
job_id=job_id,
university_id=job.university_id,
result_data=result_data,
schools_count=schools_count,
programs_count=programs_count,
faculty_count=faculty_count
)
db.add(result)
job.status = "completed"
job.progress = 100
job.current_step = "完成"
job.completed_at = datetime.utcnow()
_add_log(
db, job_id, "info",
f"爬取成功: {schools_count}个学院, {programs_count}个项目, {faculty_count}位导师"
)
else:
job.status = "failed"
job.error_message = "脚本执行无返回结果"
job.completed_at = datetime.utcnow()
_add_log(db, job_id, "error", "脚本执行失败: 无返回结果")
db.commit()
except Exception as e:
error_msg = f"执行出错: {str(e)}\n{traceback.format_exc()}"
_add_log(db, job_id, "error", error_msg)
job = db.query(ScrapeJob).filter(ScrapeJob.id == job_id).first()
if job:
job.status = "failed"
job.error_message = str(e)
job.completed_at = datetime.utcnow()
db.commit()
finally:
db.close()
def _execute_script(script_content: str, log_callback) -> dict:
"""
执行Python脚本内容
安全地在隔离环境中执行脚本
"""
if not PLAYWRIGHT_AVAILABLE:
log_callback("error", "Playwright 未安装,请运行: pip install playwright && playwright install")
return None
# 创建执行环境 - 包含脚本需要的所有模块
# 注意:使用同一个字典作为 globals 和 locals确保函数定义可以互相访问
exec_namespace = {
"__builtins__": __builtins__,
"asyncio": asyncio,
"json": json,
"re": re,
"datetime": datetime,
"timezone": timezone,
"urljoin": urljoin,
"urlparse": urlparse,
"async_playwright": async_playwright,
}
try:
# 编译并执行脚本 - 使用同一个命名空间确保函数可互相调用
exec(script_content, exec_namespace, exec_namespace)
# 获取scrape函数
scrape_func = exec_namespace.get("scrape")
if not scrape_func:
log_callback("error", "脚本中未找到 scrape 函数")
return None
# 运行异步爬虫函数
result = asyncio.run(scrape_func(output_callback=log_callback))
return result
except Exception as e:
log_callback("error", f"脚本执行异常: {str(e)}")
raise
def _add_log(db: Session, job_id: int, level: str, message: str):
"""添加日志"""
log = ScrapeLog(
job_id=job_id,
level=level,
message=message
)
db.add(log)
db.commit()

View File

@ -0,0 +1,558 @@
"""
爬虫脚本生成服务
分析大学网站结构,自动生成爬虫脚本
"""
import re
from datetime import datetime
from urllib.parse import urlparse
from sqlalchemy.orm import Session
from ..database import SessionLocal
from ..models import University, ScraperScript
# 预置的大学爬虫脚本模板
SCRAPER_TEMPLATES = {
"harvard.edu": "harvard_scraper",
"mit.edu": "generic_scraper",
"stanford.edu": "generic_scraper",
}
def generate_scraper_script(university_id: int, university_url: str):
"""
生成爬虫脚本的后台任务
1. 分析大学网站域名
2. 如果有预置模板则使用模板
3. 否则生成通用爬虫脚本
"""
db = SessionLocal()
try:
university = db.query(University).filter(University.id == university_id).first()
if not university:
return
# 解析URL获取域名
parsed = urlparse(university_url)
domain = parsed.netloc.replace("www.", "")
# 检查是否有预置模板
template_name = None
for pattern, template in SCRAPER_TEMPLATES.items():
if pattern in domain:
template_name = template
break
# 生成脚本
script_content = _generate_script_content(domain, template_name)
config_content = _generate_config_content(university.name, university_url, domain)
# 计算版本号
existing_count = db.query(ScraperScript).filter(
ScraperScript.university_id == university_id
).count()
# 保存脚本
script = ScraperScript(
university_id=university_id,
script_name=f"{domain.replace('.', '_')}_scraper",
script_content=script_content,
config_content=config_content,
version=existing_count + 1,
status="active"
)
db.add(script)
# 更新大学状态
university.status = "ready"
db.commit()
except Exception as e:
# 记录错误
if university:
university.status = "error"
db.commit()
raise e
finally:
db.close()
def _generate_script_content(domain: str, template_name: str = None) -> str:
"""生成Python爬虫脚本内容"""
if template_name == "harvard_scraper":
return '''"""
Harvard University 专用爬虫脚本
自动生成
"""
import asyncio
import json
from datetime import datetime, timezone
from playwright.async_api import async_playwright
# 学院URL映射
SCHOOL_MAPPING = {
"gsas.harvard.edu": "Graduate School of Arts and Sciences (GSAS)",
"seas.harvard.edu": "John A. Paulson School of Engineering and Applied Sciences (SEAS)",
"hbs.edu": "Harvard Business School (HBS)",
"gsd.harvard.edu": "Graduate School of Design (GSD)",
"gse.harvard.edu": "Graduate School of Education (HGSE)",
"hks.harvard.edu": "Harvard Kennedy School (HKS)",
"hls.harvard.edu": "Harvard Law School (HLS)",
"hms.harvard.edu": "Harvard Medical School (HMS)",
"hsph.harvard.edu": "T.H. Chan School of Public Health (HSPH)",
"hds.harvard.edu": "Harvard Divinity School (HDS)",
"fas.harvard.edu": "Faculty of Arts and Sciences (FAS)",
}
async def scrape(output_callback=None):
"""执行爬取"""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
result = {
"name": "Harvard University",
"url": "https://www.harvard.edu/",
"country": "USA",
"scraped_at": datetime.now(timezone.utc).isoformat(),
"schools": []
}
# 访问项目列表页
if output_callback:
output_callback("info", "访问Harvard项目列表...")
await page.goto("https://www.harvard.edu/programs/?degree_levels=graduate")
await page.wait_for_timeout(3000)
# 提取项目数据
programs = await page.evaluate("""() => {
const items = document.querySelectorAll('[class*="records__record"]');
const programs = [];
items.forEach(item => {
const btn = item.querySelector('button[class*="title-link"]');
if (btn) {
programs.push({
name: btn.innerText.trim(),
url: ''
});
}
});
return programs;
}""")
if output_callback:
output_callback("info", f"找到 {len(programs)} 个项目")
# 简化输出
result["schools"] = [{
"name": "Graduate Programs",
"url": "https://www.harvard.edu/programs/",
"programs": [{"name": p["name"], "url": p["url"], "faculty": []} for p in programs[:50]]
}]
await browser.close()
return result
if __name__ == "__main__":
result = asyncio.run(scrape())
print(json.dumps(result, indent=2, ensure_ascii=False))
'''
# 通用爬虫模板 - 深度爬取硕士项目
# 使用字符串拼接来避免 f-string 和 JavaScript 引号冲突
return _build_generic_scraper_template(domain)
def _build_generic_scraper_template(domain: str) -> str:
"""构建通用爬虫模板"""
# JavaScript code blocks (use raw strings to avoid escaping issues)
js_check_courses = r'''() => {
const links = document.querySelectorAll('a[href]');
let courseCount = 0;
for (const a of links) {
const href = a.href.toLowerCase();
if (/\/\d{4,}\//.test(href) ||
/\/(msc|ma|mba|mres|llm|med|meng)-/.test(href) ||
/\/course\/[a-z]/.test(href)) {
courseCount++;
}
}
return courseCount;
}'''
js_find_list_url = r'''() => {
const links = document.querySelectorAll('a[href]');
for (const a of links) {
const text = a.innerText.toLowerCase();
const href = a.href.toLowerCase();
if ((text.includes('a-z') || text.includes('all course') ||
text.includes('full list') || text.includes('browse all') ||
href.includes('/list')) &&
(href.includes('master') || href.includes('course') || href.includes('postgrad'))) {
return a.href;
}
}
return null;
}'''
js_find_courses_from_home = r'''() => {
const links = document.querySelectorAll('a[href]');
for (const a of links) {
const href = a.href.toLowerCase();
const text = a.innerText.toLowerCase();
if ((href.includes('master') || href.includes('postgraduate') || href.includes('graduate')) &&
(href.includes('course') || href.includes('program') || href.includes('degree'))) {
return a.href;
}
}
return null;
}'''
js_extract_programs = r'''() => {
const programs = [];
const seen = new Set();
const currentHost = window.location.hostname;
document.querySelectorAll('a[href]').forEach(a => {
const href = a.href;
const text = a.innerText.trim().replace(/\s+/g, ' ');
if (!href || seen.has(href)) return;
if (text.length < 5 || text.length > 200) return;
if (href.includes('#') || href.includes('javascript:') || href.includes('mailto:')) return;
try {
const linkHost = new URL(href).hostname;
if (!linkHost.includes(currentHost.replace('www.', '')) &&
!currentHost.includes(linkHost.replace('www.', ''))) return;
} catch {
return;
}
const hrefLower = href.toLowerCase();
const textLower = text.toLowerCase();
const isNavigation = textLower === 'courses' ||
textLower === 'programmes' ||
textLower === 'undergraduate' ||
textLower === 'postgraduate' ||
textLower === 'masters' ||
textLower === "master's" ||
textLower.includes('skip to') ||
textLower.includes('share') ||
textLower === 'home' ||
textLower === 'study' ||
textLower.startsWith('a-z') ||
textLower.includes('admission') ||
textLower.includes('fees and funding') ||
textLower.includes('why should') ||
textLower.includes('why manchester') ||
textLower.includes('teaching and learning') ||
textLower.includes('meet us') ||
textLower.includes('student support') ||
textLower.includes('contact us') ||
textLower.includes('how to apply') ||
hrefLower.includes('/admissions/') ||
hrefLower.includes('/fees-and-funding/') ||
hrefLower.includes('/why-') ||
hrefLower.includes('/meet-us/') ||
hrefLower.includes('/contact-us/') ||
hrefLower.includes('/student-support/') ||
hrefLower.includes('/teaching-and-learning/') ||
hrefLower.endsWith('/courses/') ||
hrefLower.endsWith('/masters/') ||
hrefLower.endsWith('/postgraduate/');
if (isNavigation) return;
const isExcluded = hrefLower.includes('/undergraduate') ||
hrefLower.includes('/bachelor') ||
hrefLower.includes('/phd/') ||
hrefLower.includes('/doctoral') ||
hrefLower.includes('/research-degree') ||
textLower.includes('bachelor') ||
textLower.includes('undergraduate') ||
(textLower.includes('phd') && !textLower.includes('mphil'));
if (isExcluded) return;
const hasNumericId = /\/\d{4,}\//.test(href);
const hasDegreeSlug = /\/(msc|ma|mba|mres|llm|med|meng|mpa|mph|mphil)-[a-z]/.test(hrefLower);
const isCoursePage = (hrefLower.includes('/course/') ||
hrefLower.includes('/courses/list/') ||
hrefLower.includes('/programme/')) &&
href.split('/').filter(p => p).length > 4;
const textHasDegree = /\b(msc|ma|mba|mres|llm|med|meng|pgcert|pgdip)\b/i.test(text) ||
textLower.includes('master');
if (hasNumericId || hasDegreeSlug || isCoursePage || textHasDegree) {
seen.add(href);
programs.push({
name: text,
url: href
});
}
});
return programs;
}'''
js_extract_faculty = r'''() => {
const faculty = [];
const seen = new Set();
document.querySelectorAll('a[href]').forEach(a => {
const href = a.href.toLowerCase();
const text = a.innerText.trim();
if (seen.has(href)) return;
if (text.length < 3 || text.length > 100) return;
const isStaff = href.includes('/people/') ||
href.includes('/staff/') ||
href.includes('/faculty/') ||
href.includes('/profile/') ||
href.includes('/academics/') ||
href.includes('/researcher/');
if (isStaff) {
seen.add(href);
faculty.push({
name: text.replace(/\s+/g, ' '),
url: a.href
});
}
});
return faculty.slice(0, 20);
}'''
university_name = domain.split('.')[0].title()
template = f'''"""
通用大学爬虫脚本
目标: {domain}
自动生成 - 深度爬取硕士项目和导师信息
"""
import asyncio
import json
import re
from datetime import datetime, timezone
from urllib.parse import urljoin, urlparse
from playwright.async_api import async_playwright
MASTERS_PATHS = [
"/study/masters/courses/list/",
"/study/masters/courses/",
"/postgraduate/taught/courses/",
"/postgraduate/courses/list/",
"/postgraduate/courses/",
"/graduate/programs/",
"/academics/graduate/programs/",
"/programmes/masters/",
"/masters/programmes/",
"/admissions/graduate/programs/",
]
JS_CHECK_COURSES = """{js_check_courses}"""
JS_FIND_LIST_URL = """{js_find_list_url}"""
JS_FIND_COURSES_FROM_HOME = """{js_find_courses_from_home}"""
JS_EXTRACT_PROGRAMS = """{js_extract_programs}"""
JS_EXTRACT_FACULTY = """{js_extract_faculty}"""
async def find_course_list_page(page, base_url, output_callback):
for path in MASTERS_PATHS:
test_url = base_url.rstrip('/') + path
try:
response = await page.goto(test_url, wait_until="domcontentloaded", timeout=15000)
if response and response.status == 200:
title = await page.title()
if '404' not in title.lower() and 'not found' not in title.lower():
has_courses = await page.evaluate(JS_CHECK_COURSES)
if has_courses > 5:
if output_callback:
output_callback("info", f"Found course list: {{path}} ({{has_courses}} courses)")
return test_url
list_url = await page.evaluate(JS_FIND_LIST_URL)
if list_url:
if output_callback:
output_callback("info", f"Found full course list: {{list_url}}")
return list_url
except:
continue
try:
await page.goto(base_url, wait_until="domcontentloaded", timeout=30000)
await page.wait_for_timeout(2000)
courses_url = await page.evaluate(JS_FIND_COURSES_FROM_HOME)
if courses_url:
return courses_url
except:
pass
return None
async def extract_course_links(page, output_callback):
return await page.evaluate(JS_EXTRACT_PROGRAMS)
async def scrape(output_callback=None):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
)
page = await context.new_page()
base_url = "https://www.{domain}/"
result = {{
"name": "{university_name} University",
"url": base_url,
"scraped_at": datetime.now(timezone.utc).isoformat(),
"schools": []
}}
all_programs = []
try:
if output_callback:
output_callback("info", "Searching for masters course list...")
courses_url = await find_course_list_page(page, base_url, output_callback)
if not courses_url:
if output_callback:
output_callback("warning", "Course list not found, using homepage")
courses_url = base_url
if output_callback:
output_callback("info", "Extracting masters programs...")
await page.goto(courses_url, wait_until="domcontentloaded", timeout=30000)
await page.wait_for_timeout(3000)
for _ in range(3):
try:
load_more = page.locator('button:has-text("Load more"), button:has-text("Show more"), button:has-text("View more"), a:has-text("Load more")')
if await load_more.count() > 0:
await load_more.first.click()
await page.wait_for_timeout(2000)
else:
break
except:
break
programs_data = await extract_course_links(page, output_callback)
if output_callback:
output_callback("info", f"Found {{len(programs_data)}} masters programs")
max_detail_pages = min(len(programs_data), 30)
for i, prog in enumerate(programs_data[:max_detail_pages]):
try:
if output_callback and i % 10 == 0:
output_callback("info", f"Processing {{i+1}}/{{max_detail_pages}}: {{prog['name'][:50]}}")
await page.goto(prog['url'], wait_until="domcontentloaded", timeout=15000)
await page.wait_for_timeout(800)
faculty_data = await page.evaluate(JS_EXTRACT_FACULTY)
all_programs.append({{
"name": prog['name'],
"url": prog['url'],
"faculty": faculty_data
}})
except:
all_programs.append({{
"name": prog['name'],
"url": prog['url'],
"faculty": []
}})
for prog in programs_data[max_detail_pages:]:
all_programs.append({{
"name": prog['name'],
"url": prog['url'],
"faculty": []
}})
result["schools"] = [{{
"name": "Masters Programs",
"url": courses_url,
"programs": all_programs
}}]
if output_callback:
total_faculty = sum(len(p.get('faculty', [])) for p in all_programs)
output_callback("info", f"Done! {{len(all_programs)}} programs, {{total_faculty}} faculty")
except Exception as e:
if output_callback:
output_callback("error", f"Scraping error: {{str(e)}}")
finally:
await browser.close()
return result
if __name__ == "__main__":
result = asyncio.run(scrape())
print(json.dumps(result, indent=2, ensure_ascii=False))
'''
return template
def _generate_config_content(name: str, url: str, domain: str) -> dict:
"""生成配置内容"""
return {
"university": {
"name": name,
"url": url,
"domain": domain
},
"scraper": {
"headless": True,
"timeout": 30000,
"wait_time": 2000
},
"paths_to_try": [
"/programs",
"/academics/programs",
"/graduate",
"/degrees",
"/admissions/graduate"
],
"selectors": {
"program_item": "div.program, li.program, article.program, a[href*='/program']",
"faculty_item": "div.faculty, li.person, .profile-card"
},
"generated_at": datetime.utcnow().isoformat()
}

View File

@ -0,0 +1 @@
"""Celery任务 (可选,用于生产环境)"""

25
backend/requirements.txt Normal file
View File

@ -0,0 +1,25 @@
# FastAPI Web Framework
fastapi>=0.109.0
uvicorn[standard]>=0.27.0
python-multipart>=0.0.6
# Database
sqlalchemy>=2.0.25
psycopg2-binary>=2.9.9
alembic>=1.13.1
# Task Queue
celery>=5.3.6
redis>=5.0.1
# Utilities
pydantic>=2.9
pydantic-settings>=2.6
python-dotenv>=1.0.0
httpx>=0.28
# Existing scraper dependencies
playwright>=1.48
pyyaml>=6.0
# CORS

143
configs/harvard.yaml Normal file
View File

@ -0,0 +1,143 @@
# Harvard University 爬虫配置
# 按照 学院 → 项目 → 导师 的层级结构组织
#
# Harvard的特殊情况有一个集中的项目列表页面可以从那里获取所有项目
# 然后通过GSAS页面关联到各学院和导师信息
university:
name: "Harvard University"
url: "https://www.harvard.edu/"
country: "USA"
# 第一层:学院列表
schools:
discovery_method: "static_list"
static_list:
# 文理研究生院 - 最主要的研究生项目集中地
- name: "Graduate School of Arts and Sciences (GSAS)"
url: "https://gsas.harvard.edu/"
# 工程与应用科学学院
- name: "John A. Paulson School of Engineering and Applied Sciences (SEAS)"
url: "https://seas.harvard.edu/"
# 商学院
- name: "Harvard Business School (HBS)"
url: "https://www.hbs.edu/"
# 设计学院
- name: "Graduate School of Design (GSD)"
url: "https://www.gsd.harvard.edu/"
# 教育学院
- name: "Graduate School of Education (HGSE)"
url: "https://www.gse.harvard.edu/"
# 肯尼迪政府学院
- name: "Harvard Kennedy School (HKS)"
url: "https://www.hks.harvard.edu/"
# 法学院
- name: "Harvard Law School (HLS)"
url: "https://hls.harvard.edu/"
# 医学院
- name: "Harvard Medical School (HMS)"
url: "https://hms.harvard.edu/"
# 公共卫生学院
- name: "T.H. Chan School of Public Health (HSPH)"
url: "https://www.hsph.harvard.edu/"
# 神学院
- name: "Harvard Divinity School (HDS)"
url: "https://hds.harvard.edu/"
# 牙医学院
- name: "Harvard School of Dental Medicine (HSDM)"
url: "https://hsdm.harvard.edu/"
# 第二层:项目发现配置
programs:
# 在学院网站上尝试这些路径来查找项目列表
paths_to_try:
- "/programs"
- "/academics/programs"
- "/academics/graduate-programs"
- "/academics/masters-programs"
- "/graduate"
- "/degrees"
- "/academics"
# 从学院首页查找项目列表页面的链接模式
link_patterns:
- text_contains: ["program", "degree", "academics"]
href_contains: ["/program", "/degree", "/academic"]
- text_contains: ["master", "graduate"]
href_contains: ["/master", "/graduate"]
# 项目列表页面的选择器
selectors:
program_item: "div.program-item, li.program, .degree-program, article.program, a[href*='/program']"
program_name: "h3, h4, .title, .program-title, .name"
program_url: "a[href]"
degree_type: ".degree, .credential, .degree-type"
# 分页配置
pagination:
type: "none"
# 第三层:导师发现配置
faculty:
discovery_strategies:
- type: "link_in_page"
patterns:
- text_contains: ["faculty", "people", "advisor"]
href_contains: ["/faculty", "/people", "/advisor"]
- text_contains: ["see list", "view all"]
href_contains: ["/people", "/faculty"]
- type: "url_pattern"
patterns:
- "{program_url}/faculty"
- "{program_url}/people"
- "{school_url}/faculty"
- "{school_url}/people"
selectors:
faculty_item: "div.faculty, li.person, .profile-card, article.person"
faculty_name: "h3, h4, .name, .title a"
faculty_url: "a[href*='/people/'], a[href*='/faculty/'], a[href*='/profile/']"
faculty_title: ".title, .position, .role, .job-title"
# 过滤规则
filters:
program_degree_types:
include:
- "Master"
- "M.S."
- "M.A."
- "MBA"
- "M.Eng"
- "M.Ed"
- "M.P.P"
- "M.P.A"
- "M.Arch"
- "M.L.A"
- "M.Div"
- "M.T.S"
- "LL.M"
- "S.M."
- "A.M."
- "A.L.M."
exclude:
- "Ph.D."
- "Doctor"
- "Bachelor"
- "B.S."
- "B.A."
- "Certificate"
- "Undergraduate"
exclude_schools: []

331
configs/manchester.yaml Normal file
View File

@ -0,0 +1,331 @@
university:
name: "The University of Manchester"
url: "https://www.manchester.ac.uk/"
country: "United Kingdom"
schools:
discovery_method: "static_list"
request:
timeout_ms: 45000
max_retries: 3
retry_backoff_ms: 3000
static_list:
- name: "Alliance Manchester Business School"
url: "https://www.alliancembs.manchester.ac.uk/research/accounting-and-finance/staff/"
keywords:
- "accounting"
- "finance"
- "business"
- "management"
- "marketing"
- "mba"
- "economics"
- "entrepreneurship"
faculty_pages:
- url: "https://www.alliancembs.manchester.ac.uk/research/accounting-and-finance/staff/"
extract_method: "table"
requires_scroll: true
scroll_times: 6
scroll_delay_ms: 700
load_more_selector: "button.load-more, button.show-more"
max_load_more: 5
request:
timeout_ms: 60000
wait_until: "domcontentloaded"
post_wait_ms: 2500
- name: "Department of Computer Science"
url: "https://www.cs.manchester.ac.uk/about/people/academic-and-research-staff/"
keywords:
- "computer"
- "software"
- "data science"
- "artificial intelligence"
- "ai "
- "machine learning"
- "cyber"
- "computing"
faculty_pages:
- url: "https://www.cs.manchester.ac.uk/about/people/academic-and-research-staff/"
extract_method: "links"
requires_scroll: true
scroll_times: 6
scroll_delay_ms: 700
blocked_resources: ["image", "font", "media"]
- url: "https://www.cs.manchester.ac.uk/about/people/"
extract_method: "links"
load_more_selector: "button.load-more"
max_load_more: 5
request:
timeout_ms: 45000
wait_until: "domcontentloaded"
post_wait_ms: 2000
- name: "Department of Physics and Astronomy"
url: "https://www.physics.manchester.ac.uk/about/people/academic-and-research-staff/"
keywords:
- "physics"
- "astronomy"
- "astrophysics"
- "nuclear"
- "particle"
faculty_pages:
- url: "https://www.physics.manchester.ac.uk/about/people/academic-and-research-staff/"
extract_method: "links"
requires_scroll: true
scroll_times: 5
scroll_delay_ms: 700
- name: "Department of Electrical and Electronic Engineering"
url: "https://www.eee.manchester.ac.uk/about/people/academic-and-research-staff/"
keywords:
- "electrical"
- "electronic"
- "eee"
- "power systems"
- "microelectronics"
faculty_pages:
- url: "https://www.eee.manchester.ac.uk/about/people/academic-and-research-staff/"
extract_method: "links"
requires_scroll: true
scroll_times: 6
scroll_delay_ms: 700
- name: "Department of Chemistry"
url: "https://research.manchester.ac.uk/en/organisations/department-of-chemistry/persons/"
keywords:
- "chemistry"
- "chemical"
faculty_pages:
- url: "https://research.manchester.ac.uk/en/organisations/department-of-chemistry/persons/"
extract_method: "research_explorer"
requires_scroll: true
request:
timeout_ms: 120000
wait_until: "networkidle"
wait_for_selector: "a.link.person"
post_wait_ms: 5000
research_explorer:
org_slug: "department-of-chemistry"
page_size: 200
- name: "Department of Mathematics"
url: "https://research.manchester.ac.uk/en/organisations/department-of-mathematics/persons/"
keywords:
- "mathematics"
- "statistics"
- "applied math"
- "actuarial"
faculty_pages:
- url: "https://research.manchester.ac.uk/en/organisations/department-of-mathematics/persons/"
extract_method: "research_explorer"
requires_scroll: true
request:
timeout_ms: 120000
wait_until: "networkidle"
wait_for_selector: "a.link.person"
post_wait_ms: 4500
research_explorer:
org_slug: "department-of-mathematics"
page_size: 200
- name: "School of Engineering"
url: "https://research.manchester.ac.uk/en/organisations/school-of-engineering/persons/"
keywords:
- "engineering"
- "mechanical"
- "aerospace"
- "civil"
- "materials"
faculty_pages:
- url: "https://research.manchester.ac.uk/en/organisations/school-of-engineering/persons/"
extract_method: "research_explorer"
requires_scroll: true
request:
timeout_ms: 120000
wait_until: "networkidle"
wait_for_selector: "a.link.person"
post_wait_ms: 4500
research_explorer:
org_slug: "school-of-engineering"
page_size: 400
- name: "Faculty of Biology, Medicine and Health"
url: "https://research.manchester.ac.uk/en/organisations/faculty-of-biology-medicine-and-health/persons/"
keywords:
- "medicine"
- "medical"
- "health"
- "nursing"
- "pharmacy"
- "clinical"
- "dental"
- "optometry"
- "biology"
- "biomedical"
- "psychology"
faculty_pages:
- url: "https://research.manchester.ac.uk/en/organisations/faculty-of-biology-medicine-and-health/persons/"
extract_method: "research_explorer"
requires_scroll: true
request:
timeout_ms: 130000
wait_until: "networkidle"
wait_for_selector: "a.link.person"
post_wait_ms: 4500
research_explorer:
org_slug: "faculty-of-biology-medicine-and-health"
page_size: 400
- name: "School of Social Sciences"
url: "https://research.manchester.ac.uk/en/organisations/school-of-social-sciences/persons/"
keywords:
- "sociology"
- "politics"
- "international"
- "social"
- "criminology"
- "anthropology"
- "philosophy"
faculty_pages:
- url: "https://research.manchester.ac.uk/en/organisations/school-of-social-sciences/persons/"
extract_method: "research_explorer"
requires_scroll: true
request:
timeout_ms: 120000
wait_until: "networkidle"
wait_for_selector: "a.link.person"
post_wait_ms: 4500
research_explorer:
org_slug: "school-of-social-sciences"
page_size: 200
- name: "School of Law"
url: "https://research.manchester.ac.uk/en/organisations/school-of-law/persons/"
keywords:
- "law"
- "legal"
- "llm"
faculty_pages:
- url: "https://research.manchester.ac.uk/en/organisations/school-of-law/persons/"
extract_method: "research_explorer"
requires_scroll: true
request:
timeout_ms: 120000
wait_until: "networkidle"
wait_for_selector: "a.link.person"
post_wait_ms: 4500
research_explorer:
org_slug: "school-of-law"
page_size: 200
- name: "School of Arts, Languages and Cultures"
url: "https://research.manchester.ac.uk/en/organisations/school-of-arts-languages-and-cultures/persons/"
keywords:
- "arts"
- "languages"
- "culture"
- "music"
- "drama"
- "theatre"
- "history"
- "linguistics"
- "literature"
- "translation"
- "archaeology"
- "religion"
faculty_pages:
- url: "https://research.manchester.ac.uk/en/organisations/school-of-arts-languages-and-cultures/persons/"
extract_method: "research_explorer"
requires_scroll: true
request:
timeout_ms: 120000
wait_until: "networkidle"
wait_for_selector: "a.link.person"
post_wait_ms: 4500
research_explorer:
org_slug: "school-of-arts-languages-and-cultures"
page_size: 300
- name: "School of Environment, Education and Development"
url: "https://research.manchester.ac.uk/en/organisations/school-of-environment-education-and-development/persons/"
keywords:
- "environment"
- "education"
- "development"
- "planning"
- "architecture"
- "urban"
- "geography"
- "sustainability"
faculty_pages:
- url: "https://research.manchester.ac.uk/en/organisations/school-of-environment-education-and-development/persons/"
extract_method: "research_explorer"
requires_scroll: true
request:
timeout_ms: 120000
wait_until: "networkidle"
wait_for_selector: "a.link.person"
post_wait_ms: 4500
research_explorer:
org_slug: "school-of-environment-education-and-development"
page_size: 300
programs:
paths_to_try:
- "/study/masters/courses/list/"
link_patterns:
- text_contains: ["masters", "postgraduate", "graduate"]
href_contains: ["/courses/list", "/study/masters", "/study/postgraduate"]
selectors:
program_item: "li.course-item, article.course, .course-listing a"
program_name: ".course-title, h3, .title"
program_url: "a[href]"
degree_type: ".course-award, .badge"
request:
timeout_ms: 40000
wait_until: "domcontentloaded"
post_wait_ms: 2500
global_catalog:
url: "https://www.manchester.ac.uk/study/masters/courses/list/"
request:
timeout_ms: 60000
wait_until: "networkidle"
wait_after_ms: 3000
metadata_keyword_field: "keywords"
assign_by_school_keywords: true
assign_if_no_keywords: false
allow_multiple_assignments: false
per_school_limit: 200
skip_program_faculty_lookup: true
faculty:
discovery_strategies:
- type: "link_in_page"
patterns:
- text_contains: ["people", "faculty", "staff", "directory"]
href_contains: ["/people", "/faculty", "/staff"]
request:
timeout_ms: 30000
wait_until: "domcontentloaded"
post_wait_ms: 1500
- type: "url_pattern"
patterns:
- "{program_url}/people"
- "{program_url}/faculty"
- "{school_url}/people"
- "{school_url}/staff"
request:
timeout_ms: 30000
wait_until: "domcontentloaded"
post_wait_ms: 1500
- type: "school_directory"
assign_to_all: false
match_by_school_keywords: true
metadata_keyword_field: "keywords"
request:
timeout_ms: 120000
post_wait_ms: 3500
filters:
program_degree_types:
include: ["MSc", "MA", "MBA", "MEng", "LLM", "MRes"]
exclude: ["PhD", "Bachelor", "BSc", "BA", "PGCert"]
exclude_schools: []
playwright:
stealth: true
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36"
extra_headers:
Accept-Language: "en-US,en;q=0.9"
cookies: []
add_init_scripts: []

View File

@ -0,0 +1,24 @@
# 英国高校模板库
该目录存放针对英国大学常见站点结构的 ScraperConfig 模板片段,目标是让生成/调度脚本能够快速套用成熟的学院、项目、导师配置,并保持与 `src/university_scraper` 中的最新能力同步。
## 使用方式
1. 复制需要的模板文件到 `configs/<university>.yaml`,并根据该学校的实际信息替换占位符(域名、学院 URL、Research Explorer 组织 slug 等)。
2. 调整 `schools.static_list` 中的学院列表:
- `keywords`:用于自动将项目聚类到学院;
- `faculty_pages`:定义学院级导师目录(支持 `extract_method: table|links|research_explorer`、滚动/点击更多、独立请求参数)。
3. 根据学校的课程导航方式,补全 `programs.paths_to_try``link_patterns``selectors` 与请求设置。
4. `faculty.discovery_strategies` 推荐至少包含:
- `link_in_page`从项目页寻找“People/Faculty”链接
- `url_pattern`:补充常见 URL 模式;
- `school_directory`: true复用 `faculty_pages` 中的导师目录,将其按关键词分发到项目层。
5. 运行 `python -m src.university_scraper.cli run --config configs/<university>.yaml --output output/<name>.json`(或在 Web 端触发任务)验证,并将本地结果与旧版对比。
## 模板列表
| 文件 | 适用场景 |
|------|----------|
| `uk_research_explorer_template.yaml` | 大多数使用 Pure Portal / Research Explorer 的英国大学如曼大、UCL、帝国理工的人文社科学院。 |
| `uk_department_directory_template.yaml` | 传统院系官网列出 HTML Staff Directory 的学院(如各理工学院官网、独立学院站点)。 |
后续若发现新的页面类型(例如 SharePoint 列表、嵌入式 API 等),请在此目录增加新的模板文件,并在本 README 中更新说明。

View File

@ -0,0 +1,95 @@
university:
name: "REPLACE_UNIVERSITY_NAME"
url: "https://www.example.ac.uk/"
country: "United Kingdom"
schools:
discovery_method: "static_list"
static_list:
- name: "Department of Computer Science"
url: "https://www.example.ac.uk/about/people/academic-and-research-staff/"
keywords:
- "computer"
- "software"
- "artificial intelligence"
- "data science"
faculty_pages:
- url: "https://www.example.ac.uk/about/people/academic-and-research-staff/"
extract_method: "links"
requires_scroll: true
scroll_times: 6
scroll_delay_ms: 600
blocked_resources: ["image", "font", "media"]
- url: "https://www.example.ac.uk/about/people/"
extract_method: "links"
load_more_selector: "button.load-more"
max_load_more: 5
request:
timeout_ms: 45000
wait_until: "domcontentloaded"
post_wait_ms: 2000
- name: "Department of Physics"
url: "https://www.example.ac.uk/physics/about/people/"
keywords:
- "physics"
- "astronomy"
- "material science"
faculty_pages:
- url: "https://www.example.ac.uk/physics/about/people/academic-staff/"
extract_method: "table"
request:
timeout_ms: 60000
wait_until: "domcontentloaded"
post_wait_ms: 2000
programs:
paths_to_try:
- "/study/masters/courses/a-to-z/"
- "/study/masters/courses/list/"
link_patterns:
- text_contains: ["courses", "masters", "postgraduate"]
href_contains: ["/study/", "/masters/", "/courses/"]
selectors:
program_item: ".course-card, li.course, article.course"
program_name: ".course-title, h3, .title"
program_url: "a[href]"
degree_type: ".award, .badge"
request:
timeout_ms: 35000
wait_until: "domcontentloaded"
post_wait_ms: 2000
faculty:
discovery_strategies:
- type: "link_in_page"
patterns:
- text_contains: ["people", "faculty", "team", "staff"]
href_contains: ["/people", "/faculty", "/staff"]
request:
timeout_ms: 25000
wait_until: "domcontentloaded"
post_wait_ms: 1500
- type: "url_pattern"
patterns:
- "{program_url}/people"
- "{program_url}/staff"
- "{school_url}/people"
- "{school_url}/contact/staff"
request:
timeout_ms: 25000
wait_until: "domcontentloaded"
post_wait_ms: 1500
- type: "school_directory"
assign_to_all: false
match_by_school_keywords: true
metadata_keyword_field: "keywords"
request:
timeout_ms: 60000
wait_for_selector: "a[href*='/people/'], table"
post_wait_ms: 2000
filters:
program_degree_types:
include: ["MSc", "MSci", "MA", "MBA", "MEng", "LLM"]
exclude: ["PhD", "Bachelor", "BSc", "BA", "PGCert"]
exclude_schools: []

View File

@ -0,0 +1,101 @@
university:
name: "REPLACE_UNIVERSITY_NAME"
url: "https://www.example.ac.uk/"
country: "United Kingdom"
schools:
discovery_method: "static_list"
request:
timeout_ms: 45000
max_retries: 3
retry_backoff_ms: 3000
static_list:
# 基于 Research Explorer (Pure Portal) 的学院示例
- name: "School of Engineering"
url: "https://research.example.ac.uk/en/organisations/school-of-engineering/persons/"
keywords:
- "engineering"
- "mechanical"
- "civil"
- "materials"
faculty_pages:
- url: "https://research.example.ac.uk/en/organisations/school-of-engineering/persons/"
extract_method: "research_explorer"
requires_scroll: true
request:
timeout_ms: 120000
wait_until: "networkidle"
post_wait_ms: 5000
research_explorer:
org_slug: "school-of-engineering"
page_size: 400
- name: "Faculty of Humanities"
url: "https://research.example.ac.uk/en/organisations/faculty-of-humanities/persons/"
keywords:
- "arts"
- "languages"
- "history"
- "philosophy"
faculty_pages:
- url: "https://research.example.ac.uk/en/organisations/faculty-of-humanities/persons/"
extract_method: "research_explorer"
requires_scroll: true
request:
timeout_ms: 120000
wait_until: "networkidle"
post_wait_ms: 4500
research_explorer:
org_slug: "faculty-of-humanities"
page_size: 300
programs:
paths_to_try:
- "/study/masters/courses/list/"
- "/study/postgraduate/courses/list/"
link_patterns:
- text_contains: ["masters", "postgraduate", "graduate"]
href_contains: ["/courses/", "/study/", "/programmes/"]
selectors:
program_item: "li.course-item, article.course-card, a.course-link"
program_name: ".course-title, h3, .title"
program_url: "a[href]"
degree_type: ".course-award, .badge"
request:
timeout_ms: 40000
wait_until: "domcontentloaded"
post_wait_ms: 2500
faculty:
discovery_strategies:
- type: "link_in_page"
patterns:
- text_contains: ["faculty", "people", "staff", "directory"]
href_contains: ["/faculty", "/people", "/staff"]
request:
timeout_ms: 30000
wait_until: "domcontentloaded"
post_wait_ms: 1500
- type: "url_pattern"
patterns:
- "{program_url}/people"
- "{program_url}/faculty"
- "{school_url}/people"
- "{school_url}/staff"
request:
timeout_ms: 30000
wait_until: "domcontentloaded"
post_wait_ms: 1500
- type: "school_directory"
assign_to_all: false
match_by_school_keywords: true
metadata_keyword_field: "keywords"
request:
timeout_ms: 120000
wait_for_selector: "a.link.person"
post_wait_ms: 4000
filters:
program_degree_types:
include: ["MSc", "MA", "MBA", "MEng", "LLM", "MRes"]
exclude: ["PhD", "Bachelor", "BSc", "BA"]
exclude_schools: []

169
configs/ucl.yaml Normal file
View File

@ -0,0 +1,169 @@
university:
name: "University College London"
url: "https://www.ucl.ac.uk/"
country: "United Kingdom"
schools:
discovery_method: "static_list"
request:
timeout_ms: 45000
max_retries: 3
retry_backoff_ms: 3000
static_list:
- name: "Faculty of Engineering Sciences"
url: "https://www.ucl.ac.uk/engineering/people"
keywords:
- "engineering"
- "mechanical"
- "civil"
- "materials"
- "electronic"
- "computer"
faculty_pages:
- url: "https://www.ucl.ac.uk/engineering/people"
extract_method: "links"
requires_scroll: true
scroll_times: 8
scroll_delay_ms: 600
blocked_resources: ["image", "font", "media"]
- url: "https://www.ucl.ac.uk/electronic-electrical-engineering/people/academic-staff"
extract_method: "table"
request:
timeout_ms: 45000
wait_until: "domcontentloaded"
post_wait_ms: 2000
- name: "Faculty of Mathematical & Physical Sciences"
url: "https://www.ucl.ac.uk/mathematical-physical-sciences/people"
keywords:
- "mathematics"
- "physics"
- "chemistry"
- "earth sciences"
- "astronomy"
faculty_pages:
- url: "https://www.ucl.ac.uk/mathematical-physical-sciences/people"
extract_method: "links"
requires_scroll: true
scroll_times: 6
scroll_delay_ms: 600
- url: "https://www.ucl.ac.uk/physics-astronomy/people/academic-staff"
extract_method: "links"
- name: "Faculty of Arts & Humanities"
url: "https://www.ucl.ac.uk/arts-humanities/people/academic-staff"
keywords:
- "arts"
- "languages"
- "culture"
- "history"
- "philosophy"
- "translation"
faculty_pages:
- url: "https://www.ucl.ac.uk/arts-humanities/people/academic-staff"
extract_method: "links"
requires_scroll: true
scroll_times: 6
scroll_delay_ms: 600
- name: "Faculty of Laws"
url: "https://www.ucl.ac.uk/laws/people/academic-staff"
keywords:
- "law"
- "legal"
- "llm"
faculty_pages:
- url: "https://www.ucl.ac.uk/laws/people/academic-staff"
extract_method: "links"
requires_scroll: true
scroll_times: 5
scroll_delay_ms: 600
- name: "Faculty of Social & Historical Sciences"
url: "https://www.ucl.ac.uk/social-historical-sciences/people"
keywords:
- "social"
- "economics"
- "geography"
- "anthropology"
- "politics"
- "history"
faculty_pages:
- url: "https://www.ucl.ac.uk/social-historical-sciences/people"
extract_method: "links"
requires_scroll: true
scroll_times: 6
scroll_delay_ms: 600
- name: "Faculty of Brain Sciences"
url: "https://www.ucl.ac.uk/brain-sciences/people"
keywords:
- "neuroscience"
- "psychology"
- "cognitive"
- "biomedical"
faculty_pages:
- url: "https://www.ucl.ac.uk/brain-sciences/people"
extract_method: "links"
requires_scroll: true
scroll_times: 6
scroll_delay_ms: 600
- name: "Faculty of the Built Environment (The Bartlett)"
url: "https://www.ucl.ac.uk/bartlett/people/all"
keywords:
- "architecture"
- "planning"
- "urban"
- "built environment"
faculty_pages:
- url: "https://www.ucl.ac.uk/bartlett/people/all"
extract_method: "links"
requires_scroll: true
scroll_times: 10
scroll_delay_ms: 600
programs:
paths_to_try:
- "/prospective-students/graduate/taught-degrees/"
link_patterns:
- text_contains: ["graduate", "taught", "masters", "postgraduate"]
href_contains: ["/prospective-students/graduate", "/study/graduate", "/courses/"]
selectors:
program_item: ".view-content .view-row, li.listing__item, article.prog-card"
program_name: ".listing__title, h3, .title"
program_url: "a[href]"
degree_type: ".listing__award, .award"
request:
timeout_ms: 40000
wait_until: "domcontentloaded"
post_wait_ms: 2500
faculty:
discovery_strategies:
- type: "link_in_page"
patterns:
- text_contains: ["people", "faculty", "staff", "team"]
href_contains: ["/people", "/faculty", "/staff", "/team"]
request:
timeout_ms: 30000
wait_until: "domcontentloaded"
post_wait_ms: 1500
- type: "url_pattern"
patterns:
- "{program_url}/people"
- "{program_url}/staff"
- "{school_url}/people"
- "{school_url}/staff"
request:
timeout_ms: 30000
wait_until: "domcontentloaded"
post_wait_ms: 1500
- type: "school_directory"
assign_to_all: false
match_by_school_keywords: true
metadata_keyword_field: "keywords"
request:
timeout_ms: 60000
wait_for_selector: "a[href*='/people/'], .person, .profile-card"
post_wait_ms: 2500
filters:
program_degree_types:
include: ["MSc", "MSci", "MA", "MBA", "MEng", "LLM", "MRes"]
exclude: ["PhD", "Bachelor", "BSc", "BA", "PGCert"]
exclude_schools: []

54
docker-compose.yml Normal file
View File

@ -0,0 +1,54 @@
version: '3.8'
services:
# 后端API服务
backend:
build:
context: ./backend
dockerfile: Dockerfile
ports:
- "8000:8000"
environment:
- DATABASE_URL=postgresql://postgres:postgres@db:5432/university_scraper
- REDIS_URL=redis://redis:6379/0
depends_on:
- db
- redis
volumes:
- ./backend:/app
- scraper_data:/app/data
# 前端服务
frontend:
build:
context: ./frontend
dockerfile: Dockerfile
ports:
- "3000:80"
depends_on:
- backend
# PostgreSQL数据库
db:
image: postgres:15-alpine
environment:
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=postgres
- POSTGRES_DB=university_scraper
volumes:
- postgres_data:/var/lib/postgresql/data
ports:
- "5432:5432"
# Redis (用于任务队列)
redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
- redis_data:/data
volumes:
postgres_data:
redis_data:
scraper_data:

26
frontend/Dockerfile Normal file
View File

@ -0,0 +1,26 @@
FROM node:20-alpine as builder
WORKDIR /app
# 复制package文件
COPY package*.json ./
RUN npm install
# 复制源代码
COPY . .
# 构建
RUN npm run build
# 生产镜像
FROM nginx:alpine
# 复制构建产物
COPY --from=builder /app/dist /usr/share/nginx/html
# 复制nginx配置
COPY nginx.conf /etc/nginx/conf.d/default.conf
EXPOSE 80
CMD ["nginx", "-g", "daemon off;"]

12
frontend/index.html Normal file
View File

@ -0,0 +1,12 @@
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>大学爬虫系统</title>
</head>
<body>
<div id="root"></div>
<script type="module" src="/src/main.tsx"></script>
</body>
</html>

21
frontend/nginx.conf Normal file
View File

@ -0,0 +1,21 @@
server {
listen 80;
server_name localhost;
root /usr/share/nginx/html;
index index.html;
# 处理SPA路由
location / {
try_files $uri $uri/ /index.html;
}
# 代理API请求到后端
location /api {
proxy_pass http://backend:8000;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_cache_bypass $http_upgrade;
}
}

3051
frontend/package-lock.json generated Normal file

File diff suppressed because it is too large Load Diff

26
frontend/package.json Normal file
View File

@ -0,0 +1,26 @@
{
"name": "university-scraper-web",
"version": "1.0.0",
"private": true,
"scripts": {
"dev": "vite",
"build": "tsc && vite build",
"preview": "vite preview"
},
"dependencies": {
"react": "^18.2.0",
"react-dom": "^18.2.0",
"react-router-dom": "^6.20.0",
"@tanstack/react-query": "^5.8.0",
"axios": "^1.6.0",
"antd": "^5.11.0",
"@ant-design/icons": "^5.2.6"
},
"devDependencies": {
"@types/react": "^18.2.0",
"@types/react-dom": "^18.2.0",
"@vitejs/plugin-react": "^4.2.0",
"typescript": "^5.3.0",
"vite": "^5.0.0"
}
}

75
frontend/src/App.tsx Normal file
View File

@ -0,0 +1,75 @@
/**
* 主应用组件
*/
import { useState } from 'react'
import { BrowserRouter, Routes, Route, Link, useNavigate } from 'react-router-dom'
import { Layout, Menu, Typography } from 'antd'
import { HomeOutlined, PlusOutlined, DatabaseOutlined } from '@ant-design/icons'
import HomePage from './pages/HomePage'
import AddUniversityPage from './pages/AddUniversityPage'
import UniversityDetailPage from './pages/UniversityDetailPage'
const { Header, Content, Footer } = Layout
const { Title } = Typography
function AppContent() {
const navigate = useNavigate()
const [current, setCurrent] = useState('home')
const menuItems = [
{
key: 'home',
icon: <HomeOutlined />,
label: '大学列表',
onClick: () => navigate('/')
},
{
key: 'add',
icon: <PlusOutlined />,
label: '添加大学',
onClick: () => navigate('/add')
}
]
return (
<Layout style={{ minHeight: '100vh' }}>
<Header style={{ display: 'flex', alignItems: 'center', background: '#001529' }}>
<div style={{ color: 'white', fontSize: '20px', fontWeight: 'bold', marginRight: '40px' }}>
<DatabaseOutlined />
</div>
<Menu
theme="dark"
mode="horizontal"
selectedKeys={[current]}
items={menuItems}
onClick={(e) => setCurrent(e.key)}
style={{ flex: 1 }}
/>
</Header>
<Content style={{ padding: '24px', background: '#f5f5f5' }}>
<div style={{ maxWidth: 1200, margin: '0 auto' }}>
<Routes>
<Route path="/" element={<HomePage />} />
<Route path="/add" element={<AddUniversityPage />} />
<Route path="/university/:id" element={<UniversityDetailPage />} />
</Routes>
</div>
</Content>
<Footer style={{ textAlign: 'center', background: '#f5f5f5' }}>
©2024 - &
</Footer>
</Layout>
)
}
function App() {
return (
<BrowserRouter>
<AppContent />
</BrowserRouter>
)
}
export default App

29
frontend/src/index.css Normal file
View File

@ -0,0 +1,29 @@
* {
margin: 0;
padding: 0;
box-sizing: border-box;
}
body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;
background-color: #f5f5f5;
}
.container {
max-width: 1200px;
margin: 0 auto;
padding: 20px;
}
.card-hover:hover {
box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15);
transition: box-shadow 0.3s;
}
.status-pending { color: #faad14; }
.status-analyzing { color: #1890ff; }
.status-ready { color: #52c41a; }
.status-running { color: #1890ff; }
.status-completed { color: #52c41a; }
.status-failed { color: #ff4d4f; }
.status-error { color: #ff4d4f; }

26
frontend/src/main.tsx Normal file
View File

@ -0,0 +1,26 @@
import React from 'react'
import ReactDOM from 'react-dom/client'
import { QueryClient, QueryClientProvider } from '@tanstack/react-query'
import { ConfigProvider } from 'antd'
import zhCN from 'antd/locale/zh_CN'
import App from './App'
import './index.css'
const queryClient = new QueryClient({
defaultOptions: {
queries: {
refetchOnWindowFocus: false,
retry: 1
}
}
})
ReactDOM.createRoot(document.getElementById('root')!).render(
<React.StrictMode>
<QueryClientProvider client={queryClient}>
<ConfigProvider locale={zhCN}>
<App />
</ConfigProvider>
</QueryClientProvider>
</React.StrictMode>
)

View File

@ -0,0 +1,165 @@
/**
* 添加大学页面 - 一键生成爬虫脚本
*/
import { useState } from 'react'
import { useNavigate } from 'react-router-dom'
import { useMutation } from '@tanstack/react-query'
import {
Card, Form, Input, Button, Typography, Steps, Result, Spin, message
} from 'antd'
import { GlobalOutlined, RocketOutlined, CheckCircleOutlined, LoadingOutlined } from '@ant-design/icons'
import { scriptApi } from '../services/api'
const { Title, Text, Paragraph } = Typography
export default function AddUniversityPage() {
const navigate = useNavigate()
const [form] = Form.useForm()
const [currentStep, setCurrentStep] = useState(0)
const [universityId, setUniversityId] = useState<number | null>(null)
// 生成脚本
const generateMutation = useMutation({
mutationFn: scriptApi.generate,
onSuccess: (response) => {
const data = response.data
setUniversityId(data.university_id)
setCurrentStep(2)
message.success('脚本生成成功!')
},
onError: (error: any) => {
message.error(error.response?.data?.detail || '生成失败')
setCurrentStep(0)
}
})
const handleSubmit = (values: { url: string; name?: string }) => {
setCurrentStep(1)
generateMutation.mutate({
university_url: values.url,
university_name: values.name
})
}
const stepItems = [
{
title: '输入信息',
icon: <GlobalOutlined />
},
{
title: '分析生成',
icon: currentStep === 1 ? <LoadingOutlined /> : <RocketOutlined />
},
{
title: '完成',
icon: <CheckCircleOutlined />
}
]
return (
<Card>
<Title level={3} style={{ textAlign: 'center', marginBottom: 32 }}>
-
</Title>
<Steps current={currentStep} items={stepItems} style={{ marginBottom: 40 }} />
{currentStep === 0 && (
<div style={{ maxWidth: 500, margin: '0 auto' }}>
<Paragraph style={{ textAlign: 'center', marginBottom: 24 }}>
</Paragraph>
<Form
form={form}
layout="vertical"
onFinish={handleSubmit}
>
<Form.Item
name="url"
label="大学官网URL"
rules={[
{ required: true, message: '请输入大学官网URL' },
{ type: 'url', message: '请输入有效的URL' }
]}
>
<Input
placeholder="https://www.harvard.edu/"
size="large"
prefix={<GlobalOutlined />}
/>
</Form.Item>
<Form.Item
name="name"
label="大学名称 (可选)"
>
<Input
placeholder="如: Harvard University"
size="large"
/>
</Form.Item>
<Form.Item>
<Button
type="primary"
htmlType="submit"
size="large"
block
icon={<RocketOutlined />}
>
</Button>
</Form.Item>
</Form>
<div style={{ marginTop: 32, padding: 16, background: '#f5f5f5', borderRadius: 8 }}>
<Text strong>:</Text>
<ul style={{ marginTop: 8 }}>
<li> ( Harvard, MIT, Stanford)</li>
<li> ( Oxford, Cambridge)</li>
<li></li>
</ul>
</div>
</div>
)}
{currentStep === 1 && (
<div style={{ textAlign: 'center', padding: 60 }}>
<Spin size="large" />
<Title level={4} style={{ marginTop: 24 }}>...</Title>
<Paragraph>访</Paragraph>
<Paragraph type="secondary">...</Paragraph>
</div>
)}
{currentStep === 2 && (
<Result
status="success"
title="爬虫脚本生成成功!"
subTitle="系统已自动分析网站结构并生成了爬虫脚本"
extra={[
<Button
type="primary"
key="detail"
size="large"
onClick={() => navigate(`/university/${universityId}`)}
>
</Button>,
<Button
key="add"
size="large"
onClick={() => {
setCurrentStep(0)
form.resetFields()
}}
>
</Button>
]}
/>
)}
</Card>
)
}

View File

@ -0,0 +1,185 @@
/**
* 首页 - 大学列表
*/
import { useState } from 'react'
import { useNavigate } from 'react-router-dom'
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
import {
Card, Table, Button, Input, Space, Tag, message, Popconfirm, Typography, Row, Col, Statistic
} from 'antd'
import {
PlusOutlined, SearchOutlined, DeleteOutlined, EyeOutlined, ReloadOutlined
} from '@ant-design/icons'
import { universityApi } from '../services/api'
const { Title } = Typography
// 状态标签映射
const statusTags: Record<string, { color: string; text: string }> = {
pending: { color: 'default', text: '待分析' },
analyzing: { color: 'processing', text: '分析中' },
ready: { color: 'success', text: '就绪' },
error: { color: 'error', text: '错误' }
}
export default function HomePage() {
const navigate = useNavigate()
const queryClient = useQueryClient()
const [search, setSearch] = useState('')
// 获取大学列表
const { data, isLoading, refetch } = useQuery({
queryKey: ['universities', search],
queryFn: () => universityApi.list({ search: search || undefined })
})
// 删除大学
const deleteMutation = useMutation({
mutationFn: universityApi.delete,
onSuccess: () => {
message.success('删除成功')
queryClient.invalidateQueries({ queryKey: ['universities'] })
},
onError: () => {
message.error('删除失败')
}
})
const universities = data?.data?.items || []
const total = data?.data?.total || 0
// 统计
const readyCount = universities.filter((u: any) => u.status === 'ready').length
const totalPrograms = universities.reduce((sum: number, u: any) =>
sum + (u.latest_result?.programs_count || 0), 0)
const totalFaculty = universities.reduce((sum: number, u: any) =>
sum + (u.latest_result?.faculty_count || 0), 0)
const columns = [
{
title: '大学名称',
dataIndex: 'name',
key: 'name',
render: (text: string, record: any) => (
<a onClick={() => navigate(`/university/${record.id}`)}>{text}</a>
)
},
{
title: '国家',
dataIndex: 'country',
key: 'country',
width: 100
},
{
title: '状态',
dataIndex: 'status',
key: 'status',
width: 100,
render: (status: string) => {
const tag = statusTags[status] || { color: 'default', text: status }
return <Tag color={tag.color}>{tag.text}</Tag>
}
},
{
title: '项目数',
key: 'programs',
width: 100,
render: (_: any, record: any) => record.latest_result?.programs_count || '-'
},
{
title: '导师数',
key: 'faculty',
width: 100,
render: (_: any, record: any) => record.latest_result?.faculty_count || '-'
},
{
title: '操作',
key: 'actions',
width: 150,
render: (_: any, record: any) => (
<Space>
<Button
type="link"
icon={<EyeOutlined />}
onClick={() => navigate(`/university/${record.id}`)}
>
</Button>
<Popconfirm
title="确定删除这个大学吗?"
onConfirm={() => deleteMutation.mutate(record.id)}
okText="确定"
cancelText="取消"
>
<Button type="link" danger icon={<DeleteOutlined />}>
</Button>
</Popconfirm>
</Space>
)
}
]
return (
<div>
{/* 统计卡片 */}
<Row gutter={16} style={{ marginBottom: 24 }}>
<Col span={6}>
<Card>
<Statistic title="大学总数" value={total} />
</Card>
</Col>
<Col span={6}>
<Card>
<Statistic title="已就绪" value={readyCount} valueStyle={{ color: '#52c41a' }} />
</Card>
</Col>
<Col span={6}>
<Card>
<Statistic title="项目总数" value={totalPrograms} />
</Card>
</Col>
<Col span={6}>
<Card>
<Statistic title="导师总数" value={totalFaculty} />
</Card>
</Col>
</Row>
{/* 大学列表 */}
<Card
title={<Title level={4} style={{ margin: 0 }}></Title>}
extra={
<Space>
<Input
placeholder="搜索大学..."
prefix={<SearchOutlined />}
value={search}
onChange={(e) => setSearch(e.target.value)}
style={{ width: 200 }}
allowClear
/>
<Button icon={<ReloadOutlined />} onClick={() => refetch()}>
</Button>
<Button type="primary" icon={<PlusOutlined />} onClick={() => navigate('/add')}>
</Button>
</Space>
}
>
<Table
columns={columns}
dataSource={universities}
rowKey="id"
loading={isLoading}
pagination={{
total,
showSizeChanger: true,
showTotal: (t) => `${t} 所大学`
}}
/>
</Card>
</div>
)
}

View File

@ -0,0 +1,368 @@
/**
* 大学详情页面 - 管理爬虫、运行爬虫、查看数据
*/
import { useState, useEffect } from 'react'
import { useParams, useNavigate } from 'react-router-dom'
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
import {
Card, Tabs, Button, Typography, Tag, Space, Table, Progress, Timeline, Spin,
message, Descriptions, Tree, Input, Row, Col, Statistic, Empty, Modal
} from 'antd'
import {
PlayCircleOutlined, ReloadOutlined, DownloadOutlined, ArrowLeftOutlined,
CheckCircleOutlined, ClockCircleOutlined, ExclamationCircleOutlined,
SearchOutlined, TeamOutlined, BookOutlined, BankOutlined
} from '@ant-design/icons'
import { universityApi, scriptApi, jobApi, resultApi } from '../services/api'
const { Title, Text, Paragraph } = Typography
const { TabPane } = Tabs
// 状态映射
const statusMap: Record<string, { color: string; text: string; icon: any }> = {
pending: { color: 'default', text: '等待中', icon: <ClockCircleOutlined /> },
running: { color: 'processing', text: '运行中', icon: <Spin size="small" /> },
completed: { color: 'success', text: '已完成', icon: <CheckCircleOutlined /> },
failed: { color: 'error', text: '失败', icon: <ExclamationCircleOutlined /> },
cancelled: { color: 'warning', text: '已取消', icon: <ExclamationCircleOutlined /> }
}
export default function UniversityDetailPage() {
const { id } = useParams<{ id: string }>()
const navigate = useNavigate()
const queryClient = useQueryClient()
const universityId = parseInt(id || '0')
const [activeTab, setActiveTab] = useState('overview')
const [pollingJobId, setPollingJobId] = useState<number | null>(null)
const [searchKeyword, setSearchKeyword] = useState('')
// 获取大学详情
const { data: universityData, isLoading: universityLoading } = useQuery({
queryKey: ['university', universityId],
queryFn: () => universityApi.get(universityId)
})
// 获取脚本
const { data: scriptsData } = useQuery({
queryKey: ['scripts', universityId],
queryFn: () => scriptApi.getByUniversity(universityId)
})
// 获取任务列表
const { data: jobsData, refetch: refetchJobs } = useQuery({
queryKey: ['jobs', universityId],
queryFn: () => jobApi.getByUniversity(universityId)
})
// 获取结果数据
const { data: resultData } = useQuery({
queryKey: ['result', universityId],
queryFn: () => resultApi.get(universityId),
enabled: activeTab === 'data'
})
// 获取任务状态 (轮询)
const { data: jobStatusData } = useQuery({
queryKey: ['job-status', pollingJobId],
queryFn: () => jobApi.getStatus(pollingJobId!),
enabled: !!pollingJobId,
refetchInterval: pollingJobId ? 2000 : false
})
// 启动爬虫任务
const startJobMutation = useMutation({
mutationFn: () => jobApi.start(universityId),
onSuccess: (response) => {
message.success('爬虫任务已启动')
setPollingJobId(response.data.id)
refetchJobs()
},
onError: (error: any) => {
message.error(error.response?.data?.detail || '启动失败')
}
})
// 监听任务完成
useEffect(() => {
if (jobStatusData?.data?.status === 'completed' || jobStatusData?.data?.status === 'failed') {
setPollingJobId(null)
refetchJobs()
queryClient.invalidateQueries({ queryKey: ['university', universityId] })
queryClient.invalidateQueries({ queryKey: ['result', universityId] })
if (jobStatusData?.data?.status === 'completed') {
message.success('爬取完成!')
} else {
message.error('爬取失败')
}
}
}, [jobStatusData?.data?.status])
const university = universityData?.data
const scripts = scriptsData?.data || []
const jobs = jobsData?.data || []
const result = resultData?.data
// 构建数据树
const buildDataTree = () => {
if (!result?.result_data?.schools) return []
return result.result_data.schools.map((school: any, si: number) => ({
key: `school-${si}`,
title: (
<span>
<BankOutlined style={{ marginRight: 8 }} />
{school.name} ({school.programs?.length || 0})
</span>
),
children: school.programs?.map((prog: any, pi: number) => ({
key: `program-${si}-${pi}`,
title: (
<span>
<BookOutlined style={{ marginRight: 8 }} />
{prog.name} ({prog.faculty?.length || 0})
</span>
),
children: prog.faculty?.map((fac: any, fi: number) => ({
key: `faculty-${si}-${pi}-${fi}`,
title: (
<span>
<TeamOutlined style={{ marginRight: 8 }} />
<a href={fac.url} target="_blank" rel="noreferrer">{fac.name}</a>
</span>
),
isLeaf: true
}))
}))
}))
}
if (universityLoading) {
return <Card><Spin size="large" /></Card>
}
if (!university) {
return <Card><Empty description="大学不存在" /></Card>
}
const activeScript = scripts.find((s: any) => s.status === 'active')
const latestJob = jobs[0]
const isRunning = pollingJobId !== null || latestJob?.status === 'running'
return (
<div>
{/* 头部 */}
<Card style={{ marginBottom: 16 }}>
<Space style={{ marginBottom: 16 }}>
<Button icon={<ArrowLeftOutlined />} onClick={() => navigate('/')}>
</Button>
</Space>
<Row gutter={24}>
<Col span={16}>
<Title level={3} style={{ marginBottom: 8 }}>{university.name}</Title>
<Paragraph>
<a href={university.url} target="_blank" rel="noreferrer">{university.url}</a>
</Paragraph>
<Space>
<Tag>{university.country || '未知国家'}</Tag>
<Tag color={university.status === 'ready' ? 'green' : 'orange'}>
{university.status === 'ready' ? '就绪' : university.status}
</Tag>
</Space>
</Col>
<Col span={8} style={{ textAlign: 'right' }}>
<Button
type="primary"
size="large"
icon={isRunning ? <Spin size="small" /> : <PlayCircleOutlined />}
onClick={() => startJobMutation.mutate()}
disabled={!activeScript || isRunning}
loading={startJobMutation.isPending}
>
{isRunning ? '爬虫运行中...' : '一键运行爬虫'}
</Button>
</Col>
</Row>
{/* 统计 */}
<Row gutter={16} style={{ marginTop: 24 }}>
<Col span={6}>
<Statistic title="学院数" value={university.latest_result?.schools_count || 0} />
</Col>
<Col span={6}>
<Statistic title="项目数" value={university.latest_result?.programs_count || 0} />
</Col>
<Col span={6}>
<Statistic title="导师数" value={university.latest_result?.faculty_count || 0} />
</Col>
<Col span={6}>
<Statistic title="脚本版本" value={activeScript?.version || 0} />
</Col>
</Row>
</Card>
{/* 运行进度 */}
{pollingJobId && jobStatusData?.data && (
<Card style={{ marginBottom: 16 }}>
<Title level={5}></Title>
<Progress percent={jobStatusData.data.progress} status="active" />
<Text type="secondary">{jobStatusData.data.current_step}</Text>
<div style={{ marginTop: 16, maxHeight: 200, overflowY: 'auto' }}>
<Timeline
items={jobStatusData.data.logs?.slice(-10).map((log: any) => ({
color: log.level === 'error' ? 'red' : log.level === 'warning' ? 'orange' : 'blue',
children: <Text>{log.message}</Text>
}))}
/>
</div>
</Card>
)}
{/* 标签页 */}
<Card>
<Tabs activeKey={activeTab} onChange={setActiveTab}>
{/* 概览 */}
<TabPane tab="概览" key="overview">
<Descriptions title="基本信息" bordered column={2}>
<Descriptions.Item label="大学名称">{university.name}</Descriptions.Item>
<Descriptions.Item label="官网地址">
<a href={university.url} target="_blank" rel="noreferrer">{university.url}</a>
</Descriptions.Item>
<Descriptions.Item label="国家">{university.country || '-'}</Descriptions.Item>
<Descriptions.Item label="状态">
<Tag color={university.status === 'ready' ? 'green' : 'default'}>
{university.status}
</Tag>
</Descriptions.Item>
<Descriptions.Item label="创建时间">
{new Date(university.created_at).toLocaleString()}
</Descriptions.Item>
<Descriptions.Item label="更新时间">
{new Date(university.updated_at).toLocaleString()}
</Descriptions.Item>
</Descriptions>
<Title level={5} style={{ marginTop: 24 }}></Title>
<Table
dataSource={jobs.slice(0, 5)}
rowKey="id"
pagination={false}
columns={[
{
title: '任务ID',
dataIndex: 'id',
width: 80
},
{
title: '状态',
dataIndex: 'status',
width: 100,
render: (status: string) => {
const s = statusMap[status] || { color: 'default', text: status }
return <Tag color={s.color}>{s.icon} {s.text}</Tag>
}
},
{
title: '进度',
dataIndex: 'progress',
width: 150,
render: (progress: number) => <Progress percent={progress} size="small" />
},
{
title: '开始时间',
dataIndex: 'started_at',
render: (t: string) => t ? new Date(t).toLocaleString() : '-'
},
{
title: '完成时间',
dataIndex: 'completed_at',
render: (t: string) => t ? new Date(t).toLocaleString() : '-'
}
]}
/>
</TabPane>
{/* 数据查看 */}
<TabPane tab="数据查看" key="data">
{result?.result_data ? (
<div>
<Row style={{ marginBottom: 16 }}>
<Col span={12}>
<Input
placeholder="搜索项目或导师..."
prefix={<SearchOutlined />}
value={searchKeyword}
onChange={(e) => setSearchKeyword(e.target.value)}
style={{ width: 300 }}
/>
</Col>
<Col span={12} style={{ textAlign: 'right' }}>
<Button
icon={<DownloadOutlined />}
onClick={() => {
const dataStr = JSON.stringify(result.result_data, null, 2)
const blob = new Blob([dataStr], { type: 'application/json' })
const url = URL.createObjectURL(blob)
const a = document.createElement('a')
a.href = url
a.download = `${university.name}_data.json`
a.click()
}}
>
JSON
</Button>
</Col>
</Row>
<Tree
showLine
defaultExpandedKeys={['school-0']}
treeData={buildDataTree()}
style={{ background: '#fafafa', padding: 16, borderRadius: 8 }}
/>
</div>
) : (
<Empty description="暂无数据,请先运行爬虫" />
)}
</TabPane>
{/* 脚本管理 */}
<TabPane tab="脚本管理" key="script">
{activeScript ? (
<div>
<Descriptions bordered column={2}>
<Descriptions.Item label="脚本名称">{activeScript.script_name}</Descriptions.Item>
<Descriptions.Item label="版本">v{activeScript.version}</Descriptions.Item>
<Descriptions.Item label="状态">
<Tag color="green"></Tag>
</Descriptions.Item>
<Descriptions.Item label="创建时间">
{new Date(activeScript.created_at).toLocaleString()}
</Descriptions.Item>
</Descriptions>
<Title level={5} style={{ marginTop: 24 }}></Title>
<pre style={{
background: '#1e1e1e',
color: '#d4d4d4',
padding: 16,
borderRadius: 8,
maxHeight: 400,
overflow: 'auto'
}}>
{activeScript.script_content}
</pre>
</div>
) : (
<Empty description="暂无脚本" />
)}
</TabPane>
</Tabs>
</Card>
</div>
)
}

View File

@ -0,0 +1,77 @@
/**
* API服务
*/
import axios from 'axios'
const api = axios.create({
baseURL: '/api',
timeout: 60000
})
// 大学相关API
export const universityApi = {
list: (params?: { skip?: number; limit?: number; search?: string }) =>
api.get('/universities', { params }),
get: (id: number) =>
api.get(`/universities/${id}`),
create: (data: { name: string; url: string; country?: string }) =>
api.post('/universities', data),
update: (id: number, data: { name?: string; url?: string; country?: string }) =>
api.put(`/universities/${id}`, data),
delete: (id: number) =>
api.delete(`/universities/${id}`)
}
// 脚本相关API
export const scriptApi = {
generate: (data: { university_url: string; university_name?: string }) =>
api.post('/scripts/generate', data),
getByUniversity: (universityId: number) =>
api.get(`/scripts/university/${universityId}`),
get: (id: number) =>
api.get(`/scripts/${id}`)
}
// 任务相关API
export const jobApi = {
start: (universityId: number) =>
api.post(`/jobs/start/${universityId}`),
get: (id: number) =>
api.get(`/jobs/${id}`),
getStatus: (id: number) =>
api.get(`/jobs/${id}/status`),
getByUniversity: (universityId: number) =>
api.get(`/jobs/university/${universityId}`),
cancel: (id: number) =>
api.post(`/jobs/${id}/cancel`)
}
// 结果相关API
export const resultApi = {
get: (universityId: number) =>
api.get(`/results/university/${universityId}`),
getSchools: (universityId: number) =>
api.get(`/results/university/${universityId}/schools`),
getPrograms: (universityId: number, params?: { school_name?: string; search?: string }) =>
api.get(`/results/university/${universityId}/programs`, { params }),
getFaculty: (universityId: number, params?: { school_name?: string; program_name?: string; search?: string; skip?: number; limit?: number }) =>
api.get(`/results/university/${universityId}/faculty`, { params }),
export: (universityId: number) =>
api.get(`/results/university/${universityId}/export`, { responseType: 'blob' })
}
export default api

1
frontend/src/vite-env.d.ts vendored Normal file
View File

@ -0,0 +1 @@
/// <reference types="vite/client" />

21
frontend/tsconfig.json Normal file
View File

@ -0,0 +1,21 @@
{
"compilerOptions": {
"target": "ES2020",
"useDefineForClassFields": true,
"lib": ["ES2020", "DOM", "DOM.Iterable"],
"module": "ESNext",
"skipLibCheck": true,
"moduleResolution": "bundler",
"allowImportingTsExtensions": true,
"resolveJsonModule": true,
"isolatedModules": true,
"noEmit": true,
"jsx": "react-jsx",
"strict": true,
"noUnusedLocals": true,
"noUnusedParameters": true,
"noFallthroughCasesInSwitch": true
},
"include": ["src"],
"references": [{ "path": "./tsconfig.node.json" }]
}

View File

@ -0,0 +1,10 @@
{
"compilerOptions": {
"composite": true,
"skipLibCheck": true,
"module": "ESNext",
"moduleResolution": "bundler",
"allowSyntheticDefaultImports": true
},
"include": ["vite.config.ts"]
}

15
frontend/vite.config.ts Normal file
View File

@ -0,0 +1,15 @@
import { defineConfig } from 'vite'
import react from '@vitejs/plugin-react'
export default defineConfig({
plugins: [react()],
server: {
port: 3000,
proxy: {
'/api': {
target: 'http://localhost:8000',
changeOrigin: true
}
}
}
})

135
generate_scraper.py Normal file
View File

@ -0,0 +1,135 @@
#!/usr/bin/env python
"""
University Scraper Generator
This script generates a Playwright-based web scraper for any university website.
It uses an AI agent to analyze the university's website structure and create
a customized scraper that collects master's program pages and faculty profiles.
Usage:
python generate_scraper.py
Configuration:
Set the following variables below:
- TARGET_URL: The university homepage URL
- CAMPUS_NAME: Short name for the university
- LANGUAGE: Primary language of the website
- MAX_DEPTH: How deep to crawl (default: 3)
- MAX_PAGES: Maximum pages to visit during sampling (default: 30)
"""
import argparse
import os
import sys
# ============================================================================
# CONFIGURATION - Modify these values for your target university
# ============================================================================
TARGET_URL = "https://www.harvard.edu/"
CAMPUS_NAME = "Harvard"
LANGUAGE = "English"
MAX_DEPTH = 3
MAX_PAGES = 30
# ============================================================================
def get_env_key(name: str) -> str | None:
"""Get environment variable, with Windows registry fallback."""
# Try standard environment variable first
value = os.environ.get(name)
if value:
return value
# Windows: try reading from user environment in registry
if sys.platform == "win32":
try:
import winreg
with winreg.OpenKey(winreg.HKEY_CURRENT_USER, r"Environment") as key:
return winreg.QueryValueEx(key, name)[0]
except Exception:
pass
return None
def main():
parser = argparse.ArgumentParser(
description="Generate a Playwright scraper for a university website"
)
parser.add_argument(
"--url",
default=TARGET_URL,
help="University homepage URL"
)
parser.add_argument(
"--name",
default=CAMPUS_NAME,
help="Short name for the university"
)
parser.add_argument(
"--language",
default=LANGUAGE,
help="Primary language of the website"
)
parser.add_argument(
"--max-depth",
type=int,
default=MAX_DEPTH,
help="Maximum crawl depth"
)
parser.add_argument(
"--max-pages",
type=int,
default=MAX_PAGES,
help="Maximum pages to visit during sampling"
)
parser.add_argument(
"--no-snapshot",
action="store_true",
help="Skip browser snapshot capture"
)
args = parser.parse_args()
# Configure OpenRouter API
openrouter_key = get_env_key("OPENROUTER_API_KEY")
if not openrouter_key:
print("Error: OPENROUTER_API_KEY environment variable not set")
print("Please set it with your OpenRouter API key")
sys.exit(1)
os.environ["OPENAI_API_KEY"] = openrouter_key
os.environ["CODEGEN_MODEL_PROVIDER"] = "openrouter"
os.environ["CODEGEN_OPENROUTER_MODEL"] = "anthropic/claude-3-opus"
# Import after environment is configured
from university_agent import GenerationEngine, GenerationRequest, Settings
settings = Settings()
print(f"Provider: {settings.model_provider}")
print(f"Model: {settings.openrouter_model}")
engine = GenerationEngine(settings)
request = GenerationRequest(
target_url=args.url,
campus_name=args.name,
assumed_language=args.language,
max_depth=args.max_depth,
max_pages=args.max_pages,
)
print(f"\nGenerating scraper for: {args.name}")
print(f"URL: {args.url}")
print(f"Max depth: {args.max_depth}, Max pages: {args.max_pages}")
print("-" * 50)
result = engine.generate(request, capture_snapshot=not args.no_snapshot)
print("-" * 50)
print(f"Script saved to: {result.script_path}")
print(f"Project slug: {result.plan.project_slug}")
print(f"\nTo run the scraper:")
print(f" cd artifacts")
print(f" uv run python {result.script_path.name} --max-pages 50 --no-verify")
if __name__ == "__main__":
main()

View File

@ -0,0 +1,164 @@
#!/usr/bin/env python3
"""
将已爬取的Harvard数据按学院重新组织
读取原始扁平数据,按 学院 → 项目 → 导师 层级重新组织输出
"""
import json
from pathlib import Path
from datetime import datetime, timezone
from urllib.parse import urlparse
from collections import defaultdict
# Harvard学院映射 - 根据URL子域名判断所属学院
SCHOOL_MAPPING = {
"gsas.harvard.edu": "Graduate School of Arts and Sciences (GSAS)",
"seas.harvard.edu": "John A. Paulson School of Engineering and Applied Sciences (SEAS)",
"hbs.edu": "Harvard Business School (HBS)",
"www.hbs.edu": "Harvard Business School (HBS)",
"gsd.harvard.edu": "Graduate School of Design (GSD)",
"www.gsd.harvard.edu": "Graduate School of Design (GSD)",
"gse.harvard.edu": "Graduate School of Education (HGSE)",
"www.gse.harvard.edu": "Graduate School of Education (HGSE)",
"hks.harvard.edu": "Harvard Kennedy School (HKS)",
"www.hks.harvard.edu": "Harvard Kennedy School (HKS)",
"hls.harvard.edu": "Harvard Law School (HLS)",
"hms.harvard.edu": "Harvard Medical School (HMS)",
"hsph.harvard.edu": "T.H. Chan School of Public Health (HSPH)",
"www.hsph.harvard.edu": "T.H. Chan School of Public Health (HSPH)",
"hds.harvard.edu": "Harvard Divinity School (HDS)",
"hsdm.harvard.edu": "Harvard School of Dental Medicine (HSDM)",
"fas.harvard.edu": "Faculty of Arts and Sciences (FAS)",
"aaas.fas.harvard.edu": "Faculty of Arts and Sciences (FAS)",
"dce.harvard.edu": "Division of Continuing Education (DCE)",
"extension.harvard.edu": "Harvard Extension School",
"cs.seas.harvard.edu": "John A. Paulson School of Engineering and Applied Sciences (SEAS)",
}
# 学院URL映射
SCHOOL_URLS = {
"Graduate School of Arts and Sciences (GSAS)": "https://gsas.harvard.edu/",
"John A. Paulson School of Engineering and Applied Sciences (SEAS)": "https://seas.harvard.edu/",
"Harvard Business School (HBS)": "https://www.hbs.edu/",
"Graduate School of Design (GSD)": "https://www.gsd.harvard.edu/",
"Graduate School of Education (HGSE)": "https://www.gse.harvard.edu/",
"Harvard Kennedy School (HKS)": "https://www.hks.harvard.edu/",
"Harvard Law School (HLS)": "https://hls.harvard.edu/",
"Harvard Medical School (HMS)": "https://hms.harvard.edu/",
"T.H. Chan School of Public Health (HSPH)": "https://www.hsph.harvard.edu/",
"Harvard Divinity School (HDS)": "https://hds.harvard.edu/",
"Harvard School of Dental Medicine (HSDM)": "https://hsdm.harvard.edu/",
"Faculty of Arts and Sciences (FAS)": "https://fas.harvard.edu/",
"Division of Continuing Education (DCE)": "https://dce.harvard.edu/",
"Harvard Extension School": "https://extension.harvard.edu/",
"Other": "https://www.harvard.edu/",
}
def determine_school_from_url(url: str) -> str:
"""根据URL判断所属学院"""
if not url:
return "Other"
parsed = urlparse(url)
domain = parsed.netloc.lower()
# 先尝试完全匹配
for pattern, school_name in SCHOOL_MAPPING.items():
if domain == pattern:
return school_name
# 再尝试部分匹配
for pattern, school_name in SCHOOL_MAPPING.items():
if pattern in domain:
return school_name
return "Other"
def reorganize_data(input_path: str, output_path: str):
"""重新组织数据按学院层级"""
# 读取原始数据
with open(input_path, 'r', encoding='utf-8') as f:
data = json.load(f)
print(f"读取原始数据: {data['total_programs']} 个项目, {data['total_faculty_found']} 位导师")
# 按学院分组
schools_dict = defaultdict(lambda: {"name": "", "url": "", "programs": []})
for prog in data['programs']:
# 根据faculty_page_url判断学院
faculty_url = prog.get('faculty_page_url', '')
school_name = determine_school_from_url(faculty_url)
# 如果没有faculty_page_url尝试从program url推断
if school_name == "Other" and prog.get('url'):
school_name = determine_school_from_url(prog['url'])
# 创建项目对象
program = {
"name": prog['name'],
"url": prog.get('url', ''),
"degree_type": prog.get('degrees', ''),
"faculty_page_url": faculty_url,
"faculty": prog.get('faculty', [])
}
# 添加到学院
if not schools_dict[school_name]["name"]:
schools_dict[school_name]["name"] = school_name
schools_dict[school_name]["url"] = SCHOOL_URLS.get(school_name, "")
schools_dict[school_name]["programs"].append(program)
# 转换为列表并排序
schools_list = sorted(schools_dict.values(), key=lambda s: s["name"])
# 构建输出结构
result = {
"name": "Harvard University",
"url": "https://www.harvard.edu/",
"country": "USA",
"scraped_at": datetime.now(timezone.utc).isoformat(),
"schools": schools_list
}
# 打印统计
print("\n" + "=" * 60)
print("按学院重新组织完成!")
print("=" * 60)
print(f"大学: {result['name']}")
print(f"学院数: {len(schools_list)}")
total_programs = sum(len(s['programs']) for s in schools_list)
total_faculty = sum(len(p['faculty']) for s in schools_list for p in s['programs'])
print(f"项目数: {total_programs}")
print(f"导师数: {total_faculty}")
print("\n各学院统计:")
for school in schools_list:
prog_count = len(school['programs'])
fac_count = sum(len(p['faculty']) for p in school['programs'])
print(f" {school['name']}: {prog_count}个项目, {fac_count}位导师")
# 保存结果
output_file = Path(output_path)
output_file.parent.mkdir(parents=True, exist_ok=True)
with open(output_file, 'w', encoding='utf-8') as f:
json.dump(result, f, ensure_ascii=False, indent=2)
print(f"\n结果已保存到: {output_path}")
return result
if __name__ == "__main__":
input_file = "artifacts/harvard_programs_with_faculty.json"
output_file = "output/harvard_hierarchical_result.json"
reorganize_data(input_file, output_file)

45
scripts/start_backend.py Normal file
View File

@ -0,0 +1,45 @@
#!/usr/bin/env python3
"""
启动后端API服务 (本地开发)
"""
import subprocess
import sys
import os
# 切换到项目根目录
project_root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
os.chdir(project_root)
# 添加backend到Python路径
backend_path = os.path.join(project_root, "backend")
sys.path.insert(0, backend_path)
print("=" * 60)
print("启动大学爬虫 Web API 服务")
print("=" * 60)
print(f"项目目录: {project_root}")
print(f"后端目录: {backend_path}")
print()
# 检查是否安装了依赖
try:
import fastapi
import uvicorn
except ImportError:
print("正在安装后端依赖...")
subprocess.run([sys.executable, "-m", "pip", "install", "-r", "backend/requirements.txt"])
# 初始化数据库
print("初始化数据库...")
os.chdir(backend_path)
# 启动服务
print()
print("启动 FastAPI 服务...")
print("API文档: http://localhost:8000/docs")
print("Swagger UI: http://localhost:8000/redoc")
print()
import uvicorn
uvicorn.run("app.main:app", host="0.0.0.0", port=8000, reload=True)

42
scripts/start_dev.bat Normal file
View File

@ -0,0 +1,42 @@
@echo off
echo ============================================================
echo 大学爬虫 Web 系统 - 本地开发启动
echo ============================================================
echo.
echo 启动后端API服务...
cd /d "%~dp0..\backend"
REM 安装后端依赖
pip install -r requirements.txt -q
REM 启动后端
start cmd /k "cd /d %~dp0..\backend && uvicorn app.main:app --reload --port 8000"
echo 后端已启动: http://localhost:8000
echo API文档: http://localhost:8000/docs
echo.
echo 启动前端服务...
cd /d "%~dp0..\frontend"
REM 安装前端依赖
if not exist node_modules (
echo 安装前端依赖...
npm install
)
REM 启动前端
start cmd /k "cd /d %~dp0..\frontend && npm run dev"
echo 前端已启动: http://localhost:3000
echo.
echo ============================================================
echo 系统启动完成!
echo.
echo 后端API: http://localhost:8000/docs
echo 前端页面: http://localhost:3000
echo ============================================================
pause

126
scripts/test_harvard.py Normal file
View File

@ -0,0 +1,126 @@
#!/usr/bin/env python3
"""
测试Harvard大学爬取 - 只测试2个学院
"""
import asyncio
import sys
from pathlib import Path
# 添加项目路径
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
from university_scraper.config import ScraperConfig
from university_scraper.scraper import UniversityScraper
# 简化的测试配置 - 只测试2个学院
TEST_CONFIG = {
"university": {
"name": "Harvard University",
"url": "https://www.harvard.edu/",
"country": "USA"
},
"schools": {
"discovery_method": "static_list",
"static_list": [
{
"name": "John A. Paulson School of Engineering and Applied Sciences (SEAS)",
"url": "https://seas.harvard.edu/"
},
{
"name": "Graduate School of Design (GSD)",
"url": "https://www.gsd.harvard.edu/"
}
]
},
"programs": {
"paths_to_try": [
"/academics/graduate-programs",
"/programs",
"/academics/programs",
"/graduate"
],
"link_patterns": [
{"text_contains": ["program", "degree"], "href_contains": ["/program", "/degree"]},
{"text_contains": ["master", "graduate"], "href_contains": ["/master", "/graduate"]}
],
"selectors": {
"program_item": "div.program-item, li.program, a[href*='/program']",
"program_name": "h3, .title",
"program_url": "a[href]",
"degree_type": ".degree"
},
"pagination": {"type": "none"}
},
"faculty": {
"discovery_strategies": [
{
"type": "link_in_page",
"patterns": [
{"text_contains": ["faculty", "people"], "href_contains": ["/faculty", "/people"]}
]
},
{
"type": "url_pattern",
"patterns": [
"{school_url}/faculty",
"{school_url}/people"
]
}
],
"selectors": {
"faculty_item": "div.faculty, li.person",
"faculty_name": "h3, .name",
"faculty_url": "a[href*='/people/'], a[href*='/faculty/']"
}
},
"filters": {
"program_degree_types": {
"include": ["Master", "M.S.", "M.A.", "MBA", "M.Eng", "S.M."],
"exclude": ["Ph.D.", "Doctor", "Bachelor"]
},
"exclude_schools": []
}
}
async def test_harvard():
"""测试Harvard爬取"""
print("=" * 60)
print("测试Harvard大学爬取简化版 - 2个学院")
print("=" * 60)
config = ScraperConfig.from_dict(TEST_CONFIG)
async with UniversityScraper(config, headless=False) as scraper:
university = await scraper.scrape()
scraper.save_results("output/harvard_test_result.json")
# 打印详细结果
print("\n" + "=" * 60)
print("详细结果:")
print("=" * 60)
for school in university.schools:
print(f"\n学院: {school.name}")
print(f" URL: {school.url}")
print(f" 项目数: {len(school.programs)}")
for prog in school.programs[:5]:
print(f"\n 项目: {prog.name}")
print(f" URL: {prog.url}")
print(f" 学位: {prog.degree_type}")
print(f" 导师数: {len(prog.faculty)}")
if prog.faculty:
print(" 导师示例:")
for f in prog.faculty[:3]:
print(f" - {f.name}: {f.url}")
if len(school.programs) > 5:
print(f"\n ... 还有 {len(school.programs) - 5} 个项目")
if __name__ == "__main__":
asyncio.run(test_harvard())

View File

@ -1,6 +1,7 @@
from __future__ import annotations
import json
import re
from textwrap import dedent
from agno.agent import Agent
@ -101,6 +102,12 @@ class ScriptAgent:
return Claude(id=self.settings.anthropic_model)
if provider == "openai":
return OpenAIChat(id=self.settings.openai_model)
if provider == "openrouter":
# OpenRouter is OpenAI-compatible, use OpenAIChat with custom base_url
return OpenAIChat(
id=self.settings.openrouter_model,
base_url=self.settings.openrouter_base_url,
)
raise ValueError(f"Unsupported provider: {provider}")
def build_plan(self, request: GenerationRequest, summary: SiteSummary | None) -> ScriptPlan:
@ -128,17 +135,65 @@ class ScriptAgent:
plan.script_name = self.settings.default_script_name
return plan
@staticmethod
def _extract_json(text: str) -> dict | None:
"""Try to extract JSON from text that might contain markdown or other content."""
# Try direct parsing first
try:
return json.loads(text)
except json.JSONDecodeError:
pass
# Try to find JSON in code blocks
code_block_pattern = r"```(?:json)?\s*([\s\S]*?)```"
matches = re.findall(code_block_pattern, text)
for match in matches:
try:
return json.loads(match.strip())
except json.JSONDecodeError:
continue
# Try to find JSON object pattern
json_pattern = r"\{[\s\S]*\}"
matches = re.findall(json_pattern, text)
for match in matches:
try:
return json.loads(match)
except json.JSONDecodeError:
continue
return None
@staticmethod
def _coerce_plan(run_response: RunOutput) -> ScriptPlan:
content = run_response.content
if isinstance(content, ScriptPlan):
return content
if isinstance(content, dict):
return ScriptPlan.model_validate(content)
if isinstance(content, str):
try:
payload = json.loads(content)
except json.JSONDecodeError as exc:
raise ValueError("Agent returned a non-JSON payload.") from exc
return ScriptPlan.model_validate(payload)
payload = content
elif isinstance(content, str):
payload = ScriptAgent._extract_json(content)
if payload is None:
raise ValueError(f"Agent returned a non-JSON payload: {content[:500]}")
else:
raise ValueError("Agent response did not match the ScriptPlan schema.")
# Fill in missing required fields with defaults
if "project_slug" not in payload:
payload["project_slug"] = "university-scraper"
if "description" not in payload:
payload["description"] = "Playwright scraper for university master programs and faculty profiles."
if "master_program_keywords" not in payload:
payload["master_program_keywords"] = ["master", "graduate", "M.S.", "M.A."]
if "faculty_keywords" not in payload:
payload["faculty_keywords"] = ["professor", "faculty", "researcher", "people"]
if "navigation_strategy" not in payload:
payload["navigation_strategy"] = "Navigate from homepage to departments to find programs and faculty."
# Handle navigation_strategy if it's a list instead of string
if isinstance(payload.get("navigation_strategy"), list):
payload["navigation_strategy"] = " ".join(payload["navigation_strategy"])
# Handle extra_notes if it's a string instead of list
if isinstance(payload.get("extra_notes"), str):
payload["extra_notes"] = [payload["extra_notes"]]
return ScriptPlan.model_validate(payload)

View File

@ -10,7 +10,7 @@ from pydantic_settings import BaseSettings
class Settings(BaseSettings):
"""Runtime configuration for the code-generation agent."""
model_provider: Literal["anthropic", "openai"] = Field(
model_provider: Literal["anthropic", "openai", "openrouter"] = Field(
default="anthropic",
description="LLM provider consumed through the Agno SDK.",
)
@ -22,6 +22,14 @@ class Settings(BaseSettings):
default="o4-mini",
description="Default OpenAI model identifier.",
)
openrouter_model: str = Field(
default="anthropic/claude-sonnet-4",
description="Default OpenRouter model identifier.",
)
openrouter_base_url: str = Field(
default="https://openrouter.ai/api/v1",
description="OpenRouter API base URL.",
)
reasoning_enabled: bool = Field(
default=True,
description="Enable multi-step reasoning for higher-fidelity plans.",

View File

@ -0,0 +1,7 @@
"""
University Scraper - 通用大学官网爬虫框架
支持按照 学院 → 项目 → 导师 的层级结构爬取任意海外大学官网
"""
__version__ = "1.0.0"

View File

@ -0,0 +1,8 @@
"""
模块入口点,支持 python -m university_scraper 运行
"""
from .cli import main
if __name__ == "__main__":
main()

View File

@ -0,0 +1,374 @@
"""
AI辅助页面分析工具
帮助分析新大学官网的页面结构,生成配置建议
"""
import asyncio
import json
from typing import Dict, Any, List, Optional
from urllib.parse import urljoin, urlparse
from playwright.async_api import async_playwright, Page
class PageAnalyzer:
"""页面结构分析器"""
def __init__(self):
self.browser = None
self.page: Optional[Page] = None
async def __aenter__(self):
playwright = await async_playwright().start()
self.browser = await playwright.chromium.launch(headless=False)
context = await self.browser.new_context(
viewport={'width': 1920, 'height': 1080}
)
self.page = await context.new_page()
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
if self.browser:
await self.browser.close()
async def analyze_university_homepage(self, url: str) -> Dict[str, Any]:
"""分析大学官网首页,寻找学院链接"""
print(f"\n分析大学首页: {url}")
await self.page.goto(url, wait_until='networkidle')
await self.page.wait_for_timeout(3000)
analysis = await self.page.evaluate('''() => {
const result = {
title: document.title,
schools_links: [],
navigation_links: [],
potential_schools_pages: [],
all_harvard_subdomains: new Set()
};
// 查找可能的学院链接
const schoolKeywords = ['school', 'college', 'faculty', 'institute', 'academy', 'department'];
const navKeywords = ['academics', 'schools', 'colleges', 'programs', 'education'];
document.querySelectorAll('a[href]').forEach(a => {
const href = a.href || '';
const text = a.innerText.trim().toLowerCase();
// 收集所有子域名
try {
const urlObj = new URL(href);
if (urlObj.hostname.includes('harvard.edu') &&
urlObj.hostname !== 'www.harvard.edu') {
result.all_harvard_subdomains.add(urlObj.origin);
}
} catch(e) {}
// 查找学院链接
if (schoolKeywords.some(kw => text.includes(kw)) ||
schoolKeywords.some(kw => href.toLowerCase().includes(kw))) {
result.schools_links.push({
text: a.innerText.trim().substring(0, 100),
href: href
});
}
// 查找导航到学院列表的链接
if (navKeywords.some(kw => text.includes(kw))) {
result.potential_schools_pages.push({
text: a.innerText.trim().substring(0, 50),
href: href
});
}
});
// 转换Set为数组
result.all_harvard_subdomains = Array.from(result.all_harvard_subdomains);
return result;
}''')
print(f"\n页面标题: {analysis['title']}")
print(f"\n发现的子域名 ({len(analysis['all_harvard_subdomains'])} 个):")
for subdomain in analysis['all_harvard_subdomains'][:20]:
print(f" - {subdomain}")
print(f"\n可能的学院链接 ({len(analysis['schools_links'])} 个):")
for link in analysis['schools_links'][:15]:
print(f" - {link['text'][:50]} -> {link['href']}")
return analysis
async def analyze_school_page(self, url: str) -> Dict[str, Any]:
"""分析学院页面,寻找项目列表"""
print(f"\n分析学院页面: {url}")
await self.page.goto(url, wait_until='networkidle')
await self.page.wait_for_timeout(3000)
analysis = await self.page.evaluate('''() => {
const result = {
title: document.title,
navigation: [],
program_links: [],
degree_mentions: [],
faculty_links: []
};
// 分析导航结构
document.querySelectorAll('nav a, [class*="nav"] a, header a').forEach(a => {
const text = a.innerText.trim();
const href = a.href || '';
if (text.length > 2 && text.length < 50) {
result.navigation.push({ text, href });
}
});
// 查找项目/学位链接
const programKeywords = ['program', 'degree', 'master', 'graduate', 'academic', 'study'];
document.querySelectorAll('a[href]').forEach(a => {
const text = a.innerText.trim().toLowerCase();
const href = a.href.toLowerCase();
if (programKeywords.some(kw => text.includes(kw) || href.includes(kw))) {
result.program_links.push({
text: a.innerText.trim().substring(0, 100),
href: a.href
});
}
// 查找Faculty链接
if (text.includes('faculty') || text.includes('people') ||
href.includes('/faculty') || href.includes('/people')) {
result.faculty_links.push({
text: a.innerText.trim().substring(0, 100),
href: a.href
});
}
});
return result;
}''')
print(f"\n导航链接:")
for nav in analysis['navigation'][:10]:
print(f" - {nav['text']} -> {nav['href']}")
print(f"\n项目相关链接 ({len(analysis['program_links'])} 个):")
for link in analysis['program_links'][:15]:
print(f" - {link['text'][:50]} -> {link['href']}")
print(f"\nFaculty链接 ({len(analysis['faculty_links'])} 个):")
for link in analysis['faculty_links'][:10]:
print(f" - {link['text'][:50]} -> {link['href']}")
return analysis
async def analyze_programs_page(self, url: str) -> Dict[str, Any]:
"""分析项目列表页面,识别项目选择器"""
print(f"\n分析项目列表页面: {url}")
await self.page.goto(url, wait_until='networkidle')
await self.page.wait_for_timeout(3000)
# 保存截图
screenshot_path = f"analysis_{urlparse(url).netloc.replace('.', '_')}.png"
await self.page.screenshot(path=screenshot_path, full_page=True)
print(f"截图已保存: {screenshot_path}")
analysis = await self.page.evaluate('''() => {
const result = {
title: document.title,
potential_program_containers: [],
program_items: [],
pagination: null,
selectors_suggestion: {}
};
// 分析页面结构,寻找重复的项目容器
const containers = [
'div[class*="program"]',
'li[class*="program"]',
'article[class*="program"]',
'div[class*="degree"]',
'div[class*="card"]',
'li.item',
'div.item'
];
containers.forEach(selector => {
const elements = document.querySelectorAll(selector);
if (elements.length >= 3) {
result.potential_program_containers.push({
selector: selector,
count: elements.length,
sample: elements[0].outerHTML.substring(0, 500)
});
}
});
// 查找所有看起来像项目的链接
document.querySelectorAll('a[href]').forEach(a => {
const href = a.href.toLowerCase();
const text = a.innerText.trim();
if ((href.includes('/program') || href.includes('/degree') ||
href.includes('/master') || href.includes('/graduate')) &&
text.length > 5 && text.length < 150) {
result.program_items.push({
text: text,
href: a.href,
parentClass: a.parentElement?.className || '',
grandparentClass: a.parentElement?.parentElement?.className || ''
});
}
});
// 查找分页元素
const paginationSelectors = [
'.pagination',
'[class*="pagination"]',
'nav[aria-label*="page"]',
'.pager'
];
for (const selector of paginationSelectors) {
const elem = document.querySelector(selector);
if (elem) {
result.pagination = {
selector: selector,
html: elem.outerHTML.substring(0, 300)
};
break;
}
}
return result;
}''')
print(f"\n可能的项目容器:")
for container in analysis['potential_program_containers']:
print(f" 选择器: {container['selector']} (找到 {container['count']} 个)")
print(f"\n找到的项目链接 ({len(analysis['program_items'])} 个):")
for item in analysis['program_items'][:10]:
print(f" - {item['text'][:60]}")
print(f" 父元素class: {item['parentClass'][:50]}")
if analysis['pagination']:
print(f"\n分页元素: {analysis['pagination']['selector']}")
return analysis
async def analyze_faculty_page(self, url: str) -> Dict[str, Any]:
"""分析导师列表页面,识别导师选择器"""
print(f"\n分析导师列表页面: {url}")
await self.page.goto(url, wait_until='networkidle')
await self.page.wait_for_timeout(3000)
analysis = await self.page.evaluate('''() => {
const result = {
title: document.title,
faculty_links: [],
potential_containers: [],
url_patterns: new Set()
};
// 查找个人页面链接
const personPatterns = ['/people/', '/faculty/', '/profile/', '/person/', '/directory/'];
document.querySelectorAll('a[href]').forEach(a => {
const href = a.href.toLowerCase();
const text = a.innerText.trim();
if (personPatterns.some(p => href.includes(p)) &&
text.length > 3 && text.length < 100) {
result.faculty_links.push({
name: text,
url: a.href,
parentClass: a.parentElement?.className || ''
});
// 记录URL模式
personPatterns.forEach(p => {
if (href.includes(p)) {
result.url_patterns.add(p);
}
});
}
});
result.url_patterns = Array.from(result.url_patterns);
return result;
}''')
print(f"\n发现的导师链接 ({len(analysis['faculty_links'])} 个):")
for faculty in analysis['faculty_links'][:15]:
print(f" - {faculty['name']} -> {faculty['url']}")
print(f"\nURL模式: {analysis['url_patterns']}")
return analysis
async def generate_config_suggestion(self, university_url: str) -> str:
"""生成配置文件建议"""
print(f"\n{'='*60}")
print(f"开始分析: {university_url}")
print(f"{'='*60}")
# 分析首页
homepage_analysis = await self.analyze_university_homepage(university_url)
# 生成配置建议
domain = urlparse(university_url).netloc
config_suggestion = f'''# {homepage_analysis['title']} 爬虫配置
# 自动生成的配置建议,请根据实际情况调整
university:
name: "{homepage_analysis['title'].split(' - ')[0].split(' | ')[0]}"
url: "{university_url}"
country: "TODO"
# 发现的子域名(可能是学院网站):
# {chr(10).join(['# - ' + s for s in homepage_analysis['all_harvard_subdomains'][:10]])}
schools:
discovery_method: "static_list"
# TODO: 根据上面的子域名和学院链接,手动填写学院列表
static_list:
# 示例:
# - name: "School of Engineering"
# url: "https://engineering.{domain}/"
'''
print(f"\n{'='*60}")
print("配置建议:")
print(f"{'='*60}")
print(config_suggestion)
return config_suggestion
async def analyze_new_university(url: str):
"""分析新大学的便捷函数"""
async with PageAnalyzer() as analyzer:
await analyzer.generate_config_suggestion(url)
# CLI入口
if __name__ == "__main__":
import sys
if len(sys.argv) < 2:
print("用法: python analyzer.py <university_url>")
print("示例: python analyzer.py https://www.stanford.edu/")
sys.exit(1)
asyncio.run(analyze_new_university(sys.argv[1]))

View File

@ -0,0 +1,105 @@
"""
命令行工具
用法:
# 爬取指定大学
python -m university_scraper scrape harvard
# 分析新大学
python -m university_scraper analyze https://www.stanford.edu/
# 列出可用配置
python -m university_scraper list
"""
import asyncio
import argparse
from pathlib import Path
def main():
parser = argparse.ArgumentParser(
description="通用大学官网爬虫 - 按照 学院→项目→导师 层级爬取"
)
subparsers = parser.add_subparsers(dest='command', help='可用命令')
# 爬取命令
scrape_parser = subparsers.add_parser('scrape', help='爬取指定大学')
scrape_parser.add_argument('university', help='大学名称(配置文件名,不含.yaml')
scrape_parser.add_argument('-o', '--output', help='输出文件路径', default=None)
scrape_parser.add_argument('--headless', action='store_true', help='无头模式运行')
scrape_parser.add_argument('--config-dir', default='configs', help='配置文件目录')
# 分析命令
analyze_parser = subparsers.add_parser('analyze', help='分析新大学官网结构')
analyze_parser.add_argument('url', help='大学官网URL')
# 列出命令
list_parser = subparsers.add_parser('list', help='列出可用的大学配置')
list_parser.add_argument('--config-dir', default='configs', help='配置文件目录')
args = parser.parse_args()
if args.command == 'scrape':
asyncio.run(run_scrape(args))
elif args.command == 'analyze':
asyncio.run(run_analyze(args))
elif args.command == 'list':
run_list(args)
else:
parser.print_help()
async def run_scrape(args):
"""执行爬取"""
from .config import load_config
from .scraper import UniversityScraper
config_path = Path(args.config_dir) / f"{args.university}.yaml"
if not config_path.exists():
print(f"错误: 配置文件不存在 - {config_path}")
print(f"可用配置: {list_configs(args.config_dir)}")
return
config = load_config(str(config_path))
output_path = args.output or f"output/{args.university}_result.json"
async with UniversityScraper(config, headless=args.headless) as scraper:
await scraper.scrape()
scraper.save_results(output_path)
async def run_analyze(args):
"""执行分析"""
from .analyzer import PageAnalyzer
async with PageAnalyzer() as analyzer:
await analyzer.generate_config_suggestion(args.url)
def run_list(args):
"""列出可用配置"""
configs = list_configs(args.config_dir)
if configs:
print("可用的大学配置:")
for name in configs:
print(f" - {name}")
else:
print(f"{args.config_dir} 目录下没有找到配置文件")
def list_configs(config_dir: str):
"""列出配置文件"""
path = Path(config_dir)
if not path.exists():
return []
return [f.stem for f in path.glob("*.yaml")] + [f.stem for f in path.glob("*.yml")]
if __name__ == "__main__":
main()

View File

@ -0,0 +1,232 @@
"""
配置文件加载和验证
配置文件格式 (YAML):
university:
name: "Harvard University"
url: "https://www.harvard.edu/"
country: "USA"
# 第一层:学院列表页面
schools:
# 获取学院列表的方式
discovery_method: "static_list" # static_list | scrape_page | sitemap
# 方式1: 静态列表 (手动配置已知学院)
static_list:
- name: "School of Engineering and Applied Sciences"
url: "https://seas.harvard.edu/"
keywords: ["engineering", "computer"]
faculty_pages:
- url: "https://seas.harvard.edu/people"
extract_method: "links" # links | table | research_explorer
request:
timeout_ms: 90000
wait_for_selector: ".profile-card"
- name: "Graduate School of Arts and Sciences"
url: "https://gsas.harvard.edu/"
# 方式2: 从页面爬取
scrape_config:
url: "https://www.harvard.edu/schools/"
selector: "a.school-link"
name_attribute: "text" # text | title | data-name
url_attribute: "href"
# 第二层:每个学院下的项目列表
programs:
# 相对于学院URL的路径模式
paths_to_try:
- "/academics/graduate-programs"
- "/programs"
- "/graduate"
- "/academics/masters"
# 或者使用选择器从学院首页查找
link_patterns:
- text_contains: ["graduate", "master", "program"]
- href_contains: ["/program", "/graduate", "/academics"]
# 项目列表页面的选择器
selectors:
program_item: "div.program-item, li.program, a.program-link"
program_name: "h3, .title, .program-name"
program_url: "a[href]"
degree_type: ".degree, .credential"
request:
timeout_ms: 45000
max_retries: 3
retry_backoff_ms: 3000
# 分页配置
pagination:
type: "none" # none | click | url_param | infinite_scroll
next_selector: "a.next, button.next-page"
param_name: "page"
# 第三层:每个项目下的导师列表
faculty:
# 查找导师页面的策略
discovery_strategies:
- type: "link_in_page"
patterns:
- text_contains: ["faculty", "people", "advisor", "professor"]
- href_contains: ["/faculty", "/people", "/directory"]
- type: "url_pattern"
patterns:
- "{program_url}/faculty"
- "{program_url}/people"
- "{school_url}/people"
- type: "school_directory"
assign_to_all: true
match_by_school_keywords: true
request:
timeout_ms: 90000
wait_for_selector: "a.link.person"
# 导师列表页面的选择器
selectors:
faculty_item: "div.faculty-item, li.person, .profile-card"
faculty_name: "h3, .name, .title a"
faculty_url: "a[href*='/people/'], a[href*='/faculty/'], a[href*='/profile/']"
faculty_title: ".title, .position, .role"
faculty_email: "a[href^='mailto:']"
# 过滤规则
filters:
# 只爬取硕士项目
program_degree_types:
include: ["M.S.", "M.A.", "MBA", "Master", "M.Eng", "M.Ed", "M.P.P", "M.P.A"]
exclude: ["Ph.D.", "Bachelor", "B.S.", "B.A.", "Certificate"]
# 排除某些学院
exclude_schools:
- "Summer School"
- "Extension School"
"""
import yaml
from pathlib import Path
from typing import Dict, Any, List, Optional
from dataclasses import dataclass, field
@dataclass
class UniversityConfig:
"""大学基本信息配置"""
name: str
url: str
country: str = "Unknown"
@dataclass
class SchoolsConfig:
"""学院发现配置"""
discovery_method: str = "static_list"
static_list: List[Dict[str, str]] = field(default_factory=list)
scrape_config: Optional[Dict[str, Any]] = None
request: Dict[str, Any] = field(default_factory=dict)
@dataclass
class ProgramsConfig:
"""项目发现配置"""
paths_to_try: List[str] = field(default_factory=list)
link_patterns: List[Dict[str, List[str]]] = field(default_factory=list)
selectors: Dict[str, str] = field(default_factory=dict)
pagination: Dict[str, Any] = field(default_factory=dict)
request: Dict[str, Any] = field(default_factory=dict)
global_catalog: Optional[Dict[str, Any]] = None
@dataclass
class FacultyConfig:
"""导师发现配置"""
discovery_strategies: List[Dict[str, Any]] = field(default_factory=list)
selectors: Dict[str, str] = field(default_factory=dict)
request: Dict[str, Any] = field(default_factory=dict)
@dataclass
class FiltersConfig:
"""过滤规则配置"""
program_degree_types: Dict[str, List[str]] = field(default_factory=dict)
exclude_schools: List[str] = field(default_factory=list)
@dataclass
class PlaywrightConfig:
"""Playwright运行环境配置"""
stealth: bool = False
user_agent: Optional[str] = None
locale: Optional[str] = None
timezone_id: Optional[str] = None
viewport: Optional[Dict[str, int]] = None
ignore_https_errors: bool = False
extra_headers: Dict[str, str] = field(default_factory=dict)
cookies: List[Dict[str, Any]] = field(default_factory=list)
add_init_scripts: List[str] = field(default_factory=list)
@dataclass
class ScraperConfig:
"""完整的爬虫配置"""
university: UniversityConfig
schools: SchoolsConfig
programs: ProgramsConfig
faculty: FacultyConfig
filters: FiltersConfig
playwright: PlaywrightConfig = field(default_factory=PlaywrightConfig)
@classmethod
def from_yaml(cls, yaml_path: str) -> "ScraperConfig":
"""从YAML文件加载配置"""
with open(yaml_path, 'r', encoding='utf-8') as f:
data = yaml.safe_load(f)
return cls(
university=UniversityConfig(**data.get('university', {})),
schools=SchoolsConfig(**data.get('schools', {})),
programs=ProgramsConfig(**data.get('programs', {})),
faculty=FacultyConfig(**data.get('faculty', {})),
filters=FiltersConfig(**data.get('filters', {})),
playwright=PlaywrightConfig(**data.get('playwright', {}))
)
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "ScraperConfig":
"""从字典创建配置"""
return cls(
university=UniversityConfig(**data.get('university', {})),
schools=SchoolsConfig(**data.get('schools', {})),
programs=ProgramsConfig(**data.get('programs', {})),
faculty=FacultyConfig(**data.get('faculty', {})),
filters=FiltersConfig(**data.get('filters', {})),
playwright=PlaywrightConfig(**data.get('playwright', {}))
)
def load_config(config_path: str) -> ScraperConfig:
"""加载配置文件"""
path = Path(config_path)
if not path.exists():
raise FileNotFoundError(f"配置文件不存在: {config_path}")
if path.suffix in ['.yaml', '.yml']:
return ScraperConfig.from_yaml(config_path)
else:
raise ValueError(f"不支持的配置文件格式: {path.suffix}")
def list_available_configs(configs_dir: str = "configs") -> List[str]:
"""列出所有可用的配置文件"""
path = Path(configs_dir)
if not path.exists():
return []
return [
f.stem for f in path.glob("*.yaml")
] + [
f.stem for f in path.glob("*.yml")
]

View File

@ -0,0 +1,405 @@
#!/usr/bin/env python3
"""
Harvard专用爬虫
Harvard的特殊情况
1. 有一个集中的项目列表页面 (harvard.edu/programs)
2. 项目详情在GSAS页面 (gsas.harvard.edu/program/xxx)
3. 导师信息在各院系网站
爬取流程:
1. 从集中页面获取所有硕士项目
2. 通过GSAS页面确定每个项目所属学院
3. 从院系网站获取导师信息
4. 按 学院→项目→导师 层级组织输出
"""
import asyncio
import json
import re
from datetime import datetime, timezone
from pathlib import Path
from typing import List, Dict, Optional, Tuple
from urllib.parse import urljoin
from playwright.async_api import async_playwright, Page, Browser
from .models import University, School, Program, Faculty
# Harvard学院映射 - 根据URL子域名判断所属学院
SCHOOL_MAPPING = {
"gsas.harvard.edu": "Graduate School of Arts and Sciences (GSAS)",
"seas.harvard.edu": "John A. Paulson School of Engineering and Applied Sciences (SEAS)",
"hbs.edu": "Harvard Business School (HBS)",
"www.hbs.edu": "Harvard Business School (HBS)",
"gsd.harvard.edu": "Graduate School of Design (GSD)",
"www.gsd.harvard.edu": "Graduate School of Design (GSD)",
"gse.harvard.edu": "Graduate School of Education (HGSE)",
"www.gse.harvard.edu": "Graduate School of Education (HGSE)",
"hks.harvard.edu": "Harvard Kennedy School (HKS)",
"www.hks.harvard.edu": "Harvard Kennedy School (HKS)",
"hls.harvard.edu": "Harvard Law School (HLS)",
"hms.harvard.edu": "Harvard Medical School (HMS)",
"hsph.harvard.edu": "T.H. Chan School of Public Health (HSPH)",
"www.hsph.harvard.edu": "T.H. Chan School of Public Health (HSPH)",
"hds.harvard.edu": "Harvard Divinity School (HDS)",
"hsdm.harvard.edu": "Harvard School of Dental Medicine (HSDM)",
"fas.harvard.edu": "Faculty of Arts and Sciences (FAS)",
"dce.harvard.edu": "Division of Continuing Education (DCE)",
"extension.harvard.edu": "Harvard Extension School",
}
# 学院URL映射
SCHOOL_URLS = {
"Graduate School of Arts and Sciences (GSAS)": "https://gsas.harvard.edu/",
"John A. Paulson School of Engineering and Applied Sciences (SEAS)": "https://seas.harvard.edu/",
"Harvard Business School (HBS)": "https://www.hbs.edu/",
"Graduate School of Design (GSD)": "https://www.gsd.harvard.edu/",
"Graduate School of Education (HGSE)": "https://www.gse.harvard.edu/",
"Harvard Kennedy School (HKS)": "https://www.hks.harvard.edu/",
"Harvard Law School (HLS)": "https://hls.harvard.edu/",
"Harvard Medical School (HMS)": "https://hms.harvard.edu/",
"T.H. Chan School of Public Health (HSPH)": "https://www.hsph.harvard.edu/",
"Harvard Divinity School (HDS)": "https://hds.harvard.edu/",
"Harvard School of Dental Medicine (HSDM)": "https://hsdm.harvard.edu/",
"Faculty of Arts and Sciences (FAS)": "https://fas.harvard.edu/",
"Other": "https://www.harvard.edu/",
}
def name_to_slug(name: str) -> str:
"""将项目名称转换为URL slug"""
slug = name.lower()
slug = re.sub(r'[^\w\s-]', '', slug)
slug = re.sub(r'[\s_]+', '-', slug)
slug = re.sub(r'-+', '-', slug)
slug = slug.strip('-')
return slug
def determine_school_from_url(url: str) -> str:
"""根据URL判断所属学院"""
if not url:
return "Other"
from urllib.parse import urlparse
parsed = urlparse(url)
domain = parsed.netloc.lower()
for pattern, school_name in SCHOOL_MAPPING.items():
if pattern in domain:
return school_name
return "Other"
class HarvardScraper:
"""Harvard专用爬虫"""
def __init__(self, headless: bool = True):
self.headless = headless
self.browser: Optional[Browser] = None
self.page: Optional[Page] = None
self._playwright = None
async def __aenter__(self):
self._playwright = await async_playwright().start()
self.browser = await self._playwright.chromium.launch(headless=self.headless)
context = await self.browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
viewport={'width': 1920, 'height': 1080},
java_script_enabled=True,
)
self.page = await context.new_page()
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
if self.browser:
await self.browser.close()
if self._playwright:
await self._playwright.stop()
async def _safe_goto(self, url: str, timeout: int = 30000, retries: int = 3) -> bool:
"""安全的页面导航,带重试机制"""
for attempt in range(retries):
try:
await self.page.goto(url, wait_until="domcontentloaded", timeout=timeout)
await self.page.wait_for_timeout(2000)
return True
except Exception as e:
print(f" 导航失败 (尝试 {attempt + 1}/{retries}): {str(e)[:50]}")
if attempt < retries - 1:
await self.page.wait_for_timeout(3000)
return False
async def scrape(self) -> University:
"""执行完整的爬取流程"""
print(f"\n{'='*60}")
print("Harvard University 专用爬虫")
print(f"{'='*60}")
# 创建大学对象
university = University(
name="Harvard University",
url="https://www.harvard.edu/",
country="USA"
)
# 第一阶段:从集中页面获取所有硕士项目
print("\n[阶段1] 从集中页面获取项目列表...")
raw_programs = await self._scrape_programs_list()
print(f" 找到 {len(raw_programs)} 个项目")
# 第二阶段:获取每个项目的详情和导师信息
print("\n[阶段2] 获取项目详情和导师信息...")
# 按学院组织的项目
schools_dict: Dict[str, School] = {}
for i, prog_data in enumerate(raw_programs, 1):
print(f"\n [{i}/{len(raw_programs)}] {prog_data['name']}")
# 获取项目详情和导师
program, school_name = await self._get_program_details(prog_data)
if program:
# 添加到对应学院
if school_name not in schools_dict:
schools_dict[school_name] = School(
name=school_name,
url=SCHOOL_URLS.get(school_name, "")
)
schools_dict[school_name].programs.append(program)
print(f" 学院: {school_name}")
print(f" 导师: {len(program.faculty)}")
# 避免请求过快
await self.page.wait_for_timeout(1000)
# 转换为列表并排序
university.schools = sorted(schools_dict.values(), key=lambda s: s.name)
university.scraped_at = datetime.now(timezone.utc).isoformat()
# 打印统计
self._print_summary(university)
return university
async def _scrape_programs_list(self) -> List[Dict]:
"""从Harvard集中页面获取所有硕士项目"""
all_programs = []
base_url = "https://www.harvard.edu/programs/?degree_levels=graduate"
print(f" 访问: {base_url}")
if not await self._safe_goto(base_url, timeout=60000):
print(" 无法访问项目页面!")
return []
await self.page.wait_for_timeout(3000)
# 滚动到页面底部
await self.page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await self.page.wait_for_timeout(2000)
current_page = 1
max_pages = 15
while current_page <= max_pages:
print(f"{current_page} 页...")
await self.page.wait_for_timeout(2000)
# 提取当前页面的项目
page_data = await self.page.evaluate('''() => {
const programs = [];
const programItems = document.querySelectorAll('[class*="records__record"], [class*="c-programs-item"]');
programItems.forEach((item) => {
const nameBtn = item.querySelector('button[class*="title-link"], button[class*="c-programs-item"]');
if (!nameBtn) return;
const name = nameBtn.innerText.trim();
if (!name || name.length < 3) return;
let degrees = '';
const allText = item.innerText;
const degreeMatch = allText.match(/(A\\.B\\.|Ph\\.D\\.|M\\.A\\.|S\\.M\\.|M\\.Arch\\.|LL\\.M\\.|S\\.B\\.|A\\.L\\.B\\.|A\\.L\\.M\\.|M\\.M\\.Sc\\.|Ed\\.D\\.|Ed\\.M\\.|M\\.P\\.A\\.|M\\.P\\.P\\.|M\\.P\\.H\\.|J\\.D\\.|M\\.B\\.A\\.|M\\.D\\.|D\\.M\\.D\\.|Th\\.D\\.|M\\.Div\\.|M\\.T\\.S\\.|M\\.E\\.|D\\.M\\.Sc\\.|M\\.H\\.C\\.M\\.|M\\.L\\.A\\.|M\\.D\\.E\\.|M\\.R\\.E\\.|M\\.A\\.U\\.D\\.|M\\.R\\.P\\.L\\.)/g);
if (degreeMatch) {
degrees = degreeMatch.join(', ');
}
programs.push({ name, degrees });
});
return programs;
}''')
for prog in page_data:
name = prog['name'].strip()
if name and not any(p['name'] == name for p in all_programs):
all_programs.append(prog)
# 尝试点击下一页
try:
next_btn = self.page.locator('button.c-pagination__link--next')
if await next_btn.count() > 0:
await next_btn.first.scroll_into_view_if_needed()
await next_btn.first.click()
await self.page.wait_for_timeout(3000)
current_page += 1
else:
break
except:
break
# 过滤:只保留硕士项目
master_keywords = ['M.A.', 'M.S.', 'S.M.', 'A.M.', 'MBA', 'M.Arch', 'M.L.A.',
'M.Div', 'M.T.S', 'LL.M', 'M.P.P', 'M.P.A', 'M.Ed', 'Ed.M.',
'A.L.M.', 'M.P.H.', 'M.M.Sc.', 'Master']
phd_keywords = ['Ph.D.', 'Doctor', 'D.M.D.', 'D.M.Sc.', 'Ed.D.', 'Th.D.', 'J.D.', 'M.D.']
filtered = []
for prog in all_programs:
degrees = prog.get('degrees', '')
name = prog.get('name', '')
# 检查是否有硕士学位
has_master = any(kw in degrees or kw in name for kw in master_keywords)
# 排除纯博士项目
is_phd_only = all(kw in degrees for kw in phd_keywords if kw in degrees) and not has_master
if has_master or (not is_phd_only and not degrees):
filtered.append(prog)
return filtered
async def _get_program_details(self, prog_data: Dict) -> Tuple[Optional[Program], str]:
"""获取项目详情和导师信息"""
name = prog_data['name']
degrees = prog_data.get('degrees', '')
# 生成URL
slug = name_to_slug(name)
program_url = f"https://www.harvard.edu/programs/{slug}/"
gsas_url = f"https://gsas.harvard.edu/program/{slug}"
# 访问GSAS页面获取详情
school_name = "Other"
faculty_list = []
faculty_page_url = None
try:
if await self._safe_goto(gsas_url, timeout=20000, retries=2):
# 检查页面是否有效
title = await self.page.title()
if '404' not in title and 'not found' not in title.lower():
school_name = "Graduate School of Arts and Sciences (GSAS)"
# 查找Faculty链接
faculty_link = await self.page.evaluate('''() => {
const links = document.querySelectorAll('a[href]');
for (const link of links) {
const text = link.innerText.toLowerCase();
const href = link.href;
if (text.includes('faculty') && text.includes('see list')) {
return href;
}
if ((text.includes('faculty') || text.includes('people')) &&
(href.includes('/people') || href.includes('/faculty'))) {
return href;
}
}
return null;
}''')
if faculty_link:
faculty_page_url = faculty_link
school_name = determine_school_from_url(faculty_link)
# 访问导师页面
if await self._safe_goto(faculty_link, timeout=20000, retries=2):
# 提取导师信息
faculty_list = await self._extract_faculty()
except Exception as e:
print(f" 获取详情失败: {str(e)[:50]}")
# 创建项目对象
program = Program(
name=name,
url=program_url,
degree_type=degrees,
faculty_page_url=faculty_page_url,
faculty=[Faculty(name=f['name'], url=f['url']) for f in faculty_list]
)
return program, school_name
async def _extract_faculty(self) -> List[Dict]:
"""从当前页面提取导师信息"""
return await self.page.evaluate('''() => {
const faculty = [];
const seen = new Set();
const patterns = ['/people/', '/faculty/', '/profile/', '/person/'];
document.querySelectorAll('a[href]').forEach(a => {
const href = a.href || '';
const text = a.innerText.trim();
const lowerHref = href.toLowerCase();
const lowerText = text.toLowerCase();
const isPersonLink = patterns.some(p => lowerHref.includes(p));
const isNavLink = ['people', 'faculty', 'directory', 'staff', 'all'].includes(lowerText);
if (isPersonLink && !isNavLink &&
text.length > 3 && text.length < 100 &&
!seen.has(href)) {
seen.add(href);
faculty.push({ name: text, url: href });
}
});
return faculty;
}''')
def _print_summary(self, university: University):
"""打印统计摘要"""
total_programs = sum(len(s.programs) for s in university.schools)
total_faculty = sum(len(p.faculty) for s in university.schools for p in s.programs)
print(f"\n{'='*60}")
print("爬取完成!")
print(f"{'='*60}")
print(f"大学: {university.name}")
print(f"学院数: {len(university.schools)}")
print(f"项目数: {total_programs}")
print(f"导师数: {total_faculty}")
print("\n各学院统计:")
for school in university.schools:
prog_count = len(school.programs)
fac_count = sum(len(p.faculty) for p in school.programs)
print(f" {school.name}: {prog_count}个项目, {fac_count}位导师")
def save_results(self, university: University, output_path: str):
"""保存结果"""
output = Path(output_path)
output.parent.mkdir(parents=True, exist_ok=True)
with open(output, 'w', encoding='utf-8') as f:
json.dump(university.to_dict(), f, ensure_ascii=False, indent=2)
print(f"\n结果已保存到: {output_path}")
async def scrape_harvard(output_path: str = "output/harvard_full_result.json", headless: bool = True):
"""爬取Harvard的便捷函数"""
async with HarvardScraper(headless=headless) as scraper:
university = await scraper.scrape()
scraper.save_results(university, output_path)
return university
if __name__ == "__main__":
asyncio.run(scrape_harvard(headless=False))

View File

@ -0,0 +1,105 @@
"""
鏁版嵁妯″瀷瀹氫箟 - 瀛﹂櫌 → 椤圭洰 → 瀵煎笀 层级结构
"""
from dataclasses import dataclass, field
from datetime import datetime
from typing import Any, Dict, List, Optional
@dataclass
class Faculty:
"""瀵煎笀淇℃伅"""
name: str
url: str
title: Optional[str] = None
email: Optional[str] = None
department: Optional[str] = None
def to_dict(self) -> dict:
return {
"name": self.name,
"url": self.url,
"title": self.title,
"email": self.email,
"department": self.department
}
@dataclass
class Program:
"""纭曞+椤圭洰淇℃伅"""
name: str
url: str
degree_type: Optional[str] = None # M.S., M.A., MBA, etc.
description: Optional[str] = None
faculty_page_url: Optional[str] = None
faculty: List[Faculty] = field(default_factory=list)
metadata: Dict[str, Any] = field(default_factory=dict)
def to_dict(self) -> dict:
return {
"name": self.name,
"url": self.url,
"degree_type": self.degree_type,
"description": self.description,
"faculty_page_url": self.faculty_page_url,
"faculty_count": len(self.faculty),
"faculty": [f.to_dict() for f in self.faculty],
"metadata": self.metadata
}
@dataclass
class School:
"""瀛﹂櫌淇℃伅"""
name: str
url: str
description: Optional[str] = None
programs: List[Program] = field(default_factory=list)
metadata: Dict[str, Any] = field(default_factory=dict)
faculty_directory: List[Faculty] = field(default_factory=list)
faculty_directory_loaded: bool = False
def to_dict(self) -> dict:
return {
"name": self.name,
"url": self.url,
"description": self.description,
"program_count": len(self.programs),
"programs": [p.to_dict() for p in self.programs],
"faculty_directory_count": len(self.faculty_directory),
"faculty_directory": [f.to_dict() for f in self.faculty_directory]
}
@dataclass
class University:
"""澶у淇℃伅 - 椤跺眰鏁版嵁缁撴瀯"""
name: str
url: str
country: Optional[str] = None
schools: List[School] = field(default_factory=list)
scraped_at: Optional[str] = None
def to_dict(self) -> dict:
# 缁熻
total_programs = sum(len(s.programs) for s in self.schools)
total_faculty = sum(
len(p.faculty)
for s in self.schools
for p in s.programs
)
return {
"university": self.name,
"url": self.url,
"country": self.country,
"scraped_at": self.scraped_at or datetime.utcnow().isoformat(),
"statistics": {
"total_schools": len(self.schools),
"total_programs": total_programs,
"total_faculty": total_faculty
},
"schools": [s.to_dict() for s in self.schools]
}

File diff suppressed because it is too large Load Diff

View File

@ -2,3 +2,7 @@
agent系统使用 https://docs.agno.com/
然后其中的浏览器自动化使用playwright
使用的python工具库是uv、ruff、ty使用这三个构建项目。
另外还需要研究一个方案:
当前生成的爬虫脚本还需要在 Python 环境中手动执行。我们希望实现的效果:生成的脚本可以存到数据库(你研究一下是存在哪里好),我们的目的是存到云端或者服务器上,后续我们可以通过接口来调用生成的脚本去爬取。
理想情况下我们有一个页面,页面上有一个输入栏,用来输入要生成爬虫脚本的学校,然后点击“一键生成爬虫脚本”后,由我们的 agent 自动生成脚本。生成完成后会跳转到该学校的管理页面,页面上可以点击“一键运行爬虫脚本”来抓取对应项目和老师的一面数据。爬取完毕后,我们可以在页面上检查和验证爬虫抓取到的数据。
目的是做成没有任何电脑知识的普通人也能操作的一个系统。

32
对话总结.txt Normal file
View File

@ -0,0 +1,32 @@
项目进度与存在的问题:
📊 项目进度总结
✅ 已完成的功能
Web系统架构
FastAPI 后端 + React + TypeScript + Ant Design 前端
SQLite 数据库存储大学、脚本、爬取结果
动态脚本执行机制(通过 exec() 注入模块)
曼彻斯特大学爬虫
成功爬取 297 个硕士项目
创建了完整的学院配置 manchester_complete_scraper.py
实现了 3 种提取方法table、links、research_explorer
项目按关键词自动分配到 13 个学院
已修复的问题
Windows asyncio 事件循环策略
exec 命名空间问题(函数互相调用)
硕士项目过滤逻辑(排除本科/博士)
⚠️ 当前存在的问题
问题 影响 原因
网络超时 11/12 学院页面加载失败 网络不稳定或页面响应慢
Research Explorer 页面 大量学院使用此系统 JavaScript 渲染慢60秒超时不够
导师数据不完整 仅获取 78 名导师AMBS 其他学院页面无法访问
📈 数据统计
指标 数量
硕士项目总数 297
学院分类数 13
成功获取导师的学院 1/13
导师总数 78
🔧 建议的改进方向
增加超时时间 - 对 Research Explorer 页面增加到 90-120 秒
添加重试机制 - 失败后自动重试 2-3 次
使用备选 URL - 为每个学院配置多个可能的 staff 页面
分批爬取 - 将学院分批处理,避免同时请求过多