Compare commits
4 Commits
d80db75d4e
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 426cf4d2cd | |||
| 2714c8ad5c | |||
| a4dca81216 | |||
| fb2aa12f2b |
14
.env.example
Normal file
14
.env.example
Normal file
@ -0,0 +1,14 @@
|
|||||||
|
# OpenRouter Configuration (recommended)
|
||||||
|
CODEGEN_MODEL_PROVIDER=openrouter
|
||||||
|
OPENAI_API_KEY=your-openrouter-api-key-here
|
||||||
|
CODEGEN_OPENROUTER_MODEL=anthropic/claude-sonnet-4
|
||||||
|
|
||||||
|
# Alternative: Direct Anthropic
|
||||||
|
# CODEGEN_MODEL_PROVIDER=anthropic
|
||||||
|
# ANTHROPIC_API_KEY=your-anthropic-api-key-here
|
||||||
|
# CODEGEN_ANTHROPIC_MODEL=claude-sonnet-4-20250514
|
||||||
|
|
||||||
|
# Alternative: OpenAI
|
||||||
|
# CODEGEN_MODEL_PROVIDER=openai
|
||||||
|
# OPENAI_API_KEY=your-openai-api-key-here
|
||||||
|
# CODEGEN_OPENAI_MODEL=gpt-4o
|
||||||
33
.gitignore
vendored
33
.gitignore
vendored
@ -174,3 +174,36 @@ cython_debug/
|
|||||||
# PyPI configuration file
|
# PyPI configuration file
|
||||||
.pypirc
|
.pypirc
|
||||||
|
|
||||||
|
# Windows artifacts
|
||||||
|
nul
|
||||||
|
|
||||||
|
# Scraper output files
|
||||||
|
*_results.json
|
||||||
|
|
||||||
|
# Output directories
|
||||||
|
output/
|
||||||
|
|
||||||
|
# Screenshots and debug images
|
||||||
|
*.png
|
||||||
|
artifacts/*.html
|
||||||
|
|
||||||
|
# Windows
|
||||||
|
desktop.ini
|
||||||
|
|
||||||
|
# Claude settings (local)
|
||||||
|
.claude/
|
||||||
|
|
||||||
|
# Progress files
|
||||||
|
*_progress.json
|
||||||
|
|
||||||
|
# Test result files
|
||||||
|
*_test_result.json
|
||||||
|
|
||||||
|
# Node modules
|
||||||
|
node_modules/
|
||||||
|
|
||||||
|
# Database files
|
||||||
|
*.db
|
||||||
|
|
||||||
|
# Frontend build
|
||||||
|
frontend/nul
|
||||||
|
|||||||
248
README.md
248
README.md
@ -1,73 +1,221 @@
|
|||||||
# University Playwright Codegen Agent
|
# University Playwright Codegen Agent
|
||||||
|
|
||||||
构建于 [Agno](https://docs.agno.com/) 的自动化代码生成代理:输入海外大学官网的根地址,即可生成一份使用 **Playwright** 的 Python 脚本,脚本会抓取各学院/研究生院下的硕士项目网址以及项目中列出的导师(Supervisor/Faculty)个人信息页面。本项目使用 `uv` 进行依赖管理,`ruff` 做静态检查,`ty` 负责类型检查,并提供了一个基于 Typer 的 CLI。
|
构建于 [Agno](https://docs.agno.com/) 的自动化代码生成代理:输入海外大学官网的根地址,即可生成一份使用 **Playwright** 的 Python 脚本,脚本会抓取各学院/研究生院下的硕士项目网址以及项目中列出的导师(Supervisor/Faculty)个人信息页面。
|
||||||
|
|
||||||
## Features
|
## Quick Start
|
||||||
|
|
||||||
- ✅ **Agno Agent**:利用 `output_schema` 强制结构化输出,里程碑式地生成 `ScriptPlan` 并将其渲染为可执行脚本。
|
### 1. 环境准备
|
||||||
- ✅ **Playwright sampling**:计划生成前会用 Playwright 对站点进行轻量抓取,帮助 Agent 找到关键词与导航策略。
|
|
||||||
- ✅ **Deterministic script template**:脚本模板包含 BFS 爬取、关键词过滤、JSON 输出等逻辑,确保满足“硕士项目 + 导师”需求。
|
|
||||||
- ✅ **uv + ruff + ty workflow**:开箱即用的现代 Python 工具链。
|
|
||||||
|
|
||||||
## Getting started
|
```bash
|
||||||
|
# 克隆项目
|
||||||
|
git clone https://git.prodream.cn/YXY/University-Playwright-Codegen-Agent.git
|
||||||
|
cd University-Playwright-Codegen-Agent
|
||||||
|
|
||||||
1. **创建虚拟环境并安装依赖**
|
# 安装依赖(需要 uv)
|
||||||
|
uv sync
|
||||||
|
|
||||||
```bash
|
# 安装 Playwright 浏览器
|
||||||
uv venv --python 3.12
|
uv run playwright install
|
||||||
uv pip install -r pyproject.toml
|
```
|
||||||
playwright install # 安装浏览器内核
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **配置大模型 API key**
|
### 2. 配置 API Key
|
||||||
|
|
||||||
- OpenAI: `export OPENAI_API_KEY=...`
|
项目使用 OpenRouter API 调用 Claude 模型。设置环境变量:
|
||||||
- Anthropic: `export ANTHROPIC_API_KEY=...`
|
|
||||||
- 可通过环境变量 `CODEGEN_MODEL_PROVIDER` 在 `openai` 与 `anthropic` 之间切换。
|
|
||||||
|
|
||||||
3. **运行 CLI 生成脚本**
|
**Windows (PowerShell):**
|
||||||
|
```powershell
|
||||||
|
[Environment]::SetEnvironmentVariable("OPENROUTER_API_KEY", "your-api-key", "User")
|
||||||
|
```
|
||||||
|
|
||||||
```bash
|
**Windows (CMD):**
|
||||||
uv run university-agent generate \
|
```cmd
|
||||||
"https://www.example.edu" \
|
setx OPENROUTER_API_KEY "your-api-key"
|
||||||
--campus "Example Campus" \
|
```
|
||||||
--language "English" \
|
|
||||||
--max-depth 2 \
|
|
||||||
--max-pages 60
|
|
||||||
```
|
|
||||||
|
|
||||||
运行完成后会在 `artifacts/` 下看到生成的 Playwright 脚本,并在终端展示自动规划的关键词与验证步骤。
|
**Linux/macOS:**
|
||||||
|
```bash
|
||||||
|
export OPENROUTER_API_KEY="your-api-key"
|
||||||
|
```
|
||||||
|
|
||||||
4. **执行 Ruff & Ty 检查**
|
或者复制 `.env.example` 为 `.env` 并填入 API Key。
|
||||||
|
|
||||||
```bash
|
### 3. 生成爬虫脚本
|
||||||
uv run ruff check
|
|
||||||
uvx ty check
|
|
||||||
```
|
|
||||||
|
|
||||||
## Project structure
|
**方式一:使用命令行参数**
|
||||||
|
```bash
|
||||||
|
uv run python generate_scraper.py \
|
||||||
|
--url "https://www.harvard.edu/" \
|
||||||
|
--name "Harvard" \
|
||||||
|
--language "English" \
|
||||||
|
--max-depth 3 \
|
||||||
|
--max-pages 30
|
||||||
|
```
|
||||||
|
|
||||||
|
**方式二:修改脚本中的配置**
|
||||||
|
|
||||||
|
编辑 `generate_scraper.py` 顶部的配置:
|
||||||
|
```python
|
||||||
|
TARGET_URL = "https://www.example.edu/"
|
||||||
|
CAMPUS_NAME = "Example University"
|
||||||
|
LANGUAGE = "English"
|
||||||
|
MAX_DEPTH = 3
|
||||||
|
MAX_PAGES = 30
|
||||||
|
```
|
||||||
|
|
||||||
|
然后运行:
|
||||||
|
```bash
|
||||||
|
uv run python generate_scraper.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. 运行生成的爬虫
|
||||||
|
|
||||||
|
生成的脚本保存在 `artifacts/` 目录下:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd artifacts
|
||||||
|
uv run python harvard_faculty_scraper.py --max-pages 50 --no-verify
|
||||||
|
```
|
||||||
|
|
||||||
|
**常用参数:**
|
||||||
|
| 参数 | 说明 | 默认值 |
|
||||||
|
|------|------|--------|
|
||||||
|
| `--max-pages` | 最大爬取页面数 | 30 |
|
||||||
|
| `--max-depth` | 最大爬取深度 | 3 |
|
||||||
|
| `--no-verify` | 跳过链接验证(推荐) | False |
|
||||||
|
| `--browser` | 浏览器引擎 (chromium/firefox/webkit) | chromium |
|
||||||
|
| `--timeout` | 页面加载超时(ms) | 20000 |
|
||||||
|
| `--output` | 输出文件路径 | university-scraper_results.json |
|
||||||
|
|
||||||
|
### 5. 查看结果
|
||||||
|
|
||||||
|
爬取结果保存为 JSON 文件:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"statistics": {
|
||||||
|
"total_links": 277,
|
||||||
|
"program_links": 8,
|
||||||
|
"faculty_links": 269,
|
||||||
|
"profile_pages": 265
|
||||||
|
},
|
||||||
|
"program_links": [...],
|
||||||
|
"faculty_links": [...]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## 使用 CLI(可选)
|
||||||
|
|
||||||
|
项目也提供 Typer CLI:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uv run university-agent generate \
|
||||||
|
"https://www.example.edu" \
|
||||||
|
--campus "Example Campus" \
|
||||||
|
--language "English" \
|
||||||
|
--max-depth 2 \
|
||||||
|
--max-pages 60
|
||||||
|
```
|
||||||
|
|
||||||
|
## 测试过的大学
|
||||||
|
|
||||||
|
| 大学 | 状态 | 结果 | 生成的脚本 |
|
||||||
|
|------|------|------|-----------|
|
||||||
|
| Harvard | ✅ | 277 链接 (8 项目, 269 教职, 265 个人主页) | `artifacts/harvard_faculty_scraper.py` |
|
||||||
|
| RWTH Aachen | ✅ | 108 链接 (103 项目, 5 教职) | `artifacts/rwth_aachen_playwright_scraper.py` |
|
||||||
|
| KAUST | ✅ | 9 链接 (需使用 Firefox) | `artifacts/kaust_faculty_scraper.py` |
|
||||||
|
|
||||||
|
### Harvard 测试示例
|
||||||
|
|
||||||
|
**生成爬虫脚本:**
|
||||||
|
```bash
|
||||||
|
uv run python generate_scraper.py --url "https://www.harvard.edu/" --name "Harvard"
|
||||||
|
```
|
||||||
|
|
||||||
|
**运行爬虫:**
|
||||||
|
```bash
|
||||||
|
cd artifacts
|
||||||
|
uv run python harvard_faculty_scraper.py --max-pages 30 --no-verify
|
||||||
|
```
|
||||||
|
|
||||||
|
**结果输出** (`artifacts/university-scraper_results.json`):
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"statistics": {
|
||||||
|
"total_links": 277,
|
||||||
|
"program_links": 8,
|
||||||
|
"faculty_links": 269,
|
||||||
|
"profile_pages": 265
|
||||||
|
},
|
||||||
|
"program_links": [
|
||||||
|
{"url": "https://www.harvard.edu/programs/?degree_levels=graduate", "text": "Graduate Programs"},
|
||||||
|
...
|
||||||
|
],
|
||||||
|
"faculty_links": [
|
||||||
|
{"url": "https://www.gse.harvard.edu/directory/faculty", "text": "Faculty Directory"},
|
||||||
|
{"url": "https://faculty.harvard.edu", "text": "Harvard Faculty"},
|
||||||
|
...
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
爬取覆盖了 Harvard 的多个学院:
|
||||||
|
- Graduate School of Design (GSD)
|
||||||
|
- Graduate School of Education (GSE)
|
||||||
|
- Faculty of Arts and Sciences (FAS)
|
||||||
|
- Graduate School of Arts and Sciences (GSAS)
|
||||||
|
- Harvard Divinity School (HDS)
|
||||||
|
|
||||||
|
## 故障排除
|
||||||
|
|
||||||
|
### 超时错误
|
||||||
|
某些网站响应较慢,增加超时时间:
|
||||||
|
```bash
|
||||||
|
uv run python xxx_scraper.py --timeout 60000 --no-verify
|
||||||
|
```
|
||||||
|
|
||||||
|
### 浏览器被阻止
|
||||||
|
某些网站(如 KAUST)会阻止 Chromium,改用 Firefox:
|
||||||
|
```bash
|
||||||
|
uv run python xxx_scraper.py --browser firefox
|
||||||
|
```
|
||||||
|
|
||||||
|
### API Key 错误
|
||||||
|
确保 `OPENROUTER_API_KEY` 环境变量已正确设置:
|
||||||
|
```bash
|
||||||
|
echo $OPENROUTER_API_KEY # Linux/macOS
|
||||||
|
echo %OPENROUTER_API_KEY% # Windows CMD
|
||||||
|
```
|
||||||
|
|
||||||
|
## Project Structure
|
||||||
|
|
||||||
```
|
```
|
||||||
├── README.md
|
├── README.md
|
||||||
|
├── generate_scraper.py # 主入口脚本
|
||||||
|
├── .env.example # 环境变量模板
|
||||||
├── pyproject.toml
|
├── pyproject.toml
|
||||||
├── src/university_agent
|
├── artifacts/ # 生成的爬虫脚本
|
||||||
│ ├── agent.py # Agno Agent 配置
|
│ ├── harvard_faculty_scraper.py
|
||||||
│ ├── cli.py # Typer CLI
|
│ ├── kaust_faculty_scraper.py
|
||||||
│ ├── config.py # pydantic Settings
|
│ └── ...
|
||||||
│ ├── generator.py # Orchestration 引擎
|
└── src/university_agent/
|
||||||
│ ├── models.py # 数据模型(请求/计划/结果)
|
├── agent.py # Agno Agent 配置
|
||||||
│ ├── renderer.py # ScriptPlan -> Playwright script
|
├── cli.py # Typer CLI
|
||||||
│ ├── sampler.py # Playwright 采样
|
├── config.py # pydantic Settings
|
||||||
│ ├── templates/
|
├── generator.py # Orchestration 引擎
|
||||||
│ │ └── playwright_script.py.jinja
|
├── models.py # 数据模型
|
||||||
│ └── writer.py # 将脚本写入 artifacts/
|
├── renderer.py # ScriptPlan -> Playwright script
|
||||||
└── 任务1.txt
|
├── sampler.py # Playwright 采样
|
||||||
|
└── writer.py # 脚本写入
|
||||||
```
|
```
|
||||||
|
|
||||||
## Tips
|
## Features
|
||||||
|
|
||||||
- `university-agent generate --help` 查看所有 CLI 选项,可选择跳过采样或导出规划 JSON。
|
- **Agno Agent**:利用 `output_schema` 强制结构化输出
|
||||||
- 如果 Agno Agent 需使用其他工具,可在 `agent.py` 中自行扩展自定义 `tool`。
|
- **Playwright sampling**:生成前对站点进行轻量抓取
|
||||||
- Playwright 采样在某些环境中需要额外的浏览器依赖,请根据官方提示执行 `playwright install`。
|
- **Deterministic script template**:BFS 爬取、关键词过滤、JSON 输出
|
||||||
|
- **OpenRouter 支持**:通过 OpenRouter 使用 Claude 模型
|
||||||
|
- **uv + ruff + ty workflow**:现代 Python 工具链
|
||||||
|
|
||||||
Happy building! 🎓🤖
|
## License
|
||||||
|
|
||||||
|
MIT
|
||||||
|
|||||||
261
SYSTEM_DESIGN.md
Normal file
261
SYSTEM_DESIGN.md
Normal file
@ -0,0 +1,261 @@
|
|||||||
|
# 大学爬虫Web系统设计方案
|
||||||
|
|
||||||
|
## 一、系统架构
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────────┐
|
||||||
|
│ 前端 (React/Vue) │
|
||||||
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
|
||||||
|
│ │ 输入大学URL │ │ 一键生成脚本 │ │ 查看/验证爬取数据 │ │
|
||||||
|
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
|
||||||
|
└─────────────────────────────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────────────────────────────────────────────────────┐
|
||||||
|
│ 后端 API (FastAPI) │
|
||||||
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────<E29480><E29480>───────────────────┐ │
|
||||||
|
│ │ 脚本生成API │ │ 脚本执行API │ │ 数据查询API │ │
|
||||||
|
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
|
||||||
|
└─────────────────────────────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
┌─────────────────┼─────────────────┐
|
||||||
|
▼ ▼ ▼
|
||||||
|
┌───────────────────┐ ┌───────────────┐ ┌───────────────────────┐
|
||||||
|
│ PostgreSQL │ │ 任务队列 │ │ Agent (Claude) │
|
||||||
|
│ 数据库 │ │ (Celery) │ │ 分析+生成脚本 │
|
||||||
|
│ - 爬虫脚本 │ └───────────────┘ └───────────────────────┘
|
||||||
|
│ - 爬取结果 │
|
||||||
|
│ - 执行日志 │
|
||||||
|
└───────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## 二、技术栈选择
|
||||||
|
|
||||||
|
### 后端
|
||||||
|
- **框架**: FastAPI (Python,与现有爬虫代码无缝集成)
|
||||||
|
- **数据库**: PostgreSQL (存储脚本、结果、日志)
|
||||||
|
- **任务队列**: Celery + Redis (异步执行爬虫任务)
|
||||||
|
- **ORM**: SQLAlchemy
|
||||||
|
|
||||||
|
### 前端
|
||||||
|
- **框架**: React + TypeScript (或 Vue.js)
|
||||||
|
- **UI库**: Ant Design / Material-UI
|
||||||
|
- **状态管理**: React Query (数据获取和缓存)
|
||||||
|
|
||||||
|
### 部署
|
||||||
|
- **容器化**: Docker + Docker Compose
|
||||||
|
- **云平台**: 可部署到 AWS/阿里云/腾讯云
|
||||||
|
|
||||||
|
## 三、数据库设计
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- 大学表
|
||||||
|
CREATE TABLE universities (
|
||||||
|
id SERIAL PRIMARY KEY,
|
||||||
|
name VARCHAR(255) NOT NULL,
|
||||||
|
url VARCHAR(500) NOT NULL,
|
||||||
|
country VARCHAR(100),
|
||||||
|
created_at TIMESTAMP DEFAULT NOW(),
|
||||||
|
updated_at TIMESTAMP DEFAULT NOW()
|
||||||
|
);
|
||||||
|
|
||||||
|
-- 爬虫脚本表
|
||||||
|
CREATE TABLE scraper_scripts (
|
||||||
|
id SERIAL PRIMARY KEY,
|
||||||
|
university_id INTEGER REFERENCES universities(id),
|
||||||
|
script_name VARCHAR(255) NOT NULL,
|
||||||
|
script_content TEXT NOT NULL, -- Python脚本代码
|
||||||
|
config_content TEXT, -- YAML配置
|
||||||
|
version INTEGER DEFAULT 1,
|
||||||
|
status VARCHAR(50) DEFAULT 'draft', -- draft, active, deprecated
|
||||||
|
created_at TIMESTAMP DEFAULT NOW(),
|
||||||
|
updated_at TIMESTAMP DEFAULT NOW()
|
||||||
|
);
|
||||||
|
|
||||||
|
-- 爬取任务表
|
||||||
|
CREATE TABLE scrape_jobs (
|
||||||
|
id SERIAL PRIMARY KEY,
|
||||||
|
university_id INTEGER REFERENCES universities(id),
|
||||||
|
script_id INTEGER REFERENCES scraper_scripts(id),
|
||||||
|
status VARCHAR(50) DEFAULT 'pending', -- pending, running, completed, failed
|
||||||
|
started_at TIMESTAMP,
|
||||||
|
completed_at TIMESTAMP,
|
||||||
|
error_message TEXT,
|
||||||
|
created_at TIMESTAMP DEFAULT NOW()
|
||||||
|
);
|
||||||
|
|
||||||
|
-- 爬取结果表 (JSON存储层级数据)
|
||||||
|
CREATE TABLE scrape_results (
|
||||||
|
id SERIAL PRIMARY KEY,
|
||||||
|
job_id INTEGER REFERENCES scrape_jobs(id),
|
||||||
|
university_id INTEGER REFERENCES universities(id),
|
||||||
|
result_data JSONB NOT NULL, -- 学院→项目→导师 JSON数据
|
||||||
|
schools_count INTEGER,
|
||||||
|
programs_count INTEGER,
|
||||||
|
faculty_count INTEGER,
|
||||||
|
created_at TIMESTAMP DEFAULT NOW()
|
||||||
|
);
|
||||||
|
|
||||||
|
-- 执行日志表
|
||||||
|
CREATE TABLE scrape_logs (
|
||||||
|
id SERIAL PRIMARY KEY,
|
||||||
|
job_id INTEGER REFERENCES scrape_jobs(id),
|
||||||
|
level VARCHAR(20), -- info, warning, error
|
||||||
|
message TEXT,
|
||||||
|
created_at TIMESTAMP DEFAULT NOW()
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
## 四、API接口设计
|
||||||
|
|
||||||
|
### 1. 大学管理
|
||||||
|
```
|
||||||
|
POST /api/universities 创建大学
|
||||||
|
GET /api/universities 获取大学列表
|
||||||
|
GET /api/universities/{id} 获取大学详情
|
||||||
|
DELETE /api/universities/{id} 删除大学
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. 爬虫脚本
|
||||||
|
```
|
||||||
|
POST /api/scripts/generate 生成爬虫脚本 (Agent自动分析)
|
||||||
|
GET /api/scripts/{university_id} 获取大学的爬虫脚本
|
||||||
|
PUT /api/scripts/{id} 更新脚本
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. 爬取任务
|
||||||
|
```
|
||||||
|
POST /api/jobs/start/{university_id} 启动爬取任务
|
||||||
|
GET /api/jobs/{id} 获取任务状态
|
||||||
|
GET /api/jobs/university/{id} 获取大学的任务列表
|
||||||
|
POST /api/jobs/{id}/cancel 取消任务
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. 数据结果
|
||||||
|
```
|
||||||
|
GET /api/results/{university_id} 获取爬取结果
|
||||||
|
GET /api/results/{university_id}/schools 获取学院列表
|
||||||
|
GET /api/results/{university_id}/programs 获取项目列表
|
||||||
|
GET /api/results/{university_id}/faculty 获取导师列表
|
||||||
|
GET /api/results/{university_id}/export?format=json 导出数据
|
||||||
|
```
|
||||||
|
|
||||||
|
## 五、前端页面设计
|
||||||
|
|
||||||
|
### 页面1: 首页/大学列表
|
||||||
|
- 显示已添加的大学列表
|
||||||
|
- "添加新大学" 按钮
|
||||||
|
- 每个大学卡片显示:名称、状态、项目数、导师数、操作按钮
|
||||||
|
|
||||||
|
### 页面2: 添加大学 (一键生成脚本)
|
||||||
|
- 输入框:大学官网URL
|
||||||
|
- "分析并生成脚本" 按钮
|
||||||
|
- 显示分析进度和日志
|
||||||
|
- 生成完成后自动跳转到管理页面
|
||||||
|
|
||||||
|
### 页面3: 大学管理页面
|
||||||
|
- 大学基本信息
|
||||||
|
- 爬虫脚本状态
|
||||||
|
- "一键运行爬虫" 按钮
|
||||||
|
- 运行进度和日志实时显示
|
||||||
|
- 历史任务列表
|
||||||
|
|
||||||
|
### 页面4: 数据查看页面
|
||||||
|
- 树形结构展示:学院 → 项目 → 导师
|
||||||
|
- 搜索和筛选功能
|
||||||
|
- 数据导出按钮 (JSON/Excel)
|
||||||
|
- 数据校验和编辑功能
|
||||||
|
|
||||||
|
## 六、实现步骤
|
||||||
|
|
||||||
|
### 阶段1: 后端基础 (优先)
|
||||||
|
1. 创建 FastAPI 项目结构
|
||||||
|
2. 设计数据库模型 (SQLAlchemy)
|
||||||
|
3. 实现基础 CRUD API
|
||||||
|
4. 集成现有爬虫代码
|
||||||
|
|
||||||
|
### 阶段2: 脚本生成与执行
|
||||||
|
1. 实现 Agent 自动分析逻辑
|
||||||
|
2. 实现脚本存储和版本管理
|
||||||
|
3. 集成 Celery 异步任务队列
|
||||||
|
4. 实现爬虫执行和日志记录
|
||||||
|
|
||||||
|
### 阶段3: 前端开发
|
||||||
|
1. 搭建 React 项目
|
||||||
|
2. 实现大学列表页面
|
||||||
|
3. 实现脚本生成页面
|
||||||
|
4. 实现数据查看页面
|
||||||
|
|
||||||
|
### 阶段4: 部署上线
|
||||||
|
1. Docker 容器化
|
||||||
|
2. 部署到云服务器
|
||||||
|
3. 配置域名和 HTTPS
|
||||||
|
|
||||||
|
## 七、目录结构
|
||||||
|
|
||||||
|
```
|
||||||
|
university-scraper-web/
|
||||||
|
├── backend/
|
||||||
|
│ ├── app/
|
||||||
|
│ │ ├── __init__.py
|
||||||
|
│ │ ├── main.py # FastAPI入口
|
||||||
|
│ │ ├── config.py # 配置
|
||||||
|
│ │ ├── database.py # 数据库连接
|
||||||
|
│ │ ├── models/ # SQLAlchemy模型
|
||||||
|
│ │ │ ├── university.py
|
||||||
|
│ │ │ ├── script.py
|
||||||
|
│ │ │ ├── job.py
|
||||||
|
│ │ │ └── result.py
|
||||||
|
│ │ ├── schemas/ # Pydantic模型
|
||||||
|
│ │ ├── api/ # API路由
|
||||||
|
│ │ │ ├── universities.py
|
||||||
|
│ │ │ ├── scripts.py
|
||||||
|
│ │ │ ├── jobs.py
|
||||||
|
│ │ │ └── results.py
|
||||||
|
│ │ ├── services/ # 业务逻辑
|
||||||
|
│ │ │ ├── scraper_service.py
|
||||||
|
│ │ │ └── agent_service.py
|
||||||
|
│ │ └── tasks/ # Celery任务
|
||||||
|
│ │ └── scrape_task.py
|
||||||
|
│ ├── requirements.txt
|
||||||
|
│ └── Dockerfile
|
||||||
|
├── frontend/
|
||||||
|
│ ├── src/
|
||||||
|
│ │ ├── components/
|
||||||
|
│ │ ├── pages/
|
||||||
|
│ │ ├── services/
|
||||||
|
│ │ └── App.tsx
|
||||||
|
│ ├── package.json
|
||||||
|
│ └── Dockerfile
|
||||||
|
├── docker-compose.yml
|
||||||
|
└── README.md
|
||||||
|
```
|
||||||
|
|
||||||
|
## 八、关于脚本存储位置的建议
|
||||||
|
|
||||||
|
### 推荐方案:PostgreSQL + 文件系统混合
|
||||||
|
|
||||||
|
1. **PostgreSQL 存储**:
|
||||||
|
- 脚本元数据 (名称、版本、状态)
|
||||||
|
- 脚本代码内容 (TEXT字段)
|
||||||
|
- 配置文件内容 (JSONB字段)
|
||||||
|
- 爬取结果 (JSONB字段)
|
||||||
|
|
||||||
|
2. **优点**:
|
||||||
|
- 事务支持,数据一致性
|
||||||
|
- 版本管理方便
|
||||||
|
- 查询和搜索方便
|
||||||
|
- 备份和迁移简单
|
||||||
|
- 与后端集成紧密
|
||||||
|
|
||||||
|
3. **云部署选项**:
|
||||||
|
- AWS RDS PostgreSQL
|
||||||
|
- 阿里云 RDS PostgreSQL
|
||||||
|
- 腾讯云 TDSQL-C
|
||||||
|
|
||||||
|
### 备选方案:MongoDB
|
||||||
|
|
||||||
|
如果数据结构经常变化,可以考虑 MongoDB:
|
||||||
|
- 灵活的文档结构
|
||||||
|
- 适合存储层级化的爬取结果
|
||||||
|
- 但 Python 生态对 PostgreSQL 支持更好
|
||||||
83
artifacts/debug_cs_faculty.py
Normal file
83
artifacts/debug_cs_faculty.py
Normal file
@ -0,0 +1,83 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
调试Computer Science的Faculty页面
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
|
||||||
|
|
||||||
|
async def debug_cs():
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser = await p.chromium.launch(headless=False)
|
||||||
|
page = await browser.new_page()
|
||||||
|
|
||||||
|
# 访问Computer Science GSAS页面
|
||||||
|
gsas_url = "https://gsas.harvard.edu/program/computer-science"
|
||||||
|
print(f"访问: {gsas_url}")
|
||||||
|
|
||||||
|
await page.goto(gsas_url, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
await page.screenshot(path="cs_gsas_page.png", full_page=True)
|
||||||
|
print("截图已保存: cs_gsas_page.png")
|
||||||
|
|
||||||
|
# 查找所有链接
|
||||||
|
links = await page.evaluate('''() => {
|
||||||
|
const links = [];
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
const href = a.href;
|
||||||
|
if (text && text.length > 2 && text.length < 100) {
|
||||||
|
links.push({text: text, href: href});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
return links;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f"\n页面上的所有链接 ({len(links)} 个):")
|
||||||
|
for link in links:
|
||||||
|
print(f" - {link['text'][:60]} -> {link['href']}")
|
||||||
|
|
||||||
|
# 查找可能的Faculty或People链接
|
||||||
|
print("\n\n查找Faculty/People相关链接:")
|
||||||
|
for link in links:
|
||||||
|
text_lower = link['text'].lower()
|
||||||
|
href_lower = link['href'].lower()
|
||||||
|
if 'faculty' in text_lower or 'people' in href_lower or 'faculty' in href_lower or 'website' in text_lower:
|
||||||
|
print(f" * {link['text']} -> {link['href']}")
|
||||||
|
|
||||||
|
# 尝试访问SEAS (School of Engineering)
|
||||||
|
print("\n\n尝试访问SEAS Computer Science页面...")
|
||||||
|
seas_url = "https://seas.harvard.edu/computer-science"
|
||||||
|
await page.goto(seas_url, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
await page.screenshot(path="seas_cs_page.png", full_page=True)
|
||||||
|
print("截图已保存: seas_cs_page.png")
|
||||||
|
|
||||||
|
seas_links = await page.evaluate('''() => {
|
||||||
|
const links = [];
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
const href = a.href;
|
||||||
|
const lowerText = text.toLowerCase();
|
||||||
|
const lowerHref = href.toLowerCase();
|
||||||
|
if ((lowerText.includes('faculty') || lowerText.includes('people') ||
|
||||||
|
lowerHref.includes('faculty') || lowerHref.includes('people')) &&
|
||||||
|
text.length > 2) {
|
||||||
|
links.push({text: text, href: href});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
return links;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f"\nSEAS页面上的Faculty/People链接:")
|
||||||
|
for link in seas_links:
|
||||||
|
print(f" * {link['text']} -> {link['href']}")
|
||||||
|
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(debug_cs())
|
||||||
110
artifacts/explore_faculty_page.py
Normal file
110
artifacts/explore_faculty_page.py
Normal file
@ -0,0 +1,110 @@
|
|||||||
|
"""
|
||||||
|
探索Harvard院系People/Faculty页面结构,获取导师列表
|
||||||
|
"""
|
||||||
|
import asyncio
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
|
||||||
|
async def explore_faculty_page():
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser = await p.chromium.launch(headless=False)
|
||||||
|
page = await browser.new_page()
|
||||||
|
|
||||||
|
# 访问AAAS院系People页面
|
||||||
|
people_url = "https://aaas.fas.harvard.edu/aaas-people"
|
||||||
|
print(f"访问院系People页面: {people_url}")
|
||||||
|
|
||||||
|
await page.goto(people_url, wait_until='networkidle')
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
# 截图保存
|
||||||
|
await page.screenshot(path="aaas_people_page.png", full_page=True)
|
||||||
|
print("已保存截图: aaas_people_page.png")
|
||||||
|
|
||||||
|
# 获取所有教职员工链接
|
||||||
|
faculty_info = await page.evaluate('''() => {
|
||||||
|
const faculty = [];
|
||||||
|
|
||||||
|
// 查找所有 /people/ 路径的链接
|
||||||
|
document.querySelectorAll('a[href*="/people/"]').forEach(a => {
|
||||||
|
const href = a.href || '';
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
|
||||||
|
// 过滤掉导航链接,只保留个人页面链接
|
||||||
|
if (href.includes('/people/') && text.length > 3 &&
|
||||||
|
!text.toLowerCase().includes('people') &&
|
||||||
|
!href.endsWith('/people/') &&
|
||||||
|
!href.endsWith('/aaas-people')) {
|
||||||
|
faculty.push({
|
||||||
|
name: text,
|
||||||
|
url: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return faculty;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f"\n找到 {len(faculty_info)} 个教职员工:")
|
||||||
|
for f in faculty_info:
|
||||||
|
print(f" - {f['name']} -> {f['url']}")
|
||||||
|
|
||||||
|
# 尝试经济学院系的Faculty页面
|
||||||
|
print("\n\n========== 尝试经济学院系Faculty页面 ==========")
|
||||||
|
econ_faculty_url = "http://economics.harvard.edu/people/people-type/faculty"
|
||||||
|
print(f"访问: {econ_faculty_url}")
|
||||||
|
|
||||||
|
await page.goto(econ_faculty_url, wait_until='networkidle')
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
await page.screenshot(path="econ_faculty_page.png", full_page=True)
|
||||||
|
print("已保存截图: econ_faculty_page.png")
|
||||||
|
|
||||||
|
econ_faculty = await page.evaluate('''() => {
|
||||||
|
const faculty = [];
|
||||||
|
|
||||||
|
// 查找所有可能的faculty链接
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href || '';
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
const lowerHref = href.toLowerCase();
|
||||||
|
|
||||||
|
// 查找个人页面链接
|
||||||
|
if ((lowerHref.includes('/people/') || lowerHref.includes('/faculty/') ||
|
||||||
|
lowerHref.includes('/profile/')) &&
|
||||||
|
text.length > 3 && text.length < 100 &&
|
||||||
|
!text.toLowerCase().includes('faculty') &&
|
||||||
|
!text.toLowerCase().includes('people')) {
|
||||||
|
faculty.push({
|
||||||
|
name: text,
|
||||||
|
url: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return faculty;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f"\n找到 {len(econ_faculty)} 个教职员工:")
|
||||||
|
for f in econ_faculty[:30]:
|
||||||
|
print(f" - {f['name']} -> {f['url']}")
|
||||||
|
|
||||||
|
# 查看页面上所有链接用于调试
|
||||||
|
print("\n\n页面上的所有链接:")
|
||||||
|
all_links = await page.evaluate('''() => {
|
||||||
|
const links = [];
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href || '';
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
if (text && text.length > 2 && text.length < 100) {
|
||||||
|
links.push({text: text, href: href});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
return links;
|
||||||
|
}''')
|
||||||
|
for link in all_links[:40]:
|
||||||
|
print(f" - {link['text'][:50]} -> {link['href']}")
|
||||||
|
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(explore_faculty_page())
|
||||||
173
artifacts/explore_manchester.py
Normal file
173
artifacts/explore_manchester.py
Normal file
@ -0,0 +1,173 @@
|
|||||||
|
"""
|
||||||
|
探索曼彻斯特大学硕士课程页面结构
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
|
||||||
|
|
||||||
|
async def explore_manchester():
|
||||||
|
"""探索曼彻斯特大学网站结构"""
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser = await p.chromium.launch(headless=False)
|
||||||
|
context = await browser.new_context(
|
||||||
|
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
|
||||||
|
)
|
||||||
|
page = await context.new_page()
|
||||||
|
|
||||||
|
# 直接访问硕士课程A-Z列表页
|
||||||
|
print("访问硕士课程A-Z列表页面...")
|
||||||
|
await page.goto("https://www.manchester.ac.uk/study/masters/courses/list/",
|
||||||
|
wait_until="domcontentloaded", timeout=60000)
|
||||||
|
await page.wait_for_timeout(5000)
|
||||||
|
|
||||||
|
# 截图
|
||||||
|
await page.screenshot(path="manchester_masters_page.png", full_page=False)
|
||||||
|
print("截图已保存: manchester_masters_page.png")
|
||||||
|
|
||||||
|
# 分析页面结构
|
||||||
|
page_info = await page.evaluate("""() => {
|
||||||
|
const info = {
|
||||||
|
title: document.title,
|
||||||
|
url: window.location.href,
|
||||||
|
all_links: [],
|
||||||
|
course_candidates: [],
|
||||||
|
page_sections: []
|
||||||
|
};
|
||||||
|
|
||||||
|
// 获取所有链接
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href;
|
||||||
|
const text = a.innerText.trim().substring(0, 100);
|
||||||
|
if (href && text) {
|
||||||
|
info.all_links.push({href, text});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
// 查找可能的课程链接 - 包含 /course/ 或 list-item
|
||||||
|
document.querySelectorAll('a[href*="/course/"], .course-link, [class*="course"] a, .search-result a, .list-item a').forEach(a => {
|
||||||
|
info.course_candidates.push({
|
||||||
|
href: a.href,
|
||||||
|
text: a.innerText.trim().substring(0, 100),
|
||||||
|
classes: a.className,
|
||||||
|
parent_classes: a.parentElement?.className || ''
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
// 获取页面主要区块
|
||||||
|
document.querySelectorAll('main, [role="main"], .content, #content, .results, .course-list').forEach(el => {
|
||||||
|
info.page_sections.push({
|
||||||
|
tag: el.tagName,
|
||||||
|
id: el.id,
|
||||||
|
classes: el.className,
|
||||||
|
children_count: el.children.length
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
return info;
|
||||||
|
}""")
|
||||||
|
|
||||||
|
print(f"\n页面标题: {page_info['title']}")
|
||||||
|
print(f"当前URL: {page_info['url']}")
|
||||||
|
print(f"\n总链接数: {len(page_info['all_links'])}")
|
||||||
|
print(f"课程候选链接数: {len(page_info['course_candidates'])}")
|
||||||
|
|
||||||
|
# 查找包含 masters/courses/ 的链接
|
||||||
|
masters_links = [l for l in page_info['all_links']
|
||||||
|
if 'masters/courses/' in l['href'].lower()
|
||||||
|
and l['href'] != page_info['url']]
|
||||||
|
|
||||||
|
print(f"\n硕士课程相关链接 ({len(masters_links)}):")
|
||||||
|
for link in masters_links[:20]:
|
||||||
|
print(f" - {link['text'][:50]}: {link['href']}")
|
||||||
|
|
||||||
|
print(f"\n课程候选详情:")
|
||||||
|
for c in page_info['course_candidates'][:10]:
|
||||||
|
print(f" - {c['text'][:50]}")
|
||||||
|
print(f" URL: {c['href']}")
|
||||||
|
print(f" Classes: {c['classes']}")
|
||||||
|
|
||||||
|
# 检查是否有搜索/筛选功能
|
||||||
|
search_elements = await page.evaluate("""() => {
|
||||||
|
const elements = [];
|
||||||
|
document.querySelectorAll('input[type="search"], input[type="text"], select, .filter, .search').forEach(el => {
|
||||||
|
elements.push({
|
||||||
|
tag: el.tagName,
|
||||||
|
type: el.type || '',
|
||||||
|
id: el.id,
|
||||||
|
name: el.name || '',
|
||||||
|
classes: el.className
|
||||||
|
});
|
||||||
|
});
|
||||||
|
return elements;
|
||||||
|
}""")
|
||||||
|
|
||||||
|
print(f"\n搜索/筛选元素: {len(search_elements)}")
|
||||||
|
for el in search_elements[:5]:
|
||||||
|
print(f" - {el}")
|
||||||
|
|
||||||
|
# 尝试找到课程列表的实际结构
|
||||||
|
print("\n\n正在分析页面中的课程列表结构...")
|
||||||
|
|
||||||
|
list_structures = await page.evaluate("""() => {
|
||||||
|
const structures = [];
|
||||||
|
|
||||||
|
// 查找各种可能的列表结构
|
||||||
|
const selectors = [
|
||||||
|
'ul li a[href*="course"]',
|
||||||
|
'div[class*="result"] a',
|
||||||
|
'div[class*="course"] a',
|
||||||
|
'article a[href]',
|
||||||
|
'.search-results a',
|
||||||
|
'[data-course] a',
|
||||||
|
'table tr td a'
|
||||||
|
];
|
||||||
|
|
||||||
|
for (const selector of selectors) {
|
||||||
|
const elements = document.querySelectorAll(selector);
|
||||||
|
if (elements.length > 0) {
|
||||||
|
const samples = [];
|
||||||
|
elements.forEach((el, i) => {
|
||||||
|
if (i < 5) {
|
||||||
|
samples.push({
|
||||||
|
href: el.href,
|
||||||
|
text: el.innerText.trim().substring(0, 80)
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
structures.push({
|
||||||
|
selector: selector,
|
||||||
|
count: elements.length,
|
||||||
|
samples: samples
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return structures;
|
||||||
|
}""")
|
||||||
|
|
||||||
|
print("\n找到的列表结构:")
|
||||||
|
for s in list_structures:
|
||||||
|
print(f"\n 选择器: {s['selector']} (共 {s['count']} 个)")
|
||||||
|
for sample in s['samples']:
|
||||||
|
print(f" - {sample['text']}: {sample['href']}")
|
||||||
|
|
||||||
|
# 保存完整分析结果
|
||||||
|
with open("manchester_analysis.json", "w", encoding="utf-8") as f:
|
||||||
|
json.dump(page_info, f, indent=2, ensure_ascii=False)
|
||||||
|
|
||||||
|
print("\n\n完整分析已保存到 manchester_analysis.json")
|
||||||
|
|
||||||
|
# 等待用户查看
|
||||||
|
print("\n按 Ctrl+C 关闭浏览器...")
|
||||||
|
try:
|
||||||
|
await asyncio.sleep(30)
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(explore_manchester())
|
||||||
226
artifacts/explore_program_page.py
Normal file
226
artifacts/explore_program_page.py
Normal file
@ -0,0 +1,226 @@
|
|||||||
|
"""
|
||||||
|
探索Harvard项目页面结构,寻找导师信息
|
||||||
|
"""
|
||||||
|
import asyncio
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
|
||||||
|
async def explore_program_page():
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser = await p.chromium.launch(headless=False)
|
||||||
|
page = await browser.new_page()
|
||||||
|
|
||||||
|
# 访问研究生院系页面 (GSAS)
|
||||||
|
gsas_url = "https://gsas.harvard.edu/program/african-and-african-american-studies"
|
||||||
|
print(f"访问研究生院系页面: {gsas_url}")
|
||||||
|
|
||||||
|
await page.goto(gsas_url, wait_until='networkidle')
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
# 截图保存
|
||||||
|
await page.screenshot(path="gsas_program_page.png", full_page=True)
|
||||||
|
print("已保存截图: gsas_program_page.png")
|
||||||
|
|
||||||
|
# 分析页面结构
|
||||||
|
page_info = await page.evaluate('''() => {
|
||||||
|
const info = {
|
||||||
|
title: document.title,
|
||||||
|
h1: document.querySelector('h1')?.innerText || '',
|
||||||
|
allHeadings: [],
|
||||||
|
facultyLinks: [],
|
||||||
|
peopleLinks: [],
|
||||||
|
allLinks: []
|
||||||
|
};
|
||||||
|
|
||||||
|
// 获取所有标题
|
||||||
|
document.querySelectorAll('h1, h2, h3, h4').forEach(h => {
|
||||||
|
info.allHeadings.push({
|
||||||
|
tag: h.tagName,
|
||||||
|
text: h.innerText.trim().substring(0, 100)
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
// 查找所有链接
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href || '';
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
|
||||||
|
// 检查是否与教职员工相关
|
||||||
|
const lowerHref = href.toLowerCase();
|
||||||
|
const lowerText = text.toLowerCase();
|
||||||
|
|
||||||
|
if (lowerHref.includes('faculty') || lowerHref.includes('people') ||
|
||||||
|
lowerHref.includes('professor') || lowerHref.includes('staff') ||
|
||||||
|
lowerText.includes('faculty') || lowerText.includes('people')) {
|
||||||
|
info.facultyLinks.push({
|
||||||
|
text: text.substring(0, 100),
|
||||||
|
href: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
// 检查是否是个人页面链接
|
||||||
|
if (href.includes('/people/') || href.includes('/faculty/') ||
|
||||||
|
href.includes('/profile/') || href.includes('/person/')) {
|
||||||
|
info.peopleLinks.push({
|
||||||
|
text: text.substring(0, 100),
|
||||||
|
href: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
// 保存所有主要链接
|
||||||
|
if (href && text.length > 2 && text.length < 150) {
|
||||||
|
info.allLinks.push({
|
||||||
|
text: text,
|
||||||
|
href: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return info;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f"\n页面标题: {page_info['title']}")
|
||||||
|
print(f"H1: {page_info['h1']}")
|
||||||
|
|
||||||
|
print(f"\n所有标题 ({len(page_info['allHeadings'])}):")
|
||||||
|
for h in page_info['allHeadings']:
|
||||||
|
print(f" <{h['tag']}>: {h['text']}")
|
||||||
|
|
||||||
|
print(f"\n教职员工相关链接 ({len(page_info['facultyLinks'])}):")
|
||||||
|
for f in page_info['facultyLinks']:
|
||||||
|
print(f" - {f['text']} -> {f['href']}")
|
||||||
|
|
||||||
|
print(f"\n个人页面链接 ({len(page_info['peopleLinks'])}):")
|
||||||
|
for p in page_info['peopleLinks']:
|
||||||
|
print(f" - {p['text']} -> {p['href']}")
|
||||||
|
|
||||||
|
print(f"\n所有链接 ({len(page_info['allLinks'])}):")
|
||||||
|
for link in page_info['allLinks'][:50]:
|
||||||
|
print(f" - {link['text'][:60]} -> {link['href']}")
|
||||||
|
|
||||||
|
# 尝试另一个项目页面看看是否有不同结构
|
||||||
|
print("\n\n========== 尝试另一个项目页面 ==========")
|
||||||
|
economics_url = "https://gsas.harvard.edu/program/economics"
|
||||||
|
print(f"访问: {economics_url}")
|
||||||
|
|
||||||
|
await page.goto(economics_url, wait_until='networkidle')
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
# 截图保存
|
||||||
|
await page.screenshot(path="gsas_economics_page.png", full_page=True)
|
||||||
|
print("已保存截图: gsas_economics_page.png")
|
||||||
|
|
||||||
|
# 分析
|
||||||
|
econ_info = await page.evaluate('''() => {
|
||||||
|
const info = {
|
||||||
|
title: document.title,
|
||||||
|
facultyLinks: [],
|
||||||
|
peopleLinks: []
|
||||||
|
};
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href || '';
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
const lowerHref = href.toLowerCase();
|
||||||
|
const lowerText = text.toLowerCase();
|
||||||
|
|
||||||
|
if (lowerHref.includes('faculty') || lowerHref.includes('people') ||
|
||||||
|
lowerText.includes('faculty') || lowerText.includes('people')) {
|
||||||
|
info.facultyLinks.push({
|
||||||
|
text: text.substring(0, 100),
|
||||||
|
href: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
if (href.includes('/people/') || href.includes('/faculty/') ||
|
||||||
|
href.includes('/profile/') || href.includes('/person/')) {
|
||||||
|
info.peopleLinks.push({
|
||||||
|
text: text.substring(0, 100),
|
||||||
|
href: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return info;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f"\n教职员工相关链接 ({len(econ_info['facultyLinks'])}):")
|
||||||
|
for f in econ_info['facultyLinks']:
|
||||||
|
print(f" - {f['text']} -> {f['href']}")
|
||||||
|
|
||||||
|
print(f"\n个人页面链接 ({len(econ_info['peopleLinks'])}):")
|
||||||
|
for p in econ_info['peopleLinks']:
|
||||||
|
print(f" - {p['text']} -> {p['href']}")
|
||||||
|
|
||||||
|
# 访问院系主页看看有没有Faculty页面
|
||||||
|
print("\n\n========== 尝试访问院系主页 ==========")
|
||||||
|
dept_url = "https://aaas.fas.harvard.edu/"
|
||||||
|
print(f"访问院系主页: {dept_url}")
|
||||||
|
|
||||||
|
await page.goto(dept_url, wait_until='networkidle')
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
await page.screenshot(path="aaas_dept_page.png", full_page=True)
|
||||||
|
print("已保存截图: aaas_dept_page.png")
|
||||||
|
|
||||||
|
dept_info = await page.evaluate('''() => {
|
||||||
|
const info = {
|
||||||
|
title: document.title,
|
||||||
|
navLinks: [],
|
||||||
|
facultyLinks: [],
|
||||||
|
peopleLinks: []
|
||||||
|
};
|
||||||
|
|
||||||
|
// 获取导航链接
|
||||||
|
document.querySelectorAll('nav a, [class*="nav"] a, [class*="menu"] a').forEach(a => {
|
||||||
|
const href = a.href || '';
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
if (text && text.length > 1 && text.length < 50) {
|
||||||
|
info.navLinks.push({
|
||||||
|
text: text,
|
||||||
|
href: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href || '';
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
const lowerHref = href.toLowerCase();
|
||||||
|
const lowerText = text.toLowerCase();
|
||||||
|
|
||||||
|
if (lowerHref.includes('faculty') || lowerHref.includes('people') ||
|
||||||
|
lowerText.includes('faculty') || lowerText.includes('people')) {
|
||||||
|
info.facultyLinks.push({
|
||||||
|
text: text.substring(0, 100),
|
||||||
|
href: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
if (href.includes('/people/') || href.includes('/faculty/') ||
|
||||||
|
href.includes('/profile/')) {
|
||||||
|
info.peopleLinks.push({
|
||||||
|
text: text.substring(0, 100),
|
||||||
|
href: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return info;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f"\n导航链接 ({len(dept_info['navLinks'])}):")
|
||||||
|
for link in dept_info['navLinks'][:20]:
|
||||||
|
print(f" - {link['text']} -> {link['href']}")
|
||||||
|
|
||||||
|
print(f"\n教职员工相关链接 ({len(dept_info['facultyLinks'])}):")
|
||||||
|
for f in dept_info['facultyLinks']:
|
||||||
|
print(f" - {f['text']} -> {f['href']}")
|
||||||
|
|
||||||
|
print(f"\n个人页面链接 ({len(dept_info['peopleLinks'])}):")
|
||||||
|
for p in dept_info['peopleLinks'][:30]:
|
||||||
|
print(f" - {p['text']} -> {p['href']}")
|
||||||
|
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(explore_program_page())
|
||||||
445
artifacts/harvard_faculty_scraper.py
Normal file
445
artifacts/harvard_faculty_scraper.py
Normal file
@ -0,0 +1,445 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
"""
|
||||||
|
Auto-generated by the Agno codegen agent.
|
||||||
|
Target university: Harvard (https://www.harvard.edu/)
|
||||||
|
Requested caps: depth=3, pages=30
|
||||||
|
|
||||||
|
Plan description: Playwright scraper for university master programs and faculty profiles.
|
||||||
|
Navigation strategy: Start at https://www.harvard.edu/ Follow links to /academics/ and /a-to-z/ to find list of schools and departments For each school/department, look for a 'faculty' or 'people' page On faculty directory pages, identify and follow links to individual profiles Check for school/department specific subdomains like hls.harvard.edu, hds.harvard.edu, etc. Prioritize crawling faculty directory pages over general site crawling
|
||||||
|
Verification checklist:
|
||||||
|
- Manually review a sample of scraped URLs to verify they are faculty profiles
|
||||||
|
- Check that major academic departments are represented in the results
|
||||||
|
- Verify the script is capturing profile page content, not just URLs
|
||||||
|
- Confirm no login pages, application forms, or directory pages are included
|
||||||
|
Playwright snapshot used to guide this plan:
|
||||||
|
1. Harvard University (https://www.harvard.edu/)
|
||||||
|
Snippet: Skip to main content Harvard University Learn about our lawsuits to protect our students and researchers Search Menu David Liu received the 2025 Breakthrough Prize in Life Sciences for developing a revolutionary gene-editing platforms that precisely corrects genetic mutations.
|
||||||
|
Anchors: Skip to main content -> https://www.harvard.edu/#main-content, Harvard University -> https://www.harvard.edu/, Learn about our lawsuits to protect our students and researchers -> https://www.harvard.edu/federal-lawsuits/, × -> javascript:void(0), A to Z index -> https://www.harvard.edu/a-to-z/, Academics -> https://www.harvard.edu/academics/
|
||||||
|
2. Index of departments, schools, and affiliates - Harvard University (https://www.harvard.edu/a-to-z/)
|
||||||
|
Snippet: Skip to main content Harvard University Learn about our lawsuits to protect our students and researchers Search Menu David Liu received the 2025 Breakthrough Prize in Life Sciences for developing a revolutionary gene-editing platforms that precisely corrects genetic mutations.
|
||||||
|
Anchors: Skip to main content -> https://www.harvard.edu/a-to-z/#main-content, Harvard University -> https://www.harvard.edu/, Learn about our lawsuits to protect our students and researchers -> https://www.harvard.edu/federal-lawsuits/, × -> javascript:void(0), A to Z index -> https://www.harvard.edu/a-to-z/, Academics -> https://www.harvard.edu/academics/
|
||||||
|
3. Academics - Harvard University (https://www.harvard.edu/academics/)
|
||||||
|
Snippet: Skip to main content Harvard University Learn about our lawsuits to protect our students and researchers Search Menu David Liu received the 2025 Breakthrough Prize in Life Sciences for developing a revolutionary gene-editing platforms that precisely corrects genetic mutations.
|
||||||
|
Anchors: Skip to main content -> https://www.harvard.edu/academics/#main-content, Harvard University -> https://www.harvard.edu/, Learn about our lawsuits to protect our students and researchers -> https://www.harvard.edu/federal-lawsuits/, A to Z index -> https://www.harvard.edu/a-to-z/, Academics -> https://www.harvard.edu/academics/, Undergraduate Degrees -> https://www.harvard.edu//programs/?degree_levels=undergraduate
|
||||||
|
4. Programs - Harvard University (https://www.harvard.edu//programs/?degree_levels=undergraduate)
|
||||||
|
Snippet: Skip to main content Harvard University Learn about our lawsuits to protect our students and researchers Search Menu David Liu received the 2025 Breakthrough Prize in Life Sciences for developing a revolutionary gene-editing platforms that precisely corrects genetic mutations.
|
||||||
|
Anchors: Skip to main content -> https://www.harvard.edu/programs/?degree_levels=undergraduate#main-content, Harvard University -> https://www.harvard.edu/, Learn about our lawsuits to protect our students and researchers -> https://www.harvard.edu/federal-lawsuits/, A to Z index -> https://www.harvard.edu/a-to-z/, Academics -> https://www.harvard.edu/academics/, Undergraduate Degrees -> https://www.harvard.edu//programs/?degree_levels=undergraduate
|
||||||
|
Snapshot truncated.
|
||||||
|
|
||||||
|
Generated at: 2025-12-10T07:19:12.294884+00:00
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import time
|
||||||
|
from collections import deque
|
||||||
|
from dataclasses import asdict, dataclass, field
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Deque, Iterable, List, Set, Tuple
|
||||||
|
from urllib.parse import urljoin, urldefrag, urlparse
|
||||||
|
|
||||||
|
from playwright.async_api import async_playwright, Page, Response
|
||||||
|
|
||||||
|
PROGRAM_KEYWORDS = ['/graduate/', '/masters/', '/programs/?degree_levels=graduate', '/mpp/', 'Master of', 'M.S.', 'M.A.', 'graduate program']
|
||||||
|
FACULTY_KEYWORDS = ['/people/', '/~', '/faculty/', '/profile/', 'professor', 'dr.', 'ph.d.', 'firstname-lastname']
|
||||||
|
EXCLUSION_KEYWORDS = ['admissions', 'apply', 'tuition', 'news', 'events', 'calendar', 'careers', 'jobs', 'login', 'donate', 'alumni', 'giving']
|
||||||
|
METADATA_FIELDS = ['url', 'title', 'entity_type', 'department', 'email', 'scraped_at']
|
||||||
|
EXTRA_NOTES = ['Many Harvard faculty have profiles under the /~username/ URL pattern', 'Some faculty may be cross-listed in multiple departments', 'Prioritize finding profiles from professional schools (business, law, medicine, etc.)', "Check for non-standard faculty titles like 'lecturer', 'fellow', 'researcher'"]
|
||||||
|
|
||||||
|
# URL patterns that indicate individual profile pages
|
||||||
|
PROFILE_URL_PATTERNS = [
|
||||||
|
"/people/", "/person/", "/profile/", "/profiles/",
|
||||||
|
"/faculty/", "/staff/", "/directory/",
|
||||||
|
"/~", # Unix-style personal pages
|
||||||
|
"/bio/", "/about/",
|
||||||
|
]
|
||||||
|
|
||||||
|
# URL patterns that indicate listing/directory pages (should be crawled deeper)
|
||||||
|
DIRECTORY_URL_PATTERNS = [
|
||||||
|
"/faculty", "/people", "/directory", "/staff",
|
||||||
|
"/team", "/members", "/researchers",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def normalize_url(base: str, href: str) -> str:
|
||||||
|
"""Normalize URL by resolving relative paths and removing fragments."""
|
||||||
|
absolute = urljoin(base, href)
|
||||||
|
cleaned, _ = urldefrag(absolute)
|
||||||
|
# Remove trailing slash for consistency
|
||||||
|
return cleaned.rstrip("/")
|
||||||
|
|
||||||
|
|
||||||
|
def matches_any(text: str, keywords: Iterable[str]) -> bool:
|
||||||
|
"""Check if text contains any of the keywords (case-insensitive)."""
|
||||||
|
lowered = text.lower()
|
||||||
|
return any(keyword.lower() in lowered for keyword in keywords)
|
||||||
|
|
||||||
|
|
||||||
|
def is_same_domain(url1: str, url2: str) -> bool:
|
||||||
|
"""Check if two URLs belong to the same root domain."""
|
||||||
|
domain1 = urlparse(url1).netloc.replace("www.", "")
|
||||||
|
domain2 = urlparse(url2).netloc.replace("www.", "")
|
||||||
|
# Allow subdomains of the same root domain
|
||||||
|
parts1 = domain1.split(".")
|
||||||
|
parts2 = domain2.split(".")
|
||||||
|
if len(parts1) >= 2 and len(parts2) >= 2:
|
||||||
|
return parts1[-2:] == parts2[-2:]
|
||||||
|
return domain1 == domain2
|
||||||
|
|
||||||
|
|
||||||
|
def is_profile_url(url: str) -> bool:
|
||||||
|
"""Check if URL pattern suggests an individual profile page."""
|
||||||
|
url_lower = url.lower()
|
||||||
|
return any(pattern in url_lower for pattern in PROFILE_URL_PATTERNS)
|
||||||
|
|
||||||
|
|
||||||
|
def is_directory_url(url: str) -> bool:
|
||||||
|
"""Check if URL pattern suggests a directory/listing page."""
|
||||||
|
url_lower = url.lower()
|
||||||
|
return any(pattern in url_lower for pattern in DIRECTORY_URL_PATTERNS)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ScrapedLink:
|
||||||
|
url: str
|
||||||
|
title: str
|
||||||
|
text: str
|
||||||
|
source_url: str
|
||||||
|
bucket: str # "program" or "faculty"
|
||||||
|
is_verified: bool = False
|
||||||
|
http_status: int = 0
|
||||||
|
is_profile_page: bool = False
|
||||||
|
scraped_at: str = field(default_factory=lambda: datetime.now(timezone.utc).isoformat())
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ScrapeSettings:
|
||||||
|
root_url: str
|
||||||
|
max_depth: int
|
||||||
|
max_pages: int
|
||||||
|
headless: bool
|
||||||
|
output: Path
|
||||||
|
verify_links: bool = True
|
||||||
|
request_delay: float = 1.0 # Polite crawling delay
|
||||||
|
timeout: int = 60000 # Navigation timeout in ms
|
||||||
|
|
||||||
|
|
||||||
|
async def extract_links(page: Page) -> List[Tuple[str, str]]:
|
||||||
|
"""Extract all anchor links from the page."""
|
||||||
|
anchors: Iterable[dict] = await page.eval_on_selector_all(
|
||||||
|
"a",
|
||||||
|
"""elements => elements
|
||||||
|
.map(el => ({text: (el.textContent || '').trim(), href: el.href}))
|
||||||
|
.filter(item => item.text && item.href && item.href.startsWith('http'))""",
|
||||||
|
)
|
||||||
|
return [(item["href"], item["text"]) for item in anchors]
|
||||||
|
|
||||||
|
|
||||||
|
async def get_page_title(page: Page) -> str:
|
||||||
|
"""Get the page title safely."""
|
||||||
|
try:
|
||||||
|
return await page.title() or ""
|
||||||
|
except Exception:
|
||||||
|
return ""
|
||||||
|
|
||||||
|
|
||||||
|
async def verify_link(context, url: str, timeout: int = 10000) -> Tuple[bool, int, str]:
|
||||||
|
"""
|
||||||
|
Verify a link by making a HEAD-like request.
|
||||||
|
Returns: (is_valid, status_code, page_title)
|
||||||
|
"""
|
||||||
|
page = await context.new_page()
|
||||||
|
try:
|
||||||
|
response: Response = await page.goto(url, wait_until="domcontentloaded", timeout=timeout)
|
||||||
|
if response:
|
||||||
|
status = response.status
|
||||||
|
title = await get_page_title(page)
|
||||||
|
is_valid = 200 <= status < 400
|
||||||
|
return is_valid, status, title
|
||||||
|
return False, 0, ""
|
||||||
|
except Exception:
|
||||||
|
return False, 0, ""
|
||||||
|
finally:
|
||||||
|
await page.close()
|
||||||
|
|
||||||
|
|
||||||
|
async def crawl(settings: ScrapeSettings, browser_name: str) -> List[ScrapedLink]:
|
||||||
|
"""
|
||||||
|
Crawl the website using BFS, collecting program and faculty links.
|
||||||
|
Features:
|
||||||
|
- URL deduplication
|
||||||
|
- Link verification
|
||||||
|
- Profile page detection
|
||||||
|
- Polite crawling with delays
|
||||||
|
"""
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser_launcher = getattr(p, browser_name)
|
||||||
|
browser = await browser_launcher.launch(headless=settings.headless)
|
||||||
|
context = await browser.new_context()
|
||||||
|
|
||||||
|
# Priority queue: (priority, url, depth) - lower priority = processed first
|
||||||
|
# Directory pages get priority 0, others get priority 1
|
||||||
|
queue: Deque[Tuple[int, str, int]] = deque([(0, settings.root_url, 0)])
|
||||||
|
visited: Set[str] = set()
|
||||||
|
found_urls: Set[str] = set() # For deduplication of results
|
||||||
|
results: List[ScrapedLink] = []
|
||||||
|
|
||||||
|
print(f"Starting crawl from: {settings.root_url}")
|
||||||
|
print(f"Max depth: {settings.max_depth}, Max pages: {settings.max_pages}")
|
||||||
|
|
||||||
|
try:
|
||||||
|
while queue and len(visited) < settings.max_pages:
|
||||||
|
# Sort queue by priority (directory pages first)
|
||||||
|
queue = deque(sorted(queue, key=lambda x: x[0]))
|
||||||
|
priority, url, depth = queue.popleft()
|
||||||
|
|
||||||
|
normalized_url = normalize_url(settings.root_url, url)
|
||||||
|
if normalized_url in visited or depth > settings.max_depth:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Only crawl same-domain URLs
|
||||||
|
if not is_same_domain(settings.root_url, normalized_url):
|
||||||
|
continue
|
||||||
|
|
||||||
|
visited.add(normalized_url)
|
||||||
|
print(f"[{len(visited)}/{settings.max_pages}] Depth {depth}: {normalized_url[:80]}...")
|
||||||
|
|
||||||
|
page = await context.new_page()
|
||||||
|
try:
|
||||||
|
response = await page.goto(
|
||||||
|
normalized_url, wait_until="domcontentloaded", timeout=settings.timeout
|
||||||
|
)
|
||||||
|
if not response or response.status >= 400:
|
||||||
|
await page.close()
|
||||||
|
continue
|
||||||
|
except Exception as e:
|
||||||
|
print(f" Error: {e}")
|
||||||
|
await page.close()
|
||||||
|
continue
|
||||||
|
|
||||||
|
page_title = await get_page_title(page)
|
||||||
|
links = await extract_links(page)
|
||||||
|
|
||||||
|
for href, text in links:
|
||||||
|
normalized_href = normalize_url(normalized_url, href)
|
||||||
|
|
||||||
|
# Skip if already found or is excluded
|
||||||
|
if normalized_href in found_urls:
|
||||||
|
continue
|
||||||
|
if matches_any(text, EXCLUSION_KEYWORDS) or matches_any(normalized_href, EXCLUSION_KEYWORDS):
|
||||||
|
continue
|
||||||
|
|
||||||
|
text_lower = text.lower()
|
||||||
|
href_lower = normalized_href.lower()
|
||||||
|
is_profile = is_profile_url(normalized_href)
|
||||||
|
|
||||||
|
# Check for program links
|
||||||
|
if matches_any(text_lower, PROGRAM_KEYWORDS) or matches_any(href_lower, PROGRAM_KEYWORDS):
|
||||||
|
found_urls.add(normalized_href)
|
||||||
|
results.append(
|
||||||
|
ScrapedLink(
|
||||||
|
url=normalized_href,
|
||||||
|
title="",
|
||||||
|
text=text[:200],
|
||||||
|
source_url=normalized_url,
|
||||||
|
bucket="program",
|
||||||
|
is_profile_page=False,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Check for faculty links
|
||||||
|
if matches_any(text_lower, FACULTY_KEYWORDS) or matches_any(href_lower, FACULTY_KEYWORDS):
|
||||||
|
found_urls.add(normalized_href)
|
||||||
|
results.append(
|
||||||
|
ScrapedLink(
|
||||||
|
url=normalized_href,
|
||||||
|
title="",
|
||||||
|
text=text[:200],
|
||||||
|
source_url=normalized_url,
|
||||||
|
bucket="faculty",
|
||||||
|
is_profile_page=is_profile,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Queue for further crawling
|
||||||
|
if depth < settings.max_depth and is_same_domain(settings.root_url, normalized_href):
|
||||||
|
# Prioritize directory pages
|
||||||
|
link_priority = 0 if is_directory_url(normalized_href) else 1
|
||||||
|
queue.append((link_priority, normalized_href, depth + 1))
|
||||||
|
|
||||||
|
await page.close()
|
||||||
|
|
||||||
|
# Polite delay between requests
|
||||||
|
await asyncio.sleep(settings.request_delay)
|
||||||
|
|
||||||
|
finally:
|
||||||
|
await context.close()
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
# Verify links if enabled
|
||||||
|
if settings.verify_links and results:
|
||||||
|
print(f"\nVerifying {len(results)} links...")
|
||||||
|
browser = await browser_launcher.launch(headless=True)
|
||||||
|
context = await browser.new_context()
|
||||||
|
|
||||||
|
verified_results = []
|
||||||
|
for i, link in enumerate(results):
|
||||||
|
if link.url in [r.url for r in verified_results]:
|
||||||
|
continue # Skip duplicates
|
||||||
|
|
||||||
|
print(f" [{i+1}/{len(results)}] Verifying: {link.url[:60]}...")
|
||||||
|
is_valid, status, title = await verify_link(context, link.url)
|
||||||
|
link.is_verified = True
|
||||||
|
link.http_status = status
|
||||||
|
link.title = title or link.text
|
||||||
|
|
||||||
|
if is_valid:
|
||||||
|
verified_results.append(link)
|
||||||
|
else:
|
||||||
|
print(f" Invalid (HTTP {status})")
|
||||||
|
|
||||||
|
await asyncio.sleep(0.5) # Delay between verifications
|
||||||
|
|
||||||
|
await context.close()
|
||||||
|
await browser.close()
|
||||||
|
results = verified_results
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
def deduplicate_results(results: List[ScrapedLink]) -> List[ScrapedLink]:
|
||||||
|
"""Remove duplicate URLs, keeping the first occurrence."""
|
||||||
|
seen: Set[str] = set()
|
||||||
|
unique = []
|
||||||
|
for link in results:
|
||||||
|
if link.url not in seen:
|
||||||
|
seen.add(link.url)
|
||||||
|
unique.append(link)
|
||||||
|
return unique
|
||||||
|
|
||||||
|
|
||||||
|
def serialize(results: List[ScrapedLink], target: Path, root_url: str) -> None:
|
||||||
|
"""Save results to JSON file with statistics."""
|
||||||
|
results = deduplicate_results(results)
|
||||||
|
|
||||||
|
program_links = [link for link in results if link.bucket == "program"]
|
||||||
|
faculty_links = [link for link in results if link.bucket == "faculty"]
|
||||||
|
profile_pages = [link for link in faculty_links if link.is_profile_page]
|
||||||
|
|
||||||
|
payload = {
|
||||||
|
"root_url": root_url,
|
||||||
|
"generated_at": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"statistics": {
|
||||||
|
"total_links": len(results),
|
||||||
|
"program_links": len(program_links),
|
||||||
|
"faculty_links": len(faculty_links),
|
||||||
|
"profile_pages": len(profile_pages),
|
||||||
|
"verified_links": len([r for r in results if r.is_verified and r.http_status == 200]),
|
||||||
|
},
|
||||||
|
"program_links": [asdict(link) for link in program_links],
|
||||||
|
"faculty_links": [asdict(link) for link in faculty_links],
|
||||||
|
"notes": EXTRA_NOTES,
|
||||||
|
"metadata_fields": METADATA_FIELDS,
|
||||||
|
}
|
||||||
|
target.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
target.write_text(json.dumps(payload, indent=2, ensure_ascii=False), encoding="utf-8")
|
||||||
|
|
||||||
|
print(f"\nResults saved to: {target}")
|
||||||
|
print(f" Total links: {len(results)}")
|
||||||
|
print(f" Program links: {len(program_links)}")
|
||||||
|
print(f" Faculty links: {len(faculty_links)}")
|
||||||
|
print(f" Profile pages: {len(profile_pages)}")
|
||||||
|
|
||||||
|
|
||||||
|
def parse_args() -> argparse.Namespace:
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Playwright scraper generated by the Agno agent for https://www.harvard.edu/."
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--root-url",
|
||||||
|
default="https://www.harvard.edu/",
|
||||||
|
help="Seed url to start crawling from.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--max-depth",
|
||||||
|
type=int,
|
||||||
|
default=3,
|
||||||
|
help="Maximum crawl depth.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--max-pages",
|
||||||
|
type=int,
|
||||||
|
default=30,
|
||||||
|
help="Maximum number of pages to visit.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--output",
|
||||||
|
type=Path,
|
||||||
|
default=Path("university-scraper_results.json"),
|
||||||
|
help="Where to save the JSON output.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--headless",
|
||||||
|
action="store_true",
|
||||||
|
default=True,
|
||||||
|
help="Run browser in headless mode (default: True).",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--no-headless",
|
||||||
|
action="store_false",
|
||||||
|
dest="headless",
|
||||||
|
help="Run browser with visible window.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--browser",
|
||||||
|
choices=["chromium", "firefox", "webkit"],
|
||||||
|
default="chromium",
|
||||||
|
help="Browser engine to launch via Playwright.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--no-verify",
|
||||||
|
action="store_true",
|
||||||
|
default=False,
|
||||||
|
help="Skip link verification step.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--delay",
|
||||||
|
type=float,
|
||||||
|
default=1.0,
|
||||||
|
help="Delay between requests in seconds (polite crawling).",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--timeout",
|
||||||
|
type=int,
|
||||||
|
default=60000,
|
||||||
|
help="Navigation timeout in milliseconds (default: 60000 = 60s).",
|
||||||
|
)
|
||||||
|
return parser.parse_args()
|
||||||
|
|
||||||
|
|
||||||
|
async def main_async() -> None:
|
||||||
|
args = parse_args()
|
||||||
|
settings = ScrapeSettings(
|
||||||
|
root_url=args.root_url,
|
||||||
|
max_depth=args.max_depth,
|
||||||
|
max_pages=args.max_pages,
|
||||||
|
headless=args.headless,
|
||||||
|
output=args.output,
|
||||||
|
verify_links=not args.no_verify,
|
||||||
|
request_delay=args.delay,
|
||||||
|
timeout=args.timeout,
|
||||||
|
)
|
||||||
|
links = await crawl(settings, browser_name=args.browser)
|
||||||
|
serialize(links, settings.output, settings.root_url)
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
asyncio.run(main_async())
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
466
artifacts/harvard_programs_scraper.py
Normal file
466
artifacts/harvard_programs_scraper.py
Normal file
@ -0,0 +1,466 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Harvard Graduate Programs Scraper
|
||||||
|
专门爬取 https://www.harvard.edu/programs/?degree_levels=graduate 页面的所有研究生项目
|
||||||
|
通过点击分页按钮遍历所有页面
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
|
||||||
|
|
||||||
|
async def scrape_harvard_programs():
|
||||||
|
"""爬取Harvard研究生项目列表页面 - 通过点击分页按钮"""
|
||||||
|
|
||||||
|
all_programs = []
|
||||||
|
base_url = "https://www.harvard.edu/programs/?degree_levels=graduate"
|
||||||
|
|
||||||
|
async with async_playwright() as p:
|
||||||
|
# 使用无头模式
|
||||||
|
browser = await p.chromium.launch(headless=True)
|
||||||
|
context = await browser.new_context(
|
||||||
|
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
||||||
|
viewport={'width': 1920, 'height': 1080}
|
||||||
|
)
|
||||||
|
page = await context.new_page()
|
||||||
|
|
||||||
|
print(f"正在访问: {base_url}")
|
||||||
|
# 使用 domcontentloaded 而非 networkidle,更快加载
|
||||||
|
await page.goto(base_url, wait_until="domcontentloaded", timeout=60000)
|
||||||
|
# 等待页面内容加载
|
||||||
|
await page.wait_for_timeout(5000)
|
||||||
|
|
||||||
|
# 滚动到页面底部以确保分页按钮加载
|
||||||
|
print("滚动到页面底部...")
|
||||||
|
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
current_page = 1
|
||||||
|
max_pages = 15
|
||||||
|
|
||||||
|
while current_page <= max_pages:
|
||||||
|
print(f"\n========== 第 {current_page} 页 ==========")
|
||||||
|
|
||||||
|
# 等待内容加载
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
# 提取当前页面的项目
|
||||||
|
# 从调试输出得知,项目按钮的class是 'records__record___PbPhG c-programs-item__title-link'
|
||||||
|
# 需要点击按钮来获取URL,因为Harvard使用JavaScript导航
|
||||||
|
|
||||||
|
# 首先获取所有项目按钮信息
|
||||||
|
page_data = await page.evaluate('''() => {
|
||||||
|
const programs = [];
|
||||||
|
|
||||||
|
// 查找所有项目行/容器
|
||||||
|
const programItems = document.querySelectorAll('[class*="records__record"], [class*="c-programs-item"]');
|
||||||
|
|
||||||
|
programItems.forEach((item, index) => {
|
||||||
|
// 获取项目名称按钮
|
||||||
|
const nameBtn = item.querySelector('button[class*="title-link"], button[class*="c-programs-item"]');
|
||||||
|
if (!nameBtn) return;
|
||||||
|
|
||||||
|
const name = nameBtn.innerText.trim();
|
||||||
|
if (!name || name.length < 3) return;
|
||||||
|
|
||||||
|
// 获取学位信息
|
||||||
|
let degrees = '';
|
||||||
|
const allText = item.innerText;
|
||||||
|
const degreeMatch = allText.match(/(A\\.B\\.|Ph\\.D\\.|M\\.A\\.|S\\.M\\.|M\\.Arch\\.|LL\\.M\\.|S\\.B\\.|A\\.L\\.B\\.|A\\.L\\.M\\.|M\\.M\\.Sc\\.|Ed\\.D\\.|Ed\\.M\\.|M\\.P\\.A\\.|M\\.P\\.P\\.|M\\.P\\.H\\.|J\\.D\\.|M\\.B\\.A\\.|M\\.D\\.|D\\.M\\.D\\.|Th\\.D\\.|M\\.Div\\.|M\\.T\\.S\\.|M\\.E\\.|D\\.M\\.Sc\\.|M\\.H\\.C\\.M\\.|M\\.L\\.A\\.|M\\.D\\.E\\.|M\\.R\\.E\\.|M\\.A\\.U\\.D\\.|M\\.R\\.P\\.L\\.)/g);
|
||||||
|
if (degreeMatch) {
|
||||||
|
degrees = degreeMatch.join(', ');
|
||||||
|
}
|
||||||
|
|
||||||
|
// 查找链接 - 检查各种可能的位置
|
||||||
|
let url = '';
|
||||||
|
|
||||||
|
// 方法1: 查找 <a> 标签
|
||||||
|
const link = item.querySelector('a[href]');
|
||||||
|
if (link && link.href) {
|
||||||
|
url = link.href;
|
||||||
|
}
|
||||||
|
|
||||||
|
// 方法2: 检查data属性
|
||||||
|
if (!url) {
|
||||||
|
const dataUrl = nameBtn.getAttribute('data-url') ||
|
||||||
|
nameBtn.getAttribute('data-href') ||
|
||||||
|
item.getAttribute('data-url');
|
||||||
|
if (dataUrl) url = dataUrl;
|
||||||
|
}
|
||||||
|
|
||||||
|
// 方法3: 检查onclick属性
|
||||||
|
if (!url) {
|
||||||
|
const onclick = nameBtn.getAttribute('onclick') || '';
|
||||||
|
const urlMatch = onclick.match(/['"]([^'"]*\\/programs\\/[^'"]*)['"]/);
|
||||||
|
if (urlMatch) url = urlMatch[1];
|
||||||
|
}
|
||||||
|
|
||||||
|
programs.push({
|
||||||
|
name: name,
|
||||||
|
degrees: degrees,
|
||||||
|
url: url,
|
||||||
|
index: index
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
// 如果方法1没找到项目,使用备选方法
|
||||||
|
if (programs.length === 0) {
|
||||||
|
// 查找所有项目按钮
|
||||||
|
const buttons = document.querySelectorAll('button');
|
||||||
|
buttons.forEach((btn, index) => {
|
||||||
|
const className = btn.className || '';
|
||||||
|
if (className.includes('c-programs-item') || className.includes('title-link')) {
|
||||||
|
const name = btn.innerText.trim();
|
||||||
|
if (name && name.length > 3 && !name.match(/^(Page|Next|Previous|Search|Menu|Filter)/)) {
|
||||||
|
programs.push({
|
||||||
|
name: name,
|
||||||
|
degrees: '',
|
||||||
|
url: '',
|
||||||
|
index: index
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
return {
|
||||||
|
programs: programs,
|
||||||
|
totalFound: programs.length
|
||||||
|
};
|
||||||
|
}''')
|
||||||
|
|
||||||
|
# 第一页时调试输出HTML结构
|
||||||
|
if current_page == 1 and len(page_data['programs']) == 0:
|
||||||
|
print("未找到项目,调试HTML结构...")
|
||||||
|
html_debug = await page.evaluate('''() => {
|
||||||
|
const debug = {
|
||||||
|
allButtons: [],
|
||||||
|
allLinks: [],
|
||||||
|
sampleHTML: ''
|
||||||
|
};
|
||||||
|
|
||||||
|
// 获取所有按钮
|
||||||
|
document.querySelectorAll('button').forEach(btn => {
|
||||||
|
const text = btn.innerText.trim().substring(0, 50);
|
||||||
|
if (text && text.length > 3) {
|
||||||
|
debug.allButtons.push({
|
||||||
|
text: text,
|
||||||
|
class: btn.className.substring(0, 80)
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
// 获取main区域的HTML片段
|
||||||
|
const main = document.querySelector('main') || document.body;
|
||||||
|
debug.sampleHTML = main.innerHTML.substring(0, 3000);
|
||||||
|
|
||||||
|
return debug;
|
||||||
|
}''')
|
||||||
|
print(f"找到 {len(html_debug['allButtons'])} 个按钮:")
|
||||||
|
for btn in html_debug['allButtons'][:20]:
|
||||||
|
print(f" - {btn['text']} | class: {btn['class']}")
|
||||||
|
print(f"\nHTML片段:\n{html_debug['sampleHTML'][:1500]}")
|
||||||
|
|
||||||
|
print(f" 本页找到 {len(page_data['programs'])} 个项目")
|
||||||
|
|
||||||
|
# 打印找到的项目
|
||||||
|
for prog in page_data['programs']:
|
||||||
|
print(f" - {prog['name']} ({prog['degrees']})")
|
||||||
|
|
||||||
|
# 添加到总列表(去重)
|
||||||
|
for prog in page_data['programs']:
|
||||||
|
name = prog['name'].strip()
|
||||||
|
if name and not any(p['name'] == name for p in all_programs):
|
||||||
|
all_programs.append({
|
||||||
|
'name': name,
|
||||||
|
'degrees': prog.get('degrees', ''),
|
||||||
|
'url': prog.get('url', ''),
|
||||||
|
'page': current_page
|
||||||
|
})
|
||||||
|
|
||||||
|
# 尝试点击下一页按钮
|
||||||
|
try:
|
||||||
|
clicked = False
|
||||||
|
|
||||||
|
# 首先打印所有分页相关元素用于调试
|
||||||
|
if current_page == 1:
|
||||||
|
# 截图保存以便调试
|
||||||
|
await page.screenshot(path="harvard_debug_pagination.png", full_page=True)
|
||||||
|
print("已保存调试截图: harvard_debug_pagination.png")
|
||||||
|
|
||||||
|
pagination_info = await page.evaluate('''() => {
|
||||||
|
const result = {
|
||||||
|
links: [],
|
||||||
|
buttons: [],
|
||||||
|
allClickable: [],
|
||||||
|
pageNumbers: [],
|
||||||
|
allText: []
|
||||||
|
};
|
||||||
|
|
||||||
|
// 查找所有链接
|
||||||
|
document.querySelectorAll('a').forEach(a => {
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
if (text.match(/^[0-9]+$|Next|page|Prev/i)) {
|
||||||
|
result.links.push({
|
||||||
|
text: text.substring(0, 50),
|
||||||
|
href: a.href,
|
||||||
|
visible: a.offsetParent !== null,
|
||||||
|
className: a.className
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
// 查找所有按钮
|
||||||
|
document.querySelectorAll('button').forEach(b => {
|
||||||
|
const text = b.innerText.trim();
|
||||||
|
if (text.match(/^[0-9]+$|Next|page|Prev/i) || text.length < 20) {
|
||||||
|
result.buttons.push({
|
||||||
|
text: text.substring(0, 50),
|
||||||
|
visible: b.offsetParent !== null,
|
||||||
|
className: b.className
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
// 查找所有包含数字的可点击元素(可能是分页)
|
||||||
|
document.querySelectorAll('a, button, span[role="button"], div[role="button"], li a, nav a').forEach(el => {
|
||||||
|
const text = el.innerText.trim();
|
||||||
|
if (text.match(/^[0-9]$/) || text === 'Next page' || text.includes('Next')) {
|
||||||
|
result.pageNumbers.push({
|
||||||
|
tag: el.tagName,
|
||||||
|
text: text,
|
||||||
|
className: el.className,
|
||||||
|
id: el.id,
|
||||||
|
ariaLabel: el.getAttribute('aria-label'),
|
||||||
|
visible: el.offsetParent !== null
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
// 查找页面底部区域的所有可点击元素
|
||||||
|
const bodyRect = document.body.getBoundingClientRect();
|
||||||
|
document.querySelectorAll('*').forEach(el => {
|
||||||
|
const rect = el.getBoundingClientRect();
|
||||||
|
const text = el.innerText?.trim() || '';
|
||||||
|
// 只看页面下半部分的元素且文本短
|
||||||
|
if (rect.top > bodyRect.height * 0.5 && text.length > 0 && text.length < 30) {
|
||||||
|
const style = window.getComputedStyle(el);
|
||||||
|
if (style.cursor === 'pointer' || el.tagName === 'A' || el.tagName === 'BUTTON') {
|
||||||
|
result.allClickable.push({
|
||||||
|
tag: el.tagName,
|
||||||
|
text: text.substring(0, 30),
|
||||||
|
top: Math.round(rect.top),
|
||||||
|
className: el.className?.substring?.(0, 50) || ''
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
// 输出页面底部所有文本以便调试
|
||||||
|
const bodyText = document.body.innerText;
|
||||||
|
const lines = bodyText.split('\\n').filter(l => l.trim());
|
||||||
|
// 找到包含数字1-9的行
|
||||||
|
for (let i = 0; i < lines.length; i++) {
|
||||||
|
if (lines[i].match(/^[1-9]$|Next page|Previous/)) {
|
||||||
|
result.allText.push(lines[i]);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return result;
|
||||||
|
}''')
|
||||||
|
print(f"\n分页相关链接 ({len(pagination_info['links'])} 个):")
|
||||||
|
for link in pagination_info['links']:
|
||||||
|
print(f" a: '{link['text']}' class='{link.get('className', '')}' (visible: {link['visible']})")
|
||||||
|
print(f"\n分页相关按钮 ({len(pagination_info['buttons'])} 个):")
|
||||||
|
for btn in pagination_info['buttons']:
|
||||||
|
print(f" button: '{btn['text']}' class='{btn.get('className', '')}' (visible: {btn['visible']})")
|
||||||
|
print(f"\n页码元素 ({len(pagination_info['pageNumbers'])} 个):")
|
||||||
|
for pn in pagination_info['pageNumbers']:
|
||||||
|
print(f" {pn['tag']}: '{pn['text']}' aria-label='{pn.get('ariaLabel')}' visible={pn['visible']}")
|
||||||
|
print(f"\n页面下半部分可点击元素 ({len(pagination_info['allClickable'])} 个):")
|
||||||
|
for el in pagination_info['allClickable'][:30]:
|
||||||
|
print(f" {el['tag']}: '{el['text']}' (top: {el['top']})")
|
||||||
|
print(f"\n页面中的分页文本 ({len(pagination_info['allText'])} 个):")
|
||||||
|
for txt in pagination_info['allText'][:20]:
|
||||||
|
print(f" '{txt}'")
|
||||||
|
|
||||||
|
# 方法1: 直接使用CSS选择器查找 "Next page" 按钮 (最可靠)
|
||||||
|
# 从调试输出得知,分页按钮是 <button class="c-pagination__link c-pagination__link--next">
|
||||||
|
next_page_num = str(current_page + 1)
|
||||||
|
|
||||||
|
try:
|
||||||
|
next_btn = page.locator('button.c-pagination__link--next')
|
||||||
|
if await next_btn.count() > 0:
|
||||||
|
print(f"\n找到 'Next page' 按钮 (CSS选择器),尝试点击...")
|
||||||
|
await next_btn.first.scroll_into_view_if_needed()
|
||||||
|
await next_btn.first.click()
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
current_page += 1
|
||||||
|
clicked = True
|
||||||
|
except Exception as e:
|
||||||
|
print(f"方法1失败: {e}")
|
||||||
|
|
||||||
|
if clicked:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# 方法2: 使用 get_by_role 查找按钮
|
||||||
|
try:
|
||||||
|
next_btn = page.get_by_role("button", name="Next page")
|
||||||
|
if await next_btn.count() > 0:
|
||||||
|
print(f"\n通过role找到 'Next page' 按钮,尝试点击...")
|
||||||
|
await next_btn.first.scroll_into_view_if_needed()
|
||||||
|
await next_btn.first.click()
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
current_page += 1
|
||||||
|
clicked = True
|
||||||
|
except Exception as e:
|
||||||
|
print(f"方法2失败: {e}")
|
||||||
|
|
||||||
|
if clicked:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# 方法3: 查找所有分页按钮并点击 "Next page"
|
||||||
|
try:
|
||||||
|
pagination_buttons = await page.query_selector_all('button.c-pagination__link')
|
||||||
|
for btn in pagination_buttons:
|
||||||
|
text = await btn.inner_text()
|
||||||
|
if 'Next page' in text:
|
||||||
|
print(f"\n通过遍历分页按钮找到 'Next page',点击...")
|
||||||
|
await btn.scroll_into_view_if_needed()
|
||||||
|
await btn.click()
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
current_page += 1
|
||||||
|
clicked = True
|
||||||
|
break
|
||||||
|
except Exception as e:
|
||||||
|
print(f"方法3失败: {e}")
|
||||||
|
|
||||||
|
if clicked:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# 方法4: 通过JavaScript直接点击分页按钮
|
||||||
|
try:
|
||||||
|
js_clicked = await page.evaluate('''() => {
|
||||||
|
// 查找 Next page 按钮
|
||||||
|
const nextBtn = document.querySelector('button.c-pagination__link--next');
|
||||||
|
if (nextBtn) {
|
||||||
|
nextBtn.click();
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
// 备选:查找所有分页按钮
|
||||||
|
const buttons = document.querySelectorAll('button.c-pagination__link');
|
||||||
|
for (const btn of buttons) {
|
||||||
|
if (btn.innerText.includes('Next page')) {
|
||||||
|
btn.click();
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return false;
|
||||||
|
}''')
|
||||||
|
if js_clicked:
|
||||||
|
print(f"\n通过JavaScript点击 'Next page' 成功")
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
current_page += 1
|
||||||
|
clicked = True
|
||||||
|
except Exception as e:
|
||||||
|
print(f"方法4失败: {e}")
|
||||||
|
|
||||||
|
if clicked:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# 方法5: 遍历所有按钮查找
|
||||||
|
try:
|
||||||
|
all_buttons = await page.query_selector_all('button')
|
||||||
|
for btn in all_buttons:
|
||||||
|
try:
|
||||||
|
text = await btn.inner_text()
|
||||||
|
if 'Next page' in text:
|
||||||
|
visible = await btn.is_visible()
|
||||||
|
if visible:
|
||||||
|
print(f"\n遍历所有按钮找到 'Next page',点击...")
|
||||||
|
await btn.scroll_into_view_if_needed()
|
||||||
|
await btn.click()
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
current_page += 1
|
||||||
|
clicked = True
|
||||||
|
break
|
||||||
|
except:
|
||||||
|
continue
|
||||||
|
except Exception as e:
|
||||||
|
print(f"方法5失败: {e}")
|
||||||
|
|
||||||
|
if clicked:
|
||||||
|
continue
|
||||||
|
|
||||||
|
print("没有找到下一页按钮,结束爬取")
|
||||||
|
break
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"点击下一页时出错: {e}")
|
||||||
|
break
|
||||||
|
|
||||||
|
# 生成项目URL - Harvard的项目URL格式为:
|
||||||
|
# https://www.harvard.edu/programs/{program-name-slug}/
|
||||||
|
# 例如: african-and-african-american-studies
|
||||||
|
|
||||||
|
import re
|
||||||
|
|
||||||
|
def name_to_slug(name):
|
||||||
|
"""将项目名称转换为URL slug"""
|
||||||
|
# 转小写
|
||||||
|
slug = name.lower()
|
||||||
|
# 将特殊字符替换为空格
|
||||||
|
slug = re.sub(r'[^\w\s-]', '', slug)
|
||||||
|
# 替换空格为连字符
|
||||||
|
slug = re.sub(r'[\s_]+', '-', slug)
|
||||||
|
# 移除多余的连字符
|
||||||
|
slug = re.sub(r'-+', '-', slug)
|
||||||
|
# 移除首尾连字符
|
||||||
|
slug = slug.strip('-')
|
||||||
|
return slug
|
||||||
|
|
||||||
|
print("\n正在生成项目URL...")
|
||||||
|
for prog in all_programs:
|
||||||
|
slug = name_to_slug(prog['name'])
|
||||||
|
prog['url'] = f"https://www.harvard.edu/programs/{slug}/"
|
||||||
|
print(f" {prog['name']} -> {prog['url']}")
|
||||||
|
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
# 排序
|
||||||
|
programs = sorted(all_programs, key=lambda x: x['name'])
|
||||||
|
|
||||||
|
# 保存
|
||||||
|
result = {
|
||||||
|
'source_url': base_url,
|
||||||
|
'scraped_at': datetime.now(timezone.utc).isoformat(),
|
||||||
|
'total_pages_scraped': current_page,
|
||||||
|
'total_programs': len(programs),
|
||||||
|
'programs': programs
|
||||||
|
}
|
||||||
|
|
||||||
|
output_file = Path('harvard_programs_results.json')
|
||||||
|
with open(output_file, 'w', encoding='utf-8') as f:
|
||||||
|
json.dump(result, f, ensure_ascii=False, indent=2)
|
||||||
|
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print(f"爬取完成!")
|
||||||
|
print(f"共爬取 {current_page} 页")
|
||||||
|
print(f"共找到 {len(programs)} 个研究生项目")
|
||||||
|
print(f"结果保存到: {output_file}")
|
||||||
|
print(f"{'='*60}")
|
||||||
|
|
||||||
|
# 打印完整列表
|
||||||
|
print("\n研究生项目完整列表:")
|
||||||
|
for i, prog in enumerate(programs, 1):
|
||||||
|
print(f"{i:3}. {prog['name']} - {prog['degrees']}")
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(scrape_harvard_programs())
|
||||||
356
artifacts/harvard_programs_with_faculty_scraper.py
Normal file
356
artifacts/harvard_programs_with_faculty_scraper.py
Normal file
@ -0,0 +1,356 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Harvard Graduate Programs Scraper with Faculty Information
|
||||||
|
爬取 https://www.harvard.edu/programs/?degree_levels=graduate 页面的所有研究生项目
|
||||||
|
并获取每个项目的导师个人信息页面URL
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
|
||||||
|
|
||||||
|
def name_to_slug(name):
|
||||||
|
"""将项目名称转换为URL slug"""
|
||||||
|
slug = name.lower()
|
||||||
|
slug = re.sub(r'[^\w\s-]', '', slug)
|
||||||
|
slug = re.sub(r'[\s_]+', '-', slug)
|
||||||
|
slug = re.sub(r'-+', '-', slug)
|
||||||
|
slug = slug.strip('-')
|
||||||
|
return slug
|
||||||
|
|
||||||
|
|
||||||
|
async def extract_faculty_from_page(page):
|
||||||
|
"""从当前页面提取所有教职员工链接"""
|
||||||
|
faculty_list = await page.evaluate('''() => {
|
||||||
|
const faculty = [];
|
||||||
|
const seen = new Set();
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href || '';
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
const lowerHref = href.toLowerCase();
|
||||||
|
const lowerText = text.toLowerCase();
|
||||||
|
|
||||||
|
// 检查是否是个人页面链接
|
||||||
|
if ((lowerHref.includes('/people/') || lowerHref.includes('/faculty/') ||
|
||||||
|
lowerHref.includes('/profile/') || lowerHref.includes('/person/')) &&
|
||||||
|
text.length > 3 && text.length < 100 &&
|
||||||
|
!lowerText.includes('people') &&
|
||||||
|
!lowerText.includes('faculty') &&
|
||||||
|
!lowerText.includes('profile') &&
|
||||||
|
!lowerText.includes('staff') &&
|
||||||
|
!lowerHref.endsWith('/people/') &&
|
||||||
|
!lowerHref.endsWith('/people') &&
|
||||||
|
!lowerHref.endsWith('/faculty/') &&
|
||||||
|
!lowerHref.endsWith('/faculty')) {
|
||||||
|
|
||||||
|
if (!seen.has(href)) {
|
||||||
|
seen.add(href);
|
||||||
|
faculty.push({
|
||||||
|
name: text,
|
||||||
|
url: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return faculty;
|
||||||
|
}''')
|
||||||
|
return faculty_list
|
||||||
|
|
||||||
|
|
||||||
|
async def get_faculty_from_gsas_page(page, gsas_url, program_name):
|
||||||
|
"""从GSAS项目页面获取Faculty链接,然后访问院系People页面获取导师列表"""
|
||||||
|
faculty_list = []
|
||||||
|
faculty_page_url = None
|
||||||
|
|
||||||
|
try:
|
||||||
|
print(f" 访问GSAS页面: {gsas_url}")
|
||||||
|
await page.goto(gsas_url, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
# 策略1: 查找 "See list of ... faculty" 链接
|
||||||
|
faculty_link = await page.evaluate('''() => {
|
||||||
|
const links = document.querySelectorAll('a[href]');
|
||||||
|
for (const link of links) {
|
||||||
|
const text = link.innerText.toLowerCase();
|
||||||
|
const href = link.href;
|
||||||
|
if (text.includes('faculty') && text.includes('see list')) {
|
||||||
|
return href;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
# 策略2: 查找任何包含 /people 或 /faculty 的链接
|
||||||
|
if not faculty_link:
|
||||||
|
faculty_link = await page.evaluate('''() => {
|
||||||
|
const links = document.querySelectorAll('a[href]');
|
||||||
|
for (const link of links) {
|
||||||
|
const text = link.innerText.toLowerCase();
|
||||||
|
const href = link.href.toLowerCase();
|
||||||
|
// 查找Faculty相关链接
|
||||||
|
if ((text.includes('faculty') || text.includes('people')) &&
|
||||||
|
(href.includes('/people') || href.includes('/faculty'))) {
|
||||||
|
return link.href;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
# 策略3: 从页面中查找院系网站链接,然后尝试访问其People页面
|
||||||
|
if not faculty_link:
|
||||||
|
dept_website = await page.evaluate('''() => {
|
||||||
|
const links = document.querySelectorAll('a[href]');
|
||||||
|
for (const link of links) {
|
||||||
|
const text = link.innerText.toLowerCase();
|
||||||
|
const href = link.href;
|
||||||
|
// 查找 Website 链接 (通常指向院系主页)
|
||||||
|
if (text.includes('website') && href.includes('harvard.edu') &&
|
||||||
|
!href.includes('gsas.harvard.edu')) {
|
||||||
|
return href;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
if dept_website:
|
||||||
|
print(f" 找到院系网站: {dept_website}")
|
||||||
|
try:
|
||||||
|
await page.goto(dept_website, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
# 在院系网站上查找People/Faculty链接
|
||||||
|
faculty_link = await page.evaluate('''() => {
|
||||||
|
const links = document.querySelectorAll('a[href]');
|
||||||
|
for (const link of links) {
|
||||||
|
const text = link.innerText.toLowerCase().trim();
|
||||||
|
const href = link.href;
|
||||||
|
if ((text === 'people' || text === 'faculty' ||
|
||||||
|
text === 'faculty & research' || text.includes('our faculty')) &&
|
||||||
|
(href.includes('/people') || href.includes('/faculty'))) {
|
||||||
|
return href;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}''')
|
||||||
|
except Exception as e:
|
||||||
|
print(f" 访问院系网站失败: {e}")
|
||||||
|
|
||||||
|
if faculty_link:
|
||||||
|
faculty_page_url = faculty_link
|
||||||
|
print(f" 找到Faculty页面: {faculty_link}")
|
||||||
|
|
||||||
|
# 访问Faculty/People页面
|
||||||
|
await page.goto(faculty_link, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
# 提取所有导师信息
|
||||||
|
faculty_list = await extract_faculty_from_page(page)
|
||||||
|
|
||||||
|
# 如果第一页没找到,尝试处理分页或其他布局
|
||||||
|
if len(faculty_list) == 0:
|
||||||
|
# 可能需要点击某些按钮或处理JavaScript加载
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
faculty_list = await extract_faculty_from_page(page)
|
||||||
|
|
||||||
|
print(f" 找到 {len(faculty_list)} 位导师")
|
||||||
|
else:
|
||||||
|
print(f" 未找到Faculty页面链接")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f" 获取Faculty信息失败: {e}")
|
||||||
|
|
||||||
|
return faculty_list, faculty_page_url
|
||||||
|
|
||||||
|
|
||||||
|
async def scrape_harvard_programs_with_faculty():
|
||||||
|
"""爬取Harvard研究生项目列表及导师信息"""
|
||||||
|
|
||||||
|
all_programs = []
|
||||||
|
base_url = "https://www.harvard.edu/programs/?degree_levels=graduate"
|
||||||
|
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser = await p.chromium.launch(headless=True)
|
||||||
|
context = await browser.new_context(
|
||||||
|
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
||||||
|
viewport={'width': 1920, 'height': 1080}
|
||||||
|
)
|
||||||
|
page = await context.new_page()
|
||||||
|
|
||||||
|
print(f"正在访问: {base_url}")
|
||||||
|
await page.goto(base_url, wait_until="domcontentloaded", timeout=60000)
|
||||||
|
await page.wait_for_timeout(5000)
|
||||||
|
|
||||||
|
# 滚动到页面底部
|
||||||
|
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
current_page = 1
|
||||||
|
max_pages = 15
|
||||||
|
|
||||||
|
# 第一阶段:收集所有项目基本信息
|
||||||
|
print("\n========== 第一阶段:收集项目列表 ==========")
|
||||||
|
while current_page <= max_pages:
|
||||||
|
print(f"\n--- 第 {current_page} 页 ---")
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
# 提取当前页面的项目
|
||||||
|
page_data = await page.evaluate('''() => {
|
||||||
|
const programs = [];
|
||||||
|
const programItems = document.querySelectorAll('[class*="records__record"], [class*="c-programs-item"]');
|
||||||
|
|
||||||
|
programItems.forEach((item, index) => {
|
||||||
|
const nameBtn = item.querySelector('button[class*="title-link"], button[class*="c-programs-item"]');
|
||||||
|
if (!nameBtn) return;
|
||||||
|
|
||||||
|
const name = nameBtn.innerText.trim();
|
||||||
|
if (!name || name.length < 3) return;
|
||||||
|
|
||||||
|
let degrees = '';
|
||||||
|
const allText = item.innerText;
|
||||||
|
const degreeMatch = allText.match(/(A\\.B\\.|Ph\\.D\\.|M\\.A\\.|S\\.M\\.|M\\.Arch\\.|LL\\.M\\.|S\\.B\\.|A\\.L\\.B\\.|A\\.L\\.M\\.|M\\.M\\.Sc\\.|Ed\\.D\\.|Ed\\.M\\.|M\\.P\\.A\\.|M\\.P\\.P\\.|M\\.P\\.H\\.|J\\.D\\.|M\\.B\\.A\\.|M\\.D\\.|D\\.M\\.D\\.|Th\\.D\\.|M\\.Div\\.|M\\.T\\.S\\.|M\\.E\\.|D\\.M\\.Sc\\.|M\\.H\\.C\\.M\\.|M\\.L\\.A\\.|M\\.D\\.E\\.|M\\.R\\.E\\.|M\\.A\\.U\\.D\\.|M\\.R\\.P\\.L\\.)/g);
|
||||||
|
if (degreeMatch) {
|
||||||
|
degrees = degreeMatch.join(', ');
|
||||||
|
}
|
||||||
|
|
||||||
|
programs.push({
|
||||||
|
name: name,
|
||||||
|
degrees: degrees
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
if (programs.length === 0) {
|
||||||
|
const buttons = document.querySelectorAll('button');
|
||||||
|
buttons.forEach((btn) => {
|
||||||
|
const className = btn.className || '';
|
||||||
|
if (className.includes('c-programs-item') || className.includes('title-link')) {
|
||||||
|
const name = btn.innerText.trim();
|
||||||
|
if (name && name.length > 3 && !name.match(/^(Page|Next|Previous|Search|Menu|Filter)/)) {
|
||||||
|
programs.push({
|
||||||
|
name: name,
|
||||||
|
degrees: ''
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
return programs;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f" 本页找到 {len(page_data)} 个项目")
|
||||||
|
|
||||||
|
for prog in page_data:
|
||||||
|
name = prog['name'].strip()
|
||||||
|
if name and not any(p['name'] == name for p in all_programs):
|
||||||
|
all_programs.append({
|
||||||
|
'name': name,
|
||||||
|
'degrees': prog.get('degrees', ''),
|
||||||
|
'page': current_page
|
||||||
|
})
|
||||||
|
|
||||||
|
# 尝试点击下一页
|
||||||
|
try:
|
||||||
|
next_btn = page.locator('button.c-pagination__link--next')
|
||||||
|
if await next_btn.count() > 0:
|
||||||
|
await next_btn.first.scroll_into_view_if_needed()
|
||||||
|
await next_btn.first.click()
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
current_page += 1
|
||||||
|
else:
|
||||||
|
print("没有下一页按钮,结束收集")
|
||||||
|
break
|
||||||
|
except Exception as e:
|
||||||
|
print(f"分页失败: {e}")
|
||||||
|
break
|
||||||
|
|
||||||
|
print(f"\n共收集到 {len(all_programs)} 个项目")
|
||||||
|
|
||||||
|
# 第二阶段:为每个项目获取导师信息
|
||||||
|
print("\n========== 第二阶段:获取导师信息 ==========")
|
||||||
|
print("注意:这将访问每个项目的GSAS页面,可能需要较长时间...")
|
||||||
|
|
||||||
|
for i, prog in enumerate(all_programs, 1):
|
||||||
|
print(f"\n[{i}/{len(all_programs)}] {prog['name']}")
|
||||||
|
|
||||||
|
# 生成项目URL
|
||||||
|
slug = name_to_slug(prog['name'])
|
||||||
|
prog['url'] = f"https://www.harvard.edu/programs/{slug}/"
|
||||||
|
|
||||||
|
# 生成GSAS URL
|
||||||
|
gsas_url = f"https://gsas.harvard.edu/program/{slug}"
|
||||||
|
|
||||||
|
# 获取导师信息
|
||||||
|
faculty_list, faculty_page_url = await get_faculty_from_gsas_page(page, gsas_url, prog['name'])
|
||||||
|
|
||||||
|
prog['faculty_page_url'] = faculty_page_url or ""
|
||||||
|
prog['faculty'] = faculty_list
|
||||||
|
prog['faculty_count'] = len(faculty_list)
|
||||||
|
|
||||||
|
# 每10个项目保存一次进度
|
||||||
|
if i % 10 == 0:
|
||||||
|
temp_result = {
|
||||||
|
'source_url': base_url,
|
||||||
|
'scraped_at': datetime.now(timezone.utc).isoformat(),
|
||||||
|
'progress': f"{i}/{len(all_programs)}",
|
||||||
|
'programs': all_programs[:i]
|
||||||
|
}
|
||||||
|
with open('harvard_programs_progress.json', 'w', encoding='utf-8') as f:
|
||||||
|
json.dump(temp_result, f, ensure_ascii=False, indent=2)
|
||||||
|
print(f" [进度已保存]")
|
||||||
|
|
||||||
|
# 避免请求过快
|
||||||
|
await page.wait_for_timeout(1500)
|
||||||
|
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
# 排序
|
||||||
|
programs = sorted(all_programs, key=lambda x: x['name'])
|
||||||
|
|
||||||
|
# 统计
|
||||||
|
total_faculty = sum(p['faculty_count'] for p in programs)
|
||||||
|
programs_with_faculty = sum(1 for p in programs if p['faculty_count'] > 0)
|
||||||
|
|
||||||
|
# 保存最终结果
|
||||||
|
result = {
|
||||||
|
'source_url': base_url,
|
||||||
|
'scraped_at': datetime.now(timezone.utc).isoformat(),
|
||||||
|
'total_pages_scraped': current_page,
|
||||||
|
'total_programs': len(programs),
|
||||||
|
'programs_with_faculty': programs_with_faculty,
|
||||||
|
'total_faculty_found': total_faculty,
|
||||||
|
'programs': programs
|
||||||
|
}
|
||||||
|
|
||||||
|
output_file = Path('harvard_programs_with_faculty.json')
|
||||||
|
with open(output_file, 'w', encoding='utf-8') as f:
|
||||||
|
json.dump(result, f, ensure_ascii=False, indent=2)
|
||||||
|
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print(f"爬取完成!")
|
||||||
|
print(f"共爬取 {current_page} 页")
|
||||||
|
print(f"共找到 {len(programs)} 个研究生项目")
|
||||||
|
print(f"其中 {programs_with_faculty} 个项目有导师信息")
|
||||||
|
print(f"共找到 {total_faculty} 位导师")
|
||||||
|
print(f"结果保存到: {output_file}")
|
||||||
|
print(f"{'='*60}")
|
||||||
|
|
||||||
|
# 打印摘要
|
||||||
|
print("\n项目摘要 (前30个):")
|
||||||
|
for i, prog in enumerate(programs[:30], 1):
|
||||||
|
faculty_info = f"({prog['faculty_count']}位导师)" if prog['faculty_count'] > 0 else "(无导师信息)"
|
||||||
|
print(f"{i:3}. {prog['name']} {faculty_info}")
|
||||||
|
|
||||||
|
if len(programs) > 30:
|
||||||
|
print(f"... 还有 {len(programs) - 30} 个项目")
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(scrape_harvard_programs_with_faculty())
|
||||||
435
artifacts/kaust_faculty_scraper.py
Normal file
435
artifacts/kaust_faculty_scraper.py
Normal file
@ -0,0 +1,435 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
"""
|
||||||
|
Auto-generated by the Agno codegen agent.
|
||||||
|
Target university: KAUST (https://www.kaust.edu.sa/en/)
|
||||||
|
Requested caps: depth=3, pages=30
|
||||||
|
|
||||||
|
Plan description: Playwright scraper for university master programs and faculty profiles.
|
||||||
|
Navigation strategy: Start at https://www.kaust.edu.sa/en/ Navigate to /study/ to find degree program links Follow links to individual degree pages under /degree-programs/ Separately, look for links to /faculty/ or /people/ directories Crawl faculty directories to extract links to individual bio pages Individual faculty are often under a subdomain like bio.kaust.edu.sa
|
||||||
|
Verification checklist:
|
||||||
|
- Verify master's programs are under /study/ or /degree-programs/
|
||||||
|
- Check that faculty directory pages contain links to individual bios
|
||||||
|
- Confirm individual faculty pages have research/expertise details
|
||||||
|
- Ensure exclusion keywords successfully skip irrelevant pages
|
||||||
|
Playwright snapshot used to guide this plan:
|
||||||
|
No browser snapshot was captured.
|
||||||
|
|
||||||
|
Generated at: 2025-12-10T02:48:42.571899+00:00
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import time
|
||||||
|
from collections import deque
|
||||||
|
from dataclasses import asdict, dataclass, field
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Deque, Iterable, List, Set, Tuple
|
||||||
|
from urllib.parse import urljoin, urldefrag, urlparse
|
||||||
|
|
||||||
|
from playwright.async_api import async_playwright, Page, Response
|
||||||
|
|
||||||
|
PROGRAM_KEYWORDS = ['/study/', '/degree-programs/', '/academics/', 'M.Sc.', 'Master of Science', 'graduate program']
|
||||||
|
FACULTY_KEYWORDS = ['/people/', '/profiles/faculty/', 'Professor', 'faculty-member', '/faculty/firstname-lastname', 'bio.kaust.edu.sa']
|
||||||
|
EXCLUSION_KEYWORDS = ['/admissions/', '/apply/', '/tuition/', '/events/', '/news/', '/careers/', '/jobs/', '/login/', '/alumni/', '/giving/', 'inquiry.kaust.edu.sa']
|
||||||
|
METADATA_FIELDS = ['url', 'title', 'entity_type', 'department', 'email', 'scraped_at']
|
||||||
|
EXTRA_NOTES = ['Many faculty are listed under a separate subdomain bio.kaust.edu.sa', 'Prioritize crawling the centralized faculty directory first', 'Alumni and affiliated faculty may not have full profile pages']
|
||||||
|
|
||||||
|
# URL patterns that indicate individual profile pages
|
||||||
|
PROFILE_URL_PATTERNS = [
|
||||||
|
"/people/", "/person/", "/profile/", "/profiles/",
|
||||||
|
"/faculty/", "/staff/", "/directory/",
|
||||||
|
"/~", # Unix-style personal pages
|
||||||
|
"/bio/", "/about/",
|
||||||
|
]
|
||||||
|
|
||||||
|
# URL patterns that indicate listing/directory pages (should be crawled deeper)
|
||||||
|
DIRECTORY_URL_PATTERNS = [
|
||||||
|
"/faculty", "/people", "/directory", "/staff",
|
||||||
|
"/team", "/members", "/researchers",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def normalize_url(base: str, href: str) -> str:
|
||||||
|
"""Normalize URL by resolving relative paths and removing fragments."""
|
||||||
|
absolute = urljoin(base, href)
|
||||||
|
cleaned, _ = urldefrag(absolute)
|
||||||
|
# Remove trailing slash for consistency
|
||||||
|
return cleaned.rstrip("/")
|
||||||
|
|
||||||
|
|
||||||
|
def matches_any(text: str, keywords: Iterable[str]) -> bool:
|
||||||
|
"""Check if text contains any of the keywords (case-insensitive)."""
|
||||||
|
lowered = text.lower()
|
||||||
|
return any(keyword.lower() in lowered for keyword in keywords)
|
||||||
|
|
||||||
|
|
||||||
|
def is_same_domain(url1: str, url2: str) -> bool:
|
||||||
|
"""Check if two URLs belong to the same root domain."""
|
||||||
|
domain1 = urlparse(url1).netloc.replace("www.", "")
|
||||||
|
domain2 = urlparse(url2).netloc.replace("www.", "")
|
||||||
|
# Allow subdomains of the same root domain
|
||||||
|
parts1 = domain1.split(".")
|
||||||
|
parts2 = domain2.split(".")
|
||||||
|
if len(parts1) >= 2 and len(parts2) >= 2:
|
||||||
|
return parts1[-2:] == parts2[-2:]
|
||||||
|
return domain1 == domain2
|
||||||
|
|
||||||
|
|
||||||
|
def is_profile_url(url: str) -> bool:
|
||||||
|
"""Check if URL pattern suggests an individual profile page."""
|
||||||
|
url_lower = url.lower()
|
||||||
|
return any(pattern in url_lower for pattern in PROFILE_URL_PATTERNS)
|
||||||
|
|
||||||
|
|
||||||
|
def is_directory_url(url: str) -> bool:
|
||||||
|
"""Check if URL pattern suggests a directory/listing page."""
|
||||||
|
url_lower = url.lower()
|
||||||
|
return any(pattern in url_lower for pattern in DIRECTORY_URL_PATTERNS)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ScrapedLink:
|
||||||
|
url: str
|
||||||
|
title: str
|
||||||
|
text: str
|
||||||
|
source_url: str
|
||||||
|
bucket: str # "program" or "faculty"
|
||||||
|
is_verified: bool = False
|
||||||
|
http_status: int = 0
|
||||||
|
is_profile_page: bool = False
|
||||||
|
scraped_at: str = field(default_factory=lambda: datetime.now(timezone.utc).isoformat())
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ScrapeSettings:
|
||||||
|
root_url: str
|
||||||
|
max_depth: int
|
||||||
|
max_pages: int
|
||||||
|
headless: bool
|
||||||
|
output: Path
|
||||||
|
verify_links: bool = True
|
||||||
|
request_delay: float = 1.0 # Polite crawling delay
|
||||||
|
timeout: int = 60000 # Navigation timeout in ms (default 60s for slow sites)
|
||||||
|
|
||||||
|
|
||||||
|
async def extract_links(page: Page) -> List[Tuple[str, str]]:
|
||||||
|
"""Extract all anchor links from the page."""
|
||||||
|
anchors: Iterable[dict] = await page.eval_on_selector_all(
|
||||||
|
"a",
|
||||||
|
"""elements => elements
|
||||||
|
.map(el => ({text: (el.textContent || '').trim(), href: el.href}))
|
||||||
|
.filter(item => item.text && item.href && item.href.startsWith('http'))""",
|
||||||
|
)
|
||||||
|
return [(item["href"], item["text"]) for item in anchors]
|
||||||
|
|
||||||
|
|
||||||
|
async def get_page_title(page: Page) -> str:
|
||||||
|
"""Get the page title safely."""
|
||||||
|
try:
|
||||||
|
return await page.title() or ""
|
||||||
|
except Exception:
|
||||||
|
return ""
|
||||||
|
|
||||||
|
|
||||||
|
async def verify_link(context, url: str, timeout: int = 10000) -> Tuple[bool, int, str]:
|
||||||
|
"""
|
||||||
|
Verify a link by making a HEAD-like request.
|
||||||
|
Returns: (is_valid, status_code, page_title)
|
||||||
|
"""
|
||||||
|
page = await context.new_page()
|
||||||
|
try:
|
||||||
|
response: Response = await page.goto(url, wait_until="domcontentloaded", timeout=timeout)
|
||||||
|
if response:
|
||||||
|
status = response.status
|
||||||
|
title = await get_page_title(page)
|
||||||
|
is_valid = 200 <= status < 400
|
||||||
|
return is_valid, status, title
|
||||||
|
return False, 0, ""
|
||||||
|
except Exception:
|
||||||
|
return False, 0, ""
|
||||||
|
finally:
|
||||||
|
await page.close()
|
||||||
|
|
||||||
|
|
||||||
|
async def crawl(settings: ScrapeSettings, browser_name: str) -> List[ScrapedLink]:
|
||||||
|
"""
|
||||||
|
Crawl the website using BFS, collecting program and faculty links.
|
||||||
|
Features:
|
||||||
|
- URL deduplication
|
||||||
|
- Link verification
|
||||||
|
- Profile page detection
|
||||||
|
- Polite crawling with delays
|
||||||
|
"""
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser_launcher = getattr(p, browser_name)
|
||||||
|
browser = await browser_launcher.launch(headless=settings.headless)
|
||||||
|
context = await browser.new_context(
|
||||||
|
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Priority queue: (priority, url, depth) - lower priority = processed first
|
||||||
|
# Directory pages get priority 0, others get priority 1
|
||||||
|
queue: Deque[Tuple[int, str, int]] = deque([(0, settings.root_url, 0)])
|
||||||
|
visited: Set[str] = set()
|
||||||
|
found_urls: Set[str] = set() # For deduplication of results
|
||||||
|
results: List[ScrapedLink] = []
|
||||||
|
|
||||||
|
print(f"Starting crawl from: {settings.root_url}")
|
||||||
|
print(f"Max depth: {settings.max_depth}, Max pages: {settings.max_pages}")
|
||||||
|
|
||||||
|
try:
|
||||||
|
while queue and len(visited) < settings.max_pages:
|
||||||
|
# Sort queue by priority (directory pages first)
|
||||||
|
queue = deque(sorted(queue, key=lambda x: x[0]))
|
||||||
|
priority, url, depth = queue.popleft()
|
||||||
|
|
||||||
|
normalized_url = normalize_url(settings.root_url, url)
|
||||||
|
if normalized_url in visited or depth > settings.max_depth:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Only crawl same-domain URLs
|
||||||
|
if not is_same_domain(settings.root_url, normalized_url):
|
||||||
|
continue
|
||||||
|
|
||||||
|
visited.add(normalized_url)
|
||||||
|
print(f"[{len(visited)}/{settings.max_pages}] Depth {depth}: {normalized_url[:80]}...")
|
||||||
|
|
||||||
|
page = await context.new_page()
|
||||||
|
try:
|
||||||
|
response = await page.goto(
|
||||||
|
normalized_url, wait_until="load", timeout=settings.timeout
|
||||||
|
)
|
||||||
|
if not response or response.status >= 400:
|
||||||
|
await page.close()
|
||||||
|
continue
|
||||||
|
except Exception as e:
|
||||||
|
print(f" Error: {e}")
|
||||||
|
await page.close()
|
||||||
|
continue
|
||||||
|
|
||||||
|
page_title = await get_page_title(page)
|
||||||
|
links = await extract_links(page)
|
||||||
|
|
||||||
|
for href, text in links:
|
||||||
|
normalized_href = normalize_url(normalized_url, href)
|
||||||
|
|
||||||
|
# Skip if already found or is excluded
|
||||||
|
if normalized_href in found_urls:
|
||||||
|
continue
|
||||||
|
if matches_any(text, EXCLUSION_KEYWORDS) or matches_any(normalized_href, EXCLUSION_KEYWORDS):
|
||||||
|
continue
|
||||||
|
|
||||||
|
text_lower = text.lower()
|
||||||
|
href_lower = normalized_href.lower()
|
||||||
|
is_profile = is_profile_url(normalized_href)
|
||||||
|
|
||||||
|
# Check for program links
|
||||||
|
if matches_any(text_lower, PROGRAM_KEYWORDS) or matches_any(href_lower, PROGRAM_KEYWORDS):
|
||||||
|
found_urls.add(normalized_href)
|
||||||
|
results.append(
|
||||||
|
ScrapedLink(
|
||||||
|
url=normalized_href,
|
||||||
|
title="",
|
||||||
|
text=text[:200],
|
||||||
|
source_url=normalized_url,
|
||||||
|
bucket="program",
|
||||||
|
is_profile_page=False,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Check for faculty links
|
||||||
|
if matches_any(text_lower, FACULTY_KEYWORDS) or matches_any(href_lower, FACULTY_KEYWORDS):
|
||||||
|
found_urls.add(normalized_href)
|
||||||
|
results.append(
|
||||||
|
ScrapedLink(
|
||||||
|
url=normalized_href,
|
||||||
|
title="",
|
||||||
|
text=text[:200],
|
||||||
|
source_url=normalized_url,
|
||||||
|
bucket="faculty",
|
||||||
|
is_profile_page=is_profile,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Queue for further crawling
|
||||||
|
if depth < settings.max_depth and is_same_domain(settings.root_url, normalized_href):
|
||||||
|
# Prioritize directory pages
|
||||||
|
link_priority = 0 if is_directory_url(normalized_href) else 1
|
||||||
|
queue.append((link_priority, normalized_href, depth + 1))
|
||||||
|
|
||||||
|
await page.close()
|
||||||
|
|
||||||
|
# Polite delay between requests
|
||||||
|
await asyncio.sleep(settings.request_delay)
|
||||||
|
|
||||||
|
finally:
|
||||||
|
await context.close()
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
# Verify links if enabled
|
||||||
|
if settings.verify_links and results:
|
||||||
|
print(f"\nVerifying {len(results)} links...")
|
||||||
|
browser = await browser_launcher.launch(headless=True)
|
||||||
|
context = await browser.new_context()
|
||||||
|
|
||||||
|
verified_results = []
|
||||||
|
for i, link in enumerate(results):
|
||||||
|
if link.url in [r.url for r in verified_results]:
|
||||||
|
continue # Skip duplicates
|
||||||
|
|
||||||
|
print(f" [{i+1}/{len(results)}] Verifying: {link.url[:60]}...")
|
||||||
|
is_valid, status, title = await verify_link(context, link.url)
|
||||||
|
link.is_verified = True
|
||||||
|
link.http_status = status
|
||||||
|
link.title = title or link.text
|
||||||
|
|
||||||
|
if is_valid:
|
||||||
|
verified_results.append(link)
|
||||||
|
else:
|
||||||
|
print(f" Invalid (HTTP {status})")
|
||||||
|
|
||||||
|
await asyncio.sleep(0.5) # Delay between verifications
|
||||||
|
|
||||||
|
await context.close()
|
||||||
|
await browser.close()
|
||||||
|
results = verified_results
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
def deduplicate_results(results: List[ScrapedLink]) -> List[ScrapedLink]:
|
||||||
|
"""Remove duplicate URLs, keeping the first occurrence."""
|
||||||
|
seen: Set[str] = set()
|
||||||
|
unique = []
|
||||||
|
for link in results:
|
||||||
|
if link.url not in seen:
|
||||||
|
seen.add(link.url)
|
||||||
|
unique.append(link)
|
||||||
|
return unique
|
||||||
|
|
||||||
|
|
||||||
|
def serialize(results: List[ScrapedLink], target: Path, root_url: str) -> None:
|
||||||
|
"""Save results to JSON file with statistics."""
|
||||||
|
results = deduplicate_results(results)
|
||||||
|
|
||||||
|
program_links = [link for link in results if link.bucket == "program"]
|
||||||
|
faculty_links = [link for link in results if link.bucket == "faculty"]
|
||||||
|
profile_pages = [link for link in faculty_links if link.is_profile_page]
|
||||||
|
|
||||||
|
payload = {
|
||||||
|
"root_url": root_url,
|
||||||
|
"generated_at": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"statistics": {
|
||||||
|
"total_links": len(results),
|
||||||
|
"program_links": len(program_links),
|
||||||
|
"faculty_links": len(faculty_links),
|
||||||
|
"profile_pages": len(profile_pages),
|
||||||
|
"verified_links": len([r for r in results if r.is_verified and r.http_status == 200]),
|
||||||
|
},
|
||||||
|
"program_links": [asdict(link) for link in program_links],
|
||||||
|
"faculty_links": [asdict(link) for link in faculty_links],
|
||||||
|
"notes": EXTRA_NOTES,
|
||||||
|
"metadata_fields": METADATA_FIELDS,
|
||||||
|
}
|
||||||
|
target.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
target.write_text(json.dumps(payload, indent=2, ensure_ascii=False), encoding="utf-8")
|
||||||
|
|
||||||
|
print(f"\nResults saved to: {target}")
|
||||||
|
print(f" Total links: {len(results)}")
|
||||||
|
print(f" Program links: {len(program_links)}")
|
||||||
|
print(f" Faculty links: {len(faculty_links)}")
|
||||||
|
print(f" Profile pages: {len(profile_pages)}")
|
||||||
|
|
||||||
|
|
||||||
|
def parse_args() -> argparse.Namespace:
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Playwright scraper generated by the Agno agent for https://www.kaust.edu.sa/en/."
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--root-url",
|
||||||
|
default="https://www.kaust.edu.sa/en/",
|
||||||
|
help="Seed url to start crawling from.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--max-depth",
|
||||||
|
type=int,
|
||||||
|
default=3,
|
||||||
|
help="Maximum crawl depth.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--max-pages",
|
||||||
|
type=int,
|
||||||
|
default=30,
|
||||||
|
help="Maximum number of pages to visit.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--output",
|
||||||
|
type=Path,
|
||||||
|
default=Path("university-scraper_results.json"),
|
||||||
|
help="Where to save the JSON output.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--headless",
|
||||||
|
action="store_true",
|
||||||
|
default=True,
|
||||||
|
help="Run browser in headless mode (default: True).",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--no-headless",
|
||||||
|
action="store_false",
|
||||||
|
dest="headless",
|
||||||
|
help="Run browser with visible window.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--browser",
|
||||||
|
choices=["chromium", "firefox", "webkit"],
|
||||||
|
default="firefox",
|
||||||
|
help="Browser engine to launch via Playwright (firefox recommended for KAUST).",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--no-verify",
|
||||||
|
action="store_true",
|
||||||
|
default=False,
|
||||||
|
help="Skip link verification step.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--delay",
|
||||||
|
type=float,
|
||||||
|
default=1.0,
|
||||||
|
help="Delay between requests in seconds (polite crawling).",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--timeout",
|
||||||
|
type=int,
|
||||||
|
default=60000,
|
||||||
|
help="Navigation timeout in milliseconds (default: 60000 = 60s).",
|
||||||
|
)
|
||||||
|
return parser.parse_args()
|
||||||
|
|
||||||
|
|
||||||
|
async def main_async() -> None:
|
||||||
|
args = parse_args()
|
||||||
|
settings = ScrapeSettings(
|
||||||
|
root_url=args.root_url,
|
||||||
|
max_depth=args.max_depth,
|
||||||
|
max_pages=args.max_pages,
|
||||||
|
headless=args.headless,
|
||||||
|
output=args.output,
|
||||||
|
verify_links=not args.no_verify,
|
||||||
|
request_delay=args.delay,
|
||||||
|
timeout=args.timeout,
|
||||||
|
)
|
||||||
|
links = await crawl(settings, browser_name=args.browser)
|
||||||
|
serialize(links, settings.output, settings.root_url)
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
asyncio.run(main_async())
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
910
artifacts/manchester_complete_scraper.py
Normal file
910
artifacts/manchester_complete_scraper.py
Normal file
@ -0,0 +1,910 @@
|
|||||||
|
"""
|
||||||
|
曼彻斯特大学完整采集脚本
|
||||||
|
新增特性:
|
||||||
|
- Research Explorer API 优先拉取 JSON / XML,失败再回落 DOM
|
||||||
|
- 每个学院独立页面、并行抓取(默认 3 并发)
|
||||||
|
- 细粒度超时/重试/滚动/Load more 控制
|
||||||
|
- 多 URL / 备用 Staff 页面配置
|
||||||
|
- 导师目录缓存,可按学院关键词映射到项目
|
||||||
|
- 诊断信息记录(失败学院、超时学院、批次信息)
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
from copy import deepcopy
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from typing import Any, Dict, List, Optional, Tuple
|
||||||
|
from urllib.parse import urlencode, urljoin
|
||||||
|
from xml.etree import ElementTree as ET
|
||||||
|
|
||||||
|
from playwright.async_api import (
|
||||||
|
TimeoutError as PlaywrightTimeoutError,
|
||||||
|
async_playwright,
|
||||||
|
)
|
||||||
|
|
||||||
|
# =========================
|
||||||
|
# 配置区
|
||||||
|
# =========================
|
||||||
|
|
||||||
|
DEFAULT_REQUEST = {
|
||||||
|
"timeout_ms": 60000,
|
||||||
|
"post_wait_ms": 2500,
|
||||||
|
"wait_until": "domcontentloaded",
|
||||||
|
"max_retries": 3,
|
||||||
|
"retry_backoff_ms": 2000,
|
||||||
|
}
|
||||||
|
|
||||||
|
STAFF_CONCURRENCY = 3
|
||||||
|
|
||||||
|
SCHOOL_CONFIG: List[Dict[str, Any]] = [
|
||||||
|
{
|
||||||
|
"name": "Alliance Manchester Business School",
|
||||||
|
"keywords": [
|
||||||
|
"accounting",
|
||||||
|
"finance",
|
||||||
|
"business",
|
||||||
|
"management",
|
||||||
|
"marketing",
|
||||||
|
"mba",
|
||||||
|
"economics",
|
||||||
|
"entrepreneurship",
|
||||||
|
],
|
||||||
|
"attach_faculty_to_programs": True,
|
||||||
|
"staff_pages": [
|
||||||
|
{
|
||||||
|
"url": "https://www.alliancembs.manchester.ac.uk/research/accounting-and-finance/staff/",
|
||||||
|
"extract_method": "table",
|
||||||
|
"request": {"timeout_ms": 60000, "wait_until": "networkidle"},
|
||||||
|
}
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "Department of Computer Science",
|
||||||
|
"keywords": [
|
||||||
|
"computer",
|
||||||
|
"software",
|
||||||
|
"data science",
|
||||||
|
"artificial intelligence",
|
||||||
|
"ai ",
|
||||||
|
"machine learning",
|
||||||
|
"cyber",
|
||||||
|
"computing",
|
||||||
|
],
|
||||||
|
"attach_faculty_to_programs": True,
|
||||||
|
"staff_pages": [
|
||||||
|
{
|
||||||
|
"url": "https://www.cs.manchester.ac.uk/about/people/academic-and-research-staff/",
|
||||||
|
"extract_method": "links",
|
||||||
|
"requires_scroll": True,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"url": "https://www.cs.manchester.ac.uk/about/people/",
|
||||||
|
"extract_method": "links",
|
||||||
|
"load_more_selector": "button.load-more",
|
||||||
|
"max_load_more": 6,
|
||||||
|
},
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "Department of Physics and Astronomy",
|
||||||
|
"keywords": [
|
||||||
|
"physics",
|
||||||
|
"astronomy",
|
||||||
|
"astrophysics",
|
||||||
|
"nuclear",
|
||||||
|
"particle",
|
||||||
|
],
|
||||||
|
"attach_faculty_to_programs": True,
|
||||||
|
"staff_pages": [
|
||||||
|
{
|
||||||
|
"url": "https://www.physics.manchester.ac.uk/about/people/academic-and-research-staff/",
|
||||||
|
"extract_method": "links",
|
||||||
|
"requires_scroll": True,
|
||||||
|
}
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "Department of Electrical and Electronic Engineering",
|
||||||
|
"keywords": [
|
||||||
|
"electrical",
|
||||||
|
"electronic",
|
||||||
|
"eee",
|
||||||
|
"power systems",
|
||||||
|
"microelectronics",
|
||||||
|
],
|
||||||
|
"attach_faculty_to_programs": True,
|
||||||
|
"staff_pages": [
|
||||||
|
{
|
||||||
|
"url": "https://www.eee.manchester.ac.uk/about/people/academic-and-research-staff/",
|
||||||
|
"extract_method": "links",
|
||||||
|
"requires_scroll": True,
|
||||||
|
}
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "Department of Chemistry",
|
||||||
|
"keywords": ["chemistry", "chemical"],
|
||||||
|
"attach_faculty_to_programs": True,
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"research_explorer": {"page_size": 200},
|
||||||
|
"staff_pages": [
|
||||||
|
{
|
||||||
|
"url": "https://research.manchester.ac.uk/en/organisations/department-of-chemistry/persons/",
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"requires_scroll": True,
|
||||||
|
"request": {
|
||||||
|
"timeout_ms": 120000,
|
||||||
|
"wait_until": "networkidle",
|
||||||
|
"post_wait_ms": 5000,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "Department of Mathematics",
|
||||||
|
"keywords": [
|
||||||
|
"mathematics",
|
||||||
|
"mathematical",
|
||||||
|
"applied math",
|
||||||
|
"statistics",
|
||||||
|
"actuarial",
|
||||||
|
],
|
||||||
|
"attach_faculty_to_programs": True,
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"research_explorer": {"page_size": 200},
|
||||||
|
"staff_pages": [
|
||||||
|
{
|
||||||
|
"url": "https://research.manchester.ac.uk/en/organisations/department-of-mathematics/persons/",
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"requires_scroll": True,
|
||||||
|
}
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "School of Engineering",
|
||||||
|
"keywords": [
|
||||||
|
"engineering",
|
||||||
|
"mechanical",
|
||||||
|
"aerospace",
|
||||||
|
"civil",
|
||||||
|
"structural",
|
||||||
|
"materials",
|
||||||
|
],
|
||||||
|
"attach_faculty_to_programs": True,
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"research_explorer": {"page_size": 400},
|
||||||
|
"staff_pages": [
|
||||||
|
{
|
||||||
|
"url": "https://research.manchester.ac.uk/en/organisations/school-of-engineering/persons/",
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"requires_scroll": True,
|
||||||
|
}
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "Faculty of Biology, Medicine and Health",
|
||||||
|
"keywords": [
|
||||||
|
"medicine",
|
||||||
|
"medical",
|
||||||
|
"health",
|
||||||
|
"nursing",
|
||||||
|
"pharmacy",
|
||||||
|
"clinical",
|
||||||
|
"dental",
|
||||||
|
"optometry",
|
||||||
|
"biology",
|
||||||
|
"biomedical",
|
||||||
|
"anatomical",
|
||||||
|
"physiotherapy",
|
||||||
|
"midwifery",
|
||||||
|
"mental health",
|
||||||
|
"psychology",
|
||||||
|
],
|
||||||
|
"attach_faculty_to_programs": True,
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"research_explorer": {"page_size": 400},
|
||||||
|
"staff_pages": [
|
||||||
|
{
|
||||||
|
"url": "https://research.manchester.ac.uk/en/organisations/faculty-of-biology-medicine-and-health/persons/",
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"requires_scroll": True,
|
||||||
|
}
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "School of Social Sciences",
|
||||||
|
"keywords": [
|
||||||
|
"sociology",
|
||||||
|
"politics",
|
||||||
|
"international",
|
||||||
|
"social",
|
||||||
|
"criminology",
|
||||||
|
"anthropology",
|
||||||
|
"philosophy",
|
||||||
|
],
|
||||||
|
"attach_faculty_to_programs": True,
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"research_explorer": {"page_size": 200},
|
||||||
|
"staff_pages": [
|
||||||
|
{
|
||||||
|
"url": "https://research.manchester.ac.uk/en/organisations/school-of-social-sciences/persons/",
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"requires_scroll": True,
|
||||||
|
}
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "School of Law",
|
||||||
|
"keywords": ["law", "legal", "llm"],
|
||||||
|
"attach_faculty_to_programs": True,
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"research_explorer": {"page_size": 200},
|
||||||
|
"staff_pages": [
|
||||||
|
{
|
||||||
|
"url": "https://research.manchester.ac.uk/en/organisations/school-of-law/persons/",
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"requires_scroll": True,
|
||||||
|
}
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "School of Arts, Languages and Cultures",
|
||||||
|
"keywords": [
|
||||||
|
"arts",
|
||||||
|
"languages",
|
||||||
|
"culture",
|
||||||
|
"music",
|
||||||
|
"drama",
|
||||||
|
"theatre",
|
||||||
|
"history",
|
||||||
|
"linguistics",
|
||||||
|
"literature",
|
||||||
|
"translation",
|
||||||
|
"classics",
|
||||||
|
"archaeology",
|
||||||
|
"religion",
|
||||||
|
],
|
||||||
|
"attach_faculty_to_programs": True,
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"research_explorer": {"page_size": 400},
|
||||||
|
"staff_pages": [
|
||||||
|
{
|
||||||
|
"url": "https://research.manchester.ac.uk/en/organisations/school-of-arts-languages-and-cultures/persons/",
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"requires_scroll": True,
|
||||||
|
}
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "School of Environment, Education and Development",
|
||||||
|
"keywords": [
|
||||||
|
"environment",
|
||||||
|
"education",
|
||||||
|
"development",
|
||||||
|
"planning",
|
||||||
|
"architecture",
|
||||||
|
"urban",
|
||||||
|
"geography",
|
||||||
|
"sustainability",
|
||||||
|
],
|
||||||
|
"attach_faculty_to_programs": True,
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"research_explorer": {"page_size": 300},
|
||||||
|
"staff_pages": [
|
||||||
|
{
|
||||||
|
"url": "https://research.manchester.ac.uk/en/organisations/school-of-environment-education-and-development/persons/",
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"requires_scroll": True,
|
||||||
|
}
|
||||||
|
],
|
||||||
|
},
|
||||||
|
]
|
||||||
|
|
||||||
|
SCHOOL_LOOKUP = {cfg["name"]: cfg for cfg in SCHOOL_CONFIG}
|
||||||
|
|
||||||
|
# =========================
|
||||||
|
# JS 提取函数
|
||||||
|
# =========================
|
||||||
|
|
||||||
|
JS_EXTRACT_TABLE_STAFF = """() => {
|
||||||
|
const staff = [];
|
||||||
|
const seen = new Set();
|
||||||
|
|
||||||
|
document.querySelectorAll('table tr').forEach(row => {
|
||||||
|
const cells = row.querySelectorAll('td');
|
||||||
|
if (cells.length >= 2) {
|
||||||
|
const link = cells[1]?.querySelector('a[href]') || cells[0]?.querySelector('a[href]');
|
||||||
|
const titleCell = cells[2] || cells[1];
|
||||||
|
|
||||||
|
if (link) {
|
||||||
|
const name = link.innerText.trim();
|
||||||
|
const url = link.href;
|
||||||
|
const title = titleCell ? titleCell.innerText.trim() : '';
|
||||||
|
|
||||||
|
if (name.length > 2 && !name.toLowerCase().includes('skip') && !seen.has(url)) {
|
||||||
|
seen.add(url);
|
||||||
|
staff.push({
|
||||||
|
name,
|
||||||
|
url,
|
||||||
|
title
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return staff;
|
||||||
|
}"""
|
||||||
|
|
||||||
|
JS_EXTRACT_LINK_STAFF = """() => {
|
||||||
|
const staff = [];
|
||||||
|
const seen = new Set();
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href;
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
|
||||||
|
if (seen.has(href)) return;
|
||||||
|
if (text.length < 5 || text.length > 80) return;
|
||||||
|
|
||||||
|
const lowerText = text.toLowerCase();
|
||||||
|
if (lowerText.includes('skip') ||
|
||||||
|
lowerText.includes('staff') ||
|
||||||
|
lowerText.includes('people') ||
|
||||||
|
lowerText.includes('academic') ||
|
||||||
|
lowerText.includes('research profiles')) return;
|
||||||
|
|
||||||
|
if (href.includes('/persons/') ||
|
||||||
|
href.includes('/portal/en/researchers/') ||
|
||||||
|
href.includes('/profile/') ||
|
||||||
|
href.includes('/people/')) {
|
||||||
|
seen.add(href);
|
||||||
|
staff.push({
|
||||||
|
name: text,
|
||||||
|
url: href,
|
||||||
|
title: ''
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return staff;
|
||||||
|
}"""
|
||||||
|
|
||||||
|
JS_EXTRACT_RESEARCH_EXPLORER = """() => {
|
||||||
|
const staff = [];
|
||||||
|
const seen = new Set();
|
||||||
|
|
||||||
|
document.querySelectorAll('a.link.person').forEach(a => {
|
||||||
|
const href = a.href;
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
|
||||||
|
if (!seen.has(href) && text.length > 3 && text.length < 80) {
|
||||||
|
seen.add(href);
|
||||||
|
staff.push({
|
||||||
|
name: text,
|
||||||
|
url: href,
|
||||||
|
title: ''
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
if (staff.length === 0) {
|
||||||
|
document.querySelectorAll('a[href*="/persons/"]').forEach(a => {
|
||||||
|
const href = a.href;
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
const lower = text.toLowerCase();
|
||||||
|
|
||||||
|
if (seen.has(href)) return;
|
||||||
|
if (text.length < 3 || text.length > 80) return;
|
||||||
|
if (lower.includes('person') || lower.includes('next') || lower.includes('previous')) return;
|
||||||
|
|
||||||
|
seen.add(href);
|
||||||
|
staff.push({
|
||||||
|
name: text,
|
||||||
|
url: href,
|
||||||
|
title: ''
|
||||||
|
});
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
return staff;
|
||||||
|
}"""
|
||||||
|
|
||||||
|
JS_EXTRACT_PROGRAMS = """() => {
|
||||||
|
const programs = [];
|
||||||
|
const seen = new Set();
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href;
|
||||||
|
const text = a.innerText.trim().replace(/\\s+/g, ' ');
|
||||||
|
|
||||||
|
if (!href || seen.has(href)) return;
|
||||||
|
if (text.length < 10 || text.length > 200) return;
|
||||||
|
|
||||||
|
const hrefLower = href.toLowerCase();
|
||||||
|
const textLower = text.toLowerCase();
|
||||||
|
|
||||||
|
const isNav = textLower === 'courses' ||
|
||||||
|
textLower === 'masters' ||
|
||||||
|
textLower.includes('admission') ||
|
||||||
|
textLower.includes('fees') ||
|
||||||
|
textLower.includes('skip to') ||
|
||||||
|
textLower.includes('search') ||
|
||||||
|
textLower.includes('contact') ||
|
||||||
|
hrefLower.includes('#');
|
||||||
|
if (isNav) return;
|
||||||
|
|
||||||
|
const hasNumericId = /\\/\\d{5}\\//.test(href);
|
||||||
|
const isCoursePage = hrefLower.includes('/courses/list/') && hasNumericId;
|
||||||
|
|
||||||
|
if (isCoursePage) {
|
||||||
|
seen.add(href);
|
||||||
|
programs.push({
|
||||||
|
name: text,
|
||||||
|
url: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return programs;
|
||||||
|
}"""
|
||||||
|
|
||||||
|
|
||||||
|
# =========================
|
||||||
|
# 数据匹配
|
||||||
|
# =========================
|
||||||
|
|
||||||
|
def match_program_to_school(program_name: str) -> str:
|
||||||
|
lower = program_name.lower()
|
||||||
|
for school in SCHOOL_CONFIG:
|
||||||
|
for keyword in school["keywords"]:
|
||||||
|
if keyword in lower:
|
||||||
|
return school["name"]
|
||||||
|
return "Other Programs"
|
||||||
|
|
||||||
|
|
||||||
|
# =========================
|
||||||
|
# 请求与解析工具
|
||||||
|
# =========================
|
||||||
|
|
||||||
|
def _merge_request_settings(*layers: Optional[Dict[str, Any]]) -> Dict[str, Any]:
|
||||||
|
settings = dict(DEFAULT_REQUEST)
|
||||||
|
for layer in layers:
|
||||||
|
if not layer:
|
||||||
|
continue
|
||||||
|
for key, value in layer.items():
|
||||||
|
if value is not None:
|
||||||
|
settings[key] = value
|
||||||
|
settings["max_retries"] = max(1, int(settings.get("max_retries", 1)))
|
||||||
|
settings["retry_backoff_ms"] = settings.get("retry_backoff_ms", 2000)
|
||||||
|
return settings
|
||||||
|
|
||||||
|
|
||||||
|
async def _goto_with_retry(page, url: str, settings: Dict[str, Any], label: str) -> Tuple[bool, Optional[str]]:
|
||||||
|
last_error = None
|
||||||
|
for attempt in range(settings["max_retries"]):
|
||||||
|
try:
|
||||||
|
await page.goto(url, wait_until=settings["wait_until"], timeout=settings["timeout_ms"])
|
||||||
|
if settings.get("wait_for_selector"):
|
||||||
|
await page.wait_for_selector(settings["wait_for_selector"], timeout=settings["timeout_ms"])
|
||||||
|
if settings.get("post_wait_ms"):
|
||||||
|
await page.wait_for_timeout(settings["post_wait_ms"])
|
||||||
|
return True, None
|
||||||
|
except PlaywrightTimeoutError as exc:
|
||||||
|
last_error = f"Timeout: {exc}"
|
||||||
|
except Exception as exc: # noqa: BLE001
|
||||||
|
last_error = str(exc)
|
||||||
|
|
||||||
|
if attempt < settings["max_retries"] - 1:
|
||||||
|
await page.wait_for_timeout(settings["retry_backoff_ms"] * (attempt + 1))
|
||||||
|
|
||||||
|
return False, last_error
|
||||||
|
|
||||||
|
|
||||||
|
async def _perform_scroll(page, repetitions: int = 5, delay_ms: int = 800):
|
||||||
|
repetitions = max(1, repetitions)
|
||||||
|
for i in range(repetitions):
|
||||||
|
await page.evaluate("(y) => window.scrollTo(0, y)", 2000 * (i + 1))
|
||||||
|
await page.wait_for_timeout(delay_ms)
|
||||||
|
|
||||||
|
|
||||||
|
async def _load_more(page, selector: str, max_clicks: int = 5, wait_ms: int = 1500):
|
||||||
|
for _ in range(max_clicks):
|
||||||
|
button = await page.query_selector(selector)
|
||||||
|
if not button:
|
||||||
|
break
|
||||||
|
try:
|
||||||
|
await button.click()
|
||||||
|
await page.wait_for_timeout(wait_ms)
|
||||||
|
except Exception:
|
||||||
|
break
|
||||||
|
|
||||||
|
|
||||||
|
def _deduplicate_staff(staff: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
|
||||||
|
seen = set()
|
||||||
|
cleaned = []
|
||||||
|
for item in staff:
|
||||||
|
name = (item.get("name") or "").strip()
|
||||||
|
if not name:
|
||||||
|
continue
|
||||||
|
url = (item.get("url") or "").strip()
|
||||||
|
key = url or name.lower()
|
||||||
|
if key in seen:
|
||||||
|
continue
|
||||||
|
seen.add(key)
|
||||||
|
cleaned.append({"name": name, "url": url, "title": (item.get("title") or "").strip()})
|
||||||
|
return cleaned
|
||||||
|
|
||||||
|
|
||||||
|
def _append_query(url: str, params: Dict[str, Any]) -> str:
|
||||||
|
delimiter = "&" if "?" in url else "?"
|
||||||
|
return f"{url}{delimiter}{urlencode(params)}"
|
||||||
|
|
||||||
|
|
||||||
|
def _guess_research_slug(staff_url: Optional[str]) -> Optional[str]:
|
||||||
|
if not staff_url:
|
||||||
|
return None
|
||||||
|
path = staff_url.rstrip("/").split("/")
|
||||||
|
return path[-1] if path else None
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_research_explorer_json(data: Any, base_url: str) -> List[Dict[str, str]]:
|
||||||
|
items: List[Dict[str, Any]] = []
|
||||||
|
if isinstance(data, list):
|
||||||
|
items = data
|
||||||
|
elif isinstance(data, dict):
|
||||||
|
for key in ("results", "items", "persons", "data", "entities"):
|
||||||
|
if isinstance(data.get(key), list):
|
||||||
|
items = data[key]
|
||||||
|
break
|
||||||
|
if not items and isinstance(data.get("rows"), list):
|
||||||
|
items = data["rows"]
|
||||||
|
|
||||||
|
staff = []
|
||||||
|
for item in items:
|
||||||
|
if not isinstance(item, dict):
|
||||||
|
continue
|
||||||
|
name = item.get("name") or item.get("title") or item.get("fullName")
|
||||||
|
profile_url = item.get("url") or item.get("href") or item.get("link") or item.get("primaryURL")
|
||||||
|
if not name:
|
||||||
|
continue
|
||||||
|
if profile_url:
|
||||||
|
profile_url = urljoin(base_url, profile_url)
|
||||||
|
staff.append(
|
||||||
|
{
|
||||||
|
"name": name.strip(),
|
||||||
|
"url": (profile_url or "").strip(),
|
||||||
|
"title": (item.get("jobTitle") or item.get("position") or "").strip(),
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return staff
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_research_explorer_xml(text: str, base_url: str) -> List[Dict[str, str]]:
|
||||||
|
staff: List[Dict[str, str]] = []
|
||||||
|
try:
|
||||||
|
root = ET.fromstring(text)
|
||||||
|
except ET.ParseError:
|
||||||
|
return staff
|
||||||
|
|
||||||
|
for entry in root.findall(".//{http://www.w3.org/2005/Atom}entry"):
|
||||||
|
title = entry.findtext("{http://www.w3.org/2005/Atom}title", default="")
|
||||||
|
link = entry.find("{http://www.w3.org/2005/Atom}link")
|
||||||
|
href = link.attrib.get("href") if link is not None else ""
|
||||||
|
if title:
|
||||||
|
staff.append(
|
||||||
|
{
|
||||||
|
"name": title.strip(),
|
||||||
|
"url": urljoin(base_url, href) if href else "",
|
||||||
|
"title": "",
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return staff
|
||||||
|
|
||||||
|
|
||||||
|
async def fetch_research_explorer_api(context, school_config: Dict[str, Any], output_callback) -> List[Dict[str, str]]:
|
||||||
|
config = school_config.get("research_explorer") or {}
|
||||||
|
if not config and school_config.get("extract_method") != "research_explorer":
|
||||||
|
return []
|
||||||
|
|
||||||
|
base_staff_url = ""
|
||||||
|
if school_config.get("staff_pages"):
|
||||||
|
base_staff_url = school_config["staff_pages"][0].get("url", "")
|
||||||
|
|
||||||
|
page_size = config.get("page_size", 200)
|
||||||
|
timeout_ms = config.get("timeout_ms", 70000)
|
||||||
|
|
||||||
|
candidates: List[str] = []
|
||||||
|
slug = config.get("org_slug") or _guess_research_slug(base_staff_url)
|
||||||
|
base_api = config.get("api_base", "https://research.manchester.ac.uk/ws/portalapi.aspx")
|
||||||
|
|
||||||
|
if config.get("api_url"):
|
||||||
|
candidates.append(config["api_url"])
|
||||||
|
|
||||||
|
if slug:
|
||||||
|
params = {
|
||||||
|
"action": "search",
|
||||||
|
"language": "en",
|
||||||
|
"format": "json",
|
||||||
|
"site": "default",
|
||||||
|
"showall": "true",
|
||||||
|
"pageSize": page_size,
|
||||||
|
"organisations": slug,
|
||||||
|
}
|
||||||
|
candidates.append(f"{base_api}?{urlencode(params)}")
|
||||||
|
|
||||||
|
if base_staff_url:
|
||||||
|
candidates.append(_append_query(base_staff_url, {"format": "json", "limit": page_size}))
|
||||||
|
candidates.append(_append_query(base_staff_url, {"format": "xml", "limit": page_size}))
|
||||||
|
|
||||||
|
for url in candidates:
|
||||||
|
try:
|
||||||
|
resp = await context.request.get(url, timeout=timeout_ms)
|
||||||
|
if resp.status != 200:
|
||||||
|
continue
|
||||||
|
ctype = resp.headers.get("content-type", "")
|
||||||
|
if "json" in ctype:
|
||||||
|
data = await resp.json()
|
||||||
|
parsed = _parse_research_explorer_json(data, base_staff_url)
|
||||||
|
else:
|
||||||
|
text = await resp.text()
|
||||||
|
parsed = _parse_research_explorer_xml(text, base_staff_url)
|
||||||
|
parsed = _deduplicate_staff(parsed)
|
||||||
|
if parsed:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f" {school_config['name']}: {len(parsed)} staff via API")
|
||||||
|
return parsed
|
||||||
|
except Exception as exc: # noqa: BLE001
|
||||||
|
if output_callback:
|
||||||
|
output_callback(
|
||||||
|
"warning", f" {school_config['name']}: API fetch failed ({str(exc)[:60]})"
|
||||||
|
)
|
||||||
|
return []
|
||||||
|
|
||||||
|
|
||||||
|
async def scrape_staff_via_browser(context, school_config: Dict[str, Any], output_callback) -> List[Dict[str, str]]:
|
||||||
|
staff_collected: List[Dict[str, str]] = []
|
||||||
|
staff_pages = school_config.get("staff_pages") or []
|
||||||
|
if not staff_pages and school_config.get("staff_url"):
|
||||||
|
staff_pages = [{"url": school_config["staff_url"], "extract_method": school_config.get("extract_method")}]
|
||||||
|
|
||||||
|
page = await context.new_page()
|
||||||
|
blocked_types = school_config.get("blocked_resources", ["image", "font", "media"])
|
||||||
|
if blocked_types:
|
||||||
|
async def _route_handler(route):
|
||||||
|
if route.request.resource_type in blocked_types:
|
||||||
|
await route.abort()
|
||||||
|
else:
|
||||||
|
await route.continue_()
|
||||||
|
|
||||||
|
await page.route("**/*", _route_handler)
|
||||||
|
|
||||||
|
for page_cfg in staff_pages:
|
||||||
|
target_url = page_cfg.get("url")
|
||||||
|
if not target_url:
|
||||||
|
continue
|
||||||
|
|
||||||
|
settings = _merge_request_settings(school_config.get("request"), page_cfg.get("request"))
|
||||||
|
success, error = await _goto_with_retry(page, target_url, settings, school_config["name"])
|
||||||
|
if not success:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("warning", f" {school_config['name']}: failed to load {target_url} ({error})")
|
||||||
|
continue
|
||||||
|
|
||||||
|
if page_cfg.get("requires_scroll"):
|
||||||
|
await _perform_scroll(page, page_cfg.get("scroll_times", 6), page_cfg.get("scroll_delay_ms", 700))
|
||||||
|
|
||||||
|
if page_cfg.get("load_from_selector"):
|
||||||
|
await _load_more(page, page_cfg["load_from_selector"], page_cfg.get("max_load_more", 5))
|
||||||
|
elif page_cfg.get("load_more_selector"):
|
||||||
|
await _load_more(page, page_cfg["load_more_selector"], page_cfg.get("max_load_more", 5))
|
||||||
|
|
||||||
|
method = page_cfg.get("extract_method") or school_config.get("extract_method") or "links"
|
||||||
|
if method == "table":
|
||||||
|
extracted = await page.evaluate(JS_EXTRACT_TABLE_STAFF)
|
||||||
|
elif method == "research_explorer":
|
||||||
|
extracted = await page.evaluate(JS_EXTRACT_RESEARCH_EXPLORER)
|
||||||
|
else:
|
||||||
|
extracted = await page.evaluate(JS_EXTRACT_LINK_STAFF)
|
||||||
|
|
||||||
|
staff_collected.extend(extracted)
|
||||||
|
|
||||||
|
await page.close()
|
||||||
|
return _deduplicate_staff(staff_collected)
|
||||||
|
|
||||||
|
|
||||||
|
# =========================
|
||||||
|
# 并发抓取学院 Staff
|
||||||
|
# =========================
|
||||||
|
|
||||||
|
async def scrape_school_staff(context, school_config: Dict[str, Any], semaphore, output_callback):
|
||||||
|
async with semaphore:
|
||||||
|
staff_list: List[Dict[str, str]] = []
|
||||||
|
status = "success"
|
||||||
|
error: Optional[str] = None
|
||||||
|
|
||||||
|
try:
|
||||||
|
if school_config.get("extract_method") == "research_explorer":
|
||||||
|
staff_list = await fetch_research_explorer_api(context, school_config, output_callback)
|
||||||
|
if not staff_list:
|
||||||
|
staff_list = await scrape_staff_via_browser(context, school_config, output_callback)
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f" {school_config['name']}: total {len(staff_list)} staff")
|
||||||
|
|
||||||
|
except Exception as exc: # noqa: BLE001
|
||||||
|
status = "error"
|
||||||
|
error = str(exc)
|
||||||
|
if output_callback:
|
||||||
|
output_callback("error", f" {school_config['name']}: {error}")
|
||||||
|
|
||||||
|
return {
|
||||||
|
"name": school_config["name"],
|
||||||
|
"staff": staff_list,
|
||||||
|
"status": status,
|
||||||
|
"error": error,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
async def scrape_all_school_staff(context, output_callback):
|
||||||
|
semaphore = asyncio.Semaphore(STAFF_CONCURRENCY)
|
||||||
|
tasks = [
|
||||||
|
asyncio.create_task(scrape_school_staff(context, cfg, semaphore, output_callback))
|
||||||
|
for cfg in SCHOOL_CONFIG
|
||||||
|
]
|
||||||
|
results = await asyncio.gather(*tasks)
|
||||||
|
|
||||||
|
staff_map = {}
|
||||||
|
diagnostics = {"failed": [], "success": [], "total": len(results)}
|
||||||
|
for res in results:
|
||||||
|
if res["staff"]:
|
||||||
|
staff_map[res["name"]] = res["staff"]
|
||||||
|
diagnostics["success"].append(res["name"])
|
||||||
|
else:
|
||||||
|
diagnostics["failed"].append(
|
||||||
|
{
|
||||||
|
"name": res["name"],
|
||||||
|
"status": res["status"],
|
||||||
|
"error": res.get("error"),
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return staff_map, diagnostics
|
||||||
|
|
||||||
|
|
||||||
|
# =========================
|
||||||
|
# 主流程
|
||||||
|
# =========================
|
||||||
|
|
||||||
|
async def scrape(output_callback=None):
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser = await p.chromium.launch(headless=True)
|
||||||
|
context = await browser.new_context(
|
||||||
|
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
|
||||||
|
)
|
||||||
|
|
||||||
|
base_url = "https://www.manchester.ac.uk/"
|
||||||
|
result = {
|
||||||
|
"name": "The University of Manchester",
|
||||||
|
"url": base_url,
|
||||||
|
"scraped_at": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"schools": [],
|
||||||
|
"diagnostics": {},
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Step 1: Masters 列表
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", "Step 1: Scraping masters programs list...")
|
||||||
|
|
||||||
|
page = await context.new_page()
|
||||||
|
courses_url = "https://www.manchester.ac.uk/study/masters/courses/list/"
|
||||||
|
await page.goto(courses_url, wait_until="domcontentloaded", timeout=40000)
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
programs_data = await page.evaluate(JS_EXTRACT_PROGRAMS)
|
||||||
|
await page.close()
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"Found {len(programs_data)} masters programs")
|
||||||
|
|
||||||
|
# Step 2: 并发抓取学院 Staff
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", "Step 2: Scraping faculty from staff pages (parallel)...")
|
||||||
|
school_staff, diagnostics = await scrape_all_school_staff(context, output_callback)
|
||||||
|
|
||||||
|
# Step 3: 组织数据
|
||||||
|
schools_dict: Dict[str, Dict[str, Any]] = {}
|
||||||
|
for prog in programs_data:
|
||||||
|
school_name = match_program_to_school(prog["name"])
|
||||||
|
if school_name not in schools_dict:
|
||||||
|
schools_dict[school_name] = {
|
||||||
|
"name": school_name,
|
||||||
|
"url": "",
|
||||||
|
"programs": [],
|
||||||
|
"faculty": school_staff.get(school_name, []),
|
||||||
|
"faculty_source": "school_directory" if school_staff.get(school_name) else "",
|
||||||
|
}
|
||||||
|
|
||||||
|
schools_dict[school_name]["programs"].append(
|
||||||
|
{
|
||||||
|
"name": prog["name"],
|
||||||
|
"url": prog["url"],
|
||||||
|
"faculty": [],
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
for cfg in SCHOOL_CONFIG:
|
||||||
|
if cfg["name"] in schools_dict:
|
||||||
|
first_page = (cfg.get("staff_pages") or [{}])[0]
|
||||||
|
schools_dict[cfg["name"]]["url"] = first_page.get("url") or cfg.get("staff_url", "")
|
||||||
|
|
||||||
|
_attach_faculty_to_programs(schools_dict, school_staff)
|
||||||
|
|
||||||
|
result["schools"] = list(schools_dict.values())
|
||||||
|
|
||||||
|
total_programs = sum(len(s["programs"]) for s in result["schools"])
|
||||||
|
total_faculty = sum(len(s.get("faculty", [])) for s in result["schools"])
|
||||||
|
|
||||||
|
result["diagnostics"] = {
|
||||||
|
"total_programs": total_programs,
|
||||||
|
"total_faculty_records": total_faculty,
|
||||||
|
"school_staff_success": diagnostics.get("success", []),
|
||||||
|
"school_staff_failed": diagnostics.get("failed", []),
|
||||||
|
}
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
output_callback(
|
||||||
|
"info",
|
||||||
|
f"Done! {len(result['schools'])} schools, {total_programs} programs, {total_faculty} faculty",
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as exc: # noqa: BLE001
|
||||||
|
if output_callback:
|
||||||
|
output_callback("error", f"Scraping error: {str(exc)}")
|
||||||
|
finally:
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
def _attach_faculty_to_programs(schools_dict: Dict[str, Dict[str, Any]], staff_map: Dict[str, List[Dict[str, str]]]):
|
||||||
|
for school_name, school_data in schools_dict.items():
|
||||||
|
staff = staff_map.get(school_name, [])
|
||||||
|
cfg = SCHOOL_LOOKUP.get(school_name, {})
|
||||||
|
if not staff or not cfg.get("attach_faculty_to_programs"):
|
||||||
|
continue
|
||||||
|
|
||||||
|
limit = cfg.get("faculty_per_program")
|
||||||
|
for program in school_data["programs"]:
|
||||||
|
sliced = deepcopy(staff[:limit] if limit else staff)
|
||||||
|
program["faculty"] = sliced
|
||||||
|
|
||||||
|
|
||||||
|
# =========================
|
||||||
|
# CLI
|
||||||
|
# =========================
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
import sys
|
||||||
|
|
||||||
|
if sys.platform == "win32":
|
||||||
|
asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())
|
||||||
|
|
||||||
|
def print_callback(level, msg):
|
||||||
|
print(f"[{level}] {msg}")
|
||||||
|
|
||||||
|
scrape_result = asyncio.run(scrape(output_callback=print_callback))
|
||||||
|
|
||||||
|
output_path = "output/manchester_complete_result.json"
|
||||||
|
with open(output_path, "w", encoding="utf-8") as f:
|
||||||
|
json.dump(scrape_result, f, ensure_ascii=False, indent=2)
|
||||||
|
|
||||||
|
print("\nResult saved to", output_path)
|
||||||
|
print("\n=== Summary ===")
|
||||||
|
for school in sorted(scrape_result["schools"], key=lambda s: -len(s.get("faculty", []))):
|
||||||
|
print(
|
||||||
|
f" {school['name']}: "
|
||||||
|
f"{len(school['programs'])} programs, "
|
||||||
|
f"{len(school.get('faculty', []))} faculty"
|
||||||
|
)
|
||||||
|
|
||||||
229
artifacts/manchester_improved_scraper.py
Normal file
229
artifacts/manchester_improved_scraper.py
Normal file
@ -0,0 +1,229 @@
|
|||||||
|
"""
|
||||||
|
曼彻斯特大学专用爬虫脚本
|
||||||
|
改进版 - 从学院Staff页面提取导师信息
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from urllib.parse import urljoin, urlparse
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
|
||||||
|
|
||||||
|
# 曼彻斯特大学学院Staff页面映射
|
||||||
|
# 项目关键词 -> 学院Staff页面URL
|
||||||
|
SCHOOL_STAFF_MAPPING = {
|
||||||
|
# Alliance Manchester Business School (AMBS)
|
||||||
|
"accounting": "https://www.alliancembs.manchester.ac.uk/research/accounting-and-finance/staff/",
|
||||||
|
"finance": "https://www.alliancembs.manchester.ac.uk/research/accounting-and-finance/staff/",
|
||||||
|
"business": "https://www.alliancembs.manchester.ac.uk/about/our-people/",
|
||||||
|
"management": "https://www.alliancembs.manchester.ac.uk/about/our-people/",
|
||||||
|
"marketing": "https://www.alliancembs.manchester.ac.uk/research/management-sciences-and-marketing/",
|
||||||
|
"mba": "https://www.alliancembs.manchester.ac.uk/about/our-people/",
|
||||||
|
|
||||||
|
# 其他学院可以继续添加...
|
||||||
|
# "computer": "...",
|
||||||
|
# "engineering": "...",
|
||||||
|
}
|
||||||
|
|
||||||
|
# 通用学院Staff页面列表(如果没有匹配的关键词)
|
||||||
|
GENERAL_STAFF_PAGES = [
|
||||||
|
"https://www.alliancembs.manchester.ac.uk/about/our-people/",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
async def scrape(output_callback=None):
|
||||||
|
"""执行爬取"""
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser = await p.chromium.launch(headless=True)
|
||||||
|
context = await browser.new_context(
|
||||||
|
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
|
||||||
|
)
|
||||||
|
page = await context.new_page()
|
||||||
|
|
||||||
|
base_url = "https://www.manchester.ac.uk/"
|
||||||
|
|
||||||
|
result = {
|
||||||
|
"name": "The University of Manchester",
|
||||||
|
"url": base_url,
|
||||||
|
"scraped_at": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"schools": []
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
# 第一步:爬取硕士项目列表
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", "Step 1: Scraping masters programs list...")
|
||||||
|
|
||||||
|
courses_url = "https://www.manchester.ac.uk/study/masters/courses/list/"
|
||||||
|
await page.goto(courses_url, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
# 提取所有硕士项目
|
||||||
|
programs_data = await page.evaluate('''() => {
|
||||||
|
const programs = [];
|
||||||
|
const seen = new Set();
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href;
|
||||||
|
const text = a.innerText.trim().replace(/\\s+/g, ' ');
|
||||||
|
|
||||||
|
if (!href || seen.has(href)) return;
|
||||||
|
if (text.length < 10 || text.length > 200) return;
|
||||||
|
|
||||||
|
const hrefLower = href.toLowerCase();
|
||||||
|
const textLower = text.toLowerCase();
|
||||||
|
|
||||||
|
// 排除导航链接
|
||||||
|
if (textLower === 'courses' || textLower === 'masters' ||
|
||||||
|
textLower.includes('admission') || textLower.includes('fees') ||
|
||||||
|
textLower.includes('skip to') || textLower.includes('skip navigation') ||
|
||||||
|
textLower === 'home' || textLower === 'search' ||
|
||||||
|
textLower.includes('contact') || textLower.includes('footer') ||
|
||||||
|
hrefLower.endsWith('/courses/') || hrefLower.endsWith('/masters/') ||
|
||||||
|
hrefLower.includes('#')) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
// 检查是否是课程链接 - 必须包含课程ID
|
||||||
|
const hasNumericId = /\\/\\d{5}\\//.test(href); // 5位数字ID
|
||||||
|
const isCoursePage = hrefLower.includes('/courses/list/') &&
|
||||||
|
hasNumericId;
|
||||||
|
|
||||||
|
if (isCoursePage) {
|
||||||
|
seen.add(href);
|
||||||
|
programs.push({
|
||||||
|
name: text,
|
||||||
|
url: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return programs;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"Found {len(programs_data)} masters programs")
|
||||||
|
|
||||||
|
# 第二步:爬取学院Staff页面的导师信息
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", "Step 2: Scraping faculty from school staff pages...")
|
||||||
|
|
||||||
|
all_faculty = {} # school_url -> faculty list
|
||||||
|
|
||||||
|
# 爬取AMBS Accounting & Finance Staff
|
||||||
|
staff_url = "https://www.alliancembs.manchester.ac.uk/research/accounting-and-finance/staff/"
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"Scraping staff from: {staff_url}")
|
||||||
|
|
||||||
|
await page.goto(staff_url, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
# 从表格提取教职员工
|
||||||
|
faculty_data = await page.evaluate('''() => {
|
||||||
|
const faculty = [];
|
||||||
|
const rows = document.querySelectorAll('table tr');
|
||||||
|
|
||||||
|
rows.forEach(row => {
|
||||||
|
const cells = row.querySelectorAll('td');
|
||||||
|
if (cells.length >= 2) {
|
||||||
|
const link = cells[1]?.querySelector('a[href]');
|
||||||
|
const titleCell = cells[2];
|
||||||
|
|
||||||
|
if (link) {
|
||||||
|
const name = link.innerText.trim();
|
||||||
|
const url = link.href;
|
||||||
|
const title = titleCell ? titleCell.innerText.trim() : '';
|
||||||
|
|
||||||
|
if (name.length > 2 && !name.toLowerCase().includes('skip')) {
|
||||||
|
faculty.push({
|
||||||
|
name: name,
|
||||||
|
url: url,
|
||||||
|
title: title
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return faculty;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"Found {len(faculty_data)} faculty members from AMBS")
|
||||||
|
|
||||||
|
all_faculty["AMBS - Accounting and Finance"] = faculty_data
|
||||||
|
|
||||||
|
# 第三步:组装结果
|
||||||
|
# 将项目按关键词分配到学院
|
||||||
|
schools_data = {}
|
||||||
|
|
||||||
|
for prog in programs_data:
|
||||||
|
prog_name_lower = prog['name'].lower()
|
||||||
|
|
||||||
|
# 确定所属学院
|
||||||
|
school_name = "Other Programs"
|
||||||
|
matched_faculty = []
|
||||||
|
|
||||||
|
for keyword, staff_url in SCHOOL_STAFF_MAPPING.items():
|
||||||
|
if keyword in prog_name_lower:
|
||||||
|
if "accounting" in keyword or "finance" in keyword:
|
||||||
|
school_name = "Alliance Manchester Business School"
|
||||||
|
matched_faculty = all_faculty.get("AMBS - Accounting and Finance", [])
|
||||||
|
elif "business" in keyword or "management" in keyword or "mba" in keyword:
|
||||||
|
school_name = "Alliance Manchester Business School"
|
||||||
|
matched_faculty = all_faculty.get("AMBS - Accounting and Finance", [])
|
||||||
|
break
|
||||||
|
|
||||||
|
if school_name not in schools_data:
|
||||||
|
schools_data[school_name] = {
|
||||||
|
"name": school_name,
|
||||||
|
"url": "",
|
||||||
|
"programs": [],
|
||||||
|
"faculty": matched_faculty # 学院级别的导师
|
||||||
|
}
|
||||||
|
|
||||||
|
schools_data[school_name]["programs"].append({
|
||||||
|
"name": prog['name'],
|
||||||
|
"url": prog['url'],
|
||||||
|
"faculty": [] # 项目级别暂不填充
|
||||||
|
})
|
||||||
|
|
||||||
|
result["schools"] = list(schools_data.values())
|
||||||
|
|
||||||
|
# 统计
|
||||||
|
total_programs = sum(len(s['programs']) for s in result['schools'])
|
||||||
|
total_faculty = sum(len(s.get('faculty', [])) for s in result['schools'])
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"Done! {len(result['schools'])} schools, {total_programs} programs, {total_faculty} faculty")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("error", f"Scraping error: {str(e)}")
|
||||||
|
|
||||||
|
finally:
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
import sys
|
||||||
|
if sys.platform == "win32":
|
||||||
|
asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())
|
||||||
|
|
||||||
|
def print_callback(level, msg):
|
||||||
|
print(f"[{level}] {msg}")
|
||||||
|
|
||||||
|
result = asyncio.run(scrape(output_callback=print_callback))
|
||||||
|
|
||||||
|
# 保存结果
|
||||||
|
with open("output/manchester_improved_result.json", "w", encoding="utf-8") as f:
|
||||||
|
json.dump(result, f, indent=2, ensure_ascii=False)
|
||||||
|
|
||||||
|
print(f"\nResult saved to output/manchester_improved_result.json")
|
||||||
|
print(f"Schools: {len(result['schools'])}")
|
||||||
|
for school in result['schools']:
|
||||||
|
print(f" - {school['name']}: {len(school['programs'])} programs, {len(school.get('faculty', []))} faculty")
|
||||||
438
artifacts/rwth_aachen_playwright_scraper.py
Normal file
438
artifacts/rwth_aachen_playwright_scraper.py
Normal file
@ -0,0 +1,438 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
"""
|
||||||
|
Auto-generated by the Agno codegen agent.
|
||||||
|
Target university: RWTH Aachen (https://www.rwth-aachen.de/go/id/a/?lidx=1)
|
||||||
|
Requested caps: depth=3, pages=30
|
||||||
|
|
||||||
|
Plan description: Playwright scraper for university master programs and faculty profiles.
|
||||||
|
Navigation strategy: Start from the main university page and look for faculty/department directories. RWTH Aachen likely structures content with faculty organized by departments. Look for department pages (like 'Fakultäten'), then navigate to individual department sites, find 'Mitarbeiter' or 'Personal' sections, and extract individual faculty profile URLs. The university uses both German and English, so check for patterns in both languages. Individual faculty pages likely follow patterns like '/mitarbeiter/firstname-lastname' or similar German naming conventions.
|
||||||
|
Verification checklist:
|
||||||
|
- Verify that faculty URLs point to individual person pages, not department listings
|
||||||
|
- Check that master's program pages contain degree information and curriculum details
|
||||||
|
- Ensure scraped faculty pages include personal information like research interests, contact details, or CV
|
||||||
|
- Validate that URLs contain individual identifiers (names, personal paths) rather than generic terms
|
||||||
|
- Cross-check that German and English versions of pages are both captured when available
|
||||||
|
Playwright snapshot used to guide this plan:
|
||||||
|
1. RWTH Aachen University | Rheinisch-Westfälische Technische Hochschule | EN (https://www.rwth-aachen.de/go/id/a/?lidx=1)
|
||||||
|
Snippet: Skip to Content Skip to Main Navigation Skip to Landing Pages for Target Groups Skip to Quick Access Skip to Search Skip to Footer News Information for... Quick Access DE Search for Search Copyright: © Copyright: © Copyright: © Copyright: © Studying at RWTH Welc
|
||||||
|
Anchors: Skip to Content -> https://www.rwth-aachen.de/go/id/a/?lidx=1#main, Skip to Main Navigation -> https://www.rwth-aachen.de/go/id/a/?lidx=1#main-nav-control, Skip to Landing Pages for Target Groups -> https://www.rwth-aachen.de/go/id/a/?lidx=1#persona-control, Skip to Quick Access -> https://www.rwth-aachen.de/go/id/a/?lidx=1#quick-start-control, Skip to Search -> https://www.rwth-aachen.de/go/id/a/?lidx=1#searchbar, Skip to Footer -> https://www.rwth-aachen.de/go/id/a/?lidx=1#footer
|
||||||
|
2. Prospective Students | RWTH Aachen University | EN (https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/)
|
||||||
|
Snippet: Skip to Content Skip to Main Navigation Skip to Landing Pages for Target Groups Skip to Quick Access Skip to Search Skip to Footer News Information for... Quick Access DE Search for Search Prospective Students Choosing A Course of Study Copyright: © Mario Irrmischer Adv
|
||||||
|
Anchors: Skip to Content -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/#main, Skip to Main Navigation -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/#main-nav-control, Skip to Landing Pages for Target Groups -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/#persona-control, Skip to Quick Access -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/#quick-start-control, Skip to Search -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/#searchbar, Skip to Footer -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/#footer
|
||||||
|
3. First-Year Students | RWTH Aachen University | EN (https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/)
|
||||||
|
Snippet: Skip to Content Skip to Main Navigation Skip to Landing Pages for Target Groups Skip to Quick Access Skip to Search Skip to Footer News Information for... Quick Access DE Search for Search First-Year Students Preparing for Your Studies – Recommended Subject-Specific Res
|
||||||
|
Anchors: Skip to Content -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/#main, Skip to Main Navigation -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/#main-nav-control, Skip to Landing Pages for Target Groups -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/#persona-control, Skip to Quick Access -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/#quick-start-control, Skip to Search -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/#searchbar, Skip to Footer -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/#footer
|
||||||
|
4. Students | RWTH Aachen University | EN (https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/)
|
||||||
|
Snippet: Skip to Content Skip to Main Navigation Skip to Landing Pages for Target Groups Skip to Quick Access Skip to Search Skip to Footer News Information for... Quick Access DE Search for Search Students Teaser Copyright: © Martin Braun Classes What lectures do you have next
|
||||||
|
Anchors: Skip to Content -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/#main, Skip to Main Navigation -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/#main-nav-control, Skip to Landing Pages for Target Groups -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/#persona-control, Skip to Quick Access -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/#quick-start-control, Skip to Search -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/#searchbar, Skip to Footer -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/#footer
|
||||||
|
Snapshot truncated.
|
||||||
|
|
||||||
|
Generated at: 2025-12-09T10:27:25.950820+00:00
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import time
|
||||||
|
from collections import deque
|
||||||
|
from dataclasses import asdict, dataclass, field
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Deque, Iterable, List, Set, Tuple
|
||||||
|
from urllib.parse import urljoin, urldefrag, urlparse
|
||||||
|
|
||||||
|
from playwright.async_api import async_playwright, Page, Response
|
||||||
|
|
||||||
|
PROGRAM_KEYWORDS = ['Master', 'M.Sc.', 'M.A.', 'Graduate', 'Masterstudiengang', '/studium/', '/studiengänge/', 'Postgraduate']
|
||||||
|
FACULTY_KEYWORDS = ['Prof.', 'Dr.', 'Professor', '/mitarbeiter/', '/people/', '/personal/', '/~', 'Professorin']
|
||||||
|
EXCLUSION_KEYWORDS = ['bewerbung', 'admission', 'apply', 'bewerben', 'news', 'nachrichten', 'events', 'veranstaltungen', 'career', 'stellenangebote', 'login', 'anmelden', 'alumni', 'donate', 'spenden', 'studienanfänger']
|
||||||
|
METADATA_FIELDS = ['url', 'title', 'entity_type', 'department', 'email', 'scraped_at']
|
||||||
|
EXTRA_NOTES = ["RWTH Aachen is a major German technical university with content in both German and English. The site structure appears to use target group portals ('Zielgruppenportale') for different audiences. Faculty information will likely be distributed across different department websites. The university uses German academic titles (Prof., Dr.) extensively. Be prepared to handle both '/cms/root/' URL structures and potential subdomain variations for different faculties."]
|
||||||
|
|
||||||
|
# URL patterns that indicate individual profile pages
|
||||||
|
PROFILE_URL_PATTERNS = [
|
||||||
|
"/people/", "/person/", "/profile/", "/profiles/",
|
||||||
|
"/faculty/", "/staff/", "/directory/",
|
||||||
|
"/~", # Unix-style personal pages
|
||||||
|
"/bio/", "/about/",
|
||||||
|
]
|
||||||
|
|
||||||
|
# URL patterns that indicate listing/directory pages (should be crawled deeper)
|
||||||
|
DIRECTORY_URL_PATTERNS = [
|
||||||
|
"/faculty", "/people", "/directory", "/staff",
|
||||||
|
"/team", "/members", "/researchers",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def normalize_url(base: str, href: str) -> str:
|
||||||
|
"""Normalize URL by resolving relative paths and removing fragments."""
|
||||||
|
absolute = urljoin(base, href)
|
||||||
|
cleaned, _ = urldefrag(absolute)
|
||||||
|
# Remove trailing slash for consistency
|
||||||
|
return cleaned.rstrip("/")
|
||||||
|
|
||||||
|
|
||||||
|
def matches_any(text: str, keywords: Iterable[str]) -> bool:
|
||||||
|
"""Check if text contains any of the keywords (case-insensitive)."""
|
||||||
|
lowered = text.lower()
|
||||||
|
return any(keyword.lower() in lowered for keyword in keywords)
|
||||||
|
|
||||||
|
|
||||||
|
def is_same_domain(url1: str, url2: str) -> bool:
|
||||||
|
"""Check if two URLs belong to the same root domain."""
|
||||||
|
domain1 = urlparse(url1).netloc.replace("www.", "")
|
||||||
|
domain2 = urlparse(url2).netloc.replace("www.", "")
|
||||||
|
# Allow subdomains of the same root domain
|
||||||
|
parts1 = domain1.split(".")
|
||||||
|
parts2 = domain2.split(".")
|
||||||
|
if len(parts1) >= 2 and len(parts2) >= 2:
|
||||||
|
return parts1[-2:] == parts2[-2:]
|
||||||
|
return domain1 == domain2
|
||||||
|
|
||||||
|
|
||||||
|
def is_profile_url(url: str) -> bool:
|
||||||
|
"""Check if URL pattern suggests an individual profile page."""
|
||||||
|
url_lower = url.lower()
|
||||||
|
return any(pattern in url_lower for pattern in PROFILE_URL_PATTERNS)
|
||||||
|
|
||||||
|
|
||||||
|
def is_directory_url(url: str) -> bool:
|
||||||
|
"""Check if URL pattern suggests a directory/listing page."""
|
||||||
|
url_lower = url.lower()
|
||||||
|
return any(pattern in url_lower for pattern in DIRECTORY_URL_PATTERNS)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ScrapedLink:
|
||||||
|
url: str
|
||||||
|
title: str
|
||||||
|
text: str
|
||||||
|
source_url: str
|
||||||
|
bucket: str # "program" or "faculty"
|
||||||
|
is_verified: bool = False
|
||||||
|
http_status: int = 0
|
||||||
|
is_profile_page: bool = False
|
||||||
|
scraped_at: str = field(default_factory=lambda: datetime.now(timezone.utc).isoformat())
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ScrapeSettings:
|
||||||
|
root_url: str
|
||||||
|
max_depth: int
|
||||||
|
max_pages: int
|
||||||
|
headless: bool
|
||||||
|
output: Path
|
||||||
|
verify_links: bool = True
|
||||||
|
request_delay: float = 1.0 # Polite crawling delay
|
||||||
|
|
||||||
|
|
||||||
|
async def extract_links(page: Page) -> List[Tuple[str, str]]:
|
||||||
|
"""Extract all anchor links from the page."""
|
||||||
|
anchors: Iterable[dict] = await page.eval_on_selector_all(
|
||||||
|
"a",
|
||||||
|
"""elements => elements
|
||||||
|
.map(el => ({text: (el.textContent || '').trim(), href: el.href}))
|
||||||
|
.filter(item => item.text && item.href && item.href.startsWith('http'))""",
|
||||||
|
)
|
||||||
|
return [(item["href"], item["text"]) for item in anchors]
|
||||||
|
|
||||||
|
|
||||||
|
async def get_page_title(page: Page) -> str:
|
||||||
|
"""Get the page title safely."""
|
||||||
|
try:
|
||||||
|
return await page.title() or ""
|
||||||
|
except Exception:
|
||||||
|
return ""
|
||||||
|
|
||||||
|
|
||||||
|
async def verify_link(context, url: str, timeout: int = 10000) -> Tuple[bool, int, str]:
|
||||||
|
"""
|
||||||
|
Verify a link by making a HEAD-like request.
|
||||||
|
Returns: (is_valid, status_code, page_title)
|
||||||
|
"""
|
||||||
|
page = await context.new_page()
|
||||||
|
try:
|
||||||
|
response: Response = await page.goto(url, wait_until="domcontentloaded", timeout=timeout)
|
||||||
|
if response:
|
||||||
|
status = response.status
|
||||||
|
title = await get_page_title(page)
|
||||||
|
is_valid = 200 <= status < 400
|
||||||
|
return is_valid, status, title
|
||||||
|
return False, 0, ""
|
||||||
|
except Exception:
|
||||||
|
return False, 0, ""
|
||||||
|
finally:
|
||||||
|
await page.close()
|
||||||
|
|
||||||
|
|
||||||
|
async def crawl(settings: ScrapeSettings, browser_name: str) -> List[ScrapedLink]:
|
||||||
|
"""
|
||||||
|
Crawl the website using BFS, collecting program and faculty links.
|
||||||
|
Features:
|
||||||
|
- URL deduplication
|
||||||
|
- Link verification
|
||||||
|
- Profile page detection
|
||||||
|
- Polite crawling with delays
|
||||||
|
"""
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser_launcher = getattr(p, browser_name)
|
||||||
|
browser = await browser_launcher.launch(headless=settings.headless)
|
||||||
|
context = await browser.new_context()
|
||||||
|
|
||||||
|
# Priority queue: (priority, url, depth) - lower priority = processed first
|
||||||
|
# Directory pages get priority 0, others get priority 1
|
||||||
|
queue: Deque[Tuple[int, str, int]] = deque([(0, settings.root_url, 0)])
|
||||||
|
visited: Set[str] = set()
|
||||||
|
found_urls: Set[str] = set() # For deduplication of results
|
||||||
|
results: List[ScrapedLink] = []
|
||||||
|
|
||||||
|
print(f"Starting crawl from: {settings.root_url}")
|
||||||
|
print(f"Max depth: {settings.max_depth}, Max pages: {settings.max_pages}")
|
||||||
|
|
||||||
|
try:
|
||||||
|
while queue and len(visited) < settings.max_pages:
|
||||||
|
# Sort queue by priority (directory pages first)
|
||||||
|
queue = deque(sorted(queue, key=lambda x: x[0]))
|
||||||
|
priority, url, depth = queue.popleft()
|
||||||
|
|
||||||
|
normalized_url = normalize_url(settings.root_url, url)
|
||||||
|
if normalized_url in visited or depth > settings.max_depth:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Only crawl same-domain URLs
|
||||||
|
if not is_same_domain(settings.root_url, normalized_url):
|
||||||
|
continue
|
||||||
|
|
||||||
|
visited.add(normalized_url)
|
||||||
|
print(f"[{len(visited)}/{settings.max_pages}] Depth {depth}: {normalized_url[:80]}...")
|
||||||
|
|
||||||
|
page = await context.new_page()
|
||||||
|
try:
|
||||||
|
response = await page.goto(
|
||||||
|
normalized_url, wait_until="domcontentloaded", timeout=20000
|
||||||
|
)
|
||||||
|
if not response or response.status >= 400:
|
||||||
|
await page.close()
|
||||||
|
continue
|
||||||
|
except Exception as e:
|
||||||
|
print(f" Error: {e}")
|
||||||
|
await page.close()
|
||||||
|
continue
|
||||||
|
|
||||||
|
page_title = await get_page_title(page)
|
||||||
|
links = await extract_links(page)
|
||||||
|
|
||||||
|
for href, text in links:
|
||||||
|
normalized_href = normalize_url(normalized_url, href)
|
||||||
|
|
||||||
|
# Skip if already found or is excluded
|
||||||
|
if normalized_href in found_urls:
|
||||||
|
continue
|
||||||
|
if matches_any(text, EXCLUSION_KEYWORDS) or matches_any(normalized_href, EXCLUSION_KEYWORDS):
|
||||||
|
continue
|
||||||
|
|
||||||
|
text_lower = text.lower()
|
||||||
|
href_lower = normalized_href.lower()
|
||||||
|
is_profile = is_profile_url(normalized_href)
|
||||||
|
|
||||||
|
# Check for program links
|
||||||
|
if matches_any(text_lower, PROGRAM_KEYWORDS) or matches_any(href_lower, PROGRAM_KEYWORDS):
|
||||||
|
found_urls.add(normalized_href)
|
||||||
|
results.append(
|
||||||
|
ScrapedLink(
|
||||||
|
url=normalized_href,
|
||||||
|
title="",
|
||||||
|
text=text[:200],
|
||||||
|
source_url=normalized_url,
|
||||||
|
bucket="program",
|
||||||
|
is_profile_page=False,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Check for faculty links
|
||||||
|
if matches_any(text_lower, FACULTY_KEYWORDS) or matches_any(href_lower, FACULTY_KEYWORDS):
|
||||||
|
found_urls.add(normalized_href)
|
||||||
|
results.append(
|
||||||
|
ScrapedLink(
|
||||||
|
url=normalized_href,
|
||||||
|
title="",
|
||||||
|
text=text[:200],
|
||||||
|
source_url=normalized_url,
|
||||||
|
bucket="faculty",
|
||||||
|
is_profile_page=is_profile,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Queue for further crawling
|
||||||
|
if depth < settings.max_depth and is_same_domain(settings.root_url, normalized_href):
|
||||||
|
# Prioritize directory pages
|
||||||
|
link_priority = 0 if is_directory_url(normalized_href) else 1
|
||||||
|
queue.append((link_priority, normalized_href, depth + 1))
|
||||||
|
|
||||||
|
await page.close()
|
||||||
|
|
||||||
|
# Polite delay between requests
|
||||||
|
await asyncio.sleep(settings.request_delay)
|
||||||
|
|
||||||
|
finally:
|
||||||
|
await context.close()
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
# Verify links if enabled
|
||||||
|
if settings.verify_links and results:
|
||||||
|
print(f"\nVerifying {len(results)} links...")
|
||||||
|
browser = await browser_launcher.launch(headless=True)
|
||||||
|
context = await browser.new_context()
|
||||||
|
|
||||||
|
verified_results = []
|
||||||
|
for i, link in enumerate(results):
|
||||||
|
if link.url in [r.url for r in verified_results]:
|
||||||
|
continue # Skip duplicates
|
||||||
|
|
||||||
|
print(f" [{i+1}/{len(results)}] Verifying: {link.url[:60]}...")
|
||||||
|
is_valid, status, title = await verify_link(context, link.url)
|
||||||
|
link.is_verified = True
|
||||||
|
link.http_status = status
|
||||||
|
link.title = title or link.text
|
||||||
|
|
||||||
|
if is_valid:
|
||||||
|
verified_results.append(link)
|
||||||
|
else:
|
||||||
|
print(f" Invalid (HTTP {status})")
|
||||||
|
|
||||||
|
await asyncio.sleep(0.5) # Delay between verifications
|
||||||
|
|
||||||
|
await context.close()
|
||||||
|
await browser.close()
|
||||||
|
results = verified_results
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
def deduplicate_results(results: List[ScrapedLink]) -> List[ScrapedLink]:
|
||||||
|
"""Remove duplicate URLs, keeping the first occurrence."""
|
||||||
|
seen: Set[str] = set()
|
||||||
|
unique = []
|
||||||
|
for link in results:
|
||||||
|
if link.url not in seen:
|
||||||
|
seen.add(link.url)
|
||||||
|
unique.append(link)
|
||||||
|
return unique
|
||||||
|
|
||||||
|
|
||||||
|
def serialize(results: List[ScrapedLink], target: Path, root_url: str) -> None:
|
||||||
|
"""Save results to JSON file with statistics."""
|
||||||
|
results = deduplicate_results(results)
|
||||||
|
|
||||||
|
program_links = [link for link in results if link.bucket == "program"]
|
||||||
|
faculty_links = [link for link in results if link.bucket == "faculty"]
|
||||||
|
profile_pages = [link for link in faculty_links if link.is_profile_page]
|
||||||
|
|
||||||
|
payload = {
|
||||||
|
"root_url": root_url,
|
||||||
|
"generated_at": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"statistics": {
|
||||||
|
"total_links": len(results),
|
||||||
|
"program_links": len(program_links),
|
||||||
|
"faculty_links": len(faculty_links),
|
||||||
|
"profile_pages": len(profile_pages),
|
||||||
|
"verified_links": len([r for r in results if r.is_verified and r.http_status == 200]),
|
||||||
|
},
|
||||||
|
"program_links": [asdict(link) for link in program_links],
|
||||||
|
"faculty_links": [asdict(link) for link in faculty_links],
|
||||||
|
"notes": EXTRA_NOTES,
|
||||||
|
"metadata_fields": METADATA_FIELDS,
|
||||||
|
}
|
||||||
|
target.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
target.write_text(json.dumps(payload, indent=2, ensure_ascii=False), encoding="utf-8")
|
||||||
|
|
||||||
|
print(f"\nResults saved to: {target}")
|
||||||
|
print(f" Total links: {len(results)}")
|
||||||
|
print(f" Program links: {len(program_links)}")
|
||||||
|
print(f" Faculty links: {len(faculty_links)}")
|
||||||
|
print(f" Profile pages: {len(profile_pages)}")
|
||||||
|
|
||||||
|
|
||||||
|
def parse_args() -> argparse.Namespace:
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Playwright scraper generated by the Agno agent for https://www.rwth-aachen.de/go/id/a/?lidx=1."
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--root-url",
|
||||||
|
default="https://www.rwth-aachen.de/go/id/a/?lidx=1",
|
||||||
|
help="Seed url to start crawling from.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--max-depth",
|
||||||
|
type=int,
|
||||||
|
default=3,
|
||||||
|
help="Maximum crawl depth.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--max-pages",
|
||||||
|
type=int,
|
||||||
|
default=30,
|
||||||
|
help="Maximum number of pages to visit.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--output",
|
||||||
|
type=Path,
|
||||||
|
default=Path("university-scraper_results.json"),
|
||||||
|
help="Where to save the JSON output.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--headless",
|
||||||
|
action="store_true",
|
||||||
|
default=True,
|
||||||
|
help="Run browser in headless mode (default: True).",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--no-headless",
|
||||||
|
action="store_false",
|
||||||
|
dest="headless",
|
||||||
|
help="Run browser with visible window.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--browser",
|
||||||
|
choices=["chromium", "firefox", "webkit"],
|
||||||
|
default="chromium",
|
||||||
|
help="Browser engine to launch via Playwright.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--no-verify",
|
||||||
|
action="store_true",
|
||||||
|
default=False,
|
||||||
|
help="Skip link verification step.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--delay",
|
||||||
|
type=float,
|
||||||
|
default=1.0,
|
||||||
|
help="Delay between requests in seconds (polite crawling).",
|
||||||
|
)
|
||||||
|
return parser.parse_args()
|
||||||
|
|
||||||
|
|
||||||
|
async def main_async() -> None:
|
||||||
|
args = parse_args()
|
||||||
|
settings = ScrapeSettings(
|
||||||
|
root_url=args.root_url,
|
||||||
|
max_depth=args.max_depth,
|
||||||
|
max_pages=args.max_pages,
|
||||||
|
headless=args.headless,
|
||||||
|
output=args.output,
|
||||||
|
verify_links=not args.no_verify,
|
||||||
|
request_delay=args.delay,
|
||||||
|
)
|
||||||
|
links = await crawl(settings, browser_name=args.browser)
|
||||||
|
serialize(links, settings.output, settings.root_url)
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
asyncio.run(main_async())
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
437
artifacts/rwth_aachen_university_scraper.py
Normal file
437
artifacts/rwth_aachen_university_scraper.py
Normal file
@ -0,0 +1,437 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
"""
|
||||||
|
Auto-generated by the Agno codegen agent.
|
||||||
|
Target university: RWTH Aachen (https://www.rwth-aachen.de/go/id/a/?lidx=1)
|
||||||
|
Requested caps: depth=3, pages=30
|
||||||
|
|
||||||
|
Plan description: Playwright scraper for university master programs and faculty profiles.
|
||||||
|
Navigation strategy: Start at the university homepage: https://www.rwth-aachen.de/ Navigate to faculty/department pages, e.g. /fakultaeten/, /fachbereiche/ Look for staff/people directory pages within each department Crawl the staff directories to find individual profile pages Some departments may use subdomains like informatik.rwth-aachen.de
|
||||||
|
Verification checklist:
|
||||||
|
- Check that collected URLs are for individual people, not directories
|
||||||
|
- Spot check profile pages to ensure they represent faculty members
|
||||||
|
- Verify relevant graduate program pages were found
|
||||||
|
- Confirm noise pages like news, events, jobs were excluded
|
||||||
|
Playwright snapshot used to guide this plan:
|
||||||
|
1. RWTH Aachen University | Rheinisch-Westfälische Technische Hochschule | EN (https://www.rwth-aachen.de/go/id/a/?lidx=1)
|
||||||
|
Snippet: Skip to Content Skip to Main Navigation Skip to Landing Pages for Target Groups Skip to Quick Access Skip to Search Skip to Footer News Information for... Quick Access DE Search for Search Copyright: © Copyright: © Copyright: © Copyright: © Studying at RWTH Welc
|
||||||
|
Anchors: Skip to Content -> https://www.rwth-aachen.de/go/id/a/?lidx=1#main, Skip to Main Navigation -> https://www.rwth-aachen.de/go/id/a/?lidx=1#main-nav-control, Skip to Landing Pages for Target Groups -> https://www.rwth-aachen.de/go/id/a/?lidx=1#persona-control, Skip to Quick Access -> https://www.rwth-aachen.de/go/id/a/?lidx=1#quick-start-control, Skip to Search -> https://www.rwth-aachen.de/go/id/a/?lidx=1#searchbar, Skip to Footer -> https://www.rwth-aachen.de/go/id/a/?lidx=1#footer
|
||||||
|
2. Prospective Students | RWTH Aachen University | EN (https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/)
|
||||||
|
Snippet: Skip to Content Skip to Main Navigation Skip to Landing Pages for Target Groups Skip to Quick Access Skip to Search Skip to Footer News Information for... Quick Access DE Search for Search Prospective Students Choosing A Course of Study Copyright: © Mario Irrmischer Adv
|
||||||
|
Anchors: Skip to Content -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/#main, Skip to Main Navigation -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/#main-nav-control, Skip to Landing Pages for Target Groups -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/#persona-control, Skip to Quick Access -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/#quick-start-control, Skip to Search -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/#searchbar, Skip to Footer -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~svo/Studieninteressierte/lidx/1/#footer
|
||||||
|
3. First-Year Students | RWTH Aachen University | EN (https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/)
|
||||||
|
Snippet: Skip to Content Skip to Main Navigation Skip to Landing Pages for Target Groups Skip to Quick Access Skip to Search Skip to Footer News Information for... Quick Access DE Search for Search First-Year Students Preparing for Your Studies – Recommended Subject-Specific Res
|
||||||
|
Anchors: Skip to Content -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/#main, Skip to Main Navigation -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/#main-nav-control, Skip to Landing Pages for Target Groups -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/#persona-control, Skip to Quick Access -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/#quick-start-control, Skip to Search -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/#searchbar, Skip to Footer -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~cgjnl/Studienanfaengerinnen-und-anfaenger/lidx/1/#footer
|
||||||
|
4. Students | RWTH Aachen University | EN (https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/)
|
||||||
|
Snippet: Skip to Content Skip to Main Navigation Skip to Landing Pages for Target Groups Skip to Quick Access Skip to Search Skip to Footer News Information for... Quick Access DE Search for Search Students Teaser Copyright: © Martin Braun Classes What lectures do you have next
|
||||||
|
Anchors: Skip to Content -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/#main, Skip to Main Navigation -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/#main-nav-control, Skip to Landing Pages for Target Groups -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/#persona-control, Skip to Quick Access -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/#quick-start-control, Skip to Search -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/#searchbar, Skip to Footer -> https://www.rwth-aachen.de/cms/root/Zielgruppenportale/~tpi/Studierende/lidx/1/#footer
|
||||||
|
Snapshot truncated.
|
||||||
|
|
||||||
|
Generated at: 2025-12-09T15:00:09.586788+00:00
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import time
|
||||||
|
from collections import deque
|
||||||
|
from dataclasses import asdict, dataclass, field
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Deque, Iterable, List, Set, Tuple
|
||||||
|
from urllib.parse import urljoin, urldefrag, urlparse
|
||||||
|
|
||||||
|
from playwright.async_api import async_playwright, Page, Response
|
||||||
|
|
||||||
|
PROGRAM_KEYWORDS = ['/studium/', '/studiengaenge/', 'master', 'graduate', 'postgraduate', 'm.sc.', 'm.a.']
|
||||||
|
FACULTY_KEYWORDS = ['/staff/', '/profile/', '/personen/', '/person/', '/aw/personen/', 'prof.', 'dr.', 'professor']
|
||||||
|
EXCLUSION_KEYWORDS = ['studieninteressierte', 'studienanfaenger', 'zulassung', 'bewerbung', 'studienbeitraege', 'studienfinanzierung', 'aktuelles', 'veranstaltungen', 'karriere', 'stellenangebote', 'alumni', 'anmeldung']
|
||||||
|
METADATA_FIELDS = ['url', 'title', 'entity_type', 'department', 'email', 'scraped_at']
|
||||||
|
EXTRA_NOTES = ['Site is primarily in German, so use German keywords', 'Faculty profile URLs contain /personen/ or /person/', 'Graduate program pages use /studium/ and /studiengaenge/']
|
||||||
|
|
||||||
|
# URL patterns that indicate individual profile pages
|
||||||
|
PROFILE_URL_PATTERNS = [
|
||||||
|
"/people/", "/person/", "/profile/", "/profiles/",
|
||||||
|
"/faculty/", "/staff/", "/directory/",
|
||||||
|
"/~", # Unix-style personal pages
|
||||||
|
"/bio/", "/about/",
|
||||||
|
]
|
||||||
|
|
||||||
|
# URL patterns that indicate listing/directory pages (should be crawled deeper)
|
||||||
|
DIRECTORY_URL_PATTERNS = [
|
||||||
|
"/faculty", "/people", "/directory", "/staff",
|
||||||
|
"/team", "/members", "/researchers",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def normalize_url(base: str, href: str) -> str:
|
||||||
|
"""Normalize URL by resolving relative paths and removing fragments."""
|
||||||
|
absolute = urljoin(base, href)
|
||||||
|
cleaned, _ = urldefrag(absolute)
|
||||||
|
# Remove trailing slash for consistency
|
||||||
|
return cleaned.rstrip("/")
|
||||||
|
|
||||||
|
|
||||||
|
def matches_any(text: str, keywords: Iterable[str]) -> bool:
|
||||||
|
"""Check if text contains any of the keywords (case-insensitive)."""
|
||||||
|
lowered = text.lower()
|
||||||
|
return any(keyword.lower() in lowered for keyword in keywords)
|
||||||
|
|
||||||
|
|
||||||
|
def is_same_domain(url1: str, url2: str) -> bool:
|
||||||
|
"""Check if two URLs belong to the same root domain."""
|
||||||
|
domain1 = urlparse(url1).netloc.replace("www.", "")
|
||||||
|
domain2 = urlparse(url2).netloc.replace("www.", "")
|
||||||
|
# Allow subdomains of the same root domain
|
||||||
|
parts1 = domain1.split(".")
|
||||||
|
parts2 = domain2.split(".")
|
||||||
|
if len(parts1) >= 2 and len(parts2) >= 2:
|
||||||
|
return parts1[-2:] == parts2[-2:]
|
||||||
|
return domain1 == domain2
|
||||||
|
|
||||||
|
|
||||||
|
def is_profile_url(url: str) -> bool:
|
||||||
|
"""Check if URL pattern suggests an individual profile page."""
|
||||||
|
url_lower = url.lower()
|
||||||
|
return any(pattern in url_lower for pattern in PROFILE_URL_PATTERNS)
|
||||||
|
|
||||||
|
|
||||||
|
def is_directory_url(url: str) -> bool:
|
||||||
|
"""Check if URL pattern suggests a directory/listing page."""
|
||||||
|
url_lower = url.lower()
|
||||||
|
return any(pattern in url_lower for pattern in DIRECTORY_URL_PATTERNS)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ScrapedLink:
|
||||||
|
url: str
|
||||||
|
title: str
|
||||||
|
text: str
|
||||||
|
source_url: str
|
||||||
|
bucket: str # "program" or "faculty"
|
||||||
|
is_verified: bool = False
|
||||||
|
http_status: int = 0
|
||||||
|
is_profile_page: bool = False
|
||||||
|
scraped_at: str = field(default_factory=lambda: datetime.now(timezone.utc).isoformat())
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ScrapeSettings:
|
||||||
|
root_url: str
|
||||||
|
max_depth: int
|
||||||
|
max_pages: int
|
||||||
|
headless: bool
|
||||||
|
output: Path
|
||||||
|
verify_links: bool = True
|
||||||
|
request_delay: float = 1.0 # Polite crawling delay
|
||||||
|
|
||||||
|
|
||||||
|
async def extract_links(page: Page) -> List[Tuple[str, str]]:
|
||||||
|
"""Extract all anchor links from the page."""
|
||||||
|
anchors: Iterable[dict] = await page.eval_on_selector_all(
|
||||||
|
"a",
|
||||||
|
"""elements => elements
|
||||||
|
.map(el => ({text: (el.textContent || '').trim(), href: el.href}))
|
||||||
|
.filter(item => item.text && item.href && item.href.startsWith('http'))""",
|
||||||
|
)
|
||||||
|
return [(item["href"], item["text"]) for item in anchors]
|
||||||
|
|
||||||
|
|
||||||
|
async def get_page_title(page: Page) -> str:
|
||||||
|
"""Get the page title safely."""
|
||||||
|
try:
|
||||||
|
return await page.title() or ""
|
||||||
|
except Exception:
|
||||||
|
return ""
|
||||||
|
|
||||||
|
|
||||||
|
async def verify_link(context, url: str, timeout: int = 10000) -> Tuple[bool, int, str]:
|
||||||
|
"""
|
||||||
|
Verify a link by making a HEAD-like request.
|
||||||
|
Returns: (is_valid, status_code, page_title)
|
||||||
|
"""
|
||||||
|
page = await context.new_page()
|
||||||
|
try:
|
||||||
|
response: Response = await page.goto(url, wait_until="domcontentloaded", timeout=timeout)
|
||||||
|
if response:
|
||||||
|
status = response.status
|
||||||
|
title = await get_page_title(page)
|
||||||
|
is_valid = 200 <= status < 400
|
||||||
|
return is_valid, status, title
|
||||||
|
return False, 0, ""
|
||||||
|
except Exception:
|
||||||
|
return False, 0, ""
|
||||||
|
finally:
|
||||||
|
await page.close()
|
||||||
|
|
||||||
|
|
||||||
|
async def crawl(settings: ScrapeSettings, browser_name: str) -> List[ScrapedLink]:
|
||||||
|
"""
|
||||||
|
Crawl the website using BFS, collecting program and faculty links.
|
||||||
|
Features:
|
||||||
|
- URL deduplication
|
||||||
|
- Link verification
|
||||||
|
- Profile page detection
|
||||||
|
- Polite crawling with delays
|
||||||
|
"""
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser_launcher = getattr(p, browser_name)
|
||||||
|
browser = await browser_launcher.launch(headless=settings.headless)
|
||||||
|
context = await browser.new_context()
|
||||||
|
|
||||||
|
# Priority queue: (priority, url, depth) - lower priority = processed first
|
||||||
|
# Directory pages get priority 0, others get priority 1
|
||||||
|
queue: Deque[Tuple[int, str, int]] = deque([(0, settings.root_url, 0)])
|
||||||
|
visited: Set[str] = set()
|
||||||
|
found_urls: Set[str] = set() # For deduplication of results
|
||||||
|
results: List[ScrapedLink] = []
|
||||||
|
|
||||||
|
print(f"Starting crawl from: {settings.root_url}")
|
||||||
|
print(f"Max depth: {settings.max_depth}, Max pages: {settings.max_pages}")
|
||||||
|
|
||||||
|
try:
|
||||||
|
while queue and len(visited) < settings.max_pages:
|
||||||
|
# Sort queue by priority (directory pages first)
|
||||||
|
queue = deque(sorted(queue, key=lambda x: x[0]))
|
||||||
|
priority, url, depth = queue.popleft()
|
||||||
|
|
||||||
|
normalized_url = normalize_url(settings.root_url, url)
|
||||||
|
if normalized_url in visited or depth > settings.max_depth:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Only crawl same-domain URLs
|
||||||
|
if not is_same_domain(settings.root_url, normalized_url):
|
||||||
|
continue
|
||||||
|
|
||||||
|
visited.add(normalized_url)
|
||||||
|
print(f"[{len(visited)}/{settings.max_pages}] Depth {depth}: {normalized_url[:80]}...")
|
||||||
|
|
||||||
|
page = await context.new_page()
|
||||||
|
try:
|
||||||
|
response = await page.goto(
|
||||||
|
normalized_url, wait_until="domcontentloaded", timeout=20000
|
||||||
|
)
|
||||||
|
if not response or response.status >= 400:
|
||||||
|
await page.close()
|
||||||
|
continue
|
||||||
|
except Exception as e:
|
||||||
|
print(f" Error: {e}")
|
||||||
|
await page.close()
|
||||||
|
continue
|
||||||
|
|
||||||
|
page_title = await get_page_title(page)
|
||||||
|
links = await extract_links(page)
|
||||||
|
|
||||||
|
for href, text in links:
|
||||||
|
normalized_href = normalize_url(normalized_url, href)
|
||||||
|
|
||||||
|
# Skip if already found or is excluded
|
||||||
|
if normalized_href in found_urls:
|
||||||
|
continue
|
||||||
|
if matches_any(text, EXCLUSION_KEYWORDS) or matches_any(normalized_href, EXCLUSION_KEYWORDS):
|
||||||
|
continue
|
||||||
|
|
||||||
|
text_lower = text.lower()
|
||||||
|
href_lower = normalized_href.lower()
|
||||||
|
is_profile = is_profile_url(normalized_href)
|
||||||
|
|
||||||
|
# Check for program links
|
||||||
|
if matches_any(text_lower, PROGRAM_KEYWORDS) or matches_any(href_lower, PROGRAM_KEYWORDS):
|
||||||
|
found_urls.add(normalized_href)
|
||||||
|
results.append(
|
||||||
|
ScrapedLink(
|
||||||
|
url=normalized_href,
|
||||||
|
title="",
|
||||||
|
text=text[:200],
|
||||||
|
source_url=normalized_url,
|
||||||
|
bucket="program",
|
||||||
|
is_profile_page=False,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Check for faculty links
|
||||||
|
if matches_any(text_lower, FACULTY_KEYWORDS) or matches_any(href_lower, FACULTY_KEYWORDS):
|
||||||
|
found_urls.add(normalized_href)
|
||||||
|
results.append(
|
||||||
|
ScrapedLink(
|
||||||
|
url=normalized_href,
|
||||||
|
title="",
|
||||||
|
text=text[:200],
|
||||||
|
source_url=normalized_url,
|
||||||
|
bucket="faculty",
|
||||||
|
is_profile_page=is_profile,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Queue for further crawling
|
||||||
|
if depth < settings.max_depth and is_same_domain(settings.root_url, normalized_href):
|
||||||
|
# Prioritize directory pages
|
||||||
|
link_priority = 0 if is_directory_url(normalized_href) else 1
|
||||||
|
queue.append((link_priority, normalized_href, depth + 1))
|
||||||
|
|
||||||
|
await page.close()
|
||||||
|
|
||||||
|
# Polite delay between requests
|
||||||
|
await asyncio.sleep(settings.request_delay)
|
||||||
|
|
||||||
|
finally:
|
||||||
|
await context.close()
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
# Verify links if enabled
|
||||||
|
if settings.verify_links and results:
|
||||||
|
print(f"\nVerifying {len(results)} links...")
|
||||||
|
browser = await browser_launcher.launch(headless=True)
|
||||||
|
context = await browser.new_context()
|
||||||
|
|
||||||
|
verified_results = []
|
||||||
|
for i, link in enumerate(results):
|
||||||
|
if link.url in [r.url for r in verified_results]:
|
||||||
|
continue # Skip duplicates
|
||||||
|
|
||||||
|
print(f" [{i+1}/{len(results)}] Verifying: {link.url[:60]}...")
|
||||||
|
is_valid, status, title = await verify_link(context, link.url)
|
||||||
|
link.is_verified = True
|
||||||
|
link.http_status = status
|
||||||
|
link.title = title or link.text
|
||||||
|
|
||||||
|
if is_valid:
|
||||||
|
verified_results.append(link)
|
||||||
|
else:
|
||||||
|
print(f" Invalid (HTTP {status})")
|
||||||
|
|
||||||
|
await asyncio.sleep(0.5) # Delay between verifications
|
||||||
|
|
||||||
|
await context.close()
|
||||||
|
await browser.close()
|
||||||
|
results = verified_results
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
def deduplicate_results(results: List[ScrapedLink]) -> List[ScrapedLink]:
|
||||||
|
"""Remove duplicate URLs, keeping the first occurrence."""
|
||||||
|
seen: Set[str] = set()
|
||||||
|
unique = []
|
||||||
|
for link in results:
|
||||||
|
if link.url not in seen:
|
||||||
|
seen.add(link.url)
|
||||||
|
unique.append(link)
|
||||||
|
return unique
|
||||||
|
|
||||||
|
|
||||||
|
def serialize(results: List[ScrapedLink], target: Path, root_url: str) -> None:
|
||||||
|
"""Save results to JSON file with statistics."""
|
||||||
|
results = deduplicate_results(results)
|
||||||
|
|
||||||
|
program_links = [link for link in results if link.bucket == "program"]
|
||||||
|
faculty_links = [link for link in results if link.bucket == "faculty"]
|
||||||
|
profile_pages = [link for link in faculty_links if link.is_profile_page]
|
||||||
|
|
||||||
|
payload = {
|
||||||
|
"root_url": root_url,
|
||||||
|
"generated_at": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"statistics": {
|
||||||
|
"total_links": len(results),
|
||||||
|
"program_links": len(program_links),
|
||||||
|
"faculty_links": len(faculty_links),
|
||||||
|
"profile_pages": len(profile_pages),
|
||||||
|
"verified_links": len([r for r in results if r.is_verified and r.http_status == 200]),
|
||||||
|
},
|
||||||
|
"program_links": [asdict(link) for link in program_links],
|
||||||
|
"faculty_links": [asdict(link) for link in faculty_links],
|
||||||
|
"notes": EXTRA_NOTES,
|
||||||
|
"metadata_fields": METADATA_FIELDS,
|
||||||
|
}
|
||||||
|
target.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
target.write_text(json.dumps(payload, indent=2, ensure_ascii=False), encoding="utf-8")
|
||||||
|
|
||||||
|
print(f"\nResults saved to: {target}")
|
||||||
|
print(f" Total links: {len(results)}")
|
||||||
|
print(f" Program links: {len(program_links)}")
|
||||||
|
print(f" Faculty links: {len(faculty_links)}")
|
||||||
|
print(f" Profile pages: {len(profile_pages)}")
|
||||||
|
|
||||||
|
|
||||||
|
def parse_args() -> argparse.Namespace:
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Playwright scraper generated by the Agno agent for https://www.rwth-aachen.de/go/id/a/?lidx=1."
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--root-url",
|
||||||
|
default="https://www.rwth-aachen.de/go/id/a/?lidx=1",
|
||||||
|
help="Seed url to start crawling from.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--max-depth",
|
||||||
|
type=int,
|
||||||
|
default=3,
|
||||||
|
help="Maximum crawl depth.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--max-pages",
|
||||||
|
type=int,
|
||||||
|
default=30,
|
||||||
|
help="Maximum number of pages to visit.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--output",
|
||||||
|
type=Path,
|
||||||
|
default=Path("university-scraper_results.json"),
|
||||||
|
help="Where to save the JSON output.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--headless",
|
||||||
|
action="store_true",
|
||||||
|
default=True,
|
||||||
|
help="Run browser in headless mode (default: True).",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--no-headless",
|
||||||
|
action="store_false",
|
||||||
|
dest="headless",
|
||||||
|
help="Run browser with visible window.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--browser",
|
||||||
|
choices=["chromium", "firefox", "webkit"],
|
||||||
|
default="chromium",
|
||||||
|
help="Browser engine to launch via Playwright.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--no-verify",
|
||||||
|
action="store_true",
|
||||||
|
default=False,
|
||||||
|
help="Skip link verification step.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--delay",
|
||||||
|
type=float,
|
||||||
|
default=1.0,
|
||||||
|
help="Delay between requests in seconds (polite crawling).",
|
||||||
|
)
|
||||||
|
return parser.parse_args()
|
||||||
|
|
||||||
|
|
||||||
|
async def main_async() -> None:
|
||||||
|
args = parse_args()
|
||||||
|
settings = ScrapeSettings(
|
||||||
|
root_url=args.root_url,
|
||||||
|
max_depth=args.max_depth,
|
||||||
|
max_pages=args.max_pages,
|
||||||
|
headless=args.headless,
|
||||||
|
output=args.output,
|
||||||
|
verify_links=not args.no_verify,
|
||||||
|
request_delay=args.delay,
|
||||||
|
)
|
||||||
|
links = await crawl(settings, browser_name=args.browser)
|
||||||
|
serialize(links, settings.output, settings.root_url)
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
asyncio.run(main_async())
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
165
artifacts/test_faculty_scraper.py
Normal file
165
artifacts/test_faculty_scraper.py
Normal file
@ -0,0 +1,165 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
测试导师信息爬取逻辑 - 只测试3个项目
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
|
||||||
|
|
||||||
|
def name_to_slug(name):
|
||||||
|
"""将项目名称转换为URL slug"""
|
||||||
|
slug = name.lower()
|
||||||
|
slug = re.sub(r'[^\w\s-]', '', slug)
|
||||||
|
slug = re.sub(r'[\s_]+', '-', slug)
|
||||||
|
slug = re.sub(r'-+', '-', slug)
|
||||||
|
slug = slug.strip('-')
|
||||||
|
return slug
|
||||||
|
|
||||||
|
|
||||||
|
async def get_faculty_from_gsas_page(page, gsas_url):
|
||||||
|
"""从GSAS项目页面获取Faculty链接,然后访问院系People页面获取导师列表"""
|
||||||
|
faculty_list = []
|
||||||
|
faculty_page_url = None
|
||||||
|
|
||||||
|
try:
|
||||||
|
print(f" 访问GSAS页面: {gsas_url}")
|
||||||
|
await page.goto(gsas_url, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
# 查找Faculty部分的链接
|
||||||
|
faculty_link = await page.evaluate('''() => {
|
||||||
|
const links = document.querySelectorAll('a[href]');
|
||||||
|
for (const link of links) {
|
||||||
|
const text = link.innerText.toLowerCase();
|
||||||
|
const href = link.href;
|
||||||
|
if (text.includes('faculty') && text.includes('see list')) {
|
||||||
|
return href;
|
||||||
|
}
|
||||||
|
if (text.includes('faculty') && (href.includes('/people') || href.includes('/faculty'))) {
|
||||||
|
return href;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
if faculty_link:
|
||||||
|
faculty_page_url = faculty_link
|
||||||
|
print(f" 找到Faculty页面链接: {faculty_link}")
|
||||||
|
|
||||||
|
# 访问Faculty/People页面
|
||||||
|
await page.goto(faculty_link, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
# 提取所有导师信息
|
||||||
|
faculty_list = await page.evaluate('''() => {
|
||||||
|
const faculty = [];
|
||||||
|
const seen = new Set();
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href || '';
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
const lowerHref = href.toLowerCase();
|
||||||
|
|
||||||
|
if ((lowerHref.includes('/people/') || lowerHref.includes('/faculty/') ||
|
||||||
|
lowerHref.includes('/profile/')) &&
|
||||||
|
text.length > 3 && text.length < 100 &&
|
||||||
|
!text.toLowerCase().includes('people') &&
|
||||||
|
!text.toLowerCase().includes('faculty') &&
|
||||||
|
!lowerHref.endsWith('/people/') &&
|
||||||
|
!lowerHref.endsWith('/faculty/')) {
|
||||||
|
|
||||||
|
if (!seen.has(href)) {
|
||||||
|
seen.add(href);
|
||||||
|
faculty.push({
|
||||||
|
name: text,
|
||||||
|
url: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return faculty;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f" 找到 {len(faculty_list)} 位导师")
|
||||||
|
for f in faculty_list[:5]:
|
||||||
|
print(f" - {f['name']}: {f['url']}")
|
||||||
|
if len(faculty_list) > 5:
|
||||||
|
print(f" ... 还有 {len(faculty_list) - 5} 位")
|
||||||
|
else:
|
||||||
|
print(" 未找到Faculty页面链接")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f" 获取Faculty信息失败: {e}")
|
||||||
|
|
||||||
|
return faculty_list, faculty_page_url
|
||||||
|
|
||||||
|
|
||||||
|
async def test_faculty_scraper():
|
||||||
|
"""测试导师爬取"""
|
||||||
|
|
||||||
|
# 测试3个项目
|
||||||
|
test_programs = [
|
||||||
|
"African and African American Studies",
|
||||||
|
"Economics",
|
||||||
|
"Computer Science"
|
||||||
|
]
|
||||||
|
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser = await p.chromium.launch(headless=False)
|
||||||
|
context = await browser.new_context(
|
||||||
|
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
|
||||||
|
viewport={'width': 1920, 'height': 1080}
|
||||||
|
)
|
||||||
|
page = await context.new_page()
|
||||||
|
|
||||||
|
results = []
|
||||||
|
|
||||||
|
for i, name in enumerate(test_programs, 1):
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print(f"[{i}/{len(test_programs)}] 测试: {name}")
|
||||||
|
print(f"{'='*60}")
|
||||||
|
|
||||||
|
slug = name_to_slug(name)
|
||||||
|
program_url = f"https://www.harvard.edu/programs/{slug}/"
|
||||||
|
gsas_url = f"https://gsas.harvard.edu/program/{slug}"
|
||||||
|
|
||||||
|
print(f"项目URL: {program_url}")
|
||||||
|
print(f"GSAS URL: {gsas_url}")
|
||||||
|
|
||||||
|
faculty_list, faculty_page_url = await get_faculty_from_gsas_page(page, gsas_url)
|
||||||
|
|
||||||
|
results.append({
|
||||||
|
'name': name,
|
||||||
|
'url': program_url,
|
||||||
|
'gsas_url': gsas_url,
|
||||||
|
'faculty_page_url': faculty_page_url,
|
||||||
|
'faculty': faculty_list,
|
||||||
|
'faculty_count': len(faculty_list)
|
||||||
|
})
|
||||||
|
|
||||||
|
await page.wait_for_timeout(1000)
|
||||||
|
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
# 输出结果
|
||||||
|
print(f"\n\n{'='*60}")
|
||||||
|
print("测试结果汇总")
|
||||||
|
print(f"{'='*60}")
|
||||||
|
|
||||||
|
for r in results:
|
||||||
|
print(f"\n{r['name']}:")
|
||||||
|
print(f" Faculty页面: {r['faculty_page_url'] or '未找到'}")
|
||||||
|
print(f" 导师数量: {r['faculty_count']}")
|
||||||
|
|
||||||
|
# 保存测试结果
|
||||||
|
with open('test_faculty_results.json', 'w', encoding='utf-8') as f:
|
||||||
|
json.dump(results, f, ensure_ascii=False, indent=2)
|
||||||
|
print(f"\n测试结果已保存到: test_faculty_results.json")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(test_faculty_scraper())
|
||||||
464
artifacts/test_manchester_scraper.py
Normal file
464
artifacts/test_manchester_scraper.py
Normal file
@ -0,0 +1,464 @@
|
|||||||
|
"""
|
||||||
|
Test Manchester University scraper - improved faculty mapping
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
|
||||||
|
|
||||||
|
MASTERS_PATHS = [
|
||||||
|
"/study/masters/courses/list/",
|
||||||
|
"/study/masters/courses/",
|
||||||
|
"/postgraduate/taught/courses/",
|
||||||
|
"/postgraduate/courses/list/",
|
||||||
|
"/postgraduate/courses/",
|
||||||
|
"/graduate/programs/",
|
||||||
|
"/academics/graduate/programs/",
|
||||||
|
"/programmes/masters/",
|
||||||
|
"/masters/programmes/",
|
||||||
|
"/admissions/graduate/programs/",
|
||||||
|
]
|
||||||
|
|
||||||
|
ACCOUNTING_STAFF_URL = "https://www.alliancembs.manchester.ac.uk/research/accounting-and-finance/staff/"
|
||||||
|
ACCOUNTING_STAFF_CACHE = None
|
||||||
|
|
||||||
|
|
||||||
|
JS_CHECK_COURSES = r"""() => {
|
||||||
|
const links = document.querySelectorAll('a[href]');
|
||||||
|
let courseCount = 0;
|
||||||
|
for (const a of links) {
|
||||||
|
const href = a.href.toLowerCase();
|
||||||
|
if (/\/\d{4,}\//.test(href) ||
|
||||||
|
/\/(msc|ma|mba|mres|llm|med|meng)-/.test(href) ||
|
||||||
|
/\/course\/[a-z]/.test(href)) {
|
||||||
|
courseCount++;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return courseCount;
|
||||||
|
}"""
|
||||||
|
|
||||||
|
JS_FIND_LIST_URL = """() => {
|
||||||
|
const links = document.querySelectorAll('a[href]');
|
||||||
|
for (const a of links) {
|
||||||
|
const text = a.innerText.toLowerCase();
|
||||||
|
const href = a.href.toLowerCase();
|
||||||
|
if ((text.includes('a-z') || text.includes('all course') ||
|
||||||
|
text.includes('full list') || text.includes('browse all') ||
|
||||||
|
href.includes('/list')) &&
|
||||||
|
(href.includes('master') || href.includes('course') || href.includes('postgrad'))) {
|
||||||
|
return a.href;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}"""
|
||||||
|
|
||||||
|
JS_FIND_COURSES_FROM_HOME = """() => {
|
||||||
|
const links = document.querySelectorAll('a[href]');
|
||||||
|
for (const a of links) {
|
||||||
|
const href = a.href.toLowerCase();
|
||||||
|
const text = a.innerText.toLowerCase();
|
||||||
|
if ((href.includes('master') || href.includes('postgraduate') || href.includes('graduate')) &&
|
||||||
|
(href.includes('course') || href.includes('program') || href.includes('degree'))) {
|
||||||
|
return a.href;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}"""
|
||||||
|
|
||||||
|
JS_EXTRACT_PROGRAMS = r"""() => {
|
||||||
|
const programs = [];
|
||||||
|
const seen = new Set();
|
||||||
|
const currentHost = window.location.hostname;
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href;
|
||||||
|
const text = a.innerText.trim().replace(/\s+/g, ' ');
|
||||||
|
|
||||||
|
if (!href || seen.has(href)) return;
|
||||||
|
if (text.length < 5 || text.length > 200) return;
|
||||||
|
if (href.includes('#') || href.includes('javascript:') || href.includes('mailto:')) return;
|
||||||
|
|
||||||
|
try {
|
||||||
|
const linkHost = new URL(href).hostname;
|
||||||
|
if (!linkHost.includes(currentHost.replace('www.', '')) &&
|
||||||
|
!currentHost.includes(linkHost.replace('www.', ''))) return;
|
||||||
|
} catch {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
const hrefLower = href.toLowerCase();
|
||||||
|
const textLower = text.toLowerCase();
|
||||||
|
|
||||||
|
const isNavigation = textLower === 'courses' ||
|
||||||
|
textLower === 'programmes' ||
|
||||||
|
textLower === 'undergraduate' ||
|
||||||
|
textLower === 'postgraduate' ||
|
||||||
|
textLower === 'masters' ||
|
||||||
|
textLower === "master's" ||
|
||||||
|
textLower.includes('skip to') ||
|
||||||
|
textLower.includes('share') ||
|
||||||
|
textLower === 'home' ||
|
||||||
|
textLower === 'study' ||
|
||||||
|
textLower.startsWith('a-z') ||
|
||||||
|
textLower.includes('admission') ||
|
||||||
|
textLower.includes('fees and funding') ||
|
||||||
|
textLower.includes('why should') ||
|
||||||
|
textLower.includes('why manchester') ||
|
||||||
|
textLower.includes('teaching and learning') ||
|
||||||
|
textLower.includes('meet us') ||
|
||||||
|
textLower.includes('student support') ||
|
||||||
|
textLower.includes('contact us') ||
|
||||||
|
textLower.includes('how to apply') ||
|
||||||
|
hrefLower.includes('/admissions/') ||
|
||||||
|
hrefLower.includes('/fees-and-funding/') ||
|
||||||
|
hrefLower.includes('/why-') ||
|
||||||
|
hrefLower.includes('/meet-us/') ||
|
||||||
|
hrefLower.includes('/contact-us/') ||
|
||||||
|
hrefLower.includes('/student-support/') ||
|
||||||
|
hrefLower.includes('/teaching-and-learning/') ||
|
||||||
|
hrefLower.endsWith('/courses/') ||
|
||||||
|
hrefLower.endsWith('/masters/') ||
|
||||||
|
hrefLower.endsWith('/postgraduate/');
|
||||||
|
|
||||||
|
if (isNavigation) return;
|
||||||
|
|
||||||
|
const isExcluded = hrefLower.includes('/undergraduate') ||
|
||||||
|
hrefLower.includes('/bachelor') ||
|
||||||
|
hrefLower.includes('/phd/') ||
|
||||||
|
hrefLower.includes('/doctoral') ||
|
||||||
|
hrefLower.includes('/research-degree') ||
|
||||||
|
textLower.includes('bachelor') ||
|
||||||
|
textLower.includes('undergraduate') ||
|
||||||
|
(textLower.includes('phd') && !textLower.includes('mphil'));
|
||||||
|
|
||||||
|
if (isExcluded) return;
|
||||||
|
|
||||||
|
const hasNumericId = /\/\d{4,}\//.test(href);
|
||||||
|
const hasDegreeSlug = /\/(msc|ma|mba|mres|llm|med|meng|mpa|mph|mphil)-[a-z]/.test(hrefLower);
|
||||||
|
const isCoursePage = (hrefLower.includes('/course/') ||
|
||||||
|
hrefLower.includes('/courses/list/') ||
|
||||||
|
hrefLower.includes('/programme/')) &&
|
||||||
|
href.split('/').filter(p => p).length > 4;
|
||||||
|
const textHasDegree = /(msc|ma|mba|mres|llm|med|meng|pgcert|pgdip)/i.test(text) ||
|
||||||
|
textLower.includes('master');
|
||||||
|
|
||||||
|
if (hasNumericId || hasDegreeSlug || isCoursePage || textHasDegree) {
|
||||||
|
seen.add(href);
|
||||||
|
programs.push({
|
||||||
|
name: text,
|
||||||
|
url: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return programs;
|
||||||
|
}"""
|
||||||
|
|
||||||
|
JS_EXTRACT_FACULTY = r"""() => {
|
||||||
|
const faculty = [];
|
||||||
|
const seen = new Set();
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href.toLowerCase();
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
|
||||||
|
if (seen.has(href)) return;
|
||||||
|
if (text.length < 3 || text.length > 100) return;
|
||||||
|
|
||||||
|
const isStaff = href.includes('/people/') ||
|
||||||
|
href.includes('/staff/') ||
|
||||||
|
href.includes('/faculty/') ||
|
||||||
|
href.includes('/profile/') ||
|
||||||
|
href.includes('/academics/') ||
|
||||||
|
href.includes('/researcher/');
|
||||||
|
|
||||||
|
if (isStaff) {
|
||||||
|
seen.add(href);
|
||||||
|
faculty.push({
|
||||||
|
name: text.replace(/\s+/g, ' '),
|
||||||
|
url: a.href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return faculty.slice(0, 20);
|
||||||
|
}"""
|
||||||
|
|
||||||
|
JS_EXTRACT_ACCOUNTING_STAFF = r"""() => {
|
||||||
|
const rows = Array.from(document.querySelectorAll('table tbody tr'));
|
||||||
|
const staff = [];
|
||||||
|
|
||||||
|
for (const row of rows) {
|
||||||
|
const cells = row.querySelectorAll('td');
|
||||||
|
if (!cells || cells.length < 2) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
const nameCell = cells[1];
|
||||||
|
const roleCell = cells[2];
|
||||||
|
const emailCell = cells[5];
|
||||||
|
|
||||||
|
let profileUrl = '';
|
||||||
|
let displayName = nameCell ? nameCell.innerText.trim() : '';
|
||||||
|
const link = nameCell ? nameCell.querySelector('a[href]') : null;
|
||||||
|
if (link) {
|
||||||
|
profileUrl = link.href;
|
||||||
|
displayName = link.innerText.trim() || displayName;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!displayName) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
let email = '';
|
||||||
|
if (emailCell) {
|
||||||
|
const emailLink = emailCell.querySelector('a[href^="mailto:"]');
|
||||||
|
if (emailLink) {
|
||||||
|
email = emailLink.href.replace('mailto:', '').trim();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
staff.push({
|
||||||
|
name: displayName,
|
||||||
|
title: roleCell ? roleCell.innerText.trim() : '',
|
||||||
|
url: profileUrl,
|
||||||
|
email: email
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
return staff;
|
||||||
|
}"""
|
||||||
|
|
||||||
|
|
||||||
|
def should_use_accounting_staff(program_name: str) -> bool:
|
||||||
|
lower_name = program_name.lower()
|
||||||
|
return "msc" in lower_name and "accounting" in lower_name
|
||||||
|
|
||||||
|
|
||||||
|
async def load_accounting_staff(context, output_callback=None):
|
||||||
|
global ACCOUNTING_STAFF_CACHE
|
||||||
|
|
||||||
|
if ACCOUNTING_STAFF_CACHE is not None:
|
||||||
|
return ACCOUNTING_STAFF_CACHE
|
||||||
|
|
||||||
|
staff_page = await context.new_page()
|
||||||
|
try:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", "Loading official AMBS Accounting & Finance staff page...")
|
||||||
|
|
||||||
|
await staff_page.goto(ACCOUNTING_STAFF_URL, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await staff_page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
ACCOUNTING_STAFF_CACHE = await staff_page.evaluate(JS_EXTRACT_ACCOUNTING_STAFF)
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"Captured {len(ACCOUNTING_STAFF_CACHE)} faculty from the official staff page")
|
||||||
|
|
||||||
|
except Exception as exc:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("error", f"Failed to load AMBS staff page: {exc}")
|
||||||
|
ACCOUNTING_STAFF_CACHE = []
|
||||||
|
finally:
|
||||||
|
await staff_page.close()
|
||||||
|
|
||||||
|
return ACCOUNTING_STAFF_CACHE
|
||||||
|
|
||||||
|
|
||||||
|
async def find_course_list_page(page, base_url, output_callback):
|
||||||
|
for path in MASTERS_PATHS:
|
||||||
|
test_url = base_url.rstrip('/') + path
|
||||||
|
try:
|
||||||
|
response = await page.goto(test_url, wait_until="domcontentloaded", timeout=15000)
|
||||||
|
if response and response.status == 200:
|
||||||
|
title = await page.title()
|
||||||
|
if '404' not in title.lower() and 'not found' not in title.lower():
|
||||||
|
has_courses = await page.evaluate(JS_CHECK_COURSES)
|
||||||
|
if has_courses > 5:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"Found course list: {path} ({has_courses} courses)")
|
||||||
|
return test_url
|
||||||
|
|
||||||
|
list_url = await page.evaluate(JS_FIND_LIST_URL)
|
||||||
|
if list_url:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"Found full course list: {list_url}")
|
||||||
|
return list_url
|
||||||
|
except:
|
||||||
|
continue
|
||||||
|
|
||||||
|
try:
|
||||||
|
await page.goto(base_url, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
courses_url = await page.evaluate(JS_FIND_COURSES_FROM_HOME)
|
||||||
|
if courses_url:
|
||||||
|
return courses_url
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
async def extract_course_links(page, output_callback):
|
||||||
|
return await page.evaluate(JS_EXTRACT_PROGRAMS)
|
||||||
|
|
||||||
|
|
||||||
|
async def scrape(output_callback=None):
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser = await p.chromium.launch(headless=True)
|
||||||
|
context = await browser.new_context(
|
||||||
|
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
|
||||||
|
)
|
||||||
|
page = await context.new_page()
|
||||||
|
|
||||||
|
base_url = "https://www.manchester.ac.uk/"
|
||||||
|
|
||||||
|
result = {
|
||||||
|
"name": "Manchester University",
|
||||||
|
"url": base_url,
|
||||||
|
"scraped_at": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"schools": []
|
||||||
|
}
|
||||||
|
|
||||||
|
all_programs = []
|
||||||
|
|
||||||
|
try:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", "Searching for masters course list...")
|
||||||
|
|
||||||
|
courses_url = await find_course_list_page(page, base_url, output_callback)
|
||||||
|
|
||||||
|
if not courses_url:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("warning", "Course list not found, using homepage")
|
||||||
|
courses_url = base_url
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", "Extracting masters programs...")
|
||||||
|
|
||||||
|
await page.goto(courses_url, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
for _ in range(3):
|
||||||
|
try:
|
||||||
|
load_more = page.locator('button:has-text("Load more"), button:has-text("Show more"), button:has-text("View more"), a:has-text("Load more")')
|
||||||
|
if await load_more.count() > 0:
|
||||||
|
await load_more.first.click()
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
else:
|
||||||
|
break
|
||||||
|
except:
|
||||||
|
break
|
||||||
|
|
||||||
|
programs_data = await extract_course_links(page, output_callback)
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"Found {len(programs_data)} masters programs")
|
||||||
|
|
||||||
|
print("\nTop 20 programs:")
|
||||||
|
for i, prog in enumerate(programs_data[:20]):
|
||||||
|
print(f" {i+1}. {prog['name'][:60]}")
|
||||||
|
print(f" {prog['url']}")
|
||||||
|
|
||||||
|
max_detail_pages = min(len(programs_data), 30)
|
||||||
|
detailed_processed = 0
|
||||||
|
logged_official_staff = False
|
||||||
|
|
||||||
|
for prog in programs_data:
|
||||||
|
faculty_data = []
|
||||||
|
used_official_staff = False
|
||||||
|
|
||||||
|
if should_use_accounting_staff(prog['name']):
|
||||||
|
staff_list = await load_accounting_staff(context, output_callback)
|
||||||
|
if staff_list:
|
||||||
|
used_official_staff = True
|
||||||
|
if output_callback and not logged_official_staff:
|
||||||
|
output_callback("info", "Using Alliance MBS Accounting & Finance staff directory for accounting programmes")
|
||||||
|
logged_official_staff = True
|
||||||
|
faculty_data = [
|
||||||
|
{
|
||||||
|
"name": person.get("name"),
|
||||||
|
"url": person.get("url") or ACCOUNTING_STAFF_URL,
|
||||||
|
"title": person.get("title"),
|
||||||
|
"email": person.get("email"),
|
||||||
|
"source": "Alliance Manchester Business School - Accounting & Finance staff"
|
||||||
|
}
|
||||||
|
for person in staff_list
|
||||||
|
]
|
||||||
|
|
||||||
|
elif detailed_processed < max_detail_pages:
|
||||||
|
detailed_processed += 1
|
||||||
|
if output_callback and detailed_processed % 10 == 0:
|
||||||
|
output_callback("info", f"Processing {detailed_processed}/{max_detail_pages}: {prog['name'][:50]}")
|
||||||
|
try:
|
||||||
|
await page.goto(prog['url'], wait_until="domcontentloaded", timeout=15000)
|
||||||
|
await page.wait_for_timeout(800)
|
||||||
|
|
||||||
|
faculty_data = await page.evaluate(JS_EXTRACT_FACULTY)
|
||||||
|
except Exception as e:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("warning", f"Failed to capture faculty for {prog['name'][:50]}: {e}")
|
||||||
|
faculty_data = []
|
||||||
|
|
||||||
|
program_entry = {
|
||||||
|
"name": prog['name'],
|
||||||
|
"url": prog['url'],
|
||||||
|
"faculty": faculty_data
|
||||||
|
}
|
||||||
|
|
||||||
|
if used_official_staff:
|
||||||
|
program_entry["faculty_page_override"] = ACCOUNTING_STAFF_URL
|
||||||
|
|
||||||
|
all_programs.append(program_entry)
|
||||||
|
|
||||||
|
result["schools"] = [{
|
||||||
|
"name": "Masters Programs",
|
||||||
|
"url": courses_url,
|
||||||
|
"programs": all_programs
|
||||||
|
}]
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
total_faculty = sum(len(p.get('faculty', [])) for p in all_programs)
|
||||||
|
output_callback("info", f"Done! {len(all_programs)} programs, {total_faculty} faculty")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("error", f"Scraping error: {str(e)}")
|
||||||
|
|
||||||
|
finally:
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
def log_callback(level, message):
|
||||||
|
print(f"[{level.upper()}] {message}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
result = asyncio.run(scrape(output_callback=log_callback))
|
||||||
|
|
||||||
|
print("\n" + "="*60)
|
||||||
|
print("Scrape summary:")
|
||||||
|
print("="*60)
|
||||||
|
|
||||||
|
if result.get("schools"):
|
||||||
|
school = result["schools"][0]
|
||||||
|
programs = school.get("programs", [])
|
||||||
|
print(f"Course list URL: {school.get('url')}")
|
||||||
|
print(f"Total programs: {len(programs)}")
|
||||||
|
|
||||||
|
faculty_count = sum(len(p.get('faculty', [])) for p in programs)
|
||||||
|
print(f"Faculty total: {faculty_count}")
|
||||||
|
|
||||||
|
print("\nTop 10 programs:")
|
||||||
|
for i, p in enumerate(programs[:10]):
|
||||||
|
print(f" {i+1}. {p['name'][:60]}")
|
||||||
|
if p.get("faculty"):
|
||||||
|
print(f" Faculty entries: {len(p['faculty'])}")
|
||||||
|
|
||||||
|
with open("manchester_test_result.json", "w", encoding="utf-8") as f:
|
||||||
|
json.dump(result, f, indent=2, ensure_ascii=False)
|
||||||
|
print("\nSaved results to manchester_test_result.json")
|
||||||
25
backend/Dockerfile
Normal file
25
backend/Dockerfile
Normal file
@ -0,0 +1,25 @@
|
|||||||
|
FROM python:3.11-slim
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
# 安装系统依赖
|
||||||
|
RUN apt-get update && apt-get install -y \
|
||||||
|
wget \
|
||||||
|
gnupg \
|
||||||
|
&& rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
# 安装Playwright依赖
|
||||||
|
RUN pip install playwright && playwright install chromium && playwright install-deps
|
||||||
|
|
||||||
|
# 复制依赖文件
|
||||||
|
COPY requirements.txt .
|
||||||
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
|
|
||||||
|
# 复制应用代码
|
||||||
|
COPY . .
|
||||||
|
|
||||||
|
# 暴露端口
|
||||||
|
EXPOSE 8000
|
||||||
|
|
||||||
|
# 启动命令
|
||||||
|
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
|
||||||
1
backend/app/__init__.py
Normal file
1
backend/app/__init__.py
Normal file
@ -0,0 +1 @@
|
|||||||
|
"""University Scraper Web Backend"""
|
||||||
15
backend/app/api/__init__.py
Normal file
15
backend/app/api/__init__.py
Normal file
@ -0,0 +1,15 @@
|
|||||||
|
"""API路由"""
|
||||||
|
|
||||||
|
from fastapi import APIRouter
|
||||||
|
|
||||||
|
from .universities import router as universities_router
|
||||||
|
from .scripts import router as scripts_router
|
||||||
|
from .jobs import router as jobs_router
|
||||||
|
from .results import router as results_router
|
||||||
|
|
||||||
|
api_router = APIRouter()
|
||||||
|
|
||||||
|
api_router.include_router(universities_router, prefix="/universities", tags=["大学管理"])
|
||||||
|
api_router.include_router(scripts_router, prefix="/scripts", tags=["爬虫脚本"])
|
||||||
|
api_router.include_router(jobs_router, prefix="/jobs", tags=["爬取任务"])
|
||||||
|
api_router.include_router(results_router, prefix="/results", tags=["爬取结果"])
|
||||||
144
backend/app/api/jobs.py
Normal file
144
backend/app/api/jobs.py
Normal file
@ -0,0 +1,144 @@
|
|||||||
|
"""爬取任务API"""
|
||||||
|
|
||||||
|
from typing import List
|
||||||
|
from datetime import datetime
|
||||||
|
from fastapi import APIRouter, Depends, HTTPException, BackgroundTasks
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
from ..database import get_db
|
||||||
|
from ..models import University, ScraperScript, ScrapeJob, ScrapeLog
|
||||||
|
from ..schemas.job import JobResponse, JobStatusResponse, LogResponse
|
||||||
|
from ..services.scraper_runner import run_scraper
|
||||||
|
|
||||||
|
router = APIRouter()
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/start/{university_id}", response_model=JobResponse)
|
||||||
|
async def start_scrape_job(
|
||||||
|
university_id: int,
|
||||||
|
background_tasks: BackgroundTasks,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
一键运行爬虫
|
||||||
|
|
||||||
|
启动爬取任务,抓取大学项目和导师数据
|
||||||
|
"""
|
||||||
|
# 检查大学是否存在
|
||||||
|
university = db.query(University).filter(University.id == university_id).first()
|
||||||
|
if not university:
|
||||||
|
raise HTTPException(status_code=404, detail="大学不存在")
|
||||||
|
|
||||||
|
# 检查是否有活跃的脚本
|
||||||
|
script = db.query(ScraperScript).filter(
|
||||||
|
ScraperScript.university_id == university_id,
|
||||||
|
ScraperScript.status == "active"
|
||||||
|
).first()
|
||||||
|
|
||||||
|
if not script:
|
||||||
|
raise HTTPException(status_code=400, detail="没有可用的爬虫脚本,请先生成脚本")
|
||||||
|
|
||||||
|
# 检查是否有正在运行的任务
|
||||||
|
running_job = db.query(ScrapeJob).filter(
|
||||||
|
ScrapeJob.university_id == university_id,
|
||||||
|
ScrapeJob.status == "running"
|
||||||
|
).first()
|
||||||
|
|
||||||
|
if running_job:
|
||||||
|
raise HTTPException(status_code=400, detail="已有正在运行的任务")
|
||||||
|
|
||||||
|
# 创建任务
|
||||||
|
job = ScrapeJob(
|
||||||
|
university_id=university_id,
|
||||||
|
script_id=script.id,
|
||||||
|
status="pending",
|
||||||
|
progress=0,
|
||||||
|
current_step="准备中..."
|
||||||
|
)
|
||||||
|
db.add(job)
|
||||||
|
db.commit()
|
||||||
|
db.refresh(job)
|
||||||
|
|
||||||
|
# 在后台执行爬虫
|
||||||
|
background_tasks.add_task(
|
||||||
|
run_scraper,
|
||||||
|
job_id=job.id,
|
||||||
|
script_id=script.id
|
||||||
|
)
|
||||||
|
|
||||||
|
return job
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/{job_id}", response_model=JobResponse)
|
||||||
|
def get_job(
|
||||||
|
job_id: int,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""获取任务详情"""
|
||||||
|
job = db.query(ScrapeJob).filter(ScrapeJob.id == job_id).first()
|
||||||
|
if not job:
|
||||||
|
raise HTTPException(status_code=404, detail="任务不存在")
|
||||||
|
|
||||||
|
return job
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/{job_id}/status", response_model=JobStatusResponse)
|
||||||
|
def get_job_status(
|
||||||
|
job_id: int,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""获取任务状态和日志"""
|
||||||
|
job = db.query(ScrapeJob).filter(ScrapeJob.id == job_id).first()
|
||||||
|
if not job:
|
||||||
|
raise HTTPException(status_code=404, detail="任务不存在")
|
||||||
|
|
||||||
|
# 获取最近的日志
|
||||||
|
logs = db.query(ScrapeLog).filter(
|
||||||
|
ScrapeLog.job_id == job_id
|
||||||
|
).order_by(ScrapeLog.created_at.desc()).limit(50).all()
|
||||||
|
|
||||||
|
return JobStatusResponse(
|
||||||
|
id=job.id,
|
||||||
|
status=job.status,
|
||||||
|
progress=job.progress,
|
||||||
|
current_step=job.current_step,
|
||||||
|
logs=[LogResponse(
|
||||||
|
id=log.id,
|
||||||
|
level=log.level,
|
||||||
|
message=log.message,
|
||||||
|
created_at=log.created_at
|
||||||
|
) for log in reversed(logs)]
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/university/{university_id}", response_model=List[JobResponse])
|
||||||
|
def get_university_jobs(
|
||||||
|
university_id: int,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""获取大学的所有任务"""
|
||||||
|
jobs = db.query(ScrapeJob).filter(
|
||||||
|
ScrapeJob.university_id == university_id
|
||||||
|
).order_by(ScrapeJob.created_at.desc()).limit(20).all()
|
||||||
|
|
||||||
|
return jobs
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/{job_id}/cancel")
|
||||||
|
def cancel_job(
|
||||||
|
job_id: int,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""取消任务"""
|
||||||
|
job = db.query(ScrapeJob).filter(ScrapeJob.id == job_id).first()
|
||||||
|
if not job:
|
||||||
|
raise HTTPException(status_code=404, detail="任务不存在")
|
||||||
|
|
||||||
|
if job.status not in ["pending", "running"]:
|
||||||
|
raise HTTPException(status_code=400, detail="任务已结束,无法取消")
|
||||||
|
|
||||||
|
job.status = "cancelled"
|
||||||
|
job.completed_at = datetime.utcnow()
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
return {"message": "任务已取消"}
|
||||||
175
backend/app/api/results.py
Normal file
175
backend/app/api/results.py
Normal file
@ -0,0 +1,175 @@
|
|||||||
|
"""爬取结果API"""
|
||||||
|
|
||||||
|
from typing import Optional
|
||||||
|
from fastapi import APIRouter, Depends, HTTPException, Query
|
||||||
|
from fastapi.responses import JSONResponse
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
from ..database import get_db
|
||||||
|
from ..models import ScrapeResult
|
||||||
|
from ..schemas.result import ResultResponse
|
||||||
|
|
||||||
|
router = APIRouter()
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/university/{university_id}", response_model=ResultResponse)
|
||||||
|
def get_university_result(
|
||||||
|
university_id: int,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""获取大学最新的爬取结果"""
|
||||||
|
result = db.query(ScrapeResult).filter(
|
||||||
|
ScrapeResult.university_id == university_id
|
||||||
|
).order_by(ScrapeResult.created_at.desc()).first()
|
||||||
|
|
||||||
|
if not result:
|
||||||
|
raise HTTPException(status_code=404, detail="没有爬取结果")
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/university/{university_id}/schools")
|
||||||
|
def get_schools(
|
||||||
|
university_id: int,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""获取学院列表"""
|
||||||
|
result = db.query(ScrapeResult).filter(
|
||||||
|
ScrapeResult.university_id == university_id
|
||||||
|
).order_by(ScrapeResult.created_at.desc()).first()
|
||||||
|
|
||||||
|
if not result:
|
||||||
|
raise HTTPException(status_code=404, detail="没有爬取结果")
|
||||||
|
|
||||||
|
schools = result.result_data.get("schools", [])
|
||||||
|
|
||||||
|
# 返回简化的学院列表
|
||||||
|
return {
|
||||||
|
"total": len(schools),
|
||||||
|
"schools": [
|
||||||
|
{
|
||||||
|
"name": s.get("name"),
|
||||||
|
"url": s.get("url"),
|
||||||
|
"program_count": len(s.get("programs", []))
|
||||||
|
}
|
||||||
|
for s in schools
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/university/{university_id}/programs")
|
||||||
|
def get_programs(
|
||||||
|
university_id: int,
|
||||||
|
school_name: Optional[str] = Query(None, description="按学院筛选"),
|
||||||
|
search: Optional[str] = Query(None, description="搜索项目名称"),
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""获取项目列表"""
|
||||||
|
result = db.query(ScrapeResult).filter(
|
||||||
|
ScrapeResult.university_id == university_id
|
||||||
|
).order_by(ScrapeResult.created_at.desc()).first()
|
||||||
|
|
||||||
|
if not result:
|
||||||
|
raise HTTPException(status_code=404, detail="没有爬取结果")
|
||||||
|
|
||||||
|
schools = result.result_data.get("schools", [])
|
||||||
|
programs = []
|
||||||
|
|
||||||
|
for school in schools:
|
||||||
|
if school_name and school.get("name") != school_name:
|
||||||
|
continue
|
||||||
|
|
||||||
|
for prog in school.get("programs", []):
|
||||||
|
if search and search.lower() not in prog.get("name", "").lower():
|
||||||
|
continue
|
||||||
|
|
||||||
|
programs.append({
|
||||||
|
"name": prog.get("name"),
|
||||||
|
"url": prog.get("url"),
|
||||||
|
"degree_type": prog.get("degree_type"),
|
||||||
|
"school": school.get("name"),
|
||||||
|
"faculty_count": len(prog.get("faculty", []))
|
||||||
|
})
|
||||||
|
|
||||||
|
return {
|
||||||
|
"total": len(programs),
|
||||||
|
"programs": programs
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/university/{university_id}/faculty")
|
||||||
|
def get_faculty(
|
||||||
|
university_id: int,
|
||||||
|
school_name: Optional[str] = Query(None, description="按学院筛选"),
|
||||||
|
program_name: Optional[str] = Query(None, description="按项目筛选"),
|
||||||
|
search: Optional[str] = Query(None, description="搜索导师姓名"),
|
||||||
|
skip: int = Query(0, ge=0),
|
||||||
|
limit: int = Query(50, ge=1, le=200),
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""获取导师列表"""
|
||||||
|
result = db.query(ScrapeResult).filter(
|
||||||
|
ScrapeResult.university_id == university_id
|
||||||
|
).order_by(ScrapeResult.created_at.desc()).first()
|
||||||
|
|
||||||
|
if not result:
|
||||||
|
raise HTTPException(status_code=404, detail="没有爬取结果")
|
||||||
|
|
||||||
|
schools = result.result_data.get("schools", [])
|
||||||
|
faculty_list = []
|
||||||
|
|
||||||
|
for school in schools:
|
||||||
|
if school_name and school.get("name") != school_name:
|
||||||
|
continue
|
||||||
|
|
||||||
|
for prog in school.get("programs", []):
|
||||||
|
if program_name and prog.get("name") != program_name:
|
||||||
|
continue
|
||||||
|
|
||||||
|
for fac in prog.get("faculty", []):
|
||||||
|
if search and search.lower() not in fac.get("name", "").lower():
|
||||||
|
continue
|
||||||
|
|
||||||
|
faculty_list.append({
|
||||||
|
"name": fac.get("name"),
|
||||||
|
"url": fac.get("url"),
|
||||||
|
"title": fac.get("title"),
|
||||||
|
"email": fac.get("email"),
|
||||||
|
"program": prog.get("name"),
|
||||||
|
"school": school.get("name")
|
||||||
|
})
|
||||||
|
|
||||||
|
total = len(faculty_list)
|
||||||
|
faculty_list = faculty_list[skip:skip + limit]
|
||||||
|
|
||||||
|
return {
|
||||||
|
"total": total,
|
||||||
|
"skip": skip,
|
||||||
|
"limit": limit,
|
||||||
|
"faculty": faculty_list
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/university/{university_id}/export")
|
||||||
|
def export_result(
|
||||||
|
university_id: int,
|
||||||
|
format: str = Query("json", enum=["json"]),
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""导出爬取结果"""
|
||||||
|
result = db.query(ScrapeResult).filter(
|
||||||
|
ScrapeResult.university_id == university_id
|
||||||
|
).order_by(ScrapeResult.created_at.desc()).first()
|
||||||
|
|
||||||
|
if not result:
|
||||||
|
raise HTTPException(status_code=404, detail="没有爬取结果")
|
||||||
|
|
||||||
|
if format == "json":
|
||||||
|
return JSONResponse(
|
||||||
|
content=result.result_data,
|
||||||
|
headers={
|
||||||
|
"Content-Disposition": f"attachment; filename=university_{university_id}_result.json"
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
raise HTTPException(status_code=400, detail="不支持的格式")
|
||||||
167
backend/app/api/scripts.py
Normal file
167
backend/app/api/scripts.py
Normal file
@ -0,0 +1,167 @@
|
|||||||
|
"""爬虫脚本API"""
|
||||||
|
|
||||||
|
from typing import List
|
||||||
|
from fastapi import APIRouter, Depends, HTTPException, BackgroundTasks
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
from ..database import get_db
|
||||||
|
from ..models import University, ScraperScript
|
||||||
|
from ..schemas.script import (
|
||||||
|
ScriptCreate,
|
||||||
|
ScriptResponse,
|
||||||
|
GenerateScriptRequest,
|
||||||
|
GenerateScriptResponse
|
||||||
|
)
|
||||||
|
from ..services.script_generator import generate_scraper_script
|
||||||
|
|
||||||
|
router = APIRouter()
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/generate", response_model=GenerateScriptResponse)
|
||||||
|
async def generate_script(
|
||||||
|
data: GenerateScriptRequest,
|
||||||
|
background_tasks: BackgroundTasks,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
一键生成爬虫脚本
|
||||||
|
|
||||||
|
分析大学网站结构,自动生成爬虫脚本
|
||||||
|
"""
|
||||||
|
# 检查或创建大学记录
|
||||||
|
university = db.query(University).filter(University.url == data.university_url).first()
|
||||||
|
|
||||||
|
if not university:
|
||||||
|
# 从URL提取大学名称
|
||||||
|
name = data.university_name
|
||||||
|
if not name:
|
||||||
|
from urllib.parse import urlparse
|
||||||
|
parsed = urlparse(data.university_url)
|
||||||
|
name = parsed.netloc.replace("www.", "").split(".")[0].title()
|
||||||
|
|
||||||
|
university = University(
|
||||||
|
name=name,
|
||||||
|
url=data.university_url,
|
||||||
|
status="analyzing"
|
||||||
|
)
|
||||||
|
db.add(university)
|
||||||
|
db.commit()
|
||||||
|
db.refresh(university)
|
||||||
|
else:
|
||||||
|
# 更新状态
|
||||||
|
university.status = "analyzing"
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
# 在后台执行脚本生成
|
||||||
|
background_tasks.add_task(
|
||||||
|
generate_scraper_script,
|
||||||
|
university_id=university.id,
|
||||||
|
university_url=data.university_url
|
||||||
|
)
|
||||||
|
|
||||||
|
return GenerateScriptResponse(
|
||||||
|
success=True,
|
||||||
|
university_id=university.id,
|
||||||
|
script_id=None,
|
||||||
|
message="正在分析网站结构并生成爬虫脚本...",
|
||||||
|
status="analyzing"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/university/{university_id}", response_model=List[ScriptResponse])
|
||||||
|
def get_university_scripts(
|
||||||
|
university_id: int,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""获取大学的所有爬虫脚本"""
|
||||||
|
scripts = db.query(ScraperScript).filter(
|
||||||
|
ScraperScript.university_id == university_id
|
||||||
|
).order_by(ScraperScript.version.desc()).all()
|
||||||
|
|
||||||
|
return scripts
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/{script_id}", response_model=ScriptResponse)
|
||||||
|
def get_script(
|
||||||
|
script_id: int,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""获取脚本详情"""
|
||||||
|
script = db.query(ScraperScript).filter(ScraperScript.id == script_id).first()
|
||||||
|
if not script:
|
||||||
|
raise HTTPException(status_code=404, detail="脚本不存在")
|
||||||
|
|
||||||
|
return script
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("", response_model=ScriptResponse)
|
||||||
|
def create_script(
|
||||||
|
data: ScriptCreate,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""手动创建脚本"""
|
||||||
|
# 检查大学是否存在
|
||||||
|
university = db.query(University).filter(University.id == data.university_id).first()
|
||||||
|
if not university:
|
||||||
|
raise HTTPException(status_code=404, detail="大学不存在")
|
||||||
|
|
||||||
|
# 获取当前最高版本
|
||||||
|
max_version = db.query(ScraperScript).filter(
|
||||||
|
ScraperScript.university_id == data.university_id
|
||||||
|
).count()
|
||||||
|
|
||||||
|
script = ScraperScript(
|
||||||
|
university_id=data.university_id,
|
||||||
|
script_name=data.script_name,
|
||||||
|
script_content=data.script_content,
|
||||||
|
config_content=data.config_content,
|
||||||
|
version=max_version + 1,
|
||||||
|
status="active"
|
||||||
|
)
|
||||||
|
|
||||||
|
db.add(script)
|
||||||
|
db.commit()
|
||||||
|
db.refresh(script)
|
||||||
|
|
||||||
|
# 更新大学状态
|
||||||
|
university.status = "ready"
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
return script
|
||||||
|
|
||||||
|
|
||||||
|
@router.put("/{script_id}", response_model=ScriptResponse)
|
||||||
|
def update_script(
|
||||||
|
script_id: int,
|
||||||
|
data: ScriptCreate,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""更新脚本"""
|
||||||
|
script = db.query(ScraperScript).filter(ScraperScript.id == script_id).first()
|
||||||
|
if not script:
|
||||||
|
raise HTTPException(status_code=404, detail="脚本不存在")
|
||||||
|
|
||||||
|
script.script_content = data.script_content
|
||||||
|
if data.config_content:
|
||||||
|
script.config_content = data.config_content
|
||||||
|
|
||||||
|
db.commit()
|
||||||
|
db.refresh(script)
|
||||||
|
|
||||||
|
return script
|
||||||
|
|
||||||
|
|
||||||
|
@router.delete("/{script_id}")
|
||||||
|
def delete_script(
|
||||||
|
script_id: int,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""删除脚本"""
|
||||||
|
script = db.query(ScraperScript).filter(ScraperScript.id == script_id).first()
|
||||||
|
if not script:
|
||||||
|
raise HTTPException(status_code=404, detail="脚本不存在")
|
||||||
|
|
||||||
|
db.delete(script)
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
return {"message": "删除成功"}
|
||||||
165
backend/app/api/universities.py
Normal file
165
backend/app/api/universities.py
Normal file
@ -0,0 +1,165 @@
|
|||||||
|
"""大学管理API"""
|
||||||
|
|
||||||
|
from typing import List, Optional
|
||||||
|
from fastapi import APIRouter, Depends, HTTPException, Query
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
from ..database import get_db
|
||||||
|
from ..models import University, ScrapeResult
|
||||||
|
from ..schemas.university import (
|
||||||
|
UniversityCreate,
|
||||||
|
UniversityUpdate,
|
||||||
|
UniversityResponse,
|
||||||
|
UniversityListResponse
|
||||||
|
)
|
||||||
|
|
||||||
|
router = APIRouter()
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("", response_model=UniversityListResponse)
|
||||||
|
def list_universities(
|
||||||
|
skip: int = Query(0, ge=0),
|
||||||
|
limit: int = Query(20, ge=1, le=100),
|
||||||
|
search: Optional[str] = None,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""获取大学列表"""
|
||||||
|
query = db.query(University)
|
||||||
|
|
||||||
|
if search:
|
||||||
|
query = query.filter(University.name.ilike(f"%{search}%"))
|
||||||
|
|
||||||
|
total = query.count()
|
||||||
|
universities = query.order_by(University.created_at.desc()).offset(skip).limit(limit).all()
|
||||||
|
|
||||||
|
# 添加统计信息
|
||||||
|
items = []
|
||||||
|
for uni in universities:
|
||||||
|
# 获取最新结果
|
||||||
|
latest_result = db.query(ScrapeResult).filter(
|
||||||
|
ScrapeResult.university_id == uni.id
|
||||||
|
).order_by(ScrapeResult.created_at.desc()).first()
|
||||||
|
|
||||||
|
items.append(UniversityResponse(
|
||||||
|
id=uni.id,
|
||||||
|
name=uni.name,
|
||||||
|
url=uni.url,
|
||||||
|
country=uni.country,
|
||||||
|
description=uni.description,
|
||||||
|
status=uni.status,
|
||||||
|
created_at=uni.created_at,
|
||||||
|
updated_at=uni.updated_at,
|
||||||
|
scripts_count=len(uni.scripts),
|
||||||
|
jobs_count=len(uni.jobs),
|
||||||
|
latest_result={
|
||||||
|
"schools_count": latest_result.schools_count,
|
||||||
|
"programs_count": latest_result.programs_count,
|
||||||
|
"faculty_count": latest_result.faculty_count,
|
||||||
|
"created_at": latest_result.created_at.isoformat()
|
||||||
|
} if latest_result else None
|
||||||
|
))
|
||||||
|
|
||||||
|
return UniversityListResponse(total=total, items=items)
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("", response_model=UniversityResponse)
|
||||||
|
def create_university(
|
||||||
|
data: UniversityCreate,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""创建大学"""
|
||||||
|
# 检查是否已存在
|
||||||
|
existing = db.query(University).filter(University.url == data.url).first()
|
||||||
|
if existing:
|
||||||
|
raise HTTPException(status_code=400, detail="该大学URL已存在")
|
||||||
|
|
||||||
|
university = University(**data.model_dump())
|
||||||
|
db.add(university)
|
||||||
|
db.commit()
|
||||||
|
db.refresh(university)
|
||||||
|
|
||||||
|
return UniversityResponse(
|
||||||
|
id=university.id,
|
||||||
|
name=university.name,
|
||||||
|
url=university.url,
|
||||||
|
country=university.country,
|
||||||
|
description=university.description,
|
||||||
|
status=university.status,
|
||||||
|
created_at=university.created_at,
|
||||||
|
updated_at=university.updated_at,
|
||||||
|
scripts_count=0,
|
||||||
|
jobs_count=0,
|
||||||
|
latest_result=None
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/{university_id}", response_model=UniversityResponse)
|
||||||
|
def get_university(
|
||||||
|
university_id: int,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""获取大学详情"""
|
||||||
|
university = db.query(University).filter(University.id == university_id).first()
|
||||||
|
if not university:
|
||||||
|
raise HTTPException(status_code=404, detail="大学不存在")
|
||||||
|
|
||||||
|
# 获取最新结果
|
||||||
|
latest_result = db.query(ScrapeResult).filter(
|
||||||
|
ScrapeResult.university_id == university.id
|
||||||
|
).order_by(ScrapeResult.created_at.desc()).first()
|
||||||
|
|
||||||
|
return UniversityResponse(
|
||||||
|
id=university.id,
|
||||||
|
name=university.name,
|
||||||
|
url=university.url,
|
||||||
|
country=university.country,
|
||||||
|
description=university.description,
|
||||||
|
status=university.status,
|
||||||
|
created_at=university.created_at,
|
||||||
|
updated_at=university.updated_at,
|
||||||
|
scripts_count=len(university.scripts),
|
||||||
|
jobs_count=len(university.jobs),
|
||||||
|
latest_result={
|
||||||
|
"schools_count": latest_result.schools_count,
|
||||||
|
"programs_count": latest_result.programs_count,
|
||||||
|
"faculty_count": latest_result.faculty_count,
|
||||||
|
"created_at": latest_result.created_at.isoformat()
|
||||||
|
} if latest_result else None
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@router.put("/{university_id}", response_model=UniversityResponse)
|
||||||
|
def update_university(
|
||||||
|
university_id: int,
|
||||||
|
data: UniversityUpdate,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""更新大学信息"""
|
||||||
|
university = db.query(University).filter(University.id == university_id).first()
|
||||||
|
if not university:
|
||||||
|
raise HTTPException(status_code=404, detail="大学不存在")
|
||||||
|
|
||||||
|
update_data = data.model_dump(exclude_unset=True)
|
||||||
|
for field, value in update_data.items():
|
||||||
|
setattr(university, field, value)
|
||||||
|
|
||||||
|
db.commit()
|
||||||
|
db.refresh(university)
|
||||||
|
|
||||||
|
return get_university(university_id, db)
|
||||||
|
|
||||||
|
|
||||||
|
@router.delete("/{university_id}")
|
||||||
|
def delete_university(
|
||||||
|
university_id: int,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""删除大学"""
|
||||||
|
university = db.query(University).filter(University.id == university_id).first()
|
||||||
|
if not university:
|
||||||
|
raise HTTPException(status_code=404, detail="大学不存在")
|
||||||
|
|
||||||
|
db.delete(university)
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
return {"message": "删除成功"}
|
||||||
37
backend/app/config.py
Normal file
37
backend/app/config.py
Normal file
@ -0,0 +1,37 @@
|
|||||||
|
"""应用配置"""
|
||||||
|
|
||||||
|
from pydantic_settings import BaseSettings
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
|
||||||
|
class Settings(BaseSettings):
|
||||||
|
"""应用设置"""
|
||||||
|
|
||||||
|
# 应用配置
|
||||||
|
APP_NAME: str = "University Scraper API"
|
||||||
|
APP_VERSION: str = "1.0.0"
|
||||||
|
DEBUG: bool = True
|
||||||
|
|
||||||
|
# 数据库配置
|
||||||
|
DATABASE_URL: str = "sqlite:///./university_scraper.db" # 开发环境使用SQLite
|
||||||
|
# 生产环境使用: postgresql://user:password@localhost/university_scraper
|
||||||
|
|
||||||
|
# Redis配置 (用于Celery任务队列)
|
||||||
|
REDIS_URL: str = "redis://localhost:6379/0"
|
||||||
|
|
||||||
|
# CORS配置
|
||||||
|
CORS_ORIGINS: list = ["http://localhost:3000", "http://127.0.0.1:3000"]
|
||||||
|
|
||||||
|
# Agent配置 (用于自动生成脚本)
|
||||||
|
OPENROUTER_API_KEY: Optional[str] = None
|
||||||
|
|
||||||
|
# 文件存储路径
|
||||||
|
SCRIPTS_DIR: str = "./scripts"
|
||||||
|
RESULTS_DIR: str = "./results"
|
||||||
|
|
||||||
|
class Config:
|
||||||
|
env_file = ".env"
|
||||||
|
case_sensitive = True
|
||||||
|
|
||||||
|
|
||||||
|
settings = Settings()
|
||||||
35
backend/app/database.py
Normal file
35
backend/app/database.py
Normal file
@ -0,0 +1,35 @@
|
|||||||
|
"""数据库连接和会话管理"""
|
||||||
|
|
||||||
|
from sqlalchemy import create_engine
|
||||||
|
from sqlalchemy.ext.declarative import declarative_base
|
||||||
|
from sqlalchemy.orm import sessionmaker
|
||||||
|
|
||||||
|
from .config import settings
|
||||||
|
|
||||||
|
# 创建数据库引擎
|
||||||
|
engine = create_engine(
|
||||||
|
settings.DATABASE_URL,
|
||||||
|
connect_args={"check_same_thread": False} if "sqlite" in settings.DATABASE_URL else {},
|
||||||
|
echo=settings.DEBUG
|
||||||
|
)
|
||||||
|
|
||||||
|
# 创建会话工厂
|
||||||
|
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
|
||||||
|
|
||||||
|
# 声明基类
|
||||||
|
Base = declarative_base()
|
||||||
|
|
||||||
|
|
||||||
|
def get_db():
|
||||||
|
"""获取数据库会话 (依赖注入)"""
|
||||||
|
db = SessionLocal()
|
||||||
|
try:
|
||||||
|
yield db
|
||||||
|
finally:
|
||||||
|
db.close()
|
||||||
|
|
||||||
|
|
||||||
|
def init_db():
|
||||||
|
"""初始化数据库 (创建所有表)"""
|
||||||
|
from .models import university, script, job, result # noqa
|
||||||
|
Base.metadata.create_all(bind=engine)
|
||||||
72
backend/app/main.py
Normal file
72
backend/app/main.py
Normal file
@ -0,0 +1,72 @@
|
|||||||
|
"""
|
||||||
|
University Scraper Web API
|
||||||
|
|
||||||
|
主应用入口
|
||||||
|
"""
|
||||||
|
|
||||||
|
from fastapi import FastAPI
|
||||||
|
from fastapi.middleware.cors import CORSMiddleware
|
||||||
|
|
||||||
|
from .config import settings
|
||||||
|
from .database import init_db
|
||||||
|
from .api import api_router
|
||||||
|
|
||||||
|
# 创建应用
|
||||||
|
app = FastAPI(
|
||||||
|
title=settings.APP_NAME,
|
||||||
|
version=settings.APP_VERSION,
|
||||||
|
description="""
|
||||||
|
## 大学爬虫Web系统 API
|
||||||
|
|
||||||
|
### 功能
|
||||||
|
- 🏫 **大学管理**: 添加、编辑、删除大学
|
||||||
|
- 📜 **脚本生成**: 一键生成爬虫脚本
|
||||||
|
- 🚀 **任务执行**: 一键运行爬虫
|
||||||
|
- 📊 **数据查看**: 查看和导出爬取结果
|
||||||
|
|
||||||
|
### 数据结构
|
||||||
|
大学 → 学院 → 项目 → 导师
|
||||||
|
""",
|
||||||
|
docs_url="/docs",
|
||||||
|
redoc_url="/redoc"
|
||||||
|
)
|
||||||
|
|
||||||
|
# 配置CORS
|
||||||
|
app.add_middleware(
|
||||||
|
CORSMiddleware,
|
||||||
|
allow_origins=settings.CORS_ORIGINS,
|
||||||
|
allow_credentials=True,
|
||||||
|
allow_methods=["*"],
|
||||||
|
allow_headers=["*"],
|
||||||
|
)
|
||||||
|
|
||||||
|
# 注册路由
|
||||||
|
app.include_router(api_router, prefix="/api")
|
||||||
|
|
||||||
|
|
||||||
|
@app.on_event("startup")
|
||||||
|
async def startup_event():
|
||||||
|
"""应用启动时初始化数据库"""
|
||||||
|
init_db()
|
||||||
|
|
||||||
|
|
||||||
|
@app.get("/")
|
||||||
|
async def root():
|
||||||
|
"""根路由"""
|
||||||
|
return {
|
||||||
|
"name": settings.APP_NAME,
|
||||||
|
"version": settings.APP_VERSION,
|
||||||
|
"docs": "/docs",
|
||||||
|
"api": "/api"
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@app.get("/health")
|
||||||
|
async def health_check():
|
||||||
|
"""健康检查"""
|
||||||
|
return {"status": "healthy"}
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
import uvicorn
|
||||||
|
uvicorn.run(app, host="0.0.0.0", port=8000)
|
||||||
8
backend/app/models/__init__.py
Normal file
8
backend/app/models/__init__.py
Normal file
@ -0,0 +1,8 @@
|
|||||||
|
"""数据库模型"""
|
||||||
|
|
||||||
|
from .university import University
|
||||||
|
from .script import ScraperScript
|
||||||
|
from .job import ScrapeJob, ScrapeLog
|
||||||
|
from .result import ScrapeResult
|
||||||
|
|
||||||
|
__all__ = ["University", "ScraperScript", "ScrapeJob", "ScrapeLog", "ScrapeResult"]
|
||||||
56
backend/app/models/job.py
Normal file
56
backend/app/models/job.py
Normal file
@ -0,0 +1,56 @@
|
|||||||
|
"""爬取任务模型"""
|
||||||
|
|
||||||
|
from datetime import datetime
|
||||||
|
from sqlalchemy import Column, Integer, String, DateTime, Text, ForeignKey
|
||||||
|
from sqlalchemy.orm import relationship
|
||||||
|
|
||||||
|
from ..database import Base
|
||||||
|
|
||||||
|
|
||||||
|
class ScrapeJob(Base):
|
||||||
|
"""爬取任务表"""
|
||||||
|
|
||||||
|
__tablename__ = "scrape_jobs"
|
||||||
|
|
||||||
|
id = Column(Integer, primary_key=True, index=True)
|
||||||
|
university_id = Column(Integer, ForeignKey("universities.id"), nullable=False)
|
||||||
|
script_id = Column(Integer, ForeignKey("scraper_scripts.id"))
|
||||||
|
|
||||||
|
status = Column(String(50), default="pending") # pending, running, completed, failed, cancelled
|
||||||
|
progress = Column(Integer, default=0) # 0-100 进度百分比
|
||||||
|
current_step = Column(String(255)) # 当前步骤描述
|
||||||
|
|
||||||
|
started_at = Column(DateTime)
|
||||||
|
completed_at = Column(DateTime)
|
||||||
|
error_message = Column(Text)
|
||||||
|
|
||||||
|
created_at = Column(DateTime, default=datetime.utcnow)
|
||||||
|
|
||||||
|
# 关联
|
||||||
|
university = relationship("University", back_populates="jobs")
|
||||||
|
script = relationship("ScraperScript", back_populates="jobs")
|
||||||
|
logs = relationship("ScrapeLog", back_populates="job", cascade="all, delete-orphan")
|
||||||
|
results = relationship("ScrapeResult", back_populates="job", cascade="all, delete-orphan")
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
return f"<ScrapeJob(id={self.id}, status='{self.status}')>"
|
||||||
|
|
||||||
|
|
||||||
|
class ScrapeLog(Base):
|
||||||
|
"""爬取日志表"""
|
||||||
|
|
||||||
|
__tablename__ = "scrape_logs"
|
||||||
|
|
||||||
|
id = Column(Integer, primary_key=True, index=True)
|
||||||
|
job_id = Column(Integer, ForeignKey("scrape_jobs.id"), nullable=False)
|
||||||
|
|
||||||
|
level = Column(String(20), default="info") # debug, info, warning, error
|
||||||
|
message = Column(Text, nullable=False)
|
||||||
|
|
||||||
|
created_at = Column(DateTime, default=datetime.utcnow)
|
||||||
|
|
||||||
|
# 关联
|
||||||
|
job = relationship("ScrapeJob", back_populates="logs")
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
return f"<ScrapeLog(id={self.id}, level='{self.level}')>"
|
||||||
34
backend/app/models/result.py
Normal file
34
backend/app/models/result.py
Normal file
@ -0,0 +1,34 @@
|
|||||||
|
"""爬取结果模型"""
|
||||||
|
|
||||||
|
from datetime import datetime
|
||||||
|
from sqlalchemy import Column, Integer, DateTime, ForeignKey, JSON
|
||||||
|
from sqlalchemy.orm import relationship
|
||||||
|
|
||||||
|
from ..database import Base
|
||||||
|
|
||||||
|
|
||||||
|
class ScrapeResult(Base):
|
||||||
|
"""爬取结果表"""
|
||||||
|
|
||||||
|
__tablename__ = "scrape_results"
|
||||||
|
|
||||||
|
id = Column(Integer, primary_key=True, index=True)
|
||||||
|
job_id = Column(Integer, ForeignKey("scrape_jobs.id"))
|
||||||
|
university_id = Column(Integer, ForeignKey("universities.id"), nullable=False)
|
||||||
|
|
||||||
|
# JSON数据: 学院 → 项目 → 导师 层级结构
|
||||||
|
result_data = Column(JSON, nullable=False)
|
||||||
|
|
||||||
|
# 统计信息
|
||||||
|
schools_count = Column(Integer, default=0)
|
||||||
|
programs_count = Column(Integer, default=0)
|
||||||
|
faculty_count = Column(Integer, default=0)
|
||||||
|
|
||||||
|
created_at = Column(DateTime, default=datetime.utcnow)
|
||||||
|
|
||||||
|
# 关联
|
||||||
|
job = relationship("ScrapeJob", back_populates="results")
|
||||||
|
university = relationship("University", back_populates="results")
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
return f"<ScrapeResult(id={self.id}, programs={self.programs_count}, faculty={self.faculty_count})>"
|
||||||
34
backend/app/models/script.py
Normal file
34
backend/app/models/script.py
Normal file
@ -0,0 +1,34 @@
|
|||||||
|
"""爬虫脚本模型"""
|
||||||
|
|
||||||
|
from datetime import datetime
|
||||||
|
from sqlalchemy import Column, Integer, String, DateTime, Text, ForeignKey, JSON
|
||||||
|
from sqlalchemy.orm import relationship
|
||||||
|
|
||||||
|
from ..database import Base
|
||||||
|
|
||||||
|
|
||||||
|
class ScraperScript(Base):
|
||||||
|
"""爬虫脚本表"""
|
||||||
|
|
||||||
|
__tablename__ = "scraper_scripts"
|
||||||
|
|
||||||
|
id = Column(Integer, primary_key=True, index=True)
|
||||||
|
university_id = Column(Integer, ForeignKey("universities.id"), nullable=False)
|
||||||
|
|
||||||
|
script_name = Column(String(255), nullable=False)
|
||||||
|
script_content = Column(Text, nullable=False) # Python脚本代码
|
||||||
|
config_content = Column(JSON) # YAML配置转为JSON存储
|
||||||
|
|
||||||
|
version = Column(Integer, default=1)
|
||||||
|
status = Column(String(50), default="draft") # draft, active, deprecated, error
|
||||||
|
error_message = Column(Text)
|
||||||
|
|
||||||
|
created_at = Column(DateTime, default=datetime.utcnow)
|
||||||
|
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
|
||||||
|
|
||||||
|
# 关联
|
||||||
|
university = relationship("University", back_populates="scripts")
|
||||||
|
jobs = relationship("ScrapeJob", back_populates="script")
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
return f"<ScraperScript(id={self.id}, name='{self.script_name}')>"
|
||||||
31
backend/app/models/university.py
Normal file
31
backend/app/models/university.py
Normal file
@ -0,0 +1,31 @@
|
|||||||
|
"""大学模型"""
|
||||||
|
|
||||||
|
from datetime import datetime
|
||||||
|
from sqlalchemy import Column, Integer, String, DateTime, Text
|
||||||
|
from sqlalchemy.orm import relationship
|
||||||
|
|
||||||
|
from ..database import Base
|
||||||
|
|
||||||
|
|
||||||
|
class University(Base):
|
||||||
|
"""大学表"""
|
||||||
|
|
||||||
|
__tablename__ = "universities"
|
||||||
|
|
||||||
|
id = Column(Integer, primary_key=True, index=True)
|
||||||
|
name = Column(String(255), nullable=False, index=True)
|
||||||
|
url = Column(String(500), nullable=False)
|
||||||
|
country = Column(String(100))
|
||||||
|
description = Column(Text)
|
||||||
|
status = Column(String(50), default="pending") # pending, analyzing, ready, error
|
||||||
|
|
||||||
|
created_at = Column(DateTime, default=datetime.utcnow)
|
||||||
|
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
|
||||||
|
|
||||||
|
# 关联
|
||||||
|
scripts = relationship("ScraperScript", back_populates="university", cascade="all, delete-orphan")
|
||||||
|
jobs = relationship("ScrapeJob", back_populates="university", cascade="all, delete-orphan")
|
||||||
|
results = relationship("ScrapeResult", back_populates="university", cascade="all, delete-orphan")
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
return f"<University(id={self.id}, name='{self.name}')>"
|
||||||
33
backend/app/schemas/__init__.py
Normal file
33
backend/app/schemas/__init__.py
Normal file
@ -0,0 +1,33 @@
|
|||||||
|
"""Pydantic schemas for API"""
|
||||||
|
|
||||||
|
from .university import (
|
||||||
|
UniversityCreate,
|
||||||
|
UniversityUpdate,
|
||||||
|
UniversityResponse,
|
||||||
|
UniversityListResponse
|
||||||
|
)
|
||||||
|
from .script import (
|
||||||
|
ScriptCreate,
|
||||||
|
ScriptResponse,
|
||||||
|
GenerateScriptRequest,
|
||||||
|
GenerateScriptResponse
|
||||||
|
)
|
||||||
|
from .job import (
|
||||||
|
JobCreate,
|
||||||
|
JobResponse,
|
||||||
|
JobStatusResponse,
|
||||||
|
LogResponse
|
||||||
|
)
|
||||||
|
from .result import (
|
||||||
|
ResultResponse,
|
||||||
|
SchoolData,
|
||||||
|
ProgramData,
|
||||||
|
FacultyData
|
||||||
|
)
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"UniversityCreate", "UniversityUpdate", "UniversityResponse", "UniversityListResponse",
|
||||||
|
"ScriptCreate", "ScriptResponse", "GenerateScriptRequest", "GenerateScriptResponse",
|
||||||
|
"JobCreate", "JobResponse", "JobStatusResponse", "LogResponse",
|
||||||
|
"ResultResponse", "SchoolData", "ProgramData", "FacultyData"
|
||||||
|
]
|
||||||
52
backend/app/schemas/job.py
Normal file
52
backend/app/schemas/job.py
Normal file
@ -0,0 +1,52 @@
|
|||||||
|
"""爬取任务相关的Pydantic模型"""
|
||||||
|
|
||||||
|
from datetime import datetime
|
||||||
|
from typing import Optional, List
|
||||||
|
from pydantic import BaseModel
|
||||||
|
|
||||||
|
|
||||||
|
class JobCreate(BaseModel):
|
||||||
|
"""创建任务请求"""
|
||||||
|
university_id: int
|
||||||
|
script_id: Optional[int] = None
|
||||||
|
|
||||||
|
|
||||||
|
class JobResponse(BaseModel):
|
||||||
|
"""任务响应"""
|
||||||
|
id: int
|
||||||
|
university_id: int
|
||||||
|
script_id: Optional[int] = None
|
||||||
|
status: str
|
||||||
|
progress: int
|
||||||
|
current_step: Optional[str] = None
|
||||||
|
started_at: Optional[datetime] = None
|
||||||
|
completed_at: Optional[datetime] = None
|
||||||
|
error_message: Optional[str] = None
|
||||||
|
created_at: datetime
|
||||||
|
|
||||||
|
class Config:
|
||||||
|
from_attributes = True
|
||||||
|
|
||||||
|
|
||||||
|
class JobStatusResponse(BaseModel):
|
||||||
|
"""任务状态响应"""
|
||||||
|
id: int
|
||||||
|
status: str
|
||||||
|
progress: int
|
||||||
|
current_step: Optional[str] = None
|
||||||
|
logs: List["LogResponse"] = []
|
||||||
|
|
||||||
|
|
||||||
|
class LogResponse(BaseModel):
|
||||||
|
"""日志响应"""
|
||||||
|
id: int
|
||||||
|
level: str
|
||||||
|
message: str
|
||||||
|
created_at: datetime
|
||||||
|
|
||||||
|
class Config:
|
||||||
|
from_attributes = True
|
||||||
|
|
||||||
|
|
||||||
|
# 解决循环引用
|
||||||
|
JobStatusResponse.model_rebuild()
|
||||||
67
backend/app/schemas/result.py
Normal file
67
backend/app/schemas/result.py
Normal file
@ -0,0 +1,67 @@
|
|||||||
|
"""爬取结果相关的Pydantic模型"""
|
||||||
|
|
||||||
|
from datetime import datetime
|
||||||
|
from typing import Optional, List, Dict, Any
|
||||||
|
from pydantic import BaseModel
|
||||||
|
|
||||||
|
|
||||||
|
class FacultyData(BaseModel):
|
||||||
|
"""导师数据"""
|
||||||
|
name: str
|
||||||
|
url: str
|
||||||
|
title: Optional[str] = None
|
||||||
|
email: Optional[str] = None
|
||||||
|
department: Optional[str] = None
|
||||||
|
|
||||||
|
|
||||||
|
class ProgramData(BaseModel):
|
||||||
|
"""项目数据"""
|
||||||
|
name: str
|
||||||
|
url: str
|
||||||
|
degree_type: Optional[str] = None
|
||||||
|
description: Optional[str] = None
|
||||||
|
faculty_page_url: Optional[str] = None
|
||||||
|
faculty_count: int = 0
|
||||||
|
faculty: List[FacultyData] = []
|
||||||
|
|
||||||
|
|
||||||
|
class SchoolData(BaseModel):
|
||||||
|
"""学院数据"""
|
||||||
|
name: str
|
||||||
|
url: str
|
||||||
|
description: Optional[str] = None
|
||||||
|
program_count: int = 0
|
||||||
|
programs: List[ProgramData] = []
|
||||||
|
|
||||||
|
|
||||||
|
class ResultResponse(BaseModel):
|
||||||
|
"""完整结果响应"""
|
||||||
|
id: int
|
||||||
|
university_id: int
|
||||||
|
job_id: Optional[int] = None
|
||||||
|
|
||||||
|
# 统计
|
||||||
|
schools_count: int
|
||||||
|
programs_count: int
|
||||||
|
faculty_count: int
|
||||||
|
|
||||||
|
# 完整数据
|
||||||
|
result_data: Dict[str, Any]
|
||||||
|
|
||||||
|
created_at: datetime
|
||||||
|
|
||||||
|
class Config:
|
||||||
|
from_attributes = True
|
||||||
|
|
||||||
|
|
||||||
|
class ResultSummary(BaseModel):
|
||||||
|
"""结果摘要"""
|
||||||
|
id: int
|
||||||
|
university_id: int
|
||||||
|
schools_count: int
|
||||||
|
programs_count: int
|
||||||
|
faculty_count: int
|
||||||
|
created_at: datetime
|
||||||
|
|
||||||
|
class Config:
|
||||||
|
from_attributes = True
|
||||||
46
backend/app/schemas/script.py
Normal file
46
backend/app/schemas/script.py
Normal file
@ -0,0 +1,46 @@
|
|||||||
|
"""爬虫脚本相关的Pydantic模型"""
|
||||||
|
|
||||||
|
from datetime import datetime
|
||||||
|
from typing import Optional, Dict, Any
|
||||||
|
from pydantic import BaseModel
|
||||||
|
|
||||||
|
|
||||||
|
class ScriptBase(BaseModel):
|
||||||
|
"""脚本基础字段"""
|
||||||
|
script_name: str
|
||||||
|
script_content: str
|
||||||
|
config_content: Optional[Dict[str, Any]] = None
|
||||||
|
|
||||||
|
|
||||||
|
class ScriptCreate(ScriptBase):
|
||||||
|
"""创建脚本请求"""
|
||||||
|
university_id: int
|
||||||
|
|
||||||
|
|
||||||
|
class ScriptResponse(ScriptBase):
|
||||||
|
"""脚本响应"""
|
||||||
|
id: int
|
||||||
|
university_id: int
|
||||||
|
version: int
|
||||||
|
status: str
|
||||||
|
error_message: Optional[str] = None
|
||||||
|
created_at: datetime
|
||||||
|
updated_at: datetime
|
||||||
|
|
||||||
|
class Config:
|
||||||
|
from_attributes = True
|
||||||
|
|
||||||
|
|
||||||
|
class GenerateScriptRequest(BaseModel):
|
||||||
|
"""生成脚本请求"""
|
||||||
|
university_url: str
|
||||||
|
university_name: Optional[str] = None
|
||||||
|
|
||||||
|
|
||||||
|
class GenerateScriptResponse(BaseModel):
|
||||||
|
"""生成脚本响应"""
|
||||||
|
success: bool
|
||||||
|
university_id: int
|
||||||
|
script_id: Optional[int] = None
|
||||||
|
message: str
|
||||||
|
status: str # analyzing, completed, failed
|
||||||
48
backend/app/schemas/university.py
Normal file
48
backend/app/schemas/university.py
Normal file
@ -0,0 +1,48 @@
|
|||||||
|
"""大学相关的Pydantic模型"""
|
||||||
|
|
||||||
|
from datetime import datetime
|
||||||
|
from typing import Optional, List
|
||||||
|
from pydantic import BaseModel, HttpUrl
|
||||||
|
|
||||||
|
|
||||||
|
class UniversityBase(BaseModel):
|
||||||
|
"""大学基础字段"""
|
||||||
|
name: str
|
||||||
|
url: str
|
||||||
|
country: Optional[str] = None
|
||||||
|
description: Optional[str] = None
|
||||||
|
|
||||||
|
|
||||||
|
class UniversityCreate(UniversityBase):
|
||||||
|
"""创建大学请求"""
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
class UniversityUpdate(BaseModel):
|
||||||
|
"""更新大学请求"""
|
||||||
|
name: Optional[str] = None
|
||||||
|
url: Optional[str] = None
|
||||||
|
country: Optional[str] = None
|
||||||
|
description: Optional[str] = None
|
||||||
|
|
||||||
|
|
||||||
|
class UniversityResponse(UniversityBase):
|
||||||
|
"""大学响应"""
|
||||||
|
id: int
|
||||||
|
status: str
|
||||||
|
created_at: datetime
|
||||||
|
updated_at: datetime
|
||||||
|
|
||||||
|
# 统计信息
|
||||||
|
scripts_count: int = 0
|
||||||
|
jobs_count: int = 0
|
||||||
|
latest_result: Optional[dict] = None
|
||||||
|
|
||||||
|
class Config:
|
||||||
|
from_attributes = True
|
||||||
|
|
||||||
|
|
||||||
|
class UniversityListResponse(BaseModel):
|
||||||
|
"""大学列表响应"""
|
||||||
|
total: int
|
||||||
|
items: List[UniversityResponse]
|
||||||
6
backend/app/services/__init__.py
Normal file
6
backend/app/services/__init__.py
Normal file
@ -0,0 +1,6 @@
|
|||||||
|
"""业务服务"""
|
||||||
|
|
||||||
|
from .script_generator import generate_scraper_script
|
||||||
|
from .scraper_runner import run_scraper
|
||||||
|
|
||||||
|
__all__ = ["generate_scraper_script", "run_scraper"]
|
||||||
177
backend/app/services/scraper_runner.py
Normal file
177
backend/app/services/scraper_runner.py
Normal file
@ -0,0 +1,177 @@
|
|||||||
|
"""
|
||||||
|
爬虫执行服务
|
||||||
|
|
||||||
|
运行爬虫脚本并保存结果
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
import traceback
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from urllib.parse import urljoin, urlparse
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
# Windows 上需要设置事件循环策略
|
||||||
|
if sys.platform == "win32":
|
||||||
|
asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())
|
||||||
|
|
||||||
|
# 导入playwright供脚本使用
|
||||||
|
try:
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
PLAYWRIGHT_AVAILABLE = True
|
||||||
|
except ImportError:
|
||||||
|
PLAYWRIGHT_AVAILABLE = False
|
||||||
|
async_playwright = None
|
||||||
|
|
||||||
|
from ..database import SessionLocal
|
||||||
|
from ..models import ScraperScript, ScrapeJob, ScrapeLog, ScrapeResult
|
||||||
|
|
||||||
|
|
||||||
|
def run_scraper(job_id: int, script_id: int):
|
||||||
|
"""
|
||||||
|
执行爬虫的后台任务
|
||||||
|
"""
|
||||||
|
db = SessionLocal()
|
||||||
|
|
||||||
|
try:
|
||||||
|
job = db.query(ScrapeJob).filter(ScrapeJob.id == job_id).first()
|
||||||
|
script = db.query(ScraperScript).filter(ScraperScript.id == script_id).first()
|
||||||
|
|
||||||
|
if not job or not script:
|
||||||
|
return
|
||||||
|
|
||||||
|
# 更新任务状态
|
||||||
|
job.status = "running"
|
||||||
|
job.started_at = datetime.utcnow()
|
||||||
|
job.current_step = "正在初始化..."
|
||||||
|
job.progress = 5
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
_add_log(db, job_id, "info", "开始执行爬虫脚本")
|
||||||
|
|
||||||
|
# 创建日志回调函数
|
||||||
|
def log_callback(level: str, message: str):
|
||||||
|
_add_log(db, job_id, level, message)
|
||||||
|
|
||||||
|
# 执行脚本
|
||||||
|
job.current_step = "正在爬取数据..."
|
||||||
|
job.progress = 20
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
result_data = _execute_script(script.script_content, log_callback)
|
||||||
|
|
||||||
|
if result_data:
|
||||||
|
job.progress = 80
|
||||||
|
job.current_step = "正在保存结果..."
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
_add_log(db, job_id, "info", "爬取完成,正在保存结果...")
|
||||||
|
|
||||||
|
# 计算统计信息
|
||||||
|
schools = result_data.get("schools", [])
|
||||||
|
schools_count = len(schools)
|
||||||
|
programs_count = sum(len(s.get("programs", [])) for s in schools)
|
||||||
|
faculty_count = sum(
|
||||||
|
len(p.get("faculty", []))
|
||||||
|
for s in schools
|
||||||
|
for p in s.get("programs", [])
|
||||||
|
)
|
||||||
|
|
||||||
|
# 保存结果
|
||||||
|
result = ScrapeResult(
|
||||||
|
job_id=job_id,
|
||||||
|
university_id=job.university_id,
|
||||||
|
result_data=result_data,
|
||||||
|
schools_count=schools_count,
|
||||||
|
programs_count=programs_count,
|
||||||
|
faculty_count=faculty_count
|
||||||
|
)
|
||||||
|
db.add(result)
|
||||||
|
|
||||||
|
job.status = "completed"
|
||||||
|
job.progress = 100
|
||||||
|
job.current_step = "完成"
|
||||||
|
job.completed_at = datetime.utcnow()
|
||||||
|
|
||||||
|
_add_log(
|
||||||
|
db, job_id, "info",
|
||||||
|
f"爬取成功: {schools_count}个学院, {programs_count}个项目, {faculty_count}位导师"
|
||||||
|
)
|
||||||
|
|
||||||
|
else:
|
||||||
|
job.status = "failed"
|
||||||
|
job.error_message = "脚本执行无返回结果"
|
||||||
|
job.completed_at = datetime.utcnow()
|
||||||
|
_add_log(db, job_id, "error", "脚本执行失败: 无返回结果")
|
||||||
|
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
error_msg = f"执行出错: {str(e)}\n{traceback.format_exc()}"
|
||||||
|
_add_log(db, job_id, "error", error_msg)
|
||||||
|
|
||||||
|
job = db.query(ScrapeJob).filter(ScrapeJob.id == job_id).first()
|
||||||
|
if job:
|
||||||
|
job.status = "failed"
|
||||||
|
job.error_message = str(e)
|
||||||
|
job.completed_at = datetime.utcnow()
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
finally:
|
||||||
|
db.close()
|
||||||
|
|
||||||
|
|
||||||
|
def _execute_script(script_content: str, log_callback) -> dict:
|
||||||
|
"""
|
||||||
|
执行Python脚本内容
|
||||||
|
|
||||||
|
安全地在隔离环境中执行脚本
|
||||||
|
"""
|
||||||
|
if not PLAYWRIGHT_AVAILABLE:
|
||||||
|
log_callback("error", "Playwright 未安装,请运行: pip install playwright && playwright install")
|
||||||
|
return None
|
||||||
|
|
||||||
|
# 创建执行环境 - 包含脚本需要的所有模块
|
||||||
|
# 注意:使用同一个字典作为 globals 和 locals,确保函数定义可以互相访问
|
||||||
|
exec_namespace = {
|
||||||
|
"__builtins__": __builtins__,
|
||||||
|
"asyncio": asyncio,
|
||||||
|
"json": json,
|
||||||
|
"re": re,
|
||||||
|
"datetime": datetime,
|
||||||
|
"timezone": timezone,
|
||||||
|
"urljoin": urljoin,
|
||||||
|
"urlparse": urlparse,
|
||||||
|
"async_playwright": async_playwright,
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
# 编译并执行脚本 - 使用同一个命名空间确保函数可互相调用
|
||||||
|
exec(script_content, exec_namespace, exec_namespace)
|
||||||
|
|
||||||
|
# 获取scrape函数
|
||||||
|
scrape_func = exec_namespace.get("scrape")
|
||||||
|
if not scrape_func:
|
||||||
|
log_callback("error", "脚本中未找到 scrape 函数")
|
||||||
|
return None
|
||||||
|
|
||||||
|
# 运行异步爬虫函数
|
||||||
|
result = asyncio.run(scrape_func(output_callback=log_callback))
|
||||||
|
return result
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
log_callback("error", f"脚本执行异常: {str(e)}")
|
||||||
|
raise
|
||||||
|
|
||||||
|
|
||||||
|
def _add_log(db: Session, job_id: int, level: str, message: str):
|
||||||
|
"""添加日志"""
|
||||||
|
log = ScrapeLog(
|
||||||
|
job_id=job_id,
|
||||||
|
level=level,
|
||||||
|
message=message
|
||||||
|
)
|
||||||
|
db.add(log)
|
||||||
|
db.commit()
|
||||||
558
backend/app/services/script_generator.py
Normal file
558
backend/app/services/script_generator.py
Normal file
@ -0,0 +1,558 @@
|
|||||||
|
"""
|
||||||
|
爬虫脚本生成服务
|
||||||
|
|
||||||
|
分析大学网站结构,自动生成爬虫脚本
|
||||||
|
"""
|
||||||
|
|
||||||
|
import re
|
||||||
|
from datetime import datetime
|
||||||
|
from urllib.parse import urlparse
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
from ..database import SessionLocal
|
||||||
|
from ..models import University, ScraperScript
|
||||||
|
|
||||||
|
|
||||||
|
# 预置的大学爬虫脚本模板
|
||||||
|
SCRAPER_TEMPLATES = {
|
||||||
|
"harvard.edu": "harvard_scraper",
|
||||||
|
"mit.edu": "generic_scraper",
|
||||||
|
"stanford.edu": "generic_scraper",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def generate_scraper_script(university_id: int, university_url: str):
|
||||||
|
"""
|
||||||
|
生成爬虫脚本的后台任务
|
||||||
|
|
||||||
|
1. 分析大学网站域名
|
||||||
|
2. 如果有预置模板则使用模板
|
||||||
|
3. 否则生成通用爬虫脚本
|
||||||
|
"""
|
||||||
|
db = SessionLocal()
|
||||||
|
|
||||||
|
try:
|
||||||
|
university = db.query(University).filter(University.id == university_id).first()
|
||||||
|
if not university:
|
||||||
|
return
|
||||||
|
|
||||||
|
# 解析URL获取域名
|
||||||
|
parsed = urlparse(university_url)
|
||||||
|
domain = parsed.netloc.replace("www.", "")
|
||||||
|
|
||||||
|
# 检查是否有预置模板
|
||||||
|
template_name = None
|
||||||
|
for pattern, template in SCRAPER_TEMPLATES.items():
|
||||||
|
if pattern in domain:
|
||||||
|
template_name = template
|
||||||
|
break
|
||||||
|
|
||||||
|
# 生成脚本
|
||||||
|
script_content = _generate_script_content(domain, template_name)
|
||||||
|
config_content = _generate_config_content(university.name, university_url, domain)
|
||||||
|
|
||||||
|
# 计算版本号
|
||||||
|
existing_count = db.query(ScraperScript).filter(
|
||||||
|
ScraperScript.university_id == university_id
|
||||||
|
).count()
|
||||||
|
|
||||||
|
# 保存脚本
|
||||||
|
script = ScraperScript(
|
||||||
|
university_id=university_id,
|
||||||
|
script_name=f"{domain.replace('.', '_')}_scraper",
|
||||||
|
script_content=script_content,
|
||||||
|
config_content=config_content,
|
||||||
|
version=existing_count + 1,
|
||||||
|
status="active"
|
||||||
|
)
|
||||||
|
|
||||||
|
db.add(script)
|
||||||
|
|
||||||
|
# 更新大学状态
|
||||||
|
university.status = "ready"
|
||||||
|
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
# 记录错误
|
||||||
|
if university:
|
||||||
|
university.status = "error"
|
||||||
|
db.commit()
|
||||||
|
raise e
|
||||||
|
|
||||||
|
finally:
|
||||||
|
db.close()
|
||||||
|
|
||||||
|
|
||||||
|
def _generate_script_content(domain: str, template_name: str = None) -> str:
|
||||||
|
"""生成Python爬虫脚本内容"""
|
||||||
|
|
||||||
|
if template_name == "harvard_scraper":
|
||||||
|
return '''"""
|
||||||
|
Harvard University 专用爬虫脚本
|
||||||
|
自动生成
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
|
||||||
|
# 学院URL映射
|
||||||
|
SCHOOL_MAPPING = {
|
||||||
|
"gsas.harvard.edu": "Graduate School of Arts and Sciences (GSAS)",
|
||||||
|
"seas.harvard.edu": "John A. Paulson School of Engineering and Applied Sciences (SEAS)",
|
||||||
|
"hbs.edu": "Harvard Business School (HBS)",
|
||||||
|
"gsd.harvard.edu": "Graduate School of Design (GSD)",
|
||||||
|
"gse.harvard.edu": "Graduate School of Education (HGSE)",
|
||||||
|
"hks.harvard.edu": "Harvard Kennedy School (HKS)",
|
||||||
|
"hls.harvard.edu": "Harvard Law School (HLS)",
|
||||||
|
"hms.harvard.edu": "Harvard Medical School (HMS)",
|
||||||
|
"hsph.harvard.edu": "T.H. Chan School of Public Health (HSPH)",
|
||||||
|
"hds.harvard.edu": "Harvard Divinity School (HDS)",
|
||||||
|
"fas.harvard.edu": "Faculty of Arts and Sciences (FAS)",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
async def scrape(output_callback=None):
|
||||||
|
"""执行爬取"""
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser = await p.chromium.launch(headless=True)
|
||||||
|
page = await browser.new_page()
|
||||||
|
|
||||||
|
result = {
|
||||||
|
"name": "Harvard University",
|
||||||
|
"url": "https://www.harvard.edu/",
|
||||||
|
"country": "USA",
|
||||||
|
"scraped_at": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"schools": []
|
||||||
|
}
|
||||||
|
|
||||||
|
# 访问项目列表页
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", "访问Harvard项目列表...")
|
||||||
|
|
||||||
|
await page.goto("https://www.harvard.edu/programs/?degree_levels=graduate")
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
# 提取项目数据
|
||||||
|
programs = await page.evaluate("""() => {
|
||||||
|
const items = document.querySelectorAll('[class*="records__record"]');
|
||||||
|
const programs = [];
|
||||||
|
items.forEach(item => {
|
||||||
|
const btn = item.querySelector('button[class*="title-link"]');
|
||||||
|
if (btn) {
|
||||||
|
programs.push({
|
||||||
|
name: btn.innerText.trim(),
|
||||||
|
url: ''
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
return programs;
|
||||||
|
}""")
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"找到 {len(programs)} 个项目")
|
||||||
|
|
||||||
|
# 简化输出
|
||||||
|
result["schools"] = [{
|
||||||
|
"name": "Graduate Programs",
|
||||||
|
"url": "https://www.harvard.edu/programs/",
|
||||||
|
"programs": [{"name": p["name"], "url": p["url"], "faculty": []} for p in programs[:50]]
|
||||||
|
}]
|
||||||
|
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
result = asyncio.run(scrape())
|
||||||
|
print(json.dumps(result, indent=2, ensure_ascii=False))
|
||||||
|
'''
|
||||||
|
|
||||||
|
# 通用爬虫模板 - 深度爬取硕士项目
|
||||||
|
# 使用字符串拼接来避免 f-string 和 JavaScript 引号冲突
|
||||||
|
return _build_generic_scraper_template(domain)
|
||||||
|
|
||||||
|
|
||||||
|
def _build_generic_scraper_template(domain: str) -> str:
|
||||||
|
"""构建通用爬虫模板"""
|
||||||
|
|
||||||
|
# JavaScript code blocks (use raw strings to avoid escaping issues)
|
||||||
|
js_check_courses = r'''() => {
|
||||||
|
const links = document.querySelectorAll('a[href]');
|
||||||
|
let courseCount = 0;
|
||||||
|
for (const a of links) {
|
||||||
|
const href = a.href.toLowerCase();
|
||||||
|
if (/\/\d{4,}\//.test(href) ||
|
||||||
|
/\/(msc|ma|mba|mres|llm|med|meng)-/.test(href) ||
|
||||||
|
/\/course\/[a-z]/.test(href)) {
|
||||||
|
courseCount++;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return courseCount;
|
||||||
|
}'''
|
||||||
|
|
||||||
|
js_find_list_url = r'''() => {
|
||||||
|
const links = document.querySelectorAll('a[href]');
|
||||||
|
for (const a of links) {
|
||||||
|
const text = a.innerText.toLowerCase();
|
||||||
|
const href = a.href.toLowerCase();
|
||||||
|
if ((text.includes('a-z') || text.includes('all course') ||
|
||||||
|
text.includes('full list') || text.includes('browse all') ||
|
||||||
|
href.includes('/list')) &&
|
||||||
|
(href.includes('master') || href.includes('course') || href.includes('postgrad'))) {
|
||||||
|
return a.href;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}'''
|
||||||
|
|
||||||
|
js_find_courses_from_home = r'''() => {
|
||||||
|
const links = document.querySelectorAll('a[href]');
|
||||||
|
for (const a of links) {
|
||||||
|
const href = a.href.toLowerCase();
|
||||||
|
const text = a.innerText.toLowerCase();
|
||||||
|
if ((href.includes('master') || href.includes('postgraduate') || href.includes('graduate')) &&
|
||||||
|
(href.includes('course') || href.includes('program') || href.includes('degree'))) {
|
||||||
|
return a.href;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}'''
|
||||||
|
|
||||||
|
js_extract_programs = r'''() => {
|
||||||
|
const programs = [];
|
||||||
|
const seen = new Set();
|
||||||
|
const currentHost = window.location.hostname;
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href;
|
||||||
|
const text = a.innerText.trim().replace(/\s+/g, ' ');
|
||||||
|
|
||||||
|
if (!href || seen.has(href)) return;
|
||||||
|
if (text.length < 5 || text.length > 200) return;
|
||||||
|
if (href.includes('#') || href.includes('javascript:') || href.includes('mailto:')) return;
|
||||||
|
|
||||||
|
try {
|
||||||
|
const linkHost = new URL(href).hostname;
|
||||||
|
if (!linkHost.includes(currentHost.replace('www.', '')) &&
|
||||||
|
!currentHost.includes(linkHost.replace('www.', ''))) return;
|
||||||
|
} catch {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
const hrefLower = href.toLowerCase();
|
||||||
|
const textLower = text.toLowerCase();
|
||||||
|
|
||||||
|
const isNavigation = textLower === 'courses' ||
|
||||||
|
textLower === 'programmes' ||
|
||||||
|
textLower === 'undergraduate' ||
|
||||||
|
textLower === 'postgraduate' ||
|
||||||
|
textLower === 'masters' ||
|
||||||
|
textLower === "master's" ||
|
||||||
|
textLower.includes('skip to') ||
|
||||||
|
textLower.includes('share') ||
|
||||||
|
textLower === 'home' ||
|
||||||
|
textLower === 'study' ||
|
||||||
|
textLower.startsWith('a-z') ||
|
||||||
|
textLower.includes('admission') ||
|
||||||
|
textLower.includes('fees and funding') ||
|
||||||
|
textLower.includes('why should') ||
|
||||||
|
textLower.includes('why manchester') ||
|
||||||
|
textLower.includes('teaching and learning') ||
|
||||||
|
textLower.includes('meet us') ||
|
||||||
|
textLower.includes('student support') ||
|
||||||
|
textLower.includes('contact us') ||
|
||||||
|
textLower.includes('how to apply') ||
|
||||||
|
hrefLower.includes('/admissions/') ||
|
||||||
|
hrefLower.includes('/fees-and-funding/') ||
|
||||||
|
hrefLower.includes('/why-') ||
|
||||||
|
hrefLower.includes('/meet-us/') ||
|
||||||
|
hrefLower.includes('/contact-us/') ||
|
||||||
|
hrefLower.includes('/student-support/') ||
|
||||||
|
hrefLower.includes('/teaching-and-learning/') ||
|
||||||
|
hrefLower.endsWith('/courses/') ||
|
||||||
|
hrefLower.endsWith('/masters/') ||
|
||||||
|
hrefLower.endsWith('/postgraduate/');
|
||||||
|
|
||||||
|
if (isNavigation) return;
|
||||||
|
|
||||||
|
const isExcluded = hrefLower.includes('/undergraduate') ||
|
||||||
|
hrefLower.includes('/bachelor') ||
|
||||||
|
hrefLower.includes('/phd/') ||
|
||||||
|
hrefLower.includes('/doctoral') ||
|
||||||
|
hrefLower.includes('/research-degree') ||
|
||||||
|
textLower.includes('bachelor') ||
|
||||||
|
textLower.includes('undergraduate') ||
|
||||||
|
(textLower.includes('phd') && !textLower.includes('mphil'));
|
||||||
|
|
||||||
|
if (isExcluded) return;
|
||||||
|
|
||||||
|
const hasNumericId = /\/\d{4,}\//.test(href);
|
||||||
|
const hasDegreeSlug = /\/(msc|ma|mba|mres|llm|med|meng|mpa|mph|mphil)-[a-z]/.test(hrefLower);
|
||||||
|
const isCoursePage = (hrefLower.includes('/course/') ||
|
||||||
|
hrefLower.includes('/courses/list/') ||
|
||||||
|
hrefLower.includes('/programme/')) &&
|
||||||
|
href.split('/').filter(p => p).length > 4;
|
||||||
|
const textHasDegree = /\b(msc|ma|mba|mres|llm|med|meng|pgcert|pgdip)\b/i.test(text) ||
|
||||||
|
textLower.includes('master');
|
||||||
|
|
||||||
|
if (hasNumericId || hasDegreeSlug || isCoursePage || textHasDegree) {
|
||||||
|
seen.add(href);
|
||||||
|
programs.push({
|
||||||
|
name: text,
|
||||||
|
url: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return programs;
|
||||||
|
}'''
|
||||||
|
|
||||||
|
js_extract_faculty = r'''() => {
|
||||||
|
const faculty = [];
|
||||||
|
const seen = new Set();
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href.toLowerCase();
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
|
||||||
|
if (seen.has(href)) return;
|
||||||
|
if (text.length < 3 || text.length > 100) return;
|
||||||
|
|
||||||
|
const isStaff = href.includes('/people/') ||
|
||||||
|
href.includes('/staff/') ||
|
||||||
|
href.includes('/faculty/') ||
|
||||||
|
href.includes('/profile/') ||
|
||||||
|
href.includes('/academics/') ||
|
||||||
|
href.includes('/researcher/');
|
||||||
|
|
||||||
|
if (isStaff) {
|
||||||
|
seen.add(href);
|
||||||
|
faculty.push({
|
||||||
|
name: text.replace(/\s+/g, ' '),
|
||||||
|
url: a.href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return faculty.slice(0, 20);
|
||||||
|
}'''
|
||||||
|
|
||||||
|
university_name = domain.split('.')[0].title()
|
||||||
|
|
||||||
|
template = f'''"""
|
||||||
|
通用大学爬虫脚本
|
||||||
|
目标: {domain}
|
||||||
|
自动生成 - 深度爬取硕士项目和导师信息
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from urllib.parse import urljoin, urlparse
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
|
||||||
|
|
||||||
|
MASTERS_PATHS = [
|
||||||
|
"/study/masters/courses/list/",
|
||||||
|
"/study/masters/courses/",
|
||||||
|
"/postgraduate/taught/courses/",
|
||||||
|
"/postgraduate/courses/list/",
|
||||||
|
"/postgraduate/courses/",
|
||||||
|
"/graduate/programs/",
|
||||||
|
"/academics/graduate/programs/",
|
||||||
|
"/programmes/masters/",
|
||||||
|
"/masters/programmes/",
|
||||||
|
"/admissions/graduate/programs/",
|
||||||
|
]
|
||||||
|
|
||||||
|
JS_CHECK_COURSES = """{js_check_courses}"""
|
||||||
|
|
||||||
|
JS_FIND_LIST_URL = """{js_find_list_url}"""
|
||||||
|
|
||||||
|
JS_FIND_COURSES_FROM_HOME = """{js_find_courses_from_home}"""
|
||||||
|
|
||||||
|
JS_EXTRACT_PROGRAMS = """{js_extract_programs}"""
|
||||||
|
|
||||||
|
JS_EXTRACT_FACULTY = """{js_extract_faculty}"""
|
||||||
|
|
||||||
|
|
||||||
|
async def find_course_list_page(page, base_url, output_callback):
|
||||||
|
for path in MASTERS_PATHS:
|
||||||
|
test_url = base_url.rstrip('/') + path
|
||||||
|
try:
|
||||||
|
response = await page.goto(test_url, wait_until="domcontentloaded", timeout=15000)
|
||||||
|
if response and response.status == 200:
|
||||||
|
title = await page.title()
|
||||||
|
if '404' not in title.lower() and 'not found' not in title.lower():
|
||||||
|
has_courses = await page.evaluate(JS_CHECK_COURSES)
|
||||||
|
if has_courses > 5:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"Found course list: {{path}} ({{has_courses}} courses)")
|
||||||
|
return test_url
|
||||||
|
|
||||||
|
list_url = await page.evaluate(JS_FIND_LIST_URL)
|
||||||
|
if list_url:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"Found full course list: {{list_url}}")
|
||||||
|
return list_url
|
||||||
|
except:
|
||||||
|
continue
|
||||||
|
|
||||||
|
try:
|
||||||
|
await page.goto(base_url, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
courses_url = await page.evaluate(JS_FIND_COURSES_FROM_HOME)
|
||||||
|
if courses_url:
|
||||||
|
return courses_url
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
async def extract_course_links(page, output_callback):
|
||||||
|
return await page.evaluate(JS_EXTRACT_PROGRAMS)
|
||||||
|
|
||||||
|
|
||||||
|
async def scrape(output_callback=None):
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser = await p.chromium.launch(headless=True)
|
||||||
|
context = await browser.new_context(
|
||||||
|
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
|
||||||
|
)
|
||||||
|
page = await context.new_page()
|
||||||
|
|
||||||
|
base_url = "https://www.{domain}/"
|
||||||
|
|
||||||
|
result = {{
|
||||||
|
"name": "{university_name} University",
|
||||||
|
"url": base_url,
|
||||||
|
"scraped_at": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"schools": []
|
||||||
|
}}
|
||||||
|
|
||||||
|
all_programs = []
|
||||||
|
|
||||||
|
try:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", "Searching for masters course list...")
|
||||||
|
|
||||||
|
courses_url = await find_course_list_page(page, base_url, output_callback)
|
||||||
|
|
||||||
|
if not courses_url:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("warning", "Course list not found, using homepage")
|
||||||
|
courses_url = base_url
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", "Extracting masters programs...")
|
||||||
|
|
||||||
|
await page.goto(courses_url, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
for _ in range(3):
|
||||||
|
try:
|
||||||
|
load_more = page.locator('button:has-text("Load more"), button:has-text("Show more"), button:has-text("View more"), a:has-text("Load more")')
|
||||||
|
if await load_more.count() > 0:
|
||||||
|
await load_more.first.click()
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
else:
|
||||||
|
break
|
||||||
|
except:
|
||||||
|
break
|
||||||
|
|
||||||
|
programs_data = await extract_course_links(page, output_callback)
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"Found {{len(programs_data)}} masters programs")
|
||||||
|
|
||||||
|
max_detail_pages = min(len(programs_data), 30)
|
||||||
|
|
||||||
|
for i, prog in enumerate(programs_data[:max_detail_pages]):
|
||||||
|
try:
|
||||||
|
if output_callback and i % 10 == 0:
|
||||||
|
output_callback("info", f"Processing {{i+1}}/{{max_detail_pages}}: {{prog['name'][:50]}}")
|
||||||
|
|
||||||
|
await page.goto(prog['url'], wait_until="domcontentloaded", timeout=15000)
|
||||||
|
await page.wait_for_timeout(800)
|
||||||
|
|
||||||
|
faculty_data = await page.evaluate(JS_EXTRACT_FACULTY)
|
||||||
|
|
||||||
|
all_programs.append({{
|
||||||
|
"name": prog['name'],
|
||||||
|
"url": prog['url'],
|
||||||
|
"faculty": faculty_data
|
||||||
|
}})
|
||||||
|
|
||||||
|
except:
|
||||||
|
all_programs.append({{
|
||||||
|
"name": prog['name'],
|
||||||
|
"url": prog['url'],
|
||||||
|
"faculty": []
|
||||||
|
}})
|
||||||
|
|
||||||
|
for prog in programs_data[max_detail_pages:]:
|
||||||
|
all_programs.append({{
|
||||||
|
"name": prog['name'],
|
||||||
|
"url": prog['url'],
|
||||||
|
"faculty": []
|
||||||
|
}})
|
||||||
|
|
||||||
|
result["schools"] = [{{
|
||||||
|
"name": "Masters Programs",
|
||||||
|
"url": courses_url,
|
||||||
|
"programs": all_programs
|
||||||
|
}}]
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
total_faculty = sum(len(p.get('faculty', [])) for p in all_programs)
|
||||||
|
output_callback("info", f"Done! {{len(all_programs)}} programs, {{total_faculty}} faculty")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("error", f"Scraping error: {{str(e)}}")
|
||||||
|
|
||||||
|
finally:
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
result = asyncio.run(scrape())
|
||||||
|
print(json.dumps(result, indent=2, ensure_ascii=False))
|
||||||
|
'''
|
||||||
|
return template
|
||||||
|
|
||||||
|
|
||||||
|
def _generate_config_content(name: str, url: str, domain: str) -> dict:
|
||||||
|
"""生成配置内容"""
|
||||||
|
return {
|
||||||
|
"university": {
|
||||||
|
"name": name,
|
||||||
|
"url": url,
|
||||||
|
"domain": domain
|
||||||
|
},
|
||||||
|
"scraper": {
|
||||||
|
"headless": True,
|
||||||
|
"timeout": 30000,
|
||||||
|
"wait_time": 2000
|
||||||
|
},
|
||||||
|
"paths_to_try": [
|
||||||
|
"/programs",
|
||||||
|
"/academics/programs",
|
||||||
|
"/graduate",
|
||||||
|
"/degrees",
|
||||||
|
"/admissions/graduate"
|
||||||
|
],
|
||||||
|
"selectors": {
|
||||||
|
"program_item": "div.program, li.program, article.program, a[href*='/program']",
|
||||||
|
"faculty_item": "div.faculty, li.person, .profile-card"
|
||||||
|
},
|
||||||
|
"generated_at": datetime.utcnow().isoformat()
|
||||||
|
}
|
||||||
1
backend/app/tasks/__init__.py
Normal file
1
backend/app/tasks/__init__.py
Normal file
@ -0,0 +1 @@
|
|||||||
|
"""Celery任务 (可选,用于生产环境)"""
|
||||||
25
backend/requirements.txt
Normal file
25
backend/requirements.txt
Normal file
@ -0,0 +1,25 @@
|
|||||||
|
# FastAPI Web Framework
|
||||||
|
fastapi>=0.109.0
|
||||||
|
uvicorn[standard]>=0.27.0
|
||||||
|
python-multipart>=0.0.6
|
||||||
|
|
||||||
|
# Database
|
||||||
|
sqlalchemy>=2.0.25
|
||||||
|
psycopg2-binary>=2.9.9
|
||||||
|
alembic>=1.13.1
|
||||||
|
|
||||||
|
# Task Queue
|
||||||
|
celery>=5.3.6
|
||||||
|
redis>=5.0.1
|
||||||
|
|
||||||
|
# Utilities
|
||||||
|
pydantic>=2.9
|
||||||
|
pydantic-settings>=2.6
|
||||||
|
python-dotenv>=1.0.0
|
||||||
|
httpx>=0.28
|
||||||
|
|
||||||
|
# Existing scraper dependencies
|
||||||
|
playwright>=1.48
|
||||||
|
pyyaml>=6.0
|
||||||
|
|
||||||
|
# CORS
|
||||||
143
configs/harvard.yaml
Normal file
143
configs/harvard.yaml
Normal file
@ -0,0 +1,143 @@
|
|||||||
|
# Harvard University 爬虫配置
|
||||||
|
# 按照 学院 → 项目 → 导师 的层级结构组织
|
||||||
|
#
|
||||||
|
# Harvard的特殊情况:有一个集中的项目列表页面,可以从那里获取所有项目
|
||||||
|
# 然后通过GSAS页面关联到各学院和导师信息
|
||||||
|
|
||||||
|
university:
|
||||||
|
name: "Harvard University"
|
||||||
|
url: "https://www.harvard.edu/"
|
||||||
|
country: "USA"
|
||||||
|
|
||||||
|
# 第一层:学院列表
|
||||||
|
schools:
|
||||||
|
discovery_method: "static_list"
|
||||||
|
|
||||||
|
static_list:
|
||||||
|
# 文理研究生院 - 最主要的研究生项目集中地
|
||||||
|
- name: "Graduate School of Arts and Sciences (GSAS)"
|
||||||
|
url: "https://gsas.harvard.edu/"
|
||||||
|
|
||||||
|
# 工程与应用科学学院
|
||||||
|
- name: "John A. Paulson School of Engineering and Applied Sciences (SEAS)"
|
||||||
|
url: "https://seas.harvard.edu/"
|
||||||
|
|
||||||
|
# 商学院
|
||||||
|
- name: "Harvard Business School (HBS)"
|
||||||
|
url: "https://www.hbs.edu/"
|
||||||
|
|
||||||
|
# 设计学院
|
||||||
|
- name: "Graduate School of Design (GSD)"
|
||||||
|
url: "https://www.gsd.harvard.edu/"
|
||||||
|
|
||||||
|
# 教育学院
|
||||||
|
- name: "Graduate School of Education (HGSE)"
|
||||||
|
url: "https://www.gse.harvard.edu/"
|
||||||
|
|
||||||
|
# 肯尼迪政府学院
|
||||||
|
- name: "Harvard Kennedy School (HKS)"
|
||||||
|
url: "https://www.hks.harvard.edu/"
|
||||||
|
|
||||||
|
# 法学院
|
||||||
|
- name: "Harvard Law School (HLS)"
|
||||||
|
url: "https://hls.harvard.edu/"
|
||||||
|
|
||||||
|
# 医学院
|
||||||
|
- name: "Harvard Medical School (HMS)"
|
||||||
|
url: "https://hms.harvard.edu/"
|
||||||
|
|
||||||
|
# 公共卫生学院
|
||||||
|
- name: "T.H. Chan School of Public Health (HSPH)"
|
||||||
|
url: "https://www.hsph.harvard.edu/"
|
||||||
|
|
||||||
|
# 神学院
|
||||||
|
- name: "Harvard Divinity School (HDS)"
|
||||||
|
url: "https://hds.harvard.edu/"
|
||||||
|
|
||||||
|
# 牙医学院
|
||||||
|
- name: "Harvard School of Dental Medicine (HSDM)"
|
||||||
|
url: "https://hsdm.harvard.edu/"
|
||||||
|
|
||||||
|
# 第二层:项目发现配置
|
||||||
|
programs:
|
||||||
|
# 在学院网站上尝试这些路径来查找项目列表
|
||||||
|
paths_to_try:
|
||||||
|
- "/programs"
|
||||||
|
- "/academics/programs"
|
||||||
|
- "/academics/graduate-programs"
|
||||||
|
- "/academics/masters-programs"
|
||||||
|
- "/graduate"
|
||||||
|
- "/degrees"
|
||||||
|
- "/academics"
|
||||||
|
|
||||||
|
# 从学院首页查找项目列表页面的链接模式
|
||||||
|
link_patterns:
|
||||||
|
- text_contains: ["program", "degree", "academics"]
|
||||||
|
href_contains: ["/program", "/degree", "/academic"]
|
||||||
|
- text_contains: ["master", "graduate"]
|
||||||
|
href_contains: ["/master", "/graduate"]
|
||||||
|
|
||||||
|
# 项目列表页面的选择器
|
||||||
|
selectors:
|
||||||
|
program_item: "div.program-item, li.program, .degree-program, article.program, a[href*='/program']"
|
||||||
|
program_name: "h3, h4, .title, .program-title, .name"
|
||||||
|
program_url: "a[href]"
|
||||||
|
degree_type: ".degree, .credential, .degree-type"
|
||||||
|
|
||||||
|
# 分页配置
|
||||||
|
pagination:
|
||||||
|
type: "none"
|
||||||
|
|
||||||
|
# 第三层:导师发现配置
|
||||||
|
faculty:
|
||||||
|
discovery_strategies:
|
||||||
|
- type: "link_in_page"
|
||||||
|
patterns:
|
||||||
|
- text_contains: ["faculty", "people", "advisor"]
|
||||||
|
href_contains: ["/faculty", "/people", "/advisor"]
|
||||||
|
- text_contains: ["see list", "view all"]
|
||||||
|
href_contains: ["/people", "/faculty"]
|
||||||
|
|
||||||
|
- type: "url_pattern"
|
||||||
|
patterns:
|
||||||
|
- "{program_url}/faculty"
|
||||||
|
- "{program_url}/people"
|
||||||
|
- "{school_url}/faculty"
|
||||||
|
- "{school_url}/people"
|
||||||
|
|
||||||
|
selectors:
|
||||||
|
faculty_item: "div.faculty, li.person, .profile-card, article.person"
|
||||||
|
faculty_name: "h3, h4, .name, .title a"
|
||||||
|
faculty_url: "a[href*='/people/'], a[href*='/faculty/'], a[href*='/profile/']"
|
||||||
|
faculty_title: ".title, .position, .role, .job-title"
|
||||||
|
|
||||||
|
# 过滤规则
|
||||||
|
filters:
|
||||||
|
program_degree_types:
|
||||||
|
include:
|
||||||
|
- "Master"
|
||||||
|
- "M.S."
|
||||||
|
- "M.A."
|
||||||
|
- "MBA"
|
||||||
|
- "M.Eng"
|
||||||
|
- "M.Ed"
|
||||||
|
- "M.P.P"
|
||||||
|
- "M.P.A"
|
||||||
|
- "M.Arch"
|
||||||
|
- "M.L.A"
|
||||||
|
- "M.Div"
|
||||||
|
- "M.T.S"
|
||||||
|
- "LL.M"
|
||||||
|
- "S.M."
|
||||||
|
- "A.M."
|
||||||
|
- "A.L.M."
|
||||||
|
exclude:
|
||||||
|
- "Ph.D."
|
||||||
|
- "Doctor"
|
||||||
|
- "Bachelor"
|
||||||
|
- "B.S."
|
||||||
|
- "B.A."
|
||||||
|
- "Certificate"
|
||||||
|
- "Undergraduate"
|
||||||
|
|
||||||
|
exclude_schools: []
|
||||||
331
configs/manchester.yaml
Normal file
331
configs/manchester.yaml
Normal file
@ -0,0 +1,331 @@
|
|||||||
|
university:
|
||||||
|
name: "The University of Manchester"
|
||||||
|
url: "https://www.manchester.ac.uk/"
|
||||||
|
country: "United Kingdom"
|
||||||
|
|
||||||
|
schools:
|
||||||
|
discovery_method: "static_list"
|
||||||
|
request:
|
||||||
|
timeout_ms: 45000
|
||||||
|
max_retries: 3
|
||||||
|
retry_backoff_ms: 3000
|
||||||
|
static_list:
|
||||||
|
- name: "Alliance Manchester Business School"
|
||||||
|
url: "https://www.alliancembs.manchester.ac.uk/research/accounting-and-finance/staff/"
|
||||||
|
keywords:
|
||||||
|
- "accounting"
|
||||||
|
- "finance"
|
||||||
|
- "business"
|
||||||
|
- "management"
|
||||||
|
- "marketing"
|
||||||
|
- "mba"
|
||||||
|
- "economics"
|
||||||
|
- "entrepreneurship"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.alliancembs.manchester.ac.uk/research/accounting-and-finance/staff/"
|
||||||
|
extract_method: "table"
|
||||||
|
requires_scroll: true
|
||||||
|
scroll_times: 6
|
||||||
|
scroll_delay_ms: 700
|
||||||
|
load_more_selector: "button.load-more, button.show-more"
|
||||||
|
max_load_more: 5
|
||||||
|
request:
|
||||||
|
timeout_ms: 60000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 2500
|
||||||
|
- name: "Department of Computer Science"
|
||||||
|
url: "https://www.cs.manchester.ac.uk/about/people/academic-and-research-staff/"
|
||||||
|
keywords:
|
||||||
|
- "computer"
|
||||||
|
- "software"
|
||||||
|
- "data science"
|
||||||
|
- "artificial intelligence"
|
||||||
|
- "ai "
|
||||||
|
- "machine learning"
|
||||||
|
- "cyber"
|
||||||
|
- "computing"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.cs.manchester.ac.uk/about/people/academic-and-research-staff/"
|
||||||
|
extract_method: "links"
|
||||||
|
requires_scroll: true
|
||||||
|
scroll_times: 6
|
||||||
|
scroll_delay_ms: 700
|
||||||
|
blocked_resources: ["image", "font", "media"]
|
||||||
|
- url: "https://www.cs.manchester.ac.uk/about/people/"
|
||||||
|
extract_method: "links"
|
||||||
|
load_more_selector: "button.load-more"
|
||||||
|
max_load_more: 5
|
||||||
|
request:
|
||||||
|
timeout_ms: 45000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 2000
|
||||||
|
- name: "Department of Physics and Astronomy"
|
||||||
|
url: "https://www.physics.manchester.ac.uk/about/people/academic-and-research-staff/"
|
||||||
|
keywords:
|
||||||
|
- "physics"
|
||||||
|
- "astronomy"
|
||||||
|
- "astrophysics"
|
||||||
|
- "nuclear"
|
||||||
|
- "particle"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.physics.manchester.ac.uk/about/people/academic-and-research-staff/"
|
||||||
|
extract_method: "links"
|
||||||
|
requires_scroll: true
|
||||||
|
scroll_times: 5
|
||||||
|
scroll_delay_ms: 700
|
||||||
|
- name: "Department of Electrical and Electronic Engineering"
|
||||||
|
url: "https://www.eee.manchester.ac.uk/about/people/academic-and-research-staff/"
|
||||||
|
keywords:
|
||||||
|
- "electrical"
|
||||||
|
- "electronic"
|
||||||
|
- "eee"
|
||||||
|
- "power systems"
|
||||||
|
- "microelectronics"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.eee.manchester.ac.uk/about/people/academic-and-research-staff/"
|
||||||
|
extract_method: "links"
|
||||||
|
requires_scroll: true
|
||||||
|
scroll_times: 6
|
||||||
|
scroll_delay_ms: 700
|
||||||
|
- name: "Department of Chemistry"
|
||||||
|
url: "https://research.manchester.ac.uk/en/organisations/department-of-chemistry/persons/"
|
||||||
|
keywords:
|
||||||
|
- "chemistry"
|
||||||
|
- "chemical"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://research.manchester.ac.uk/en/organisations/department-of-chemistry/persons/"
|
||||||
|
extract_method: "research_explorer"
|
||||||
|
requires_scroll: true
|
||||||
|
request:
|
||||||
|
timeout_ms: 120000
|
||||||
|
wait_until: "networkidle"
|
||||||
|
wait_for_selector: "a.link.person"
|
||||||
|
post_wait_ms: 5000
|
||||||
|
research_explorer:
|
||||||
|
org_slug: "department-of-chemistry"
|
||||||
|
page_size: 200
|
||||||
|
- name: "Department of Mathematics"
|
||||||
|
url: "https://research.manchester.ac.uk/en/organisations/department-of-mathematics/persons/"
|
||||||
|
keywords:
|
||||||
|
- "mathematics"
|
||||||
|
- "statistics"
|
||||||
|
- "applied math"
|
||||||
|
- "actuarial"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://research.manchester.ac.uk/en/organisations/department-of-mathematics/persons/"
|
||||||
|
extract_method: "research_explorer"
|
||||||
|
requires_scroll: true
|
||||||
|
request:
|
||||||
|
timeout_ms: 120000
|
||||||
|
wait_until: "networkidle"
|
||||||
|
wait_for_selector: "a.link.person"
|
||||||
|
post_wait_ms: 4500
|
||||||
|
research_explorer:
|
||||||
|
org_slug: "department-of-mathematics"
|
||||||
|
page_size: 200
|
||||||
|
- name: "School of Engineering"
|
||||||
|
url: "https://research.manchester.ac.uk/en/organisations/school-of-engineering/persons/"
|
||||||
|
keywords:
|
||||||
|
- "engineering"
|
||||||
|
- "mechanical"
|
||||||
|
- "aerospace"
|
||||||
|
- "civil"
|
||||||
|
- "materials"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://research.manchester.ac.uk/en/organisations/school-of-engineering/persons/"
|
||||||
|
extract_method: "research_explorer"
|
||||||
|
requires_scroll: true
|
||||||
|
request:
|
||||||
|
timeout_ms: 120000
|
||||||
|
wait_until: "networkidle"
|
||||||
|
wait_for_selector: "a.link.person"
|
||||||
|
post_wait_ms: 4500
|
||||||
|
research_explorer:
|
||||||
|
org_slug: "school-of-engineering"
|
||||||
|
page_size: 400
|
||||||
|
- name: "Faculty of Biology, Medicine and Health"
|
||||||
|
url: "https://research.manchester.ac.uk/en/organisations/faculty-of-biology-medicine-and-health/persons/"
|
||||||
|
keywords:
|
||||||
|
- "medicine"
|
||||||
|
- "medical"
|
||||||
|
- "health"
|
||||||
|
- "nursing"
|
||||||
|
- "pharmacy"
|
||||||
|
- "clinical"
|
||||||
|
- "dental"
|
||||||
|
- "optometry"
|
||||||
|
- "biology"
|
||||||
|
- "biomedical"
|
||||||
|
- "psychology"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://research.manchester.ac.uk/en/organisations/faculty-of-biology-medicine-and-health/persons/"
|
||||||
|
extract_method: "research_explorer"
|
||||||
|
requires_scroll: true
|
||||||
|
request:
|
||||||
|
timeout_ms: 130000
|
||||||
|
wait_until: "networkidle"
|
||||||
|
wait_for_selector: "a.link.person"
|
||||||
|
post_wait_ms: 4500
|
||||||
|
research_explorer:
|
||||||
|
org_slug: "faculty-of-biology-medicine-and-health"
|
||||||
|
page_size: 400
|
||||||
|
- name: "School of Social Sciences"
|
||||||
|
url: "https://research.manchester.ac.uk/en/organisations/school-of-social-sciences/persons/"
|
||||||
|
keywords:
|
||||||
|
- "sociology"
|
||||||
|
- "politics"
|
||||||
|
- "international"
|
||||||
|
- "social"
|
||||||
|
- "criminology"
|
||||||
|
- "anthropology"
|
||||||
|
- "philosophy"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://research.manchester.ac.uk/en/organisations/school-of-social-sciences/persons/"
|
||||||
|
extract_method: "research_explorer"
|
||||||
|
requires_scroll: true
|
||||||
|
request:
|
||||||
|
timeout_ms: 120000
|
||||||
|
wait_until: "networkidle"
|
||||||
|
wait_for_selector: "a.link.person"
|
||||||
|
post_wait_ms: 4500
|
||||||
|
research_explorer:
|
||||||
|
org_slug: "school-of-social-sciences"
|
||||||
|
page_size: 200
|
||||||
|
- name: "School of Law"
|
||||||
|
url: "https://research.manchester.ac.uk/en/organisations/school-of-law/persons/"
|
||||||
|
keywords:
|
||||||
|
- "law"
|
||||||
|
- "legal"
|
||||||
|
- "llm"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://research.manchester.ac.uk/en/organisations/school-of-law/persons/"
|
||||||
|
extract_method: "research_explorer"
|
||||||
|
requires_scroll: true
|
||||||
|
request:
|
||||||
|
timeout_ms: 120000
|
||||||
|
wait_until: "networkidle"
|
||||||
|
wait_for_selector: "a.link.person"
|
||||||
|
post_wait_ms: 4500
|
||||||
|
research_explorer:
|
||||||
|
org_slug: "school-of-law"
|
||||||
|
page_size: 200
|
||||||
|
- name: "School of Arts, Languages and Cultures"
|
||||||
|
url: "https://research.manchester.ac.uk/en/organisations/school-of-arts-languages-and-cultures/persons/"
|
||||||
|
keywords:
|
||||||
|
- "arts"
|
||||||
|
- "languages"
|
||||||
|
- "culture"
|
||||||
|
- "music"
|
||||||
|
- "drama"
|
||||||
|
- "theatre"
|
||||||
|
- "history"
|
||||||
|
- "linguistics"
|
||||||
|
- "literature"
|
||||||
|
- "translation"
|
||||||
|
- "archaeology"
|
||||||
|
- "religion"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://research.manchester.ac.uk/en/organisations/school-of-arts-languages-and-cultures/persons/"
|
||||||
|
extract_method: "research_explorer"
|
||||||
|
requires_scroll: true
|
||||||
|
request:
|
||||||
|
timeout_ms: 120000
|
||||||
|
wait_until: "networkidle"
|
||||||
|
wait_for_selector: "a.link.person"
|
||||||
|
post_wait_ms: 4500
|
||||||
|
research_explorer:
|
||||||
|
org_slug: "school-of-arts-languages-and-cultures"
|
||||||
|
page_size: 300
|
||||||
|
- name: "School of Environment, Education and Development"
|
||||||
|
url: "https://research.manchester.ac.uk/en/organisations/school-of-environment-education-and-development/persons/"
|
||||||
|
keywords:
|
||||||
|
- "environment"
|
||||||
|
- "education"
|
||||||
|
- "development"
|
||||||
|
- "planning"
|
||||||
|
- "architecture"
|
||||||
|
- "urban"
|
||||||
|
- "geography"
|
||||||
|
- "sustainability"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://research.manchester.ac.uk/en/organisations/school-of-environment-education-and-development/persons/"
|
||||||
|
extract_method: "research_explorer"
|
||||||
|
requires_scroll: true
|
||||||
|
request:
|
||||||
|
timeout_ms: 120000
|
||||||
|
wait_until: "networkidle"
|
||||||
|
wait_for_selector: "a.link.person"
|
||||||
|
post_wait_ms: 4500
|
||||||
|
research_explorer:
|
||||||
|
org_slug: "school-of-environment-education-and-development"
|
||||||
|
page_size: 300
|
||||||
|
|
||||||
|
programs:
|
||||||
|
paths_to_try:
|
||||||
|
- "/study/masters/courses/list/"
|
||||||
|
link_patterns:
|
||||||
|
- text_contains: ["masters", "postgraduate", "graduate"]
|
||||||
|
href_contains: ["/courses/list", "/study/masters", "/study/postgraduate"]
|
||||||
|
selectors:
|
||||||
|
program_item: "li.course-item, article.course, .course-listing a"
|
||||||
|
program_name: ".course-title, h3, .title"
|
||||||
|
program_url: "a[href]"
|
||||||
|
degree_type: ".course-award, .badge"
|
||||||
|
request:
|
||||||
|
timeout_ms: 40000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 2500
|
||||||
|
global_catalog:
|
||||||
|
url: "https://www.manchester.ac.uk/study/masters/courses/list/"
|
||||||
|
request:
|
||||||
|
timeout_ms: 60000
|
||||||
|
wait_until: "networkidle"
|
||||||
|
wait_after_ms: 3000
|
||||||
|
metadata_keyword_field: "keywords"
|
||||||
|
assign_by_school_keywords: true
|
||||||
|
assign_if_no_keywords: false
|
||||||
|
allow_multiple_assignments: false
|
||||||
|
per_school_limit: 200
|
||||||
|
skip_program_faculty_lookup: true
|
||||||
|
|
||||||
|
faculty:
|
||||||
|
discovery_strategies:
|
||||||
|
- type: "link_in_page"
|
||||||
|
patterns:
|
||||||
|
- text_contains: ["people", "faculty", "staff", "directory"]
|
||||||
|
href_contains: ["/people", "/faculty", "/staff"]
|
||||||
|
request:
|
||||||
|
timeout_ms: 30000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 1500
|
||||||
|
- type: "url_pattern"
|
||||||
|
patterns:
|
||||||
|
- "{program_url}/people"
|
||||||
|
- "{program_url}/faculty"
|
||||||
|
- "{school_url}/people"
|
||||||
|
- "{school_url}/staff"
|
||||||
|
request:
|
||||||
|
timeout_ms: 30000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 1500
|
||||||
|
- type: "school_directory"
|
||||||
|
assign_to_all: false
|
||||||
|
match_by_school_keywords: true
|
||||||
|
metadata_keyword_field: "keywords"
|
||||||
|
request:
|
||||||
|
timeout_ms: 120000
|
||||||
|
post_wait_ms: 3500
|
||||||
|
|
||||||
|
filters:
|
||||||
|
program_degree_types:
|
||||||
|
include: ["MSc", "MA", "MBA", "MEng", "LLM", "MRes"]
|
||||||
|
exclude: ["PhD", "Bachelor", "BSc", "BA", "PGCert"]
|
||||||
|
exclude_schools: []
|
||||||
|
|
||||||
|
playwright:
|
||||||
|
stealth: true
|
||||||
|
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36"
|
||||||
|
extra_headers:
|
||||||
|
Accept-Language: "en-US,en;q=0.9"
|
||||||
|
cookies: []
|
||||||
|
add_init_scripts: []
|
||||||
24
configs/templates/README.md
Normal file
24
configs/templates/README.md
Normal file
@ -0,0 +1,24 @@
|
|||||||
|
# 英国高校模板库
|
||||||
|
|
||||||
|
该目录存放针对英国大学常见站点结构的 ScraperConfig 模板片段,目标是让生成/调度脚本能够快速套用成熟的学院、项目、导师配置,并保持与 `src/university_scraper` 中的最新能力同步。
|
||||||
|
|
||||||
|
## 使用方式
|
||||||
|
1. 复制需要的模板文件到 `configs/<university>.yaml`,并根据该学校的实际信息替换占位符(域名、学院 URL、Research Explorer 组织 slug 等)。
|
||||||
|
2. 调整 `schools.static_list` 中的学院列表:
|
||||||
|
- `keywords`:用于自动将项目聚类到学院;
|
||||||
|
- `faculty_pages`:定义学院级导师目录(支持 `extract_method: table|links|research_explorer`、滚动/点击更多、独立请求参数)。
|
||||||
|
3. 根据学校的课程导航方式,补全 `programs.paths_to_try`、`link_patterns`、`selectors` 与请求设置。
|
||||||
|
4. `faculty.discovery_strategies` 推荐至少包含:
|
||||||
|
- `link_in_page`:从项目页寻找“People/Faculty”链接;
|
||||||
|
- `url_pattern`:补充常见 URL 模式;
|
||||||
|
- `school_directory`: true:复用 `faculty_pages` 中的导师目录,将其按关键词分发到项目层。
|
||||||
|
5. 运行 `python -m src.university_scraper.cli run --config configs/<university>.yaml --output output/<name>.json`(或在 Web 端触发任务)验证,并将本地结果与旧版对比。
|
||||||
|
|
||||||
|
## 模板列表
|
||||||
|
|
||||||
|
| 文件 | 适用场景 |
|
||||||
|
|------|----------|
|
||||||
|
| `uk_research_explorer_template.yaml` | 大多数使用 Pure Portal / Research Explorer 的英国大学(如曼大、UCL、帝国理工的人文社科学院)。 |
|
||||||
|
| `uk_department_directory_template.yaml` | 传统院系官网列出 HTML Staff Directory 的学院(如各理工学院官网、独立学院站点)。 |
|
||||||
|
|
||||||
|
后续若发现新的页面类型(例如 SharePoint 列表、嵌入式 API 等),请在此目录增加新的模板文件,并在本 README 中更新说明。
|
||||||
95
configs/templates/uk_department_directory_template.yaml
Normal file
95
configs/templates/uk_department_directory_template.yaml
Normal file
@ -0,0 +1,95 @@
|
|||||||
|
university:
|
||||||
|
name: "REPLACE_UNIVERSITY_NAME"
|
||||||
|
url: "https://www.example.ac.uk/"
|
||||||
|
country: "United Kingdom"
|
||||||
|
|
||||||
|
schools:
|
||||||
|
discovery_method: "static_list"
|
||||||
|
static_list:
|
||||||
|
- name: "Department of Computer Science"
|
||||||
|
url: "https://www.example.ac.uk/about/people/academic-and-research-staff/"
|
||||||
|
keywords:
|
||||||
|
- "computer"
|
||||||
|
- "software"
|
||||||
|
- "artificial intelligence"
|
||||||
|
- "data science"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.example.ac.uk/about/people/academic-and-research-staff/"
|
||||||
|
extract_method: "links"
|
||||||
|
requires_scroll: true
|
||||||
|
scroll_times: 6
|
||||||
|
scroll_delay_ms: 600
|
||||||
|
blocked_resources: ["image", "font", "media"]
|
||||||
|
- url: "https://www.example.ac.uk/about/people/"
|
||||||
|
extract_method: "links"
|
||||||
|
load_more_selector: "button.load-more"
|
||||||
|
max_load_more: 5
|
||||||
|
request:
|
||||||
|
timeout_ms: 45000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 2000
|
||||||
|
- name: "Department of Physics"
|
||||||
|
url: "https://www.example.ac.uk/physics/about/people/"
|
||||||
|
keywords:
|
||||||
|
- "physics"
|
||||||
|
- "astronomy"
|
||||||
|
- "material science"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.example.ac.uk/physics/about/people/academic-staff/"
|
||||||
|
extract_method: "table"
|
||||||
|
request:
|
||||||
|
timeout_ms: 60000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 2000
|
||||||
|
|
||||||
|
programs:
|
||||||
|
paths_to_try:
|
||||||
|
- "/study/masters/courses/a-to-z/"
|
||||||
|
- "/study/masters/courses/list/"
|
||||||
|
link_patterns:
|
||||||
|
- text_contains: ["courses", "masters", "postgraduate"]
|
||||||
|
href_contains: ["/study/", "/masters/", "/courses/"]
|
||||||
|
selectors:
|
||||||
|
program_item: ".course-card, li.course, article.course"
|
||||||
|
program_name: ".course-title, h3, .title"
|
||||||
|
program_url: "a[href]"
|
||||||
|
degree_type: ".award, .badge"
|
||||||
|
request:
|
||||||
|
timeout_ms: 35000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 2000
|
||||||
|
|
||||||
|
faculty:
|
||||||
|
discovery_strategies:
|
||||||
|
- type: "link_in_page"
|
||||||
|
patterns:
|
||||||
|
- text_contains: ["people", "faculty", "team", "staff"]
|
||||||
|
href_contains: ["/people", "/faculty", "/staff"]
|
||||||
|
request:
|
||||||
|
timeout_ms: 25000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 1500
|
||||||
|
- type: "url_pattern"
|
||||||
|
patterns:
|
||||||
|
- "{program_url}/people"
|
||||||
|
- "{program_url}/staff"
|
||||||
|
- "{school_url}/people"
|
||||||
|
- "{school_url}/contact/staff"
|
||||||
|
request:
|
||||||
|
timeout_ms: 25000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 1500
|
||||||
|
- type: "school_directory"
|
||||||
|
assign_to_all: false
|
||||||
|
match_by_school_keywords: true
|
||||||
|
metadata_keyword_field: "keywords"
|
||||||
|
request:
|
||||||
|
timeout_ms: 60000
|
||||||
|
wait_for_selector: "a[href*='/people/'], table"
|
||||||
|
post_wait_ms: 2000
|
||||||
|
|
||||||
|
filters:
|
||||||
|
program_degree_types:
|
||||||
|
include: ["MSc", "MSci", "MA", "MBA", "MEng", "LLM"]
|
||||||
|
exclude: ["PhD", "Bachelor", "BSc", "BA", "PGCert"]
|
||||||
|
exclude_schools: []
|
||||||
101
configs/templates/uk_research_explorer_template.yaml
Normal file
101
configs/templates/uk_research_explorer_template.yaml
Normal file
@ -0,0 +1,101 @@
|
|||||||
|
university:
|
||||||
|
name: "REPLACE_UNIVERSITY_NAME"
|
||||||
|
url: "https://www.example.ac.uk/"
|
||||||
|
country: "United Kingdom"
|
||||||
|
|
||||||
|
schools:
|
||||||
|
discovery_method: "static_list"
|
||||||
|
request:
|
||||||
|
timeout_ms: 45000
|
||||||
|
max_retries: 3
|
||||||
|
retry_backoff_ms: 3000
|
||||||
|
static_list:
|
||||||
|
# 基于 Research Explorer (Pure Portal) 的学院示例
|
||||||
|
- name: "School of Engineering"
|
||||||
|
url: "https://research.example.ac.uk/en/organisations/school-of-engineering/persons/"
|
||||||
|
keywords:
|
||||||
|
- "engineering"
|
||||||
|
- "mechanical"
|
||||||
|
- "civil"
|
||||||
|
- "materials"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://research.example.ac.uk/en/organisations/school-of-engineering/persons/"
|
||||||
|
extract_method: "research_explorer"
|
||||||
|
requires_scroll: true
|
||||||
|
request:
|
||||||
|
timeout_ms: 120000
|
||||||
|
wait_until: "networkidle"
|
||||||
|
post_wait_ms: 5000
|
||||||
|
research_explorer:
|
||||||
|
org_slug: "school-of-engineering"
|
||||||
|
page_size: 400
|
||||||
|
- name: "Faculty of Humanities"
|
||||||
|
url: "https://research.example.ac.uk/en/organisations/faculty-of-humanities/persons/"
|
||||||
|
keywords:
|
||||||
|
- "arts"
|
||||||
|
- "languages"
|
||||||
|
- "history"
|
||||||
|
- "philosophy"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://research.example.ac.uk/en/organisations/faculty-of-humanities/persons/"
|
||||||
|
extract_method: "research_explorer"
|
||||||
|
requires_scroll: true
|
||||||
|
request:
|
||||||
|
timeout_ms: 120000
|
||||||
|
wait_until: "networkidle"
|
||||||
|
post_wait_ms: 4500
|
||||||
|
research_explorer:
|
||||||
|
org_slug: "faculty-of-humanities"
|
||||||
|
page_size: 300
|
||||||
|
|
||||||
|
programs:
|
||||||
|
paths_to_try:
|
||||||
|
- "/study/masters/courses/list/"
|
||||||
|
- "/study/postgraduate/courses/list/"
|
||||||
|
link_patterns:
|
||||||
|
- text_contains: ["masters", "postgraduate", "graduate"]
|
||||||
|
href_contains: ["/courses/", "/study/", "/programmes/"]
|
||||||
|
selectors:
|
||||||
|
program_item: "li.course-item, article.course-card, a.course-link"
|
||||||
|
program_name: ".course-title, h3, .title"
|
||||||
|
program_url: "a[href]"
|
||||||
|
degree_type: ".course-award, .badge"
|
||||||
|
request:
|
||||||
|
timeout_ms: 40000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 2500
|
||||||
|
|
||||||
|
faculty:
|
||||||
|
discovery_strategies:
|
||||||
|
- type: "link_in_page"
|
||||||
|
patterns:
|
||||||
|
- text_contains: ["faculty", "people", "staff", "directory"]
|
||||||
|
href_contains: ["/faculty", "/people", "/staff"]
|
||||||
|
request:
|
||||||
|
timeout_ms: 30000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 1500
|
||||||
|
- type: "url_pattern"
|
||||||
|
patterns:
|
||||||
|
- "{program_url}/people"
|
||||||
|
- "{program_url}/faculty"
|
||||||
|
- "{school_url}/people"
|
||||||
|
- "{school_url}/staff"
|
||||||
|
request:
|
||||||
|
timeout_ms: 30000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 1500
|
||||||
|
- type: "school_directory"
|
||||||
|
assign_to_all: false
|
||||||
|
match_by_school_keywords: true
|
||||||
|
metadata_keyword_field: "keywords"
|
||||||
|
request:
|
||||||
|
timeout_ms: 120000
|
||||||
|
wait_for_selector: "a.link.person"
|
||||||
|
post_wait_ms: 4000
|
||||||
|
|
||||||
|
filters:
|
||||||
|
program_degree_types:
|
||||||
|
include: ["MSc", "MA", "MBA", "MEng", "LLM", "MRes"]
|
||||||
|
exclude: ["PhD", "Bachelor", "BSc", "BA"]
|
||||||
|
exclude_schools: []
|
||||||
169
configs/ucl.yaml
Normal file
169
configs/ucl.yaml
Normal file
@ -0,0 +1,169 @@
|
|||||||
|
university:
|
||||||
|
name: "University College London"
|
||||||
|
url: "https://www.ucl.ac.uk/"
|
||||||
|
country: "United Kingdom"
|
||||||
|
|
||||||
|
schools:
|
||||||
|
discovery_method: "static_list"
|
||||||
|
request:
|
||||||
|
timeout_ms: 45000
|
||||||
|
max_retries: 3
|
||||||
|
retry_backoff_ms: 3000
|
||||||
|
static_list:
|
||||||
|
- name: "Faculty of Engineering Sciences"
|
||||||
|
url: "https://www.ucl.ac.uk/engineering/people"
|
||||||
|
keywords:
|
||||||
|
- "engineering"
|
||||||
|
- "mechanical"
|
||||||
|
- "civil"
|
||||||
|
- "materials"
|
||||||
|
- "electronic"
|
||||||
|
- "computer"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.ucl.ac.uk/engineering/people"
|
||||||
|
extract_method: "links"
|
||||||
|
requires_scroll: true
|
||||||
|
scroll_times: 8
|
||||||
|
scroll_delay_ms: 600
|
||||||
|
blocked_resources: ["image", "font", "media"]
|
||||||
|
- url: "https://www.ucl.ac.uk/electronic-electrical-engineering/people/academic-staff"
|
||||||
|
extract_method: "table"
|
||||||
|
request:
|
||||||
|
timeout_ms: 45000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 2000
|
||||||
|
- name: "Faculty of Mathematical & Physical Sciences"
|
||||||
|
url: "https://www.ucl.ac.uk/mathematical-physical-sciences/people"
|
||||||
|
keywords:
|
||||||
|
- "mathematics"
|
||||||
|
- "physics"
|
||||||
|
- "chemistry"
|
||||||
|
- "earth sciences"
|
||||||
|
- "astronomy"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.ucl.ac.uk/mathematical-physical-sciences/people"
|
||||||
|
extract_method: "links"
|
||||||
|
requires_scroll: true
|
||||||
|
scroll_times: 6
|
||||||
|
scroll_delay_ms: 600
|
||||||
|
- url: "https://www.ucl.ac.uk/physics-astronomy/people/academic-staff"
|
||||||
|
extract_method: "links"
|
||||||
|
- name: "Faculty of Arts & Humanities"
|
||||||
|
url: "https://www.ucl.ac.uk/arts-humanities/people/academic-staff"
|
||||||
|
keywords:
|
||||||
|
- "arts"
|
||||||
|
- "languages"
|
||||||
|
- "culture"
|
||||||
|
- "history"
|
||||||
|
- "philosophy"
|
||||||
|
- "translation"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.ucl.ac.uk/arts-humanities/people/academic-staff"
|
||||||
|
extract_method: "links"
|
||||||
|
requires_scroll: true
|
||||||
|
scroll_times: 6
|
||||||
|
scroll_delay_ms: 600
|
||||||
|
- name: "Faculty of Laws"
|
||||||
|
url: "https://www.ucl.ac.uk/laws/people/academic-staff"
|
||||||
|
keywords:
|
||||||
|
- "law"
|
||||||
|
- "legal"
|
||||||
|
- "llm"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.ucl.ac.uk/laws/people/academic-staff"
|
||||||
|
extract_method: "links"
|
||||||
|
requires_scroll: true
|
||||||
|
scroll_times: 5
|
||||||
|
scroll_delay_ms: 600
|
||||||
|
- name: "Faculty of Social & Historical Sciences"
|
||||||
|
url: "https://www.ucl.ac.uk/social-historical-sciences/people"
|
||||||
|
keywords:
|
||||||
|
- "social"
|
||||||
|
- "economics"
|
||||||
|
- "geography"
|
||||||
|
- "anthropology"
|
||||||
|
- "politics"
|
||||||
|
- "history"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.ucl.ac.uk/social-historical-sciences/people"
|
||||||
|
extract_method: "links"
|
||||||
|
requires_scroll: true
|
||||||
|
scroll_times: 6
|
||||||
|
scroll_delay_ms: 600
|
||||||
|
- name: "Faculty of Brain Sciences"
|
||||||
|
url: "https://www.ucl.ac.uk/brain-sciences/people"
|
||||||
|
keywords:
|
||||||
|
- "neuroscience"
|
||||||
|
- "psychology"
|
||||||
|
- "cognitive"
|
||||||
|
- "biomedical"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.ucl.ac.uk/brain-sciences/people"
|
||||||
|
extract_method: "links"
|
||||||
|
requires_scroll: true
|
||||||
|
scroll_times: 6
|
||||||
|
scroll_delay_ms: 600
|
||||||
|
- name: "Faculty of the Built Environment (The Bartlett)"
|
||||||
|
url: "https://www.ucl.ac.uk/bartlett/people/all"
|
||||||
|
keywords:
|
||||||
|
- "architecture"
|
||||||
|
- "planning"
|
||||||
|
- "urban"
|
||||||
|
- "built environment"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.ucl.ac.uk/bartlett/people/all"
|
||||||
|
extract_method: "links"
|
||||||
|
requires_scroll: true
|
||||||
|
scroll_times: 10
|
||||||
|
scroll_delay_ms: 600
|
||||||
|
|
||||||
|
programs:
|
||||||
|
paths_to_try:
|
||||||
|
- "/prospective-students/graduate/taught-degrees/"
|
||||||
|
link_patterns:
|
||||||
|
- text_contains: ["graduate", "taught", "masters", "postgraduate"]
|
||||||
|
href_contains: ["/prospective-students/graduate", "/study/graduate", "/courses/"]
|
||||||
|
selectors:
|
||||||
|
program_item: ".view-content .view-row, li.listing__item, article.prog-card"
|
||||||
|
program_name: ".listing__title, h3, .title"
|
||||||
|
program_url: "a[href]"
|
||||||
|
degree_type: ".listing__award, .award"
|
||||||
|
request:
|
||||||
|
timeout_ms: 40000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 2500
|
||||||
|
|
||||||
|
faculty:
|
||||||
|
discovery_strategies:
|
||||||
|
- type: "link_in_page"
|
||||||
|
patterns:
|
||||||
|
- text_contains: ["people", "faculty", "staff", "team"]
|
||||||
|
href_contains: ["/people", "/faculty", "/staff", "/team"]
|
||||||
|
request:
|
||||||
|
timeout_ms: 30000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 1500
|
||||||
|
- type: "url_pattern"
|
||||||
|
patterns:
|
||||||
|
- "{program_url}/people"
|
||||||
|
- "{program_url}/staff"
|
||||||
|
- "{school_url}/people"
|
||||||
|
- "{school_url}/staff"
|
||||||
|
request:
|
||||||
|
timeout_ms: 30000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 1500
|
||||||
|
- type: "school_directory"
|
||||||
|
assign_to_all: false
|
||||||
|
match_by_school_keywords: true
|
||||||
|
metadata_keyword_field: "keywords"
|
||||||
|
request:
|
||||||
|
timeout_ms: 60000
|
||||||
|
wait_for_selector: "a[href*='/people/'], .person, .profile-card"
|
||||||
|
post_wait_ms: 2500
|
||||||
|
|
||||||
|
filters:
|
||||||
|
program_degree_types:
|
||||||
|
include: ["MSc", "MSci", "MA", "MBA", "MEng", "LLM", "MRes"]
|
||||||
|
exclude: ["PhD", "Bachelor", "BSc", "BA", "PGCert"]
|
||||||
|
exclude_schools: []
|
||||||
54
docker-compose.yml
Normal file
54
docker-compose.yml
Normal file
@ -0,0 +1,54 @@
|
|||||||
|
version: '3.8'
|
||||||
|
|
||||||
|
services:
|
||||||
|
# 后端API服务
|
||||||
|
backend:
|
||||||
|
build:
|
||||||
|
context: ./backend
|
||||||
|
dockerfile: Dockerfile
|
||||||
|
ports:
|
||||||
|
- "8000:8000"
|
||||||
|
environment:
|
||||||
|
- DATABASE_URL=postgresql://postgres:postgres@db:5432/university_scraper
|
||||||
|
- REDIS_URL=redis://redis:6379/0
|
||||||
|
depends_on:
|
||||||
|
- db
|
||||||
|
- redis
|
||||||
|
volumes:
|
||||||
|
- ./backend:/app
|
||||||
|
- scraper_data:/app/data
|
||||||
|
|
||||||
|
# 前端服务
|
||||||
|
frontend:
|
||||||
|
build:
|
||||||
|
context: ./frontend
|
||||||
|
dockerfile: Dockerfile
|
||||||
|
ports:
|
||||||
|
- "3000:80"
|
||||||
|
depends_on:
|
||||||
|
- backend
|
||||||
|
|
||||||
|
# PostgreSQL数据库
|
||||||
|
db:
|
||||||
|
image: postgres:15-alpine
|
||||||
|
environment:
|
||||||
|
- POSTGRES_USER=postgres
|
||||||
|
- POSTGRES_PASSWORD=postgres
|
||||||
|
- POSTGRES_DB=university_scraper
|
||||||
|
volumes:
|
||||||
|
- postgres_data:/var/lib/postgresql/data
|
||||||
|
ports:
|
||||||
|
- "5432:5432"
|
||||||
|
|
||||||
|
# Redis (用于任务队列)
|
||||||
|
redis:
|
||||||
|
image: redis:7-alpine
|
||||||
|
ports:
|
||||||
|
- "6379:6379"
|
||||||
|
volumes:
|
||||||
|
- redis_data:/data
|
||||||
|
|
||||||
|
volumes:
|
||||||
|
postgres_data:
|
||||||
|
redis_data:
|
||||||
|
scraper_data:
|
||||||
26
frontend/Dockerfile
Normal file
26
frontend/Dockerfile
Normal file
@ -0,0 +1,26 @@
|
|||||||
|
FROM node:20-alpine as builder
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
# 复制package文件
|
||||||
|
COPY package*.json ./
|
||||||
|
RUN npm install
|
||||||
|
|
||||||
|
# 复制源代码
|
||||||
|
COPY . .
|
||||||
|
|
||||||
|
# 构建
|
||||||
|
RUN npm run build
|
||||||
|
|
||||||
|
# 生产镜像
|
||||||
|
FROM nginx:alpine
|
||||||
|
|
||||||
|
# 复制构建产物
|
||||||
|
COPY --from=builder /app/dist /usr/share/nginx/html
|
||||||
|
|
||||||
|
# 复制nginx配置
|
||||||
|
COPY nginx.conf /etc/nginx/conf.d/default.conf
|
||||||
|
|
||||||
|
EXPOSE 80
|
||||||
|
|
||||||
|
CMD ["nginx", "-g", "daemon off;"]
|
||||||
12
frontend/index.html
Normal file
12
frontend/index.html
Normal file
@ -0,0 +1,12 @@
|
|||||||
|
<!DOCTYPE html>
|
||||||
|
<html lang="zh-CN">
|
||||||
|
<head>
|
||||||
|
<meta charset="UTF-8" />
|
||||||
|
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||||
|
<title>大学爬虫系统</title>
|
||||||
|
</head>
|
||||||
|
<body>
|
||||||
|
<div id="root"></div>
|
||||||
|
<script type="module" src="/src/main.tsx"></script>
|
||||||
|
</body>
|
||||||
|
</html>
|
||||||
21
frontend/nginx.conf
Normal file
21
frontend/nginx.conf
Normal file
@ -0,0 +1,21 @@
|
|||||||
|
server {
|
||||||
|
listen 80;
|
||||||
|
server_name localhost;
|
||||||
|
root /usr/share/nginx/html;
|
||||||
|
index index.html;
|
||||||
|
|
||||||
|
# 处理SPA路由
|
||||||
|
location / {
|
||||||
|
try_files $uri $uri/ /index.html;
|
||||||
|
}
|
||||||
|
|
||||||
|
# 代理API请求到后端
|
||||||
|
location /api {
|
||||||
|
proxy_pass http://backend:8000;
|
||||||
|
proxy_http_version 1.1;
|
||||||
|
proxy_set_header Upgrade $http_upgrade;
|
||||||
|
proxy_set_header Connection 'upgrade';
|
||||||
|
proxy_set_header Host $host;
|
||||||
|
proxy_cache_bypass $http_upgrade;
|
||||||
|
}
|
||||||
|
}
|
||||||
3051
frontend/package-lock.json
generated
Normal file
3051
frontend/package-lock.json
generated
Normal file
File diff suppressed because it is too large
Load Diff
26
frontend/package.json
Normal file
26
frontend/package.json
Normal file
@ -0,0 +1,26 @@
|
|||||||
|
{
|
||||||
|
"name": "university-scraper-web",
|
||||||
|
"version": "1.0.0",
|
||||||
|
"private": true,
|
||||||
|
"scripts": {
|
||||||
|
"dev": "vite",
|
||||||
|
"build": "tsc && vite build",
|
||||||
|
"preview": "vite preview"
|
||||||
|
},
|
||||||
|
"dependencies": {
|
||||||
|
"react": "^18.2.0",
|
||||||
|
"react-dom": "^18.2.0",
|
||||||
|
"react-router-dom": "^6.20.0",
|
||||||
|
"@tanstack/react-query": "^5.8.0",
|
||||||
|
"axios": "^1.6.0",
|
||||||
|
"antd": "^5.11.0",
|
||||||
|
"@ant-design/icons": "^5.2.6"
|
||||||
|
},
|
||||||
|
"devDependencies": {
|
||||||
|
"@types/react": "^18.2.0",
|
||||||
|
"@types/react-dom": "^18.2.0",
|
||||||
|
"@vitejs/plugin-react": "^4.2.0",
|
||||||
|
"typescript": "^5.3.0",
|
||||||
|
"vite": "^5.0.0"
|
||||||
|
}
|
||||||
|
}
|
||||||
75
frontend/src/App.tsx
Normal file
75
frontend/src/App.tsx
Normal file
@ -0,0 +1,75 @@
|
|||||||
|
/**
|
||||||
|
* 主应用组件
|
||||||
|
*/
|
||||||
|
import { useState } from 'react'
|
||||||
|
import { BrowserRouter, Routes, Route, Link, useNavigate } from 'react-router-dom'
|
||||||
|
import { Layout, Menu, Typography } from 'antd'
|
||||||
|
import { HomeOutlined, PlusOutlined, DatabaseOutlined } from '@ant-design/icons'
|
||||||
|
import HomePage from './pages/HomePage'
|
||||||
|
import AddUniversityPage from './pages/AddUniversityPage'
|
||||||
|
import UniversityDetailPage from './pages/UniversityDetailPage'
|
||||||
|
|
||||||
|
const { Header, Content, Footer } = Layout
|
||||||
|
const { Title } = Typography
|
||||||
|
|
||||||
|
function AppContent() {
|
||||||
|
const navigate = useNavigate()
|
||||||
|
const [current, setCurrent] = useState('home')
|
||||||
|
|
||||||
|
const menuItems = [
|
||||||
|
{
|
||||||
|
key: 'home',
|
||||||
|
icon: <HomeOutlined />,
|
||||||
|
label: '大学列表',
|
||||||
|
onClick: () => navigate('/')
|
||||||
|
},
|
||||||
|
{
|
||||||
|
key: 'add',
|
||||||
|
icon: <PlusOutlined />,
|
||||||
|
label: '添加大学',
|
||||||
|
onClick: () => navigate('/add')
|
||||||
|
}
|
||||||
|
]
|
||||||
|
|
||||||
|
return (
|
||||||
|
<Layout style={{ minHeight: '100vh' }}>
|
||||||
|
<Header style={{ display: 'flex', alignItems: 'center', background: '#001529' }}>
|
||||||
|
<div style={{ color: 'white', fontSize: '20px', fontWeight: 'bold', marginRight: '40px' }}>
|
||||||
|
<DatabaseOutlined /> 大学爬虫系统
|
||||||
|
</div>
|
||||||
|
<Menu
|
||||||
|
theme="dark"
|
||||||
|
mode="horizontal"
|
||||||
|
selectedKeys={[current]}
|
||||||
|
items={menuItems}
|
||||||
|
onClick={(e) => setCurrent(e.key)}
|
||||||
|
style={{ flex: 1 }}
|
||||||
|
/>
|
||||||
|
</Header>
|
||||||
|
|
||||||
|
<Content style={{ padding: '24px', background: '#f5f5f5' }}>
|
||||||
|
<div style={{ maxWidth: 1200, margin: '0 auto' }}>
|
||||||
|
<Routes>
|
||||||
|
<Route path="/" element={<HomePage />} />
|
||||||
|
<Route path="/add" element={<AddUniversityPage />} />
|
||||||
|
<Route path="/university/:id" element={<UniversityDetailPage />} />
|
||||||
|
</Routes>
|
||||||
|
</div>
|
||||||
|
</Content>
|
||||||
|
|
||||||
|
<Footer style={{ textAlign: 'center', background: '#f5f5f5' }}>
|
||||||
|
大学爬虫系统 ©2024 - 一键生成 & 一键爬取
|
||||||
|
</Footer>
|
||||||
|
</Layout>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
function App() {
|
||||||
|
return (
|
||||||
|
<BrowserRouter>
|
||||||
|
<AppContent />
|
||||||
|
</BrowserRouter>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
export default App
|
||||||
29
frontend/src/index.css
Normal file
29
frontend/src/index.css
Normal file
@ -0,0 +1,29 @@
|
|||||||
|
* {
|
||||||
|
margin: 0;
|
||||||
|
padding: 0;
|
||||||
|
box-sizing: border-box;
|
||||||
|
}
|
||||||
|
|
||||||
|
body {
|
||||||
|
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;
|
||||||
|
background-color: #f5f5f5;
|
||||||
|
}
|
||||||
|
|
||||||
|
.container {
|
||||||
|
max-width: 1200px;
|
||||||
|
margin: 0 auto;
|
||||||
|
padding: 20px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.card-hover:hover {
|
||||||
|
box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15);
|
||||||
|
transition: box-shadow 0.3s;
|
||||||
|
}
|
||||||
|
|
||||||
|
.status-pending { color: #faad14; }
|
||||||
|
.status-analyzing { color: #1890ff; }
|
||||||
|
.status-ready { color: #52c41a; }
|
||||||
|
.status-running { color: #1890ff; }
|
||||||
|
.status-completed { color: #52c41a; }
|
||||||
|
.status-failed { color: #ff4d4f; }
|
||||||
|
.status-error { color: #ff4d4f; }
|
||||||
26
frontend/src/main.tsx
Normal file
26
frontend/src/main.tsx
Normal file
@ -0,0 +1,26 @@
|
|||||||
|
import React from 'react'
|
||||||
|
import ReactDOM from 'react-dom/client'
|
||||||
|
import { QueryClient, QueryClientProvider } from '@tanstack/react-query'
|
||||||
|
import { ConfigProvider } from 'antd'
|
||||||
|
import zhCN from 'antd/locale/zh_CN'
|
||||||
|
import App from './App'
|
||||||
|
import './index.css'
|
||||||
|
|
||||||
|
const queryClient = new QueryClient({
|
||||||
|
defaultOptions: {
|
||||||
|
queries: {
|
||||||
|
refetchOnWindowFocus: false,
|
||||||
|
retry: 1
|
||||||
|
}
|
||||||
|
}
|
||||||
|
})
|
||||||
|
|
||||||
|
ReactDOM.createRoot(document.getElementById('root')!).render(
|
||||||
|
<React.StrictMode>
|
||||||
|
<QueryClientProvider client={queryClient}>
|
||||||
|
<ConfigProvider locale={zhCN}>
|
||||||
|
<App />
|
||||||
|
</ConfigProvider>
|
||||||
|
</QueryClientProvider>
|
||||||
|
</React.StrictMode>
|
||||||
|
)
|
||||||
165
frontend/src/pages/AddUniversityPage.tsx
Normal file
165
frontend/src/pages/AddUniversityPage.tsx
Normal file
@ -0,0 +1,165 @@
|
|||||||
|
/**
|
||||||
|
* 添加大学页面 - 一键生成爬虫脚本
|
||||||
|
*/
|
||||||
|
import { useState } from 'react'
|
||||||
|
import { useNavigate } from 'react-router-dom'
|
||||||
|
import { useMutation } from '@tanstack/react-query'
|
||||||
|
import {
|
||||||
|
Card, Form, Input, Button, Typography, Steps, Result, Spin, message
|
||||||
|
} from 'antd'
|
||||||
|
import { GlobalOutlined, RocketOutlined, CheckCircleOutlined, LoadingOutlined } from '@ant-design/icons'
|
||||||
|
import { scriptApi } from '../services/api'
|
||||||
|
|
||||||
|
const { Title, Text, Paragraph } = Typography
|
||||||
|
|
||||||
|
export default function AddUniversityPage() {
|
||||||
|
const navigate = useNavigate()
|
||||||
|
const [form] = Form.useForm()
|
||||||
|
const [currentStep, setCurrentStep] = useState(0)
|
||||||
|
const [universityId, setUniversityId] = useState<number | null>(null)
|
||||||
|
|
||||||
|
// 生成脚本
|
||||||
|
const generateMutation = useMutation({
|
||||||
|
mutationFn: scriptApi.generate,
|
||||||
|
onSuccess: (response) => {
|
||||||
|
const data = response.data
|
||||||
|
setUniversityId(data.university_id)
|
||||||
|
setCurrentStep(2)
|
||||||
|
message.success('脚本生成成功!')
|
||||||
|
},
|
||||||
|
onError: (error: any) => {
|
||||||
|
message.error(error.response?.data?.detail || '生成失败')
|
||||||
|
setCurrentStep(0)
|
||||||
|
}
|
||||||
|
})
|
||||||
|
|
||||||
|
const handleSubmit = (values: { url: string; name?: string }) => {
|
||||||
|
setCurrentStep(1)
|
||||||
|
generateMutation.mutate({
|
||||||
|
university_url: values.url,
|
||||||
|
university_name: values.name
|
||||||
|
})
|
||||||
|
}
|
||||||
|
|
||||||
|
const stepItems = [
|
||||||
|
{
|
||||||
|
title: '输入信息',
|
||||||
|
icon: <GlobalOutlined />
|
||||||
|
},
|
||||||
|
{
|
||||||
|
title: '分析生成',
|
||||||
|
icon: currentStep === 1 ? <LoadingOutlined /> : <RocketOutlined />
|
||||||
|
},
|
||||||
|
{
|
||||||
|
title: '完成',
|
||||||
|
icon: <CheckCircleOutlined />
|
||||||
|
}
|
||||||
|
]
|
||||||
|
|
||||||
|
return (
|
||||||
|
<Card>
|
||||||
|
<Title level={3} style={{ textAlign: 'center', marginBottom: 32 }}>
|
||||||
|
添加大学 - 一键生成爬虫脚本
|
||||||
|
</Title>
|
||||||
|
|
||||||
|
<Steps current={currentStep} items={stepItems} style={{ marginBottom: 40 }} />
|
||||||
|
|
||||||
|
{currentStep === 0 && (
|
||||||
|
<div style={{ maxWidth: 500, margin: '0 auto' }}>
|
||||||
|
<Paragraph style={{ textAlign: 'center', marginBottom: 24 }}>
|
||||||
|
输入大学官网地址,系统将自动分析网站结构并生成爬虫脚本
|
||||||
|
</Paragraph>
|
||||||
|
|
||||||
|
<Form
|
||||||
|
form={form}
|
||||||
|
layout="vertical"
|
||||||
|
onFinish={handleSubmit}
|
||||||
|
>
|
||||||
|
<Form.Item
|
||||||
|
name="url"
|
||||||
|
label="大学官网URL"
|
||||||
|
rules={[
|
||||||
|
{ required: true, message: '请输入大学官网URL' },
|
||||||
|
{ type: 'url', message: '请输入有效的URL' }
|
||||||
|
]}
|
||||||
|
>
|
||||||
|
<Input
|
||||||
|
placeholder="https://www.harvard.edu/"
|
||||||
|
size="large"
|
||||||
|
prefix={<GlobalOutlined />}
|
||||||
|
/>
|
||||||
|
</Form.Item>
|
||||||
|
|
||||||
|
<Form.Item
|
||||||
|
name="name"
|
||||||
|
label="大学名称 (可选)"
|
||||||
|
>
|
||||||
|
<Input
|
||||||
|
placeholder="如: Harvard University"
|
||||||
|
size="large"
|
||||||
|
/>
|
||||||
|
</Form.Item>
|
||||||
|
|
||||||
|
<Form.Item>
|
||||||
|
<Button
|
||||||
|
type="primary"
|
||||||
|
htmlType="submit"
|
||||||
|
size="large"
|
||||||
|
block
|
||||||
|
icon={<RocketOutlined />}
|
||||||
|
>
|
||||||
|
一键生成爬虫脚本
|
||||||
|
</Button>
|
||||||
|
</Form.Item>
|
||||||
|
</Form>
|
||||||
|
|
||||||
|
<div style={{ marginTop: 32, padding: 16, background: '#f5f5f5', borderRadius: 8 }}>
|
||||||
|
<Text strong>支持的大学类型:</Text>
|
||||||
|
<ul style={{ marginTop: 8 }}>
|
||||||
|
<li>美国大学 (如 Harvard, MIT, Stanford)</li>
|
||||||
|
<li>英国大学 (如 Oxford, Cambridge)</li>
|
||||||
|
<li>其他海外大学</li>
|
||||||
|
</ul>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
|
||||||
|
{currentStep === 1 && (
|
||||||
|
<div style={{ textAlign: 'center', padding: 60 }}>
|
||||||
|
<Spin size="large" />
|
||||||
|
<Title level={4} style={{ marginTop: 24 }}>正在分析网站结构...</Title>
|
||||||
|
<Paragraph>系统正在访问大学官网,分析页面结构并生成爬虫脚本</Paragraph>
|
||||||
|
<Paragraph type="secondary">这可能需要几秒钟,请稍候...</Paragraph>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
|
||||||
|
{currentStep === 2 && (
|
||||||
|
<Result
|
||||||
|
status="success"
|
||||||
|
title="爬虫脚本生成成功!"
|
||||||
|
subTitle="系统已自动分析网站结构并生成了爬虫脚本"
|
||||||
|
extra={[
|
||||||
|
<Button
|
||||||
|
type="primary"
|
||||||
|
key="detail"
|
||||||
|
size="large"
|
||||||
|
onClick={() => navigate(`/university/${universityId}`)}
|
||||||
|
>
|
||||||
|
进入大学管理页面
|
||||||
|
</Button>,
|
||||||
|
<Button
|
||||||
|
key="add"
|
||||||
|
size="large"
|
||||||
|
onClick={() => {
|
||||||
|
setCurrentStep(0)
|
||||||
|
form.resetFields()
|
||||||
|
}}
|
||||||
|
>
|
||||||
|
继续添加
|
||||||
|
</Button>
|
||||||
|
]}
|
||||||
|
/>
|
||||||
|
)}
|
||||||
|
</Card>
|
||||||
|
)
|
||||||
|
}
|
||||||
185
frontend/src/pages/HomePage.tsx
Normal file
185
frontend/src/pages/HomePage.tsx
Normal file
@ -0,0 +1,185 @@
|
|||||||
|
/**
|
||||||
|
* 首页 - 大学列表
|
||||||
|
*/
|
||||||
|
import { useState } from 'react'
|
||||||
|
import { useNavigate } from 'react-router-dom'
|
||||||
|
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
|
||||||
|
import {
|
||||||
|
Card, Table, Button, Input, Space, Tag, message, Popconfirm, Typography, Row, Col, Statistic
|
||||||
|
} from 'antd'
|
||||||
|
import {
|
||||||
|
PlusOutlined, SearchOutlined, DeleteOutlined, EyeOutlined, ReloadOutlined
|
||||||
|
} from '@ant-design/icons'
|
||||||
|
import { universityApi } from '../services/api'
|
||||||
|
|
||||||
|
const { Title } = Typography
|
||||||
|
|
||||||
|
// 状态标签映射
|
||||||
|
const statusTags: Record<string, { color: string; text: string }> = {
|
||||||
|
pending: { color: 'default', text: '待分析' },
|
||||||
|
analyzing: { color: 'processing', text: '分析中' },
|
||||||
|
ready: { color: 'success', text: '就绪' },
|
||||||
|
error: { color: 'error', text: '错误' }
|
||||||
|
}
|
||||||
|
|
||||||
|
export default function HomePage() {
|
||||||
|
const navigate = useNavigate()
|
||||||
|
const queryClient = useQueryClient()
|
||||||
|
const [search, setSearch] = useState('')
|
||||||
|
|
||||||
|
// 获取大学列表
|
||||||
|
const { data, isLoading, refetch } = useQuery({
|
||||||
|
queryKey: ['universities', search],
|
||||||
|
queryFn: () => universityApi.list({ search: search || undefined })
|
||||||
|
})
|
||||||
|
|
||||||
|
// 删除大学
|
||||||
|
const deleteMutation = useMutation({
|
||||||
|
mutationFn: universityApi.delete,
|
||||||
|
onSuccess: () => {
|
||||||
|
message.success('删除成功')
|
||||||
|
queryClient.invalidateQueries({ queryKey: ['universities'] })
|
||||||
|
},
|
||||||
|
onError: () => {
|
||||||
|
message.error('删除失败')
|
||||||
|
}
|
||||||
|
})
|
||||||
|
|
||||||
|
const universities = data?.data?.items || []
|
||||||
|
const total = data?.data?.total || 0
|
||||||
|
|
||||||
|
// 统计
|
||||||
|
const readyCount = universities.filter((u: any) => u.status === 'ready').length
|
||||||
|
const totalPrograms = universities.reduce((sum: number, u: any) =>
|
||||||
|
sum + (u.latest_result?.programs_count || 0), 0)
|
||||||
|
const totalFaculty = universities.reduce((sum: number, u: any) =>
|
||||||
|
sum + (u.latest_result?.faculty_count || 0), 0)
|
||||||
|
|
||||||
|
const columns = [
|
||||||
|
{
|
||||||
|
title: '大学名称',
|
||||||
|
dataIndex: 'name',
|
||||||
|
key: 'name',
|
||||||
|
render: (text: string, record: any) => (
|
||||||
|
<a onClick={() => navigate(`/university/${record.id}`)}>{text}</a>
|
||||||
|
)
|
||||||
|
},
|
||||||
|
{
|
||||||
|
title: '国家',
|
||||||
|
dataIndex: 'country',
|
||||||
|
key: 'country',
|
||||||
|
width: 100
|
||||||
|
},
|
||||||
|
{
|
||||||
|
title: '状态',
|
||||||
|
dataIndex: 'status',
|
||||||
|
key: 'status',
|
||||||
|
width: 100,
|
||||||
|
render: (status: string) => {
|
||||||
|
const tag = statusTags[status] || { color: 'default', text: status }
|
||||||
|
return <Tag color={tag.color}>{tag.text}</Tag>
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
title: '项目数',
|
||||||
|
key: 'programs',
|
||||||
|
width: 100,
|
||||||
|
render: (_: any, record: any) => record.latest_result?.programs_count || '-'
|
||||||
|
},
|
||||||
|
{
|
||||||
|
title: '导师数',
|
||||||
|
key: 'faculty',
|
||||||
|
width: 100,
|
||||||
|
render: (_: any, record: any) => record.latest_result?.faculty_count || '-'
|
||||||
|
},
|
||||||
|
{
|
||||||
|
title: '操作',
|
||||||
|
key: 'actions',
|
||||||
|
width: 150,
|
||||||
|
render: (_: any, record: any) => (
|
||||||
|
<Space>
|
||||||
|
<Button
|
||||||
|
type="link"
|
||||||
|
icon={<EyeOutlined />}
|
||||||
|
onClick={() => navigate(`/university/${record.id}`)}
|
||||||
|
>
|
||||||
|
查看
|
||||||
|
</Button>
|
||||||
|
<Popconfirm
|
||||||
|
title="确定删除这个大学吗?"
|
||||||
|
onConfirm={() => deleteMutation.mutate(record.id)}
|
||||||
|
okText="确定"
|
||||||
|
cancelText="取消"
|
||||||
|
>
|
||||||
|
<Button type="link" danger icon={<DeleteOutlined />}>
|
||||||
|
删除
|
||||||
|
</Button>
|
||||||
|
</Popconfirm>
|
||||||
|
</Space>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
]
|
||||||
|
|
||||||
|
return (
|
||||||
|
<div>
|
||||||
|
{/* 统计卡片 */}
|
||||||
|
<Row gutter={16} style={{ marginBottom: 24 }}>
|
||||||
|
<Col span={6}>
|
||||||
|
<Card>
|
||||||
|
<Statistic title="大学总数" value={total} />
|
||||||
|
</Card>
|
||||||
|
</Col>
|
||||||
|
<Col span={6}>
|
||||||
|
<Card>
|
||||||
|
<Statistic title="已就绪" value={readyCount} valueStyle={{ color: '#52c41a' }} />
|
||||||
|
</Card>
|
||||||
|
</Col>
|
||||||
|
<Col span={6}>
|
||||||
|
<Card>
|
||||||
|
<Statistic title="项目总数" value={totalPrograms} />
|
||||||
|
</Card>
|
||||||
|
</Col>
|
||||||
|
<Col span={6}>
|
||||||
|
<Card>
|
||||||
|
<Statistic title="导师总数" value={totalFaculty} />
|
||||||
|
</Card>
|
||||||
|
</Col>
|
||||||
|
</Row>
|
||||||
|
|
||||||
|
{/* 大学列表 */}
|
||||||
|
<Card
|
||||||
|
title={<Title level={4} style={{ margin: 0 }}>大学列表</Title>}
|
||||||
|
extra={
|
||||||
|
<Space>
|
||||||
|
<Input
|
||||||
|
placeholder="搜索大学..."
|
||||||
|
prefix={<SearchOutlined />}
|
||||||
|
value={search}
|
||||||
|
onChange={(e) => setSearch(e.target.value)}
|
||||||
|
style={{ width: 200 }}
|
||||||
|
allowClear
|
||||||
|
/>
|
||||||
|
<Button icon={<ReloadOutlined />} onClick={() => refetch()}>
|
||||||
|
刷新
|
||||||
|
</Button>
|
||||||
|
<Button type="primary" icon={<PlusOutlined />} onClick={() => navigate('/add')}>
|
||||||
|
添加大学
|
||||||
|
</Button>
|
||||||
|
</Space>
|
||||||
|
}
|
||||||
|
>
|
||||||
|
<Table
|
||||||
|
columns={columns}
|
||||||
|
dataSource={universities}
|
||||||
|
rowKey="id"
|
||||||
|
loading={isLoading}
|
||||||
|
pagination={{
|
||||||
|
total,
|
||||||
|
showSizeChanger: true,
|
||||||
|
showTotal: (t) => `共 ${t} 所大学`
|
||||||
|
}}
|
||||||
|
/>
|
||||||
|
</Card>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
368
frontend/src/pages/UniversityDetailPage.tsx
Normal file
368
frontend/src/pages/UniversityDetailPage.tsx
Normal file
@ -0,0 +1,368 @@
|
|||||||
|
/**
|
||||||
|
* 大学详情页面 - 管理爬虫、运行爬虫、查看数据
|
||||||
|
*/
|
||||||
|
import { useState, useEffect } from 'react'
|
||||||
|
import { useParams, useNavigate } from 'react-router-dom'
|
||||||
|
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
|
||||||
|
import {
|
||||||
|
Card, Tabs, Button, Typography, Tag, Space, Table, Progress, Timeline, Spin,
|
||||||
|
message, Descriptions, Tree, Input, Row, Col, Statistic, Empty, Modal
|
||||||
|
} from 'antd'
|
||||||
|
import {
|
||||||
|
PlayCircleOutlined, ReloadOutlined, DownloadOutlined, ArrowLeftOutlined,
|
||||||
|
CheckCircleOutlined, ClockCircleOutlined, ExclamationCircleOutlined,
|
||||||
|
SearchOutlined, TeamOutlined, BookOutlined, BankOutlined
|
||||||
|
} from '@ant-design/icons'
|
||||||
|
import { universityApi, scriptApi, jobApi, resultApi } from '../services/api'
|
||||||
|
|
||||||
|
const { Title, Text, Paragraph } = Typography
|
||||||
|
const { TabPane } = Tabs
|
||||||
|
|
||||||
|
// 状态映射
|
||||||
|
const statusMap: Record<string, { color: string; text: string; icon: any }> = {
|
||||||
|
pending: { color: 'default', text: '等待中', icon: <ClockCircleOutlined /> },
|
||||||
|
running: { color: 'processing', text: '运行中', icon: <Spin size="small" /> },
|
||||||
|
completed: { color: 'success', text: '已完成', icon: <CheckCircleOutlined /> },
|
||||||
|
failed: { color: 'error', text: '失败', icon: <ExclamationCircleOutlined /> },
|
||||||
|
cancelled: { color: 'warning', text: '已取消', icon: <ExclamationCircleOutlined /> }
|
||||||
|
}
|
||||||
|
|
||||||
|
export default function UniversityDetailPage() {
|
||||||
|
const { id } = useParams<{ id: string }>()
|
||||||
|
const navigate = useNavigate()
|
||||||
|
const queryClient = useQueryClient()
|
||||||
|
const universityId = parseInt(id || '0')
|
||||||
|
|
||||||
|
const [activeTab, setActiveTab] = useState('overview')
|
||||||
|
const [pollingJobId, setPollingJobId] = useState<number | null>(null)
|
||||||
|
const [searchKeyword, setSearchKeyword] = useState('')
|
||||||
|
|
||||||
|
// 获取大学详情
|
||||||
|
const { data: universityData, isLoading: universityLoading } = useQuery({
|
||||||
|
queryKey: ['university', universityId],
|
||||||
|
queryFn: () => universityApi.get(universityId)
|
||||||
|
})
|
||||||
|
|
||||||
|
// 获取脚本
|
||||||
|
const { data: scriptsData } = useQuery({
|
||||||
|
queryKey: ['scripts', universityId],
|
||||||
|
queryFn: () => scriptApi.getByUniversity(universityId)
|
||||||
|
})
|
||||||
|
|
||||||
|
// 获取任务列表
|
||||||
|
const { data: jobsData, refetch: refetchJobs } = useQuery({
|
||||||
|
queryKey: ['jobs', universityId],
|
||||||
|
queryFn: () => jobApi.getByUniversity(universityId)
|
||||||
|
})
|
||||||
|
|
||||||
|
// 获取结果数据
|
||||||
|
const { data: resultData } = useQuery({
|
||||||
|
queryKey: ['result', universityId],
|
||||||
|
queryFn: () => resultApi.get(universityId),
|
||||||
|
enabled: activeTab === 'data'
|
||||||
|
})
|
||||||
|
|
||||||
|
// 获取任务状态 (轮询)
|
||||||
|
const { data: jobStatusData } = useQuery({
|
||||||
|
queryKey: ['job-status', pollingJobId],
|
||||||
|
queryFn: () => jobApi.getStatus(pollingJobId!),
|
||||||
|
enabled: !!pollingJobId,
|
||||||
|
refetchInterval: pollingJobId ? 2000 : false
|
||||||
|
})
|
||||||
|
|
||||||
|
// 启动爬虫任务
|
||||||
|
const startJobMutation = useMutation({
|
||||||
|
mutationFn: () => jobApi.start(universityId),
|
||||||
|
onSuccess: (response) => {
|
||||||
|
message.success('爬虫任务已启动')
|
||||||
|
setPollingJobId(response.data.id)
|
||||||
|
refetchJobs()
|
||||||
|
},
|
||||||
|
onError: (error: any) => {
|
||||||
|
message.error(error.response?.data?.detail || '启动失败')
|
||||||
|
}
|
||||||
|
})
|
||||||
|
|
||||||
|
// 监听任务完成
|
||||||
|
useEffect(() => {
|
||||||
|
if (jobStatusData?.data?.status === 'completed' || jobStatusData?.data?.status === 'failed') {
|
||||||
|
setPollingJobId(null)
|
||||||
|
refetchJobs()
|
||||||
|
queryClient.invalidateQueries({ queryKey: ['university', universityId] })
|
||||||
|
queryClient.invalidateQueries({ queryKey: ['result', universityId] })
|
||||||
|
|
||||||
|
if (jobStatusData?.data?.status === 'completed') {
|
||||||
|
message.success('爬取完成!')
|
||||||
|
} else {
|
||||||
|
message.error('爬取失败')
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}, [jobStatusData?.data?.status])
|
||||||
|
|
||||||
|
const university = universityData?.data
|
||||||
|
const scripts = scriptsData?.data || []
|
||||||
|
const jobs = jobsData?.data || []
|
||||||
|
const result = resultData?.data
|
||||||
|
|
||||||
|
// 构建数据树
|
||||||
|
const buildDataTree = () => {
|
||||||
|
if (!result?.result_data?.schools) return []
|
||||||
|
|
||||||
|
return result.result_data.schools.map((school: any, si: number) => ({
|
||||||
|
key: `school-${si}`,
|
||||||
|
title: (
|
||||||
|
<span>
|
||||||
|
<BankOutlined style={{ marginRight: 8 }} />
|
||||||
|
{school.name} ({school.programs?.length || 0}个项目)
|
||||||
|
</span>
|
||||||
|
),
|
||||||
|
children: school.programs?.map((prog: any, pi: number) => ({
|
||||||
|
key: `program-${si}-${pi}`,
|
||||||
|
title: (
|
||||||
|
<span>
|
||||||
|
<BookOutlined style={{ marginRight: 8 }} />
|
||||||
|
{prog.name} ({prog.faculty?.length || 0}位导师)
|
||||||
|
</span>
|
||||||
|
),
|
||||||
|
children: prog.faculty?.map((fac: any, fi: number) => ({
|
||||||
|
key: `faculty-${si}-${pi}-${fi}`,
|
||||||
|
title: (
|
||||||
|
<span>
|
||||||
|
<TeamOutlined style={{ marginRight: 8 }} />
|
||||||
|
<a href={fac.url} target="_blank" rel="noreferrer">{fac.name}</a>
|
||||||
|
</span>
|
||||||
|
),
|
||||||
|
isLeaf: true
|
||||||
|
}))
|
||||||
|
}))
|
||||||
|
}))
|
||||||
|
}
|
||||||
|
|
||||||
|
if (universityLoading) {
|
||||||
|
return <Card><Spin size="large" /></Card>
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!university) {
|
||||||
|
return <Card><Empty description="大学不存在" /></Card>
|
||||||
|
}
|
||||||
|
|
||||||
|
const activeScript = scripts.find((s: any) => s.status === 'active')
|
||||||
|
const latestJob = jobs[0]
|
||||||
|
const isRunning = pollingJobId !== null || latestJob?.status === 'running'
|
||||||
|
|
||||||
|
return (
|
||||||
|
<div>
|
||||||
|
{/* 头部 */}
|
||||||
|
<Card style={{ marginBottom: 16 }}>
|
||||||
|
<Space style={{ marginBottom: 16 }}>
|
||||||
|
<Button icon={<ArrowLeftOutlined />} onClick={() => navigate('/')}>
|
||||||
|
返回列表
|
||||||
|
</Button>
|
||||||
|
</Space>
|
||||||
|
|
||||||
|
<Row gutter={24}>
|
||||||
|
<Col span={16}>
|
||||||
|
<Title level={3} style={{ marginBottom: 8 }}>{university.name}</Title>
|
||||||
|
<Paragraph>
|
||||||
|
<a href={university.url} target="_blank" rel="noreferrer">{university.url}</a>
|
||||||
|
</Paragraph>
|
||||||
|
<Space>
|
||||||
|
<Tag>{university.country || '未知国家'}</Tag>
|
||||||
|
<Tag color={university.status === 'ready' ? 'green' : 'orange'}>
|
||||||
|
{university.status === 'ready' ? '就绪' : university.status}
|
||||||
|
</Tag>
|
||||||
|
</Space>
|
||||||
|
</Col>
|
||||||
|
<Col span={8} style={{ textAlign: 'right' }}>
|
||||||
|
<Button
|
||||||
|
type="primary"
|
||||||
|
size="large"
|
||||||
|
icon={isRunning ? <Spin size="small" /> : <PlayCircleOutlined />}
|
||||||
|
onClick={() => startJobMutation.mutate()}
|
||||||
|
disabled={!activeScript || isRunning}
|
||||||
|
loading={startJobMutation.isPending}
|
||||||
|
>
|
||||||
|
{isRunning ? '爬虫运行中...' : '一键运行爬虫'}
|
||||||
|
</Button>
|
||||||
|
</Col>
|
||||||
|
</Row>
|
||||||
|
|
||||||
|
{/* 统计 */}
|
||||||
|
<Row gutter={16} style={{ marginTop: 24 }}>
|
||||||
|
<Col span={6}>
|
||||||
|
<Statistic title="学院数" value={university.latest_result?.schools_count || 0} />
|
||||||
|
</Col>
|
||||||
|
<Col span={6}>
|
||||||
|
<Statistic title="项目数" value={university.latest_result?.programs_count || 0} />
|
||||||
|
</Col>
|
||||||
|
<Col span={6}>
|
||||||
|
<Statistic title="导师数" value={university.latest_result?.faculty_count || 0} />
|
||||||
|
</Col>
|
||||||
|
<Col span={6}>
|
||||||
|
<Statistic title="脚本版本" value={activeScript?.version || 0} />
|
||||||
|
</Col>
|
||||||
|
</Row>
|
||||||
|
</Card>
|
||||||
|
|
||||||
|
{/* 运行进度 */}
|
||||||
|
{pollingJobId && jobStatusData?.data && (
|
||||||
|
<Card style={{ marginBottom: 16 }}>
|
||||||
|
<Title level={5}>爬虫运行中</Title>
|
||||||
|
<Progress percent={jobStatusData.data.progress} status="active" />
|
||||||
|
<Text type="secondary">{jobStatusData.data.current_step}</Text>
|
||||||
|
|
||||||
|
<div style={{ marginTop: 16, maxHeight: 200, overflowY: 'auto' }}>
|
||||||
|
<Timeline
|
||||||
|
items={jobStatusData.data.logs?.slice(-10).map((log: any) => ({
|
||||||
|
color: log.level === 'error' ? 'red' : log.level === 'warning' ? 'orange' : 'blue',
|
||||||
|
children: <Text>{log.message}</Text>
|
||||||
|
}))}
|
||||||
|
/>
|
||||||
|
</div>
|
||||||
|
</Card>
|
||||||
|
)}
|
||||||
|
|
||||||
|
{/* 标签页 */}
|
||||||
|
<Card>
|
||||||
|
<Tabs activeKey={activeTab} onChange={setActiveTab}>
|
||||||
|
{/* 概览 */}
|
||||||
|
<TabPane tab="概览" key="overview">
|
||||||
|
<Descriptions title="基本信息" bordered column={2}>
|
||||||
|
<Descriptions.Item label="大学名称">{university.name}</Descriptions.Item>
|
||||||
|
<Descriptions.Item label="官网地址">
|
||||||
|
<a href={university.url} target="_blank" rel="noreferrer">{university.url}</a>
|
||||||
|
</Descriptions.Item>
|
||||||
|
<Descriptions.Item label="国家">{university.country || '-'}</Descriptions.Item>
|
||||||
|
<Descriptions.Item label="状态">
|
||||||
|
<Tag color={university.status === 'ready' ? 'green' : 'default'}>
|
||||||
|
{university.status}
|
||||||
|
</Tag>
|
||||||
|
</Descriptions.Item>
|
||||||
|
<Descriptions.Item label="创建时间">
|
||||||
|
{new Date(university.created_at).toLocaleString()}
|
||||||
|
</Descriptions.Item>
|
||||||
|
<Descriptions.Item label="更新时间">
|
||||||
|
{new Date(university.updated_at).toLocaleString()}
|
||||||
|
</Descriptions.Item>
|
||||||
|
</Descriptions>
|
||||||
|
|
||||||
|
<Title level={5} style={{ marginTop: 24 }}>历史任务</Title>
|
||||||
|
<Table
|
||||||
|
dataSource={jobs.slice(0, 5)}
|
||||||
|
rowKey="id"
|
||||||
|
pagination={false}
|
||||||
|
columns={[
|
||||||
|
{
|
||||||
|
title: '任务ID',
|
||||||
|
dataIndex: 'id',
|
||||||
|
width: 80
|
||||||
|
},
|
||||||
|
{
|
||||||
|
title: '状态',
|
||||||
|
dataIndex: 'status',
|
||||||
|
width: 100,
|
||||||
|
render: (status: string) => {
|
||||||
|
const s = statusMap[status] || { color: 'default', text: status }
|
||||||
|
return <Tag color={s.color}>{s.icon} {s.text}</Tag>
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
title: '进度',
|
||||||
|
dataIndex: 'progress',
|
||||||
|
width: 150,
|
||||||
|
render: (progress: number) => <Progress percent={progress} size="small" />
|
||||||
|
},
|
||||||
|
{
|
||||||
|
title: '开始时间',
|
||||||
|
dataIndex: 'started_at',
|
||||||
|
render: (t: string) => t ? new Date(t).toLocaleString() : '-'
|
||||||
|
},
|
||||||
|
{
|
||||||
|
title: '完成时间',
|
||||||
|
dataIndex: 'completed_at',
|
||||||
|
render: (t: string) => t ? new Date(t).toLocaleString() : '-'
|
||||||
|
}
|
||||||
|
]}
|
||||||
|
/>
|
||||||
|
</TabPane>
|
||||||
|
|
||||||
|
{/* 数据查看 */}
|
||||||
|
<TabPane tab="数据查看" key="data">
|
||||||
|
{result?.result_data ? (
|
||||||
|
<div>
|
||||||
|
<Row style={{ marginBottom: 16 }}>
|
||||||
|
<Col span={12}>
|
||||||
|
<Input
|
||||||
|
placeholder="搜索项目或导师..."
|
||||||
|
prefix={<SearchOutlined />}
|
||||||
|
value={searchKeyword}
|
||||||
|
onChange={(e) => setSearchKeyword(e.target.value)}
|
||||||
|
style={{ width: 300 }}
|
||||||
|
/>
|
||||||
|
</Col>
|
||||||
|
<Col span={12} style={{ textAlign: 'right' }}>
|
||||||
|
<Button
|
||||||
|
icon={<DownloadOutlined />}
|
||||||
|
onClick={() => {
|
||||||
|
const dataStr = JSON.stringify(result.result_data, null, 2)
|
||||||
|
const blob = new Blob([dataStr], { type: 'application/json' })
|
||||||
|
const url = URL.createObjectURL(blob)
|
||||||
|
const a = document.createElement('a')
|
||||||
|
a.href = url
|
||||||
|
a.download = `${university.name}_data.json`
|
||||||
|
a.click()
|
||||||
|
}}
|
||||||
|
>
|
||||||
|
导出JSON
|
||||||
|
</Button>
|
||||||
|
</Col>
|
||||||
|
</Row>
|
||||||
|
|
||||||
|
<Tree
|
||||||
|
showLine
|
||||||
|
defaultExpandedKeys={['school-0']}
|
||||||
|
treeData={buildDataTree()}
|
||||||
|
style={{ background: '#fafafa', padding: 16, borderRadius: 8 }}
|
||||||
|
/>
|
||||||
|
</div>
|
||||||
|
) : (
|
||||||
|
<Empty description="暂无数据,请先运行爬虫" />
|
||||||
|
)}
|
||||||
|
</TabPane>
|
||||||
|
|
||||||
|
{/* 脚本管理 */}
|
||||||
|
<TabPane tab="脚本管理" key="script">
|
||||||
|
{activeScript ? (
|
||||||
|
<div>
|
||||||
|
<Descriptions bordered column={2}>
|
||||||
|
<Descriptions.Item label="脚本名称">{activeScript.script_name}</Descriptions.Item>
|
||||||
|
<Descriptions.Item label="版本">v{activeScript.version}</Descriptions.Item>
|
||||||
|
<Descriptions.Item label="状态">
|
||||||
|
<Tag color="green">活跃</Tag>
|
||||||
|
</Descriptions.Item>
|
||||||
|
<Descriptions.Item label="创建时间">
|
||||||
|
{new Date(activeScript.created_at).toLocaleString()}
|
||||||
|
</Descriptions.Item>
|
||||||
|
</Descriptions>
|
||||||
|
|
||||||
|
<Title level={5} style={{ marginTop: 24 }}>脚本代码</Title>
|
||||||
|
<pre style={{
|
||||||
|
background: '#1e1e1e',
|
||||||
|
color: '#d4d4d4',
|
||||||
|
padding: 16,
|
||||||
|
borderRadius: 8,
|
||||||
|
maxHeight: 400,
|
||||||
|
overflow: 'auto'
|
||||||
|
}}>
|
||||||
|
{activeScript.script_content}
|
||||||
|
</pre>
|
||||||
|
</div>
|
||||||
|
) : (
|
||||||
|
<Empty description="暂无脚本" />
|
||||||
|
)}
|
||||||
|
</TabPane>
|
||||||
|
</Tabs>
|
||||||
|
</Card>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
77
frontend/src/services/api.ts
Normal file
77
frontend/src/services/api.ts
Normal file
@ -0,0 +1,77 @@
|
|||||||
|
/**
|
||||||
|
* API服务
|
||||||
|
*/
|
||||||
|
import axios from 'axios'
|
||||||
|
|
||||||
|
const api = axios.create({
|
||||||
|
baseURL: '/api',
|
||||||
|
timeout: 60000
|
||||||
|
})
|
||||||
|
|
||||||
|
// 大学相关API
|
||||||
|
export const universityApi = {
|
||||||
|
list: (params?: { skip?: number; limit?: number; search?: string }) =>
|
||||||
|
api.get('/universities', { params }),
|
||||||
|
|
||||||
|
get: (id: number) =>
|
||||||
|
api.get(`/universities/${id}`),
|
||||||
|
|
||||||
|
create: (data: { name: string; url: string; country?: string }) =>
|
||||||
|
api.post('/universities', data),
|
||||||
|
|
||||||
|
update: (id: number, data: { name?: string; url?: string; country?: string }) =>
|
||||||
|
api.put(`/universities/${id}`, data),
|
||||||
|
|
||||||
|
delete: (id: number) =>
|
||||||
|
api.delete(`/universities/${id}`)
|
||||||
|
}
|
||||||
|
|
||||||
|
// 脚本相关API
|
||||||
|
export const scriptApi = {
|
||||||
|
generate: (data: { university_url: string; university_name?: string }) =>
|
||||||
|
api.post('/scripts/generate', data),
|
||||||
|
|
||||||
|
getByUniversity: (universityId: number) =>
|
||||||
|
api.get(`/scripts/university/${universityId}`),
|
||||||
|
|
||||||
|
get: (id: number) =>
|
||||||
|
api.get(`/scripts/${id}`)
|
||||||
|
}
|
||||||
|
|
||||||
|
// 任务相关API
|
||||||
|
export const jobApi = {
|
||||||
|
start: (universityId: number) =>
|
||||||
|
api.post(`/jobs/start/${universityId}`),
|
||||||
|
|
||||||
|
get: (id: number) =>
|
||||||
|
api.get(`/jobs/${id}`),
|
||||||
|
|
||||||
|
getStatus: (id: number) =>
|
||||||
|
api.get(`/jobs/${id}/status`),
|
||||||
|
|
||||||
|
getByUniversity: (universityId: number) =>
|
||||||
|
api.get(`/jobs/university/${universityId}`),
|
||||||
|
|
||||||
|
cancel: (id: number) =>
|
||||||
|
api.post(`/jobs/${id}/cancel`)
|
||||||
|
}
|
||||||
|
|
||||||
|
// 结果相关API
|
||||||
|
export const resultApi = {
|
||||||
|
get: (universityId: number) =>
|
||||||
|
api.get(`/results/university/${universityId}`),
|
||||||
|
|
||||||
|
getSchools: (universityId: number) =>
|
||||||
|
api.get(`/results/university/${universityId}/schools`),
|
||||||
|
|
||||||
|
getPrograms: (universityId: number, params?: { school_name?: string; search?: string }) =>
|
||||||
|
api.get(`/results/university/${universityId}/programs`, { params }),
|
||||||
|
|
||||||
|
getFaculty: (universityId: number, params?: { school_name?: string; program_name?: string; search?: string; skip?: number; limit?: number }) =>
|
||||||
|
api.get(`/results/university/${universityId}/faculty`, { params }),
|
||||||
|
|
||||||
|
export: (universityId: number) =>
|
||||||
|
api.get(`/results/university/${universityId}/export`, { responseType: 'blob' })
|
||||||
|
}
|
||||||
|
|
||||||
|
export default api
|
||||||
1
frontend/src/vite-env.d.ts
vendored
Normal file
1
frontend/src/vite-env.d.ts
vendored
Normal file
@ -0,0 +1 @@
|
|||||||
|
/// <reference types="vite/client" />
|
||||||
21
frontend/tsconfig.json
Normal file
21
frontend/tsconfig.json
Normal file
@ -0,0 +1,21 @@
|
|||||||
|
{
|
||||||
|
"compilerOptions": {
|
||||||
|
"target": "ES2020",
|
||||||
|
"useDefineForClassFields": true,
|
||||||
|
"lib": ["ES2020", "DOM", "DOM.Iterable"],
|
||||||
|
"module": "ESNext",
|
||||||
|
"skipLibCheck": true,
|
||||||
|
"moduleResolution": "bundler",
|
||||||
|
"allowImportingTsExtensions": true,
|
||||||
|
"resolveJsonModule": true,
|
||||||
|
"isolatedModules": true,
|
||||||
|
"noEmit": true,
|
||||||
|
"jsx": "react-jsx",
|
||||||
|
"strict": true,
|
||||||
|
"noUnusedLocals": true,
|
||||||
|
"noUnusedParameters": true,
|
||||||
|
"noFallthroughCasesInSwitch": true
|
||||||
|
},
|
||||||
|
"include": ["src"],
|
||||||
|
"references": [{ "path": "./tsconfig.node.json" }]
|
||||||
|
}
|
||||||
10
frontend/tsconfig.node.json
Normal file
10
frontend/tsconfig.node.json
Normal file
@ -0,0 +1,10 @@
|
|||||||
|
{
|
||||||
|
"compilerOptions": {
|
||||||
|
"composite": true,
|
||||||
|
"skipLibCheck": true,
|
||||||
|
"module": "ESNext",
|
||||||
|
"moduleResolution": "bundler",
|
||||||
|
"allowSyntheticDefaultImports": true
|
||||||
|
},
|
||||||
|
"include": ["vite.config.ts"]
|
||||||
|
}
|
||||||
15
frontend/vite.config.ts
Normal file
15
frontend/vite.config.ts
Normal file
@ -0,0 +1,15 @@
|
|||||||
|
import { defineConfig } from 'vite'
|
||||||
|
import react from '@vitejs/plugin-react'
|
||||||
|
|
||||||
|
export default defineConfig({
|
||||||
|
plugins: [react()],
|
||||||
|
server: {
|
||||||
|
port: 3000,
|
||||||
|
proxy: {
|
||||||
|
'/api': {
|
||||||
|
target: 'http://localhost:8000',
|
||||||
|
changeOrigin: true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
})
|
||||||
135
generate_scraper.py
Normal file
135
generate_scraper.py
Normal file
@ -0,0 +1,135 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
"""
|
||||||
|
University Scraper Generator
|
||||||
|
|
||||||
|
This script generates a Playwright-based web scraper for any university website.
|
||||||
|
It uses an AI agent to analyze the university's website structure and create
|
||||||
|
a customized scraper that collects master's program pages and faculty profiles.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python generate_scraper.py
|
||||||
|
|
||||||
|
Configuration:
|
||||||
|
Set the following variables below:
|
||||||
|
- TARGET_URL: The university homepage URL
|
||||||
|
- CAMPUS_NAME: Short name for the university
|
||||||
|
- LANGUAGE: Primary language of the website
|
||||||
|
- MAX_DEPTH: How deep to crawl (default: 3)
|
||||||
|
- MAX_PAGES: Maximum pages to visit during sampling (default: 30)
|
||||||
|
"""
|
||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
|
||||||
|
# ============================================================================
|
||||||
|
# CONFIGURATION - Modify these values for your target university
|
||||||
|
# ============================================================================
|
||||||
|
TARGET_URL = "https://www.harvard.edu/"
|
||||||
|
CAMPUS_NAME = "Harvard"
|
||||||
|
LANGUAGE = "English"
|
||||||
|
MAX_DEPTH = 3
|
||||||
|
MAX_PAGES = 30
|
||||||
|
# ============================================================================
|
||||||
|
|
||||||
|
|
||||||
|
def get_env_key(name: str) -> str | None:
|
||||||
|
"""Get environment variable, with Windows registry fallback."""
|
||||||
|
# Try standard environment variable first
|
||||||
|
value = os.environ.get(name)
|
||||||
|
if value:
|
||||||
|
return value
|
||||||
|
|
||||||
|
# Windows: try reading from user environment in registry
|
||||||
|
if sys.platform == "win32":
|
||||||
|
try:
|
||||||
|
import winreg
|
||||||
|
with winreg.OpenKey(winreg.HKEY_CURRENT_USER, r"Environment") as key:
|
||||||
|
return winreg.QueryValueEx(key, name)[0]
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Generate a Playwright scraper for a university website"
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--url",
|
||||||
|
default=TARGET_URL,
|
||||||
|
help="University homepage URL"
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--name",
|
||||||
|
default=CAMPUS_NAME,
|
||||||
|
help="Short name for the university"
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--language",
|
||||||
|
default=LANGUAGE,
|
||||||
|
help="Primary language of the website"
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--max-depth",
|
||||||
|
type=int,
|
||||||
|
default=MAX_DEPTH,
|
||||||
|
help="Maximum crawl depth"
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--max-pages",
|
||||||
|
type=int,
|
||||||
|
default=MAX_PAGES,
|
||||||
|
help="Maximum pages to visit during sampling"
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--no-snapshot",
|
||||||
|
action="store_true",
|
||||||
|
help="Skip browser snapshot capture"
|
||||||
|
)
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Configure OpenRouter API
|
||||||
|
openrouter_key = get_env_key("OPENROUTER_API_KEY")
|
||||||
|
if not openrouter_key:
|
||||||
|
print("Error: OPENROUTER_API_KEY environment variable not set")
|
||||||
|
print("Please set it with your OpenRouter API key")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
os.environ["OPENAI_API_KEY"] = openrouter_key
|
||||||
|
os.environ["CODEGEN_MODEL_PROVIDER"] = "openrouter"
|
||||||
|
os.environ["CODEGEN_OPENROUTER_MODEL"] = "anthropic/claude-3-opus"
|
||||||
|
|
||||||
|
# Import after environment is configured
|
||||||
|
from university_agent import GenerationEngine, GenerationRequest, Settings
|
||||||
|
|
||||||
|
settings = Settings()
|
||||||
|
print(f"Provider: {settings.model_provider}")
|
||||||
|
print(f"Model: {settings.openrouter_model}")
|
||||||
|
|
||||||
|
engine = GenerationEngine(settings)
|
||||||
|
request = GenerationRequest(
|
||||||
|
target_url=args.url,
|
||||||
|
campus_name=args.name,
|
||||||
|
assumed_language=args.language,
|
||||||
|
max_depth=args.max_depth,
|
||||||
|
max_pages=args.max_pages,
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"\nGenerating scraper for: {args.name}")
|
||||||
|
print(f"URL: {args.url}")
|
||||||
|
print(f"Max depth: {args.max_depth}, Max pages: {args.max_pages}")
|
||||||
|
print("-" * 50)
|
||||||
|
|
||||||
|
result = engine.generate(request, capture_snapshot=not args.no_snapshot)
|
||||||
|
|
||||||
|
print("-" * 50)
|
||||||
|
print(f"Script saved to: {result.script_path}")
|
||||||
|
print(f"Project slug: {result.plan.project_slug}")
|
||||||
|
print(f"\nTo run the scraper:")
|
||||||
|
print(f" cd artifacts")
|
||||||
|
print(f" uv run python {result.script_path.name} --max-pages 50 --no-verify")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
164
scripts/reorganize_by_school.py
Normal file
164
scripts/reorganize_by_school.py
Normal file
@ -0,0 +1,164 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
将已爬取的Harvard数据按学院重新组织
|
||||||
|
|
||||||
|
读取原始扁平数据,按 学院 → 项目 → 导师 层级重新组织输出
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from urllib.parse import urlparse
|
||||||
|
from collections import defaultdict
|
||||||
|
|
||||||
|
# Harvard学院映射 - 根据URL子域名判断所属学院
|
||||||
|
SCHOOL_MAPPING = {
|
||||||
|
"gsas.harvard.edu": "Graduate School of Arts and Sciences (GSAS)",
|
||||||
|
"seas.harvard.edu": "John A. Paulson School of Engineering and Applied Sciences (SEAS)",
|
||||||
|
"hbs.edu": "Harvard Business School (HBS)",
|
||||||
|
"www.hbs.edu": "Harvard Business School (HBS)",
|
||||||
|
"gsd.harvard.edu": "Graduate School of Design (GSD)",
|
||||||
|
"www.gsd.harvard.edu": "Graduate School of Design (GSD)",
|
||||||
|
"gse.harvard.edu": "Graduate School of Education (HGSE)",
|
||||||
|
"www.gse.harvard.edu": "Graduate School of Education (HGSE)",
|
||||||
|
"hks.harvard.edu": "Harvard Kennedy School (HKS)",
|
||||||
|
"www.hks.harvard.edu": "Harvard Kennedy School (HKS)",
|
||||||
|
"hls.harvard.edu": "Harvard Law School (HLS)",
|
||||||
|
"hms.harvard.edu": "Harvard Medical School (HMS)",
|
||||||
|
"hsph.harvard.edu": "T.H. Chan School of Public Health (HSPH)",
|
||||||
|
"www.hsph.harvard.edu": "T.H. Chan School of Public Health (HSPH)",
|
||||||
|
"hds.harvard.edu": "Harvard Divinity School (HDS)",
|
||||||
|
"hsdm.harvard.edu": "Harvard School of Dental Medicine (HSDM)",
|
||||||
|
"fas.harvard.edu": "Faculty of Arts and Sciences (FAS)",
|
||||||
|
"aaas.fas.harvard.edu": "Faculty of Arts and Sciences (FAS)",
|
||||||
|
"dce.harvard.edu": "Division of Continuing Education (DCE)",
|
||||||
|
"extension.harvard.edu": "Harvard Extension School",
|
||||||
|
"cs.seas.harvard.edu": "John A. Paulson School of Engineering and Applied Sciences (SEAS)",
|
||||||
|
}
|
||||||
|
|
||||||
|
# 学院URL映射
|
||||||
|
SCHOOL_URLS = {
|
||||||
|
"Graduate School of Arts and Sciences (GSAS)": "https://gsas.harvard.edu/",
|
||||||
|
"John A. Paulson School of Engineering and Applied Sciences (SEAS)": "https://seas.harvard.edu/",
|
||||||
|
"Harvard Business School (HBS)": "https://www.hbs.edu/",
|
||||||
|
"Graduate School of Design (GSD)": "https://www.gsd.harvard.edu/",
|
||||||
|
"Graduate School of Education (HGSE)": "https://www.gse.harvard.edu/",
|
||||||
|
"Harvard Kennedy School (HKS)": "https://www.hks.harvard.edu/",
|
||||||
|
"Harvard Law School (HLS)": "https://hls.harvard.edu/",
|
||||||
|
"Harvard Medical School (HMS)": "https://hms.harvard.edu/",
|
||||||
|
"T.H. Chan School of Public Health (HSPH)": "https://www.hsph.harvard.edu/",
|
||||||
|
"Harvard Divinity School (HDS)": "https://hds.harvard.edu/",
|
||||||
|
"Harvard School of Dental Medicine (HSDM)": "https://hsdm.harvard.edu/",
|
||||||
|
"Faculty of Arts and Sciences (FAS)": "https://fas.harvard.edu/",
|
||||||
|
"Division of Continuing Education (DCE)": "https://dce.harvard.edu/",
|
||||||
|
"Harvard Extension School": "https://extension.harvard.edu/",
|
||||||
|
"Other": "https://www.harvard.edu/",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def determine_school_from_url(url: str) -> str:
|
||||||
|
"""根据URL判断所属学院"""
|
||||||
|
if not url:
|
||||||
|
return "Other"
|
||||||
|
|
||||||
|
parsed = urlparse(url)
|
||||||
|
domain = parsed.netloc.lower()
|
||||||
|
|
||||||
|
# 先尝试完全匹配
|
||||||
|
for pattern, school_name in SCHOOL_MAPPING.items():
|
||||||
|
if domain == pattern:
|
||||||
|
return school_name
|
||||||
|
|
||||||
|
# 再尝试部分匹配
|
||||||
|
for pattern, school_name in SCHOOL_MAPPING.items():
|
||||||
|
if pattern in domain:
|
||||||
|
return school_name
|
||||||
|
|
||||||
|
return "Other"
|
||||||
|
|
||||||
|
|
||||||
|
def reorganize_data(input_path: str, output_path: str):
|
||||||
|
"""重新组织数据按学院层级"""
|
||||||
|
|
||||||
|
# 读取原始数据
|
||||||
|
with open(input_path, 'r', encoding='utf-8') as f:
|
||||||
|
data = json.load(f)
|
||||||
|
|
||||||
|
print(f"读取原始数据: {data['total_programs']} 个项目, {data['total_faculty_found']} 位导师")
|
||||||
|
|
||||||
|
# 按学院分组
|
||||||
|
schools_dict = defaultdict(lambda: {"name": "", "url": "", "programs": []})
|
||||||
|
|
||||||
|
for prog in data['programs']:
|
||||||
|
# 根据faculty_page_url判断学院
|
||||||
|
faculty_url = prog.get('faculty_page_url', '')
|
||||||
|
school_name = determine_school_from_url(faculty_url)
|
||||||
|
|
||||||
|
# 如果没有faculty_page_url,尝试从program url推断
|
||||||
|
if school_name == "Other" and prog.get('url'):
|
||||||
|
school_name = determine_school_from_url(prog['url'])
|
||||||
|
|
||||||
|
# 创建项目对象
|
||||||
|
program = {
|
||||||
|
"name": prog['name'],
|
||||||
|
"url": prog.get('url', ''),
|
||||||
|
"degree_type": prog.get('degrees', ''),
|
||||||
|
"faculty_page_url": faculty_url,
|
||||||
|
"faculty": prog.get('faculty', [])
|
||||||
|
}
|
||||||
|
|
||||||
|
# 添加到学院
|
||||||
|
if not schools_dict[school_name]["name"]:
|
||||||
|
schools_dict[school_name]["name"] = school_name
|
||||||
|
schools_dict[school_name]["url"] = SCHOOL_URLS.get(school_name, "")
|
||||||
|
|
||||||
|
schools_dict[school_name]["programs"].append(program)
|
||||||
|
|
||||||
|
# 转换为列表并排序
|
||||||
|
schools_list = sorted(schools_dict.values(), key=lambda s: s["name"])
|
||||||
|
|
||||||
|
# 构建输出结构
|
||||||
|
result = {
|
||||||
|
"name": "Harvard University",
|
||||||
|
"url": "https://www.harvard.edu/",
|
||||||
|
"country": "USA",
|
||||||
|
"scraped_at": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"schools": schools_list
|
||||||
|
}
|
||||||
|
|
||||||
|
# 打印统计
|
||||||
|
print("\n" + "=" * 60)
|
||||||
|
print("按学院重新组织完成!")
|
||||||
|
print("=" * 60)
|
||||||
|
print(f"大学: {result['name']}")
|
||||||
|
print(f"学院数: {len(schools_list)}")
|
||||||
|
|
||||||
|
total_programs = sum(len(s['programs']) for s in schools_list)
|
||||||
|
total_faculty = sum(len(p['faculty']) for s in schools_list for p in s['programs'])
|
||||||
|
|
||||||
|
print(f"项目数: {total_programs}")
|
||||||
|
print(f"导师数: {total_faculty}")
|
||||||
|
|
||||||
|
print("\n各学院统计:")
|
||||||
|
for school in schools_list:
|
||||||
|
prog_count = len(school['programs'])
|
||||||
|
fac_count = sum(len(p['faculty']) for p in school['programs'])
|
||||||
|
print(f" {school['name']}: {prog_count}个项目, {fac_count}位导师")
|
||||||
|
|
||||||
|
# 保存结果
|
||||||
|
output_file = Path(output_path)
|
||||||
|
output_file.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
with open(output_file, 'w', encoding='utf-8') as f:
|
||||||
|
json.dump(result, f, ensure_ascii=False, indent=2)
|
||||||
|
|
||||||
|
print(f"\n结果已保存到: {output_path}")
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
input_file = "artifacts/harvard_programs_with_faculty.json"
|
||||||
|
output_file = "output/harvard_hierarchical_result.json"
|
||||||
|
|
||||||
|
reorganize_data(input_file, output_file)
|
||||||
45
scripts/start_backend.py
Normal file
45
scripts/start_backend.py
Normal file
@ -0,0 +1,45 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
启动后端API服务 (本地开发)
|
||||||
|
"""
|
||||||
|
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
import os
|
||||||
|
|
||||||
|
# 切换到项目根目录
|
||||||
|
project_root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||||
|
os.chdir(project_root)
|
||||||
|
|
||||||
|
# 添加backend到Python路径
|
||||||
|
backend_path = os.path.join(project_root, "backend")
|
||||||
|
sys.path.insert(0, backend_path)
|
||||||
|
|
||||||
|
print("=" * 60)
|
||||||
|
print("启动大学爬虫 Web API 服务")
|
||||||
|
print("=" * 60)
|
||||||
|
print(f"项目目录: {project_root}")
|
||||||
|
print(f"后端目录: {backend_path}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# 检查是否安装了依赖
|
||||||
|
try:
|
||||||
|
import fastapi
|
||||||
|
import uvicorn
|
||||||
|
except ImportError:
|
||||||
|
print("正在安装后端依赖...")
|
||||||
|
subprocess.run([sys.executable, "-m", "pip", "install", "-r", "backend/requirements.txt"])
|
||||||
|
|
||||||
|
# 初始化数据库
|
||||||
|
print("初始化数据库...")
|
||||||
|
os.chdir(backend_path)
|
||||||
|
|
||||||
|
# 启动服务
|
||||||
|
print()
|
||||||
|
print("启动 FastAPI 服务...")
|
||||||
|
print("API文档: http://localhost:8000/docs")
|
||||||
|
print("Swagger UI: http://localhost:8000/redoc")
|
||||||
|
print()
|
||||||
|
|
||||||
|
import uvicorn
|
||||||
|
uvicorn.run("app.main:app", host="0.0.0.0", port=8000, reload=True)
|
||||||
42
scripts/start_dev.bat
Normal file
42
scripts/start_dev.bat
Normal file
@ -0,0 +1,42 @@
|
|||||||
|
@echo off
|
||||||
|
echo ============================================================
|
||||||
|
echo 大学爬虫 Web 系统 - 本地开发启动
|
||||||
|
echo ============================================================
|
||||||
|
|
||||||
|
echo.
|
||||||
|
echo 启动后端API服务...
|
||||||
|
cd /d "%~dp0..\backend"
|
||||||
|
|
||||||
|
REM 安装后端依赖
|
||||||
|
pip install -r requirements.txt -q
|
||||||
|
|
||||||
|
REM 启动后端
|
||||||
|
start cmd /k "cd /d %~dp0..\backend && uvicorn app.main:app --reload --port 8000"
|
||||||
|
|
||||||
|
echo 后端已启动: http://localhost:8000
|
||||||
|
echo API文档: http://localhost:8000/docs
|
||||||
|
|
||||||
|
echo.
|
||||||
|
echo 启动前端服务...
|
||||||
|
cd /d "%~dp0..\frontend"
|
||||||
|
|
||||||
|
REM 安装前端依赖
|
||||||
|
if not exist node_modules (
|
||||||
|
echo 安装前端依赖...
|
||||||
|
npm install
|
||||||
|
)
|
||||||
|
|
||||||
|
REM 启动前端
|
||||||
|
start cmd /k "cd /d %~dp0..\frontend && npm run dev"
|
||||||
|
|
||||||
|
echo 前端已启动: http://localhost:3000
|
||||||
|
|
||||||
|
echo.
|
||||||
|
echo ============================================================
|
||||||
|
echo 系统启动完成!
|
||||||
|
echo.
|
||||||
|
echo 后端API: http://localhost:8000/docs
|
||||||
|
echo 前端页面: http://localhost:3000
|
||||||
|
echo ============================================================
|
||||||
|
|
||||||
|
pause
|
||||||
126
scripts/test_harvard.py
Normal file
126
scripts/test_harvard.py
Normal file
@ -0,0 +1,126 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
测试Harvard大学爬取 - 只测试2个学院
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
# 添加项目路径
|
||||||
|
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
|
||||||
|
|
||||||
|
from university_scraper.config import ScraperConfig
|
||||||
|
from university_scraper.scraper import UniversityScraper
|
||||||
|
|
||||||
|
|
||||||
|
# 简化的测试配置 - 只测试2个学院
|
||||||
|
TEST_CONFIG = {
|
||||||
|
"university": {
|
||||||
|
"name": "Harvard University",
|
||||||
|
"url": "https://www.harvard.edu/",
|
||||||
|
"country": "USA"
|
||||||
|
},
|
||||||
|
"schools": {
|
||||||
|
"discovery_method": "static_list",
|
||||||
|
"static_list": [
|
||||||
|
{
|
||||||
|
"name": "John A. Paulson School of Engineering and Applied Sciences (SEAS)",
|
||||||
|
"url": "https://seas.harvard.edu/"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "Graduate School of Design (GSD)",
|
||||||
|
"url": "https://www.gsd.harvard.edu/"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"programs": {
|
||||||
|
"paths_to_try": [
|
||||||
|
"/academics/graduate-programs",
|
||||||
|
"/programs",
|
||||||
|
"/academics/programs",
|
||||||
|
"/graduate"
|
||||||
|
],
|
||||||
|
"link_patterns": [
|
||||||
|
{"text_contains": ["program", "degree"], "href_contains": ["/program", "/degree"]},
|
||||||
|
{"text_contains": ["master", "graduate"], "href_contains": ["/master", "/graduate"]}
|
||||||
|
],
|
||||||
|
"selectors": {
|
||||||
|
"program_item": "div.program-item, li.program, a[href*='/program']",
|
||||||
|
"program_name": "h3, .title",
|
||||||
|
"program_url": "a[href]",
|
||||||
|
"degree_type": ".degree"
|
||||||
|
},
|
||||||
|
"pagination": {"type": "none"}
|
||||||
|
},
|
||||||
|
"faculty": {
|
||||||
|
"discovery_strategies": [
|
||||||
|
{
|
||||||
|
"type": "link_in_page",
|
||||||
|
"patterns": [
|
||||||
|
{"text_contains": ["faculty", "people"], "href_contains": ["/faculty", "/people"]}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"type": "url_pattern",
|
||||||
|
"patterns": [
|
||||||
|
"{school_url}/faculty",
|
||||||
|
"{school_url}/people"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"selectors": {
|
||||||
|
"faculty_item": "div.faculty, li.person",
|
||||||
|
"faculty_name": "h3, .name",
|
||||||
|
"faculty_url": "a[href*='/people/'], a[href*='/faculty/']"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"filters": {
|
||||||
|
"program_degree_types": {
|
||||||
|
"include": ["Master", "M.S.", "M.A.", "MBA", "M.Eng", "S.M."],
|
||||||
|
"exclude": ["Ph.D.", "Doctor", "Bachelor"]
|
||||||
|
},
|
||||||
|
"exclude_schools": []
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
async def test_harvard():
|
||||||
|
"""测试Harvard爬取"""
|
||||||
|
print("=" * 60)
|
||||||
|
print("测试Harvard大学爬取(简化版 - 2个学院)")
|
||||||
|
print("=" * 60)
|
||||||
|
|
||||||
|
config = ScraperConfig.from_dict(TEST_CONFIG)
|
||||||
|
|
||||||
|
async with UniversityScraper(config, headless=False) as scraper:
|
||||||
|
university = await scraper.scrape()
|
||||||
|
scraper.save_results("output/harvard_test_result.json")
|
||||||
|
|
||||||
|
# 打印详细结果
|
||||||
|
print("\n" + "=" * 60)
|
||||||
|
print("详细结果:")
|
||||||
|
print("=" * 60)
|
||||||
|
|
||||||
|
for school in university.schools:
|
||||||
|
print(f"\n学院: {school.name}")
|
||||||
|
print(f" URL: {school.url}")
|
||||||
|
print(f" 项目数: {len(school.programs)}")
|
||||||
|
|
||||||
|
for prog in school.programs[:5]:
|
||||||
|
print(f"\n 项目: {prog.name}")
|
||||||
|
print(f" URL: {prog.url}")
|
||||||
|
print(f" 学位: {prog.degree_type}")
|
||||||
|
print(f" 导师数: {len(prog.faculty)}")
|
||||||
|
|
||||||
|
if prog.faculty:
|
||||||
|
print(" 导师示例:")
|
||||||
|
for f in prog.faculty[:3]:
|
||||||
|
print(f" - {f.name}: {f.url}")
|
||||||
|
|
||||||
|
if len(school.programs) > 5:
|
||||||
|
print(f"\n ... 还有 {len(school.programs) - 5} 个项目")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(test_harvard())
|
||||||
@ -1,6 +1,7 @@
|
|||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
import json
|
import json
|
||||||
|
import re
|
||||||
from textwrap import dedent
|
from textwrap import dedent
|
||||||
|
|
||||||
from agno.agent import Agent
|
from agno.agent import Agent
|
||||||
@ -101,6 +102,12 @@ class ScriptAgent:
|
|||||||
return Claude(id=self.settings.anthropic_model)
|
return Claude(id=self.settings.anthropic_model)
|
||||||
if provider == "openai":
|
if provider == "openai":
|
||||||
return OpenAIChat(id=self.settings.openai_model)
|
return OpenAIChat(id=self.settings.openai_model)
|
||||||
|
if provider == "openrouter":
|
||||||
|
# OpenRouter is OpenAI-compatible, use OpenAIChat with custom base_url
|
||||||
|
return OpenAIChat(
|
||||||
|
id=self.settings.openrouter_model,
|
||||||
|
base_url=self.settings.openrouter_base_url,
|
||||||
|
)
|
||||||
raise ValueError(f"Unsupported provider: {provider}")
|
raise ValueError(f"Unsupported provider: {provider}")
|
||||||
|
|
||||||
def build_plan(self, request: GenerationRequest, summary: SiteSummary | None) -> ScriptPlan:
|
def build_plan(self, request: GenerationRequest, summary: SiteSummary | None) -> ScriptPlan:
|
||||||
@ -128,17 +135,65 @@ class ScriptAgent:
|
|||||||
plan.script_name = self.settings.default_script_name
|
plan.script_name = self.settings.default_script_name
|
||||||
return plan
|
return plan
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def _extract_json(text: str) -> dict | None:
|
||||||
|
"""Try to extract JSON from text that might contain markdown or other content."""
|
||||||
|
# Try direct parsing first
|
||||||
|
try:
|
||||||
|
return json.loads(text)
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# Try to find JSON in code blocks
|
||||||
|
code_block_pattern = r"```(?:json)?\s*([\s\S]*?)```"
|
||||||
|
matches = re.findall(code_block_pattern, text)
|
||||||
|
for match in matches:
|
||||||
|
try:
|
||||||
|
return json.loads(match.strip())
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Try to find JSON object pattern
|
||||||
|
json_pattern = r"\{[\s\S]*\}"
|
||||||
|
matches = re.findall(json_pattern, text)
|
||||||
|
for match in matches:
|
||||||
|
try:
|
||||||
|
return json.loads(match)
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
continue
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def _coerce_plan(run_response: RunOutput) -> ScriptPlan:
|
def _coerce_plan(run_response: RunOutput) -> ScriptPlan:
|
||||||
content = run_response.content
|
content = run_response.content
|
||||||
if isinstance(content, ScriptPlan):
|
if isinstance(content, ScriptPlan):
|
||||||
return content
|
return content
|
||||||
if isinstance(content, dict):
|
if isinstance(content, dict):
|
||||||
return ScriptPlan.model_validate(content)
|
payload = content
|
||||||
if isinstance(content, str):
|
elif isinstance(content, str):
|
||||||
try:
|
payload = ScriptAgent._extract_json(content)
|
||||||
payload = json.loads(content)
|
if payload is None:
|
||||||
except json.JSONDecodeError as exc:
|
raise ValueError(f"Agent returned a non-JSON payload: {content[:500]}")
|
||||||
raise ValueError("Agent returned a non-JSON payload.") from exc
|
else:
|
||||||
return ScriptPlan.model_validate(payload)
|
raise ValueError("Agent response did not match the ScriptPlan schema.")
|
||||||
raise ValueError("Agent response did not match the ScriptPlan schema.")
|
|
||||||
|
# Fill in missing required fields with defaults
|
||||||
|
if "project_slug" not in payload:
|
||||||
|
payload["project_slug"] = "university-scraper"
|
||||||
|
if "description" not in payload:
|
||||||
|
payload["description"] = "Playwright scraper for university master programs and faculty profiles."
|
||||||
|
if "master_program_keywords" not in payload:
|
||||||
|
payload["master_program_keywords"] = ["master", "graduate", "M.S.", "M.A."]
|
||||||
|
if "faculty_keywords" not in payload:
|
||||||
|
payload["faculty_keywords"] = ["professor", "faculty", "researcher", "people"]
|
||||||
|
if "navigation_strategy" not in payload:
|
||||||
|
payload["navigation_strategy"] = "Navigate from homepage to departments to find programs and faculty."
|
||||||
|
# Handle navigation_strategy if it's a list instead of string
|
||||||
|
if isinstance(payload.get("navigation_strategy"), list):
|
||||||
|
payload["navigation_strategy"] = " ".join(payload["navigation_strategy"])
|
||||||
|
# Handle extra_notes if it's a string instead of list
|
||||||
|
if isinstance(payload.get("extra_notes"), str):
|
||||||
|
payload["extra_notes"] = [payload["extra_notes"]]
|
||||||
|
|
||||||
|
return ScriptPlan.model_validate(payload)
|
||||||
|
|||||||
@ -10,7 +10,7 @@ from pydantic_settings import BaseSettings
|
|||||||
class Settings(BaseSettings):
|
class Settings(BaseSettings):
|
||||||
"""Runtime configuration for the code-generation agent."""
|
"""Runtime configuration for the code-generation agent."""
|
||||||
|
|
||||||
model_provider: Literal["anthropic", "openai"] = Field(
|
model_provider: Literal["anthropic", "openai", "openrouter"] = Field(
|
||||||
default="anthropic",
|
default="anthropic",
|
||||||
description="LLM provider consumed through the Agno SDK.",
|
description="LLM provider consumed through the Agno SDK.",
|
||||||
)
|
)
|
||||||
@ -22,6 +22,14 @@ class Settings(BaseSettings):
|
|||||||
default="o4-mini",
|
default="o4-mini",
|
||||||
description="Default OpenAI model identifier.",
|
description="Default OpenAI model identifier.",
|
||||||
)
|
)
|
||||||
|
openrouter_model: str = Field(
|
||||||
|
default="anthropic/claude-sonnet-4",
|
||||||
|
description="Default OpenRouter model identifier.",
|
||||||
|
)
|
||||||
|
openrouter_base_url: str = Field(
|
||||||
|
default="https://openrouter.ai/api/v1",
|
||||||
|
description="OpenRouter API base URL.",
|
||||||
|
)
|
||||||
reasoning_enabled: bool = Field(
|
reasoning_enabled: bool = Field(
|
||||||
default=True,
|
default=True,
|
||||||
description="Enable multi-step reasoning for higher-fidelity plans.",
|
description="Enable multi-step reasoning for higher-fidelity plans.",
|
||||||
|
|||||||
7
src/university_scraper/__init__.py
Normal file
7
src/university_scraper/__init__.py
Normal file
@ -0,0 +1,7 @@
|
|||||||
|
"""
|
||||||
|
University Scraper - 通用大学官网爬虫框架
|
||||||
|
|
||||||
|
支持按照 学院 → 项目 → 导师 的层级结构爬取任意海外大学官网
|
||||||
|
"""
|
||||||
|
|
||||||
|
__version__ = "1.0.0"
|
||||||
8
src/university_scraper/__main__.py
Normal file
8
src/university_scraper/__main__.py
Normal file
@ -0,0 +1,8 @@
|
|||||||
|
"""
|
||||||
|
模块入口点,支持 python -m university_scraper 运行
|
||||||
|
"""
|
||||||
|
|
||||||
|
from .cli import main
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
374
src/university_scraper/analyzer.py
Normal file
374
src/university_scraper/analyzer.py
Normal file
@ -0,0 +1,374 @@
|
|||||||
|
"""
|
||||||
|
AI辅助页面分析工具
|
||||||
|
|
||||||
|
帮助分析新大学官网的页面结构,生成配置建议
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
from typing import Dict, Any, List, Optional
|
||||||
|
from urllib.parse import urljoin, urlparse
|
||||||
|
|
||||||
|
from playwright.async_api import async_playwright, Page
|
||||||
|
|
||||||
|
|
||||||
|
class PageAnalyzer:
|
||||||
|
"""页面结构分析器"""
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.browser = None
|
||||||
|
self.page: Optional[Page] = None
|
||||||
|
|
||||||
|
async def __aenter__(self):
|
||||||
|
playwright = await async_playwright().start()
|
||||||
|
self.browser = await playwright.chromium.launch(headless=False)
|
||||||
|
context = await self.browser.new_context(
|
||||||
|
viewport={'width': 1920, 'height': 1080}
|
||||||
|
)
|
||||||
|
self.page = await context.new_page()
|
||||||
|
return self
|
||||||
|
|
||||||
|
async def __aexit__(self, exc_type, exc_val, exc_tb):
|
||||||
|
if self.browser:
|
||||||
|
await self.browser.close()
|
||||||
|
|
||||||
|
async def analyze_university_homepage(self, url: str) -> Dict[str, Any]:
|
||||||
|
"""分析大学官网首页,寻找学院链接"""
|
||||||
|
print(f"\n分析大学首页: {url}")
|
||||||
|
|
||||||
|
await self.page.goto(url, wait_until='networkidle')
|
||||||
|
await self.page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
analysis = await self.page.evaluate('''() => {
|
||||||
|
const result = {
|
||||||
|
title: document.title,
|
||||||
|
schools_links: [],
|
||||||
|
navigation_links: [],
|
||||||
|
potential_schools_pages: [],
|
||||||
|
all_harvard_subdomains: new Set()
|
||||||
|
};
|
||||||
|
|
||||||
|
// 查找可能的学院链接
|
||||||
|
const schoolKeywords = ['school', 'college', 'faculty', 'institute', 'academy', 'department'];
|
||||||
|
const navKeywords = ['academics', 'schools', 'colleges', 'programs', 'education'];
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href || '';
|
||||||
|
const text = a.innerText.trim().toLowerCase();
|
||||||
|
|
||||||
|
// 收集所有子域名
|
||||||
|
try {
|
||||||
|
const urlObj = new URL(href);
|
||||||
|
if (urlObj.hostname.includes('harvard.edu') &&
|
||||||
|
urlObj.hostname !== 'www.harvard.edu') {
|
||||||
|
result.all_harvard_subdomains.add(urlObj.origin);
|
||||||
|
}
|
||||||
|
} catch(e) {}
|
||||||
|
|
||||||
|
// 查找学院链接
|
||||||
|
if (schoolKeywords.some(kw => text.includes(kw)) ||
|
||||||
|
schoolKeywords.some(kw => href.toLowerCase().includes(kw))) {
|
||||||
|
result.schools_links.push({
|
||||||
|
text: a.innerText.trim().substring(0, 100),
|
||||||
|
href: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
// 查找导航到学院列表的链接
|
||||||
|
if (navKeywords.some(kw => text.includes(kw))) {
|
||||||
|
result.potential_schools_pages.push({
|
||||||
|
text: a.innerText.trim().substring(0, 50),
|
||||||
|
href: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
// 转换Set为数组
|
||||||
|
result.all_harvard_subdomains = Array.from(result.all_harvard_subdomains);
|
||||||
|
|
||||||
|
return result;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f"\n页面标题: {analysis['title']}")
|
||||||
|
print(f"\n发现的子域名 ({len(analysis['all_harvard_subdomains'])} 个):")
|
||||||
|
for subdomain in analysis['all_harvard_subdomains'][:20]:
|
||||||
|
print(f" - {subdomain}")
|
||||||
|
|
||||||
|
print(f"\n可能的学院链接 ({len(analysis['schools_links'])} 个):")
|
||||||
|
for link in analysis['schools_links'][:15]:
|
||||||
|
print(f" - {link['text'][:50]} -> {link['href']}")
|
||||||
|
|
||||||
|
return analysis
|
||||||
|
|
||||||
|
async def analyze_school_page(self, url: str) -> Dict[str, Any]:
|
||||||
|
"""分析学院页面,寻找项目列表"""
|
||||||
|
print(f"\n分析学院页面: {url}")
|
||||||
|
|
||||||
|
await self.page.goto(url, wait_until='networkidle')
|
||||||
|
await self.page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
analysis = await self.page.evaluate('''() => {
|
||||||
|
const result = {
|
||||||
|
title: document.title,
|
||||||
|
navigation: [],
|
||||||
|
program_links: [],
|
||||||
|
degree_mentions: [],
|
||||||
|
faculty_links: []
|
||||||
|
};
|
||||||
|
|
||||||
|
// 分析导航结构
|
||||||
|
document.querySelectorAll('nav a, [class*="nav"] a, header a').forEach(a => {
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
const href = a.href || '';
|
||||||
|
if (text.length > 2 && text.length < 50) {
|
||||||
|
result.navigation.push({ text, href });
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
// 查找项目/学位链接
|
||||||
|
const programKeywords = ['program', 'degree', 'master', 'graduate', 'academic', 'study'];
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const text = a.innerText.trim().toLowerCase();
|
||||||
|
const href = a.href.toLowerCase();
|
||||||
|
|
||||||
|
if (programKeywords.some(kw => text.includes(kw) || href.includes(kw))) {
|
||||||
|
result.program_links.push({
|
||||||
|
text: a.innerText.trim().substring(0, 100),
|
||||||
|
href: a.href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
// 查找Faculty链接
|
||||||
|
if (text.includes('faculty') || text.includes('people') ||
|
||||||
|
href.includes('/faculty') || href.includes('/people')) {
|
||||||
|
result.faculty_links.push({
|
||||||
|
text: a.innerText.trim().substring(0, 100),
|
||||||
|
href: a.href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return result;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f"\n导航链接:")
|
||||||
|
for nav in analysis['navigation'][:10]:
|
||||||
|
print(f" - {nav['text']} -> {nav['href']}")
|
||||||
|
|
||||||
|
print(f"\n项目相关链接 ({len(analysis['program_links'])} 个):")
|
||||||
|
for link in analysis['program_links'][:15]:
|
||||||
|
print(f" - {link['text'][:50]} -> {link['href']}")
|
||||||
|
|
||||||
|
print(f"\nFaculty链接 ({len(analysis['faculty_links'])} 个):")
|
||||||
|
for link in analysis['faculty_links'][:10]:
|
||||||
|
print(f" - {link['text'][:50]} -> {link['href']}")
|
||||||
|
|
||||||
|
return analysis
|
||||||
|
|
||||||
|
async def analyze_programs_page(self, url: str) -> Dict[str, Any]:
|
||||||
|
"""分析项目列表页面,识别项目选择器"""
|
||||||
|
print(f"\n分析项目列表页面: {url}")
|
||||||
|
|
||||||
|
await self.page.goto(url, wait_until='networkidle')
|
||||||
|
await self.page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
# 保存截图
|
||||||
|
screenshot_path = f"analysis_{urlparse(url).netloc.replace('.', '_')}.png"
|
||||||
|
await self.page.screenshot(path=screenshot_path, full_page=True)
|
||||||
|
print(f"截图已保存: {screenshot_path}")
|
||||||
|
|
||||||
|
analysis = await self.page.evaluate('''() => {
|
||||||
|
const result = {
|
||||||
|
title: document.title,
|
||||||
|
potential_program_containers: [],
|
||||||
|
program_items: [],
|
||||||
|
pagination: null,
|
||||||
|
selectors_suggestion: {}
|
||||||
|
};
|
||||||
|
|
||||||
|
// 分析页面结构,寻找重复的项目容器
|
||||||
|
const containers = [
|
||||||
|
'div[class*="program"]',
|
||||||
|
'li[class*="program"]',
|
||||||
|
'article[class*="program"]',
|
||||||
|
'div[class*="degree"]',
|
||||||
|
'div[class*="card"]',
|
||||||
|
'li.item',
|
||||||
|
'div.item'
|
||||||
|
];
|
||||||
|
|
||||||
|
containers.forEach(selector => {
|
||||||
|
const elements = document.querySelectorAll(selector);
|
||||||
|
if (elements.length >= 3) {
|
||||||
|
result.potential_program_containers.push({
|
||||||
|
selector: selector,
|
||||||
|
count: elements.length,
|
||||||
|
sample: elements[0].outerHTML.substring(0, 500)
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
// 查找所有看起来像项目的链接
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href.toLowerCase();
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
|
||||||
|
if ((href.includes('/program') || href.includes('/degree') ||
|
||||||
|
href.includes('/master') || href.includes('/graduate')) &&
|
||||||
|
text.length > 5 && text.length < 150) {
|
||||||
|
|
||||||
|
result.program_items.push({
|
||||||
|
text: text,
|
||||||
|
href: a.href,
|
||||||
|
parentClass: a.parentElement?.className || '',
|
||||||
|
grandparentClass: a.parentElement?.parentElement?.className || ''
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
// 查找分页元素
|
||||||
|
const paginationSelectors = [
|
||||||
|
'.pagination',
|
||||||
|
'[class*="pagination"]',
|
||||||
|
'nav[aria-label*="page"]',
|
||||||
|
'.pager'
|
||||||
|
];
|
||||||
|
|
||||||
|
for (const selector of paginationSelectors) {
|
||||||
|
const elem = document.querySelector(selector);
|
||||||
|
if (elem) {
|
||||||
|
result.pagination = {
|
||||||
|
selector: selector,
|
||||||
|
html: elem.outerHTML.substring(0, 300)
|
||||||
|
};
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return result;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f"\n可能的项目容器:")
|
||||||
|
for container in analysis['potential_program_containers']:
|
||||||
|
print(f" 选择器: {container['selector']} (找到 {container['count']} 个)")
|
||||||
|
|
||||||
|
print(f"\n找到的项目链接 ({len(analysis['program_items'])} 个):")
|
||||||
|
for item in analysis['program_items'][:10]:
|
||||||
|
print(f" - {item['text'][:60]}")
|
||||||
|
print(f" 父元素class: {item['parentClass'][:50]}")
|
||||||
|
|
||||||
|
if analysis['pagination']:
|
||||||
|
print(f"\n分页元素: {analysis['pagination']['selector']}")
|
||||||
|
|
||||||
|
return analysis
|
||||||
|
|
||||||
|
async def analyze_faculty_page(self, url: str) -> Dict[str, Any]:
|
||||||
|
"""分析导师列表页面,识别导师选择器"""
|
||||||
|
print(f"\n分析导师列表页面: {url}")
|
||||||
|
|
||||||
|
await self.page.goto(url, wait_until='networkidle')
|
||||||
|
await self.page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
analysis = await self.page.evaluate('''() => {
|
||||||
|
const result = {
|
||||||
|
title: document.title,
|
||||||
|
faculty_links: [],
|
||||||
|
potential_containers: [],
|
||||||
|
url_patterns: new Set()
|
||||||
|
};
|
||||||
|
|
||||||
|
// 查找个人页面链接
|
||||||
|
const personPatterns = ['/people/', '/faculty/', '/profile/', '/person/', '/directory/'];
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href.toLowerCase();
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
|
||||||
|
if (personPatterns.some(p => href.includes(p)) &&
|
||||||
|
text.length > 3 && text.length < 100) {
|
||||||
|
|
||||||
|
result.faculty_links.push({
|
||||||
|
name: text,
|
||||||
|
url: a.href,
|
||||||
|
parentClass: a.parentElement?.className || ''
|
||||||
|
});
|
||||||
|
|
||||||
|
// 记录URL模式
|
||||||
|
personPatterns.forEach(p => {
|
||||||
|
if (href.includes(p)) {
|
||||||
|
result.url_patterns.add(p);
|
||||||
|
}
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
result.url_patterns = Array.from(result.url_patterns);
|
||||||
|
|
||||||
|
return result;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f"\n发现的导师链接 ({len(analysis['faculty_links'])} 个):")
|
||||||
|
for faculty in analysis['faculty_links'][:15]:
|
||||||
|
print(f" - {faculty['name']} -> {faculty['url']}")
|
||||||
|
|
||||||
|
print(f"\nURL模式: {analysis['url_patterns']}")
|
||||||
|
|
||||||
|
return analysis
|
||||||
|
|
||||||
|
async def generate_config_suggestion(self, university_url: str) -> str:
|
||||||
|
"""生成配置文件建议"""
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print(f"开始分析: {university_url}")
|
||||||
|
print(f"{'='*60}")
|
||||||
|
|
||||||
|
# 分析首页
|
||||||
|
homepage_analysis = await self.analyze_university_homepage(university_url)
|
||||||
|
|
||||||
|
# 生成配置建议
|
||||||
|
domain = urlparse(university_url).netloc
|
||||||
|
config_suggestion = f'''# {homepage_analysis['title']} 爬虫配置
|
||||||
|
# 自动生成的配置建议,请根据实际情况调整
|
||||||
|
|
||||||
|
university:
|
||||||
|
name: "{homepage_analysis['title'].split(' - ')[0].split(' | ')[0]}"
|
||||||
|
url: "{university_url}"
|
||||||
|
country: "TODO"
|
||||||
|
|
||||||
|
# 发现的子域名(可能是学院网站):
|
||||||
|
# {chr(10).join(['# - ' + s for s in homepage_analysis['all_harvard_subdomains'][:10]])}
|
||||||
|
|
||||||
|
schools:
|
||||||
|
discovery_method: "static_list"
|
||||||
|
|
||||||
|
# TODO: 根据上面的子域名和学院链接,手动填写学院列表
|
||||||
|
static_list:
|
||||||
|
# 示例:
|
||||||
|
# - name: "School of Engineering"
|
||||||
|
# url: "https://engineering.{domain}/"
|
||||||
|
'''
|
||||||
|
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print("配置建议:")
|
||||||
|
print(f"{'='*60}")
|
||||||
|
print(config_suggestion)
|
||||||
|
|
||||||
|
return config_suggestion
|
||||||
|
|
||||||
|
|
||||||
|
async def analyze_new_university(url: str):
|
||||||
|
"""分析新大学的便捷函数"""
|
||||||
|
async with PageAnalyzer() as analyzer:
|
||||||
|
await analyzer.generate_config_suggestion(url)
|
||||||
|
|
||||||
|
|
||||||
|
# CLI入口
|
||||||
|
if __name__ == "__main__":
|
||||||
|
import sys
|
||||||
|
|
||||||
|
if len(sys.argv) < 2:
|
||||||
|
print("用法: python analyzer.py <university_url>")
|
||||||
|
print("示例: python analyzer.py https://www.stanford.edu/")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
asyncio.run(analyze_new_university(sys.argv[1]))
|
||||||
105
src/university_scraper/cli.py
Normal file
105
src/university_scraper/cli.py
Normal file
@ -0,0 +1,105 @@
|
|||||||
|
"""
|
||||||
|
命令行工具
|
||||||
|
|
||||||
|
用法:
|
||||||
|
# 爬取指定大学
|
||||||
|
python -m university_scraper scrape harvard
|
||||||
|
|
||||||
|
# 分析新大学
|
||||||
|
python -m university_scraper analyze https://www.stanford.edu/
|
||||||
|
|
||||||
|
# 列出可用配置
|
||||||
|
python -m university_scraper list
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import argparse
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="通用大学官网爬虫 - 按照 学院→项目→导师 层级爬取"
|
||||||
|
)
|
||||||
|
|
||||||
|
subparsers = parser.add_subparsers(dest='command', help='可用命令')
|
||||||
|
|
||||||
|
# 爬取命令
|
||||||
|
scrape_parser = subparsers.add_parser('scrape', help='爬取指定大学')
|
||||||
|
scrape_parser.add_argument('university', help='大学名称(配置文件名,不含.yaml)')
|
||||||
|
scrape_parser.add_argument('-o', '--output', help='输出文件路径', default=None)
|
||||||
|
scrape_parser.add_argument('--headless', action='store_true', help='无头模式运行')
|
||||||
|
scrape_parser.add_argument('--config-dir', default='configs', help='配置文件目录')
|
||||||
|
|
||||||
|
# 分析命令
|
||||||
|
analyze_parser = subparsers.add_parser('analyze', help='分析新大学官网结构')
|
||||||
|
analyze_parser.add_argument('url', help='大学官网URL')
|
||||||
|
|
||||||
|
# 列出命令
|
||||||
|
list_parser = subparsers.add_parser('list', help='列出可用的大学配置')
|
||||||
|
list_parser.add_argument('--config-dir', default='configs', help='配置文件目录')
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
if args.command == 'scrape':
|
||||||
|
asyncio.run(run_scrape(args))
|
||||||
|
elif args.command == 'analyze':
|
||||||
|
asyncio.run(run_analyze(args))
|
||||||
|
elif args.command == 'list':
|
||||||
|
run_list(args)
|
||||||
|
else:
|
||||||
|
parser.print_help()
|
||||||
|
|
||||||
|
|
||||||
|
async def run_scrape(args):
|
||||||
|
"""执行爬取"""
|
||||||
|
from .config import load_config
|
||||||
|
from .scraper import UniversityScraper
|
||||||
|
|
||||||
|
config_path = Path(args.config_dir) / f"{args.university}.yaml"
|
||||||
|
|
||||||
|
if not config_path.exists():
|
||||||
|
print(f"错误: 配置文件不存在 - {config_path}")
|
||||||
|
print(f"可用配置: {list_configs(args.config_dir)}")
|
||||||
|
return
|
||||||
|
|
||||||
|
config = load_config(str(config_path))
|
||||||
|
|
||||||
|
output_path = args.output or f"output/{args.university}_result.json"
|
||||||
|
|
||||||
|
async with UniversityScraper(config, headless=args.headless) as scraper:
|
||||||
|
await scraper.scrape()
|
||||||
|
scraper.save_results(output_path)
|
||||||
|
|
||||||
|
|
||||||
|
async def run_analyze(args):
|
||||||
|
"""执行分析"""
|
||||||
|
from .analyzer import PageAnalyzer
|
||||||
|
|
||||||
|
async with PageAnalyzer() as analyzer:
|
||||||
|
await analyzer.generate_config_suggestion(args.url)
|
||||||
|
|
||||||
|
|
||||||
|
def run_list(args):
|
||||||
|
"""列出可用配置"""
|
||||||
|
configs = list_configs(args.config_dir)
|
||||||
|
|
||||||
|
if configs:
|
||||||
|
print("可用的大学配置:")
|
||||||
|
for name in configs:
|
||||||
|
print(f" - {name}")
|
||||||
|
else:
|
||||||
|
print(f"在 {args.config_dir} 目录下没有找到配置文件")
|
||||||
|
|
||||||
|
|
||||||
|
def list_configs(config_dir: str):
|
||||||
|
"""列出配置文件"""
|
||||||
|
path = Path(config_dir)
|
||||||
|
if not path.exists():
|
||||||
|
return []
|
||||||
|
|
||||||
|
return [f.stem for f in path.glob("*.yaml")] + [f.stem for f in path.glob("*.yml")]
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
232
src/university_scraper/config.py
Normal file
232
src/university_scraper/config.py
Normal file
@ -0,0 +1,232 @@
|
|||||||
|
"""
|
||||||
|
配置文件加载和验证
|
||||||
|
|
||||||
|
配置文件格式 (YAML):
|
||||||
|
|
||||||
|
university:
|
||||||
|
name: "Harvard University"
|
||||||
|
url: "https://www.harvard.edu/"
|
||||||
|
country: "USA"
|
||||||
|
|
||||||
|
# 第一层:学院列表页面
|
||||||
|
schools:
|
||||||
|
# 获取学院列表的方式
|
||||||
|
discovery_method: "static_list" # static_list | scrape_page | sitemap
|
||||||
|
|
||||||
|
# 方式1: 静态列表 (手动配置已知学院)
|
||||||
|
static_list:
|
||||||
|
- name: "School of Engineering and Applied Sciences"
|
||||||
|
url: "https://seas.harvard.edu/"
|
||||||
|
keywords: ["engineering", "computer"]
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://seas.harvard.edu/people"
|
||||||
|
extract_method: "links" # links | table | research_explorer
|
||||||
|
request:
|
||||||
|
timeout_ms: 90000
|
||||||
|
wait_for_selector: ".profile-card"
|
||||||
|
- name: "Graduate School of Arts and Sciences"
|
||||||
|
url: "https://gsas.harvard.edu/"
|
||||||
|
|
||||||
|
# 方式2: 从页面爬取
|
||||||
|
scrape_config:
|
||||||
|
url: "https://www.harvard.edu/schools/"
|
||||||
|
selector: "a.school-link"
|
||||||
|
name_attribute: "text" # text | title | data-name
|
||||||
|
url_attribute: "href"
|
||||||
|
|
||||||
|
# 第二层:每个学院下的项目列表
|
||||||
|
programs:
|
||||||
|
# 相对于学院URL的路径模式
|
||||||
|
paths_to_try:
|
||||||
|
- "/academics/graduate-programs"
|
||||||
|
- "/programs"
|
||||||
|
- "/graduate"
|
||||||
|
- "/academics/masters"
|
||||||
|
|
||||||
|
# 或者使用选择器从学院首页查找
|
||||||
|
link_patterns:
|
||||||
|
- text_contains: ["graduate", "master", "program"]
|
||||||
|
- href_contains: ["/program", "/graduate", "/academics"]
|
||||||
|
|
||||||
|
# 项目列表页面的选择器
|
||||||
|
selectors:
|
||||||
|
program_item: "div.program-item, li.program, a.program-link"
|
||||||
|
program_name: "h3, .title, .program-name"
|
||||||
|
program_url: "a[href]"
|
||||||
|
degree_type: ".degree, .credential"
|
||||||
|
request:
|
||||||
|
timeout_ms: 45000
|
||||||
|
max_retries: 3
|
||||||
|
retry_backoff_ms: 3000
|
||||||
|
|
||||||
|
# 分页配置
|
||||||
|
pagination:
|
||||||
|
type: "none" # none | click | url_param | infinite_scroll
|
||||||
|
next_selector: "a.next, button.next-page"
|
||||||
|
param_name: "page"
|
||||||
|
|
||||||
|
# 第三层:每个项目下的导师列表
|
||||||
|
faculty:
|
||||||
|
# 查找导师页面的策略
|
||||||
|
discovery_strategies:
|
||||||
|
- type: "link_in_page"
|
||||||
|
patterns:
|
||||||
|
- text_contains: ["faculty", "people", "advisor", "professor"]
|
||||||
|
- href_contains: ["/faculty", "/people", "/directory"]
|
||||||
|
|
||||||
|
- type: "url_pattern"
|
||||||
|
patterns:
|
||||||
|
- "{program_url}/faculty"
|
||||||
|
- "{program_url}/people"
|
||||||
|
- "{school_url}/people"
|
||||||
|
- type: "school_directory"
|
||||||
|
assign_to_all: true
|
||||||
|
match_by_school_keywords: true
|
||||||
|
request:
|
||||||
|
timeout_ms: 90000
|
||||||
|
wait_for_selector: "a.link.person"
|
||||||
|
|
||||||
|
# 导师列表页面的选择器
|
||||||
|
selectors:
|
||||||
|
faculty_item: "div.faculty-item, li.person, .profile-card"
|
||||||
|
faculty_name: "h3, .name, .title a"
|
||||||
|
faculty_url: "a[href*='/people/'], a[href*='/faculty/'], a[href*='/profile/']"
|
||||||
|
faculty_title: ".title, .position, .role"
|
||||||
|
faculty_email: "a[href^='mailto:']"
|
||||||
|
|
||||||
|
# 过滤规则
|
||||||
|
filters:
|
||||||
|
# 只爬取硕士项目
|
||||||
|
program_degree_types:
|
||||||
|
include: ["M.S.", "M.A.", "MBA", "Master", "M.Eng", "M.Ed", "M.P.P", "M.P.A"]
|
||||||
|
exclude: ["Ph.D.", "Bachelor", "B.S.", "B.A.", "Certificate"]
|
||||||
|
|
||||||
|
# 排除某些学院
|
||||||
|
exclude_schools:
|
||||||
|
- "Summer School"
|
||||||
|
- "Extension School"
|
||||||
|
"""
|
||||||
|
|
||||||
|
import yaml
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Dict, Any, List, Optional
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class UniversityConfig:
|
||||||
|
"""大学基本信息配置"""
|
||||||
|
name: str
|
||||||
|
url: str
|
||||||
|
country: str = "Unknown"
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class SchoolsConfig:
|
||||||
|
"""学院发现配置"""
|
||||||
|
discovery_method: str = "static_list"
|
||||||
|
static_list: List[Dict[str, str]] = field(default_factory=list)
|
||||||
|
scrape_config: Optional[Dict[str, Any]] = None
|
||||||
|
request: Dict[str, Any] = field(default_factory=dict)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ProgramsConfig:
|
||||||
|
"""项目发现配置"""
|
||||||
|
paths_to_try: List[str] = field(default_factory=list)
|
||||||
|
link_patterns: List[Dict[str, List[str]]] = field(default_factory=list)
|
||||||
|
selectors: Dict[str, str] = field(default_factory=dict)
|
||||||
|
pagination: Dict[str, Any] = field(default_factory=dict)
|
||||||
|
request: Dict[str, Any] = field(default_factory=dict)
|
||||||
|
global_catalog: Optional[Dict[str, Any]] = None
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class FacultyConfig:
|
||||||
|
"""导师发现配置"""
|
||||||
|
discovery_strategies: List[Dict[str, Any]] = field(default_factory=list)
|
||||||
|
selectors: Dict[str, str] = field(default_factory=dict)
|
||||||
|
request: Dict[str, Any] = field(default_factory=dict)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class FiltersConfig:
|
||||||
|
"""过滤规则配置"""
|
||||||
|
program_degree_types: Dict[str, List[str]] = field(default_factory=dict)
|
||||||
|
exclude_schools: List[str] = field(default_factory=list)
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class PlaywrightConfig:
|
||||||
|
"""Playwright运行环境配置"""
|
||||||
|
stealth: bool = False
|
||||||
|
user_agent: Optional[str] = None
|
||||||
|
locale: Optional[str] = None
|
||||||
|
timezone_id: Optional[str] = None
|
||||||
|
viewport: Optional[Dict[str, int]] = None
|
||||||
|
ignore_https_errors: bool = False
|
||||||
|
extra_headers: Dict[str, str] = field(default_factory=dict)
|
||||||
|
cookies: List[Dict[str, Any]] = field(default_factory=list)
|
||||||
|
add_init_scripts: List[str] = field(default_factory=list)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ScraperConfig:
|
||||||
|
"""完整的爬虫配置"""
|
||||||
|
university: UniversityConfig
|
||||||
|
schools: SchoolsConfig
|
||||||
|
programs: ProgramsConfig
|
||||||
|
faculty: FacultyConfig
|
||||||
|
filters: FiltersConfig
|
||||||
|
playwright: PlaywrightConfig = field(default_factory=PlaywrightConfig)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_yaml(cls, yaml_path: str) -> "ScraperConfig":
|
||||||
|
"""从YAML文件加载配置"""
|
||||||
|
with open(yaml_path, 'r', encoding='utf-8') as f:
|
||||||
|
data = yaml.safe_load(f)
|
||||||
|
|
||||||
|
return cls(
|
||||||
|
university=UniversityConfig(**data.get('university', {})),
|
||||||
|
schools=SchoolsConfig(**data.get('schools', {})),
|
||||||
|
programs=ProgramsConfig(**data.get('programs', {})),
|
||||||
|
faculty=FacultyConfig(**data.get('faculty', {})),
|
||||||
|
filters=FiltersConfig(**data.get('filters', {})),
|
||||||
|
playwright=PlaywrightConfig(**data.get('playwright', {}))
|
||||||
|
)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_dict(cls, data: Dict[str, Any]) -> "ScraperConfig":
|
||||||
|
"""从字典创建配置"""
|
||||||
|
return cls(
|
||||||
|
university=UniversityConfig(**data.get('university', {})),
|
||||||
|
schools=SchoolsConfig(**data.get('schools', {})),
|
||||||
|
programs=ProgramsConfig(**data.get('programs', {})),
|
||||||
|
faculty=FacultyConfig(**data.get('faculty', {})),
|
||||||
|
filters=FiltersConfig(**data.get('filters', {})),
|
||||||
|
playwright=PlaywrightConfig(**data.get('playwright', {}))
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def load_config(config_path: str) -> ScraperConfig:
|
||||||
|
"""加载配置文件"""
|
||||||
|
path = Path(config_path)
|
||||||
|
if not path.exists():
|
||||||
|
raise FileNotFoundError(f"配置文件不存在: {config_path}")
|
||||||
|
|
||||||
|
if path.suffix in ['.yaml', '.yml']:
|
||||||
|
return ScraperConfig.from_yaml(config_path)
|
||||||
|
else:
|
||||||
|
raise ValueError(f"不支持的配置文件格式: {path.suffix}")
|
||||||
|
|
||||||
|
|
||||||
|
def list_available_configs(configs_dir: str = "configs") -> List[str]:
|
||||||
|
"""列出所有可用的配置文件"""
|
||||||
|
path = Path(configs_dir)
|
||||||
|
if not path.exists():
|
||||||
|
return []
|
||||||
|
|
||||||
|
return [
|
||||||
|
f.stem for f in path.glob("*.yaml")
|
||||||
|
] + [
|
||||||
|
f.stem for f in path.glob("*.yml")
|
||||||
|
]
|
||||||
405
src/university_scraper/harvard_scraper.py
Normal file
405
src/university_scraper/harvard_scraper.py
Normal file
@ -0,0 +1,405 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Harvard专用爬虫
|
||||||
|
|
||||||
|
Harvard的特殊情况:
|
||||||
|
1. 有一个集中的项目列表页面 (harvard.edu/programs)
|
||||||
|
2. 项目详情在GSAS页面 (gsas.harvard.edu/program/xxx)
|
||||||
|
3. 导师信息在各院系网站
|
||||||
|
|
||||||
|
爬取流程:
|
||||||
|
1. 从集中页面获取所有硕士项目
|
||||||
|
2. 通过GSAS页面确定每个项目所属学院
|
||||||
|
3. 从院系网站获取导师信息
|
||||||
|
4. 按 学院→项目→导师 层级组织输出
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import List, Dict, Optional, Tuple
|
||||||
|
from urllib.parse import urljoin
|
||||||
|
|
||||||
|
from playwright.async_api import async_playwright, Page, Browser
|
||||||
|
|
||||||
|
from .models import University, School, Program, Faculty
|
||||||
|
|
||||||
|
|
||||||
|
# Harvard学院映射 - 根据URL子域名判断所属学院
|
||||||
|
SCHOOL_MAPPING = {
|
||||||
|
"gsas.harvard.edu": "Graduate School of Arts and Sciences (GSAS)",
|
||||||
|
"seas.harvard.edu": "John A. Paulson School of Engineering and Applied Sciences (SEAS)",
|
||||||
|
"hbs.edu": "Harvard Business School (HBS)",
|
||||||
|
"www.hbs.edu": "Harvard Business School (HBS)",
|
||||||
|
"gsd.harvard.edu": "Graduate School of Design (GSD)",
|
||||||
|
"www.gsd.harvard.edu": "Graduate School of Design (GSD)",
|
||||||
|
"gse.harvard.edu": "Graduate School of Education (HGSE)",
|
||||||
|
"www.gse.harvard.edu": "Graduate School of Education (HGSE)",
|
||||||
|
"hks.harvard.edu": "Harvard Kennedy School (HKS)",
|
||||||
|
"www.hks.harvard.edu": "Harvard Kennedy School (HKS)",
|
||||||
|
"hls.harvard.edu": "Harvard Law School (HLS)",
|
||||||
|
"hms.harvard.edu": "Harvard Medical School (HMS)",
|
||||||
|
"hsph.harvard.edu": "T.H. Chan School of Public Health (HSPH)",
|
||||||
|
"www.hsph.harvard.edu": "T.H. Chan School of Public Health (HSPH)",
|
||||||
|
"hds.harvard.edu": "Harvard Divinity School (HDS)",
|
||||||
|
"hsdm.harvard.edu": "Harvard School of Dental Medicine (HSDM)",
|
||||||
|
"fas.harvard.edu": "Faculty of Arts and Sciences (FAS)",
|
||||||
|
"dce.harvard.edu": "Division of Continuing Education (DCE)",
|
||||||
|
"extension.harvard.edu": "Harvard Extension School",
|
||||||
|
}
|
||||||
|
|
||||||
|
# 学院URL映射
|
||||||
|
SCHOOL_URLS = {
|
||||||
|
"Graduate School of Arts and Sciences (GSAS)": "https://gsas.harvard.edu/",
|
||||||
|
"John A. Paulson School of Engineering and Applied Sciences (SEAS)": "https://seas.harvard.edu/",
|
||||||
|
"Harvard Business School (HBS)": "https://www.hbs.edu/",
|
||||||
|
"Graduate School of Design (GSD)": "https://www.gsd.harvard.edu/",
|
||||||
|
"Graduate School of Education (HGSE)": "https://www.gse.harvard.edu/",
|
||||||
|
"Harvard Kennedy School (HKS)": "https://www.hks.harvard.edu/",
|
||||||
|
"Harvard Law School (HLS)": "https://hls.harvard.edu/",
|
||||||
|
"Harvard Medical School (HMS)": "https://hms.harvard.edu/",
|
||||||
|
"T.H. Chan School of Public Health (HSPH)": "https://www.hsph.harvard.edu/",
|
||||||
|
"Harvard Divinity School (HDS)": "https://hds.harvard.edu/",
|
||||||
|
"Harvard School of Dental Medicine (HSDM)": "https://hsdm.harvard.edu/",
|
||||||
|
"Faculty of Arts and Sciences (FAS)": "https://fas.harvard.edu/",
|
||||||
|
"Other": "https://www.harvard.edu/",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def name_to_slug(name: str) -> str:
|
||||||
|
"""将项目名称转换为URL slug"""
|
||||||
|
slug = name.lower()
|
||||||
|
slug = re.sub(r'[^\w\s-]', '', slug)
|
||||||
|
slug = re.sub(r'[\s_]+', '-', slug)
|
||||||
|
slug = re.sub(r'-+', '-', slug)
|
||||||
|
slug = slug.strip('-')
|
||||||
|
return slug
|
||||||
|
|
||||||
|
|
||||||
|
def determine_school_from_url(url: str) -> str:
|
||||||
|
"""根据URL判断所属学院"""
|
||||||
|
if not url:
|
||||||
|
return "Other"
|
||||||
|
|
||||||
|
from urllib.parse import urlparse
|
||||||
|
parsed = urlparse(url)
|
||||||
|
domain = parsed.netloc.lower()
|
||||||
|
|
||||||
|
for pattern, school_name in SCHOOL_MAPPING.items():
|
||||||
|
if pattern in domain:
|
||||||
|
return school_name
|
||||||
|
|
||||||
|
return "Other"
|
||||||
|
|
||||||
|
|
||||||
|
class HarvardScraper:
|
||||||
|
"""Harvard专用爬虫"""
|
||||||
|
|
||||||
|
def __init__(self, headless: bool = True):
|
||||||
|
self.headless = headless
|
||||||
|
self.browser: Optional[Browser] = None
|
||||||
|
self.page: Optional[Page] = None
|
||||||
|
self._playwright = None
|
||||||
|
|
||||||
|
async def __aenter__(self):
|
||||||
|
self._playwright = await async_playwright().start()
|
||||||
|
self.browser = await self._playwright.chromium.launch(headless=self.headless)
|
||||||
|
context = await self.browser.new_context(
|
||||||
|
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
||||||
|
viewport={'width': 1920, 'height': 1080},
|
||||||
|
java_script_enabled=True,
|
||||||
|
)
|
||||||
|
self.page = await context.new_page()
|
||||||
|
return self
|
||||||
|
|
||||||
|
async def __aexit__(self, exc_type, exc_val, exc_tb):
|
||||||
|
if self.browser:
|
||||||
|
await self.browser.close()
|
||||||
|
if self._playwright:
|
||||||
|
await self._playwright.stop()
|
||||||
|
|
||||||
|
async def _safe_goto(self, url: str, timeout: int = 30000, retries: int = 3) -> bool:
|
||||||
|
"""安全的页面导航,带重试机制"""
|
||||||
|
for attempt in range(retries):
|
||||||
|
try:
|
||||||
|
await self.page.goto(url, wait_until="domcontentloaded", timeout=timeout)
|
||||||
|
await self.page.wait_for_timeout(2000)
|
||||||
|
return True
|
||||||
|
except Exception as e:
|
||||||
|
print(f" 导航失败 (尝试 {attempt + 1}/{retries}): {str(e)[:50]}")
|
||||||
|
if attempt < retries - 1:
|
||||||
|
await self.page.wait_for_timeout(3000)
|
||||||
|
return False
|
||||||
|
|
||||||
|
async def scrape(self) -> University:
|
||||||
|
"""执行完整的爬取流程"""
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print("Harvard University 专用爬虫")
|
||||||
|
print(f"{'='*60}")
|
||||||
|
|
||||||
|
# 创建大学对象
|
||||||
|
university = University(
|
||||||
|
name="Harvard University",
|
||||||
|
url="https://www.harvard.edu/",
|
||||||
|
country="USA"
|
||||||
|
)
|
||||||
|
|
||||||
|
# 第一阶段:从集中页面获取所有硕士项目
|
||||||
|
print("\n[阶段1] 从集中页面获取项目列表...")
|
||||||
|
raw_programs = await self._scrape_programs_list()
|
||||||
|
print(f" 找到 {len(raw_programs)} 个项目")
|
||||||
|
|
||||||
|
# 第二阶段:获取每个项目的详情和导师信息
|
||||||
|
print("\n[阶段2] 获取项目详情和导师信息...")
|
||||||
|
|
||||||
|
# 按学院组织的项目
|
||||||
|
schools_dict: Dict[str, School] = {}
|
||||||
|
|
||||||
|
for i, prog_data in enumerate(raw_programs, 1):
|
||||||
|
print(f"\n [{i}/{len(raw_programs)}] {prog_data['name']}")
|
||||||
|
|
||||||
|
# 获取项目详情和导师
|
||||||
|
program, school_name = await self._get_program_details(prog_data)
|
||||||
|
|
||||||
|
if program:
|
||||||
|
# 添加到对应学院
|
||||||
|
if school_name not in schools_dict:
|
||||||
|
schools_dict[school_name] = School(
|
||||||
|
name=school_name,
|
||||||
|
url=SCHOOL_URLS.get(school_name, "")
|
||||||
|
)
|
||||||
|
schools_dict[school_name].programs.append(program)
|
||||||
|
|
||||||
|
print(f" 学院: {school_name}")
|
||||||
|
print(f" 导师: {len(program.faculty)}位")
|
||||||
|
|
||||||
|
# 避免请求过快
|
||||||
|
await self.page.wait_for_timeout(1000)
|
||||||
|
|
||||||
|
# 转换为列表并排序
|
||||||
|
university.schools = sorted(schools_dict.values(), key=lambda s: s.name)
|
||||||
|
university.scraped_at = datetime.now(timezone.utc).isoformat()
|
||||||
|
|
||||||
|
# 打印统计
|
||||||
|
self._print_summary(university)
|
||||||
|
|
||||||
|
return university
|
||||||
|
|
||||||
|
async def _scrape_programs_list(self) -> List[Dict]:
|
||||||
|
"""从Harvard集中页面获取所有硕士项目"""
|
||||||
|
all_programs = []
|
||||||
|
base_url = "https://www.harvard.edu/programs/?degree_levels=graduate"
|
||||||
|
|
||||||
|
print(f" 访问: {base_url}")
|
||||||
|
if not await self._safe_goto(base_url, timeout=60000):
|
||||||
|
print(" 无法访问项目页面!")
|
||||||
|
return []
|
||||||
|
await self.page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
# 滚动到页面底部
|
||||||
|
await self.page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||||
|
await self.page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
current_page = 1
|
||||||
|
max_pages = 15
|
||||||
|
|
||||||
|
while current_page <= max_pages:
|
||||||
|
print(f" 第 {current_page} 页...")
|
||||||
|
await self.page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
# 提取当前页面的项目
|
||||||
|
page_data = await self.page.evaluate('''() => {
|
||||||
|
const programs = [];
|
||||||
|
const programItems = document.querySelectorAll('[class*="records__record"], [class*="c-programs-item"]');
|
||||||
|
|
||||||
|
programItems.forEach((item) => {
|
||||||
|
const nameBtn = item.querySelector('button[class*="title-link"], button[class*="c-programs-item"]');
|
||||||
|
if (!nameBtn) return;
|
||||||
|
|
||||||
|
const name = nameBtn.innerText.trim();
|
||||||
|
if (!name || name.length < 3) return;
|
||||||
|
|
||||||
|
let degrees = '';
|
||||||
|
const allText = item.innerText;
|
||||||
|
const degreeMatch = allText.match(/(A\\.B\\.|Ph\\.D\\.|M\\.A\\.|S\\.M\\.|M\\.Arch\\.|LL\\.M\\.|S\\.B\\.|A\\.L\\.B\\.|A\\.L\\.M\\.|M\\.M\\.Sc\\.|Ed\\.D\\.|Ed\\.M\\.|M\\.P\\.A\\.|M\\.P\\.P\\.|M\\.P\\.H\\.|J\\.D\\.|M\\.B\\.A\\.|M\\.D\\.|D\\.M\\.D\\.|Th\\.D\\.|M\\.Div\\.|M\\.T\\.S\\.|M\\.E\\.|D\\.M\\.Sc\\.|M\\.H\\.C\\.M\\.|M\\.L\\.A\\.|M\\.D\\.E\\.|M\\.R\\.E\\.|M\\.A\\.U\\.D\\.|M\\.R\\.P\\.L\\.)/g);
|
||||||
|
if (degreeMatch) {
|
||||||
|
degrees = degreeMatch.join(', ');
|
||||||
|
}
|
||||||
|
|
||||||
|
programs.push({ name, degrees });
|
||||||
|
});
|
||||||
|
|
||||||
|
return programs;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
for prog in page_data:
|
||||||
|
name = prog['name'].strip()
|
||||||
|
if name and not any(p['name'] == name for p in all_programs):
|
||||||
|
all_programs.append(prog)
|
||||||
|
|
||||||
|
# 尝试点击下一页
|
||||||
|
try:
|
||||||
|
next_btn = self.page.locator('button.c-pagination__link--next')
|
||||||
|
if await next_btn.count() > 0:
|
||||||
|
await next_btn.first.scroll_into_view_if_needed()
|
||||||
|
await next_btn.first.click()
|
||||||
|
await self.page.wait_for_timeout(3000)
|
||||||
|
current_page += 1
|
||||||
|
else:
|
||||||
|
break
|
||||||
|
except:
|
||||||
|
break
|
||||||
|
|
||||||
|
# 过滤:只保留硕士项目
|
||||||
|
master_keywords = ['M.A.', 'M.S.', 'S.M.', 'A.M.', 'MBA', 'M.Arch', 'M.L.A.',
|
||||||
|
'M.Div', 'M.T.S', 'LL.M', 'M.P.P', 'M.P.A', 'M.Ed', 'Ed.M.',
|
||||||
|
'A.L.M.', 'M.P.H.', 'M.M.Sc.', 'Master']
|
||||||
|
phd_keywords = ['Ph.D.', 'Doctor', 'D.M.D.', 'D.M.Sc.', 'Ed.D.', 'Th.D.', 'J.D.', 'M.D.']
|
||||||
|
|
||||||
|
filtered = []
|
||||||
|
for prog in all_programs:
|
||||||
|
degrees = prog.get('degrees', '')
|
||||||
|
name = prog.get('name', '')
|
||||||
|
|
||||||
|
# 检查是否有硕士学位
|
||||||
|
has_master = any(kw in degrees or kw in name for kw in master_keywords)
|
||||||
|
|
||||||
|
# 排除纯博士项目
|
||||||
|
is_phd_only = all(kw in degrees for kw in phd_keywords if kw in degrees) and not has_master
|
||||||
|
|
||||||
|
if has_master or (not is_phd_only and not degrees):
|
||||||
|
filtered.append(prog)
|
||||||
|
|
||||||
|
return filtered
|
||||||
|
|
||||||
|
async def _get_program_details(self, prog_data: Dict) -> Tuple[Optional[Program], str]:
|
||||||
|
"""获取项目详情和导师信息"""
|
||||||
|
name = prog_data['name']
|
||||||
|
degrees = prog_data.get('degrees', '')
|
||||||
|
|
||||||
|
# 生成URL
|
||||||
|
slug = name_to_slug(name)
|
||||||
|
program_url = f"https://www.harvard.edu/programs/{slug}/"
|
||||||
|
gsas_url = f"https://gsas.harvard.edu/program/{slug}"
|
||||||
|
|
||||||
|
# 访问GSAS页面获取详情
|
||||||
|
school_name = "Other"
|
||||||
|
faculty_list = []
|
||||||
|
faculty_page_url = None
|
||||||
|
|
||||||
|
try:
|
||||||
|
if await self._safe_goto(gsas_url, timeout=20000, retries=2):
|
||||||
|
# 检查页面是否有效
|
||||||
|
title = await self.page.title()
|
||||||
|
if '404' not in title and 'not found' not in title.lower():
|
||||||
|
school_name = "Graduate School of Arts and Sciences (GSAS)"
|
||||||
|
|
||||||
|
# 查找Faculty链接
|
||||||
|
faculty_link = await self.page.evaluate('''() => {
|
||||||
|
const links = document.querySelectorAll('a[href]');
|
||||||
|
for (const link of links) {
|
||||||
|
const text = link.innerText.toLowerCase();
|
||||||
|
const href = link.href;
|
||||||
|
if (text.includes('faculty') && text.includes('see list')) {
|
||||||
|
return href;
|
||||||
|
}
|
||||||
|
if ((text.includes('faculty') || text.includes('people')) &&
|
||||||
|
(href.includes('/people') || href.includes('/faculty'))) {
|
||||||
|
return href;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
if faculty_link:
|
||||||
|
faculty_page_url = faculty_link
|
||||||
|
school_name = determine_school_from_url(faculty_link)
|
||||||
|
|
||||||
|
# 访问导师页面
|
||||||
|
if await self._safe_goto(faculty_link, timeout=20000, retries=2):
|
||||||
|
# 提取导师信息
|
||||||
|
faculty_list = await self._extract_faculty()
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f" 获取详情失败: {str(e)[:50]}")
|
||||||
|
|
||||||
|
# 创建项目对象
|
||||||
|
program = Program(
|
||||||
|
name=name,
|
||||||
|
url=program_url,
|
||||||
|
degree_type=degrees,
|
||||||
|
faculty_page_url=faculty_page_url,
|
||||||
|
faculty=[Faculty(name=f['name'], url=f['url']) for f in faculty_list]
|
||||||
|
)
|
||||||
|
|
||||||
|
return program, school_name
|
||||||
|
|
||||||
|
async def _extract_faculty(self) -> List[Dict]:
|
||||||
|
"""从当前页面提取导师信息"""
|
||||||
|
return await self.page.evaluate('''() => {
|
||||||
|
const faculty = [];
|
||||||
|
const seen = new Set();
|
||||||
|
const patterns = ['/people/', '/faculty/', '/profile/', '/person/'];
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href || '';
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
const lowerHref = href.toLowerCase();
|
||||||
|
const lowerText = text.toLowerCase();
|
||||||
|
|
||||||
|
const isPersonLink = patterns.some(p => lowerHref.includes(p));
|
||||||
|
const isNavLink = ['people', 'faculty', 'directory', 'staff', 'all'].includes(lowerText);
|
||||||
|
|
||||||
|
if (isPersonLink && !isNavLink &&
|
||||||
|
text.length > 3 && text.length < 100 &&
|
||||||
|
!seen.has(href)) {
|
||||||
|
seen.add(href);
|
||||||
|
faculty.push({ name: text, url: href });
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return faculty;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
def _print_summary(self, university: University):
|
||||||
|
"""打印统计摘要"""
|
||||||
|
total_programs = sum(len(s.programs) for s in university.schools)
|
||||||
|
total_faculty = sum(len(p.faculty) for s in university.schools for p in s.programs)
|
||||||
|
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print("爬取完成!")
|
||||||
|
print(f"{'='*60}")
|
||||||
|
print(f"大学: {university.name}")
|
||||||
|
print(f"学院数: {len(university.schools)}")
|
||||||
|
print(f"项目数: {total_programs}")
|
||||||
|
print(f"导师数: {total_faculty}")
|
||||||
|
|
||||||
|
print("\n各学院统计:")
|
||||||
|
for school in university.schools:
|
||||||
|
prog_count = len(school.programs)
|
||||||
|
fac_count = sum(len(p.faculty) for p in school.programs)
|
||||||
|
print(f" {school.name}: {prog_count}个项目, {fac_count}位导师")
|
||||||
|
|
||||||
|
def save_results(self, university: University, output_path: str):
|
||||||
|
"""保存结果"""
|
||||||
|
output = Path(output_path)
|
||||||
|
output.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
with open(output, 'w', encoding='utf-8') as f:
|
||||||
|
json.dump(university.to_dict(), f, ensure_ascii=False, indent=2)
|
||||||
|
|
||||||
|
print(f"\n结果已保存到: {output_path}")
|
||||||
|
|
||||||
|
|
||||||
|
async def scrape_harvard(output_path: str = "output/harvard_full_result.json", headless: bool = True):
|
||||||
|
"""爬取Harvard的便捷函数"""
|
||||||
|
async with HarvardScraper(headless=headless) as scraper:
|
||||||
|
university = await scraper.scrape()
|
||||||
|
scraper.save_results(university, output_path)
|
||||||
|
return university
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(scrape_harvard(headless=False))
|
||||||
105
src/university_scraper/models.py
Normal file
105
src/university_scraper/models.py
Normal file
@ -0,0 +1,105 @@
|
|||||||
|
"""
|
||||||
|
鏁版嵁妯″瀷瀹氫箟 - 瀛﹂櫌 → 椤圭洰 → 瀵煎笀 层级结构
|
||||||
|
"""
|
||||||
|
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from datetime import datetime
|
||||||
|
from typing import Any, Dict, List, Optional
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class Faculty:
|
||||||
|
"""瀵煎笀淇℃伅"""
|
||||||
|
name: str
|
||||||
|
url: str
|
||||||
|
title: Optional[str] = None
|
||||||
|
email: Optional[str] = None
|
||||||
|
department: Optional[str] = None
|
||||||
|
|
||||||
|
def to_dict(self) -> dict:
|
||||||
|
return {
|
||||||
|
"name": self.name,
|
||||||
|
"url": self.url,
|
||||||
|
"title": self.title,
|
||||||
|
"email": self.email,
|
||||||
|
"department": self.department
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class Program:
|
||||||
|
"""纭曞+椤圭洰淇℃伅"""
|
||||||
|
name: str
|
||||||
|
url: str
|
||||||
|
degree_type: Optional[str] = None # M.S., M.A., MBA, etc.
|
||||||
|
description: Optional[str] = None
|
||||||
|
faculty_page_url: Optional[str] = None
|
||||||
|
faculty: List[Faculty] = field(default_factory=list)
|
||||||
|
metadata: Dict[str, Any] = field(default_factory=dict)
|
||||||
|
|
||||||
|
def to_dict(self) -> dict:
|
||||||
|
return {
|
||||||
|
"name": self.name,
|
||||||
|
"url": self.url,
|
||||||
|
"degree_type": self.degree_type,
|
||||||
|
"description": self.description,
|
||||||
|
"faculty_page_url": self.faculty_page_url,
|
||||||
|
"faculty_count": len(self.faculty),
|
||||||
|
"faculty": [f.to_dict() for f in self.faculty],
|
||||||
|
"metadata": self.metadata
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class School:
|
||||||
|
"""瀛﹂櫌淇℃伅"""
|
||||||
|
name: str
|
||||||
|
url: str
|
||||||
|
description: Optional[str] = None
|
||||||
|
programs: List[Program] = field(default_factory=list)
|
||||||
|
metadata: Dict[str, Any] = field(default_factory=dict)
|
||||||
|
faculty_directory: List[Faculty] = field(default_factory=list)
|
||||||
|
faculty_directory_loaded: bool = False
|
||||||
|
|
||||||
|
def to_dict(self) -> dict:
|
||||||
|
return {
|
||||||
|
"name": self.name,
|
||||||
|
"url": self.url,
|
||||||
|
"description": self.description,
|
||||||
|
"program_count": len(self.programs),
|
||||||
|
"programs": [p.to_dict() for p in self.programs],
|
||||||
|
"faculty_directory_count": len(self.faculty_directory),
|
||||||
|
"faculty_directory": [f.to_dict() for f in self.faculty_directory]
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class University:
|
||||||
|
"""澶у淇℃伅 - 椤跺眰鏁版嵁缁撴瀯"""
|
||||||
|
name: str
|
||||||
|
url: str
|
||||||
|
country: Optional[str] = None
|
||||||
|
schools: List[School] = field(default_factory=list)
|
||||||
|
scraped_at: Optional[str] = None
|
||||||
|
|
||||||
|
def to_dict(self) -> dict:
|
||||||
|
# 缁熻
|
||||||
|
total_programs = sum(len(s.programs) for s in self.schools)
|
||||||
|
total_faculty = sum(
|
||||||
|
len(p.faculty)
|
||||||
|
for s in self.schools
|
||||||
|
for p in s.programs
|
||||||
|
)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"university": self.name,
|
||||||
|
"url": self.url,
|
||||||
|
"country": self.country,
|
||||||
|
"scraped_at": self.scraped_at or datetime.utcnow().isoformat(),
|
||||||
|
"statistics": {
|
||||||
|
"total_schools": len(self.schools),
|
||||||
|
"total_programs": total_programs,
|
||||||
|
"total_faculty": total_faculty
|
||||||
|
},
|
||||||
|
"schools": [s.to_dict() for s in self.schools]
|
||||||
|
}
|
||||||
1360
src/university_scraper/scraper.py
Normal file
1360
src/university_scraper/scraper.py
Normal file
File diff suppressed because it is too large
Load Diff
6
任务1.txt
6
任务1.txt
@ -1,4 +1,8 @@
|
|||||||
构建一个自动化生成代码的agent,给定一个海外大学官网的网址,生成一套或者说一个python脚本能够爬取这个大学各级学院下的所有硕士项目的网址 和 硕士项目中各导师个人信息的网址
|
构建一个自动化生成代码的agent,给定一个海外大学官网的网址,生成一套或者说一个python脚本能够爬取这个大学各级学院下的所有硕士项目的网址 和 硕士项目中各导师个人信息的网址
|
||||||
agent系统使用 https://docs.agno.com/
|
agent系统使用 https://docs.agno.com/
|
||||||
然后其中的浏览器自动化使用playwright
|
然后其中的浏览器自动化使用playwright
|
||||||
使用的python工具库是,uv、ruff、ty,使用这三个构建项目。
|
使用的python工具库是,uv、ruff、ty,使用这三个构建项目。
|
||||||
|
另外还需要研究一个方案:
|
||||||
|
当前生成的爬虫脚本还需要在 Python 环境中手动执行。我们希望实现的效果:生成的脚本可以存到数据库(你研究一下是存在哪里好),我们的目的是存到云端或者服务器上,后续我们可以通过接口来调用生成的脚本去爬取。
|
||||||
|
理想情况下我们有一个页面,页面上有一个输入栏,用来输入要生成爬虫脚本的学校,然后点击“一键生成爬虫脚本”后,由我们的 agent 自动生成脚本。生成完成后会跳转到该学校的管理页面,页面上可以点击“一键运行爬虫脚本”来抓取对应项目和老师的一面数据。爬取完毕后,我们可以在页面上检查和验证爬虫抓取到的数据。
|
||||||
|
目的是做成没有任何电脑知识的普通人也能操作的一个系统。
|
||||||
32
对话总结.txt
Normal file
32
对话总结.txt
Normal file
@ -0,0 +1,32 @@
|
|||||||
|
项目进度与存在的问题:
|
||||||
|
📊 项目进度总结
|
||||||
|
✅ 已完成的功能
|
||||||
|
Web系统架构
|
||||||
|
FastAPI 后端 + React + TypeScript + Ant Design 前端
|
||||||
|
SQLite 数据库存储大学、脚本、爬取结果
|
||||||
|
动态脚本执行机制(通过 exec() 注入模块)
|
||||||
|
曼彻斯特大学爬虫
|
||||||
|
成功爬取 297 个硕士项目
|
||||||
|
创建了完整的学院配置 manchester_complete_scraper.py
|
||||||
|
实现了 3 种提取方法:table、links、research_explorer
|
||||||
|
项目按关键词自动分配到 13 个学院
|
||||||
|
已修复的问题
|
||||||
|
Windows asyncio 事件循环策略
|
||||||
|
exec 命名空间问题(函数互相调用)
|
||||||
|
硕士项目过滤逻辑(排除本科/博士)
|
||||||
|
⚠️ 当前存在的问题
|
||||||
|
问题 影响 原因
|
||||||
|
网络超时 11/12 学院页面加载失败 网络不稳定或页面响应慢
|
||||||
|
Research Explorer 页面 大量学院使用此系统 JavaScript 渲染慢,60秒超时不够
|
||||||
|
导师数据不完整 仅获取 78 名导师(AMBS) 其他学院页面无法访问
|
||||||
|
📈 数据统计
|
||||||
|
指标 数量
|
||||||
|
硕士项目总数 297
|
||||||
|
学院分类数 13
|
||||||
|
成功获取导师的学院 1/13
|
||||||
|
导师总数 78
|
||||||
|
🔧 建议的改进方向
|
||||||
|
增加超时时间 - 对 Research Explorer 页面增加到 90-120 秒
|
||||||
|
添加重试机制 - 失败后自动重试 2-3 次
|
||||||
|
使用备选 URL - 为每个学院配置多个可能的 staff 页面
|
||||||
|
分批爬取 - 将学院分批处理,避免同时请求过多
|
||||||
Reference in New Issue
Block a user