Add university scraper system with backend, frontend, and configs
- Add src/university_scraper module with scraper, analyzer, and CLI - Add backend FastAPI service with API endpoints and database models - Add frontend React app with university management pages - Add configs for Harvard, Manchester, and UCL universities - Add artifacts with various scraper implementations - Add Docker compose configuration for deployment - Update .gitignore to exclude generated files 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
28
.gitignore
vendored
28
.gitignore
vendored
@ -179,3 +179,31 @@ nul
|
|||||||
|
|
||||||
# Scraper output files
|
# Scraper output files
|
||||||
*_results.json
|
*_results.json
|
||||||
|
|
||||||
|
# Output directories
|
||||||
|
output/
|
||||||
|
|
||||||
|
# Screenshots and debug images
|
||||||
|
*.png
|
||||||
|
artifacts/*.html
|
||||||
|
|
||||||
|
# Windows
|
||||||
|
desktop.ini
|
||||||
|
|
||||||
|
# Claude settings (local)
|
||||||
|
.claude/
|
||||||
|
|
||||||
|
# Progress files
|
||||||
|
*_progress.json
|
||||||
|
|
||||||
|
# Test result files
|
||||||
|
*_test_result.json
|
||||||
|
|
||||||
|
# Node modules
|
||||||
|
node_modules/
|
||||||
|
|
||||||
|
# Database files
|
||||||
|
*.db
|
||||||
|
|
||||||
|
# Frontend build
|
||||||
|
frontend/nul
|
||||||
|
|||||||
261
SYSTEM_DESIGN.md
Normal file
261
SYSTEM_DESIGN.md
Normal file
@ -0,0 +1,261 @@
|
|||||||
|
# 大学爬虫Web系统设计方案
|
||||||
|
|
||||||
|
## 一、系统架构
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────────┐
|
||||||
|
│ 前端 (React/Vue) │
|
||||||
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
|
||||||
|
│ │ 输入大学URL │ │ 一键生成脚本 │ │ 查看/验证爬取数据 │ │
|
||||||
|
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
|
||||||
|
└─────────────────────────────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────────────────────────────────────────────────────┐
|
||||||
|
│ 后端 API (FastAPI) │
|
||||||
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────<E29480><E29480>───────────────────┐ │
|
||||||
|
│ │ 脚本生成API │ │ 脚本执行API │ │ 数据查询API │ │
|
||||||
|
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
|
||||||
|
└─────────────────────────────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
┌─────────────────┼─────────────────┐
|
||||||
|
▼ ▼ ▼
|
||||||
|
┌───────────────────┐ ┌───────────────┐ ┌───────────────────────┐
|
||||||
|
│ PostgreSQL │ │ 任务队列 │ │ Agent (Claude) │
|
||||||
|
│ 数据库 │ │ (Celery) │ │ 分析+生成脚本 │
|
||||||
|
│ - 爬虫脚本 │ └───────────────┘ └───────────────────────┘
|
||||||
|
│ - 爬取结果 │
|
||||||
|
│ - 执行日志 │
|
||||||
|
└───────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## 二、技术栈选择
|
||||||
|
|
||||||
|
### 后端
|
||||||
|
- **框架**: FastAPI (Python,与现有爬虫代码无缝集成)
|
||||||
|
- **数据库**: PostgreSQL (存储脚本、结果、日志)
|
||||||
|
- **任务队列**: Celery + Redis (异步执行爬虫任务)
|
||||||
|
- **ORM**: SQLAlchemy
|
||||||
|
|
||||||
|
### 前端
|
||||||
|
- **框架**: React + TypeScript (或 Vue.js)
|
||||||
|
- **UI库**: Ant Design / Material-UI
|
||||||
|
- **状态管理**: React Query (数据获取和缓存)
|
||||||
|
|
||||||
|
### 部署
|
||||||
|
- **容器化**: Docker + Docker Compose
|
||||||
|
- **云平台**: 可部署到 AWS/阿里云/腾讯云
|
||||||
|
|
||||||
|
## 三、数据库设计
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- 大学表
|
||||||
|
CREATE TABLE universities (
|
||||||
|
id SERIAL PRIMARY KEY,
|
||||||
|
name VARCHAR(255) NOT NULL,
|
||||||
|
url VARCHAR(500) NOT NULL,
|
||||||
|
country VARCHAR(100),
|
||||||
|
created_at TIMESTAMP DEFAULT NOW(),
|
||||||
|
updated_at TIMESTAMP DEFAULT NOW()
|
||||||
|
);
|
||||||
|
|
||||||
|
-- 爬虫脚本表
|
||||||
|
CREATE TABLE scraper_scripts (
|
||||||
|
id SERIAL PRIMARY KEY,
|
||||||
|
university_id INTEGER REFERENCES universities(id),
|
||||||
|
script_name VARCHAR(255) NOT NULL,
|
||||||
|
script_content TEXT NOT NULL, -- Python脚本代码
|
||||||
|
config_content TEXT, -- YAML配置
|
||||||
|
version INTEGER DEFAULT 1,
|
||||||
|
status VARCHAR(50) DEFAULT 'draft', -- draft, active, deprecated
|
||||||
|
created_at TIMESTAMP DEFAULT NOW(),
|
||||||
|
updated_at TIMESTAMP DEFAULT NOW()
|
||||||
|
);
|
||||||
|
|
||||||
|
-- 爬取任务表
|
||||||
|
CREATE TABLE scrape_jobs (
|
||||||
|
id SERIAL PRIMARY KEY,
|
||||||
|
university_id INTEGER REFERENCES universities(id),
|
||||||
|
script_id INTEGER REFERENCES scraper_scripts(id),
|
||||||
|
status VARCHAR(50) DEFAULT 'pending', -- pending, running, completed, failed
|
||||||
|
started_at TIMESTAMP,
|
||||||
|
completed_at TIMESTAMP,
|
||||||
|
error_message TEXT,
|
||||||
|
created_at TIMESTAMP DEFAULT NOW()
|
||||||
|
);
|
||||||
|
|
||||||
|
-- 爬取结果表 (JSON存储层级数据)
|
||||||
|
CREATE TABLE scrape_results (
|
||||||
|
id SERIAL PRIMARY KEY,
|
||||||
|
job_id INTEGER REFERENCES scrape_jobs(id),
|
||||||
|
university_id INTEGER REFERENCES universities(id),
|
||||||
|
result_data JSONB NOT NULL, -- 学院→项目→导师 JSON数据
|
||||||
|
schools_count INTEGER,
|
||||||
|
programs_count INTEGER,
|
||||||
|
faculty_count INTEGER,
|
||||||
|
created_at TIMESTAMP DEFAULT NOW()
|
||||||
|
);
|
||||||
|
|
||||||
|
-- 执行日志表
|
||||||
|
CREATE TABLE scrape_logs (
|
||||||
|
id SERIAL PRIMARY KEY,
|
||||||
|
job_id INTEGER REFERENCES scrape_jobs(id),
|
||||||
|
level VARCHAR(20), -- info, warning, error
|
||||||
|
message TEXT,
|
||||||
|
created_at TIMESTAMP DEFAULT NOW()
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
## 四、API接口设计
|
||||||
|
|
||||||
|
### 1. 大学管理
|
||||||
|
```
|
||||||
|
POST /api/universities 创建大学
|
||||||
|
GET /api/universities 获取大学列表
|
||||||
|
GET /api/universities/{id} 获取大学详情
|
||||||
|
DELETE /api/universities/{id} 删除大学
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. 爬虫脚本
|
||||||
|
```
|
||||||
|
POST /api/scripts/generate 生成爬虫脚本 (Agent自动分析)
|
||||||
|
GET /api/scripts/{university_id} 获取大学的爬虫脚本
|
||||||
|
PUT /api/scripts/{id} 更新脚本
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. 爬取任务
|
||||||
|
```
|
||||||
|
POST /api/jobs/start/{university_id} 启动爬取任务
|
||||||
|
GET /api/jobs/{id} 获取任务状态
|
||||||
|
GET /api/jobs/university/{id} 获取大学的任务列表
|
||||||
|
POST /api/jobs/{id}/cancel 取消任务
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. 数据结果
|
||||||
|
```
|
||||||
|
GET /api/results/{university_id} 获取爬取结果
|
||||||
|
GET /api/results/{university_id}/schools 获取学院列表
|
||||||
|
GET /api/results/{university_id}/programs 获取项目列表
|
||||||
|
GET /api/results/{university_id}/faculty 获取导师列表
|
||||||
|
GET /api/results/{university_id}/export?format=json 导出数据
|
||||||
|
```
|
||||||
|
|
||||||
|
## 五、前端页面设计
|
||||||
|
|
||||||
|
### 页面1: 首页/大学列表
|
||||||
|
- 显示已添加的大学列表
|
||||||
|
- "添加新大学" 按钮
|
||||||
|
- 每个大学卡片显示:名称、状态、项目数、导师数、操作按钮
|
||||||
|
|
||||||
|
### 页面2: 添加大学 (一键生成脚本)
|
||||||
|
- 输入框:大学官网URL
|
||||||
|
- "分析并生成脚本" 按钮
|
||||||
|
- 显示分析进度和日志
|
||||||
|
- 生成完成后自动跳转到管理页面
|
||||||
|
|
||||||
|
### 页面3: 大学管理页面
|
||||||
|
- 大学基本信息
|
||||||
|
- 爬虫脚本状态
|
||||||
|
- "一键运行爬虫" 按钮
|
||||||
|
- 运行进度和日志实时显示
|
||||||
|
- 历史任务列表
|
||||||
|
|
||||||
|
### 页面4: 数据查看页面
|
||||||
|
- 树形结构展示:学院 → 项目 → 导师
|
||||||
|
- 搜索和筛选功能
|
||||||
|
- 数据导出按钮 (JSON/Excel)
|
||||||
|
- 数据校验和编辑功能
|
||||||
|
|
||||||
|
## 六、实现步骤
|
||||||
|
|
||||||
|
### 阶段1: 后端基础 (优先)
|
||||||
|
1. 创建 FastAPI 项目结构
|
||||||
|
2. 设计数据库模型 (SQLAlchemy)
|
||||||
|
3. 实现基础 CRUD API
|
||||||
|
4. 集成现有爬虫代码
|
||||||
|
|
||||||
|
### 阶段2: 脚本生成与执行
|
||||||
|
1. 实现 Agent 自动分析逻辑
|
||||||
|
2. 实现脚本存储和版本管理
|
||||||
|
3. 集成 Celery 异步任务队列
|
||||||
|
4. 实现爬虫执行和日志记录
|
||||||
|
|
||||||
|
### 阶段3: 前端开发
|
||||||
|
1. 搭建 React 项目
|
||||||
|
2. 实现大学列表页面
|
||||||
|
3. 实现脚本生成页面
|
||||||
|
4. 实现数据查看页面
|
||||||
|
|
||||||
|
### 阶段4: 部署上线
|
||||||
|
1. Docker 容器化
|
||||||
|
2. 部署到云服务器
|
||||||
|
3. 配置域名和 HTTPS
|
||||||
|
|
||||||
|
## 七、目录结构
|
||||||
|
|
||||||
|
```
|
||||||
|
university-scraper-web/
|
||||||
|
├── backend/
|
||||||
|
│ ├── app/
|
||||||
|
│ │ ├── __init__.py
|
||||||
|
│ │ ├── main.py # FastAPI入口
|
||||||
|
│ │ ├── config.py # 配置
|
||||||
|
│ │ ├── database.py # 数据库连接
|
||||||
|
│ │ ├── models/ # SQLAlchemy模型
|
||||||
|
│ │ │ ├── university.py
|
||||||
|
│ │ │ ├── script.py
|
||||||
|
│ │ │ ├── job.py
|
||||||
|
│ │ │ └── result.py
|
||||||
|
│ │ ├── schemas/ # Pydantic模型
|
||||||
|
│ │ ├── api/ # API路由
|
||||||
|
│ │ │ ├── universities.py
|
||||||
|
│ │ │ ├── scripts.py
|
||||||
|
│ │ │ ├── jobs.py
|
||||||
|
│ │ │ └── results.py
|
||||||
|
│ │ ├── services/ # 业务逻辑
|
||||||
|
│ │ │ ├── scraper_service.py
|
||||||
|
│ │ │ └── agent_service.py
|
||||||
|
│ │ └── tasks/ # Celery任务
|
||||||
|
│ │ └── scrape_task.py
|
||||||
|
│ ├── requirements.txt
|
||||||
|
│ └── Dockerfile
|
||||||
|
├── frontend/
|
||||||
|
│ ├── src/
|
||||||
|
│ │ ├── components/
|
||||||
|
│ │ ├── pages/
|
||||||
|
│ │ ├── services/
|
||||||
|
│ │ └── App.tsx
|
||||||
|
│ ├── package.json
|
||||||
|
│ └── Dockerfile
|
||||||
|
├── docker-compose.yml
|
||||||
|
└── README.md
|
||||||
|
```
|
||||||
|
|
||||||
|
## 八、关于脚本存储位置的建议
|
||||||
|
|
||||||
|
### 推荐方案:PostgreSQL + 文件系统混合
|
||||||
|
|
||||||
|
1. **PostgreSQL 存储**:
|
||||||
|
- 脚本元数据 (名称、版本、状态)
|
||||||
|
- 脚本代码内容 (TEXT字段)
|
||||||
|
- 配置文件内容 (JSONB字段)
|
||||||
|
- 爬取结果 (JSONB字段)
|
||||||
|
|
||||||
|
2. **优点**:
|
||||||
|
- 事务支持,数据一致性
|
||||||
|
- 版本管理方便
|
||||||
|
- 查询和搜索方便
|
||||||
|
- 备份和迁移简单
|
||||||
|
- 与后端集成紧密
|
||||||
|
|
||||||
|
3. **云部署选项**:
|
||||||
|
- AWS RDS PostgreSQL
|
||||||
|
- 阿里云 RDS PostgreSQL
|
||||||
|
- 腾讯云 TDSQL-C
|
||||||
|
|
||||||
|
### 备选方案:MongoDB
|
||||||
|
|
||||||
|
如果数据结构经常变化,可以考虑 MongoDB:
|
||||||
|
- 灵活的文档结构
|
||||||
|
- 适合存储层级化的爬取结果
|
||||||
|
- 但 Python 生态对 PostgreSQL 支持更好
|
||||||
83
artifacts/debug_cs_faculty.py
Normal file
83
artifacts/debug_cs_faculty.py
Normal file
@ -0,0 +1,83 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
调试Computer Science的Faculty页面
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
|
||||||
|
|
||||||
|
async def debug_cs():
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser = await p.chromium.launch(headless=False)
|
||||||
|
page = await browser.new_page()
|
||||||
|
|
||||||
|
# 访问Computer Science GSAS页面
|
||||||
|
gsas_url = "https://gsas.harvard.edu/program/computer-science"
|
||||||
|
print(f"访问: {gsas_url}")
|
||||||
|
|
||||||
|
await page.goto(gsas_url, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
await page.screenshot(path="cs_gsas_page.png", full_page=True)
|
||||||
|
print("截图已保存: cs_gsas_page.png")
|
||||||
|
|
||||||
|
# 查找所有链接
|
||||||
|
links = await page.evaluate('''() => {
|
||||||
|
const links = [];
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
const href = a.href;
|
||||||
|
if (text && text.length > 2 && text.length < 100) {
|
||||||
|
links.push({text: text, href: href});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
return links;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f"\n页面上的所有链接 ({len(links)} 个):")
|
||||||
|
for link in links:
|
||||||
|
print(f" - {link['text'][:60]} -> {link['href']}")
|
||||||
|
|
||||||
|
# 查找可能的Faculty或People链接
|
||||||
|
print("\n\n查找Faculty/People相关链接:")
|
||||||
|
for link in links:
|
||||||
|
text_lower = link['text'].lower()
|
||||||
|
href_lower = link['href'].lower()
|
||||||
|
if 'faculty' in text_lower or 'people' in href_lower or 'faculty' in href_lower or 'website' in text_lower:
|
||||||
|
print(f" * {link['text']} -> {link['href']}")
|
||||||
|
|
||||||
|
# 尝试访问SEAS (School of Engineering)
|
||||||
|
print("\n\n尝试访问SEAS Computer Science页面...")
|
||||||
|
seas_url = "https://seas.harvard.edu/computer-science"
|
||||||
|
await page.goto(seas_url, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
await page.screenshot(path="seas_cs_page.png", full_page=True)
|
||||||
|
print("截图已保存: seas_cs_page.png")
|
||||||
|
|
||||||
|
seas_links = await page.evaluate('''() => {
|
||||||
|
const links = [];
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
const href = a.href;
|
||||||
|
const lowerText = text.toLowerCase();
|
||||||
|
const lowerHref = href.toLowerCase();
|
||||||
|
if ((lowerText.includes('faculty') || lowerText.includes('people') ||
|
||||||
|
lowerHref.includes('faculty') || lowerHref.includes('people')) &&
|
||||||
|
text.length > 2) {
|
||||||
|
links.push({text: text, href: href});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
return links;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f"\nSEAS页面上的Faculty/People链接:")
|
||||||
|
for link in seas_links:
|
||||||
|
print(f" * {link['text']} -> {link['href']}")
|
||||||
|
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(debug_cs())
|
||||||
110
artifacts/explore_faculty_page.py
Normal file
110
artifacts/explore_faculty_page.py
Normal file
@ -0,0 +1,110 @@
|
|||||||
|
"""
|
||||||
|
探索Harvard院系People/Faculty页面结构,获取导师列表
|
||||||
|
"""
|
||||||
|
import asyncio
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
|
||||||
|
async def explore_faculty_page():
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser = await p.chromium.launch(headless=False)
|
||||||
|
page = await browser.new_page()
|
||||||
|
|
||||||
|
# 访问AAAS院系People页面
|
||||||
|
people_url = "https://aaas.fas.harvard.edu/aaas-people"
|
||||||
|
print(f"访问院系People页面: {people_url}")
|
||||||
|
|
||||||
|
await page.goto(people_url, wait_until='networkidle')
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
# 截图保存
|
||||||
|
await page.screenshot(path="aaas_people_page.png", full_page=True)
|
||||||
|
print("已保存截图: aaas_people_page.png")
|
||||||
|
|
||||||
|
# 获取所有教职员工链接
|
||||||
|
faculty_info = await page.evaluate('''() => {
|
||||||
|
const faculty = [];
|
||||||
|
|
||||||
|
// 查找所有 /people/ 路径的链接
|
||||||
|
document.querySelectorAll('a[href*="/people/"]').forEach(a => {
|
||||||
|
const href = a.href || '';
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
|
||||||
|
// 过滤掉导航链接,只保留个人页面链接
|
||||||
|
if (href.includes('/people/') && text.length > 3 &&
|
||||||
|
!text.toLowerCase().includes('people') &&
|
||||||
|
!href.endsWith('/people/') &&
|
||||||
|
!href.endsWith('/aaas-people')) {
|
||||||
|
faculty.push({
|
||||||
|
name: text,
|
||||||
|
url: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return faculty;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f"\n找到 {len(faculty_info)} 个教职员工:")
|
||||||
|
for f in faculty_info:
|
||||||
|
print(f" - {f['name']} -> {f['url']}")
|
||||||
|
|
||||||
|
# 尝试经济学院系的Faculty页面
|
||||||
|
print("\n\n========== 尝试经济学院系Faculty页面 ==========")
|
||||||
|
econ_faculty_url = "http://economics.harvard.edu/people/people-type/faculty"
|
||||||
|
print(f"访问: {econ_faculty_url}")
|
||||||
|
|
||||||
|
await page.goto(econ_faculty_url, wait_until='networkidle')
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
await page.screenshot(path="econ_faculty_page.png", full_page=True)
|
||||||
|
print("已保存截图: econ_faculty_page.png")
|
||||||
|
|
||||||
|
econ_faculty = await page.evaluate('''() => {
|
||||||
|
const faculty = [];
|
||||||
|
|
||||||
|
// 查找所有可能的faculty链接
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href || '';
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
const lowerHref = href.toLowerCase();
|
||||||
|
|
||||||
|
// 查找个人页面链接
|
||||||
|
if ((lowerHref.includes('/people/') || lowerHref.includes('/faculty/') ||
|
||||||
|
lowerHref.includes('/profile/')) &&
|
||||||
|
text.length > 3 && text.length < 100 &&
|
||||||
|
!text.toLowerCase().includes('faculty') &&
|
||||||
|
!text.toLowerCase().includes('people')) {
|
||||||
|
faculty.push({
|
||||||
|
name: text,
|
||||||
|
url: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return faculty;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f"\n找到 {len(econ_faculty)} 个教职员工:")
|
||||||
|
for f in econ_faculty[:30]:
|
||||||
|
print(f" - {f['name']} -> {f['url']}")
|
||||||
|
|
||||||
|
# 查看页面上所有链接用于调试
|
||||||
|
print("\n\n页面上的所有链接:")
|
||||||
|
all_links = await page.evaluate('''() => {
|
||||||
|
const links = [];
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href || '';
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
if (text && text.length > 2 && text.length < 100) {
|
||||||
|
links.push({text: text, href: href});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
return links;
|
||||||
|
}''')
|
||||||
|
for link in all_links[:40]:
|
||||||
|
print(f" - {link['text'][:50]} -> {link['href']}")
|
||||||
|
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(explore_faculty_page())
|
||||||
173
artifacts/explore_manchester.py
Normal file
173
artifacts/explore_manchester.py
Normal file
@ -0,0 +1,173 @@
|
|||||||
|
"""
|
||||||
|
探索曼彻斯特大学硕士课程页面结构
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
|
||||||
|
|
||||||
|
async def explore_manchester():
|
||||||
|
"""探索曼彻斯特大学网站结构"""
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser = await p.chromium.launch(headless=False)
|
||||||
|
context = await browser.new_context(
|
||||||
|
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
|
||||||
|
)
|
||||||
|
page = await context.new_page()
|
||||||
|
|
||||||
|
# 直接访问硕士课程A-Z列表页
|
||||||
|
print("访问硕士课程A-Z列表页面...")
|
||||||
|
await page.goto("https://www.manchester.ac.uk/study/masters/courses/list/",
|
||||||
|
wait_until="domcontentloaded", timeout=60000)
|
||||||
|
await page.wait_for_timeout(5000)
|
||||||
|
|
||||||
|
# 截图
|
||||||
|
await page.screenshot(path="manchester_masters_page.png", full_page=False)
|
||||||
|
print("截图已保存: manchester_masters_page.png")
|
||||||
|
|
||||||
|
# 分析页面结构
|
||||||
|
page_info = await page.evaluate("""() => {
|
||||||
|
const info = {
|
||||||
|
title: document.title,
|
||||||
|
url: window.location.href,
|
||||||
|
all_links: [],
|
||||||
|
course_candidates: [],
|
||||||
|
page_sections: []
|
||||||
|
};
|
||||||
|
|
||||||
|
// 获取所有链接
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href;
|
||||||
|
const text = a.innerText.trim().substring(0, 100);
|
||||||
|
if (href && text) {
|
||||||
|
info.all_links.push({href, text});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
// 查找可能的课程链接 - 包含 /course/ 或 list-item
|
||||||
|
document.querySelectorAll('a[href*="/course/"], .course-link, [class*="course"] a, .search-result a, .list-item a').forEach(a => {
|
||||||
|
info.course_candidates.push({
|
||||||
|
href: a.href,
|
||||||
|
text: a.innerText.trim().substring(0, 100),
|
||||||
|
classes: a.className,
|
||||||
|
parent_classes: a.parentElement?.className || ''
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
// 获取页面主要区块
|
||||||
|
document.querySelectorAll('main, [role="main"], .content, #content, .results, .course-list').forEach(el => {
|
||||||
|
info.page_sections.push({
|
||||||
|
tag: el.tagName,
|
||||||
|
id: el.id,
|
||||||
|
classes: el.className,
|
||||||
|
children_count: el.children.length
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
return info;
|
||||||
|
}""")
|
||||||
|
|
||||||
|
print(f"\n页面标题: {page_info['title']}")
|
||||||
|
print(f"当前URL: {page_info['url']}")
|
||||||
|
print(f"\n总链接数: {len(page_info['all_links'])}")
|
||||||
|
print(f"课程候选链接数: {len(page_info['course_candidates'])}")
|
||||||
|
|
||||||
|
# 查找包含 masters/courses/ 的链接
|
||||||
|
masters_links = [l for l in page_info['all_links']
|
||||||
|
if 'masters/courses/' in l['href'].lower()
|
||||||
|
and l['href'] != page_info['url']]
|
||||||
|
|
||||||
|
print(f"\n硕士课程相关链接 ({len(masters_links)}):")
|
||||||
|
for link in masters_links[:20]:
|
||||||
|
print(f" - {link['text'][:50]}: {link['href']}")
|
||||||
|
|
||||||
|
print(f"\n课程候选详情:")
|
||||||
|
for c in page_info['course_candidates'][:10]:
|
||||||
|
print(f" - {c['text'][:50]}")
|
||||||
|
print(f" URL: {c['href']}")
|
||||||
|
print(f" Classes: {c['classes']}")
|
||||||
|
|
||||||
|
# 检查是否有搜索/筛选功能
|
||||||
|
search_elements = await page.evaluate("""() => {
|
||||||
|
const elements = [];
|
||||||
|
document.querySelectorAll('input[type="search"], input[type="text"], select, .filter, .search').forEach(el => {
|
||||||
|
elements.push({
|
||||||
|
tag: el.tagName,
|
||||||
|
type: el.type || '',
|
||||||
|
id: el.id,
|
||||||
|
name: el.name || '',
|
||||||
|
classes: el.className
|
||||||
|
});
|
||||||
|
});
|
||||||
|
return elements;
|
||||||
|
}""")
|
||||||
|
|
||||||
|
print(f"\n搜索/筛选元素: {len(search_elements)}")
|
||||||
|
for el in search_elements[:5]:
|
||||||
|
print(f" - {el}")
|
||||||
|
|
||||||
|
# 尝试找到课程列表的实际结构
|
||||||
|
print("\n\n正在分析页面中的课程列表结构...")
|
||||||
|
|
||||||
|
list_structures = await page.evaluate("""() => {
|
||||||
|
const structures = [];
|
||||||
|
|
||||||
|
// 查找各种可能的列表结构
|
||||||
|
const selectors = [
|
||||||
|
'ul li a[href*="course"]',
|
||||||
|
'div[class*="result"] a',
|
||||||
|
'div[class*="course"] a',
|
||||||
|
'article a[href]',
|
||||||
|
'.search-results a',
|
||||||
|
'[data-course] a',
|
||||||
|
'table tr td a'
|
||||||
|
];
|
||||||
|
|
||||||
|
for (const selector of selectors) {
|
||||||
|
const elements = document.querySelectorAll(selector);
|
||||||
|
if (elements.length > 0) {
|
||||||
|
const samples = [];
|
||||||
|
elements.forEach((el, i) => {
|
||||||
|
if (i < 5) {
|
||||||
|
samples.push({
|
||||||
|
href: el.href,
|
||||||
|
text: el.innerText.trim().substring(0, 80)
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
structures.push({
|
||||||
|
selector: selector,
|
||||||
|
count: elements.length,
|
||||||
|
samples: samples
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return structures;
|
||||||
|
}""")
|
||||||
|
|
||||||
|
print("\n找到的列表结构:")
|
||||||
|
for s in list_structures:
|
||||||
|
print(f"\n 选择器: {s['selector']} (共 {s['count']} 个)")
|
||||||
|
for sample in s['samples']:
|
||||||
|
print(f" - {sample['text']}: {sample['href']}")
|
||||||
|
|
||||||
|
# 保存完整分析结果
|
||||||
|
with open("manchester_analysis.json", "w", encoding="utf-8") as f:
|
||||||
|
json.dump(page_info, f, indent=2, ensure_ascii=False)
|
||||||
|
|
||||||
|
print("\n\n完整分析已保存到 manchester_analysis.json")
|
||||||
|
|
||||||
|
# 等待用户查看
|
||||||
|
print("\n按 Ctrl+C 关闭浏览器...")
|
||||||
|
try:
|
||||||
|
await asyncio.sleep(30)
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(explore_manchester())
|
||||||
226
artifacts/explore_program_page.py
Normal file
226
artifacts/explore_program_page.py
Normal file
@ -0,0 +1,226 @@
|
|||||||
|
"""
|
||||||
|
探索Harvard项目页面结构,寻找导师信息
|
||||||
|
"""
|
||||||
|
import asyncio
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
|
||||||
|
async def explore_program_page():
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser = await p.chromium.launch(headless=False)
|
||||||
|
page = await browser.new_page()
|
||||||
|
|
||||||
|
# 访问研究生院系页面 (GSAS)
|
||||||
|
gsas_url = "https://gsas.harvard.edu/program/african-and-african-american-studies"
|
||||||
|
print(f"访问研究生院系页面: {gsas_url}")
|
||||||
|
|
||||||
|
await page.goto(gsas_url, wait_until='networkidle')
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
# 截图保存
|
||||||
|
await page.screenshot(path="gsas_program_page.png", full_page=True)
|
||||||
|
print("已保存截图: gsas_program_page.png")
|
||||||
|
|
||||||
|
# 分析页面结构
|
||||||
|
page_info = await page.evaluate('''() => {
|
||||||
|
const info = {
|
||||||
|
title: document.title,
|
||||||
|
h1: document.querySelector('h1')?.innerText || '',
|
||||||
|
allHeadings: [],
|
||||||
|
facultyLinks: [],
|
||||||
|
peopleLinks: [],
|
||||||
|
allLinks: []
|
||||||
|
};
|
||||||
|
|
||||||
|
// 获取所有标题
|
||||||
|
document.querySelectorAll('h1, h2, h3, h4').forEach(h => {
|
||||||
|
info.allHeadings.push({
|
||||||
|
tag: h.tagName,
|
||||||
|
text: h.innerText.trim().substring(0, 100)
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
// 查找所有链接
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href || '';
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
|
||||||
|
// 检查是否与教职员工相关
|
||||||
|
const lowerHref = href.toLowerCase();
|
||||||
|
const lowerText = text.toLowerCase();
|
||||||
|
|
||||||
|
if (lowerHref.includes('faculty') || lowerHref.includes('people') ||
|
||||||
|
lowerHref.includes('professor') || lowerHref.includes('staff') ||
|
||||||
|
lowerText.includes('faculty') || lowerText.includes('people')) {
|
||||||
|
info.facultyLinks.push({
|
||||||
|
text: text.substring(0, 100),
|
||||||
|
href: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
// 检查是否是个人页面链接
|
||||||
|
if (href.includes('/people/') || href.includes('/faculty/') ||
|
||||||
|
href.includes('/profile/') || href.includes('/person/')) {
|
||||||
|
info.peopleLinks.push({
|
||||||
|
text: text.substring(0, 100),
|
||||||
|
href: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
// 保存所有主要链接
|
||||||
|
if (href && text.length > 2 && text.length < 150) {
|
||||||
|
info.allLinks.push({
|
||||||
|
text: text,
|
||||||
|
href: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return info;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f"\n页面标题: {page_info['title']}")
|
||||||
|
print(f"H1: {page_info['h1']}")
|
||||||
|
|
||||||
|
print(f"\n所有标题 ({len(page_info['allHeadings'])}):")
|
||||||
|
for h in page_info['allHeadings']:
|
||||||
|
print(f" <{h['tag']}>: {h['text']}")
|
||||||
|
|
||||||
|
print(f"\n教职员工相关链接 ({len(page_info['facultyLinks'])}):")
|
||||||
|
for f in page_info['facultyLinks']:
|
||||||
|
print(f" - {f['text']} -> {f['href']}")
|
||||||
|
|
||||||
|
print(f"\n个人页面链接 ({len(page_info['peopleLinks'])}):")
|
||||||
|
for p in page_info['peopleLinks']:
|
||||||
|
print(f" - {p['text']} -> {p['href']}")
|
||||||
|
|
||||||
|
print(f"\n所有链接 ({len(page_info['allLinks'])}):")
|
||||||
|
for link in page_info['allLinks'][:50]:
|
||||||
|
print(f" - {link['text'][:60]} -> {link['href']}")
|
||||||
|
|
||||||
|
# 尝试另一个项目页面看看是否有不同结构
|
||||||
|
print("\n\n========== 尝试另一个项目页面 ==========")
|
||||||
|
economics_url = "https://gsas.harvard.edu/program/economics"
|
||||||
|
print(f"访问: {economics_url}")
|
||||||
|
|
||||||
|
await page.goto(economics_url, wait_until='networkidle')
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
# 截图保存
|
||||||
|
await page.screenshot(path="gsas_economics_page.png", full_page=True)
|
||||||
|
print("已保存截图: gsas_economics_page.png")
|
||||||
|
|
||||||
|
# 分析
|
||||||
|
econ_info = await page.evaluate('''() => {
|
||||||
|
const info = {
|
||||||
|
title: document.title,
|
||||||
|
facultyLinks: [],
|
||||||
|
peopleLinks: []
|
||||||
|
};
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href || '';
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
const lowerHref = href.toLowerCase();
|
||||||
|
const lowerText = text.toLowerCase();
|
||||||
|
|
||||||
|
if (lowerHref.includes('faculty') || lowerHref.includes('people') ||
|
||||||
|
lowerText.includes('faculty') || lowerText.includes('people')) {
|
||||||
|
info.facultyLinks.push({
|
||||||
|
text: text.substring(0, 100),
|
||||||
|
href: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
if (href.includes('/people/') || href.includes('/faculty/') ||
|
||||||
|
href.includes('/profile/') || href.includes('/person/')) {
|
||||||
|
info.peopleLinks.push({
|
||||||
|
text: text.substring(0, 100),
|
||||||
|
href: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return info;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f"\n教职员工相关链接 ({len(econ_info['facultyLinks'])}):")
|
||||||
|
for f in econ_info['facultyLinks']:
|
||||||
|
print(f" - {f['text']} -> {f['href']}")
|
||||||
|
|
||||||
|
print(f"\n个人页面链接 ({len(econ_info['peopleLinks'])}):")
|
||||||
|
for p in econ_info['peopleLinks']:
|
||||||
|
print(f" - {p['text']} -> {p['href']}")
|
||||||
|
|
||||||
|
# 访问院系主页看看有没有Faculty页面
|
||||||
|
print("\n\n========== 尝试访问院系主页 ==========")
|
||||||
|
dept_url = "https://aaas.fas.harvard.edu/"
|
||||||
|
print(f"访问院系主页: {dept_url}")
|
||||||
|
|
||||||
|
await page.goto(dept_url, wait_until='networkidle')
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
await page.screenshot(path="aaas_dept_page.png", full_page=True)
|
||||||
|
print("已保存截图: aaas_dept_page.png")
|
||||||
|
|
||||||
|
dept_info = await page.evaluate('''() => {
|
||||||
|
const info = {
|
||||||
|
title: document.title,
|
||||||
|
navLinks: [],
|
||||||
|
facultyLinks: [],
|
||||||
|
peopleLinks: []
|
||||||
|
};
|
||||||
|
|
||||||
|
// 获取导航链接
|
||||||
|
document.querySelectorAll('nav a, [class*="nav"] a, [class*="menu"] a').forEach(a => {
|
||||||
|
const href = a.href || '';
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
if (text && text.length > 1 && text.length < 50) {
|
||||||
|
info.navLinks.push({
|
||||||
|
text: text,
|
||||||
|
href: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href || '';
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
const lowerHref = href.toLowerCase();
|
||||||
|
const lowerText = text.toLowerCase();
|
||||||
|
|
||||||
|
if (lowerHref.includes('faculty') || lowerHref.includes('people') ||
|
||||||
|
lowerText.includes('faculty') || lowerText.includes('people')) {
|
||||||
|
info.facultyLinks.push({
|
||||||
|
text: text.substring(0, 100),
|
||||||
|
href: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
if (href.includes('/people/') || href.includes('/faculty/') ||
|
||||||
|
href.includes('/profile/')) {
|
||||||
|
info.peopleLinks.push({
|
||||||
|
text: text.substring(0, 100),
|
||||||
|
href: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return info;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f"\n导航链接 ({len(dept_info['navLinks'])}):")
|
||||||
|
for link in dept_info['navLinks'][:20]:
|
||||||
|
print(f" - {link['text']} -> {link['href']}")
|
||||||
|
|
||||||
|
print(f"\n教职员工相关链接 ({len(dept_info['facultyLinks'])}):")
|
||||||
|
for f in dept_info['facultyLinks']:
|
||||||
|
print(f" - {f['text']} -> {f['href']}")
|
||||||
|
|
||||||
|
print(f"\n个人页面链接 ({len(dept_info['peopleLinks'])}):")
|
||||||
|
for p in dept_info['peopleLinks'][:30]:
|
||||||
|
print(f" - {p['text']} -> {p['href']}")
|
||||||
|
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(explore_program_page())
|
||||||
@ -125,6 +125,7 @@ class ScrapeSettings:
|
|||||||
output: Path
|
output: Path
|
||||||
verify_links: bool = True
|
verify_links: bool = True
|
||||||
request_delay: float = 1.0 # Polite crawling delay
|
request_delay: float = 1.0 # Polite crawling delay
|
||||||
|
timeout: int = 60000 # Navigation timeout in ms
|
||||||
|
|
||||||
|
|
||||||
async def extract_links(page: Page) -> List[Tuple[str, str]]:
|
async def extract_links(page: Page) -> List[Tuple[str, str]]:
|
||||||
@ -210,7 +211,7 @@ async def crawl(settings: ScrapeSettings, browser_name: str) -> List[ScrapedLink
|
|||||||
page = await context.new_page()
|
page = await context.new_page()
|
||||||
try:
|
try:
|
||||||
response = await page.goto(
|
response = await page.goto(
|
||||||
normalized_url, wait_until="domcontentloaded", timeout=20000
|
normalized_url, wait_until="domcontentloaded", timeout=settings.timeout
|
||||||
)
|
)
|
||||||
if not response or response.status >= 400:
|
if not response or response.status >= 400:
|
||||||
await page.close()
|
await page.close()
|
||||||
@ -411,6 +412,12 @@ def parse_args() -> argparse.Namespace:
|
|||||||
default=1.0,
|
default=1.0,
|
||||||
help="Delay between requests in seconds (polite crawling).",
|
help="Delay between requests in seconds (polite crawling).",
|
||||||
)
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--timeout",
|
||||||
|
type=int,
|
||||||
|
default=60000,
|
||||||
|
help="Navigation timeout in milliseconds (default: 60000 = 60s).",
|
||||||
|
)
|
||||||
return parser.parse_args()
|
return parser.parse_args()
|
||||||
|
|
||||||
|
|
||||||
@ -424,6 +431,7 @@ async def main_async() -> None:
|
|||||||
output=args.output,
|
output=args.output,
|
||||||
verify_links=not args.no_verify,
|
verify_links=not args.no_verify,
|
||||||
request_delay=args.delay,
|
request_delay=args.delay,
|
||||||
|
timeout=args.timeout,
|
||||||
)
|
)
|
||||||
links = await crawl(settings, browser_name=args.browser)
|
links = await crawl(settings, browser_name=args.browser)
|
||||||
serialize(links, settings.output, settings.root_url)
|
serialize(links, settings.output, settings.root_url)
|
||||||
|
|||||||
466
artifacts/harvard_programs_scraper.py
Normal file
466
artifacts/harvard_programs_scraper.py
Normal file
@ -0,0 +1,466 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Harvard Graduate Programs Scraper
|
||||||
|
专门爬取 https://www.harvard.edu/programs/?degree_levels=graduate 页面的所有研究生项目
|
||||||
|
通过点击分页按钮遍历所有页面
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
|
||||||
|
|
||||||
|
async def scrape_harvard_programs():
|
||||||
|
"""爬取Harvard研究生项目列表页面 - 通过点击分页按钮"""
|
||||||
|
|
||||||
|
all_programs = []
|
||||||
|
base_url = "https://www.harvard.edu/programs/?degree_levels=graduate"
|
||||||
|
|
||||||
|
async with async_playwright() as p:
|
||||||
|
# 使用无头模式
|
||||||
|
browser = await p.chromium.launch(headless=True)
|
||||||
|
context = await browser.new_context(
|
||||||
|
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
||||||
|
viewport={'width': 1920, 'height': 1080}
|
||||||
|
)
|
||||||
|
page = await context.new_page()
|
||||||
|
|
||||||
|
print(f"正在访问: {base_url}")
|
||||||
|
# 使用 domcontentloaded 而非 networkidle,更快加载
|
||||||
|
await page.goto(base_url, wait_until="domcontentloaded", timeout=60000)
|
||||||
|
# 等待页面内容加载
|
||||||
|
await page.wait_for_timeout(5000)
|
||||||
|
|
||||||
|
# 滚动到页面底部以确保分页按钮加载
|
||||||
|
print("滚动到页面底部...")
|
||||||
|
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
current_page = 1
|
||||||
|
max_pages = 15
|
||||||
|
|
||||||
|
while current_page <= max_pages:
|
||||||
|
print(f"\n========== 第 {current_page} 页 ==========")
|
||||||
|
|
||||||
|
# 等待内容加载
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
# 提取当前页面的项目
|
||||||
|
# 从调试输出得知,项目按钮的class是 'records__record___PbPhG c-programs-item__title-link'
|
||||||
|
# 需要点击按钮来获取URL,因为Harvard使用JavaScript导航
|
||||||
|
|
||||||
|
# 首先获取所有项目按钮信息
|
||||||
|
page_data = await page.evaluate('''() => {
|
||||||
|
const programs = [];
|
||||||
|
|
||||||
|
// 查找所有项目行/容器
|
||||||
|
const programItems = document.querySelectorAll('[class*="records__record"], [class*="c-programs-item"]');
|
||||||
|
|
||||||
|
programItems.forEach((item, index) => {
|
||||||
|
// 获取项目名称按钮
|
||||||
|
const nameBtn = item.querySelector('button[class*="title-link"], button[class*="c-programs-item"]');
|
||||||
|
if (!nameBtn) return;
|
||||||
|
|
||||||
|
const name = nameBtn.innerText.trim();
|
||||||
|
if (!name || name.length < 3) return;
|
||||||
|
|
||||||
|
// 获取学位信息
|
||||||
|
let degrees = '';
|
||||||
|
const allText = item.innerText;
|
||||||
|
const degreeMatch = allText.match(/(A\\.B\\.|Ph\\.D\\.|M\\.A\\.|S\\.M\\.|M\\.Arch\\.|LL\\.M\\.|S\\.B\\.|A\\.L\\.B\\.|A\\.L\\.M\\.|M\\.M\\.Sc\\.|Ed\\.D\\.|Ed\\.M\\.|M\\.P\\.A\\.|M\\.P\\.P\\.|M\\.P\\.H\\.|J\\.D\\.|M\\.B\\.A\\.|M\\.D\\.|D\\.M\\.D\\.|Th\\.D\\.|M\\.Div\\.|M\\.T\\.S\\.|M\\.E\\.|D\\.M\\.Sc\\.|M\\.H\\.C\\.M\\.|M\\.L\\.A\\.|M\\.D\\.E\\.|M\\.R\\.E\\.|M\\.A\\.U\\.D\\.|M\\.R\\.P\\.L\\.)/g);
|
||||||
|
if (degreeMatch) {
|
||||||
|
degrees = degreeMatch.join(', ');
|
||||||
|
}
|
||||||
|
|
||||||
|
// 查找链接 - 检查各种可能的位置
|
||||||
|
let url = '';
|
||||||
|
|
||||||
|
// 方法1: 查找 <a> 标签
|
||||||
|
const link = item.querySelector('a[href]');
|
||||||
|
if (link && link.href) {
|
||||||
|
url = link.href;
|
||||||
|
}
|
||||||
|
|
||||||
|
// 方法2: 检查data属性
|
||||||
|
if (!url) {
|
||||||
|
const dataUrl = nameBtn.getAttribute('data-url') ||
|
||||||
|
nameBtn.getAttribute('data-href') ||
|
||||||
|
item.getAttribute('data-url');
|
||||||
|
if (dataUrl) url = dataUrl;
|
||||||
|
}
|
||||||
|
|
||||||
|
// 方法3: 检查onclick属性
|
||||||
|
if (!url) {
|
||||||
|
const onclick = nameBtn.getAttribute('onclick') || '';
|
||||||
|
const urlMatch = onclick.match(/['"]([^'"]*\\/programs\\/[^'"]*)['"]/);
|
||||||
|
if (urlMatch) url = urlMatch[1];
|
||||||
|
}
|
||||||
|
|
||||||
|
programs.push({
|
||||||
|
name: name,
|
||||||
|
degrees: degrees,
|
||||||
|
url: url,
|
||||||
|
index: index
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
// 如果方法1没找到项目,使用备选方法
|
||||||
|
if (programs.length === 0) {
|
||||||
|
// 查找所有项目按钮
|
||||||
|
const buttons = document.querySelectorAll('button');
|
||||||
|
buttons.forEach((btn, index) => {
|
||||||
|
const className = btn.className || '';
|
||||||
|
if (className.includes('c-programs-item') || className.includes('title-link')) {
|
||||||
|
const name = btn.innerText.trim();
|
||||||
|
if (name && name.length > 3 && !name.match(/^(Page|Next|Previous|Search|Menu|Filter)/)) {
|
||||||
|
programs.push({
|
||||||
|
name: name,
|
||||||
|
degrees: '',
|
||||||
|
url: '',
|
||||||
|
index: index
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
return {
|
||||||
|
programs: programs,
|
||||||
|
totalFound: programs.length
|
||||||
|
};
|
||||||
|
}''')
|
||||||
|
|
||||||
|
# 第一页时调试输出HTML结构
|
||||||
|
if current_page == 1 and len(page_data['programs']) == 0:
|
||||||
|
print("未找到项目,调试HTML结构...")
|
||||||
|
html_debug = await page.evaluate('''() => {
|
||||||
|
const debug = {
|
||||||
|
allButtons: [],
|
||||||
|
allLinks: [],
|
||||||
|
sampleHTML: ''
|
||||||
|
};
|
||||||
|
|
||||||
|
// 获取所有按钮
|
||||||
|
document.querySelectorAll('button').forEach(btn => {
|
||||||
|
const text = btn.innerText.trim().substring(0, 50);
|
||||||
|
if (text && text.length > 3) {
|
||||||
|
debug.allButtons.push({
|
||||||
|
text: text,
|
||||||
|
class: btn.className.substring(0, 80)
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
// 获取main区域的HTML片段
|
||||||
|
const main = document.querySelector('main') || document.body;
|
||||||
|
debug.sampleHTML = main.innerHTML.substring(0, 3000);
|
||||||
|
|
||||||
|
return debug;
|
||||||
|
}''')
|
||||||
|
print(f"找到 {len(html_debug['allButtons'])} 个按钮:")
|
||||||
|
for btn in html_debug['allButtons'][:20]:
|
||||||
|
print(f" - {btn['text']} | class: {btn['class']}")
|
||||||
|
print(f"\nHTML片段:\n{html_debug['sampleHTML'][:1500]}")
|
||||||
|
|
||||||
|
print(f" 本页找到 {len(page_data['programs'])} 个项目")
|
||||||
|
|
||||||
|
# 打印找到的项目
|
||||||
|
for prog in page_data['programs']:
|
||||||
|
print(f" - {prog['name']} ({prog['degrees']})")
|
||||||
|
|
||||||
|
# 添加到总列表(去重)
|
||||||
|
for prog in page_data['programs']:
|
||||||
|
name = prog['name'].strip()
|
||||||
|
if name and not any(p['name'] == name for p in all_programs):
|
||||||
|
all_programs.append({
|
||||||
|
'name': name,
|
||||||
|
'degrees': prog.get('degrees', ''),
|
||||||
|
'url': prog.get('url', ''),
|
||||||
|
'page': current_page
|
||||||
|
})
|
||||||
|
|
||||||
|
# 尝试点击下一页按钮
|
||||||
|
try:
|
||||||
|
clicked = False
|
||||||
|
|
||||||
|
# 首先打印所有分页相关元素用于调试
|
||||||
|
if current_page == 1:
|
||||||
|
# 截图保存以便调试
|
||||||
|
await page.screenshot(path="harvard_debug_pagination.png", full_page=True)
|
||||||
|
print("已保存调试截图: harvard_debug_pagination.png")
|
||||||
|
|
||||||
|
pagination_info = await page.evaluate('''() => {
|
||||||
|
const result = {
|
||||||
|
links: [],
|
||||||
|
buttons: [],
|
||||||
|
allClickable: [],
|
||||||
|
pageNumbers: [],
|
||||||
|
allText: []
|
||||||
|
};
|
||||||
|
|
||||||
|
// 查找所有链接
|
||||||
|
document.querySelectorAll('a').forEach(a => {
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
if (text.match(/^[0-9]+$|Next|page|Prev/i)) {
|
||||||
|
result.links.push({
|
||||||
|
text: text.substring(0, 50),
|
||||||
|
href: a.href,
|
||||||
|
visible: a.offsetParent !== null,
|
||||||
|
className: a.className
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
// 查找所有按钮
|
||||||
|
document.querySelectorAll('button').forEach(b => {
|
||||||
|
const text = b.innerText.trim();
|
||||||
|
if (text.match(/^[0-9]+$|Next|page|Prev/i) || text.length < 20) {
|
||||||
|
result.buttons.push({
|
||||||
|
text: text.substring(0, 50),
|
||||||
|
visible: b.offsetParent !== null,
|
||||||
|
className: b.className
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
// 查找所有包含数字的可点击元素(可能是分页)
|
||||||
|
document.querySelectorAll('a, button, span[role="button"], div[role="button"], li a, nav a').forEach(el => {
|
||||||
|
const text = el.innerText.trim();
|
||||||
|
if (text.match(/^[0-9]$/) || text === 'Next page' || text.includes('Next')) {
|
||||||
|
result.pageNumbers.push({
|
||||||
|
tag: el.tagName,
|
||||||
|
text: text,
|
||||||
|
className: el.className,
|
||||||
|
id: el.id,
|
||||||
|
ariaLabel: el.getAttribute('aria-label'),
|
||||||
|
visible: el.offsetParent !== null
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
// 查找页面底部区域的所有可点击元素
|
||||||
|
const bodyRect = document.body.getBoundingClientRect();
|
||||||
|
document.querySelectorAll('*').forEach(el => {
|
||||||
|
const rect = el.getBoundingClientRect();
|
||||||
|
const text = el.innerText?.trim() || '';
|
||||||
|
// 只看页面下半部分的元素且文本短
|
||||||
|
if (rect.top > bodyRect.height * 0.5 && text.length > 0 && text.length < 30) {
|
||||||
|
const style = window.getComputedStyle(el);
|
||||||
|
if (style.cursor === 'pointer' || el.tagName === 'A' || el.tagName === 'BUTTON') {
|
||||||
|
result.allClickable.push({
|
||||||
|
tag: el.tagName,
|
||||||
|
text: text.substring(0, 30),
|
||||||
|
top: Math.round(rect.top),
|
||||||
|
className: el.className?.substring?.(0, 50) || ''
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
// 输出页面底部所有文本以便调试
|
||||||
|
const bodyText = document.body.innerText;
|
||||||
|
const lines = bodyText.split('\\n').filter(l => l.trim());
|
||||||
|
// 找到包含数字1-9的行
|
||||||
|
for (let i = 0; i < lines.length; i++) {
|
||||||
|
if (lines[i].match(/^[1-9]$|Next page|Previous/)) {
|
||||||
|
result.allText.push(lines[i]);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return result;
|
||||||
|
}''')
|
||||||
|
print(f"\n分页相关链接 ({len(pagination_info['links'])} 个):")
|
||||||
|
for link in pagination_info['links']:
|
||||||
|
print(f" a: '{link['text']}' class='{link.get('className', '')}' (visible: {link['visible']})")
|
||||||
|
print(f"\n分页相关按钮 ({len(pagination_info['buttons'])} 个):")
|
||||||
|
for btn in pagination_info['buttons']:
|
||||||
|
print(f" button: '{btn['text']}' class='{btn.get('className', '')}' (visible: {btn['visible']})")
|
||||||
|
print(f"\n页码元素 ({len(pagination_info['pageNumbers'])} 个):")
|
||||||
|
for pn in pagination_info['pageNumbers']:
|
||||||
|
print(f" {pn['tag']}: '{pn['text']}' aria-label='{pn.get('ariaLabel')}' visible={pn['visible']}")
|
||||||
|
print(f"\n页面下半部分可点击元素 ({len(pagination_info['allClickable'])} 个):")
|
||||||
|
for el in pagination_info['allClickable'][:30]:
|
||||||
|
print(f" {el['tag']}: '{el['text']}' (top: {el['top']})")
|
||||||
|
print(f"\n页面中的分页文本 ({len(pagination_info['allText'])} 个):")
|
||||||
|
for txt in pagination_info['allText'][:20]:
|
||||||
|
print(f" '{txt}'")
|
||||||
|
|
||||||
|
# 方法1: 直接使用CSS选择器查找 "Next page" 按钮 (最可靠)
|
||||||
|
# 从调试输出得知,分页按钮是 <button class="c-pagination__link c-pagination__link--next">
|
||||||
|
next_page_num = str(current_page + 1)
|
||||||
|
|
||||||
|
try:
|
||||||
|
next_btn = page.locator('button.c-pagination__link--next')
|
||||||
|
if await next_btn.count() > 0:
|
||||||
|
print(f"\n找到 'Next page' 按钮 (CSS选择器),尝试点击...")
|
||||||
|
await next_btn.first.scroll_into_view_if_needed()
|
||||||
|
await next_btn.first.click()
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
current_page += 1
|
||||||
|
clicked = True
|
||||||
|
except Exception as e:
|
||||||
|
print(f"方法1失败: {e}")
|
||||||
|
|
||||||
|
if clicked:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# 方法2: 使用 get_by_role 查找按钮
|
||||||
|
try:
|
||||||
|
next_btn = page.get_by_role("button", name="Next page")
|
||||||
|
if await next_btn.count() > 0:
|
||||||
|
print(f"\n通过role找到 'Next page' 按钮,尝试点击...")
|
||||||
|
await next_btn.first.scroll_into_view_if_needed()
|
||||||
|
await next_btn.first.click()
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
current_page += 1
|
||||||
|
clicked = True
|
||||||
|
except Exception as e:
|
||||||
|
print(f"方法2失败: {e}")
|
||||||
|
|
||||||
|
if clicked:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# 方法3: 查找所有分页按钮并点击 "Next page"
|
||||||
|
try:
|
||||||
|
pagination_buttons = await page.query_selector_all('button.c-pagination__link')
|
||||||
|
for btn in pagination_buttons:
|
||||||
|
text = await btn.inner_text()
|
||||||
|
if 'Next page' in text:
|
||||||
|
print(f"\n通过遍历分页按钮找到 'Next page',点击...")
|
||||||
|
await btn.scroll_into_view_if_needed()
|
||||||
|
await btn.click()
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
current_page += 1
|
||||||
|
clicked = True
|
||||||
|
break
|
||||||
|
except Exception as e:
|
||||||
|
print(f"方法3失败: {e}")
|
||||||
|
|
||||||
|
if clicked:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# 方法4: 通过JavaScript直接点击分页按钮
|
||||||
|
try:
|
||||||
|
js_clicked = await page.evaluate('''() => {
|
||||||
|
// 查找 Next page 按钮
|
||||||
|
const nextBtn = document.querySelector('button.c-pagination__link--next');
|
||||||
|
if (nextBtn) {
|
||||||
|
nextBtn.click();
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
// 备选:查找所有分页按钮
|
||||||
|
const buttons = document.querySelectorAll('button.c-pagination__link');
|
||||||
|
for (const btn of buttons) {
|
||||||
|
if (btn.innerText.includes('Next page')) {
|
||||||
|
btn.click();
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return false;
|
||||||
|
}''')
|
||||||
|
if js_clicked:
|
||||||
|
print(f"\n通过JavaScript点击 'Next page' 成功")
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
current_page += 1
|
||||||
|
clicked = True
|
||||||
|
except Exception as e:
|
||||||
|
print(f"方法4失败: {e}")
|
||||||
|
|
||||||
|
if clicked:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# 方法5: 遍历所有按钮查找
|
||||||
|
try:
|
||||||
|
all_buttons = await page.query_selector_all('button')
|
||||||
|
for btn in all_buttons:
|
||||||
|
try:
|
||||||
|
text = await btn.inner_text()
|
||||||
|
if 'Next page' in text:
|
||||||
|
visible = await btn.is_visible()
|
||||||
|
if visible:
|
||||||
|
print(f"\n遍历所有按钮找到 'Next page',点击...")
|
||||||
|
await btn.scroll_into_view_if_needed()
|
||||||
|
await btn.click()
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
current_page += 1
|
||||||
|
clicked = True
|
||||||
|
break
|
||||||
|
except:
|
||||||
|
continue
|
||||||
|
except Exception as e:
|
||||||
|
print(f"方法5失败: {e}")
|
||||||
|
|
||||||
|
if clicked:
|
||||||
|
continue
|
||||||
|
|
||||||
|
print("没有找到下一页按钮,结束爬取")
|
||||||
|
break
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"点击下一页时出错: {e}")
|
||||||
|
break
|
||||||
|
|
||||||
|
# 生成项目URL - Harvard的项目URL格式为:
|
||||||
|
# https://www.harvard.edu/programs/{program-name-slug}/
|
||||||
|
# 例如: african-and-african-american-studies
|
||||||
|
|
||||||
|
import re
|
||||||
|
|
||||||
|
def name_to_slug(name):
|
||||||
|
"""将项目名称转换为URL slug"""
|
||||||
|
# 转小写
|
||||||
|
slug = name.lower()
|
||||||
|
# 将特殊字符替换为空格
|
||||||
|
slug = re.sub(r'[^\w\s-]', '', slug)
|
||||||
|
# 替换空格为连字符
|
||||||
|
slug = re.sub(r'[\s_]+', '-', slug)
|
||||||
|
# 移除多余的连字符
|
||||||
|
slug = re.sub(r'-+', '-', slug)
|
||||||
|
# 移除首尾连字符
|
||||||
|
slug = slug.strip('-')
|
||||||
|
return slug
|
||||||
|
|
||||||
|
print("\n正在生成项目URL...")
|
||||||
|
for prog in all_programs:
|
||||||
|
slug = name_to_slug(prog['name'])
|
||||||
|
prog['url'] = f"https://www.harvard.edu/programs/{slug}/"
|
||||||
|
print(f" {prog['name']} -> {prog['url']}")
|
||||||
|
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
# 排序
|
||||||
|
programs = sorted(all_programs, key=lambda x: x['name'])
|
||||||
|
|
||||||
|
# 保存
|
||||||
|
result = {
|
||||||
|
'source_url': base_url,
|
||||||
|
'scraped_at': datetime.now(timezone.utc).isoformat(),
|
||||||
|
'total_pages_scraped': current_page,
|
||||||
|
'total_programs': len(programs),
|
||||||
|
'programs': programs
|
||||||
|
}
|
||||||
|
|
||||||
|
output_file = Path('harvard_programs_results.json')
|
||||||
|
with open(output_file, 'w', encoding='utf-8') as f:
|
||||||
|
json.dump(result, f, ensure_ascii=False, indent=2)
|
||||||
|
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print(f"爬取完成!")
|
||||||
|
print(f"共爬取 {current_page} 页")
|
||||||
|
print(f"共找到 {len(programs)} 个研究生项目")
|
||||||
|
print(f"结果保存到: {output_file}")
|
||||||
|
print(f"{'='*60}")
|
||||||
|
|
||||||
|
# 打印完整列表
|
||||||
|
print("\n研究生项目完整列表:")
|
||||||
|
for i, prog in enumerate(programs, 1):
|
||||||
|
print(f"{i:3}. {prog['name']} - {prog['degrees']}")
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(scrape_harvard_programs())
|
||||||
356
artifacts/harvard_programs_with_faculty_scraper.py
Normal file
356
artifacts/harvard_programs_with_faculty_scraper.py
Normal file
@ -0,0 +1,356 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Harvard Graduate Programs Scraper with Faculty Information
|
||||||
|
爬取 https://www.harvard.edu/programs/?degree_levels=graduate 页面的所有研究生项目
|
||||||
|
并获取每个项目的导师个人信息页面URL
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
|
||||||
|
|
||||||
|
def name_to_slug(name):
|
||||||
|
"""将项目名称转换为URL slug"""
|
||||||
|
slug = name.lower()
|
||||||
|
slug = re.sub(r'[^\w\s-]', '', slug)
|
||||||
|
slug = re.sub(r'[\s_]+', '-', slug)
|
||||||
|
slug = re.sub(r'-+', '-', slug)
|
||||||
|
slug = slug.strip('-')
|
||||||
|
return slug
|
||||||
|
|
||||||
|
|
||||||
|
async def extract_faculty_from_page(page):
|
||||||
|
"""从当前页面提取所有教职员工链接"""
|
||||||
|
faculty_list = await page.evaluate('''() => {
|
||||||
|
const faculty = [];
|
||||||
|
const seen = new Set();
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href || '';
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
const lowerHref = href.toLowerCase();
|
||||||
|
const lowerText = text.toLowerCase();
|
||||||
|
|
||||||
|
// 检查是否是个人页面链接
|
||||||
|
if ((lowerHref.includes('/people/') || lowerHref.includes('/faculty/') ||
|
||||||
|
lowerHref.includes('/profile/') || lowerHref.includes('/person/')) &&
|
||||||
|
text.length > 3 && text.length < 100 &&
|
||||||
|
!lowerText.includes('people') &&
|
||||||
|
!lowerText.includes('faculty') &&
|
||||||
|
!lowerText.includes('profile') &&
|
||||||
|
!lowerText.includes('staff') &&
|
||||||
|
!lowerHref.endsWith('/people/') &&
|
||||||
|
!lowerHref.endsWith('/people') &&
|
||||||
|
!lowerHref.endsWith('/faculty/') &&
|
||||||
|
!lowerHref.endsWith('/faculty')) {
|
||||||
|
|
||||||
|
if (!seen.has(href)) {
|
||||||
|
seen.add(href);
|
||||||
|
faculty.push({
|
||||||
|
name: text,
|
||||||
|
url: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return faculty;
|
||||||
|
}''')
|
||||||
|
return faculty_list
|
||||||
|
|
||||||
|
|
||||||
|
async def get_faculty_from_gsas_page(page, gsas_url, program_name):
|
||||||
|
"""从GSAS项目页面获取Faculty链接,然后访问院系People页面获取导师列表"""
|
||||||
|
faculty_list = []
|
||||||
|
faculty_page_url = None
|
||||||
|
|
||||||
|
try:
|
||||||
|
print(f" 访问GSAS页面: {gsas_url}")
|
||||||
|
await page.goto(gsas_url, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
# 策略1: 查找 "See list of ... faculty" 链接
|
||||||
|
faculty_link = await page.evaluate('''() => {
|
||||||
|
const links = document.querySelectorAll('a[href]');
|
||||||
|
for (const link of links) {
|
||||||
|
const text = link.innerText.toLowerCase();
|
||||||
|
const href = link.href;
|
||||||
|
if (text.includes('faculty') && text.includes('see list')) {
|
||||||
|
return href;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
# 策略2: 查找任何包含 /people 或 /faculty 的链接
|
||||||
|
if not faculty_link:
|
||||||
|
faculty_link = await page.evaluate('''() => {
|
||||||
|
const links = document.querySelectorAll('a[href]');
|
||||||
|
for (const link of links) {
|
||||||
|
const text = link.innerText.toLowerCase();
|
||||||
|
const href = link.href.toLowerCase();
|
||||||
|
// 查找Faculty相关链接
|
||||||
|
if ((text.includes('faculty') || text.includes('people')) &&
|
||||||
|
(href.includes('/people') || href.includes('/faculty'))) {
|
||||||
|
return link.href;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
# 策略3: 从页面中查找院系网站链接,然后尝试访问其People页面
|
||||||
|
if not faculty_link:
|
||||||
|
dept_website = await page.evaluate('''() => {
|
||||||
|
const links = document.querySelectorAll('a[href]');
|
||||||
|
for (const link of links) {
|
||||||
|
const text = link.innerText.toLowerCase();
|
||||||
|
const href = link.href;
|
||||||
|
// 查找 Website 链接 (通常指向院系主页)
|
||||||
|
if (text.includes('website') && href.includes('harvard.edu') &&
|
||||||
|
!href.includes('gsas.harvard.edu')) {
|
||||||
|
return href;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
if dept_website:
|
||||||
|
print(f" 找到院系网站: {dept_website}")
|
||||||
|
try:
|
||||||
|
await page.goto(dept_website, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
# 在院系网站上查找People/Faculty链接
|
||||||
|
faculty_link = await page.evaluate('''() => {
|
||||||
|
const links = document.querySelectorAll('a[href]');
|
||||||
|
for (const link of links) {
|
||||||
|
const text = link.innerText.toLowerCase().trim();
|
||||||
|
const href = link.href;
|
||||||
|
if ((text === 'people' || text === 'faculty' ||
|
||||||
|
text === 'faculty & research' || text.includes('our faculty')) &&
|
||||||
|
(href.includes('/people') || href.includes('/faculty'))) {
|
||||||
|
return href;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}''')
|
||||||
|
except Exception as e:
|
||||||
|
print(f" 访问院系网站失败: {e}")
|
||||||
|
|
||||||
|
if faculty_link:
|
||||||
|
faculty_page_url = faculty_link
|
||||||
|
print(f" 找到Faculty页面: {faculty_link}")
|
||||||
|
|
||||||
|
# 访问Faculty/People页面
|
||||||
|
await page.goto(faculty_link, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
# 提取所有导师信息
|
||||||
|
faculty_list = await extract_faculty_from_page(page)
|
||||||
|
|
||||||
|
# 如果第一页没找到,尝试处理分页或其他布局
|
||||||
|
if len(faculty_list) == 0:
|
||||||
|
# 可能需要点击某些按钮或处理JavaScript加载
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
faculty_list = await extract_faculty_from_page(page)
|
||||||
|
|
||||||
|
print(f" 找到 {len(faculty_list)} 位导师")
|
||||||
|
else:
|
||||||
|
print(f" 未找到Faculty页面链接")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f" 获取Faculty信息失败: {e}")
|
||||||
|
|
||||||
|
return faculty_list, faculty_page_url
|
||||||
|
|
||||||
|
|
||||||
|
async def scrape_harvard_programs_with_faculty():
|
||||||
|
"""爬取Harvard研究生项目列表及导师信息"""
|
||||||
|
|
||||||
|
all_programs = []
|
||||||
|
base_url = "https://www.harvard.edu/programs/?degree_levels=graduate"
|
||||||
|
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser = await p.chromium.launch(headless=True)
|
||||||
|
context = await browser.new_context(
|
||||||
|
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
||||||
|
viewport={'width': 1920, 'height': 1080}
|
||||||
|
)
|
||||||
|
page = await context.new_page()
|
||||||
|
|
||||||
|
print(f"正在访问: {base_url}")
|
||||||
|
await page.goto(base_url, wait_until="domcontentloaded", timeout=60000)
|
||||||
|
await page.wait_for_timeout(5000)
|
||||||
|
|
||||||
|
# 滚动到页面底部
|
||||||
|
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
current_page = 1
|
||||||
|
max_pages = 15
|
||||||
|
|
||||||
|
# 第一阶段:收集所有项目基本信息
|
||||||
|
print("\n========== 第一阶段:收集项目列表 ==========")
|
||||||
|
while current_page <= max_pages:
|
||||||
|
print(f"\n--- 第 {current_page} 页 ---")
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
# 提取当前页面的项目
|
||||||
|
page_data = await page.evaluate('''() => {
|
||||||
|
const programs = [];
|
||||||
|
const programItems = document.querySelectorAll('[class*="records__record"], [class*="c-programs-item"]');
|
||||||
|
|
||||||
|
programItems.forEach((item, index) => {
|
||||||
|
const nameBtn = item.querySelector('button[class*="title-link"], button[class*="c-programs-item"]');
|
||||||
|
if (!nameBtn) return;
|
||||||
|
|
||||||
|
const name = nameBtn.innerText.trim();
|
||||||
|
if (!name || name.length < 3) return;
|
||||||
|
|
||||||
|
let degrees = '';
|
||||||
|
const allText = item.innerText;
|
||||||
|
const degreeMatch = allText.match(/(A\\.B\\.|Ph\\.D\\.|M\\.A\\.|S\\.M\\.|M\\.Arch\\.|LL\\.M\\.|S\\.B\\.|A\\.L\\.B\\.|A\\.L\\.M\\.|M\\.M\\.Sc\\.|Ed\\.D\\.|Ed\\.M\\.|M\\.P\\.A\\.|M\\.P\\.P\\.|M\\.P\\.H\\.|J\\.D\\.|M\\.B\\.A\\.|M\\.D\\.|D\\.M\\.D\\.|Th\\.D\\.|M\\.Div\\.|M\\.T\\.S\\.|M\\.E\\.|D\\.M\\.Sc\\.|M\\.H\\.C\\.M\\.|M\\.L\\.A\\.|M\\.D\\.E\\.|M\\.R\\.E\\.|M\\.A\\.U\\.D\\.|M\\.R\\.P\\.L\\.)/g);
|
||||||
|
if (degreeMatch) {
|
||||||
|
degrees = degreeMatch.join(', ');
|
||||||
|
}
|
||||||
|
|
||||||
|
programs.push({
|
||||||
|
name: name,
|
||||||
|
degrees: degrees
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
if (programs.length === 0) {
|
||||||
|
const buttons = document.querySelectorAll('button');
|
||||||
|
buttons.forEach((btn) => {
|
||||||
|
const className = btn.className || '';
|
||||||
|
if (className.includes('c-programs-item') || className.includes('title-link')) {
|
||||||
|
const name = btn.innerText.trim();
|
||||||
|
if (name && name.length > 3 && !name.match(/^(Page|Next|Previous|Search|Menu|Filter)/)) {
|
||||||
|
programs.push({
|
||||||
|
name: name,
|
||||||
|
degrees: ''
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
return programs;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f" 本页找到 {len(page_data)} 个项目")
|
||||||
|
|
||||||
|
for prog in page_data:
|
||||||
|
name = prog['name'].strip()
|
||||||
|
if name and not any(p['name'] == name for p in all_programs):
|
||||||
|
all_programs.append({
|
||||||
|
'name': name,
|
||||||
|
'degrees': prog.get('degrees', ''),
|
||||||
|
'page': current_page
|
||||||
|
})
|
||||||
|
|
||||||
|
# 尝试点击下一页
|
||||||
|
try:
|
||||||
|
next_btn = page.locator('button.c-pagination__link--next')
|
||||||
|
if await next_btn.count() > 0:
|
||||||
|
await next_btn.first.scroll_into_view_if_needed()
|
||||||
|
await next_btn.first.click()
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
current_page += 1
|
||||||
|
else:
|
||||||
|
print("没有下一页按钮,结束收集")
|
||||||
|
break
|
||||||
|
except Exception as e:
|
||||||
|
print(f"分页失败: {e}")
|
||||||
|
break
|
||||||
|
|
||||||
|
print(f"\n共收集到 {len(all_programs)} 个项目")
|
||||||
|
|
||||||
|
# 第二阶段:为每个项目获取导师信息
|
||||||
|
print("\n========== 第二阶段:获取导师信息 ==========")
|
||||||
|
print("注意:这将访问每个项目的GSAS页面,可能需要较长时间...")
|
||||||
|
|
||||||
|
for i, prog in enumerate(all_programs, 1):
|
||||||
|
print(f"\n[{i}/{len(all_programs)}] {prog['name']}")
|
||||||
|
|
||||||
|
# 生成项目URL
|
||||||
|
slug = name_to_slug(prog['name'])
|
||||||
|
prog['url'] = f"https://www.harvard.edu/programs/{slug}/"
|
||||||
|
|
||||||
|
# 生成GSAS URL
|
||||||
|
gsas_url = f"https://gsas.harvard.edu/program/{slug}"
|
||||||
|
|
||||||
|
# 获取导师信息
|
||||||
|
faculty_list, faculty_page_url = await get_faculty_from_gsas_page(page, gsas_url, prog['name'])
|
||||||
|
|
||||||
|
prog['faculty_page_url'] = faculty_page_url or ""
|
||||||
|
prog['faculty'] = faculty_list
|
||||||
|
prog['faculty_count'] = len(faculty_list)
|
||||||
|
|
||||||
|
# 每10个项目保存一次进度
|
||||||
|
if i % 10 == 0:
|
||||||
|
temp_result = {
|
||||||
|
'source_url': base_url,
|
||||||
|
'scraped_at': datetime.now(timezone.utc).isoformat(),
|
||||||
|
'progress': f"{i}/{len(all_programs)}",
|
||||||
|
'programs': all_programs[:i]
|
||||||
|
}
|
||||||
|
with open('harvard_programs_progress.json', 'w', encoding='utf-8') as f:
|
||||||
|
json.dump(temp_result, f, ensure_ascii=False, indent=2)
|
||||||
|
print(f" [进度已保存]")
|
||||||
|
|
||||||
|
# 避免请求过快
|
||||||
|
await page.wait_for_timeout(1500)
|
||||||
|
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
# 排序
|
||||||
|
programs = sorted(all_programs, key=lambda x: x['name'])
|
||||||
|
|
||||||
|
# 统计
|
||||||
|
total_faculty = sum(p['faculty_count'] for p in programs)
|
||||||
|
programs_with_faculty = sum(1 for p in programs if p['faculty_count'] > 0)
|
||||||
|
|
||||||
|
# 保存最终结果
|
||||||
|
result = {
|
||||||
|
'source_url': base_url,
|
||||||
|
'scraped_at': datetime.now(timezone.utc).isoformat(),
|
||||||
|
'total_pages_scraped': current_page,
|
||||||
|
'total_programs': len(programs),
|
||||||
|
'programs_with_faculty': programs_with_faculty,
|
||||||
|
'total_faculty_found': total_faculty,
|
||||||
|
'programs': programs
|
||||||
|
}
|
||||||
|
|
||||||
|
output_file = Path('harvard_programs_with_faculty.json')
|
||||||
|
with open(output_file, 'w', encoding='utf-8') as f:
|
||||||
|
json.dump(result, f, ensure_ascii=False, indent=2)
|
||||||
|
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print(f"爬取完成!")
|
||||||
|
print(f"共爬取 {current_page} 页")
|
||||||
|
print(f"共找到 {len(programs)} 个研究生项目")
|
||||||
|
print(f"其中 {programs_with_faculty} 个项目有导师信息")
|
||||||
|
print(f"共找到 {total_faculty} 位导师")
|
||||||
|
print(f"结果保存到: {output_file}")
|
||||||
|
print(f"{'='*60}")
|
||||||
|
|
||||||
|
# 打印摘要
|
||||||
|
print("\n项目摘要 (前30个):")
|
||||||
|
for i, prog in enumerate(programs[:30], 1):
|
||||||
|
faculty_info = f"({prog['faculty_count']}位导师)" if prog['faculty_count'] > 0 else "(无导师信息)"
|
||||||
|
print(f"{i:3}. {prog['name']} {faculty_info}")
|
||||||
|
|
||||||
|
if len(programs) > 30:
|
||||||
|
print(f"... 还有 {len(programs) - 30} 个项目")
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(scrape_harvard_programs_with_faculty())
|
||||||
910
artifacts/manchester_complete_scraper.py
Normal file
910
artifacts/manchester_complete_scraper.py
Normal file
@ -0,0 +1,910 @@
|
|||||||
|
"""
|
||||||
|
曼彻斯特大学完整采集脚本
|
||||||
|
新增特性:
|
||||||
|
- Research Explorer API 优先拉取 JSON / XML,失败再回落 DOM
|
||||||
|
- 每个学院独立页面、并行抓取(默认 3 并发)
|
||||||
|
- 细粒度超时/重试/滚动/Load more 控制
|
||||||
|
- 多 URL / 备用 Staff 页面配置
|
||||||
|
- 导师目录缓存,可按学院关键词映射到项目
|
||||||
|
- 诊断信息记录(失败学院、超时学院、批次信息)
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
from copy import deepcopy
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from typing import Any, Dict, List, Optional, Tuple
|
||||||
|
from urllib.parse import urlencode, urljoin
|
||||||
|
from xml.etree import ElementTree as ET
|
||||||
|
|
||||||
|
from playwright.async_api import (
|
||||||
|
TimeoutError as PlaywrightTimeoutError,
|
||||||
|
async_playwright,
|
||||||
|
)
|
||||||
|
|
||||||
|
# =========================
|
||||||
|
# 配置区
|
||||||
|
# =========================
|
||||||
|
|
||||||
|
DEFAULT_REQUEST = {
|
||||||
|
"timeout_ms": 60000,
|
||||||
|
"post_wait_ms": 2500,
|
||||||
|
"wait_until": "domcontentloaded",
|
||||||
|
"max_retries": 3,
|
||||||
|
"retry_backoff_ms": 2000,
|
||||||
|
}
|
||||||
|
|
||||||
|
STAFF_CONCURRENCY = 3
|
||||||
|
|
||||||
|
SCHOOL_CONFIG: List[Dict[str, Any]] = [
|
||||||
|
{
|
||||||
|
"name": "Alliance Manchester Business School",
|
||||||
|
"keywords": [
|
||||||
|
"accounting",
|
||||||
|
"finance",
|
||||||
|
"business",
|
||||||
|
"management",
|
||||||
|
"marketing",
|
||||||
|
"mba",
|
||||||
|
"economics",
|
||||||
|
"entrepreneurship",
|
||||||
|
],
|
||||||
|
"attach_faculty_to_programs": True,
|
||||||
|
"staff_pages": [
|
||||||
|
{
|
||||||
|
"url": "https://www.alliancembs.manchester.ac.uk/research/accounting-and-finance/staff/",
|
||||||
|
"extract_method": "table",
|
||||||
|
"request": {"timeout_ms": 60000, "wait_until": "networkidle"},
|
||||||
|
}
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "Department of Computer Science",
|
||||||
|
"keywords": [
|
||||||
|
"computer",
|
||||||
|
"software",
|
||||||
|
"data science",
|
||||||
|
"artificial intelligence",
|
||||||
|
"ai ",
|
||||||
|
"machine learning",
|
||||||
|
"cyber",
|
||||||
|
"computing",
|
||||||
|
],
|
||||||
|
"attach_faculty_to_programs": True,
|
||||||
|
"staff_pages": [
|
||||||
|
{
|
||||||
|
"url": "https://www.cs.manchester.ac.uk/about/people/academic-and-research-staff/",
|
||||||
|
"extract_method": "links",
|
||||||
|
"requires_scroll": True,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"url": "https://www.cs.manchester.ac.uk/about/people/",
|
||||||
|
"extract_method": "links",
|
||||||
|
"load_more_selector": "button.load-more",
|
||||||
|
"max_load_more": 6,
|
||||||
|
},
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "Department of Physics and Astronomy",
|
||||||
|
"keywords": [
|
||||||
|
"physics",
|
||||||
|
"astronomy",
|
||||||
|
"astrophysics",
|
||||||
|
"nuclear",
|
||||||
|
"particle",
|
||||||
|
],
|
||||||
|
"attach_faculty_to_programs": True,
|
||||||
|
"staff_pages": [
|
||||||
|
{
|
||||||
|
"url": "https://www.physics.manchester.ac.uk/about/people/academic-and-research-staff/",
|
||||||
|
"extract_method": "links",
|
||||||
|
"requires_scroll": True,
|
||||||
|
}
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "Department of Electrical and Electronic Engineering",
|
||||||
|
"keywords": [
|
||||||
|
"electrical",
|
||||||
|
"electronic",
|
||||||
|
"eee",
|
||||||
|
"power systems",
|
||||||
|
"microelectronics",
|
||||||
|
],
|
||||||
|
"attach_faculty_to_programs": True,
|
||||||
|
"staff_pages": [
|
||||||
|
{
|
||||||
|
"url": "https://www.eee.manchester.ac.uk/about/people/academic-and-research-staff/",
|
||||||
|
"extract_method": "links",
|
||||||
|
"requires_scroll": True,
|
||||||
|
}
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "Department of Chemistry",
|
||||||
|
"keywords": ["chemistry", "chemical"],
|
||||||
|
"attach_faculty_to_programs": True,
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"research_explorer": {"page_size": 200},
|
||||||
|
"staff_pages": [
|
||||||
|
{
|
||||||
|
"url": "https://research.manchester.ac.uk/en/organisations/department-of-chemistry/persons/",
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"requires_scroll": True,
|
||||||
|
"request": {
|
||||||
|
"timeout_ms": 120000,
|
||||||
|
"wait_until": "networkidle",
|
||||||
|
"post_wait_ms": 5000,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "Department of Mathematics",
|
||||||
|
"keywords": [
|
||||||
|
"mathematics",
|
||||||
|
"mathematical",
|
||||||
|
"applied math",
|
||||||
|
"statistics",
|
||||||
|
"actuarial",
|
||||||
|
],
|
||||||
|
"attach_faculty_to_programs": True,
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"research_explorer": {"page_size": 200},
|
||||||
|
"staff_pages": [
|
||||||
|
{
|
||||||
|
"url": "https://research.manchester.ac.uk/en/organisations/department-of-mathematics/persons/",
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"requires_scroll": True,
|
||||||
|
}
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "School of Engineering",
|
||||||
|
"keywords": [
|
||||||
|
"engineering",
|
||||||
|
"mechanical",
|
||||||
|
"aerospace",
|
||||||
|
"civil",
|
||||||
|
"structural",
|
||||||
|
"materials",
|
||||||
|
],
|
||||||
|
"attach_faculty_to_programs": True,
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"research_explorer": {"page_size": 400},
|
||||||
|
"staff_pages": [
|
||||||
|
{
|
||||||
|
"url": "https://research.manchester.ac.uk/en/organisations/school-of-engineering/persons/",
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"requires_scroll": True,
|
||||||
|
}
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "Faculty of Biology, Medicine and Health",
|
||||||
|
"keywords": [
|
||||||
|
"medicine",
|
||||||
|
"medical",
|
||||||
|
"health",
|
||||||
|
"nursing",
|
||||||
|
"pharmacy",
|
||||||
|
"clinical",
|
||||||
|
"dental",
|
||||||
|
"optometry",
|
||||||
|
"biology",
|
||||||
|
"biomedical",
|
||||||
|
"anatomical",
|
||||||
|
"physiotherapy",
|
||||||
|
"midwifery",
|
||||||
|
"mental health",
|
||||||
|
"psychology",
|
||||||
|
],
|
||||||
|
"attach_faculty_to_programs": True,
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"research_explorer": {"page_size": 400},
|
||||||
|
"staff_pages": [
|
||||||
|
{
|
||||||
|
"url": "https://research.manchester.ac.uk/en/organisations/faculty-of-biology-medicine-and-health/persons/",
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"requires_scroll": True,
|
||||||
|
}
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "School of Social Sciences",
|
||||||
|
"keywords": [
|
||||||
|
"sociology",
|
||||||
|
"politics",
|
||||||
|
"international",
|
||||||
|
"social",
|
||||||
|
"criminology",
|
||||||
|
"anthropology",
|
||||||
|
"philosophy",
|
||||||
|
],
|
||||||
|
"attach_faculty_to_programs": True,
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"research_explorer": {"page_size": 200},
|
||||||
|
"staff_pages": [
|
||||||
|
{
|
||||||
|
"url": "https://research.manchester.ac.uk/en/organisations/school-of-social-sciences/persons/",
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"requires_scroll": True,
|
||||||
|
}
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "School of Law",
|
||||||
|
"keywords": ["law", "legal", "llm"],
|
||||||
|
"attach_faculty_to_programs": True,
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"research_explorer": {"page_size": 200},
|
||||||
|
"staff_pages": [
|
||||||
|
{
|
||||||
|
"url": "https://research.manchester.ac.uk/en/organisations/school-of-law/persons/",
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"requires_scroll": True,
|
||||||
|
}
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "School of Arts, Languages and Cultures",
|
||||||
|
"keywords": [
|
||||||
|
"arts",
|
||||||
|
"languages",
|
||||||
|
"culture",
|
||||||
|
"music",
|
||||||
|
"drama",
|
||||||
|
"theatre",
|
||||||
|
"history",
|
||||||
|
"linguistics",
|
||||||
|
"literature",
|
||||||
|
"translation",
|
||||||
|
"classics",
|
||||||
|
"archaeology",
|
||||||
|
"religion",
|
||||||
|
],
|
||||||
|
"attach_faculty_to_programs": True,
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"research_explorer": {"page_size": 400},
|
||||||
|
"staff_pages": [
|
||||||
|
{
|
||||||
|
"url": "https://research.manchester.ac.uk/en/organisations/school-of-arts-languages-and-cultures/persons/",
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"requires_scroll": True,
|
||||||
|
}
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "School of Environment, Education and Development",
|
||||||
|
"keywords": [
|
||||||
|
"environment",
|
||||||
|
"education",
|
||||||
|
"development",
|
||||||
|
"planning",
|
||||||
|
"architecture",
|
||||||
|
"urban",
|
||||||
|
"geography",
|
||||||
|
"sustainability",
|
||||||
|
],
|
||||||
|
"attach_faculty_to_programs": True,
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"research_explorer": {"page_size": 300},
|
||||||
|
"staff_pages": [
|
||||||
|
{
|
||||||
|
"url": "https://research.manchester.ac.uk/en/organisations/school-of-environment-education-and-development/persons/",
|
||||||
|
"extract_method": "research_explorer",
|
||||||
|
"requires_scroll": True,
|
||||||
|
}
|
||||||
|
],
|
||||||
|
},
|
||||||
|
]
|
||||||
|
|
||||||
|
SCHOOL_LOOKUP = {cfg["name"]: cfg for cfg in SCHOOL_CONFIG}
|
||||||
|
|
||||||
|
# =========================
|
||||||
|
# JS 提取函数
|
||||||
|
# =========================
|
||||||
|
|
||||||
|
JS_EXTRACT_TABLE_STAFF = """() => {
|
||||||
|
const staff = [];
|
||||||
|
const seen = new Set();
|
||||||
|
|
||||||
|
document.querySelectorAll('table tr').forEach(row => {
|
||||||
|
const cells = row.querySelectorAll('td');
|
||||||
|
if (cells.length >= 2) {
|
||||||
|
const link = cells[1]?.querySelector('a[href]') || cells[0]?.querySelector('a[href]');
|
||||||
|
const titleCell = cells[2] || cells[1];
|
||||||
|
|
||||||
|
if (link) {
|
||||||
|
const name = link.innerText.trim();
|
||||||
|
const url = link.href;
|
||||||
|
const title = titleCell ? titleCell.innerText.trim() : '';
|
||||||
|
|
||||||
|
if (name.length > 2 && !name.toLowerCase().includes('skip') && !seen.has(url)) {
|
||||||
|
seen.add(url);
|
||||||
|
staff.push({
|
||||||
|
name,
|
||||||
|
url,
|
||||||
|
title
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return staff;
|
||||||
|
}"""
|
||||||
|
|
||||||
|
JS_EXTRACT_LINK_STAFF = """() => {
|
||||||
|
const staff = [];
|
||||||
|
const seen = new Set();
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href;
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
|
||||||
|
if (seen.has(href)) return;
|
||||||
|
if (text.length < 5 || text.length > 80) return;
|
||||||
|
|
||||||
|
const lowerText = text.toLowerCase();
|
||||||
|
if (lowerText.includes('skip') ||
|
||||||
|
lowerText.includes('staff') ||
|
||||||
|
lowerText.includes('people') ||
|
||||||
|
lowerText.includes('academic') ||
|
||||||
|
lowerText.includes('research profiles')) return;
|
||||||
|
|
||||||
|
if (href.includes('/persons/') ||
|
||||||
|
href.includes('/portal/en/researchers/') ||
|
||||||
|
href.includes('/profile/') ||
|
||||||
|
href.includes('/people/')) {
|
||||||
|
seen.add(href);
|
||||||
|
staff.push({
|
||||||
|
name: text,
|
||||||
|
url: href,
|
||||||
|
title: ''
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return staff;
|
||||||
|
}"""
|
||||||
|
|
||||||
|
JS_EXTRACT_RESEARCH_EXPLORER = """() => {
|
||||||
|
const staff = [];
|
||||||
|
const seen = new Set();
|
||||||
|
|
||||||
|
document.querySelectorAll('a.link.person').forEach(a => {
|
||||||
|
const href = a.href;
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
|
||||||
|
if (!seen.has(href) && text.length > 3 && text.length < 80) {
|
||||||
|
seen.add(href);
|
||||||
|
staff.push({
|
||||||
|
name: text,
|
||||||
|
url: href,
|
||||||
|
title: ''
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
if (staff.length === 0) {
|
||||||
|
document.querySelectorAll('a[href*="/persons/"]').forEach(a => {
|
||||||
|
const href = a.href;
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
const lower = text.toLowerCase();
|
||||||
|
|
||||||
|
if (seen.has(href)) return;
|
||||||
|
if (text.length < 3 || text.length > 80) return;
|
||||||
|
if (lower.includes('person') || lower.includes('next') || lower.includes('previous')) return;
|
||||||
|
|
||||||
|
seen.add(href);
|
||||||
|
staff.push({
|
||||||
|
name: text,
|
||||||
|
url: href,
|
||||||
|
title: ''
|
||||||
|
});
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
return staff;
|
||||||
|
}"""
|
||||||
|
|
||||||
|
JS_EXTRACT_PROGRAMS = """() => {
|
||||||
|
const programs = [];
|
||||||
|
const seen = new Set();
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href;
|
||||||
|
const text = a.innerText.trim().replace(/\\s+/g, ' ');
|
||||||
|
|
||||||
|
if (!href || seen.has(href)) return;
|
||||||
|
if (text.length < 10 || text.length > 200) return;
|
||||||
|
|
||||||
|
const hrefLower = href.toLowerCase();
|
||||||
|
const textLower = text.toLowerCase();
|
||||||
|
|
||||||
|
const isNav = textLower === 'courses' ||
|
||||||
|
textLower === 'masters' ||
|
||||||
|
textLower.includes('admission') ||
|
||||||
|
textLower.includes('fees') ||
|
||||||
|
textLower.includes('skip to') ||
|
||||||
|
textLower.includes('search') ||
|
||||||
|
textLower.includes('contact') ||
|
||||||
|
hrefLower.includes('#');
|
||||||
|
if (isNav) return;
|
||||||
|
|
||||||
|
const hasNumericId = /\\/\\d{5}\\//.test(href);
|
||||||
|
const isCoursePage = hrefLower.includes('/courses/list/') && hasNumericId;
|
||||||
|
|
||||||
|
if (isCoursePage) {
|
||||||
|
seen.add(href);
|
||||||
|
programs.push({
|
||||||
|
name: text,
|
||||||
|
url: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return programs;
|
||||||
|
}"""
|
||||||
|
|
||||||
|
|
||||||
|
# =========================
|
||||||
|
# 数据匹配
|
||||||
|
# =========================
|
||||||
|
|
||||||
|
def match_program_to_school(program_name: str) -> str:
|
||||||
|
lower = program_name.lower()
|
||||||
|
for school in SCHOOL_CONFIG:
|
||||||
|
for keyword in school["keywords"]:
|
||||||
|
if keyword in lower:
|
||||||
|
return school["name"]
|
||||||
|
return "Other Programs"
|
||||||
|
|
||||||
|
|
||||||
|
# =========================
|
||||||
|
# 请求与解析工具
|
||||||
|
# =========================
|
||||||
|
|
||||||
|
def _merge_request_settings(*layers: Optional[Dict[str, Any]]) -> Dict[str, Any]:
|
||||||
|
settings = dict(DEFAULT_REQUEST)
|
||||||
|
for layer in layers:
|
||||||
|
if not layer:
|
||||||
|
continue
|
||||||
|
for key, value in layer.items():
|
||||||
|
if value is not None:
|
||||||
|
settings[key] = value
|
||||||
|
settings["max_retries"] = max(1, int(settings.get("max_retries", 1)))
|
||||||
|
settings["retry_backoff_ms"] = settings.get("retry_backoff_ms", 2000)
|
||||||
|
return settings
|
||||||
|
|
||||||
|
|
||||||
|
async def _goto_with_retry(page, url: str, settings: Dict[str, Any], label: str) -> Tuple[bool, Optional[str]]:
|
||||||
|
last_error = None
|
||||||
|
for attempt in range(settings["max_retries"]):
|
||||||
|
try:
|
||||||
|
await page.goto(url, wait_until=settings["wait_until"], timeout=settings["timeout_ms"])
|
||||||
|
if settings.get("wait_for_selector"):
|
||||||
|
await page.wait_for_selector(settings["wait_for_selector"], timeout=settings["timeout_ms"])
|
||||||
|
if settings.get("post_wait_ms"):
|
||||||
|
await page.wait_for_timeout(settings["post_wait_ms"])
|
||||||
|
return True, None
|
||||||
|
except PlaywrightTimeoutError as exc:
|
||||||
|
last_error = f"Timeout: {exc}"
|
||||||
|
except Exception as exc: # noqa: BLE001
|
||||||
|
last_error = str(exc)
|
||||||
|
|
||||||
|
if attempt < settings["max_retries"] - 1:
|
||||||
|
await page.wait_for_timeout(settings["retry_backoff_ms"] * (attempt + 1))
|
||||||
|
|
||||||
|
return False, last_error
|
||||||
|
|
||||||
|
|
||||||
|
async def _perform_scroll(page, repetitions: int = 5, delay_ms: int = 800):
|
||||||
|
repetitions = max(1, repetitions)
|
||||||
|
for i in range(repetitions):
|
||||||
|
await page.evaluate("(y) => window.scrollTo(0, y)", 2000 * (i + 1))
|
||||||
|
await page.wait_for_timeout(delay_ms)
|
||||||
|
|
||||||
|
|
||||||
|
async def _load_more(page, selector: str, max_clicks: int = 5, wait_ms: int = 1500):
|
||||||
|
for _ in range(max_clicks):
|
||||||
|
button = await page.query_selector(selector)
|
||||||
|
if not button:
|
||||||
|
break
|
||||||
|
try:
|
||||||
|
await button.click()
|
||||||
|
await page.wait_for_timeout(wait_ms)
|
||||||
|
except Exception:
|
||||||
|
break
|
||||||
|
|
||||||
|
|
||||||
|
def _deduplicate_staff(staff: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
|
||||||
|
seen = set()
|
||||||
|
cleaned = []
|
||||||
|
for item in staff:
|
||||||
|
name = (item.get("name") or "").strip()
|
||||||
|
if not name:
|
||||||
|
continue
|
||||||
|
url = (item.get("url") or "").strip()
|
||||||
|
key = url or name.lower()
|
||||||
|
if key in seen:
|
||||||
|
continue
|
||||||
|
seen.add(key)
|
||||||
|
cleaned.append({"name": name, "url": url, "title": (item.get("title") or "").strip()})
|
||||||
|
return cleaned
|
||||||
|
|
||||||
|
|
||||||
|
def _append_query(url: str, params: Dict[str, Any]) -> str:
|
||||||
|
delimiter = "&" if "?" in url else "?"
|
||||||
|
return f"{url}{delimiter}{urlencode(params)}"
|
||||||
|
|
||||||
|
|
||||||
|
def _guess_research_slug(staff_url: Optional[str]) -> Optional[str]:
|
||||||
|
if not staff_url:
|
||||||
|
return None
|
||||||
|
path = staff_url.rstrip("/").split("/")
|
||||||
|
return path[-1] if path else None
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_research_explorer_json(data: Any, base_url: str) -> List[Dict[str, str]]:
|
||||||
|
items: List[Dict[str, Any]] = []
|
||||||
|
if isinstance(data, list):
|
||||||
|
items = data
|
||||||
|
elif isinstance(data, dict):
|
||||||
|
for key in ("results", "items", "persons", "data", "entities"):
|
||||||
|
if isinstance(data.get(key), list):
|
||||||
|
items = data[key]
|
||||||
|
break
|
||||||
|
if not items and isinstance(data.get("rows"), list):
|
||||||
|
items = data["rows"]
|
||||||
|
|
||||||
|
staff = []
|
||||||
|
for item in items:
|
||||||
|
if not isinstance(item, dict):
|
||||||
|
continue
|
||||||
|
name = item.get("name") or item.get("title") or item.get("fullName")
|
||||||
|
profile_url = item.get("url") or item.get("href") or item.get("link") or item.get("primaryURL")
|
||||||
|
if not name:
|
||||||
|
continue
|
||||||
|
if profile_url:
|
||||||
|
profile_url = urljoin(base_url, profile_url)
|
||||||
|
staff.append(
|
||||||
|
{
|
||||||
|
"name": name.strip(),
|
||||||
|
"url": (profile_url or "").strip(),
|
||||||
|
"title": (item.get("jobTitle") or item.get("position") or "").strip(),
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return staff
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_research_explorer_xml(text: str, base_url: str) -> List[Dict[str, str]]:
|
||||||
|
staff: List[Dict[str, str]] = []
|
||||||
|
try:
|
||||||
|
root = ET.fromstring(text)
|
||||||
|
except ET.ParseError:
|
||||||
|
return staff
|
||||||
|
|
||||||
|
for entry in root.findall(".//{http://www.w3.org/2005/Atom}entry"):
|
||||||
|
title = entry.findtext("{http://www.w3.org/2005/Atom}title", default="")
|
||||||
|
link = entry.find("{http://www.w3.org/2005/Atom}link")
|
||||||
|
href = link.attrib.get("href") if link is not None else ""
|
||||||
|
if title:
|
||||||
|
staff.append(
|
||||||
|
{
|
||||||
|
"name": title.strip(),
|
||||||
|
"url": urljoin(base_url, href) if href else "",
|
||||||
|
"title": "",
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return staff
|
||||||
|
|
||||||
|
|
||||||
|
async def fetch_research_explorer_api(context, school_config: Dict[str, Any], output_callback) -> List[Dict[str, str]]:
|
||||||
|
config = school_config.get("research_explorer") or {}
|
||||||
|
if not config and school_config.get("extract_method") != "research_explorer":
|
||||||
|
return []
|
||||||
|
|
||||||
|
base_staff_url = ""
|
||||||
|
if school_config.get("staff_pages"):
|
||||||
|
base_staff_url = school_config["staff_pages"][0].get("url", "")
|
||||||
|
|
||||||
|
page_size = config.get("page_size", 200)
|
||||||
|
timeout_ms = config.get("timeout_ms", 70000)
|
||||||
|
|
||||||
|
candidates: List[str] = []
|
||||||
|
slug = config.get("org_slug") or _guess_research_slug(base_staff_url)
|
||||||
|
base_api = config.get("api_base", "https://research.manchester.ac.uk/ws/portalapi.aspx")
|
||||||
|
|
||||||
|
if config.get("api_url"):
|
||||||
|
candidates.append(config["api_url"])
|
||||||
|
|
||||||
|
if slug:
|
||||||
|
params = {
|
||||||
|
"action": "search",
|
||||||
|
"language": "en",
|
||||||
|
"format": "json",
|
||||||
|
"site": "default",
|
||||||
|
"showall": "true",
|
||||||
|
"pageSize": page_size,
|
||||||
|
"organisations": slug,
|
||||||
|
}
|
||||||
|
candidates.append(f"{base_api}?{urlencode(params)}")
|
||||||
|
|
||||||
|
if base_staff_url:
|
||||||
|
candidates.append(_append_query(base_staff_url, {"format": "json", "limit": page_size}))
|
||||||
|
candidates.append(_append_query(base_staff_url, {"format": "xml", "limit": page_size}))
|
||||||
|
|
||||||
|
for url in candidates:
|
||||||
|
try:
|
||||||
|
resp = await context.request.get(url, timeout=timeout_ms)
|
||||||
|
if resp.status != 200:
|
||||||
|
continue
|
||||||
|
ctype = resp.headers.get("content-type", "")
|
||||||
|
if "json" in ctype:
|
||||||
|
data = await resp.json()
|
||||||
|
parsed = _parse_research_explorer_json(data, base_staff_url)
|
||||||
|
else:
|
||||||
|
text = await resp.text()
|
||||||
|
parsed = _parse_research_explorer_xml(text, base_staff_url)
|
||||||
|
parsed = _deduplicate_staff(parsed)
|
||||||
|
if parsed:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f" {school_config['name']}: {len(parsed)} staff via API")
|
||||||
|
return parsed
|
||||||
|
except Exception as exc: # noqa: BLE001
|
||||||
|
if output_callback:
|
||||||
|
output_callback(
|
||||||
|
"warning", f" {school_config['name']}: API fetch failed ({str(exc)[:60]})"
|
||||||
|
)
|
||||||
|
return []
|
||||||
|
|
||||||
|
|
||||||
|
async def scrape_staff_via_browser(context, school_config: Dict[str, Any], output_callback) -> List[Dict[str, str]]:
|
||||||
|
staff_collected: List[Dict[str, str]] = []
|
||||||
|
staff_pages = school_config.get("staff_pages") or []
|
||||||
|
if not staff_pages and school_config.get("staff_url"):
|
||||||
|
staff_pages = [{"url": school_config["staff_url"], "extract_method": school_config.get("extract_method")}]
|
||||||
|
|
||||||
|
page = await context.new_page()
|
||||||
|
blocked_types = school_config.get("blocked_resources", ["image", "font", "media"])
|
||||||
|
if blocked_types:
|
||||||
|
async def _route_handler(route):
|
||||||
|
if route.request.resource_type in blocked_types:
|
||||||
|
await route.abort()
|
||||||
|
else:
|
||||||
|
await route.continue_()
|
||||||
|
|
||||||
|
await page.route("**/*", _route_handler)
|
||||||
|
|
||||||
|
for page_cfg in staff_pages:
|
||||||
|
target_url = page_cfg.get("url")
|
||||||
|
if not target_url:
|
||||||
|
continue
|
||||||
|
|
||||||
|
settings = _merge_request_settings(school_config.get("request"), page_cfg.get("request"))
|
||||||
|
success, error = await _goto_with_retry(page, target_url, settings, school_config["name"])
|
||||||
|
if not success:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("warning", f" {school_config['name']}: failed to load {target_url} ({error})")
|
||||||
|
continue
|
||||||
|
|
||||||
|
if page_cfg.get("requires_scroll"):
|
||||||
|
await _perform_scroll(page, page_cfg.get("scroll_times", 6), page_cfg.get("scroll_delay_ms", 700))
|
||||||
|
|
||||||
|
if page_cfg.get("load_from_selector"):
|
||||||
|
await _load_more(page, page_cfg["load_from_selector"], page_cfg.get("max_load_more", 5))
|
||||||
|
elif page_cfg.get("load_more_selector"):
|
||||||
|
await _load_more(page, page_cfg["load_more_selector"], page_cfg.get("max_load_more", 5))
|
||||||
|
|
||||||
|
method = page_cfg.get("extract_method") or school_config.get("extract_method") or "links"
|
||||||
|
if method == "table":
|
||||||
|
extracted = await page.evaluate(JS_EXTRACT_TABLE_STAFF)
|
||||||
|
elif method == "research_explorer":
|
||||||
|
extracted = await page.evaluate(JS_EXTRACT_RESEARCH_EXPLORER)
|
||||||
|
else:
|
||||||
|
extracted = await page.evaluate(JS_EXTRACT_LINK_STAFF)
|
||||||
|
|
||||||
|
staff_collected.extend(extracted)
|
||||||
|
|
||||||
|
await page.close()
|
||||||
|
return _deduplicate_staff(staff_collected)
|
||||||
|
|
||||||
|
|
||||||
|
# =========================
|
||||||
|
# 并发抓取学院 Staff
|
||||||
|
# =========================
|
||||||
|
|
||||||
|
async def scrape_school_staff(context, school_config: Dict[str, Any], semaphore, output_callback):
|
||||||
|
async with semaphore:
|
||||||
|
staff_list: List[Dict[str, str]] = []
|
||||||
|
status = "success"
|
||||||
|
error: Optional[str] = None
|
||||||
|
|
||||||
|
try:
|
||||||
|
if school_config.get("extract_method") == "research_explorer":
|
||||||
|
staff_list = await fetch_research_explorer_api(context, school_config, output_callback)
|
||||||
|
if not staff_list:
|
||||||
|
staff_list = await scrape_staff_via_browser(context, school_config, output_callback)
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f" {school_config['name']}: total {len(staff_list)} staff")
|
||||||
|
|
||||||
|
except Exception as exc: # noqa: BLE001
|
||||||
|
status = "error"
|
||||||
|
error = str(exc)
|
||||||
|
if output_callback:
|
||||||
|
output_callback("error", f" {school_config['name']}: {error}")
|
||||||
|
|
||||||
|
return {
|
||||||
|
"name": school_config["name"],
|
||||||
|
"staff": staff_list,
|
||||||
|
"status": status,
|
||||||
|
"error": error,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
async def scrape_all_school_staff(context, output_callback):
|
||||||
|
semaphore = asyncio.Semaphore(STAFF_CONCURRENCY)
|
||||||
|
tasks = [
|
||||||
|
asyncio.create_task(scrape_school_staff(context, cfg, semaphore, output_callback))
|
||||||
|
for cfg in SCHOOL_CONFIG
|
||||||
|
]
|
||||||
|
results = await asyncio.gather(*tasks)
|
||||||
|
|
||||||
|
staff_map = {}
|
||||||
|
diagnostics = {"failed": [], "success": [], "total": len(results)}
|
||||||
|
for res in results:
|
||||||
|
if res["staff"]:
|
||||||
|
staff_map[res["name"]] = res["staff"]
|
||||||
|
diagnostics["success"].append(res["name"])
|
||||||
|
else:
|
||||||
|
diagnostics["failed"].append(
|
||||||
|
{
|
||||||
|
"name": res["name"],
|
||||||
|
"status": res["status"],
|
||||||
|
"error": res.get("error"),
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return staff_map, diagnostics
|
||||||
|
|
||||||
|
|
||||||
|
# =========================
|
||||||
|
# 主流程
|
||||||
|
# =========================
|
||||||
|
|
||||||
|
async def scrape(output_callback=None):
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser = await p.chromium.launch(headless=True)
|
||||||
|
context = await browser.new_context(
|
||||||
|
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
|
||||||
|
)
|
||||||
|
|
||||||
|
base_url = "https://www.manchester.ac.uk/"
|
||||||
|
result = {
|
||||||
|
"name": "The University of Manchester",
|
||||||
|
"url": base_url,
|
||||||
|
"scraped_at": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"schools": [],
|
||||||
|
"diagnostics": {},
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Step 1: Masters 列表
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", "Step 1: Scraping masters programs list...")
|
||||||
|
|
||||||
|
page = await context.new_page()
|
||||||
|
courses_url = "https://www.manchester.ac.uk/study/masters/courses/list/"
|
||||||
|
await page.goto(courses_url, wait_until="domcontentloaded", timeout=40000)
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
programs_data = await page.evaluate(JS_EXTRACT_PROGRAMS)
|
||||||
|
await page.close()
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"Found {len(programs_data)} masters programs")
|
||||||
|
|
||||||
|
# Step 2: 并发抓取学院 Staff
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", "Step 2: Scraping faculty from staff pages (parallel)...")
|
||||||
|
school_staff, diagnostics = await scrape_all_school_staff(context, output_callback)
|
||||||
|
|
||||||
|
# Step 3: 组织数据
|
||||||
|
schools_dict: Dict[str, Dict[str, Any]] = {}
|
||||||
|
for prog in programs_data:
|
||||||
|
school_name = match_program_to_school(prog["name"])
|
||||||
|
if school_name not in schools_dict:
|
||||||
|
schools_dict[school_name] = {
|
||||||
|
"name": school_name,
|
||||||
|
"url": "",
|
||||||
|
"programs": [],
|
||||||
|
"faculty": school_staff.get(school_name, []),
|
||||||
|
"faculty_source": "school_directory" if school_staff.get(school_name) else "",
|
||||||
|
}
|
||||||
|
|
||||||
|
schools_dict[school_name]["programs"].append(
|
||||||
|
{
|
||||||
|
"name": prog["name"],
|
||||||
|
"url": prog["url"],
|
||||||
|
"faculty": [],
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
for cfg in SCHOOL_CONFIG:
|
||||||
|
if cfg["name"] in schools_dict:
|
||||||
|
first_page = (cfg.get("staff_pages") or [{}])[0]
|
||||||
|
schools_dict[cfg["name"]]["url"] = first_page.get("url") or cfg.get("staff_url", "")
|
||||||
|
|
||||||
|
_attach_faculty_to_programs(schools_dict, school_staff)
|
||||||
|
|
||||||
|
result["schools"] = list(schools_dict.values())
|
||||||
|
|
||||||
|
total_programs = sum(len(s["programs"]) for s in result["schools"])
|
||||||
|
total_faculty = sum(len(s.get("faculty", [])) for s in result["schools"])
|
||||||
|
|
||||||
|
result["diagnostics"] = {
|
||||||
|
"total_programs": total_programs,
|
||||||
|
"total_faculty_records": total_faculty,
|
||||||
|
"school_staff_success": diagnostics.get("success", []),
|
||||||
|
"school_staff_failed": diagnostics.get("failed", []),
|
||||||
|
}
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
output_callback(
|
||||||
|
"info",
|
||||||
|
f"Done! {len(result['schools'])} schools, {total_programs} programs, {total_faculty} faculty",
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as exc: # noqa: BLE001
|
||||||
|
if output_callback:
|
||||||
|
output_callback("error", f"Scraping error: {str(exc)}")
|
||||||
|
finally:
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
def _attach_faculty_to_programs(schools_dict: Dict[str, Dict[str, Any]], staff_map: Dict[str, List[Dict[str, str]]]):
|
||||||
|
for school_name, school_data in schools_dict.items():
|
||||||
|
staff = staff_map.get(school_name, [])
|
||||||
|
cfg = SCHOOL_LOOKUP.get(school_name, {})
|
||||||
|
if not staff or not cfg.get("attach_faculty_to_programs"):
|
||||||
|
continue
|
||||||
|
|
||||||
|
limit = cfg.get("faculty_per_program")
|
||||||
|
for program in school_data["programs"]:
|
||||||
|
sliced = deepcopy(staff[:limit] if limit else staff)
|
||||||
|
program["faculty"] = sliced
|
||||||
|
|
||||||
|
|
||||||
|
# =========================
|
||||||
|
# CLI
|
||||||
|
# =========================
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
import sys
|
||||||
|
|
||||||
|
if sys.platform == "win32":
|
||||||
|
asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())
|
||||||
|
|
||||||
|
def print_callback(level, msg):
|
||||||
|
print(f"[{level}] {msg}")
|
||||||
|
|
||||||
|
scrape_result = asyncio.run(scrape(output_callback=print_callback))
|
||||||
|
|
||||||
|
output_path = "output/manchester_complete_result.json"
|
||||||
|
with open(output_path, "w", encoding="utf-8") as f:
|
||||||
|
json.dump(scrape_result, f, ensure_ascii=False, indent=2)
|
||||||
|
|
||||||
|
print("\nResult saved to", output_path)
|
||||||
|
print("\n=== Summary ===")
|
||||||
|
for school in sorted(scrape_result["schools"], key=lambda s: -len(s.get("faculty", []))):
|
||||||
|
print(
|
||||||
|
f" {school['name']}: "
|
||||||
|
f"{len(school['programs'])} programs, "
|
||||||
|
f"{len(school.get('faculty', []))} faculty"
|
||||||
|
)
|
||||||
|
|
||||||
229
artifacts/manchester_improved_scraper.py
Normal file
229
artifacts/manchester_improved_scraper.py
Normal file
@ -0,0 +1,229 @@
|
|||||||
|
"""
|
||||||
|
曼彻斯特大学专用爬虫脚本
|
||||||
|
改进版 - 从学院Staff页面提取导师信息
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from urllib.parse import urljoin, urlparse
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
|
||||||
|
|
||||||
|
# 曼彻斯特大学学院Staff页面映射
|
||||||
|
# 项目关键词 -> 学院Staff页面URL
|
||||||
|
SCHOOL_STAFF_MAPPING = {
|
||||||
|
# Alliance Manchester Business School (AMBS)
|
||||||
|
"accounting": "https://www.alliancembs.manchester.ac.uk/research/accounting-and-finance/staff/",
|
||||||
|
"finance": "https://www.alliancembs.manchester.ac.uk/research/accounting-and-finance/staff/",
|
||||||
|
"business": "https://www.alliancembs.manchester.ac.uk/about/our-people/",
|
||||||
|
"management": "https://www.alliancembs.manchester.ac.uk/about/our-people/",
|
||||||
|
"marketing": "https://www.alliancembs.manchester.ac.uk/research/management-sciences-and-marketing/",
|
||||||
|
"mba": "https://www.alliancembs.manchester.ac.uk/about/our-people/",
|
||||||
|
|
||||||
|
# 其他学院可以继续添加...
|
||||||
|
# "computer": "...",
|
||||||
|
# "engineering": "...",
|
||||||
|
}
|
||||||
|
|
||||||
|
# 通用学院Staff页面列表(如果没有匹配的关键词)
|
||||||
|
GENERAL_STAFF_PAGES = [
|
||||||
|
"https://www.alliancembs.manchester.ac.uk/about/our-people/",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
async def scrape(output_callback=None):
|
||||||
|
"""执行爬取"""
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser = await p.chromium.launch(headless=True)
|
||||||
|
context = await browser.new_context(
|
||||||
|
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
|
||||||
|
)
|
||||||
|
page = await context.new_page()
|
||||||
|
|
||||||
|
base_url = "https://www.manchester.ac.uk/"
|
||||||
|
|
||||||
|
result = {
|
||||||
|
"name": "The University of Manchester",
|
||||||
|
"url": base_url,
|
||||||
|
"scraped_at": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"schools": []
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
# 第一步:爬取硕士项目列表
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", "Step 1: Scraping masters programs list...")
|
||||||
|
|
||||||
|
courses_url = "https://www.manchester.ac.uk/study/masters/courses/list/"
|
||||||
|
await page.goto(courses_url, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
# 提取所有硕士项目
|
||||||
|
programs_data = await page.evaluate('''() => {
|
||||||
|
const programs = [];
|
||||||
|
const seen = new Set();
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href;
|
||||||
|
const text = a.innerText.trim().replace(/\\s+/g, ' ');
|
||||||
|
|
||||||
|
if (!href || seen.has(href)) return;
|
||||||
|
if (text.length < 10 || text.length > 200) return;
|
||||||
|
|
||||||
|
const hrefLower = href.toLowerCase();
|
||||||
|
const textLower = text.toLowerCase();
|
||||||
|
|
||||||
|
// 排除导航链接
|
||||||
|
if (textLower === 'courses' || textLower === 'masters' ||
|
||||||
|
textLower.includes('admission') || textLower.includes('fees') ||
|
||||||
|
textLower.includes('skip to') || textLower.includes('skip navigation') ||
|
||||||
|
textLower === 'home' || textLower === 'search' ||
|
||||||
|
textLower.includes('contact') || textLower.includes('footer') ||
|
||||||
|
hrefLower.endsWith('/courses/') || hrefLower.endsWith('/masters/') ||
|
||||||
|
hrefLower.includes('#')) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
// 检查是否是课程链接 - 必须包含课程ID
|
||||||
|
const hasNumericId = /\\/\\d{5}\\//.test(href); // 5位数字ID
|
||||||
|
const isCoursePage = hrefLower.includes('/courses/list/') &&
|
||||||
|
hasNumericId;
|
||||||
|
|
||||||
|
if (isCoursePage) {
|
||||||
|
seen.add(href);
|
||||||
|
programs.push({
|
||||||
|
name: text,
|
||||||
|
url: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return programs;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"Found {len(programs_data)} masters programs")
|
||||||
|
|
||||||
|
# 第二步:爬取学院Staff页面的导师信息
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", "Step 2: Scraping faculty from school staff pages...")
|
||||||
|
|
||||||
|
all_faculty = {} # school_url -> faculty list
|
||||||
|
|
||||||
|
# 爬取AMBS Accounting & Finance Staff
|
||||||
|
staff_url = "https://www.alliancembs.manchester.ac.uk/research/accounting-and-finance/staff/"
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"Scraping staff from: {staff_url}")
|
||||||
|
|
||||||
|
await page.goto(staff_url, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
# 从表格提取教职员工
|
||||||
|
faculty_data = await page.evaluate('''() => {
|
||||||
|
const faculty = [];
|
||||||
|
const rows = document.querySelectorAll('table tr');
|
||||||
|
|
||||||
|
rows.forEach(row => {
|
||||||
|
const cells = row.querySelectorAll('td');
|
||||||
|
if (cells.length >= 2) {
|
||||||
|
const link = cells[1]?.querySelector('a[href]');
|
||||||
|
const titleCell = cells[2];
|
||||||
|
|
||||||
|
if (link) {
|
||||||
|
const name = link.innerText.trim();
|
||||||
|
const url = link.href;
|
||||||
|
const title = titleCell ? titleCell.innerText.trim() : '';
|
||||||
|
|
||||||
|
if (name.length > 2 && !name.toLowerCase().includes('skip')) {
|
||||||
|
faculty.push({
|
||||||
|
name: name,
|
||||||
|
url: url,
|
||||||
|
title: title
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return faculty;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"Found {len(faculty_data)} faculty members from AMBS")
|
||||||
|
|
||||||
|
all_faculty["AMBS - Accounting and Finance"] = faculty_data
|
||||||
|
|
||||||
|
# 第三步:组装结果
|
||||||
|
# 将项目按关键词分配到学院
|
||||||
|
schools_data = {}
|
||||||
|
|
||||||
|
for prog in programs_data:
|
||||||
|
prog_name_lower = prog['name'].lower()
|
||||||
|
|
||||||
|
# 确定所属学院
|
||||||
|
school_name = "Other Programs"
|
||||||
|
matched_faculty = []
|
||||||
|
|
||||||
|
for keyword, staff_url in SCHOOL_STAFF_MAPPING.items():
|
||||||
|
if keyword in prog_name_lower:
|
||||||
|
if "accounting" in keyword or "finance" in keyword:
|
||||||
|
school_name = "Alliance Manchester Business School"
|
||||||
|
matched_faculty = all_faculty.get("AMBS - Accounting and Finance", [])
|
||||||
|
elif "business" in keyword or "management" in keyword or "mba" in keyword:
|
||||||
|
school_name = "Alliance Manchester Business School"
|
||||||
|
matched_faculty = all_faculty.get("AMBS - Accounting and Finance", [])
|
||||||
|
break
|
||||||
|
|
||||||
|
if school_name not in schools_data:
|
||||||
|
schools_data[school_name] = {
|
||||||
|
"name": school_name,
|
||||||
|
"url": "",
|
||||||
|
"programs": [],
|
||||||
|
"faculty": matched_faculty # 学院级别的导师
|
||||||
|
}
|
||||||
|
|
||||||
|
schools_data[school_name]["programs"].append({
|
||||||
|
"name": prog['name'],
|
||||||
|
"url": prog['url'],
|
||||||
|
"faculty": [] # 项目级别暂不填充
|
||||||
|
})
|
||||||
|
|
||||||
|
result["schools"] = list(schools_data.values())
|
||||||
|
|
||||||
|
# 统计
|
||||||
|
total_programs = sum(len(s['programs']) for s in result['schools'])
|
||||||
|
total_faculty = sum(len(s.get('faculty', [])) for s in result['schools'])
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"Done! {len(result['schools'])} schools, {total_programs} programs, {total_faculty} faculty")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("error", f"Scraping error: {str(e)}")
|
||||||
|
|
||||||
|
finally:
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
import sys
|
||||||
|
if sys.platform == "win32":
|
||||||
|
asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())
|
||||||
|
|
||||||
|
def print_callback(level, msg):
|
||||||
|
print(f"[{level}] {msg}")
|
||||||
|
|
||||||
|
result = asyncio.run(scrape(output_callback=print_callback))
|
||||||
|
|
||||||
|
# 保存结果
|
||||||
|
with open("output/manchester_improved_result.json", "w", encoding="utf-8") as f:
|
||||||
|
json.dump(result, f, indent=2, ensure_ascii=False)
|
||||||
|
|
||||||
|
print(f"\nResult saved to output/manchester_improved_result.json")
|
||||||
|
print(f"Schools: {len(result['schools'])}")
|
||||||
|
for school in result['schools']:
|
||||||
|
print(f" - {school['name']}: {len(school['programs'])} programs, {len(school.get('faculty', []))} faculty")
|
||||||
165
artifacts/test_faculty_scraper.py
Normal file
165
artifacts/test_faculty_scraper.py
Normal file
@ -0,0 +1,165 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
测试导师信息爬取逻辑 - 只测试3个项目
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
|
||||||
|
|
||||||
|
def name_to_slug(name):
|
||||||
|
"""将项目名称转换为URL slug"""
|
||||||
|
slug = name.lower()
|
||||||
|
slug = re.sub(r'[^\w\s-]', '', slug)
|
||||||
|
slug = re.sub(r'[\s_]+', '-', slug)
|
||||||
|
slug = re.sub(r'-+', '-', slug)
|
||||||
|
slug = slug.strip('-')
|
||||||
|
return slug
|
||||||
|
|
||||||
|
|
||||||
|
async def get_faculty_from_gsas_page(page, gsas_url):
|
||||||
|
"""从GSAS项目页面获取Faculty链接,然后访问院系People页面获取导师列表"""
|
||||||
|
faculty_list = []
|
||||||
|
faculty_page_url = None
|
||||||
|
|
||||||
|
try:
|
||||||
|
print(f" 访问GSAS页面: {gsas_url}")
|
||||||
|
await page.goto(gsas_url, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
# 查找Faculty部分的链接
|
||||||
|
faculty_link = await page.evaluate('''() => {
|
||||||
|
const links = document.querySelectorAll('a[href]');
|
||||||
|
for (const link of links) {
|
||||||
|
const text = link.innerText.toLowerCase();
|
||||||
|
const href = link.href;
|
||||||
|
if (text.includes('faculty') && text.includes('see list')) {
|
||||||
|
return href;
|
||||||
|
}
|
||||||
|
if (text.includes('faculty') && (href.includes('/people') || href.includes('/faculty'))) {
|
||||||
|
return href;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
if faculty_link:
|
||||||
|
faculty_page_url = faculty_link
|
||||||
|
print(f" 找到Faculty页面链接: {faculty_link}")
|
||||||
|
|
||||||
|
# 访问Faculty/People页面
|
||||||
|
await page.goto(faculty_link, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
# 提取所有导师信息
|
||||||
|
faculty_list = await page.evaluate('''() => {
|
||||||
|
const faculty = [];
|
||||||
|
const seen = new Set();
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href || '';
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
const lowerHref = href.toLowerCase();
|
||||||
|
|
||||||
|
if ((lowerHref.includes('/people/') || lowerHref.includes('/faculty/') ||
|
||||||
|
lowerHref.includes('/profile/')) &&
|
||||||
|
text.length > 3 && text.length < 100 &&
|
||||||
|
!text.toLowerCase().includes('people') &&
|
||||||
|
!text.toLowerCase().includes('faculty') &&
|
||||||
|
!lowerHref.endsWith('/people/') &&
|
||||||
|
!lowerHref.endsWith('/faculty/')) {
|
||||||
|
|
||||||
|
if (!seen.has(href)) {
|
||||||
|
seen.add(href);
|
||||||
|
faculty.push({
|
||||||
|
name: text,
|
||||||
|
url: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return faculty;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f" 找到 {len(faculty_list)} 位导师")
|
||||||
|
for f in faculty_list[:5]:
|
||||||
|
print(f" - {f['name']}: {f['url']}")
|
||||||
|
if len(faculty_list) > 5:
|
||||||
|
print(f" ... 还有 {len(faculty_list) - 5} 位")
|
||||||
|
else:
|
||||||
|
print(" 未找到Faculty页面链接")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f" 获取Faculty信息失败: {e}")
|
||||||
|
|
||||||
|
return faculty_list, faculty_page_url
|
||||||
|
|
||||||
|
|
||||||
|
async def test_faculty_scraper():
|
||||||
|
"""测试导师爬取"""
|
||||||
|
|
||||||
|
# 测试3个项目
|
||||||
|
test_programs = [
|
||||||
|
"African and African American Studies",
|
||||||
|
"Economics",
|
||||||
|
"Computer Science"
|
||||||
|
]
|
||||||
|
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser = await p.chromium.launch(headless=False)
|
||||||
|
context = await browser.new_context(
|
||||||
|
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
|
||||||
|
viewport={'width': 1920, 'height': 1080}
|
||||||
|
)
|
||||||
|
page = await context.new_page()
|
||||||
|
|
||||||
|
results = []
|
||||||
|
|
||||||
|
for i, name in enumerate(test_programs, 1):
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print(f"[{i}/{len(test_programs)}] 测试: {name}")
|
||||||
|
print(f"{'='*60}")
|
||||||
|
|
||||||
|
slug = name_to_slug(name)
|
||||||
|
program_url = f"https://www.harvard.edu/programs/{slug}/"
|
||||||
|
gsas_url = f"https://gsas.harvard.edu/program/{slug}"
|
||||||
|
|
||||||
|
print(f"项目URL: {program_url}")
|
||||||
|
print(f"GSAS URL: {gsas_url}")
|
||||||
|
|
||||||
|
faculty_list, faculty_page_url = await get_faculty_from_gsas_page(page, gsas_url)
|
||||||
|
|
||||||
|
results.append({
|
||||||
|
'name': name,
|
||||||
|
'url': program_url,
|
||||||
|
'gsas_url': gsas_url,
|
||||||
|
'faculty_page_url': faculty_page_url,
|
||||||
|
'faculty': faculty_list,
|
||||||
|
'faculty_count': len(faculty_list)
|
||||||
|
})
|
||||||
|
|
||||||
|
await page.wait_for_timeout(1000)
|
||||||
|
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
# 输出结果
|
||||||
|
print(f"\n\n{'='*60}")
|
||||||
|
print("测试结果汇总")
|
||||||
|
print(f"{'='*60}")
|
||||||
|
|
||||||
|
for r in results:
|
||||||
|
print(f"\n{r['name']}:")
|
||||||
|
print(f" Faculty页面: {r['faculty_page_url'] or '未找到'}")
|
||||||
|
print(f" 导师数量: {r['faculty_count']}")
|
||||||
|
|
||||||
|
# 保存测试结果
|
||||||
|
with open('test_faculty_results.json', 'w', encoding='utf-8') as f:
|
||||||
|
json.dump(results, f, ensure_ascii=False, indent=2)
|
||||||
|
print(f"\n测试结果已保存到: test_faculty_results.json")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(test_faculty_scraper())
|
||||||
464
artifacts/test_manchester_scraper.py
Normal file
464
artifacts/test_manchester_scraper.py
Normal file
@ -0,0 +1,464 @@
|
|||||||
|
"""
|
||||||
|
Test Manchester University scraper - improved faculty mapping
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
|
||||||
|
|
||||||
|
MASTERS_PATHS = [
|
||||||
|
"/study/masters/courses/list/",
|
||||||
|
"/study/masters/courses/",
|
||||||
|
"/postgraduate/taught/courses/",
|
||||||
|
"/postgraduate/courses/list/",
|
||||||
|
"/postgraduate/courses/",
|
||||||
|
"/graduate/programs/",
|
||||||
|
"/academics/graduate/programs/",
|
||||||
|
"/programmes/masters/",
|
||||||
|
"/masters/programmes/",
|
||||||
|
"/admissions/graduate/programs/",
|
||||||
|
]
|
||||||
|
|
||||||
|
ACCOUNTING_STAFF_URL = "https://www.alliancembs.manchester.ac.uk/research/accounting-and-finance/staff/"
|
||||||
|
ACCOUNTING_STAFF_CACHE = None
|
||||||
|
|
||||||
|
|
||||||
|
JS_CHECK_COURSES = r"""() => {
|
||||||
|
const links = document.querySelectorAll('a[href]');
|
||||||
|
let courseCount = 0;
|
||||||
|
for (const a of links) {
|
||||||
|
const href = a.href.toLowerCase();
|
||||||
|
if (/\/\d{4,}\//.test(href) ||
|
||||||
|
/\/(msc|ma|mba|mres|llm|med|meng)-/.test(href) ||
|
||||||
|
/\/course\/[a-z]/.test(href)) {
|
||||||
|
courseCount++;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return courseCount;
|
||||||
|
}"""
|
||||||
|
|
||||||
|
JS_FIND_LIST_URL = """() => {
|
||||||
|
const links = document.querySelectorAll('a[href]');
|
||||||
|
for (const a of links) {
|
||||||
|
const text = a.innerText.toLowerCase();
|
||||||
|
const href = a.href.toLowerCase();
|
||||||
|
if ((text.includes('a-z') || text.includes('all course') ||
|
||||||
|
text.includes('full list') || text.includes('browse all') ||
|
||||||
|
href.includes('/list')) &&
|
||||||
|
(href.includes('master') || href.includes('course') || href.includes('postgrad'))) {
|
||||||
|
return a.href;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}"""
|
||||||
|
|
||||||
|
JS_FIND_COURSES_FROM_HOME = """() => {
|
||||||
|
const links = document.querySelectorAll('a[href]');
|
||||||
|
for (const a of links) {
|
||||||
|
const href = a.href.toLowerCase();
|
||||||
|
const text = a.innerText.toLowerCase();
|
||||||
|
if ((href.includes('master') || href.includes('postgraduate') || href.includes('graduate')) &&
|
||||||
|
(href.includes('course') || href.includes('program') || href.includes('degree'))) {
|
||||||
|
return a.href;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}"""
|
||||||
|
|
||||||
|
JS_EXTRACT_PROGRAMS = r"""() => {
|
||||||
|
const programs = [];
|
||||||
|
const seen = new Set();
|
||||||
|
const currentHost = window.location.hostname;
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href;
|
||||||
|
const text = a.innerText.trim().replace(/\s+/g, ' ');
|
||||||
|
|
||||||
|
if (!href || seen.has(href)) return;
|
||||||
|
if (text.length < 5 || text.length > 200) return;
|
||||||
|
if (href.includes('#') || href.includes('javascript:') || href.includes('mailto:')) return;
|
||||||
|
|
||||||
|
try {
|
||||||
|
const linkHost = new URL(href).hostname;
|
||||||
|
if (!linkHost.includes(currentHost.replace('www.', '')) &&
|
||||||
|
!currentHost.includes(linkHost.replace('www.', ''))) return;
|
||||||
|
} catch {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
const hrefLower = href.toLowerCase();
|
||||||
|
const textLower = text.toLowerCase();
|
||||||
|
|
||||||
|
const isNavigation = textLower === 'courses' ||
|
||||||
|
textLower === 'programmes' ||
|
||||||
|
textLower === 'undergraduate' ||
|
||||||
|
textLower === 'postgraduate' ||
|
||||||
|
textLower === 'masters' ||
|
||||||
|
textLower === "master's" ||
|
||||||
|
textLower.includes('skip to') ||
|
||||||
|
textLower.includes('share') ||
|
||||||
|
textLower === 'home' ||
|
||||||
|
textLower === 'study' ||
|
||||||
|
textLower.startsWith('a-z') ||
|
||||||
|
textLower.includes('admission') ||
|
||||||
|
textLower.includes('fees and funding') ||
|
||||||
|
textLower.includes('why should') ||
|
||||||
|
textLower.includes('why manchester') ||
|
||||||
|
textLower.includes('teaching and learning') ||
|
||||||
|
textLower.includes('meet us') ||
|
||||||
|
textLower.includes('student support') ||
|
||||||
|
textLower.includes('contact us') ||
|
||||||
|
textLower.includes('how to apply') ||
|
||||||
|
hrefLower.includes('/admissions/') ||
|
||||||
|
hrefLower.includes('/fees-and-funding/') ||
|
||||||
|
hrefLower.includes('/why-') ||
|
||||||
|
hrefLower.includes('/meet-us/') ||
|
||||||
|
hrefLower.includes('/contact-us/') ||
|
||||||
|
hrefLower.includes('/student-support/') ||
|
||||||
|
hrefLower.includes('/teaching-and-learning/') ||
|
||||||
|
hrefLower.endsWith('/courses/') ||
|
||||||
|
hrefLower.endsWith('/masters/') ||
|
||||||
|
hrefLower.endsWith('/postgraduate/');
|
||||||
|
|
||||||
|
if (isNavigation) return;
|
||||||
|
|
||||||
|
const isExcluded = hrefLower.includes('/undergraduate') ||
|
||||||
|
hrefLower.includes('/bachelor') ||
|
||||||
|
hrefLower.includes('/phd/') ||
|
||||||
|
hrefLower.includes('/doctoral') ||
|
||||||
|
hrefLower.includes('/research-degree') ||
|
||||||
|
textLower.includes('bachelor') ||
|
||||||
|
textLower.includes('undergraduate') ||
|
||||||
|
(textLower.includes('phd') && !textLower.includes('mphil'));
|
||||||
|
|
||||||
|
if (isExcluded) return;
|
||||||
|
|
||||||
|
const hasNumericId = /\/\d{4,}\//.test(href);
|
||||||
|
const hasDegreeSlug = /\/(msc|ma|mba|mres|llm|med|meng|mpa|mph|mphil)-[a-z]/.test(hrefLower);
|
||||||
|
const isCoursePage = (hrefLower.includes('/course/') ||
|
||||||
|
hrefLower.includes('/courses/list/') ||
|
||||||
|
hrefLower.includes('/programme/')) &&
|
||||||
|
href.split('/').filter(p => p).length > 4;
|
||||||
|
const textHasDegree = /(msc|ma|mba|mres|llm|med|meng|pgcert|pgdip)/i.test(text) ||
|
||||||
|
textLower.includes('master');
|
||||||
|
|
||||||
|
if (hasNumericId || hasDegreeSlug || isCoursePage || textHasDegree) {
|
||||||
|
seen.add(href);
|
||||||
|
programs.push({
|
||||||
|
name: text,
|
||||||
|
url: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return programs;
|
||||||
|
}"""
|
||||||
|
|
||||||
|
JS_EXTRACT_FACULTY = r"""() => {
|
||||||
|
const faculty = [];
|
||||||
|
const seen = new Set();
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href.toLowerCase();
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
|
||||||
|
if (seen.has(href)) return;
|
||||||
|
if (text.length < 3 || text.length > 100) return;
|
||||||
|
|
||||||
|
const isStaff = href.includes('/people/') ||
|
||||||
|
href.includes('/staff/') ||
|
||||||
|
href.includes('/faculty/') ||
|
||||||
|
href.includes('/profile/') ||
|
||||||
|
href.includes('/academics/') ||
|
||||||
|
href.includes('/researcher/');
|
||||||
|
|
||||||
|
if (isStaff) {
|
||||||
|
seen.add(href);
|
||||||
|
faculty.push({
|
||||||
|
name: text.replace(/\s+/g, ' '),
|
||||||
|
url: a.href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return faculty.slice(0, 20);
|
||||||
|
}"""
|
||||||
|
|
||||||
|
JS_EXTRACT_ACCOUNTING_STAFF = r"""() => {
|
||||||
|
const rows = Array.from(document.querySelectorAll('table tbody tr'));
|
||||||
|
const staff = [];
|
||||||
|
|
||||||
|
for (const row of rows) {
|
||||||
|
const cells = row.querySelectorAll('td');
|
||||||
|
if (!cells || cells.length < 2) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
const nameCell = cells[1];
|
||||||
|
const roleCell = cells[2];
|
||||||
|
const emailCell = cells[5];
|
||||||
|
|
||||||
|
let profileUrl = '';
|
||||||
|
let displayName = nameCell ? nameCell.innerText.trim() : '';
|
||||||
|
const link = nameCell ? nameCell.querySelector('a[href]') : null;
|
||||||
|
if (link) {
|
||||||
|
profileUrl = link.href;
|
||||||
|
displayName = link.innerText.trim() || displayName;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!displayName) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
let email = '';
|
||||||
|
if (emailCell) {
|
||||||
|
const emailLink = emailCell.querySelector('a[href^="mailto:"]');
|
||||||
|
if (emailLink) {
|
||||||
|
email = emailLink.href.replace('mailto:', '').trim();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
staff.push({
|
||||||
|
name: displayName,
|
||||||
|
title: roleCell ? roleCell.innerText.trim() : '',
|
||||||
|
url: profileUrl,
|
||||||
|
email: email
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
return staff;
|
||||||
|
}"""
|
||||||
|
|
||||||
|
|
||||||
|
def should_use_accounting_staff(program_name: str) -> bool:
|
||||||
|
lower_name = program_name.lower()
|
||||||
|
return "msc" in lower_name and "accounting" in lower_name
|
||||||
|
|
||||||
|
|
||||||
|
async def load_accounting_staff(context, output_callback=None):
|
||||||
|
global ACCOUNTING_STAFF_CACHE
|
||||||
|
|
||||||
|
if ACCOUNTING_STAFF_CACHE is not None:
|
||||||
|
return ACCOUNTING_STAFF_CACHE
|
||||||
|
|
||||||
|
staff_page = await context.new_page()
|
||||||
|
try:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", "Loading official AMBS Accounting & Finance staff page...")
|
||||||
|
|
||||||
|
await staff_page.goto(ACCOUNTING_STAFF_URL, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await staff_page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
ACCOUNTING_STAFF_CACHE = await staff_page.evaluate(JS_EXTRACT_ACCOUNTING_STAFF)
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"Captured {len(ACCOUNTING_STAFF_CACHE)} faculty from the official staff page")
|
||||||
|
|
||||||
|
except Exception as exc:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("error", f"Failed to load AMBS staff page: {exc}")
|
||||||
|
ACCOUNTING_STAFF_CACHE = []
|
||||||
|
finally:
|
||||||
|
await staff_page.close()
|
||||||
|
|
||||||
|
return ACCOUNTING_STAFF_CACHE
|
||||||
|
|
||||||
|
|
||||||
|
async def find_course_list_page(page, base_url, output_callback):
|
||||||
|
for path in MASTERS_PATHS:
|
||||||
|
test_url = base_url.rstrip('/') + path
|
||||||
|
try:
|
||||||
|
response = await page.goto(test_url, wait_until="domcontentloaded", timeout=15000)
|
||||||
|
if response and response.status == 200:
|
||||||
|
title = await page.title()
|
||||||
|
if '404' not in title.lower() and 'not found' not in title.lower():
|
||||||
|
has_courses = await page.evaluate(JS_CHECK_COURSES)
|
||||||
|
if has_courses > 5:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"Found course list: {path} ({has_courses} courses)")
|
||||||
|
return test_url
|
||||||
|
|
||||||
|
list_url = await page.evaluate(JS_FIND_LIST_URL)
|
||||||
|
if list_url:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"Found full course list: {list_url}")
|
||||||
|
return list_url
|
||||||
|
except:
|
||||||
|
continue
|
||||||
|
|
||||||
|
try:
|
||||||
|
await page.goto(base_url, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
courses_url = await page.evaluate(JS_FIND_COURSES_FROM_HOME)
|
||||||
|
if courses_url:
|
||||||
|
return courses_url
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
async def extract_course_links(page, output_callback):
|
||||||
|
return await page.evaluate(JS_EXTRACT_PROGRAMS)
|
||||||
|
|
||||||
|
|
||||||
|
async def scrape(output_callback=None):
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser = await p.chromium.launch(headless=True)
|
||||||
|
context = await browser.new_context(
|
||||||
|
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
|
||||||
|
)
|
||||||
|
page = await context.new_page()
|
||||||
|
|
||||||
|
base_url = "https://www.manchester.ac.uk/"
|
||||||
|
|
||||||
|
result = {
|
||||||
|
"name": "Manchester University",
|
||||||
|
"url": base_url,
|
||||||
|
"scraped_at": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"schools": []
|
||||||
|
}
|
||||||
|
|
||||||
|
all_programs = []
|
||||||
|
|
||||||
|
try:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", "Searching for masters course list...")
|
||||||
|
|
||||||
|
courses_url = await find_course_list_page(page, base_url, output_callback)
|
||||||
|
|
||||||
|
if not courses_url:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("warning", "Course list not found, using homepage")
|
||||||
|
courses_url = base_url
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", "Extracting masters programs...")
|
||||||
|
|
||||||
|
await page.goto(courses_url, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
for _ in range(3):
|
||||||
|
try:
|
||||||
|
load_more = page.locator('button:has-text("Load more"), button:has-text("Show more"), button:has-text("View more"), a:has-text("Load more")')
|
||||||
|
if await load_more.count() > 0:
|
||||||
|
await load_more.first.click()
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
else:
|
||||||
|
break
|
||||||
|
except:
|
||||||
|
break
|
||||||
|
|
||||||
|
programs_data = await extract_course_links(page, output_callback)
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"Found {len(programs_data)} masters programs")
|
||||||
|
|
||||||
|
print("\nTop 20 programs:")
|
||||||
|
for i, prog in enumerate(programs_data[:20]):
|
||||||
|
print(f" {i+1}. {prog['name'][:60]}")
|
||||||
|
print(f" {prog['url']}")
|
||||||
|
|
||||||
|
max_detail_pages = min(len(programs_data), 30)
|
||||||
|
detailed_processed = 0
|
||||||
|
logged_official_staff = False
|
||||||
|
|
||||||
|
for prog in programs_data:
|
||||||
|
faculty_data = []
|
||||||
|
used_official_staff = False
|
||||||
|
|
||||||
|
if should_use_accounting_staff(prog['name']):
|
||||||
|
staff_list = await load_accounting_staff(context, output_callback)
|
||||||
|
if staff_list:
|
||||||
|
used_official_staff = True
|
||||||
|
if output_callback and not logged_official_staff:
|
||||||
|
output_callback("info", "Using Alliance MBS Accounting & Finance staff directory for accounting programmes")
|
||||||
|
logged_official_staff = True
|
||||||
|
faculty_data = [
|
||||||
|
{
|
||||||
|
"name": person.get("name"),
|
||||||
|
"url": person.get("url") or ACCOUNTING_STAFF_URL,
|
||||||
|
"title": person.get("title"),
|
||||||
|
"email": person.get("email"),
|
||||||
|
"source": "Alliance Manchester Business School - Accounting & Finance staff"
|
||||||
|
}
|
||||||
|
for person in staff_list
|
||||||
|
]
|
||||||
|
|
||||||
|
elif detailed_processed < max_detail_pages:
|
||||||
|
detailed_processed += 1
|
||||||
|
if output_callback and detailed_processed % 10 == 0:
|
||||||
|
output_callback("info", f"Processing {detailed_processed}/{max_detail_pages}: {prog['name'][:50]}")
|
||||||
|
try:
|
||||||
|
await page.goto(prog['url'], wait_until="domcontentloaded", timeout=15000)
|
||||||
|
await page.wait_for_timeout(800)
|
||||||
|
|
||||||
|
faculty_data = await page.evaluate(JS_EXTRACT_FACULTY)
|
||||||
|
except Exception as e:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("warning", f"Failed to capture faculty for {prog['name'][:50]}: {e}")
|
||||||
|
faculty_data = []
|
||||||
|
|
||||||
|
program_entry = {
|
||||||
|
"name": prog['name'],
|
||||||
|
"url": prog['url'],
|
||||||
|
"faculty": faculty_data
|
||||||
|
}
|
||||||
|
|
||||||
|
if used_official_staff:
|
||||||
|
program_entry["faculty_page_override"] = ACCOUNTING_STAFF_URL
|
||||||
|
|
||||||
|
all_programs.append(program_entry)
|
||||||
|
|
||||||
|
result["schools"] = [{
|
||||||
|
"name": "Masters Programs",
|
||||||
|
"url": courses_url,
|
||||||
|
"programs": all_programs
|
||||||
|
}]
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
total_faculty = sum(len(p.get('faculty', [])) for p in all_programs)
|
||||||
|
output_callback("info", f"Done! {len(all_programs)} programs, {total_faculty} faculty")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("error", f"Scraping error: {str(e)}")
|
||||||
|
|
||||||
|
finally:
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
def log_callback(level, message):
|
||||||
|
print(f"[{level.upper()}] {message}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
result = asyncio.run(scrape(output_callback=log_callback))
|
||||||
|
|
||||||
|
print("\n" + "="*60)
|
||||||
|
print("Scrape summary:")
|
||||||
|
print("="*60)
|
||||||
|
|
||||||
|
if result.get("schools"):
|
||||||
|
school = result["schools"][0]
|
||||||
|
programs = school.get("programs", [])
|
||||||
|
print(f"Course list URL: {school.get('url')}")
|
||||||
|
print(f"Total programs: {len(programs)}")
|
||||||
|
|
||||||
|
faculty_count = sum(len(p.get('faculty', [])) for p in programs)
|
||||||
|
print(f"Faculty total: {faculty_count}")
|
||||||
|
|
||||||
|
print("\nTop 10 programs:")
|
||||||
|
for i, p in enumerate(programs[:10]):
|
||||||
|
print(f" {i+1}. {p['name'][:60]}")
|
||||||
|
if p.get("faculty"):
|
||||||
|
print(f" Faculty entries: {len(p['faculty'])}")
|
||||||
|
|
||||||
|
with open("manchester_test_result.json", "w", encoding="utf-8") as f:
|
||||||
|
json.dump(result, f, indent=2, ensure_ascii=False)
|
||||||
|
print("\nSaved results to manchester_test_result.json")
|
||||||
25
backend/Dockerfile
Normal file
25
backend/Dockerfile
Normal file
@ -0,0 +1,25 @@
|
|||||||
|
FROM python:3.11-slim
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
# 安装系统依赖
|
||||||
|
RUN apt-get update && apt-get install -y \
|
||||||
|
wget \
|
||||||
|
gnupg \
|
||||||
|
&& rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
# 安装Playwright依赖
|
||||||
|
RUN pip install playwright && playwright install chromium && playwright install-deps
|
||||||
|
|
||||||
|
# 复制依赖文件
|
||||||
|
COPY requirements.txt .
|
||||||
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
|
|
||||||
|
# 复制应用代码
|
||||||
|
COPY . .
|
||||||
|
|
||||||
|
# 暴露端口
|
||||||
|
EXPOSE 8000
|
||||||
|
|
||||||
|
# 启动命令
|
||||||
|
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
|
||||||
1
backend/app/__init__.py
Normal file
1
backend/app/__init__.py
Normal file
@ -0,0 +1 @@
|
|||||||
|
"""University Scraper Web Backend"""
|
||||||
15
backend/app/api/__init__.py
Normal file
15
backend/app/api/__init__.py
Normal file
@ -0,0 +1,15 @@
|
|||||||
|
"""API路由"""
|
||||||
|
|
||||||
|
from fastapi import APIRouter
|
||||||
|
|
||||||
|
from .universities import router as universities_router
|
||||||
|
from .scripts import router as scripts_router
|
||||||
|
from .jobs import router as jobs_router
|
||||||
|
from .results import router as results_router
|
||||||
|
|
||||||
|
api_router = APIRouter()
|
||||||
|
|
||||||
|
api_router.include_router(universities_router, prefix="/universities", tags=["大学管理"])
|
||||||
|
api_router.include_router(scripts_router, prefix="/scripts", tags=["爬虫脚本"])
|
||||||
|
api_router.include_router(jobs_router, prefix="/jobs", tags=["爬取任务"])
|
||||||
|
api_router.include_router(results_router, prefix="/results", tags=["爬取结果"])
|
||||||
144
backend/app/api/jobs.py
Normal file
144
backend/app/api/jobs.py
Normal file
@ -0,0 +1,144 @@
|
|||||||
|
"""爬取任务API"""
|
||||||
|
|
||||||
|
from typing import List
|
||||||
|
from datetime import datetime
|
||||||
|
from fastapi import APIRouter, Depends, HTTPException, BackgroundTasks
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
from ..database import get_db
|
||||||
|
from ..models import University, ScraperScript, ScrapeJob, ScrapeLog
|
||||||
|
from ..schemas.job import JobResponse, JobStatusResponse, LogResponse
|
||||||
|
from ..services.scraper_runner import run_scraper
|
||||||
|
|
||||||
|
router = APIRouter()
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/start/{university_id}", response_model=JobResponse)
|
||||||
|
async def start_scrape_job(
|
||||||
|
university_id: int,
|
||||||
|
background_tasks: BackgroundTasks,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
一键运行爬虫
|
||||||
|
|
||||||
|
启动爬取任务,抓取大学项目和导师数据
|
||||||
|
"""
|
||||||
|
# 检查大学是否存在
|
||||||
|
university = db.query(University).filter(University.id == university_id).first()
|
||||||
|
if not university:
|
||||||
|
raise HTTPException(status_code=404, detail="大学不存在")
|
||||||
|
|
||||||
|
# 检查是否有活跃的脚本
|
||||||
|
script = db.query(ScraperScript).filter(
|
||||||
|
ScraperScript.university_id == university_id,
|
||||||
|
ScraperScript.status == "active"
|
||||||
|
).first()
|
||||||
|
|
||||||
|
if not script:
|
||||||
|
raise HTTPException(status_code=400, detail="没有可用的爬虫脚本,请先生成脚本")
|
||||||
|
|
||||||
|
# 检查是否有正在运行的任务
|
||||||
|
running_job = db.query(ScrapeJob).filter(
|
||||||
|
ScrapeJob.university_id == university_id,
|
||||||
|
ScrapeJob.status == "running"
|
||||||
|
).first()
|
||||||
|
|
||||||
|
if running_job:
|
||||||
|
raise HTTPException(status_code=400, detail="已有正在运行的任务")
|
||||||
|
|
||||||
|
# 创建任务
|
||||||
|
job = ScrapeJob(
|
||||||
|
university_id=university_id,
|
||||||
|
script_id=script.id,
|
||||||
|
status="pending",
|
||||||
|
progress=0,
|
||||||
|
current_step="准备中..."
|
||||||
|
)
|
||||||
|
db.add(job)
|
||||||
|
db.commit()
|
||||||
|
db.refresh(job)
|
||||||
|
|
||||||
|
# 在后台执行爬虫
|
||||||
|
background_tasks.add_task(
|
||||||
|
run_scraper,
|
||||||
|
job_id=job.id,
|
||||||
|
script_id=script.id
|
||||||
|
)
|
||||||
|
|
||||||
|
return job
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/{job_id}", response_model=JobResponse)
|
||||||
|
def get_job(
|
||||||
|
job_id: int,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""获取任务详情"""
|
||||||
|
job = db.query(ScrapeJob).filter(ScrapeJob.id == job_id).first()
|
||||||
|
if not job:
|
||||||
|
raise HTTPException(status_code=404, detail="任务不存在")
|
||||||
|
|
||||||
|
return job
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/{job_id}/status", response_model=JobStatusResponse)
|
||||||
|
def get_job_status(
|
||||||
|
job_id: int,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""获取任务状态和日志"""
|
||||||
|
job = db.query(ScrapeJob).filter(ScrapeJob.id == job_id).first()
|
||||||
|
if not job:
|
||||||
|
raise HTTPException(status_code=404, detail="任务不存在")
|
||||||
|
|
||||||
|
# 获取最近的日志
|
||||||
|
logs = db.query(ScrapeLog).filter(
|
||||||
|
ScrapeLog.job_id == job_id
|
||||||
|
).order_by(ScrapeLog.created_at.desc()).limit(50).all()
|
||||||
|
|
||||||
|
return JobStatusResponse(
|
||||||
|
id=job.id,
|
||||||
|
status=job.status,
|
||||||
|
progress=job.progress,
|
||||||
|
current_step=job.current_step,
|
||||||
|
logs=[LogResponse(
|
||||||
|
id=log.id,
|
||||||
|
level=log.level,
|
||||||
|
message=log.message,
|
||||||
|
created_at=log.created_at
|
||||||
|
) for log in reversed(logs)]
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/university/{university_id}", response_model=List[JobResponse])
|
||||||
|
def get_university_jobs(
|
||||||
|
university_id: int,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""获取大学的所有任务"""
|
||||||
|
jobs = db.query(ScrapeJob).filter(
|
||||||
|
ScrapeJob.university_id == university_id
|
||||||
|
).order_by(ScrapeJob.created_at.desc()).limit(20).all()
|
||||||
|
|
||||||
|
return jobs
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/{job_id}/cancel")
|
||||||
|
def cancel_job(
|
||||||
|
job_id: int,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""取消任务"""
|
||||||
|
job = db.query(ScrapeJob).filter(ScrapeJob.id == job_id).first()
|
||||||
|
if not job:
|
||||||
|
raise HTTPException(status_code=404, detail="任务不存在")
|
||||||
|
|
||||||
|
if job.status not in ["pending", "running"]:
|
||||||
|
raise HTTPException(status_code=400, detail="任务已结束,无法取消")
|
||||||
|
|
||||||
|
job.status = "cancelled"
|
||||||
|
job.completed_at = datetime.utcnow()
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
return {"message": "任务已取消"}
|
||||||
175
backend/app/api/results.py
Normal file
175
backend/app/api/results.py
Normal file
@ -0,0 +1,175 @@
|
|||||||
|
"""爬取结果API"""
|
||||||
|
|
||||||
|
from typing import Optional
|
||||||
|
from fastapi import APIRouter, Depends, HTTPException, Query
|
||||||
|
from fastapi.responses import JSONResponse
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
from ..database import get_db
|
||||||
|
from ..models import ScrapeResult
|
||||||
|
from ..schemas.result import ResultResponse
|
||||||
|
|
||||||
|
router = APIRouter()
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/university/{university_id}", response_model=ResultResponse)
|
||||||
|
def get_university_result(
|
||||||
|
university_id: int,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""获取大学最新的爬取结果"""
|
||||||
|
result = db.query(ScrapeResult).filter(
|
||||||
|
ScrapeResult.university_id == university_id
|
||||||
|
).order_by(ScrapeResult.created_at.desc()).first()
|
||||||
|
|
||||||
|
if not result:
|
||||||
|
raise HTTPException(status_code=404, detail="没有爬取结果")
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/university/{university_id}/schools")
|
||||||
|
def get_schools(
|
||||||
|
university_id: int,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""获取学院列表"""
|
||||||
|
result = db.query(ScrapeResult).filter(
|
||||||
|
ScrapeResult.university_id == university_id
|
||||||
|
).order_by(ScrapeResult.created_at.desc()).first()
|
||||||
|
|
||||||
|
if not result:
|
||||||
|
raise HTTPException(status_code=404, detail="没有爬取结果")
|
||||||
|
|
||||||
|
schools = result.result_data.get("schools", [])
|
||||||
|
|
||||||
|
# 返回简化的学院列表
|
||||||
|
return {
|
||||||
|
"total": len(schools),
|
||||||
|
"schools": [
|
||||||
|
{
|
||||||
|
"name": s.get("name"),
|
||||||
|
"url": s.get("url"),
|
||||||
|
"program_count": len(s.get("programs", []))
|
||||||
|
}
|
||||||
|
for s in schools
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/university/{university_id}/programs")
|
||||||
|
def get_programs(
|
||||||
|
university_id: int,
|
||||||
|
school_name: Optional[str] = Query(None, description="按学院筛选"),
|
||||||
|
search: Optional[str] = Query(None, description="搜索项目名称"),
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""获取项目列表"""
|
||||||
|
result = db.query(ScrapeResult).filter(
|
||||||
|
ScrapeResult.university_id == university_id
|
||||||
|
).order_by(ScrapeResult.created_at.desc()).first()
|
||||||
|
|
||||||
|
if not result:
|
||||||
|
raise HTTPException(status_code=404, detail="没有爬取结果")
|
||||||
|
|
||||||
|
schools = result.result_data.get("schools", [])
|
||||||
|
programs = []
|
||||||
|
|
||||||
|
for school in schools:
|
||||||
|
if school_name and school.get("name") != school_name:
|
||||||
|
continue
|
||||||
|
|
||||||
|
for prog in school.get("programs", []):
|
||||||
|
if search and search.lower() not in prog.get("name", "").lower():
|
||||||
|
continue
|
||||||
|
|
||||||
|
programs.append({
|
||||||
|
"name": prog.get("name"),
|
||||||
|
"url": prog.get("url"),
|
||||||
|
"degree_type": prog.get("degree_type"),
|
||||||
|
"school": school.get("name"),
|
||||||
|
"faculty_count": len(prog.get("faculty", []))
|
||||||
|
})
|
||||||
|
|
||||||
|
return {
|
||||||
|
"total": len(programs),
|
||||||
|
"programs": programs
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/university/{university_id}/faculty")
|
||||||
|
def get_faculty(
|
||||||
|
university_id: int,
|
||||||
|
school_name: Optional[str] = Query(None, description="按学院筛选"),
|
||||||
|
program_name: Optional[str] = Query(None, description="按项目筛选"),
|
||||||
|
search: Optional[str] = Query(None, description="搜索导师姓名"),
|
||||||
|
skip: int = Query(0, ge=0),
|
||||||
|
limit: int = Query(50, ge=1, le=200),
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""获取导师列表"""
|
||||||
|
result = db.query(ScrapeResult).filter(
|
||||||
|
ScrapeResult.university_id == university_id
|
||||||
|
).order_by(ScrapeResult.created_at.desc()).first()
|
||||||
|
|
||||||
|
if not result:
|
||||||
|
raise HTTPException(status_code=404, detail="没有爬取结果")
|
||||||
|
|
||||||
|
schools = result.result_data.get("schools", [])
|
||||||
|
faculty_list = []
|
||||||
|
|
||||||
|
for school in schools:
|
||||||
|
if school_name and school.get("name") != school_name:
|
||||||
|
continue
|
||||||
|
|
||||||
|
for prog in school.get("programs", []):
|
||||||
|
if program_name and prog.get("name") != program_name:
|
||||||
|
continue
|
||||||
|
|
||||||
|
for fac in prog.get("faculty", []):
|
||||||
|
if search and search.lower() not in fac.get("name", "").lower():
|
||||||
|
continue
|
||||||
|
|
||||||
|
faculty_list.append({
|
||||||
|
"name": fac.get("name"),
|
||||||
|
"url": fac.get("url"),
|
||||||
|
"title": fac.get("title"),
|
||||||
|
"email": fac.get("email"),
|
||||||
|
"program": prog.get("name"),
|
||||||
|
"school": school.get("name")
|
||||||
|
})
|
||||||
|
|
||||||
|
total = len(faculty_list)
|
||||||
|
faculty_list = faculty_list[skip:skip + limit]
|
||||||
|
|
||||||
|
return {
|
||||||
|
"total": total,
|
||||||
|
"skip": skip,
|
||||||
|
"limit": limit,
|
||||||
|
"faculty": faculty_list
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/university/{university_id}/export")
|
||||||
|
def export_result(
|
||||||
|
university_id: int,
|
||||||
|
format: str = Query("json", enum=["json"]),
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""导出爬取结果"""
|
||||||
|
result = db.query(ScrapeResult).filter(
|
||||||
|
ScrapeResult.university_id == university_id
|
||||||
|
).order_by(ScrapeResult.created_at.desc()).first()
|
||||||
|
|
||||||
|
if not result:
|
||||||
|
raise HTTPException(status_code=404, detail="没有爬取结果")
|
||||||
|
|
||||||
|
if format == "json":
|
||||||
|
return JSONResponse(
|
||||||
|
content=result.result_data,
|
||||||
|
headers={
|
||||||
|
"Content-Disposition": f"attachment; filename=university_{university_id}_result.json"
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
raise HTTPException(status_code=400, detail="不支持的格式")
|
||||||
167
backend/app/api/scripts.py
Normal file
167
backend/app/api/scripts.py
Normal file
@ -0,0 +1,167 @@
|
|||||||
|
"""爬虫脚本API"""
|
||||||
|
|
||||||
|
from typing import List
|
||||||
|
from fastapi import APIRouter, Depends, HTTPException, BackgroundTasks
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
from ..database import get_db
|
||||||
|
from ..models import University, ScraperScript
|
||||||
|
from ..schemas.script import (
|
||||||
|
ScriptCreate,
|
||||||
|
ScriptResponse,
|
||||||
|
GenerateScriptRequest,
|
||||||
|
GenerateScriptResponse
|
||||||
|
)
|
||||||
|
from ..services.script_generator import generate_scraper_script
|
||||||
|
|
||||||
|
router = APIRouter()
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/generate", response_model=GenerateScriptResponse)
|
||||||
|
async def generate_script(
|
||||||
|
data: GenerateScriptRequest,
|
||||||
|
background_tasks: BackgroundTasks,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
一键生成爬虫脚本
|
||||||
|
|
||||||
|
分析大学网站结构,自动生成爬虫脚本
|
||||||
|
"""
|
||||||
|
# 检查或创建大学记录
|
||||||
|
university = db.query(University).filter(University.url == data.university_url).first()
|
||||||
|
|
||||||
|
if not university:
|
||||||
|
# 从URL提取大学名称
|
||||||
|
name = data.university_name
|
||||||
|
if not name:
|
||||||
|
from urllib.parse import urlparse
|
||||||
|
parsed = urlparse(data.university_url)
|
||||||
|
name = parsed.netloc.replace("www.", "").split(".")[0].title()
|
||||||
|
|
||||||
|
university = University(
|
||||||
|
name=name,
|
||||||
|
url=data.university_url,
|
||||||
|
status="analyzing"
|
||||||
|
)
|
||||||
|
db.add(university)
|
||||||
|
db.commit()
|
||||||
|
db.refresh(university)
|
||||||
|
else:
|
||||||
|
# 更新状态
|
||||||
|
university.status = "analyzing"
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
# 在后台执行脚本生成
|
||||||
|
background_tasks.add_task(
|
||||||
|
generate_scraper_script,
|
||||||
|
university_id=university.id,
|
||||||
|
university_url=data.university_url
|
||||||
|
)
|
||||||
|
|
||||||
|
return GenerateScriptResponse(
|
||||||
|
success=True,
|
||||||
|
university_id=university.id,
|
||||||
|
script_id=None,
|
||||||
|
message="正在分析网站结构并生成爬虫脚本...",
|
||||||
|
status="analyzing"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/university/{university_id}", response_model=List[ScriptResponse])
|
||||||
|
def get_university_scripts(
|
||||||
|
university_id: int,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""获取大学的所有爬虫脚本"""
|
||||||
|
scripts = db.query(ScraperScript).filter(
|
||||||
|
ScraperScript.university_id == university_id
|
||||||
|
).order_by(ScraperScript.version.desc()).all()
|
||||||
|
|
||||||
|
return scripts
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/{script_id}", response_model=ScriptResponse)
|
||||||
|
def get_script(
|
||||||
|
script_id: int,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""获取脚本详情"""
|
||||||
|
script = db.query(ScraperScript).filter(ScraperScript.id == script_id).first()
|
||||||
|
if not script:
|
||||||
|
raise HTTPException(status_code=404, detail="脚本不存在")
|
||||||
|
|
||||||
|
return script
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("", response_model=ScriptResponse)
|
||||||
|
def create_script(
|
||||||
|
data: ScriptCreate,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""手动创建脚本"""
|
||||||
|
# 检查大学是否存在
|
||||||
|
university = db.query(University).filter(University.id == data.university_id).first()
|
||||||
|
if not university:
|
||||||
|
raise HTTPException(status_code=404, detail="大学不存在")
|
||||||
|
|
||||||
|
# 获取当前最高版本
|
||||||
|
max_version = db.query(ScraperScript).filter(
|
||||||
|
ScraperScript.university_id == data.university_id
|
||||||
|
).count()
|
||||||
|
|
||||||
|
script = ScraperScript(
|
||||||
|
university_id=data.university_id,
|
||||||
|
script_name=data.script_name,
|
||||||
|
script_content=data.script_content,
|
||||||
|
config_content=data.config_content,
|
||||||
|
version=max_version + 1,
|
||||||
|
status="active"
|
||||||
|
)
|
||||||
|
|
||||||
|
db.add(script)
|
||||||
|
db.commit()
|
||||||
|
db.refresh(script)
|
||||||
|
|
||||||
|
# 更新大学状态
|
||||||
|
university.status = "ready"
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
return script
|
||||||
|
|
||||||
|
|
||||||
|
@router.put("/{script_id}", response_model=ScriptResponse)
|
||||||
|
def update_script(
|
||||||
|
script_id: int,
|
||||||
|
data: ScriptCreate,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""更新脚本"""
|
||||||
|
script = db.query(ScraperScript).filter(ScraperScript.id == script_id).first()
|
||||||
|
if not script:
|
||||||
|
raise HTTPException(status_code=404, detail="脚本不存在")
|
||||||
|
|
||||||
|
script.script_content = data.script_content
|
||||||
|
if data.config_content:
|
||||||
|
script.config_content = data.config_content
|
||||||
|
|
||||||
|
db.commit()
|
||||||
|
db.refresh(script)
|
||||||
|
|
||||||
|
return script
|
||||||
|
|
||||||
|
|
||||||
|
@router.delete("/{script_id}")
|
||||||
|
def delete_script(
|
||||||
|
script_id: int,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""删除脚本"""
|
||||||
|
script = db.query(ScraperScript).filter(ScraperScript.id == script_id).first()
|
||||||
|
if not script:
|
||||||
|
raise HTTPException(status_code=404, detail="脚本不存在")
|
||||||
|
|
||||||
|
db.delete(script)
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
return {"message": "删除成功"}
|
||||||
165
backend/app/api/universities.py
Normal file
165
backend/app/api/universities.py
Normal file
@ -0,0 +1,165 @@
|
|||||||
|
"""大学管理API"""
|
||||||
|
|
||||||
|
from typing import List, Optional
|
||||||
|
from fastapi import APIRouter, Depends, HTTPException, Query
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
from ..database import get_db
|
||||||
|
from ..models import University, ScrapeResult
|
||||||
|
from ..schemas.university import (
|
||||||
|
UniversityCreate,
|
||||||
|
UniversityUpdate,
|
||||||
|
UniversityResponse,
|
||||||
|
UniversityListResponse
|
||||||
|
)
|
||||||
|
|
||||||
|
router = APIRouter()
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("", response_model=UniversityListResponse)
|
||||||
|
def list_universities(
|
||||||
|
skip: int = Query(0, ge=0),
|
||||||
|
limit: int = Query(20, ge=1, le=100),
|
||||||
|
search: Optional[str] = None,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""获取大学列表"""
|
||||||
|
query = db.query(University)
|
||||||
|
|
||||||
|
if search:
|
||||||
|
query = query.filter(University.name.ilike(f"%{search}%"))
|
||||||
|
|
||||||
|
total = query.count()
|
||||||
|
universities = query.order_by(University.created_at.desc()).offset(skip).limit(limit).all()
|
||||||
|
|
||||||
|
# 添加统计信息
|
||||||
|
items = []
|
||||||
|
for uni in universities:
|
||||||
|
# 获取最新结果
|
||||||
|
latest_result = db.query(ScrapeResult).filter(
|
||||||
|
ScrapeResult.university_id == uni.id
|
||||||
|
).order_by(ScrapeResult.created_at.desc()).first()
|
||||||
|
|
||||||
|
items.append(UniversityResponse(
|
||||||
|
id=uni.id,
|
||||||
|
name=uni.name,
|
||||||
|
url=uni.url,
|
||||||
|
country=uni.country,
|
||||||
|
description=uni.description,
|
||||||
|
status=uni.status,
|
||||||
|
created_at=uni.created_at,
|
||||||
|
updated_at=uni.updated_at,
|
||||||
|
scripts_count=len(uni.scripts),
|
||||||
|
jobs_count=len(uni.jobs),
|
||||||
|
latest_result={
|
||||||
|
"schools_count": latest_result.schools_count,
|
||||||
|
"programs_count": latest_result.programs_count,
|
||||||
|
"faculty_count": latest_result.faculty_count,
|
||||||
|
"created_at": latest_result.created_at.isoformat()
|
||||||
|
} if latest_result else None
|
||||||
|
))
|
||||||
|
|
||||||
|
return UniversityListResponse(total=total, items=items)
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("", response_model=UniversityResponse)
|
||||||
|
def create_university(
|
||||||
|
data: UniversityCreate,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""创建大学"""
|
||||||
|
# 检查是否已存在
|
||||||
|
existing = db.query(University).filter(University.url == data.url).first()
|
||||||
|
if existing:
|
||||||
|
raise HTTPException(status_code=400, detail="该大学URL已存在")
|
||||||
|
|
||||||
|
university = University(**data.model_dump())
|
||||||
|
db.add(university)
|
||||||
|
db.commit()
|
||||||
|
db.refresh(university)
|
||||||
|
|
||||||
|
return UniversityResponse(
|
||||||
|
id=university.id,
|
||||||
|
name=university.name,
|
||||||
|
url=university.url,
|
||||||
|
country=university.country,
|
||||||
|
description=university.description,
|
||||||
|
status=university.status,
|
||||||
|
created_at=university.created_at,
|
||||||
|
updated_at=university.updated_at,
|
||||||
|
scripts_count=0,
|
||||||
|
jobs_count=0,
|
||||||
|
latest_result=None
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/{university_id}", response_model=UniversityResponse)
|
||||||
|
def get_university(
|
||||||
|
university_id: int,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""获取大学详情"""
|
||||||
|
university = db.query(University).filter(University.id == university_id).first()
|
||||||
|
if not university:
|
||||||
|
raise HTTPException(status_code=404, detail="大学不存在")
|
||||||
|
|
||||||
|
# 获取最新结果
|
||||||
|
latest_result = db.query(ScrapeResult).filter(
|
||||||
|
ScrapeResult.university_id == university.id
|
||||||
|
).order_by(ScrapeResult.created_at.desc()).first()
|
||||||
|
|
||||||
|
return UniversityResponse(
|
||||||
|
id=university.id,
|
||||||
|
name=university.name,
|
||||||
|
url=university.url,
|
||||||
|
country=university.country,
|
||||||
|
description=university.description,
|
||||||
|
status=university.status,
|
||||||
|
created_at=university.created_at,
|
||||||
|
updated_at=university.updated_at,
|
||||||
|
scripts_count=len(university.scripts),
|
||||||
|
jobs_count=len(university.jobs),
|
||||||
|
latest_result={
|
||||||
|
"schools_count": latest_result.schools_count,
|
||||||
|
"programs_count": latest_result.programs_count,
|
||||||
|
"faculty_count": latest_result.faculty_count,
|
||||||
|
"created_at": latest_result.created_at.isoformat()
|
||||||
|
} if latest_result else None
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@router.put("/{university_id}", response_model=UniversityResponse)
|
||||||
|
def update_university(
|
||||||
|
university_id: int,
|
||||||
|
data: UniversityUpdate,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""更新大学信息"""
|
||||||
|
university = db.query(University).filter(University.id == university_id).first()
|
||||||
|
if not university:
|
||||||
|
raise HTTPException(status_code=404, detail="大学不存在")
|
||||||
|
|
||||||
|
update_data = data.model_dump(exclude_unset=True)
|
||||||
|
for field, value in update_data.items():
|
||||||
|
setattr(university, field, value)
|
||||||
|
|
||||||
|
db.commit()
|
||||||
|
db.refresh(university)
|
||||||
|
|
||||||
|
return get_university(university_id, db)
|
||||||
|
|
||||||
|
|
||||||
|
@router.delete("/{university_id}")
|
||||||
|
def delete_university(
|
||||||
|
university_id: int,
|
||||||
|
db: Session = Depends(get_db)
|
||||||
|
):
|
||||||
|
"""删除大学"""
|
||||||
|
university = db.query(University).filter(University.id == university_id).first()
|
||||||
|
if not university:
|
||||||
|
raise HTTPException(status_code=404, detail="大学不存在")
|
||||||
|
|
||||||
|
db.delete(university)
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
return {"message": "删除成功"}
|
||||||
37
backend/app/config.py
Normal file
37
backend/app/config.py
Normal file
@ -0,0 +1,37 @@
|
|||||||
|
"""应用配置"""
|
||||||
|
|
||||||
|
from pydantic_settings import BaseSettings
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
|
||||||
|
class Settings(BaseSettings):
|
||||||
|
"""应用设置"""
|
||||||
|
|
||||||
|
# 应用配置
|
||||||
|
APP_NAME: str = "University Scraper API"
|
||||||
|
APP_VERSION: str = "1.0.0"
|
||||||
|
DEBUG: bool = True
|
||||||
|
|
||||||
|
# 数据库配置
|
||||||
|
DATABASE_URL: str = "sqlite:///./university_scraper.db" # 开发环境使用SQLite
|
||||||
|
# 生产环境使用: postgresql://user:password@localhost/university_scraper
|
||||||
|
|
||||||
|
# Redis配置 (用于Celery任务队列)
|
||||||
|
REDIS_URL: str = "redis://localhost:6379/0"
|
||||||
|
|
||||||
|
# CORS配置
|
||||||
|
CORS_ORIGINS: list = ["http://localhost:3000", "http://127.0.0.1:3000"]
|
||||||
|
|
||||||
|
# Agent配置 (用于自动生成脚本)
|
||||||
|
OPENROUTER_API_KEY: Optional[str] = None
|
||||||
|
|
||||||
|
# 文件存储路径
|
||||||
|
SCRIPTS_DIR: str = "./scripts"
|
||||||
|
RESULTS_DIR: str = "./results"
|
||||||
|
|
||||||
|
class Config:
|
||||||
|
env_file = ".env"
|
||||||
|
case_sensitive = True
|
||||||
|
|
||||||
|
|
||||||
|
settings = Settings()
|
||||||
35
backend/app/database.py
Normal file
35
backend/app/database.py
Normal file
@ -0,0 +1,35 @@
|
|||||||
|
"""数据库连接和会话管理"""
|
||||||
|
|
||||||
|
from sqlalchemy import create_engine
|
||||||
|
from sqlalchemy.ext.declarative import declarative_base
|
||||||
|
from sqlalchemy.orm import sessionmaker
|
||||||
|
|
||||||
|
from .config import settings
|
||||||
|
|
||||||
|
# 创建数据库引擎
|
||||||
|
engine = create_engine(
|
||||||
|
settings.DATABASE_URL,
|
||||||
|
connect_args={"check_same_thread": False} if "sqlite" in settings.DATABASE_URL else {},
|
||||||
|
echo=settings.DEBUG
|
||||||
|
)
|
||||||
|
|
||||||
|
# 创建会话工厂
|
||||||
|
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
|
||||||
|
|
||||||
|
# 声明基类
|
||||||
|
Base = declarative_base()
|
||||||
|
|
||||||
|
|
||||||
|
def get_db():
|
||||||
|
"""获取数据库会话 (依赖注入)"""
|
||||||
|
db = SessionLocal()
|
||||||
|
try:
|
||||||
|
yield db
|
||||||
|
finally:
|
||||||
|
db.close()
|
||||||
|
|
||||||
|
|
||||||
|
def init_db():
|
||||||
|
"""初始化数据库 (创建所有表)"""
|
||||||
|
from .models import university, script, job, result # noqa
|
||||||
|
Base.metadata.create_all(bind=engine)
|
||||||
72
backend/app/main.py
Normal file
72
backend/app/main.py
Normal file
@ -0,0 +1,72 @@
|
|||||||
|
"""
|
||||||
|
University Scraper Web API
|
||||||
|
|
||||||
|
主应用入口
|
||||||
|
"""
|
||||||
|
|
||||||
|
from fastapi import FastAPI
|
||||||
|
from fastapi.middleware.cors import CORSMiddleware
|
||||||
|
|
||||||
|
from .config import settings
|
||||||
|
from .database import init_db
|
||||||
|
from .api import api_router
|
||||||
|
|
||||||
|
# 创建应用
|
||||||
|
app = FastAPI(
|
||||||
|
title=settings.APP_NAME,
|
||||||
|
version=settings.APP_VERSION,
|
||||||
|
description="""
|
||||||
|
## 大学爬虫Web系统 API
|
||||||
|
|
||||||
|
### 功能
|
||||||
|
- 🏫 **大学管理**: 添加、编辑、删除大学
|
||||||
|
- 📜 **脚本生成**: 一键生成爬虫脚本
|
||||||
|
- 🚀 **任务执行**: 一键运行爬虫
|
||||||
|
- 📊 **数据查看**: 查看和导出爬取结果
|
||||||
|
|
||||||
|
### 数据结构
|
||||||
|
大学 → 学院 → 项目 → 导师
|
||||||
|
""",
|
||||||
|
docs_url="/docs",
|
||||||
|
redoc_url="/redoc"
|
||||||
|
)
|
||||||
|
|
||||||
|
# 配置CORS
|
||||||
|
app.add_middleware(
|
||||||
|
CORSMiddleware,
|
||||||
|
allow_origins=settings.CORS_ORIGINS,
|
||||||
|
allow_credentials=True,
|
||||||
|
allow_methods=["*"],
|
||||||
|
allow_headers=["*"],
|
||||||
|
)
|
||||||
|
|
||||||
|
# 注册路由
|
||||||
|
app.include_router(api_router, prefix="/api")
|
||||||
|
|
||||||
|
|
||||||
|
@app.on_event("startup")
|
||||||
|
async def startup_event():
|
||||||
|
"""应用启动时初始化数据库"""
|
||||||
|
init_db()
|
||||||
|
|
||||||
|
|
||||||
|
@app.get("/")
|
||||||
|
async def root():
|
||||||
|
"""根路由"""
|
||||||
|
return {
|
||||||
|
"name": settings.APP_NAME,
|
||||||
|
"version": settings.APP_VERSION,
|
||||||
|
"docs": "/docs",
|
||||||
|
"api": "/api"
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@app.get("/health")
|
||||||
|
async def health_check():
|
||||||
|
"""健康检查"""
|
||||||
|
return {"status": "healthy"}
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
import uvicorn
|
||||||
|
uvicorn.run(app, host="0.0.0.0", port=8000)
|
||||||
8
backend/app/models/__init__.py
Normal file
8
backend/app/models/__init__.py
Normal file
@ -0,0 +1,8 @@
|
|||||||
|
"""数据库模型"""
|
||||||
|
|
||||||
|
from .university import University
|
||||||
|
from .script import ScraperScript
|
||||||
|
from .job import ScrapeJob, ScrapeLog
|
||||||
|
from .result import ScrapeResult
|
||||||
|
|
||||||
|
__all__ = ["University", "ScraperScript", "ScrapeJob", "ScrapeLog", "ScrapeResult"]
|
||||||
56
backend/app/models/job.py
Normal file
56
backend/app/models/job.py
Normal file
@ -0,0 +1,56 @@
|
|||||||
|
"""爬取任务模型"""
|
||||||
|
|
||||||
|
from datetime import datetime
|
||||||
|
from sqlalchemy import Column, Integer, String, DateTime, Text, ForeignKey
|
||||||
|
from sqlalchemy.orm import relationship
|
||||||
|
|
||||||
|
from ..database import Base
|
||||||
|
|
||||||
|
|
||||||
|
class ScrapeJob(Base):
|
||||||
|
"""爬取任务表"""
|
||||||
|
|
||||||
|
__tablename__ = "scrape_jobs"
|
||||||
|
|
||||||
|
id = Column(Integer, primary_key=True, index=True)
|
||||||
|
university_id = Column(Integer, ForeignKey("universities.id"), nullable=False)
|
||||||
|
script_id = Column(Integer, ForeignKey("scraper_scripts.id"))
|
||||||
|
|
||||||
|
status = Column(String(50), default="pending") # pending, running, completed, failed, cancelled
|
||||||
|
progress = Column(Integer, default=0) # 0-100 进度百分比
|
||||||
|
current_step = Column(String(255)) # 当前步骤描述
|
||||||
|
|
||||||
|
started_at = Column(DateTime)
|
||||||
|
completed_at = Column(DateTime)
|
||||||
|
error_message = Column(Text)
|
||||||
|
|
||||||
|
created_at = Column(DateTime, default=datetime.utcnow)
|
||||||
|
|
||||||
|
# 关联
|
||||||
|
university = relationship("University", back_populates="jobs")
|
||||||
|
script = relationship("ScraperScript", back_populates="jobs")
|
||||||
|
logs = relationship("ScrapeLog", back_populates="job", cascade="all, delete-orphan")
|
||||||
|
results = relationship("ScrapeResult", back_populates="job", cascade="all, delete-orphan")
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
return f"<ScrapeJob(id={self.id}, status='{self.status}')>"
|
||||||
|
|
||||||
|
|
||||||
|
class ScrapeLog(Base):
|
||||||
|
"""爬取日志表"""
|
||||||
|
|
||||||
|
__tablename__ = "scrape_logs"
|
||||||
|
|
||||||
|
id = Column(Integer, primary_key=True, index=True)
|
||||||
|
job_id = Column(Integer, ForeignKey("scrape_jobs.id"), nullable=False)
|
||||||
|
|
||||||
|
level = Column(String(20), default="info") # debug, info, warning, error
|
||||||
|
message = Column(Text, nullable=False)
|
||||||
|
|
||||||
|
created_at = Column(DateTime, default=datetime.utcnow)
|
||||||
|
|
||||||
|
# 关联
|
||||||
|
job = relationship("ScrapeJob", back_populates="logs")
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
return f"<ScrapeLog(id={self.id}, level='{self.level}')>"
|
||||||
34
backend/app/models/result.py
Normal file
34
backend/app/models/result.py
Normal file
@ -0,0 +1,34 @@
|
|||||||
|
"""爬取结果模型"""
|
||||||
|
|
||||||
|
from datetime import datetime
|
||||||
|
from sqlalchemy import Column, Integer, DateTime, ForeignKey, JSON
|
||||||
|
from sqlalchemy.orm import relationship
|
||||||
|
|
||||||
|
from ..database import Base
|
||||||
|
|
||||||
|
|
||||||
|
class ScrapeResult(Base):
|
||||||
|
"""爬取结果表"""
|
||||||
|
|
||||||
|
__tablename__ = "scrape_results"
|
||||||
|
|
||||||
|
id = Column(Integer, primary_key=True, index=True)
|
||||||
|
job_id = Column(Integer, ForeignKey("scrape_jobs.id"))
|
||||||
|
university_id = Column(Integer, ForeignKey("universities.id"), nullable=False)
|
||||||
|
|
||||||
|
# JSON数据: 学院 → 项目 → 导师 层级结构
|
||||||
|
result_data = Column(JSON, nullable=False)
|
||||||
|
|
||||||
|
# 统计信息
|
||||||
|
schools_count = Column(Integer, default=0)
|
||||||
|
programs_count = Column(Integer, default=0)
|
||||||
|
faculty_count = Column(Integer, default=0)
|
||||||
|
|
||||||
|
created_at = Column(DateTime, default=datetime.utcnow)
|
||||||
|
|
||||||
|
# 关联
|
||||||
|
job = relationship("ScrapeJob", back_populates="results")
|
||||||
|
university = relationship("University", back_populates="results")
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
return f"<ScrapeResult(id={self.id}, programs={self.programs_count}, faculty={self.faculty_count})>"
|
||||||
34
backend/app/models/script.py
Normal file
34
backend/app/models/script.py
Normal file
@ -0,0 +1,34 @@
|
|||||||
|
"""爬虫脚本模型"""
|
||||||
|
|
||||||
|
from datetime import datetime
|
||||||
|
from sqlalchemy import Column, Integer, String, DateTime, Text, ForeignKey, JSON
|
||||||
|
from sqlalchemy.orm import relationship
|
||||||
|
|
||||||
|
from ..database import Base
|
||||||
|
|
||||||
|
|
||||||
|
class ScraperScript(Base):
|
||||||
|
"""爬虫脚本表"""
|
||||||
|
|
||||||
|
__tablename__ = "scraper_scripts"
|
||||||
|
|
||||||
|
id = Column(Integer, primary_key=True, index=True)
|
||||||
|
university_id = Column(Integer, ForeignKey("universities.id"), nullable=False)
|
||||||
|
|
||||||
|
script_name = Column(String(255), nullable=False)
|
||||||
|
script_content = Column(Text, nullable=False) # Python脚本代码
|
||||||
|
config_content = Column(JSON) # YAML配置转为JSON存储
|
||||||
|
|
||||||
|
version = Column(Integer, default=1)
|
||||||
|
status = Column(String(50), default="draft") # draft, active, deprecated, error
|
||||||
|
error_message = Column(Text)
|
||||||
|
|
||||||
|
created_at = Column(DateTime, default=datetime.utcnow)
|
||||||
|
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
|
||||||
|
|
||||||
|
# 关联
|
||||||
|
university = relationship("University", back_populates="scripts")
|
||||||
|
jobs = relationship("ScrapeJob", back_populates="script")
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
return f"<ScraperScript(id={self.id}, name='{self.script_name}')>"
|
||||||
31
backend/app/models/university.py
Normal file
31
backend/app/models/university.py
Normal file
@ -0,0 +1,31 @@
|
|||||||
|
"""大学模型"""
|
||||||
|
|
||||||
|
from datetime import datetime
|
||||||
|
from sqlalchemy import Column, Integer, String, DateTime, Text
|
||||||
|
from sqlalchemy.orm import relationship
|
||||||
|
|
||||||
|
from ..database import Base
|
||||||
|
|
||||||
|
|
||||||
|
class University(Base):
|
||||||
|
"""大学表"""
|
||||||
|
|
||||||
|
__tablename__ = "universities"
|
||||||
|
|
||||||
|
id = Column(Integer, primary_key=True, index=True)
|
||||||
|
name = Column(String(255), nullable=False, index=True)
|
||||||
|
url = Column(String(500), nullable=False)
|
||||||
|
country = Column(String(100))
|
||||||
|
description = Column(Text)
|
||||||
|
status = Column(String(50), default="pending") # pending, analyzing, ready, error
|
||||||
|
|
||||||
|
created_at = Column(DateTime, default=datetime.utcnow)
|
||||||
|
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
|
||||||
|
|
||||||
|
# 关联
|
||||||
|
scripts = relationship("ScraperScript", back_populates="university", cascade="all, delete-orphan")
|
||||||
|
jobs = relationship("ScrapeJob", back_populates="university", cascade="all, delete-orphan")
|
||||||
|
results = relationship("ScrapeResult", back_populates="university", cascade="all, delete-orphan")
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
return f"<University(id={self.id}, name='{self.name}')>"
|
||||||
33
backend/app/schemas/__init__.py
Normal file
33
backend/app/schemas/__init__.py
Normal file
@ -0,0 +1,33 @@
|
|||||||
|
"""Pydantic schemas for API"""
|
||||||
|
|
||||||
|
from .university import (
|
||||||
|
UniversityCreate,
|
||||||
|
UniversityUpdate,
|
||||||
|
UniversityResponse,
|
||||||
|
UniversityListResponse
|
||||||
|
)
|
||||||
|
from .script import (
|
||||||
|
ScriptCreate,
|
||||||
|
ScriptResponse,
|
||||||
|
GenerateScriptRequest,
|
||||||
|
GenerateScriptResponse
|
||||||
|
)
|
||||||
|
from .job import (
|
||||||
|
JobCreate,
|
||||||
|
JobResponse,
|
||||||
|
JobStatusResponse,
|
||||||
|
LogResponse
|
||||||
|
)
|
||||||
|
from .result import (
|
||||||
|
ResultResponse,
|
||||||
|
SchoolData,
|
||||||
|
ProgramData,
|
||||||
|
FacultyData
|
||||||
|
)
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"UniversityCreate", "UniversityUpdate", "UniversityResponse", "UniversityListResponse",
|
||||||
|
"ScriptCreate", "ScriptResponse", "GenerateScriptRequest", "GenerateScriptResponse",
|
||||||
|
"JobCreate", "JobResponse", "JobStatusResponse", "LogResponse",
|
||||||
|
"ResultResponse", "SchoolData", "ProgramData", "FacultyData"
|
||||||
|
]
|
||||||
52
backend/app/schemas/job.py
Normal file
52
backend/app/schemas/job.py
Normal file
@ -0,0 +1,52 @@
|
|||||||
|
"""爬取任务相关的Pydantic模型"""
|
||||||
|
|
||||||
|
from datetime import datetime
|
||||||
|
from typing import Optional, List
|
||||||
|
from pydantic import BaseModel
|
||||||
|
|
||||||
|
|
||||||
|
class JobCreate(BaseModel):
|
||||||
|
"""创建任务请求"""
|
||||||
|
university_id: int
|
||||||
|
script_id: Optional[int] = None
|
||||||
|
|
||||||
|
|
||||||
|
class JobResponse(BaseModel):
|
||||||
|
"""任务响应"""
|
||||||
|
id: int
|
||||||
|
university_id: int
|
||||||
|
script_id: Optional[int] = None
|
||||||
|
status: str
|
||||||
|
progress: int
|
||||||
|
current_step: Optional[str] = None
|
||||||
|
started_at: Optional[datetime] = None
|
||||||
|
completed_at: Optional[datetime] = None
|
||||||
|
error_message: Optional[str] = None
|
||||||
|
created_at: datetime
|
||||||
|
|
||||||
|
class Config:
|
||||||
|
from_attributes = True
|
||||||
|
|
||||||
|
|
||||||
|
class JobStatusResponse(BaseModel):
|
||||||
|
"""任务状态响应"""
|
||||||
|
id: int
|
||||||
|
status: str
|
||||||
|
progress: int
|
||||||
|
current_step: Optional[str] = None
|
||||||
|
logs: List["LogResponse"] = []
|
||||||
|
|
||||||
|
|
||||||
|
class LogResponse(BaseModel):
|
||||||
|
"""日志响应"""
|
||||||
|
id: int
|
||||||
|
level: str
|
||||||
|
message: str
|
||||||
|
created_at: datetime
|
||||||
|
|
||||||
|
class Config:
|
||||||
|
from_attributes = True
|
||||||
|
|
||||||
|
|
||||||
|
# 解决循环引用
|
||||||
|
JobStatusResponse.model_rebuild()
|
||||||
67
backend/app/schemas/result.py
Normal file
67
backend/app/schemas/result.py
Normal file
@ -0,0 +1,67 @@
|
|||||||
|
"""爬取结果相关的Pydantic模型"""
|
||||||
|
|
||||||
|
from datetime import datetime
|
||||||
|
from typing import Optional, List, Dict, Any
|
||||||
|
from pydantic import BaseModel
|
||||||
|
|
||||||
|
|
||||||
|
class FacultyData(BaseModel):
|
||||||
|
"""导师数据"""
|
||||||
|
name: str
|
||||||
|
url: str
|
||||||
|
title: Optional[str] = None
|
||||||
|
email: Optional[str] = None
|
||||||
|
department: Optional[str] = None
|
||||||
|
|
||||||
|
|
||||||
|
class ProgramData(BaseModel):
|
||||||
|
"""项目数据"""
|
||||||
|
name: str
|
||||||
|
url: str
|
||||||
|
degree_type: Optional[str] = None
|
||||||
|
description: Optional[str] = None
|
||||||
|
faculty_page_url: Optional[str] = None
|
||||||
|
faculty_count: int = 0
|
||||||
|
faculty: List[FacultyData] = []
|
||||||
|
|
||||||
|
|
||||||
|
class SchoolData(BaseModel):
|
||||||
|
"""学院数据"""
|
||||||
|
name: str
|
||||||
|
url: str
|
||||||
|
description: Optional[str] = None
|
||||||
|
program_count: int = 0
|
||||||
|
programs: List[ProgramData] = []
|
||||||
|
|
||||||
|
|
||||||
|
class ResultResponse(BaseModel):
|
||||||
|
"""完整结果响应"""
|
||||||
|
id: int
|
||||||
|
university_id: int
|
||||||
|
job_id: Optional[int] = None
|
||||||
|
|
||||||
|
# 统计
|
||||||
|
schools_count: int
|
||||||
|
programs_count: int
|
||||||
|
faculty_count: int
|
||||||
|
|
||||||
|
# 完整数据
|
||||||
|
result_data: Dict[str, Any]
|
||||||
|
|
||||||
|
created_at: datetime
|
||||||
|
|
||||||
|
class Config:
|
||||||
|
from_attributes = True
|
||||||
|
|
||||||
|
|
||||||
|
class ResultSummary(BaseModel):
|
||||||
|
"""结果摘要"""
|
||||||
|
id: int
|
||||||
|
university_id: int
|
||||||
|
schools_count: int
|
||||||
|
programs_count: int
|
||||||
|
faculty_count: int
|
||||||
|
created_at: datetime
|
||||||
|
|
||||||
|
class Config:
|
||||||
|
from_attributes = True
|
||||||
46
backend/app/schemas/script.py
Normal file
46
backend/app/schemas/script.py
Normal file
@ -0,0 +1,46 @@
|
|||||||
|
"""爬虫脚本相关的Pydantic模型"""
|
||||||
|
|
||||||
|
from datetime import datetime
|
||||||
|
from typing import Optional, Dict, Any
|
||||||
|
from pydantic import BaseModel
|
||||||
|
|
||||||
|
|
||||||
|
class ScriptBase(BaseModel):
|
||||||
|
"""脚本基础字段"""
|
||||||
|
script_name: str
|
||||||
|
script_content: str
|
||||||
|
config_content: Optional[Dict[str, Any]] = None
|
||||||
|
|
||||||
|
|
||||||
|
class ScriptCreate(ScriptBase):
|
||||||
|
"""创建脚本请求"""
|
||||||
|
university_id: int
|
||||||
|
|
||||||
|
|
||||||
|
class ScriptResponse(ScriptBase):
|
||||||
|
"""脚本响应"""
|
||||||
|
id: int
|
||||||
|
university_id: int
|
||||||
|
version: int
|
||||||
|
status: str
|
||||||
|
error_message: Optional[str] = None
|
||||||
|
created_at: datetime
|
||||||
|
updated_at: datetime
|
||||||
|
|
||||||
|
class Config:
|
||||||
|
from_attributes = True
|
||||||
|
|
||||||
|
|
||||||
|
class GenerateScriptRequest(BaseModel):
|
||||||
|
"""生成脚本请求"""
|
||||||
|
university_url: str
|
||||||
|
university_name: Optional[str] = None
|
||||||
|
|
||||||
|
|
||||||
|
class GenerateScriptResponse(BaseModel):
|
||||||
|
"""生成脚本响应"""
|
||||||
|
success: bool
|
||||||
|
university_id: int
|
||||||
|
script_id: Optional[int] = None
|
||||||
|
message: str
|
||||||
|
status: str # analyzing, completed, failed
|
||||||
48
backend/app/schemas/university.py
Normal file
48
backend/app/schemas/university.py
Normal file
@ -0,0 +1,48 @@
|
|||||||
|
"""大学相关的Pydantic模型"""
|
||||||
|
|
||||||
|
from datetime import datetime
|
||||||
|
from typing import Optional, List
|
||||||
|
from pydantic import BaseModel, HttpUrl
|
||||||
|
|
||||||
|
|
||||||
|
class UniversityBase(BaseModel):
|
||||||
|
"""大学基础字段"""
|
||||||
|
name: str
|
||||||
|
url: str
|
||||||
|
country: Optional[str] = None
|
||||||
|
description: Optional[str] = None
|
||||||
|
|
||||||
|
|
||||||
|
class UniversityCreate(UniversityBase):
|
||||||
|
"""创建大学请求"""
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
class UniversityUpdate(BaseModel):
|
||||||
|
"""更新大学请求"""
|
||||||
|
name: Optional[str] = None
|
||||||
|
url: Optional[str] = None
|
||||||
|
country: Optional[str] = None
|
||||||
|
description: Optional[str] = None
|
||||||
|
|
||||||
|
|
||||||
|
class UniversityResponse(UniversityBase):
|
||||||
|
"""大学响应"""
|
||||||
|
id: int
|
||||||
|
status: str
|
||||||
|
created_at: datetime
|
||||||
|
updated_at: datetime
|
||||||
|
|
||||||
|
# 统计信息
|
||||||
|
scripts_count: int = 0
|
||||||
|
jobs_count: int = 0
|
||||||
|
latest_result: Optional[dict] = None
|
||||||
|
|
||||||
|
class Config:
|
||||||
|
from_attributes = True
|
||||||
|
|
||||||
|
|
||||||
|
class UniversityListResponse(BaseModel):
|
||||||
|
"""大学列表响应"""
|
||||||
|
total: int
|
||||||
|
items: List[UniversityResponse]
|
||||||
6
backend/app/services/__init__.py
Normal file
6
backend/app/services/__init__.py
Normal file
@ -0,0 +1,6 @@
|
|||||||
|
"""业务服务"""
|
||||||
|
|
||||||
|
from .script_generator import generate_scraper_script
|
||||||
|
from .scraper_runner import run_scraper
|
||||||
|
|
||||||
|
__all__ = ["generate_scraper_script", "run_scraper"]
|
||||||
177
backend/app/services/scraper_runner.py
Normal file
177
backend/app/services/scraper_runner.py
Normal file
@ -0,0 +1,177 @@
|
|||||||
|
"""
|
||||||
|
爬虫执行服务
|
||||||
|
|
||||||
|
运行爬虫脚本并保存结果
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
import traceback
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from urllib.parse import urljoin, urlparse
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
# Windows 上需要设置事件循环策略
|
||||||
|
if sys.platform == "win32":
|
||||||
|
asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())
|
||||||
|
|
||||||
|
# 导入playwright供脚本使用
|
||||||
|
try:
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
PLAYWRIGHT_AVAILABLE = True
|
||||||
|
except ImportError:
|
||||||
|
PLAYWRIGHT_AVAILABLE = False
|
||||||
|
async_playwright = None
|
||||||
|
|
||||||
|
from ..database import SessionLocal
|
||||||
|
from ..models import ScraperScript, ScrapeJob, ScrapeLog, ScrapeResult
|
||||||
|
|
||||||
|
|
||||||
|
def run_scraper(job_id: int, script_id: int):
|
||||||
|
"""
|
||||||
|
执行爬虫的后台任务
|
||||||
|
"""
|
||||||
|
db = SessionLocal()
|
||||||
|
|
||||||
|
try:
|
||||||
|
job = db.query(ScrapeJob).filter(ScrapeJob.id == job_id).first()
|
||||||
|
script = db.query(ScraperScript).filter(ScraperScript.id == script_id).first()
|
||||||
|
|
||||||
|
if not job or not script:
|
||||||
|
return
|
||||||
|
|
||||||
|
# 更新任务状态
|
||||||
|
job.status = "running"
|
||||||
|
job.started_at = datetime.utcnow()
|
||||||
|
job.current_step = "正在初始化..."
|
||||||
|
job.progress = 5
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
_add_log(db, job_id, "info", "开始执行爬虫脚本")
|
||||||
|
|
||||||
|
# 创建日志回调函数
|
||||||
|
def log_callback(level: str, message: str):
|
||||||
|
_add_log(db, job_id, level, message)
|
||||||
|
|
||||||
|
# 执行脚本
|
||||||
|
job.current_step = "正在爬取数据..."
|
||||||
|
job.progress = 20
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
result_data = _execute_script(script.script_content, log_callback)
|
||||||
|
|
||||||
|
if result_data:
|
||||||
|
job.progress = 80
|
||||||
|
job.current_step = "正在保存结果..."
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
_add_log(db, job_id, "info", "爬取完成,正在保存结果...")
|
||||||
|
|
||||||
|
# 计算统计信息
|
||||||
|
schools = result_data.get("schools", [])
|
||||||
|
schools_count = len(schools)
|
||||||
|
programs_count = sum(len(s.get("programs", [])) for s in schools)
|
||||||
|
faculty_count = sum(
|
||||||
|
len(p.get("faculty", []))
|
||||||
|
for s in schools
|
||||||
|
for p in s.get("programs", [])
|
||||||
|
)
|
||||||
|
|
||||||
|
# 保存结果
|
||||||
|
result = ScrapeResult(
|
||||||
|
job_id=job_id,
|
||||||
|
university_id=job.university_id,
|
||||||
|
result_data=result_data,
|
||||||
|
schools_count=schools_count,
|
||||||
|
programs_count=programs_count,
|
||||||
|
faculty_count=faculty_count
|
||||||
|
)
|
||||||
|
db.add(result)
|
||||||
|
|
||||||
|
job.status = "completed"
|
||||||
|
job.progress = 100
|
||||||
|
job.current_step = "完成"
|
||||||
|
job.completed_at = datetime.utcnow()
|
||||||
|
|
||||||
|
_add_log(
|
||||||
|
db, job_id, "info",
|
||||||
|
f"爬取成功: {schools_count}个学院, {programs_count}个项目, {faculty_count}位导师"
|
||||||
|
)
|
||||||
|
|
||||||
|
else:
|
||||||
|
job.status = "failed"
|
||||||
|
job.error_message = "脚本执行无返回结果"
|
||||||
|
job.completed_at = datetime.utcnow()
|
||||||
|
_add_log(db, job_id, "error", "脚本执行失败: 无返回结果")
|
||||||
|
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
error_msg = f"执行出错: {str(e)}\n{traceback.format_exc()}"
|
||||||
|
_add_log(db, job_id, "error", error_msg)
|
||||||
|
|
||||||
|
job = db.query(ScrapeJob).filter(ScrapeJob.id == job_id).first()
|
||||||
|
if job:
|
||||||
|
job.status = "failed"
|
||||||
|
job.error_message = str(e)
|
||||||
|
job.completed_at = datetime.utcnow()
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
finally:
|
||||||
|
db.close()
|
||||||
|
|
||||||
|
|
||||||
|
def _execute_script(script_content: str, log_callback) -> dict:
|
||||||
|
"""
|
||||||
|
执行Python脚本内容
|
||||||
|
|
||||||
|
安全地在隔离环境中执行脚本
|
||||||
|
"""
|
||||||
|
if not PLAYWRIGHT_AVAILABLE:
|
||||||
|
log_callback("error", "Playwright 未安装,请运行: pip install playwright && playwright install")
|
||||||
|
return None
|
||||||
|
|
||||||
|
# 创建执行环境 - 包含脚本需要的所有模块
|
||||||
|
# 注意:使用同一个字典作为 globals 和 locals,确保函数定义可以互相访问
|
||||||
|
exec_namespace = {
|
||||||
|
"__builtins__": __builtins__,
|
||||||
|
"asyncio": asyncio,
|
||||||
|
"json": json,
|
||||||
|
"re": re,
|
||||||
|
"datetime": datetime,
|
||||||
|
"timezone": timezone,
|
||||||
|
"urljoin": urljoin,
|
||||||
|
"urlparse": urlparse,
|
||||||
|
"async_playwright": async_playwright,
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
# 编译并执行脚本 - 使用同一个命名空间确保函数可互相调用
|
||||||
|
exec(script_content, exec_namespace, exec_namespace)
|
||||||
|
|
||||||
|
# 获取scrape函数
|
||||||
|
scrape_func = exec_namespace.get("scrape")
|
||||||
|
if not scrape_func:
|
||||||
|
log_callback("error", "脚本中未找到 scrape 函数")
|
||||||
|
return None
|
||||||
|
|
||||||
|
# 运行异步爬虫函数
|
||||||
|
result = asyncio.run(scrape_func(output_callback=log_callback))
|
||||||
|
return result
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
log_callback("error", f"脚本执行异常: {str(e)}")
|
||||||
|
raise
|
||||||
|
|
||||||
|
|
||||||
|
def _add_log(db: Session, job_id: int, level: str, message: str):
|
||||||
|
"""添加日志"""
|
||||||
|
log = ScrapeLog(
|
||||||
|
job_id=job_id,
|
||||||
|
level=level,
|
||||||
|
message=message
|
||||||
|
)
|
||||||
|
db.add(log)
|
||||||
|
db.commit()
|
||||||
558
backend/app/services/script_generator.py
Normal file
558
backend/app/services/script_generator.py
Normal file
@ -0,0 +1,558 @@
|
|||||||
|
"""
|
||||||
|
爬虫脚本生成服务
|
||||||
|
|
||||||
|
分析大学网站结构,自动生成爬虫脚本
|
||||||
|
"""
|
||||||
|
|
||||||
|
import re
|
||||||
|
from datetime import datetime
|
||||||
|
from urllib.parse import urlparse
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
from ..database import SessionLocal
|
||||||
|
from ..models import University, ScraperScript
|
||||||
|
|
||||||
|
|
||||||
|
# 预置的大学爬虫脚本模板
|
||||||
|
SCRAPER_TEMPLATES = {
|
||||||
|
"harvard.edu": "harvard_scraper",
|
||||||
|
"mit.edu": "generic_scraper",
|
||||||
|
"stanford.edu": "generic_scraper",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def generate_scraper_script(university_id: int, university_url: str):
|
||||||
|
"""
|
||||||
|
生成爬虫脚本的后台任务
|
||||||
|
|
||||||
|
1. 分析大学网站域名
|
||||||
|
2. 如果有预置模板则使用模板
|
||||||
|
3. 否则生成通用爬虫脚本
|
||||||
|
"""
|
||||||
|
db = SessionLocal()
|
||||||
|
|
||||||
|
try:
|
||||||
|
university = db.query(University).filter(University.id == university_id).first()
|
||||||
|
if not university:
|
||||||
|
return
|
||||||
|
|
||||||
|
# 解析URL获取域名
|
||||||
|
parsed = urlparse(university_url)
|
||||||
|
domain = parsed.netloc.replace("www.", "")
|
||||||
|
|
||||||
|
# 检查是否有预置模板
|
||||||
|
template_name = None
|
||||||
|
for pattern, template in SCRAPER_TEMPLATES.items():
|
||||||
|
if pattern in domain:
|
||||||
|
template_name = template
|
||||||
|
break
|
||||||
|
|
||||||
|
# 生成脚本
|
||||||
|
script_content = _generate_script_content(domain, template_name)
|
||||||
|
config_content = _generate_config_content(university.name, university_url, domain)
|
||||||
|
|
||||||
|
# 计算版本号
|
||||||
|
existing_count = db.query(ScraperScript).filter(
|
||||||
|
ScraperScript.university_id == university_id
|
||||||
|
).count()
|
||||||
|
|
||||||
|
# 保存脚本
|
||||||
|
script = ScraperScript(
|
||||||
|
university_id=university_id,
|
||||||
|
script_name=f"{domain.replace('.', '_')}_scraper",
|
||||||
|
script_content=script_content,
|
||||||
|
config_content=config_content,
|
||||||
|
version=existing_count + 1,
|
||||||
|
status="active"
|
||||||
|
)
|
||||||
|
|
||||||
|
db.add(script)
|
||||||
|
|
||||||
|
# 更新大学状态
|
||||||
|
university.status = "ready"
|
||||||
|
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
# 记录错误
|
||||||
|
if university:
|
||||||
|
university.status = "error"
|
||||||
|
db.commit()
|
||||||
|
raise e
|
||||||
|
|
||||||
|
finally:
|
||||||
|
db.close()
|
||||||
|
|
||||||
|
|
||||||
|
def _generate_script_content(domain: str, template_name: str = None) -> str:
|
||||||
|
"""生成Python爬虫脚本内容"""
|
||||||
|
|
||||||
|
if template_name == "harvard_scraper":
|
||||||
|
return '''"""
|
||||||
|
Harvard University 专用爬虫脚本
|
||||||
|
自动生成
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
|
||||||
|
# 学院URL映射
|
||||||
|
SCHOOL_MAPPING = {
|
||||||
|
"gsas.harvard.edu": "Graduate School of Arts and Sciences (GSAS)",
|
||||||
|
"seas.harvard.edu": "John A. Paulson School of Engineering and Applied Sciences (SEAS)",
|
||||||
|
"hbs.edu": "Harvard Business School (HBS)",
|
||||||
|
"gsd.harvard.edu": "Graduate School of Design (GSD)",
|
||||||
|
"gse.harvard.edu": "Graduate School of Education (HGSE)",
|
||||||
|
"hks.harvard.edu": "Harvard Kennedy School (HKS)",
|
||||||
|
"hls.harvard.edu": "Harvard Law School (HLS)",
|
||||||
|
"hms.harvard.edu": "Harvard Medical School (HMS)",
|
||||||
|
"hsph.harvard.edu": "T.H. Chan School of Public Health (HSPH)",
|
||||||
|
"hds.harvard.edu": "Harvard Divinity School (HDS)",
|
||||||
|
"fas.harvard.edu": "Faculty of Arts and Sciences (FAS)",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
async def scrape(output_callback=None):
|
||||||
|
"""执行爬取"""
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser = await p.chromium.launch(headless=True)
|
||||||
|
page = await browser.new_page()
|
||||||
|
|
||||||
|
result = {
|
||||||
|
"name": "Harvard University",
|
||||||
|
"url": "https://www.harvard.edu/",
|
||||||
|
"country": "USA",
|
||||||
|
"scraped_at": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"schools": []
|
||||||
|
}
|
||||||
|
|
||||||
|
# 访问项目列表页
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", "访问Harvard项目列表...")
|
||||||
|
|
||||||
|
await page.goto("https://www.harvard.edu/programs/?degree_levels=graduate")
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
# 提取项目数据
|
||||||
|
programs = await page.evaluate("""() => {
|
||||||
|
const items = document.querySelectorAll('[class*="records__record"]');
|
||||||
|
const programs = [];
|
||||||
|
items.forEach(item => {
|
||||||
|
const btn = item.querySelector('button[class*="title-link"]');
|
||||||
|
if (btn) {
|
||||||
|
programs.push({
|
||||||
|
name: btn.innerText.trim(),
|
||||||
|
url: ''
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
return programs;
|
||||||
|
}""")
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"找到 {len(programs)} 个项目")
|
||||||
|
|
||||||
|
# 简化输出
|
||||||
|
result["schools"] = [{
|
||||||
|
"name": "Graduate Programs",
|
||||||
|
"url": "https://www.harvard.edu/programs/",
|
||||||
|
"programs": [{"name": p["name"], "url": p["url"], "faculty": []} for p in programs[:50]]
|
||||||
|
}]
|
||||||
|
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
result = asyncio.run(scrape())
|
||||||
|
print(json.dumps(result, indent=2, ensure_ascii=False))
|
||||||
|
'''
|
||||||
|
|
||||||
|
# 通用爬虫模板 - 深度爬取硕士项目
|
||||||
|
# 使用字符串拼接来避免 f-string 和 JavaScript 引号冲突
|
||||||
|
return _build_generic_scraper_template(domain)
|
||||||
|
|
||||||
|
|
||||||
|
def _build_generic_scraper_template(domain: str) -> str:
|
||||||
|
"""构建通用爬虫模板"""
|
||||||
|
|
||||||
|
# JavaScript code blocks (use raw strings to avoid escaping issues)
|
||||||
|
js_check_courses = r'''() => {
|
||||||
|
const links = document.querySelectorAll('a[href]');
|
||||||
|
let courseCount = 0;
|
||||||
|
for (const a of links) {
|
||||||
|
const href = a.href.toLowerCase();
|
||||||
|
if (/\/\d{4,}\//.test(href) ||
|
||||||
|
/\/(msc|ma|mba|mres|llm|med|meng)-/.test(href) ||
|
||||||
|
/\/course\/[a-z]/.test(href)) {
|
||||||
|
courseCount++;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return courseCount;
|
||||||
|
}'''
|
||||||
|
|
||||||
|
js_find_list_url = r'''() => {
|
||||||
|
const links = document.querySelectorAll('a[href]');
|
||||||
|
for (const a of links) {
|
||||||
|
const text = a.innerText.toLowerCase();
|
||||||
|
const href = a.href.toLowerCase();
|
||||||
|
if ((text.includes('a-z') || text.includes('all course') ||
|
||||||
|
text.includes('full list') || text.includes('browse all') ||
|
||||||
|
href.includes('/list')) &&
|
||||||
|
(href.includes('master') || href.includes('course') || href.includes('postgrad'))) {
|
||||||
|
return a.href;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}'''
|
||||||
|
|
||||||
|
js_find_courses_from_home = r'''() => {
|
||||||
|
const links = document.querySelectorAll('a[href]');
|
||||||
|
for (const a of links) {
|
||||||
|
const href = a.href.toLowerCase();
|
||||||
|
const text = a.innerText.toLowerCase();
|
||||||
|
if ((href.includes('master') || href.includes('postgraduate') || href.includes('graduate')) &&
|
||||||
|
(href.includes('course') || href.includes('program') || href.includes('degree'))) {
|
||||||
|
return a.href;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}'''
|
||||||
|
|
||||||
|
js_extract_programs = r'''() => {
|
||||||
|
const programs = [];
|
||||||
|
const seen = new Set();
|
||||||
|
const currentHost = window.location.hostname;
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href;
|
||||||
|
const text = a.innerText.trim().replace(/\s+/g, ' ');
|
||||||
|
|
||||||
|
if (!href || seen.has(href)) return;
|
||||||
|
if (text.length < 5 || text.length > 200) return;
|
||||||
|
if (href.includes('#') || href.includes('javascript:') || href.includes('mailto:')) return;
|
||||||
|
|
||||||
|
try {
|
||||||
|
const linkHost = new URL(href).hostname;
|
||||||
|
if (!linkHost.includes(currentHost.replace('www.', '')) &&
|
||||||
|
!currentHost.includes(linkHost.replace('www.', ''))) return;
|
||||||
|
} catch {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
const hrefLower = href.toLowerCase();
|
||||||
|
const textLower = text.toLowerCase();
|
||||||
|
|
||||||
|
const isNavigation = textLower === 'courses' ||
|
||||||
|
textLower === 'programmes' ||
|
||||||
|
textLower === 'undergraduate' ||
|
||||||
|
textLower === 'postgraduate' ||
|
||||||
|
textLower === 'masters' ||
|
||||||
|
textLower === "master's" ||
|
||||||
|
textLower.includes('skip to') ||
|
||||||
|
textLower.includes('share') ||
|
||||||
|
textLower === 'home' ||
|
||||||
|
textLower === 'study' ||
|
||||||
|
textLower.startsWith('a-z') ||
|
||||||
|
textLower.includes('admission') ||
|
||||||
|
textLower.includes('fees and funding') ||
|
||||||
|
textLower.includes('why should') ||
|
||||||
|
textLower.includes('why manchester') ||
|
||||||
|
textLower.includes('teaching and learning') ||
|
||||||
|
textLower.includes('meet us') ||
|
||||||
|
textLower.includes('student support') ||
|
||||||
|
textLower.includes('contact us') ||
|
||||||
|
textLower.includes('how to apply') ||
|
||||||
|
hrefLower.includes('/admissions/') ||
|
||||||
|
hrefLower.includes('/fees-and-funding/') ||
|
||||||
|
hrefLower.includes('/why-') ||
|
||||||
|
hrefLower.includes('/meet-us/') ||
|
||||||
|
hrefLower.includes('/contact-us/') ||
|
||||||
|
hrefLower.includes('/student-support/') ||
|
||||||
|
hrefLower.includes('/teaching-and-learning/') ||
|
||||||
|
hrefLower.endsWith('/courses/') ||
|
||||||
|
hrefLower.endsWith('/masters/') ||
|
||||||
|
hrefLower.endsWith('/postgraduate/');
|
||||||
|
|
||||||
|
if (isNavigation) return;
|
||||||
|
|
||||||
|
const isExcluded = hrefLower.includes('/undergraduate') ||
|
||||||
|
hrefLower.includes('/bachelor') ||
|
||||||
|
hrefLower.includes('/phd/') ||
|
||||||
|
hrefLower.includes('/doctoral') ||
|
||||||
|
hrefLower.includes('/research-degree') ||
|
||||||
|
textLower.includes('bachelor') ||
|
||||||
|
textLower.includes('undergraduate') ||
|
||||||
|
(textLower.includes('phd') && !textLower.includes('mphil'));
|
||||||
|
|
||||||
|
if (isExcluded) return;
|
||||||
|
|
||||||
|
const hasNumericId = /\/\d{4,}\//.test(href);
|
||||||
|
const hasDegreeSlug = /\/(msc|ma|mba|mres|llm|med|meng|mpa|mph|mphil)-[a-z]/.test(hrefLower);
|
||||||
|
const isCoursePage = (hrefLower.includes('/course/') ||
|
||||||
|
hrefLower.includes('/courses/list/') ||
|
||||||
|
hrefLower.includes('/programme/')) &&
|
||||||
|
href.split('/').filter(p => p).length > 4;
|
||||||
|
const textHasDegree = /\b(msc|ma|mba|mres|llm|med|meng|pgcert|pgdip)\b/i.test(text) ||
|
||||||
|
textLower.includes('master');
|
||||||
|
|
||||||
|
if (hasNumericId || hasDegreeSlug || isCoursePage || textHasDegree) {
|
||||||
|
seen.add(href);
|
||||||
|
programs.push({
|
||||||
|
name: text,
|
||||||
|
url: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return programs;
|
||||||
|
}'''
|
||||||
|
|
||||||
|
js_extract_faculty = r'''() => {
|
||||||
|
const faculty = [];
|
||||||
|
const seen = new Set();
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href.toLowerCase();
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
|
||||||
|
if (seen.has(href)) return;
|
||||||
|
if (text.length < 3 || text.length > 100) return;
|
||||||
|
|
||||||
|
const isStaff = href.includes('/people/') ||
|
||||||
|
href.includes('/staff/') ||
|
||||||
|
href.includes('/faculty/') ||
|
||||||
|
href.includes('/profile/') ||
|
||||||
|
href.includes('/academics/') ||
|
||||||
|
href.includes('/researcher/');
|
||||||
|
|
||||||
|
if (isStaff) {
|
||||||
|
seen.add(href);
|
||||||
|
faculty.push({
|
||||||
|
name: text.replace(/\s+/g, ' '),
|
||||||
|
url: a.href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return faculty.slice(0, 20);
|
||||||
|
}'''
|
||||||
|
|
||||||
|
university_name = domain.split('.')[0].title()
|
||||||
|
|
||||||
|
template = f'''"""
|
||||||
|
通用大学爬虫脚本
|
||||||
|
目标: {domain}
|
||||||
|
自动生成 - 深度爬取硕士项目和导师信息
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from urllib.parse import urljoin, urlparse
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
|
||||||
|
|
||||||
|
MASTERS_PATHS = [
|
||||||
|
"/study/masters/courses/list/",
|
||||||
|
"/study/masters/courses/",
|
||||||
|
"/postgraduate/taught/courses/",
|
||||||
|
"/postgraduate/courses/list/",
|
||||||
|
"/postgraduate/courses/",
|
||||||
|
"/graduate/programs/",
|
||||||
|
"/academics/graduate/programs/",
|
||||||
|
"/programmes/masters/",
|
||||||
|
"/masters/programmes/",
|
||||||
|
"/admissions/graduate/programs/",
|
||||||
|
]
|
||||||
|
|
||||||
|
JS_CHECK_COURSES = """{js_check_courses}"""
|
||||||
|
|
||||||
|
JS_FIND_LIST_URL = """{js_find_list_url}"""
|
||||||
|
|
||||||
|
JS_FIND_COURSES_FROM_HOME = """{js_find_courses_from_home}"""
|
||||||
|
|
||||||
|
JS_EXTRACT_PROGRAMS = """{js_extract_programs}"""
|
||||||
|
|
||||||
|
JS_EXTRACT_FACULTY = """{js_extract_faculty}"""
|
||||||
|
|
||||||
|
|
||||||
|
async def find_course_list_page(page, base_url, output_callback):
|
||||||
|
for path in MASTERS_PATHS:
|
||||||
|
test_url = base_url.rstrip('/') + path
|
||||||
|
try:
|
||||||
|
response = await page.goto(test_url, wait_until="domcontentloaded", timeout=15000)
|
||||||
|
if response and response.status == 200:
|
||||||
|
title = await page.title()
|
||||||
|
if '404' not in title.lower() and 'not found' not in title.lower():
|
||||||
|
has_courses = await page.evaluate(JS_CHECK_COURSES)
|
||||||
|
if has_courses > 5:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"Found course list: {{path}} ({{has_courses}} courses)")
|
||||||
|
return test_url
|
||||||
|
|
||||||
|
list_url = await page.evaluate(JS_FIND_LIST_URL)
|
||||||
|
if list_url:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"Found full course list: {{list_url}}")
|
||||||
|
return list_url
|
||||||
|
except:
|
||||||
|
continue
|
||||||
|
|
||||||
|
try:
|
||||||
|
await page.goto(base_url, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
courses_url = await page.evaluate(JS_FIND_COURSES_FROM_HOME)
|
||||||
|
if courses_url:
|
||||||
|
return courses_url
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
async def extract_course_links(page, output_callback):
|
||||||
|
return await page.evaluate(JS_EXTRACT_PROGRAMS)
|
||||||
|
|
||||||
|
|
||||||
|
async def scrape(output_callback=None):
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser = await p.chromium.launch(headless=True)
|
||||||
|
context = await browser.new_context(
|
||||||
|
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
|
||||||
|
)
|
||||||
|
page = await context.new_page()
|
||||||
|
|
||||||
|
base_url = "https://www.{domain}/"
|
||||||
|
|
||||||
|
result = {{
|
||||||
|
"name": "{university_name} University",
|
||||||
|
"url": base_url,
|
||||||
|
"scraped_at": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"schools": []
|
||||||
|
}}
|
||||||
|
|
||||||
|
all_programs = []
|
||||||
|
|
||||||
|
try:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", "Searching for masters course list...")
|
||||||
|
|
||||||
|
courses_url = await find_course_list_page(page, base_url, output_callback)
|
||||||
|
|
||||||
|
if not courses_url:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("warning", "Course list not found, using homepage")
|
||||||
|
courses_url = base_url
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", "Extracting masters programs...")
|
||||||
|
|
||||||
|
await page.goto(courses_url, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
await page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
for _ in range(3):
|
||||||
|
try:
|
||||||
|
load_more = page.locator('button:has-text("Load more"), button:has-text("Show more"), button:has-text("View more"), a:has-text("Load more")')
|
||||||
|
if await load_more.count() > 0:
|
||||||
|
await load_more.first.click()
|
||||||
|
await page.wait_for_timeout(2000)
|
||||||
|
else:
|
||||||
|
break
|
||||||
|
except:
|
||||||
|
break
|
||||||
|
|
||||||
|
programs_data = await extract_course_links(page, output_callback)
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
output_callback("info", f"Found {{len(programs_data)}} masters programs")
|
||||||
|
|
||||||
|
max_detail_pages = min(len(programs_data), 30)
|
||||||
|
|
||||||
|
for i, prog in enumerate(programs_data[:max_detail_pages]):
|
||||||
|
try:
|
||||||
|
if output_callback and i % 10 == 0:
|
||||||
|
output_callback("info", f"Processing {{i+1}}/{{max_detail_pages}}: {{prog['name'][:50]}}")
|
||||||
|
|
||||||
|
await page.goto(prog['url'], wait_until="domcontentloaded", timeout=15000)
|
||||||
|
await page.wait_for_timeout(800)
|
||||||
|
|
||||||
|
faculty_data = await page.evaluate(JS_EXTRACT_FACULTY)
|
||||||
|
|
||||||
|
all_programs.append({{
|
||||||
|
"name": prog['name'],
|
||||||
|
"url": prog['url'],
|
||||||
|
"faculty": faculty_data
|
||||||
|
}})
|
||||||
|
|
||||||
|
except:
|
||||||
|
all_programs.append({{
|
||||||
|
"name": prog['name'],
|
||||||
|
"url": prog['url'],
|
||||||
|
"faculty": []
|
||||||
|
}})
|
||||||
|
|
||||||
|
for prog in programs_data[max_detail_pages:]:
|
||||||
|
all_programs.append({{
|
||||||
|
"name": prog['name'],
|
||||||
|
"url": prog['url'],
|
||||||
|
"faculty": []
|
||||||
|
}})
|
||||||
|
|
||||||
|
result["schools"] = [{{
|
||||||
|
"name": "Masters Programs",
|
||||||
|
"url": courses_url,
|
||||||
|
"programs": all_programs
|
||||||
|
}}]
|
||||||
|
|
||||||
|
if output_callback:
|
||||||
|
total_faculty = sum(len(p.get('faculty', [])) for p in all_programs)
|
||||||
|
output_callback("info", f"Done! {{len(all_programs)}} programs, {{total_faculty}} faculty")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
if output_callback:
|
||||||
|
output_callback("error", f"Scraping error: {{str(e)}}")
|
||||||
|
|
||||||
|
finally:
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
result = asyncio.run(scrape())
|
||||||
|
print(json.dumps(result, indent=2, ensure_ascii=False))
|
||||||
|
'''
|
||||||
|
return template
|
||||||
|
|
||||||
|
|
||||||
|
def _generate_config_content(name: str, url: str, domain: str) -> dict:
|
||||||
|
"""生成配置内容"""
|
||||||
|
return {
|
||||||
|
"university": {
|
||||||
|
"name": name,
|
||||||
|
"url": url,
|
||||||
|
"domain": domain
|
||||||
|
},
|
||||||
|
"scraper": {
|
||||||
|
"headless": True,
|
||||||
|
"timeout": 30000,
|
||||||
|
"wait_time": 2000
|
||||||
|
},
|
||||||
|
"paths_to_try": [
|
||||||
|
"/programs",
|
||||||
|
"/academics/programs",
|
||||||
|
"/graduate",
|
||||||
|
"/degrees",
|
||||||
|
"/admissions/graduate"
|
||||||
|
],
|
||||||
|
"selectors": {
|
||||||
|
"program_item": "div.program, li.program, article.program, a[href*='/program']",
|
||||||
|
"faculty_item": "div.faculty, li.person, .profile-card"
|
||||||
|
},
|
||||||
|
"generated_at": datetime.utcnow().isoformat()
|
||||||
|
}
|
||||||
1
backend/app/tasks/__init__.py
Normal file
1
backend/app/tasks/__init__.py
Normal file
@ -0,0 +1 @@
|
|||||||
|
"""Celery任务 (可选,用于生产环境)"""
|
||||||
25
backend/requirements.txt
Normal file
25
backend/requirements.txt
Normal file
@ -0,0 +1,25 @@
|
|||||||
|
# FastAPI Web Framework
|
||||||
|
fastapi>=0.109.0
|
||||||
|
uvicorn[standard]>=0.27.0
|
||||||
|
python-multipart>=0.0.6
|
||||||
|
|
||||||
|
# Database
|
||||||
|
sqlalchemy>=2.0.25
|
||||||
|
psycopg2-binary>=2.9.9
|
||||||
|
alembic>=1.13.1
|
||||||
|
|
||||||
|
# Task Queue
|
||||||
|
celery>=5.3.6
|
||||||
|
redis>=5.0.1
|
||||||
|
|
||||||
|
# Utilities
|
||||||
|
pydantic>=2.9
|
||||||
|
pydantic-settings>=2.6
|
||||||
|
python-dotenv>=1.0.0
|
||||||
|
httpx>=0.28
|
||||||
|
|
||||||
|
# Existing scraper dependencies
|
||||||
|
playwright>=1.48
|
||||||
|
pyyaml>=6.0
|
||||||
|
|
||||||
|
# CORS
|
||||||
143
configs/harvard.yaml
Normal file
143
configs/harvard.yaml
Normal file
@ -0,0 +1,143 @@
|
|||||||
|
# Harvard University 爬虫配置
|
||||||
|
# 按照 学院 → 项目 → 导师 的层级结构组织
|
||||||
|
#
|
||||||
|
# Harvard的特殊情况:有一个集中的项目列表页面,可以从那里获取所有项目
|
||||||
|
# 然后通过GSAS页面关联到各学院和导师信息
|
||||||
|
|
||||||
|
university:
|
||||||
|
name: "Harvard University"
|
||||||
|
url: "https://www.harvard.edu/"
|
||||||
|
country: "USA"
|
||||||
|
|
||||||
|
# 第一层:学院列表
|
||||||
|
schools:
|
||||||
|
discovery_method: "static_list"
|
||||||
|
|
||||||
|
static_list:
|
||||||
|
# 文理研究生院 - 最主要的研究生项目集中地
|
||||||
|
- name: "Graduate School of Arts and Sciences (GSAS)"
|
||||||
|
url: "https://gsas.harvard.edu/"
|
||||||
|
|
||||||
|
# 工程与应用科学学院
|
||||||
|
- name: "John A. Paulson School of Engineering and Applied Sciences (SEAS)"
|
||||||
|
url: "https://seas.harvard.edu/"
|
||||||
|
|
||||||
|
# 商学院
|
||||||
|
- name: "Harvard Business School (HBS)"
|
||||||
|
url: "https://www.hbs.edu/"
|
||||||
|
|
||||||
|
# 设计学院
|
||||||
|
- name: "Graduate School of Design (GSD)"
|
||||||
|
url: "https://www.gsd.harvard.edu/"
|
||||||
|
|
||||||
|
# 教育学院
|
||||||
|
- name: "Graduate School of Education (HGSE)"
|
||||||
|
url: "https://www.gse.harvard.edu/"
|
||||||
|
|
||||||
|
# 肯尼迪政府学院
|
||||||
|
- name: "Harvard Kennedy School (HKS)"
|
||||||
|
url: "https://www.hks.harvard.edu/"
|
||||||
|
|
||||||
|
# 法学院
|
||||||
|
- name: "Harvard Law School (HLS)"
|
||||||
|
url: "https://hls.harvard.edu/"
|
||||||
|
|
||||||
|
# 医学院
|
||||||
|
- name: "Harvard Medical School (HMS)"
|
||||||
|
url: "https://hms.harvard.edu/"
|
||||||
|
|
||||||
|
# 公共卫生学院
|
||||||
|
- name: "T.H. Chan School of Public Health (HSPH)"
|
||||||
|
url: "https://www.hsph.harvard.edu/"
|
||||||
|
|
||||||
|
# 神学院
|
||||||
|
- name: "Harvard Divinity School (HDS)"
|
||||||
|
url: "https://hds.harvard.edu/"
|
||||||
|
|
||||||
|
# 牙医学院
|
||||||
|
- name: "Harvard School of Dental Medicine (HSDM)"
|
||||||
|
url: "https://hsdm.harvard.edu/"
|
||||||
|
|
||||||
|
# 第二层:项目发现配置
|
||||||
|
programs:
|
||||||
|
# 在学院网站上尝试这些路径来查找项目列表
|
||||||
|
paths_to_try:
|
||||||
|
- "/programs"
|
||||||
|
- "/academics/programs"
|
||||||
|
- "/academics/graduate-programs"
|
||||||
|
- "/academics/masters-programs"
|
||||||
|
- "/graduate"
|
||||||
|
- "/degrees"
|
||||||
|
- "/academics"
|
||||||
|
|
||||||
|
# 从学院首页查找项目列表页面的链接模式
|
||||||
|
link_patterns:
|
||||||
|
- text_contains: ["program", "degree", "academics"]
|
||||||
|
href_contains: ["/program", "/degree", "/academic"]
|
||||||
|
- text_contains: ["master", "graduate"]
|
||||||
|
href_contains: ["/master", "/graduate"]
|
||||||
|
|
||||||
|
# 项目列表页面的选择器
|
||||||
|
selectors:
|
||||||
|
program_item: "div.program-item, li.program, .degree-program, article.program, a[href*='/program']"
|
||||||
|
program_name: "h3, h4, .title, .program-title, .name"
|
||||||
|
program_url: "a[href]"
|
||||||
|
degree_type: ".degree, .credential, .degree-type"
|
||||||
|
|
||||||
|
# 分页配置
|
||||||
|
pagination:
|
||||||
|
type: "none"
|
||||||
|
|
||||||
|
# 第三层:导师发现配置
|
||||||
|
faculty:
|
||||||
|
discovery_strategies:
|
||||||
|
- type: "link_in_page"
|
||||||
|
patterns:
|
||||||
|
- text_contains: ["faculty", "people", "advisor"]
|
||||||
|
href_contains: ["/faculty", "/people", "/advisor"]
|
||||||
|
- text_contains: ["see list", "view all"]
|
||||||
|
href_contains: ["/people", "/faculty"]
|
||||||
|
|
||||||
|
- type: "url_pattern"
|
||||||
|
patterns:
|
||||||
|
- "{program_url}/faculty"
|
||||||
|
- "{program_url}/people"
|
||||||
|
- "{school_url}/faculty"
|
||||||
|
- "{school_url}/people"
|
||||||
|
|
||||||
|
selectors:
|
||||||
|
faculty_item: "div.faculty, li.person, .profile-card, article.person"
|
||||||
|
faculty_name: "h3, h4, .name, .title a"
|
||||||
|
faculty_url: "a[href*='/people/'], a[href*='/faculty/'], a[href*='/profile/']"
|
||||||
|
faculty_title: ".title, .position, .role, .job-title"
|
||||||
|
|
||||||
|
# 过滤规则
|
||||||
|
filters:
|
||||||
|
program_degree_types:
|
||||||
|
include:
|
||||||
|
- "Master"
|
||||||
|
- "M.S."
|
||||||
|
- "M.A."
|
||||||
|
- "MBA"
|
||||||
|
- "M.Eng"
|
||||||
|
- "M.Ed"
|
||||||
|
- "M.P.P"
|
||||||
|
- "M.P.A"
|
||||||
|
- "M.Arch"
|
||||||
|
- "M.L.A"
|
||||||
|
- "M.Div"
|
||||||
|
- "M.T.S"
|
||||||
|
- "LL.M"
|
||||||
|
- "S.M."
|
||||||
|
- "A.M."
|
||||||
|
- "A.L.M."
|
||||||
|
exclude:
|
||||||
|
- "Ph.D."
|
||||||
|
- "Doctor"
|
||||||
|
- "Bachelor"
|
||||||
|
- "B.S."
|
||||||
|
- "B.A."
|
||||||
|
- "Certificate"
|
||||||
|
- "Undergraduate"
|
||||||
|
|
||||||
|
exclude_schools: []
|
||||||
331
configs/manchester.yaml
Normal file
331
configs/manchester.yaml
Normal file
@ -0,0 +1,331 @@
|
|||||||
|
university:
|
||||||
|
name: "The University of Manchester"
|
||||||
|
url: "https://www.manchester.ac.uk/"
|
||||||
|
country: "United Kingdom"
|
||||||
|
|
||||||
|
schools:
|
||||||
|
discovery_method: "static_list"
|
||||||
|
request:
|
||||||
|
timeout_ms: 45000
|
||||||
|
max_retries: 3
|
||||||
|
retry_backoff_ms: 3000
|
||||||
|
static_list:
|
||||||
|
- name: "Alliance Manchester Business School"
|
||||||
|
url: "https://www.alliancembs.manchester.ac.uk/research/accounting-and-finance/staff/"
|
||||||
|
keywords:
|
||||||
|
- "accounting"
|
||||||
|
- "finance"
|
||||||
|
- "business"
|
||||||
|
- "management"
|
||||||
|
- "marketing"
|
||||||
|
- "mba"
|
||||||
|
- "economics"
|
||||||
|
- "entrepreneurship"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.alliancembs.manchester.ac.uk/research/accounting-and-finance/staff/"
|
||||||
|
extract_method: "table"
|
||||||
|
requires_scroll: true
|
||||||
|
scroll_times: 6
|
||||||
|
scroll_delay_ms: 700
|
||||||
|
load_more_selector: "button.load-more, button.show-more"
|
||||||
|
max_load_more: 5
|
||||||
|
request:
|
||||||
|
timeout_ms: 60000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 2500
|
||||||
|
- name: "Department of Computer Science"
|
||||||
|
url: "https://www.cs.manchester.ac.uk/about/people/academic-and-research-staff/"
|
||||||
|
keywords:
|
||||||
|
- "computer"
|
||||||
|
- "software"
|
||||||
|
- "data science"
|
||||||
|
- "artificial intelligence"
|
||||||
|
- "ai "
|
||||||
|
- "machine learning"
|
||||||
|
- "cyber"
|
||||||
|
- "computing"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.cs.manchester.ac.uk/about/people/academic-and-research-staff/"
|
||||||
|
extract_method: "links"
|
||||||
|
requires_scroll: true
|
||||||
|
scroll_times: 6
|
||||||
|
scroll_delay_ms: 700
|
||||||
|
blocked_resources: ["image", "font", "media"]
|
||||||
|
- url: "https://www.cs.manchester.ac.uk/about/people/"
|
||||||
|
extract_method: "links"
|
||||||
|
load_more_selector: "button.load-more"
|
||||||
|
max_load_more: 5
|
||||||
|
request:
|
||||||
|
timeout_ms: 45000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 2000
|
||||||
|
- name: "Department of Physics and Astronomy"
|
||||||
|
url: "https://www.physics.manchester.ac.uk/about/people/academic-and-research-staff/"
|
||||||
|
keywords:
|
||||||
|
- "physics"
|
||||||
|
- "astronomy"
|
||||||
|
- "astrophysics"
|
||||||
|
- "nuclear"
|
||||||
|
- "particle"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.physics.manchester.ac.uk/about/people/academic-and-research-staff/"
|
||||||
|
extract_method: "links"
|
||||||
|
requires_scroll: true
|
||||||
|
scroll_times: 5
|
||||||
|
scroll_delay_ms: 700
|
||||||
|
- name: "Department of Electrical and Electronic Engineering"
|
||||||
|
url: "https://www.eee.manchester.ac.uk/about/people/academic-and-research-staff/"
|
||||||
|
keywords:
|
||||||
|
- "electrical"
|
||||||
|
- "electronic"
|
||||||
|
- "eee"
|
||||||
|
- "power systems"
|
||||||
|
- "microelectronics"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.eee.manchester.ac.uk/about/people/academic-and-research-staff/"
|
||||||
|
extract_method: "links"
|
||||||
|
requires_scroll: true
|
||||||
|
scroll_times: 6
|
||||||
|
scroll_delay_ms: 700
|
||||||
|
- name: "Department of Chemistry"
|
||||||
|
url: "https://research.manchester.ac.uk/en/organisations/department-of-chemistry/persons/"
|
||||||
|
keywords:
|
||||||
|
- "chemistry"
|
||||||
|
- "chemical"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://research.manchester.ac.uk/en/organisations/department-of-chemistry/persons/"
|
||||||
|
extract_method: "research_explorer"
|
||||||
|
requires_scroll: true
|
||||||
|
request:
|
||||||
|
timeout_ms: 120000
|
||||||
|
wait_until: "networkidle"
|
||||||
|
wait_for_selector: "a.link.person"
|
||||||
|
post_wait_ms: 5000
|
||||||
|
research_explorer:
|
||||||
|
org_slug: "department-of-chemistry"
|
||||||
|
page_size: 200
|
||||||
|
- name: "Department of Mathematics"
|
||||||
|
url: "https://research.manchester.ac.uk/en/organisations/department-of-mathematics/persons/"
|
||||||
|
keywords:
|
||||||
|
- "mathematics"
|
||||||
|
- "statistics"
|
||||||
|
- "applied math"
|
||||||
|
- "actuarial"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://research.manchester.ac.uk/en/organisations/department-of-mathematics/persons/"
|
||||||
|
extract_method: "research_explorer"
|
||||||
|
requires_scroll: true
|
||||||
|
request:
|
||||||
|
timeout_ms: 120000
|
||||||
|
wait_until: "networkidle"
|
||||||
|
wait_for_selector: "a.link.person"
|
||||||
|
post_wait_ms: 4500
|
||||||
|
research_explorer:
|
||||||
|
org_slug: "department-of-mathematics"
|
||||||
|
page_size: 200
|
||||||
|
- name: "School of Engineering"
|
||||||
|
url: "https://research.manchester.ac.uk/en/organisations/school-of-engineering/persons/"
|
||||||
|
keywords:
|
||||||
|
- "engineering"
|
||||||
|
- "mechanical"
|
||||||
|
- "aerospace"
|
||||||
|
- "civil"
|
||||||
|
- "materials"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://research.manchester.ac.uk/en/organisations/school-of-engineering/persons/"
|
||||||
|
extract_method: "research_explorer"
|
||||||
|
requires_scroll: true
|
||||||
|
request:
|
||||||
|
timeout_ms: 120000
|
||||||
|
wait_until: "networkidle"
|
||||||
|
wait_for_selector: "a.link.person"
|
||||||
|
post_wait_ms: 4500
|
||||||
|
research_explorer:
|
||||||
|
org_slug: "school-of-engineering"
|
||||||
|
page_size: 400
|
||||||
|
- name: "Faculty of Biology, Medicine and Health"
|
||||||
|
url: "https://research.manchester.ac.uk/en/organisations/faculty-of-biology-medicine-and-health/persons/"
|
||||||
|
keywords:
|
||||||
|
- "medicine"
|
||||||
|
- "medical"
|
||||||
|
- "health"
|
||||||
|
- "nursing"
|
||||||
|
- "pharmacy"
|
||||||
|
- "clinical"
|
||||||
|
- "dental"
|
||||||
|
- "optometry"
|
||||||
|
- "biology"
|
||||||
|
- "biomedical"
|
||||||
|
- "psychology"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://research.manchester.ac.uk/en/organisations/faculty-of-biology-medicine-and-health/persons/"
|
||||||
|
extract_method: "research_explorer"
|
||||||
|
requires_scroll: true
|
||||||
|
request:
|
||||||
|
timeout_ms: 130000
|
||||||
|
wait_until: "networkidle"
|
||||||
|
wait_for_selector: "a.link.person"
|
||||||
|
post_wait_ms: 4500
|
||||||
|
research_explorer:
|
||||||
|
org_slug: "faculty-of-biology-medicine-and-health"
|
||||||
|
page_size: 400
|
||||||
|
- name: "School of Social Sciences"
|
||||||
|
url: "https://research.manchester.ac.uk/en/organisations/school-of-social-sciences/persons/"
|
||||||
|
keywords:
|
||||||
|
- "sociology"
|
||||||
|
- "politics"
|
||||||
|
- "international"
|
||||||
|
- "social"
|
||||||
|
- "criminology"
|
||||||
|
- "anthropology"
|
||||||
|
- "philosophy"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://research.manchester.ac.uk/en/organisations/school-of-social-sciences/persons/"
|
||||||
|
extract_method: "research_explorer"
|
||||||
|
requires_scroll: true
|
||||||
|
request:
|
||||||
|
timeout_ms: 120000
|
||||||
|
wait_until: "networkidle"
|
||||||
|
wait_for_selector: "a.link.person"
|
||||||
|
post_wait_ms: 4500
|
||||||
|
research_explorer:
|
||||||
|
org_slug: "school-of-social-sciences"
|
||||||
|
page_size: 200
|
||||||
|
- name: "School of Law"
|
||||||
|
url: "https://research.manchester.ac.uk/en/organisations/school-of-law/persons/"
|
||||||
|
keywords:
|
||||||
|
- "law"
|
||||||
|
- "legal"
|
||||||
|
- "llm"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://research.manchester.ac.uk/en/organisations/school-of-law/persons/"
|
||||||
|
extract_method: "research_explorer"
|
||||||
|
requires_scroll: true
|
||||||
|
request:
|
||||||
|
timeout_ms: 120000
|
||||||
|
wait_until: "networkidle"
|
||||||
|
wait_for_selector: "a.link.person"
|
||||||
|
post_wait_ms: 4500
|
||||||
|
research_explorer:
|
||||||
|
org_slug: "school-of-law"
|
||||||
|
page_size: 200
|
||||||
|
- name: "School of Arts, Languages and Cultures"
|
||||||
|
url: "https://research.manchester.ac.uk/en/organisations/school-of-arts-languages-and-cultures/persons/"
|
||||||
|
keywords:
|
||||||
|
- "arts"
|
||||||
|
- "languages"
|
||||||
|
- "culture"
|
||||||
|
- "music"
|
||||||
|
- "drama"
|
||||||
|
- "theatre"
|
||||||
|
- "history"
|
||||||
|
- "linguistics"
|
||||||
|
- "literature"
|
||||||
|
- "translation"
|
||||||
|
- "archaeology"
|
||||||
|
- "religion"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://research.manchester.ac.uk/en/organisations/school-of-arts-languages-and-cultures/persons/"
|
||||||
|
extract_method: "research_explorer"
|
||||||
|
requires_scroll: true
|
||||||
|
request:
|
||||||
|
timeout_ms: 120000
|
||||||
|
wait_until: "networkidle"
|
||||||
|
wait_for_selector: "a.link.person"
|
||||||
|
post_wait_ms: 4500
|
||||||
|
research_explorer:
|
||||||
|
org_slug: "school-of-arts-languages-and-cultures"
|
||||||
|
page_size: 300
|
||||||
|
- name: "School of Environment, Education and Development"
|
||||||
|
url: "https://research.manchester.ac.uk/en/organisations/school-of-environment-education-and-development/persons/"
|
||||||
|
keywords:
|
||||||
|
- "environment"
|
||||||
|
- "education"
|
||||||
|
- "development"
|
||||||
|
- "planning"
|
||||||
|
- "architecture"
|
||||||
|
- "urban"
|
||||||
|
- "geography"
|
||||||
|
- "sustainability"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://research.manchester.ac.uk/en/organisations/school-of-environment-education-and-development/persons/"
|
||||||
|
extract_method: "research_explorer"
|
||||||
|
requires_scroll: true
|
||||||
|
request:
|
||||||
|
timeout_ms: 120000
|
||||||
|
wait_until: "networkidle"
|
||||||
|
wait_for_selector: "a.link.person"
|
||||||
|
post_wait_ms: 4500
|
||||||
|
research_explorer:
|
||||||
|
org_slug: "school-of-environment-education-and-development"
|
||||||
|
page_size: 300
|
||||||
|
|
||||||
|
programs:
|
||||||
|
paths_to_try:
|
||||||
|
- "/study/masters/courses/list/"
|
||||||
|
link_patterns:
|
||||||
|
- text_contains: ["masters", "postgraduate", "graduate"]
|
||||||
|
href_contains: ["/courses/list", "/study/masters", "/study/postgraduate"]
|
||||||
|
selectors:
|
||||||
|
program_item: "li.course-item, article.course, .course-listing a"
|
||||||
|
program_name: ".course-title, h3, .title"
|
||||||
|
program_url: "a[href]"
|
||||||
|
degree_type: ".course-award, .badge"
|
||||||
|
request:
|
||||||
|
timeout_ms: 40000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 2500
|
||||||
|
global_catalog:
|
||||||
|
url: "https://www.manchester.ac.uk/study/masters/courses/list/"
|
||||||
|
request:
|
||||||
|
timeout_ms: 60000
|
||||||
|
wait_until: "networkidle"
|
||||||
|
wait_after_ms: 3000
|
||||||
|
metadata_keyword_field: "keywords"
|
||||||
|
assign_by_school_keywords: true
|
||||||
|
assign_if_no_keywords: false
|
||||||
|
allow_multiple_assignments: false
|
||||||
|
per_school_limit: 200
|
||||||
|
skip_program_faculty_lookup: true
|
||||||
|
|
||||||
|
faculty:
|
||||||
|
discovery_strategies:
|
||||||
|
- type: "link_in_page"
|
||||||
|
patterns:
|
||||||
|
- text_contains: ["people", "faculty", "staff", "directory"]
|
||||||
|
href_contains: ["/people", "/faculty", "/staff"]
|
||||||
|
request:
|
||||||
|
timeout_ms: 30000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 1500
|
||||||
|
- type: "url_pattern"
|
||||||
|
patterns:
|
||||||
|
- "{program_url}/people"
|
||||||
|
- "{program_url}/faculty"
|
||||||
|
- "{school_url}/people"
|
||||||
|
- "{school_url}/staff"
|
||||||
|
request:
|
||||||
|
timeout_ms: 30000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 1500
|
||||||
|
- type: "school_directory"
|
||||||
|
assign_to_all: false
|
||||||
|
match_by_school_keywords: true
|
||||||
|
metadata_keyword_field: "keywords"
|
||||||
|
request:
|
||||||
|
timeout_ms: 120000
|
||||||
|
post_wait_ms: 3500
|
||||||
|
|
||||||
|
filters:
|
||||||
|
program_degree_types:
|
||||||
|
include: ["MSc", "MA", "MBA", "MEng", "LLM", "MRes"]
|
||||||
|
exclude: ["PhD", "Bachelor", "BSc", "BA", "PGCert"]
|
||||||
|
exclude_schools: []
|
||||||
|
|
||||||
|
playwright:
|
||||||
|
stealth: true
|
||||||
|
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36"
|
||||||
|
extra_headers:
|
||||||
|
Accept-Language: "en-US,en;q=0.9"
|
||||||
|
cookies: []
|
||||||
|
add_init_scripts: []
|
||||||
24
configs/templates/README.md
Normal file
24
configs/templates/README.md
Normal file
@ -0,0 +1,24 @@
|
|||||||
|
# 英国高校模板库
|
||||||
|
|
||||||
|
该目录存放针对英国大学常见站点结构的 ScraperConfig 模板片段,目标是让生成/调度脚本能够快速套用成熟的学院、项目、导师配置,并保持与 `src/university_scraper` 中的最新能力同步。
|
||||||
|
|
||||||
|
## 使用方式
|
||||||
|
1. 复制需要的模板文件到 `configs/<university>.yaml`,并根据该学校的实际信息替换占位符(域名、学院 URL、Research Explorer 组织 slug 等)。
|
||||||
|
2. 调整 `schools.static_list` 中的学院列表:
|
||||||
|
- `keywords`:用于自动将项目聚类到学院;
|
||||||
|
- `faculty_pages`:定义学院级导师目录(支持 `extract_method: table|links|research_explorer`、滚动/点击更多、独立请求参数)。
|
||||||
|
3. 根据学校的课程导航方式,补全 `programs.paths_to_try`、`link_patterns`、`selectors` 与请求设置。
|
||||||
|
4. `faculty.discovery_strategies` 推荐至少包含:
|
||||||
|
- `link_in_page`:从项目页寻找“People/Faculty”链接;
|
||||||
|
- `url_pattern`:补充常见 URL 模式;
|
||||||
|
- `school_directory`: true:复用 `faculty_pages` 中的导师目录,将其按关键词分发到项目层。
|
||||||
|
5. 运行 `python -m src.university_scraper.cli run --config configs/<university>.yaml --output output/<name>.json`(或在 Web 端触发任务)验证,并将本地结果与旧版对比。
|
||||||
|
|
||||||
|
## 模板列表
|
||||||
|
|
||||||
|
| 文件 | 适用场景 |
|
||||||
|
|------|----------|
|
||||||
|
| `uk_research_explorer_template.yaml` | 大多数使用 Pure Portal / Research Explorer 的英国大学(如曼大、UCL、帝国理工的人文社科学院)。 |
|
||||||
|
| `uk_department_directory_template.yaml` | 传统院系官网列出 HTML Staff Directory 的学院(如各理工学院官网、独立学院站点)。 |
|
||||||
|
|
||||||
|
后续若发现新的页面类型(例如 SharePoint 列表、嵌入式 API 等),请在此目录增加新的模板文件,并在本 README 中更新说明。
|
||||||
95
configs/templates/uk_department_directory_template.yaml
Normal file
95
configs/templates/uk_department_directory_template.yaml
Normal file
@ -0,0 +1,95 @@
|
|||||||
|
university:
|
||||||
|
name: "REPLACE_UNIVERSITY_NAME"
|
||||||
|
url: "https://www.example.ac.uk/"
|
||||||
|
country: "United Kingdom"
|
||||||
|
|
||||||
|
schools:
|
||||||
|
discovery_method: "static_list"
|
||||||
|
static_list:
|
||||||
|
- name: "Department of Computer Science"
|
||||||
|
url: "https://www.example.ac.uk/about/people/academic-and-research-staff/"
|
||||||
|
keywords:
|
||||||
|
- "computer"
|
||||||
|
- "software"
|
||||||
|
- "artificial intelligence"
|
||||||
|
- "data science"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.example.ac.uk/about/people/academic-and-research-staff/"
|
||||||
|
extract_method: "links"
|
||||||
|
requires_scroll: true
|
||||||
|
scroll_times: 6
|
||||||
|
scroll_delay_ms: 600
|
||||||
|
blocked_resources: ["image", "font", "media"]
|
||||||
|
- url: "https://www.example.ac.uk/about/people/"
|
||||||
|
extract_method: "links"
|
||||||
|
load_more_selector: "button.load-more"
|
||||||
|
max_load_more: 5
|
||||||
|
request:
|
||||||
|
timeout_ms: 45000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 2000
|
||||||
|
- name: "Department of Physics"
|
||||||
|
url: "https://www.example.ac.uk/physics/about/people/"
|
||||||
|
keywords:
|
||||||
|
- "physics"
|
||||||
|
- "astronomy"
|
||||||
|
- "material science"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.example.ac.uk/physics/about/people/academic-staff/"
|
||||||
|
extract_method: "table"
|
||||||
|
request:
|
||||||
|
timeout_ms: 60000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 2000
|
||||||
|
|
||||||
|
programs:
|
||||||
|
paths_to_try:
|
||||||
|
- "/study/masters/courses/a-to-z/"
|
||||||
|
- "/study/masters/courses/list/"
|
||||||
|
link_patterns:
|
||||||
|
- text_contains: ["courses", "masters", "postgraduate"]
|
||||||
|
href_contains: ["/study/", "/masters/", "/courses/"]
|
||||||
|
selectors:
|
||||||
|
program_item: ".course-card, li.course, article.course"
|
||||||
|
program_name: ".course-title, h3, .title"
|
||||||
|
program_url: "a[href]"
|
||||||
|
degree_type: ".award, .badge"
|
||||||
|
request:
|
||||||
|
timeout_ms: 35000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 2000
|
||||||
|
|
||||||
|
faculty:
|
||||||
|
discovery_strategies:
|
||||||
|
- type: "link_in_page"
|
||||||
|
patterns:
|
||||||
|
- text_contains: ["people", "faculty", "team", "staff"]
|
||||||
|
href_contains: ["/people", "/faculty", "/staff"]
|
||||||
|
request:
|
||||||
|
timeout_ms: 25000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 1500
|
||||||
|
- type: "url_pattern"
|
||||||
|
patterns:
|
||||||
|
- "{program_url}/people"
|
||||||
|
- "{program_url}/staff"
|
||||||
|
- "{school_url}/people"
|
||||||
|
- "{school_url}/contact/staff"
|
||||||
|
request:
|
||||||
|
timeout_ms: 25000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 1500
|
||||||
|
- type: "school_directory"
|
||||||
|
assign_to_all: false
|
||||||
|
match_by_school_keywords: true
|
||||||
|
metadata_keyword_field: "keywords"
|
||||||
|
request:
|
||||||
|
timeout_ms: 60000
|
||||||
|
wait_for_selector: "a[href*='/people/'], table"
|
||||||
|
post_wait_ms: 2000
|
||||||
|
|
||||||
|
filters:
|
||||||
|
program_degree_types:
|
||||||
|
include: ["MSc", "MSci", "MA", "MBA", "MEng", "LLM"]
|
||||||
|
exclude: ["PhD", "Bachelor", "BSc", "BA", "PGCert"]
|
||||||
|
exclude_schools: []
|
||||||
101
configs/templates/uk_research_explorer_template.yaml
Normal file
101
configs/templates/uk_research_explorer_template.yaml
Normal file
@ -0,0 +1,101 @@
|
|||||||
|
university:
|
||||||
|
name: "REPLACE_UNIVERSITY_NAME"
|
||||||
|
url: "https://www.example.ac.uk/"
|
||||||
|
country: "United Kingdom"
|
||||||
|
|
||||||
|
schools:
|
||||||
|
discovery_method: "static_list"
|
||||||
|
request:
|
||||||
|
timeout_ms: 45000
|
||||||
|
max_retries: 3
|
||||||
|
retry_backoff_ms: 3000
|
||||||
|
static_list:
|
||||||
|
# 基于 Research Explorer (Pure Portal) 的学院示例
|
||||||
|
- name: "School of Engineering"
|
||||||
|
url: "https://research.example.ac.uk/en/organisations/school-of-engineering/persons/"
|
||||||
|
keywords:
|
||||||
|
- "engineering"
|
||||||
|
- "mechanical"
|
||||||
|
- "civil"
|
||||||
|
- "materials"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://research.example.ac.uk/en/organisations/school-of-engineering/persons/"
|
||||||
|
extract_method: "research_explorer"
|
||||||
|
requires_scroll: true
|
||||||
|
request:
|
||||||
|
timeout_ms: 120000
|
||||||
|
wait_until: "networkidle"
|
||||||
|
post_wait_ms: 5000
|
||||||
|
research_explorer:
|
||||||
|
org_slug: "school-of-engineering"
|
||||||
|
page_size: 400
|
||||||
|
- name: "Faculty of Humanities"
|
||||||
|
url: "https://research.example.ac.uk/en/organisations/faculty-of-humanities/persons/"
|
||||||
|
keywords:
|
||||||
|
- "arts"
|
||||||
|
- "languages"
|
||||||
|
- "history"
|
||||||
|
- "philosophy"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://research.example.ac.uk/en/organisations/faculty-of-humanities/persons/"
|
||||||
|
extract_method: "research_explorer"
|
||||||
|
requires_scroll: true
|
||||||
|
request:
|
||||||
|
timeout_ms: 120000
|
||||||
|
wait_until: "networkidle"
|
||||||
|
post_wait_ms: 4500
|
||||||
|
research_explorer:
|
||||||
|
org_slug: "faculty-of-humanities"
|
||||||
|
page_size: 300
|
||||||
|
|
||||||
|
programs:
|
||||||
|
paths_to_try:
|
||||||
|
- "/study/masters/courses/list/"
|
||||||
|
- "/study/postgraduate/courses/list/"
|
||||||
|
link_patterns:
|
||||||
|
- text_contains: ["masters", "postgraduate", "graduate"]
|
||||||
|
href_contains: ["/courses/", "/study/", "/programmes/"]
|
||||||
|
selectors:
|
||||||
|
program_item: "li.course-item, article.course-card, a.course-link"
|
||||||
|
program_name: ".course-title, h3, .title"
|
||||||
|
program_url: "a[href]"
|
||||||
|
degree_type: ".course-award, .badge"
|
||||||
|
request:
|
||||||
|
timeout_ms: 40000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 2500
|
||||||
|
|
||||||
|
faculty:
|
||||||
|
discovery_strategies:
|
||||||
|
- type: "link_in_page"
|
||||||
|
patterns:
|
||||||
|
- text_contains: ["faculty", "people", "staff", "directory"]
|
||||||
|
href_contains: ["/faculty", "/people", "/staff"]
|
||||||
|
request:
|
||||||
|
timeout_ms: 30000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 1500
|
||||||
|
- type: "url_pattern"
|
||||||
|
patterns:
|
||||||
|
- "{program_url}/people"
|
||||||
|
- "{program_url}/faculty"
|
||||||
|
- "{school_url}/people"
|
||||||
|
- "{school_url}/staff"
|
||||||
|
request:
|
||||||
|
timeout_ms: 30000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 1500
|
||||||
|
- type: "school_directory"
|
||||||
|
assign_to_all: false
|
||||||
|
match_by_school_keywords: true
|
||||||
|
metadata_keyword_field: "keywords"
|
||||||
|
request:
|
||||||
|
timeout_ms: 120000
|
||||||
|
wait_for_selector: "a.link.person"
|
||||||
|
post_wait_ms: 4000
|
||||||
|
|
||||||
|
filters:
|
||||||
|
program_degree_types:
|
||||||
|
include: ["MSc", "MA", "MBA", "MEng", "LLM", "MRes"]
|
||||||
|
exclude: ["PhD", "Bachelor", "BSc", "BA"]
|
||||||
|
exclude_schools: []
|
||||||
169
configs/ucl.yaml
Normal file
169
configs/ucl.yaml
Normal file
@ -0,0 +1,169 @@
|
|||||||
|
university:
|
||||||
|
name: "University College London"
|
||||||
|
url: "https://www.ucl.ac.uk/"
|
||||||
|
country: "United Kingdom"
|
||||||
|
|
||||||
|
schools:
|
||||||
|
discovery_method: "static_list"
|
||||||
|
request:
|
||||||
|
timeout_ms: 45000
|
||||||
|
max_retries: 3
|
||||||
|
retry_backoff_ms: 3000
|
||||||
|
static_list:
|
||||||
|
- name: "Faculty of Engineering Sciences"
|
||||||
|
url: "https://www.ucl.ac.uk/engineering/people"
|
||||||
|
keywords:
|
||||||
|
- "engineering"
|
||||||
|
- "mechanical"
|
||||||
|
- "civil"
|
||||||
|
- "materials"
|
||||||
|
- "electronic"
|
||||||
|
- "computer"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.ucl.ac.uk/engineering/people"
|
||||||
|
extract_method: "links"
|
||||||
|
requires_scroll: true
|
||||||
|
scroll_times: 8
|
||||||
|
scroll_delay_ms: 600
|
||||||
|
blocked_resources: ["image", "font", "media"]
|
||||||
|
- url: "https://www.ucl.ac.uk/electronic-electrical-engineering/people/academic-staff"
|
||||||
|
extract_method: "table"
|
||||||
|
request:
|
||||||
|
timeout_ms: 45000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 2000
|
||||||
|
- name: "Faculty of Mathematical & Physical Sciences"
|
||||||
|
url: "https://www.ucl.ac.uk/mathematical-physical-sciences/people"
|
||||||
|
keywords:
|
||||||
|
- "mathematics"
|
||||||
|
- "physics"
|
||||||
|
- "chemistry"
|
||||||
|
- "earth sciences"
|
||||||
|
- "astronomy"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.ucl.ac.uk/mathematical-physical-sciences/people"
|
||||||
|
extract_method: "links"
|
||||||
|
requires_scroll: true
|
||||||
|
scroll_times: 6
|
||||||
|
scroll_delay_ms: 600
|
||||||
|
- url: "https://www.ucl.ac.uk/physics-astronomy/people/academic-staff"
|
||||||
|
extract_method: "links"
|
||||||
|
- name: "Faculty of Arts & Humanities"
|
||||||
|
url: "https://www.ucl.ac.uk/arts-humanities/people/academic-staff"
|
||||||
|
keywords:
|
||||||
|
- "arts"
|
||||||
|
- "languages"
|
||||||
|
- "culture"
|
||||||
|
- "history"
|
||||||
|
- "philosophy"
|
||||||
|
- "translation"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.ucl.ac.uk/arts-humanities/people/academic-staff"
|
||||||
|
extract_method: "links"
|
||||||
|
requires_scroll: true
|
||||||
|
scroll_times: 6
|
||||||
|
scroll_delay_ms: 600
|
||||||
|
- name: "Faculty of Laws"
|
||||||
|
url: "https://www.ucl.ac.uk/laws/people/academic-staff"
|
||||||
|
keywords:
|
||||||
|
- "law"
|
||||||
|
- "legal"
|
||||||
|
- "llm"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.ucl.ac.uk/laws/people/academic-staff"
|
||||||
|
extract_method: "links"
|
||||||
|
requires_scroll: true
|
||||||
|
scroll_times: 5
|
||||||
|
scroll_delay_ms: 600
|
||||||
|
- name: "Faculty of Social & Historical Sciences"
|
||||||
|
url: "https://www.ucl.ac.uk/social-historical-sciences/people"
|
||||||
|
keywords:
|
||||||
|
- "social"
|
||||||
|
- "economics"
|
||||||
|
- "geography"
|
||||||
|
- "anthropology"
|
||||||
|
- "politics"
|
||||||
|
- "history"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.ucl.ac.uk/social-historical-sciences/people"
|
||||||
|
extract_method: "links"
|
||||||
|
requires_scroll: true
|
||||||
|
scroll_times: 6
|
||||||
|
scroll_delay_ms: 600
|
||||||
|
- name: "Faculty of Brain Sciences"
|
||||||
|
url: "https://www.ucl.ac.uk/brain-sciences/people"
|
||||||
|
keywords:
|
||||||
|
- "neuroscience"
|
||||||
|
- "psychology"
|
||||||
|
- "cognitive"
|
||||||
|
- "biomedical"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.ucl.ac.uk/brain-sciences/people"
|
||||||
|
extract_method: "links"
|
||||||
|
requires_scroll: true
|
||||||
|
scroll_times: 6
|
||||||
|
scroll_delay_ms: 600
|
||||||
|
- name: "Faculty of the Built Environment (The Bartlett)"
|
||||||
|
url: "https://www.ucl.ac.uk/bartlett/people/all"
|
||||||
|
keywords:
|
||||||
|
- "architecture"
|
||||||
|
- "planning"
|
||||||
|
- "urban"
|
||||||
|
- "built environment"
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://www.ucl.ac.uk/bartlett/people/all"
|
||||||
|
extract_method: "links"
|
||||||
|
requires_scroll: true
|
||||||
|
scroll_times: 10
|
||||||
|
scroll_delay_ms: 600
|
||||||
|
|
||||||
|
programs:
|
||||||
|
paths_to_try:
|
||||||
|
- "/prospective-students/graduate/taught-degrees/"
|
||||||
|
link_patterns:
|
||||||
|
- text_contains: ["graduate", "taught", "masters", "postgraduate"]
|
||||||
|
href_contains: ["/prospective-students/graduate", "/study/graduate", "/courses/"]
|
||||||
|
selectors:
|
||||||
|
program_item: ".view-content .view-row, li.listing__item, article.prog-card"
|
||||||
|
program_name: ".listing__title, h3, .title"
|
||||||
|
program_url: "a[href]"
|
||||||
|
degree_type: ".listing__award, .award"
|
||||||
|
request:
|
||||||
|
timeout_ms: 40000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 2500
|
||||||
|
|
||||||
|
faculty:
|
||||||
|
discovery_strategies:
|
||||||
|
- type: "link_in_page"
|
||||||
|
patterns:
|
||||||
|
- text_contains: ["people", "faculty", "staff", "team"]
|
||||||
|
href_contains: ["/people", "/faculty", "/staff", "/team"]
|
||||||
|
request:
|
||||||
|
timeout_ms: 30000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 1500
|
||||||
|
- type: "url_pattern"
|
||||||
|
patterns:
|
||||||
|
- "{program_url}/people"
|
||||||
|
- "{program_url}/staff"
|
||||||
|
- "{school_url}/people"
|
||||||
|
- "{school_url}/staff"
|
||||||
|
request:
|
||||||
|
timeout_ms: 30000
|
||||||
|
wait_until: "domcontentloaded"
|
||||||
|
post_wait_ms: 1500
|
||||||
|
- type: "school_directory"
|
||||||
|
assign_to_all: false
|
||||||
|
match_by_school_keywords: true
|
||||||
|
metadata_keyword_field: "keywords"
|
||||||
|
request:
|
||||||
|
timeout_ms: 60000
|
||||||
|
wait_for_selector: "a[href*='/people/'], .person, .profile-card"
|
||||||
|
post_wait_ms: 2500
|
||||||
|
|
||||||
|
filters:
|
||||||
|
program_degree_types:
|
||||||
|
include: ["MSc", "MSci", "MA", "MBA", "MEng", "LLM", "MRes"]
|
||||||
|
exclude: ["PhD", "Bachelor", "BSc", "BA", "PGCert"]
|
||||||
|
exclude_schools: []
|
||||||
54
docker-compose.yml
Normal file
54
docker-compose.yml
Normal file
@ -0,0 +1,54 @@
|
|||||||
|
version: '3.8'
|
||||||
|
|
||||||
|
services:
|
||||||
|
# 后端API服务
|
||||||
|
backend:
|
||||||
|
build:
|
||||||
|
context: ./backend
|
||||||
|
dockerfile: Dockerfile
|
||||||
|
ports:
|
||||||
|
- "8000:8000"
|
||||||
|
environment:
|
||||||
|
- DATABASE_URL=postgresql://postgres:postgres@db:5432/university_scraper
|
||||||
|
- REDIS_URL=redis://redis:6379/0
|
||||||
|
depends_on:
|
||||||
|
- db
|
||||||
|
- redis
|
||||||
|
volumes:
|
||||||
|
- ./backend:/app
|
||||||
|
- scraper_data:/app/data
|
||||||
|
|
||||||
|
# 前端服务
|
||||||
|
frontend:
|
||||||
|
build:
|
||||||
|
context: ./frontend
|
||||||
|
dockerfile: Dockerfile
|
||||||
|
ports:
|
||||||
|
- "3000:80"
|
||||||
|
depends_on:
|
||||||
|
- backend
|
||||||
|
|
||||||
|
# PostgreSQL数据库
|
||||||
|
db:
|
||||||
|
image: postgres:15-alpine
|
||||||
|
environment:
|
||||||
|
- POSTGRES_USER=postgres
|
||||||
|
- POSTGRES_PASSWORD=postgres
|
||||||
|
- POSTGRES_DB=university_scraper
|
||||||
|
volumes:
|
||||||
|
- postgres_data:/var/lib/postgresql/data
|
||||||
|
ports:
|
||||||
|
- "5432:5432"
|
||||||
|
|
||||||
|
# Redis (用于任务队列)
|
||||||
|
redis:
|
||||||
|
image: redis:7-alpine
|
||||||
|
ports:
|
||||||
|
- "6379:6379"
|
||||||
|
volumes:
|
||||||
|
- redis_data:/data
|
||||||
|
|
||||||
|
volumes:
|
||||||
|
postgres_data:
|
||||||
|
redis_data:
|
||||||
|
scraper_data:
|
||||||
26
frontend/Dockerfile
Normal file
26
frontend/Dockerfile
Normal file
@ -0,0 +1,26 @@
|
|||||||
|
FROM node:20-alpine as builder
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
# 复制package文件
|
||||||
|
COPY package*.json ./
|
||||||
|
RUN npm install
|
||||||
|
|
||||||
|
# 复制源代码
|
||||||
|
COPY . .
|
||||||
|
|
||||||
|
# 构建
|
||||||
|
RUN npm run build
|
||||||
|
|
||||||
|
# 生产镜像
|
||||||
|
FROM nginx:alpine
|
||||||
|
|
||||||
|
# 复制构建产物
|
||||||
|
COPY --from=builder /app/dist /usr/share/nginx/html
|
||||||
|
|
||||||
|
# 复制nginx配置
|
||||||
|
COPY nginx.conf /etc/nginx/conf.d/default.conf
|
||||||
|
|
||||||
|
EXPOSE 80
|
||||||
|
|
||||||
|
CMD ["nginx", "-g", "daemon off;"]
|
||||||
12
frontend/index.html
Normal file
12
frontend/index.html
Normal file
@ -0,0 +1,12 @@
|
|||||||
|
<!DOCTYPE html>
|
||||||
|
<html lang="zh-CN">
|
||||||
|
<head>
|
||||||
|
<meta charset="UTF-8" />
|
||||||
|
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||||
|
<title>大学爬虫系统</title>
|
||||||
|
</head>
|
||||||
|
<body>
|
||||||
|
<div id="root"></div>
|
||||||
|
<script type="module" src="/src/main.tsx"></script>
|
||||||
|
</body>
|
||||||
|
</html>
|
||||||
21
frontend/nginx.conf
Normal file
21
frontend/nginx.conf
Normal file
@ -0,0 +1,21 @@
|
|||||||
|
server {
|
||||||
|
listen 80;
|
||||||
|
server_name localhost;
|
||||||
|
root /usr/share/nginx/html;
|
||||||
|
index index.html;
|
||||||
|
|
||||||
|
# 处理SPA路由
|
||||||
|
location / {
|
||||||
|
try_files $uri $uri/ /index.html;
|
||||||
|
}
|
||||||
|
|
||||||
|
# 代理API请求到后端
|
||||||
|
location /api {
|
||||||
|
proxy_pass http://backend:8000;
|
||||||
|
proxy_http_version 1.1;
|
||||||
|
proxy_set_header Upgrade $http_upgrade;
|
||||||
|
proxy_set_header Connection 'upgrade';
|
||||||
|
proxy_set_header Host $host;
|
||||||
|
proxy_cache_bypass $http_upgrade;
|
||||||
|
}
|
||||||
|
}
|
||||||
3051
frontend/package-lock.json
generated
Normal file
3051
frontend/package-lock.json
generated
Normal file
File diff suppressed because it is too large
Load Diff
26
frontend/package.json
Normal file
26
frontend/package.json
Normal file
@ -0,0 +1,26 @@
|
|||||||
|
{
|
||||||
|
"name": "university-scraper-web",
|
||||||
|
"version": "1.0.0",
|
||||||
|
"private": true,
|
||||||
|
"scripts": {
|
||||||
|
"dev": "vite",
|
||||||
|
"build": "tsc && vite build",
|
||||||
|
"preview": "vite preview"
|
||||||
|
},
|
||||||
|
"dependencies": {
|
||||||
|
"react": "^18.2.0",
|
||||||
|
"react-dom": "^18.2.0",
|
||||||
|
"react-router-dom": "^6.20.0",
|
||||||
|
"@tanstack/react-query": "^5.8.0",
|
||||||
|
"axios": "^1.6.0",
|
||||||
|
"antd": "^5.11.0",
|
||||||
|
"@ant-design/icons": "^5.2.6"
|
||||||
|
},
|
||||||
|
"devDependencies": {
|
||||||
|
"@types/react": "^18.2.0",
|
||||||
|
"@types/react-dom": "^18.2.0",
|
||||||
|
"@vitejs/plugin-react": "^4.2.0",
|
||||||
|
"typescript": "^5.3.0",
|
||||||
|
"vite": "^5.0.0"
|
||||||
|
}
|
||||||
|
}
|
||||||
75
frontend/src/App.tsx
Normal file
75
frontend/src/App.tsx
Normal file
@ -0,0 +1,75 @@
|
|||||||
|
/**
|
||||||
|
* 主应用组件
|
||||||
|
*/
|
||||||
|
import { useState } from 'react'
|
||||||
|
import { BrowserRouter, Routes, Route, Link, useNavigate } from 'react-router-dom'
|
||||||
|
import { Layout, Menu, Typography } from 'antd'
|
||||||
|
import { HomeOutlined, PlusOutlined, DatabaseOutlined } from '@ant-design/icons'
|
||||||
|
import HomePage from './pages/HomePage'
|
||||||
|
import AddUniversityPage from './pages/AddUniversityPage'
|
||||||
|
import UniversityDetailPage from './pages/UniversityDetailPage'
|
||||||
|
|
||||||
|
const { Header, Content, Footer } = Layout
|
||||||
|
const { Title } = Typography
|
||||||
|
|
||||||
|
function AppContent() {
|
||||||
|
const navigate = useNavigate()
|
||||||
|
const [current, setCurrent] = useState('home')
|
||||||
|
|
||||||
|
const menuItems = [
|
||||||
|
{
|
||||||
|
key: 'home',
|
||||||
|
icon: <HomeOutlined />,
|
||||||
|
label: '大学列表',
|
||||||
|
onClick: () => navigate('/')
|
||||||
|
},
|
||||||
|
{
|
||||||
|
key: 'add',
|
||||||
|
icon: <PlusOutlined />,
|
||||||
|
label: '添加大学',
|
||||||
|
onClick: () => navigate('/add')
|
||||||
|
}
|
||||||
|
]
|
||||||
|
|
||||||
|
return (
|
||||||
|
<Layout style={{ minHeight: '100vh' }}>
|
||||||
|
<Header style={{ display: 'flex', alignItems: 'center', background: '#001529' }}>
|
||||||
|
<div style={{ color: 'white', fontSize: '20px', fontWeight: 'bold', marginRight: '40px' }}>
|
||||||
|
<DatabaseOutlined /> 大学爬虫系统
|
||||||
|
</div>
|
||||||
|
<Menu
|
||||||
|
theme="dark"
|
||||||
|
mode="horizontal"
|
||||||
|
selectedKeys={[current]}
|
||||||
|
items={menuItems}
|
||||||
|
onClick={(e) => setCurrent(e.key)}
|
||||||
|
style={{ flex: 1 }}
|
||||||
|
/>
|
||||||
|
</Header>
|
||||||
|
|
||||||
|
<Content style={{ padding: '24px', background: '#f5f5f5' }}>
|
||||||
|
<div style={{ maxWidth: 1200, margin: '0 auto' }}>
|
||||||
|
<Routes>
|
||||||
|
<Route path="/" element={<HomePage />} />
|
||||||
|
<Route path="/add" element={<AddUniversityPage />} />
|
||||||
|
<Route path="/university/:id" element={<UniversityDetailPage />} />
|
||||||
|
</Routes>
|
||||||
|
</div>
|
||||||
|
</Content>
|
||||||
|
|
||||||
|
<Footer style={{ textAlign: 'center', background: '#f5f5f5' }}>
|
||||||
|
大学爬虫系统 ©2024 - 一键生成 & 一键爬取
|
||||||
|
</Footer>
|
||||||
|
</Layout>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
function App() {
|
||||||
|
return (
|
||||||
|
<BrowserRouter>
|
||||||
|
<AppContent />
|
||||||
|
</BrowserRouter>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
export default App
|
||||||
29
frontend/src/index.css
Normal file
29
frontend/src/index.css
Normal file
@ -0,0 +1,29 @@
|
|||||||
|
* {
|
||||||
|
margin: 0;
|
||||||
|
padding: 0;
|
||||||
|
box-sizing: border-box;
|
||||||
|
}
|
||||||
|
|
||||||
|
body {
|
||||||
|
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;
|
||||||
|
background-color: #f5f5f5;
|
||||||
|
}
|
||||||
|
|
||||||
|
.container {
|
||||||
|
max-width: 1200px;
|
||||||
|
margin: 0 auto;
|
||||||
|
padding: 20px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.card-hover:hover {
|
||||||
|
box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15);
|
||||||
|
transition: box-shadow 0.3s;
|
||||||
|
}
|
||||||
|
|
||||||
|
.status-pending { color: #faad14; }
|
||||||
|
.status-analyzing { color: #1890ff; }
|
||||||
|
.status-ready { color: #52c41a; }
|
||||||
|
.status-running { color: #1890ff; }
|
||||||
|
.status-completed { color: #52c41a; }
|
||||||
|
.status-failed { color: #ff4d4f; }
|
||||||
|
.status-error { color: #ff4d4f; }
|
||||||
26
frontend/src/main.tsx
Normal file
26
frontend/src/main.tsx
Normal file
@ -0,0 +1,26 @@
|
|||||||
|
import React from 'react'
|
||||||
|
import ReactDOM from 'react-dom/client'
|
||||||
|
import { QueryClient, QueryClientProvider } from '@tanstack/react-query'
|
||||||
|
import { ConfigProvider } from 'antd'
|
||||||
|
import zhCN from 'antd/locale/zh_CN'
|
||||||
|
import App from './App'
|
||||||
|
import './index.css'
|
||||||
|
|
||||||
|
const queryClient = new QueryClient({
|
||||||
|
defaultOptions: {
|
||||||
|
queries: {
|
||||||
|
refetchOnWindowFocus: false,
|
||||||
|
retry: 1
|
||||||
|
}
|
||||||
|
}
|
||||||
|
})
|
||||||
|
|
||||||
|
ReactDOM.createRoot(document.getElementById('root')!).render(
|
||||||
|
<React.StrictMode>
|
||||||
|
<QueryClientProvider client={queryClient}>
|
||||||
|
<ConfigProvider locale={zhCN}>
|
||||||
|
<App />
|
||||||
|
</ConfigProvider>
|
||||||
|
</QueryClientProvider>
|
||||||
|
</React.StrictMode>
|
||||||
|
)
|
||||||
165
frontend/src/pages/AddUniversityPage.tsx
Normal file
165
frontend/src/pages/AddUniversityPage.tsx
Normal file
@ -0,0 +1,165 @@
|
|||||||
|
/**
|
||||||
|
* 添加大学页面 - 一键生成爬虫脚本
|
||||||
|
*/
|
||||||
|
import { useState } from 'react'
|
||||||
|
import { useNavigate } from 'react-router-dom'
|
||||||
|
import { useMutation } from '@tanstack/react-query'
|
||||||
|
import {
|
||||||
|
Card, Form, Input, Button, Typography, Steps, Result, Spin, message
|
||||||
|
} from 'antd'
|
||||||
|
import { GlobalOutlined, RocketOutlined, CheckCircleOutlined, LoadingOutlined } from '@ant-design/icons'
|
||||||
|
import { scriptApi } from '../services/api'
|
||||||
|
|
||||||
|
const { Title, Text, Paragraph } = Typography
|
||||||
|
|
||||||
|
export default function AddUniversityPage() {
|
||||||
|
const navigate = useNavigate()
|
||||||
|
const [form] = Form.useForm()
|
||||||
|
const [currentStep, setCurrentStep] = useState(0)
|
||||||
|
const [universityId, setUniversityId] = useState<number | null>(null)
|
||||||
|
|
||||||
|
// 生成脚本
|
||||||
|
const generateMutation = useMutation({
|
||||||
|
mutationFn: scriptApi.generate,
|
||||||
|
onSuccess: (response) => {
|
||||||
|
const data = response.data
|
||||||
|
setUniversityId(data.university_id)
|
||||||
|
setCurrentStep(2)
|
||||||
|
message.success('脚本生成成功!')
|
||||||
|
},
|
||||||
|
onError: (error: any) => {
|
||||||
|
message.error(error.response?.data?.detail || '生成失败')
|
||||||
|
setCurrentStep(0)
|
||||||
|
}
|
||||||
|
})
|
||||||
|
|
||||||
|
const handleSubmit = (values: { url: string; name?: string }) => {
|
||||||
|
setCurrentStep(1)
|
||||||
|
generateMutation.mutate({
|
||||||
|
university_url: values.url,
|
||||||
|
university_name: values.name
|
||||||
|
})
|
||||||
|
}
|
||||||
|
|
||||||
|
const stepItems = [
|
||||||
|
{
|
||||||
|
title: '输入信息',
|
||||||
|
icon: <GlobalOutlined />
|
||||||
|
},
|
||||||
|
{
|
||||||
|
title: '分析生成',
|
||||||
|
icon: currentStep === 1 ? <LoadingOutlined /> : <RocketOutlined />
|
||||||
|
},
|
||||||
|
{
|
||||||
|
title: '完成',
|
||||||
|
icon: <CheckCircleOutlined />
|
||||||
|
}
|
||||||
|
]
|
||||||
|
|
||||||
|
return (
|
||||||
|
<Card>
|
||||||
|
<Title level={3} style={{ textAlign: 'center', marginBottom: 32 }}>
|
||||||
|
添加大学 - 一键生成爬虫脚本
|
||||||
|
</Title>
|
||||||
|
|
||||||
|
<Steps current={currentStep} items={stepItems} style={{ marginBottom: 40 }} />
|
||||||
|
|
||||||
|
{currentStep === 0 && (
|
||||||
|
<div style={{ maxWidth: 500, margin: '0 auto' }}>
|
||||||
|
<Paragraph style={{ textAlign: 'center', marginBottom: 24 }}>
|
||||||
|
输入大学官网地址,系统将自动分析网站结构并生成爬虫脚本
|
||||||
|
</Paragraph>
|
||||||
|
|
||||||
|
<Form
|
||||||
|
form={form}
|
||||||
|
layout="vertical"
|
||||||
|
onFinish={handleSubmit}
|
||||||
|
>
|
||||||
|
<Form.Item
|
||||||
|
name="url"
|
||||||
|
label="大学官网URL"
|
||||||
|
rules={[
|
||||||
|
{ required: true, message: '请输入大学官网URL' },
|
||||||
|
{ type: 'url', message: '请输入有效的URL' }
|
||||||
|
]}
|
||||||
|
>
|
||||||
|
<Input
|
||||||
|
placeholder="https://www.harvard.edu/"
|
||||||
|
size="large"
|
||||||
|
prefix={<GlobalOutlined />}
|
||||||
|
/>
|
||||||
|
</Form.Item>
|
||||||
|
|
||||||
|
<Form.Item
|
||||||
|
name="name"
|
||||||
|
label="大学名称 (可选)"
|
||||||
|
>
|
||||||
|
<Input
|
||||||
|
placeholder="如: Harvard University"
|
||||||
|
size="large"
|
||||||
|
/>
|
||||||
|
</Form.Item>
|
||||||
|
|
||||||
|
<Form.Item>
|
||||||
|
<Button
|
||||||
|
type="primary"
|
||||||
|
htmlType="submit"
|
||||||
|
size="large"
|
||||||
|
block
|
||||||
|
icon={<RocketOutlined />}
|
||||||
|
>
|
||||||
|
一键生成爬虫脚本
|
||||||
|
</Button>
|
||||||
|
</Form.Item>
|
||||||
|
</Form>
|
||||||
|
|
||||||
|
<div style={{ marginTop: 32, padding: 16, background: '#f5f5f5', borderRadius: 8 }}>
|
||||||
|
<Text strong>支持的大学类型:</Text>
|
||||||
|
<ul style={{ marginTop: 8 }}>
|
||||||
|
<li>美国大学 (如 Harvard, MIT, Stanford)</li>
|
||||||
|
<li>英国大学 (如 Oxford, Cambridge)</li>
|
||||||
|
<li>其他海外大学</li>
|
||||||
|
</ul>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
|
||||||
|
{currentStep === 1 && (
|
||||||
|
<div style={{ textAlign: 'center', padding: 60 }}>
|
||||||
|
<Spin size="large" />
|
||||||
|
<Title level={4} style={{ marginTop: 24 }}>正在分析网站结构...</Title>
|
||||||
|
<Paragraph>系统正在访问大学官网,分析页面结构并生成爬虫脚本</Paragraph>
|
||||||
|
<Paragraph type="secondary">这可能需要几秒钟,请稍候...</Paragraph>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
|
||||||
|
{currentStep === 2 && (
|
||||||
|
<Result
|
||||||
|
status="success"
|
||||||
|
title="爬虫脚本生成成功!"
|
||||||
|
subTitle="系统已自动分析网站结构并生成了爬虫脚本"
|
||||||
|
extra={[
|
||||||
|
<Button
|
||||||
|
type="primary"
|
||||||
|
key="detail"
|
||||||
|
size="large"
|
||||||
|
onClick={() => navigate(`/university/${universityId}`)}
|
||||||
|
>
|
||||||
|
进入大学管理页面
|
||||||
|
</Button>,
|
||||||
|
<Button
|
||||||
|
key="add"
|
||||||
|
size="large"
|
||||||
|
onClick={() => {
|
||||||
|
setCurrentStep(0)
|
||||||
|
form.resetFields()
|
||||||
|
}}
|
||||||
|
>
|
||||||
|
继续添加
|
||||||
|
</Button>
|
||||||
|
]}
|
||||||
|
/>
|
||||||
|
)}
|
||||||
|
</Card>
|
||||||
|
)
|
||||||
|
}
|
||||||
185
frontend/src/pages/HomePage.tsx
Normal file
185
frontend/src/pages/HomePage.tsx
Normal file
@ -0,0 +1,185 @@
|
|||||||
|
/**
|
||||||
|
* 首页 - 大学列表
|
||||||
|
*/
|
||||||
|
import { useState } from 'react'
|
||||||
|
import { useNavigate } from 'react-router-dom'
|
||||||
|
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
|
||||||
|
import {
|
||||||
|
Card, Table, Button, Input, Space, Tag, message, Popconfirm, Typography, Row, Col, Statistic
|
||||||
|
} from 'antd'
|
||||||
|
import {
|
||||||
|
PlusOutlined, SearchOutlined, DeleteOutlined, EyeOutlined, ReloadOutlined
|
||||||
|
} from '@ant-design/icons'
|
||||||
|
import { universityApi } from '../services/api'
|
||||||
|
|
||||||
|
const { Title } = Typography
|
||||||
|
|
||||||
|
// 状态标签映射
|
||||||
|
const statusTags: Record<string, { color: string; text: string }> = {
|
||||||
|
pending: { color: 'default', text: '待分析' },
|
||||||
|
analyzing: { color: 'processing', text: '分析中' },
|
||||||
|
ready: { color: 'success', text: '就绪' },
|
||||||
|
error: { color: 'error', text: '错误' }
|
||||||
|
}
|
||||||
|
|
||||||
|
export default function HomePage() {
|
||||||
|
const navigate = useNavigate()
|
||||||
|
const queryClient = useQueryClient()
|
||||||
|
const [search, setSearch] = useState('')
|
||||||
|
|
||||||
|
// 获取大学列表
|
||||||
|
const { data, isLoading, refetch } = useQuery({
|
||||||
|
queryKey: ['universities', search],
|
||||||
|
queryFn: () => universityApi.list({ search: search || undefined })
|
||||||
|
})
|
||||||
|
|
||||||
|
// 删除大学
|
||||||
|
const deleteMutation = useMutation({
|
||||||
|
mutationFn: universityApi.delete,
|
||||||
|
onSuccess: () => {
|
||||||
|
message.success('删除成功')
|
||||||
|
queryClient.invalidateQueries({ queryKey: ['universities'] })
|
||||||
|
},
|
||||||
|
onError: () => {
|
||||||
|
message.error('删除失败')
|
||||||
|
}
|
||||||
|
})
|
||||||
|
|
||||||
|
const universities = data?.data?.items || []
|
||||||
|
const total = data?.data?.total || 0
|
||||||
|
|
||||||
|
// 统计
|
||||||
|
const readyCount = universities.filter((u: any) => u.status === 'ready').length
|
||||||
|
const totalPrograms = universities.reduce((sum: number, u: any) =>
|
||||||
|
sum + (u.latest_result?.programs_count || 0), 0)
|
||||||
|
const totalFaculty = universities.reduce((sum: number, u: any) =>
|
||||||
|
sum + (u.latest_result?.faculty_count || 0), 0)
|
||||||
|
|
||||||
|
const columns = [
|
||||||
|
{
|
||||||
|
title: '大学名称',
|
||||||
|
dataIndex: 'name',
|
||||||
|
key: 'name',
|
||||||
|
render: (text: string, record: any) => (
|
||||||
|
<a onClick={() => navigate(`/university/${record.id}`)}>{text}</a>
|
||||||
|
)
|
||||||
|
},
|
||||||
|
{
|
||||||
|
title: '国家',
|
||||||
|
dataIndex: 'country',
|
||||||
|
key: 'country',
|
||||||
|
width: 100
|
||||||
|
},
|
||||||
|
{
|
||||||
|
title: '状态',
|
||||||
|
dataIndex: 'status',
|
||||||
|
key: 'status',
|
||||||
|
width: 100,
|
||||||
|
render: (status: string) => {
|
||||||
|
const tag = statusTags[status] || { color: 'default', text: status }
|
||||||
|
return <Tag color={tag.color}>{tag.text}</Tag>
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
title: '项目数',
|
||||||
|
key: 'programs',
|
||||||
|
width: 100,
|
||||||
|
render: (_: any, record: any) => record.latest_result?.programs_count || '-'
|
||||||
|
},
|
||||||
|
{
|
||||||
|
title: '导师数',
|
||||||
|
key: 'faculty',
|
||||||
|
width: 100,
|
||||||
|
render: (_: any, record: any) => record.latest_result?.faculty_count || '-'
|
||||||
|
},
|
||||||
|
{
|
||||||
|
title: '操作',
|
||||||
|
key: 'actions',
|
||||||
|
width: 150,
|
||||||
|
render: (_: any, record: any) => (
|
||||||
|
<Space>
|
||||||
|
<Button
|
||||||
|
type="link"
|
||||||
|
icon={<EyeOutlined />}
|
||||||
|
onClick={() => navigate(`/university/${record.id}`)}
|
||||||
|
>
|
||||||
|
查看
|
||||||
|
</Button>
|
||||||
|
<Popconfirm
|
||||||
|
title="确定删除这个大学吗?"
|
||||||
|
onConfirm={() => deleteMutation.mutate(record.id)}
|
||||||
|
okText="确定"
|
||||||
|
cancelText="取消"
|
||||||
|
>
|
||||||
|
<Button type="link" danger icon={<DeleteOutlined />}>
|
||||||
|
删除
|
||||||
|
</Button>
|
||||||
|
</Popconfirm>
|
||||||
|
</Space>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
]
|
||||||
|
|
||||||
|
return (
|
||||||
|
<div>
|
||||||
|
{/* 统计卡片 */}
|
||||||
|
<Row gutter={16} style={{ marginBottom: 24 }}>
|
||||||
|
<Col span={6}>
|
||||||
|
<Card>
|
||||||
|
<Statistic title="大学总数" value={total} />
|
||||||
|
</Card>
|
||||||
|
</Col>
|
||||||
|
<Col span={6}>
|
||||||
|
<Card>
|
||||||
|
<Statistic title="已就绪" value={readyCount} valueStyle={{ color: '#52c41a' }} />
|
||||||
|
</Card>
|
||||||
|
</Col>
|
||||||
|
<Col span={6}>
|
||||||
|
<Card>
|
||||||
|
<Statistic title="项目总数" value={totalPrograms} />
|
||||||
|
</Card>
|
||||||
|
</Col>
|
||||||
|
<Col span={6}>
|
||||||
|
<Card>
|
||||||
|
<Statistic title="导师总数" value={totalFaculty} />
|
||||||
|
</Card>
|
||||||
|
</Col>
|
||||||
|
</Row>
|
||||||
|
|
||||||
|
{/* 大学列表 */}
|
||||||
|
<Card
|
||||||
|
title={<Title level={4} style={{ margin: 0 }}>大学列表</Title>}
|
||||||
|
extra={
|
||||||
|
<Space>
|
||||||
|
<Input
|
||||||
|
placeholder="搜索大学..."
|
||||||
|
prefix={<SearchOutlined />}
|
||||||
|
value={search}
|
||||||
|
onChange={(e) => setSearch(e.target.value)}
|
||||||
|
style={{ width: 200 }}
|
||||||
|
allowClear
|
||||||
|
/>
|
||||||
|
<Button icon={<ReloadOutlined />} onClick={() => refetch()}>
|
||||||
|
刷新
|
||||||
|
</Button>
|
||||||
|
<Button type="primary" icon={<PlusOutlined />} onClick={() => navigate('/add')}>
|
||||||
|
添加大学
|
||||||
|
</Button>
|
||||||
|
</Space>
|
||||||
|
}
|
||||||
|
>
|
||||||
|
<Table
|
||||||
|
columns={columns}
|
||||||
|
dataSource={universities}
|
||||||
|
rowKey="id"
|
||||||
|
loading={isLoading}
|
||||||
|
pagination={{
|
||||||
|
total,
|
||||||
|
showSizeChanger: true,
|
||||||
|
showTotal: (t) => `共 ${t} 所大学`
|
||||||
|
}}
|
||||||
|
/>
|
||||||
|
</Card>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
368
frontend/src/pages/UniversityDetailPage.tsx
Normal file
368
frontend/src/pages/UniversityDetailPage.tsx
Normal file
@ -0,0 +1,368 @@
|
|||||||
|
/**
|
||||||
|
* 大学详情页面 - 管理爬虫、运行爬虫、查看数据
|
||||||
|
*/
|
||||||
|
import { useState, useEffect } from 'react'
|
||||||
|
import { useParams, useNavigate } from 'react-router-dom'
|
||||||
|
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
|
||||||
|
import {
|
||||||
|
Card, Tabs, Button, Typography, Tag, Space, Table, Progress, Timeline, Spin,
|
||||||
|
message, Descriptions, Tree, Input, Row, Col, Statistic, Empty, Modal
|
||||||
|
} from 'antd'
|
||||||
|
import {
|
||||||
|
PlayCircleOutlined, ReloadOutlined, DownloadOutlined, ArrowLeftOutlined,
|
||||||
|
CheckCircleOutlined, ClockCircleOutlined, ExclamationCircleOutlined,
|
||||||
|
SearchOutlined, TeamOutlined, BookOutlined, BankOutlined
|
||||||
|
} from '@ant-design/icons'
|
||||||
|
import { universityApi, scriptApi, jobApi, resultApi } from '../services/api'
|
||||||
|
|
||||||
|
const { Title, Text, Paragraph } = Typography
|
||||||
|
const { TabPane } = Tabs
|
||||||
|
|
||||||
|
// 状态映射
|
||||||
|
const statusMap: Record<string, { color: string; text: string; icon: any }> = {
|
||||||
|
pending: { color: 'default', text: '等待中', icon: <ClockCircleOutlined /> },
|
||||||
|
running: { color: 'processing', text: '运行中', icon: <Spin size="small" /> },
|
||||||
|
completed: { color: 'success', text: '已完成', icon: <CheckCircleOutlined /> },
|
||||||
|
failed: { color: 'error', text: '失败', icon: <ExclamationCircleOutlined /> },
|
||||||
|
cancelled: { color: 'warning', text: '已取消', icon: <ExclamationCircleOutlined /> }
|
||||||
|
}
|
||||||
|
|
||||||
|
export default function UniversityDetailPage() {
|
||||||
|
const { id } = useParams<{ id: string }>()
|
||||||
|
const navigate = useNavigate()
|
||||||
|
const queryClient = useQueryClient()
|
||||||
|
const universityId = parseInt(id || '0')
|
||||||
|
|
||||||
|
const [activeTab, setActiveTab] = useState('overview')
|
||||||
|
const [pollingJobId, setPollingJobId] = useState<number | null>(null)
|
||||||
|
const [searchKeyword, setSearchKeyword] = useState('')
|
||||||
|
|
||||||
|
// 获取大学详情
|
||||||
|
const { data: universityData, isLoading: universityLoading } = useQuery({
|
||||||
|
queryKey: ['university', universityId],
|
||||||
|
queryFn: () => universityApi.get(universityId)
|
||||||
|
})
|
||||||
|
|
||||||
|
// 获取脚本
|
||||||
|
const { data: scriptsData } = useQuery({
|
||||||
|
queryKey: ['scripts', universityId],
|
||||||
|
queryFn: () => scriptApi.getByUniversity(universityId)
|
||||||
|
})
|
||||||
|
|
||||||
|
// 获取任务列表
|
||||||
|
const { data: jobsData, refetch: refetchJobs } = useQuery({
|
||||||
|
queryKey: ['jobs', universityId],
|
||||||
|
queryFn: () => jobApi.getByUniversity(universityId)
|
||||||
|
})
|
||||||
|
|
||||||
|
// 获取结果数据
|
||||||
|
const { data: resultData } = useQuery({
|
||||||
|
queryKey: ['result', universityId],
|
||||||
|
queryFn: () => resultApi.get(universityId),
|
||||||
|
enabled: activeTab === 'data'
|
||||||
|
})
|
||||||
|
|
||||||
|
// 获取任务状态 (轮询)
|
||||||
|
const { data: jobStatusData } = useQuery({
|
||||||
|
queryKey: ['job-status', pollingJobId],
|
||||||
|
queryFn: () => jobApi.getStatus(pollingJobId!),
|
||||||
|
enabled: !!pollingJobId,
|
||||||
|
refetchInterval: pollingJobId ? 2000 : false
|
||||||
|
})
|
||||||
|
|
||||||
|
// 启动爬虫任务
|
||||||
|
const startJobMutation = useMutation({
|
||||||
|
mutationFn: () => jobApi.start(universityId),
|
||||||
|
onSuccess: (response) => {
|
||||||
|
message.success('爬虫任务已启动')
|
||||||
|
setPollingJobId(response.data.id)
|
||||||
|
refetchJobs()
|
||||||
|
},
|
||||||
|
onError: (error: any) => {
|
||||||
|
message.error(error.response?.data?.detail || '启动失败')
|
||||||
|
}
|
||||||
|
})
|
||||||
|
|
||||||
|
// 监听任务完成
|
||||||
|
useEffect(() => {
|
||||||
|
if (jobStatusData?.data?.status === 'completed' || jobStatusData?.data?.status === 'failed') {
|
||||||
|
setPollingJobId(null)
|
||||||
|
refetchJobs()
|
||||||
|
queryClient.invalidateQueries({ queryKey: ['university', universityId] })
|
||||||
|
queryClient.invalidateQueries({ queryKey: ['result', universityId] })
|
||||||
|
|
||||||
|
if (jobStatusData?.data?.status === 'completed') {
|
||||||
|
message.success('爬取完成!')
|
||||||
|
} else {
|
||||||
|
message.error('爬取失败')
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}, [jobStatusData?.data?.status])
|
||||||
|
|
||||||
|
const university = universityData?.data
|
||||||
|
const scripts = scriptsData?.data || []
|
||||||
|
const jobs = jobsData?.data || []
|
||||||
|
const result = resultData?.data
|
||||||
|
|
||||||
|
// 构建数据树
|
||||||
|
const buildDataTree = () => {
|
||||||
|
if (!result?.result_data?.schools) return []
|
||||||
|
|
||||||
|
return result.result_data.schools.map((school: any, si: number) => ({
|
||||||
|
key: `school-${si}`,
|
||||||
|
title: (
|
||||||
|
<span>
|
||||||
|
<BankOutlined style={{ marginRight: 8 }} />
|
||||||
|
{school.name} ({school.programs?.length || 0}个项目)
|
||||||
|
</span>
|
||||||
|
),
|
||||||
|
children: school.programs?.map((prog: any, pi: number) => ({
|
||||||
|
key: `program-${si}-${pi}`,
|
||||||
|
title: (
|
||||||
|
<span>
|
||||||
|
<BookOutlined style={{ marginRight: 8 }} />
|
||||||
|
{prog.name} ({prog.faculty?.length || 0}位导师)
|
||||||
|
</span>
|
||||||
|
),
|
||||||
|
children: prog.faculty?.map((fac: any, fi: number) => ({
|
||||||
|
key: `faculty-${si}-${pi}-${fi}`,
|
||||||
|
title: (
|
||||||
|
<span>
|
||||||
|
<TeamOutlined style={{ marginRight: 8 }} />
|
||||||
|
<a href={fac.url} target="_blank" rel="noreferrer">{fac.name}</a>
|
||||||
|
</span>
|
||||||
|
),
|
||||||
|
isLeaf: true
|
||||||
|
}))
|
||||||
|
}))
|
||||||
|
}))
|
||||||
|
}
|
||||||
|
|
||||||
|
if (universityLoading) {
|
||||||
|
return <Card><Spin size="large" /></Card>
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!university) {
|
||||||
|
return <Card><Empty description="大学不存在" /></Card>
|
||||||
|
}
|
||||||
|
|
||||||
|
const activeScript = scripts.find((s: any) => s.status === 'active')
|
||||||
|
const latestJob = jobs[0]
|
||||||
|
const isRunning = pollingJobId !== null || latestJob?.status === 'running'
|
||||||
|
|
||||||
|
return (
|
||||||
|
<div>
|
||||||
|
{/* 头部 */}
|
||||||
|
<Card style={{ marginBottom: 16 }}>
|
||||||
|
<Space style={{ marginBottom: 16 }}>
|
||||||
|
<Button icon={<ArrowLeftOutlined />} onClick={() => navigate('/')}>
|
||||||
|
返回列表
|
||||||
|
</Button>
|
||||||
|
</Space>
|
||||||
|
|
||||||
|
<Row gutter={24}>
|
||||||
|
<Col span={16}>
|
||||||
|
<Title level={3} style={{ marginBottom: 8 }}>{university.name}</Title>
|
||||||
|
<Paragraph>
|
||||||
|
<a href={university.url} target="_blank" rel="noreferrer">{university.url}</a>
|
||||||
|
</Paragraph>
|
||||||
|
<Space>
|
||||||
|
<Tag>{university.country || '未知国家'}</Tag>
|
||||||
|
<Tag color={university.status === 'ready' ? 'green' : 'orange'}>
|
||||||
|
{university.status === 'ready' ? '就绪' : university.status}
|
||||||
|
</Tag>
|
||||||
|
</Space>
|
||||||
|
</Col>
|
||||||
|
<Col span={8} style={{ textAlign: 'right' }}>
|
||||||
|
<Button
|
||||||
|
type="primary"
|
||||||
|
size="large"
|
||||||
|
icon={isRunning ? <Spin size="small" /> : <PlayCircleOutlined />}
|
||||||
|
onClick={() => startJobMutation.mutate()}
|
||||||
|
disabled={!activeScript || isRunning}
|
||||||
|
loading={startJobMutation.isPending}
|
||||||
|
>
|
||||||
|
{isRunning ? '爬虫运行中...' : '一键运行爬虫'}
|
||||||
|
</Button>
|
||||||
|
</Col>
|
||||||
|
</Row>
|
||||||
|
|
||||||
|
{/* 统计 */}
|
||||||
|
<Row gutter={16} style={{ marginTop: 24 }}>
|
||||||
|
<Col span={6}>
|
||||||
|
<Statistic title="学院数" value={university.latest_result?.schools_count || 0} />
|
||||||
|
</Col>
|
||||||
|
<Col span={6}>
|
||||||
|
<Statistic title="项目数" value={university.latest_result?.programs_count || 0} />
|
||||||
|
</Col>
|
||||||
|
<Col span={6}>
|
||||||
|
<Statistic title="导师数" value={university.latest_result?.faculty_count || 0} />
|
||||||
|
</Col>
|
||||||
|
<Col span={6}>
|
||||||
|
<Statistic title="脚本版本" value={activeScript?.version || 0} />
|
||||||
|
</Col>
|
||||||
|
</Row>
|
||||||
|
</Card>
|
||||||
|
|
||||||
|
{/* 运行进度 */}
|
||||||
|
{pollingJobId && jobStatusData?.data && (
|
||||||
|
<Card style={{ marginBottom: 16 }}>
|
||||||
|
<Title level={5}>爬虫运行中</Title>
|
||||||
|
<Progress percent={jobStatusData.data.progress} status="active" />
|
||||||
|
<Text type="secondary">{jobStatusData.data.current_step}</Text>
|
||||||
|
|
||||||
|
<div style={{ marginTop: 16, maxHeight: 200, overflowY: 'auto' }}>
|
||||||
|
<Timeline
|
||||||
|
items={jobStatusData.data.logs?.slice(-10).map((log: any) => ({
|
||||||
|
color: log.level === 'error' ? 'red' : log.level === 'warning' ? 'orange' : 'blue',
|
||||||
|
children: <Text>{log.message}</Text>
|
||||||
|
}))}
|
||||||
|
/>
|
||||||
|
</div>
|
||||||
|
</Card>
|
||||||
|
)}
|
||||||
|
|
||||||
|
{/* 标签页 */}
|
||||||
|
<Card>
|
||||||
|
<Tabs activeKey={activeTab} onChange={setActiveTab}>
|
||||||
|
{/* 概览 */}
|
||||||
|
<TabPane tab="概览" key="overview">
|
||||||
|
<Descriptions title="基本信息" bordered column={2}>
|
||||||
|
<Descriptions.Item label="大学名称">{university.name}</Descriptions.Item>
|
||||||
|
<Descriptions.Item label="官网地址">
|
||||||
|
<a href={university.url} target="_blank" rel="noreferrer">{university.url}</a>
|
||||||
|
</Descriptions.Item>
|
||||||
|
<Descriptions.Item label="国家">{university.country || '-'}</Descriptions.Item>
|
||||||
|
<Descriptions.Item label="状态">
|
||||||
|
<Tag color={university.status === 'ready' ? 'green' : 'default'}>
|
||||||
|
{university.status}
|
||||||
|
</Tag>
|
||||||
|
</Descriptions.Item>
|
||||||
|
<Descriptions.Item label="创建时间">
|
||||||
|
{new Date(university.created_at).toLocaleString()}
|
||||||
|
</Descriptions.Item>
|
||||||
|
<Descriptions.Item label="更新时间">
|
||||||
|
{new Date(university.updated_at).toLocaleString()}
|
||||||
|
</Descriptions.Item>
|
||||||
|
</Descriptions>
|
||||||
|
|
||||||
|
<Title level={5} style={{ marginTop: 24 }}>历史任务</Title>
|
||||||
|
<Table
|
||||||
|
dataSource={jobs.slice(0, 5)}
|
||||||
|
rowKey="id"
|
||||||
|
pagination={false}
|
||||||
|
columns={[
|
||||||
|
{
|
||||||
|
title: '任务ID',
|
||||||
|
dataIndex: 'id',
|
||||||
|
width: 80
|
||||||
|
},
|
||||||
|
{
|
||||||
|
title: '状态',
|
||||||
|
dataIndex: 'status',
|
||||||
|
width: 100,
|
||||||
|
render: (status: string) => {
|
||||||
|
const s = statusMap[status] || { color: 'default', text: status }
|
||||||
|
return <Tag color={s.color}>{s.icon} {s.text}</Tag>
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
title: '进度',
|
||||||
|
dataIndex: 'progress',
|
||||||
|
width: 150,
|
||||||
|
render: (progress: number) => <Progress percent={progress} size="small" />
|
||||||
|
},
|
||||||
|
{
|
||||||
|
title: '开始时间',
|
||||||
|
dataIndex: 'started_at',
|
||||||
|
render: (t: string) => t ? new Date(t).toLocaleString() : '-'
|
||||||
|
},
|
||||||
|
{
|
||||||
|
title: '完成时间',
|
||||||
|
dataIndex: 'completed_at',
|
||||||
|
render: (t: string) => t ? new Date(t).toLocaleString() : '-'
|
||||||
|
}
|
||||||
|
]}
|
||||||
|
/>
|
||||||
|
</TabPane>
|
||||||
|
|
||||||
|
{/* 数据查看 */}
|
||||||
|
<TabPane tab="数据查看" key="data">
|
||||||
|
{result?.result_data ? (
|
||||||
|
<div>
|
||||||
|
<Row style={{ marginBottom: 16 }}>
|
||||||
|
<Col span={12}>
|
||||||
|
<Input
|
||||||
|
placeholder="搜索项目或导师..."
|
||||||
|
prefix={<SearchOutlined />}
|
||||||
|
value={searchKeyword}
|
||||||
|
onChange={(e) => setSearchKeyword(e.target.value)}
|
||||||
|
style={{ width: 300 }}
|
||||||
|
/>
|
||||||
|
</Col>
|
||||||
|
<Col span={12} style={{ textAlign: 'right' }}>
|
||||||
|
<Button
|
||||||
|
icon={<DownloadOutlined />}
|
||||||
|
onClick={() => {
|
||||||
|
const dataStr = JSON.stringify(result.result_data, null, 2)
|
||||||
|
const blob = new Blob([dataStr], { type: 'application/json' })
|
||||||
|
const url = URL.createObjectURL(blob)
|
||||||
|
const a = document.createElement('a')
|
||||||
|
a.href = url
|
||||||
|
a.download = `${university.name}_data.json`
|
||||||
|
a.click()
|
||||||
|
}}
|
||||||
|
>
|
||||||
|
导出JSON
|
||||||
|
</Button>
|
||||||
|
</Col>
|
||||||
|
</Row>
|
||||||
|
|
||||||
|
<Tree
|
||||||
|
showLine
|
||||||
|
defaultExpandedKeys={['school-0']}
|
||||||
|
treeData={buildDataTree()}
|
||||||
|
style={{ background: '#fafafa', padding: 16, borderRadius: 8 }}
|
||||||
|
/>
|
||||||
|
</div>
|
||||||
|
) : (
|
||||||
|
<Empty description="暂无数据,请先运行爬虫" />
|
||||||
|
)}
|
||||||
|
</TabPane>
|
||||||
|
|
||||||
|
{/* 脚本管理 */}
|
||||||
|
<TabPane tab="脚本管理" key="script">
|
||||||
|
{activeScript ? (
|
||||||
|
<div>
|
||||||
|
<Descriptions bordered column={2}>
|
||||||
|
<Descriptions.Item label="脚本名称">{activeScript.script_name}</Descriptions.Item>
|
||||||
|
<Descriptions.Item label="版本">v{activeScript.version}</Descriptions.Item>
|
||||||
|
<Descriptions.Item label="状态">
|
||||||
|
<Tag color="green">活跃</Tag>
|
||||||
|
</Descriptions.Item>
|
||||||
|
<Descriptions.Item label="创建时间">
|
||||||
|
{new Date(activeScript.created_at).toLocaleString()}
|
||||||
|
</Descriptions.Item>
|
||||||
|
</Descriptions>
|
||||||
|
|
||||||
|
<Title level={5} style={{ marginTop: 24 }}>脚本代码</Title>
|
||||||
|
<pre style={{
|
||||||
|
background: '#1e1e1e',
|
||||||
|
color: '#d4d4d4',
|
||||||
|
padding: 16,
|
||||||
|
borderRadius: 8,
|
||||||
|
maxHeight: 400,
|
||||||
|
overflow: 'auto'
|
||||||
|
}}>
|
||||||
|
{activeScript.script_content}
|
||||||
|
</pre>
|
||||||
|
</div>
|
||||||
|
) : (
|
||||||
|
<Empty description="暂无脚本" />
|
||||||
|
)}
|
||||||
|
</TabPane>
|
||||||
|
</Tabs>
|
||||||
|
</Card>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
77
frontend/src/services/api.ts
Normal file
77
frontend/src/services/api.ts
Normal file
@ -0,0 +1,77 @@
|
|||||||
|
/**
|
||||||
|
* API服务
|
||||||
|
*/
|
||||||
|
import axios from 'axios'
|
||||||
|
|
||||||
|
const api = axios.create({
|
||||||
|
baseURL: '/api',
|
||||||
|
timeout: 60000
|
||||||
|
})
|
||||||
|
|
||||||
|
// 大学相关API
|
||||||
|
export const universityApi = {
|
||||||
|
list: (params?: { skip?: number; limit?: number; search?: string }) =>
|
||||||
|
api.get('/universities', { params }),
|
||||||
|
|
||||||
|
get: (id: number) =>
|
||||||
|
api.get(`/universities/${id}`),
|
||||||
|
|
||||||
|
create: (data: { name: string; url: string; country?: string }) =>
|
||||||
|
api.post('/universities', data),
|
||||||
|
|
||||||
|
update: (id: number, data: { name?: string; url?: string; country?: string }) =>
|
||||||
|
api.put(`/universities/${id}`, data),
|
||||||
|
|
||||||
|
delete: (id: number) =>
|
||||||
|
api.delete(`/universities/${id}`)
|
||||||
|
}
|
||||||
|
|
||||||
|
// 脚本相关API
|
||||||
|
export const scriptApi = {
|
||||||
|
generate: (data: { university_url: string; university_name?: string }) =>
|
||||||
|
api.post('/scripts/generate', data),
|
||||||
|
|
||||||
|
getByUniversity: (universityId: number) =>
|
||||||
|
api.get(`/scripts/university/${universityId}`),
|
||||||
|
|
||||||
|
get: (id: number) =>
|
||||||
|
api.get(`/scripts/${id}`)
|
||||||
|
}
|
||||||
|
|
||||||
|
// 任务相关API
|
||||||
|
export const jobApi = {
|
||||||
|
start: (universityId: number) =>
|
||||||
|
api.post(`/jobs/start/${universityId}`),
|
||||||
|
|
||||||
|
get: (id: number) =>
|
||||||
|
api.get(`/jobs/${id}`),
|
||||||
|
|
||||||
|
getStatus: (id: number) =>
|
||||||
|
api.get(`/jobs/${id}/status`),
|
||||||
|
|
||||||
|
getByUniversity: (universityId: number) =>
|
||||||
|
api.get(`/jobs/university/${universityId}`),
|
||||||
|
|
||||||
|
cancel: (id: number) =>
|
||||||
|
api.post(`/jobs/${id}/cancel`)
|
||||||
|
}
|
||||||
|
|
||||||
|
// 结果相关API
|
||||||
|
export const resultApi = {
|
||||||
|
get: (universityId: number) =>
|
||||||
|
api.get(`/results/university/${universityId}`),
|
||||||
|
|
||||||
|
getSchools: (universityId: number) =>
|
||||||
|
api.get(`/results/university/${universityId}/schools`),
|
||||||
|
|
||||||
|
getPrograms: (universityId: number, params?: { school_name?: string; search?: string }) =>
|
||||||
|
api.get(`/results/university/${universityId}/programs`, { params }),
|
||||||
|
|
||||||
|
getFaculty: (universityId: number, params?: { school_name?: string; program_name?: string; search?: string; skip?: number; limit?: number }) =>
|
||||||
|
api.get(`/results/university/${universityId}/faculty`, { params }),
|
||||||
|
|
||||||
|
export: (universityId: number) =>
|
||||||
|
api.get(`/results/university/${universityId}/export`, { responseType: 'blob' })
|
||||||
|
}
|
||||||
|
|
||||||
|
export default api
|
||||||
1
frontend/src/vite-env.d.ts
vendored
Normal file
1
frontend/src/vite-env.d.ts
vendored
Normal file
@ -0,0 +1 @@
|
|||||||
|
/// <reference types="vite/client" />
|
||||||
21
frontend/tsconfig.json
Normal file
21
frontend/tsconfig.json
Normal file
@ -0,0 +1,21 @@
|
|||||||
|
{
|
||||||
|
"compilerOptions": {
|
||||||
|
"target": "ES2020",
|
||||||
|
"useDefineForClassFields": true,
|
||||||
|
"lib": ["ES2020", "DOM", "DOM.Iterable"],
|
||||||
|
"module": "ESNext",
|
||||||
|
"skipLibCheck": true,
|
||||||
|
"moduleResolution": "bundler",
|
||||||
|
"allowImportingTsExtensions": true,
|
||||||
|
"resolveJsonModule": true,
|
||||||
|
"isolatedModules": true,
|
||||||
|
"noEmit": true,
|
||||||
|
"jsx": "react-jsx",
|
||||||
|
"strict": true,
|
||||||
|
"noUnusedLocals": true,
|
||||||
|
"noUnusedParameters": true,
|
||||||
|
"noFallthroughCasesInSwitch": true
|
||||||
|
},
|
||||||
|
"include": ["src"],
|
||||||
|
"references": [{ "path": "./tsconfig.node.json" }]
|
||||||
|
}
|
||||||
10
frontend/tsconfig.node.json
Normal file
10
frontend/tsconfig.node.json
Normal file
@ -0,0 +1,10 @@
|
|||||||
|
{
|
||||||
|
"compilerOptions": {
|
||||||
|
"composite": true,
|
||||||
|
"skipLibCheck": true,
|
||||||
|
"module": "ESNext",
|
||||||
|
"moduleResolution": "bundler",
|
||||||
|
"allowSyntheticDefaultImports": true
|
||||||
|
},
|
||||||
|
"include": ["vite.config.ts"]
|
||||||
|
}
|
||||||
15
frontend/vite.config.ts
Normal file
15
frontend/vite.config.ts
Normal file
@ -0,0 +1,15 @@
|
|||||||
|
import { defineConfig } from 'vite'
|
||||||
|
import react from '@vitejs/plugin-react'
|
||||||
|
|
||||||
|
export default defineConfig({
|
||||||
|
plugins: [react()],
|
||||||
|
server: {
|
||||||
|
port: 3000,
|
||||||
|
proxy: {
|
||||||
|
'/api': {
|
||||||
|
target: 'http://localhost:8000',
|
||||||
|
changeOrigin: true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
})
|
||||||
164
scripts/reorganize_by_school.py
Normal file
164
scripts/reorganize_by_school.py
Normal file
@ -0,0 +1,164 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
将已爬取的Harvard数据按学院重新组织
|
||||||
|
|
||||||
|
读取原始扁平数据,按 学院 → 项目 → 导师 层级重新组织输出
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from urllib.parse import urlparse
|
||||||
|
from collections import defaultdict
|
||||||
|
|
||||||
|
# Harvard学院映射 - 根据URL子域名判断所属学院
|
||||||
|
SCHOOL_MAPPING = {
|
||||||
|
"gsas.harvard.edu": "Graduate School of Arts and Sciences (GSAS)",
|
||||||
|
"seas.harvard.edu": "John A. Paulson School of Engineering and Applied Sciences (SEAS)",
|
||||||
|
"hbs.edu": "Harvard Business School (HBS)",
|
||||||
|
"www.hbs.edu": "Harvard Business School (HBS)",
|
||||||
|
"gsd.harvard.edu": "Graduate School of Design (GSD)",
|
||||||
|
"www.gsd.harvard.edu": "Graduate School of Design (GSD)",
|
||||||
|
"gse.harvard.edu": "Graduate School of Education (HGSE)",
|
||||||
|
"www.gse.harvard.edu": "Graduate School of Education (HGSE)",
|
||||||
|
"hks.harvard.edu": "Harvard Kennedy School (HKS)",
|
||||||
|
"www.hks.harvard.edu": "Harvard Kennedy School (HKS)",
|
||||||
|
"hls.harvard.edu": "Harvard Law School (HLS)",
|
||||||
|
"hms.harvard.edu": "Harvard Medical School (HMS)",
|
||||||
|
"hsph.harvard.edu": "T.H. Chan School of Public Health (HSPH)",
|
||||||
|
"www.hsph.harvard.edu": "T.H. Chan School of Public Health (HSPH)",
|
||||||
|
"hds.harvard.edu": "Harvard Divinity School (HDS)",
|
||||||
|
"hsdm.harvard.edu": "Harvard School of Dental Medicine (HSDM)",
|
||||||
|
"fas.harvard.edu": "Faculty of Arts and Sciences (FAS)",
|
||||||
|
"aaas.fas.harvard.edu": "Faculty of Arts and Sciences (FAS)",
|
||||||
|
"dce.harvard.edu": "Division of Continuing Education (DCE)",
|
||||||
|
"extension.harvard.edu": "Harvard Extension School",
|
||||||
|
"cs.seas.harvard.edu": "John A. Paulson School of Engineering and Applied Sciences (SEAS)",
|
||||||
|
}
|
||||||
|
|
||||||
|
# 学院URL映射
|
||||||
|
SCHOOL_URLS = {
|
||||||
|
"Graduate School of Arts and Sciences (GSAS)": "https://gsas.harvard.edu/",
|
||||||
|
"John A. Paulson School of Engineering and Applied Sciences (SEAS)": "https://seas.harvard.edu/",
|
||||||
|
"Harvard Business School (HBS)": "https://www.hbs.edu/",
|
||||||
|
"Graduate School of Design (GSD)": "https://www.gsd.harvard.edu/",
|
||||||
|
"Graduate School of Education (HGSE)": "https://www.gse.harvard.edu/",
|
||||||
|
"Harvard Kennedy School (HKS)": "https://www.hks.harvard.edu/",
|
||||||
|
"Harvard Law School (HLS)": "https://hls.harvard.edu/",
|
||||||
|
"Harvard Medical School (HMS)": "https://hms.harvard.edu/",
|
||||||
|
"T.H. Chan School of Public Health (HSPH)": "https://www.hsph.harvard.edu/",
|
||||||
|
"Harvard Divinity School (HDS)": "https://hds.harvard.edu/",
|
||||||
|
"Harvard School of Dental Medicine (HSDM)": "https://hsdm.harvard.edu/",
|
||||||
|
"Faculty of Arts and Sciences (FAS)": "https://fas.harvard.edu/",
|
||||||
|
"Division of Continuing Education (DCE)": "https://dce.harvard.edu/",
|
||||||
|
"Harvard Extension School": "https://extension.harvard.edu/",
|
||||||
|
"Other": "https://www.harvard.edu/",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def determine_school_from_url(url: str) -> str:
|
||||||
|
"""根据URL判断所属学院"""
|
||||||
|
if not url:
|
||||||
|
return "Other"
|
||||||
|
|
||||||
|
parsed = urlparse(url)
|
||||||
|
domain = parsed.netloc.lower()
|
||||||
|
|
||||||
|
# 先尝试完全匹配
|
||||||
|
for pattern, school_name in SCHOOL_MAPPING.items():
|
||||||
|
if domain == pattern:
|
||||||
|
return school_name
|
||||||
|
|
||||||
|
# 再尝试部分匹配
|
||||||
|
for pattern, school_name in SCHOOL_MAPPING.items():
|
||||||
|
if pattern in domain:
|
||||||
|
return school_name
|
||||||
|
|
||||||
|
return "Other"
|
||||||
|
|
||||||
|
|
||||||
|
def reorganize_data(input_path: str, output_path: str):
|
||||||
|
"""重新组织数据按学院层级"""
|
||||||
|
|
||||||
|
# 读取原始数据
|
||||||
|
with open(input_path, 'r', encoding='utf-8') as f:
|
||||||
|
data = json.load(f)
|
||||||
|
|
||||||
|
print(f"读取原始数据: {data['total_programs']} 个项目, {data['total_faculty_found']} 位导师")
|
||||||
|
|
||||||
|
# 按学院分组
|
||||||
|
schools_dict = defaultdict(lambda: {"name": "", "url": "", "programs": []})
|
||||||
|
|
||||||
|
for prog in data['programs']:
|
||||||
|
# 根据faculty_page_url判断学院
|
||||||
|
faculty_url = prog.get('faculty_page_url', '')
|
||||||
|
school_name = determine_school_from_url(faculty_url)
|
||||||
|
|
||||||
|
# 如果没有faculty_page_url,尝试从program url推断
|
||||||
|
if school_name == "Other" and prog.get('url'):
|
||||||
|
school_name = determine_school_from_url(prog['url'])
|
||||||
|
|
||||||
|
# 创建项目对象
|
||||||
|
program = {
|
||||||
|
"name": prog['name'],
|
||||||
|
"url": prog.get('url', ''),
|
||||||
|
"degree_type": prog.get('degrees', ''),
|
||||||
|
"faculty_page_url": faculty_url,
|
||||||
|
"faculty": prog.get('faculty', [])
|
||||||
|
}
|
||||||
|
|
||||||
|
# 添加到学院
|
||||||
|
if not schools_dict[school_name]["name"]:
|
||||||
|
schools_dict[school_name]["name"] = school_name
|
||||||
|
schools_dict[school_name]["url"] = SCHOOL_URLS.get(school_name, "")
|
||||||
|
|
||||||
|
schools_dict[school_name]["programs"].append(program)
|
||||||
|
|
||||||
|
# 转换为列表并排序
|
||||||
|
schools_list = sorted(schools_dict.values(), key=lambda s: s["name"])
|
||||||
|
|
||||||
|
# 构建输出结构
|
||||||
|
result = {
|
||||||
|
"name": "Harvard University",
|
||||||
|
"url": "https://www.harvard.edu/",
|
||||||
|
"country": "USA",
|
||||||
|
"scraped_at": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"schools": schools_list
|
||||||
|
}
|
||||||
|
|
||||||
|
# 打印统计
|
||||||
|
print("\n" + "=" * 60)
|
||||||
|
print("按学院重新组织完成!")
|
||||||
|
print("=" * 60)
|
||||||
|
print(f"大学: {result['name']}")
|
||||||
|
print(f"学院数: {len(schools_list)}")
|
||||||
|
|
||||||
|
total_programs = sum(len(s['programs']) for s in schools_list)
|
||||||
|
total_faculty = sum(len(p['faculty']) for s in schools_list for p in s['programs'])
|
||||||
|
|
||||||
|
print(f"项目数: {total_programs}")
|
||||||
|
print(f"导师数: {total_faculty}")
|
||||||
|
|
||||||
|
print("\n各学院统计:")
|
||||||
|
for school in schools_list:
|
||||||
|
prog_count = len(school['programs'])
|
||||||
|
fac_count = sum(len(p['faculty']) for p in school['programs'])
|
||||||
|
print(f" {school['name']}: {prog_count}个项目, {fac_count}位导师")
|
||||||
|
|
||||||
|
# 保存结果
|
||||||
|
output_file = Path(output_path)
|
||||||
|
output_file.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
with open(output_file, 'w', encoding='utf-8') as f:
|
||||||
|
json.dump(result, f, ensure_ascii=False, indent=2)
|
||||||
|
|
||||||
|
print(f"\n结果已保存到: {output_path}")
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
input_file = "artifacts/harvard_programs_with_faculty.json"
|
||||||
|
output_file = "output/harvard_hierarchical_result.json"
|
||||||
|
|
||||||
|
reorganize_data(input_file, output_file)
|
||||||
45
scripts/start_backend.py
Normal file
45
scripts/start_backend.py
Normal file
@ -0,0 +1,45 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
启动后端API服务 (本地开发)
|
||||||
|
"""
|
||||||
|
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
import os
|
||||||
|
|
||||||
|
# 切换到项目根目录
|
||||||
|
project_root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||||
|
os.chdir(project_root)
|
||||||
|
|
||||||
|
# 添加backend到Python路径
|
||||||
|
backend_path = os.path.join(project_root, "backend")
|
||||||
|
sys.path.insert(0, backend_path)
|
||||||
|
|
||||||
|
print("=" * 60)
|
||||||
|
print("启动大学爬虫 Web API 服务")
|
||||||
|
print("=" * 60)
|
||||||
|
print(f"项目目录: {project_root}")
|
||||||
|
print(f"后端目录: {backend_path}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# 检查是否安装了依赖
|
||||||
|
try:
|
||||||
|
import fastapi
|
||||||
|
import uvicorn
|
||||||
|
except ImportError:
|
||||||
|
print("正在安装后端依赖...")
|
||||||
|
subprocess.run([sys.executable, "-m", "pip", "install", "-r", "backend/requirements.txt"])
|
||||||
|
|
||||||
|
# 初始化数据库
|
||||||
|
print("初始化数据库...")
|
||||||
|
os.chdir(backend_path)
|
||||||
|
|
||||||
|
# 启动服务
|
||||||
|
print()
|
||||||
|
print("启动 FastAPI 服务...")
|
||||||
|
print("API文档: http://localhost:8000/docs")
|
||||||
|
print("Swagger UI: http://localhost:8000/redoc")
|
||||||
|
print()
|
||||||
|
|
||||||
|
import uvicorn
|
||||||
|
uvicorn.run("app.main:app", host="0.0.0.0", port=8000, reload=True)
|
||||||
42
scripts/start_dev.bat
Normal file
42
scripts/start_dev.bat
Normal file
@ -0,0 +1,42 @@
|
|||||||
|
@echo off
|
||||||
|
echo ============================================================
|
||||||
|
echo 大学爬虫 Web 系统 - 本地开发启动
|
||||||
|
echo ============================================================
|
||||||
|
|
||||||
|
echo.
|
||||||
|
echo 启动后端API服务...
|
||||||
|
cd /d "%~dp0..\backend"
|
||||||
|
|
||||||
|
REM 安装后端依赖
|
||||||
|
pip install -r requirements.txt -q
|
||||||
|
|
||||||
|
REM 启动后端
|
||||||
|
start cmd /k "cd /d %~dp0..\backend && uvicorn app.main:app --reload --port 8000"
|
||||||
|
|
||||||
|
echo 后端已启动: http://localhost:8000
|
||||||
|
echo API文档: http://localhost:8000/docs
|
||||||
|
|
||||||
|
echo.
|
||||||
|
echo 启动前端服务...
|
||||||
|
cd /d "%~dp0..\frontend"
|
||||||
|
|
||||||
|
REM 安装前端依赖
|
||||||
|
if not exist node_modules (
|
||||||
|
echo 安装前端依赖...
|
||||||
|
npm install
|
||||||
|
)
|
||||||
|
|
||||||
|
REM 启动前端
|
||||||
|
start cmd /k "cd /d %~dp0..\frontend && npm run dev"
|
||||||
|
|
||||||
|
echo 前端已启动: http://localhost:3000
|
||||||
|
|
||||||
|
echo.
|
||||||
|
echo ============================================================
|
||||||
|
echo 系统启动完成!
|
||||||
|
echo.
|
||||||
|
echo 后端API: http://localhost:8000/docs
|
||||||
|
echo 前端页面: http://localhost:3000
|
||||||
|
echo ============================================================
|
||||||
|
|
||||||
|
pause
|
||||||
126
scripts/test_harvard.py
Normal file
126
scripts/test_harvard.py
Normal file
@ -0,0 +1,126 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
测试Harvard大学爬取 - 只测试2个学院
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
# 添加项目路径
|
||||||
|
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
|
||||||
|
|
||||||
|
from university_scraper.config import ScraperConfig
|
||||||
|
from university_scraper.scraper import UniversityScraper
|
||||||
|
|
||||||
|
|
||||||
|
# 简化的测试配置 - 只测试2个学院
|
||||||
|
TEST_CONFIG = {
|
||||||
|
"university": {
|
||||||
|
"name": "Harvard University",
|
||||||
|
"url": "https://www.harvard.edu/",
|
||||||
|
"country": "USA"
|
||||||
|
},
|
||||||
|
"schools": {
|
||||||
|
"discovery_method": "static_list",
|
||||||
|
"static_list": [
|
||||||
|
{
|
||||||
|
"name": "John A. Paulson School of Engineering and Applied Sciences (SEAS)",
|
||||||
|
"url": "https://seas.harvard.edu/"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "Graduate School of Design (GSD)",
|
||||||
|
"url": "https://www.gsd.harvard.edu/"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"programs": {
|
||||||
|
"paths_to_try": [
|
||||||
|
"/academics/graduate-programs",
|
||||||
|
"/programs",
|
||||||
|
"/academics/programs",
|
||||||
|
"/graduate"
|
||||||
|
],
|
||||||
|
"link_patterns": [
|
||||||
|
{"text_contains": ["program", "degree"], "href_contains": ["/program", "/degree"]},
|
||||||
|
{"text_contains": ["master", "graduate"], "href_contains": ["/master", "/graduate"]}
|
||||||
|
],
|
||||||
|
"selectors": {
|
||||||
|
"program_item": "div.program-item, li.program, a[href*='/program']",
|
||||||
|
"program_name": "h3, .title",
|
||||||
|
"program_url": "a[href]",
|
||||||
|
"degree_type": ".degree"
|
||||||
|
},
|
||||||
|
"pagination": {"type": "none"}
|
||||||
|
},
|
||||||
|
"faculty": {
|
||||||
|
"discovery_strategies": [
|
||||||
|
{
|
||||||
|
"type": "link_in_page",
|
||||||
|
"patterns": [
|
||||||
|
{"text_contains": ["faculty", "people"], "href_contains": ["/faculty", "/people"]}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"type": "url_pattern",
|
||||||
|
"patterns": [
|
||||||
|
"{school_url}/faculty",
|
||||||
|
"{school_url}/people"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"selectors": {
|
||||||
|
"faculty_item": "div.faculty, li.person",
|
||||||
|
"faculty_name": "h3, .name",
|
||||||
|
"faculty_url": "a[href*='/people/'], a[href*='/faculty/']"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"filters": {
|
||||||
|
"program_degree_types": {
|
||||||
|
"include": ["Master", "M.S.", "M.A.", "MBA", "M.Eng", "S.M."],
|
||||||
|
"exclude": ["Ph.D.", "Doctor", "Bachelor"]
|
||||||
|
},
|
||||||
|
"exclude_schools": []
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
async def test_harvard():
|
||||||
|
"""测试Harvard爬取"""
|
||||||
|
print("=" * 60)
|
||||||
|
print("测试Harvard大学爬取(简化版 - 2个学院)")
|
||||||
|
print("=" * 60)
|
||||||
|
|
||||||
|
config = ScraperConfig.from_dict(TEST_CONFIG)
|
||||||
|
|
||||||
|
async with UniversityScraper(config, headless=False) as scraper:
|
||||||
|
university = await scraper.scrape()
|
||||||
|
scraper.save_results("output/harvard_test_result.json")
|
||||||
|
|
||||||
|
# 打印详细结果
|
||||||
|
print("\n" + "=" * 60)
|
||||||
|
print("详细结果:")
|
||||||
|
print("=" * 60)
|
||||||
|
|
||||||
|
for school in university.schools:
|
||||||
|
print(f"\n学院: {school.name}")
|
||||||
|
print(f" URL: {school.url}")
|
||||||
|
print(f" 项目数: {len(school.programs)}")
|
||||||
|
|
||||||
|
for prog in school.programs[:5]:
|
||||||
|
print(f"\n 项目: {prog.name}")
|
||||||
|
print(f" URL: {prog.url}")
|
||||||
|
print(f" 学位: {prog.degree_type}")
|
||||||
|
print(f" 导师数: {len(prog.faculty)}")
|
||||||
|
|
||||||
|
if prog.faculty:
|
||||||
|
print(" 导师示例:")
|
||||||
|
for f in prog.faculty[:3]:
|
||||||
|
print(f" - {f.name}: {f.url}")
|
||||||
|
|
||||||
|
if len(school.programs) > 5:
|
||||||
|
print(f"\n ... 还有 {len(school.programs) - 5} 个项目")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(test_harvard())
|
||||||
7
src/university_scraper/__init__.py
Normal file
7
src/university_scraper/__init__.py
Normal file
@ -0,0 +1,7 @@
|
|||||||
|
"""
|
||||||
|
University Scraper - 通用大学官网爬虫框架
|
||||||
|
|
||||||
|
支持按照 学院 → 项目 → 导师 的层级结构爬取任意海外大学官网
|
||||||
|
"""
|
||||||
|
|
||||||
|
__version__ = "1.0.0"
|
||||||
8
src/university_scraper/__main__.py
Normal file
8
src/university_scraper/__main__.py
Normal file
@ -0,0 +1,8 @@
|
|||||||
|
"""
|
||||||
|
模块入口点,支持 python -m university_scraper 运行
|
||||||
|
"""
|
||||||
|
|
||||||
|
from .cli import main
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
374
src/university_scraper/analyzer.py
Normal file
374
src/university_scraper/analyzer.py
Normal file
@ -0,0 +1,374 @@
|
|||||||
|
"""
|
||||||
|
AI辅助页面分析工具
|
||||||
|
|
||||||
|
帮助分析新大学官网的页面结构,生成配置建议
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
from typing import Dict, Any, List, Optional
|
||||||
|
from urllib.parse import urljoin, urlparse
|
||||||
|
|
||||||
|
from playwright.async_api import async_playwright, Page
|
||||||
|
|
||||||
|
|
||||||
|
class PageAnalyzer:
|
||||||
|
"""页面结构分析器"""
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.browser = None
|
||||||
|
self.page: Optional[Page] = None
|
||||||
|
|
||||||
|
async def __aenter__(self):
|
||||||
|
playwright = await async_playwright().start()
|
||||||
|
self.browser = await playwright.chromium.launch(headless=False)
|
||||||
|
context = await self.browser.new_context(
|
||||||
|
viewport={'width': 1920, 'height': 1080}
|
||||||
|
)
|
||||||
|
self.page = await context.new_page()
|
||||||
|
return self
|
||||||
|
|
||||||
|
async def __aexit__(self, exc_type, exc_val, exc_tb):
|
||||||
|
if self.browser:
|
||||||
|
await self.browser.close()
|
||||||
|
|
||||||
|
async def analyze_university_homepage(self, url: str) -> Dict[str, Any]:
|
||||||
|
"""分析大学官网首页,寻找学院链接"""
|
||||||
|
print(f"\n分析大学首页: {url}")
|
||||||
|
|
||||||
|
await self.page.goto(url, wait_until='networkidle')
|
||||||
|
await self.page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
analysis = await self.page.evaluate('''() => {
|
||||||
|
const result = {
|
||||||
|
title: document.title,
|
||||||
|
schools_links: [],
|
||||||
|
navigation_links: [],
|
||||||
|
potential_schools_pages: [],
|
||||||
|
all_harvard_subdomains: new Set()
|
||||||
|
};
|
||||||
|
|
||||||
|
// 查找可能的学院链接
|
||||||
|
const schoolKeywords = ['school', 'college', 'faculty', 'institute', 'academy', 'department'];
|
||||||
|
const navKeywords = ['academics', 'schools', 'colleges', 'programs', 'education'];
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href || '';
|
||||||
|
const text = a.innerText.trim().toLowerCase();
|
||||||
|
|
||||||
|
// 收集所有子域名
|
||||||
|
try {
|
||||||
|
const urlObj = new URL(href);
|
||||||
|
if (urlObj.hostname.includes('harvard.edu') &&
|
||||||
|
urlObj.hostname !== 'www.harvard.edu') {
|
||||||
|
result.all_harvard_subdomains.add(urlObj.origin);
|
||||||
|
}
|
||||||
|
} catch(e) {}
|
||||||
|
|
||||||
|
// 查找学院链接
|
||||||
|
if (schoolKeywords.some(kw => text.includes(kw)) ||
|
||||||
|
schoolKeywords.some(kw => href.toLowerCase().includes(kw))) {
|
||||||
|
result.schools_links.push({
|
||||||
|
text: a.innerText.trim().substring(0, 100),
|
||||||
|
href: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
// 查找导航到学院列表的链接
|
||||||
|
if (navKeywords.some(kw => text.includes(kw))) {
|
||||||
|
result.potential_schools_pages.push({
|
||||||
|
text: a.innerText.trim().substring(0, 50),
|
||||||
|
href: href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
// 转换Set为数组
|
||||||
|
result.all_harvard_subdomains = Array.from(result.all_harvard_subdomains);
|
||||||
|
|
||||||
|
return result;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f"\n页面标题: {analysis['title']}")
|
||||||
|
print(f"\n发现的子域名 ({len(analysis['all_harvard_subdomains'])} 个):")
|
||||||
|
for subdomain in analysis['all_harvard_subdomains'][:20]:
|
||||||
|
print(f" - {subdomain}")
|
||||||
|
|
||||||
|
print(f"\n可能的学院链接 ({len(analysis['schools_links'])} 个):")
|
||||||
|
for link in analysis['schools_links'][:15]:
|
||||||
|
print(f" - {link['text'][:50]} -> {link['href']}")
|
||||||
|
|
||||||
|
return analysis
|
||||||
|
|
||||||
|
async def analyze_school_page(self, url: str) -> Dict[str, Any]:
|
||||||
|
"""分析学院页面,寻找项目列表"""
|
||||||
|
print(f"\n分析学院页面: {url}")
|
||||||
|
|
||||||
|
await self.page.goto(url, wait_until='networkidle')
|
||||||
|
await self.page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
analysis = await self.page.evaluate('''() => {
|
||||||
|
const result = {
|
||||||
|
title: document.title,
|
||||||
|
navigation: [],
|
||||||
|
program_links: [],
|
||||||
|
degree_mentions: [],
|
||||||
|
faculty_links: []
|
||||||
|
};
|
||||||
|
|
||||||
|
// 分析导航结构
|
||||||
|
document.querySelectorAll('nav a, [class*="nav"] a, header a').forEach(a => {
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
const href = a.href || '';
|
||||||
|
if (text.length > 2 && text.length < 50) {
|
||||||
|
result.navigation.push({ text, href });
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
// 查找项目/学位链接
|
||||||
|
const programKeywords = ['program', 'degree', 'master', 'graduate', 'academic', 'study'];
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const text = a.innerText.trim().toLowerCase();
|
||||||
|
const href = a.href.toLowerCase();
|
||||||
|
|
||||||
|
if (programKeywords.some(kw => text.includes(kw) || href.includes(kw))) {
|
||||||
|
result.program_links.push({
|
||||||
|
text: a.innerText.trim().substring(0, 100),
|
||||||
|
href: a.href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
// 查找Faculty链接
|
||||||
|
if (text.includes('faculty') || text.includes('people') ||
|
||||||
|
href.includes('/faculty') || href.includes('/people')) {
|
||||||
|
result.faculty_links.push({
|
||||||
|
text: a.innerText.trim().substring(0, 100),
|
||||||
|
href: a.href
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return result;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f"\n导航链接:")
|
||||||
|
for nav in analysis['navigation'][:10]:
|
||||||
|
print(f" - {nav['text']} -> {nav['href']}")
|
||||||
|
|
||||||
|
print(f"\n项目相关链接 ({len(analysis['program_links'])} 个):")
|
||||||
|
for link in analysis['program_links'][:15]:
|
||||||
|
print(f" - {link['text'][:50]} -> {link['href']}")
|
||||||
|
|
||||||
|
print(f"\nFaculty链接 ({len(analysis['faculty_links'])} 个):")
|
||||||
|
for link in analysis['faculty_links'][:10]:
|
||||||
|
print(f" - {link['text'][:50]} -> {link['href']}")
|
||||||
|
|
||||||
|
return analysis
|
||||||
|
|
||||||
|
async def analyze_programs_page(self, url: str) -> Dict[str, Any]:
|
||||||
|
"""分析项目列表页面,识别项目选择器"""
|
||||||
|
print(f"\n分析项目列表页面: {url}")
|
||||||
|
|
||||||
|
await self.page.goto(url, wait_until='networkidle')
|
||||||
|
await self.page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
# 保存截图
|
||||||
|
screenshot_path = f"analysis_{urlparse(url).netloc.replace('.', '_')}.png"
|
||||||
|
await self.page.screenshot(path=screenshot_path, full_page=True)
|
||||||
|
print(f"截图已保存: {screenshot_path}")
|
||||||
|
|
||||||
|
analysis = await self.page.evaluate('''() => {
|
||||||
|
const result = {
|
||||||
|
title: document.title,
|
||||||
|
potential_program_containers: [],
|
||||||
|
program_items: [],
|
||||||
|
pagination: null,
|
||||||
|
selectors_suggestion: {}
|
||||||
|
};
|
||||||
|
|
||||||
|
// 分析页面结构,寻找重复的项目容器
|
||||||
|
const containers = [
|
||||||
|
'div[class*="program"]',
|
||||||
|
'li[class*="program"]',
|
||||||
|
'article[class*="program"]',
|
||||||
|
'div[class*="degree"]',
|
||||||
|
'div[class*="card"]',
|
||||||
|
'li.item',
|
||||||
|
'div.item'
|
||||||
|
];
|
||||||
|
|
||||||
|
containers.forEach(selector => {
|
||||||
|
const elements = document.querySelectorAll(selector);
|
||||||
|
if (elements.length >= 3) {
|
||||||
|
result.potential_program_containers.push({
|
||||||
|
selector: selector,
|
||||||
|
count: elements.length,
|
||||||
|
sample: elements[0].outerHTML.substring(0, 500)
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
// 查找所有看起来像项目的链接
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href.toLowerCase();
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
|
||||||
|
if ((href.includes('/program') || href.includes('/degree') ||
|
||||||
|
href.includes('/master') || href.includes('/graduate')) &&
|
||||||
|
text.length > 5 && text.length < 150) {
|
||||||
|
|
||||||
|
result.program_items.push({
|
||||||
|
text: text,
|
||||||
|
href: a.href,
|
||||||
|
parentClass: a.parentElement?.className || '',
|
||||||
|
grandparentClass: a.parentElement?.parentElement?.className || ''
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
// 查找分页元素
|
||||||
|
const paginationSelectors = [
|
||||||
|
'.pagination',
|
||||||
|
'[class*="pagination"]',
|
||||||
|
'nav[aria-label*="page"]',
|
||||||
|
'.pager'
|
||||||
|
];
|
||||||
|
|
||||||
|
for (const selector of paginationSelectors) {
|
||||||
|
const elem = document.querySelector(selector);
|
||||||
|
if (elem) {
|
||||||
|
result.pagination = {
|
||||||
|
selector: selector,
|
||||||
|
html: elem.outerHTML.substring(0, 300)
|
||||||
|
};
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return result;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f"\n可能的项目容器:")
|
||||||
|
for container in analysis['potential_program_containers']:
|
||||||
|
print(f" 选择器: {container['selector']} (找到 {container['count']} 个)")
|
||||||
|
|
||||||
|
print(f"\n找到的项目链接 ({len(analysis['program_items'])} 个):")
|
||||||
|
for item in analysis['program_items'][:10]:
|
||||||
|
print(f" - {item['text'][:60]}")
|
||||||
|
print(f" 父元素class: {item['parentClass'][:50]}")
|
||||||
|
|
||||||
|
if analysis['pagination']:
|
||||||
|
print(f"\n分页元素: {analysis['pagination']['selector']}")
|
||||||
|
|
||||||
|
return analysis
|
||||||
|
|
||||||
|
async def analyze_faculty_page(self, url: str) -> Dict[str, Any]:
|
||||||
|
"""分析导师列表页面,识别导师选择器"""
|
||||||
|
print(f"\n分析导师列表页面: {url}")
|
||||||
|
|
||||||
|
await self.page.goto(url, wait_until='networkidle')
|
||||||
|
await self.page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
analysis = await self.page.evaluate('''() => {
|
||||||
|
const result = {
|
||||||
|
title: document.title,
|
||||||
|
faculty_links: [],
|
||||||
|
potential_containers: [],
|
||||||
|
url_patterns: new Set()
|
||||||
|
};
|
||||||
|
|
||||||
|
// 查找个人页面链接
|
||||||
|
const personPatterns = ['/people/', '/faculty/', '/profile/', '/person/', '/directory/'];
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href.toLowerCase();
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
|
||||||
|
if (personPatterns.some(p => href.includes(p)) &&
|
||||||
|
text.length > 3 && text.length < 100) {
|
||||||
|
|
||||||
|
result.faculty_links.push({
|
||||||
|
name: text,
|
||||||
|
url: a.href,
|
||||||
|
parentClass: a.parentElement?.className || ''
|
||||||
|
});
|
||||||
|
|
||||||
|
// 记录URL模式
|
||||||
|
personPatterns.forEach(p => {
|
||||||
|
if (href.includes(p)) {
|
||||||
|
result.url_patterns.add(p);
|
||||||
|
}
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
result.url_patterns = Array.from(result.url_patterns);
|
||||||
|
|
||||||
|
return result;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
print(f"\n发现的导师链接 ({len(analysis['faculty_links'])} 个):")
|
||||||
|
for faculty in analysis['faculty_links'][:15]:
|
||||||
|
print(f" - {faculty['name']} -> {faculty['url']}")
|
||||||
|
|
||||||
|
print(f"\nURL模式: {analysis['url_patterns']}")
|
||||||
|
|
||||||
|
return analysis
|
||||||
|
|
||||||
|
async def generate_config_suggestion(self, university_url: str) -> str:
|
||||||
|
"""生成配置文件建议"""
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print(f"开始分析: {university_url}")
|
||||||
|
print(f"{'='*60}")
|
||||||
|
|
||||||
|
# 分析首页
|
||||||
|
homepage_analysis = await self.analyze_university_homepage(university_url)
|
||||||
|
|
||||||
|
# 生成配置建议
|
||||||
|
domain = urlparse(university_url).netloc
|
||||||
|
config_suggestion = f'''# {homepage_analysis['title']} 爬虫配置
|
||||||
|
# 自动生成的配置建议,请根据实际情况调整
|
||||||
|
|
||||||
|
university:
|
||||||
|
name: "{homepage_analysis['title'].split(' - ')[0].split(' | ')[0]}"
|
||||||
|
url: "{university_url}"
|
||||||
|
country: "TODO"
|
||||||
|
|
||||||
|
# 发现的子域名(可能是学院网站):
|
||||||
|
# {chr(10).join(['# - ' + s for s in homepage_analysis['all_harvard_subdomains'][:10]])}
|
||||||
|
|
||||||
|
schools:
|
||||||
|
discovery_method: "static_list"
|
||||||
|
|
||||||
|
# TODO: 根据上面的子域名和学院链接,手动填写学院列表
|
||||||
|
static_list:
|
||||||
|
# 示例:
|
||||||
|
# - name: "School of Engineering"
|
||||||
|
# url: "https://engineering.{domain}/"
|
||||||
|
'''
|
||||||
|
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print("配置建议:")
|
||||||
|
print(f"{'='*60}")
|
||||||
|
print(config_suggestion)
|
||||||
|
|
||||||
|
return config_suggestion
|
||||||
|
|
||||||
|
|
||||||
|
async def analyze_new_university(url: str):
|
||||||
|
"""分析新大学的便捷函数"""
|
||||||
|
async with PageAnalyzer() as analyzer:
|
||||||
|
await analyzer.generate_config_suggestion(url)
|
||||||
|
|
||||||
|
|
||||||
|
# CLI入口
|
||||||
|
if __name__ == "__main__":
|
||||||
|
import sys
|
||||||
|
|
||||||
|
if len(sys.argv) < 2:
|
||||||
|
print("用法: python analyzer.py <university_url>")
|
||||||
|
print("示例: python analyzer.py https://www.stanford.edu/")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
asyncio.run(analyze_new_university(sys.argv[1]))
|
||||||
105
src/university_scraper/cli.py
Normal file
105
src/university_scraper/cli.py
Normal file
@ -0,0 +1,105 @@
|
|||||||
|
"""
|
||||||
|
命令行工具
|
||||||
|
|
||||||
|
用法:
|
||||||
|
# 爬取指定大学
|
||||||
|
python -m university_scraper scrape harvard
|
||||||
|
|
||||||
|
# 分析新大学
|
||||||
|
python -m university_scraper analyze https://www.stanford.edu/
|
||||||
|
|
||||||
|
# 列出可用配置
|
||||||
|
python -m university_scraper list
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import argparse
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="通用大学官网爬虫 - 按照 学院→项目→导师 层级爬取"
|
||||||
|
)
|
||||||
|
|
||||||
|
subparsers = parser.add_subparsers(dest='command', help='可用命令')
|
||||||
|
|
||||||
|
# 爬取命令
|
||||||
|
scrape_parser = subparsers.add_parser('scrape', help='爬取指定大学')
|
||||||
|
scrape_parser.add_argument('university', help='大学名称(配置文件名,不含.yaml)')
|
||||||
|
scrape_parser.add_argument('-o', '--output', help='输出文件路径', default=None)
|
||||||
|
scrape_parser.add_argument('--headless', action='store_true', help='无头模式运行')
|
||||||
|
scrape_parser.add_argument('--config-dir', default='configs', help='配置文件目录')
|
||||||
|
|
||||||
|
# 分析命令
|
||||||
|
analyze_parser = subparsers.add_parser('analyze', help='分析新大学官网结构')
|
||||||
|
analyze_parser.add_argument('url', help='大学官网URL')
|
||||||
|
|
||||||
|
# 列出命令
|
||||||
|
list_parser = subparsers.add_parser('list', help='列出可用的大学配置')
|
||||||
|
list_parser.add_argument('--config-dir', default='configs', help='配置文件目录')
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
if args.command == 'scrape':
|
||||||
|
asyncio.run(run_scrape(args))
|
||||||
|
elif args.command == 'analyze':
|
||||||
|
asyncio.run(run_analyze(args))
|
||||||
|
elif args.command == 'list':
|
||||||
|
run_list(args)
|
||||||
|
else:
|
||||||
|
parser.print_help()
|
||||||
|
|
||||||
|
|
||||||
|
async def run_scrape(args):
|
||||||
|
"""执行爬取"""
|
||||||
|
from .config import load_config
|
||||||
|
from .scraper import UniversityScraper
|
||||||
|
|
||||||
|
config_path = Path(args.config_dir) / f"{args.university}.yaml"
|
||||||
|
|
||||||
|
if not config_path.exists():
|
||||||
|
print(f"错误: 配置文件不存在 - {config_path}")
|
||||||
|
print(f"可用配置: {list_configs(args.config_dir)}")
|
||||||
|
return
|
||||||
|
|
||||||
|
config = load_config(str(config_path))
|
||||||
|
|
||||||
|
output_path = args.output or f"output/{args.university}_result.json"
|
||||||
|
|
||||||
|
async with UniversityScraper(config, headless=args.headless) as scraper:
|
||||||
|
await scraper.scrape()
|
||||||
|
scraper.save_results(output_path)
|
||||||
|
|
||||||
|
|
||||||
|
async def run_analyze(args):
|
||||||
|
"""执行分析"""
|
||||||
|
from .analyzer import PageAnalyzer
|
||||||
|
|
||||||
|
async with PageAnalyzer() as analyzer:
|
||||||
|
await analyzer.generate_config_suggestion(args.url)
|
||||||
|
|
||||||
|
|
||||||
|
def run_list(args):
|
||||||
|
"""列出可用配置"""
|
||||||
|
configs = list_configs(args.config_dir)
|
||||||
|
|
||||||
|
if configs:
|
||||||
|
print("可用的大学配置:")
|
||||||
|
for name in configs:
|
||||||
|
print(f" - {name}")
|
||||||
|
else:
|
||||||
|
print(f"在 {args.config_dir} 目录下没有找到配置文件")
|
||||||
|
|
||||||
|
|
||||||
|
def list_configs(config_dir: str):
|
||||||
|
"""列出配置文件"""
|
||||||
|
path = Path(config_dir)
|
||||||
|
if not path.exists():
|
||||||
|
return []
|
||||||
|
|
||||||
|
return [f.stem for f in path.glob("*.yaml")] + [f.stem for f in path.glob("*.yml")]
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
232
src/university_scraper/config.py
Normal file
232
src/university_scraper/config.py
Normal file
@ -0,0 +1,232 @@
|
|||||||
|
"""
|
||||||
|
配置文件加载和验证
|
||||||
|
|
||||||
|
配置文件格式 (YAML):
|
||||||
|
|
||||||
|
university:
|
||||||
|
name: "Harvard University"
|
||||||
|
url: "https://www.harvard.edu/"
|
||||||
|
country: "USA"
|
||||||
|
|
||||||
|
# 第一层:学院列表页面
|
||||||
|
schools:
|
||||||
|
# 获取学院列表的方式
|
||||||
|
discovery_method: "static_list" # static_list | scrape_page | sitemap
|
||||||
|
|
||||||
|
# 方式1: 静态列表 (手动配置已知学院)
|
||||||
|
static_list:
|
||||||
|
- name: "School of Engineering and Applied Sciences"
|
||||||
|
url: "https://seas.harvard.edu/"
|
||||||
|
keywords: ["engineering", "computer"]
|
||||||
|
faculty_pages:
|
||||||
|
- url: "https://seas.harvard.edu/people"
|
||||||
|
extract_method: "links" # links | table | research_explorer
|
||||||
|
request:
|
||||||
|
timeout_ms: 90000
|
||||||
|
wait_for_selector: ".profile-card"
|
||||||
|
- name: "Graduate School of Arts and Sciences"
|
||||||
|
url: "https://gsas.harvard.edu/"
|
||||||
|
|
||||||
|
# 方式2: 从页面爬取
|
||||||
|
scrape_config:
|
||||||
|
url: "https://www.harvard.edu/schools/"
|
||||||
|
selector: "a.school-link"
|
||||||
|
name_attribute: "text" # text | title | data-name
|
||||||
|
url_attribute: "href"
|
||||||
|
|
||||||
|
# 第二层:每个学院下的项目列表
|
||||||
|
programs:
|
||||||
|
# 相对于学院URL的路径模式
|
||||||
|
paths_to_try:
|
||||||
|
- "/academics/graduate-programs"
|
||||||
|
- "/programs"
|
||||||
|
- "/graduate"
|
||||||
|
- "/academics/masters"
|
||||||
|
|
||||||
|
# 或者使用选择器从学院首页查找
|
||||||
|
link_patterns:
|
||||||
|
- text_contains: ["graduate", "master", "program"]
|
||||||
|
- href_contains: ["/program", "/graduate", "/academics"]
|
||||||
|
|
||||||
|
# 项目列表页面的选择器
|
||||||
|
selectors:
|
||||||
|
program_item: "div.program-item, li.program, a.program-link"
|
||||||
|
program_name: "h3, .title, .program-name"
|
||||||
|
program_url: "a[href]"
|
||||||
|
degree_type: ".degree, .credential"
|
||||||
|
request:
|
||||||
|
timeout_ms: 45000
|
||||||
|
max_retries: 3
|
||||||
|
retry_backoff_ms: 3000
|
||||||
|
|
||||||
|
# 分页配置
|
||||||
|
pagination:
|
||||||
|
type: "none" # none | click | url_param | infinite_scroll
|
||||||
|
next_selector: "a.next, button.next-page"
|
||||||
|
param_name: "page"
|
||||||
|
|
||||||
|
# 第三层:每个项目下的导师列表
|
||||||
|
faculty:
|
||||||
|
# 查找导师页面的策略
|
||||||
|
discovery_strategies:
|
||||||
|
- type: "link_in_page"
|
||||||
|
patterns:
|
||||||
|
- text_contains: ["faculty", "people", "advisor", "professor"]
|
||||||
|
- href_contains: ["/faculty", "/people", "/directory"]
|
||||||
|
|
||||||
|
- type: "url_pattern"
|
||||||
|
patterns:
|
||||||
|
- "{program_url}/faculty"
|
||||||
|
- "{program_url}/people"
|
||||||
|
- "{school_url}/people"
|
||||||
|
- type: "school_directory"
|
||||||
|
assign_to_all: true
|
||||||
|
match_by_school_keywords: true
|
||||||
|
request:
|
||||||
|
timeout_ms: 90000
|
||||||
|
wait_for_selector: "a.link.person"
|
||||||
|
|
||||||
|
# 导师列表页面的选择器
|
||||||
|
selectors:
|
||||||
|
faculty_item: "div.faculty-item, li.person, .profile-card"
|
||||||
|
faculty_name: "h3, .name, .title a"
|
||||||
|
faculty_url: "a[href*='/people/'], a[href*='/faculty/'], a[href*='/profile/']"
|
||||||
|
faculty_title: ".title, .position, .role"
|
||||||
|
faculty_email: "a[href^='mailto:']"
|
||||||
|
|
||||||
|
# 过滤规则
|
||||||
|
filters:
|
||||||
|
# 只爬取硕士项目
|
||||||
|
program_degree_types:
|
||||||
|
include: ["M.S.", "M.A.", "MBA", "Master", "M.Eng", "M.Ed", "M.P.P", "M.P.A"]
|
||||||
|
exclude: ["Ph.D.", "Bachelor", "B.S.", "B.A.", "Certificate"]
|
||||||
|
|
||||||
|
# 排除某些学院
|
||||||
|
exclude_schools:
|
||||||
|
- "Summer School"
|
||||||
|
- "Extension School"
|
||||||
|
"""
|
||||||
|
|
||||||
|
import yaml
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Dict, Any, List, Optional
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class UniversityConfig:
|
||||||
|
"""大学基本信息配置"""
|
||||||
|
name: str
|
||||||
|
url: str
|
||||||
|
country: str = "Unknown"
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class SchoolsConfig:
|
||||||
|
"""学院发现配置"""
|
||||||
|
discovery_method: str = "static_list"
|
||||||
|
static_list: List[Dict[str, str]] = field(default_factory=list)
|
||||||
|
scrape_config: Optional[Dict[str, Any]] = None
|
||||||
|
request: Dict[str, Any] = field(default_factory=dict)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ProgramsConfig:
|
||||||
|
"""项目发现配置"""
|
||||||
|
paths_to_try: List[str] = field(default_factory=list)
|
||||||
|
link_patterns: List[Dict[str, List[str]]] = field(default_factory=list)
|
||||||
|
selectors: Dict[str, str] = field(default_factory=dict)
|
||||||
|
pagination: Dict[str, Any] = field(default_factory=dict)
|
||||||
|
request: Dict[str, Any] = field(default_factory=dict)
|
||||||
|
global_catalog: Optional[Dict[str, Any]] = None
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class FacultyConfig:
|
||||||
|
"""导师发现配置"""
|
||||||
|
discovery_strategies: List[Dict[str, Any]] = field(default_factory=list)
|
||||||
|
selectors: Dict[str, str] = field(default_factory=dict)
|
||||||
|
request: Dict[str, Any] = field(default_factory=dict)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class FiltersConfig:
|
||||||
|
"""过滤规则配置"""
|
||||||
|
program_degree_types: Dict[str, List[str]] = field(default_factory=dict)
|
||||||
|
exclude_schools: List[str] = field(default_factory=list)
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class PlaywrightConfig:
|
||||||
|
"""Playwright运行环境配置"""
|
||||||
|
stealth: bool = False
|
||||||
|
user_agent: Optional[str] = None
|
||||||
|
locale: Optional[str] = None
|
||||||
|
timezone_id: Optional[str] = None
|
||||||
|
viewport: Optional[Dict[str, int]] = None
|
||||||
|
ignore_https_errors: bool = False
|
||||||
|
extra_headers: Dict[str, str] = field(default_factory=dict)
|
||||||
|
cookies: List[Dict[str, Any]] = field(default_factory=list)
|
||||||
|
add_init_scripts: List[str] = field(default_factory=list)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ScraperConfig:
|
||||||
|
"""完整的爬虫配置"""
|
||||||
|
university: UniversityConfig
|
||||||
|
schools: SchoolsConfig
|
||||||
|
programs: ProgramsConfig
|
||||||
|
faculty: FacultyConfig
|
||||||
|
filters: FiltersConfig
|
||||||
|
playwright: PlaywrightConfig = field(default_factory=PlaywrightConfig)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_yaml(cls, yaml_path: str) -> "ScraperConfig":
|
||||||
|
"""从YAML文件加载配置"""
|
||||||
|
with open(yaml_path, 'r', encoding='utf-8') as f:
|
||||||
|
data = yaml.safe_load(f)
|
||||||
|
|
||||||
|
return cls(
|
||||||
|
university=UniversityConfig(**data.get('university', {})),
|
||||||
|
schools=SchoolsConfig(**data.get('schools', {})),
|
||||||
|
programs=ProgramsConfig(**data.get('programs', {})),
|
||||||
|
faculty=FacultyConfig(**data.get('faculty', {})),
|
||||||
|
filters=FiltersConfig(**data.get('filters', {})),
|
||||||
|
playwright=PlaywrightConfig(**data.get('playwright', {}))
|
||||||
|
)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_dict(cls, data: Dict[str, Any]) -> "ScraperConfig":
|
||||||
|
"""从字典创建配置"""
|
||||||
|
return cls(
|
||||||
|
university=UniversityConfig(**data.get('university', {})),
|
||||||
|
schools=SchoolsConfig(**data.get('schools', {})),
|
||||||
|
programs=ProgramsConfig(**data.get('programs', {})),
|
||||||
|
faculty=FacultyConfig(**data.get('faculty', {})),
|
||||||
|
filters=FiltersConfig(**data.get('filters', {})),
|
||||||
|
playwright=PlaywrightConfig(**data.get('playwright', {}))
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def load_config(config_path: str) -> ScraperConfig:
|
||||||
|
"""加载配置文件"""
|
||||||
|
path = Path(config_path)
|
||||||
|
if not path.exists():
|
||||||
|
raise FileNotFoundError(f"配置文件不存在: {config_path}")
|
||||||
|
|
||||||
|
if path.suffix in ['.yaml', '.yml']:
|
||||||
|
return ScraperConfig.from_yaml(config_path)
|
||||||
|
else:
|
||||||
|
raise ValueError(f"不支持的配置文件格式: {path.suffix}")
|
||||||
|
|
||||||
|
|
||||||
|
def list_available_configs(configs_dir: str = "configs") -> List[str]:
|
||||||
|
"""列出所有可用的配置文件"""
|
||||||
|
path = Path(configs_dir)
|
||||||
|
if not path.exists():
|
||||||
|
return []
|
||||||
|
|
||||||
|
return [
|
||||||
|
f.stem for f in path.glob("*.yaml")
|
||||||
|
] + [
|
||||||
|
f.stem for f in path.glob("*.yml")
|
||||||
|
]
|
||||||
405
src/university_scraper/harvard_scraper.py
Normal file
405
src/university_scraper/harvard_scraper.py
Normal file
@ -0,0 +1,405 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Harvard专用爬虫
|
||||||
|
|
||||||
|
Harvard的特殊情况:
|
||||||
|
1. 有一个集中的项目列表页面 (harvard.edu/programs)
|
||||||
|
2. 项目详情在GSAS页面 (gsas.harvard.edu/program/xxx)
|
||||||
|
3. 导师信息在各院系网站
|
||||||
|
|
||||||
|
爬取流程:
|
||||||
|
1. 从集中页面获取所有硕士项目
|
||||||
|
2. 通过GSAS页面确定每个项目所属学院
|
||||||
|
3. 从院系网站获取导师信息
|
||||||
|
4. 按 学院→项目→导师 层级组织输出
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import List, Dict, Optional, Tuple
|
||||||
|
from urllib.parse import urljoin
|
||||||
|
|
||||||
|
from playwright.async_api import async_playwright, Page, Browser
|
||||||
|
|
||||||
|
from .models import University, School, Program, Faculty
|
||||||
|
|
||||||
|
|
||||||
|
# Harvard学院映射 - 根据URL子域名判断所属学院
|
||||||
|
SCHOOL_MAPPING = {
|
||||||
|
"gsas.harvard.edu": "Graduate School of Arts and Sciences (GSAS)",
|
||||||
|
"seas.harvard.edu": "John A. Paulson School of Engineering and Applied Sciences (SEAS)",
|
||||||
|
"hbs.edu": "Harvard Business School (HBS)",
|
||||||
|
"www.hbs.edu": "Harvard Business School (HBS)",
|
||||||
|
"gsd.harvard.edu": "Graduate School of Design (GSD)",
|
||||||
|
"www.gsd.harvard.edu": "Graduate School of Design (GSD)",
|
||||||
|
"gse.harvard.edu": "Graduate School of Education (HGSE)",
|
||||||
|
"www.gse.harvard.edu": "Graduate School of Education (HGSE)",
|
||||||
|
"hks.harvard.edu": "Harvard Kennedy School (HKS)",
|
||||||
|
"www.hks.harvard.edu": "Harvard Kennedy School (HKS)",
|
||||||
|
"hls.harvard.edu": "Harvard Law School (HLS)",
|
||||||
|
"hms.harvard.edu": "Harvard Medical School (HMS)",
|
||||||
|
"hsph.harvard.edu": "T.H. Chan School of Public Health (HSPH)",
|
||||||
|
"www.hsph.harvard.edu": "T.H. Chan School of Public Health (HSPH)",
|
||||||
|
"hds.harvard.edu": "Harvard Divinity School (HDS)",
|
||||||
|
"hsdm.harvard.edu": "Harvard School of Dental Medicine (HSDM)",
|
||||||
|
"fas.harvard.edu": "Faculty of Arts and Sciences (FAS)",
|
||||||
|
"dce.harvard.edu": "Division of Continuing Education (DCE)",
|
||||||
|
"extension.harvard.edu": "Harvard Extension School",
|
||||||
|
}
|
||||||
|
|
||||||
|
# 学院URL映射
|
||||||
|
SCHOOL_URLS = {
|
||||||
|
"Graduate School of Arts and Sciences (GSAS)": "https://gsas.harvard.edu/",
|
||||||
|
"John A. Paulson School of Engineering and Applied Sciences (SEAS)": "https://seas.harvard.edu/",
|
||||||
|
"Harvard Business School (HBS)": "https://www.hbs.edu/",
|
||||||
|
"Graduate School of Design (GSD)": "https://www.gsd.harvard.edu/",
|
||||||
|
"Graduate School of Education (HGSE)": "https://www.gse.harvard.edu/",
|
||||||
|
"Harvard Kennedy School (HKS)": "https://www.hks.harvard.edu/",
|
||||||
|
"Harvard Law School (HLS)": "https://hls.harvard.edu/",
|
||||||
|
"Harvard Medical School (HMS)": "https://hms.harvard.edu/",
|
||||||
|
"T.H. Chan School of Public Health (HSPH)": "https://www.hsph.harvard.edu/",
|
||||||
|
"Harvard Divinity School (HDS)": "https://hds.harvard.edu/",
|
||||||
|
"Harvard School of Dental Medicine (HSDM)": "https://hsdm.harvard.edu/",
|
||||||
|
"Faculty of Arts and Sciences (FAS)": "https://fas.harvard.edu/",
|
||||||
|
"Other": "https://www.harvard.edu/",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def name_to_slug(name: str) -> str:
|
||||||
|
"""将项目名称转换为URL slug"""
|
||||||
|
slug = name.lower()
|
||||||
|
slug = re.sub(r'[^\w\s-]', '', slug)
|
||||||
|
slug = re.sub(r'[\s_]+', '-', slug)
|
||||||
|
slug = re.sub(r'-+', '-', slug)
|
||||||
|
slug = slug.strip('-')
|
||||||
|
return slug
|
||||||
|
|
||||||
|
|
||||||
|
def determine_school_from_url(url: str) -> str:
|
||||||
|
"""根据URL判断所属学院"""
|
||||||
|
if not url:
|
||||||
|
return "Other"
|
||||||
|
|
||||||
|
from urllib.parse import urlparse
|
||||||
|
parsed = urlparse(url)
|
||||||
|
domain = parsed.netloc.lower()
|
||||||
|
|
||||||
|
for pattern, school_name in SCHOOL_MAPPING.items():
|
||||||
|
if pattern in domain:
|
||||||
|
return school_name
|
||||||
|
|
||||||
|
return "Other"
|
||||||
|
|
||||||
|
|
||||||
|
class HarvardScraper:
|
||||||
|
"""Harvard专用爬虫"""
|
||||||
|
|
||||||
|
def __init__(self, headless: bool = True):
|
||||||
|
self.headless = headless
|
||||||
|
self.browser: Optional[Browser] = None
|
||||||
|
self.page: Optional[Page] = None
|
||||||
|
self._playwright = None
|
||||||
|
|
||||||
|
async def __aenter__(self):
|
||||||
|
self._playwright = await async_playwright().start()
|
||||||
|
self.browser = await self._playwright.chromium.launch(headless=self.headless)
|
||||||
|
context = await self.browser.new_context(
|
||||||
|
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
||||||
|
viewport={'width': 1920, 'height': 1080},
|
||||||
|
java_script_enabled=True,
|
||||||
|
)
|
||||||
|
self.page = await context.new_page()
|
||||||
|
return self
|
||||||
|
|
||||||
|
async def __aexit__(self, exc_type, exc_val, exc_tb):
|
||||||
|
if self.browser:
|
||||||
|
await self.browser.close()
|
||||||
|
if self._playwright:
|
||||||
|
await self._playwright.stop()
|
||||||
|
|
||||||
|
async def _safe_goto(self, url: str, timeout: int = 30000, retries: int = 3) -> bool:
|
||||||
|
"""安全的页面导航,带重试机制"""
|
||||||
|
for attempt in range(retries):
|
||||||
|
try:
|
||||||
|
await self.page.goto(url, wait_until="domcontentloaded", timeout=timeout)
|
||||||
|
await self.page.wait_for_timeout(2000)
|
||||||
|
return True
|
||||||
|
except Exception as e:
|
||||||
|
print(f" 导航失败 (尝试 {attempt + 1}/{retries}): {str(e)[:50]}")
|
||||||
|
if attempt < retries - 1:
|
||||||
|
await self.page.wait_for_timeout(3000)
|
||||||
|
return False
|
||||||
|
|
||||||
|
async def scrape(self) -> University:
|
||||||
|
"""执行完整的爬取流程"""
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print("Harvard University 专用爬虫")
|
||||||
|
print(f"{'='*60}")
|
||||||
|
|
||||||
|
# 创建大学对象
|
||||||
|
university = University(
|
||||||
|
name="Harvard University",
|
||||||
|
url="https://www.harvard.edu/",
|
||||||
|
country="USA"
|
||||||
|
)
|
||||||
|
|
||||||
|
# 第一阶段:从集中页面获取所有硕士项目
|
||||||
|
print("\n[阶段1] 从集中页面获取项目列表...")
|
||||||
|
raw_programs = await self._scrape_programs_list()
|
||||||
|
print(f" 找到 {len(raw_programs)} 个项目")
|
||||||
|
|
||||||
|
# 第二阶段:获取每个项目的详情和导师信息
|
||||||
|
print("\n[阶段2] 获取项目详情和导师信息...")
|
||||||
|
|
||||||
|
# 按学院组织的项目
|
||||||
|
schools_dict: Dict[str, School] = {}
|
||||||
|
|
||||||
|
for i, prog_data in enumerate(raw_programs, 1):
|
||||||
|
print(f"\n [{i}/{len(raw_programs)}] {prog_data['name']}")
|
||||||
|
|
||||||
|
# 获取项目详情和导师
|
||||||
|
program, school_name = await self._get_program_details(prog_data)
|
||||||
|
|
||||||
|
if program:
|
||||||
|
# 添加到对应学院
|
||||||
|
if school_name not in schools_dict:
|
||||||
|
schools_dict[school_name] = School(
|
||||||
|
name=school_name,
|
||||||
|
url=SCHOOL_URLS.get(school_name, "")
|
||||||
|
)
|
||||||
|
schools_dict[school_name].programs.append(program)
|
||||||
|
|
||||||
|
print(f" 学院: {school_name}")
|
||||||
|
print(f" 导师: {len(program.faculty)}位")
|
||||||
|
|
||||||
|
# 避免请求过快
|
||||||
|
await self.page.wait_for_timeout(1000)
|
||||||
|
|
||||||
|
# 转换为列表并排序
|
||||||
|
university.schools = sorted(schools_dict.values(), key=lambda s: s.name)
|
||||||
|
university.scraped_at = datetime.now(timezone.utc).isoformat()
|
||||||
|
|
||||||
|
# 打印统计
|
||||||
|
self._print_summary(university)
|
||||||
|
|
||||||
|
return university
|
||||||
|
|
||||||
|
async def _scrape_programs_list(self) -> List[Dict]:
|
||||||
|
"""从Harvard集中页面获取所有硕士项目"""
|
||||||
|
all_programs = []
|
||||||
|
base_url = "https://www.harvard.edu/programs/?degree_levels=graduate"
|
||||||
|
|
||||||
|
print(f" 访问: {base_url}")
|
||||||
|
if not await self._safe_goto(base_url, timeout=60000):
|
||||||
|
print(" 无法访问项目页面!")
|
||||||
|
return []
|
||||||
|
await self.page.wait_for_timeout(3000)
|
||||||
|
|
||||||
|
# 滚动到页面底部
|
||||||
|
await self.page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||||
|
await self.page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
current_page = 1
|
||||||
|
max_pages = 15
|
||||||
|
|
||||||
|
while current_page <= max_pages:
|
||||||
|
print(f" 第 {current_page} 页...")
|
||||||
|
await self.page.wait_for_timeout(2000)
|
||||||
|
|
||||||
|
# 提取当前页面的项目
|
||||||
|
page_data = await self.page.evaluate('''() => {
|
||||||
|
const programs = [];
|
||||||
|
const programItems = document.querySelectorAll('[class*="records__record"], [class*="c-programs-item"]');
|
||||||
|
|
||||||
|
programItems.forEach((item) => {
|
||||||
|
const nameBtn = item.querySelector('button[class*="title-link"], button[class*="c-programs-item"]');
|
||||||
|
if (!nameBtn) return;
|
||||||
|
|
||||||
|
const name = nameBtn.innerText.trim();
|
||||||
|
if (!name || name.length < 3) return;
|
||||||
|
|
||||||
|
let degrees = '';
|
||||||
|
const allText = item.innerText;
|
||||||
|
const degreeMatch = allText.match(/(A\\.B\\.|Ph\\.D\\.|M\\.A\\.|S\\.M\\.|M\\.Arch\\.|LL\\.M\\.|S\\.B\\.|A\\.L\\.B\\.|A\\.L\\.M\\.|M\\.M\\.Sc\\.|Ed\\.D\\.|Ed\\.M\\.|M\\.P\\.A\\.|M\\.P\\.P\\.|M\\.P\\.H\\.|J\\.D\\.|M\\.B\\.A\\.|M\\.D\\.|D\\.M\\.D\\.|Th\\.D\\.|M\\.Div\\.|M\\.T\\.S\\.|M\\.E\\.|D\\.M\\.Sc\\.|M\\.H\\.C\\.M\\.|M\\.L\\.A\\.|M\\.D\\.E\\.|M\\.R\\.E\\.|M\\.A\\.U\\.D\\.|M\\.R\\.P\\.L\\.)/g);
|
||||||
|
if (degreeMatch) {
|
||||||
|
degrees = degreeMatch.join(', ');
|
||||||
|
}
|
||||||
|
|
||||||
|
programs.push({ name, degrees });
|
||||||
|
});
|
||||||
|
|
||||||
|
return programs;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
for prog in page_data:
|
||||||
|
name = prog['name'].strip()
|
||||||
|
if name and not any(p['name'] == name for p in all_programs):
|
||||||
|
all_programs.append(prog)
|
||||||
|
|
||||||
|
# 尝试点击下一页
|
||||||
|
try:
|
||||||
|
next_btn = self.page.locator('button.c-pagination__link--next')
|
||||||
|
if await next_btn.count() > 0:
|
||||||
|
await next_btn.first.scroll_into_view_if_needed()
|
||||||
|
await next_btn.first.click()
|
||||||
|
await self.page.wait_for_timeout(3000)
|
||||||
|
current_page += 1
|
||||||
|
else:
|
||||||
|
break
|
||||||
|
except:
|
||||||
|
break
|
||||||
|
|
||||||
|
# 过滤:只保留硕士项目
|
||||||
|
master_keywords = ['M.A.', 'M.S.', 'S.M.', 'A.M.', 'MBA', 'M.Arch', 'M.L.A.',
|
||||||
|
'M.Div', 'M.T.S', 'LL.M', 'M.P.P', 'M.P.A', 'M.Ed', 'Ed.M.',
|
||||||
|
'A.L.M.', 'M.P.H.', 'M.M.Sc.', 'Master']
|
||||||
|
phd_keywords = ['Ph.D.', 'Doctor', 'D.M.D.', 'D.M.Sc.', 'Ed.D.', 'Th.D.', 'J.D.', 'M.D.']
|
||||||
|
|
||||||
|
filtered = []
|
||||||
|
for prog in all_programs:
|
||||||
|
degrees = prog.get('degrees', '')
|
||||||
|
name = prog.get('name', '')
|
||||||
|
|
||||||
|
# 检查是否有硕士学位
|
||||||
|
has_master = any(kw in degrees or kw in name for kw in master_keywords)
|
||||||
|
|
||||||
|
# 排除纯博士项目
|
||||||
|
is_phd_only = all(kw in degrees for kw in phd_keywords if kw in degrees) and not has_master
|
||||||
|
|
||||||
|
if has_master or (not is_phd_only and not degrees):
|
||||||
|
filtered.append(prog)
|
||||||
|
|
||||||
|
return filtered
|
||||||
|
|
||||||
|
async def _get_program_details(self, prog_data: Dict) -> Tuple[Optional[Program], str]:
|
||||||
|
"""获取项目详情和导师信息"""
|
||||||
|
name = prog_data['name']
|
||||||
|
degrees = prog_data.get('degrees', '')
|
||||||
|
|
||||||
|
# 生成URL
|
||||||
|
slug = name_to_slug(name)
|
||||||
|
program_url = f"https://www.harvard.edu/programs/{slug}/"
|
||||||
|
gsas_url = f"https://gsas.harvard.edu/program/{slug}"
|
||||||
|
|
||||||
|
# 访问GSAS页面获取详情
|
||||||
|
school_name = "Other"
|
||||||
|
faculty_list = []
|
||||||
|
faculty_page_url = None
|
||||||
|
|
||||||
|
try:
|
||||||
|
if await self._safe_goto(gsas_url, timeout=20000, retries=2):
|
||||||
|
# 检查页面是否有效
|
||||||
|
title = await self.page.title()
|
||||||
|
if '404' not in title and 'not found' not in title.lower():
|
||||||
|
school_name = "Graduate School of Arts and Sciences (GSAS)"
|
||||||
|
|
||||||
|
# 查找Faculty链接
|
||||||
|
faculty_link = await self.page.evaluate('''() => {
|
||||||
|
const links = document.querySelectorAll('a[href]');
|
||||||
|
for (const link of links) {
|
||||||
|
const text = link.innerText.toLowerCase();
|
||||||
|
const href = link.href;
|
||||||
|
if (text.includes('faculty') && text.includes('see list')) {
|
||||||
|
return href;
|
||||||
|
}
|
||||||
|
if ((text.includes('faculty') || text.includes('people')) &&
|
||||||
|
(href.includes('/people') || href.includes('/faculty'))) {
|
||||||
|
return href;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
if faculty_link:
|
||||||
|
faculty_page_url = faculty_link
|
||||||
|
school_name = determine_school_from_url(faculty_link)
|
||||||
|
|
||||||
|
# 访问导师页面
|
||||||
|
if await self._safe_goto(faculty_link, timeout=20000, retries=2):
|
||||||
|
# 提取导师信息
|
||||||
|
faculty_list = await self._extract_faculty()
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f" 获取详情失败: {str(e)[:50]}")
|
||||||
|
|
||||||
|
# 创建项目对象
|
||||||
|
program = Program(
|
||||||
|
name=name,
|
||||||
|
url=program_url,
|
||||||
|
degree_type=degrees,
|
||||||
|
faculty_page_url=faculty_page_url,
|
||||||
|
faculty=[Faculty(name=f['name'], url=f['url']) for f in faculty_list]
|
||||||
|
)
|
||||||
|
|
||||||
|
return program, school_name
|
||||||
|
|
||||||
|
async def _extract_faculty(self) -> List[Dict]:
|
||||||
|
"""从当前页面提取导师信息"""
|
||||||
|
return await self.page.evaluate('''() => {
|
||||||
|
const faculty = [];
|
||||||
|
const seen = new Set();
|
||||||
|
const patterns = ['/people/', '/faculty/', '/profile/', '/person/'];
|
||||||
|
|
||||||
|
document.querySelectorAll('a[href]').forEach(a => {
|
||||||
|
const href = a.href || '';
|
||||||
|
const text = a.innerText.trim();
|
||||||
|
const lowerHref = href.toLowerCase();
|
||||||
|
const lowerText = text.toLowerCase();
|
||||||
|
|
||||||
|
const isPersonLink = patterns.some(p => lowerHref.includes(p));
|
||||||
|
const isNavLink = ['people', 'faculty', 'directory', 'staff', 'all'].includes(lowerText);
|
||||||
|
|
||||||
|
if (isPersonLink && !isNavLink &&
|
||||||
|
text.length > 3 && text.length < 100 &&
|
||||||
|
!seen.has(href)) {
|
||||||
|
seen.add(href);
|
||||||
|
faculty.push({ name: text, url: href });
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return faculty;
|
||||||
|
}''')
|
||||||
|
|
||||||
|
def _print_summary(self, university: University):
|
||||||
|
"""打印统计摘要"""
|
||||||
|
total_programs = sum(len(s.programs) for s in university.schools)
|
||||||
|
total_faculty = sum(len(p.faculty) for s in university.schools for p in s.programs)
|
||||||
|
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print("爬取完成!")
|
||||||
|
print(f"{'='*60}")
|
||||||
|
print(f"大学: {university.name}")
|
||||||
|
print(f"学院数: {len(university.schools)}")
|
||||||
|
print(f"项目数: {total_programs}")
|
||||||
|
print(f"导师数: {total_faculty}")
|
||||||
|
|
||||||
|
print("\n各学院统计:")
|
||||||
|
for school in university.schools:
|
||||||
|
prog_count = len(school.programs)
|
||||||
|
fac_count = sum(len(p.faculty) for p in school.programs)
|
||||||
|
print(f" {school.name}: {prog_count}个项目, {fac_count}位导师")
|
||||||
|
|
||||||
|
def save_results(self, university: University, output_path: str):
|
||||||
|
"""保存结果"""
|
||||||
|
output = Path(output_path)
|
||||||
|
output.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
with open(output, 'w', encoding='utf-8') as f:
|
||||||
|
json.dump(university.to_dict(), f, ensure_ascii=False, indent=2)
|
||||||
|
|
||||||
|
print(f"\n结果已保存到: {output_path}")
|
||||||
|
|
||||||
|
|
||||||
|
async def scrape_harvard(output_path: str = "output/harvard_full_result.json", headless: bool = True):
|
||||||
|
"""爬取Harvard的便捷函数"""
|
||||||
|
async with HarvardScraper(headless=headless) as scraper:
|
||||||
|
university = await scraper.scrape()
|
||||||
|
scraper.save_results(university, output_path)
|
||||||
|
return university
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(scrape_harvard(headless=False))
|
||||||
105
src/university_scraper/models.py
Normal file
105
src/university_scraper/models.py
Normal file
@ -0,0 +1,105 @@
|
|||||||
|
"""
|
||||||
|
鏁版嵁妯″瀷瀹氫箟 - 瀛﹂櫌 → 椤圭洰 → 瀵煎笀 层级结构
|
||||||
|
"""
|
||||||
|
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from datetime import datetime
|
||||||
|
from typing import Any, Dict, List, Optional
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class Faculty:
|
||||||
|
"""瀵煎笀淇℃伅"""
|
||||||
|
name: str
|
||||||
|
url: str
|
||||||
|
title: Optional[str] = None
|
||||||
|
email: Optional[str] = None
|
||||||
|
department: Optional[str] = None
|
||||||
|
|
||||||
|
def to_dict(self) -> dict:
|
||||||
|
return {
|
||||||
|
"name": self.name,
|
||||||
|
"url": self.url,
|
||||||
|
"title": self.title,
|
||||||
|
"email": self.email,
|
||||||
|
"department": self.department
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class Program:
|
||||||
|
"""纭曞+椤圭洰淇℃伅"""
|
||||||
|
name: str
|
||||||
|
url: str
|
||||||
|
degree_type: Optional[str] = None # M.S., M.A., MBA, etc.
|
||||||
|
description: Optional[str] = None
|
||||||
|
faculty_page_url: Optional[str] = None
|
||||||
|
faculty: List[Faculty] = field(default_factory=list)
|
||||||
|
metadata: Dict[str, Any] = field(default_factory=dict)
|
||||||
|
|
||||||
|
def to_dict(self) -> dict:
|
||||||
|
return {
|
||||||
|
"name": self.name,
|
||||||
|
"url": self.url,
|
||||||
|
"degree_type": self.degree_type,
|
||||||
|
"description": self.description,
|
||||||
|
"faculty_page_url": self.faculty_page_url,
|
||||||
|
"faculty_count": len(self.faculty),
|
||||||
|
"faculty": [f.to_dict() for f in self.faculty],
|
||||||
|
"metadata": self.metadata
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class School:
|
||||||
|
"""瀛﹂櫌淇℃伅"""
|
||||||
|
name: str
|
||||||
|
url: str
|
||||||
|
description: Optional[str] = None
|
||||||
|
programs: List[Program] = field(default_factory=list)
|
||||||
|
metadata: Dict[str, Any] = field(default_factory=dict)
|
||||||
|
faculty_directory: List[Faculty] = field(default_factory=list)
|
||||||
|
faculty_directory_loaded: bool = False
|
||||||
|
|
||||||
|
def to_dict(self) -> dict:
|
||||||
|
return {
|
||||||
|
"name": self.name,
|
||||||
|
"url": self.url,
|
||||||
|
"description": self.description,
|
||||||
|
"program_count": len(self.programs),
|
||||||
|
"programs": [p.to_dict() for p in self.programs],
|
||||||
|
"faculty_directory_count": len(self.faculty_directory),
|
||||||
|
"faculty_directory": [f.to_dict() for f in self.faculty_directory]
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class University:
|
||||||
|
"""澶у淇℃伅 - 椤跺眰鏁版嵁缁撴瀯"""
|
||||||
|
name: str
|
||||||
|
url: str
|
||||||
|
country: Optional[str] = None
|
||||||
|
schools: List[School] = field(default_factory=list)
|
||||||
|
scraped_at: Optional[str] = None
|
||||||
|
|
||||||
|
def to_dict(self) -> dict:
|
||||||
|
# 缁熻
|
||||||
|
total_programs = sum(len(s.programs) for s in self.schools)
|
||||||
|
total_faculty = sum(
|
||||||
|
len(p.faculty)
|
||||||
|
for s in self.schools
|
||||||
|
for p in s.programs
|
||||||
|
)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"university": self.name,
|
||||||
|
"url": self.url,
|
||||||
|
"country": self.country,
|
||||||
|
"scraped_at": self.scraped_at or datetime.utcnow().isoformat(),
|
||||||
|
"statistics": {
|
||||||
|
"total_schools": len(self.schools),
|
||||||
|
"total_programs": total_programs,
|
||||||
|
"total_faculty": total_faculty
|
||||||
|
},
|
||||||
|
"schools": [s.to_dict() for s in self.schools]
|
||||||
|
}
|
||||||
1360
src/university_scraper/scraper.py
Normal file
1360
src/university_scraper/scraper.py
Normal file
File diff suppressed because it is too large
Load Diff
6
任务1.txt
6
任务1.txt
@ -1,4 +1,8 @@
|
|||||||
构建一个自动化生成代码的agent,给定一个海外大学官网的网址,生成一套或者说一个python脚本能够爬取这个大学各级学院下的所有硕士项目的网址 和 硕士项目中各导师个人信息的网址
|
构建一个自动化生成代码的agent,给定一个海外大学官网的网址,生成一套或者说一个python脚本能够爬取这个大学各级学院下的所有硕士项目的网址 和 硕士项目中各导师个人信息的网址
|
||||||
agent系统使用 https://docs.agno.com/
|
agent系统使用 https://docs.agno.com/
|
||||||
然后其中的浏览器自动化使用playwright
|
然后其中的浏览器自动化使用playwright
|
||||||
使用的python工具库是,uv、ruff、ty,使用这三个构建项目。
|
使用的python工具库是,uv、ruff、ty,使用这三个构建项目。
|
||||||
|
另外还需要研究一个方案:
|
||||||
|
当前生成的爬虫脚本还需要在 Python 环境中手动执行。我们希望实现的效果:生成的脚本可以存到数据库(你研究一下是存在哪里好),我们的目的是存到云端或者服务器上,后续我们可以通过接口来调用生成的脚本去爬取。
|
||||||
|
理想情况下我们有一个页面,页面上有一个输入栏,用来输入要生成爬虫脚本的学校,然后点击“一键生成爬虫脚本”后,由我们的 agent 自动生成脚本。生成完成后会跳转到该学校的管理页面,页面上可以点击“一键运行爬虫脚本”来抓取对应项目和老师的一面数据。爬取完毕后,我们可以在页面上检查和验证爬虫抓取到的数据。
|
||||||
|
目的是做成没有任何电脑知识的普通人也能操作的一个系统。
|
||||||
32
对话总结.txt
Normal file
32
对话总结.txt
Normal file
@ -0,0 +1,32 @@
|
|||||||
|
项目进度与存在的问题:
|
||||||
|
📊 项目进度总结
|
||||||
|
✅ 已完成的功能
|
||||||
|
Web系统架构
|
||||||
|
FastAPI 后端 + React + TypeScript + Ant Design 前端
|
||||||
|
SQLite 数据库存储大学、脚本、爬取结果
|
||||||
|
动态脚本执行机制(通过 exec() 注入模块)
|
||||||
|
曼彻斯特大学爬虫
|
||||||
|
成功爬取 297 个硕士项目
|
||||||
|
创建了完整的学院配置 manchester_complete_scraper.py
|
||||||
|
实现了 3 种提取方法:table、links、research_explorer
|
||||||
|
项目按关键词自动分配到 13 个学院
|
||||||
|
已修复的问题
|
||||||
|
Windows asyncio 事件循环策略
|
||||||
|
exec 命名空间问题(函数互相调用)
|
||||||
|
硕士项目过滤逻辑(排除本科/博士)
|
||||||
|
⚠️ 当前存在的问题
|
||||||
|
问题 影响 原因
|
||||||
|
网络超时 11/12 学院页面加载失败 网络不稳定或页面响应慢
|
||||||
|
Research Explorer 页面 大量学院使用此系统 JavaScript 渲染慢,60秒超时不够
|
||||||
|
导师数据不完整 仅获取 78 名导师(AMBS) 其他学院页面无法访问
|
||||||
|
📈 数据统计
|
||||||
|
指标 数量
|
||||||
|
硕士项目总数 297
|
||||||
|
学院分类数 13
|
||||||
|
成功获取导师的学院 1/13
|
||||||
|
导师总数 78
|
||||||
|
🔧 建议的改进方向
|
||||||
|
增加超时时间 - 对 Research Explorer 页面增加到 90-120 秒
|
||||||
|
添加重试机制 - 失败后自动重试 2-3 次
|
||||||
|
使用备选 URL - 为每个学院配置多个可能的 staff 页面
|
||||||
|
分批爬取 - 将学院分批处理,避免同时请求过多
|
||||||
Reference in New Issue
Block a user