Files
yangxiaoyu-crypto 2714c8ad5c Add Harvard test example to README
- Add detailed test results table with script paths
- Include Harvard test example with commands and sample output
- List covered Harvard schools

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 15:43:09 +08:00

222 lines
5.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# University Playwright Codegen Agent
构建于 [Agno](https://docs.agno.com/) 的自动化代码生成代理:输入海外大学官网的根地址,即可生成一份使用 **Playwright** 的 Python 脚本,脚本会抓取各学院/研究生院下的硕士项目网址以及项目中列出的导师Supervisor/Faculty个人信息页面。
## Quick Start
### 1. 环境准备
```bash
# 克隆项目
git clone https://git.prodream.cn/YXY/University-Playwright-Codegen-Agent.git
cd University-Playwright-Codegen-Agent
# 安装依赖(需要 uv
uv sync
# 安装 Playwright 浏览器
uv run playwright install
```
### 2. 配置 API Key
项目使用 OpenRouter API 调用 Claude 模型。设置环境变量:
**Windows (PowerShell):**
```powershell
[Environment]::SetEnvironmentVariable("OPENROUTER_API_KEY", "your-api-key", "User")
```
**Windows (CMD):**
```cmd
setx OPENROUTER_API_KEY "your-api-key"
```
**Linux/macOS:**
```bash
export OPENROUTER_API_KEY="your-api-key"
```
或者复制 `.env.example``.env` 并填入 API Key。
### 3. 生成爬虫脚本
**方式一:使用命令行参数**
```bash
uv run python generate_scraper.py \
--url "https://www.harvard.edu/" \
--name "Harvard" \
--language "English" \
--max-depth 3 \
--max-pages 30
```
**方式二:修改脚本中的配置**
编辑 `generate_scraper.py` 顶部的配置:
```python
TARGET_URL = "https://www.example.edu/"
CAMPUS_NAME = "Example University"
LANGUAGE = "English"
MAX_DEPTH = 3
MAX_PAGES = 30
```
然后运行:
```bash
uv run python generate_scraper.py
```
### 4. 运行生成的爬虫
生成的脚本保存在 `artifacts/` 目录下:
```bash
cd artifacts
uv run python harvard_faculty_scraper.py --max-pages 50 --no-verify
```
**常用参数:**
| 参数 | 说明 | 默认值 |
|------|------|--------|
| `--max-pages` | 最大爬取页面数 | 30 |
| `--max-depth` | 最大爬取深度 | 3 |
| `--no-verify` | 跳过链接验证(推荐) | False |
| `--browser` | 浏览器引擎 (chromium/firefox/webkit) | chromium |
| `--timeout` | 页面加载超时(ms) | 20000 |
| `--output` | 输出文件路径 | university-scraper_results.json |
### 5. 查看结果
爬取结果保存为 JSON 文件:
```json
{
"statistics": {
"total_links": 277,
"program_links": 8,
"faculty_links": 269,
"profile_pages": 265
},
"program_links": [...],
"faculty_links": [...]
}
```
## 使用 CLI可选
项目也提供 Typer CLI
```bash
uv run university-agent generate \
"https://www.example.edu" \
--campus "Example Campus" \
--language "English" \
--max-depth 2 \
--max-pages 60
```
## 测试过的大学
| 大学 | 状态 | 结果 | 生成的脚本 |
|------|------|------|-----------|
| Harvard | ✅ | 277 链接 (8 项目, 269 教职, 265 个人主页) | `artifacts/harvard_faculty_scraper.py` |
| RWTH Aachen | ✅ | 108 链接 (103 项目, 5 教职) | `artifacts/rwth_aachen_playwright_scraper.py` |
| KAUST | ✅ | 9 链接 (需使用 Firefox) | `artifacts/kaust_faculty_scraper.py` |
### Harvard 测试示例
**生成爬虫脚本:**
```bash
uv run python generate_scraper.py --url "https://www.harvard.edu/" --name "Harvard"
```
**运行爬虫:**
```bash
cd artifacts
uv run python harvard_faculty_scraper.py --max-pages 30 --no-verify
```
**结果输出** (`artifacts/university-scraper_results.json`)
```json
{
"statistics": {
"total_links": 277,
"program_links": 8,
"faculty_links": 269,
"profile_pages": 265
},
"program_links": [
{"url": "https://www.harvard.edu/programs/?degree_levels=graduate", "text": "Graduate Programs"},
...
],
"faculty_links": [
{"url": "https://www.gse.harvard.edu/directory/faculty", "text": "Faculty Directory"},
{"url": "https://faculty.harvard.edu", "text": "Harvard Faculty"},
...
]
}
```
爬取覆盖了 Harvard 的多个学院:
- Graduate School of Design (GSD)
- Graduate School of Education (GSE)
- Faculty of Arts and Sciences (FAS)
- Graduate School of Arts and Sciences (GSAS)
- Harvard Divinity School (HDS)
## 故障排除
### 超时错误
某些网站响应较慢,增加超时时间:
```bash
uv run python xxx_scraper.py --timeout 60000 --no-verify
```
### 浏览器被阻止
某些网站(如 KAUST会阻止 Chromium改用 Firefox
```bash
uv run python xxx_scraper.py --browser firefox
```
### API Key 错误
确保 `OPENROUTER_API_KEY` 环境变量已正确设置:
```bash
echo $OPENROUTER_API_KEY # Linux/macOS
echo %OPENROUTER_API_KEY% # Windows CMD
```
## Project Structure
```
├── README.md
├── generate_scraper.py # 主入口脚本
├── .env.example # 环境变量模板
├── pyproject.toml
├── artifacts/ # 生成的爬虫脚本
│ ├── harvard_faculty_scraper.py
│ ├── kaust_faculty_scraper.py
│ └── ...
└── src/university_agent/
├── agent.py # Agno Agent 配置
├── cli.py # Typer CLI
├── config.py # pydantic Settings
├── generator.py # Orchestration 引擎
├── models.py # 数据模型
├── renderer.py # ScriptPlan -> Playwright script
├── sampler.py # Playwright 采样
└── writer.py # 脚本写入
```
## Features
- **Agno Agent**:利用 `output_schema` 强制结构化输出
- **Playwright sampling**:生成前对站点进行轻量抓取
- **Deterministic script template**BFS 爬取、关键词过滤、JSON 输出
- **OpenRouter 支持**:通过 OpenRouter 使用 Claude 模型
- **uv + ruff + ty workflow**:现代 Python 工具链
## License
MIT