- Rename test_rwth.py to generate_scraper.py with CLI arguments - Update README.md with comprehensive usage guide - Add Harvard scraper as example output - Document troubleshooting tips for common issues 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
181 lines
4.6 KiB
Markdown
181 lines
4.6 KiB
Markdown
# University Playwright Codegen Agent
|
||
|
||
构建于 [Agno](https://docs.agno.com/) 的自动化代码生成代理:输入海外大学官网的根地址,即可生成一份使用 **Playwright** 的 Python 脚本,脚本会抓取各学院/研究生院下的硕士项目网址以及项目中列出的导师(Supervisor/Faculty)个人信息页面。
|
||
|
||
## Quick Start
|
||
|
||
### 1. 环境准备
|
||
|
||
```bash
|
||
# 克隆项目
|
||
git clone https://git.prodream.cn/YXY/University-Playwright-Codegen-Agent.git
|
||
cd University-Playwright-Codegen-Agent
|
||
|
||
# 安装依赖(需要 uv)
|
||
uv sync
|
||
|
||
# 安装 Playwright 浏览器
|
||
uv run playwright install
|
||
```
|
||
|
||
### 2. 配置 API Key
|
||
|
||
项目使用 OpenRouter API 调用 Claude 模型。设置环境变量:
|
||
|
||
**Windows (PowerShell):**
|
||
```powershell
|
||
[Environment]::SetEnvironmentVariable("OPENROUTER_API_KEY", "your-api-key", "User")
|
||
```
|
||
|
||
**Windows (CMD):**
|
||
```cmd
|
||
setx OPENROUTER_API_KEY "your-api-key"
|
||
```
|
||
|
||
**Linux/macOS:**
|
||
```bash
|
||
export OPENROUTER_API_KEY="your-api-key"
|
||
```
|
||
|
||
或者复制 `.env.example` 为 `.env` 并填入 API Key。
|
||
|
||
### 3. 生成爬虫脚本
|
||
|
||
**方式一:使用命令行参数**
|
||
```bash
|
||
uv run python generate_scraper.py \
|
||
--url "https://www.harvard.edu/" \
|
||
--name "Harvard" \
|
||
--language "English" \
|
||
--max-depth 3 \
|
||
--max-pages 30
|
||
```
|
||
|
||
**方式二:修改脚本中的配置**
|
||
|
||
编辑 `generate_scraper.py` 顶部的配置:
|
||
```python
|
||
TARGET_URL = "https://www.example.edu/"
|
||
CAMPUS_NAME = "Example University"
|
||
LANGUAGE = "English"
|
||
MAX_DEPTH = 3
|
||
MAX_PAGES = 30
|
||
```
|
||
|
||
然后运行:
|
||
```bash
|
||
uv run python generate_scraper.py
|
||
```
|
||
|
||
### 4. 运行生成的爬虫
|
||
|
||
生成的脚本保存在 `artifacts/` 目录下:
|
||
|
||
```bash
|
||
cd artifacts
|
||
uv run python harvard_faculty_scraper.py --max-pages 50 --no-verify
|
||
```
|
||
|
||
**常用参数:**
|
||
| 参数 | 说明 | 默认值 |
|
||
|------|------|--------|
|
||
| `--max-pages` | 最大爬取页面数 | 30 |
|
||
| `--max-depth` | 最大爬取深度 | 3 |
|
||
| `--no-verify` | 跳过链接验证(推荐) | False |
|
||
| `--browser` | 浏览器引擎 (chromium/firefox/webkit) | chromium |
|
||
| `--timeout` | 页面加载超时(ms) | 20000 |
|
||
| `--output` | 输出文件路径 | university-scraper_results.json |
|
||
|
||
### 5. 查看结果
|
||
|
||
爬取结果保存为 JSON 文件:
|
||
|
||
```json
|
||
{
|
||
"statistics": {
|
||
"total_links": 277,
|
||
"program_links": 8,
|
||
"faculty_links": 269,
|
||
"profile_pages": 265
|
||
},
|
||
"program_links": [...],
|
||
"faculty_links": [...]
|
||
}
|
||
```
|
||
|
||
## 使用 CLI(可选)
|
||
|
||
项目也提供 Typer CLI:
|
||
|
||
```bash
|
||
uv run university-agent generate \
|
||
"https://www.example.edu" \
|
||
--campus "Example Campus" \
|
||
--language "English" \
|
||
--max-depth 2 \
|
||
--max-pages 60
|
||
```
|
||
|
||
## 测试过的大学
|
||
|
||
| 大学 | 状态 | 备注 |
|
||
|------|------|------|
|
||
| Harvard | ✅ | 找到 277 个链接 |
|
||
| RWTH Aachen | ✅ | 找到 108 个链接 |
|
||
| KAUST | ✅ | 需使用 Firefox,网站较慢 |
|
||
|
||
## 故障排除
|
||
|
||
### 超时错误
|
||
某些网站响应较慢,增加超时时间:
|
||
```bash
|
||
uv run python xxx_scraper.py --timeout 60000 --no-verify
|
||
```
|
||
|
||
### 浏览器被阻止
|
||
某些网站(如 KAUST)会阻止 Chromium,改用 Firefox:
|
||
```bash
|
||
uv run python xxx_scraper.py --browser firefox
|
||
```
|
||
|
||
### API Key 错误
|
||
确保 `OPENROUTER_API_KEY` 环境变量已正确设置:
|
||
```bash
|
||
echo $OPENROUTER_API_KEY # Linux/macOS
|
||
echo %OPENROUTER_API_KEY% # Windows CMD
|
||
```
|
||
|
||
## Project Structure
|
||
|
||
```
|
||
├── README.md
|
||
├── generate_scraper.py # 主入口脚本
|
||
├── .env.example # 环境变量模板
|
||
├── pyproject.toml
|
||
├── artifacts/ # 生成的爬虫脚本
|
||
│ ├── harvard_faculty_scraper.py
|
||
│ ├── kaust_faculty_scraper.py
|
||
│ └── ...
|
||
└── src/university_agent/
|
||
├── agent.py # Agno Agent 配置
|
||
├── cli.py # Typer CLI
|
||
├── config.py # pydantic Settings
|
||
├── generator.py # Orchestration 引擎
|
||
├── models.py # 数据模型
|
||
├── renderer.py # ScriptPlan -> Playwright script
|
||
├── sampler.py # Playwright 采样
|
||
└── writer.py # 脚本写入
|
||
```
|
||
|
||
## Features
|
||
|
||
- **Agno Agent**:利用 `output_schema` 强制结构化输出
|
||
- **Playwright sampling**:生成前对站点进行轻量抓取
|
||
- **Deterministic script template**:BFS 爬取、关键词过滤、JSON 输出
|
||
- **OpenRouter 支持**:通过 OpenRouter 使用 Claude 模型
|
||
- **uv + ruff + ty workflow**:现代 Python 工具链
|
||
|
||
## License
|
||
|
||
MIT
|