Rename test script and update documentation
- Rename test_rwth.py to generate_scraper.py with CLI arguments - Update README.md with comprehensive usage guide - Add Harvard scraper as example output - Document troubleshooting tips for common issues 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
207
README.md
207
README.md
@ -1,73 +1,180 @@
|
||||
# University Playwright Codegen Agent
|
||||
|
||||
构建于 [Agno](https://docs.agno.com/) 的自动化代码生成代理:输入海外大学官网的根地址,即可生成一份使用 **Playwright** 的 Python 脚本,脚本会抓取各学院/研究生院下的硕士项目网址以及项目中列出的导师(Supervisor/Faculty)个人信息页面。本项目使用 `uv` 进行依赖管理,`ruff` 做静态检查,`ty` 负责类型检查,并提供了一个基于 Typer 的 CLI。
|
||||
构建于 [Agno](https://docs.agno.com/) 的自动化代码生成代理:输入海外大学官网的根地址,即可生成一份使用 **Playwright** 的 Python 脚本,脚本会抓取各学院/研究生院下的硕士项目网址以及项目中列出的导师(Supervisor/Faculty)个人信息页面。
|
||||
|
||||
## Features
|
||||
## Quick Start
|
||||
|
||||
- ✅ **Agno Agent**:利用 `output_schema` 强制结构化输出,里程碑式地生成 `ScriptPlan` 并将其渲染为可执行脚本。
|
||||
- ✅ **Playwright sampling**:计划生成前会用 Playwright 对站点进行轻量抓取,帮助 Agent 找到关键词与导航策略。
|
||||
- ✅ **Deterministic script template**:脚本模板包含 BFS 爬取、关键词过滤、JSON 输出等逻辑,确保满足“硕士项目 + 导师”需求。
|
||||
- ✅ **uv + ruff + ty workflow**:开箱即用的现代 Python 工具链。
|
||||
### 1. 环境准备
|
||||
|
||||
## Getting started
|
||||
```bash
|
||||
# 克隆项目
|
||||
git clone https://git.prodream.cn/YXY/University-Playwright-Codegen-Agent.git
|
||||
cd University-Playwright-Codegen-Agent
|
||||
|
||||
1. **创建虚拟环境并安装依赖**
|
||||
# 安装依赖(需要 uv)
|
||||
uv sync
|
||||
|
||||
```bash
|
||||
uv venv --python 3.12
|
||||
uv pip install -r pyproject.toml
|
||||
playwright install # 安装浏览器内核
|
||||
```
|
||||
# 安装 Playwright 浏览器
|
||||
uv run playwright install
|
||||
```
|
||||
|
||||
2. **配置大模型 API key**
|
||||
### 2. 配置 API Key
|
||||
|
||||
- OpenAI: `export OPENAI_API_KEY=...`
|
||||
- Anthropic: `export ANTHROPIC_API_KEY=...`
|
||||
- 可通过环境变量 `CODEGEN_MODEL_PROVIDER` 在 `openai` 与 `anthropic` 之间切换。
|
||||
项目使用 OpenRouter API 调用 Claude 模型。设置环境变量:
|
||||
|
||||
3. **运行 CLI 生成脚本**
|
||||
**Windows (PowerShell):**
|
||||
```powershell
|
||||
[Environment]::SetEnvironmentVariable("OPENROUTER_API_KEY", "your-api-key", "User")
|
||||
```
|
||||
|
||||
```bash
|
||||
uv run university-agent generate \
|
||||
"https://www.example.edu" \
|
||||
--campus "Example Campus" \
|
||||
--language "English" \
|
||||
--max-depth 2 \
|
||||
--max-pages 60
|
||||
```
|
||||
**Windows (CMD):**
|
||||
```cmd
|
||||
setx OPENROUTER_API_KEY "your-api-key"
|
||||
```
|
||||
|
||||
运行完成后会在 `artifacts/` 下看到生成的 Playwright 脚本,并在终端展示自动规划的关键词与验证步骤。
|
||||
**Linux/macOS:**
|
||||
```bash
|
||||
export OPENROUTER_API_KEY="your-api-key"
|
||||
```
|
||||
|
||||
4. **执行 Ruff & Ty 检查**
|
||||
或者复制 `.env.example` 为 `.env` 并填入 API Key。
|
||||
|
||||
```bash
|
||||
uv run ruff check
|
||||
uvx ty check
|
||||
```
|
||||
### 3. 生成爬虫脚本
|
||||
|
||||
## Project structure
|
||||
**方式一:使用命令行参数**
|
||||
```bash
|
||||
uv run python generate_scraper.py \
|
||||
--url "https://www.harvard.edu/" \
|
||||
--name "Harvard" \
|
||||
--language "English" \
|
||||
--max-depth 3 \
|
||||
--max-pages 30
|
||||
```
|
||||
|
||||
**方式二:修改脚本中的配置**
|
||||
|
||||
编辑 `generate_scraper.py` 顶部的配置:
|
||||
```python
|
||||
TARGET_URL = "https://www.example.edu/"
|
||||
CAMPUS_NAME = "Example University"
|
||||
LANGUAGE = "English"
|
||||
MAX_DEPTH = 3
|
||||
MAX_PAGES = 30
|
||||
```
|
||||
|
||||
然后运行:
|
||||
```bash
|
||||
uv run python generate_scraper.py
|
||||
```
|
||||
|
||||
### 4. 运行生成的爬虫
|
||||
|
||||
生成的脚本保存在 `artifacts/` 目录下:
|
||||
|
||||
```bash
|
||||
cd artifacts
|
||||
uv run python harvard_faculty_scraper.py --max-pages 50 --no-verify
|
||||
```
|
||||
|
||||
**常用参数:**
|
||||
| 参数 | 说明 | 默认值 |
|
||||
|------|------|--------|
|
||||
| `--max-pages` | 最大爬取页面数 | 30 |
|
||||
| `--max-depth` | 最大爬取深度 | 3 |
|
||||
| `--no-verify` | 跳过链接验证(推荐) | False |
|
||||
| `--browser` | 浏览器引擎 (chromium/firefox/webkit) | chromium |
|
||||
| `--timeout` | 页面加载超时(ms) | 20000 |
|
||||
| `--output` | 输出文件路径 | university-scraper_results.json |
|
||||
|
||||
### 5. 查看结果
|
||||
|
||||
爬取结果保存为 JSON 文件:
|
||||
|
||||
```json
|
||||
{
|
||||
"statistics": {
|
||||
"total_links": 277,
|
||||
"program_links": 8,
|
||||
"faculty_links": 269,
|
||||
"profile_pages": 265
|
||||
},
|
||||
"program_links": [...],
|
||||
"faculty_links": [...]
|
||||
}
|
||||
```
|
||||
|
||||
## 使用 CLI(可选)
|
||||
|
||||
项目也提供 Typer CLI:
|
||||
|
||||
```bash
|
||||
uv run university-agent generate \
|
||||
"https://www.example.edu" \
|
||||
--campus "Example Campus" \
|
||||
--language "English" \
|
||||
--max-depth 2 \
|
||||
--max-pages 60
|
||||
```
|
||||
|
||||
## 测试过的大学
|
||||
|
||||
| 大学 | 状态 | 备注 |
|
||||
|------|------|------|
|
||||
| Harvard | ✅ | 找到 277 个链接 |
|
||||
| RWTH Aachen | ✅ | 找到 108 个链接 |
|
||||
| KAUST | ✅ | 需使用 Firefox,网站较慢 |
|
||||
|
||||
## 故障排除
|
||||
|
||||
### 超时错误
|
||||
某些网站响应较慢,增加超时时间:
|
||||
```bash
|
||||
uv run python xxx_scraper.py --timeout 60000 --no-verify
|
||||
```
|
||||
|
||||
### 浏览器被阻止
|
||||
某些网站(如 KAUST)会阻止 Chromium,改用 Firefox:
|
||||
```bash
|
||||
uv run python xxx_scraper.py --browser firefox
|
||||
```
|
||||
|
||||
### API Key 错误
|
||||
确保 `OPENROUTER_API_KEY` 环境变量已正确设置:
|
||||
```bash
|
||||
echo $OPENROUTER_API_KEY # Linux/macOS
|
||||
echo %OPENROUTER_API_KEY% # Windows CMD
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
├── README.md
|
||||
├── generate_scraper.py # 主入口脚本
|
||||
├── .env.example # 环境变量模板
|
||||
├── pyproject.toml
|
||||
├── src/university_agent
|
||||
│ ├── agent.py # Agno Agent 配置
|
||||
│ ├── cli.py # Typer CLI
|
||||
│ ├── config.py # pydantic Settings
|
||||
│ ├── generator.py # Orchestration 引擎
|
||||
│ ├── models.py # 数据模型(请求/计划/结果)
|
||||
│ ├── renderer.py # ScriptPlan -> Playwright script
|
||||
│ ├── sampler.py # Playwright 采样
|
||||
│ ├── templates/
|
||||
│ │ └── playwright_script.py.jinja
|
||||
│ └── writer.py # 将脚本写入 artifacts/
|
||||
└── 任务1.txt
|
||||
├── artifacts/ # 生成的爬虫脚本
|
||||
│ ├── harvard_faculty_scraper.py
|
||||
│ ├── kaust_faculty_scraper.py
|
||||
│ └── ...
|
||||
└── src/university_agent/
|
||||
├── agent.py # Agno Agent 配置
|
||||
├── cli.py # Typer CLI
|
||||
├── config.py # pydantic Settings
|
||||
├── generator.py # Orchestration 引擎
|
||||
├── models.py # 数据模型
|
||||
├── renderer.py # ScriptPlan -> Playwright script
|
||||
├── sampler.py # Playwright 采样
|
||||
└── writer.py # 脚本写入
|
||||
```
|
||||
|
||||
## Tips
|
||||
## Features
|
||||
|
||||
- `university-agent generate --help` 查看所有 CLI 选项,可选择跳过采样或导出规划 JSON。
|
||||
- 如果 Agno Agent 需使用其他工具,可在 `agent.py` 中自行扩展自定义 `tool`。
|
||||
- Playwright 采样在某些环境中需要额外的浏览器依赖,请根据官方提示执行 `playwright install`。
|
||||
- **Agno Agent**:利用 `output_schema` 强制结构化输出
|
||||
- **Playwright sampling**:生成前对站点进行轻量抓取
|
||||
- **Deterministic script template**:BFS 爬取、关键词过滤、JSON 输出
|
||||
- **OpenRouter 支持**:通过 OpenRouter 使用 Claude 模型
|
||||
- **uv + ruff + ty workflow**:现代 Python 工具链
|
||||
|
||||
Happy building! 🎓🤖
|
||||
## License
|
||||
|
||||
MIT
|
||||
|
||||
Reference in New Issue
Block a user