University-Playwright-Codeg…/README.md

# University Playwright Codegen Agent

构建于 [Agno](https://docs.agno.com/) 的自动化代码生成代理：输入海外大学官网的根地址，即可生成一份使用 **Playwright** 的 Python 脚本，脚本会抓取各学院/研究生院下的硕士项目网址以及项目中列出的导师（Supervisor/Faculty）个人信息页面。

## Quick Start

### 1. 环境准备

```bash
# 克隆项目
git clone https://git.prodream.cn/YXY/University-Playwright-Codegen-Agent.git
cd University-Playwright-Codegen-Agent

# 安装依赖（需要 uv）
uv sync

# 安装 Playwright 浏览器
uv run playwright install
```

### 2. 配置 API Key

项目使用 OpenRouter API 调用 Claude 模型。设置环境变量：

**Windows (PowerShell):**
```powershell
[Environment]::SetEnvironmentVariable("OPENROUTER_API_KEY", "your-api-key", "User")
```

**Windows (CMD):**
```cmd
setx OPENROUTER_API_KEY "your-api-key"
```

**Linux/macOS:**
```bash
export OPENROUTER_API_KEY="your-api-key"
```

或者复制 `.env.example` 为 `.env` 并填入 API Key。

### 3. 生成爬虫脚本

**方式一：使用命令行参数**
```bash
uv run python generate_scraper.py \
  --url "https://www.harvard.edu/" \
  --name "Harvard" \
  --language "English" \
  --max-depth 3 \
  --max-pages 30
```

**方式二：修改脚本中的配置**

编辑 `generate_scraper.py` 顶部的配置：
```python
TARGET_URL = "https://www.example.edu/"
CAMPUS_NAME = "Example University"
LANGUAGE = "English"
MAX_DEPTH = 3
MAX_PAGES = 30
```

然后运行：
```bash
uv run python generate_scraper.py
```

### 4. 运行生成的爬虫

生成的脚本保存在 `artifacts/` 目录下：

```bash
cd artifacts
uv run python harvard_faculty_scraper.py --max-pages 50 --no-verify
```

**常用参数：**
| 参数 | 说明 | 默认值 |
|------|------|--------|
| `--max-pages` | 最大爬取页面数 | 30 |
| `--max-depth` | 最大爬取深度 | 3 |
| `--no-verify` | 跳过链接验证（推荐） | False |
| `--browser` | 浏览器引擎 (chromium/firefox/webkit) | chromium |
| `--timeout` | 页面加载超时(ms) | 20000 |
| `--output` | 输出文件路径 | university-scraper_results.json |

### 5. 查看结果

爬取结果保存为 JSON 文件：

```json
{
  "statistics": {
    "total_links": 277,
    "program_links": 8,
    "faculty_links": 269,
    "profile_pages": 265
  },
  "program_links": [...],
  "faculty_links": [...]
}
```

## 使用 CLI（可选）

项目也提供 Typer CLI：

```bash
uv run university-agent generate \
  "https://www.example.edu" \
  --campus "Example Campus" \
  --language "English" \
  --max-depth 2 \
  --max-pages 60
```

## 测试过的大学

| 大学 | 状态 | 结果 | 生成的脚本 |
|------|------|------|-----------|
| Harvard | ✅ | 277 链接 (8 项目, 269 教职, 265 个人主页) | `artifacts/harvard_faculty_scraper.py` |
| RWTH Aachen | ✅ | 108 链接 (103 项目, 5 教职) | `artifacts/rwth_aachen_playwright_scraper.py` |
| KAUST | ✅ | 9 链接 (需使用 Firefox) | `artifacts/kaust_faculty_scraper.py` |

### Harvard 测试示例

**生成爬虫脚本：**
```bash
uv run python generate_scraper.py --url "https://www.harvard.edu/" --name "Harvard"
```

**运行爬虫：**
```bash
cd artifacts
uv run python harvard_faculty_scraper.py --max-pages 30 --no-verify
```

**结果输出** (`artifacts/university-scraper_results.json`)：
```json
{
  "statistics": {
    "total_links": 277,
    "program_links": 8,
    "faculty_links": 269,
    "profile_pages": 265
  },
  "program_links": [
    {"url": "https://www.harvard.edu/programs/?degree_levels=graduate", "text": "Graduate Programs"},
    ...
  ],
  "faculty_links": [
    {"url": "https://www.gse.harvard.edu/directory/faculty", "text": "Faculty Directory"},
    {"url": "https://faculty.harvard.edu", "text": "Harvard Faculty"},
    ...
  ]
}
```

爬取覆盖了 Harvard 的多个学院：
- Graduate School of Design (GSD)
- Graduate School of Education (GSE)
- Faculty of Arts and Sciences (FAS)
- Graduate School of Arts and Sciences (GSAS)
- Harvard Divinity School (HDS)

## 故障排除

### 超时错误
某些网站响应较慢，增加超时时间：
```bash
uv run python xxx_scraper.py --timeout 60000 --no-verify
```

### 浏览器被阻止
某些网站（如 KAUST）会阻止 Chromium，改用 Firefox：
```bash
uv run python xxx_scraper.py --browser firefox
```

### API Key 错误
确保 `OPENROUTER_API_KEY` 环境变量已正确设置：
```bash
echo $OPENROUTER_API_KEY  # Linux/macOS
echo %OPENROUTER_API_KEY%  # Windows CMD
```

## Project Structure

```
├── README.md
├── generate_scraper.py      # 主入口脚本
├── .env.example             # 环境变量模板
├── pyproject.toml
├── artifacts/               # 生成的爬虫脚本
│   ├── harvard_faculty_scraper.py
│   ├── kaust_faculty_scraper.py
│   └── ...
└── src/university_agent/
    ├── agent.py             # Agno Agent 配置
    ├── cli.py               # Typer CLI
    ├── config.py            # pydantic Settings
    ├── generator.py         # Orchestration 引擎
    ├── models.py            # 数据模型
    ├── renderer.py          # ScriptPlan -> Playwright script
    ├── sampler.py           # Playwright 采样
    └── writer.py            # 脚本写入
```

## Features

- **Agno Agent**：利用 `output_schema` 强制结构化输出
- **Playwright sampling**：生成前对站点进行轻量抓取
- **Deterministic script template**：BFS 爬取、关键词过滤、JSON 输出
- **OpenRouter 支持**：通过 OpenRouter 使用 Claude 模型
- **uv + ruff + ty workflow**：现代 Python 工具链

## License

MIT