Go to file

yangxiaoyu-crypto a4dca81216 Rename test script and update documentation

- Rename test_rwth.py to generate_scraper.py with CLI arguments
- Update README.md with comprehensive usage guide
- Add Harvard scraper as example output
- Document troubleshooting tips for common issues

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-10 15:36:14 +08:00

artifacts

Rename test script and update documentation

2025-12-10 15:36:14 +08:00

src

Add OpenRouter support and improve JSON parsing robustness

2025-12-10 13:13:39 +08:00

.env.example

Add OpenRouter support and improve JSON parsing robustness

2025-12-10 13:13:39 +08:00

.gitignore

Add OpenRouter support and improve JSON parsing robustness

2025-12-10 13:13:39 +08:00

generate_scraper.py

Rename test script and update documentation

2025-12-10 15:36:14 +08:00

LICENSE

Initial commit

2025-12-09 07:19:57 +00:00

pyproject.toml

Initial commit: University Playwright Codegen Agent

2025-12-09 16:38:33 +08:00

README.md

Rename test script and update documentation

2025-12-10 15:36:14 +08:00

stanford-masters-faculty_masters.json

Initial commit: University Playwright Codegen Agent

2025-12-09 16:38:33 +08:00

test_run.py

Initial commit: University Playwright Codegen Agent

2025-12-09 16:38:33 +08:00

uv.lock

Initial commit: University Playwright Codegen Agent

2025-12-09 16:38:33 +08:00

任务1.txt

Initial commit: University Playwright Codegen Agent

2025-12-09 16:38:33 +08:00

README.md

University Playwright Codegen Agent

构建于 Agno 的自动化代码生成代理：输入海外大学官网的根地址，即可生成一份使用 Playwright 的 Python 脚本，脚本会抓取各学院/研究生院下的硕士项目网址以及项目中列出的导师（Supervisor/Faculty）个人信息页面。

Quick Start

1. 环境准备

# 克隆项目
git clone https://git.prodream.cn/YXY/University-Playwright-Codegen-Agent.git
cd University-Playwright-Codegen-Agent

# 安装依赖（需要 uv）
uv sync

# 安装 Playwright 浏览器
uv run playwright install

2. 配置 API Key

项目使用 OpenRouter API 调用 Claude 模型。设置环境变量：

Windows (PowerShell):

[Environment]::SetEnvironmentVariable("OPENROUTER_API_KEY", "your-api-key", "User")

Windows (CMD):

setx OPENROUTER_API_KEY "your-api-key"

Linux/macOS:

export OPENROUTER_API_KEY="your-api-key"

或者复制 .env.example 为 .env 并填入 API Key。

3. 生成爬虫脚本

方式一：使用命令行参数

uv run python generate_scraper.py \
  --url "https://www.harvard.edu/" \
  --name "Harvard" \
  --language "English" \
  --max-depth 3 \
  --max-pages 30

方式二：修改脚本中的配置

编辑 generate_scraper.py 顶部的配置：

TARGET_URL = "https://www.example.edu/"
CAMPUS_NAME = "Example University"
LANGUAGE = "English"
MAX_DEPTH = 3
MAX_PAGES = 30

然后运行：

uv run python generate_scraper.py

4. 运行生成的爬虫

生成的脚本保存在 artifacts/ 目录下：

cd artifacts
uv run python harvard_faculty_scraper.py --max-pages 50 --no-verify

常用参数：

参数	说明	默认值
`--max-pages`	最大爬取页面数	30
`--max-depth`	最大爬取深度	3
`--no-verify`	跳过链接验证（推荐）	False
`--browser`	浏览器引擎 (chromium/firefox/webkit)	chromium
`--timeout`	页面加载超时(ms)	20000
`--output`	输出文件路径	university-scraper_results.json

5. 查看结果

爬取结果保存为 JSON 文件：

{
  "statistics": {
    "total_links": 277,
    "program_links": 8,
    "faculty_links": 269,
    "profile_pages": 265
  },
  "program_links": [...],
  "faculty_links": [...]
}

使用 CLI（可选）

项目也提供 Typer CLI：

uv run university-agent generate \
  "https://www.example.edu" \
  --campus "Example Campus" \
  --language "English" \
  --max-depth 2 \
  --max-pages 60

测试过的大学

大学	状态	备注
Harvard	✅	找到 277 个链接
RWTH Aachen	✅	找到 108 个链接
KAUST	✅	需使用 Firefox，网站较慢

故障排除

超时错误

某些网站响应较慢，增加超时时间：

uv run python xxx_scraper.py --timeout 60000 --no-verify

浏览器被阻止

某些网站（如 KAUST）会阻止 Chromium，改用 Firefox：

uv run python xxx_scraper.py --browser firefox

API Key 错误

确保 OPENROUTER_API_KEY 环境变量已正确设置：

echo $OPENROUTER_API_KEY  # Linux/macOS
echo %OPENROUTER_API_KEY%  # Windows CMD

Project Structure

├── README.md
├── generate_scraper.py      # 主入口脚本
├── .env.example             # 环境变量模板
├── pyproject.toml
├── artifacts/               # 生成的爬虫脚本
│   ├── harvard_faculty_scraper.py
│   ├── kaust_faculty_scraper.py
│   └── ...
└── src/university_agent/
    ├── agent.py             # Agno Agent 配置
    ├── cli.py               # Typer CLI
    ├── config.py            # pydantic Settings
    ├── generator.py         # Orchestration 引擎
    ├── models.py            # 数据模型
    ├── renderer.py          # ScriptPlan -> Playwright script
    ├── sampler.py           # Playwright 采样
    └── writer.py            # 脚本写入

Features

Agno Agent：利用 output_schema 强制结构化输出
Playwright sampling：生成前对站点进行轻量抓取
Deterministic script template：BFS 爬取、关键词过滤、JSON 输出
OpenRouter 支持：通过 OpenRouter 使用 Claude 模型
uv + ruff + ty workflow：现代 Python 工具链

License

MIT

Description

这是一个自动化生成代码的agent，给定一个海外大学官网的网址，即可生成python脚本，爬取这个大学各级学院下的所有硕士项目的网址和硕士项目中各导师个人信息的网址。

Readme MIT 313 KiB

Languages

Python 89.8%

TypeScript 6.4%

Jinja 3.2%

Batchfile 0.2%

Dockerfile 0.2%

Other 0.1%

README.md Unescape Escape

University Playwright Codegen Agent

Quick Start

1. 环境准备

2. 配置 API Key

3. 生成爬虫脚本

4. 运行生成的爬虫

5. 查看结果

使用 CLI（可选）

测试过的大学

故障排除

超时错误

浏览器被阻止

API Key 错误

Project Structure

Features

License

README.md