diff --git a/README.md b/README.md index 47324ff..c806adc 100644 --- a/README.md +++ b/README.md @@ -118,11 +118,52 @@ uv run university-agent generate \ ## 测试过的大学 -| 大学 | 状态 | 备注 | -|------|------|------| -| Harvard | ✅ | 找到 277 个链接 | -| RWTH Aachen | ✅ | 找到 108 个链接 | -| KAUST | ✅ | 需使用 Firefox,网站较慢 | +| 大学 | 状态 | 结果 | 生成的脚本 | +|------|------|------|-----------| +| Harvard | ✅ | 277 链接 (8 项目, 269 教职, 265 个人主页) | `artifacts/harvard_faculty_scraper.py` | +| RWTH Aachen | ✅ | 108 链接 (103 项目, 5 教职) | `artifacts/rwth_aachen_playwright_scraper.py` | +| KAUST | ✅ | 9 链接 (需使用 Firefox) | `artifacts/kaust_faculty_scraper.py` | + +### Harvard 测试示例 + +**生成爬虫脚本:** +```bash +uv run python generate_scraper.py --url "https://www.harvard.edu/" --name "Harvard" +``` + +**运行爬虫:** +```bash +cd artifacts +uv run python harvard_faculty_scraper.py --max-pages 30 --no-verify +``` + +**结果输出** (`artifacts/university-scraper_results.json`): +```json +{ + "statistics": { + "total_links": 277, + "program_links": 8, + "faculty_links": 269, + "profile_pages": 265 + }, + "program_links": [ + {"url": "https://www.harvard.edu/programs/?degree_levels=graduate", "text": "Graduate Programs"}, + ... + ], + "faculty_links": [ + {"url": "https://www.gse.harvard.edu/directory/faculty", "text": "Faculty Directory"}, + {"url": "https://faculty.harvard.edu", "text": "Harvard Faculty"}, + ... + ] +} +``` + +爬取覆盖了 Harvard 的多个学院: +- Graduate School of Design (GSD) +- Graduate School of Education (GSE) +- Faculty of Arts and Sciences (FAS) +- Graduate School of Arts and Sciences (GSAS) +- Harvard Divinity School (HDS) ## 故障排除