Add university scraper system with backend, frontend, and configs

- Add src/university_scraper module with scraper, analyzer, and CLI
- Add backend FastAPI service with API endpoints and database models
- Add frontend React app with university management pages
- Add configs for Harvard, Manchester, and UCL universities
- Add artifacts with various scraper implementations
- Add Docker compose configuration for deployment
- Update .gitignore to exclude generated files

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
yangxiaoyu-crypto
2025-12-22 15:25:08 +08:00
parent 2714c8ad5c
commit 426cf4d2cd
75 changed files with 13527 additions and 2 deletions

View File

@ -0,0 +1,24 @@
# 英国高校模板库
该目录存放针对英国大学常见站点结构的 ScraperConfig 模板片段,目标是让生成/调度脚本能够快速套用成熟的学院、项目、导师配置,并保持与 `src/university_scraper` 中的最新能力同步。
## 使用方式
1. 复制需要的模板文件到 `configs/<university>.yaml`,并根据该学校的实际信息替换占位符(域名、学院 URL、Research Explorer 组织 slug 等)。
2. 调整 `schools.static_list` 中的学院列表:
- `keywords`:用于自动将项目聚类到学院;
- `faculty_pages`:定义学院级导师目录(支持 `extract_method: table|links|research_explorer`、滚动/点击更多、独立请求参数)。
3. 根据学校的课程导航方式,补全 `programs.paths_to_try``link_patterns``selectors` 与请求设置。
4. `faculty.discovery_strategies` 推荐至少包含:
- `link_in_page`从项目页寻找“People/Faculty”链接
- `url_pattern`:补充常见 URL 模式;
- `school_directory`: true复用 `faculty_pages` 中的导师目录,将其按关键词分发到项目层。
5. 运行 `python -m src.university_scraper.cli run --config configs/<university>.yaml --output output/<name>.json`(或在 Web 端触发任务)验证,并将本地结果与旧版对比。
## 模板列表
| 文件 | 适用场景 |
|------|----------|
| `uk_research_explorer_template.yaml` | 大多数使用 Pure Portal / Research Explorer 的英国大学如曼大、UCL、帝国理工的人文社科学院。 |
| `uk_department_directory_template.yaml` | 传统院系官网列出 HTML Staff Directory 的学院(如各理工学院官网、独立学院站点)。 |
后续若发现新的页面类型(例如 SharePoint 列表、嵌入式 API 等),请在此目录增加新的模板文件,并在本 README 中更新说明。

View File

@ -0,0 +1,95 @@
university:
name: "REPLACE_UNIVERSITY_NAME"
url: "https://www.example.ac.uk/"
country: "United Kingdom"
schools:
discovery_method: "static_list"
static_list:
- name: "Department of Computer Science"
url: "https://www.example.ac.uk/about/people/academic-and-research-staff/"
keywords:
- "computer"
- "software"
- "artificial intelligence"
- "data science"
faculty_pages:
- url: "https://www.example.ac.uk/about/people/academic-and-research-staff/"
extract_method: "links"
requires_scroll: true
scroll_times: 6
scroll_delay_ms: 600
blocked_resources: ["image", "font", "media"]
- url: "https://www.example.ac.uk/about/people/"
extract_method: "links"
load_more_selector: "button.load-more"
max_load_more: 5
request:
timeout_ms: 45000
wait_until: "domcontentloaded"
post_wait_ms: 2000
- name: "Department of Physics"
url: "https://www.example.ac.uk/physics/about/people/"
keywords:
- "physics"
- "astronomy"
- "material science"
faculty_pages:
- url: "https://www.example.ac.uk/physics/about/people/academic-staff/"
extract_method: "table"
request:
timeout_ms: 60000
wait_until: "domcontentloaded"
post_wait_ms: 2000
programs:
paths_to_try:
- "/study/masters/courses/a-to-z/"
- "/study/masters/courses/list/"
link_patterns:
- text_contains: ["courses", "masters", "postgraduate"]
href_contains: ["/study/", "/masters/", "/courses/"]
selectors:
program_item: ".course-card, li.course, article.course"
program_name: ".course-title, h3, .title"
program_url: "a[href]"
degree_type: ".award, .badge"
request:
timeout_ms: 35000
wait_until: "domcontentloaded"
post_wait_ms: 2000
faculty:
discovery_strategies:
- type: "link_in_page"
patterns:
- text_contains: ["people", "faculty", "team", "staff"]
href_contains: ["/people", "/faculty", "/staff"]
request:
timeout_ms: 25000
wait_until: "domcontentloaded"
post_wait_ms: 1500
- type: "url_pattern"
patterns:
- "{program_url}/people"
- "{program_url}/staff"
- "{school_url}/people"
- "{school_url}/contact/staff"
request:
timeout_ms: 25000
wait_until: "domcontentloaded"
post_wait_ms: 1500
- type: "school_directory"
assign_to_all: false
match_by_school_keywords: true
metadata_keyword_field: "keywords"
request:
timeout_ms: 60000
wait_for_selector: "a[href*='/people/'], table"
post_wait_ms: 2000
filters:
program_degree_types:
include: ["MSc", "MSci", "MA", "MBA", "MEng", "LLM"]
exclude: ["PhD", "Bachelor", "BSc", "BA", "PGCert"]
exclude_schools: []

View File

@ -0,0 +1,101 @@
university:
name: "REPLACE_UNIVERSITY_NAME"
url: "https://www.example.ac.uk/"
country: "United Kingdom"
schools:
discovery_method: "static_list"
request:
timeout_ms: 45000
max_retries: 3
retry_backoff_ms: 3000
static_list:
# 基于 Research Explorer (Pure Portal) 的学院示例
- name: "School of Engineering"
url: "https://research.example.ac.uk/en/organisations/school-of-engineering/persons/"
keywords:
- "engineering"
- "mechanical"
- "civil"
- "materials"
faculty_pages:
- url: "https://research.example.ac.uk/en/organisations/school-of-engineering/persons/"
extract_method: "research_explorer"
requires_scroll: true
request:
timeout_ms: 120000
wait_until: "networkidle"
post_wait_ms: 5000
research_explorer:
org_slug: "school-of-engineering"
page_size: 400
- name: "Faculty of Humanities"
url: "https://research.example.ac.uk/en/organisations/faculty-of-humanities/persons/"
keywords:
- "arts"
- "languages"
- "history"
- "philosophy"
faculty_pages:
- url: "https://research.example.ac.uk/en/organisations/faculty-of-humanities/persons/"
extract_method: "research_explorer"
requires_scroll: true
request:
timeout_ms: 120000
wait_until: "networkidle"
post_wait_ms: 4500
research_explorer:
org_slug: "faculty-of-humanities"
page_size: 300
programs:
paths_to_try:
- "/study/masters/courses/list/"
- "/study/postgraduate/courses/list/"
link_patterns:
- text_contains: ["masters", "postgraduate", "graduate"]
href_contains: ["/courses/", "/study/", "/programmes/"]
selectors:
program_item: "li.course-item, article.course-card, a.course-link"
program_name: ".course-title, h3, .title"
program_url: "a[href]"
degree_type: ".course-award, .badge"
request:
timeout_ms: 40000
wait_until: "domcontentloaded"
post_wait_ms: 2500
faculty:
discovery_strategies:
- type: "link_in_page"
patterns:
- text_contains: ["faculty", "people", "staff", "directory"]
href_contains: ["/faculty", "/people", "/staff"]
request:
timeout_ms: 30000
wait_until: "domcontentloaded"
post_wait_ms: 1500
- type: "url_pattern"
patterns:
- "{program_url}/people"
- "{program_url}/faculty"
- "{school_url}/people"
- "{school_url}/staff"
request:
timeout_ms: 30000
wait_until: "domcontentloaded"
post_wait_ms: 1500
- type: "school_directory"
assign_to_all: false
match_by_school_keywords: true
metadata_keyword_field: "keywords"
request:
timeout_ms: 120000
wait_for_selector: "a.link.person"
post_wait_ms: 4000
filters:
program_degree_types:
include: ["MSc", "MA", "MBA", "MEng", "LLM", "MRes"]
exclude: ["PhD", "Bachelor", "BSc", "BA"]
exclude_schools: []