feat: implement background scheduler for job scraping with Gunicorn support
Some checks failed
CI/CD Pipeline / test (push) Failing after 13s
CI/CD Pipeline / build-image (push) Has been skipped

This commit is contained in:
2026-01-21 19:07:43 +01:00
parent e8baeb3bcf
commit d84b8f128b
6 changed files with 84 additions and 6 deletions

View File

@@ -46,7 +46,14 @@ This layered design makes it straightforward to extend the scraper to new source
## Scheduler Configuration
The application includes an automated scheduler that runs the job scraping process every hour. The scheduler is implemented in `web/craigslist.py` and includes:
The application includes an automated scheduler that runs the job scraping process every hour. The scheduler is implemented in `web/craigslist.py` and can run alongside the web app when explicitly enabled.
**Enable background scheduling** by setting the environment variable `SCRAPE_SCHEDULER_ENABLED=1`.
- **Gunicorn**: the scheduler starts once in the Gunicorn master process (see `gunicorn.conf.py`). Worker processes skip scheduler startup to avoid duplicate runs.
- **Flask dev server**: the scheduler starts on the first request (to avoid the reloader starting it twice).
When enabled, the scheduler includes:
- **Automatic Scheduling**: Scraping runs every hour automatically
- **Failure Handling**: Retry logic with exponential backoff (up to 3 attempts)
@@ -107,15 +114,12 @@ When scraping individual job listings, the following contact information is extr
The `extract_contact_info()` function intelligently parses various types of reply URLs:
1. **Mailto Links**: `mailto:jobs@company.com?subject=...`
- Extracts the email address directly
2. **Phone Links**: `tel:+1234567890`
- Extracts the phone number
3. **URL Parameters**: `https://apply.company.com?email=hr@company.com&phone=555-1234&name=HR%20Team`
- Searches for common parameter names: `email`, `phone`, `contact_name`, etc.
4. **Graceful Fallback**: If contact information cannot be extracted, the fields are set to `"N/A"`