feat: implement background scheduler for job scraping with Gunicorn support
This commit is contained in:
12
README.md
12
README.md
@@ -46,7 +46,14 @@ This layered design makes it straightforward to extend the scraper to new source
|
||||
|
||||
## Scheduler Configuration
|
||||
|
||||
The application includes an automated scheduler that runs the job scraping process every hour. The scheduler is implemented in `web/craigslist.py` and includes:
|
||||
The application includes an automated scheduler that runs the job scraping process every hour. The scheduler is implemented in `web/craigslist.py` and can run alongside the web app when explicitly enabled.
|
||||
|
||||
**Enable background scheduling** by setting the environment variable `SCRAPE_SCHEDULER_ENABLED=1`.
|
||||
|
||||
- **Gunicorn**: the scheduler starts once in the Gunicorn master process (see `gunicorn.conf.py`). Worker processes skip scheduler startup to avoid duplicate runs.
|
||||
- **Flask dev server**: the scheduler starts on the first request (to avoid the reloader starting it twice).
|
||||
|
||||
When enabled, the scheduler includes:
|
||||
|
||||
- **Automatic Scheduling**: Scraping runs every hour automatically
|
||||
- **Failure Handling**: Retry logic with exponential backoff (up to 3 attempts)
|
||||
@@ -107,15 +114,12 @@ When scraping individual job listings, the following contact information is extr
|
||||
The `extract_contact_info()` function intelligently parses various types of reply URLs:
|
||||
|
||||
1. **Mailto Links**: `mailto:jobs@company.com?subject=...`
|
||||
|
||||
- Extracts the email address directly
|
||||
|
||||
2. **Phone Links**: `tel:+1234567890`
|
||||
|
||||
- Extracts the phone number
|
||||
|
||||
3. **URL Parameters**: `https://apply.company.com?email=hr@company.com&phone=555-1234&name=HR%20Team`
|
||||
|
||||
- Searches for common parameter names: `email`, `phone`, `contact_name`, etc.
|
||||
|
||||
4. **Graceful Fallback**: If contact information cannot be extracted, the fields are set to `"N/A"`
|
||||
|
||||
Reference in New Issue
Block a user