feat: implement background scheduler for job scraping with Gunicorn support

2026-01-21 19:07:43 +01:00
parent e8baeb3bcf
commit d84b8f128b
6 changed files with 84 additions and 6 deletions
--- a/README.md
+++ b/README.md
@@ -46,7 +46,14 @@ This layered design makes it straightforward to extend the scraper to new source

 ## Scheduler Configuration

-The application includes an automated scheduler that runs the job scraping process every hour. The scheduler is implemented in `web/craigslist.py` and includes:
+The application includes an automated scheduler that runs the job scraping process every hour. The scheduler is implemented in `web/craigslist.py` and can run alongside the web app when explicitly enabled.
+
+**Enable background scheduling** by setting the environment variable `SCRAPE_SCHEDULER_ENABLED=1`.
+
+- **Gunicorn**: the scheduler starts once in the Gunicorn master process (see `gunicorn.conf.py`). Worker processes skip scheduler startup to avoid duplicate runs.
+- **Flask dev server**: the scheduler starts on the first request (to avoid the reloader starting it twice).
+
+When enabled, the scheduler includes:

 - **Automatic Scheduling**: Scraping runs every hour automatically
 - **Failure Handling**: Retry logic with exponential backoff (up to 3 attempts)
@@ -107,15 +114,12 @@ When scraping individual job listings, the following contact information is extr
 The `extract_contact_info()` function intelligently parses various types of reply URLs:

 1. **Mailto Links**: `mailto:jobs@company.com?subject=...`
-
   - Extracts the email address directly

 2. **Phone Links**: `tel:+1234567890`
-
   - Extracts the phone number

 3. **URL Parameters**: `https://apply.company.com?email=hr@company.com&phone=555-1234&name=HR%20Team`
-
   - Searches for common parameter names: `email`, `phone`, `contact_name`, etc.

 4. **Graceful Fallback**: If contact information cannot be extracted, the fields are set to `"N/A"`