Some checks failed
CI/CD Pipeline / test (push) Failing after 4m9s
- Added email_service.py for sending emails with SMTP configuration. - Introduced email_templates.py to render job alert email subjects and bodies. - Enhanced scraper.py to extract contact information from job listings. - Updated settings.js to handle negative keyword input validation. - Created email.html and email_templates.html for managing email subscriptions and templates in the admin interface. - Modified base.html to include links for email alerts and templates. - Expanded user settings.html to allow management of negative keywords. - Updated utils.py to include functions for retrieving negative keywords and email settings. - Enhanced job filtering logic to exclude jobs containing negative keywords.
248 lines
13 KiB
Markdown
248 lines
13 KiB
Markdown
# jobs
|
||
|
||
job scraper
|
||
|
||
## Features
|
||
|
||
- Scrapes job listings from website (currently craigslist by region)
|
||
- Saves job listings to a database
|
||
- Users can search for job listings by keywords and region
|
||
- Selection of job listings based on user preferences
|
||
|
||
## Architecture Overview
|
||
|
||
The application is built as a modular Flask‑based service with clear separation of concerns:
|
||
|
||
| Layer | Module | Responsibility |
|
||
| ----------------------------- | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||
| **Web UI** | `web/app.py` | Flask application that serves HTML pages, REST endpoints, and admin interfaces (users, taxonomy, health, email management). |
|
||
| **Orchestrator** | `web/craigslist.py` | Coordinates the scraping workflow: schedules runs, fetches listings, updates the DB, and triggers email alerts. |
|
||
| **Scraper** | `web/scraper.py` | Contains the low‑level HTML parsing logic (`scrape_job_data`, `scrape_job_page`, `extract_contact_info`). |
|
||
| **Persistence** | `web/db.py` | SQLAlchemy ORM models (`User`, `JobListing`, `JobDescription`, `UserInteraction`, `Region`, `Keyword`, `EmailSubscription`, **`EmailTemplate`**) and helper functions for upserts, queries, and subscription management. |
|
||
| **Email Rendering** | `web/email_templates.py` | Renders job‑alert emails using a pluggable template system. Supports default placeholders (`{count_label}`, `{scope}`, `{timestamp}`, `{jobs_section}`, `{jobs_message}`) and custom admin‑defined templates. |
|
||
| **Email Delivery** | `web/email_service.py` | Sends rendered messages via SMTP, handling TLS/SSL, authentication, and graceful disabling. |
|
||
| **Configuration** | `config/settings.json` | Centralised JSON config for database, HTTP, scraper options, negative keywords, and email settings. |
|
||
| **Static Assets & Templates** | `web/static/`, `web/templates/` | Front‑end resources (JS, CSS) and Jinja2 templates for the public UI and admin pages (including the new **Email Templates** management UI). |
|
||
| **Scheduler** | `schedule` (used in `web/craigslist.py`) | Runs the scraper automatically at configurable intervals (default hourly). |
|
||
| **Testing** | `tests/` | Pytest suite covering scheduler, scraper, DB helpers, email service, and the new admin UI for email subscriptions and templates. |
|
||
|
||
**Key architectural notes**
|
||
|
||
- **Email Subscriptions** are stored in the `email_subscriptions` table and managed via `/admin/emails`.
|
||
- **Email Templates** are persisted in the new `email_templates` table, editable through `/admin/email-templates`, and used by the alert system.
|
||
- The orchestrator (`fetch_listings`) returns a detailed result dict (`discovered`, `new`, `by_search`) that drives UI metrics and health checks.
|
||
- Contact information (`reply_url`, `contact_email`, `contact_phone`, `contact_name`) extracted by the scraper is saved in `job_descriptions`.
|
||
- Negative keyword filtering is applied early in the pipeline to prevent unwanted listings from reaching the DB or email alerts.
|
||
|
||
This layered design makes it straightforward to extend the scraper to new sources, swap out the email backend, or add additional admin features without impacting other components.
|
||
|
||
## Installation
|
||
|
||
1. Clone the repository
|
||
2. Create a virtual environment
|
||
3. Install dependencies
|
||
4. Set up environment variables
|
||
5. Run the application
|
||
|
||
## Scheduler Configuration
|
||
|
||
The application includes an automated scheduler that runs the job scraping process every hour. The scheduler is implemented in `web/craigslist.py` and includes:
|
||
|
||
- **Automatic Scheduling**: Scraping runs every hour automatically
|
||
- **Failure Handling**: Retry logic with exponential backoff (up to 3 attempts)
|
||
- **Background Operation**: Runs in a separate daemon thread
|
||
- **Graceful Error Recovery**: Continues running even if individual scraping attempts fail
|
||
|
||
### Scheduler Features
|
||
|
||
- **Retry Mechanism**: Automatically retries failed scraping attempts
|
||
- **Logging**: Comprehensive logging of scheduler operations and failures
|
||
- **Testing**: Comprehensive test suite in `tests/test_scheduler.py`
|
||
|
||
To modify the scheduling interval, edit the `start_scheduler()` function in `web/craigslist.py`.
|
||
|
||
## Job Scraping Output
|
||
|
||
The `fetch_listings()` function in `web/craigslist.py` extends its output to provide detailed metrics about each scraping operation. It returns a dictionary containing:
|
||
|
||
- **discovered**: Total number of unique job URLs discovered across all region/keyword combinations
|
||
- **new**: Total number of newly added jobs (jobs not previously in the database)
|
||
- **by_search**: List of dictionaries, each containing:
|
||
- **region**: The region name for this search
|
||
- **keyword**: The keyword used for this search
|
||
- **count**: Number of jobs fetched for this specific region/keyword combination
|
||
|
||
### Example Output
|
||
|
||
```python
|
||
{
|
||
"discovered": 150,
|
||
"new": 42,
|
||
"by_search": [
|
||
{"region": "sfbay", "keyword": "python", "count": 25},
|
||
{"region": "sfbay", "keyword": "java", "count": 18},
|
||
{"region": "losangeles", "keyword": "python", "count": 45},
|
||
{"region": "losangeles", "keyword": "java", "count": 62}
|
||
]
|
||
}
|
||
```
|
||
|
||
This per-search breakdown allows for better monitoring and debugging of the scraping process, enabling identification of searches that may be failing or returning fewer results than expected.
|
||
|
||
## Contact Information Extraction
|
||
|
||
The scraper now automatically extracts contact information from job listing pages:
|
||
|
||
### Extracted Fields
|
||
|
||
When scraping individual job listings, the following contact information is extracted and stored:
|
||
|
||
- **contact_email**: Email address extracted from reply button or contact form links
|
||
- **contact_phone**: Phone number extracted from tel links or contact parameters
|
||
- **contact_name**: Contact person or department name if available
|
||
- **reply_url**: The full reply/contact URL from the job listing
|
||
|
||
### How Contact Information is Extracted
|
||
|
||
The `extract_contact_info()` function intelligently parses various types of reply URLs:
|
||
|
||
1. **Mailto Links**: `mailto:jobs@company.com?subject=...`
|
||
|
||
- Extracts the email address directly
|
||
|
||
2. **Phone Links**: `tel:+1234567890`
|
||
|
||
- Extracts the phone number
|
||
|
||
3. **URL Parameters**: `https://apply.company.com?email=hr@company.com&phone=555-1234&name=HR%20Team`
|
||
|
||
- Searches for common parameter names: `email`, `phone`, `contact_name`, etc.
|
||
|
||
4. **Graceful Fallback**: If contact information cannot be extracted, the fields are set to `"N/A"`
|
||
|
||
### Database Storage
|
||
|
||
Contact information is stored in the `job_descriptions` table with the following columns:
|
||
|
||
- `reply_url` (VARCHAR(512)): The complete reply/contact URL
|
||
- `contact_email` (VARCHAR(255)): Extracted email address
|
||
- `contact_phone` (VARCHAR(255)): Extracted phone number
|
||
- `contact_name` (VARCHAR(255)): Extracted contact person/department name
|
||
|
||
### Example
|
||
|
||
For a job listing with reply button `mailto:hiring@acme.com?subject=Job%20Application`:
|
||
|
||
```python
|
||
{
|
||
"reply_url": "mailto:hiring@acme.com?subject=Job%20Application",
|
||
"contact_email": "hiring@acme.com",
|
||
"contact_phone": "N/A",
|
||
"contact_name": "N/A"
|
||
}
|
||
```
|
||
|
||
This contact information is automatically extracted during job page scraping and persisted to the database for easy access and filtering.
|
||
|
||
## Negative Keyword Filtering
|
||
|
||
The scraper inspects each job’s title, company, location, and description for configurable “negative” keywords. When a keyword matches, the scraped result indicates the match so downstream workflows can skip or flag the job.
|
||
|
||
### Email Configuration
|
||
|
||
Define keywords in `config/settings.json` under `scraper.negative_keywords`. Keywords are matched case-insensitively and should be supplied without surrounding whitespace:
|
||
|
||
```json
|
||
{
|
||
"scraper": {
|
||
"negative_keywords": ["scam", "mlm", "unpaid"]
|
||
}
|
||
}
|
||
```
|
||
|
||
### Scrape Output
|
||
|
||
Each `scrape_job_page` result contains three new fields:
|
||
|
||
- `is_negative_match`: `True` when any keyword matches
|
||
- `negative_keyword_match`: the keyword that triggered the match
|
||
- `negative_match_field`: which field (title, company, location, description) contained the keyword
|
||
|
||
### Processing Behavior
|
||
|
||
- `process_job_url` stops when `is_negative_match` is `True`, yielding a log message and calling `remove_job` so stale results never remain in `job_listings`.
|
||
- `upsert_job_details` now returns immediately for negative matches, ensuring `job_descriptions` never stores filtered listings.
|
||
- Regression coverage lives in `tests/test_scraper.py::TestScraperPipelineNegativeFiltering` and `tests/test_db_negative_filtering.py::test_upsert_job_details_skips_negative_match`.
|
||
|
||
Together, these checks mean negative matches are dropped before any persistence and never shown in the UI.
|
||
|
||
### User-Specific Negative Keywords
|
||
|
||
In addition to the global negative keywords defined in `settings.json`, users can define their own personal negative keywords via the **Preferences** page (`/settings`).
|
||
|
||
- **Management**: Users can add new negative keywords and remove existing ones.
|
||
- **Filtering**: Jobs matching any of the user's negative keywords are filtered out from the job listings view (`/` and `/jobs`).
|
||
- **Validation**: The UI prevents adding duplicate keywords.
|
||
- **Storage**: User-specific negative keywords are stored in the database (`negative_keywords` and `user_negative_keywords` tables).
|
||
|
||
## Email Notifications
|
||
|
||
Optional job-alert emails are generated whenever the scraper discovers new listings.
|
||
|
||
### Configuration
|
||
|
||
Edit `config/settings.json` under the `email` section:
|
||
|
||
```json
|
||
{
|
||
"email": {
|
||
"enabled": true,
|
||
"from_address": "jobs@example.com",
|
||
"recipients": ["alerts@example.com"],
|
||
"smtp": {
|
||
"host": "smtp.example.com",
|
||
"port": 587,
|
||
"username": "smtp-user",
|
||
"password": "secret",
|
||
"use_tls": true,
|
||
"use_ssl": false,
|
||
"timeout": 30
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
- Leave `enabled` set to `false` for local development or when credentials are unavailable.
|
||
- Provide at least one recipient; otherwise alerts are skipped with a log message.
|
||
- Omit real credentials from source control—inject them via environment variables or a secrets manager in production.
|
||
|
||
### How Alerts Are Sent
|
||
|
||
- After `fetch_listings()` completes, the scraper gathers new listings and, when configured, renders a plaintext digest via `web.email_templates.render_job_alert_email`.
|
||
- Delivery is handled by `web.email_service.send_email`, which supports TLS/SSL SMTP connections and gracefully skips when disabled.
|
||
- Success or failure is streamed in the scraper log output (`Job alert email sent.` or the reason for skipping).
|
||
|
||
### Managing Recipients
|
||
|
||
- Admin users can visit `/admin/emails` to add or deactivate subscription addresses through the web UI.
|
||
- Deactivated rows remain in the table so they can be reactivated later; the scraper only mails active recipients.
|
||
- The navigation bar exposes an **Email Alerts** link to the management screen after logging in as an admin user.
|
||
|
||
### Customising Templates
|
||
|
||
- Use the **Email Templates** admin page (`/admin/email-templates`) to create, edit, preview, or delete alert templates.
|
||
- Templates support placeholder tokens such as `{count_label}`, `{scope}`, `{timestamp}`, `{jobs_section}`, and `{jobs_message}`; the UI lists all available tokens.
|
||
- Preview renders the selected template with sample data so changes can be reviewed before saving.
|
||
|
||
### Tests
|
||
|
||
- `tests/test_email_templates.py` verifies the rendered subject/body for both populated and empty alerts.
|
||
- `tests/test_email_service.py` covers SMTP configuration, disabled mode, and login/send flows using fakes.
|
||
- `tests/test_admin_email.py` exercises the admin UI for listing, subscribing, and unsubscribing recipients.
|
||
- `tests/test_admin_email_templates.py` verifies CRUD operations and previews for template management.
|
||
- `tests/test_scraper.py::TestScraperEmailNotifications` ensures the scraping pipeline invokes the alert sender when new jobs are found.
|
||
|
||
## Docker Deployment
|
||
|
||
Please see [README-Docker.md](README-Docker.md) for instructions on deploying the application using Docker.
|