feat: Implement email sending utilities and templates for job notifications
Some checks failed
CI/CD Pipeline / test (push) Failing after 4m9s
Some checks failed
CI/CD Pipeline / test (push) Failing after 4m9s
- Added email_service.py for sending emails with SMTP configuration. - Introduced email_templates.py to render job alert email subjects and bodies. - Enhanced scraper.py to extract contact information from job listings. - Updated settings.js to handle negative keyword input validation. - Created email.html and email_templates.html for managing email subscriptions and templates in the admin interface. - Modified base.html to include links for email alerts and templates. - Expanded user settings.html to allow management of negative keywords. - Updated utils.py to include functions for retrieving negative keywords and email settings. - Enhanced job filtering logic to exclude jobs containing negative keywords.
This commit is contained in:
210
README.md
210
README.md
@@ -9,11 +9,32 @@ job scraper
|
||||
- Users can search for job listings by keywords and region
|
||||
- Selection of job listings based on user preferences
|
||||
|
||||
## Requirements
|
||||
## Architecture Overview
|
||||
|
||||
- Database (MySQL/MariaDB)
|
||||
- Python 3.x
|
||||
- Required Python packages (see requirements.txt)
|
||||
The application is built as a modular Flask‑based service with clear separation of concerns:
|
||||
|
||||
| Layer | Module | Responsibility |
|
||||
| ----------------------------- | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| **Web UI** | `web/app.py` | Flask application that serves HTML pages, REST endpoints, and admin interfaces (users, taxonomy, health, email management). |
|
||||
| **Orchestrator** | `web/craigslist.py` | Coordinates the scraping workflow: schedules runs, fetches listings, updates the DB, and triggers email alerts. |
|
||||
| **Scraper** | `web/scraper.py` | Contains the low‑level HTML parsing logic (`scrape_job_data`, `scrape_job_page`, `extract_contact_info`). |
|
||||
| **Persistence** | `web/db.py` | SQLAlchemy ORM models (`User`, `JobListing`, `JobDescription`, `UserInteraction`, `Region`, `Keyword`, `EmailSubscription`, **`EmailTemplate`**) and helper functions for upserts, queries, and subscription management. |
|
||||
| **Email Rendering** | `web/email_templates.py` | Renders job‑alert emails using a pluggable template system. Supports default placeholders (`{count_label}`, `{scope}`, `{timestamp}`, `{jobs_section}`, `{jobs_message}`) and custom admin‑defined templates. |
|
||||
| **Email Delivery** | `web/email_service.py` | Sends rendered messages via SMTP, handling TLS/SSL, authentication, and graceful disabling. |
|
||||
| **Configuration** | `config/settings.json` | Centralised JSON config for database, HTTP, scraper options, negative keywords, and email settings. |
|
||||
| **Static Assets & Templates** | `web/static/`, `web/templates/` | Front‑end resources (JS, CSS) and Jinja2 templates for the public UI and admin pages (including the new **Email Templates** management UI). |
|
||||
| **Scheduler** | `schedule` (used in `web/craigslist.py`) | Runs the scraper automatically at configurable intervals (default hourly). |
|
||||
| **Testing** | `tests/` | Pytest suite covering scheduler, scraper, DB helpers, email service, and the new admin UI for email subscriptions and templates. |
|
||||
|
||||
**Key architectural notes**
|
||||
|
||||
- **Email Subscriptions** are stored in the `email_subscriptions` table and managed via `/admin/emails`.
|
||||
- **Email Templates** are persisted in the new `email_templates` table, editable through `/admin/email-templates`, and used by the alert system.
|
||||
- The orchestrator (`fetch_listings`) returns a detailed result dict (`discovered`, `new`, `by_search`) that drives UI metrics and health checks.
|
||||
- Contact information (`reply_url`, `contact_email`, `contact_phone`, `contact_name`) extracted by the scraper is saved in `job_descriptions`.
|
||||
- Negative keyword filtering is applied early in the pipeline to prevent unwanted listings from reaching the DB or email alerts.
|
||||
|
||||
This layered design makes it straightforward to extend the scraper to new sources, swap out the email backend, or add additional admin features without impacting other components.
|
||||
|
||||
## Installation
|
||||
|
||||
@@ -40,6 +61,187 @@ The application includes an automated scheduler that runs the job scraping proce
|
||||
|
||||
To modify the scheduling interval, edit the `start_scheduler()` function in `web/craigslist.py`.
|
||||
|
||||
## Job Scraping Output
|
||||
|
||||
The `fetch_listings()` function in `web/craigslist.py` extends its output to provide detailed metrics about each scraping operation. It returns a dictionary containing:
|
||||
|
||||
- **discovered**: Total number of unique job URLs discovered across all region/keyword combinations
|
||||
- **new**: Total number of newly added jobs (jobs not previously in the database)
|
||||
- **by_search**: List of dictionaries, each containing:
|
||||
- **region**: The region name for this search
|
||||
- **keyword**: The keyword used for this search
|
||||
- **count**: Number of jobs fetched for this specific region/keyword combination
|
||||
|
||||
### Example Output
|
||||
|
||||
```python
|
||||
{
|
||||
"discovered": 150,
|
||||
"new": 42,
|
||||
"by_search": [
|
||||
{"region": "sfbay", "keyword": "python", "count": 25},
|
||||
{"region": "sfbay", "keyword": "java", "count": 18},
|
||||
{"region": "losangeles", "keyword": "python", "count": 45},
|
||||
{"region": "losangeles", "keyword": "java", "count": 62}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
This per-search breakdown allows for better monitoring and debugging of the scraping process, enabling identification of searches that may be failing or returning fewer results than expected.
|
||||
|
||||
## Contact Information Extraction
|
||||
|
||||
The scraper now automatically extracts contact information from job listing pages:
|
||||
|
||||
### Extracted Fields
|
||||
|
||||
When scraping individual job listings, the following contact information is extracted and stored:
|
||||
|
||||
- **contact_email**: Email address extracted from reply button or contact form links
|
||||
- **contact_phone**: Phone number extracted from tel links or contact parameters
|
||||
- **contact_name**: Contact person or department name if available
|
||||
- **reply_url**: The full reply/contact URL from the job listing
|
||||
|
||||
### How Contact Information is Extracted
|
||||
|
||||
The `extract_contact_info()` function intelligently parses various types of reply URLs:
|
||||
|
||||
1. **Mailto Links**: `mailto:jobs@company.com?subject=...`
|
||||
|
||||
- Extracts the email address directly
|
||||
|
||||
2. **Phone Links**: `tel:+1234567890`
|
||||
|
||||
- Extracts the phone number
|
||||
|
||||
3. **URL Parameters**: `https://apply.company.com?email=hr@company.com&phone=555-1234&name=HR%20Team`
|
||||
|
||||
- Searches for common parameter names: `email`, `phone`, `contact_name`, etc.
|
||||
|
||||
4. **Graceful Fallback**: If contact information cannot be extracted, the fields are set to `"N/A"`
|
||||
|
||||
### Database Storage
|
||||
|
||||
Contact information is stored in the `job_descriptions` table with the following columns:
|
||||
|
||||
- `reply_url` (VARCHAR(512)): The complete reply/contact URL
|
||||
- `contact_email` (VARCHAR(255)): Extracted email address
|
||||
- `contact_phone` (VARCHAR(255)): Extracted phone number
|
||||
- `contact_name` (VARCHAR(255)): Extracted contact person/department name
|
||||
|
||||
### Example
|
||||
|
||||
For a job listing with reply button `mailto:hiring@acme.com?subject=Job%20Application`:
|
||||
|
||||
```python
|
||||
{
|
||||
"reply_url": "mailto:hiring@acme.com?subject=Job%20Application",
|
||||
"contact_email": "hiring@acme.com",
|
||||
"contact_phone": "N/A",
|
||||
"contact_name": "N/A"
|
||||
}
|
||||
```
|
||||
|
||||
This contact information is automatically extracted during job page scraping and persisted to the database for easy access and filtering.
|
||||
|
||||
## Negative Keyword Filtering
|
||||
|
||||
The scraper inspects each job’s title, company, location, and description for configurable “negative” keywords. When a keyword matches, the scraped result indicates the match so downstream workflows can skip or flag the job.
|
||||
|
||||
### Email Configuration
|
||||
|
||||
Define keywords in `config/settings.json` under `scraper.negative_keywords`. Keywords are matched case-insensitively and should be supplied without surrounding whitespace:
|
||||
|
||||
```json
|
||||
{
|
||||
"scraper": {
|
||||
"negative_keywords": ["scam", "mlm", "unpaid"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Scrape Output
|
||||
|
||||
Each `scrape_job_page` result contains three new fields:
|
||||
|
||||
- `is_negative_match`: `True` when any keyword matches
|
||||
- `negative_keyword_match`: the keyword that triggered the match
|
||||
- `negative_match_field`: which field (title, company, location, description) contained the keyword
|
||||
|
||||
### Processing Behavior
|
||||
|
||||
- `process_job_url` stops when `is_negative_match` is `True`, yielding a log message and calling `remove_job` so stale results never remain in `job_listings`.
|
||||
- `upsert_job_details` now returns immediately for negative matches, ensuring `job_descriptions` never stores filtered listings.
|
||||
- Regression coverage lives in `tests/test_scraper.py::TestScraperPipelineNegativeFiltering` and `tests/test_db_negative_filtering.py::test_upsert_job_details_skips_negative_match`.
|
||||
|
||||
Together, these checks mean negative matches are dropped before any persistence and never shown in the UI.
|
||||
|
||||
### User-Specific Negative Keywords
|
||||
|
||||
In addition to the global negative keywords defined in `settings.json`, users can define their own personal negative keywords via the **Preferences** page (`/settings`).
|
||||
|
||||
- **Management**: Users can add new negative keywords and remove existing ones.
|
||||
- **Filtering**: Jobs matching any of the user's negative keywords are filtered out from the job listings view (`/` and `/jobs`).
|
||||
- **Validation**: The UI prevents adding duplicate keywords.
|
||||
- **Storage**: User-specific negative keywords are stored in the database (`negative_keywords` and `user_negative_keywords` tables).
|
||||
|
||||
## Email Notifications
|
||||
|
||||
Optional job-alert emails are generated whenever the scraper discovers new listings.
|
||||
|
||||
### Configuration
|
||||
|
||||
Edit `config/settings.json` under the `email` section:
|
||||
|
||||
```json
|
||||
{
|
||||
"email": {
|
||||
"enabled": true,
|
||||
"from_address": "jobs@example.com",
|
||||
"recipients": ["alerts@example.com"],
|
||||
"smtp": {
|
||||
"host": "smtp.example.com",
|
||||
"port": 587,
|
||||
"username": "smtp-user",
|
||||
"password": "secret",
|
||||
"use_tls": true,
|
||||
"use_ssl": false,
|
||||
"timeout": 30
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- Leave `enabled` set to `false` for local development or when credentials are unavailable.
|
||||
- Provide at least one recipient; otherwise alerts are skipped with a log message.
|
||||
- Omit real credentials from source control—inject them via environment variables or a secrets manager in production.
|
||||
|
||||
### How Alerts Are Sent
|
||||
|
||||
- After `fetch_listings()` completes, the scraper gathers new listings and, when configured, renders a plaintext digest via `web.email_templates.render_job_alert_email`.
|
||||
- Delivery is handled by `web.email_service.send_email`, which supports TLS/SSL SMTP connections and gracefully skips when disabled.
|
||||
- Success or failure is streamed in the scraper log output (`Job alert email sent.` or the reason for skipping).
|
||||
|
||||
### Managing Recipients
|
||||
|
||||
- Admin users can visit `/admin/emails` to add or deactivate subscription addresses through the web UI.
|
||||
- Deactivated rows remain in the table so they can be reactivated later; the scraper only mails active recipients.
|
||||
- The navigation bar exposes an **Email Alerts** link to the management screen after logging in as an admin user.
|
||||
|
||||
### Customising Templates
|
||||
|
||||
- Use the **Email Templates** admin page (`/admin/email-templates`) to create, edit, preview, or delete alert templates.
|
||||
- Templates support placeholder tokens such as `{count_label}`, `{scope}`, `{timestamp}`, `{jobs_section}`, and `{jobs_message}`; the UI lists all available tokens.
|
||||
- Preview renders the selected template with sample data so changes can be reviewed before saving.
|
||||
|
||||
### Tests
|
||||
|
||||
- `tests/test_email_templates.py` verifies the rendered subject/body for both populated and empty alerts.
|
||||
- `tests/test_email_service.py` covers SMTP configuration, disabled mode, and login/send flows using fakes.
|
||||
- `tests/test_admin_email.py` exercises the admin UI for listing, subscribing, and unsubscribing recipients.
|
||||
- `tests/test_admin_email_templates.py` verifies CRUD operations and previews for template management.
|
||||
- `tests/test_scraper.py::TestScraperEmailNotifications` ensures the scraping pipeline invokes the alert sender when new jobs are found.
|
||||
|
||||
## Docker Deployment
|
||||
|
||||
Please see [README-Docker.md](README-Docker.md) for instructions on deploying the application using Docker.
|
||||
|
||||
Reference in New Issue
Block a user