feat: Implement email sending utilities and templates for job notifications

- Added email_service.py for sending emails with SMTP configuration. - Introduced email_templates.py to render job alert email subjects and bodies. - Enhanced scraper.py to extract contact information from job listings. - Updated settings.js to handle negative keyword input validation. - Created email.html and email_templates.html for managing email subscriptions and templates in the admin interface. - Modified base.html to include links for email alerts and templates. - Expanded user settings.html to allow management of negative keywords. - Updated utils.py to include functions for retrieving negative keywords and email settings. - Enhanced job filtering logic to exclude jobs containing negative keywords.
2025-11-28 18:15:08 +01:00
parent 8afb208985
commit 2185a07ff0
23 changed files with 2660 additions and 63 deletions
--- a/README.md
+++ b/README.md
@@ -9,11 +9,32 @@ job scraper
 - Users can search for job listings by keywords and region
 - Selection of job listings based on user preferences

-## Requirements
+## Architecture Overview

- Database (MySQL/MariaDB)
- Python 3.x
-  - Required Python packages (see requirements.txt)
+The application is built as a modular Flask‑based service with clear separation of concerns:
+
+| Layer                         | Module                                   | Responsibility                                                                                                                                                                                                           |
+| ----------------------------- | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| **Web UI**                    | `web/app.py`                             | Flask application that serves HTML pages, REST endpoints, and admin interfaces (users, taxonomy, health, email management).                                                                                              |
+| **Orchestrator**              | `web/craigslist.py`                      | Coordinates the scraping workflow: schedules runs, fetches listings, updates the DB, and triggers email alerts.                                                                                                          |
+| **Scraper**                   | `web/scraper.py`                         | Contains the low‑level HTML parsing logic (`scrape_job_data`, `scrape_job_page`, `extract_contact_info`).                                                                                                                |
+| **Persistence**               | `web/db.py`                              | SQLAlchemy ORM models (`User`, `JobListing`, `JobDescription`, `UserInteraction`, `Region`, `Keyword`, `EmailSubscription`, **`EmailTemplate`**) and helper functions for upserts, queries, and subscription management. |
+| **Email Rendering**           | `web/email_templates.py`                 | Renders job‑alert emails using a pluggable template system. Supports default placeholders (`{count_label}`, `{scope}`, `{timestamp}`, `{jobs_section}`, `{jobs_message}`) and custom admin‑defined templates.            |
+| **Email Delivery**            | `web/email_service.py`                   | Sends rendered messages via SMTP, handling TLS/SSL, authentication, and graceful disabling.                                                                                                                              |
+| **Configuration**             | `config/settings.json`                   | Centralised JSON config for database, HTTP, scraper options, negative keywords, and email settings.                                                                                                                      |
+| **Static Assets & Templates** | `web/static/`, `web/templates/`          | Front‑end resources (JS, CSS) and Jinja2 templates for the public UI and admin pages (including the new **Email Templates** management UI).                                                                              |
+| **Scheduler**                 | `schedule` (used in `web/craigslist.py`) | Runs the scraper automatically at configurable intervals (default hourly).                                                                                                                                               |
+| **Testing**                   | `tests/`                                 | Pytest suite covering scheduler, scraper, DB helpers, email service, and the new admin UI for email subscriptions and templates.                                                                                         |
+
+**Key architectural notes**
+
+- **Email Subscriptions** are stored in the `email_subscriptions` table and managed via `/admin/emails`.
+- **Email Templates** are persisted in the new `email_templates` table, editable through `/admin/email-templates`, and used by the alert system.
+- The orchestrator (`fetch_listings`) returns a detailed result dict (`discovered`, `new`, `by_search`) that drives UI metrics and health checks.
+- Contact information (`reply_url`, `contact_email`, `contact_phone`, `contact_name`) extracted by the scraper is saved in `job_descriptions`.
+- Negative keyword filtering is applied early in the pipeline to prevent unwanted listings from reaching the DB or email alerts.
+
+This layered design makes it straightforward to extend the scraper to new sources, swap out the email backend, or add additional admin features without impacting other components.

 ## Installation

@@ -40,6 +61,187 @@ The application includes an automated scheduler that runs the job scraping proce

 To modify the scheduling interval, edit the `start_scheduler()` function in `web/craigslist.py`.

+## Job Scraping Output
+
+The `fetch_listings()` function in `web/craigslist.py` extends its output to provide detailed metrics about each scraping operation. It returns a dictionary containing:
+
+- **discovered**: Total number of unique job URLs discovered across all region/keyword combinations
+- **new**: Total number of newly added jobs (jobs not previously in the database)
+- **by_search**: List of dictionaries, each containing:
+  - **region**: The region name for this search
+  - **keyword**: The keyword used for this search
+  - **count**: Number of jobs fetched for this specific region/keyword combination
+
+### Example Output
+
+```python
+{
+    "discovered": 150,
+    "new": 42,
+    "by_search": [
+        {"region": "sfbay", "keyword": "python", "count": 25},
+        {"region": "sfbay", "keyword": "java", "count": 18},
+        {"region": "losangeles", "keyword": "python", "count": 45},
+        {"region": "losangeles", "keyword": "java", "count": 62}
+    ]
+}
+```
+
+This per-search breakdown allows for better monitoring and debugging of the scraping process, enabling identification of searches that may be failing or returning fewer results than expected.
+
+## Contact Information Extraction
+
+The scraper now automatically extracts contact information from job listing pages:
+
+### Extracted Fields
+
+When scraping individual job listings, the following contact information is extracted and stored:
+
+- **contact_email**: Email address extracted from reply button or contact form links
+- **contact_phone**: Phone number extracted from tel links or contact parameters
+- **contact_name**: Contact person or department name if available
+- **reply_url**: The full reply/contact URL from the job listing
+
+### How Contact Information is Extracted
+
+The `extract_contact_info()` function intelligently parses various types of reply URLs:
+
+1. **Mailto Links**: `mailto:jobs@company.com?subject=...`
+
+   - Extracts the email address directly
+
+2. **Phone Links**: `tel:+1234567890`
+
+   - Extracts the phone number
+
+3. **URL Parameters**: `https://apply.company.com?email=hr@company.com&phone=555-1234&name=HR%20Team`
+
+   - Searches for common parameter names: `email`, `phone`, `contact_name`, etc.
+
+4. **Graceful Fallback**: If contact information cannot be extracted, the fields are set to `"N/A"`
+
+### Database Storage
+
+Contact information is stored in the `job_descriptions` table with the following columns:
+
+- `reply_url` (VARCHAR(512)): The complete reply/contact URL
+- `contact_email` (VARCHAR(255)): Extracted email address
+- `contact_phone` (VARCHAR(255)): Extracted phone number
+- `contact_name` (VARCHAR(255)): Extracted contact person/department name
+
+### Example
+
+For a job listing with reply button `mailto:hiring@acme.com?subject=Job%20Application`:
+
+```python
+{
+    "reply_url": "mailto:hiring@acme.com?subject=Job%20Application",
+    "contact_email": "hiring@acme.com",
+    "contact_phone": "N/A",
+    "contact_name": "N/A"
+}
+```
+
+This contact information is automatically extracted during job page scraping and persisted to the database for easy access and filtering.
+
+## Negative Keyword Filtering
+
+The scraper inspects each job’s title, company, location, and description for configurable “negative” keywords. When a keyword matches, the scraped result indicates the match so downstream workflows can skip or flag the job.
+
+### Email Configuration
+
+Define keywords in `config/settings.json` under `scraper.negative_keywords`. Keywords are matched case-insensitively and should be supplied without surrounding whitespace:
+
+```json
+{
+  "scraper": {
+    "negative_keywords": ["scam", "mlm", "unpaid"]
+  }
+}
+```
+
+### Scrape Output
+
+Each `scrape_job_page` result contains three new fields:
+
+- `is_negative_match`: `True` when any keyword matches
+- `negative_keyword_match`: the keyword that triggered the match
+- `negative_match_field`: which field (title, company, location, description) contained the keyword
+
+### Processing Behavior
+
+- `process_job_url` stops when `is_negative_match` is `True`, yielding a log message and calling `remove_job` so stale results never remain in `job_listings`.
+- `upsert_job_details` now returns immediately for negative matches, ensuring `job_descriptions` never stores filtered listings.
+- Regression coverage lives in `tests/test_scraper.py::TestScraperPipelineNegativeFiltering` and `tests/test_db_negative_filtering.py::test_upsert_job_details_skips_negative_match`.
+
+Together, these checks mean negative matches are dropped before any persistence and never shown in the UI.
+
+### User-Specific Negative Keywords
+
+In addition to the global negative keywords defined in `settings.json`, users can define their own personal negative keywords via the **Preferences** page (`/settings`).
+
+- **Management**: Users can add new negative keywords and remove existing ones.
+- **Filtering**: Jobs matching any of the user's negative keywords are filtered out from the job listings view (`/` and `/jobs`).
+- **Validation**: The UI prevents adding duplicate keywords.
+- **Storage**: User-specific negative keywords are stored in the database (`negative_keywords` and `user_negative_keywords` tables).
+
+## Email Notifications
+
+Optional job-alert emails are generated whenever the scraper discovers new listings.
+
+### Configuration
+
+Edit `config/settings.json` under the `email` section:
+
+```json
+{
+  "email": {
+    "enabled": true,
+    "from_address": "jobs@example.com",
+    "recipients": ["alerts@example.com"],
+    "smtp": {
+      "host": "smtp.example.com",
+      "port": 587,
+      "username": "smtp-user",
+      "password": "secret",
+      "use_tls": true,
+      "use_ssl": false,
+      "timeout": 30
+    }
+  }
+}
+```
+
+- Leave `enabled` set to `false` for local development or when credentials are unavailable.
+- Provide at least one recipient; otherwise alerts are skipped with a log message.
+- Omit real credentials from source control—inject them via environment variables or a secrets manager in production.
+
+### How Alerts Are Sent
+
+- After `fetch_listings()` completes, the scraper gathers new listings and, when configured, renders a plaintext digest via `web.email_templates.render_job_alert_email`.
+- Delivery is handled by `web.email_service.send_email`, which supports TLS/SSL SMTP connections and gracefully skips when disabled.
+- Success or failure is streamed in the scraper log output (`Job alert email sent.` or the reason for skipping).
+
+### Managing Recipients
+
+- Admin users can visit `/admin/emails` to add or deactivate subscription addresses through the web UI.
+- Deactivated rows remain in the table so they can be reactivated later; the scraper only mails active recipients.
+- The navigation bar exposes an **Email Alerts** link to the management screen after logging in as an admin user.
+
+### Customising Templates
+
+- Use the **Email Templates** admin page (`/admin/email-templates`) to create, edit, preview, or delete alert templates.
+- Templates support placeholder tokens such as `{count_label}`, `{scope}`, `{timestamp}`, `{jobs_section}`, and `{jobs_message}`; the UI lists all available tokens.
+- Preview renders the selected template with sample data so changes can be reviewed before saving.
+
+### Tests
+
+- `tests/test_email_templates.py` verifies the rendered subject/body for both populated and empty alerts.
+- `tests/test_email_service.py` covers SMTP configuration, disabled mode, and login/send flows using fakes.
+- `tests/test_admin_email.py` exercises the admin UI for listing, subscribing, and unsubscribing recipients.
+- `tests/test_admin_email_templates.py` verifies CRUD operations and previews for template management.
+- `tests/test_scraper.py::TestScraperEmailNotifications` ensures the scraping pipeline invokes the alert sender when new jobs are found.
+
 ## Docker Deployment

 Please see [README-Docker.md](README-Docker.md) for instructions on deploying the application using Docker.