Files

CI/CD Pipeline / test (push) Failing after 4m9s

Details

feat: Implement email sending utilities and templates for job notifications

- Added email_service.py for sending emails with SMTP configuration.
- Introduced email_templates.py to render job alert email subjects and bodies.
- Enhanced scraper.py to extract contact information from job listings.
- Updated settings.js to handle negative keyword input validation.
- Created email.html and email_templates.html for managing email subscriptions and templates in the admin interface.
- Modified base.html to include links for email alerts and templates.
- Expanded user settings.html to allow management of negative keywords.
- Updated utils.py to include functions for retrieving negative keywords and email settings.
- Enhanced job filtering logic to exclude jobs containing negative keywords.

2025-11-28 18:15:08 +01:00

13 KiB

Raw Blame History

jobs

job scraper

Features

Scrapes job listings from website (currently craigslist by region)
Saves job listings to a database
Users can search for job listings by keywords and region
Selection of job listings based on user preferences

Architecture Overview

The application is built as a modular Flask‑based service with clear separation of concerns:

Layer	Module	Responsibility
Web UI	`web/app.py`	Flask application that serves HTML pages, REST endpoints, and admin interfaces (users, taxonomy, health, email management).
Orchestrator	`web/craigslist.py`	Coordinates the scraping workflow: schedules runs, fetches listings, updates the DB, and triggers email alerts.
Scraper	`web/scraper.py`	Contains the low‑level HTML parsing logic (`scrape_job_data`, `scrape_job_page`, `extract_contact_info`).
Persistence	`web/db.py`	SQLAlchemy ORM models (`User`, `JobListing`, `JobDescription`, `UserInteraction`, `Region`, `Keyword`, `EmailSubscription`, `EmailTemplate`) and helper functions for upserts, queries, and subscription management.
Email Rendering	`web/email_templates.py`	Renders job‑alert emails using a pluggable template system. Supports default placeholders (`{count_label}`, `{scope}`, `{timestamp}`, `{jobs_section}`, `{jobs_message}`) and custom admin‑defined templates.
Email Delivery	`web/email_service.py`	Sends rendered messages via SMTP, handling TLS/SSL, authentication, and graceful disabling.
Configuration	`config/settings.json`	Centralised JSON config for database, HTTP, scraper options, negative keywords, and email settings.
Static Assets & Templates	`web/static/`, `web/templates/`	Front‑end resources (JS, CSS) and Jinja2 templates for the public UI and admin pages (including the new Email Templates management UI).
Scheduler	`schedule` (used in `web/craigslist.py`)	Runs the scraper automatically at configurable intervals (default hourly).
Testing	`tests/`	Pytest suite covering scheduler, scraper, DB helpers, email service, and the new admin UI for email subscriptions and templates.

Key architectural notes

Email Subscriptions are stored in the email_subscriptions table and managed via /admin/emails.
Email Templates are persisted in the new email_templates table, editable through /admin/email-templates, and used by the alert system.
The orchestrator (fetch_listings) returns a detailed result dict (discovered, new, by_search) that drives UI metrics and health checks.
Contact information (reply_url, contact_email, contact_phone, contact_name) extracted by the scraper is saved in job_descriptions.
Negative keyword filtering is applied early in the pipeline to prevent unwanted listings from reaching the DB or email alerts.

This layered design makes it straightforward to extend the scraper to new sources, swap out the email backend, or add additional admin features without impacting other components.

Installation

Clone the repository
Create a virtual environment
Install dependencies
Set up environment variables
Run the application

Scheduler Configuration

The application includes an automated scheduler that runs the job scraping process every hour. The scheduler is implemented in web/craigslist.py and includes:

Automatic Scheduling: Scraping runs every hour automatically
Failure Handling: Retry logic with exponential backoff (up to 3 attempts)
Background Operation: Runs in a separate daemon thread
Graceful Error Recovery: Continues running even if individual scraping attempts fail

Scheduler Features

Retry Mechanism: Automatically retries failed scraping attempts
Logging: Comprehensive logging of scheduler operations and failures
Testing: Comprehensive test suite in tests/test_scheduler.py

To modify the scheduling interval, edit the start_scheduler() function in web/craigslist.py.

Job Scraping Output

The fetch_listings() function in web/craigslist.py extends its output to provide detailed metrics about each scraping operation. It returns a dictionary containing:

discovered: Total number of unique job URLs discovered across all region/keyword combinations
new: Total number of newly added jobs (jobs not previously in the database)
by_search: List of dictionaries, each containing:
- region: The region name for this search
- keyword: The keyword used for this search
- count: Number of jobs fetched for this specific region/keyword combination

Example Output

{
    "discovered": 150,
    "new": 42,
    "by_search": [
        {"region": "sfbay", "keyword": "python", "count": 25},
        {"region": "sfbay", "keyword": "java", "count": 18},
        {"region": "losangeles", "keyword": "python", "count": 45},
        {"region": "losangeles", "keyword": "java", "count": 62}
    ]
}

This per-search breakdown allows for better monitoring and debugging of the scraping process, enabling identification of searches that may be failing or returning fewer results than expected.

Contact Information Extraction

The scraper now automatically extracts contact information from job listing pages:

Extracted Fields

When scraping individual job listings, the following contact information is extracted and stored:

contact_email: Email address extracted from reply button or contact form links
contact_phone: Phone number extracted from tel links or contact parameters
contact_name: Contact person or department name if available
reply_url: The full reply/contact URL from the job listing

How Contact Information is Extracted

The extract_contact_info() function intelligently parses various types of reply URLs:

Mailto Links: mailto:jobs@company.com?subject=...
- Extracts the email address directly
Phone Links: tel:+1234567890
- Extracts the phone number
URL Parameters: https://apply.company.com?email=hr@company.com&phone=555-1234&name=HR%20Team
- Searches for common parameter names: email, phone, contact_name, etc.
Graceful Fallback: If contact information cannot be extracted, the fields are set to "N/A"

Database Storage

Contact information is stored in the job_descriptions table with the following columns:

reply_url (VARCHAR(512)): The complete reply/contact URL
contact_email (VARCHAR(255)): Extracted email address
contact_phone (VARCHAR(255)): Extracted phone number
contact_name (VARCHAR(255)): Extracted contact person/department name

Example

For a job listing with reply button mailto:hiring@acme.com?subject=Job%20Application:

{
    "reply_url": "mailto:hiring@acme.com?subject=Job%20Application",
    "contact_email": "hiring@acme.com",
    "contact_phone": "N/A",
    "contact_name": "N/A"
}

This contact information is automatically extracted during job page scraping and persisted to the database for easy access and filtering.

Negative Keyword Filtering

The scraper inspects each job’s title, company, location, and description for configurable “negative” keywords. When a keyword matches, the scraped result indicates the match so downstream workflows can skip or flag the job.

Email Configuration

Define keywords in config/settings.json under scraper.negative_keywords. Keywords are matched case-insensitively and should be supplied without surrounding whitespace:

{
  "scraper": {
    "negative_keywords": ["scam", "mlm", "unpaid"]
  }
}

Scrape Output

Each scrape_job_page result contains three new fields:

is_negative_match: True when any keyword matches
negative_keyword_match: the keyword that triggered the match
negative_match_field: which field (title, company, location, description) contained the keyword

Processing Behavior

process_job_url stops when is_negative_match is True, yielding a log message and calling remove_job so stale results never remain in job_listings.
upsert_job_details now returns immediately for negative matches, ensuring job_descriptions never stores filtered listings.
Regression coverage lives in tests/test_scraper.py::TestScraperPipelineNegativeFiltering and tests/test_db_negative_filtering.py::test_upsert_job_details_skips_negative_match.

Together, these checks mean negative matches are dropped before any persistence and never shown in the UI.

User-Specific Negative Keywords

In addition to the global negative keywords defined in settings.json, users can define their own personal negative keywords via the Preferences page (/settings).

Management: Users can add new negative keywords and remove existing ones.
Filtering: Jobs matching any of the user's negative keywords are filtered out from the job listings view (/ and /jobs).
Validation: The UI prevents adding duplicate keywords.
Storage: User-specific negative keywords are stored in the database (negative_keywords and user_negative_keywords tables).

Email Notifications

Optional job-alert emails are generated whenever the scraper discovers new listings.

Configuration

Edit config/settings.json under the email section:

{
  "email": {
    "enabled": true,
    "from_address": "jobs@example.com",
    "recipients": ["alerts@example.com"],
    "smtp": {
      "host": "smtp.example.com",
      "port": 587,
      "username": "smtp-user",
      "password": "secret",
      "use_tls": true,
      "use_ssl": false,
      "timeout": 30
    }
  }
}

Leave enabled set to false for local development or when credentials are unavailable.
Provide at least one recipient; otherwise alerts are skipped with a log message.
Omit real credentials from source control—inject them via environment variables or a secrets manager in production.

How Alerts Are Sent

After fetch_listings() completes, the scraper gathers new listings and, when configured, renders a plaintext digest via web.email_templates.render_job_alert_email.
Delivery is handled by web.email_service.send_email, which supports TLS/SSL SMTP connections and gracefully skips when disabled.
Success or failure is streamed in the scraper log output (Job alert email sent. or the reason for skipping).

Managing Recipients

Admin users can visit /admin/emails to add or deactivate subscription addresses through the web UI.
Deactivated rows remain in the table so they can be reactivated later; the scraper only mails active recipients.
The navigation bar exposes an Email Alerts link to the management screen after logging in as an admin user.

Customising Templates

Use the Email Templates admin page (/admin/email-templates) to create, edit, preview, or delete alert templates.
Templates support placeholder tokens such as {count_label}, {scope}, {timestamp}, {jobs_section}, and {jobs_message}; the UI lists all available tokens.
Preview renders the selected template with sample data so changes can be reviewed before saving.

Tests

tests/test_email_templates.py verifies the rendered subject/body for both populated and empty alerts.
tests/test_email_service.py covers SMTP configuration, disabled mode, and login/send flows using fakes.
tests/test_admin_email.py exercises the admin UI for listing, subscribing, and unsubscribing recipients.
tests/test_admin_email_templates.py verifies CRUD operations and previews for template management.
tests/test_scraper.py::TestScraperEmailNotifications ensures the scraping pipeline invokes the alert sender when new jobs are found.

Docker Deployment

Please see README-Docker.md for instructions on deploying the application using Docker.

13 KiB Raw Blame History Unescape Escape