Files
jobs/README.md
zwitschi 2185a07ff0
Some checks failed
CI/CD Pipeline / test (push) Failing after 4m9s
feat: Implement email sending utilities and templates for job notifications
- Added email_service.py for sending emails with SMTP configuration.
- Introduced email_templates.py to render job alert email subjects and bodies.
- Enhanced scraper.py to extract contact information from job listings.
- Updated settings.js to handle negative keyword input validation.
- Created email.html and email_templates.html for managing email subscriptions and templates in the admin interface.
- Modified base.html to include links for email alerts and templates.
- Expanded user settings.html to allow management of negative keywords.
- Updated utils.py to include functions for retrieving negative keywords and email settings.
- Enhanced job filtering logic to exclude jobs containing negative keywords.
2025-11-28 18:15:08 +01:00

13 KiB
Raw Permalink Blame History

jobs

job scraper

Features

  • Scrapes job listings from website (currently craigslist by region)
  • Saves job listings to a database
  • Users can search for job listings by keywords and region
  • Selection of job listings based on user preferences

Architecture Overview

The application is built as a modular Flaskbased service with clear separation of concerns:

Layer Module Responsibility
Web UI web/app.py Flask application that serves HTML pages, REST endpoints, and admin interfaces (users, taxonomy, health, email management).
Orchestrator web/craigslist.py Coordinates the scraping workflow: schedules runs, fetches listings, updates the DB, and triggers email alerts.
Scraper web/scraper.py Contains the lowlevel HTML parsing logic (scrape_job_data, scrape_job_page, extract_contact_info).
Persistence web/db.py SQLAlchemy ORM models (User, JobListing, JobDescription, UserInteraction, Region, Keyword, EmailSubscription, EmailTemplate) and helper functions for upserts, queries, and subscription management.
Email Rendering web/email_templates.py Renders jobalert emails using a pluggable template system. Supports default placeholders ({count_label}, {scope}, {timestamp}, {jobs_section}, {jobs_message}) and custom admindefined templates.
Email Delivery web/email_service.py Sends rendered messages via SMTP, handling TLS/SSL, authentication, and graceful disabling.
Configuration config/settings.json Centralised JSON config for database, HTTP, scraper options, negative keywords, and email settings.
Static Assets & Templates web/static/, web/templates/ Frontend resources (JS, CSS) and Jinja2 templates for the public UI and admin pages (including the new Email Templates management UI).
Scheduler schedule (used in web/craigslist.py) Runs the scraper automatically at configurable intervals (default hourly).
Testing tests/ Pytest suite covering scheduler, scraper, DB helpers, email service, and the new admin UI for email subscriptions and templates.

Key architectural notes

  • Email Subscriptions are stored in the email_subscriptions table and managed via /admin/emails.
  • Email Templates are persisted in the new email_templates table, editable through /admin/email-templates, and used by the alert system.
  • The orchestrator (fetch_listings) returns a detailed result dict (discovered, new, by_search) that drives UI metrics and health checks.
  • Contact information (reply_url, contact_email, contact_phone, contact_name) extracted by the scraper is saved in job_descriptions.
  • Negative keyword filtering is applied early in the pipeline to prevent unwanted listings from reaching the DB or email alerts.

This layered design makes it straightforward to extend the scraper to new sources, swap out the email backend, or add additional admin features without impacting other components.

Installation

  1. Clone the repository
  2. Create a virtual environment
  3. Install dependencies
  4. Set up environment variables
  5. Run the application

Scheduler Configuration

The application includes an automated scheduler that runs the job scraping process every hour. The scheduler is implemented in web/craigslist.py and includes:

  • Automatic Scheduling: Scraping runs every hour automatically
  • Failure Handling: Retry logic with exponential backoff (up to 3 attempts)
  • Background Operation: Runs in a separate daemon thread
  • Graceful Error Recovery: Continues running even if individual scraping attempts fail

Scheduler Features

  • Retry Mechanism: Automatically retries failed scraping attempts
  • Logging: Comprehensive logging of scheduler operations and failures
  • Testing: Comprehensive test suite in tests/test_scheduler.py

To modify the scheduling interval, edit the start_scheduler() function in web/craigslist.py.

Job Scraping Output

The fetch_listings() function in web/craigslist.py extends its output to provide detailed metrics about each scraping operation. It returns a dictionary containing:

  • discovered: Total number of unique job URLs discovered across all region/keyword combinations
  • new: Total number of newly added jobs (jobs not previously in the database)
  • by_search: List of dictionaries, each containing:
    • region: The region name for this search
    • keyword: The keyword used for this search
    • count: Number of jobs fetched for this specific region/keyword combination

Example Output

{
    "discovered": 150,
    "new": 42,
    "by_search": [
        {"region": "sfbay", "keyword": "python", "count": 25},
        {"region": "sfbay", "keyword": "java", "count": 18},
        {"region": "losangeles", "keyword": "python", "count": 45},
        {"region": "losangeles", "keyword": "java", "count": 62}
    ]
}

This per-search breakdown allows for better monitoring and debugging of the scraping process, enabling identification of searches that may be failing or returning fewer results than expected.

Contact Information Extraction

The scraper now automatically extracts contact information from job listing pages:

Extracted Fields

When scraping individual job listings, the following contact information is extracted and stored:

  • contact_email: Email address extracted from reply button or contact form links
  • contact_phone: Phone number extracted from tel links or contact parameters
  • contact_name: Contact person or department name if available
  • reply_url: The full reply/contact URL from the job listing

How Contact Information is Extracted

The extract_contact_info() function intelligently parses various types of reply URLs:

  1. Mailto Links: mailto:jobs@company.com?subject=...

    • Extracts the email address directly
  2. Phone Links: tel:+1234567890

    • Extracts the phone number
  3. URL Parameters: https://apply.company.com?email=hr@company.com&phone=555-1234&name=HR%20Team

    • Searches for common parameter names: email, phone, contact_name, etc.
  4. Graceful Fallback: If contact information cannot be extracted, the fields are set to "N/A"

Database Storage

Contact information is stored in the job_descriptions table with the following columns:

  • reply_url (VARCHAR(512)): The complete reply/contact URL
  • contact_email (VARCHAR(255)): Extracted email address
  • contact_phone (VARCHAR(255)): Extracted phone number
  • contact_name (VARCHAR(255)): Extracted contact person/department name

Example

For a job listing with reply button mailto:hiring@acme.com?subject=Job%20Application:

{
    "reply_url": "mailto:hiring@acme.com?subject=Job%20Application",
    "contact_email": "hiring@acme.com",
    "contact_phone": "N/A",
    "contact_name": "N/A"
}

This contact information is automatically extracted during job page scraping and persisted to the database for easy access and filtering.

Negative Keyword Filtering

The scraper inspects each jobs title, company, location, and description for configurable “negative” keywords. When a keyword matches, the scraped result indicates the match so downstream workflows can skip or flag the job.

Email Configuration

Define keywords in config/settings.json under scraper.negative_keywords. Keywords are matched case-insensitively and should be supplied without surrounding whitespace:

{
  "scraper": {
    "negative_keywords": ["scam", "mlm", "unpaid"]
  }
}

Scrape Output

Each scrape_job_page result contains three new fields:

  • is_negative_match: True when any keyword matches
  • negative_keyword_match: the keyword that triggered the match
  • negative_match_field: which field (title, company, location, description) contained the keyword

Processing Behavior

  • process_job_url stops when is_negative_match is True, yielding a log message and calling remove_job so stale results never remain in job_listings.
  • upsert_job_details now returns immediately for negative matches, ensuring job_descriptions never stores filtered listings.
  • Regression coverage lives in tests/test_scraper.py::TestScraperPipelineNegativeFiltering and tests/test_db_negative_filtering.py::test_upsert_job_details_skips_negative_match.

Together, these checks mean negative matches are dropped before any persistence and never shown in the UI.

User-Specific Negative Keywords

In addition to the global negative keywords defined in settings.json, users can define their own personal negative keywords via the Preferences page (/settings).

  • Management: Users can add new negative keywords and remove existing ones.
  • Filtering: Jobs matching any of the user's negative keywords are filtered out from the job listings view (/ and /jobs).
  • Validation: The UI prevents adding duplicate keywords.
  • Storage: User-specific negative keywords are stored in the database (negative_keywords and user_negative_keywords tables).

Email Notifications

Optional job-alert emails are generated whenever the scraper discovers new listings.

Configuration

Edit config/settings.json under the email section:

{
  "email": {
    "enabled": true,
    "from_address": "jobs@example.com",
    "recipients": ["alerts@example.com"],
    "smtp": {
      "host": "smtp.example.com",
      "port": 587,
      "username": "smtp-user",
      "password": "secret",
      "use_tls": true,
      "use_ssl": false,
      "timeout": 30
    }
  }
}
  • Leave enabled set to false for local development or when credentials are unavailable.
  • Provide at least one recipient; otherwise alerts are skipped with a log message.
  • Omit real credentials from source control—inject them via environment variables or a secrets manager in production.

How Alerts Are Sent

  • After fetch_listings() completes, the scraper gathers new listings and, when configured, renders a plaintext digest via web.email_templates.render_job_alert_email.
  • Delivery is handled by web.email_service.send_email, which supports TLS/SSL SMTP connections and gracefully skips when disabled.
  • Success or failure is streamed in the scraper log output (Job alert email sent. or the reason for skipping).

Managing Recipients

  • Admin users can visit /admin/emails to add or deactivate subscription addresses through the web UI.
  • Deactivated rows remain in the table so they can be reactivated later; the scraper only mails active recipients.
  • The navigation bar exposes an Email Alerts link to the management screen after logging in as an admin user.

Customising Templates

  • Use the Email Templates admin page (/admin/email-templates) to create, edit, preview, or delete alert templates.
  • Templates support placeholder tokens such as {count_label}, {scope}, {timestamp}, {jobs_section}, and {jobs_message}; the UI lists all available tokens.
  • Preview renders the selected template with sample data so changes can be reviewed before saving.

Tests

  • tests/test_email_templates.py verifies the rendered subject/body for both populated and empty alerts.
  • tests/test_email_service.py covers SMTP configuration, disabled mode, and login/send flows using fakes.
  • tests/test_admin_email.py exercises the admin UI for listing, subscribing, and unsubscribing recipients.
  • tests/test_admin_email_templates.py verifies CRUD operations and previews for template management.
  • tests/test_scraper.py::TestScraperEmailNotifications ensures the scraping pipeline invokes the alert sender when new jobs are found.

Docker Deployment

Please see README-Docker.md for instructions on deploying the application using Docker.