# jobs job scraper ## Features - Scrapes job listings from website (currently craigslist by region) - Saves job listings to a database - Users can search for job listings by keywords and region - Selection of job listings based on user preferences ## Architecture Overview The application is built as a modular Flask‑based service with clear separation of concerns: | Layer | Module | Responsibility | | ----------------------------- | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | **Web UI** | `web/app.py` | Flask application that serves HTML pages, REST endpoints, and admin interfaces (users, taxonomy, health, email management). | | **Orchestrator** | `web/craigslist.py` | Coordinates the scraping workflow: schedules runs, fetches listings, updates the DB, and triggers email alerts. | | **Scraper** | `web/scraper.py` | Contains the low‑level HTML parsing logic (`scrape_job_data`, `scrape_job_page`, `extract_contact_info`). | | **Persistence** | `web/db.py` | SQLAlchemy ORM models (`User`, `JobListing`, `JobDescription`, `UserInteraction`, `Region`, `Keyword`, `EmailSubscription`, **`EmailTemplate`**) and helper functions for upserts, queries, and subscription management. | | **Email Rendering** | `web/email_templates.py` | Renders job‑alert emails using a pluggable template system. Supports default placeholders (`{count_label}`, `{scope}`, `{timestamp}`, `{jobs_section}`, `{jobs_message}`) and custom admin‑defined templates. | | **Email Delivery** | `web/email_service.py` | Sends rendered messages via SMTP, handling TLS/SSL, authentication, and graceful disabling. | | **Configuration** | `config/settings.json` | Centralised JSON config for database, HTTP, scraper options, negative keywords, and email settings. | | **Static Assets & Templates** | `web/static/`, `web/templates/` | Front‑end resources (JS, CSS) and Jinja2 templates for the public UI and admin pages (including the new **Email Templates** management UI). | | **Scheduler** | `schedule` (used in `web/craigslist.py`) | Runs the scraper automatically at configurable intervals (default hourly). | | **Testing** | `tests/` | Pytest suite covering scheduler, scraper, DB helpers, email service, and the new admin UI for email subscriptions and templates. | **Key architectural notes** - **Email Subscriptions** are stored in the `email_subscriptions` table and managed via `/admin/emails`. - **Email Templates** are persisted in the new `email_templates` table, editable through `/admin/email-templates`, and used by the alert system. - The orchestrator (`fetch_listings`) returns a detailed result dict (`discovered`, `new`, `by_search`) that drives UI metrics and health checks. - Contact information (`reply_url`, `contact_email`, `contact_phone`, `contact_name`) extracted by the scraper is saved in `job_descriptions`. - Negative keyword filtering is applied early in the pipeline to prevent unwanted listings from reaching the DB or email alerts. This layered design makes it straightforward to extend the scraper to new sources, swap out the email backend, or add additional admin features without impacting other components. ## Installation 1. Clone the repository 2. Create a virtual environment 3. Install dependencies 4. Set up environment variables 5. Run the application ## Scheduler Configuration The application includes an automated scheduler that runs the job scraping process every hour. The scheduler is implemented in `web/craigslist.py` and includes: - **Automatic Scheduling**: Scraping runs every hour automatically - **Failure Handling**: Retry logic with exponential backoff (up to 3 attempts) - **Background Operation**: Runs in a separate daemon thread - **Graceful Error Recovery**: Continues running even if individual scraping attempts fail ### Scheduler Features - **Retry Mechanism**: Automatically retries failed scraping attempts - **Logging**: Comprehensive logging of scheduler operations and failures - **Testing**: Comprehensive test suite in `tests/test_scheduler.py` To modify the scheduling interval, edit the `start_scheduler()` function in `web/craigslist.py`. ## Job Scraping Output The `fetch_listings()` function in `web/craigslist.py` extends its output to provide detailed metrics about each scraping operation. It returns a dictionary containing: - **discovered**: Total number of unique job URLs discovered across all region/keyword combinations - **new**: Total number of newly added jobs (jobs not previously in the database) - **by_search**: List of dictionaries, each containing: - **region**: The region name for this search - **keyword**: The keyword used for this search - **count**: Number of jobs fetched for this specific region/keyword combination ### Example Output ```python { "discovered": 150, "new": 42, "by_search": [ {"region": "sfbay", "keyword": "python", "count": 25}, {"region": "sfbay", "keyword": "java", "count": 18}, {"region": "losangeles", "keyword": "python", "count": 45}, {"region": "losangeles", "keyword": "java", "count": 62} ] } ``` This per-search breakdown allows for better monitoring and debugging of the scraping process, enabling identification of searches that may be failing or returning fewer results than expected. ## Contact Information Extraction The scraper now automatically extracts contact information from job listing pages: ### Extracted Fields When scraping individual job listings, the following contact information is extracted and stored: - **contact_email**: Email address extracted from reply button or contact form links - **contact_phone**: Phone number extracted from tel links or contact parameters - **contact_name**: Contact person or department name if available - **reply_url**: The full reply/contact URL from the job listing ### How Contact Information is Extracted The `extract_contact_info()` function intelligently parses various types of reply URLs: 1. **Mailto Links**: `mailto:jobs@company.com?subject=...` - Extracts the email address directly 2. **Phone Links**: `tel:+1234567890` - Extracts the phone number 3. **URL Parameters**: `https://apply.company.com?email=hr@company.com&phone=555-1234&name=HR%20Team` - Searches for common parameter names: `email`, `phone`, `contact_name`, etc. 4. **Graceful Fallback**: If contact information cannot be extracted, the fields are set to `"N/A"` ### Database Storage Contact information is stored in the `job_descriptions` table with the following columns: - `reply_url` (VARCHAR(512)): The complete reply/contact URL - `contact_email` (VARCHAR(255)): Extracted email address - `contact_phone` (VARCHAR(255)): Extracted phone number - `contact_name` (VARCHAR(255)): Extracted contact person/department name ### Example For a job listing with reply button `mailto:hiring@acme.com?subject=Job%20Application`: ```python { "reply_url": "mailto:hiring@acme.com?subject=Job%20Application", "contact_email": "hiring@acme.com", "contact_phone": "N/A", "contact_name": "N/A" } ``` This contact information is automatically extracted during job page scraping and persisted to the database for easy access and filtering. ## Negative Keyword Filtering The scraper inspects each job’s title, company, location, and description for configurable “negative” keywords. When a keyword matches, the scraped result indicates the match so downstream workflows can skip or flag the job. ### Email Configuration Define keywords in `config/settings.json` under `scraper.negative_keywords`. Keywords are matched case-insensitively and should be supplied without surrounding whitespace: ```json { "scraper": { "negative_keywords": ["scam", "mlm", "unpaid"] } } ``` ### Scrape Output Each `scrape_job_page` result contains three new fields: - `is_negative_match`: `True` when any keyword matches - `negative_keyword_match`: the keyword that triggered the match - `negative_match_field`: which field (title, company, location, description) contained the keyword ### Processing Behavior - `process_job_url` stops when `is_negative_match` is `True`, yielding a log message and calling `remove_job` so stale results never remain in `job_listings`. - `upsert_job_details` now returns immediately for negative matches, ensuring `job_descriptions` never stores filtered listings. - Regression coverage lives in `tests/test_scraper.py::TestScraperPipelineNegativeFiltering` and `tests/test_db_negative_filtering.py::test_upsert_job_details_skips_negative_match`. Together, these checks mean negative matches are dropped before any persistence and never shown in the UI. ### User-Specific Negative Keywords In addition to the global negative keywords defined in `settings.json`, users can define their own personal negative keywords via the **Preferences** page (`/settings`). - **Management**: Users can add new negative keywords and remove existing ones. - **Filtering**: Jobs matching any of the user's negative keywords are filtered out from the job listings view (`/` and `/jobs`). - **Validation**: The UI prevents adding duplicate keywords. - **Storage**: User-specific negative keywords are stored in the database (`negative_keywords` and `user_negative_keywords` tables). ## Email Notifications Optional job-alert emails are generated whenever the scraper discovers new listings. ### Configuration Edit `config/settings.json` under the `email` section: ```json { "email": { "enabled": true, "from_address": "jobs@example.com", "recipients": ["alerts@example.com"], "smtp": { "host": "smtp.example.com", "port": 587, "username": "smtp-user", "password": "secret", "use_tls": true, "use_ssl": false, "timeout": 30 } } } ``` - Leave `enabled` set to `false` for local development or when credentials are unavailable. - Provide at least one recipient; otherwise alerts are skipped with a log message. - Omit real credentials from source control—inject them via environment variables or a secrets manager in production. ### How Alerts Are Sent - After `fetch_listings()` completes, the scraper gathers new listings and, when configured, renders a plaintext digest via `web.email_templates.render_job_alert_email`. - Delivery is handled by `web.email_service.send_email`, which supports TLS/SSL SMTP connections and gracefully skips when disabled. - Success or failure is streamed in the scraper log output (`Job alert email sent.` or the reason for skipping). ### Managing Recipients - Admin users can visit `/admin/emails` to add or deactivate subscription addresses through the web UI. - Deactivated rows remain in the table so they can be reactivated later; the scraper only mails active recipients. - The navigation bar exposes an **Email Alerts** link to the management screen after logging in as an admin user. ### Customising Templates - Use the **Email Templates** admin page (`/admin/email-templates`) to create, edit, preview, or delete alert templates. - Templates support placeholder tokens such as `{count_label}`, `{scope}`, `{timestamp}`, `{jobs_section}`, and `{jobs_message}`; the UI lists all available tokens. - Preview renders the selected template with sample data so changes can be reviewed before saving. ### Tests - `tests/test_email_templates.py` verifies the rendered subject/body for both populated and empty alerts. - `tests/test_email_service.py` covers SMTP configuration, disabled mode, and login/send flows using fakes. - `tests/test_admin_email.py` exercises the admin UI for listing, subscribing, and unsubscribing recipients. - `tests/test_admin_email_templates.py` verifies CRUD operations and previews for template management. - `tests/test_scraper.py::TestScraperEmailNotifications` ensures the scraping pipeline invokes the alert sender when new jobs are found. ## Docker Deployment Please see [README-Docker.md](README-Docker.md) for instructions on deploying the application using Docker.