- Added email_service.py for sending emails with SMTP configuration. - Introduced email_templates.py to render job alert email subjects and bodies. - Enhanced scraper.py to extract contact information from job listings. - Updated settings.js to handle negative keyword input validation. - Created email.html and email_templates.html for managing email subscriptions and templates in the admin interface. - Modified base.html to include links for email alerts and templates. - Expanded user settings.html to allow management of negative keywords. - Updated utils.py to include functions for retrieving negative keywords and email settings. - Enhanced job filtering logic to exclude jobs containing negative keywords.
13 KiB
jobs
job scraper
Features
- Scrapes job listings from website (currently craigslist by region)
- Saves job listings to a database
- Users can search for job listings by keywords and region
- Selection of job listings based on user preferences
Architecture Overview
The application is built as a modular Flask‑based service with clear separation of concerns:
| Layer | Module | Responsibility |
|---|---|---|
| Web UI | web/app.py |
Flask application that serves HTML pages, REST endpoints, and admin interfaces (users, taxonomy, health, email management). |
| Orchestrator | web/craigslist.py |
Coordinates the scraping workflow: schedules runs, fetches listings, updates the DB, and triggers email alerts. |
| Scraper | web/scraper.py |
Contains the low‑level HTML parsing logic (scrape_job_data, scrape_job_page, extract_contact_info). |
| Persistence | web/db.py |
SQLAlchemy ORM models (User, JobListing, JobDescription, UserInteraction, Region, Keyword, EmailSubscription, EmailTemplate) and helper functions for upserts, queries, and subscription management. |
| Email Rendering | web/email_templates.py |
Renders job‑alert emails using a pluggable template system. Supports default placeholders ({count_label}, {scope}, {timestamp}, {jobs_section}, {jobs_message}) and custom admin‑defined templates. |
| Email Delivery | web/email_service.py |
Sends rendered messages via SMTP, handling TLS/SSL, authentication, and graceful disabling. |
| Configuration | config/settings.json |
Centralised JSON config for database, HTTP, scraper options, negative keywords, and email settings. |
| Static Assets & Templates | web/static/, web/templates/ |
Front‑end resources (JS, CSS) and Jinja2 templates for the public UI and admin pages (including the new Email Templates management UI). |
| Scheduler | schedule (used in web/craigslist.py) |
Runs the scraper automatically at configurable intervals (default hourly). |
| Testing | tests/ |
Pytest suite covering scheduler, scraper, DB helpers, email service, and the new admin UI for email subscriptions and templates. |
Key architectural notes
- Email Subscriptions are stored in the
email_subscriptionstable and managed via/admin/emails. - Email Templates are persisted in the new
email_templatestable, editable through/admin/email-templates, and used by the alert system. - The orchestrator (
fetch_listings) returns a detailed result dict (discovered,new,by_search) that drives UI metrics and health checks. - Contact information (
reply_url,contact_email,contact_phone,contact_name) extracted by the scraper is saved injob_descriptions. - Negative keyword filtering is applied early in the pipeline to prevent unwanted listings from reaching the DB or email alerts.
This layered design makes it straightforward to extend the scraper to new sources, swap out the email backend, or add additional admin features without impacting other components.
Installation
- Clone the repository
- Create a virtual environment
- Install dependencies
- Set up environment variables
- Run the application
Scheduler Configuration
The application includes an automated scheduler that runs the job scraping process every hour. The scheduler is implemented in web/craigslist.py and includes:
- Automatic Scheduling: Scraping runs every hour automatically
- Failure Handling: Retry logic with exponential backoff (up to 3 attempts)
- Background Operation: Runs in a separate daemon thread
- Graceful Error Recovery: Continues running even if individual scraping attempts fail
Scheduler Features
- Retry Mechanism: Automatically retries failed scraping attempts
- Logging: Comprehensive logging of scheduler operations and failures
- Testing: Comprehensive test suite in
tests/test_scheduler.py
To modify the scheduling interval, edit the start_scheduler() function in web/craigslist.py.
Job Scraping Output
The fetch_listings() function in web/craigslist.py extends its output to provide detailed metrics about each scraping operation. It returns a dictionary containing:
- discovered: Total number of unique job URLs discovered across all region/keyword combinations
- new: Total number of newly added jobs (jobs not previously in the database)
- by_search: List of dictionaries, each containing:
- region: The region name for this search
- keyword: The keyword used for this search
- count: Number of jobs fetched for this specific region/keyword combination
Example Output
{
"discovered": 150,
"new": 42,
"by_search": [
{"region": "sfbay", "keyword": "python", "count": 25},
{"region": "sfbay", "keyword": "java", "count": 18},
{"region": "losangeles", "keyword": "python", "count": 45},
{"region": "losangeles", "keyword": "java", "count": 62}
]
}
This per-search breakdown allows for better monitoring and debugging of the scraping process, enabling identification of searches that may be failing or returning fewer results than expected.
Contact Information Extraction
The scraper now automatically extracts contact information from job listing pages:
Extracted Fields
When scraping individual job listings, the following contact information is extracted and stored:
- contact_email: Email address extracted from reply button or contact form links
- contact_phone: Phone number extracted from tel links or contact parameters
- contact_name: Contact person or department name if available
- reply_url: The full reply/contact URL from the job listing
How Contact Information is Extracted
The extract_contact_info() function intelligently parses various types of reply URLs:
-
Mailto Links:
mailto:jobs@company.com?subject=...- Extracts the email address directly
-
Phone Links:
tel:+1234567890- Extracts the phone number
-
URL Parameters:
https://apply.company.com?email=hr@company.com&phone=555-1234&name=HR%20Team- Searches for common parameter names:
email,phone,contact_name, etc.
- Searches for common parameter names:
-
Graceful Fallback: If contact information cannot be extracted, the fields are set to
"N/A"
Database Storage
Contact information is stored in the job_descriptions table with the following columns:
reply_url(VARCHAR(512)): The complete reply/contact URLcontact_email(VARCHAR(255)): Extracted email addresscontact_phone(VARCHAR(255)): Extracted phone numbercontact_name(VARCHAR(255)): Extracted contact person/department name
Example
For a job listing with reply button mailto:hiring@acme.com?subject=Job%20Application:
{
"reply_url": "mailto:hiring@acme.com?subject=Job%20Application",
"contact_email": "hiring@acme.com",
"contact_phone": "N/A",
"contact_name": "N/A"
}
This contact information is automatically extracted during job page scraping and persisted to the database for easy access and filtering.
Negative Keyword Filtering
The scraper inspects each job’s title, company, location, and description for configurable “negative” keywords. When a keyword matches, the scraped result indicates the match so downstream workflows can skip or flag the job.
Email Configuration
Define keywords in config/settings.json under scraper.negative_keywords. Keywords are matched case-insensitively and should be supplied without surrounding whitespace:
{
"scraper": {
"negative_keywords": ["scam", "mlm", "unpaid"]
}
}
Scrape Output
Each scrape_job_page result contains three new fields:
is_negative_match:Truewhen any keyword matchesnegative_keyword_match: the keyword that triggered the matchnegative_match_field: which field (title, company, location, description) contained the keyword
Processing Behavior
process_job_urlstops whenis_negative_matchisTrue, yielding a log message and callingremove_jobso stale results never remain injob_listings.upsert_job_detailsnow returns immediately for negative matches, ensuringjob_descriptionsnever stores filtered listings.- Regression coverage lives in
tests/test_scraper.py::TestScraperPipelineNegativeFilteringandtests/test_db_negative_filtering.py::test_upsert_job_details_skips_negative_match.
Together, these checks mean negative matches are dropped before any persistence and never shown in the UI.
User-Specific Negative Keywords
In addition to the global negative keywords defined in settings.json, users can define their own personal negative keywords via the Preferences page (/settings).
- Management: Users can add new negative keywords and remove existing ones.
- Filtering: Jobs matching any of the user's negative keywords are filtered out from the job listings view (
/and/jobs). - Validation: The UI prevents adding duplicate keywords.
- Storage: User-specific negative keywords are stored in the database (
negative_keywordsanduser_negative_keywordstables).
Email Notifications
Optional job-alert emails are generated whenever the scraper discovers new listings.
Configuration
Edit config/settings.json under the email section:
{
"email": {
"enabled": true,
"from_address": "jobs@example.com",
"recipients": ["alerts@example.com"],
"smtp": {
"host": "smtp.example.com",
"port": 587,
"username": "smtp-user",
"password": "secret",
"use_tls": true,
"use_ssl": false,
"timeout": 30
}
}
}
- Leave
enabledset tofalsefor local development or when credentials are unavailable. - Provide at least one recipient; otherwise alerts are skipped with a log message.
- Omit real credentials from source control—inject them via environment variables or a secrets manager in production.
How Alerts Are Sent
- After
fetch_listings()completes, the scraper gathers new listings and, when configured, renders a plaintext digest viaweb.email_templates.render_job_alert_email. - Delivery is handled by
web.email_service.send_email, which supports TLS/SSL SMTP connections and gracefully skips when disabled. - Success or failure is streamed in the scraper log output (
Job alert email sent.or the reason for skipping).
Managing Recipients
- Admin users can visit
/admin/emailsto add or deactivate subscription addresses through the web UI. - Deactivated rows remain in the table so they can be reactivated later; the scraper only mails active recipients.
- The navigation bar exposes an Email Alerts link to the management screen after logging in as an admin user.
Customising Templates
- Use the Email Templates admin page (
/admin/email-templates) to create, edit, preview, or delete alert templates. - Templates support placeholder tokens such as
{count_label},{scope},{timestamp},{jobs_section}, and{jobs_message}; the UI lists all available tokens. - Preview renders the selected template with sample data so changes can be reviewed before saving.
Tests
tests/test_email_templates.pyverifies the rendered subject/body for both populated and empty alerts.tests/test_email_service.pycovers SMTP configuration, disabled mode, and login/send flows using fakes.tests/test_admin_email.pyexercises the admin UI for listing, subscribing, and unsubscribing recipients.tests/test_admin_email_templates.pyverifies CRUD operations and previews for template management.tests/test_scraper.py::TestScraperEmailNotificationsensures the scraping pipeline invokes the alert sender when new jobs are found.
Docker Deployment
Please see README-Docker.md for instructions on deploying the application using Docker.