diff --git a/README.md b/README.md index f458291..b39fc4a 100644 --- a/README.md +++ b/README.md @@ -9,11 +9,32 @@ job scraper - Users can search for job listings by keywords and region - Selection of job listings based on user preferences -## Requirements +## Architecture Overview -- Database (MySQL/MariaDB) -- Python 3.x - - Required Python packages (see requirements.txt) +The application is built as a modular Flask‑based service with clear separation of concerns: + +| Layer | Module | Responsibility | +| ----------------------------- | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| **Web UI** | `web/app.py` | Flask application that serves HTML pages, REST endpoints, and admin interfaces (users, taxonomy, health, email management). | +| **Orchestrator** | `web/craigslist.py` | Coordinates the scraping workflow: schedules runs, fetches listings, updates the DB, and triggers email alerts. | +| **Scraper** | `web/scraper.py` | Contains the low‑level HTML parsing logic (`scrape_job_data`, `scrape_job_page`, `extract_contact_info`). | +| **Persistence** | `web/db.py` | SQLAlchemy ORM models (`User`, `JobListing`, `JobDescription`, `UserInteraction`, `Region`, `Keyword`, `EmailSubscription`, **`EmailTemplate`**) and helper functions for upserts, queries, and subscription management. | +| **Email Rendering** | `web/email_templates.py` | Renders job‑alert emails using a pluggable template system. Supports default placeholders (`{count_label}`, `{scope}`, `{timestamp}`, `{jobs_section}`, `{jobs_message}`) and custom admin‑defined templates. | +| **Email Delivery** | `web/email_service.py` | Sends rendered messages via SMTP, handling TLS/SSL, authentication, and graceful disabling. | +| **Configuration** | `config/settings.json` | Centralised JSON config for database, HTTP, scraper options, negative keywords, and email settings. | +| **Static Assets & Templates** | `web/static/`, `web/templates/` | Front‑end resources (JS, CSS) and Jinja2 templates for the public UI and admin pages (including the new **Email Templates** management UI). | +| **Scheduler** | `schedule` (used in `web/craigslist.py`) | Runs the scraper automatically at configurable intervals (default hourly). | +| **Testing** | `tests/` | Pytest suite covering scheduler, scraper, DB helpers, email service, and the new admin UI for email subscriptions and templates. | + +**Key architectural notes** + +- **Email Subscriptions** are stored in the `email_subscriptions` table and managed via `/admin/emails`. +- **Email Templates** are persisted in the new `email_templates` table, editable through `/admin/email-templates`, and used by the alert system. +- The orchestrator (`fetch_listings`) returns a detailed result dict (`discovered`, `new`, `by_search`) that drives UI metrics and health checks. +- Contact information (`reply_url`, `contact_email`, `contact_phone`, `contact_name`) extracted by the scraper is saved in `job_descriptions`. +- Negative keyword filtering is applied early in the pipeline to prevent unwanted listings from reaching the DB or email alerts. + +This layered design makes it straightforward to extend the scraper to new sources, swap out the email backend, or add additional admin features without impacting other components. ## Installation @@ -40,6 +61,187 @@ The application includes an automated scheduler that runs the job scraping proce To modify the scheduling interval, edit the `start_scheduler()` function in `web/craigslist.py`. +## Job Scraping Output + +The `fetch_listings()` function in `web/craigslist.py` extends its output to provide detailed metrics about each scraping operation. It returns a dictionary containing: + +- **discovered**: Total number of unique job URLs discovered across all region/keyword combinations +- **new**: Total number of newly added jobs (jobs not previously in the database) +- **by_search**: List of dictionaries, each containing: + - **region**: The region name for this search + - **keyword**: The keyword used for this search + - **count**: Number of jobs fetched for this specific region/keyword combination + +### Example Output + +```python +{ + "discovered": 150, + "new": 42, + "by_search": [ + {"region": "sfbay", "keyword": "python", "count": 25}, + {"region": "sfbay", "keyword": "java", "count": 18}, + {"region": "losangeles", "keyword": "python", "count": 45}, + {"region": "losangeles", "keyword": "java", "count": 62} + ] +} +``` + +This per-search breakdown allows for better monitoring and debugging of the scraping process, enabling identification of searches that may be failing or returning fewer results than expected. + +## Contact Information Extraction + +The scraper now automatically extracts contact information from job listing pages: + +### Extracted Fields + +When scraping individual job listings, the following contact information is extracted and stored: + +- **contact_email**: Email address extracted from reply button or contact form links +- **contact_phone**: Phone number extracted from tel links or contact parameters +- **contact_name**: Contact person or department name if available +- **reply_url**: The full reply/contact URL from the job listing + +### How Contact Information is Extracted + +The `extract_contact_info()` function intelligently parses various types of reply URLs: + +1. **Mailto Links**: `mailto:jobs@company.com?subject=...` + + - Extracts the email address directly + +2. **Phone Links**: `tel:+1234567890` + + - Extracts the phone number + +3. **URL Parameters**: `https://apply.company.com?email=hr@company.com&phone=555-1234&name=HR%20Team` + + - Searches for common parameter names: `email`, `phone`, `contact_name`, etc. + +4. **Graceful Fallback**: If contact information cannot be extracted, the fields are set to `"N/A"` + +### Database Storage + +Contact information is stored in the `job_descriptions` table with the following columns: + +- `reply_url` (VARCHAR(512)): The complete reply/contact URL +- `contact_email` (VARCHAR(255)): Extracted email address +- `contact_phone` (VARCHAR(255)): Extracted phone number +- `contact_name` (VARCHAR(255)): Extracted contact person/department name + +### Example + +For a job listing with reply button `mailto:hiring@acme.com?subject=Job%20Application`: + +```python +{ + "reply_url": "mailto:hiring@acme.com?subject=Job%20Application", + "contact_email": "hiring@acme.com", + "contact_phone": "N/A", + "contact_name": "N/A" +} +``` + +This contact information is automatically extracted during job page scraping and persisted to the database for easy access and filtering. + +## Negative Keyword Filtering + +The scraper inspects each job’s title, company, location, and description for configurable “negative” keywords. When a keyword matches, the scraped result indicates the match so downstream workflows can skip or flag the job. + +### Email Configuration + +Define keywords in `config/settings.json` under `scraper.negative_keywords`. Keywords are matched case-insensitively and should be supplied without surrounding whitespace: + +```json +{ + "scraper": { + "negative_keywords": ["scam", "mlm", "unpaid"] + } +} +``` + +### Scrape Output + +Each `scrape_job_page` result contains three new fields: + +- `is_negative_match`: `True` when any keyword matches +- `negative_keyword_match`: the keyword that triggered the match +- `negative_match_field`: which field (title, company, location, description) contained the keyword + +### Processing Behavior + +- `process_job_url` stops when `is_negative_match` is `True`, yielding a log message and calling `remove_job` so stale results never remain in `job_listings`. +- `upsert_job_details` now returns immediately for negative matches, ensuring `job_descriptions` never stores filtered listings. +- Regression coverage lives in `tests/test_scraper.py::TestScraperPipelineNegativeFiltering` and `tests/test_db_negative_filtering.py::test_upsert_job_details_skips_negative_match`. + +Together, these checks mean negative matches are dropped before any persistence and never shown in the UI. + +### User-Specific Negative Keywords + +In addition to the global negative keywords defined in `settings.json`, users can define their own personal negative keywords via the **Preferences** page (`/settings`). + +- **Management**: Users can add new negative keywords and remove existing ones. +- **Filtering**: Jobs matching any of the user's negative keywords are filtered out from the job listings view (`/` and `/jobs`). +- **Validation**: The UI prevents adding duplicate keywords. +- **Storage**: User-specific negative keywords are stored in the database (`negative_keywords` and `user_negative_keywords` tables). + +## Email Notifications + +Optional job-alert emails are generated whenever the scraper discovers new listings. + +### Configuration + +Edit `config/settings.json` under the `email` section: + +```json +{ + "email": { + "enabled": true, + "from_address": "jobs@example.com", + "recipients": ["alerts@example.com"], + "smtp": { + "host": "smtp.example.com", + "port": 587, + "username": "smtp-user", + "password": "secret", + "use_tls": true, + "use_ssl": false, + "timeout": 30 + } + } +} +``` + +- Leave `enabled` set to `false` for local development or when credentials are unavailable. +- Provide at least one recipient; otherwise alerts are skipped with a log message. +- Omit real credentials from source control—inject them via environment variables or a secrets manager in production. + +### How Alerts Are Sent + +- After `fetch_listings()` completes, the scraper gathers new listings and, when configured, renders a plaintext digest via `web.email_templates.render_job_alert_email`. +- Delivery is handled by `web.email_service.send_email`, which supports TLS/SSL SMTP connections and gracefully skips when disabled. +- Success or failure is streamed in the scraper log output (`Job alert email sent.` or the reason for skipping). + +### Managing Recipients + +- Admin users can visit `/admin/emails` to add or deactivate subscription addresses through the web UI. +- Deactivated rows remain in the table so they can be reactivated later; the scraper only mails active recipients. +- The navigation bar exposes an **Email Alerts** link to the management screen after logging in as an admin user. + +### Customising Templates + +- Use the **Email Templates** admin page (`/admin/email-templates`) to create, edit, preview, or delete alert templates. +- Templates support placeholder tokens such as `{count_label}`, `{scope}`, `{timestamp}`, `{jobs_section}`, and `{jobs_message}`; the UI lists all available tokens. +- Preview renders the selected template with sample data so changes can be reviewed before saving. + +### Tests + +- `tests/test_email_templates.py` verifies the rendered subject/body for both populated and empty alerts. +- `tests/test_email_service.py` covers SMTP configuration, disabled mode, and login/send flows using fakes. +- `tests/test_admin_email.py` exercises the admin UI for listing, subscribing, and unsubscribing recipients. +- `tests/test_admin_email_templates.py` verifies CRUD operations and previews for template management. +- `tests/test_scraper.py::TestScraperEmailNotifications` ensures the scraping pipeline invokes the alert sender when new jobs are found. + ## Docker Deployment Please see [README-Docker.md](README-Docker.md) for instructions on deploying the application using Docker. diff --git a/config/settings.json b/config/settings.json index 880399d..1241675 100644 --- a/config/settings.json +++ b/config/settings.json @@ -9,7 +9,7 @@ } }, "http": { - "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:141.0) Gecko/20100101 Firefox/141.0", + "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:145.0) Gecko/20100101 Firefox/145.0", "request_timeout": 30, "max_retries": 3, "backoff_factor": 2, @@ -22,7 +22,22 @@ }, "scraper": { "base_url": "https://{region}.craigslist.org/search/jjj?query={keyword}&sort=rel", - "config_dir": "config" + "config_dir": "config", + "negative_keywords": [] + }, + "email": { + "enabled": false, + "from_address": "jobs@example.com", + "recipients": [], + "smtp": { + "host": "smtp.example.com", + "port": 587, + "username": "", + "password": "", + "use_tls": true, + "use_ssl": false, + "timeout": 30 + } }, "users": [ { "username": "anonymous", "is_admin": false, "password": "" }, diff --git a/tests/test_admin_email.py b/tests/test_admin_email.py new file mode 100644 index 0000000..17e023f --- /dev/null +++ b/tests/test_admin_email.py @@ -0,0 +1,84 @@ +import pytest +from sqlalchemy import text + +from web.app import app +from web.db import ( + db_init, + create_or_update_user, + subscribe_email, + list_email_subscriptions, + _ensure_session, +) + + +@pytest.fixture(scope="function", autouse=True) +def initialize_app(): + app.config.update(TESTING=True, WTF_CSRF_ENABLED=False) + with app.app_context(): + db_init() + create_or_update_user("admin", password="secret", + is_admin=True, is_active=True) + # Clear subscriptions before and after each test to avoid leakage + with _ensure_session() as session: + session.execute(text("DELETE FROM email_subscriptions")) + session.commit() + yield + with _ensure_session() as session: + session.execute(text("DELETE FROM email_subscriptions")) + session.commit() + + +@pytest.fixture +def client(): + with app.test_client() as test_client: + with test_client.session_transaction() as sess: + sess["username"] = "admin" + yield test_client + + +@pytest.fixture +def anon_client(): + with app.test_client() as test_client: + # Ensure no admin session present + with test_client.session_transaction() as sess: + sess.pop("username", None) + yield test_client + + +def test_admin_emails_requires_admin(anon_client): + response = anon_client.get("/admin/emails") + assert response.status_code == 302 + assert "/login" in response.headers.get("Location", "") + + +def test_admin_emails_lists_subscriptions(client): + subscribe_email("alice@example.com") + response = client.get("/admin/emails") + assert response.status_code == 200 + assert b"alice@example.com" in response.data + + +def test_admin_emails_can_subscribe(client): + response = client.post( + "/admin/emails", + data={"action": "subscribe", "email": "bob@example.com"}, + follow_redirects=False, + ) + assert response.status_code == 302 + emails = list_email_subscriptions() + assert any(sub["email"] == "bob@example.com" and sub["is_active"] + for sub in emails) + + +def test_admin_emails_can_unsubscribe(client): + subscribe_email("carol@example.com") + response = client.post( + "/admin/emails", + data={"action": "unsubscribe", "email": "carol@example.com"}, + follow_redirects=False, + ) + assert response.status_code == 302 + emails = list_email_subscriptions() + matching = [sub for sub in emails if sub["email"] == "carol@example.com"] + assert matching + assert matching[0]["is_active"] is False diff --git a/tests/test_admin_email_templates.py b/tests/test_admin_email_templates.py new file mode 100644 index 0000000..ae9466c --- /dev/null +++ b/tests/test_admin_email_templates.py @@ -0,0 +1,138 @@ +import pytest +from sqlalchemy import text + +from web.app import app +from web.db import ( + db_init, + create_or_update_user, + list_email_templates, + update_email_template, + _ensure_session, + ensure_default_email_template, +) +from web.email_templates import render_job_alert_email + + +@pytest.fixture(scope="function", autouse=True) +def setup_database(): + app.config.update(TESTING=True, WTF_CSRF_ENABLED=False) + with app.app_context(): + db_init() + create_or_update_user("admin", password="secret", is_admin=True, is_active=True) + with _ensure_session() as session: + session.execute(text("DELETE FROM email_templates")) + session.commit() + ensure_default_email_template() + yield + with _ensure_session() as session: + session.execute(text("DELETE FROM email_templates")) + session.commit() + ensure_default_email_template() + + +@pytest.fixture +def client(): + with app.test_client() as test_client: + with test_client.session_transaction() as sess: + sess["username"] = "admin" + yield test_client + + +@pytest.fixture +def anon_client(): + with app.test_client() as test_client: + with test_client.session_transaction() as sess: + sess.pop("username", None) + yield test_client + + +def test_email_templates_requires_admin(anon_client): + response = anon_client.get("/admin/email-templates") + assert response.status_code == 302 + assert "/login" in response.headers.get("Location", "") + + +def test_email_templates_lists_default(client): + response = client.get("/admin/email-templates") + assert response.status_code == 200 + assert b"job-alert" in response.data + + +def test_email_templates_create_update_delete(client): + # Create + response = client.post( + "/admin/email-templates", + data={ + "action": "create", + "name": "Daily Summary", + "slug": "daily-summary", + "subject": "Summary: {count_label}", + "body": "Jobs:{jobs_section}", + "is_active": "on", + }, + follow_redirects=False, + ) + assert response.status_code == 302 + templates = list_email_templates() + assert any(t["slug"] == "daily-summary" for t in templates) + + # Update + template_row = next(t for t in templates if t["slug"] == "daily-summary") + response = client.post( + "/admin/email-templates", + data={ + "action": "update", + "template_id": template_row["template_id"], + "name": "Daily Summary", + "slug": "daily-summary", + "subject": "Updated: {count_label}", + "body": "Updated body {jobs_section}", + }, + follow_redirects=False, + ) + assert response.status_code == 302 + updated = list_email_templates() + updated_row = next(t for t in updated if t["slug"] == "daily-summary") + assert "Updated:" in updated_row["subject"] + + # Delete + response = client.post( + "/admin/email-templates", + data={ + "action": "delete", + "template_id": updated_row["template_id"], + }, + follow_redirects=False, + ) + assert response.status_code == 302 + slugs = [t["slug"] for t in list_email_templates()] + assert "daily-summary" not in slugs + + +def test_email_templates_preview(client): + templates = list_email_templates() + job_alert = next(t for t in templates if t["slug"] == "job-alert") + response = client.get(f"/admin/email-templates?preview_id={job_alert['template_id']}") + assert response.status_code == 200 + assert b"Preview" in response.data + assert b"Subject" in response.data + + +def test_render_job_alert_email_uses_template_override(client): + templates = list_email_templates() + job_alert = next(t for t in templates if t["slug"] == "job-alert") + update_email_template( + job_alert["template_id"], + subject="Custom Subject {count}", + body="Body {jobs_message}", + ) + rendered = render_job_alert_email([ + { + "title": "Python Developer", + "company": "Acme", + "location": "Remote", + "url": "https://example.com", + } + ]) + assert rendered["subject"].startswith("Custom Subject") + assert "Python Developer" in rendered["body"] diff --git a/tests/test_db_negative_filtering.py b/tests/test_db_negative_filtering.py new file mode 100644 index 0000000..7e49be7 --- /dev/null +++ b/tests/test_db_negative_filtering.py @@ -0,0 +1,21 @@ +import pytest +import web.db as db + + +def test_upsert_job_details_skips_negative_match(monkeypatch): + def fail(*args, **kwargs): # pragma: no cover - guard against unwanted calls + raise AssertionError("should not reach database layers when negative") + + monkeypatch.setattr(db, "_ensure_session", fail) + monkeypatch.setattr(db, "insert_log", fail) + + job_data = { + "url": "https://example.com/job/neg", + "id": "neg123", + "is_negative_match": True, + "negative_keyword_match": "scam", + "negative_match_field": "title", + } + + # Should return early without touching the database helpers. + db.upsert_job_details(job_data) diff --git a/tests/test_email_service.py b/tests/test_email_service.py new file mode 100644 index 0000000..b6a77a6 --- /dev/null +++ b/tests/test_email_service.py @@ -0,0 +1,106 @@ +import pytest + +from web.email_service import ( + EmailConfigurationError, + send_email, +) + + +def test_send_email_disabled(monkeypatch): + called = {} + + def _fake_smtp(*args, **kwargs): # pragma: no cover - should not be called + called["used"] = True + raise AssertionError( + "SMTP should not be invoked when email is disabled") + + monkeypatch.setattr("web.email_service.smtplib.SMTP", _fake_smtp) + monkeypatch.setattr("web.email_service.smtplib.SMTP_SSL", _fake_smtp) + + result = send_email( + subject="Hi", + body="Test", + to="user@example.com", + settings={"enabled": False}, + ) + assert result is False + assert called == {} + + +def test_send_email_sends_message(monkeypatch): + events = {"starttls": False, "login": None, "sent": None} + + class FakeSMTP: + def __init__(self, *, host, port, timeout): + self.host = host + self.port = port + self.timeout = timeout + + def __enter__(self): + return self + + def __exit__(self, exc_type, exc, tb): + return False + + def ehlo(self): + events.setdefault("ehlo", 0) + events["ehlo"] += 1 + + def starttls(self): + events["starttls"] = True + + def login(self, username, password): + events["login"] = (username, password) + + def send_message(self, message, *, from_addr, to_addrs): + events["sent"] = { + "from": from_addr, + "to": tuple(to_addrs), + "subject": message["Subject"], + } + + monkeypatch.setattr("web.email_service.smtplib.SMTP", FakeSMTP) + monkeypatch.setattr("web.email_service.smtplib.SMTP_SSL", FakeSMTP) + + settings = { + "enabled": True, + "from_address": "jobs@example.com", + "smtp": { + "host": "smtp.example.com", + "port": 2525, + "timeout": 15, + "username": "jobs", + "password": "secret", + "use_tls": True, + "use_ssl": False, + }, + } + + result = send_email( + subject="New Jobs", + body="You have new jobs waiting.", + to=["a@example.com", "b@example.com"], + cc="c@example.com", + bcc=["d@example.com"], + settings=settings, + ) + + assert result is True + assert events["starttls"] is True + assert events["login"] == ("jobs", "secret") + assert events["sent"] == { + "from": "jobs@example.com", + "to": ("a@example.com", "b@example.com", "c@example.com", "d@example.com"), + "subject": "New Jobs", + } + + +def test_send_email_requires_host(): + settings = { + "enabled": True, + "from_address": "jobs@example.com", + "smtp": {"host": "", "port": 587}, + } + with pytest.raises(EmailConfigurationError): + send_email(subject="Hi", body="Test", + to="user@example.com", settings=settings) diff --git a/tests/test_email_templates.py b/tests/test_email_templates.py new file mode 100644 index 0000000..45ca3ef --- /dev/null +++ b/tests/test_email_templates.py @@ -0,0 +1,40 @@ +from datetime import datetime + +from web.email_templates import render_job_alert_email + + +def test_render_job_alert_email_with_jobs(): + jobs = [ + { + "title": "Python Developer", + "company": "Acme", + "location": "Remote", + "url": "https://example.com/jobs/1", + }, + { + "title": "Data Engineer", + "company": "Globex", + "location": "NYC", + "url": "https://example.com/jobs/2", + }, + ] + ts = datetime(2025, 11, 3, 12, 0) + rendered = render_job_alert_email( + jobs, region="sfbay", keyword="python", generated_at=ts) + + assert rendered["subject"] == "2 new jobs (region: sfbay, keyword: python)" + assert "1. Python Developer" in rendered["body"] + assert "Generated at 2025-11-03 12:00 UTC." in rendered["body"] + assert rendered["context"]["count"] == 2 + assert rendered["context"]["jobs_section"].startswith( + "\n1. Python Developer") + + +def test_render_job_alert_email_empty(): + ts = datetime(2025, 11, 3, 12, 0) + rendered = render_job_alert_email([], generated_at=ts) + + assert rendered["subject"] == "No new jobs" + assert "No jobs matched this alert." in rendered["body"] + assert rendered["body"].count("Generated at") == 1 + assert rendered["context"]["count"] == 0 diff --git a/tests/test_scheduler.py b/tests/test_scheduler.py index bf3a002..c9c620d 100644 --- a/tests/test_scheduler.py +++ b/tests/test_scheduler.py @@ -1,7 +1,7 @@ import pytest import time from unittest.mock import patch, MagicMock -from web.craigslist import scrape_jobs_with_retry, run_scheduled_scraping +from web.craigslist import scrape_jobs_with_retry, run_scheduled_scraping, fetch_listings class TestScheduler: @@ -38,3 +38,100 @@ class TestScheduler: # This is a basic test to ensure the scheduler can be set up from web.craigslist import schedule assert schedule is not None + + @patch('web.craigslist.db_get_all_job_urls') + @patch('web.craigslist.seed_regions_keywords_from_listings') + @patch('web.craigslist.get_all_regions') + @patch('web.craigslist.get_all_keywords') + @patch('web.craigslist.get_last_fetch_time') + @patch('web.craigslist.process_region_keyword') + @patch('web.craigslist.upsert_listing') + @patch('web.craigslist.insert_log') + def test_fetch_listings_return_structure(self, mock_log, mock_upsert, mock_process, mock_last_fetch, + mock_keywords, mock_regions, mock_seed, mock_db_urls): + """Test that fetch_listings returns the correct structure with per-search counts.""" + # Setup mocks + mock_db_urls.return_value = [] + mock_regions.return_value = [{"name": "sfbay"}] + mock_keywords.return_value = [{"name": "python"}] + mock_last_fetch.return_value = None # Never fetched before + mock_process.return_value = [ + ("2025-11-03T10:00:00Z", "sfbay", "python", "Python Dev", + "$100k", "San Francisco", "http://example.com/1"), + ("2025-11-03T10:00:00Z", "sfbay", "python", "Python Dev", + "$100k", "San Francisco", "http://example.com/2"), + ] + + # Collect messages and get return value from generator + gen = fetch_listings() + messages = [] + result = None + try: + while True: + messages.append(next(gen)) + except StopIteration as e: + result = e.value + + # Verify return structure + assert result is not None + assert "discovered" in result + assert "new" in result + assert "by_search" in result + assert isinstance(result.get("by_search"), list) + assert result.get("discovered") == 2 + assert result.get("new") == 2 + + @patch('web.craigslist.db_get_all_job_urls') + @patch('web.craigslist.seed_regions_keywords_from_listings') + @patch('web.craigslist.get_all_regions') + @patch('web.craigslist.get_all_keywords') + @patch('web.craigslist.get_last_fetch_time') + @patch('web.craigslist.process_region_keyword') + @patch('web.craigslist.upsert_listing') + @patch('web.craigslist.insert_log') + def test_fetch_listings_per_search_count(self, mock_log, mock_upsert, mock_process, mock_last_fetch, + mock_keywords, mock_regions, mock_seed, mock_db_urls): + """Test that fetch_listings correctly counts jobs per search.""" + # Setup mocks + mock_db_urls.return_value = [] + mock_regions.return_value = [{"name": "sfbay"}, {"name": "losangeles"}] + mock_keywords.return_value = [{"name": "python"}, {"name": "java"}] + mock_last_fetch.return_value = None # Never fetched before + + # Mock process_region_keyword to return different counts for each search + def mock_process_impl(region, keyword, discovered_urls): + # Use unique URLs per search to get the total discovered count + base_url = f"http://example.com/{region}/{keyword}" + counts = { + ("sfbay", "python"): 3, + ("sfbay", "java"): 2, + ("losangeles", "python"): 4, + ("losangeles", "java"): 1, + } + count = counts.get((region, keyword), 0) + return [(f"2025-11-03T10:00:00Z", region, keyword, f"Job {i}", "$100k", region, f"{base_url}/{i}") + for i in range(count)] + + mock_process.side_effect = mock_process_impl + + # Collect result from generator + gen = fetch_listings() + messages = [] + result = None + try: + while True: + messages.append(next(gen)) + except StopIteration as e: + result = e.value + + # Verify per-search counts + assert result is not None + by_search = result.get("by_search", []) + assert len(by_search) == 4 + + search_data = {(r.get("region"), r.get("keyword")) : r.get("count") for r in by_search} + assert search_data.get(("sfbay", "python")) == 3 + assert search_data.get(("sfbay", "java")) == 2 + assert search_data.get(("losangeles", "python")) == 4 + assert search_data.get(("losangeles", "java")) == 1 + assert result.get("discovered") == 10 # Total unique jobs diff --git a/tests/test_scraper.py b/tests/test_scraper.py new file mode 100644 index 0000000..1989c98 --- /dev/null +++ b/tests/test_scraper.py @@ -0,0 +1,384 @@ +import pytest +from web.scraper import scrape_job_page, extract_contact_info +from web.craigslist import process_job_url, scraper + + +def _make_negative_job(url: str) -> dict: + return { + "url": url, + "title": "SCAM role", + "company": "Test Co", + "location": "Remote", + "description": "This is a scam offer", + "id": "job123", + "posted_time": "", + "reply_url": "N/A", + "contact_email": "N/A", + "contact_phone": "N/A", + "contact_name": "N/A", + "is_negative_match": True, + "negative_keyword_match": "scam", + "negative_match_field": "title", + } + + +class TestExtractContactInfo: + """Test suite for contact information extraction.""" + + def test_extract_email_from_mailto_link(self): + """Test extraction of email from mailto link.""" + reply_url = "mailto:contact@example.com?subject=Job%20Inquiry" + contact_info = extract_contact_info(reply_url) + + assert contact_info["email"] == "contact@example.com" + assert contact_info["phone"] == "N/A" + assert contact_info["contact_name"] == "N/A" + + def test_extract_phone_from_tel_link(self): + """Test extraction of phone from tel link.""" + reply_url = "tel:+1234567890" + contact_info = extract_contact_info(reply_url) + + assert contact_info["email"] == "N/A" + assert contact_info["phone"] == "+1234567890" + assert contact_info["contact_name"] == "N/A" + + def test_extract_email_from_url_parameter(self): + """Test extraction of email from URL query parameters.""" + reply_url = "https://example.com/contact?email=jobs@company.com&name=John%20Doe" + contact_info = extract_contact_info(reply_url) + + assert contact_info["email"] == "jobs@company.com" + assert contact_info["contact_name"] == "John Doe" + + def test_extract_phone_from_url_parameter(self): + """Test extraction of phone from URL query parameters.""" + reply_url = "https://example.com/apply?phone=555-1234&email=contact@test.com" + contact_info = extract_contact_info(reply_url) + + assert contact_info["phone"] == "555-1234" + assert contact_info["email"] == "contact@test.com" + + def test_extract_contact_name_from_url_parameter(self): + """Test extraction of contact name from URL query parameters.""" + reply_url = "https://example.com/reply?name=Alice%20Smith&contact_name=Bob%20Jones" + contact_info = extract_contact_info(reply_url) + + # Should prefer contact_name over name + assert contact_info["contact_name"] == "Bob Jones" + + def test_extract_all_fields_from_url(self): + """Test extraction of all fields from URL parameters.""" + reply_url = "https://example.com/contact?email=hr@company.com&phone=555-9876&contact_name=Jane%20Doe" + contact_info = extract_contact_info(reply_url) + + assert contact_info["email"] == "hr@company.com" + assert contact_info["phone"] == "555-9876" + assert contact_info["contact_name"] == "Jane Doe" + + def test_handle_empty_reply_url(self): + """Test handling of empty reply URL.""" + contact_info = extract_contact_info("") + + assert contact_info["email"] == "N/A" + assert contact_info["phone"] == "N/A" + assert contact_info["contact_name"] == "N/A" + + def test_handle_na_reply_url(self): + """Test handling of N/A reply URL.""" + contact_info = extract_contact_info("N/A") + + assert contact_info["email"] == "N/A" + assert contact_info["phone"] == "N/A" + assert contact_info["contact_name"] == "N/A" + + def test_handle_none_reply_url(self): + """Test handling of None reply URL.""" + contact_info = extract_contact_info(None) + + assert contact_info["email"] == "N/A" + assert contact_info["phone"] == "N/A" + assert contact_info["contact_name"] == "N/A" + + def test_handle_invalid_url(self): + """Test handling of invalid URL (graceful fallback).""" + reply_url = "not a valid url at all" + contact_info = extract_contact_info(reply_url) + + # Should return all N/A values without crashing + assert contact_info["email"] == "N/A" + assert contact_info["phone"] == "N/A" + assert contact_info["contact_name"] == "N/A" + + def test_multiple_parameter_variations(self): + """Test that function finds email despite multiple parameter name variations.""" + reply_url = "https://example.com/reply?from_email=sender@test.com&other=value" + contact_info = extract_contact_info(reply_url) + + assert contact_info["email"] == "sender@test.com" + + def test_telephone_parameter_name(self): + """Test extraction using 'telephone' parameter name.""" + reply_url = "https://example.com/contact?telephone=555-0000" + contact_info = extract_contact_info(reply_url) + + assert contact_info["phone"] == "555-0000" + + +class TestScrapeJobPageContactInfo: + """Test suite for scrape_job_page contact information extraction.""" + + def test_scrape_job_page_includes_contact_fields(self): + """Test that scrape_job_page includes contact information in return dict.""" + html_content = """ + +
This is a test job description
+posting id: 12345abc
+ +Job desc
id: xyz
+Job desc
id: xyz
+Apply now
id: manager123
+This is a scam offer
We pay well and on time.