Product Tracker

A backend-focused system for tracking products, inventory movements, and operational events with reliability, auditability, and scalability in mind.

Author’s note
I built MLScraper as a hands-on, practical system to track product listings and prices across multiple Mexican and international e-commerce sites. I use it to spot new listings and meaningful price drops, and to push alerts to Telegram and a small web UI. Below I explain what the project is, how it is structured, why I made the engineering choices I did, and what I would improve next.

Overview

MLScraper is a modular, asyncio-based web scraper designed to:

Crawl search results and listing pages from multiple online retailers (e.g., MercadoLibre, Amazon MX, Liverpool, El Palacio de Hierro).
Normalize results into a singleArticlemodel and keep per-search persistent history.
Detect new listings and price drops and deliver notifications via Telegram and a small websocket-driven frontend.
Persist state to JSON files in./data/to keep the system simple and portable.

The project is implemented in Python 3, usesaiohttpfor HTTP operations,FastAPI+ WebSockets for a lightweight dashboard, and simple JSON files for storage. The codebase is organized so you can add new providers (sites) by implementing aMotorsubclass and adding it to the motor generator.

High-Level Goals

When I designed MLScraper I had these goals in mind:

Modularity — Each provider (MercadoLibre, Amazon, Liverpool, Palacio de Hierro, etc.) should be encapsulated as aMotorso adding a new site is straightforward.
Asynchronous efficiency — Useasyncio/aiohttpto keep I/O non-blocking and allow concurrent fetching/pagination.
Simple persistence — Store results as JSON per-search so the system is easy to inspect, move, and debug.
Actionable alerts — Notify on two important events:
- A new item is discovered for a tracked search.
- A significant price drop (the implementation uses a 14% threshold).
Developer ergonomics — Include a small static UI served byFastAPIand a devcontainer for reproducible development.

System Architecture

Below is a simplified architecture diagram in ASCII to explain the moving parts at a glance:

+--------------------+        +-----------------+        +----------------------+
| provider motors    | -----> | Scraper engine  | -----> | Local JSON storage   |
| (amazon, ml, etc.) |        | (Motor orchestr)|        | (./data/*.json)      |
+--------------------+        +-----------------+        +----------------------+
        |                            |
        |                            +--> Notifier (utils/telegram.py) -> Telegram Bot
        |                            |
        |                            +--> Web API / WebSocket (FastAPI app.py) -> Browser UI

Key components and where they live in the repo:

app.py— FastAPI app and WebSocket connection manager (web UI + live updates).
scrapper.py— Orchestrates periodic scraping cycles across motors and handles global logic for broadcasting notifications.
provider/*— Provider implementations:
- provider/mercado_libre/*
- provider/amazon/*
- provider/liverpool/*
- provider/palacio_de_hierro/*Each provider implements ascrape_pagemethod that extracts listings and pagination links.
scraper/motor.py— Abstract baseMotorclass (core scraping loop, pagination handling, state transitions).
scraper/article.py—Articledataclass andArticleHistoryfor price/time history.
scraper/stream.py—Streamcontainer foractiveandfinishedlistings.
utils/— Helpers:file_manager.py,headers.py(random user agents),telegram.py, andsecret.py(credentials).
static/— Minimal dashboard UI served by FastAPI (HTML, CSS, JS, notification sound).
.devcontainer/— Dockerfile and devcontainer config to reproduce dev environment.

How the Project Works

In this section I walk through the flow from configuration to alert delivery.

Configuration & Entry Points

To run the project a few configuration points must be defined.

Telegram credentials

Insideutils/secret.pythe Telegram bot credentials are configured:

apiToken, chatID = "", ""

Once these values are filled, the scraper will automatically send notifications when events are detected.

Search configuration

All searches are defined in:

provider/generator.py

This module returns the list of motors that the system will run.

Example:

from .mercado_libre.motor import MercadoLibre as ML
from .liverpool.motor import Liverpool as LV
from .amazon.motor import Amazon as AZ, Seller
from .palacio_de_hierro.motor import PalacioDeHierro as PH

def get_motors():
    return [
        ML('zelda wii'),
        LV(search_term='LV PS5', url='https://www.liverpool.com.mx/...'),
        AZ(search_term='pokemon tcg', seller=Seller.amazon_mx),
        PH(search_term='PH Electrodomesticos', url='https://www.elpalaciodehierro.com/...')
    ]

Each element defines:

the search label
the initial search URL
the provider motor responsible for scraping

Running the application

The project can be executed with:

python app.py

or with uvicorn:

uvicorn app:app --host 0.0.0.0 --port 8000

A development container is also included to run everything inside Docker with a reproducible environment.

The Motor Abstraction

The Motor class is the core abstraction of the scraper.

Every provider extends the base class located at:

scraper/motor.py

Each motor must implement a function with the following responsibility:

def scrape_page(self, body):

This method receives the HTTP response and must:

Parse the HTML
Extract the products
Return structured results

The method returns:

(items, next_url)

Where:

items→ list of parsed products
next_url→ optional pagination link

This architecture allows the main scraper loop to stay generic while each provider focuses only on parsing.

The Scraping Pipeline

The scraping process follows a consistent pipeline:

Create an async HTTP session (aiohttp)
Fetch the search page
Parse results with the provider motor
Normalize items intoArticleobjects
Compare them with previous results
Detect new listings or price changes
Persist results
Trigger notifications

The core loop lives inside the Motor implementation.

Network requests include retry logic with exponential backoff to handle temporary failures.

Data Model

Each product is represented by anArticleobject.

Example JSON representation:

{
  "search_term": "AZ amazon_usa - DK books",
  "url": "https://www.amazon.com.mx/dp/1465482512/",
  "identifier": "1465482512",
  "title": "Zoology: Inside the Secret World of Animals",
  "price": 825.57,
  "datetime": "2024-11-16 07:23:23.193375",
  "status": "active",
  "history": [
    {
      "datetime": "2024-11-17 23:35:34.266925",
      "price": 374.36
    },
    {
      "datetime": "2024-11-15 11:08:11.463581",
      "price": 378.75
    }
  ]
}

Important characteristics:

Each article contains a price history.
Status can beactiveorfinished.
The identifier allows deduplication across scrapes.

Persistence Layer

MLScraper intentionally keeps persistence extremely simple.

All results are stored as JSON files under:

./data/

Each search generates its own file.

Advantages of this approach:

Human-readable
No database required
Easy debugging
Portable

However, it does not scale well to large datasets or concurrent writers.

Notification System

The project currently implements two notification channels.

Telegram Alerts

Telegram integration is implemented in:

utils/telegram.py

Two alert types exist.

New item detected:

send_new_to_telegram(article)

Price drop detected:

send_price_drop_to_telegram(article)

The alert includes:

product title
link
previous price
new price
timestamp

A price-drop notification triggers when the drop exceeds roughly 14%.

Web Dashboard

A small dashboard is served using FastAPI.

app.pyexposes:

static frontend
websocket endpoint/ws/

The UI listens to the websocket and displays real-time events.

When a new product is detected the page also plays a notification sound.

HTTP Strategy

To reduce scraping blocks the system uses randomized headers.

Located in:

utils/headers.py

It randomly selects modern browser User-Agent strings and sets additional client hints to mimic real browser traffic.

Requests also include:

retry logic
exponential backoff
timeout protection

This makes the scraper more resilient against temporary failures.

Engineering Decisions & Design Tradeoffs

Several design decisions were made intentionally to balance simplicity and capability.

JSON vs Database

Decision: Store results in JSON files.

Pros

No external dependency
Easy inspection
Portable
Great for small projects

Cons

Poor scalability
No indexing
Not safe for concurrent writes

A future version will likely migrate to SQLite or Postgres.

HTML Scraping vs Official APIs

Decision: Parse HTML pages.

Pros

Works with any public search page
No API keys needed
More flexible

Cons

Fragile when page layouts change
Potentially against terms of service
Requires constant maintenance

Async Scraping with Sync Notifications

Scraping usesaiohttpwhile Telegram calls userequests.

This simplifies implementation but theoretically could block the event loop if many alerts fire simultaneously.

For a personal project the tradeoff was acceptable.

Single Process Architecture

Everything runs inside one process:

scraper
API server
notification logic

Advantages:

easy debugging
minimal deployment complexity

Disadvantages:

limited scalability
less fault isolation

Why This Project Matters

MLScraper started as a practical tool but it also became a valuable learning project.

It touches many real-world engineering concerns:

asynchronous networking
scraping reliability
system architecture
data persistence
event notification
containerized development

Beyond the technical aspects, it solves a real problem: tracking product availability and pricing across multiple marketplaces without constantly checking websites manually.

Future Improvements

There are many improvements I would like to implement.

Database Migration

Replace JSON storage with SQLite or Postgres to enable:

indexing
better queries
analytics
multi-process safety

Async Notification Pipeline

Introduce a queue-based notification system to prevent blocking.

Proxy Support

Add rotating proxies and optional headless browser support using Playwright for sites with stronger bot protection.

Configurable Searches

Move search configuration to YAML or a database so searches can be modified without editing code.

Historical Analytics

Add charts and analytics for:

price trends
average discounts
best purchase windows

Automated Tests

Provider parsers should include unit tests using stored HTML snapshots to detect breakage.

Distributed Scraping

Move motors to worker processes coordinated by a scheduler.

Closing Thoughts

MLScraper is intentionally simple but surprisingly capable.

By combining a modular scraping architecture, asynchronous networking, and lightweight persistence, it provides a flexible platform for monitoring product listings across multiple online stores.

The system is easy to extend: adding a new provider typically means writing a single parser and registering it in the generator.

For me, the project strikes a nice balance between experimentation and real-world usefulness. It has already helped me discover listings and price drops that I would have otherwise missed.

If you are interested in scraping systems, event-driven architectures, or simply automating repetitive browsing tasks, building something like MLScraper is an excellent exercise.

And if you decide to extend it — adding a new provider, improving the architecture, or integrating analytics — you’ll quickly see how a small personal tool can evolve into a surprisingly sophisticated system.