A backend-focused system for tracking products, inventory movements, and operational events with reliability, auditability, and scalability in mind.
Author’s note
I built MLScraper as a hands-on, practical system to track product listings and prices across multiple Mexican and international e-commerce sites. I use it to spot new listings and meaningful price drops, and to push alerts to Telegram and a small web UI. Below I explain what the project is, how it is structured, why I made the engineering choices I did, and what I would improve next.
Overview
MLScraper is a modular, asyncio-based web scraper designed to:
- Crawl search results and listing pages from multiple online retailers (e.g., MercadoLibre, Amazon MX, Liverpool, El Palacio de Hierro).
- Normalize results into a single
Articlemodel and keep per-search persistent history. - Detect new listings and price drops and deliver notifications via Telegram and a small websocket-driven frontend.
- Persist state to JSON files in
./data/to keep the system simple and portable.
The project is implemented in Python 3, usesaiohttpfor HTTP operations,FastAPI+ WebSockets for a lightweight dashboard, and simple JSON files for storage. The codebase is organized so you can add new providers (sites) by implementing aMotorsubclass and adding it to the motor generator.
High-Level Goals
When I designed MLScraper I had these goals in mind:
- Modularity — Each provider (MercadoLibre, Amazon, Liverpool, Palacio de Hierro, etc.) should be encapsulated as a
Motorso adding a new site is straightforward. - Asynchronous efficiency — Use
asyncio/aiohttpto keep I/O non-blocking and allow concurrent fetching/pagination. - Simple persistence — Store results as JSON per-search so the system is easy to inspect, move, and debug.
- Actionable alerts — Notify on two important events:
- A new item is discovered for a tracked search.
- A significant price drop (the implementation uses a 14% threshold).
- Developer ergonomics — Include a small static UI served by
FastAPIand a devcontainer for reproducible development.
System Architecture
Below is a simplified architecture diagram in ASCII to explain the moving parts at a glance:
+--------------------+ +-----------------+ +----------------------+
| provider motors | -----> | Scraper engine | -----> | Local JSON storage |
| (amazon, ml, etc.) | | (Motor orchestr)| | (./data/*.json) |
+--------------------+ +-----------------+ +----------------------+
| |
| +--> Notifier (utils/telegram.py) -> Telegram Bot
| |
| +--> Web API / WebSocket (FastAPI app.py) -> Browser UIKey components and where they live in the repo:
app.py— FastAPI app and WebSocket connection manager (web UI + live updates).scrapper.py— Orchestrates periodic scraping cycles across motors and handles global logic for broadcasting notifications.provider/*— Provider implementations:provider/mercado_libre/*provider/amazon/*provider/liverpool/*provider/palacio_de_hierro/*Each provider implements ascrape_pagemethod that extracts listings and pagination links.
scraper/motor.py— Abstract baseMotorclass (core scraping loop, pagination handling, state transitions).scraper/article.py—Articledataclass andArticleHistoryfor price/time history.scraper/stream.py—Streamcontainer foractiveandfinishedlistings.utils/— Helpers:file_manager.py,headers.py(random user agents),telegram.py, andsecret.py(credentials).static/— Minimal dashboard UI served by FastAPI (HTML, CSS, JS, notification sound)..devcontainer/— Dockerfile and devcontainer config to reproduce dev environment.
How the Project Works
In this section I walk through the flow from configuration to alert delivery.
Configuration & Entry Points
To run the project a few configuration points must be defined.
Telegram credentials
Insideutils/secret.pythe Telegram bot credentials are configured:
apiToken, chatID = "", ""Once these values are filled, the scraper will automatically send notifications when events are detected.
Search configuration
All searches are defined in:
provider/generator.py
This module returns the list of motors that the system will run.
Example:
from .mercado_libre.motor import MercadoLibre as ML
from .liverpool.motor import Liverpool as LV
from .amazon.motor import Amazon as AZ, Seller
from .palacio_de_hierro.motor import PalacioDeHierro as PH
def get_motors():
return [
ML('zelda wii'),
LV(search_term='LV PS5', url='https://www.liverpool.com.mx/...'),
AZ(search_term='pokemon tcg', seller=Seller.amazon_mx),
PH(search_term='PH Electrodomesticos', url='https://www.elpalaciodehierro.com/...')
]Each element defines:
- the search label
- the initial search URL
- the provider motor responsible for scraping
Running the application
The project can be executed with:
python app.pyor with uvicorn:
uvicorn app:app --host 0.0.0.0 --port 8000A development container is also included to run everything inside Docker with a reproducible environment.
The Motor Abstraction
The Motor class is the core abstraction of the scraper.
Every provider extends the base class located at:
scraper/motor.py
Each motor must implement a function with the following responsibility:
def scrape_page(self, body):This method receives the HTTP response and must:
- Parse the HTML
- Extract the products
- Return structured results
The method returns:
(items, next_url)Where:
items→ list of parsed productsnext_url→ optional pagination link
This architecture allows the main scraper loop to stay generic while each provider focuses only on parsing.
The Scraping Pipeline
The scraping process follows a consistent pipeline:
- Create an async HTTP session (
aiohttp) - Fetch the search page
- Parse results with the provider motor
- Normalize items into
Articleobjects - Compare them with previous results
- Detect new listings or price changes
- Persist results
- Trigger notifications
The core loop lives inside the Motor implementation.
Network requests include retry logic with exponential backoff to handle temporary failures.
Data Model
Each product is represented by anArticleobject.
Example JSON representation:
{
"search_term": "AZ amazon_usa - DK books",
"url": "https://www.amazon.com.mx/dp/1465482512/",
"identifier": "1465482512",
"title": "Zoology: Inside the Secret World of Animals",
"price": 825.57,
"datetime": "2024-11-16 07:23:23.193375",
"status": "active",
"history": [
{
"datetime": "2024-11-17 23:35:34.266925",
"price": 374.36
},
{
"datetime": "2024-11-15 11:08:11.463581",
"price": 378.75
}
]
}Important characteristics:
- Each article contains a price history.
- Status can be
activeorfinished. - The identifier allows deduplication across scrapes.
Persistence Layer
MLScraper intentionally keeps persistence extremely simple.
All results are stored as JSON files under:
./data/Each search generates its own file.
Advantages of this approach:
- Human-readable
- No database required
- Easy debugging
- Portable
However, it does not scale well to large datasets or concurrent writers.
Notification System
The project currently implements two notification channels.
Telegram Alerts
Telegram integration is implemented in:
utils/telegram.py
Two alert types exist.
New item detected:
send_new_to_telegram(article)Price drop detected:
send_price_drop_to_telegram(article)The alert includes:
- product title
- link
- previous price
- new price
- timestamp
A price-drop notification triggers when the drop exceeds roughly 14%.
Web Dashboard
A small dashboard is served using FastAPI.
app.pyexposes:
- static frontend
- websocket endpoint
/ws/
The UI listens to the websocket and displays real-time events.
When a new product is detected the page also plays a notification sound.
HTTP Strategy
To reduce scraping blocks the system uses randomized headers.
Located in:
utils/headers.py
It randomly selects modern browser User-Agent strings and sets additional client hints to mimic real browser traffic.
Requests also include:
- retry logic
- exponential backoff
- timeout protection
This makes the scraper more resilient against temporary failures.
Engineering Decisions & Design Tradeoffs
Several design decisions were made intentionally to balance simplicity and capability.
JSON vs Database
Decision: Store results in JSON files.
Pros
- No external dependency
- Easy inspection
- Portable
- Great for small projects
Cons
- Poor scalability
- No indexing
- Not safe for concurrent writes
A future version will likely migrate to SQLite or Postgres.
HTML Scraping vs Official APIs
Decision: Parse HTML pages.
Pros
- Works with any public search page
- No API keys needed
- More flexible
Cons
- Fragile when page layouts change
- Potentially against terms of service
- Requires constant maintenance
Async Scraping with Sync Notifications
Scraping usesaiohttpwhile Telegram calls userequests.
This simplifies implementation but theoretically could block the event loop if many alerts fire simultaneously.
For a personal project the tradeoff was acceptable.
Single Process Architecture
Everything runs inside one process:
- scraper
- API server
- notification logic
Advantages:
- easy debugging
- minimal deployment complexity
Disadvantages:
- limited scalability
- less fault isolation
Why This Project Matters
MLScraper started as a practical tool but it also became a valuable learning project.
It touches many real-world engineering concerns:
- asynchronous networking
- scraping reliability
- system architecture
- data persistence
- event notification
- containerized development
Beyond the technical aspects, it solves a real problem: tracking product availability and pricing across multiple marketplaces without constantly checking websites manually.
Future Improvements
There are many improvements I would like to implement.
Database Migration
Replace JSON storage with SQLite or Postgres to enable:
- indexing
- better queries
- analytics
- multi-process safety
Async Notification Pipeline
Introduce a queue-based notification system to prevent blocking.
Proxy Support
Add rotating proxies and optional headless browser support using Playwright for sites with stronger bot protection.
Configurable Searches
Move search configuration to YAML or a database so searches can be modified without editing code.
Historical Analytics
Add charts and analytics for:
- price trends
- average discounts
- best purchase windows
Automated Tests
Provider parsers should include unit tests using stored HTML snapshots to detect breakage.
Distributed Scraping
Move motors to worker processes coordinated by a scheduler.
Closing Thoughts
MLScraper is intentionally simple but surprisingly capable.
By combining a modular scraping architecture, asynchronous networking, and lightweight persistence, it provides a flexible platform for monitoring product listings across multiple online stores.
The system is easy to extend: adding a new provider typically means writing a single parser and registering it in the generator.
For me, the project strikes a nice balance between experimentation and real-world usefulness. It has already helped me discover listings and price drops that I would have otherwise missed.
If you are interested in scraping systems, event-driven architectures, or simply automating repetitive browsing tasks, building something like MLScraper is an excellent exercise.
And if you decide to extend it — adding a new provider, improving the architecture, or integrating analytics — you’ll quickly see how a small personal tool can evolve into a surprisingly sophisticated system.
