Overview
Welcome to our Static HTML Scraper project! This app demonstrates how to efficiently scrape data from classic, static HTML websites using lightweight tools — requests and BeautifulSoup.
Project Workflow
Many modern websites rely on JavaScript, which requires browser automation tools for scraping. However, some websites still serve all product data directly in the HTML source. This project focuses on such sites to demonstrate fast, simple, and respectful scraping without the overhead of browser automation.
Data Collection
We start by parsing the site’s XML sitemap to discover all URLs, then filter and scrape relevant product pages to extract detailed product information such as name, price, description, SKU, and images.
Interactive Analysis
The extracted product data is presented in an interactive Streamlit app where you can explore the dataset, view summaries, and visualize key insights such as the most expensive products and price distributions.
Why This Project?
This project highlights how many valuable data sources are accessible without complex JavaScript handling, making lightweight scraping both effective and efficient.
Scalability
While the demo works on a subset of products to be respectful to the site, the approach scales to larger datasets and different static HTML websites.
Try the App
Explore the full workflow from sitemap parsing to data extraction and interactive visualization by launching the app.