Overview
Welcome to our Dirty Data application! The process starts with creating a synthetic dataset filled with a variety of shoe sales data. From there, we deliberately "dirty" this data by introducing inconsistencies, missing values, duplicates, and more. Once the data is intentionally flawed, we walk through the process of analyzing, cleaning, and transforming it into a reliable dataset for further analysis and modeling.
Data Creation
We begin by generating a synthetic dataset of shoe sales. This dataset includes information such as Product ID, Age Group, Gender, Product Type, Size, Color, Price, and Stock Quantity. These fields are generated randomly, ensuring a wide range of possible values for every column, resulting in a rich dataset to manipulate and clean.
Adding Dirt to the Data
Once our synthetic dataset is ready, we introduce various "dirt" to it. This includes:
- Missing Values: We randomly introduce missing values in fields like Price and Stock Quantity.
- Duplicates: Some of the rows are duplicated to simulate repeated entries in the data.
- Inconsistencies: For example, we randomly change the gender values to 'male' or 'FEMALE' to create casing inconsistencies.
Analyzing the Dirty Data
After introducing dirt into our dataset, we analyze it to understand its condition. We generate a detailed report that includes:
- The number of rows and columns in the dataset
- The count of duplicates
- The number of missing values for each column
- A per-column analysis showing unique values, string lengths, and more
Cleaning the Data
After analyzing the dirty dataset, we move on to cleaning it. This includes:
- Changing Data Types: We handle the necessary type conversions to ensure that columns like Price and Stock Quantity are in the correct format (e.g., float and integer).
- Handling Duplicates: We either mark the duplicates or remove them from the dataset, depending on the user's choice.
- Replacing NaNs: We replace missing values with a placeholder like -99999 and offer multiple options for handling them, including removing rows or replacing them with zeros or column averages.
Final Cleaned Data
Once all cleaning steps are completed, we present the final cleaned dataset for analysis. The result is a dataset that's ready to be used for deeper analysis, reporting, or machine learning.
Why This App?
This application helps demonstrate how we can turn messy, real-world data into a structured, usable format. Whether you're dealing with missing data, duplicates, or inconsistencies, this tool guides you through the process of cleaning and transforming your data into a reliable resource.
Deploying the App
After completing the data cleaning steps, this tool is deployed live on Streamlit, where it can be accessed anytime. The cleaning process is quick and interactive, making it a great learning tool for those looking to improve their data cleaning skills.
Scalability
While this app is currently focused on a synthetic dataset, the process can be scaled to handle real-world datasets with much larger sizes and complexities. The same principles apply regardless of dataset size — you can always clean and transform your data effectively.