King AI Capital

Overview

Welcome to our Dirty Data application! The process starts with creating a synthetic dataset filled with a variety of shoe sales data. From there, we deliberately "dirty" this data by introducing inconsistencies, missing values, duplicates, and more. Once the data is intentionally flawed, we walk through the process of analyzing, cleaning, and transforming it into a reliable dataset for further analysis and modeling.

Data Creation

We begin by generating a synthetic dataset of shoe sales. This dataset includes information such as Product ID, Age Group, Gender, Product Type, Size, Color, Price, and Stock Quantity. These fields are generated randomly, ensuring a wide range of possible values for every column, resulting in a rich dataset to manipulate and clean.

Adding Dirt to the Data

Once our synthetic dataset is ready, we introduce various "dirt" to it. This includes:

Missing Values: We randomly introduce missing values in fields like Price and Stock Quantity.
Duplicates: Some of the rows are duplicated to simulate repeated entries in the data.
Inconsistencies: For example, we randomly change the gender values to 'male' or 'FEMALE' to create casing inconsistencies.

This simulates the messy data scenarios we often encounter in the real world, which need to be cleaned before further analysis.

Analyzing the Dirty Data

After introducing dirt into our dataset, we analyze it to understand its condition. We generate a detailed report that includes:

The number of rows and columns in the dataset
The count of duplicates
The number of missing values for each column
A per-column analysis showing unique values, string lengths, and more

This analysis helps us identify potential issues such as inconsistent casing, extra spaces, or special characters that need to be cleaned up.

Cleaning the Data

After analyzing the dirty dataset, we move on to cleaning it. This includes:

Changing Data Types: We handle the necessary type conversions to ensure that columns like Price and Stock Quantity are in the correct format (e.g., float and integer).
Handling Duplicates: We either mark the duplicates or remove them from the dataset, depending on the user's choice.
Replacing NaNs: We replace missing values with a placeholder like -99999 and offer multiple options for handling them, including removing rows or replacing them with zeros or column averages.

This cleaning process helps transform the raw, dirty data into something much more usable and reliable.

Final Cleaned Data

Once all cleaning steps are completed, we present the final cleaned dataset for analysis. The result is a dataset that's ready to be used for deeper analysis, reporting, or machine learning.

Why This App?

This application helps demonstrate how we can turn messy, real-world data into a structured, usable format. Whether you're dealing with missing data, duplicates, or inconsistencies, this tool guides you through the process of cleaning and transforming your data into a reliable resource.

Deploying the App

After completing the data cleaning steps, this tool is deployed live on Streamlit, where it can be accessed anytime. The cleaning process is quick and interactive, making it a great learning tool for those looking to improve their data cleaning skills.

Scalability

While this app is currently focused on a synthetic dataset, the process can be scaled to handle real-world datasets with much larger sizes and complexities. The same principles apply regardless of dataset size — you can always clean and transform your data effectively.

Dirty Data Create, Analyse, and Clean