In this project, we created an OLAP (online analytical processing) system that organized the data of car sales and car features to analyze trends in car sales.
The objective of this assignment was to use data science techniques and analytical tools to create a data mart, relevant queries, an interactive dashboard of the data, and use artificial intelligence algorithms to preform data mining. This project served as a comprehensive data science project that provided hands on experience to apply the teachings of the course. For our project specifically, we were interested in exploring the patterns in car pricing with respect to to car features and the CPI (consumer price index). CPI was chosed as a metric to indicate the state of the economy at the time of purchase.
- jupyter notebooks
- mySQL
- Power BI
For this assignment, we used three datasets which required preprocessing to before being entered into the database. During preprocessing steps, we handled null values, duplicate enteries, corrected errors in enteries, and reformatted data. These preprocessing steps can be viewed in the file CSI 4124 Part2.ipynb. Below is an example of how we identified null and duplicate values and addressed a null value in this particular dataset.
We created a mySQL database and enetered the data from the jupyter notebook. Below are some screenshots of the database and the process of inserting the data into the database.
Screenshot of the database schema
Screenshot of some eneteries in the CPI table
Screenshot of jupyter notebook code used to enter CPI data into the database
In this phase of the assignment, we were instructed to created roll up, drill down, slice, dice, iceberg, and window queries that were relevant to our dataset. These queries were then combined into combination queries, such as a slice and dice query. Below is a screenshot of a dice query that selects car sales enteries in the database where the car engine fuel type is "regular unleaded" and the customer income is >82000.
Screenshot of dice query that selects car sales enteries in the database where the car engine fuel type is "regular unleaded" and the customer income is >82000
For the interactive dashboard, we used Power BI. The dashboard can be viewed by opeing the BI_Dashboard.pbit file. Screenshots of the dashboard are below.
Screenshots of the dahsboard
In this phase of the project, we were intructed to use three AI models to classify our data in a meaningful way. We implemented a decision tree model, gradient boosting model, and random forest model to classify the prices of cars based on their features as this sort of classification could be useful to dealership, for example, when pricing their vehicles. We compared the precision, recall, and runtimes of the models. The data mining process can be viewed in the Data Mining.ipynb file.