This project applies unsupervised learning to analyze a dataset of airline passenger satisfaction surveys. The goal is to uncover hidden patterns, segment passengers, and understand what drives satisfaction using clustering and dimensionality reduction techniques.
- Python (Jupyter Notebook)
- pandas, numpy
- seaborn, matplotlib
- scikit-learn (KMeans, PCA, KernelPCA, Silhouette Score, Preprocessing Pipelines)
- scipy (Box-Cox transformation)
The dataset contains:
- Demographic Info: Gender, Age, Customer Type
- Travel Details: Flight Class, Distance, Travel Type
- Service Ratings: Seat Comfort, Boarding, Entertainment, etc.
- Timing: Departure/Arrival Delay
- Target: Passenger Satisfaction
Initial preprocessing steps:
- Removed unused columns
- Renamed columns for clarity
- Checked for missing values and outliers
- Analyzed distribution of features using histograms and
describe()
- Class vs Satisfaction: Created proportional stacked bar charts to explore satisfaction levels by travel class.
- Correlation Matrix:
- Strong correlation between Online Booking and Time Convenience (0.72)
- Extremely high correlation (0.97) between Departure Delay and Arrival Delay
- FacetGrid Visualizations: Used histograms to compare distributions of all categorical features against satisfaction.
- Applied log and Box-Cox transformations to address skewness in delay variables
- Created preprocessing pipelines using
ColumnTransformer
andOrdinalEncoder
- Scaled numerical features with
StandardScaler
- Started with 2-cluster and then 3-cluster KMeans models.
- Evaluated cluster quality using Inertia and the Elbow Method.
- Determined 3 clusters as optimal.
- Cluster 0: Majority neutral/dissatisfied (23,828 vs 4,679 satisfied)
- Cluster 1: Majority satisfied (32,207 vs 11,954 neutral/dissatisfied)
- Cluster 2: More neutral/dissatisfied than satisfied, but not as extreme as Cluster 0
Insights:
- Cluster 1 is the most satisfied group, ideal for retention strategies.
- Cluster 0 & 2 show dissatisfaction trends, indicating areas for improvement.
- Applied PCA with 14 components to calculate feature contributions
- Compared KernelPCA (3 components) vs PCA (14 components) using Silhouette Scores
- Visualized clusters with color gradients based on distance in PCA space
Conclusion:
- Both methods provide good separation, but PCA with 14 components slightly outperforms KernelPCA based on Silhouette Score
This project demonstrates end-to-end unsupervised learning using real-world survey data:
- Cleaned and explored data thoroughly
- Reduced skewness and scaled features
- Identified optimal clusters
- Applied and compared dimensionality reduction techniques
It highlights the power of clustering to understand customer satisfaction and create actionable insights.
- Deep dive into features within each cluster
- Try DBSCAN or Gaussian Mixture Models
- Visualize with t-SNE or UMAP
- Build interactive dashboard to explore segments
Project completed by Shiva Dorri as part of her unsupervised learning journey into data science.
Tags: #KMeans
#PCA
#KernelPCA
#CustomerSegmentation
#UnsupervisedLearning
#DataScience