A toolkit for quickly getting insights from pandas
Dataframes via common regression methods and their visualizations. Effectively a combination of R
's glm
and ggplot
functionalities as a single Python model call with the addition of one-liner clustering methods.
Currently supports:
- (multiple) linear regression with 2D, 3D plots
- (multiple) logistic regression; robust for both binary and proportional (i.e. 0 < y < 1) regressands + 2D, 3D plots
- k-means clustering with inertia plots to determine optimal cluster number; plots in 1D, 2D, 3D but clustering for any number of variables
Example usage:
from explore_toolkit import lm
df = pd.read_csv("titanic.csv")
df = df.fillna({'Age': df['Age'].median()})
lm(df, 'Fare ~ Age', plot=True)
lm(df, 'Fare ~ Age + SibSp', plot=True)
2D | 3D |
---|---|
![]() |
![]() |
from explore-toolkit import logit
td = pd.read_csv('ReedfrogPred.csv') # propsurv is between 0 and 1, but also works if binary
logit(td, 'propsurv ~ surv', plot=True)
logit(td, 'propsurv ~ density + surv', plot=True)
2D | 3D |
---|---|
![]() |
![]() |
from explore_toolkit kmeansclusters, elbow
df = pd.read_csv('Iris.csv')
Here I am using the very popular iris
dataset
elbow(df, ['SepalLengthCm', 'SepalWidthCm', 'PetalWidthCm'])

and now clustering with the optimal n_clusters = 3
:
kmeansclusters(df, ['SepalLengthCm', 'SepalWidthCm', ], n_clusters=3, plot=True, append=True, spit=False)
kmeansclusters(df, ['SepalLengthCm', 'SepalWidthCm', 'PetalWidthCm'], n_clusters=3, plot=True, append=True)
2D | 3D |
---|---|
![]() |
![]() |
Use spit = True
to return the standalone column of cluster numbers and use append = True
to insert the column of cluster numbers into the analysed dataframe (position 1)