2020
Predicting subscription churn with PySpark ML
End-to-end churn modelling on a music streaming dataset. Feature engineering, model selection, and evaluation, at Spark scale.
Context
Churn is the textbook ML problem, but most write-ups use pandas on a toy dataset. I wanted to walk through it in PySpark to show how the workflow changes when the data doesn't fit on a laptop.
Approach
- Event-log feature engineering in Spark: session counts, listening time, page interactions, downgrade signals.
- Trained and compared logistic regression, random forest, and gradient-boosted trees in MLlib.
- Evaluated with F1 given the heavy class imbalance, not raw accuracy.
Outcome
A full PySpark ML walkthrough, published on Analytics Vidhya, that other practitioners can lift directly into their own Spark environment.
Stack
- PySpark
- MLlib
- Feature engineering
- Classification