2020

Predicting subscription churn with PySpark ML

End-to-end churn modelling on a music streaming dataset. Feature engineering, model selection, and evaluation, at Spark scale.

Context

Churn is the textbook ML problem, but most write-ups use pandas on a toy dataset. I wanted to walk through it in PySpark to show how the workflow changes when the data doesn't fit on a laptop.

Approach

Event-log feature engineering in Spark: session counts, listening time, page interactions, downgrade signals.
Trained and compared logistic regression, random forest, and gradient-boosted trees in MLlib.
Evaluated with F1 given the heavy class imbalance, not raw accuracy.

Outcome

A full PySpark ML walkthrough, published on Analytics Vidhya, that other practitioners can lift directly into their own Spark environment.

Stack

PySpark
MLlib
Feature engineering
Classification

Context

Approach

Outcome

Stack

Links