← Work
2020

Predicting subscription churn with PySpark ML

End-to-end churn modelling on a music streaming dataset. Feature engineering, model selection, and evaluation, at Spark scale.

Context

Churn is the textbook ML problem, but most write-ups use pandas on a toy dataset. I wanted to walk through it in PySpark to show how the workflow changes when the data doesn't fit on a laptop.

Approach

  • Event-log feature engineering in Spark: session counts, listening time, page interactions, downgrade signals.
  • Trained and compared logistic regression, random forest, and gradient-boosted trees in MLlib.
  • Evaluated with F1 given the heavy class imbalance, not raw accuracy.

Outcome

A full PySpark ML walkthrough, published on Analytics Vidhya, that other practitioners can lift directly into their own Spark environment.

Stack

  • PySpark
  • MLlib
  • Feature engineering
  • Classification

Links