XGBoost for beginners – from CSV to Trustworthy Model

This is what we do with in the YouTube video below:

  • take a tiny CSV and explore it with plain words
  • we train a strong model, using the XGBoost python library (eXtreme Gradient Boosting)
  • pick a practical decision treshold and we explain what matters in the data
  • add two simple business rules:
    • debt_ratio is up should not make risk go down
    • tenure_months up should not make risk go up

Why XGBoost matters and what does it mean exactly?

  • Boosting – adding small decision trees, one after another. Each new tree focuses on the row the earlier trees got wrong and fixes part of these mistakes. The final prediction is the sum of all small trees.
  • Gradient – we follow the direction that reduces the error the fastest. Somehow greedy. Think of an error hill: the gradient shows the downhill direction
  • Extreme – it is tuned for speed and accuracy. It runs well on a laptop CPU, can use GPU and can scale up, if needed.

TLDR – many simple trees working tohether, trained in a smart direction, implemented for speed.

What we have build in the YouTube video?

  • Load a tiny cusomer churn CSV called churn.csv
  • Do quick, safe checks – missing values and class balance.
  • Split data into train, validation, test – 60-20-20.
  • Train XGBoost with early stopping using the Booster API. The model is called bst .
  • Choose a practical decision treshold from validation – “a line in the sand”.
  • Explain results on the test set in plain terms – confusion matrix, precision, recall, ROC AUC, PR AUC
  • See which column mattered most (a hint – if people start calling the call centre a lot, most probably there is a problem and they will quit using your service)
  • Add two business rules with monotonic constraints and build new model – bst_cons
  • Compare the quality of bst_cons and bst with a few lines.
  • Save both models into UBJ (Universal Binary JSON)

The dataset in a glance

These are really human-friendly columns:

  • tenure_months – how long the customer has been with us
  • monthly_spend – monthly spend
  • support_calls – number of support calls
  • promo_eligible – 1 if they can get a promo
  • debt_ratio – 0 to 1, higher means more financial stress
  • age – age in years
  • is_international – 1 if they are international
  • churn – 1 means they have already switched to another provider

There are about 2% missing values. XGBoost can handle the missing values quite quickly (especially with such tiny dataset)

Minimal, Complete and Verifyable Example

This is some complete and minimal python. Cheers to all StackOverflow users, remembering [MCVE] and having fun there some 10 years ago.

This is what is printed:

What the printed metrics mean? In simple words, as I am also a beginner in XGBoost 🙂

Confusion matrix

This is the simplest to explain. It counts 4 groups:

  • True negatives (tn) – predicted no churn, had no churn
  • True positives (tp) – predicted churn, had churn
  • False positives (fp) – predicted churn, had no churn
  • False negatives (fn) – predicted no churn, had churn

From these simple numbers we get all the other metrics. Really!

Precision

Of all the values flagged with churn, how many are churn?

Precision = tp / (tp + fp). High precision means few false alarms.

Recall

Of  all values who are truly churn, how many we predicted?

Recall = tp / (tp + fn). High recall means fewer churn. In general, if our model simply predicted everything as a churn, our recall would have been 100%. But this is not exactly a useful model.

F1

One number that balances precision and recall. We used F1 on the validation set to choose a practical decision treshold.

ROC AUC (Receiving Operator Characteristic Area Under Curve)

  • How well the model ranks positives higher than negatives across all tresholds.
  • It is the area under the ROC curve – tp rate vs fp rate.
  • Values range from 0.5 (random) to 1.0 (perfect).
  • ROC AUC is stable when classes are balanced, but can be overly optimistic when positives are very rare.
  • PR AUC (Precision Recall Area Under Curve)
  • Area under the precision recall curve. It focuses on the quality of catching positives.
  • Values range from the positive class share (baseline) up to 1.0.
  • PR AUC is more informative than ROC AUC when positives are rare, like churn in many datasets. (Ours is an exception.)
  • This is why eval_metric="aucpr"  is used in the code.

Gain

We printed the top features by gain. Gain is the average error reduction when splitting on that feature, during training. Bigger gain means the feature was more useful overall. Use this as a starting point for decisions – which feature drives risk and is this expected. Does it make business sense?

The GitHub repository with both python codes (from YouTube and from this article) is here:

https://github.com/Vitosh/Python_personal/tree/master/YouTube/043_Python-XGBoost

Enjoy it! 🙂

Tagged with: , , ,