Abstract
This is my fourth independent project at Metis. The goal of this project is to help a bank institution build a classification model to predict potential customers who are going to subscribe to “Term Deposit”. Then, the institution will apply direct marketing campaigns(phone calls) to the Marketing Targets.
I used data from Kaggle-Banking dataset, then started with a baseline Logistic Regression model F-beta(beta = 2): 0.7582, ROC_AUC: 0.6502. Then adding complexity to improve the prediction ability. I trained 6 candidate models and selected the best performing one - XGBoost. Then, I retrained the model to get my final model with F-beta(beta = 2): 0.9946, ROC_AUC: 0.9998.
Problem
Our client is a bank institution. “Term Deposit” is a major source of their income. They are planning to apply direct marketing campaigns on their customers. They want us to help them build a classification model to predict if a customer will subscribe or not so that they can target customers to improve the conversion rate.
Data
The data I used is from Kaggle — Banking Dataset -Marketing Targets. The training dataset contains 45,000+ rows, 16 features while the test dataset contains 4,500+rows, 16 features. Target is ‘y’ — has the client subscribed a term deposit? (binary: “yes”,”no”). (I dropped the “duration” column based on the dataset instructions.)
EDA:
I checked the target first and I noticed that there’s an class imbalance issue. So, I chose an “oversampling method” to resample before modeling.
Below is the counts comparison for the people who have property and the people who don’t have property. It turned out the people who don’t have housing have a higher chance to subscribe to a term deposit. It can be explained that the people who don’t own houses have more cash in hand which allows them to make some investments.
Here’s the pairplot for our numerical features. Let’s have a look at the diagonal distribution. It doesn’t seem like any single feature can be a good classifier for our problem, but they may work pretty well if we combine them together.
Design/Metric Choosing:
For the metric, I chose an F-beta score as a hard prediction metric. I want to minimize False Negatives which are the actual potential customers but were predicted as not going to subscribe in this case. At the same time, I don’t want too many False Positives either because I don’t want our sales team to waste too much time calling people who are not going to subscribe. So, I set beta = 2 which has a good balance of “precision” and “recall” with a slight favor to “recall”. At the same time I don’t want my soft predictions to be too off, so I will use ROC_AUC as my secondary metric as a reference.
Baseline Model:
My baseline model is a logistic regression model which only includes my main numerical features — age, balance, day, campaign, pdays, previous. From this confusion matrix, we can see there are over 2000 data points mislabeled and 500 of them are False Negatives. F-beta(beta = 2): 0.7582, ROC_AUC is 0.6502.
Improving Baseline Model:
Then I tried to improve my baseline model. For the previous model, I didn’t scale my data, I just used a higher C. But to improve it, I tried scaling & adding dummified categorical features to improve my F-beta to 0.7933, ROC_AUC: 0.7687.
Model Evaluation & Selection(6 candidate models):
- Logistic Regression: F-beta: 0.7933 ROC_AUC: 0.7687
- 20-NN: F-beta: 0.7632 ROC_AUC: 0.7392
- Random Forest: F-beta: 0.8333 ROC_AUC: 0.7928
- XGBoost: F-beta: 0.8329 ROC_AUC: 0.7992
- Naive Bayes: F-beta: 0.8016 ROC_AUC: 0.7487
- SVM: F-beta: 0.8388 ROC_AUC: N/A
XGBoost performs the best. So, I tuned the hyper-parameter to get a model with a better predictive power.
- XGBoost(with hyper-parameter tuning): F-beta: 0.8546 ROC_AUC: 0.7983
Then, I retrained my model based on (train+validation), then scored on the test set. Final score(on test):
- Final XGBoost: F-beta: 0.9946 ROC_AUC: 0.9998
Model Evaluation:
Based on my final model, for 4,500 data points in my test set, 24 of them were mislabeled, one of them is False Negative. So, It’s a close to perfect classifier. Am I lucky to get this result? I guess it’s the validation set made my final model performs so good when I refit on (training + validation) data.
Future Work:
1. Check feature importance to make sure the top predictors aren’t leaking information.
2. Try different random_state to see if I can get a higher validation score on all of my candidate models.
3. Try refit entire data on Random Forest to see the performance.
4. Lastly, as SVM was with a similar F-beta score with my 2 tree based models, I can try to improve my SVM model to see if it can get a similar performance as XGBoost.
Please see complete code, google slide, writeup here: Github Repo Link