Customer Segmentation and Lifetime Value Prediction using machine learning models

In this project, I use K-Means and DBSCAN to cluster customers and use BG/NBD, Gamma Gamma and XGBoost to predict customer lifetime value

Project Summary

Click here to access full of my project on my Github

Input: An online retail dataset of a online store from 2010-2011
Goal:
- Identify which is the better clustering algorithm for customer segmentation: K-Means or DBSCAN
- Identify which is the better clustering algorithm for customer lifetime value prediction: K-Means or DBSCAN
- Propose business strategies

Insight:

Clustering Efficiency Rating: DBSCAN+K-Means > K-Means > DBSCAN
Customer Lifetime Value Prediction Efficiency Rating: BG/NBD + Gamma Gamma > XGBoost

Business Implementation:

Group	Name	Ranking	Characteristics	Business strategies
0	Loyal customers	1	Second highest average order value (AOV) Highest CLTV	Increase their AOV by introducing promotional combos or promoting high-value products on special occasions
1	Potential customers	2	Highest AOV, low purchases, leading to CLTV ranking 2nd	Encourage them make purchase multiple times - Introducing the service of accumulating reward points after each purchase or ranking members according to the number of purchases. The higher the rank, the more attractive promotions. - Prioritize advertising products with a short shelf life to these customers to encourage them to buy again, such as: foods, drinks or cosmetics.
2	Needing attention	3	Low AOV, medium purchases and low CLTV Buying randomly: special occasions or discounts.	Impress customers by introducing good feedback, products with high rating

Project Duration
- Data description and preprocessing
  - Remove nulls and outliers
  - Create RFM model
  - Standardization and scaling data
- Clustering with K-Means and DBSCAN
  - Clustering with K-MEans
  - Clustering with DBSCAN
  - Comparing based on Silhouette score of two models
- Predicting customer lifetime value
  - Predicting customer lifetime value with BG/NBD and Gamma
  - Predict customer lifetime value with XGBoost
  - Comparing based on RMSE of two models
- Proposing business strategies

My detail project:

Data collection and description
The dataset retrieved from Kaggle contains all transactions that occurred between January 12, 2010 and September 12, 2011 for a UK-based online retailer. The company's products are gifts for the holidays. Many of the company's customers are wholesalers. The dataset has 541909 rows and 8 columns as follows.

Columns	Description
InvoiceNo	Some bills. A 6-digit integer that is uniquely assigned to each transaction. If the code starts with the letter 'c', the invoice is canceled.
StockCode	Product code. A 5-digit integer assigned to each individual product
Description	Product's Name
Quantity	The number of products
InvoiceDate	Transaction time created
UnitPrice	Product unit price
CustomerNo	Customer code. A 5-digit integer that is uniquely assigned to each customer.
Country	Client's country of residence

Data preprocessing
- We deleted rows with duplicate data, null CustomerNo, negative Quantity and UnitPrice values.

# Remove duplicate rows
df1.drop_duplicates()
# Delete rows with null CustomerNo
df1 = df1[pd.notnull(df1['CustomerNo'])]
# Remove orders having Quantity =< 0
df1=df1[df1['Quantity']>0]

- Converted the data type of the InvoiceDate column to datetime.

#Converted the data type of the InvoiceDate column to datetime
df1["Date"] = pd.to_datetime(df["Date"])
#Invoice columns shows only Date
from datetime import datetime
df1["Date"] = df1["Date"].dt.date

- Create new dataframe based on RFM model (Recency, Frequency and Monetary) by each CustomerNo

df2 = df1.copy()
#Add TotalPrice column
df2['TotalPrice']= df2['Price'] * df2['Quantity']


# Add Avg_Monetary (Average Monetary)
df2['Avg_Monetary'] = df2['TotalPrice']


# Add Tenure column
df2['Tenure'] = df2['Date']
df2.head()
import datetime


# Select current_date 1 day from the last day ie 10/12/2019
current_date = df2["Date"].max() + datetime.timedelta(days=1)
print(current_date)


# Return the corresponding columns to the RFM . model
rfm = df2.groupby(['CustomerNo','Country']).agg({
    'Date': lambda x: (current_date - x.max()).days,
    "TransactionNo": lambda InvoiceNo: InvoiceNo.nunique(),
    'TotalPrice': 'sum',
    'Avg_Monetary':'mean',
    'Tenure': lambda y: (current_date - y.min()).days})


# Assign names to columns
rfm.columns = ["Recency", "Frequency", "Monetary","Avg_Monetary","Tenure"]

- Handle outliers by using standard score (z-score) and Interquartile Range (IQR).

Remove outliers in Monetary columns

#Delete lines with Monetary's zscore > 3
rfm1 = rfm1[(z < 3).all(axis=1)]

Customer Segmentation using K-Means and DBSCAN

K-Means

DBSCAN

	DBSCAN 1 clustering	DBSCAN 2 clustering	DBSCAN 3 clustering
Number of clusters	5	5	9
Silhouette index	0.487	0.487	0.457
CH index	10349.7	10403.2	6649.1
DB index	1.09	1.03	1.75

Combine K-Means and DBSCAN

Customer Lifetime Value Prediction using BG/NBD, Gamma Gamma and XGBoost

BG/NBD + Gamma Gamma
- I use BG/NBD to predict purchases of customers

- I use Gamma Gamma to predict the AOV (Average order value) of customers

I will evaluate the effectiveness of the model

MSE	450.1
RMSE	21.2
MAE	5.8

RMSE is low which shows that the BG/NBD and Gamma Gamma model predict the customers lifetime value quite accurately.

XGBoost
I will use XGBoost predict the business's revenue per customer based on the xgb.XGBRegressor() model.

I will evaluate the effectiveness of the model

MSE	44116.8
RMSE	210.04
MAE	54.80

Business Insights and Suggestions

Based on CLTV and geography

Japan has the highest average purchase value but their average frequency is quite low which means they usually buy luxury products in special occasions like holidays or festivals
The UK is the main market due to the highest average total purchase value but not in the top of the countries with the average purchase value means that UK customers often buy frequently not buy a lot.

Businesses should prioritize advertising luxury items and offer discounts on special occasions like: Lunar new year or Christmas.
Increasing the purchasing value of the UK customer by offering coupons or discounts when they buy combo.

Combine segmentation and CLTV suggestions

Group	Name	Ranking	Characteristics	Business strategies
0	Loyal customers	1	Second highest average order value (AOV) Highest CLTV	Increase their AOV by introducing promotional combos or promoting high-value products on special occasions
1	Potential customers	2	Highest AOV, low purchases, leading to CLTV ranking 2nd	Encourage them make purchase multiple times - Introducing the service of accumulating reward points after each purchase or ranking members according to the number of purchases. The higher the rank, the more attractive promotions. - Prioritize advertising products with a short shelf life to these customers to encourage them to buy again, such as: foods, drinks or cosmetics.
2	Needing attention	3	Low AOV, medium purchases and low CLTV Buying randomly: special occasions or discounts.	Impress customers by introducing good feedback, products with high rating

Here the full code and dataset I used

Click here