Analyzing a marketing campaign
using Python

Project Summary

Click here to access full of my project on my Github

Input: Dataset about Marketing campaign
Goal:
- Identify what is the most effective marketing campaign
- Identify customers who bring the most value
- The most significant factor influence the store purchases
Insight:
- Customers having high AOV (average order value) tend to
  - not have many kids
  - not visit website
  - buy through catalog and store
  - buy high amount of meat and wines
  - high income
- Total amount is the most influenced factor on the number of store purchases.
- The remaining customers had more fish purchases on average than the Married PhD group.
- The last campaign is the most successful.
- Catalog is the most underperforming channel.
Project Duration
- Assessing and Cleaning data: Checking null values and data type
- Exploratory Data Analysis
  - Are there any outliers? How will you wrangle/handle them?
  - Are there any useful variables that you can engineer with the given data?
- Statistical Analysis
  - What factors are significantly related to the number of store purchases?
  - Fish has Omega 3 fatty acids which are good for the brain. - Accordingly, do "Married PhD candidates" have a significant relation with the amount spent on fish?
- Further Analysis
  - Which marketing campaign is most successful and unsuccessful?
  - Which channels are underperforming?

Context: Assuming that I've been told by the Chief Marketing Officer that recent marketing campaigns have not been as effective as they were expected to be. I need to analyze the data set to understand this problem and figure out useful insights.

Database Description

Meaning of some fields:

Dt_Customer: Date of customer's enrollment with the company
MntWines: Amount spent on wine in the last 2 years
NumWebPurchase: Number of purchases made through the company's web site
AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
Complain: 1 if customer complained in the last 2 years, 0 otherwise
Respones: 1 if customer accepted the offer in the last campaign, 0 otherwise

My detail project:

Assessing and Cleaning data
- Checking null values: There is no null value in the dataset
- Checking data type (change data type if necessary)
  Income is object type. Solution: Change it to int type:
```
df1['Income']=df1['Income'].astype(int)
														df1.info()
```
  Data type of Dt_Customer is object. Solution: change it to datetime
```
df1['Dt_Customer']=pd.to_datetime(df1['Dt_Customer'])
															df1['Dt_Customer'].info()
```
  Column 'Income' has two blank spaces. Solution: change the column name:
```
df1.rename(columns={' Income ':'Income'},inplace=True)
															df1.columns
															
```
Exploratory Data Analysis
- Are there any outliers? How will you wrangle/handle them?
- Are there any useful variables that you can engineer with the given data?
- What are characteristics of customers having high income, high AOV and having kids?
  
  - People having high income: not buy frequently (low recency), buy wines and meats (high amount), not visit web frequently.
  - People having kids tend to: buy low amount of food, low AOV, buy with deals, visit websites frequently.
  - People having high AOV tend to: not have many kids, not visit website, buy through catalog and store, buy high amount of meat and wines, high income.

Statistical Analysis

What factors are significantly related to the number of store purchases?

# What factors are significantly related to the number of store purchases?
# use random forest to predict then use importance score
#--create new dataframe to run random forest model (this is a model with high accuracy and both classification and regression)
rd_df=new_df.drop(columns=['ID','Dt_Customer'])
#--replace infinity values (if any) to the value 0
rd_df.replace([np.inf,-np.inf],0,inplace=True)
#--convert a string column to a number (one-hot coding) with get_dummies
rd_df=pd.get_dummies(rd_df)
rd_df.head()

#--import split train test function
from sklearn.model_selection import train_test_split
#--tạo dataframe train test theo chiều ngang
X=rd_df.drop(columns=['NumStorePurchases'])
y=rd_df['NumStorePurchases']
#split train test dataset
X_train, X_test, y_train, y_test=train_test_split(X, y,test_size=0.3)
#import random forest regressor
from sklearn.ensemble import RandomForestRegressor
#--create forest having 100 trees
rg = RandomForestRegressor(n_estimators=200, n_jobs=-1)
#--train the model
rg.fit(X_train, y_train)
#--test the model
y_pred=rg.predict(X_test)

Fish has Omega 3 fatty acids which are good for the brain. - Accordingly, do "Married PhD candidates" have a significant relation with the amount spent on fish?

# use t-test to test if these two groups have the same mean
from scipy.stats import ttest_ind
#This is a two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values.
#This test assumes that the populations have identical variances by default.
pval = ttest_ind(married_phd.MntFishProducts, the_rest.MntFishProducts).pvalue
print("T-test p-value: ", pval)

Further Analysis
- Which marketing campaign is most successful?
- Which channels are underperforming?

Here the full code and dataset I used

Click here