In this project, i will be analysing an e-commerce company's website data. where the analysis is the comparison between the company's newly launched webpage and the old one(existing one). Where the company wants to know which of the wabpages attracts more customers.
The dataset has a total number of 294,478 samples(rows), and only five columns.
To get started, let's import our libraries.
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
#We are setting the seed to assure you get the same answers on quizzes as we set up
random.seed(42)
1. Now, read in the ab_data.csv data. Store it in df.
a. Read in the dataset and take a look at the top few rows here:
df = pd.read_csv('ab_data.csv')
df.head()
b. the cell below to shows the number of rows and columns in the dataset.
df.shape
c. The number of unique users in the dataset.
df.nunique()
d. The proportion of users converted.
#here 1 represents True and 0 False.
df.groupby('converted')['user_id'].count()/df.shape[0]
e. The number of times the new_page and treatment don't match.
match = df[((df['group'] == 'treatment') == (df['landing_page'] == 'new_page')) == False].shape[0]
match
f. Do any of the rows have missing values?
df.info()
There are no missing values in the dataset
2. -In this part:
- The rows where treatment in group column does not match new_page in landing page column are dropped.
- The rows where control in group column does not match old_page in landing page column are dropped.
- All duplicate rows are also dropped
a. Create a new dataframe and store your new dataframe in df2.
df2 = df.copy()
index = df2[(((df2['group'] == 'treatment') & (df2['landing_page'] == 'new_page')) ==False) & (((df2['group'] == 'control') & (df2['landing_page'] == 'old_page'))==False) ].index
df2.drop(index , inplace=True)
#df.tail(5)
# Double Check all of the correct rows were removed - this should be 0
df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]
# Get indeces of the duplicated user records again
duplicated_user = df2.user_id[df2.user_id.duplicated()]
# Drop the duplicate user records
df2.drop(index=duplicated_user.index, inplace=True)
#Double check if duplicates are dropped
df2[df2['user_id'].duplicated()].shape[0]
a. How many unique user_ids are in df2?
df2.nunique()
4. Use df2 in the cells below to answer the quiz questions related to Quiz 4 in the classroom.
a. What is the probability of an individual converting regardless of the page they receive?
prob_converted = (df2['converted'] == 1).mean()
prob_converted
b. Given that an individual was in the control group, what is the probability they converted?
#individuals from control group
control_ind = df2.query('group == "control"').shape[0]
#individuals who are in control group and converted
control_and_convert = df2.query('converted == 1 & group == "control"').shape[0]
#probabiity of converted control
prob_of_control_and_convert = control_and_convert/control_ind
print('The probability of an indivdual converted from cotrol group is: {0:.4f}' .format(prob_of_control_and_convert))
c. Given that an individual was in the treatment group, what is the probability they converted?
#individuals from control group
treatment_ind = df2.query('group == "treatment"').shape[0]
#individuals who are in control group and converted
treatment_and_convert = df2.query('converted == 1 & group == "treatment"').shape[0]
#probabiity of converted control
prob_of_treatment_and_convert = treatment_and_convert/treatment_ind
print('The probability of an indivdual converted from treatment group is: {0:.4f}' .format(prob_of_treatment_and_convert))
d. What is the probability that an individual received the new page?
#individuals that received new page
new_page = df2.query('landing_page == "new_page"')
#probability of individuals who received new page
prob_new_page = new_page.shape[0]/df2.shape[0]
print('The probability that an individual received the new page is {0:.4f}'.format(prob_new_page))
From the above observations, it is shown that both groups has equal proportions which is 12%. Therfore there is no evidence to say "The new page leads to more convertion".
Null and Alternative hypothesis;
Let us assume the old page is better, unless we prove the new page to be better, at type error 1 with 5% rate. Therefore the null an dalternative hypothesis is:
null hypothesis(H0); Pššš >= Pššš¤
Alternative hypothesis(H1); Pššš < Pššš¤
Also equivalent to;
null: Pnew - Pššš <= 0
alternative: Pnew - Pššš > 0
NOTE: Pššš and Pnew are the convertion rates of both old and new page respecively.
a. What is the conversion rate for $p_{new}$ under the null?
#convertion rate of new page
p_new = prob_of_treatment_and_convert
#where probabilty of new page is (in four significant figures)
print('probabilty of new page is: {0:.4f}' .format(p_new))
b. What is the conversion rate for $p_{old}$ under the null?
#convertion rate of old page
p_old = prob_of_control_and_convert
#where probabilty of old page is (in four significant figures)
print('probabilty of new page is: {0:.4f}' .format(p_old))
#The difference in conversion rate is
print('Difference in conversion rate is {0:.4f}.'.format(p_new - p_old))
c. What is $n_{new}$, the number of individuals in the treatment group?
new_ind = df2.query('landing_page == "new_page"')
n_new = len(new_ind)
print('The number of individuals in treatment group is: {}'.format(n_new))
d. What is $n_{old}$, the number of individuals in the control group?
old_ind = df2.query('landing_page == "old_page"')
n_old = len(old_ind)
print('The number of individuals in control group is: {}'.format(n_old))
#The number of individuals in the entire dataset is (sampe size)
sample_size = len(df2)
print('The sample size is {}'.format(sample_size))
Simulate 10,000 draws of $P_{new}$ - $P_{old}$ values, which help us to be more representative of the population
p_diffs = np.array([])
# Compute the sampling distribution
for _ in range(10000):
# Generate elements from the new/old page groups using their probability
new_page_converted = np.random.choice([0, 1], size = n_new, replace = True, p = [1-p_new, p_new])
old_page_converted = np.random.choice([0, 1], size = n_old, replace = True, p = [1-p_old, p_old])
# Calculate the difference in conversion rates
p_diffs = np.append(p_diffs, new_page_converted.mean() - old_page_converted.mean())
# find elements equal to our sample size which imitates null hypothesis.
p_diffs_null = np.random.normal(0, p_diffs.std(), size = sample_size)
p_diffs_null
# Plot the distribution under the null along with the location of the sample mean
plt.hist(p_diffs_null, alpha=0.5)
plt.axvline(x = p_diffs.mean(), color = 'r', linestyle = '--')
plt.axvline(x = p_diffs_null.mean(), color = 'k', linestyle = '-')
plt.title('Sampling distribution of conversion rates')
plt.ylabel('Frequency')
plt.xlabel('Sample mean')
plt.show();
j. What proportion of the p_diffs are greater than the actual difference observed in ab_data.csv?
observed_diff = p_new - p_old
# Calculate p-value
p_value = (p_diffs_null > observed_diff).mean()
print('The probability of obseving the difference in conversion rate or higher values, \n' +
'given that the null hypothesis is true, = {0:.2f}.'.format(p_value))
In this observation above, we calculated the p-value, which is the sample statistics of knowing if null hypothesis is true. In this case observed diff will be high if null hypothesis is true.
So, if null hypothesis is true, convertions through ol page is equal of greater when compared with new page.
Therefore we hereby accept our null hypothsis.
l. We could also use a built-in to achieve similar results. Let n_old and n_new refer the the number of rows associated with the old page and new pages, respectively.
import statsmodels.api as sm
convert_old = len(old_page.query('converted == 1'))
convert_new = len(new_page.query('converted == 1'))
m. Now use stats.proportions_ztest to compute your test statistic and p-value. Here is a helpful link on using the built in.
z_score, p_value = sm.stats.proportions_ztest([convert_new, convert_old], [n_new, n_old], alternative='larger')
print('Z-score is {0:.2f} and p-value is {1:.2f}.'.format(z_score, p_value))
n. What do the z-score and p-value you computed in the previous question mean for the conversion rates of the old and new pages?
The statistic have a type 1 error of 0.05% (0.95 confidence interval).So, the null hypothesis would be rejected if the z-score is less than -1.96 or greater than 1.96. In this case our z-score is -1.31, and therefore it will not be rejected.
And the p-value is close to 1, which also signifies we accept the null hypothesis.
The above observations was the comparison of two webpages launched by an e-cmmerce company, which they are not sure whether to keep the old one or implement the new page.
So, from the above observations we are able to see that the newly developed webpage does not proves to be better.
So we suggest that the company keeps the old webpage.
from subprocess import call
call(['python', '-m', 'nbconvert', 'Analyze_ab_test_results_notebook.ipynb'])