Analyze A/B Test Results¶

Table of Contents¶

Introduction
Part I - Probability
Part II - A/B Test
Part III - Regression

Introduction¶

In this project, i will be analysing an e-commerce company's website data. where the analysis is the comparison between the company's newly launched webpage and the old one(existing one). Where the company wants to know which of the wabpages attracts more customers.

The dataset has a total number of 294,478 samples(rows), and only five columns.

Part I - Probability¶

To get started, let's import our libraries.

import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
#We are setting the seed to assure you get the same answers on quizzes as we set up
random.seed(42)

1. Now, read in the ab_data.csv data. Store it in df.

a. Read in the dataset and take a look at the top few rows here:

df = pd.read_csv('ab_data.csv')

df.head()

b. the cell below to shows the number of rows and columns in the dataset.

df.shape

(294478, 5)

c. The number of unique users in the dataset.

df.nunique()

user_id         290584
timestamp       294478
group                2
landing_page         2
converted            2
dtype: int64

d. The proportion of users converted.

#here 1 represents True and 0 False.
df.groupby('converted')['user_id'].count()/df.shape[0]

converted
0    0.880341
1    0.119659
Name: user_id, dtype: float64

e. The number of times the new_page and treatment don't match.

match = df[((df['group'] == 'treatment') == (df['landing_page'] == 'new_page')) == False].shape[0]
match

3893

f. Do any of the rows have missing values?

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
user_id         294478 non-null int64
timestamp       294478 non-null object
group           294478 non-null object
landing_page    294478 non-null object
converted       294478 non-null int64
dtypes: int64(2), object(3)
memory usage: 11.2+ MB

There are no missing values in the dataset

Data Cleaning¶

2. -In this part:

    - The rows where treatment in group column does not match new_page in landing page column are dropped.

    - The rows where control in group column does not match old_page in landing page column are dropped.

    - All duplicate rows are also dropped

a. Create a new dataframe and store your new dataframe in df2.

df2 = df.copy()
index = df2[(((df2['group'] == 'treatment') & (df2['landing_page'] == 'new_page')) ==False) & (((df2['group'] == 'control') & (df2['landing_page'] == 'old_page'))==False) ].index
df2.drop(index , inplace=True)
#df.tail(5)

# Double Check all of the correct rows were removed - this should be 0
df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]

0

# Get indeces of the duplicated user records again
duplicated_user = df2.user_id[df2.user_id.duplicated()]

# Drop the duplicate user records
df2.drop(index=duplicated_user.index, inplace=True)

#Double check if duplicates are dropped
df2[df2['user_id'].duplicated()].shape[0]

0

Data Analysis¶

PART 1: Probability¶

a. How many unique user_ids are in df2?

df2.nunique()

user_id         290584
timestamp       290584
group                2
landing_page         2
converted            2
dtype: int64

4. Use df2 in the cells below to answer the quiz questions related to Quiz 4 in the classroom.

a. What is the probability of an individual converting regardless of the page they receive?

prob_converted = (df2['converted'] == 1).mean()
prob_converted

0.11959708724499628

b. Given that an individual was in the control group, what is the probability they converted?

#individuals from control group
control_ind = df2.query('group == "control"').shape[0]
#individuals who are in control group and converted
control_and_convert = df2.query('converted == 1 & group == "control"').shape[0]
#probabiity of converted control
prob_of_control_and_convert = control_and_convert/control_ind
print('The probability of an indivdual converted from cotrol group is: {0:.4f}' .format(prob_of_control_and_convert))

The probability of an indivdual converted from cotrol group is: 0.1204

c. Given that an individual was in the treatment group, what is the probability they converted?

#individuals from control group
treatment_ind = df2.query('group == "treatment"').shape[0]
#individuals who are in control group and converted
treatment_and_convert = df2.query('converted == 1 & group == "treatment"').shape[0]
#probabiity of converted control
prob_of_treatment_and_convert = treatment_and_convert/treatment_ind
print('The probability of an indivdual converted from treatment group is: {0:.4f}' .format(prob_of_treatment_and_convert))

The probability of an indivdual converted from treatment group is: 0.1188

d. What is the probability that an individual received the new page?

#individuals that received new page
new_page = df2.query('landing_page == "new_page"')
#probability of individuals who received new page
prob_new_page = new_page.shape[0]/df2.shape[0]
print('The probability that an individual received the new page is {0:.4f}'.format(prob_new_page))

The probability that an individual received the new page is 0.5001

From the above observations, it is shown that both groups has equal proportions which is 12%. Therfore there is no evidence to say "The new page leads to more convertion".

Part II - A/B Test¶

Null and Alternative hypothesis;

Let us assume the old page is better, unless we prove the new page to be better, at type error 1 with 5% rate. Therefore the null an dalternative hypothesis is:

  null hypothesis(H0); P𝑜𝑙𝑑 >= P𝑛𝑒𝑤
  Alternative hypothesis(H1); P𝑜𝑙𝑑 < P𝑛𝑒𝑤

Also equivalent to;

  null: Pnew - P𝑜𝑙𝑑 <= 0
  alternative: Pnew - P𝑜𝑙𝑑 > 0

NOTE: P𝑜𝑙𝑑 and Pnew are the convertion rates of both old and new page respecively.

a. What is the conversion rate for $p_{new}$ under the null?

#convertion rate of new page 
p_new = prob_of_treatment_and_convert
#where probabilty of new page is (in four significant figures) 
print('probabilty of new page is: {0:.4f}' .format(p_new))

probabilty of new page is: 0.1188

b. What is the conversion rate for $p_{old}$ under the null?

#convertion rate of old page 
p_old = prob_of_control_and_convert
#where probabilty of old page is (in four significant figures) 
print('probabilty of new page is: {0:.4f}' .format(p_old))

probabilty of new page is: 0.1204

#The difference in conversion rate is
print('Difference in conversion rate is {0:.4f}.'.format(p_new - p_old))

Difference in conversion rate is -0.0016.

c. What is $n_{new}$, the number of individuals in the treatment group?

new_ind = df2.query('landing_page == "new_page"')
n_new = len(new_ind)
print('The number of individuals in treatment group is: {}'.format(n_new))

The number of individuals in treatment group is: 145310

d. What is $n_{old}$, the number of individuals in the control group?

old_ind = df2.query('landing_page == "old_page"')
n_old = len(old_ind)
print('The number of individuals in control group is: {}'.format(n_old))

The number of individuals in control group is: 145274

#The number of individuals in the entire dataset is (sampe size)
sample_size = len(df2)
print('The sample size is {}'.format(sample_size))

The sample size is 290584

Simulate 10,000 draws of $P_{new}$ - $P_{old}$ values, which help us to be more representative of the population

p_diffs = np.array([])

# Compute the sampling distribution
for _ in range(10000):
    # Generate elements from the new/old page groups using their probability
    new_page_converted = np.random.choice([0, 1], size = n_new, replace = True, p = [1-p_new, p_new])
    old_page_converted = np.random.choice([0, 1], size = n_old, replace = True, p = [1-p_old, p_old])
    
    # Calculate the difference in conversion rates
    p_diffs = np.append(p_diffs, new_page_converted.mean() - old_page_converted.mean())

# find elements equal to our sample size which imitates null hypothesis. 

p_diffs_null = np.random.normal(0, p_diffs.std(), size = sample_size)
p_diffs_null

array([-0.00025746,  0.00096247,  0.00035378, ..., -0.00038651,
       -0.00012285,  0.00040741])

# Plot the distribution under the null along with the location of the sample mean
plt.hist(p_diffs_null, alpha=0.5)
plt.axvline(x = p_diffs.mean(), color = 'r', linestyle = '--')
plt.axvline(x = p_diffs_null.mean(), color = 'k', linestyle = '-')
plt.title('Sampling distribution of conversion rates')
plt.ylabel('Frequency')
plt.xlabel('Sample mean')
plt.show();

j. What proportion of the p_diffs are greater than the actual difference observed in ab_data.csv?

observed_diff = p_new - p_old

# Calculate p-value
p_value = (p_diffs_null > observed_diff).mean()
print('The probability of obseving the difference in conversion rate or higher values, \n' + 
      'given that the null hypothesis is true, = {0:.2f}.'.format(p_value))

The probability of obseving the difference in conversion rate or higher values, 
given that the null hypothesis is true, = 0.91.

In this observation above, we calculated the p-value, which is the sample statistics of knowing if null hypothesis is true. In this case observed diff will be high if null hypothesis is true.

So, if null hypothesis is true, convertions through ol page is equal of greater when compared with new page.
Therefore we hereby accept our null hypothsis.

l. We could also use a built-in to achieve similar results. Let n_old and n_new refer the the number of rows associated with the old page and new pages, respectively.

import statsmodels.api as sm

convert_old = len(old_page.query('converted == 1'))
convert_new = len(new_page.query('converted == 1'))

m. Now use stats.proportions_ztest to compute your test statistic and p-value. Here is a helpful link on using the built in.

z_score, p_value = sm.stats.proportions_ztest([convert_new, convert_old], [n_new, n_old], alternative='larger')
print('Z-score is {0:.2f} and p-value is {1:.2f}.'.format(z_score, p_value))

Z-score is -1.31 and p-value is 0.91.

n. What do the z-score and p-value you computed in the previous question mean for the conversion rates of the old and new pages?

The statistic have a type 1 error of 0.05% (0.95 confidence interval).So, the null hypothesis would be rejected if the z-score is less than -1.96 or greater than 1.96. In this case our z-score is -1.31, and therefore it will not be rejected.

And the p-value is close to 1, which also signifies we accept the null hypothesis.

Conclusion¶

The above observations was the comparison of two webpages launched by an e-cmmerce company, which they are not sure whether to keep the old one or implement the new page.

So, from the above observations we are able to see that the newly developed webpage does not proves to be better.

So we suggest that the company keeps the old webpage.

from subprocess import call
call(['python', '-m', 'nbconvert', 'Analyze_ab_test_results_notebook.ipynb'])

255

	user_id	timestamp	group	landing_page	converted
0	851104	2017-01-21 22:11:48.556739	control	old_page	0
1	804228	2017-01-12 08:01:45.159739	control	old_page	0
2	661590	2017-01-11 16:55:06.154213	treatment	new_page	0
3	853541	2017-01-08 18:28:03.143765	treatment	new_page	0
4	864975	2017-01-21 01:52:26.210827	control	old_page	1