Project: Investigating hospital appointment

Table of Contents

Introduction

For this project, i use medical appointment No-shows provided on kaggle, which is the information collected from medical appointments in Brazil. The dataset has 110527 samples with 14 columns.

Note: It says 'No' if the patient show up and 'Yes' if the patient does'nt show up.

  • The questions raised are;
    - are non alcoholic patients more likely to show at their appointment?
    - could gender be related to showing up in the appointment?
    - can patients with diabetes also show up in their appointment?
In [1]:
#import useful packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

Data Wrangling

General Properties

In [95]:
#load the dataset
df = pd.read_csv('No_show_appointment.csv')
In [54]:
#view first 5 rows 
df.head()
Out[54]:
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No-show
0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 0 0 0 0 No
1 5.589978e+14 5642503 M 2016-04-29T16:08:27Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0 0 0 0 0 No
2 4.262962e+12 5642549 F 2016-04-29T16:19:04Z 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0 0 0 0 0 No
3 8.679512e+11 5642828 F 2016-04-29T17:29:31Z 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0 0 0 0 0 0 No
4 8.841186e+12 5642494 F 2016-04-29T16:07:23Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 1 1 0 0 0 No
In [55]:
df.shape
Out[55]:
(110527, 14)
In [56]:
df.describe()
Out[56]:
PatientId AppointmentID Age Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received
count 1.105270e+05 1.105270e+05 110527.000000 110527.000000 110527.000000 110527.000000 110527.000000 110527.000000 110527.000000
mean 1.474963e+14 5.675305e+06 37.088874 0.098266 0.197246 0.071865 0.030400 0.022248 0.321026
std 2.560949e+14 7.129575e+04 23.110205 0.297675 0.397921 0.258265 0.171686 0.161543 0.466873
min 3.921784e+04 5.030230e+06 -1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 4.172614e+12 5.640286e+06 18.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 3.173184e+13 5.680573e+06 37.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 9.439172e+13 5.725524e+06 55.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
max 9.999816e+14 5.790484e+06 115.000000 1.000000 1.000000 1.000000 1.000000 4.000000 1.000000

The above table shows a description of the whole dataset, by counting the samples, showing each column mean, minimum value, and also maximum value as well.

In [57]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
PatientId         110527 non-null float64
AppointmentID     110527 non-null int64
Gender            110527 non-null object
ScheduledDay      110527 non-null object
AppointmentDay    110527 non-null object
Age               110527 non-null int64
Neighbourhood     110527 non-null object
Scholarship       110527 non-null int64
Hipertension      110527 non-null int64
Diabetes          110527 non-null int64
Alcoholism        110527 non-null int64
Handcap           110527 non-null int64
SMS_received      110527 non-null int64
No-show           110527 non-null object
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB

The above information clearly shows that there are no missing values in the dataset

Data Cleaning

In this dataset 'ScheduledDays' and 'AppointmentDay' datatypes are in a form of Object(string), therefore converting these columns to datetime will allow date and time manipulations.

Also in the dataset the 'Age' column have a value of -1, therefore the row with such data will be dropped.

Finaly, in the No-show column the values currently 'No' means the patient has shown up, and 'Yes' means the patient has'nt shown up in the appointment.

  • things to be done in this section include;

    • The rows where Age == -1 are dropped.
    • Replacing values (No, Yes) in No-show column to (Yes, No) respectively.
    • lastly, column names are renamed to lowercases and seperated by underscore (_).
In [96]:
#the rows of age == -1 are dropped
df.drop(df[df['Age'] == -1].index, inplace =True)

Droping column where Age == -1

In [97]:
#unnecessary coulumns are dropped
df.drop([ 'Neighbourhood','ScheduledDay','AppointmentDay','SMS_received','Scholarship', 'Handcap', 'Age'],axis=1,inplace=True)
In [98]:
#the description is checked to confirm the dropped row
df.describe()
Out[98]:
PatientId AppointmentID Hipertension Diabetes Alcoholism
count 1.105260e+05 1.105260e+05 110526.000000 110526.000000 110526.000000
mean 1.474934e+14 5.675304e+06 0.197248 0.071865 0.030400
std 2.560943e+14 7.129544e+04 0.397923 0.258266 0.171686
min 3.921784e+04 5.030230e+06 0.000000 0.000000 0.000000
25% 4.172536e+12 5.640285e+06 0.000000 0.000000 0.000000
50% 3.173184e+13 5.680572e+06 0.000000 0.000000 0.000000
75% 9.438963e+13 5.725523e+06 0.000000 0.000000 0.000000
max 9.999816e+14 5.790484e+06 1.000000 1.000000 1.000000
In [99]:
#check if the rows are dropped
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 110526 entries, 0 to 110526
Data columns (total 7 columns):
PatientId        110526 non-null float64
AppointmentID    110526 non-null int64
Gender           110526 non-null object
Hipertension     110526 non-null int64
Diabetes         110526 non-null int64
Alcoholism       110526 non-null int64
No-show          110526 non-null object
dtypes: float64(1), int64(4), object(2)
memory usage: 6.7+ MB

Replacing values (No, Yes) to (Yes, No) respectively.

Lastly columns names are renamed to lowercases

In [109]:
#column names are changed to lowercase
new_column_names = {'PatientId': 'patient_id', 'AppointmentID': 'appointment_id', 
                    'Gender': 'gender', 'ScheduledDay': 'scheduled_day', 
                    'AppointmentDay': 'appointment_day', 
                    'Scholarship': 'scholarship', 'Hipertension' :'hipertension', 'No-show': 'showed'}
In [110]:
df.rename(columns= new_column_names, inplace=True)
In [111]:
#the values yes,no are changed to no,yes respectively
df['showed'].replace(['Yes', 'No'], ['No','Yes'], inplace=True)
In [112]:
#view first rows to see the changes made
df.head()
Out[112]:
patient_id appointment_id gender hipertension Diabetes Alcoholism showed
0 2.987250e+13 5642903 F 1 0 0 Yes
1 5.589978e+14 5642503 M 0 0 0 Yes
2 4.262962e+12 5642549 F 0 0 0 Yes
3 8.679512e+11 5642828 F 0 0 0 Yes
4 8.841186e+12 5642494 F 1 1 0 Yes

Exploratory Data Analysis

QUESTION 1( are non alcoholic patients more likely to show at their appointment?)

In [9]:
#copy the data to df_1
df_1 = df.copy()
In [10]:
df.Alcoholism.value_counts()
Out[10]:
0    107166
1      3360
Name: Alcoholism, dtype: int64

The above column shows the number of alcoholic and non-alcoholic patients in the dataset

In [11]:
#for alcoholic patients and showing up
alcoholic = df.query('Alcoholism == 1')
alcoholic_showedup_count = alcoholic.groupby('showed')['patient_id'].count()
alcoholic_showedup_proportion = alcoholic_showedup_count/alcoholic_showedup_count.sum()
print(alcoholic_showedup_proportion)
showed
No     0.201488
Yes    0.798512
Name: patient_id, dtype: float64
In [12]:
#the proportion of alcoholic patients
alcoholic_showedup_proportion.plot(kind='bar')
plt.title('Proportion of alcoholic patients')
plt.ylabel('Proportion')
plt.xlabel('Alcoholism')
plt.xticks(rotation=0)
plt.show();

The above query shows that about 79.9% of alcoholic patients show up in their appointment, and only 20.1% doesnt show up in their appointment.

In [13]:
#for non patients and showing up
non_alcoholic = df.query('Alcoholism == 0')
non_alcoholic_showedup_count = non_alcoholic.groupby('showed')['patient_id'].count()
non_alcoholic_showedup_proportion = non_alcoholic_showedup_count/non_alcoholic_showedup_count.sum()
print(non_alcoholic_showedup_proportion)
showed
No     0.201948
Yes    0.798052
Name: patient_id, dtype: float64
In [14]:
non_alcoholic_showedup_proportion.plot(kind='bar')
plt.title('Non_Proportion of alcoholic patients')
plt.ylabel('Proportion')
plt.xlabel('Alcoholism')
plt.xticks(rotation=0)
plt.show();

The above query also shows that around 79.8% of non-alcoholic show in their appointment, and only 20.2% doesnt show up

In [15]:
df.gender.value_counts()
Out[15]:
F    71839
M    38687
Name: gender, dtype: int64

The above column shows that Female patient almost doubles the number of male patients in the dataset

In [16]:
# proprtion of each gender for those who showed up
showed_up = df_1.query('showed == "Yes"')
showedup_gender_count = showed_up.groupby('gender')['patient_id'].count()
showedup_gender_prop = showedup_gender_count/showedup_gender_count.sum()
showedup_gender_prop.rename(index = {'F': 'Female', 'M': 'Male'}, inplace=True)
showedup_gender_prop.name = 'Gender'

print(showedup_gender_prop)
gender
Female    0.648985
Male      0.351015
Name: Gender, dtype: float64
In [17]:
# Plot proportion of gender groups in showing up
showedup_gender_prop.plot(kind='pie', figsize=(7,5))
plt.title('Showed proportion of Gender Groups')
plt.xticks(rotation=0)
plt.legend();
In [18]:
print(showedup_gender_prop)
gender
Female    0.648985
Male      0.351015
Name: Gender, dtype: float64

From the above distribution about 64.9% of those who showed up where Females, while 35.1 where Male.

In [46]:
df.groupby('gender')['hipertension'].value_counts()
Out[46]:
gender  hipertension
F       0               56500
        1               15339
M       0               32225
        1                6462
Name: hipertension, dtype: int64

Research Question 3 (can patients with diabetes also show up in their appointment?)

In [47]:
df.Diabetes.value_counts()
Out[47]:
0    102583
1      7943
Name: Diabetes, dtype: int64
In [48]:
# Calculate show up proportion of patients with health_scholarship
#for alcoholic patients and showing up
Diabetes = df.query('Diabetes == 1')
Diabetes_showedup_count = Diabetes.groupby('showed')['patient_id'].count()
Diabetes_showedup_proportion = Diabetes_showedup_count/Diabetes_showedup_count.sum()
Diabetes_showedup_proportion.name = 'Diabetes & showed up?'
#no_health_scholarship_showedup_proportion.name = 'No health_scholarship & showed up?'

# Plot the proportion
Diabetes_showedup_proportion.plot(kind='bar', figsize=(8, 5))
plt.title('Proportions of patients with diabetes & showed up')
plt.xticks(rotation=0)
plt.show();
In [49]:
print(Diabetes_showedup_proportion)
showed
No     0.180033
Yes    0.819967
Name: Diabetes & showed up?, dtype: float64

The above query clearly shows that around 82.0% of patients with diabetes show up, and only 18.00% did not show up.

In [50]:
non_Diabetes = df.query('Diabetes == 0')
non_Diabetes_showedup_count = non_Diabetes.groupby('showed')['patient_id'].count()
non_Diabetes_showedup_proportion = non_Diabetes_showedup_count/non_Diabetes_showedup_count.sum()
non_Diabetes_showedup_proportion.name = 'non_diabetes & showed up?'
#no_health_scholarship_showedup_proportion.name = 'No health_scholarship & showed up?'

# Plot the proportion
non_Diabetes_showedup_proportion.plot(kind='bar', figsize=(8, 5))
plt.title('Proportions of patients with diabetes & showed up')
plt.xticks(rotation=0)
plt.show();
In [51]:
print(non_Diabetes_showedup_proportion)
showed
No     0.20363
Yes    0.79637
Name: non_diabetes & showed up?, dtype: float64

The above query clearly shows that around 79.6% of patients without diabetes show up in their appointment, and 20.4% did not show up.

Limitations

  • The 'Age' column has a negative value of age, so i had to exclude rows(samples) with such values in the investigation.
  • This generalization does not include neighbourhood, appointmentDay, sms_received, scholarship and handcap columns(datas).

Conclusions

QUESTION 1( are non alcoholic patients more likely to show at their appointment?)

Around 79.9% of alcoholic patients showed up at their appointment and on the other hand, around 79.8% of non-alcoholic patients showed up. Hence; non alcoholic patients are not likely the ones that show up at their appointment, which means alcohlism is not to considered to showing up in appointment.

From the above distributions, the average male patients that showed up in their appointment is 35.1%, while on the other hand the average females patients that showed up in thier appointment is 64.9%. Which simply means that female patients have the highest percentage of showing up that the male patients. Hence gender can also be related in showing up.

QUESTION 3(can patients with diabetes also show up in their appointment?)

Around 80.0% of patients with diabetes showed up at their appointment which makes only 18.0% of patients with diabetes that didnt show up. And on the other hand, around 79.6% of patients without diabetes showed up, and only 20.4 of patiets without diabetes didnt show up.

Hence; patients with diabetes are more likely to miss their appointment that those without diabetes.