Project: Investigating hospital appointment¶

Table of Contents¶

Introduction
Data Wrangling
Exploratory Data Analysis
Conclusions

Introduction¶

For this project, i use medical appointment No-shows provided on kaggle, which is the information collected from medical appointments in Brazil. The dataset has 110527 samples with 14 columns.

Note: It says 'No' if the patient show up and 'Yes' if the patient does'nt show up.

The questions raised are;

- are non alcoholic patients more likely to show at their appointment?
- could gender be related to showing up in the appointment?
- can patients with diabetes also show up in their appointment?

#import useful packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

Data Wrangling¶

General Properties¶

#load the dataset
df = pd.read_csv('No_show_appointment.csv')

#view first 5 rows 
df.head()

df.shape

(110527, 14)

df.describe()

The above table shows a description of the whole dataset, by counting the samples, showing each column mean, minimum value, and also maximum value as well.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
PatientId         110527 non-null float64
AppointmentID     110527 non-null int64
Gender            110527 non-null object
ScheduledDay      110527 non-null object
AppointmentDay    110527 non-null object
Age               110527 non-null int64
Neighbourhood     110527 non-null object
Scholarship       110527 non-null int64
Hipertension      110527 non-null int64
Diabetes          110527 non-null int64
Alcoholism        110527 non-null int64
Handcap           110527 non-null int64
SMS_received      110527 non-null int64
No-show           110527 non-null object
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB

The above information clearly shows that there are no missing values in the dataset

Data Cleaning¶

In this dataset 'ScheduledDays' and 'AppointmentDay' datatypes are in a form of Object(string), therefore converting these columns to datetime will allow date and time manipulations.

Also in the dataset the 'Age' column have a value of -1, therefore the row with such data will be dropped.

Finaly, in the No-show column the values currently 'No' means the patient has shown up, and 'Yes' means the patient has'nt shown up in the appointment.

things to be done in this section include;
- The rows where Age == -1 are dropped.
- Replacing values (No, Yes) in No-show column to (Yes, No) respectively.
- lastly, column names are renamed to lowercases and seperated by underscore (_).

#the rows of age == -1 are dropped
df.drop(df[df['Age'] == -1].index, inplace =True)

Droping column where Age == -1

#unnecessary coulumns are dropped
df.drop([ 'Neighbourhood','ScheduledDay','AppointmentDay','SMS_received','Scholarship', 'Handcap', 'Age'],axis=1,inplace=True)

#the description is checked to confirm the dropped row
df.describe()

#check if the rows are dropped
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 110526 entries, 0 to 110526
Data columns (total 7 columns):
PatientId        110526 non-null float64
AppointmentID    110526 non-null int64
Gender           110526 non-null object
Hipertension     110526 non-null int64
Diabetes         110526 non-null int64
Alcoholism       110526 non-null int64
No-show          110526 non-null object
dtypes: float64(1), int64(4), object(2)
memory usage: 6.7+ MB

Replacing values (No, Yes) to (Yes, No) respectively.

Lastly columns names are renamed to lowercases

#column names are changed to lowercase
new_column_names = {'PatientId': 'patient_id', 'AppointmentID': 'appointment_id', 
                    'Gender': 'gender', 'ScheduledDay': 'scheduled_day', 
                    'AppointmentDay': 'appointment_day', 
                    'Scholarship': 'scholarship', 'Hipertension' :'hipertension', 'No-show': 'showed'}

df.rename(columns= new_column_names, inplace=True)

#the values yes,no are changed to no,yes respectively
df['showed'].replace(['Yes', 'No'], ['No','Yes'], inplace=True)

#view first rows to see the changes made
df.head()

¶

Exploratory Data Analysis¶

QUESTION 1( are non alcoholic patients more likely to show at their appointment?)¶

#copy the data to df_1
df_1 = df.copy()

df.Alcoholism.value_counts()

0    107166
1      3360
Name: Alcoholism, dtype: int64

The above column shows the number of alcoholic and non-alcoholic patients in the dataset

#for alcoholic patients and showing up
alcoholic = df.query('Alcoholism == 1')
alcoholic_showedup_count = alcoholic.groupby('showed')['patient_id'].count()
alcoholic_showedup_proportion = alcoholic_showedup_count/alcoholic_showedup_count.sum()
print(alcoholic_showedup_proportion)

showed
No     0.201488
Yes    0.798512
Name: patient_id, dtype: float64

#the proportion of alcoholic patients
alcoholic_showedup_proportion.plot(kind='bar')
plt.title('Proportion of alcoholic patients')
plt.ylabel('Proportion')
plt.xlabel('Alcoholism')
plt.xticks(rotation=0)
plt.show();

The above query shows that about 79.9% of alcoholic patients show up in their appointment, and only 20.1% doesnt show up in their appointment.

#for non patients and showing up
non_alcoholic = df.query('Alcoholism == 0')
non_alcoholic_showedup_count = non_alcoholic.groupby('showed')['patient_id'].count()
non_alcoholic_showedup_proportion = non_alcoholic_showedup_count/non_alcoholic_showedup_count.sum()
print(non_alcoholic_showedup_proportion)

showed
No     0.201948
Yes    0.798052
Name: patient_id, dtype: float64

non_alcoholic_showedup_proportion.plot(kind='bar')
plt.title('Non_Proportion of alcoholic patients')
plt.ylabel('Proportion')
plt.xlabel('Alcoholism')
plt.xticks(rotation=0)
plt.show();

The above query also shows that around 79.8% of non-alcoholic show in their appointment, and only 20.2% doesnt show up

Research Question 2 (could gender be related to showing up in the appointment?)¶

df.gender.value_counts()

F    71839
M    38687
Name: gender, dtype: int64

The above column shows that Female patient almost doubles the number of male patients in the dataset

# proprtion of each gender for those who showed up
showed_up = df_1.query('showed == "Yes"')
showedup_gender_count = showed_up.groupby('gender')['patient_id'].count()
showedup_gender_prop = showedup_gender_count/showedup_gender_count.sum()
showedup_gender_prop.rename(index = {'F': 'Female', 'M': 'Male'}, inplace=True)
showedup_gender_prop.name = 'Gender'

print(showedup_gender_prop)

gender
Female    0.648985
Male      0.351015
Name: Gender, dtype: float64

# Plot proportion of gender groups in showing up
showedup_gender_prop.plot(kind='pie', figsize=(7,5))
plt.title('Showed proportion of Gender Groups')
plt.xticks(rotation=0)
plt.legend();

print(showedup_gender_prop)

gender
Female    0.648985
Male      0.351015
Name: Gender, dtype: float64

From the above distribution about 64.9% of those who showed up where Females, while 35.1 where Male.

could hipertension of either male or female patient be related to showing up in their appoinment ?¶

df.groupby('gender')['hipertension'].value_counts()

gender  hipertension
F       0               56500
        1               15339
M       0               32225
        1                6462
Name: hipertension, dtype: int64

Research Question 3 (can patients with diabetes also show up in their appointment?)¶

df.Diabetes.value_counts()

0    102583
1      7943
Name: Diabetes, dtype: int64

# Calculate show up proportion of patients with health_scholarship
#for alcoholic patients and showing up
Diabetes = df.query('Diabetes == 1')
Diabetes_showedup_count = Diabetes.groupby('showed')['patient_id'].count()
Diabetes_showedup_proportion = Diabetes_showedup_count/Diabetes_showedup_count.sum()
Diabetes_showedup_proportion.name = 'Diabetes & showed up?'
#no_health_scholarship_showedup_proportion.name = 'No health_scholarship & showed up?'

# Plot the proportion
Diabetes_showedup_proportion.plot(kind='bar', figsize=(8, 5))
plt.title('Proportions of patients with diabetes & showed up')
plt.xticks(rotation=0)
plt.show();

print(Diabetes_showedup_proportion)

showed
No     0.180033
Yes    0.819967
Name: Diabetes & showed up?, dtype: float64

The above query clearly shows that around 82.0% of patients with diabetes show up, and only 18.00% did not show up.

non_Diabetes = df.query('Diabetes == 0')
non_Diabetes_showedup_count = non_Diabetes.groupby('showed')['patient_id'].count()
non_Diabetes_showedup_proportion = non_Diabetes_showedup_count/non_Diabetes_showedup_count.sum()
non_Diabetes_showedup_proportion.name = 'non_diabetes & showed up?'
#no_health_scholarship_showedup_proportion.name = 'No health_scholarship & showed up?'

# Plot the proportion
non_Diabetes_showedup_proportion.plot(kind='bar', figsize=(8, 5))
plt.title('Proportions of patients with diabetes & showed up')
plt.xticks(rotation=0)
plt.show();

print(non_Diabetes_showedup_proportion)

showed
No     0.20363
Yes    0.79637
Name: non_diabetes & showed up?, dtype: float64

The above query clearly shows that around 79.6% of patients without diabetes show up in their appointment, and 20.4% did not show up.

Limitations¶

The 'Age' column has a negative value of age, so i had to exclude rows(samples) with such values in the investigation.
This generalization does not include neighbourhood, appointmentDay, sms_received, scholarship and handcap columns(datas).

Conclusions¶

QUESTION 1( are non alcoholic patients more likely to show at their appointment?)¶

Around 79.9% of alcoholic patients showed up at their appointment and on the other hand, around 79.8% of non-alcoholic patients showed up. Hence; non alcoholic patients are not likely the ones that show up at their appointment, which means alcohlism is not to considered to showing up in appointment.

From the above distributions, the average male patients that showed up in their appointment is 35.1%, while on the other hand the average females patients that showed up in thier appointment is 64.9%. Which simply means that female patients have the highest percentage of showing up that the male patients. Hence gender can also be related in showing up.

QUESTION 3(can patients with diabetes also show up in their appointment?)¶

Around 80.0% of patients with diabetes showed up at their appointment which makes only 18.0% of patients with diabetes that didnt show up. And on the other hand, around 79.6% of patients without diabetes showed up, and only 20.4 of patiets without diabetes didnt show up.

Hence; patients with diabetes are more likely to miss their appointment that those without diabetes.

	PatientId	AppointmentID	Gender	ScheduledDay	AppointmentDay	Age	Neighbourhood	Hipertension	Diabetes	No-show
0	2.987250e+13	5642903	F	2016-04-29T18:38:08Z	2016-04-29T00:00:00Z	62	JARDIM DA PENHA	1	0	No
1	5.589978e+14	5642503	M	2016-04-29T16:08:27Z	2016-04-29T00:00:00Z	56	JARDIM DA PENHA	0	0	No
2	4.262962e+12	5642549	F	2016-04-29T16:19:04Z	2016-04-29T00:00:00Z	62	MATA DA PRAIA	0	0	No
3	8.679512e+11	5642828	F	2016-04-29T17:29:31Z	2016-04-29T00:00:00Z	8	PONTAL DE CAMBURI	0	0	No
4	8.841186e+12	5642494	F	2016-04-29T16:07:23Z	2016-04-29T00:00:00Z	56	JARDIM DA PENHA	1	1	No

	PatientId	AppointmentID	Age	Scholarship	Hipertension	Diabetes	Alcoholism	Handcap	SMS_received
count	1.105270e+05	1.105270e+05	110527.000000	110527.000000	110527.000000	110527.000000	110527.000000	110527.000000	110527.000000
mean	1.474963e+14	5.675305e+06	37.088874	0.098266	0.197246	0.071865	0.030400	0.022248	0.321026
std	2.560949e+14	7.129575e+04	23.110205	0.297675	0.397921	0.258265	0.171686	0.161543	0.466873
min	3.921784e+04	5.030230e+06	-1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	4.172614e+12	5.640286e+06	18.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	3.173184e+13	5.680573e+06	37.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	9.439172e+13	5.725524e+06	55.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000
max	9.999816e+14	5.790484e+06	115.000000	1.000000	1.000000	1.000000	1.000000	4.000000	1.000000

	PatientId	AppointmentID	Hipertension	Diabetes	Alcoholism
count	1.105260e+05	1.105260e+05	110526.000000	110526.000000	110526.000000
mean	1.474934e+14	5.675304e+06	0.197248	0.071865	0.030400
std	2.560943e+14	7.129544e+04	0.397923	0.258266	0.171686
min	3.921784e+04	5.030230e+06	0.000000	0.000000	0.000000
25%	4.172536e+12	5.640285e+06	0.000000	0.000000	0.000000
50%	3.173184e+13	5.680572e+06	0.000000	0.000000	0.000000
75%	9.438963e+13	5.725523e+06	0.000000	0.000000	0.000000
max	9.999816e+14	5.790484e+06	1.000000	1.000000	1.000000

Project: Investigating hospital appointment¶

Table of Contents¶

Introduction¶

Data Wrangling¶

General Properties¶

Data Cleaning¶

¶

Exploratory Data Analysis¶

QUESTION 1( are non alcoholic patients more likely to show at their appointment?)¶

Research Question 2 (could gender be related to showing up in the appointment?)¶

could hipertension of either male or female patient be related to showing up in their appoinment ?¶

Research Question 3 (can patients with diabetes also show up in their appointment?)¶

Limitations¶

Conclusions¶

QUESTION 1( are non alcoholic patients more likely to show at their appointment?)¶

QUESTION 2(could gender be related to showing up in the appointment?)¶

QUESTION 3(can patients with diabetes also show up in their appointment?)¶