About our dataset

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide.

Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.

Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.

Data Source kaggle.com/andrewmvd/heart-failure-clinical..

Task

Create a model for predicting mortality caused by Heart Failure.
12 clinical features for predicting death events.

Our Machine Learning WorkFlow

I feel a lot comfortable defining my workflow for solving a machine learning problem before ever starting to solve those problem as that gives me a feel of direction. However, this may be different for someother persons ✍🏻

Below are the steps we are going to take to solve this machine learning problem

Problem Definition and Data Collection
Get the data ready for use (Data Preprocessing)
- Check for missing values
- Fill missing values, if any.
- Turn categorical Features to Numerical
Feature Engineering: Feature engineering is the process of using domain knowledge to extract features (characteristics, properties, attributes) from raw data. A feature is a property shared by independent units on which analysis or prediction is to be done. Features are used by predictive models and influence results.
Modeling
Make Predictions
Evaluate Model Performance metrics
See if we need to improve our model
Export our trained model
Load our trained model

1. Problem Definition

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

Data Collection

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

df = pd.read_csv('Datasets/heart_failure_clinical_records_dataset.csv')

df.head()

	age	anaemia	creatinine_phosphokinase	diabetes	ejection_fraction	high_blood_pressure	platelets	serum_creatinine	serum_sodium	sex	smoking	time	DEATH_EVENT
0	75.0	0	582.0	0.0	20.0	1	265000	1.9	130.0	1.0	0.0	4.0	1
1	55.0	0	7861.0	0.0	38.0	0	263358.03	1.1	136.0	1.0	0.0	6.0	1
2	65.0	0	146.0	NaN	NaN	0	162000	1.3	129.0	1.0	1.0	7.0	1
3	50.0	1	111.0	0.0	20.0	0	210000	1.9	137.0	1.0	0.0	7.0	1
4	65.0	1	NaN	1.0	20.0	0	327000	2.7	116.0	0.0	0.0	8.0	1

# Let's get a summary of our data set
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       298 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  292 non-null    float64
 3   diabetes                  294 non-null    float64
 4   ejection_fraction         294 non-null    float64
 5   high_blood_pressure       296 non-null    object 
 6   platelets                 288 non-null    object 
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              294 non-null    float64
 9   sex                       298 non-null    float64
 10  smoking                   298 non-null    float64
 11  time                      297 non-null    float64
 12  DEATH_EVENT               299 non-null    int64  
dtypes: float64(9), int64(2), object(2)
memory usage: 30.5+ KB

# Check for duplicate
df.duplicated().sum()

# Let's check how correlated each feature is to another
correlation = df.corr()
correlation

	age	anaemia	creatinine_phosphokinase	diabetes	ejection_fraction	serum_creatinine	serum_sodium	sex	smoking	time	DEATH_EVENT
age	1.000000	0.091852	-0.088883	-0.096539	0.061344	0.155650	-0.052726	0.059484	0.018416	-0.219635	0.252176
anaemia	0.091852	1.000000	-0.184689	-0.015766	0.026149	0.054518	0.060957	-0.099176	-0.109526	-0.133430	0.066270
creatinine_phosphokinase	-0.088883	-0.184689	1.000000	-0.022312	-0.054024	-0.012482	0.057265	0.074000	0.005451	-0.025688	0.073627
diabetes	-0.096539	-0.015766	-0.022312	1.000000	-0.015566	-0.043697	-0.092627	-0.154765	-0.135563	0.028746	0.001362
ejection_fraction	0.061344	0.026149	-0.054024	-0.015566	1.000000	-0.010466	0.199457	-0.146827	-0.057282	0.028818	-0.261605
serum_creatinine	0.155650	0.054518	-0.012482	-0.043697	-0.010466	1.000000	-0.181161	0.010253	-0.026469	-0.147806	0.290386
serum_sodium	-0.052726	0.060957	0.057265	-0.092627	0.199457	-0.181161	1.000000	-0.042738	0.001440	0.057033	-0.175385
sex	0.059484	-0.099176	0.074000	-0.154765	-0.146827	0.010253	-0.042738	1.000000	0.446947	-0.005693	-0.007482
smoking	0.018416	-0.109526	0.005451	-0.135563	-0.057282	-0.026469	0.001440	0.446947	1.000000	-0.019119	-0.014233
time	-0.219635	-0.133430	-0.025688	0.028746	0.028818	-0.147806	0.057033	-0.005693	-0.019119	1.000000	-0.522918
DEATH_EVENT	0.252176	0.066270	0.073627	0.001362	-0.261605	0.290386	-0.175385	-0.007482	-0.014233	-0.522918	1.000000

df.describe()

	age	anaemia	creatinine_phosphokinase	diabetes	ejection_fraction	serum_creatinine	serum_sodium	sex	smoking	time	DEATH_EVENT
count	298.000000	299.000000	292.000000	294.000000	294.000000	299.000000	294.000000	298.000000	298.000000	297.000000	299.00000
mean	60.870248	0.431438	577.688356	0.418367	38.227891	1.391104	136.697279	0.651007	0.322148	130.208754	0.32107
std	11.898166	0.496107	972.468942	0.494132	11.852295	1.034449	4.338123	0.477454	0.468085	77.365687	0.46767
min	40.000000	0.000000	23.000000	0.000000	14.000000	0.500000	113.000000	0.000000	0.000000	4.000000	0.00000
25%	51.000000	0.000000	115.000000	0.000000	30.000000	0.900000	134.000000	0.000000	0.000000	73.000000	0.00000
50%	60.000000	0.000000	249.500000	0.000000	38.000000	1.100000	137.000000	1.000000	0.000000	115.000000	0.00000
75%	70.000000	1.000000	582.000000	1.000000	45.000000	1.400000	140.000000	1.000000	1.000000	201.000000	1.00000
max	95.000000	1.000000	7861.000000	1.000000	80.000000	9.400000	148.000000	1.000000	1.000000	285.000000	1.00000

Definition of terms from the dataframe .describe() method above

Count: Number of items of in a particular feature/Column
Mean Average number (Value gotten by dividing the sum of several quantities by their number)
Std Standard Deviation (Square root of variance (How much values differ from the mean value) )
Min Minimum value in each feature or column
Max Maximum value in each feature or column
Percentile Splitting our data into different equal segments (25, 50, 75) etc.

# Check number of columns and rows
df.shape #This shows that our dataset containes 299 rows and 13 columns

(299, 13)

# Let's checkout the column names
j = 0
for i in df.columns:
    j+=1
    print(j, i.upper())

1 AGE
2 ANAEMIA
3 CREATININE_PHOSPHOKINASE
4 DIABETES
5 EJECTION_FRACTION
6 HIGH_BLOOD_PRESSURE
7 PLATELETS
8 SERUM_CREATININE
9 SERUM_SODIUM
10 SEX
11 SMOKING
12 TIME
13 DEATH_EVENT

df['age'].T.hist(bins=50);

png

# Let's Visualize our correlations better
# sns.set(font_scale=1)
fig, ax = plt.subplots(figsize=(15,10))
ax = sns.heatmap(correlation, annot=True, fmt='.2f', cmap='YlGnBu', linewidths=.05);

png

pd.crosstab(df.age, df.DEATH_EVENT)

DEATH_EVENT	0	1
age
40.000	7	0
41.000	1	0
42.000	6	1
43.000	1	0
44.000	2	0
45.000	13	6
46.000	2	1
47.000	1	0
48.000	0	2
49.000	3	1
50.000	18	8
51.000	3	1
52.000	5	0
53.000	9	1
54.000	1	1
55.000	14	3
56.000	1	0
57.000	1	1
58.000	8	2
59.000	1	3
60.000	20	13
60.667	1	1
61.000	4	0
62.000	4	1
63.000	8	0
64.000	3	0
65.000	18	8
66.000	2	0
67.000	2	0
68.000	3	2
69.000	1	2
70.000	18	7
72.000	2	5
73.000	3	1
75.000	5	6
77.000	1	1
78.000	2	0
79.000	1	0
80.000	2	5
81.000	1	0
82.000	0	3
85.000	3	3
86.000	0	1
87.000	0	1
90.000	1	2
94.000	0	1
95.000	0	2

2. Data Preprocessing

Check if there are missing values

df.isna()

	age	anaemia	creatinine_phosphokinase	diabetes	ejection_fraction	high_blood_pressure	platelets	serum_creatinine	serum_sodium	sex	smoking	time	DEATH_EVENT
0	False	False	False	False	False	False	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False	False	False	False	False	False	False
2	False	False	False	True	True	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False	False	False	False	False
4	False	False	True	False	False	False	False	False	False	False	False	False	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...
294	False	False	False	False	False	False	True	False	False	False	False	False	False
295	False	False	False	False	False	False	False	False	False	False	False	False	False
296	False	False	False	False	False	False	True	False	False	False	False	False	False
297	False	False	False	False	False	False	True	False	False	False	False	False	False
298	False	False	False	False	False	False	True	False	False	False	False	False	False

299 rows × 13 columns

df.isna().sum()

age                          1
anaemia                      0
creatinine_phosphokinase     7
diabetes                     5
ejection_fraction            5
high_blood_pressure          3
platelets                   11
serum_creatinine             0
serum_sodium                 5
sex                          1
smoking                      1
time                         2
DEATH_EVENT                  0
dtype: int64

✍🏻 Now, with the help of pandas .isna() method we are able to find out which columns have empty values in them

Fill the missing values

When handling missing values in during your data exploration and engineering there are different ways to handle missing values.

Fill them with the mean, mode or median data of their parent column
Removing the samples with missing data
Use of Unsupervised learning approach to predict and fill the data

list(df.columns)

['age',
 'anaemia',
 'creatinine_phosphokinase',
 'diabetes',
 'ejection_fraction',
 'high_blood_pressure',
 'platelets',
 'serum_creatinine',
 'serum_sodium',
 'sex',
 'smoking',
 'time',
 'DEATH_EVENT']

Split our data into Features and Label (Independent and Dependent Variables)

# Split the data into features and labels
X = df.drop('DEATH_EVENT', axis=1)
y = df['DEATH_EVENT']

# We can use the SimpleImputer Class in scikit learn to do this but
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer


# Let's Define the columns to fill
categoricals = ['platelets', 'high_blood_pressure']
numerical = ['age',
             'anaemia',
             'creatinine_phosphokinase',
             'diabetes',
             'ejection_fraction',
             'serum_creatinine',
             'serum_sodium',
             'sex',
             'smoking',
             'time'] # Removed the two categorical feature (high_blood_pressure and platelets)

# Define the way SimpleImputer will fill the missing values
categorical_imputer = SimpleImputer(strategy='constant', fill_value='None')
numerical_imputer = SimpleImputer(strategy='mean')

# Create the Imputer 
transformer = ColumnTransformer([('categoricals', categorical_imputer, categoricals),
                                 ('numerical', numerical_imputer, numerical)])

new_X = transformer.fit_transform(X)

Congratulations!! 😎 💃
We've successfully filled and converted all our data to numbers

4. Modelling

Based on our problem and data, what machine learning model should we use? Let the audience decide 😉

Okay, you got it right 🤣, This is a classification problem because we are predicting whether our output is one thing or another i.e. whether it's 1 or 0, True or False, Rice or beans etc. In otherwords, we call this a Binary Classification

from sklearn.ensemble import RandomForestClassifier

# Instantiate our model classes
random_forest_model, ada_boost = RandomForestClassifier(), AdaBoostClassifier()

# Split our data into train and test splits
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(new_X, y, train_size=.8, random_state=21)

# Let's Checkout the shape of our data to understand how our data is being split
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((239, 12), (60, 12), (239,), (60,))

Now Let's Train our machine learning model to find patterns in our data

random_forest_model.fit(X_train, y_train)

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-306-7b70e76ee335> in <module>
----> 1 random_forest_model.fit(X_train, y_train)


/opt/anaconda3/lib/python3.8/site-packages/sklearn/ensemble/_forest.py in fit(self, X, y, sample_weight)
    302                 "sparse multilabel-indicator for y is not supported."
    303             )
--> 304         X, y = self._validate_data(X, y, multi_output=True,
    305                                    accept_sparse="csc", dtype=DTYPE)
    306         if sample_weight is not None:


/opt/anaconda3/lib/python3.8/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    431                 y = check_array(y, **check_y_params)
    432             else:
--> 433                 X, y = check_X_y(X, y, **check_params)
    434             out = X, y
    435 


/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0


/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    812         raise ValueError("y cannot be None")
    813 
--> 814     X = check_array(X, accept_sparse=accept_sparse,
    815                     accept_large_sparse=accept_large_sparse,
    816                     dtype=dtype, order=order, copy=copy,


/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0


/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    614                     array = array.astype(dtype, casting="unsafe", copy=False)
    615                 else:
--> 616                     array = np.asarray(array, order=order, dtype=dtype)
    617             except ComplexWarning as complex_warning:
    618                 raise ValueError("Complex data not supported\n"


/opt/anaconda3/lib/python3.8/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order, like)
    100         return _asarray_with_like(a, dtype=dtype, order=order, like=like)
    101 
--> 102     return array(a, dtype, copy=False, order=order)
    103 
    104 


ValueError: could not convert string to float: 'None'

Now, why are we getting an error? What did we do wrong? What can we do about it ?

# Let's checkout the unique values in our X,y set
y.unique(), new_X[::-1]

(array([1, 0]),
 array([['None', '0', 50.0, ..., 1.0, 1.0, 285.0],
        ['None', '0', 45.0, ..., 1.0, 1.0, 280.0],
        ['None', '0', 45.0, ..., 0.0, 0.0, 278.0],
        ...,
        ['162000', '0', 65.0, ..., 1.0, 1.0, 7.0],
        ['263358.03', '0', 55.0, ..., 1.0, 0.0, 6.0],
        ['265000', '1', 75.0, ..., 1.0, 0.0, 4.0]], dtype=object))

Now, because our new_X label for some reasons still contains strings rather than numbers, our machine learning model has refused to work with such data and therefore cause it to raise an error

def fix_object_data(X):
    """
        This function will help us fix the string value errors in our dataset by turning string dtypes to numeric
        X: DataFrame or dict
    """
    for label, content in X.items():
        if pd.api.types.is_string_dtype(content):
            X[label] = pd.to_numeric(X[label], errors='coerce')
    return X

Lets remind ourselves of the datatypes involved in our dataset

X.dtypes

age                         float64
anaemia                       int64
creatinine_phosphokinase    float64
diabetes                    float64
ejection_fraction           float64
high_blood_pressure          object
platelets                    object
serum_creatinine            float64
serum_sodium                float64
sex                         float64
smoking                     float64
time                        float64
dtype: object

fix_object_data(X).dtypes

age                         float64
anaemia                       int64
creatinine_phosphokinase    float64
diabetes                    float64
ejection_fraction           float64
high_blood_pressure         float64
platelets                   float64
serum_creatinine            float64
serum_sodium                float64
sex                         float64
smoking                     float64
time                        float64
dtype: object

X = fix_object_data(X)

One thing i recommend you do before fill missing values is visulizing the data frequency for outliers, so as to know what averaging method would better suit for filling of the missing values

def value_counts(data:pd.DataFrame, key='age'):
    columns = data.columns
    val_count = {}

    # Loop throught our data 
    for col in columns:
        val_count[col] = data[col].value_counts()
    return pd.DataFrame(val_count[key])

# Now let's consider filling this values one after the other since our features are not much
X['age'] = X['age'].fillna(X['age'].mean())
X['creatinine_phosphokinase'].fillna(X['creatinine_phosphokinase'].median(), inplace=True)
X['diabetes'].fillna(X['diabetes'].median(), inplace=True)
X['ejection_fraction'].fillna(X['ejection_fraction'].mean(), inplace=True)
X['high_blood_pressure'].fillna(X['high_blood_pressure'].mean(), inplace=True)
X['platelets'].fillna(X['platelets'].mean(), inplace=True)
X['serum_sodium'].fillna(X['serum_sodium'].median(), inplace=True)
X['sex'].fillna(X['sex'].median(), inplace=True)
X['smoking'].fillna(X['smoking'].median(), inplace=True)
X['time'].fillna(X['time'].median(), inplace=True)

X.isna().sum()

age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
dtype: int64

Now we've fixed our data, lets split and fit into our model for training

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=.8, random_state=21)

# Now let's train our train our model
model = RandomForestClassifier(n_estimators=120,max_depth=10)
model.fit(X_train, y_train)

RandomForestClassifier(max_depth=10, n_estimators=120)

5. Make predictions using our model

model.predict(X_test)

array([1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0])

# Have a look at what our X_test looks like
X_test.head()

	age	anaemia	creatinine_phosphokinase	ejection_fraction	high_blood_pressure	platelets	serum_creatinine	serum_sodium	sex	smoking	time
61	50.0	0	318.0	40.0	1.0	216000.000000	2.3	131.0	0.0	0.0	60.0
297	45.0	0	2413.0	38.0	0.0	262269.408112	1.4	140.0	1.0	1.0	280.0
55	95.0	1	371.0	30.0	0.0	461000.000000	2.0	132.0	1.0	0.0	50.0
243	73.0	1	1185.0	40.0	1.0	220000.000000	0.9	141.0	0.0	0.0	213.0
95	58.0	1	133.0	60.0	1.0	219000.000000	1.0	141.0	1.0	0.0	83.0

6. Evaluate our model Performance

There are different classification evaluation metrics available evaluate our model's performance however the use of these metrics depends heavily on what problems you are solving. They include as follows:

Accuracy : Default metric for classification problems. Not the best for imbalanced classes
Precision: Higher precision leads to less false positives
Recall: Higher recall leads to less false negatives
F1 Score: Usually a good overall metrics for classification model
Confusion Matrix: Evaluate the accuracy of a classification

from sklearn.metrics import confusion_matrix, accuracy_score, precision_score

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

0.9

8. Export our model

There are 2 ways we can save our model

Using the pickle module
Using the joblib module

1. Using the pickle module

import pickle as pkl

# Export with pickle
pkl.dump(model, open('your_first_model_pkl.pkl','wb')) #wb -> write binary

# Load with pickle
loaded_model_pkl = pkl.load(open('your_first_model_pkl.pkl', 'rb')) #rb -> read binary

# Make prediction
loaded_model_pkl.predict(X_test)

array([1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0])

2. Using the Joblib module

import joblib as jbl

# Export with joblib
jbl.dump(model, 'your_first_model_pkl.joblib')

# Load with joblib
loaded_model_jbl = jbl.load('your_first_model_pkl.joblib')

# Make prediction
loaded_model_jbl.predict(X_test)

array([1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0])

Let's Save A Life with AI 🤖

A comprehensive guide to build a supervised learning machine learning model

Table of contents

About our dataset

Task

Our Machine Learning WorkFlow

1. Problem Definition

Data Collection

Definition of terms from the dataframe .describe() method above

2. Data Preprocessing

Check if there are missing values

Fill the missing values

Split our data into Features and Label (Independent and Dependent Variables)

Congratulations!! 😎 💃
We've successfully filled and converted all our data to numbers

4. Modelling

Now Let's Train our machine learning model to find patterns in our data

Now, why are we getting an error? What did we do wrong? What can we do about it ?

Now we've fixed our data, lets split and fit into our model for training

5. Make predictions using our model

6. Evaluate our model Performance

8. Export our model

1. Using the pickle module

2. Using the Joblib module

Let's Save A Life with AI 🤖

A comprehensive guide to build a supervised learning machine learning model

Table of contents

About our dataset

Task

Our Machine Learning WorkFlow

1. Problem Definition

Data Collection

Definition of terms from the dataframe .describe() method above

2. Data Preprocessing

Check if there are missing values

Fill the missing values

Split our data into Features and Label (Independent and Dependent Variables)

Congratulations!! 😎 💃 We've successfully filled and converted all our data to numbers

4. Modelling

Now Let's Train our machine learning model to find patterns in our data

Now, why are we getting an error? What did we do wrong? What can we do about it ?

Now we've fixed our data, lets split and fit into our model for training

5. Make predictions using our model

6. Evaluate our model Performance

8. Export our model

1. Using the pickle module

2. Using the Joblib module

Congratulations!! 😎 💃
We've successfully filled and converted all our data to numbers