Let's Save A Life with AI ๐Ÿค–

A comprehensive guide to build a supervised learning machine learning model

ยท

15 min read

About our dataset

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide.

Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.

Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.

Data Source kaggle.com/andrewmvd/heart-failure-clinical..

Task

Create a model for predicting mortality caused by Heart Failure.
12 clinical features for predicting death events.

Our Machine Learning WorkFlow

I feel a lot comfortable defining my workflow for solving a machine learning problem before ever starting to solve those problem as that gives me a feel of direction. However, this may be different for someother persons โœ๐Ÿป

Below are the steps we are going to take to solve this machine learning problem

  1. Problem Definition and Data Collection

  2. Get the data ready for use (Data Preprocessing)

    • Check for missing values
    • Fill missing values, if any.
    • Turn categorical Features to Numerical
  3. Feature Engineering: Feature engineering is the process of using domain knowledge to extract features (characteristics, properties, attributes) from raw data. A feature is a property shared by independent units on which analysis or prediction is to be done. Features are used by predictive models and influence results.

  4. Modeling

  5. Make Predictions
  6. Evaluate Model Performance metrics
  7. See if we need to improve our model
  8. Export our trained model
  9. Load our trained model

1. Problem Definition

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

Data Collection

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
df = pd.read_csv('Datasets/heart_failure_clinical_records_dataset.csv')
df.head()
age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time DEATH_EVENT
0 75.0 0 582.0 0.0 20.0 1 265000 1.9 130.0 1.0 0.0 4.0 1
1 55.0 0 7861.0 0.0 38.0 0 263358.03 1.1 136.0 1.0 0.0 6.0 1
2 65.0 0 146.0 NaN NaN 0 162000 1.3 129.0 1.0 1.0 7.0 1
3 50.0 1 111.0 0.0 20.0 0 210000 1.9 137.0 1.0 0.0 7.0 1
4 65.0 1 NaN 1.0 20.0 0 327000 2.7 116.0 0.0 0.0 8.0 1
# Let's get a summary of our data set
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       298 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  292 non-null    float64
 3   diabetes                  294 non-null    float64
 4   ejection_fraction         294 non-null    float64
 5   high_blood_pressure       296 non-null    object 
 6   platelets                 288 non-null    object 
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              294 non-null    float64
 9   sex                       298 non-null    float64
 10  smoking                   298 non-null    float64
 11  time                      297 non-null    float64
 12  DEATH_EVENT               299 non-null    int64  
dtypes: float64(9), int64(2), object(2)
memory usage: 30.5+ KB
# Check for duplicate
df.duplicated().sum()
0
# Let's check how correlated each feature is to another
correlation = df.corr()
correlation
age anaemia creatinine_phosphokinase diabetes ejection_fraction serum_creatinine serum_sodium sex smoking time DEATH_EVENT
age 1.000000 0.091852 -0.088883 -0.096539 0.061344 0.155650 -0.052726 0.059484 0.018416 -0.219635 0.252176
anaemia 0.091852 1.000000 -0.184689 -0.015766 0.026149 0.054518 0.060957 -0.099176 -0.109526 -0.133430 0.066270
creatinine_phosphokinase -0.088883 -0.184689 1.000000 -0.022312 -0.054024 -0.012482 0.057265 0.074000 0.005451 -0.025688 0.073627
diabetes -0.096539 -0.015766 -0.022312 1.000000 -0.015566 -0.043697 -0.092627 -0.154765 -0.135563 0.028746 0.001362
ejection_fraction 0.061344 0.026149 -0.054024 -0.015566 1.000000 -0.010466 0.199457 -0.146827 -0.057282 0.028818 -0.261605
serum_creatinine 0.155650 0.054518 -0.012482 -0.043697 -0.010466 1.000000 -0.181161 0.010253 -0.026469 -0.147806 0.290386
serum_sodium -0.052726 0.060957 0.057265 -0.092627 0.199457 -0.181161 1.000000 -0.042738 0.001440 0.057033 -0.175385
sex 0.059484 -0.099176 0.074000 -0.154765 -0.146827 0.010253 -0.042738 1.000000 0.446947 -0.005693 -0.007482
smoking 0.018416 -0.109526 0.005451 -0.135563 -0.057282 -0.026469 0.001440 0.446947 1.000000 -0.019119 -0.014233
time -0.219635 -0.133430 -0.025688 0.028746 0.028818 -0.147806 0.057033 -0.005693 -0.019119 1.000000 -0.522918
DEATH_EVENT 0.252176 0.066270 0.073627 0.001362 -0.261605 0.290386 -0.175385 -0.007482 -0.014233 -0.522918 1.000000
df.describe()
age anaemia creatinine_phosphokinase diabetes ejection_fraction serum_creatinine serum_sodium sex smoking time DEATH_EVENT
count 298.000000 299.000000 292.000000 294.000000 294.000000 299.000000 294.000000 298.000000 298.000000 297.000000 299.00000
mean 60.870248 0.431438 577.688356 0.418367 38.227891 1.391104 136.697279 0.651007 0.322148 130.208754 0.32107
std 11.898166 0.496107 972.468942 0.494132 11.852295 1.034449 4.338123 0.477454 0.468085 77.365687 0.46767
min 40.000000 0.000000 23.000000 0.000000 14.000000 0.500000 113.000000 0.000000 0.000000 4.000000 0.00000
25% 51.000000 0.000000 115.000000 0.000000 30.000000 0.900000 134.000000 0.000000 0.000000 73.000000 0.00000
50% 60.000000 0.000000 249.500000 0.000000 38.000000 1.100000 137.000000 1.000000 0.000000 115.000000 0.00000
75% 70.000000 1.000000 582.000000 1.000000 45.000000 1.400000 140.000000 1.000000 1.000000 201.000000 1.00000
max 95.000000 1.000000 7861.000000 1.000000 80.000000 9.400000 148.000000 1.000000 1.000000 285.000000 1.00000

Definition of terms from the dataframe .describe() method above

  • Count: Number of items of in a particular feature/Column
  • Mean Average number (Value gotten by dividing the sum of several quantities by their number)
  • Std Standard Deviation (Square root of variance (How much values differ from the mean value) )
  • Min Minimum value in each feature or column
  • Max Maximum value in each feature or column
  • Percentile Splitting our data into different equal segments (25, 50, 75) etc.
# Check number of columns and rows
df.shape #This shows that our dataset containes 299 rows and 13 columns
(299, 13)
# Let's checkout the column names
j = 0
for i in df.columns:
    j+=1
    print(j, i.upper())
1 AGE
2 ANAEMIA
3 CREATININE_PHOSPHOKINASE
4 DIABETES
5 EJECTION_FRACTION
6 HIGH_BLOOD_PRESSURE
7 PLATELETS
8 SERUM_CREATININE
9 SERUM_SODIUM
10 SEX
11 SMOKING
12 TIME
13 DEATH_EVENT
df['age'].T.hist(bins=50);

png

# Let's Visualize our correlations better
# sns.set(font_scale=1)
fig, ax = plt.subplots(figsize=(15,10))
ax = sns.heatmap(correlation, annot=True, fmt='.2f', cmap='YlGnBu', linewidths=.05);

png

pd.crosstab(df.age, df.DEATH_EVENT)
DEATH_EVENT 0 1
age
40.000 7 0
41.000 1 0
42.000 6 1
43.000 1 0
44.000 2 0
45.000 13 6
46.000 2 1
47.000 1 0
48.000 0 2
49.000 3 1
50.000 18 8
51.000 3 1
52.000 5 0
53.000 9 1
54.000 1 1
55.000 14 3
56.000 1 0
57.000 1 1
58.000 8 2
59.000 1 3
60.000 20 13
60.667 1 1
61.000 4 0
62.000 4 1
63.000 8 0
64.000 3 0
65.000 18 8
66.000 2 0
67.000 2 0
68.000 3 2
69.000 1 2
70.000 18 7
72.000 2 5
73.000 3 1
75.000 5 6
77.000 1 1
78.000 2 0
79.000 1 0
80.000 2 5
81.000 1 0
82.000 0 3
85.000 3 3
86.000 0 1
87.000 0 1
90.000 1 2
94.000 0 1
95.000 0 2

2. Data Preprocessing

Check if there are missing values

df.isna()
age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time DEATH_EVENT
0 False False False False False False False False False False False False False
1 False False False False False False False False False False False False False
2 False False False True True False False False False False False False False
3 False False False False False False False False False False False False False
4 False False True False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ...
294 False False False False False False True False False False False False False
295 False False False False False False False False False False False False False
296 False False False False False False True False False False False False False
297 False False False False False False True False False False False False False
298 False False False False False False True False False False False False False

299 rows ร— 13 columns

df.isna().sum()
age                          1
anaemia                      0
creatinine_phosphokinase     7
diabetes                     5
ejection_fraction            5
high_blood_pressure          3
platelets                   11
serum_creatinine             0
serum_sodium                 5
sex                          1
smoking                      1
time                         2
DEATH_EVENT                  0
dtype: int64

โœ๐Ÿป Now, with the help of pandas .isna() method we are able to find out which columns have empty values in them

Fill the missing values

When handling missing values in during your data exploration and engineering there are different ways to handle missing values.

  1. Fill them with the mean, mode or median data of their parent column
  2. Removing the samples with missing data
  3. Use of Unsupervised learning approach to predict and fill the data
list(df.columns)
['age',
 'anaemia',
 'creatinine_phosphokinase',
 'diabetes',
 'ejection_fraction',
 'high_blood_pressure',
 'platelets',
 'serum_creatinine',
 'serum_sodium',
 'sex',
 'smoking',
 'time',
 'DEATH_EVENT']

Split our data into Features and Label (Independent and Dependent Variables)

# Split the data into features and labels
X = df.drop('DEATH_EVENT', axis=1)
y = df['DEATH_EVENT']
# We can use the SimpleImputer Class in scikit learn to do this but
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer


# Let's Define the columns to fill
categoricals = ['platelets', 'high_blood_pressure']
numerical = ['age',
             'anaemia',
             'creatinine_phosphokinase',
             'diabetes',
             'ejection_fraction',
             'serum_creatinine',
             'serum_sodium',
             'sex',
             'smoking',
             'time'] # Removed the two categorical feature (high_blood_pressure and platelets)

# Define the way SimpleImputer will fill the missing values
categorical_imputer = SimpleImputer(strategy='constant', fill_value='None')
numerical_imputer = SimpleImputer(strategy='mean')

# Create the Imputer 
transformer = ColumnTransformer([('categoricals', categorical_imputer, categoricals),
                                 ('numerical', numerical_imputer, numerical)])

new_X = transformer.fit_transform(X)

Congratulations!! ๐Ÿ˜Ž ๐Ÿ’ƒ
We've successfully filled and converted all our data to numbers

4. Modelling

Based on our problem and data, what machine learning model should we use? Let the audience decide ๐Ÿ˜‰

Okay, you got it right ๐Ÿคฃ, This is a classification problem because we are predicting whether our output is one thing or another i.e. whether it's 1 or 0, True or False, Rice or beans etc. In otherwords, we call this a Binary Classification

from sklearn.ensemble import RandomForestClassifier

# Instantiate our model classes
random_forest_model, ada_boost = RandomForestClassifier(), AdaBoostClassifier()
# Split our data into train and test splits
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(new_X, y, train_size=.8, random_state=21)

# Let's Checkout the shape of our data to understand how our data is being split
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((239, 12), (60, 12), (239,), (60,))

Now Let's Train our machine learning model to find patterns in our data

random_forest_model.fit(X_train, y_train)
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-306-7b70e76ee335> in <module>
----> 1 random_forest_model.fit(X_train, y_train)


/opt/anaconda3/lib/python3.8/site-packages/sklearn/ensemble/_forest.py in fit(self, X, y, sample_weight)
    302                 "sparse multilabel-indicator for y is not supported."
    303             )
--> 304         X, y = self._validate_data(X, y, multi_output=True,
    305                                    accept_sparse="csc", dtype=DTYPE)
    306         if sample_weight is not None:


/opt/anaconda3/lib/python3.8/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    431                 y = check_array(y, **check_y_params)
    432             else:
--> 433                 X, y = check_X_y(X, y, **check_params)
    434             out = X, y
    435 


/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0


/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    812         raise ValueError("y cannot be None")
    813 
--> 814     X = check_array(X, accept_sparse=accept_sparse,
    815                     accept_large_sparse=accept_large_sparse,
    816                     dtype=dtype, order=order, copy=copy,


/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0


/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    614                     array = array.astype(dtype, casting="unsafe", copy=False)
    615                 else:
--> 616                     array = np.asarray(array, order=order, dtype=dtype)
    617             except ComplexWarning as complex_warning:
    618                 raise ValueError("Complex data not supported\n"


/opt/anaconda3/lib/python3.8/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order, like)
    100         return _asarray_with_like(a, dtype=dtype, order=order, like=like)
    101 
--> 102     return array(a, dtype, copy=False, order=order)
    103 
    104 


ValueError: could not convert string to float: 'None'

Now, why are we getting an error? What did we do wrong? What can we do about it ?

# Let's checkout the unique values in our X,y set
y.unique(), new_X[::-1]
(array([1, 0]),
 array([['None', '0', 50.0, ..., 1.0, 1.0, 285.0],
        ['None', '0', 45.0, ..., 1.0, 1.0, 280.0],
        ['None', '0', 45.0, ..., 0.0, 0.0, 278.0],
        ...,
        ['162000', '0', 65.0, ..., 1.0, 1.0, 7.0],
        ['263358.03', '0', 55.0, ..., 1.0, 0.0, 6.0],
        ['265000', '1', 75.0, ..., 1.0, 0.0, 4.0]], dtype=object))

Now, because our new_X label for some reasons still contains strings rather than numbers, our machine learning model has refused to work with such data and therefore cause it to raise an error

def fix_object_data(X):
    """
        This function will help us fix the string value errors in our dataset by turning string dtypes to numeric
        X: DataFrame or dict
    """
    for label, content in X.items():
        if pd.api.types.is_string_dtype(content):
            X[label] = pd.to_numeric(X[label], errors='coerce')
    return X

Lets remind ourselves of the datatypes involved in our dataset

X.dtypes
age                         float64
anaemia                       int64
creatinine_phosphokinase    float64
diabetes                    float64
ejection_fraction           float64
high_blood_pressure          object
platelets                    object
serum_creatinine            float64
serum_sodium                float64
sex                         float64
smoking                     float64
time                        float64
dtype: object
fix_object_data(X).dtypes
age                         float64
anaemia                       int64
creatinine_phosphokinase    float64
diabetes                    float64
ejection_fraction           float64
high_blood_pressure         float64
platelets                   float64
serum_creatinine            float64
serum_sodium                float64
sex                         float64
smoking                     float64
time                        float64
dtype: object
X = fix_object_data(X)

One thing i recommend you do before fill missing values is visulizing the data frequency for outliers, so as to know what averaging method would better suit for filling of the missing values

def value_counts(data:pd.DataFrame, key='age'):
    columns = data.columns
    val_count = {}

    # Loop throught our data 
    for col in columns:
        val_count[col] = data[col].value_counts()
    return pd.DataFrame(val_count[key])
# Now let's consider filling this values one after the other since our features are not much
X['age'] = X['age'].fillna(X['age'].mean())
X['creatinine_phosphokinase'].fillna(X['creatinine_phosphokinase'].median(), inplace=True)
X['diabetes'].fillna(X['diabetes'].median(), inplace=True)
X['ejection_fraction'].fillna(X['ejection_fraction'].mean(), inplace=True)
X['high_blood_pressure'].fillna(X['high_blood_pressure'].mean(), inplace=True)
X['platelets'].fillna(X['platelets'].mean(), inplace=True)
X['serum_sodium'].fillna(X['serum_sodium'].median(), inplace=True)
X['sex'].fillna(X['sex'].median(), inplace=True)
X['smoking'].fillna(X['smoking'].median(), inplace=True)
X['time'].fillna(X['time'].median(), inplace=True)
X.isna().sum()
age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
dtype: int64

Now we've fixed our data, lets split and fit into our model for training

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=.8, random_state=21)

# Now let's train our train our model
model = RandomForestClassifier(n_estimators=120,max_depth=10)
model.fit(X_train, y_train)
RandomForestClassifier(max_depth=10, n_estimators=120)

5. Make predictions using our model

model.predict(X_test)
array([1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0])
# Have a look at what our X_test looks like
X_test.head()
age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time
61 50.0 0 318.0 0.0 40.0 1.0 216000.000000 2.3 131.0 0.0 0.0 60.0
297 45.0 0 2413.0 0.0 38.0 0.0 262269.408112 1.4 140.0 1.0 1.0 280.0
55 95.0 1 371.0 0.0 30.0 0.0 461000.000000 2.0 132.0 1.0 0.0 50.0
243 73.0 1 1185.0 0.0 40.0 1.0 220000.000000 0.9 141.0 0.0 0.0 213.0
95 58.0 1 133.0 0.0 60.0 1.0 219000.000000 1.0 141.0 1.0 0.0 83.0

6. Evaluate our model Performance

There are different classification evaluation metrics available evaluate our model's performance however the use of these metrics depends heavily on what problems you are solving. They include as follows:

  • Accuracy : Default metric for classification problems. Not the best for imbalanced classes
  • Precision: Higher precision leads to less false positives
  • Recall: Higher recall leads to less false negatives
  • F1 Score: Usually a good overall metrics for classification model
  • Confusion Matrix: Evaluate the accuracy of a classification
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)
0.9

8. Export our model

There are 2 ways we can save our model

  1. Using the pickle module
  2. Using the joblib module

1. Using the pickle module

import pickle as pkl

# Export with pickle
pkl.dump(model, open('your_first_model_pkl.pkl','wb')) #wb -> write binary

# Load with pickle
loaded_model_pkl = pkl.load(open('your_first_model_pkl.pkl', 'rb')) #rb -> read binary

# Make prediction
loaded_model_pkl.predict(X_test)
array([1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0])

2. Using the Joblib module

import joblib as jbl

# Export with joblib
jbl.dump(model, 'your_first_model_pkl.joblib')

# Load with joblib
loaded_model_jbl = jbl.load('your_first_model_pkl.joblib')

# Make prediction
loaded_model_jbl.predict(X_test)
array([1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0])
ย