
How neural nets work
Deep learning is the use of especially powerful neural networks that perform great on most prediction problems. Their ability to capture extremely complex interactions between various features in the data set allows them to perform well with various text, images, videos and source code. Instead of having the effects of features evaluated individually to predict an outcome such as in regression models, neural nets account for all interactions between features. Neural nets map relationships between different combinations of variables to the desired output and hence a function is calculated for all inputs along with their interaction with the other features in the dataset. Therefore, neural networks work great even for topics such as computer vision and natural language processing in which we lack domain knowledge. The difference between modern deep learning and the historical neural networks is the use many successive hidden layers as opposed to just one hidden layer due to increased computational power over time.

The input layer represents the predictive features in the dataset such as age or income. The output layer is the prediction from our model. All the layers that are not input or output layers are hidden layers - while inputs and outputs correspond to information we are privy to from the dataset, the hidden layers aren't something we have data about and represents an aggregation of information. Each node in the hidden layers adds to the model's ability to capture interactions, therefore the more nodes we have the more interactions we can capture.
For each value in the input layer, a weight indicating how strongly that input effects the hidden node. The value of each input is multiplied by the weight using dot product multiplication and this happens in all the nodes. However, for neural networks to achieve their maximum predictive power, an activation function is applied in the hidden layers. The activation function allows the model to capture non-linearities in the data. There are multiple activation functions however today's standards heavily relies on the ReLU (Rectified Linear Activation) in both industry and research.
An interesting fact about deep networks is that they internally build up representations of the patterns in the data that are useful for making predictions. Neural nets find increasingly complex patterns as they go through successive hidden layers of the network. Therefore, subsequent layers in the neural nets build increasingly sophisticated representations of the raw data. For example, when a neural network tries to classify an image, the first hidden layers build up patterns or interactions that are conceptually simple. A simple interaction looks at groups of nearby pixels and finds patterns such as diagonal lines, horizontal lines, vertical lines etc. Once the network has identified where the diagonal, horizontal and vertical lines are, subsequent layers combine that information to find larger patterns, like big squares. A later layer might put together the location of squares and other geometric shapes to identify a checkerboard pattern, a face, a car, or the full image. Therefore, we do not need to specify those interactions as the network calculates the weights to find the relevant patterns to make better predictions.

The goal is to find the weights giving the lowest value for the loss function using an algorithm called gradient descent. The algorithm starts at a random point, finds the slope and moves downhill to find the local minima. In the graph above, the loss function is shown on the vertical axis and the weights at the horizontal axis. The slope of the tangent line captures the slope of the loss function at the current weight and the slope corresponds to the derivative . The slope decides what direction to move in - is the slope is positive, we want to go downhill and in the direction opposite the slope towards low numbers. Repeatedly, small steps are taken in the opposite direction of the slope, recalculating the slope each time until a minimum value is obtained.
Subtracting the slope from the current value may lead us to take too big a step and lead us astray and therefore instead of directly subtracting the slope, we multiply the slope by a small number called learning rate and subsequently change the weight by the product of that multiplication. This ensures we take small steps to reliably move towards optimal weights. The learning rate is a hyperparameter of the model and can be tunes to achieve optimal results.
Furthermore, backpropagation is needed to optimize more complex deep learning models. Just as forward propagation sends input data through hidden layers and into the output layer, back propagation takes the error from the output layer and propagates it backward through the hidden layers towards the input layer. Backpropagation calculates the necessary slopes sequentially from the weights closest to the prediction, through the hidden layers and eventually back to weights coming from the inputs.
The data
The data was obtained from the Kaggle website and contains 12 different features to predict mortality by heart failure. For more information about the dataset or to download the data, please visit https://www.kaggle.com/andrewmvd/heart-failure-clinical-data
It is instrumental to predict individuals that are at high risk for cardiovascular disease as early detection is key to prevention and management of the illness. Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet, obesity, physical inactivity and harmful use of alcohol. Hence early detection of compromised individuals can potentially prevent heart failure in vulnerable populations.
The data in the Jupyter notebook looks as follows:

data exploration

Some of the variable such as creatinine phosphokinase and platelets had high variability compared to other variables and hence we will standardize the values for better model performance. We can investigate this further by calculating the variances as shown below.

Each row represents an individual and the image below describes all the different columns in the dataset. We also see that all the values are either ints of floats.

We plot histograms to check the distribution of our variables and this also helps us quickly distinguish between numerical variables and categorical variables. From the histogram below, we see that anaemia, sex, smoking, high blood pressure and diabetes are categorical variables.

We can also check for correlations between our variables as this will affect model performance and also leads to redundancy in the data. We see no strong correlations (as indicated by the lack of dark red or green)

Data preprocessing
Before we go ahead and create the model, the categorical variables are converted to a machine-learning friendly format using pd.get_dummies from pandas. Furthermore, the numerical variables are standardized in order for better model performance and so as to not create a condition where the variables with the higher values are inappropriately given more importance by the model. "StandardScaler" from the sci-kit learn library was used for this.
As part of data-preprocessing, before the model is applied to the data, the predictor variables were separated from the target variables (heart disease as 1 or no heart disease as 0). The predictor variables are stored in a variable X whereas the target variable (the variable we want our neural net to predict) is stored as y. Both X and y are subsequently converted from a Pandas dataframe to a Numpy matrix. The target variable is also converted into 2 separate columns (format required by Keras) using "to_categotical".
the model
To create the model, "adam" optimizer was utilized due to its known ability to perform better in a lot of different scenarios and loss was calculate using "categorical_crossentropy" due to our output being a classification problem. Different models were sequentially attempted based on either increasing or decreasing the hidden layers and/or the nodes present in the hidden layers in order to optimize the accuracy score and minimize the validation loss. I initially started with a smaller network and gradually increased capacity until validation score no longer improved. The early stopping function was also implemented to stop training if the validation score did not improve.
Therefore, the final model obtained was as follows:

The final validation loss was 0.2261 and validation accuracy of 0.9222.

The final model summary is as follows:

Comentarios