# GTN Training - Machine Learning - Deep Learning 1: Feedforward Neural Networks

## Jun 25, 2021 16:13 · 8690 words · 41 minute read

Hello everyone my name is Kaivan Kamali. I’m going to be giving a presentation on feed forward neural networks. This is part one of a three-part series on deep learning, part of the Galaxy training network tutorials.

00:22 - so, the requirements, it would be a good idea if you’re familiar with the galaxy platform, Introduction to Galaxy Analysis, is a good tutorial. Also, there’s an introduction to deep learning tutorial which is helpful but it’s not required. So, the questions we’re trying to answer is what is a feed forward neural network and what are some applications of it and our objectives are to understand the inspiration for neural networks, learn activation functions and various problems solved by neural networks, discuss various loss and cost functions, and the backpropagation learning algorithm, and learn how to create a neural network using Galaxy’s deep learning tools and then solve this sample regression problem via feed forward neural networks in Galaxy.

The last two bullets there it’s basically the hands-on section of this tutorial and it would be done in a separate video. So, what is an artificial neural network? Artificial neural networks are a machine learning discipline that were roughly inspired by how neurons in the human brain work. There has been a huge resurgence of neural networks in the past 10-15 years due to a vast availability of data and increases in compute capacity and also improvements in how neural network weights are initialized and also the activation functions.

We’ll get to those in a few slides. There are various types of neural networks. There are feed-forward neural networks, where the signals only move in one direction. There are recurrent neural networks where you have loops, and there are convolutional neural networks that are mostly applied to images and video problems. Feed forward neural networks are applied to classification, clustering, regression, association problems and they have a lot of real world applications.

Inspiration for neural networks is how the human brain works, roughly, so a neuron is a special biological cell with information processing ability. It receives signals from other neurons through its dendrites. It’s shown up here and if the signal received exceeds a certain threshold the neuron fires and transmits signals to other neurons via its axon which is over here. And synapse which is the connection where the axon of one neuron meets the dendrite of another neuron.

It can either enhance or inhibit the signal that’s passing through it and the theory is that learning occurs by changing the effectiveness of the synapses in our brain. Celebral cortex is the outermost layer of the brain. It’s two to three millimeter thick and has a surface area of 2200 square centimeters. It has about 10 to the power of 11 neurons and each neuron is connected to 10 to the power of 3 to 10 to the power of 4 neurons, between 1000 to 10000 neurons, so the human brain has around 10 to the power of 14 to 10 to the power of 15 connections.

Just so you get an idea and the neurons in the brain communicate by signals that are millisecond in duration. Okay. However, complex tasks like face recognition are done within a few hundred milliseconds so what does that mean? This means that the computation involved cannot take more than 100 serial steps, roughly. So, that’s a very interesting observation. The other thing is that the information that is sent from one neuron to the other is very small.

So, this means that the critical information is not being transmitted but rather it is captured by the whole inner connections. So the brain has a distributed computation and representation and allows slow computing elements to perform complex tasks quickly because the signal transmission frequency in the brain is several hundred Hertzs max, whereas the computer chips could be millions of times faster than that. So, even though the computer chips are much faster than the neurons in the brain, the brain with a distributed representation distributed processing somehow seems to take care of very complex tasks in very few serial steps.

So, that’s a very nice observation. Okay, so we’re going to discuss Perceptron now. Perceptron is basically the first neural network that’s still in use. I think Rosenblatt came up with the idea of Perceptron in the 1950s. You have an input layer and an output layer. The input layer is connected to the output layer via a weight and every input is multiplied by that weight and the products of weights and inputs are added up together. We also have a bias down here in which the input is always one and we have a weight of b1 and that helps that the function that this neuron represents be shifted to the right and left using that bias, so, that’s just a matter of mathematical sugar coating, if you will.

So, if the sum of inputs multiplied by weights plus the bias multiplied by one is greater than a certain threshold the neuron fires. If it’s not, then the neuron does not fire. So fire means it has an output of one. Does not fire means it has an output of zero. So this is the simplest neural network. I think this was proposed by Rosenblatt after he studied how vision works in flies hence the name perceptron and it was implemented in hardware. It’s still used.

It’s just different activation functions are used instead of a step activation function which is like a kind of a, it’s a threshold function. But the Perceptron also has a learning algorithm. That is we are given a training set, a set of input output pairs, and the goal of the learning algorithm is to iteratively adjust model parameters which in this case are the weights and the bias so that the model can accurately map inputs to outputs and this is called the Perceptron learning algorithm it’s actually very simple you make a prediction with Perceptron and if the value that you’ve got is more than what you got you will reduce the weight by a small factor multiplied by a small factor called the learning rate.

If it’s less than what you expected you increase the weight multiplied by the learning rate so that’s the simple perceptron learning. But the problem is that, this is highlighted by a paper by Minsky and Pappert, that a Perceptron or a single-layer feed-forward neural network cannot solve problems in which data is not linearly separable. So if you have data that’s linearly separable your Perceptron can solve it. If the data is not linearly separable, the Perceptron learning algorithm basically fails.

So a simple problem is the XOR problem. You can’t use a straight line to classify, to solve, the XOR problem, so this caused, what has been named the AI winter, that means the interest in neural network research and AI in general was reduced significantly and so was the funding, the government funding for research on AI and neural networks and this went on for a while until a multi-layer uh feed forward neural network was proposed. I mean, researchers knew about multi-layer feed forward neural networks, they just didn’t know how to train them so that was still a big problem.

So the idea was that adding one or more hidden layers enables the forward neural neural network to represent any function. That’s called the Universal Approximation Theorem and that’s great but how are you gonna train it? So in the 80s this algorithm called backpropagation became popular that allowed training of multi-layer feed forward neural networks and the interest in neural networks was revived again. So, this is a multi-layer feed forward neural network.

As you can see there’s an input layer as before and an output layer but we also have a hidden layer. We could have one or more hidden layers. I mean based on the universal approximation theorem if you have one hidden layer you should be able to approximate any function but in reality training such a network would be very difficult. So generally we have more than one hidden layer and deep neural networks we have many many hidden layers, hence, the name deep.

Also, I think there are some restrictions in the universal approximations theorem: the function that you want to approximate should be continuous and there are some other restrictions. So, in reality you usually have a neural network with multiple hidden layers. So there are good things and bad things about having more layers. So more layers means more weights. So more weights means that you’re increasing the dimension of your search because you’re searching for the optimal weights that can map the inputs to the outputs as good as it can, but if you have more weights then your search space dimension goes up.

So, that increases the train time and difficulty of the problem. There’s also the problem of overfitting. So if you have way too many parameters you are more likely to learn the peculiarities of the training data and you come up with a model that works great on the training data and does not generalize well to unseen data. So you train your model. You’re doing great. You provide the input that was not part of that training set and suddenly your model collapses.

So that’s a typical overfitting scenario and generally the more parameters that we have, we should have more data. So if our data is fixed and you increase the number of parameters, you know, we’re exposing ourselves to overfitting. So as I said in Perceptron at the output layer we have a binary step function, that was if the sum of all of the inputs, weighted inputs, was greater than zero, that was a threshold, the output is one, otherwise it’s zero.

So the second activation function is called the binary step. It’s used in Perceptron. It has a range of between zero and one. Well actually has two values zero and one and the derivative is zero everywhere except at point zero which is undefined because the function is not continuous. There is the identity activation function, the range is minus infinity to infinity. The derivative is simply one. It’s used usually in the output layer of a regression problem, so we’ll get to that later in this tutorial.

Now some more useful activation functions. There are three that should be discussed. Logistic or Sigmoid activation function. It’s basically a soft step function. It goes from zero to one but as you can see there’s no sudden jump so this function is differentiable unlike binary step function and the range is still between zero and one. Hyperbolic tangent or tanh. It’s similar to sigmoid except that the range is between minus one and one. Again it’s differentiable and the more recent Relu, rectified linear unit.

This is a function where if the input is negative it’s zero. If the input is positive it’s just whatever the input is. So the range is between zero to infinity. For negative numbers the derivative is zero; for positive numbers the derivative is one and for zero itself it’s undefined because the function is not continuous at that point but that’s not a problem. we can we can deal with a function not being differentiable at a specific point. That doesn’t stop us from using Relu.

So the problem of supervised learning is that we have a training set of size m. A training set is a set of training examples and each training example is a pair of feature vectors and labels. So a feature vector it’s a vector. Could be n-dimensional and each of the elements of the feature vector is called the feature and there’s a label associated with a feature vector we call it y. So the assumption is that there is a function that maps the feature vector to the label and given the data we want to learn that function or approximate it.

So that’s the goal of supervised learning. So there are supervised learning, you know, if the label is a real number or continuous number it’s called a regression problem. If it’s a categorical value it’s called classification problem. So for classification problems there are three cases: binary classification that means our class label can have two possible values. A classic example is when we provide the patient data we want to predict whether the patient has a disease or not.

There are two cases. It’s a binary classification. There are multi-class classification problems. So let’s say you are given a bunch of images and you want to predict whether that image contains a dog, a cat, or a panda. That’s a multi-class classification problem. A less popular problem is multi-label classification and that let’s say you’re given some images and you want to see which one of these animals are in that image. It could be more than one or could be zero, could be all, so that’s called a multi-label classification problem.

So as you can see in all these three cases it’s classification problems and our labels, they are categorical. There is no, you can’t say one of the labels is bigger than the other label and you can’t measure the distance between any of the labels. There’s no relationship. It’s just a nominal or category. So the activation function and the output layer for each of these problems are as such. If you have a binary classification problem we usually have a single neuron in the output layer and we use a Sigmoid activation function.

As you remember Sigmoid the output is between zero and one. So we say if the activation of the neuron and the output layer is greater than 0. 5 the output is 1. Let’s say there is a disease if it’s less than 0. 5, or less than equal to 0. 5, we say the output is 0, no disease. So that’s how we model a binary classification problem. For multi-label classification, it’s the same thing except that we have as many neurons as needed. So if a multi-label classification with three possible labels we would have three neurons which have sigmoid activation function and again the activation is the same way.

If the output of each neuron is greater than 0. 5 we assume it’s fired. If it’s not we assume it did not fire.

18:26 - So for multi-class classification it’s a little bit different. We have as many neurons in the output layer as the number of classes. So if you have, I don’t know, five classes we would have five neurons but we use an activation function called Softmax. So what Softmax does is it takes the input to the neurons in the output layer and produces a probability distribution so, such that you know the sum of all those probabilities adds up to one. Usually the neuron with the highest probability is the predicted label so in case of dog cat panda we get three probabilities for the image being a dog, being a cat, or being a panda and we picked, we say the network predicted for example a dog if the probability of a dog was the largest among those three.

For regression problems we usually have a single single neuron in the output layer and we use a linear activation function that’s what we’re going to use in this training actually. So loss and cost functions. So what are they and why do we need them? So during training for each training example x_i y_i we present x_i to the neural network and the neural network makes a prediction. We’re going to call it y hat of i and we have to compare the predicted output with the expected output and we need a way to objectively compare these two and see how much we’re off.

So for classification problems the main loss function which measures the difference between predicted and actual/desired output is called cross entropy. But for regression problems there’s something called a quadratic loss function which is also called mean squared error. So we’re going to get to each of those in turn. So cross entropy loss function and cross entropy cost function are defined as follows. So the loss between y_j that means the desired output for the j’th training example and y hat of j which is the predicted output for the jth training example is defined as such.

We multiply the predicted output and the desired output. This is a multi-class classification problem. These are vectors so they have multiple elements so what we do is we go through the elements of those vectors and one by one multiply the elements of the desired output by the natural logarithm of the element of the predicted output and we add them up with a negative sign. So that’s a loss. So that’s for a training example j but we have m of those. Our training set is of size m.

So the cost function is basically the average of sum of all those losses. So we calculate the loss function on the first line for every training example we have m of those we add them up and divide them by m to get the average. So that is the cost function and the cost function is a function of the parameters of the neural network which are the weights and biases. So these are all the weights between all the neurons in the network and all the biases for all the neurons in the network.

So if you have a regression problem we’re going to be using the quadratic loss and cost function, so the loss function again between the predicted output and desired output, if it is multi-dimensional, is we subtract these two vectors from each other, we calculate the length of that vector and raise it to power of 2 and we multiply it by one-half so that would be the loss and we have m training examples and the cost function would be just adding up the loss for those m training examples and then dividing them by m so we have an average we have an average out.

Okay so now we want to know, we know how to calculate the loss and cost functions, now we want to know how to update the weights in the network so it can learn. So, there’s an algorithm called back propagation. It was proposed in the 80s. Multiple people independently came up with the algorithm. It’s basically a gradient descent technique in that your goal is to find a local minimum of a function by iteratively moving in the opposite direction of the gradient of the function at the current point.

So it’s like, think of it as a hill climbing, if the slope is positive and you want to find the local minimum you walk backwards. If the slope is negative and you want to find the minimum you walk forward. That’s all that is. The slope is decided by the gradient and we usually move in small steps so we don’t overshoot. So the goal of the learning is to minimize the cost function given a training set. So, our cost function for example for a multi-class classification problem is given by j which is a function of w and b.

So the goal is to minimize that given the training set, so the cost function is a function of network weights and biases for all neurons in all layers. Backpropagation iteratively computes gradient of cost function relative to each weight and bias. So let’s say we have a network with multiple layers and these layers are fully connected, that means every neuron in one layer is connected to every neuron in every other layer. So this is one example. So here we have one two three times four, I don’t know, twelve that would be plus six eighteen weights and also we’re not showing the biases here so that would be five more biases as well.

So all of these need to be updated by the backpropagation algorithm. So we update weights and biases in the opposite direction of gradient because we want to minimize the cost function and the gradients are used to update weights and biases. The goal is to find a local minimum. So the derivation of the back propagation algorithm is somewhat involved. If you go to the tutorial, these are the slides for the feed forward neural network, but there’s also a tutorial for feed forward neural networks.

I do cite a reference that has the derivation for backpropagation algorithm. It’s basically the chain rule in calculus. That’s the bottom line but you have to get creative in order to do all these calculations. I’m not going to discuss the derivation. I’m just going to give you the results and if you’re interested you can look at that paper to see how these values are derived. So the whole point is in order to do the backpropagation we define this term called error.

It’s displayed by small delta and delta of ilj means the error of neuron i at layer l for training example j. So what that means is that’s a partial derivative of the loss function that we defined up here, first line for a multi-class classification, relative to zilj and zilj is the input to neuron i layer l for training example j. So that’s how we define error. Then if you read that paper you will see that the error at the last layer or output layer of a neural network can be calculated by the first formula and it’s simplified to basically on the right hand side to the predicted output minus the expected output.

These are vectors because it’s a multi-class classification, it’s multi-dimensional, so we can calculate the delta at the output layer then the second formula we can calculate delta at the layer l given delta at layer l plus one. So the way it works is we calculate the error values for the output layer, then we use the second formula to calculate the error values for the layer prior to the output layer. Again recursively we use that to calculate the error two layers before the output layer and we go all the way to the input layer.

So as you can see the errors propagate backward hence the name back propagation. So after we do this multiple times we get to the input layer and all the delta error values are calculated. So what good are they? well the same paper will show you that the partial derivative of the loss function relative to the bias is just the value of delta. There’s an i subscript missing here, sorry for that, and the partial derivative of loss with respect to the weight is given by delta multiplied by the activation of the neuron in the previous layer.

So in summary we define this term called delta or error we know how to calculate delta in the output layer by the first formula, by the second formula we know how to calculate the error in a previous layer given the current layer and if we use that from the output layer we can go all the way back to the input layer iteratively. Hence, the name backpropagation. Having the value of deltas we can calculate the partial derivative of loss with respect to biases and also with respect to the weights and that’s the gradient that tells us whether we should increase the weight or decrease the weight in the gradient descent algorithm.

So there are different types of gradient descent algorithm: in batch gradient descent you calculate the gradient for each weight and bias for all of the samples because as I said our training sample is of size m so we can calculate the derivative of the loss function relative to weights and biases m times and then what we do is we average those gradients and then we update the weights and biases. So this is good but it’s slow. If we have too many samples, so if let’s say you have ten thousand a hundred thousand samples you have to calculate the derivative of the loss function for all the weights and biases for all of these one hundred thousand samples then average them then update them and then repeat.

So this is going to be very slow so an alternative is the stochastic gradient descent and the idea is that after one training example is provided to the network you get the predicted output calculate the derivative of the loss function relative to weights and biases and use that one derivative, not the average of all derivatives, for the training examples to update the weights, so this has the benefit of being fast but that single gradient may not be representative and you know you may not get good results.

So the middle ground is something called mini-batch gradient descent, so instead of using say ten thousand training examples to update the weights we break it down into like 500 mini-batches so after we average the gradient for 500 samples in a mini-batch and we update the weights so that’s better than just using one sample to update the weights and it’s also better, it’s more accurate, and it’s also better than using all the samples to update the width, it is faster, so this is the preferred, mini-batch gradient descent is the preferred way to train a neural network.

So neural networks suffer from a problem called vanishing gradient problem so as you can see the second back propagation equation is recursive if you go up here you see the second one the gradient at the delta at layer l is calculated via the delta at layer l plus one so if we have a network that has like five layers we have to repeat this five times. We go from level five to four, four to three, three to two, and two to one. Actually it’s four times but if you look at the formula the second term, actually the second term, is gl prime of zlj.

zlj is an input to the derivative of g is the activation function at layer l so we have to calculate the derivative of the activation function at layer l to provide this input that gives us a value that is multiplied by the value on the left hand side. But the point is on the right hand side of the formula for calculating delta we have a derivative and this is a recursive formula so in order to calculate the delta on the first layer in the five layer network that i told you about we have to multiply four derivatives by each other and the derivatives of, for example, sigmoid are generally very small numbers so if you multiply multiple small numbers with each other you’re going to have a very very small number and that is what the delta that you want to use to update the weights.

So if the value that you use to update the weights is very very small then the weights don’t get updated and then your gradient descent algorithm does not converge or it will take forever to converge so that’s a problem more frequent in the networks that have many layers. So this actually prevented neural networks to be applied to many complex problems and actually resulted in a loss of interest in neural networks in late 90s early 2000s but since then we’ve had much better activation functions proposed so for example relu does not suffer from this problem so you can have a network that has like tens of layers and if you use relu activation function for the hidden layers you can avoid the vanishing gradient problem but so sigmoid and tanh are still used in the output layers mostly sigmoid but for the hidden layers if you have a very deep network with many many many hidden layers it’s better to use relu to avoid the vanishing gradient problem so in this tutorial we’re going to solve a regression problem it’s car purchase price prediction we have a sample data set, given five features of an individual, their age, gender, miles driven per day, personal debt, and monthly income, and the money that they spent buying a car we want to train a feedforward neural network to predict how much someone will spend on buying the car so then we’re going to evaluate this feedforward neural network on a test data set and we’re going to plot graphs to assess the model’s performance.

Training dataset has 723 training examples so the test data set has 272 examples. Examples in test and training data sets are mutually exclusive so when we test our model on the test data none of them are actually used in training the model. The input features are scaled to be between zero and one in range. That’s a common pre-processing step before the data is presented to the network so this is the slides for the tutorial. The tutorial itself has a references section and you can find the references for this material, the material used here there.

It’s just that it’s not easy to put references and slides here. Some general information, the galaxy training material, can be found at training. galaxyproject. org. There are various bioinformatics and growing machine learning topics many tutorials many contributors if you need help you can go to help galaxyproject. org there are also gitter channels, there’s a gitter channel for galaxy training and there’s also one for galaxy the main chat room and there are also domain specific chat rooms.

There are various events that you can find out about them in galaxy project/events, upcoming one is well I guess this one gcc 2021, so thank you so much. The next video would be the hands-on, the hands-on section for feed forward neural networks in which we use galaxy’s neural networks facilities to create a feed forward neural network to solve this car purchase price prediction problem which is a regression problem so I’m going to be, I’m going to be seeing you there soon.

Thank you! Hello everyone I’m back. We’re going to be doing the hands-on section of the feed forward feed forward neural network tutorial so what you need to do is you have to go to training. galaxyproject. org and that’s the main website for galaxy training. If you scroll down to statistics and machine learning here, click and there’s a deep learning part one feed forward neural networks. If you click on this monitor sign you’re going to be taken to the tutorial.

So we already went through the slides in the previous video. Here we’re going to do the hands-on section so we’re going to click on the get data and then we’re going to solve a simple regression problem. Just a note you could download the workflow that’s used in this tutorial here then import the workflow into galaxy and run it. However, here I’m going to, like, start from scratch. I’m not going to use this workflow but if you have any issues it would be a good idea to just use the workflow first and look at workflow and see what it’s doing and to figure out what you’re doing wrong.

Okay so let’s get the data. This is the section of the tutorial on getting the data. I’m going to copy the Zenodo links for the data that we want to upload to galaxy, so you have to go to use galaxy. eu this is the galaxy website for Europe. There’s also use galaxy. org but I think some of the tools are installed on eu are not on the org website, so let’s stick with the galaxy eu. You need to register. I’m registered as me obviously. After that, this is, it has three panels.

On the left hand side we have the tools. This is the main panel. On the right side we have the history so we start by creating a new history and you do that by clicking on this plus sign up here and then give your history and name. This is so we can basically refer to everything that we do as part of the hands-on section of this tutorial. So I’m going to call it gcc 2021 feedforward neural networks, so that’s that and remember I copied the url of the files that we need for this tutorial.

We’re going to go back here, click on upload data on the top left corner. We’re going to get this page. I’ll click on paste fetch data and I’ll paste the links here and hit start so this is going to fetch the files from Zenodo. These are four files, two for training, two for testing. One of the files is the feature vector, the other one is the label. So we have a feature vector and label for training and a feature vector and label for testing and if I’m looking at my notes we have 723 training examples and 242 test examples.

That’s the size of our training and test set, so as you can see the jobs went from gray into a kind of orangey and then green so it went from like they were waiting to be executed, they were executed, and now they are finished. So we have to rename these files. You click on this edit attribute button here. What we do is that we’re going to remove the extension. And we’re going to also remove the url they’re automatically included in the name of the file and then we’re going to save this.

The other thing that we do is that we look at the data types. Sometimes the data types are not detected correctly so all of these files are of type tabular. So select new type tabular and click change data type so we’re going to repeat this process for all the other files and just so you know everything is documented in the tutorial so if at any point you think you don’t understand something you can pause the video go to the tutorial and look it up. So this tells you how to create a new history, this tells you how to rename a data set, this tells you how to change the data type and so on.

So I’m just following the instructions in the tutorial. So let’s rename and change the type of the remaining files. So I get rid of the extension and the url save it, change data type to tabular and then I’ll do that for the next one, finally the last one. Okay, so two of the jobs are still running. Well one completed. We’ll wait for the last one, so we have the x_train is the training feature set, the y_train is the training label. Similarly x_test and y_test are the input feature and the labels for testing.

So we’re going to use the testing the training data to train our model, train our model, and the testing data to test and evaluate our model. So what we’re gonna do next is follow the tutorial and there are like multiple steps. Step one is create a deep learning model architecture, so I’m going to do that in the galaxy page, so what you need to do is almost all of the tools that we’re using, all of the tools that Galaxy has are on this pane. You can search for them.

That’s one way the other way is I know that all of these tools are under the Machine Learning header so if you scroll down, you see the machine learning header. Here you click, you would see all of the tools that you need for machine learning here. Okay, so let’s go back and the tool that we need is create a deep learning model architecture, so i’m going to find it, create a deep learning model architecture with Keras, that’s actually a good point because the Galaxy has wrapped Keras under the hood for all of our neural network facilities and Keras is a basically a higher level library on top of Tensorflow, that is a Google library.

So, anyway, Keras has a very nice interface, the api is very clean it’s easy to implement and it doesn’t take a lot of code to get a lot of things done. There are other libraries like PyTorch also very popular but I really like Keras because it’s very concise and it’s very, I like the way they’ve implemented the library. Anyway, so here we are, we’re going to pick version 0. 4. 2 of this tool that’s done by clicking on this version button and selecting 0.

- 2. The model type is obviously sequential. We have a feed forward neural network. The input shape is a five because we have five attributes in our dataset. It’s, I’m looking at my notes, it’s age, gender, average miles driven per day, personal debt, and monthly income, so these are the input to our model and the output is how much a person spent buying a car. So the inputs are given for training in x _train, those five, and the outputs are given in y_train, the amount that they spent buying a car, so that’s the shape of the input.

So here we define a layer and if we have multiple layers we can click on the plus insert layer here to add a new one. So the first layer it’s going to be, the type, choose the type of layer as core-dense. It’s a fully connected. That’s fine we’re going to have this layer we’re going to this layer is going to have 12 units and the activation function we’re going to pick Relu. And we are going to insert the second layer so it’s going to be also dense. The number of units is going to be eight and the activation function is going to be Relu, and finally we’re going to add the output layer.

It’s going to be core dense again, the type of the layer, number of units is going to be one because this is a regression problem so in a regression problem we have one neuron in the output layer and the activation function is linear. So let’s select linear here if we can’t find, okay, so we are done defining our neural network architecture and everything seems fine so what we need to do is we’re going to click on execute and this is going to start a job called keras model config.

It’s in the, in the wait mode while it’s gray, when it turns kind of yellowish it means it’s being run. When it’s green it’s complete. That’s the color coding for Galaxy. So, this model can be downloaded as a json file. So while that model is building I’m going to show you the, if you click on this i button you can view the data so if you click here this is going to be our input feature vector for training. As you can see it includes age, gender, miles driven per day or month, the amount of debt, and the amount of income, and if you look at the the labels which it’s just a value that is normalized, so i don’t know it could be the first line represents 32000 or whatever it’s a normalized value.

Okay, so we have this model and I think as you can see it’s a json object so next what we do if you look at the tutorial we’re gonna create a deep learning model. So what does, what that does is it takes the json file that we just created the model architecture and we have to specify a loss function as we discussed in this in the lecture and also an optimizer and some metric and we also need to define a few other parameters. So i’m going to go here and again I know that all the tools are under Machine Learning so if you scroll down on the left this is the Machine Learning tools and I’m looking for create a deep learning model with an optimizer loss function and fit parameters so you click here.

Let me get my notes here just to make sure I don’t deviate from the tutorial. So we do a build a training model we’ll leave it as it is and here it says select the data set containing model configuration. That’s the json file that was created in job number, via job number five, that’s been pre-selected correctly. That’s fine. Here do classification or regression the default is a keras classifier because we’re doing a regression problem we have to change that to keras regressor and next is a select a loss function as we said for regression we use mean squared error, so let’s select that mean squared error.

Select an optimizer we’re going to pick adam optimizer, so adam optimizer is kind of I think it should be the preferred one because it has two benefits over the basically the vanilla optimizer. One is that it’s you it uses momentum so you’re the step that you take it also uh depends on the steps that you took in the previous states. It’s, it’s weighted so there’s less weight for the steps way back in time and there’s also different learning rates for different dimensions of the search space so these are the two benefits of Adam and it’s pretty good so for the select metrics we can select mean squared error again and then we can go and pick number of epochs and batch size.

So epoch is how many times do we want to use the whole training set to train the model, so we might use I guess we have 723 examples in our training set we use 723 examples and we can still use them again to train the model it still improves the performance of the model so the epochs basically tells us how many times we’re going to use a training set. I’m gonna say 150. We have a very small data set so that’s not going to be a problem and the number of batch size is how we’re going to how often are we going to update the weights or the parameters of our model.

You know, usually we don’t want the whole training set, we don’t want to use the whole training set to do one update of the weight because it’s going to be too slow so we’re using mini- batch gradient descent and for batch size we’re just going to pick 50. So that’s done too. We just click on execute and that’s a model builder that took the json file from the previous step and all the new parameters that we specified and it’s going to build a model builder.

So while this is building let’s look at the the tutorial so the next step we need to do is do run deep learning training and evaluation. So let’s see if this is complete. Okay just start running. Excuse me. So while that’s running I’m going to look for deep learning training and evaluation tool. Again I know that all of these tools are under Machine Learning the header so let’s see if we can find a deep learning training and evaluation, which we do right here, so let’s wait for the model builder to finish.

So, okay, so the Keras model builder job just completed. I’m going to, we, we’re now, I’m going to do the next step which is deep learning training and evaluation. So you can find that tool under Machine Learning header. It’s here so, let’s see, let me look up my notes and make sure I’m not doing anything differently. So I think I may have picked the wrong tool deep learning training. Done. Okay. Okay so okay sorry I got my notes wrong so, the train and validate select the scheme remains as it is and the next step is choose the data set containing pipeline estimator pipeline or estimator object that’s the output of the Keras model builder in step number six up here.

We’ll just leave it as it is. The input type is tabular data obviously. Training sample data set that’s going to be x_train, so we’re going to click here select x_train. That’s the feature set. Does the training data have a header? it does, so we’re going to change this to yes and we’re going to say we want all the columns and now we’re gonna worry about the labels. So the labels are the data set name is y_train. That’s the correct one that’s been pre-populated.

It does have a header so we’re gonna change it to yes and again we’re going to select all columns and I think that’s it. We can click execute and this one’s going to train and evaluate our feed forward neural network model so let’s wait for this to complete. Okay, this job completed. I kind of paused the video while it was running. But anyway so we get three things: one is the the model. The other one is the weight of the model and the third one is like the the metric so you can look at the the metric.

That’s a mean squared error. It’s double check. I have to double check to see what this is but anyway this is the evaluation result, the model, and weight of the model so this should not have been negative. That’s why I have to check that. So this is the tutorial. The next step is model prediction so what we did was we have a training data set. We provided this training dataset to our neural network and we used it to update the weights of our neural network so when the training is done, you know, we have a neural network that ideally is able to predict the price of buying a car given five attributes of an individual like age, income, debt, etc.

So now it’s time to test this model. So what we do is we pass in the test data to the model, we compare the prediction of the model with the expected output and we see how our model does. So let’s do that and that’s the next task the next task or job is model prediction again we go to galaxy page, we click on Machine Learning and we should look for model prediction. It’s up here so model prediction takes a few parameters. The first one is choose the data set containing pipeline estimator object that’s the result of job number eight.

That’s pre-populated correctly. The second one is choose a dataset containing weights for the estimator above and the weights are the result of job number nine so we’re going to pick that from the drop down. Next is the select invocation method, we want to do prediction so we’ll leave it as it is. The input data type is tabular, training samples dataset. This is our oh, this is not training, this is the test set that we want to use and that would be x_test.

We pick it here from the drop down. It does have a header and we select all columns so we’re ready to execute this and this would this would basically all the data and x_test see what the model predicts and provide us with those values. So it is running now because the color change on the job and it completed. Okay, as you can see the model prediction job completed. If we click on view data we will have the predictions by our model for the, for the car prices so what we want to do now is we want to plot the output of our model so we get an idea of his performance.

If we go to the tutorial the next step would be plot actual versus predicted curves and residual plots so most, all, of the tools that we use so far they were under Machine Learning. If you scroll down we have this statistics and visualization section. The headers are Statistics Machine Learning and Graph or Display Data. All of the tools that we use so far were under Machine Learning. This plot tool is under Graph or Display Data. if you scroll down, we would find it.

Plot actual versus predicted curves and residual plots of tabular data. So you click there, select the input data file this is what we’re comparing against, which would be the labels for the test data, which is y_test and predicted data file is the output of the previous steps right here that is pre-selected correctly so what we need to do is we’re going to click execute and this is going to create three plots. Let’s look at it here. The first plot is true versus predicted values plot.

The second one is scatter plot of true versus predicted values and the third one is residual versus predicted values plot so we’ll let these three jobs complete and then we’re going to go over all three graphs one by one. So the first one, this is true versus predicted values, so true values are given by the color blue and predicted values are given by color orange and the more overlap you have the better. If they completely overlap that means our predictions are 100 accurate.

The second one, the second graph that we get is a true versus predicted values scatter plot so on the x-axis we have the true values; on the y-axis we have the predicted values so if our predictions are 100 accurate we’re going to have a 45 line with a 45 degree slope. As you can see we’re slightly off that, that means that our predictions are not 100 accurate. The root mean squared error for our neural network is 0. 11. Obviously, if it was zero that would have been a perfect neural net and our R squared metric is 0.

- These two are given up here so the R squared metric the closer it is to one the better and we’re like pretty close 0. 87. And the last graph it’s basically the predicted value on the x-axis and the residual value on the y-axis and what is the residual value it’s the difference between the predicted and true value, so if you predict, I don’t know, 0. 6 and predicted value and the true value is also 0. 6 uh you would have a point on this line on the on the y-axis equals zero.

So the more the points are off from the zero y value, the worse we are and the closer they are to the line that represents the y equals zero value, the better we are. So finally the conclusion in this tutorial: we discussed the inspiration behind neural networks, explained perceptron one of the earliest neural networks design that’s still in use today, and we discussed different activation functions, what supervised learning is, and what are loss and cost functions, and we also discussed the backpropagation learning algorithm that minimizes the cost function by updating the weights and biases in the network.

We implemented a feedforward neural network in galaxy to solve a simple regression problem to predict the purchase price of a car given a dataset that we uploaded so this completes part one of this three part series, the feed forward neural networks. The subsequent tutorials are recursive neural networks and convolutional neural networks which we’ll cover later. Okay thank you and see you soon.