Walking Through Training Models

10 min readMay 28, 2022

Every Machine Learning Algorithm use the generic optimizations algorithm called Gradient Descend and its variants. The common goal is minimizing a cost function that is usually is a Mean Squared error function:

To explain how gradient descend work let’s imagine that you are blindfolded:

and you are in room with irregular floor, you goal is reach the lowest level in room, to what do you do? You take one step in some direction and evaluate if your level increase or decrease, if decreases you guess you are in the right direction and son on. Note, if you take another direction and so on until reach the minimum. This is the main principal of gradient descend, the steps size is the given by the learning rate and the sense of increase or decreasing level is given by cost function derivatives:

Obliviously the floor is not regular as shown above, but it’s always convex. A good general model can be seen below:

When dealing with real data, each dimension can have its own scale, what can be very bad for ML models, especially in Cost Function optimization, because unscaled data make the reaching minimum process slower. Therefore, scale the data is essential to performing ML models.

Note that in scaled data case the GD goes straightforward while the unscaled case makes the algorithm doing a kind of hyperbolic trajectory, so the first one is faster than the second and we can see the importance of scale the data

As said before Gradient descend has variants. The first type is the Batch Gradient Descend. This algoritm is characterized to get the entire dataset in each interation in order to get the partial derivarites:

it can be computed in one go through the calculation of gradient vector of MSE function:

That’s a vector that contains the partial derivatives of MSE function concerning each model parameter. As a blindfolded man context, the gradient vector would be the person who gives some advice to the blind guy go to the uphill level. Therefore, as his go is going to the downhill it’s only to go to opposite direction with taking some distance size:

Hence, this this the exactly way the model parameter is adjusted, and eta is the learning rate model.

The other guy is Stochastic gradient descend.

As said before, the batch gradient descend takes the entire dataset to perform the optimization. The big problem of it is when the dataset is large it takes an enormous long time. to completion So the solution brought by the SGD is take the instances randomly every step instead takes the entire dataset, therefore this the training model the fits very well when the machine learning context involve a large dataset. Evidently it involves a cost, reaching an approximation of global minimum but not exactly the minimum.

That is because in every step the algorithm doesn’t go straightforward to the minimum but bounce down to the minimum and yet don’t reach the exactly minimum but an approximation of it, so when using this training model, you should set an acceptable close value to the minimum because you are known that is impossible to get exactly the minimum. One way to improve it is decreasing the learning rate as close as the algorithm approach to the minimum, but it never gets the exactly global minimum but very close of it.

In order to reducing the bouncing, a mean term was created, Mini-batch Gradient descent, instead of picking the entire dataset in each iteration or one instance randomly through the dataset, multiply instance are taking randomly in dataset in order to train the model. Therefore, the Mini-Batch Gradient descend go to the minimum faster than a batch gradient descend and bounce less than an SGD:

If you want precision, i.e., get the exactly minimum you should use Batch gradient descend, but you pay with time. If you want spend less time regarding that you training data is large, then you should use SGD but you will lose with precision. If you want more precision than an SGD and paying a little bit more time you should use Mini-batch gradient descend. The best algorithm will depend on of your context. It’s important speak that we can reach very close result with both if we were patient to wait for long quantity of iteration and apply learning schedule for the case of mini-batch and SGD, i.e decreasing learning rate as close as the algorithm arrives to the minimum (in some cases due to the random nature of this algorithms it has to not fall in a local minimum).

Learning Curves

Learning curves are very useful tool when we are developing machine learning model, because they give information if the model in development is overfitting or underfitting. When a ML model is been developed, a dataset is used for built this model, and to reach this goal the dataset is divided into three categories:

The train/validation to train and evaluate the training and test to final evaluating the model. As we are focusing in training model our object of interest is training and validation set. Training set are used to train supervised models, so the model is feed by instance features from the model, some Gradient descend is applied and regarding the labels the model parameter are adjusted in each iteration util all training set were consumed. After that the validation set instance is used, the feature feed the model and the labels are used to calculate the error, the difference is that the parameter is not adjusted were and just the errors are calculated. Once that the error is stored in in each iteration, we can generate the learning curves displaying in the graph the error of validation set and training set in function of iteration:

Through this you can evaluate if your model is too complex to fits your problem(overfitting), no sufficient complex, or fits well your problem

For example, when the learning curves and a high plateau value and are close each other, it means that the model are underfitting:

When the learning curve are so far each other at some point, it means that the model is overfitting:

A good model has learning curve with low value of error and very close each other, ideally like this:

Summarezing everything:

Regularization model

When a ML model is developed is always preferred use some kind of regularization because a regularized model usually performs better than a non-regularized one. Regularization means constraining the model parameter and in linear model context it is made through some kind of penalty in MSE , it means that the MSE is summed to other term and its direct impact the determination of parameter theta during GD descend performance. These terms can be defined through different way, let’s see it right below

Ridge regression

This guy applies a l2 penalty, imagining the parameter as a vector this is the norm squared of this vector, o the dot product of parameter vector with itself, so ridge regression is defined as:

the main characteristic of this regularization is reducing the theta parameters value, what makes the model less sensitive to data noise The greater is the hyperparameter alpha the less sensitive to the noise will be the model:

The reduction of sensitive two the noise is excellent, mainly during training because GD performs much faster when the model is less sensitive to data noise, but it should be used with parsimony, you don’t want underfits your model.

Lasso regression

Lasso regression applies a l1 penalty, the sum of model of each parameter in parameter vector. So that it is defined as:

This regularization performs an automatically feature selection, put down to 0 the parameter of irrelevant features, the greater is alpha the more rigid will be this filtration, an exaggerated value of alpha can make the model become underfitting:

Elastic net

This regularization is a combination of lasso and Ridge regression with the adding of new hyperparameter r. When r=0 elastic net is equivalent to Ridge regression, when r=1 it is equivalent a lasso regression:

We would prefer use elastic net when Lasso may behave erratically, it happens when the number of features is greater than the number of training instances or several instances are strongly correlated.

Classification models based on linear models

The models we will speak here in its own are not linear but they use linear regression as a base in their algorithms. This mean that both models start with a linear regression in their algorithms. For more details we see right below.

Logistic regression

As said previously, this classification algorithm starts with a linear model:

this guy will serve as an entre of a sigmoid function that will serve to estimate a probability.

The function graph maybe clarify will this function:

This function variates between 0 and 1, the exactly range of all probability should be, therefore this function is used to estimate a probability to an instance belongs to certain class. In this manner, here’s the ruler to classificate the instance:

Note that it only serves to binary classifications. If you want multiclassification, you will need various models like this to perform this kind of classification, this solution is preferred when you want make classification with combination of binary classifications in a non-exclusive classification. For example, when you have a picture and want classify if daytime/nighttime and outdoor/indoor.

This model is usually training use some GD algorithm. But the cost function is not MSE, but something that makes more sense in binary classification. When the model wrong classificate a positive instance (1) what is desired is that the closer is p_hat of 0 the greater should be the error. In the case of instance be a 0 class then the logic must be the opposite, it the closes is p_hat from 1 the greater is the error and closer to the 0 the error should be lower. The Log function do it pretty well, therefore a good way to measure error of one instance can be saw below:

We must have only one cost function, so a good way is considering how well each instance were classified, what is perfectly did by this function

SoftMax Regression

SoftMax function is a classification algorithm made to multiclassification instances. It made classification between mutually exclusive classes, i.e., classify an instance in a single one class in a group of classes.

It starts with linear model being calculated as a score:

Where

K is the number of classes.
s(x) is a vector containing the scores of each class for the instance x.
σ(s(x)) k is the estimated probability that the instance x belongs to class k given the scores of each class for that instance

so, using the score, a classification is made by:

As the model estimate the probability of some instance belong so tome k class, the more close the model to 1 is the probability of this instance belong to the class k the lower should be the error, log do it very well, but a little bit of expertise is required, given a label y_k^{(i)} provided by the dataset we just multiplied by the log, as y_k^{(i)} is 1 if belong to the class k and 0 if not, then it filtered the log probabilities of this instance belong to other class that is not k. We repeat his reason for all classes for each instance, therefore we calculate the error function through:

this is called cross-entropy error function, and note that if K=2 its perfectly equivalent a logistic regression mode.

The gradient of this function can be calculated as:

Then the model can be trained with some GD model, as we saw above.

Well, that was a long path, but that is it for today.

I hope it have been useful for you.

See you soon. 😄

Walking Through Training Models

Written by Lucasvittal