Skip to main content

Model Training

What we disscussed so far is how to build the model. Technially, for now, our model can take input and compute the output. But it can not learn from the data. In this chapter, we will discuss how to train the model.

Loss Function

There is a famous saying that "If I cannot measure it, I cannot improve it" (by Lord Kelvin). In the context of machine learning, we can say that "If I cannot compute the loss, I cannot optimize the model". The loss function is a measurement of how good the model is.

Optimizer

In modern deep learning, models are improved by using the gradient information. But, how to use the gradient information to update the model? This is where the optimizer comes in. There are many optimizers in the world, different optimizers have differnent settings and different strategies to update the model. Different optimizers and different settings of same optimizer can lead to way different traning result. In bad case, it even can lead to no convergence or even diverge. This is why deep learning provide a lot of different optimizers. Actually, in old days (before the LLM era), when we say "tuning the model", the first thing we do is to select an good optimizer and set its parameters.

Learning Rate Scheduler

Optimization is so important in the deep learning. There are lots of innovations in the optimization field. Among them, learning rate scheduler is one of the most common used technique.

Why we need learning rate scheduler? This is because, some expriments show that, in the early stage of training, we need a larger learning rate to make the model learn faster. But, in the later stage, we need a smaller learning rate to make the model learn more stable. Optimizer cannot do this by itself (In the later chapers, we will know why). This is where learning rate scheduler comes in. Same with optimizer, there are many learning rate schedulers in the world.