Implementation of Loss Function

The networks we defined in the previous section are not complete. The networks have inputs and outputs. It's OK to inference (if you know the appropirate weights), but it's not complete for training.

As Lord Kelvin said, "If I cannot measure it, I cannot improve it". In the context of machine learning, "If I cannot compute the loss, I cannot optimize the model". The loss function is a measurement of how bad the model is. This is why we call it a loss function and why we need to reduce it (not increase it).

Essentially, the loss function is just a operator or a group of operators that takes the output of the model and the expected output, and compute the difference between them. The inputs and output of the loss function are both tensors. Generally, the output of the loss function is a scalar that represents the loss.

Despite the current popularity status of TensorFlow, TensorFlow is a somehow good framework in API design. After building the model, TensorFlow ask users to compile the model. The compile function will take a loss function, an optimizer and a metric to build a complete model. Leave the optimizer and metric for later, which are the topics of later chapters, let's focus on the loss function.

Although loss functions are essentially a group of operators, it's still have some special requirements. Most loss functions have a parameter called reduction. The reduction parameter is used to specify the method to reduce the output of the loss function. The most common values are none, mean and sum.

none: No reduction will be applied. The output of the loss function will be a tensor with the same shape as the input.
mean: The output of the loss function will be a scalar that represents the mean of the input.
sum: The output of the loss function will be a scalar that represents the sum of the input.

This is because in practice, we often need to compute the loss of a batch of data. The output of the loss function is a tensor with the same shape as the input. We need to reduce it to a scalar to represent the loss of the batch. But by using which way to reduce, it depends on the actual task. For example, in classification task, we often use the mean of the loss to represent the loss of the batch. In regression task, we often use the sum of the loss to represent the loss of the batch.