In the last post, we learned what Optimisation Algorithm and Cost function is. In this post, we will see How Gradient Descent works and how we can use it to reduce cost function.
To understand Gradient Descent, I want you to imagine a scenario. Imagine yourself in Kashmir. You are among the mountains. You finally pulled off the trip with your friends that you've been planning for years.
You sit there on the balcony with cold air breezing through your face. What can make this experience even better? I say a cup of warm coffee. So you have this cup of coffee.
Now think about this. The warmth of the coffee can determine how pleasurable your experience is. If it is not warm enough or too warm the experience will suffer.
So what does all this have to do with Gradient Descent?
Let's plot a graph that tells us the ideal temperature for the coffee to be most
enjoyable. The ideal point is the temperature where it's most pleasurable.
Now the way gradient descent works is that we flip the graph upside down and try to find the lowest point on the graph which gives us the lowest cost function.
The whole curve is unknown to us and so we do not know where this lowest point lies. In order to find it we have two options.
In an exhaustive search, you will have to drink the coffee at all temperatures till you reach the point where it has least suffering or lowest cost function. This is a very reliable method but takes lots of time.
It can also be expensive and need more computation. So we can use the other method of gradient descent.
In this method we take a ball and let it roll through the curve till it reaches the lowest point. Now we do not know which direction to go, up or down because the whole curve is unknown to us.
So what we do is take a point and from there decrease the temperature and check. So when the temperature is decreased the coffee is less pleasurable. So we know that we have to increase the temperature.
We keep increasing the temperature to the point after which if increased reduces the pleasure. This will be the lowest point on the curve.
So in Gradient Descent, we try to move toward the lowest point. But how is this faster than the exhaustive search?
Gradient descent can be faster because you don't necessarily have to look for each and every point. You could increase the temperature more because you know that it's not yet close to ideal experience.
But when it gets close, you can slow down and slide slowly till you reach the ideal point. This method is used in deep learning algorithms to minimize the cost function.
If you want to learn in more detail, I recommend you check out this amazing lecture by Brandon Rohrer.