$$ w^* = arg\,min_wL(w) $$

Gradient Descent

$$ \frac{df(x)}{dx}=lim_{h\rightarrow0}\frac{f(x+h)-f(x)}{h} $$

We can get the Analytic Gradient of the Loss: $\nabla_WL$

$$ L=\frac{1}{N}\sum^N_{i=1}L_i+\sum_kW^2_k\\L_i=\sum_{j\neq y_i}\,max(0, s_j-s_{y_i}+1)\\s=f(x;W)=Wx $$

Batch Gradient Descent

Once we get the Gradient, we can train the model using Gradient Descent

w = initialize_weights()
for t in range(num_steps):
	dw = compute_gradient(loss_fn, data, w)
	w -= learning_rate * dw

$$ L(W)=\frac{1}{N}\sum^N_{i=1}L_i(x_i,y_i,W)+\lambda R(W) \\ \nabla_WL(W) = \frac{1}{N}\sum^N_{i=1}\nabla_WL_i(x_i,y_i,W)+\lambda\nabla_WR(W) $$

Stochastic Gradient Descent (SGD)