Softmax classifier and it's gradient
The function itself:
The softmax function can be seen as:
$$ S_i = \frac{e^{a_{i}}}{\sum_{k}{e^{a_{k}}}}$$
$$ S_i = \frac{e^{a_{i}}}{\sum_{k}{e^{a_{k}}}}$$
and the loss function can be seen as:
$$L_i = -log(S_i)$$
$$L_i = -log(S_i)$$
The partial derivative as taken from here :
If you do a partial derivative of $S_i$ w.r.t the $a_{i}^{th}$ element, then, remember the quotient rule:
If $$f(x) = \frac{g(x)}{h(x)} $$
Then,
$$f'(x) = \frac{g'(x) \times h(x) - h'(x)\times g(x)}{h(x)^2} $$
Here, $g(x) = e^{a_i}$, and $h(x) = \sum_{k}{e^{a_k}}$,
So $g'(x) = e^{a_i}$, when partially derived w.r.t $a_i$, and $g'(x) = 0$, when partially derived w.r.t $a_j$, where $j \neq i$, because in that case it'll be a constant.
$h'(x) = e^{a_j}$ always, because only one of all the terms in the summation will not be treated as a constant. All others will be treated as a constant, and thus they'll be diminished to zero after being derived.
Thus when $i=j$, our
$$\frac{\partial {S_i}}{\partial a_i} = \frac {e^{a_i}\times \sum - e^{a_i}\times e^{a_i}}{\sum^2} = S_i(1-S_i)$$
And when $i \neq j$
$$\frac{\partial S_i}{\partial a_j} = \frac {0 \times \sum - e^{a_i} \times e^{a_i}} {\sum^2}= -S_i \times S_j$$
Now keep these in mind. Now following this stackoverflow solution :
$$\frac{\partial L}{\partial o_i}=-\sum_ky_k\frac{\partial \log p_k}{\partial o_i}=-\sum_ky_k\frac{1}{p_k}\frac{\partial p_k}{\partial o_i}\\=-y_i(1-p_i)-\sum_{k\neq i}y_k\frac{1}{p_k}({\color{red}{-p_kp_i}})\\=-y_i(1-p_i)+\sum_{k\neq i}y_k({\color{red}{p_i}})\\=-y_i+\color{blue}{y_ip_i+\sum_{k\neq i}y_k({p_i})}\\=\color{blue}{p_i\left(\sum_ky_k\right)}-y_i=p_i-y_i$$
And thus we have our solution!
I took some time to understand this, so a good I decided to create some resource for me too look back to.