Blatherstrike - The strike of the nonsense: July 2020

Saturday 25 July 2020

Gradient of a softmax classifier and it's loss function

Softmax classifier and it's gradient

The function itself:

The softmax function can be seen as:

$$ S_i = \frac{e^{a_{i}}}{\sum_{k}{e^{a_{k}}}}$$

and the loss function can be seen as:

$$L_i = -log(S_i)$$

The partial derivative as taken from here :

If you do a partial derivative of $S_i$ w.r.t the $a_{i}^{th}$ element, then, remember the quotient rule:

If $$f(x) = \frac{g(x)}{h(x)} $$

Then,

$$f'(x) = \frac{g'(x) \times h(x) - h'(x)\times g(x)}{h(x)^2} $$

Here, $g(x) = e^{a_i}$, and $h(x) = \sum_{k}{e^{a_k}}$,

So $g'(x) = e^{a_i}$, when partially derived w.r.t $a_i$, and $g'(x) = 0$, when partially derived w.r.t $a_j$, where $j \neq i$, because in that case it'll be a constant.

$h'(x) = e^{a_j}$ always, because only one of all the terms in the summation will not be treated as a constant. All others will be treated as a constant, and thus they'll be diminished to zero after being derived.

Thus when $i=j$, our

$$\frac{\partial {S_i}}{\partial a_i} = \frac {e^{a_i}\times \sum - e^{a_i}\times e^{a_i}}{\sum^2} = S_i(1-S_i)$$

And when $i \neq j$

$$\frac{\partial S_i}{\partial a_j} = \frac {0 \times \sum - e^{a_i} \times e^{a_i}} {\sum^2}= -S_i \times S_j$$

Now keep these in mind. Now following this stackoverflow solution :

$$\frac{\partial L}{\partial o_i}=-\sum_ky_k\frac{\partial \log p_k}{\partial o_i}=-\sum_ky_k\frac{1}{p_k}\frac{\partial p_k}{\partial o_i}\\=-y_i(1-p_i)-\sum_{k\neq i}y_k\frac{1}{p_k}({\color{red}{-p_kp_i}})\\=-y_i(1-p_i)+\sum_{k\neq i}y_k({\color{red}{p_i}})\\=-y_i+\color{blue}{y_ip_i+\sum_{k\neq i}y_k({p_i})}\\=\color{blue}{p_i\left(\sum_ky_k\right)}-y_i=p_i-y_i$$

And thus we have our solution!

I took some time to understand this, so a good I decided to create some resource for me too look back to.

Blatherstrike - The strike of the nonsense

Saturday 25 July 2020

Gradient of a softmax classifier and it's loss function

Softmax classifier and it's gradient

The function itself:

Labels

Blog Archive

Report Abuse

1295D - Same GCDs