Some Notes on Exploding Gradient Problem 🌱

Last updated on March 17, 2021

Most datasets have some number of mistakes in the

y

labels. It can be harmful to maximize $\log p(y

\bf{x})

w h e n

i s a m i s t a k e . O n e w a y t o p r e v e n t t h i s i s t o e x p l i c i t l y m o d e l t h e n o i s e o n t h e l a b e l s . F o r e x a m p l e, w e c a n a s s u m e t h a t f o r s o m e s m a l l c o n s t a n t

\epsilon,

t h e t r a i n i n g s e t l a b e l

i s c o r r e c t w i t h p r o b a b i l i t y

1-\epsilon,

a n d o t h e r w i s e a n y o f t h e o t h e r p o s s i b l e l a b e l s m i g h t b e c o r r e c t . T h i s a s s u m p t i o n i s e a s y t o i n c o r p o r a t e i n t o t h e c o s t f u n c t i o n a n a l y t i c a l l y, r a t h e r t h a n b y e x p l i c i t l y d r a w i n g n o i s e s a m p l e s . F o r e x a m p l e, l a b e l s m o o t h i n g r e g u l a r i z e s a m o d e l b a s e d o n a s o f t m a x w i t h

o u t p u t v a l u e s b y r e p l a c i n g t h e h a r d 0 a n d 1 c l a s s i f i c a t i o n t a r g e t s w i t h t a r g e t s o f

\frac{\epsilon}{k-1}

a n d

1-\epsilon,

r e s p e c t i v e l y . T h e s t a n d a r d c r o s s - e n t r o p y l o s s m a y t h e n b e u s e d w i t h t h e s e s o f t t a r g e t s . M a x i m u m l i k e l i h o o d l e a r n i n g w i t h a s o f t m a x c l a s s i f i e r a n d h a r d t a r g e t s m a y a c t u a l l y n e v e r c o n v e r g e - - t h e s o f t m a x c a n n e v e r p r e d i c t a p r o b a b i l i t y o f e x a c t l y 0 o r e x a c t l y

1,$ so it will continue to learn larger and larger weights, making more extreme predictions forever. It is possible to prevent this scenario using other regularization strategies like weight decay. Label smoothing has the advantage of preventing the pursuit of hard probabilities without discouraging correct classification. This strategy has been used since the 1980s

Notes mentioning this note

There are no notes linking to this note.

Here are all the notes in this garden, along with their links, visualized as a graph.