Most datasets have some number of mistakes in the y labels. It can be harmful to maximize $\log p(y |
\bf{x})whenyisamistake.Onewaytopreventthisistoexplicitlymodelthenoiseonthelabels.Forexample,wecanassumethatforsomesmallconstant\epsilon,thetrainingsetlabelyiscorrectwithprobability1-\epsilon,andotherwiseanyoftheotherpossiblelabelsmightbecorrect.Thisassumptioniseasytoincorporateintothecostfunctionanalytically,ratherthanbyexplicitlydrawingnoisesamples.Forexample,labelsmoothingregularizesamodelbasedonasoftmaxwithkoutputvaluesbyreplacingthehard0and1classificationtargetswithtargetsof\frac{\epsilon}{k-1}and1-\epsilon,respectively.Thestandardcross−entropylossmaythenbeusedwiththesesofttargets.Maximumlikelihoodlearningwithasoftmaxclassifierandhardtargetsmayactuallyneverconverge−−thesoftmaxcanneverpredictaprobabilityofexactly0orexactly1,$ so it will continue to learn larger and larger weights, making more extreme predictions forever. It is possible to prevent this scenario using other regularization strategies like weight decay. Label smoothing has the advantage of preventing the pursuit of hard probabilities without discouraging correct classification. This strategy has been used since the 1980s |