ABSTRACT
Deep learning is one of the hottest research topics which has received a lot of attention in recent years. During the training of deep learning models, the loss function is a critical indicating objective which measures the difference between the predicted value and the distribution of the real data. It is also an important indicator for evaluating the performance of a deep learning model. The most popular loss functions used in deep learning include mean square error (MSE), cross-entropy error, etc. Obviously, the loss function has non-negligible influence to the optimizer. The most common optimizers include stochastic gradient descent method (SGD), mini-batch stochastic gradient descent method (MBGD) and Adaptive moment estimation (ADAM). Among them, the MBGD is widely used for its equilibrium between accuracy and speed. However, how to set the batch size is a big challenge. If the batch size is too large, the cost of computation and memory increases accordingly. On the other hand, the gradient descent process could be more oscillated with small batch size. Therefore, this paper proposes a improved loss function named truncated cross-entropy to stabilize the convergence procedure of the optimizer. Experiments show that the proposed method could speed up the convergence of training and reduce the oscillation. The proposed method can achieve similar performance to large-batch-size training with relatively small batch size.
- Qian N. 1999. On the momentum term in gradient descent learning algorithms. J. Neural networks, 12, 1, 145-151. https://doi.org/10.1016/S0893-6080(98)00116-6Google ScholarDigital Library
- Duchi J, Hazan E, Singer Y. 2011. Adaptive subgradient methods for online learning and stochastic optimization. J. Journal of machine learning research, 12, 7Google Scholar
- Zeiler M D. 2012. Adadelta: an adaptive learning rate method. arXiv:1212.5701. Retrieved from https://arxiv.org/abs/1212.5701Google Scholar
- Kingma D P, Ba J 2015. Adam: a Method for Stochastic Optimization. Interna-tional Conference on Learning Representations, 1-13.Google Scholar
- De S, Mukherjee A, Ullah E. 2018. Convergence guarantees for RMSProp and ADAM in non-convex optimization and an empirical comparison to Nesterov acceleration. arXiv:1807.06766. Retrieved from https://arxiv.org/abs/1807.06766Google Scholar
- Szegedy C, Vanhoucke V, Ioffe S, 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2818-2826.Google ScholarCross Ref
- Beyer L, Hénaff O J, Kolesnikov A, 2020. Are we done with imagenet? arXiv:2006.07159. Retrieved from https://arxiv.org/abs/2006.07159Google Scholar
- Lapin M, Hein M, Schiele B. 2017. Analysis and optimization of loss functions for multiclass, top-k, and multilabel classification. J. IEEE transactions on pattern analysis and machine intelligence, 40, 7, 1533-1554. https://doi.org/10.1109/TPAMI.2017.2751607Google Scholar
- Brian Lucena. 2022. Loss Functions for Classification using Structured Entropy. arXiv:2206.07122. Retrieved from https://arxiv.org/abs/2206.07122Google Scholar
- Bertinetto L, Mueller R, Tertikas K, 2020. Making better mistakes: Leveraging class hierarchies with deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12506-12515.Google ScholarCross Ref
- Feng L, Shu S, Lin Z, 2021. Can cross entropy loss be robust to label noise? In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, 2206-2212.Google Scholar
- Santiago Gonzalez and Risto Miikkulainen. 2020. Optimizing loss functions through multivariate taylor polynomial parameterization. arXiv:2002.00059, 2020b. Retrieved from https://arxiv.org/abs/2002.00059Google Scholar
- Tong Y, Yu L, Li S, 2021. Polynomial fitting algorithm based on neural network. J. ASP Transactions on Pattern Recognition and Intelligent Systems, 1, 1, 32-39.Google ScholarCross Ref
- Gonzalez S, Miikkulainen R. 2020. Evolving loss functions with multivariate taylor polynomial parameterizations. arXiv:2002.00059. Retrieved from https://arxiv.org/abs/2002.00059v2Google Scholar
- Wang D, Smith A, Xu J. 2019. Noninteractive locally private learning of linear models via polynomial approximations. Algorithmic Learning Theory. PMLR, 898-903.Google Scholar
- Leng Z, Tan M, Liu C, 2022. PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions. arXiv:2204.12511. Retrieved from https://arxiv.org/abs/2204.12511Google Scholar
- Truncated Cross-entropy: A New Loss Function for Multi-category Classification
Recommendations
Cross-entropy loss functions: theoretical analysis and applications
ICML'23: Proceedings of the 40th International Conference on Machine LearningCross-entropy is a widely used loss function in applications. It coincides with the logistic loss applied to the outputs of a neural network, when the softmax is used. But, what guarantees can we rely on when using cross-entropy as a surrogate loss? We ...
Hinge Loss Projection for Classification
Proceedings of the 23rd International Conference on Neural Information Processing - Volume 9948Hinge loss is one-sided function which gives optimal solution than that of squared error SE loss function in case of classification. It allows data points which have a value greater than 1 and less than $$-1$$-1 for positive and negative classes, ...
Risk-sensitive loss functions for sparse multi-category classification problems
In this paper, we propose two risk-sensitive loss functions to solve the multi-category classification problems where the number of training samples is small and/or there is a high imbalance in the number of samples per class. Such problems are common ...
Comments