Learning two-layer ReLU networks is nearly as easy as learning linear classifiers on separable data

Qiuling Yang; Alireza Sadeghi; Gang Wang; Jian Sun

doi:10.1109/TSP.2021.3094911

Learning two-layer ReLU networks is nearly as easy as learning linear classifiers on separable data

Qiuling Yang, Alireza Sadeghi, Gang Wang^*, Jian Sun

^*Corresponding author for this work

School of Automation

Research output: Contribution to journal › Article › peer-review

14 Citations (Scopus)

Abstract

Neural networks with non-linear rectified linear unit (ReLU) activation functions have demonstrated remarkable performance in many fields. It has been observed that a sufficiently wide and/or deep ReLU network can accurately fit the training data, with a small generalization error on the testing data. Nevertheless, existing analytical results on provably training ReLU networks are mostly limited to over-parameterized cases, or they require assumptions on the data distribution. In this paper, training a two-layer ReLU network for binary classification of linearly separable data is revisited. Adopting the hinge loss as classification criterion yields a non-convex objective function with infinite local minima and saddle points. Instead, a modified loss is proposed which enables (stochastic) gradient descent to attain a globally optimal solution. Enticingly, the solution found is globally optimal for the hinge loss too. In addition, an upper bound on the number of iterations required to find a global minimum is derived. To ensure generalization performance, a convex max-margin formulation for two-layer ReLU network classifiers is discussed. Connections between the sought max-margin ReLU network and the max-margin support vector machine are drawn. Finally, an algorithm-dependent theoretical quantification of the generalization performance is developed using classical compression bounds. Numerical tests using synthetic and real data validate the analytical results.

Original language	English
Article number	9477126
Pages (from-to)	4416-4427
Number of pages	12
Journal	IEEE Transactions on Signal Processing
Volume	69
DOIs	https://doi.org/10.1109/TSP.2021.3094911
Publication status	Published - 2021

Keywords

Convex loss
Finite iterations
Generalization
Global optimality
Max-margin
ReLU network

Access to Document

10.1109/TSP.2021.3094911

Cite this

@article{2f3f641b95924d25ae3b061dbf8d6ee8,

title = "Learning two-layer ReLU networks is nearly as easy as learning linear classifiers on separable data",

abstract = "Neural networks with non-linear rectified linear unit (ReLU) activation functions have demonstrated remarkable performance in many fields. It has been observed that a sufficiently wide and/or deep ReLU network can accurately fit the training data, with a small generalization error on the testing data. Nevertheless, existing analytical results on provably training ReLU networks are mostly limited to over-parameterized cases, or they require assumptions on the data distribution. In this paper, training a two-layer ReLU network for binary classification of linearly separable data is revisited. Adopting the hinge loss as classification criterion yields a non-convex objective function with infinite local minima and saddle points. Instead, a modified loss is proposed which enables (stochastic) gradient descent to attain a globally optimal solution. Enticingly, the solution found is globally optimal for the hinge loss too. In addition, an upper bound on the number of iterations required to find a global minimum is derived. To ensure generalization performance, a convex max-margin formulation for two-layer ReLU network classifiers is discussed. Connections between the sought max-margin ReLU network and the max-margin support vector machine are drawn. Finally, an algorithm-dependent theoretical quantification of the generalization performance is developed using classical compression bounds. Numerical tests using synthetic and real data validate the analytical results.",

keywords = "Convex loss, Finite iterations, Generalization, Global optimality, Max-margin, ReLU network",

author = "Qiuling Yang and Alireza Sadeghi and Gang Wang and Jian Sun",

note = "Publisher Copyright: {\textcopyright} 1991-2012 IEEE.",

year = "2021",

doi = "10.1109/TSP.2021.3094911",

language = "English",

volume = "69",

pages = "4416--4427",

journal = "IEEE Transactions on Signal Processing",

issn = "1053-587X",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Learning two-layer ReLU networks is nearly as easy as learning linear classifiers on separable data

AU - Yang, Qiuling

AU - Sadeghi, Alireza

AU - Wang, Gang

AU - Sun, Jian

PY - 2021

Y1 - 2021

N2 - Neural networks with non-linear rectified linear unit (ReLU) activation functions have demonstrated remarkable performance in many fields. It has been observed that a sufficiently wide and/or deep ReLU network can accurately fit the training data, with a small generalization error on the testing data. Nevertheless, existing analytical results on provably training ReLU networks are mostly limited to over-parameterized cases, or they require assumptions on the data distribution. In this paper, training a two-layer ReLU network for binary classification of linearly separable data is revisited. Adopting the hinge loss as classification criterion yields a non-convex objective function with infinite local minima and saddle points. Instead, a modified loss is proposed which enables (stochastic) gradient descent to attain a globally optimal solution. Enticingly, the solution found is globally optimal for the hinge loss too. In addition, an upper bound on the number of iterations required to find a global minimum is derived. To ensure generalization performance, a convex max-margin formulation for two-layer ReLU network classifiers is discussed. Connections between the sought max-margin ReLU network and the max-margin support vector machine are drawn. Finally, an algorithm-dependent theoretical quantification of the generalization performance is developed using classical compression bounds. Numerical tests using synthetic and real data validate the analytical results.

AB - Neural networks with non-linear rectified linear unit (ReLU) activation functions have demonstrated remarkable performance in many fields. It has been observed that a sufficiently wide and/or deep ReLU network can accurately fit the training data, with a small generalization error on the testing data. Nevertheless, existing analytical results on provably training ReLU networks are mostly limited to over-parameterized cases, or they require assumptions on the data distribution. In this paper, training a two-layer ReLU network for binary classification of linearly separable data is revisited. Adopting the hinge loss as classification criterion yields a non-convex objective function with infinite local minima and saddle points. Instead, a modified loss is proposed which enables (stochastic) gradient descent to attain a globally optimal solution. Enticingly, the solution found is globally optimal for the hinge loss too. In addition, an upper bound on the number of iterations required to find a global minimum is derived. To ensure generalization performance, a convex max-margin formulation for two-layer ReLU network classifiers is discussed. Connections between the sought max-margin ReLU network and the max-margin support vector machine are drawn. Finally, an algorithm-dependent theoretical quantification of the generalization performance is developed using classical compression bounds. Numerical tests using synthetic and real data validate the analytical results.

KW - Convex loss

KW - Finite iterations

KW - Generalization

KW - Global optimality

KW - Max-margin

KW - ReLU network

UR - http://www.scopus.com/inward/record.url?scp=85113369853&partnerID=8YFLogxK

U2 - 10.1109/TSP.2021.3094911

DO - 10.1109/TSP.2021.3094911

M3 - Article

AN - SCOPUS:85113369853

SN - 1053-587X

VL - 69

SP - 4416

EP - 4427

JO - IEEE Transactions on Signal Processing

JF - IEEE Transactions on Signal Processing

M1 - 9477126

ER -

Learning two-layer ReLU networks is nearly as easy as learning linear classifiers on separable data

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this