TY - JOUR
T1 - Human parsing by weak structural label
AU - Chen, Zhiyong
AU - Liu, Si
AU - Zhai, Yanlong
AU - Lin, Jia
AU - Cao, Xiaochun
AU - Yang, Liang
N1 - Publisher Copyright:
© 2017, Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2018/8/1
Y1 - 2018/8/1
N2 - Human parsing, which decomposes a human centric image into several semantic labels, e.g., face, skin etc, is an active topic in recent years. Traditional human parsing methods are always conducted on a supervised setting, i.e., the pixel-wise labels are available during the training process, which require tedious human labeling efforts. In this paper, we propose a weakly supervised deep parsing method to alleviate the human from the time-consuming labeling. More specifically, we resort to train a robust human parser with the structural image-level labels, e.g., “red jeans” etc. The structural label contains an attribute, e.g., “red”, as well as a class label, e.g., “jeans”. Our framework is based on the Fully Convolution Network (FCN) (Pathak et al. 2014) with two critical differences. First, the loss function defined on the pixel by FCN (Pathak et al. 2014) is modified to the image-level loss by aggregating the pixel-wise prediction of the whole image into a multiple instance learning manner. Besides, we develop a novel logistic pooling layer to constrain that the pixels responding to the color and corresponding category labels are the same to interpret the structural label. Extensive experiments in the publicly available dataset (Liu et al. IEEE Trans Multimedia 16(1):253–265, 2014) show the effectiveness of the proposed method.
AB - Human parsing, which decomposes a human centric image into several semantic labels, e.g., face, skin etc, is an active topic in recent years. Traditional human parsing methods are always conducted on a supervised setting, i.e., the pixel-wise labels are available during the training process, which require tedious human labeling efforts. In this paper, we propose a weakly supervised deep parsing method to alleviate the human from the time-consuming labeling. More specifically, we resort to train a robust human parser with the structural image-level labels, e.g., “red jeans” etc. The structural label contains an attribute, e.g., “red”, as well as a class label, e.g., “jeans”. Our framework is based on the Fully Convolution Network (FCN) (Pathak et al. 2014) with two critical differences. First, the loss function defined on the pixel by FCN (Pathak et al. 2014) is modified to the image-level loss by aggregating the pixel-wise prediction of the whole image into a multiple instance learning manner. Besides, we develop a novel logistic pooling layer to constrain that the pixels responding to the color and corresponding category labels are the same to interpret the structural label. Extensive experiments in the publicly available dataset (Liu et al. IEEE Trans Multimedia 16(1):253–265, 2014) show the effectiveness of the proposed method.
KW - Deep learning
KW - Human parsing
UR - http://www.scopus.com/inward/record.url?scp=85035121395&partnerID=8YFLogxK
U2 - 10.1007/s11042-017-5368-4
DO - 10.1007/s11042-017-5368-4
M3 - Article
AN - SCOPUS:85035121395
SN - 1380-7501
VL - 77
SP - 19795
EP - 19809
JO - Multimedia Tools and Applications
JF - Multimedia Tools and Applications
IS - 15
ER -