基于图像三元组挖掘的无监督视觉表示学习

Guo Cai He; Xia Bi Liu

doi:10.11897/SP.J.1016.2018.02787

基于图像三元组挖掘的无监督视觉表示学习

Translated title of the contribution: Unsupervised Visual Representation Learning with Image Triplets Mining

Guo Cai He, Xia Bi Liu

School of Computer Science and Technology

Beijing Institute of Technology

Research output: Contribution to journal › Article › peer-review

5 Citations (Scopus)

Abstract

Feature representation is one of the key problems in the field of computer vision, good feature representation can improve the performance of machine learning algorithms. Deep learning is one of the best methods of learning visual representation at present. The supervised method can provide rich features for classification and recognition algorithms. However, due to the massive growth of visual data and the high cost of manual annotation, unsupervised learning of visual representation has gradually received more attentions. This paper presents an unsupervised deep learning method based on image triplets mining for learning visual representation of images. Our method consists of two stages: mining image triplets and learning feature representation of images. Specifically, first, we constructed a convolutional neural network (CNN) for binary classification, then we sampled data from original datasets for the binary classification. The first class of data was obtained by data augmentation. We performed a series of visual transformation to an image and yielded some images which made up the first class of data. And we randomly sampled data from the remaining images as the second class of data. We used these two class of data to train a CNN and utilized the characteristic of soft-max activation function to mine a large number of image triplets from original image dataset. An image triplet consists of an image, an image which is similar to it and an image which is dissimilar to it. Second, we designed a Triplet CNN, which consisted of three channels of CNN and the three channels shared parameters. And then we fed the image triplet samples into the Triplet CNN. These image triplets can provide supervisory information to Triplet CNN for representation learning. We used the appropriate triplet loss to optimize the Triplet model. After the completion of training of the Triplet CNN, we input all images of the original dataset to the Triplet model and can obtain the visual representations of all images from the dataset. In the entire algorithm process, our method definitely exploited no annotation information. In order to evaluate the proposed method, we applied the feature representations learned by our method to the applications of clustering and classification on the commonly used image datasets. In the clustering tasks on multiple image datasets, the effect of the learned representation on benchmark clustering algorithms is averagely up to 15.3% in normalized mutual information (NMI), and compared with the traditional visual feature mining method, the performance of the proposed method has also achieved an improvement of about 12.7% in NMI. For the classification, based on the learned feature representation, we just used shallow classifiers. We still obtained competitive performance when compared with the best classification results on several benchmark datasets, and in another part of benchmark datasets, we got the best results that we know so far. According to the visualization results of the features of several datasets, we can also see that the feature representation learned by our method have good discriminability. The results of experiments convincingly demonstrate that the method we proposed is effective.

Translated title of the contribution	Unsupervised Visual Representation Learning with Image Triplets Mining
Original language	Chinese (Traditional)
Pages (from-to)	2787-2803
Number of pages	17
Journal	Jisuanji Xuebao/Chinese Journal of Computers
Volume	41
Issue number	12
DOIs	https://doi.org/10.11897/SP.J.1016.2018.02787
Publication status	Published - 1 Dec 2018

Access to Document

10.11897/SP.J.1016.2018.02787

Cite this

He, G. C., & Liu, X. B. (2018). 基于图像三元组挖掘的无监督视觉表示学习. Jisuanji Xuebao/Chinese Journal of Computers, 41(12), 2787-2803. https://doi.org/10.11897/SP.J.1016.2018.02787

@article{87be4a7f89134223af3b2649f650814b,

title = "基于图像三元组挖掘的无监督视觉表示学习",

abstract = "Feature representation is one of the key problems in the field of computer vision, good feature representation can improve the performance of machine learning algorithms. Deep learning is one of the best methods of learning visual representation at present. The supervised method can provide rich features for classification and recognition algorithms. However, due to the massive growth of visual data and the high cost of manual annotation, unsupervised learning of visual representation has gradually received more attentions. This paper presents an unsupervised deep learning method based on image triplets mining for learning visual representation of images. Our method consists of two stages: mining image triplets and learning feature representation of images. Specifically, first, we constructed a convolutional neural network (CNN) for binary classification, then we sampled data from original datasets for the binary classification. The first class of data was obtained by data augmentation. We performed a series of visual transformation to an image and yielded some images which made up the first class of data. And we randomly sampled data from the remaining images as the second class of data. We used these two class of data to train a CNN and utilized the characteristic of soft-max activation function to mine a large number of image triplets from original image dataset. An image triplet consists of an image, an image which is similar to it and an image which is dissimilar to it. Second, we designed a Triplet CNN, which consisted of three channels of CNN and the three channels shared parameters. And then we fed the image triplet samples into the Triplet CNN. These image triplets can provide supervisory information to Triplet CNN for representation learning. We used the appropriate triplet loss to optimize the Triplet model. After the completion of training of the Triplet CNN, we input all images of the original dataset to the Triplet model and can obtain the visual representations of all images from the dataset. In the entire algorithm process, our method definitely exploited no annotation information. In order to evaluate the proposed method, we applied the feature representations learned by our method to the applications of clustering and classification on the commonly used image datasets. In the clustering tasks on multiple image datasets, the effect of the learned representation on benchmark clustering algorithms is averagely up to 15.3% in normalized mutual information (NMI), and compared with the traditional visual feature mining method, the performance of the proposed method has also achieved an improvement of about 12.7% in NMI. For the classification, based on the learned feature representation, we just used shallow classifiers. We still obtained competitive performance when compared with the best classification results on several benchmark datasets, and in another part of benchmark datasets, we got the best results that we know so far. According to the visualization results of the features of several datasets, we can also see that the feature representation learned by our method have good discriminability. The results of experiments convincingly demonstrate that the method we proposed is effective.",

keywords = "Convolutional neural networks, Deep learning, Image triplets, Unsupervised learning, Visual representation learning",

author = "He, {Guo Cai} and Liu, {Xia Bi}",

year = "2018",

month = dec,

day = "1",

doi = "10.11897/SP.J.1016.2018.02787",

language = "繁体中文",

volume = "41",

pages = "2787--2803",

journal = "Jisuanji Xuebao/Chinese Journal of Computers",

issn = "0254-4164",

publisher = "Science Press",

number = "12",

}

TY - JOUR

T1 - 基于图像三元组挖掘的无监督视觉表示学习

AU - He, Guo Cai

AU - Liu, Xia Bi

PY - 2018/12/1

Y1 - 2018/12/1

N2 - Feature representation is one of the key problems in the field of computer vision, good feature representation can improve the performance of machine learning algorithms. Deep learning is one of the best methods of learning visual representation at present. The supervised method can provide rich features for classification and recognition algorithms. However, due to the massive growth of visual data and the high cost of manual annotation, unsupervised learning of visual representation has gradually received more attentions. This paper presents an unsupervised deep learning method based on image triplets mining for learning visual representation of images. Our method consists of two stages: mining image triplets and learning feature representation of images. Specifically, first, we constructed a convolutional neural network (CNN) for binary classification, then we sampled data from original datasets for the binary classification. The first class of data was obtained by data augmentation. We performed a series of visual transformation to an image and yielded some images which made up the first class of data. And we randomly sampled data from the remaining images as the second class of data. We used these two class of data to train a CNN and utilized the characteristic of soft-max activation function to mine a large number of image triplets from original image dataset. An image triplet consists of an image, an image which is similar to it and an image which is dissimilar to it. Second, we designed a Triplet CNN, which consisted of three channels of CNN and the three channels shared parameters. And then we fed the image triplet samples into the Triplet CNN. These image triplets can provide supervisory information to Triplet CNN for representation learning. We used the appropriate triplet loss to optimize the Triplet model. After the completion of training of the Triplet CNN, we input all images of the original dataset to the Triplet model and can obtain the visual representations of all images from the dataset. In the entire algorithm process, our method definitely exploited no annotation information. In order to evaluate the proposed method, we applied the feature representations learned by our method to the applications of clustering and classification on the commonly used image datasets. In the clustering tasks on multiple image datasets, the effect of the learned representation on benchmark clustering algorithms is averagely up to 15.3% in normalized mutual information (NMI), and compared with the traditional visual feature mining method, the performance of the proposed method has also achieved an improvement of about 12.7% in NMI. For the classification, based on the learned feature representation, we just used shallow classifiers. We still obtained competitive performance when compared with the best classification results on several benchmark datasets, and in another part of benchmark datasets, we got the best results that we know so far. According to the visualization results of the features of several datasets, we can also see that the feature representation learned by our method have good discriminability. The results of experiments convincingly demonstrate that the method we proposed is effective.

AB - Feature representation is one of the key problems in the field of computer vision, good feature representation can improve the performance of machine learning algorithms. Deep learning is one of the best methods of learning visual representation at present. The supervised method can provide rich features for classification and recognition algorithms. However, due to the massive growth of visual data and the high cost of manual annotation, unsupervised learning of visual representation has gradually received more attentions. This paper presents an unsupervised deep learning method based on image triplets mining for learning visual representation of images. Our method consists of two stages: mining image triplets and learning feature representation of images. Specifically, first, we constructed a convolutional neural network (CNN) for binary classification, then we sampled data from original datasets for the binary classification. The first class of data was obtained by data augmentation. We performed a series of visual transformation to an image and yielded some images which made up the first class of data. And we randomly sampled data from the remaining images as the second class of data. We used these two class of data to train a CNN and utilized the characteristic of soft-max activation function to mine a large number of image triplets from original image dataset. An image triplet consists of an image, an image which is similar to it and an image which is dissimilar to it. Second, we designed a Triplet CNN, which consisted of three channels of CNN and the three channels shared parameters. And then we fed the image triplet samples into the Triplet CNN. These image triplets can provide supervisory information to Triplet CNN for representation learning. We used the appropriate triplet loss to optimize the Triplet model. After the completion of training of the Triplet CNN, we input all images of the original dataset to the Triplet model and can obtain the visual representations of all images from the dataset. In the entire algorithm process, our method definitely exploited no annotation information. In order to evaluate the proposed method, we applied the feature representations learned by our method to the applications of clustering and classification on the commonly used image datasets. In the clustering tasks on multiple image datasets, the effect of the learned representation on benchmark clustering algorithms is averagely up to 15.3% in normalized mutual information (NMI), and compared with the traditional visual feature mining method, the performance of the proposed method has also achieved an improvement of about 12.7% in NMI. For the classification, based on the learned feature representation, we just used shallow classifiers. We still obtained competitive performance when compared with the best classification results on several benchmark datasets, and in another part of benchmark datasets, we got the best results that we know so far. According to the visualization results of the features of several datasets, we can also see that the feature representation learned by our method have good discriminability. The results of experiments convincingly demonstrate that the method we proposed is effective.

KW - Convolutional neural networks

KW - Deep learning

KW - Image triplets

KW - Unsupervised learning

KW - Visual representation learning

UR - http://www.scopus.com/inward/record.url?scp=85062272052&partnerID=8YFLogxK

U2 - 10.11897/SP.J.1016.2018.02787

DO - 10.11897/SP.J.1016.2018.02787

M3 - 文章

AN - SCOPUS:85062272052

SN - 0254-4164

VL - 41

SP - 2787

EP - 2803

JO - Jisuanji Xuebao/Chinese Journal of Computers

JF - Jisuanji Xuebao/Chinese Journal of Computers

IS - 12

ER -

基于图像三元组挖掘的无监督视觉表示学习

Abstract

Access to Document

Other files and links

Fingerprint

Cite this