Abstract
Feature representation is one of the key problems in the field of computer vision, good feature representation can improve the performance of machine learning algorithms. Deep learning is one of the best methods of learning visual representation at present. The supervised method can provide rich features for classification and recognition algorithms. However, due to the massive growth of visual data and the high cost of manual annotation, unsupervised learning of visual representation has gradually received more attentions. This paper presents an unsupervised deep learning method based on image triplets mining for learning visual representation of images. Our method consists of two stages: mining image triplets and learning feature representation of images. Specifically, first, we constructed a convolutional neural network (CNN) for binary classification, then we sampled data from original datasets for the binary classification. The first class of data was obtained by data augmentation. We performed a series of visual transformation to an image and yielded some images which made up the first class of data. And we randomly sampled data from the remaining images as the second class of data. We used these two class of data to train a CNN and utilized the characteristic of soft-max activation function to mine a large number of image triplets from original image dataset. An image triplet consists of an image, an image which is similar to it and an image which is dissimilar to it. Second, we designed a Triplet CNN, which consisted of three channels of CNN and the three channels shared parameters. And then we fed the image triplet samples into the Triplet CNN. These image triplets can provide supervisory information to Triplet CNN for representation learning. We used the appropriate triplet loss to optimize the Triplet model. After the completion of training of the Triplet CNN, we input all images of the original dataset to the Triplet model and can obtain the visual representations of all images from the dataset. In the entire algorithm process, our method definitely exploited no annotation information. In order to evaluate the proposed method, we applied the feature representations learned by our method to the applications of clustering and classification on the commonly used image datasets. In the clustering tasks on multiple image datasets, the effect of the learned representation on benchmark clustering algorithms is averagely up to 15.3% in normalized mutual information (NMI), and compared with the traditional visual feature mining method, the performance of the proposed method has also achieved an improvement of about 12.7% in NMI. For the classification, based on the learned feature representation, we just used shallow classifiers. We still obtained competitive performance when compared with the best classification results on several benchmark datasets, and in another part of benchmark datasets, we got the best results that we know so far. According to the visualization results of the features of several datasets, we can also see that the feature representation learned by our method have good discriminability. The results of experiments convincingly demonstrate that the method we proposed is effective.
Translated title of the contribution | Unsupervised Visual Representation Learning with Image Triplets Mining |
---|---|
Original language | Chinese (Traditional) |
Pages (from-to) | 2787-2803 |
Number of pages | 17 |
Journal | Jisuanji Xuebao/Chinese Journal of Computers |
Volume | 41 |
Issue number | 12 |
DOIs | |
Publication status | Published - 1 Dec 2018 |