LARGE-SCALE DEEP LEARNING ON THE YFCC100M DATASET
Karl Ni, Roger Pearce, Eric Wang, Kofi Boakye, Brian Van Essen, Damian Borth, Barry Chen
Lawrence Livernmore National LaboratoryComputational Engineering Division
7000 East Avenue, Livermore, CA 94550
ABSTRACT
We present a work-in-progress snapshot of learning with a 15 billionparameter deep learning network on HPC architectures applied to thelargest publicly available natural image and video dataset releasedto-date. Recent advancements in unsupervised deep neural networkssuggest that scaling up such networks in both model and trainingdataset size can yield significant improvements in the learning ofconcepts at the highest layers. We train our three-layer deep neuralnetwork on the Yahoo! Flickr Creative Commons 100M dataset. Thedataset comprises approximately 99.2 million images and 800, 000user-created videos from Yahoo’s Flickr image and video sharingplatform. Training of our network takes eight days on 98 GPU nodesat the High Performance Computing Center at Lawrence LivermoreNational Laboratory. Encouraging preliminary results and future re-search directions are presented and discussed.
Index Terms— Deep Learning, Autoencoders, High Perfor-mance Computing
1. INTRODUCTION
The field of deep learning via stacked neural networks has receivedrenewed interest in the last decade [1, 2, 3]. Neural networks havebeen shown to perform well in a wide variety of tasks, includingtext analysis [4], speech recognition [5, 6, 7], various classificationtasks [8, 9], and most notably unsupervised and supervised featurelearning on natural imagery [1, 2, 3].
Deep neural networks applied to natural images have demon-strated state-of-the-art performance in supervised object recognitiontasks [10, 1] as well as unsupervised neural networks [2, 3]. Theclassical approach to training neural networks for computer vision isvia a large dataset of labeled data. However, sufficiently large andaccurately labeled data is difficult and expensive to acquire. Moti-vated by this, [3] explored the application of deep neural networksin unsupervised deep learning and discovered that sufficiently largedeep networks are capable of learning highly complex concept levelfeatures at the top level without labels.
Spurred by this advancement, [2] set out to construct very largenetworks on the order of 109 to 1010 parameters. A key advance-ment was the highly efficient multi-GPU architecture of their model.[2] employed a high degree of model parallelism and was able toprocess 10 million YouTube thumbnails in a few days processingtime on a medium sized cluster. A notable result was the unsuper-vised learning of various faces, including those of humans and cats.Ultimately, improved feature learning at larger scales can improve
This work was performed under the auspices of the U.S. Department ofEnergy by Lawrence Livermore National Laboratory in part under ContractW-7405-Eng-48 and in part under Contract DE-AC52-07NA27344.
downstream capabilities such as scene or object classification, addi-tional unsupervised learning (i.e. via topic modeling [11] or naturallanguage processing algorithms [12]).
In collaboration with the authors of [2], we have scaled a sim-ilar model and architecture to over 15 billion parameters on theLawrence Livermore National Laboratory’s (LLNL) Edge HighPerformance Computing (HPC) system. Our long-term goal istwo-fold: (1) explore at-the-limit performance of massive networks(> 10 billion parameters) and (2) train on and analyze datasets onthe order of 100 million images.
As the number of network parameters grow, datasets need tobe scaled accordingly to avoid overfitting the models. We take ad-vantage of a brand-new dataset released jointly by Yahoo!, LLNLand the International Computer Science Institute (ICSI) called theYahoo! Flickr Creative Commons 100M (YFCC100M) dataset10.The dataset is, to the authors’ knowledge, the largest single publiclyavailable image and video dataset ever published. In addition to theraw images and video, the YFCC100M also contains metadata foreach entry including locations, camera types, keywords, titles, etc.Although beyond the scope of this paper, this rich associated meta-data potentially offers researchers additional avenues of semanticmulti-modality learning to explore.
Working with the large-scale datasets, models and computingarchitectures considered in this paper presents several daunting en-gineering challenges. For example, the significantly greater numberof GPUs and compute nodes used in our system versus [2] createscommunication issues in MPI. In addition, a typical model takes upover 40 GB of memory, making simple offline analysis tasks suchas visualization challenging. Various network architectures weretested, balancing performance and computational constraints, beforewe arrived at our current model. Finally, as in [2], data throughputpresents a bottleneck to model training. We present a novel pipelineapproach to address this problem.
The rest of this paper is organized as follows. In Section 2 wegive a brief overview of the YFCC100M dataset. The network archi-tecture and computational framework being employed is describedin Section 3. We present preliminary results and visualizations ofour network in Section 4. Finally, we summarize and discuss futureresearch directions in Section 5.
2. OVERVIEW OF THE YFCC100M DATASET
In late June 2014, Yahoo! released the Yahoo! Flickr CreativeCommons dataset (YFCC100M). This dataset consists of 100 mil-lion Flickr user-uploaded images and videos (99,206,564 images and793,436 videos) along with their corresponding metadata includingtitle, description, camera type, tags, and geotags when available. Allof the data is under Creative Commons licensing and is freely pro-
arX
iv:1
502.
0340
9v1
[cs
.LG
] 1
1 Fe
b 20
15
vided to scientists for the advancement of multimedia research 1. Inaddition to the raw images, videos, and metadata, Yahoo! in collab-oration with the ICSI and LLNL will be computing and providingstandard computer vision and audio features using LLNL’s super-computing resources.
Wang et al. [13] have used YFCC100M data to build sys-tems that associate images with more natural annotations like thosefound in user-generated captions. Others are interested in using theYFCC100M imagery and audio to geolocate where the photo orvideo was taken [14]. In fact, the 2014 MediaEval Placing Task isusing YFCC100M as the source of benchmark data [15]. We areinterested in using YFCC100M as our sandbox dataset for learningimage features using massive unsupervised neural networks, repeat-ing the experiment by [3] on an order of magnitude more data andneural network parameters. In particular, we want to see what other“grandmother neurons” [3] our network would automatically learnfrom YFCC100M.
Table 1. Top 60 Tags in YFCC100M Imagessquare iphoneography square format instagram app california travelnikon usa canon london japan francenature art music europe beach united states
england wedding italy new york canada cityvacation germany party park water people
uk spain architecture summer festival nyctaiwan paris san francisco australia winter skysnow concert night family china museumfood street live washington landscape flower
sunset photo flowers holiday trip photography
The 99,206,564 images were created and posted by 578,268 dif-ferent Flickr users. 76%, 20%, and 4% of the images have titles,auto-titles, or no titles, respectively. The average number of wordsper title is 3.08. 32% of the images have descriptions with an aver-age of 22.52 words per description. Finally, 69% of the images haveon average 7.07 tags per image. The top 60 tags are shown in Ta-ble 1. In Fig. 1 we show example images and associated meta-datafor several YFCC100M images.
3. ANALYSIS WITH LARGE SCALE NEURALNETWORKS
3.1. Network Architecture
For the large set of image data, we employed a three-layer, large-scale deep neural network with a reconstruction independent com-ponent analysis (RICA) cost function,
minW,α,b
∑i
∥∥∥WT (αWx(i)) + b− x(i)∥∥∥22+ λ
√(αWx(i))2
subject to ‖W (k)‖2 = 1, ∀k,
where as in [2],W is a weighting matrix, α is a scaling value andx(i) are the data points at the beginning of each layer. In addition, weintroduce an offset, b, for increased model flexibility. The parameterλ controls the relative sparsity, and is set to 0.1 at the first two layersand 0.01 at the final layer. Unlike [2], we do not presently include apooling layer, as we believe the scale of the network and training dataallows a similar translational invariance to be automatically learned.A particular advantage conferred by the RICA construction is thatthe sparseness term λ
√(αWx(i))2 can be computed in-situ with the
rest of the model parameters. This is in contrast to the conventional
1Available at http://research.yahoo.com/Academic Relations
sparse autoencoder construction that requires a second pass throughthe data to compute a sparseness-specific gradient contribution.
Fig. 2 illustrates the structure of our network. The three layersare composed of two untied convolutional layers, and a third fully-connected layer. The first convolutional layer utilizes 5184 filters 2
of input size 16 × 16 × 3 with stride 4 and output size 4 × 4 × 24.The second layer takes 16 spatially contiguous 3 4× 4× 24 outputsof the first layer and connects them fully to a 4 × 4 × 24 output.The stride length of the second layer is 4. The third layer is dense,and fully connects the 62 × 62 × 24 outputs of the second layer to4096 top-level neurons. The total number of parameters trained is15 billion. After each layer, local contrast normalization (LCN) isapplied prior to continuing onto the next layer. Though no poolingis applied, the window sizes at the next layer are large enough toincorporate spatial information from neighboring blocks.
Fig. 2. Network topology of large scale, trained network. Approxi-mately 15 billion parameters
Fig. 3. Pipeline for semi-parallel training of sparse autoencodersfrom a single data source
Training data is arranged into 99,207 data blocks of 960 images.Each data block consists of 5 mini-batches, where each mini-batchcontains 192 images. Due to the scale of the data, the proposedalgorithm reduces training time by employing a pipeline techniquewhere the next layer begins training before the previous layer hasfinished. Analogous to the example shown in Fig. 3, after a layerL has trained an initial set of data blocks (in our case, 1000), thenext layer, L + 1, starts training. To accomplish this, two instancesof the layer L are run simultaneously: one which continues trainingand one that uses up-to-date parameters to forward propagate datafrom Block 0 to the layer L + 1. The parameters of the forward-propagating layer L instance are periodically synchronized with thelayerL instance that continued training. We observed that our modelwas not sensitive to the choice of synchronization frequency. As a
2Arranged in a 72× 72 grid3Arranged in a 4× 4 grid
Fig. 1. Examples of YFCC Data, and the associated metadata. Photo credits to Yahoo! users “Dougtone”, “ascaro41”, “mlaaker”, “Ingy TheWingy”, “monoprixgourmet bis”.
rule of thumb, we wait to train layer L+1 until the objective of layerL stabilizes, which typically occurs after approximately one millionimages.
3.2. HPC Architecture
To train the neural network at scale, we used 98 nodes of the EdgeHPC cluster at Lawrence Livermore National Laboratory. The Edgecluster consists of 206 nodes with 12 core Intel Xeon EP X5660running at 2.8 GHz. Each node has 96 GB of DRAM and a TeslaM2050 (Fermi) NVIDIA GPU with 3 GB of GDDR5. The trainingalgorithm is model parallel as described in [2], with the nodes andGPUs processing each mini-batch across the system and distribut-ing the model across the GPUs. Communication was provided byMPI over Mellanox QDR Infiniband cards. The GPU acceleratorswere used with CUDA 5.5 and MPI-direct communication and theoperating system was a 2.6.32 kernel RHEL 6 derivative.
The dataset was stored in a Lustre file system with a peak band-width of 10 GB/s. Each mini-batch was copied from Lustre intomemory and then streamed into the GPU’s memory. Each GPU isresponsible for computing its section of the model parameters forthe current mini-batch. Communication within the algorithm occurswhen a layer’s input (or output) field spans multiple GPUs. The
communication is handled by a distributed array data structure (us-ing MPI) within the training algorithm. Global communication isminimized by using untied local receptive fields, and allowing re-ceptive fields to be trained independently.
4. PRELIMINARY RESULTS
Fig. 4. Visualization of a selection of typical first layer weights. Theright figure is a zoomed-in crop of the left.
Fig. 5. Top-5 stimuli of example layer 3 neurons. Images have beenwhitened.
We trained the network using all images from the YFCC100Mdataset. Images were preprocessed as in [2], and subsequently re-sized to 300 x 300 pixels by first centering, then scaling the smallestdimension to 300 pixels, and finally cropping. After training all threelayers, we forward propagated 2 million images through the networkin order to obtain activation values for visualization. Note that in this
paper, the test set is significantly noisier than the benchmark LabeledFaces In the Wild [16] and ImageNet [17] datasets considered in pre-vious works such as [3].
In Fig. 5, we show the top 5 stimuli for some example neurons.We observe that our network is capable of learning significant struc-ture, identifying buildings, aircraft, text, cityscapes, and tower-likebuildings, among many others. The network seems to cue in on dis-tinctive textures such as the edges of text, sides of buildings andthe sharp edge of airplanes against the smooth gradation of the sky.Moreover, the network seems to activate on large-scale structureswithin an image rather than local features. We believe that a signifi-cant contributor to our networks’ performance is due to its large sizebeing able to capture complex concepts.
While our results are encouraging, we believe that significantimprovements in learning can be achieved through improved net-work architecture and increased depth. As was demonstrated in [1],network architecture has a significant impact on the performance ofdeep networks. While the networks described in [3] were able tolearn complex features in just three layers, our results suggest thatextremely large datasets such as the YFCC100M can support (andpossibly benefit from) deeper networks with improved high-levelconcept learning.
5. SUMMARY AND FUTURE WORK
The results discussed in this paper present a snapshot of the workin progress at Lawrence Livermore National Laboratory in scalingup deep neural networks. Such networks offer enormous potentialto researchers in both supervised and unsupervised computer visiontasks, from object recognition and classification to unsupervised fea-ture extraction.
To date, we see highly encouraging results from trainingour large 15 billion parameter three-layer neural network on theYFCC100M dataset in an unsupervised manner. The results suggestthat the network is capable of learning highly complex conceptssuch as cityscapes, aircraft, buildings, and text, all without labelsor other guidance. That this structure is visible upon examinationis made all the more remarkable due to the noisiness of our test set(taken at random from the YFCC100M dataset itself).
Future work on our networks will focus on two main thrusts: (1)improve the high-level concept learning by increasing the depth ofour network, and (2) scaling our network’s width in the middle lay-ers. On the first thrust, we aim for improved high-level summariza-tion and scene understanding. Challenges on this front include care-ful tuning of parameters to combat the “vanishing gradient” problemand design of the connectivity structure of the higher-level layers tomaximize learning. On the second thrust, our challenges are primar-ily engineering focused. Memory and message passing constraintsbecome a serious concern, even on the large HPC systems fieldedby LLNL. As we move beyond our current large neural network, weplan to explore the use of memory hierarchies for staging interme-diate/input data to minimize the amount of node-to-node commu-nication, enabling the efficient training and analysis of even largernetworks.
6. ACKNOWLEDGMENTS
We would like to thank Adam Coates, Brody Huval and Andrew Ngfor providing their COTS HPC Deep Learning software and helpfuladvice. This work was performed under the auspices of the U.S.Department of Energy by Lawrence Livermore National Laboratoryunder Contract DE-AC52-07NA27344.
7. REFERENCES
[1] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classifi-cation with deep convolutional neural networks,” in Advancesin neural information processing systems, 2012.
[2] A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, andB. Catanzaro., “Deep learning with cots hpc,” in InternationalConference on Machine Learning, 2013.
[3] Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S.Corrado, J. Dean, and A. Y. Ng, “Building high-level featuresusing large scale unsupervised learning,” in International Con-ference on Machine Learning, 2012.
[4] T. Mikolov, K. Chen, G. Corrado, and J. Dean., “Efficient es-timation of word representations in vector space,” in Proceed-ings of Workshop at ICLR, 2013.
[5] O. Abdel-Hamid, L. Deng, and D. Yu, “Exploring convolu-tional neural network structures and optimization techniquesfor speech recognition,” in Interspeech 2013, 2013.
[6] G. Hinton, L. Deng, D. Yu, A.-R. Mohamed, N. Jaitly, A. Se-nior, V. Vanhoucke, P. Nguyen, T. Sainath, G. Dahl, andB. Kingsbury, “Deep neural networks for acoustic modeling inspeech recognition,” IEEE Signal Processing Magazine, vol.29, no. 6, pp. 82–97, November 2012.
[7] H. Bourlard and N. Morgan, Connectionist Speech Recogni-tion: A Hybrid Approach, Kluwer Academic Publishers, 1993.
[8] D. Claudiu Ciresan, U. Meier, L. M. Gambardella, andJ. Schmidhuber, “Convolutional neural network committeesfor handwritten character classification,” in International Con-ference on Document Analysis and Recognition, 2011.
[9] D. Reby, S. Lek, I. Dimopoulos, J. Joachim, J. Lauga, andS. Aulagnier, “Artificial neural networks as a classificationmethod in the behavioural sciences,” Behavioural Processes,vol. 40, pp. 3543, 1997.
[10] R Uetz and S. Behnke, “Large-scale object recognition withcuda-accelerated hierarchical neural networks,” in IEEE Inter-national Conference on Intelligent Computing and IntelligentSystems, 2009.
[11] L. Cao and L. Fei-Fei, “Spatially coherent latent topic modelfor concurrent object segmentation and classification,” inProceedings of International Conference on Computer vision,2007.
[12] R. Socher and M. Ganjoo and C. D. Manning and A. Y. Ng,“Zero Shot Learning Through Cross-Modal Transfer,” in Ad-vances in Neural Information Processing Systems 26. 2013.
[13] J. K. Wang, F. Yan, A. Aker, and R. Gaizauskas, “A poodleor a dog? Evaluating automatic image annotation using humandescriptions at different levels of granularity,” in Proceedingsof the Workshop on Vision and Language, 2014.
[14] James Hays and Alexei A. Efros, “im2gps: estimating geo-graphic information from a single image,” in Proceedings ofthe IEEE Conf. on Computer Vision and Pattern Recognition(CVPR), 2008.
[15] J. Choi, B. Thomee, G. Friedland, L. Cao, K. Ni, D. Borth,B. Elizalde, L. Gottlieb, C. Carrano, R. Pearce, D. Poland, “The Placing Task: A Large Scale Geo-Estimation Challengefor Social-Media Videos and Images,” 3rd ACM MultimediaWorkshop On GeoTagging and Its Applications in Multimedia.
[16] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “La-beled faces in the wild: A database for studying face recogni-tion in unconstrained environments,” .
[17] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-fei, “Ima-genet: A large-scale hierarchical image database,” in In CVPR,2009.