Date post: | 21-Jan-2015 |
Category: |
Technology |
Upload: | denis-shestakov |
View: | 354 times |
Download: | 2 times |
Terabyte-scale image similarity search with Hadoop
Denis Shestakov
Hadoop Summit Europe 2014, Amsterdam, Netherlands, 02.04.2014
About me
● Big Data researcher/engineer○ recent projects: large-scale image retrieval○ before: web crawling
● Hadoop/MapReduce contractor○ design/development/tuning Hadoop applications
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Talk Outline
● Intro to image search● Image retrieval with MapReduce● Image indexing/searching workloads● Hadoop tools for large joins● Smart Hadoop configuration● Misc & conclusions
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Intro to Image Search
● Finding images given a text○ dog →
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Intro to Image Search
● Finding images given an image○ By content-similarity
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Image Search Applications● Regular image search
○ Google Images, Bing Images, TinEye, etc ● Product search (by image)● Object recognition
○ Face, logo, vehicle, etc.● Computer vision● Augmented reality● Medical imaging● Astrophysics
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Intro to Image Search
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Intro to Image SearchHow does it work?● Images resized to smaller size● Then transformed to chosen feature description
representation○ image → set of feature descriptors (=high-dimensional
vectors) ○ Many transformations exist
■ SIFT (Scale-invariant feature transform) used by us
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Intro to Image SearchHow does it work?
Typical: several hundreds of feature descriptors per image
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
image_id SIFT descriptor
10011 21, 143, 5, …, 201, 186
10011 121, 14, 75, …, 20, 109
10011 37, 40, 0, …, 213, 96
... ...
10011 81, 235, 67, …, 102,63
Intro to Image SearchHow does it work?● Compare (e.g., by calculating Euclidean distance)
feature descriptors of a query image with descriptors of images in collection to search
● Images with ‘closest’ descriptors are similar to a query image
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Intro to Image SearchWhy MapReduce?● Direct comparisons of descriptors costly even for
very small collections● Lots of approaches to ‘organize’ feature
descriptors for fast search○ Build an index○ Index all the descriptors○ At search, check query descriptors only against
certain groups of descriptors
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Image Retrieval with MapReduceWhy MapReduce?● Poorly scalable
○ up to ~10-20 mln images● But multimedia grows exponentially● Scaling is required …
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Image Retrieval with MapReduceUse case:● Copyright violation detection in large image
databank○ >100mln images
● Searching for batch of images○ Thousands of images in one query
○ Focus on throughput, not on response time for individual image
● SIFT featuresDenis Shestakov
denshe at gmail.comlinkedin: linkedin.com/in/dshestakov
Image Retrieval with MapReduceIndexing images● Generating index tree ● Clustering images into a large set of clusters
(max cluster size = 5000)○ Mapper input:
■ unsorted SIFT descriptors■ index tree (loaded by every mapper)
○ Mapper output:■ (cluster_id, SIFT)
○ Reducer output:■ SIFTs sorted by cluster_id
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakovMapReduce
Image Retrieval with MapReduceSearching● Generating lookup table
○ indexing query SIFTs
● Finding best matches for query SIFTs○ Mapper input:
■ sorted SIFT descriptors■ lookup table (loaded by every mapper)
○ Mapper output:■ (query-sift-id, knn of image-ids)
○ Reducer output:■ Best votes (image-ids) for query-image-id
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Map
Redu
ce
MapReduce
Image Retrieval with MapReduceIn nutshell:● Indexing phase
○ Clustering SIFTs with one-pass k-means ● Searching phase
○ Map-side join of clustered SIFTs and lookup table (query SIFTs)
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Image search workloadsTime to discuss Hadoop specifics:● Standard Apache Hadoop distribution, ver.1.0.1
○ (!) No changes in Hadoop internals■ Easy to migrate
● Around 100 nodes from Grid5000○ 8/24 cores, 24/32/48GB RAM per node○ capacity/performance varied
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Image search workloadsDataset:● 110 mln images (~30 billion SIFT descriptors)
○ ~30 billion SIFT descriptors○ 4TB○ Largest reported in literature○ Images resized to 150px on largest side○ Worked also with subset, 1TB○ Used as distracting dataset
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Image search workloadsQueries:● Query batches
○ Up to 250k query images in one batch○ Batch includes original images and their distorted
variants■ Some variants are very hard to find
● e.g., print-crumple-scan● Check if original images returned as top votes
○ (out of scope) state-of-the-art search quality
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Image search workloadsIndexing workload characteristics● computationally-intensive (map phase)● data-intensive (at map&reduce phases)● large auxiliary data structure (i.e., index tree)
○ grows as dataset grows○ e.g., 1.8GB for 110M images (4TB)
● map input < map output● network is heavily utilized during shuffling
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Image search workloadsIndexing workload
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Image search workloadsSearching workload● large aux.data structure (e.g., lookup table)
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Image search workloads
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
● Basic settings:○ 512MB HDFS
block size○ 3 replicas○ 8 map slots○ 2 reduce slots
● 4TB dataset: ○ 4 map slots
Hadoop tools for large joins● Some workloads require all mappers to load a
large-size data structure○ Like image indexing/searching workloads
● Spreading data file across all nodes○ Hadoop DistributedCache
● Not efficient if structure is of gigabytes-size○ Partial solution: increase HDFS block sizes →
decrease #mappers● Another approach: multithreaded mappers
○ Not well documentedDenis Shestakov
denshe at gmail.comlinkedin: linkedin.com/in/dshestakov
Hadoop tools for large joins
● Multithreaded mapper spans a configured number of threads, each thread executes a map task
● Mapper threads share the RAM● Downsides:
○ synchronization when reading input○ synchronization when writing output
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Hadoop tools for large joins
Indexing 4T with 4 mappers slots, each running two threads
● index tree size: 1.8GBIndexing time on 100 nodes
● 8h27min → 6h8min
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Hadoop tools for large joins
● In some workloads mappers require only a part of auxiliary data structure○ I.e., relevant to data block processed○ Image searching workflow
● Approach: Hadoop MapFile○ Very efficient
■ Big batches, >10000 query images■ ~2 times faster on batches including around
25000 imagesDenis Shestakov
denshe at gmail.comlinkedin: linkedin.com/in/dshestakov
Smart Hadoop configurationHere is the problem:● Apache Hadoop, v.1.0.1● Capacity/performance of nodes varied
○ 8/24 cores, 24-48GB RAM, etc● One config file (#mappers, #reducers, maxim.
map/reduce memory, ...) for all nodes● Issue for memory-intensive workloads!
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Smart Hadoop configurationSolution (hack):● deploy Hadoop on all nodes with settings addressing
the least equipped nodes● create sub-cluster configuration files adjusted to better
equipped nodes○ substitute original config file with the new one on better
equipped nodes ● restart tasktrackers with new configuration files on
better equipped nodes
Call it smart deployment● Or known under another name?
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Smart Hadoop configuration
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Indexing 1T on 106 nodes: 75min → 65min
Conclusions
● Several directions for further optimization ● Presented techniques applicable to video and
audio datasets○ Given a transformation into feature vectors○ Only small changes expected (e.g, new Writable)
● Hadoop smart deployment trick● (Wanted) Best practices for Hadoop job
history log analysisDenis Shestakov
denshe at gmail.comlinkedin: linkedin.com/in/dshestakov
Supporting publicationsThings to share
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Hadoop job history logs available on request:● Describe indexing/searching 4TB dataset● Insights on better analysis/visualization are welcome● Get cbmi13 example-set at http://goo.gl/e06wE
Supporting publicationsSupporting Materials
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Check full-texts of our publications:● D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg. Indexing and
searching 100M images with Map-Reduce. In Proc. ACM ICMR'13, 2013.
● D. Shestakov, D. Moise, G. Gudmundsson, L. Amsaleg. Scalable high-dimensional indexing with Hadoop. In Proc. CBMI'13, 2013.
● D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg. Terabyte-scale image similarity search: experience and best practice. In Proc. IEEE BigData'13, 2013.
Acknowledgements
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
● My colleagues at INRIA Rennes
● Aalto University
● Grid5000 infrastructure
That’s it!
Denis Shestakovdenshe at gmail.com
linkedin: linkedin.com/in/dshestakov
Thanks!