Indexing Big Data in the Cloud
Indexing Big Data in the Cloud 2
Me
Scott StultsCo-Founder of OpenSource Connections
Solr / Lucene
Bash / Python / Java
Indexing Big Data in the Cloud 3
Eric
Indexing Big Data in the Cloud 4
Big Data
Indexing Big Data in the Cloud 5
Big Data Wrangler
Indexing Big Data in the Cloud 6
How?
Address a Real ProjectBe Agile
Make Small Mistaeks FastSucceed BIG
Indexing Big Data in the Cloud 7
USPTO Goals
Prototype Search UX
Prove Solr:Scales
IntegratesExcels
Indexing Big Data in the Cloud 8
Scale?
Indexing Big Data in the Cloud 9
Our Approach
KISSYAGNI
(This space intentionally left blank)
Indexing Big Data in the Cloud 10
Minimal Flair
Indexing Big Data in the Cloud 11
Record Everything!
Indexing Big Data in the Cloud 12
Some Numbers
Doc Count 1.1 MillionZip Files 313
Docs per Zip File 4,000
Zip File Size 75M
File Size 300M
Indexing Big Data in the Cloud 13
Testing
Start some serversProcess a batchCheck the clock
Indexing Big Data in the Cloud 14
start_nodes
start_nodes() { ec2-run-instances ami-1b814f72 \ --block-device-mapping '/dev/sdb=snap-48adde35::true' \ --block-device-mapping '/dev/sdi1=:10:false' \ --block-device-mapping '/dev/sdi2=:10:false' \ --block-device-mapping '/dev/sdi3=:20:false' \ --instance-type m1.large \ --key uspto-proto \ --instance-count $MAX_NODES \ --group default > ~/run-output}
Indexing Big Data in the Cloud 15
Gut Check
How fast can we do this?
What can we do in parallel?
Indexing Big Data in the Cloud 16
Scaling
Raise our instance limit
xargs -P GNU parallel
Indexing Big Data in the Cloud 17
Shortcomings
SSH?Error recovery
One Solr
Indexing Big Data in the Cloud 18
Alternatives
CloudFormationPuppet / Chef
Multiple Cores / ShardsHadoop
Indexing Big Data in the Cloud 19
Success
Indexing Big Data in the Cloud 20
Victory Lap
Indexing Big Data in the Cloud 21
Instances / Time
Indexing Big Data in the Cloud 22
Thank You
https://github.com/sstults/patent-indexing
@scottstults#o19s