Mozscape: NoSQL at Terabyte Scale
Phil SmithSoftware Engineer
What We Do
SEO & Inbound Marketing Metrics
www.opensiteexplorer.org
What We Do
Collect back links across the web
www.opensiteexplorer.org
What We Do
Collect back links across the web
www.opensiteexplorer.org
Compute metrics estimating value
What We Do
Collect back links across the web
www.opensiteexplorer.org
Compute metrics estimating value
Serve links and metrics with API and OSE
How We Do
~25-30 billion pages per month
Crawl the Web
How We Do
~25-30 billion pages per month
20 Crawler machines
Crawl the Web
How We Do
~25-30 billion pages per month
20 Crawler machines
~256 MB/sec aggregate download rate
Crawl the Web
How We Do
1:5 to 1:50 Compression Ratios
Compute Aggregates and Metrics
How We Do
1:5 to 1:50 Compression Ratios
Aggregates are Parallelized Linear Scans
Compute Aggregates and Metrics
How We Do
1:5 to 1:50 Compression Ratios
Aggregates are Parallelized Linear Scans
Communication Avoided where Possible
Compute Aggregates and Metrics
How We Do
~12 TB per Release in Amazon S3
Surface with a Read-Only API
How We Do
~12 TB per Release in Amazon S3
6 m2.4xlarge Instances for Cache
Surface with a Read-Only API
How We Do
~12 TB per Release in Amazon S3
6 m2.4xlarge Instances for Cache
~28k Requests per Minute
Surface with a Read-Only API
Observations and Strategy
Billions of Small, Similar Records
De-normalization Avoids Complex Joins
Batch-style Emphasizes Spatial Locality
Data Layout
Column-Orientation exploits Locality
Broken into 5GB chunks for S3
~64KB Compression Runs within
Compression
Tuned to Overcome Disk Read Bound
By-Column, Run & Gap Encoding on LZO
Customized Pipelines per Column
Job Control
Each Stage has Parallel, Idempotent Tasks
Tasks are Procs with easy Command Line
stdout, exit code are logged to track state
Checkpoints
Time
S3
Table Scan Checkpoint
Barrier
Table Scan
Barrier
Indexing
Columns have BDBs indexing by ID
Subset of IDs map to Compression Runs
Decompress Run and Scan to find Record
Physical Deployment
Crawlers run in Colo for white-listed IPs
Batch Process and API layer in EC2
The API might be in a colo too, butELB + Autoscaling are nice
Questions?
We’re Hiring!