Date post: | 20-Oct-2014 |
Category: |
Technology |
View: | 1,336 times |
Download: | 0 times |
Seshubabu Simhadri Chief Technology Officer, GCE
Lucene in the Cloud:
Leveraging the Power of Search and
Big Data to Shed Light on Government
Spending
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Background
What is USASpending.gov?
Moving to Our Big Data cloud
Some of the design decisions Tool Selection Cluster Design Hardware Design
Limitations and enhancements
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Overview
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
What is USASpending.gov?
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
U.S. Government Spending vs. Other Entities
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Distribution of U.S. Government Spending
• Analytics • Stats • Top-K
• Free Text Search (With auto Suggestions)
• Large Data Feeds
• APIs
What can users do on the site?
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Who are the users of the site?
• Public
• Media
• Congress
• Value Added Resellers
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Leveraging the industry leading
open source platform to deliver cost savings and
scalability within a Cloud
computing model
GCE Big Data and Analytics Cloud Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
What’s Inside the GCE Cloud?
• Hadoop − For indexing and downloads
• Distributed Solr − Analytics − Free text search
• Drupal static content
• Visualization
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Start by
Looking at the Usual Suspects
Solr Node Sizing
The greatest
challenge is how to optimally
design a node – which
combination of CPUs, memory, and shard size
delivers the desired
performance?
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Solr Node Sizing
Multiple index types
Different types of spending Varying sizes
Break complete dataset into shards as small as required to meet the response times
Choose shard size based on response times Single Core with multiple cores or Multiple Solr instances each with single core?
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Solr Cluster Design
How do you design the cluster –
which ones are individual
nodes and which ones
are aggregators?
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Solr Cluster Design
Should all shards be treated equal? Userà Aggregator Nodes à Shards Different requirements for nodes collecting the data and nodes serving a specific dataset Aggregator Node 1,2,3 ….m
Large Solr Instances, No local index Shard Nodes 1,2,3,..100..n
Small Solr Instance with index
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
What configuration did we choose?
Separate Solr instances
Multiple hard
drives per server
Solid state
disks
Infiniband
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Solr Enhancements
Enhanced Faceting: Enabling
aggregation by more than
one field
Will be contributed to Solr project
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Solr Data Importer: Why Not?
When the shards increase,
management of SQLs inside Solr
becomes a challenge
External Data Importer Using
Hadoop
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Util izing Large Commodity Servers
Solr in the Cloud required building a cost effective and
high performance infrastructure
Small vs. large
Commodity servers
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Disadvantages of higher capacity servers
Failure of one node results in failure of
multiple shards -careful
design is required
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Summary
Sharded architecture
Multiple Solr instances per server each handling small datasets
Aggregator nodes + shards Hadoop for data indexing and data feeds Large Commodity Servers
• 48-core • 256GB RAM • SSD • Infiniband
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Come build the future
of Big Data
GCECloud.com
We’re hiring!
Questions? ssimhadri at GCECloud.com
Visit us at www.GCECloud.com