Case Study - How Rackspace Query Terabytes Of Data

How Rackspace Query Terabytes of Log Data

Uses MapReduce, Hadoop

Case Study, by Schubert Zhang 2009-04-30

Rackspace• Rackspace has more than 50K devices and 7 data

centers.

• The mail system and logging servers are currently in 3 of the Rackspace data centers.

• The system stores over 800 million objects (an object = a user event such as receiving an email or logging into IMAP) within Solr and 9.6 (records?) billion within Hadoop, which equals 6.3 TB compressed.

• Several hundred gigabytes of email log data is generated each day. (seems 140GB after cleared up)

Background on Mailtrust• Email hosting company• Founded in 1999, merged with Rackspace in 2007,

previous name: Webmail.us• 80K business customers, 700K mailboxes. • 2 hosted mail products: Noteworthy, MS Exchange • The Noteworthy System:

– Homegrown, Linux based, POP3, IMAP, webmail, RSS feeds, shared calendaring, Outlook sync, Blackberry sync.

– ~600 servers, commodity hardware, designed to work around frequent failures.

• The MS Exchange System:– MAPI, POP, IMAP, OWA, Blackberry, Goodmail, ActiveSync.– ~100 servers, higher-end hardware, SAN & DAS storage.

Problems• Hundreds of gigabytes of new data each day streaming in from over 600 hyperactive

servers.• Log processing system.

– (1) Flat text files stored on each machine. • Had to be manually searched by engineers logging into each individual machine.

– (2) Relational database solution that just couldn't compete. MySQL.• Inserts quickly became the bottleneck.• A lot of index churn.• Data was then broken into Merge Tables based on time so index updates weren't a problem. • Load and operational problems.

– (3) Hadoop based solution that works wisely and has virtually unlimited scalability potential. • Hadoop• Lucene and Solr.

• The familiar faced problem now: Lots and lots of data streaming in.– Where do you store all that data? – How do you do anything useful with it?– How to retrieve the wanted data from the data sea.

• Examine mail logs in order to troubleshoot problems for our customers. • The query/search should be fast and accurate.

Now the new system • The advantage of their new system is that they can now

look at their data in anyway they want: – Nightly MapReduce jobs collect statistics about their mail system

such as spam counts by domain, bytes transferred and number of logins.

– When they wanted to find out which part of the the world their customers logged in from, a quick MapReduce job was created and they had the answer within a few hours. Not really possible in your typical ETL system.

• "Now whenever we think of complex question about our customers’ usage patterns, we can pull the answer from our logs within hours via MapReduce. This is powerful stuff."

The Platform

• Hadoop MapReduce• Hadoop Distributed File System (HDFS) • Lucene• Solr• Tomcat

The Architecture • Raw logs get streamed from hundreds of mail servers to

the Hadoop Distributed File System (”HDFS”) in real time.

• MapReduce jobs are scheduled run to index the new data using Apache Lucene and Solr.

• Once the indexes have been built, they are compressed and stored away in HDFS.

• Each Hadoop datanode runs a Tomcat servlet container, which hosts a number of Solr instances that pull and merge the new indexes, and provide really fast search results to our support team.

The System Evolution Logging v1.0

• Logs were stored in flat text files on the local disk of each mail server and were kept for 14 days.

• Our support techs did not have login access to the servers, so in order to search the logs they would have to escalate a ticket to our engineers. The engineers would then have to ssh into each mail server and grep /var/log/maillog.

• Problems: Once we grew much past a dozen servers, this manual process of logging into each server become too time consuming for our engineers.

Logging v1.1 • Sped up the search process by writing a script that would search

multiple servers via one command run from a centralized server.

• Remote still grep.

• Problems: The support techs still had to escalate a ticket to the engineers in order to perform a search. As the number of customers and servers increased, this began to take too much of our engineers' scarce time. Also, storing and searching the logs on a live server was negatively affecting the performance of the servers. To make matters worse, the engineering team had grown and we started running into the problem where two engineers would perform a search at the same time, which really slowed things down.

Logging v2.0 • a web-based tool where they could search the logs. • It allowed searching by the sender or recipient's email address, domain name or IP

address. • All of these were indexed fields in a MySQL database. The centralized log server

• Each day's logs were stored in a separate table, so that we could cleanup old data by simply dropping and recreating MySQL tables.

• Log data was only kept for 3 days in order to keep the MySQL database down to a reasonable size.

• Wildcard text searches (i.e. MySQL "LIKE" statements) were not allowed because the data set was very large and these queries would be horribly slow.

• Problems: We quickly realized that we had a bottleneck with the MySQL inserts. As the tables grew, indexing each entry as it was inserted became slow. Within the first hours of testing, the inserts began slowing and could not keep up with the rate at which data was received. Version 2.0 of the logging system was never used in production.

Logging v2.1 • Fixed the MySQL INSERT bottleneck by queuing up the log entries

in local text files on the centralized log server and periodically bulk loading them into the database. As syslog-ng received logs on its 6 ports, the data would be streamed to 6 separate text files. Every 10 minutes a script would rotate those text files and execute a MySQL LOAD to load the data into the database. This was magnitudes faster than inserting the log data one record at a time.

• Problems: The LOADs would get progressively slower as the database grew because MySQL indexing performance decreases as the table you are inserting into gets larger. This version was fast enough to be released into production, but we knew the system would not scale too far without additional work.

Logging v2.2 • Introduced Merge Tables in order to speed up loading the log data into the database. • every 10 minutes our script would create a new database table and then load the text

logs into the empty table. • After the data was loaded, the script would modify a set of Merge Tables that

combined all of the 10-minute tables together. • The web search tool was modified to allow searching within the different time ranges.

Corresponding Merge Tables existed for each of those time ranges, and were modified every 10 minutes as new tables were created.

• Problems: the database LOAD operations would take 2-3 minutes to run. the server was now always under a heavy cpu and disk IO load.

• Searches were being performed more frequently and were becoming slow. We started to see some strange problems such as random errors while trying to create new tables or modify the Merge Tables. These errors progressively became more frequent, resulting in missing log data. The support team began to lose confidence in the system's accuracy.

• the logging system had no redundancy.

• We needed a new solution that would be fast, reliable and could scale indefinitely with our growth. We needed something truly scalable.

Logging v3+• Avoid limiting our abilities to build new features down the

road. • For example, we wanted to build a tool that would allow

our customers to search their logs directly.

• It scales out it's workload horizontally by adding servers and distributing the data and MapReduce jobs amongst the servers.

• In about 3 months we build a fresh new log processing system using Hadoop, Lucene and Solr.

• Put the log search tool in the hands of our customers.

Stu Hood’s Detailed Comments• The loading of data is streaming, but the indexing is not. We write to a file in Hadoop until it

reaches a size below the block size, or until it times out, and then we close and move it to where it will be processed.

• Our processing jobs run every 10 minutes or so, meaning that the logs become available for Customer Care after about 15. We’ve executed around 150K jobs on this cluster with 3 restarts.

• We create the indexes on local disk in our reducer, and compress them into HDFS after they are complete.

• When we pull the index to make it available for search, we decompress it to local disk and merge it using the Lucene IndexWriter.addIndexes method before calling /commit on the Solr instance. The Nutch project created an IndexReader that can do read-only access on HDFS, but for speed reasons, we decided not to take that approach.

• Since we are indexing to local disk, we use an embedded SolrCore, in the same JVM as the reducer.

• We have 10 Hadoop data nodes, with 3.5TB hard drives each. = 35TB• We are currently indexing an average of 140GBytes per day.

• The merged indexes are not replicated at all… only one Solr node has a copy of each index, so failover involves a brief downtime for queries. If we lose a node, other nodes (consistent hashing) become responsible and merge the indexes from the copies we always have in Hadoop.

Future

• Creating reports or doing ad-hoc queries. • More wanted MapReduce jobs to do

wanted things.

References

• How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data

• MapReduce at Rackspace

http://highscalability.com/links/goto/292/215/links_weblink

http://highscalability.com/links/goto/292/215/links_weblink

http://blog.racklabs.com/?p=66

Date post:	01-Nov-2014
Category:	Technology
Upload:	schubert-zhang
View:	7,443 times
Download:	0 times

Case Study - How Rackspace Query Terabytes Of Data

Technology