+ All Categories
Home > Technology > HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Date post: 10-May-2015
Category:
Upload: cloudera-inc
View: 1,815 times
Download: 1 times
Share this document with a friend
Description:
Mignify is a platform for collecting, storing and analyzing Big Data harvested from the web. It aims at providing an easy access to focused and structured information extracted from Web data flows. It consists of a distributed crawler, a resource-oriented storage based on HDFS and HBase, and an extraction framework that produces filtered, enriched, and aggregated data from large document collections, including the temporal aspect. The whole system is deployed in an innovative hardware architecture comprising of a high number of small (low-consumption) nodes. This talk will tackle the decisions made along the design and development of the platform, both under a technical and functional perspective. It will introduce the cloud infrastructure, the LTE-like ingestion of the crawler output into HBase/HDFS, and the triggering mechanism of analytics based on a declarative filter/extraction specification. The design choices will be illustrated with a pilot application targeting Daily Web Monitoring in the context of a national domain.
Popular Tags:
32
A Big Data Refinery Built on HBase Stanislav Barton Internet Memory Research
Transcript
Page 1: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

A Big Data Refinery Built on HBase

Stanislav BartonInternet Memory Research

Page 2: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

A Content-oriented platform

• Mignify = a content-oriented platform which– Continuously (almost) ingests Web documents– Stores and preserves these documents as such AND– Produces structured content extracted (from single

documents) and aggregated (from groups of documents)– Organizes both raw documents, extracted and aggregated

information in a consistent information space.=> Original documents and extracted information are uniformly

stored in HBase

Page 3: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

A Service-oriented platform

• Mignify = a service-oriented platform– Physical storage layer on “custom hardware”– Crawl on-demand, with sophisticated navigation options– Able to host third-party extractors, aggregators and

classifiers– Run new algorithms on existing collections as they arrive– Supports search, navigation, on-demand query and

extraction

=> A « Web Observatory » built around Hadoop/Hbase.

Page 4: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Customers/Users

• Web Archivists – store and organize, live search

• Search engineers – organize and refine, big throughput

• Data miners – refine, secondary indices• Researchers – store, organize and/or refine

Page 5: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Talk Outline

• Architecture Overview• Use Cases/Scenarios• Data model• Queries/ Query language / Query processing• Usage Examples• Alternative HW platform

Page 6: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Overview of Mignify

Page 7: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Typical scenarios

• Full text indexers– Collect documents, with metadata and graph

information• Wrapper extraction– Get structured information from web sites

• Entity annotation– Annotate documents with entity references

• Classification– Aggregate subsets (e.g., domains); assign them to topics

Page 8: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Data in HBase

• HBase as a first choice data store:– Inherent versioning (timestamped values)– Real time access (Cache, index, key ordering)– Column oriented storage– Seamless integration with Hadoop– Big community – Production ready/mature implementation

Page 9: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Data model

• Data stored in one big table along with metadata and extraction results

• Though separated in column families – CFs as secondary indices

• Raw data stored in HBase ( < 10 MB, HDFS otherwise)

• Data stored as rows (versioned)• Values are typed

Page 10: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Types and schema, where and why

• Initially, our collections consists of purely non-structured data

• All the point for Mignify is to produce a backbone of structured information,

• Done through an iterative process which progressively builds a rich set of interrelated annotations

• Typing = important to ensure the safety of the whole process, automation

Page 11: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Version1:(A1.. Ak), Version2:(Ak+1, … Am),…

Data model II

<(CF1,Qa,t’),v1>,<(CF1,Qb,t’’),v2>, … <(CFn,Qz,t’’’),vm>v1,… vm:byte[]

<t’,{V1, … Vk}>, <t’’,{Vk+1, … Vm}>,…<t’’’, Vm+1,… Vl>

HTable

Versions

A= <CF, Qualifier>:TypeType:Writable or byte[]Schema

Resource

CF1 CF2 CFnHFiles

mignify

HBase

Page 12: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Extraction PlatformMain Principles

A framework for specifying data extraction from very large datasets Easy integration and application of new extractors

High level of genericity in terms of (i) data sources, (ii) extractors, and (iii) data sinks An extraction process specification combines these

elements [currently] A single extractor engine

Based on the specification, data extraction is processed by a single, generic MapReduce job.

Page 13: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Extraction PlatformMain Concepts

Important: typing (We care about types and schemas!) Input and Output Pipes

Declare data source (e.g., a HBase collection) and data sinks (e.g, Hbase, HDFS, csv, …)

Filters (Boolean operators that apply to input data) Extractors

Takes an input Resource, produce Features Views

Combination of input and output pipes, filters and extractors

Page 14: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Data Queries

• Various data sources (HTable, data files,…)• Projections using column families and qualifiers• Selections by HBase filtersFilterList ret = new FilterList(FilterList.Operator.MUST_PASS_ONE);Filter f1 = new SingleColumnValueFilter(Bytes.toBytes("meta"), Bytes.toBytes("detectedMime"), CompareFilter.CompareOp.EQUAL, Bytes.toBytes(“text/html”));

ret.addFilter(f1);

• Query results either flushed to files or back to HBase –> materialized views

• Views are defined per collection as a set of pairs of Extractors (user defined function) and Filters

Page 15: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Query Language (DML)

• For each employee with salary higher than 2,000, compute total costs – SELECT f(hire_date, salary) FROM employees

WHERE salary >= 2000– f(hire_date, salary)= mon(today-hire_date)*salary

• For each web page detect mime type, language– For each RSS feed, get summary,– For each HTML page extract plain text

• Currently a wizzard producing a JSON doc

Page 16: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

User functions

• List<byte[]> f(Row r)• May calculate new attribute values, stored

with Row and reused by other function • Execution plan: order matters!• Function associates description of input and

output fields– Fields dependencies give order

Page 17: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Input Pipes

• Defines how to get the data – Archive files, Text files, HBase table– Format– The Mappers have always Resource on input,

several custom InputFormats and RecordReaders

Page 18: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Output pipes

• Defines what to do with (where to store) the query result– File, Table– Format– What columns to materialize

• Most of the time PutSortReducer used so OP defines OutputFormat and what to emit

Page 19: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Query Processing

• With typed data, Resource wrapper and IPs and OPs – one universal MapReduce job to execute/process queries!!

• Most efficient (for data insertion): MapReduce job with Custom Mapper and PutSortReducer

• Job init: build combined filter, IP and OP to define input and output formats

• Mapper set-up: Init plan of user function applications, init of functions themselves

• Mapper map: apply functions on a row according to plan, use OP to emit values

• Not all combinations can leverage the PutSortReducer (writing to one table at a time, …)

Page 20: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Query Processing II

Archive File

HBase

Map

Redu

ceHBase

ViewsFilters Extractors

Reso

urce H

File

File

Co-scanner

Data File

Page 21: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Data Subscriptions / Views• Data-in-a-View satisfaction can be checked at

the ingestion time, before the data is inserted• Mimicking client side co-processors – allowing

the use of bulk loading (no coprocessors for bulk load in the moment)

• When new data arrives, user functions/actions are triggered – On demand crawls, focused crawls

Page 22: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Second Level Triggers

• User code ran in the reduce phase (when ingesting):– Put f(Put p, Row r) – Previous versions on the input of the code Can alter the

processed Put object before final flush to HFile• Co-scanner: User-defined scanner traversing the

processed region aligned with the keys in the created HFile

• Example: Change detection on a re-crawled resource

Page 23: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Aggregation Queries• Compute frequency of mime types in the

collection• For a web domain, compute spammicity using

word distribution and out-links• Based on a two-step aggregation process– Data is extracted in the Map(), emitted with the

View signature– Data is collected in the Reduce(), grouped on the

combination (view sig., aggr. key) and aggregated.

Page 24: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Aggregation Queries Processing

• Processed using MapReduce, multiple (compatible) aggregations at once (reading is the most expensive)

• Aggregation map phase: List<Pairs> map(Row r), Pair=<Agg_key, value>, Agg_key=<agg_id, agg_on_value,…>

• Aggregation reduce phase: reduce(Agg_key, values, context)

Page 25: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Aggregation Processing Numbers

Compute mime type distribution of web pages per PLD:SELECT pld, mime, count(*) FROM web_collection

GROUP BY extract_pld(url), mime

Page 26: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Data Ingestion

Our crawler asynchronously writes to Hbase:

Input: archive files (ARC, WARC) in HDFSOutput: Htable

SELECT *, f1(*),..Fn(*) FROM hdfs://path/*.warc.gz

1. Pre-compute the split region boundaries on a data sample– MapReduce on a data input sample

2. Process batch (~0.5TB) MapReduce ingestion3. Split manually too big regions (or candidates)4. If there is still input go to 2.

Page 27: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Data Ingestion Numbers

• Store indexable web resources from WARC files to HBase, detect mime type, language, extract plain text and analyze RSS feeds

• Reaching steady 40MB/s including extraction • Upper bound 170MB/s (distributed reading of

archive files in HDFS)• HBase is idle most of the time!– Allows compacting store files in the meanwhile

Page 28: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Web Archive Collection

• Column families (basic set): 1. Raw content (payload)2. Meta data (mime, IP, …)3. Baseline analytics (plain text, detected mime, …)

• Usually one another CF per analytics result• CFs as secondary indices:– All analyzed feeds at one place (no need for filter

if I am interested in all such rows)

Page 29: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Web Archive Collection II

• More than 3,000 regions (in one collection)• 12TB of compressed of indexable data (and

counting)• Crawl to store/process machine ratio is 1:1.2• Storage scales out

Page 30: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

HW Architecture

• Tens of small low-consumption nodes with a lot of disk space:– 15TB per node, 8GB RAM, dual core CPU– No enclosure -> no active cooling -> no expensive

datacenter-ish environment needed• Low per PB storage price (70 nodes/PB), car

batteries as UPS, commodity (real low-cost) HW (esp. disks)

• Still some reasonable computational power

Page 31: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Conclusions

• Conclusions– Data refinery platform– Customizable, extensible– Large scale

• Future work– Incorporating external secondary indices to filter HBase

rows/cells• Full text index filtering• Temporal filtering

– Larger (100s TBs) scale deployment

Page 32: HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

Acknowledgments

• European Union projects:– LAWA: Longitudinal Analytics of Web Archive data– SCAPE - SCAlable Preservation Environments


Recommended