+ All Categories
Home > Software > Building Data Pipelines for Solr with Apache NiFi

Building Data Pipelines for Solr with Apache NiFi

Date post: 12-Jan-2017
Category:
Upload: bryan-bende
View: 5,674 times
Download: 5 times
Share this document with a friend
35
Page 1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Building Data Pipelines for Solr with Apache NiFi Bryan Bende – Member of Technical Staff
Transcript
Page 1: Building Data Pipelines for Solr with Apache NiFi

Page 1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Building Data Pipelines for Solr with Apache NiFiBryan Bende – Member of Technical Staff

Page 2: Building Data Pipelines for Solr with Apache NiFi

Page 2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Outline

• Introduction to Apache NiFi

• Solr Indexing & Update Handlers

• NiFi/Solr Integration

• Use Cases

Page 3: Building Data Pipelines for Solr with Apache NiFi

Page 3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

About Me

• Member of Technical Staff at Hortonworks

• Apache NiFi Committer & PMC Member since June 2015

• Solr/Lucene user for several years

• Developed Solr integration for Apache NiFi 0.1.0 release

• Twitter: @bbende / Blog: bryanbende.com

Page 4: Building Data Pipelines for Solr with Apache NiFi

Page 4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Introduction

Installing Solr and getting started - easy (extract, bin/solr start)

Defining a schema and configuring Solr - easy

Getting all of your incoming data into Solr - not as easy

A lot of time spent…• Cleaning and parsing data• Writing custom code/scripts• Building approaches for monitoring and debugging• Deploying updates to code/scripts for small changes

Need something to make this easier…

Page 5: Building Data Pipelines for Solr with Apache NiFi

Page 5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Introduction to Apache NiFi

Page 6: Building Data Pipelines for Solr with Apache NiFi

Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Apache NiFi• Powerful and reliable system to process and

distribute data

• Directed graphs of data routing and transformation

• Web-based User Interface for creating, monitoring, & controlling data flows

• Highly configurable - modify data flow at runtime, dynamically prioritize data

• Data Provenance tracks data through entire system

• Easily extensible through development of custom components

[1] https://nifi.apache.org/

Page 7: Building Data Pipelines for Solr with Apache NiFi

Page 7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

NiFi - TerminologyFlowFile

• Unit of data moving through the system• Content + Attributes (key/value pairs)

Processor• Performs the work, can access FlowFiles

Connection• Links between processors• Queues that can be dynamically prioritized

Process Group• Set of processors and their connections• Receive data via input ports, send data via output ports

Page 8: Building Data Pipelines for Solr with Apache NiFi

Page 8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

NiFi - User Interface

• Drag and drop processors to build a flow• Start, stop, and configure components in real time• View errors and corresponding error messages• View statistics and health of data flow• Create templates of common processor & connections

Page 9: Building Data Pipelines for Solr with Apache NiFi

Page 9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

NiFi - Provenance

• Tracks data at each point as it flows through the system

• Records, indexes, and makes events available for display

• Handles fan-in/fan-out, i.e. merging and splitting data

• View attributes and content at given points in time

Page 10: Building Data Pipelines for Solr with Apache NiFi

Page 10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

NiFi - Queue Prioritization

• Configure a prioritizer per connection

• Determine what is important for your data – time based, arrival order, importance of a data set

• Funnel many connections down to a single connection to prioritize across data sets

• Develop your own prioritizer if needed

Page 11: Building Data Pipelines for Solr with Apache NiFi

Page 11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

NiFi - Extensibility

Built from the ground up with extensions in mind

Service-loader pattern for…• Processors• Controller Services• Reporting Tasks• Prioritizers

Extensions packaged as NiFi Archives (NARs)• Deploy NiFi lib directory and restart• Provides ClassLoader isolation• Same model as standard components

Page 12: Building Data Pipelines for Solr with Apache NiFi

Page 12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

NiFi - Architecture

OS/Host

JVM

Flow Controller

Web Server

Processor 1 Extension N

FlowFileRepository

ContentRepository

ProvenanceRepository

Local Storage

OS/Host

JVM

Flow Controller

Web Server

Processor 1 Extension N

FlowFileRepository

ContentRepository

ProvenanceRepository

Local Storage

OS/Host

JVM

Flow Controller

Web Server

Processor 1 Extension N

FlowFileRepository

ContentRepository

ProvenanceRepository

Local Storage

OS/Host

JVM

NiFi Cluster Manager – Request Replicator

Web Server

MasterNiFi Cluster Manager (NCM)

OS/Host

JVM

Flow Controller

Web Server

Processor 1 Extension N

FlowFileRepository

ContentRepository

ProvenanceRepository

Local Storage

SlavesNiFi Nodes

Page 13: Building Data Pipelines for Solr with Apache NiFi

Page 13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Solr Indexing & Update Handlers

Page 14: Building Data Pipelines for Solr with Apache NiFi

Page 14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Solr – Indexing Data

Update Handlers• XML, JSON, CSV• https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers

Clients• Java, PHP, Python, Ruby, Scala, Perl, and more• https://wiki.apache.org/solr/IntegratingSolr

Page 15: Building Data Pipelines for Solr with Apache NiFi

Page 15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Solr Update Handlers - XML

Adding documents<add> <doc> <field name=”foo”>bad</field> </doc></add>

Deleting documents<delete> <id>1234567</id> <query>foo:bar</query></delete>

Other Operations

<commit waitSearcher="false"/>

<commit waitSearcher="false" expungeDeletes="true"/>

<optimize waitSearcher="false"/>

Page 16: Building Data Pipelines for Solr with Apache NiFi

Page 16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Solr Update Handlers - JSON

Solr-Style JSON…

Add Documents[ { "id": "1”, "title": "Doc 1” }, { "id": "2”, "title": "Doc 2” }]

Commands{ "add": { "doc": { "id": "1”, "title": { "boost": 2.3, "value": "Doc1” } } }}

Page 17: Building Data Pipelines for Solr with Apache NiFi

Page 17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Solr Update Handlers - JSON

Custom JSON• Transform custom JSON based on Solr

schema

• Define paths to split JSON into multiple Solr documents

• Field mappings from JSON field name to Solr field name

Produces two Solr documents:- John, Math, term1, 90- John, Biology, term1, 86

split=/exams&f=name:/name&f=subject:/exams/subject&f=test:/exams/test&f=marks:/exams/marks

{ "name": "John", "exams": [ { "subject": "Math", "test" : "term1", "marks" : 90}, { "subject": "Biology", "test" : "term1", "marks" : 86} ]}

Page 18: Building Data Pipelines for Solr with Apache NiFi

Page 18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Solr Update Handlers - CSV

/update with Content-Type:application/csv

Important parameters:• separator• trim• header• fieldnames• skip• rowid

Page 19: Building Data Pipelines for Solr with Apache NiFi

Page 19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

SolrJ Client

SolrDocument Update

SolrInputDocument doc = new SolrInputDocument();

doc.addField("first", "bob");doc.addField("last", "smith");

solrClient.add(doc);

ContentStream Update

ContentStreamUpdateRequest request = new ContentStreamUpdateRequest("/update/json/docs");

request.setParam("json.command", "false");request.setParam("split", "/exams");request.getParams().add("f", "name:/name");request.getParams().add("f",

"subject:/exams/subject");request.getParams().add("f","test:/exams/test");request.getParams().add("f","marks:/exams/marks");

request.addContentStream(new ContentStream...);

Page 20: Building Data Pipelines for Solr with Apache NiFi

Page 20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

NiFi/Solr Integration

Page 21: Building Data Pipelines for Solr with Apache NiFi

Page 21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

NiFi Solr Processors

• Support Solr Cloud and stand-alone Solr instances

• Leverage SolrJ – CloudSolrClient & HttpSolrClient

• Extract new documents based on a date/time field – GetSolr

• Stream FlowFile content to an update handler - PutSolrContentStream

Page 22: Building Data Pipelines for Solr with Apache NiFi

Page 22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

PutSolrContentStream• Choose Solr Type - Cloud or

Standard

• Specify ZooKeeper hosts, or the Solr URL

• Specify a collection if using Solr Cloud

• Specify the Solr path for the ContentStream

• Dynamic properties sent as key/value pairs on the request

• Relationships for success, failure, and connection_failure

Page 23: Building Data Pipelines for Solr with Apache NiFi

Page 23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

GetSolr• Solr Type, Solr Location, and

Collection are the same as PutSolr

• Specify a query to run on each execution of the processor

• Specify a sort clause and a date field used to filter results

• Schedule processor to run on a cron, or timer

• Retrieves documents with ‘Date Field’ greater than time of last execution

• Produces output in SolrJ XML

Page 24: Building Data Pipelines for Solr with Apache NiFi

Page 24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Use Cases

Page 25: Building Data Pipelines for Solr with Apache NiFi

Page 25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Use Cases – Index JSON1. Pull in Tweets using Twitter API

2. Extract language and text into FlowFile attributes

3. Get non-empty English tweets${twitter.text:isEmpty():not():and(${twitter.lang:equals("en")})}

4. Merge together JSON documents based on quantity, or time

5. Use dynamic field mappings to select fields for indexing:

Page 26: Building Data Pipelines for Solr with Apache NiFi

Page 26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Use Cases – Issue Commands1. Generate a FlowFile on a cron, or timer, to

initiate an action

2. Replace the contents of the FlowFile with a Solr command

<delete><query>timestamp:[* TO NOW-1HOUR]</query>

</delete>

3. Send the command to the appropriate update handler

Page 27: Building Data Pipelines for Solr with Apache NiFi

Page 27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Use Cases – Multiple Collections1. Set a FlowFile attribute

containing the name of a Solr collection

2. Use expression language when setting the Collection property on the Solr processor:

${solr.collection}

Note:

• If merging documents, merge per collection in this case

• Current bug preventing this scenario from working:

https://issues.apache.org/jira/browse/NIFI-959

Page 28: Building Data Pipelines for Solr with Apache NiFi

Page 28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Use Cases – Log Aggregation1. Listen for log events over UDP on a

given port• Set ‘Flow File Per Datagram’ to true

2. Send JSON log events • Syslog UDP forwarding• Logback/log4j UDP appenders

3. Merge JSON events together based on size, or time

4. Stream JSON update to Solr

http://bryanbende.com/development/2015/05/17/collecting-logs-with-apache-nifi/

Page 29: Building Data Pipelines for Solr with Apache NiFi

Page 29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Use Cases – Index Avro1. Receive an Avro datafile with binary

encoding

2. Convert Avro to JSON using built in ConvertAvroToJSON processor

3. Stream JSON documents to Solr

Page 30: Building Data Pipelines for Solr with Apache NiFi

Page 30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Use Cases – Index a Relational Database1. GenerateFlowFile acts a timer to trigger

ExecuteSQL(Future plans to not require in an incoming FlowFile to ExecuteSQL NIFI-932)

2. ExecuteSQL performs a SQL query and streams the results as an Avro datafileUse expression language to construct a dynamic date range:

${now():toNumber():minus(60000)

:format(‘YYYY-MM-DD’}

3. Convert Avro to JSON using built in ConvertAvroToJSON processor

4. Stream JSON update to Solr

Page 31: Building Data Pipelines for Solr with Apache NiFi

Page 31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Use Case – Extraction in a Cluster1. Schedule GetSolr to run

on Primary Node

2. Send results to a Remote Process Group pointing back to self

3. Data gets redistributed to “Solr XML Docs” Input Ports across cluster

4. Perform further processing on each node

Page 32: Building Data Pipelines for Solr with Apache NiFi

Page 32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Future Work

Unofficial ideas…

PutSolrDocument• Parse FlowFile InputStream into one or more SolrDocuments• Allow developers to provide “FlowFile to SolrDocument” converter

PutSolrAttributes• Create a SolrDocument from FlowFile attributes• Processor properties specify attributes to include/exclude

Distribute & Execute Solr Commands• DistributeSolrCommand learns about Solr shards and produces commands per shard• ExecuteSolrCommand performs action based on the incoming command

Page 33: Building Data Pipelines for Solr with Apache NiFi

Page 33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Summary

Resources• Apache NiFi Mailing Lists

– https://nifi.apache.org/mailing_lists.html

• Apache NiFi Documentation – https://nifi.apache.org/docs.html

• Getting started developing extensions– https://cwiki.apache.org/confluence/display/NIFI/Maven+Projects+for+Extensions

– https://nifi.apache.org/developer-guide.html

Contact Info: • Email: [email protected]• Twitter: @bbende

Page 34: Building Data Pipelines for Solr with Apache NiFi

Page 34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Sources[1] https://nifi.apache.org/

[2] https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers

[3] https://wiki.apache.org/solr/IntegratingSolr

[4] http://lucidworks.com/blog/indexing-custom-json-data/

Page 35: Building Data Pipelines for Solr with Apache NiFi

Page 35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Thank you


Recommended