Integration of ClouderaNavigator Enables DataGovernance with StreamAnalytix
WHITE PAPER
As organizations increasingly rely on data as a core asset for decision making, pressures arise to manage growing volumes of data, monitor access to sensitive assets, and seamlessly enforce policies across the enterprise. Therefore, data governance today plays a key role in formulating effective data strategies.
Data governance involves everything that ensures your data is well managed, from securing it to making it accessible. One of the fundamental requirements in any data governance strategy is being able to trace data back to its origin. However, with complex operations taking place within multiple batches and real-time data flow, tracing data origins is increasingly difficult.
Therefore, understanding data lifecycles and the ability to visualize complete data flows is critical. This allows self-service access and offers full visibility to multiple users such as security teams, compliance groups, business users, etc., establishing data integrity in the system.
Integrating Cloudera Navigator within StreamAnalytix enables a unified view of all incoming data and allows monitoring access to sensitive assets, including the ability to identify the data entity lineage of all the streaming and batch complex data pipelines.
Summary
2
Current data governance challenges Disparate dataPetabytes of data are coming into the Hadoop ecosystem of various formats and structures. This data can be structured data such as CSV, or semi-structured data such as XML or JSON (in the form of logs, image files, and more). Additionally, the data is often transformed and stored in other formats along the way. There can be multiple data flows transforming the data in their custom form and storing it.
3
StreamAnalytix Cloudera Navigator Lineage integration offers a solutionIntegrating StreamAnalytix with Cloudera Navigator Lineage provides an optimal
solution to address the data governance challenges stated above.
This integration combines a visual UI and end-to-end big data analytics functionality
with a complete view of the entire data lifecycle. The integration allows the following:
• Extraction of technical, managed and custom metadata for all the data pipelines running in the cluster on top of Cloudera Navigator features
• Data cataloging and tagging of data entities in the pipelines with relevant metadata properties
• A global view of the cluster data flow in tune with the StreamAnalytix implemented data pipelines
The system allows you to have an at-a-glance view from the Cloudera Navigator
console of all the cluster entities, including drill down options into details such as
multiple metadata views. The following figure provides an architectural overview of
the integration.
Data provenanceManaging the control and flow of sensitive data access within the workflows is essential. For example, a sensitive data field such as social security numbers require stringent access control.
VisibilityAs data flows into the system, visibility is required into its movement, frequency, quality thresholds, applicable rules, and more. Similarly, after the data flows have been processed, the enterprise requires the ability to review and analyze the data flow effect on the system as well as the overall life cycle of the data.
4
To enable the data navigator for a data pipeline, do the following:
1. Start with a blank canvas, and build a pipeline. StreamAnalytix offers a visual
pipeline designer that includes a blank canvas, a plethora of operators, and drag
and drop functionality to stitch the pipeline visually. Every data pipeline running in
the system is associated with complex data operations on the input data,
including data ingestion, transformations, analytics, machine learning, actions and
alerts, visualization, and data persisting.
Figure 1: Architectural overview of StreamAnalytix and Cloudera integration
Steps to enable the Cloudera Data Navigatorwithin StreamAnalytix
Fetching Meta data from Multiple StreamAnalytix Data Pipelines
Navigator API Cloudera NavigatorMetadata Server
Cloudera NavigatorConsoleVISUAL BIG DATA ANALYTICS PLATFORM
HADOOP DISTRIBUTION
CHANNELS PROCESSORS ANALYTICS EMITTORS
QL
ML MLibH O2
5
Figure 2: StreamAnalytix visual pipeline designer and operators
Save the pipeline. This is where you can enable the navigator option.
Figure 3: Configure pipelines and operators
6
2. After you submit the pipeline, you can go to the Cloudera Navigator UI to search all
the pipelines under the tags category. All the operators in the data pipelines, which
we will refer to as ‘Data Entity,’ have been tagged with the pipeline name and can
be searched from the Navigator UI.
Figure 4: Cloudera Navigator UI
The screenshot displays all the data entities for the data pipelines with the tags
alerttest and hive_lineage.
3. The next step is viewing the lineage associated with the pipelines and operators.
If you click any of the operations, you will be able to view the lineage and the
complete life cycle of the data flow associated with it.
Lineage of enricher data entity
The following screenshot is an example of the lineage of the enricher data entity and the complete life cycle of the data flow associated with it. In this example, data is emitted to multiple data files on HDFS (cricket_input & parq in this case) and other data pipelines are consuming and applying filter transformations to it.
7
Figure 5: Lineage of enricher data entity and the data flow lifecycle
Figure 6: View of the RabbitMQ metadata
The entity lineage view allows you to trace the entity back to the source and precisely
evaluate any transformations. You can also view metadata information for each entity.
For example, if you click the Details option tab, you can view the following details
about RabbitMQ:
StreamAnalytix is an enterprise grade, visual, big data analytics platform for unified streaming and batch data processing based on best-of-breed open source technologies. It supports the end-to-end functionality of data ingestion, enrichment, machine learning, action triggers, and visualization. StreamAnalytix offers an intuitive drag-and-drop visual interface to build and operationalize big data applications five to ten times faster, across industries, data formats, and use cases.
Visit www.streamanalytix.com or write to us at [email protected]
© 2018 Impetus Technologies, Inc.All rights reserved. Product and companynames mentioned herein may be trademarksof their respective companies. July 2018
StreamAnalytix is an enterprise-grade visual platform for all your batch and stream processing and analytics needs.
Ingest, blend, and process high-velocity big data streams as they arrive, run machine learning models, visualize results on real-time dashboards, and train and refresh models in real-time or in batch mode.
Build and operationalize big data applications five to ten times faster using a visual drag-and-drop interface, an exhaustive set of pre-built operators, full application lifecycle support, and one-click options for on-premise and cloud deployments.
With support for multiple big data engines and built-in extensibility, StreamAnalytix gives you full flexibility and control to work with the technology stack of your choice.
About StreamAnalytix