Date post: | 03-Jun-2018 |
Category: |
Documents |
Upload: | karthikeyan-balasubramaniam |
View: | 220 times |
Download: | 0 times |
of 20
8/12/2019 5 Pitfalls to Avoid With Hadoop
1/20
8/12/2019 5 Pitfalls to Avoid With Hadoop
2/20
Intro: Maximizing the fourth v of Big Data 3
Pitfall #1: Hadoop is not a data integration tool 4
Pitfall #2: MapReduce programmers are hard to find 6
Pitfall #3: Most data integration tools dont run natively within Hadoop 9
Pitfall #4: Hadoop may cost more than you think 12
Pitfall #5: Elephants dont thrive in isolation 15
Benchmark 18
Conclusion 19
2
http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO8/12/2019 5 Pitfalls to Avoid With Hadoop
3/20
Traditional business intelligence architectures are struggling to efficiently process Big Data sets, particularly
massive semi-structured and unstructured data. Therefore, its been difficult to realize the full potential of Big Data
Hadoop allows organizations to overcome the architectural limitations in managing Big Data, but care needs to be
taken in order to make the most of what Hadoop has to offer.
Big Data is commonly characterized with respect to the three vs that is high-volume, high-velocity, and high-
variety of data assets but what really matters is the fourth v: value. Value is the positive impact on the business
in terms of gaining actionable insight from massive amounts of data. Big Data can uncover significant value for
organizations, for example: new revenue streams, new customer insights, improved decision making, better
quality products, improved customer experience, and so on.
Hadoop has emerged as the de facto Big Data analytics operating system to help deal with the avalanche of data
coming from logs, email, sensor devices, mobile devices, social and more. While business intelligence systems
are typically the last stop in extracting value from Big Data, the first stop is commonly manipulation of the data in
a process called Extract, Transform, Load (ETL). ETL is the process by which data is moved from source systems
manipulated into a consumable format and loaded into a target system for performing advance analytics, analysisand reporting. In fact, industry analyst Gartner recognizes that most organizations will adapt their data integration
strategy using Hadoop as a form of preprocessor for Big Data integration in the data warehouse.
However, as organizations begin to deploy this new framework, there are some pitfalls to avoid in successfully
performing ETL with Hadoop. First, businesses need to know the pitfalls, and then how to overcome the challenges
We will offer some guiding principles to address these challenges, as well as specific details on how to leverage
Syncsorts data integration tool for Hadoop, DMX-h, to drive sustainable success with your Hadoop deployment.
3
http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO8/12/2019 5 Pitfalls to Avoid With Hadoop
4/20
A data integration tool provides an environment to make it easier for a broad audience to develop and maintain
ETL jobs. Typical capabilities of a data integration tool include: an intuitive graphical interface, pre-built data
transformation functions (aggregations, joins, change data capture [CDC], cleansing, filtering, reformatting
lookups, data type conversions, and so on), metadata management to enable re-use and data lineage, powerfu
connectivity to source and target systems, and advanced features to make data integration easily accessible by
data analysts.
Although the primary use case of Hadoop is ETL, Hadoop is not a data integration tool itself. Rather, Hadoop is
a reliable, scale-out parallel processing framework, meaning servers (nodes) can be easily added as workloads
increase. It frees the programmer from concerns about how to physically manage large data sets when spreading
processing across multiple nodes. There is a rich ecosystem of Hadoop utilities that can be used to create ETL
jobs, but they are all separately evolving projects and require specific, new skills. For example, Sqoop developmen
(to move data into and out of HDFS from RDBMSs) requires skilled programmers knowledgeable in the Sqoop
command line syntax. Flume is used for moving data from a variety of systems into Hadoop; Oozie helps with
workflows; and Pig is a scripting platform for more easily creating Hadoop jobs. However, they all require much
hand-coding, as well as specialized skills and knowledge of Hadoop and MapReduce.
Finally, basic ETL operations such as data transformations are easy within a mature data integration tool. However
trying to accomplish the same task with Hadoop can quickly become complex and take a lot of expertise and
effort. For example, building a simple CDC process can easily translate into hundreds of lines of code that not only
takes several days to develop, but also requires resources to maintain and tune as needs evolve in the future
Alternatively, a preferred approach is to utilize a data integration tool that makes it easy to create and maintain
Hadoop ETL jobs.
4
http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO8/12/2019 5 Pitfalls to Avoid With Hadoop
5/20
ETL is emerging as the key use case for Hadoop implementations. However,
Hadoop alone lacks many attributes needed for successful ETL deployments.
Therefore, its important to choose a data integration tool that can fill the ETL
gaps.
Choose a user-friendly graphical interface to easily build ETL jobs without
writing MapReduce code.
Ensure that the solution has a large library of pre-built data integration
functions that can be easily reused.
Include a metadata repository to enable re-use of developments, as well as
data lineage tracking.
Select a tool with a wide variety of connectors to source and target
systems.
Syncsort DMX-h is high-performance data integration software that provides a smarter
approach to Hadoop ETL including: an intuitive graphical interface for easily creating and
maintaining jobs, a wide range of productivity features, metadata facilities for development
re-use and data lineage, high-performance connectivity capabilities, and an ability to run
natively, avoiding code generation.
5
http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO8/12/2019 5 Pitfalls to Avoid With Hadoop
6/20
Programming with the MapReduce processing paradigm in Hadoop requires not only Java programming skills, but
also a deep understanding of how to develop the appropriate Mappers, Reducers, Partitioners, Combiners, etc. A
typical Hadoop task often has multiple steps (as shown in the image on the next page) and a typical application
can have multiple tasks. Most of these steps need to be coded by a Java developer (or using Pig script). With
hand-coding, these steps can quickly become unwieldy to create and maintain.
Even with expert MapReduce programmers building jobs successfully, MapReduce code has limited metadata
associated with it. This issue makes impact analysis and data lineage difficult to perform and thus creates an
overall lack of transparency into the ETL execution flow. Ultimately, thousands of lines of Java code without any
metadata and limited documentation produces major risks for organizations, specifically hindering business agility
complicating data governance, and jeopardizing regulatory compliance.
Not only does MapReduce programming require specialized skills that are hard to find and expensive, hand-
coding does not scale well in terms of job creation productivity, job re-use, and job maintenance. Thats where
data integration tools excel, with intuitive graphical interfaces, prebuilt functions, and facilities to easily create, re-
use, and maintain ETL jobs. With data integration tools, business analysts can easily create, maintain, and re-usejobs in minutes or hours in a graphical manner that would otherwise take days or weeks with a developer writing
thousands of lines of code. Easy job creation and maintenance are critical in preventing bottlenecks that reduce
an organizations ability to extract the full value of Big Data.
6
http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO8/12/2019 5 Pitfalls to Avoid With Hadoop
7/20
Hadoop ETL requires organizations to acquire a completely new set of advanced
programming skills that are expensive and difficult to find. To overcome this pitfall
its critical to choose a data integration tool that both complements Hadoop and
also leverages skills organizations already have.
Select a tool with a graphical user interface (GUI) that abstracts the
complexities of MapReduce programming.
Look for pre-built templates specifically to create MapReduce jobs without
manually writing code.
Insist on the ability to re-use previously created MapReduce flows as
means to increase developers productivity.
Avoid code generation since it frequently requires tuning and maintenance.
Visually track data flows with metadata and lineage
7
Local
Disk
MAP
REDUCE HDFS
Input
Formatter
Ouput
Formatter
SORT
Optional
Partitioner
Optional
Combiner
LocalDisk
SORT
REDUCE HDFSOuput
Formatter
Local
Disk
SORT
Local
Disk
MAPInput
Formatter SORT
Optional
Partitioner
Optional
Combiner
Local
Disk
MAPInput
Formatter SORT
Optional
Partitioner
Optional
Combiner
http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO8/12/2019 5 Pitfalls to Avoid With Hadoop
8/20
Using DMX-h reduces or eliminates the need for costly, hard-to-find MapReduce programmers
With DMX-h, Mappers and Reducers are all built through an easy-to-use graphical developmen
environment, eliminating the need to write any code. DMX-h provides powerful and highly
efficient out-of-the-box capabilities for all key ETL functions and transformations. DMX-h
Mapper and Reducer steps can optionally perform processing that eliminates the need for
other steps in the MapReduce processing flow (including the InputFormatter, Partitioner
Combiner, and OutputFormatter) by simply checking options in the DMX-h graphical use
interface.
There are a number of other benefits inherent in DMX-h as a powerful data integration too
that make MapReduce programming more efficient. First, its easy to develop ETL jobs that
execute within MapReduce by using pre-defined templates and accelerators for common
transformations such as CDC, joins, and more. Second, jobs can be easily re-used to create
new data flows in less time, improving developer productivity. Additionally, built-in metadatacapabilities enable greater transparency into impact analysis, data lineage, and execution
flow, thereby facilitating data governance and regulatory compliance. No code generation
means there is no code to maintain or tune. As a result, organizations can minimize or even
eliminate the need to find and acquire new MapReduce skills. Instead, they can leverage ETL
expertise within their existing staff to quickly learn and implement ETL processes in Hadoop
using DMX-h.
8
http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO8/12/2019 5 Pitfalls to Avoid With Hadoop
9/20
Most data integration solutions offered for Hadoop do not run natively and generate hundreds of lines of code to
accomplish even simple tasks. This can have a significant impact on the overall time it takes to load and process
data. Thats why its critical to choose a data integration tool that is tightly integrated within Hadoop and can run
natively within the MapReduce framework. Moreover, its important to consider not only the horizontal scalability
inherent to Hadoop, but also the vertical scalability within each node. Remember, vertical scalability is about the
processing efficiency of each node. A good example of vertical scalability is sorting, a key component of every
MapReduce process (equally important is connectivity efficiency, covered in Pitfall #5). When vertical scalability is
most efficient, it also delivers the fastest job processing time, thereby reducing overall time to value.
Unfortunately, many data integration tools add a layer of overhead
that hurts performance. Most data integration tools are peripheral to
Hadoop. They simply interact with Hadoop from the outside, treating
it as just another target engine to push processing. They take the
same approach as with relational databases the so-called push-
down optimizations. This means they generate code, in most cases
Java, Pig or HiveQL, which then needs to be compiled before it isexecuted in Hadoop. Generating optimum code is not trivial, and
most of these tools can end up generating very inefficient code
that developers then need to understand, fine-tune, and maintain.
Instead, it is better to run natively within Hadoop with no need to
pre-compile, which is both easier to maintain and more efficient,
eliminating processing overhead.
9
http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO8/12/2019 5 Pitfalls to Avoid With Hadoop
10/20
Most data integration tools are simply code generators that add extra overhead to
the Hadoop framework. A smarter approach must fully integrate with Hadoop and
provide means to seamlessly optimize performance without adding complexity.
Understand how different solutions are specifically interacting with Hadoop
and the amount of code that they are generating.
Choose solutions with the ability to run natively within each Hadoop node
without generating code.
Run performance benchmarks and study which tools deliver the best
combination of price and performance for your most common use cases.
Select an approach with built-in optimizations to maximize Hadoops
vertical scalability.
DMX-h provides a truly integrated approach to Hadoop ETL. DMX-h is not a code generator
Instead, Hadoop automatically invokes the highly efficient DMX-h runtime engine, which
executes on all nodes as an integral part of the Hadoop framework. DMX-h automatically
optimizes the resource utilization (e.g., CPU, memory and I/O) on each node to deliver the
highest levels of performance, scalability, and throughput, with no manual tuning needed
Compared with Java or Pig, DMX-h execution is typically 2 to 3x faster, which means it can
process more data in the same amount of time without the need for additional nodes.
DMX-h has a very small footprint with no dependencies on third-party systems like a relationa
database, compiler, or application server for design or runtime. As a result, DMX-h can be
easily installed and deployed on every data node in a Hadoop cluster or on virtualized
environments in the cloud.
Syncsort accomplishes these performance differentiators by leveraging a number o
contributions the company has made to the Apache Hadoop open source community,
including a new feature to allow for an external sort implementation within the MapReduce
framework (MAPREDUCE-2454 ). Therefore, organizations using Hadoop no longer have to
rely on the standard Hadoop sort, but can plug in their own sort as well.
10
http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO8/12/2019 5 Pitfalls to Avoid With Hadoop
11/20
The pluggable sort option also enables development of MapReduce jobs within the DMX-h
graphical interface. Additionally, it allows the DMX-h engine to run natively within the Hadoop
cluster nodes. This approach makes it much easier to implement common tasks that are
difficult to execute in Hadoop (e.g., joins). For all Hadoop users, this new feature enables
more sophisticated manipulation of data within Hadoop like hash aggregations, hash joins
sampling N matches or even a no-sort option (i.e. ability to bypass sort when not needed
redundant).
11
http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO8/12/2019 5 Pitfalls to Avoid With Hadoop
12/20
Hadoop is significantly disrupting the cost structure of processing data at scale. However, deploying Hadoop is
not free, and significant costs can add up. Vladimir Boroditsky, a director of software engineering at Googles
Motorola Mobility Holdings Inc., recognized in a Wall Street Journal article that there is a very substantial cost
to free software, noting that Hadoop comes with additional costs of hiring in-house expertise and consultants
In all, the primary costs to consider for a complete enterprise data integration solution powered with Hadoop
include: software, technical support, skills, hardware and time-to-value.
The first three factors software, support, and skills should be considered together. While the Hadoop software
itself is open source and free, typically its desirable to purchase a support subscription with an enterprise service
level agreement (SLA). Likewise, its important to consider the software and subscription costs as a whole when
considering the data integration tool to work in tandem with Hadoop. In terms of skills, the Wall Street Journal cites
that a Hadoop programmer, also sometimes referred to as a data scientist, can easily command at least $300,000
per year. Although the data integration tool may add costs on the software and support side, using the right too
can reduce overall costs of development and maintenance by dramatically reducing time to build and manage
Hadoop jobs. Finally, data integration tool skills are much more broadly available and much less expensive than
the specialized Hadoop MapReduce developer skills.
While Hadoop leverages commodity hardware, associated costs can still be significant. When dealing with dozens
of nodes over months and years, hardware costs add up, whether commodity or not. Therefore, it is still important to
use hardware in the most efficient manner. Unfortunately, Hadoops core mechanics of MapReduce are inefficient
with respect to processing data on each individual node. The strategy with Hadoop is to spread the processing
and data across many nodes so that inefficiencies such as sorting are minimized. However, the inefficiencies are
12
http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO8/12/2019 5 Pitfalls to Avoid With Hadoop
13/20
Hadoop provides virtually unlimited horizontal scalability. However, hardwareand development costs can quickly hinder sustainable growth. Therefore, its
important to maximize developer productivity and per-node efficiency to contain
costs.
Choose cost-effective software and support, including both the Hadoop
distribution and the data integration tool.
Ensure tools include features to reduce development and maintenance
efforts of MapReduce jobs. Look for optimizations that enhance Hadoops vertical scalability to reduce
hardware requirements.
still there and add up as the number of nodes grows. Vertical scalability is critical to contain costs associated
with growing Hadoop clusters. Therefore, its important to consider data integration tools that can complement
Hadoop with the ability to maximize processing efficiency on each node, for example, by enabling Hadoop to cal
more efficient sort algorithms and seamlessly optimize MapReduce operations.
Time-to-value is the time difference between the time needed to create and deploy jobs and when an organization
may start extracting value from Big Data. This dimension is another benefit of using a data integration tool with a
graphical interface to speed development and maintenance. The time to create ETL jobs and deploy them into
production is dramatically lower when using the right data integration tool as opposed to using Hadoop utilities
such as Pig, Hive, and Sqoop.
13
http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO8/12/2019 5 Pitfalls to Avoid With Hadoop
14/20
DMX-h dramatically reduces costs of leveraging Hadoop in a number of ways. First, DMX-h
reduces time-to-value by making the development of Hadoop jobs much faster and easier than
manual coding. With DMX-h, there is no need to hire additional programmers to implement
Hadoop ETL. For the most part, you can leverage existing skills within the organization or
more easily find data integration tool developers at a more reasonable cost.
In terms of hardware, a rule-of-thumb cost for one Hadoop node is about $5,000.
However, when adding the operating system (for example a support subscription), cooling,
maintenance, power, rack space, etc., the total cost can grow to $12,000. And that does
not include administration costs. DMX-h enables Hadoop clusters to scale more efficiently
and cost-effectively by maximizing vertical scalability of each individual node. With more
efficient hardware utilization, organizations can reduce capital and operational expenses by
eliminating the need for additional compute nodes on the cluster.
14
http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO8/12/2019 5 Pitfalls to Avoid With Hadoop
15/20
One of Hadoops hallmark strengths is its ability to process massive data volumes of nearly any type. But that
strength cannot be fully utilized unless the Hadoop cluster is adequately connected to all available data sources
and targets, including relational databases, files, CRM systems, social media, mainframe and so on. However
moving data in and out of Hadoop is not trivial. Moreover, with the birth of new categories of data management
technologies, broadly generalized as NoSQL and NewSQL, mission critical systems like mainframes can al
too often be neglected. The fact is that at least 70% of the worlds transactional production applications run on
mainframe platforms. The ability to process and analyze mainframe data with Hadoop could open up a wealth o
opportunities by delivering deeper analytics, at lower cost, for many organizations.
Shortening the time it takes to get data into the Hadoop Distributed File System (HDFS) can be critical for many
companies, such as those that must load billions of records each day. Reducing load times can also be importan
for organizations that plan to increase the amount and types of data they will need to load into Hadoop, as their
application or business grows. Finally, pre-processing data before loading into Hadoop is vital in order to filter out
noise of irrelevant data, achieve significant storage space savings, and optimize performance.
15
http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO8/12/2019 5 Pitfalls to Avoid With Hadoop
16/20
Without the right connectivity, Hadoop risks becoming another data silo within the
enterprise. Tools to get the needed data in and out of Hadoop at the right time
are critical to maximize the value of Big Data.
Select tools with a wide range of native connectors, particularly for popular
relational databases, appliances, files and systems.
Dont forget to include mainframe data in your Hadoop and Big Data
strategies.
Make sure connectivity is provided not only from a stand-alone data
integration server to Hadoop, but also directly from the Hadoop cluster
itself to a variety of sources and targets.
Look for connectors that dont require writing additional code.
Ensure high-performance connectivity in both loading and extracting data
from various sources and targets.
DMX-h offers a range of high-performance connectors for every major RDBMS, appliances
XML, flat files, legacy sources and even mainframes.
DMX-h writes data directly to HDFS using native Hadoop interfaces. DMX-h can partition
the data and parallelize the loading processes to load multiple streams simultaneously into
HDFS, reducing the time to load data into HDFS by up to 6x.
16
File-BasedSource
RDBMS Appliances Other
Flat
Mainframe
HDFS
Legacy Sources
Oracle
DB2
SQL Server
Teradata
Sybase
ODBC
Netezza
Greenplum
Vertica
XML
MQ
Salesforce.com
http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO8/12/2019 5 Pitfalls to Avoid With Hadoop
17/20
DMX-h can also connect directly from each data node in the cluster, to virtually any source
and target for even greater efficiency and faster data movement.
Finally, Syncsort is commonly used to pre-process data prior to loading it into Hadoop. By
first integrating and structuring the data with Syncsort prior to loading to HDFS, load times
are reduced downstream, MapReduce tasks execute faster and more efficiently, and storagerequirements on the cluster are reduced.
17
http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO8/12/2019 5 Pitfalls to Avoid With Hadoop
18/20
A leading global financial services organization with trillions of dollars in assets is looking to improveperformance of its Hadoop ETL jobs.
18
http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO8/12/2019 5 Pitfalls to Avoid With Hadoop
19/20
As the de facto standard for Big Data processing and analytics, Hadoop represents a tremendous vehicle to extract value
from Big Data. However, relying only on Hadoop and common scripting tools like Pig, Hive and Sqoop in order to achieve
a complete ETL solution can hinder the overall potential value of Big Data. Syncsort DMX-h provides a smarter approach
making Hadoop a more mature environment for enterprise ETL. Development and maintenance are eased, overall costs are
dramatically reduced, performance is multiplied, opportunities to leverage every data source are guaranteed, and time-to
value is minimized.
As a high-performance leader in the data integration space, Syncsort has worked with early adopter Hadoop customers to
identify and solve the most common pitfalls organizations are facing. Regardless of the approach you take, its important to
recognize and address these pitfalls prior to deploying ETL on Hadoop:
Hadoop is not a data integration tool
Select a data integration tool that can dramatically speed development and maintenance efforts
by providing all the capabilities to make Hadoop ETL-ready, including connectivity, breadth of
transformations and data processing functions, metadata, reusability and ease-of-use.
MapReduce programmers are hard to find
Make sure your data integration tool includes specialized facilities to ease MapReduce job
development. Also minimize the need to acquire MapReduce programming skills by selecting a too
that allows you to leverage the same data integration expertise your organization already has, to
develop MapReduce jobs without hand-coding.
Most data integration tools dont run natively within Hadoop
Choose a data integration tool that runs natively within the Hadoop framework to minimize data
movement and maximize data processing performance within each node. Avoid code generators
altogether, as their code output frequently requires tedious tuning and maintenance.
Hadoop may cost more than you think
Do not underestimate the cost of using Hadoop including software, support, hardware, and skills
Choose a data integration tool that complements Hadoops horizontal scalability with greateperformance and efficiency on each node to minimize hardware costs.
Elephants dont thrive in isolation
Unleash Hadoops potential by making sure your data integration tool provides high-performance
connectivity to move data into and out of Hadoop from virtually any system, particularly major
relational databases, appliances, files and mainframes.#5
#4
#3
#2
#1
19
http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO8/12/2019 5 Pitfalls to Avoid With Hadoop
20/20
Simplifying and accelerating ETL use cases with Hadoop
Hadoop MapReduce: To Sort or Not to Sort
2013: The Year Big Data Gets Bigger
Syncsort provides data-intensive organizations across the big data continuum with a smarter
way to collect and process the ever-expanding data avalanche. With thousands of deployments
across all major platforms, including mainframe, Syncsort helps customers around the world
to overcome the architectural limits of todays ETL and Hadoop environments, empowering
their organizations to drive better business outcomes in less time, with less resources and
lower TCO. For more information visit www.syncsort.com.
2013 Syncsort Incorporated. All rights reserved. DMExpress is a trademark of Syncsort Incorporated. All other company and productf
http://bigdata.syncsort.com/hadoop-data-processing-whitepaper?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Linkhttp://blog.syncsort.com/2013/02/hadoop-mapreduce-to-sort-or-not-to-sort/?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Linkhttp://blog.syncsort.com/2013/01/2013-the-year-big-data-gets-bigger/?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Linkhttp://www.syncsort.com/?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Linkhttp://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypOhttp://www.syncsort.com/?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Linkhttp://www.syncsort.com/?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Linkhttp://blog.syncsort.com/2013/01/2013-the-year-big-data-gets-bigger/?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Linkhttp://blog.syncsort.com/2013/02/hadoop-mapreduce-to-sort-or-not-to-sort/?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Linkhttp://bigdata.syncsort.com/hadoop-data-processing-whitepaper?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Link