+ All Categories
Home > Documents > Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can...

Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can...

Date post: 17-Mar-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
29
Big Data Working Group TF1. Technological Maturity White Paper Elaborat per: Socis promotors: Big Data CoE Version 1.4 Date 2017-07-11
Transcript
Page 1: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group TF1. Technological Maturity

White Paper

Elaborat per: Socis promotors:

Big Data CoE Version 1.4 Date 2017-07-11

Page 2: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 2/29

INDEX 1. BIG DATA PAST, PRESENT AND FUTURE ...................................................... 3

1.1. State of the art ............................................................................................................ 3

Databases (Relational and NoSQL) ................................................................... 4 Distributed Computing ........................................................................................ 8

Advanced analytics ........................................................................................... 10

Integrated data platforms .................................................................................. 15

2. CHALLENGES AND TRENDS .......................................................................... 18

2.1. Challenges for 2017 ................................................................................................. 18 Data Governance, interoperability and integration ........................................... 18

Security, Privacy, and Validity of Data .............................................................. 18 Talent Shortage ................................................................................................ 19

A new culture is needed: the distributed organization ...................................... 20 The monolith vs the micro-services architectures ............................................. 20

The dataflow renascence .................................................................................. 21

2.2. Trends for 2017 ........................................................................................................ 21 Big data goes enterprise ................................................................................... 21

Advanced Analytics .......................................................................................... 22 Moving to the cloud and the provider lockup .................................................... 22

Internet of Things (IoT) ..................................................................................... 23 Streaming analytics .......................................................................................... 23

Trends in Databases ......................................................................................... 23 Trends in Architectures ..................................................................................... 25

3. CONCLUSIONS ................................................................................................. 27

4. REFERENCES ................................................................................................... 28

Page 3: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 3/29

1. BIG DATA PAST, PRESENT AND FUTURE

1.1. State of the art Big Data technologies are evolving and its ecosystem is maturing. The first wave of companies that developed a Big Data solution scaled their organizations, learned from early deployments (not only failures) and now offer more mature products that are being implemented in multiple companies. The Big Data Landscape is still growing and there are new companies that are defining and deploying new solutions in order to cover a business specific gap (No-SQL databases, advanced analytics, data visualization, etc.). For that reason, the number of big data companies is increasing as it is shown in the next figure:

Figure 1: Big Data Landscape (Source: Firstmark)

Open source solutions are widely spread and get along with commercial products and both have good reputation and have demonstrated to be powerful tools. A compact description of the current state of the art of frameworks, libraries and tools in the big data ecosystem is mentioned in this section. In particular, there is a description of the following main categories related to big data: Databases, Distributed Computing, Advanced Analytics and Integrated Platforms (such as visualization tools).

Page 4: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 4/29

For the first two main categories, Databases and Distributed Computing, we will describe the most relevant projects, identifying their strengths and weakness, providing an overview of its current capabilities. These related technologies are described in terms of:

• Volume: This refers to the size of a particular dataset that it’s used and how a particular solution would deal with that size.

• Velocity: How fast fresh data is expected to be fed into the system and how quickly the results are expected to be generated.

• Variety: This term refers to the wide range of data types or data sources that can support.

• Variability: This refers to the range of different things you can do with the project, different set of configurations of data, complexity of processes, variety of analytics, etc.

The other categories, Advanced Analytics and Integrated platforms, will be focused on reviewing and classifying the high variety of approaches, techniques and technologies. The analysis will be conducted from different perspectives, i.e. according to application goals, application problems and from the algorithmic perspective. Some of the most recent use cases relevant to Advanced Analytics we also be summarized. Finally, the most common and popular Integrated platforms (commercial and open source) will be reviewed.

Databases (Relational and NoSQL) Normally, for many organizations databases are commonly used as a reliable source for their analytical stores in projects related to data. In this section, there is an overview of the state of the art of the two main categories of databases:

• Relational: The Relational Database Management Systems (RDBMS) are the ones based on the relational model developed by E.F Cood in the 70s. Nearly every database uses the SQL language. An RDBMS works with tables and relationships among them.

• NoSQL: Triggered by the needs of the Web 2.0, this category of databases don’t strictly follow the relational model. Motivations for NoSQL databases are based on better scalability to sustain the need of big data and the real time web. There are different types of NoSQL databases, such as:

o Key/Value data stores: A paradigm designed for storing data structures such as hash tables, list, sets, etc. This kind of stores offers great flexibility to work as an intermediate cache or middleware systems, offering greats performance and flexibility from a programmer point of view.

o Column data stores: Columnar databases are such where the data is organized in columns, in contrast with the traditional RDBMS where data is organized in rows. This approach offers great performance increases for data access, enabling faster aggregation query processing.

o Document data stores: In this model data is organized as documents, also known as semi-structured data, this databases usually handle XML or JSON documents natively. This stores are usually designed with the idea to provide greater flexibility to their users.

Page 5: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 5/29

o Graph data stores: A graph database is a store where data is organized as graph. In contrast to traditional RDBMS systems. A graphdb can handle highly connected data with a very good performance.

o Multi-model data stores: This kind of databases have the special ability to handle different models within the same store, allowing their user to choose the best one in a single product, or even interconnect data from two different models. This provides a high degree of flexibility from the user point of view.

The main solutions related to databases are the following:

• MySQL (https://www.mysql.com/): Open Source Relational DBMS engine, originally developed by MySQL AG, now owned by Oracle Corporation. MySQL is available under the GNU GPL, as well as under several proprietary agreements.

o Volume: It can operate in a massive scale, although it needs a proper setup and usage to archive it.

o Velocity: MySQL velocity can be very high, but it will depend on the model layout and the queries executed and performed [1].

o Variety: Provides connectors with the most common programming languages through ODBC/JDBC, covering the most common needs.

o Variability: Desirable to work with transactional data as it will provide the user with full ACID support, making users aware data is safely stored.

• MongoDB (https://www.mongodb.com/): MongoDB is a document-oriented NoSQL

DBMS developed by 10gen. It focuses on ease of use, performance and high scalability [2]. MongoDB is available for Windows and Unix-like environments at no cost under the GNU Affero General Public License. The language drivers are available under an Apache License. In addition, MongoDB Inc. offers proprietary licenses.

o Volume: MongoDB can work with petabytes of data, powered by its versatile sharding methodologies.

o Velocity: Through its rich index and query support, including its aggregation framework it provides real time powerful analytics [2].

o Variety: It has drivers and connectors for the most common programming languages.

o Variability: The document data model and the adaptable schema of MongoDB offers flexibility, allowing to handle key-value data, graph, etc. but also schemas that change over time. However MongoDB impose restrictions on integrity due to transactions are not supported across documents.

• Redis (https://redis.io/): Redis is an in-memory data structure store [3], used as a

database, cache and message broker. It is available as an open source product under the BSD license.

Page 6: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 6/29

o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like with other systems will be the amount of RAM available [3].

o Velocity: Working with Redis could be very fast when using the right data structure, it could be slower if memory or post processing task are needed after retrieving data.

o Variety: Redis has multiple clients for the most popular languages, however there is the need to comply with the data structure format and expectations.

o Variability: It provides lots of flexibility if data needs matches, or could be decomposed in the available structures.

• Neo4j (https://neo4j.com/): Neo4j is a graph database developed by Neo Technology,

Inc. It is available in a GPL3 licensed open source, with extensions licensed under the terms of the Affero General Public License. Neo also licenses Neo4j with these extensions under closed-source commercial terms.

o Volume: It can operate with massive scale (billions of nodes) but also at smaller scale when it is needed.

o Velocity: This system outperforms relational databases for the graph specific use case when high relationships are going to be analyzed.

o Variety: It provides integrations with the most popular programming languages and REST API interfaces.

o Variability: Neo4j is desirable if you are going to store graph alike data, with high density of relationships.

• Sparksee (http://www.sparsity-technologies.com/): Sparksee is a high performance

graph database developed by Sparsity Technologies. One of its main features is its query performance for the retrieval and exploitation of large networks. An implementation with very light specialized structures allows analyzing and querying billions of objects at very low storage cost. Sparksee is available for Java, .Net, Python and C++ developers.

o Volume: It’s designed to power extreme data managing and querying large graph data large scale labeled and attributed multigraphs. It’s based on vertical partitioning and collections of objects identifiers stored as bitmaps and it has a capacity of more than 100 billion vertices and edges in a single multicore computer.

o Velocity: Sparksee is a high-performance and out-of-core graph database management system. It uses bitmats for very compact representation and highly compressible data structures. It performs subsecond response in recommendation queries.

o Variety: It provides integrations with Java, .Net, Python and C++ and multiplatform solutions for Windows, Linux, MacOSX and Mobile.

o Variability: Sparksee is suitable for Social Networks Analysis, Bibliographical Network, Media analysis (recommend new content or relate social data with media data), Security Networks & Fraud Detection, Physical Networks exploration & optimization or Biological Networks.

Page 7: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 7/29

• Marklogic (http://www.marklogic.com/): MarkLogic is a multi-model NoSQL database [5]

that has evolved from its XML database roots to also natively store JSON documents and RDF triples (data model for semantics). In addition to having a flexible data model, MarkLogic uses a distributed, scale-out architecture that can handle hundreds of billions of documents and hundreds of Terabytes of data. MarkLogic is developed by the company of the same name and distributed under closed source commercial terms.

o Volume: It’s developed to work on extremely large datasets measured in hundreds of terabytes. The server can scale to 100 or more nodes and hundreds of billions of documents [4].

o Velocity: MarkLogic is designed to search extremely large content sets, while providing fine-grained control over the search and access of the content. In many cases, applications will be extremely fast with no tuning. However, there are many tools and techniques that helps to make queries faster.

o Variety: It has connectors for Java, .NET, Hadoop, NodeJS and ODBC. If the programming language of a project does not fit with it, there is a REST api available that could be used [5].

o Variability: MarkLogic has a multi-model scheme, with support for JSON, XML and RDF triplets. This makes it very approachable for use cases from document-oriented to the semantic web [5].

• Cassandra (http://cassandra.apache.org/): The Apache Cassandra database [26] is a

NoSQL columnar data store mainly used for analytical purposes with big datasets that require linear scalability and high fault-tolerance availability without compromising performance. Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

o Volume: Large volumes of data in Cassandra can be accessed and managed by APIs which follow RPC style. At the same time, Cassandra also provides basic query language support called CQL, which is similar to SQL. Linear scalability means that capacity can be increased by simply adding new nodes.

o Velocity: Cassandra handles high incoming data velocity. It supplies a true read/write-anywhere design with a “location independent” architecture. This means any node in a Cassandra cluster may be read or written to.

o Variety: Cassandra combines structured, semi-structured, and unstructured data and supports different language drivers like Python, C#/.NET, C++, Ruby, Java, Node.JS, PHP, Go, Scala, Clojure, Erlang, Haskell, Rust and Pearl.

o Variability: Cassandra supports high node stability due to variability. Default memtable’s size of Cassandra can lead to more frequent flushes and compactions but on the other hand, the compactions are also made on smaller files.

Page 8: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 8/29

Besides, there are other popular solutions like HBase [27] or Riak, both columnar stores, which are used mainly for analytical purposes.

Distributed Computing With the raise and popularity of the web 2.0 and big data the need to process huge amounts of data became a real and a mandatory requirement. Single machines were not anymore sufficient to perform the required computations, for that reason systems with the ability to do calculations in a distributed environment were created. This kind of systems allow and make accessible to most users the possibility to use as much computational power as required. In this section we’re going to describe the strengths and weaknesses of several distributed computing systems taking into account the 4 Vs described before (Volume, Velocity, Variety and Variability). Apache Hadoop (http://hadoop.apache.org/) Apache Hadoop [6] is an open source software framework used for distributed storage and processing of big data sets using the MapReduce programming model. It is developed under the umbrella of the Apache Software Foundation with contributions from many committers. Hadoop is an ecosystem that has several modules that can be installed, not only as a distributed system but also as an additional software package to be deployed containing databases, advanced analytics tools, etc.

• Volume: A big deployment of Hadoop can deal with several hundreds of petabytes. However Hadoop has it major problem and challenge dealing with small data, smaller than hundred megabytes [6].

• Velocity: Hadoop is a batch oriented system. It is capable of very high performance, its model is centered on ‘Jobs’, executions that run in batches.

• Variety: Hadoop is a java based project with interfaces that read from files or databases of their ecosystem. It can be “programmed” using alternative interfaces like Apache Pig, Scalding, etc. that simplify the MapReduce oriented original API.

• Variability: It scales up effectively. However it can be hardware demanding due to the space required for HDFS and the memory usage. The way Hadoop runs is not suited to process highly connected data (such as graph traversal etc.), this is because there is no data locality guarantee.

Page 9: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 9/29

Apache Spark (https://spark.apache.org/) Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.

• Volume: Spark has been designed to handle huge amounts of data thanks to its distribution capabilities, having the ability to work on top of HDFS, parquet files, databases (i.e. HIVE, JDBC), etc.

• Velocity: It is designed for speed, operating both in memory and on disk. In 2014, Spark won the Daytona Gray Sort benchmarking challenge, sorting a dataset of 100TB in 23min.

• Variety: Spark is accessible via a set of rich APIs, all designed specifically for interacting quickly and easily with data. These APIs are well documented and structured. Java, Scala (native language), Python and R as primary languages can be used.

• Variability: The project offers several different API’s, from raw streaming features to graph and machine learning specifics, including Hadoop compatibility.

Apache Flink (https://flink.apache.org/) Apache Flink is a community driven open source framework for distributed big data analytics, like Hadoop and Spark. The core is a distributed streaming dataflow. It aims to bridge the gap between MapReduce systems and parallel database systems. It is developed under the umbrella of the Apache Software Foundation with contributions from many committers. Apache Flink is a successor of the Stratosphere project [28] (European Research project).

• Volume: Apache Flink are designed to handle a huge volume of continuous data with its distributed streaming engine.

• Velocity: Specially designed for streaming workflows. Apache Flink can be the fastest in its ecosystem with throughputs near the 15M events/second or faster with very low latency.

• Variety: Apache Flink can be accesses throw a collection of right API’s for batch, stream, graphs and machine learning. However, when compared with Spark, it can only be develop in Java or Scala.

• Variability: It has API’s to work with streaming data but also with graph data, including Hadoop compatibility. This makes the system easy to integrate with the big data ecosystem.

Apart from the solutions described before, there are frameworks specialized in making the interaction with such platforms easier, such as PIG [29] or Cascading [30].

Page 10: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 10/29

Advanced analytics According to Gartner, Advanced Analytics is defined as follows: “Advanced Analytics is the autonomous or semi-autonomous examination of data using sophisticated techniques and tools, typically beyond those of traditional business intelligence (BI), to discover deeper insights, make predictions, or generate recommendations.” Although advanced analytic techniques have been around for decades, they’ve received renewed interest as data volumes have increased and storage and processing costs have gone down. In that scenario, organizations now look at these techniques as an opportunity to perform valuable tasks such as predictive analytics, geospatial analytics, image and text analytics, etc. For instance, retail organizations use advanced analytics to gain insights into consumer behavior in order to create customer-driven marketing strategies. Nowadays there is a clear shift from quantity to quality of data, which could be also called “a shift from big data to smart data”. In other words, advanced analytics are not only employed by enterprises that dispose of high volume or high velocity data commonly associated with Big Data. The interest in these techniques is also shown by organizations that have relatively small data sets (<1Gb), but want to extract valuable insights from available data. Thus, having lots of data is no longer enough or even necessary. The key is to question the data – Can it be easily extracted and analyzed? Or how reliable is a data set? Again mathematical and statistical solutions have been applied to solve data quality and standardization challenges with pre-processing algorithms based on Big Data Analytics. The techniques belonging to the branch of Advanced Analytics can be classified in many different ways. Below the classification according to application goals, application problems and from the algorithmic perspective is presented. Several real-world use cases are summarized at the end of this section. Classification of Advanced Analytics techniques according to application goals:

• Descriptive analytics investigates data and information to answer the question “What happened?” It defines the current state of a business situation in a way that developments, patterns and exceptions become evident, in the form of producing reports, dashboards, alerts, etc.;

• Diagnostic analytics examines data or content to answer the question “Why did it happen?”, and is characterized by techniques such as drill-down, data discovery, data mining and correlations;

• Predictive analytics is concerned with forecasting and statistical modelling to determine the future possibilities, thus answering the question “What will happen?”

• Prescriptive analytics dedicates to answering the question “How to make it happen?” It’s about optimization, simulation and randomized testing to assess how enterprises can achieve their goals, e.g. to enhance customer service levels while decreasing the expenses.

Page 11: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 11/29

Classification of Advanced Analytics according to application problems and types of analysis: Advanced Analytics is applied in numerous fields, such as data security, financial trading, healthcare, marketing personalization, fraud detection, recommendations, sentiment analysis, etc. Particular application problems within these fields can be referred to the following types of analysis:

• Geospatial analysis gathers, displays, and manipulates imagery, GPS, satellite photography and historical data, described explicitly in terms of geographic coordinates or implicitly, in terms of a street address or postal code.

• Natural Language Processing (NLP) and Text mining extracts information from textual data (social network feeds, emails, blogs, online forums, survey responses, corporate documents, news, logs, etc.). For instance, insurance companies take advantage of text mining technologies by combining the results of text analysis with structured data to prevent frauds.

• Speech analysis is used to unlock hidden insights from voice communications in order to, for example, improve customer satisfaction, customer loyalty, operational efficiency, as well as agent performance. It might consist of speaker separation, emotion detection, talk-over analysis (those moments when a customer and an agent are talking simultaneously, an indicator of customer dissatisfaction), root cause analysis (drilling down into trending customer reactions and hot topics for their most specific underlying drivers).

• Image analysis extracts meaningful information from digital images (e.g. medical or biological ones) by means of image processing techniques (noise filtering, image segmentation, etc.)

• Video content analysis is used to monitor and analyze video streams in order to detect temporal and spatial events (e.g. in sport applications for summarizing soccer matches).

• Time series analysis and forecasting is used to investigate time series data in order to extract meaningful statistics and to predict future values of data points indexed in time order based on analyzing previously observed values. This type of analysis is widely applied in application fields, such as stock market forecasting, economic forecasting, sales forecasting, etc.

• Semantic analysis is the use of ontologies to analyze content in web resources. This field of research combines text analytics and Semantic Web technologies that allow data to be shared and reused across application, enterprise, and community boundaries.

• Complex networks analysis is mainly aimed at a better understanding, description and prediction of the behavior of real-world systems, such as, for example, social networks, genetics, air transportation systems, etc. The systems investigated within this type of analysis can be characterized by emergent behavior, a large number of interacting components, self-organization, and dynamism.

• Transfer learning focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. It is especially important when considering that statistical models need to be rebuilt from scratch using newly collected training data, when the distribution changes. In many real world applications, it is expensive or impossible to re-collect the needed training data and rebuild the models. Transfer learning is aimed at reducing the need and effort to re-collect the training data.

Page 12: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 12/29

Classification of Advanced Analytics from the perspective of problem solving methods: Advanced Analytics projects may require the application of various solving methods that belong to different subfields of computer science. The most frequently used methods are the following ones:

• Machine learning is a method of data analysis that is used to find patterns in data and to build analytical models based on historical data. In case of large-scale data, distributed machine learning (i.e. data parallel and model parallel) is applied at improving the ability of machine learning algorithms to learn from data within a reasonable time.

• Optimization (especially, heuristic optimization) is applied when it’s necessary to obtain the best possible solution from a given input dataset, a list of constraints and an objective function (-s) to be optimized. In a more formal way, it can be defined as a process of finding the conditions that give the minimum or maximum value of a function, where the function represents the effort required or the desired benefit. These methods can be for instance applied to find optimal or near-optimal parameters of machine learning algorithms (this process is called hyper-parameter optimization). Moreover, the heuristic optimization is useful for solving real-world problems, such as vehicle routing and scheduling in distribution companies, workforce planning in ground handling companies, cargo load planning in retail companies, etc. In all these cases the utilization of heuristics (e.g. evolutionary algorithms, tabu search, simulated annealing, etc.) can guarantee the finding of a good solution (though not optimal) in a reasonable amount of time.

• Modeling & Simulation is used to represent complex real-world systems in the form of computer models for assessing designs, improving processes (e.g. based on what-if analysis), and making critical business decisions in a virtual setting without affecting the real system. Depending on the type of a system, one can apply discrete-event simulation, continuous simulation, (Timed) Colored Petri Nets, etc. Commercial licenses of simulation software (Simio, Arena, ProModel, AnyLogic, etc.) are quite expensive, which often limits the usage of these tools to specific domains (mainly manufacturing and logistics).

Page 13: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 13/29

Machine learning tools commonly applied in Advanced Analytics projects Among the most powerful and recent tools used that implement machine learning algorithms, it’s possible to mention the following ones (this list does not pretend to be exhaustive):

• XGBoost (http://xgboost.readthedocs.io/en/latest/) implements machine learning algorithms under the Gradient Boosting framework in an optimized distributed manner. In particular, by employing multi-threads and imposing regularization (a technique used to avoid the overfitting problem), XGBoost is able to utilize more computational power and get more accurate prediction. XGBoost has been extensively used by winning solutions in Kaggle competitions, and it outperforms Random Forest algorithm of scikit-learn (sklearn) and Gradient Boosting Machine library (gbm) available in CRAN repository, especially, in terms of a run time.

Figure 2: XGBoost Technique - Example.

• Vowpal Wabbit (http://hunch.net/~vw/) is another widely used tool developed in the

Yahoo! Research lab. It has been used to learn a sparse terafeature (i.e. 1012 sparse features) dataset on 1000 nodes in one hour, which beats all current machine linear learning algorithms. The default learning algorithm is a variant of online gradient descent. Various extensions, such as conjugate gradient, mini-batch, and data-dependent learning rates, are included.

• H2O (http://h2o.ai) was written from scratch in Java and integrates with the most popular open source products like Apache Hadoop® and Spark™. It includes state-of-the-art machine learning algorithms such as Random Forest, Gradient Boosting Machine, K-Means, Deep learning, etc.

• TensorFlow (http://www.tensorflow.org) was developed by Google Brain team and it is currently used for both research and production in this enterprise. It uses deep learning to solve different types of application problems. TenserFlow can be combined with Spark for making hyper-parameter optimization and for deploying deep learning models at scale using Spark’s built-in broadcasting mechanism.

• scikit-learn (http://scikit-learn.org) is the library aimed at data mining and data analysis in Python.

Page 14: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 14/29

Use Cases of Advanced Analytics The following use cases show recent examples of deploying Advanced Analytics (machine learning, optimization and simulation modelling) techniques in real-world settings on the basis of a big amount of data:

• Neural Machine Translation system announced by Google in 2016 is capable of improving the quality of translation by applying neural networks to the Google translate service. The new system can translate phrases for language pairs where no explicit training or mapping exists. The system consists of an encoder network, a decoder network, and an attention network. The encoder transforms a source sentence into a list of vectors, one vector per input symbol. Given this list of vectors, the decoder produces one symbol at a time, until the special end-of-sentence symbol (EOS) is produced. The encoder and decoder are connected through an attention module which allows the decoder to focus on different regions of the source sentence during the course of decoding (https://research.googleblog.com/2016/11/zero-shot-translation-with-googles.html).

• Hands-free speakers of Google Home and Amazon Echo use voice recognition and natural language processing in order to find news, search for music, switch on the lamp before getting out of bed, turn on the heater and more without lifting a finger. Since these systems are built in the cloud, they are always getting smarter. The more customers use them, the more they adapt to speech patterns, vocabulary, and personal preferences. For instance, Amazon provides the Alexa developer kit in order to build voice-enabled applications (currently only in UK, Germany and US) for smart home (https://developer.amazon.com/alexa).

• Autopilot system of Tesla is able to constantly learn and improve by using machine learning algorithms, the car's wireless connection, detailed mapping and sensor data that Tesla collects. Each time a driver corrects the autopilot this event goes back to the central database tagged with GPS coordinates. This information becomes available to the rest of the fleet, which is quite distinctive from other similar autopilot systems.

• Emotion API for video (Microsoft) recognizes the facial expressions of people in a video, and returns a summary of their emotions. It is possible to use this API to track how a person or a crowd responds to a content over time, while being able to distinguish between anger, contempt, disgust, fear, happiness, neutral, sadness, and surprise (https://www.microsoft.com/cognitive-services/en-us/emotion-api).

• Delivery route planning and fleet scheduling tools (e.g. Telogis, Routific, etc.) are widely used by logistic companies (e.g. DHL, Asda, etc.) in order to optimize the planning of resources, while considering numerous constraints and inputs that discard the possibility of manual calculations. Such tools use the mixture of optimization algorithms in order to find near-optimal solutions in a short time for highly complex optimization problems known as NP-hard problems (the time required to solve such problems using any currently known algorithm increases very quickly as the size of the problem grows).

• WITNESS simulation software is used by Nissan Motor Manufacturing to optimize production processes for the manufacture of Qashqai model. In particular the enterprise needed to understand and map the role of its new suspension plant that was installed specifically for the manufacture of the Qashqai. Simulation techniques allowed testing ideas to improve the plant without having to stop the plant running, which simply isn’t an option due to the volume requirements.

Page 15: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 15/29

Integrated data platforms This category describes the most common and popular products (commercial and open source) that are designed to make common big data workflows approachable for everyone, including and integrating visualization and analytics capabilities. The below-given list does not pretend to be exhaustive. Tableau (https://www.tableau.com/) Developed by the company of the same name. Tableau aims to make data understandable for everyone. Tableau has different versions depending on the need of a user: desktop, server, and cloud. Tableau features include:

• An interactive chart builder to help users discover hidden insights while playing with data.

• Ability to connect to multiple data sources (big data, SQL database, spreadsheet, or cloud apps like Google Analytics and Salesforce), on premises or in the cloud.

• A quick way to build powerful calculations from existing data, have reference lines, forecasts and review summaries.

• Sharing and collaboration with others the generated analysis. Oracle Data Visualization (https://www.oracle.com/solutions/business-analytics/data-visualization.html) Oracle Data Visualization makes easy yet powerful visual analytics accessible to everyone. ODV helps user to drag and drop to see your data visualized automatically, change layouts, and present new insights. As a key features, ODV provides the following:

• Rich, dynamic visual analytics. Automatically connected visualizations and highlighting. • Seamless self-service discovery and dashboarding. Self-service data loading and

blending. • Powerful search, guided navigation, and filtering. Automatic matching across data sets. • Visual storytelling with narrative and snapshots. Easy sharing of live insights. • Secure and scalable

Tibco Spotfire (http://spotfire.tibco.com/) Spotfire is the data visualization and analytics software developed Tibco Software Inc with the mission to provide self service analytics to any user. The stack is currently based on the Spotfire Platform and SaaS version. The most important feature available are:

• Ability to access and combine multiple data sources, structured and unstructured data, from internal and external origins in a single analysis.

• Easy and accessible analytics capabilities such as Interactive visualization, Data augmentation, Predictive, Content based and Location analytics.

• Sharing and collaboration with others the generated analysis. Centrally locate and reuse analytics assets, such as dashboards or statistical scripts, to maximize productivity and share expertise.

Page 16: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 16/29

RapidMiner (https://rapidminer.com/) RapidMiner is a complete open source suite of products aimed to unify the data science job. The platform offers an important collection of extensions and plugins, either commercial supported by RapidMiner or developed by the community. The platform include a collection of products such as:

• The RapidMiner Studio: A visual design tool to build complete workflows. From data preparation, modelling (machine learning) validation with powerful visualization and cross validation methods.

• The RapidMiner Server: Is the centralized place in the stack to share, reuse and operationalize the predictive models & results created with the Studio within a team. Including server side execution of models, scheduling of workflows, monitoring and access control and easy integration with the web.

• The RapidMiner Radoop: The integration between the RapidMiner stack and Hadoop ecosystem, enabling users to run workflows directly and unlock the full power of Hadoop.

SPSS (http://www.ibm.com/analytics/us/en/technology/spss/) SPSS is a unified data platform developed by IBM with the mission to provide advanced analytics in an easy to use package that improve efficiency and minimize risk. SPSS is formed by a collection of products that include features such as:

• Statistical analysis and reporting with SPSS Statistics. Address the entire analytical process: planning, data collection, analysis, reporting, and deployment.

• A predictive analytics platform with SPSS Modeler that includes model-building, evaluation, and automation capabilities.

• An integration with the Hadoop ecosystem through SPSS Analytics Server.

SAS (https://www.sas.com/en_us/software/business-intelligence.html) SAS is a software suite developed by the company of the same name focused on advanced analytics, business intelligence and data management. The Business intelligence product is focused on bringing the right information to anyone who might need it, with an integrated and easy to use discovery platform. SAS BI feature, among others, are:

• Advanced Analytics, including data mining, statistical analysis, forecasting, text analysis, optimization and simulation.

• Business Intelligence with visual data exploration facilities but also easy analytics, interactive reporting and dashboards to explore data, discover new patterns, create rich visuals and share insights.

• Cloud analytics services like software or hosting

• Pre-configured analytic solutions for customers (analytical marketing, customer journey, customer experience...), fraud or risk.

Page 17: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 17/29

PowerBI (https://powerbi.microsoft.com/es-es/) Power BI is a suite of business analytics tools by Microsoft that deliver insights connecting data throughout the organization. It plugs in to many data sources, simplifies data prep, and drives ad hoc analysis transforming transactional data into visual objects. The reports can be published through the organization to consume on the web and across mobile devices with Power BI Desktop and Mobile solutions. According to Gartner Group’s Magic Quadrant of 2017, Microsoft Power BI is positioned as a Leader in business intelligence and analytic platforms. Power BI is based on Microsoft Azure Analysis Services that provides enterprise-grade data modeling in the cloud. It’s a scalable solution with a performance that can be easily matched to business needs. It’s integrated with Excel spreadsheets, on-premises data sources, big data, streaming data, and cloud services. Direct connections are available to other sources and tools like Google Analytics, Salesforce or MailChimp. Qlik (http://www.qlik.com/es-es) Qlik is a platform designed for visual data analysis and decision making in business environments. According to Gartner Group it’s positioned as a leader of Business Intelligence and analytic platforms in 2017 magic quadrant. It’s possible to create visualizations, dashboards, and apps to answer business questions exploring vast amounts of data to reveal hidden relations. The main solutions are Qlik Sense and Qlik View:

• Qlik® Sense is a stand-alone tool for data visualization and integrated analytics. Allows the generation of reports and dashboards with drag and drop technology so everyone in the organization can easily create flexible, interactive visualizations and make meaningful decisions.

• QlikView® is a solution devoted to data analysis and finding hidden value behind data. Creates business-driven data discoveries with guided analysis paths that are totally customizable.

Other products from the same suite are: Qlick Analytics Platform to embed visual analytics within a common governance and security framework, Qlik GeoAnalytics for map visualizations and location-based analytics or Qlik DataMarket devoted to find, connect and manage data from external sources.

Page 18: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 18/29

2. CHALLENGES AND TRENDS

In this section we will take a brief look to the current challenges and trends in the big data ecosystem while trying to answer the following questions:

• What is currently being seen as difficult to adopt or expand? • Which are going to be the next steps for the following years?

2.1. Challenges for 2017

Data Governance, interoperability and integration A common problem in organizations of all sizes is the interoperability of data, either internally or externally, across multiple sources, organizations or domains. A promising application of big data is the integration of data across different sources (internally and externally), but although the technology might be able to enable this process without excessive risk usually problems arise managing the expectations of the different stakeholders [8]. To overcome this issues more and more organizations are implementing data governance programs, including up-front agreements to manage expectations.

Security, Privacy, and Validity of Data Related to big data privacy and validity efforts, the main points considered are the data source and the way it’s used. The most challenging factor is to keep a balance between the value received by users and the benefit for the business while maintaining a good level of privacy and data protection [7]. Businesses have the challenge of data anonymization implementing complex methods to prevent the re-identification and keep privacy. For example, retail companies can obtain a sales boost when targeting women who are expecting a baby [10]. In 2012 a company was able to assign a pregnancy score to their clients [9] and predict the expected due date. With this information the retailer sent them coupons for baby products. The problem was when Target knew that a teenager daughter of a Minneapolis was pregnant before her father, so he got into the next Target branch and ask for the manager. He told the manager why are you sending her this coupons? Are you encouraging her to get pregnant?. But this situations can also cause problems of discrimination, for example with employers or health insurances, wrongful credit scores [11], profiling by police, etc. This is a work in progress challenge since many years that will need an organizational improvement over policies and procedures while creating intelligent products.

Page 19: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 19/29

Talent Shortage In the IT world there is currently a certain shortage of trained personal, from web developers to more specialized roles. For that reason, it is needed to increase the workforce to feed the demand. Global consulting firms like McKinsey & Company forecasts by 2018 there will be 4 million big data-related jobs in the US, and a shortage of 140,000 to 190,000 data scientists. This pattern is similar for the Europe and other areas of the world [13]. An study based on LinkedIn profiles by the consulting company stitchdata [12] predict that in 2017, the expected demand for data scientist will continue, but the talent gap will be framed in terms of data engineers.

Figure 3. Number and Growth of Data Workers in EU28 (source: IDC).

The most requested profiles are as follows:

• Chief data Officer (CDO): in charge of maintaining the spirit of a data-driven organization and fostering that all decisions in every department are based on data evidences. Leads data management and data analytics from a business oriented point of view and is responsible for different teams of data scientists and data architects that are specialized in data management.

• Data Scientists: They are key members in the data management team. They are committed to extract knowledge and valuable information from data. A Data Scientist must have an end to end vision of the processes and should be able to solve problems related to data science, build analytic models and apply algorithms. They need a combination of mathematic, statistical, programming and visualization skills, but communication competences are also important to explain the results and benefits obtained by the organization.

• Data Architect: Leads a team of data engineers devoted to the design, implementation,

deployment and management of the data infrastructure needed by the users, data analysts and data scientists.

Page 20: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 20/29

• Data Engineer: It’s a profile specialized in Big Data infrastructure whose main responsibility is to supply access to data in the most convenient way for the data scientists and final users’ consumption. A data engineer develops and selects techniques, processes, tools and methods that will be used in the development of Big Data applications. He or she must have a deep knowledge in database management, cluster architectures, programming languages and data processing systems.

A new culture is needed: the distributed organization Business adoption of big data requires addressing issues of organizational alignment, change management, business process design, coordination, and communication. These are issues that involve people, communication and understanding. Big data has become main stream raising the attention of more complex organizations. A business must start by identifying and asking the critical questions that will drive business value, but also by identifying and addressing the critical assets that will ensure successful adoption. Technology is a consequence that can be different for each organization depending on those questions. In 2016, and more in 2017, data analytics is being used across many organizations. Successful stories reveal that no longer small independent teams or big centralized teams structures are being encouraged. The reality is hybrid and distributed, spanning all data usages in the organization.

The monolith vs the micro-services architectures Classical enterprise applications are built upon three components: a database, a server side layer and a client side application (the monolith), by contrast the micro-services approach enforces the creation of a collection of business centered API’s. While in the past scaling was handled vertically, where the monolith fits, the current approaches on scaling tend towards utilizing a cluster of machines. In such environments the utilization of specialized micro components provided better abstractions, helping the overall project organization. Despite the current trend of going towards the later type of architectures, there is certainly a debate nowadays for understanding the best possible usage in each situation. Benefits like operation agility and maintainability should be taken into account in addition to the implementation efforts before deciding the best solution. A particular architecture that suits very well during the first stages might not be the best when considering the whole life of the project.

Page 21: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 21/29

The dataflow renascence The (data) flow based programming is a programming paradigm first introduced back in the 60’s. The basic idea is to model each program as a graph of data flowing between operations, making the development of parallel applications easier. This model is in a renascence in the recent years, driven by the need to deal with IoT systems, Big Data and the adoption of stream processing solutions in the stack of technologies available. During the following years, this programming trend is expected to grow and develop towards a real standardization of the paradigm/stack. It will probably become a must know approach when doing data engineering as object orientation is to software development.

2.2. Trends for 2017 In this last section, the most relevant big data trends for the next years will be analyzed, offering some highlights on advanced analytics, cloud modelling, IoT or streaming analytics.

Big data goes enterprise Security As more companies are adopting big data technologies, the goal is a "single version of the truth," where users all use the same data but don't necessarily have access to the complete datasets [14]. Companies want to make sure that each data user has the correct access permissions in place. To archive that, platforms providers like Cloudera for Hadoop, have introduced new features to strength security in their systems. Self-service analytics Self-service for big data analytics is the next steps in the process of data democratization at the organization level [14]. This solutions will not require months of planning, preparation and education. Instead, you can “simply” connect the structured data, open a panel and start extracting knowledge from it [16]. Those kind of platforms enable agility and offer increased productivity, characteristics especially valuable for every enterprise. Data democratization, simpler tools The popularization of the self-service analytics will cause a collateral improvement on the development of simpler tools that will enable non expert users do a variety of data analysis tasks by themselves. This has been happening the last year with tools like Microsoft Azure, Google Cloud, or Amazon Machine learning, however important improvements on visualization, processing and AI systems as a services are expected to properly reach such a broad audience.

Page 22: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 22/29

Data Lakes During the last few years many companies have allocated their resources to have a single relational data source instead of multiple silos [17], making it easier to share insights across the organization. From now on, enterprises will implement data lakes for large and unstructured data sets to ensure that all the available information is retained and become properly governed and operational.

Advanced Analytics The last year has shown a big trend on technology becoming smarter [16], in 2017 this trends will continue with the development of new applications and algorithms. Deep learning and artificial intelligence will be adopted by organizations that are expecting great benefits from data insights. They are becoming an affordable choice as computing power and data storage prices are no longer a big issue. This year will also lead us to see a democratization of AI and ML through Cloud technologies [18] [15], open standards, and algorithm economy will continue. The growing trend of deploying prebuilt ML algorithms to enable Self-Service Business Intelligence and Analytics is a positive step towards democratization of ML.

Moving to the cloud and the provider lockup Every day more organizations are moving their computing infrastructure to the cloud to change fixed costs into variable costs and forget about technical maintenance [14]. Traditional vendors like Cloudera and Hortonworks have reacted pricing their offerings on a consumption basis in cloud environments and creating packages that focus on targeted subsets of the projects in their full distributions. In 2016, end-user inquiries regarding Hadoop and Microsoft Azure are up 57% year over year, while inquiries about Hadoop and AWS are up 171% over 2015. According to the Market Guide for Hadoop Distributions 2016 study by Gartner. [15] In an interview with Databricks CTO, the creators of Apache Spark, a user survey they run this summer, the percent of users using Spark on the public cloud (61%) was higher than the percent using Hadoop YARN (36%), and furthermore, the share of cloud users grew from 2015 (51% to 61%) while the share of YARN decreased (40% to 36%) [15] But not everything will be moving to the cloud [19], legacy systems, sensitive data, security, compliance, and privacy issues will require a mix of cloud, on-premises, and hybrid applications. There will also be applications that utilize specialized or even private cloud providers. Organizations will need solutions architects who understand how to leverage the best of both worlds.

Page 23: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 23/29

Internet of Things (IoT) In a recent interview the current CEO of Pentaho stated that “Big data and IoT systems will evolve in 2017 to help businesses prosper during uncertain times in five ways. Self-service data prep will unlock big data’s full value; organizations will replace self-service reporting with embedded analytics; IoT’s adoption and convergence with big data will make automated data on-boarding a requirement.” [15] This answer defines at a perfect level the next step for IoT integration with big data, together with streaming analytics, will provide organizations more detailed data points that will complement their algorithmic decision engines.

Streaming analytics Streaming analytics [17] is the practice of monitoring data as it streams into the organization, instead of traditional batch analytics. It is decades-old, but open source technology has lowered the entry barriers. The proliferation of connected devices and IoT in the organizations, especially in manufacturing and healthcare industries, streaming analytics will become mainstream in 2017. Nevertheless, last years have seen big improvements on technology stacks to deal with streaming data like Apache Flink and Apache Spark.

Trends in Databases The next figure shows a snapshot of the most popular DB-Engines1, ranking across the years and database models.

Figure 4. DB-Engines Ranking (source: DB-engines)

1http://db-engines.com/

Page 24: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 24/29

This ranking system is used by vendors and users to see trends in the database environment. It is very interesting to notice and review the current trends over the last year (most of them will continue in 2017):

• Oracle is still the most popular database, however we can see a descending trends over the last years.

• Relational databases, despite the popularity of NoSQL, are here to stay. Some traditional relational vendors are also introducing NoSQL features.

• Open source is raising in popularity, having more and more organizations relaying on it for their daily operations. We can see it with the tremendous raise of MySQL, PostgreSQL, MongoDB, Cassandra, Elasticsearch, Redis and others.

Figure 5: Magic Quadrant Garner for Operational Databases

In the previous figure, as a complement to the db-engines ranking, we can see the magic quadrant by Gartner. In this chart we can see the trends of the most relevant vendors in the database ecosystem. As a summary, Microsoft, Oracle, SAP and IBM are still the leaders of the market but they are closely followed by a bunch of Open Source companies like DataStax, MongoDB, CouchBase and EnterpriseDB [20].

Page 25: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 25/29

Trends in Architectures

In the last sections we introduced the different components that conform typically a big data solution without having into consideration how do they usually glue up together. In this section we will briefly introduce a couple of reference approaches commonly used when engineering this kind of solutions. Lambda architecture The Lambda [31] is a data processing architecture initially designed to handle large volumes of data while keeping the advantages of both batch and stream oriented processing methods. This approach attempts to keep a balance between latency, throughput and fault tolerance by using batch processing to provide an accurate view of the data while simultaneously using real-time (streaming) to access fast data insides. As it is showed in the next figure, the lambda architecture is organized in three components, the batch, the fast (streaming/real time) processing layers and the serving component. Data processing is performed from an immutable source.

Figure 6. Lambda architecture blocks The most common problem of this solution is focused on the inherent complexities as usually the logic serving the batch and fast layer should be replicated across different code bases, increasing the maintenance costs of this approach. Over the years there has been lots of discussions about the pros and cons of this approach. Kappa architecture, offers a more pure and flexible streaming configuration. Kappa architecture Raised from the problems identified of the lambda architecture [32], the Kappa approach is a simplification of the latest. A Kappa based system is like the Lambda with the batch layer removed. All data is fed through the streaming component quickly to the serving layer.

Page 26: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 26/29

Figure 7. Kappa Architecture By removing the batch layer this architecture simplifies the operation and gets rid of the main impediments identified in the Lambda architecture and it’s able to achieve similar results at a fraction of the costs.

Page 27: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 27/29

3. CONCLUSIONS

In this document we have reviewed where current big data technology stands for and the expected challenges and trends for the near future. During the last years there has been a tremendous adoption of big data technologies in companies of all sizes, making this innovative approach a common toolset to solve the new business needs. From the moment that large companies have started adopting solutions based on big data, IT vendors have improved capabilities and important features in areas such as security, role based access, reliability and self-management. Real-time data and stream processing are also a big progress in the ecosystem together with IoT. In the last years very powerful and innovative platforms such as Apache Spark and Apache Flink have appeared to deal with large amounts of data in real time. This tools are going to be adopted by IoT projects in many different scenarios that will benefit from simplicity of use, API capabilities and reliability. In the area of databases, relational databases are here to stay though the fast rise of NoSQL systems. Some relational databases have adopted a few NoSQL features such as key value stores. There is an increasing interest in the organizations for applying Advanced Analytics to available historical data. Even if only a relatively small data set is available, companies can retrieve useful information from raw data. However the formulation of specific goals and questions is still one of common challenges of Advanced Analytical projects, so that’s why more efforts should be done in the initial phase of analytical projects in order to better formulate the goals, rather than simply trying out modern technologies. Another tendency is the growing use of deep learning and transfer learning in industrial applications. Special attention should be given to the acceleration of the training process of machine learning models through the use of distributed computing. It is important to mention the relevance of open source solutions in the big data ecosystem where some of the most challenging projects use Apache License, MIT or BSD. This does not mean they are completely run by the community as they usually have one or several companies sponsoring the development. Technology is becoming robust and enterprise ready. Big data analytics can offer now very interesting features that companies from all sizes need and strong companies are supporting the projects and development efforts. In the upcoming years we’re going to see the final round of adoption for those technologies that will become a common toolset for everyday data pipeline. The implementation of a data culture is a very important challenge for any organization willing to take better decisions based on data evidence. The establishment of a culture that fosters data sharing between the different departments is key to be able to extract this knowledge. Data interoperability requires pre-processing, data treatment, standardization and in some cases, machine learning techniques are being used for that purpose. With new methodologies and vertical solutions as essential instruments, the users of a data driven organization will get better insights from their customers and operations and will be in a better position that their competitors making a profitable usage of data analytics.

Page 28: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 28/29

4. REFERENCES

[1] Peter Zaitsev. Why MySQL could be slow with large tables?. June 9, 2006. percona.com [https://www.percona.com/blog/2006/06/09/why-mysql-could-be-slow-with-large-tables/]. [2] MongoDB use cases brochure.[https://www.mongodb.com/use-cases/real-time-analytics] [3] Tood Hoff. How Twitter Uses Redis To Scale - 105TB RAM, 39MM QPS, 10,000+ Instances. Hightscalability.com. September 8, 2014. [http://highscalability.com/blog/2014/9/8/how-twitter-uses-redis-to-scale-105tb-ram-39mm-qps-10000-ins.html] [4] Marklogic cluster documentation. Scalability, Availability and Failover. [https://docs.marklogic.com/guide/cluster/scalability] [5] Marklogic feature comparison. December 2015 [http://cdn.marklogic.com/wp-content/uploads/2015/12/Product-Feature-Comparison.pdf] [6] Matt Asay. Why the world's largest Hadoop installation may soon become the norm. techrepublic.com [http://www.techrepublic.com/article/why-the-worlds-largest-hadoop-installation-may-soon-become-the-norm/] [7] Andy Oram. If prejudice lurks among us, can our analytics do any better?. oreilly.com/ideas. December 12, 2016. [https://www.oreilly.com/ideas/if-prejudice-lurks-among-us-can-our-analytics-do-any-better] [8] Josh Helms. Challenges in Adopting a Big Data Strategy. IBM Center for the Business of Government. February 11, 2015. [http://www.businessofgovernment.org/blog/business-government/challenges-adopting-big-data-strategy-part-1-2] [9] Kashmir Hill. How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did. forbes.com. February 16, 2012. [http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/#60ca01ea34c6] [10] Ashley Carman. First Response's Bluetooth pregnancy test is intriguing and a privacy nightmare. theverge.com. April 25, 2016. [http://www.theverge.com/2016/4/25/11503718/first-response-pregnancy-pro-test-bluetooth-app-security] [11] Redaktion. Facebook-Freunde könnten bald über die Kreditwürdigkeit bestimmen. finanzen.net August 09, 2015. [http://www.finanzen.net/nachricht/aktien/Facebook-als-Schufa-Facebook-Freunde-koennten-bald-ueber-die-Kreditwuerdigkeit-bestimmen-4465941] [12] Asim Jalis. The State of Data Engineering. stitchdata.com. [https://www.stitchdata.com/resources/reports/the-state-of-data-engineering/]

Page 29: Big Data Working Group - Big Data CoE Barcelona · Big Data Working Group 6/29 o Volume: Redis can work with massive scales of the order of 100TB, the most common limitation, like

Big Data Working Group

www.bigdatabcn.com 29/29

[13] Gabriella Cattaneo, Mike Glennon, Rosanna Lifonti, Giorgio, Micheletti, Alys Woodward, Marianne Kolding, Angela Vacca, Carla La Croce (IDC). David Osimo (Open Evidence). European Data Market SMART 2013/0063. October 16, 2015. IDC Analise the Future and Open Evidence. [14] Mary Shacklett. 6 big data trends to watch in 2017. techrepublic.com. December 23, 2016. [http://www.techrepublic.com/article/6-big-data-trends-to-watch-in-2017/] [15] Matthew Mao. Big Data main developments in 2016 and key trends for 2017. kdnuggets.com. December 2016. [http://www.kdnuggets.com/2016/12/big-data-main-developments-2016-key-trends-2017.html] [16] Mark van Rijmenam. Top 7 big data trends for 2017. datafloq.com. December 01, 2016. [https://datafloq.com/read/the-top-7-big-data-trends-for-2017/2493] [17] Big data and BI trends for 2017, machine learning, data lakes, hadoop, spark. computerworlduk.com. December 28, 2016. [http://www.computerworlduk.com/data/big-data-bi-trends-2017-machine-learning-data-lakes-hadoop-vs-spark-3652166/#r3z-addoor] [18] Kasey Panetta. Gartner top 10 technology trends for 2017. gartner.com. October 18, 2016. [http://www.gartner.com/smarterwithgartner/gartners-top-10-technology-trends-2017/] [19] Ben Lorica. 8 Data trends on our radar for 2017. oreilly.com/ideas. January 3, 2017. [https://www.oreilly.com/ideas/8-data-trends-on-our-radar-for-2017] [20] Nick Heudecker, Donald Feinberg, Merv Adrian, Terilyn Palanca, Rick Greenwald. Magic Quadrant for Operational Database Management Systems. Gartner. October 5, 2016. [21] Nick Heudecker, Merv Adrian, Ankush Jain .Market Guide for Hadoop Distributions. Gartner. February 1, 2017. [22] Chris Howard, Frank Buytendijk, Bill Swanton. Software 2020: Rearchitecting for the Digital World. Gartner. January 28, 2016. [23] Shi-Nash, A. and Hardoon, D. R. (2017) DATA ANALYTICS AND PREDICTIVE ANALYTICS IN THE ERA OF BIG DATA, in Internet of Things and Data Analytics Handbook (ed H. Geng), John Wiley & Sons, Inc., Hoboken, NJ, USA. doi: 10.1002/9781119173601.ch19. [24] Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills. Advanced Analytics with Spark: Patterns for Learning from Data at Scale. O'Reilly Media, 2015. [25] Ian Goodfellow and Yoshua Bengio and Aaron Courville. Deep Learning. MIT Press, 2016. [http://www.deeplearningbook.org].


Recommended