Integrating SAP
BusinessObjects with
Hadoop Using a multi-node Hadoop Cluster
May 17, 2013
Integrating SAP BusinessObjects with Hadoop
Visual BI Solutions Inc. http://www.visualbis.com
1
SAP BO – HADOOP INTEGRATION
Contents 1. Installing a Single Node Hadoop Server .................................................................................................... 2
2. Configuring a Multi-Node Hadoop Cluster ............................................................................................... 6
3. Configuring Hive Data Warehouse .......................................................................................................... 10
4. Integrating SAP BusinessObjects with Hadoop ....................................................................................... 12
Integrating SAP BusinessObjects with Hadoop
Visual BI Solutions Inc. http://www.visualbis.com
2
1. Installing a Single Node Hadoop Server Installing a single node Hadoop server involves the following steps
1. Install a stable Linux OS(Preferably CENT OS) with ssh, rsync and recent jdk from Oracle.
2. Download Hadoop .rpm(Equivalent to windows .exe) from the apache website.
3. Install the downloaded file with rpm or yum package manager.
4. Apache provides generic configuration options (mentioned below) that can be deployed by
executing the scripts packed with the .rpm file.
5. Execute the configuration process by running the hadoop-setup-conf.sh script with root
privilege. Select the “default” option for config, log, pid, NameNode, DataNode, job-tracker
and task-tracker directories and provide the system name for NameNode and DataNode
hosts.
6. To install single node server .conf files, run hadoop-setup-single-node.sh script with root
privilege and select the default option for all categories.
7. Setup the single node and start Hadoop services by running hadoop-setup-hdfs.sh script
with root privilege. The .rpm file used comes with some basic examples like wordcount, pi,
teragen etc. This can be used to test if all the services are working.
8. Hadoop requires six different services to run for perfect functioning.
(a) Hadoop NameNode
(b) Hadoop DataNode
(c) Hadoop JobTracker
(d) Hadoop TaskTracker
(e) Hadoop Secondary NameNode
(f) Hadoop History Server
9. If all services are running then the single node cluster is ready for operation.
10. Hadoop services status can be checked with the following linux commands.
$root : service hadoop-namenode status (These services are located in /etc/init.d dir)
11. Similarly to start or stop services service Linux command can be used.
$root : service hadoop-datanode start
$root : service hadoop-jobtracker stop.
For more Detailed Info on Hadoop Services: http://www.cloudera.com , http://www.wikipedia.org
For more Installation Options: http://hadoop.apache.org
Integrating SAP BusinessObjects with Hadoop
Visual BI Solutions Inc. http://www.visualbis.com
3
Hadoop Running Services can be monitored through the web interfaces.
NameNode
DataNode
Integrating SAP BusinessObjects with Hadoop
Visual BI Solutions Inc. http://www.visualbis.com
4
JobTracker
TaskTracker
Integrating SAP BusinessObjects with Hadoop
Visual BI Solutions Inc. http://www.visualbis.com
5
Hadoop Basic Commands
Integrating SAP BusinessObjects with Hadoop
Visual BI Solutions Inc. http://www.visualbis.com
6
2. Configuring a Multi-Node Hadoop Cluster Single node Hadoop server can be expanded to a Hadoop cluster. In cluster mode the Hadoop
NameNode will have many live DataNode and many TaskTracker.
Steps involved in the installation of multi-node Hadoop cluster.
1. Install stable Linux (preferably CENT OS) in all machines (master and slaves).
2. Install Hadoop in all machines using Hadoop RPM from Apache.
3. Update /etc/hosts file in each machine, so that every single node in cluster knows the IP
address of all other nodes.
4. In Master node /etc/hadoop directory update the master and slaves file with the domain
names of master node and slaves nodes respectively.
5. Generate SSH key pair for the master node and place the public key in all the slave nodes.
This will enable password-less ssh login from master to all slaves
6. Run the script hadoop-setup-conf.sh in all nodes. In master let all nodes point to the master
Url. In slaves update NameNode and JobTracker urls to point to master node, other urls
point to the localhost.
7. Open firewall ports for communication in both master and slave nodes.
8. In master run the command start-dfs.sh, this will start NameNode (In master) and
DataNodes (Both Master and Slaves)
9. In master run the command start-mapred.sh, this will start JobTracker (In master) and
TaskTracker (Both Master and Slaves).
10. Now the NameNode and JobTracker will have more active nodes compared to single node
server.
For More configuration options, refer:
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ , http://hadoop.apache.org/docs/stable/cluster_setup.html
Integrating SAP BusinessObjects with Hadoop
Visual BI Solutions Inc. http://www.visualbis.com
7
Some Screenshots of the Multi-node Hadoop Cluster at work
NameNode
DataNode
Integrating SAP BusinessObjects with Hadoop
Visual BI Solutions Inc. http://www.visualbis.com
8
List of DataNodes
List of TaskTrackers
Integrating SAP BusinessObjects with Hadoop
Visual BI Solutions Inc. http://www.visualbis.com
9
JobTracker Job Status
TaskTracker Task Status
Integrating SAP BusinessObjects with Hadoop
Visual BI Solutions Inc. http://www.visualbis.com
10
3. Configuring Hive Data Warehouse Hive Data Warehousing environment runs on top of Hadoop. It performs ETL at run time and makes data
available for reporting. Hive has to be installed initially and then hosted as a service using Hive-Server
option.
Steps Involved in Configuring Hive
1. Install and Configure Hadoop on all machines and make sure all the services are running.
2. Download Hive from Apache website.
3. Now install MySQL for HIVE metadata storage or just configure the default Derby Database.
Any RDBMS system can be used for Hive metadata. This can be done by placing the correct JDBC
connector in the hive lib directory. For detailed info on connectivity follow this link
https://ccp.cloudera.com/display/CDHDOC/Hive+Installation#HiveInstallation-HiveConfiguration
4. Copy the needed .jar files to the required directories as per the instructions in the above link.
5. Now go to /bin directory in Hive package folder and execute hive command.
6. Queries can now be executed in the shell.
7. Hive Web Interface can be started by executing hive command as -> hive --service hwi.
8. Hive Thrift Server can be started by executing hive command as -> hive --service hiveserver.
9. Open the Hive server port (default 10000) in firewall for connection through JDBC.
10. If security is needed for hive server then configure Kerberos network authentication and bind it
to hive server. For more information, refer http://www.cloudera.com.
For more config options: http://hive.apache.org
For Hive – JDBC Connection:https://cwiki.apache.org/Hive/hiveclient.html#HiveClient-JDBC
Integrating SAP BusinessObjects with Hadoop
Visual BI Solutions Inc. http://www.visualbis.com
11
Screenshots of the Hive Server
Hive Web Interface
Hive Command Line
Integrating SAP BusinessObjects with Hadoop
Visual BI Solutions Inc. http://www.visualbis.com
12
4. Integrating SAP BusinessObjects with Hadoop Universe Design Using IDT
Steps Involved in Configuring SAP BusinessObjects for use with Hadoop
1. Configure SAP BusinessObjects with Hive JDBC drivers, if the server is of a version lower than BO 4.0
with SP5. In BO Server 4 SP5, SAP Provides Hive connectivity by default.
In order to configure JDBC drivers in earlier versions refer to page 77 of this document
http://help.sap.com/businessobject/product_guides/boexir4/en/xi4sp4_data_acs_en.pdf.
2. Create BO universe.
1. Open SAP IDT and create a user session with login credentials.
2. Under sessions, open connections folder. Create a new Relational connection.
Integrating SAP BusinessObjects with Hadoop
Visual BI Solutions Inc. http://www.visualbis.com
13
3. Under Driver selection menu, select Apache -> Hadoop Hive -> JDBC Drivers.
4. In the next tab enter The Database URL:port, Username & Password and Click Test
Connectivity. If it is successful, save the connection by clicking finish.
5. Now create a new project in IDT and create a shortcut for the above connection in the
project.
Integrating SAP BusinessObjects with Hadoop
Visual BI Solutions Inc. http://www.visualbis.com
14
6. Now create a new Data Foundation layer and bind the connection with the data
foundation layer.
7. This connection will be used by Data Foundation layer to import data from Hive Server.
8. From the Data Foundation layer, drag and drop the tables which are needed by the
universe. Create views in the Data foundation if required.
9. Create a new Business layer and bind the data foundation layer with the business layer.
10. Attributes can be set as measures with suitable aggregators in Data Foundation Layer.
11. Right click the business layer and select Publish -> Publish to Repository. Use integrity
before publishing to check dependencies
12. Now log on to CMC and Set universe access policy for users.
13. Now Open WEBI Launchpad or Rich Client and select Universe as Source. The Published
universe must be listed.
For Detailed Info Refer http://scn.sap.com, http://help.sap.com
Integrating SAP BusinessObjects with Hadoop
Visual BI Solutions Inc. http://www.visualbis.com
15
Some Screenshots of Universe Design
Data Foundation Layer
Business Layer
Integrating SAP BusinessObjects with Hadoop
Visual BI Solutions Inc. http://www.visualbis.com
16
Convert To Measure
Publish Universe
Integrating SAP BusinessObjects with Hadoop
Visual BI Solutions Inc. http://www.visualbis.com
17
3. Create reports
Published universe can be accessed through WEBI, Dashboards or Crystal Reports. Select Hive universe
as Data Source and build queries using the Query Panel. Universe will convert user queries to HiveQL
Statements and return the results for the report.
Some Screenshots of Text Processing Reports
WEBI Mobile Report on Word Count