Introduction and configuring Oracle Data Integrator for Big Data (Cloudera Hadoop)

imageApache Hadoop is designed to handle and process data that is typically from data sources that are non-relational and data volumes that are beyond what is handled by relational databases.

Oracle Data Integrator is a transparent and heterogeneous Big Data Integration technology based on an open and lightweight ELT architecture. It runs a diverse set of workloads, including Spark, Spark Streaming and Pig transformations, to enable customers solve their most complex and time sensitive data transformation and data movement challenges. It is a core component of Oracle Data Integration solutions, integrating seamlessly with the rest of Oracle’s Data Integration and Business Application solutions

Oracle Data Integrator for Big Data provides the following benefits to customers:

  • It brings expanded connectivity to various Big Data source such as Apache Kafka or Cassandra
  • It decreases time to value for Big Data projects
  • It provides a future proof Big Data Integration technology investment
  • It streamlines and shortens the Big Data development and implementation process

Currently ODI supports

  • Generation of Pig Latin transformations: users can choose Pig Latin as their transformation language and execution engine for ODI mappings. Apache Pig is a platform for analyzing large data sets in Hadoop and uses the high-level language Pig Latin for expressing data analysis programs.
  • Generation of Spark and Spark Streaming transformations: ODI mappings can also generate PySpark. Apache Spark is a transformation engine for large-scale data processing. It provides fast in-memory processing of large data sets. Custom PySpark code can be added through user-defined functions or the table function component.
  • Orchestration of ODI Jobs using Oozie: users have a choice between using the traditional ODI Agent or Apache Oozie as orchestration engines for jobs such as mappings, packages, scenarios, or procedures. Apache Oozie allows fully native execution on Hadoop infrastructures without installing an ODI agent for orchestration. Users can utilize Oozie tooling to schedule, manage, and monitor ODI jobs. ODI uses Oozie’s native actions to execute Hadoop processes and conditional branching logic

You can use Oracle Data Integrator to design the ‘what’ of an integration flow and assign knowledge modules to define the ‘how’ of the flow in an extensible range of mechanisms. The ‘how’ is whether it is Oracle, Teradata, Hive, Spark, Pig, etc.

Let’s configure Oracle Data Integrator for Cloudera Hadoop. You don’t need to install any components on your Hadoop Cluster. It is enough to have remote connection to manage all jobs on Hadoop.

Install ODI Big Data

Before installing and configuring ODI — we need to have Hadoop environment already configured. The simplest way is to configure ODI (or its agent) is to use Edge or Gateway Node. Yon can find my post about configuring Cloudera Edge (Gateway) node. Edge node doesn’t contain or process any data. This is just host where ODI GUI is installed.

First of all install ODI. This is easy.

1. Download ODI from edelivery.oracle.com or from otn.oracle.com. We will use last version which is 12.2.1.3.0.

image

2. Check that we have certified Java. Certification matrix is available here http://www.oracle.com/technetwork/middleware/ias/downloads/fusion-certification-100350.html
image

3. Unpack and run installation

java -jar fmw_12.2.1.3.0_odi.jar

4. Enter directory for Oracle Inventory.

image

5. Choose directory to install ODI. Choose Standalone installation. Agree to other questions and wait until installation is finished.

imageimageimage

Configure ODI repository

1. Oracle Data Integrator stores its repository and projects inside database. So we need to create repository. RCU utility is used to create repository.

export ODI_HOME=/home/oracle/odi12213
$ODI_HOME/oracle_common/bin/rcu

2. Choose to Create Repository with System Load and Product Load (default), then fill connection parameter for Oracle Database (it will be used for repository)
imageimage

3. Select «Oracle Data Integrator» on next page and change prefix for new ODI repository schemas, then enter password for created schemas.

imageimage

4. Enter password for SUPERVISOR. It will be used to authenticate user to ODI. Leave default for other parameters. Then agree to other options.

imageimage

Configure Big Data Connections for Linux on Edge Node

First step is to configure connections to all Big Data technologies.

1. Run Oracle Data Integrator

export ODI_HOME=/home/oracle/odi12213
$ODI_HOME/odi/studio/odi.sh

2. Then click «Connect to Repository». Add new connection in the Dialog Box by clicking «+» and enter required information to connect.

image

3. Run Big Data Wizard by clicking File->New…. Then Big Data Configuration

image

4. Enter prefix for new connection and choose you distribution (CDH X.XX), change directory for CDH and choose technologies which will be configured

image

5. The next page allows you to enter information for Hadoop/HDFS connection. There are two main URIs which you should change: HDFS Name Node URI and Resource Manager/Job Tracker URI. You can find hostnames of these nodes by going to Cloudera Manager and choosing Hosts->Roles page.

image

6. So find Roles HDFS Name Node (NN) and YARN Resource Manager (RM). Remember hostnames of these Roles and enter them into Hadoop connection dialog. Also enter authentication information. Then click Next.
image

7. Let’s enter only username and password on HDFS connection page.

image

8. Enter username and password on Spark page.
Also you should edit parameter «Master Cluster (Data Server)».  If you have Spark 1 then enter «yarn-client» to run jobs locally. If you have Spark 2 then enter «yarn».  I have Spark 2 and entered yarn.

image

9. Next page is Kafka configuration. Enter username and password as usual. Also enter URL of metadata broker list. To find you broker list return to Cloudera Manager Roles page. Look for Kafka Broker role (KB).

image

image

10. Next page is Hive configuration. Enter username and password. Also we will enter Metastore URI and JDBC URL. You can find Thrift host in Cloudera Manager Roles page. Look for for Hive Metastore Server (HMS). Port of Hive Metastore Server is 9083 usually but you can find it by searching «hive.metastore.port» in Hive Configuration.

JDBC URL also contains  hostname. Use any host name which has role Hive Gateway (G)

image

11. Last page is Oozie configuration. Find role Oozie Server on Cloudera Manager Role page. Enter this hostname in host parameter. Leave defaults for other parameters.

image

12. Validate configuration and click Finish.

13. Go to Topology tab and choose technologies one-by-one, test them. First technology to test is Hadoop.

14. First time you will get error ODI-26039.  java.io.FileNotFoundException: File /user/oracle/odi_home does not exist.

This is because odi_home doesn’t exists or empty. Click button Initialiaze to create odi_home directory in Hadoop. Then test again. You should get

imageimageimage

15. Test Hive connection. It should give success.

16. Test Kafka connection. I’ve got error while testing Kafka.

oracle.odi.runtime.agent.ExecutionException: oracle.odi.core.exception.OdiRuntimeException: java.lang.Exception: ODI-14184: Unable to load required classes from BDC_Kafka Data Server Classpath

This is because initial path was configured for old kafka embedded into CDH but now Kafka is installed in separate home. So we need to change class path to include directory

/opt/cloudera/parcels/KAFKA/lib/kafka/libs/*

17. So I have the following configuration. Test your configuration and test again. Now test should be successful.

image

18. The last step is to test Oozie engine connection. Again while testing i’ve got error

ODI-26140: Oozie Engine test failed.
[ File /user/oracle/odi_home/odi_12.2.1 does not exist. ]

image

19. We should initialize directories using «Initialize» button and then test connection again. Initialize will load required artifacts to HDFS. Testing should work now.

Conclusion

We have configured ODI to connect to Hadoop. We used Linux Edge node to install ODI. This is the easiest way. Also you can install ODI on Windows, copy required libraries, change CLASSPATHes and it will work like ODI on Linux.

So we can connect to Hadoop cluster and can run jobs using Hive and Spark. We can also schedule jobs using Oozie. Next post will be about doing some ETL using ODI on Hadoop.

Добавить комментарий

Ваш e-mail не будет опубликован. Обязательные поля помечены *