Oracle Data Integrator is a transparent and heterogeneous Big Data Integration technology based on an open and lightweight ELT architecture. It runs a diverse set of workloads, including Spark, Spark Streaming and Pig transformations, to enable customers solve their most complex and time sensitive data transformation and data movement challenges. It is a core component of Oracle Data Integration solutions, integrating seamlessly with the rest of Oracle’s Data Integration and Business Application solutions
Oracle Data Integrator for Big Data provides the following benefits to customers:
- It brings expanded connectivity to various Big Data source such as Apache Kafka or Cassandra
- It decreases time to value for Big Data projects
- It provides a future proof Big Data Integration technology investment
- It streamlines and shortens the Big Data development and implementation process
Currently ODI supports
- Generation of Pig Latin transformations: users can choose Pig Latin as their transformation language and execution engine for ODI mappings. Apache Pig is a platform for analyzing large data sets in Hadoop and uses the high-level language Pig Latin for expressing data analysis programs.
- Generation of Spark and Spark Streaming transformations: ODI mappings can also generate PySpark. Apache Spark is a transformation engine for large-scale data processing. It provides fast in-memory processing of large data sets. Custom PySpark code can be added through user-defined functions or the table function component.
- Orchestration of ODI Jobs using Oozie: users have a choice between using the traditional ODI Agent or Apache Oozie as orchestration engines for jobs such as mappings, packages, scenarios, or procedures. Apache Oozie allows fully native execution on Hadoop infrastructures without installing an ODI agent for orchestration. Users can utilize Oozie tooling to schedule, manage, and monitor ODI jobs. ODI uses Oozie’s native actions to execute Hadoop processes and conditional branching logic
You can use Oracle Data Integrator to design the ‘what’ of an integration flow and assign knowledge modules to define the ‘how’ of the flow in an extensible range of mechanisms. The ‘how’ is whether it is Oracle, Teradata, Hive, Spark, Pig, etc.
Let’s configure Oracle Data Integrator for Cloudera Hadoop. You don’t need to install any components on your Hadoop Cluster. It is enough to have remote connection to manage all jobs on Hadoop.
Install ODI Big Data
Before installing and configuring ODI — we need to have Hadoop environment already configured. The simplest way is to configure ODI (or its agent) is to use Edge or Gateway Node. Yon can find my post about configuring Cloudera Edge (Gateway) node. Edge node doesn’t contain or process any data. This is just host where ODI GUI is installed.
First of all install ODI. This is easy.
1. Download ODI from edelivery.oracle.com or from otn.oracle.com. We will use last version which is 188.8.131.52.0.
2. Check that we have certified Java. Certification matrix is available here http://www.oracle.com/technetwork/middleware/ias/downloads/fusion-certification-100350.html
3. Unpack and run installation
java -jar fmw_184.108.40.206.0_odi.jar
4. Enter directory for Oracle Inventory.
5. Choose directory to install ODI. Choose Standalone installation. Agree to other questions and wait until installation is finished.
Configure ODI repository
1. Oracle Data Integrator stores its repository and projects inside database. So we need to create repository. RCU utility is used to create repository.
3. Select «Oracle Data Integrator» on next page and change prefix for new ODI repository schemas, then enter password for created schemas.
4. Enter password for SUPERVISOR. It will be used to authenticate user to ODI. Leave default for other parameters. Then agree to other options.
Configure Big Data Connections for Linux on Edge Node
First step is to configure connections to all Big Data technologies.
1. Run Oracle Data Integrator
2. Then click «Connect to Repository». Add new connection in the Dialog Box by clicking «+» and enter required information to connect.
3. Run Big Data Wizard by clicking File->New…. Then Big Data Configuration
4. Enter prefix for new connection and choose you distribution (CDH X.XX), change directory for CDH and choose technologies which will be configured
5. The next page allows you to enter information for Hadoop/HDFS connection. There are two main URIs which you should change: HDFS Name Node URI and Resource Manager/Job Tracker URI. You can find hostnames of these nodes by going to Cloudera Manager and choosing Hosts->Roles page.
6. So find Roles HDFS Name Node (NN) and YARN Resource Manager (RM). Remember hostnames of these Roles and enter them into Hadoop connection dialog. Also enter authentication information. Then click Next.
7. Let’s enter only username and password on HDFS connection page.
8. Enter username and password on Spark page.
Also you should edit parameter «Master Cluster (Data Server)». If you have Spark 1 then enter «yarn-client» to run jobs locally. If you have Spark 2 then enter «yarn». I have Spark 2 and entered yarn.
9. Next page is Kafka configuration. Enter username and password as usual. Also enter URL of metadata broker list. To find you broker list return to Cloudera Manager Roles page. Look for Kafka Broker role (KB).
10. Next page is Hive configuration. Enter username and password. Also we will enter Metastore URI and JDBC URL. You can find Thrift host in Cloudera Manager Roles page. Look for for Hive Metastore Server (HMS). Port of Hive Metastore Server is 9083 usually but you can find it by searching «hive.metastore.port» in Hive Configuration.
JDBC URL also contains hostname. Use any host name which has role Hive Gateway (G)
11. Last page is Oozie configuration. Find role Oozie Server on Cloudera Manager Role page. Enter this hostname in host parameter. Leave defaults for other parameters.
12. Validate configuration and click Finish.
13. Go to Topology tab and choose technologies one-by-one, test them. First technology to test is Hadoop.
14. First time you will get error ODI-26039. java.io.FileNotFoundException: File /user/oracle/odi_home does not exist.
This is because odi_home doesn’t exists or empty. Click button Initialiaze to create odi_home directory in Hadoop. Then test again. You should get
15. Test Hive connection. It should give success.
16. Test Kafka connection. I’ve got error while testing Kafka.
oracle.odi.runtime.agent.ExecutionException: oracle.odi.core.exception.OdiRuntimeException: java.lang.Exception: ODI-14184: Unable to load required classes from BDC_Kafka Data Server Classpath
This is because initial path was configured for old kafka embedded into CDH but now Kafka is installed in separate home. So we need to change class path to include directory
17. So I have the following configuration. Test your configuration and test again. Now test should be successful.
18. The last step is to test Oozie engine connection. Again while testing i’ve got error
ODI-26140: Oozie Engine test failed.
[ File /user/oracle/odi_home/odi_12.2.1 does not exist. ]
19. We should initialize directories using «Initialize» button and then test connection again. Initialize will load required artifacts to HDFS. Testing should work now.
We have configured ODI to connect to Hadoop. We used Linux Edge node to install ODI. This is the easiest way. Also you can install ODI on Windows, copy required libraries, change CLASSPATHes and it will work like ODI on Linux.
So we can connect to Hadoop cluster and can run jobs using Hive and Spark. We can also schedule jobs using Oozie. Next post will be about doing some ETL using ODI on Hadoop.