Oracle DataSource for Apache Hadoop (OD4H): introduction

Introduction

image

Currently we see that Hadoop is becoming part of Enterprise Data Warehouse family. But family should be connected to each other. Sometimes we need access to Hadoop from Oracle Database. Sometimes Hadoop users need enterprise data stored in Oracle database.

Hive has very interesting concept – External Tables which allow you to define Java classes to access external database and present it as a native hive table.

Oracle Datasource for Apache Hadoop (formerly Oracle Table Access for Apache Hadoop) turns Oracle Database tables into a Hadoop data source (i.e., external table) enabling direct and consistent Hive QL/Spark SQL queries, as well as direct Hadoop API access. Applications can join master data or dimension data in Oracle Database with data stored in Hadoop. Additionally data can be written back to Oracle Database after processing.

Oracle Datasource for Apache Hadoop optimizes a query’s execution plans using predicate and projection pushdown, and partition pruning. Database table access is performed in parallel based on the selected split patterns, using smart and secure connections (Kerberos, SSL, Oracle Wallet), regulated by both Hadoop (i.e., maximum concurrent tasks) and Oracle DBAs (i.e., max pool size).

How does Oracle DataSource for Apache Hadoop (OD4H) works?

You can start working with OD4H using the following steps:

  1. Create a new Oracle table, or, reuse an existing table.
  2. Create the Hive DDL for creating the external table referencing the Oracle Table.
  3. Issue HiveSQL, SparkSQL, or other Spark/Hadoop queries and API calls.

How to setup

1. Oracle Datasource for Apache Hadoop requires (check current information here https://www.oracle.com/database/big-data-connectors/certifications.html):

  • Oracle Database 12c, 11.2.0.4, or earlier (if you can query it using Oracle JDBC driver 12c).
    It would be better to have Oracle Database 12c to get better performance.
  • Hadoop: Cloudera CDH5, Hortonworks 2.x or Apache Hadoop 2.2.0-2.6.0.
  • Apache Hive 0.13.0, 0.13.1 or 1.1.0

2. Download Oracle Datasource for Apache Hadoop connector. There are two places where you can take it:
– standalone archive: http://www.oracle.com/technetwork/database/database-technologies/bdc/big-data-connectors/downloads/index.html
– Oracle Mammoth archive from support.oracle.com.  BDA Extras contains OD4H

3. I’ve downloaded standalone archive which is named od4h-1.2.0.0.zip. So let’s unzip it on each Hadoop node. I’ll use free tool pssh to run unzip command on all nodes

4. Add /opt/oracle/od4h/jlib/* to HIVE_AUX_JARS_PATH.
– If you have Cloudera Manager then choose Hive->Configuration and search for HIVE_AUX_JARS_PATH
image
– If you don’t have Cloudera Manager then configure this variable in hive-env.sh file

5. That is. We are ready to create external hive table linked to Oracle tables.

Create simple example

1. Run the following query in

Attention! You can’t set table schema (or owner here). You should connect using table owner.

2. Run query using beeline

3. Get the output

Conclusion

So we can see from output that query was completed by 1 map and 1 reduce. This is ok for small tables but for large table it will be very slow configuration. We need more mappers and more database session to get data faster. I will talk about OD4H performance tuning in my next posts.

Leave a Reply

Your email address will not be published. Required fields are marked *