Introduction and configuring Oracle Data Integrator for Big Data (Cloudera Hadoop)

imageApache Hadoop is designed to handle and process data that is typically from data sources that are non-relational and data volumes that are beyond what is handled by relational databases.

Oracle Data Integrator is a transparent and heterogeneous Big Data Integration technology based on an open and lightweight ELT architecture. It runs a diverse set of workloads, including Spark, Spark Streaming and Pig transformations, to enable customers solve their most complex and time sensitive data transformation and data movement challenges. It is a core component of Oracle Data Integration solutions, integrating seamlessly with the rest of Oracle’s Data Integration and Business Application solutions

Oracle Data Integrator for Big Data provides the following benefits to customers:

  • It brings expanded connectivity to various Big Data source such as Apache Kafka or Cassandra
  • It decreases time to value for Big Data projects
  • It provides a future proof Big Data Integration technology investment
  • It streamlines and shortens the Big Data development and implementation process

Currently ODI supports

  • Generation of Pig Latin transformations: users can choose Pig Latin as their transformation language and execution engine for ODI mappings. Apache Pig is a platform for analyzing large data sets in Hadoop and uses the high-level language Pig Latin for expressing data analysis programs.
  • Generation of Spark and Spark Streaming transformations: ODI mappings can also generate PySpark. Apache Spark is a transformation engine for large-scale data processing. It provides fast in-memory processing of large data sets. Custom PySpark code can be added through user-defined functions or the table function component.
  • Orchestration of ODI Jobs using Oozie: users have a choice between using the traditional ODI Agent or Apache Oozie as orchestration engines for jobs such as mappings, packages, scenarios, or procedures. Apache Oozie allows fully native execution on Hadoop infrastructures without installing an ODI agent for orchestration. Users can utilize Oozie tooling to schedule, manage, and monitor ODI jobs. ODI uses Oozie’s native actions to execute Hadoop processes and conditional branching logic

You can use Oracle Data Integrator to design the ‘what’ of an integration flow and assign knowledge modules to define the ‘how’ of the flow in an extensible range of mechanisms. The ‘how’ is whether it is Oracle, Teradata, Hive, Spark, Pig, etc.

Let’s configure Oracle Data Integrator for Cloudera Hadoop. You don’t need to install any components on your Hadoop Cluster. It is enough to have remote connection to manage all jobs on Hadoop.

Continue reading ‘Introduction and configuring Oracle Data Integrator for Big Data (Cloudera Hadoop)’ »

Installing Edge (Gateway) Node for Hadoop or Install client for Hadoop

imageMany tools use Hadoop as backend for performing some jobs. For example we can use Kafka (or HDFS) as stage area for Oracle Data Integrator or GoldenGate. Usually it better to install separate node which will be used by ODI or GoldenGate exclusively because if will install them on Hadoop node then they will interference with other workload. And because Hadoop is cluster. Each node does its work and whole job is not finished until last node is finished. So caravans move at the speed of the slowest camel.

Hadoop vendors call such special node “Edge” or “Gateway”. They don’t contain any data, don’t participate in data process but host client software and Hadoop configuration. Let’s look how to install such node. I will use Cloudera distribution and Cloudera Manager as management tool.

Why do we need to configure Edge nodes using tools like Cloudera Manager or Ambari? Because software and configuration should be refreshed. We shouldn’t bother if somebody add new Kafka broker or changed Zookeeper host. That’s why management tool does this.

So let’s start.

Continue reading ‘Installing Edge (Gateway) Node for Hadoop or Install client for Hadoop’ »

GoldenGate 12.3: Microservice Architecture (MA)

The Microservices Architecture (MA) for Oracle GoldenGate is a new REST API Microservices-based architecture that allows you to install, configure, monitor, and manage Oracle GoldenGate services using a web-based UI.

Really there are two versions of GoldenGate now: classic and microservice. Classic architecture has standard extract, replicat, pump and receiver. It is managed by classic ggsci. Microservice Architecture (MA) has different types of processes and managed using Admin Client or using web UI. See architecture of GoldenGate MA below

imageOracle GoldenGate MA is designed with the industry-standard HTTP communication protocol and JSON data interchange format.

Classic architecture was managed using ggsci console and had weak authentication and authorization tools. Oracle GoldenGate MA has ability to verify identity using basic authentication and using SSL client certificates.

GoldenGate MA processes

Oracle GoldenGate MA  uses different types of processes to perform same tasks as GoldenGate Classic. Let’s talk a little bit about new processes:

Service Manager. Something like (and replacement of) Manager process.  This is watchdog for other processes.

Administration Server. Something like ggsci console. Operates as central control entity. You use it to create and manage other processes. The key feature of Administration Server is REST API which can be accesses from any HTTP or HTTPS client.

You can add, delete or alter GoldenGate processes, edit configuration files, add users and assign roles using Administration Server.

Receiver Server. Something like collector. It can receive trail files from remote server. However it replaces multiple collectors because it is multithreaded. Receiver was designed to be protocol agnostic – so it supports HTTPS, HTTP, UDT (reliable UDP) and classic GoldenGate TCP transports. By default it uses HTTPS protocol.

Distribution Server. Something like pump.  But again this multithreaded process which can handle multiple trail at the same time. So it will replace multiple pumps. And again it supports multiple protocols: WebSockets for HTTPS-based streaming, which relies on SSL security, UDT, SOCKS5, HTTP. It also support Passive mode to initiate connection from remote side.

Performance Metrics Server. This is process which collects and saves information from other processes (extracts, replicats, etc). All GoldenGate processes push information to Performance Metrics Server. Now this is the only processes which writes data to GoldenGate datastore (Berkley DB). You can use Performance Metrics Server to query various metrics, view logs, process statuses, monitor system utilization, etc.

Admin Client. It is a command line utility (similar to ggsci) used to create, configure and manage processes. Admin Client uses REST API to accomplish its tasks.

Continue reading ‘GoldenGate 12.3: Microservice Architecture (MA)’ »