Introduction and configuring Oracle Data Integrator for Big Data (Cloudera Hadoop)

imageApache Hadoop is designed to handle and process data that is typically from data sources that are non-relational and data volumes that are beyond what is handled by relational databases.

Oracle Data Integrator is a transparent and heterogeneous Big Data Integration technology based on an open and lightweight ELT architecture. It runs a diverse set of workloads, including Spark, Spark Streaming and Pig transformations, to enable customers solve their most complex and time sensitive data transformation and data movement challenges. It is a core component of Oracle Data Integration solutions, integrating seamlessly with the rest of Oracle’s Data Integration and Business Application solutions

Oracle Data Integrator for Big Data provides the following benefits to customers:

  • It brings expanded connectivity to various Big Data source such as Apache Kafka or Cassandra
  • It decreases time to value for Big Data projects
  • It provides a future proof Big Data Integration technology investment
  • It streamlines and shortens the Big Data development and implementation process

Currently ODI supports

  • Generation of Pig Latin transformations: users can choose Pig Latin as their transformation language and execution engine for ODI mappings. Apache Pig is a platform for analyzing large data sets in Hadoop and uses the high-level language Pig Latin for expressing data analysis programs.
  • Generation of Spark and Spark Streaming transformations: ODI mappings can also generate PySpark. Apache Spark is a transformation engine for large-scale data processing. It provides fast in-memory processing of large data sets. Custom PySpark code can be added through user-defined functions or the table function component.
  • Orchestration of ODI Jobs using Oozie: users have a choice between using the traditional ODI Agent or Apache Oozie as orchestration engines for jobs such as mappings, packages, scenarios, or procedures. Apache Oozie allows fully native execution on Hadoop infrastructures without installing an ODI agent for orchestration. Users can utilize Oozie tooling to schedule, manage, and monitor ODI jobs. ODI uses Oozie’s native actions to execute Hadoop processes and conditional branching logic

You can use Oracle Data Integrator to design the ‘what’ of an integration flow and assign knowledge modules to define the ‘how’ of the flow in an extensible range of mechanisms. The ‘how’ is whether it is Oracle, Teradata, Hive, Spark, Pig, etc.

Let’s configure Oracle Data Integrator for Cloudera Hadoop. You don’t need to install any components on your Hadoop Cluster. It is enough to have remote connection to manage all jobs on Hadoop.

Continue reading ‘Introduction and configuring Oracle Data Integrator for Big Data (Cloudera Hadoop)’ »

Installing Edge (Gateway) Node for Hadoop or Install client for Hadoop

imageMany tools use Hadoop as backend for performing some jobs. For example we can use Kafka (or HDFS) as stage area for Oracle Data Integrator or GoldenGate. Usually it better to install separate node which will be used by ODI or GoldenGate exclusively because if will install them on Hadoop node then they will interference with other workload. And because Hadoop is cluster. Each node does its work and whole job is not finished until last node is finished. So caravans move at the speed of the slowest camel.

Hadoop vendors call such special node “Edge” or “Gateway”. They don’t contain any data, don’t participate in data process but host client software and Hadoop configuration. Let’s look how to install such node. I will use Cloudera distribution and Cloudera Manager as management tool.

Why do we need to configure Edge nodes using tools like Cloudera Manager or Ambari? Because software and configuration should be refreshed. We shouldn’t bother if somebody add new Kafka broker or changed Zookeeper host. That’s why management tool does this.

So let’s start.

Continue reading ‘Installing Edge (Gateway) Node for Hadoop or Install client for Hadoop’ »

GoldenGate 12.3: Microservice Architecture (MA)

The Microservices Architecture (MA) for Oracle GoldenGate is a new REST API Microservices-based architecture that allows you to install, configure, monitor, and manage Oracle GoldenGate services using a web-based UI.

Really there are two versions of GoldenGate now: classic and microservice. Classic architecture has standard extract, replicat, pump and receiver. It is managed by classic ggsci. Microservice Architecture (MA) has different types of processes and managed using Admin Client or using web UI. See architecture of GoldenGate MA below

imageOracle GoldenGate MA is designed with the industry-standard HTTP communication protocol and JSON data interchange format.

Classic architecture was managed using ggsci console and had weak authentication and authorization tools. Oracle GoldenGate MA has ability to verify identity using basic authentication and using SSL client certificates.

GoldenGate MA processes

Oracle GoldenGate MA  uses different types of processes to perform same tasks as GoldenGate Classic. Let’s talk a little bit about new processes:

Service Manager. Something like (and replacement of) Manager process.  This is watchdog for other processes.

Administration Server. Something like ggsci console. Operates as central control entity. You use it to create and manage other processes. The key feature of Administration Server is REST API which can be accesses from any HTTP or HTTPS client.

You can add, delete or alter GoldenGate processes, edit configuration files, add users and assign roles using Administration Server.

Receiver Server. Something like collector. It can receive trail files from remote server. However it replaces multiple collectors because it is multithreaded. Receiver was designed to be protocol agnostic – so it supports HTTPS, HTTP, UDT (reliable UDP) and classic GoldenGate TCP transports. By default it uses HTTPS protocol.

Distribution Server. Something like pump.  But again this multithreaded process which can handle multiple trail at the same time. So it will replace multiple pumps. And again it supports multiple protocols: WebSockets for HTTPS-based streaming, which relies on SSL security, UDT, SOCKS5, HTTP. It also support Passive mode to initiate connection from remote side.

Performance Metrics Server. This is process which collects and saves information from other processes (extracts, replicats, etc). All GoldenGate processes push information to Performance Metrics Server. Now this is the only processes which writes data to GoldenGate datastore (Berkley DB). You can use Performance Metrics Server to query various metrics, view logs, process statuses, monitor system utilization, etc.

Admin Client. It is a command line utility (similar to ggsci) used to create, configure and manage processes. Admin Client uses REST API to accomplish its tasks.


Continue reading ‘GoldenGate 12.3: Microservice Architecture (MA)’ »

Error while compiling program with oci.h

I was developing small utility using C++ some time ago and used Visual Studio as IDE and compiler. I’ve included oci.h to connect to Oracle Database like this

#include <stdio.h>
#include “oci.h”

but got following errors

Error    C2371    ‘BOOLEAN’: redefinition; different basic types
Error    C2632    ‘char’ followed by ‘int’ is illegal
Warning    C4091    ‘typedef ‘: ignored on left of ‘unsigned char’ when no variable is declared
Error (active)    E0084    invalid combination of type specifiers

This is because oratypes.h and Wtypesbase.h (Wtypes.h) have conflict while defining type boolean. Problem was solved by adding “Wtypesbase.h” as first include like this

#include “Wtypesbase.h”
#include <stdio.h>
#include “oci.h”

GoldenGate 12.3: announcement, new features and installation

Oracle has release new version of GoldenGate 12.3 in 18 August. This is very long awaited version – it postponed 2 or 3 times because of some very important new features. See some useful links for GoldenGate 12.3:

Continue reading ‘GoldenGate 12.3: announcement, new features and installation’ »

Oracle DataSource for Apache Hadoop (OD4H): introduction

Introduction

image

Currently we see that Hadoop is becoming part of Enterprise Data Warehouse family. But family should be connected to each other. Sometimes we need access to Hadoop from Oracle Database. Sometimes Hadoop users need enterprise data stored in Oracle database.

Hive has very interesting concept – External Tables which allow you to define Java classes to access external database and present it as a native hive table.

Oracle Datasource for Apache Hadoop (formerly Oracle Table Access for Apache Hadoop) turns Oracle Database tables into a Hadoop data source (i.e., external table) enabling direct and consistent Hive QL/Spark SQL queries, as well as direct Hadoop API access. Applications can join master data or dimension data in Oracle Database with data stored in Hadoop. Additionally data can be written back to Oracle Database after processing.

Oracle Datasource for Apache Hadoop optimizes a query’s execution plans using predicate and projection pushdown, and partition pruning. Database table access is performed in parallel based on the selected split patterns, using smart and secure connections (Kerberos, SSL, Oracle Wallet), regulated by both Hadoop (i.e., maximum concurrent tasks) and Oracle DBAs (i.e., max pool size).

Continue reading ‘Oracle DataSource for Apache Hadoop (OD4H): introduction’ »

GoldenGate Cloud Service (GGCS): Configure GoldenGate to replicate data

imageGoldenGate Cloud Service is part of Oracle’s PaaS portfolio. From technical perspective it is just standard GoldenGate deployed on VM in Oracle Cloud. So same already proven architecture works in Cloud.

GGCS can be used for different cases from zero downtime migration to real-time DWH feeding. More cases like BigData and data pipeline feeding are on the way.

So what do you need to use GoldenGate Cloud Service. You should have:

  • database instance in cloud (DBaaS or ExadataCS)
  • subscription for GoldenGate Cloud Service.
  • storage cloud service (it used for backup)

GGCS is available as Non Metered service now. If you use GGCS Non-Metered Service then you should pay money even if your GoldenGate instance is down.

Soon GGCS will be available as a Metered Service. So it will possible to pay on per hour basis. This capability will open new cases like Dev/Test Cloud Environment Synchronization. Just imagine you have database in cloud for testing purposes. You should periodically (every week/month) synchronize it with production database. So you don’t need GGCS running for all time but run it for 2 hours every Sunday to apply captured data. This approach can save a lot of money.

Continue reading ‘GoldenGate Cloud Service (GGCS): Configure GoldenGate to replicate data’ »

Oracle Database Cloud Service: Create database

imageOracle Cloud provides several Oracle Database offerings. You can choose from

  • a single schema based service
  • virtual machine with a fully configured and running Oracle Database Instance
  • Exadata Service with all the database features.

You can look into details here: https://cloud.oracle.com/en_US/database

We will talk about Database as a Service and not about Schema or Exadata here. So my final goal is to create database for GoldenGate replication which is separate service. Ok let’s start.

 

Continue reading ‘Oracle Database Cloud Service: Create database’ »

Oracle Storage Cloud Service: Creating Containers Using the REST API

Introduction

imageLet’s define terms which Oracle uses in Oracle Cloud Services.

  • Block Storage – optimizes storage for IOPS and block-based access and provides POSIX-compliant file systems for Oracle Compute Cloud Service instances. This is just standard disk device. Sometimes it is only one drive, sometimes it is RAID device. But anyway application access it using standard disk operations
  • Object Storage – scalable storage which can store large binary objects with metadata and unique ID. Multiple storage nodes form a single, shared, horizontally scalable pool. Application can access data using REST API.

Oracle Storage Cloud Service provides a low cost, reliable, secure, and scalable object-storage solution for storing unstructured data and accessing it anytime from anywhere. It is ideal for data backup, archival, file sharing, and for storing large amounts of unstructured data like logs, sensor-generated data, and VM images.

Continue reading ‘Oracle Storage Cloud Service: Creating Containers Using the REST API’ »

GoldenGate Studio: Quick Start

imageIntroduction

You work with GoldenGate then you know that there was tool GoldenGate Director for configuration, deployment, monitoring and management. Really these responsibilities are distributed among development and support. So there should be two tools: for development and for support. Oracle is following this ideology and has created two tools:

 

  • GoldenGate Monitor (or GG plugin for OEM if you are OEM fan) – for management and monitoring
  • GoldenGate Studio – for design and deployment

What do we need from development tools? I have the following list in my head:

  • WYSIWYG/drag’n’drop interface
  • easy moving configuration through the path Dev/QA/Prod
  • versioning
  • collective working

GoldenGate has these features. It was created to be great scalable development tool.

Continue reading ‘GoldenGate Studio: Quick Start’ »