Building a Modern Data lake on Oracle Cloud Infrastructure

Shadab Mohammad
Oracle Developers
Published in
8 min readJul 28, 2021

--

Introduction

A Datalake is the evolution of the Data Warehouse from both an etymological and functional sense. In the past the Enterprise companies had relational databases as data sources where data was extracted, transformed, and then loaded into a central repository. These process were run on a daily basis in batch mode and were appropriately termed ETL (Extract Transform Load), and the central database was referred to as a Data warehouse. Data warehouse became a generic term for a huge database where you park your goods (data in this case) for a long period of time before extracting value from it using Analytics and Visualization . This approach of Enterprises to derive value from historical data was different in the on-premise world as both the volume and velocity of data was predictable, and capacity planning was a matter of counting your data sources and data growth. However since the advent of Big Data computing and the change in the nature of incoming data, this approach did not scale well especially with extremely large volumes of data.

Data lakes offer an almost infinitely scalable storage pool dedicated to collecting and storing Structured (Relational databases), Semi-structured (NoSQL databases) and Unstructured (Video, Image, Audio) data. With data lakes the approach was now to collect data from a wide variety of sources with huge volume and velocity of data. The data storage could be easily scaled to petabytes while collect all sorts of data and without having to capacity plan a huge, centralised database. This approach to collecting and storing historical data lead to slight changes in the process of ETL. The ‘L’ shifted before the ‘T’ and it was now ELT (Extract Load Transform). You now extract the data first and load it to a centralised staging area in it’s Raw format and transform it later to derive some business value out of it.. To store these massive amounts of data instead of expensive custom build data warehouse appliances or filesystems, distributed object stores were used. Object storage is a much cheaper storage mechanism and it can retain data with a lower TCO. Internet scale object storages made this further cheaper and accelerated companies to grow their Datalakes to Exabytes.

Datalake Architecture

Journey of your data from an upstream system into your Datalake and there-on to your downstream system for consumption can be divided into 4 broad stages : Ingestion, Storage, Transformation & Visualisation. Once these stages are completed you can then derive business value from your data.

Now let us look at how a modern data lake on OCI would look like and how the various stages can be be build using OCI services

For storage of data in a Datalake it needs to be maintained in 3 different silos. The reason for this segregation is to ensure the intrinsic value of the categories of the data is identifiable and data retention is appropriately defined to avoid excess cost. The first stage is storage of Raw data in your data lake. This is where you will typically just unload the incoming streams to dump the data. Data can be from different sources and can be staged in different buckets. The second stage is where we will do some basic cleaning of the data and store it in categories according to the type of applications from which the data originated. The categorization of data application-wise is optional and you can do the categorization based on your businesses needs.

The final storage stage will be where the data will be cleansed, massaged, and tagged so it is readily available to be consumed or loaded into your visualization tools. This would include running machine learning algorithms and other data science techniques to make predictions on business outcomes based on your organizational needs. The category of storing the data in this final storage again depends on your business requirement but a good practice is to store it according to the organisations business structure. This is also important from a security perspective so only the appropriate analysts can have access to the data based on roles.

OCI Datalake Offerings

Oracle Cloud Infrastructure (OCI) has all the offerings to build an Internet scale Datalake for your Enterprise using OCI native data services and Oracle’s own Object Storage. OCI offers both Startups, Scaleups and Enterprises an opportunity to build a robust, secure & scalable Datalake. OCI has some of the most competitive cost and performance offerings to build your Exabyte scale Datalake.

OCI offers a comprehensive list of services to build a fully cloud native and serverless Datalake which can ingest data from your On-Premises, other cloud providers and various 3rd party data sources. In addition, Oracle Cloud Infrastructure has a multitude of offerings for integrating, transforming along with deriving business value from your datalake. All these offerings are fully managed services without the need to provision any servers

Ingestion

OCI Data Integration — Oracle Cloud Infrastructure Data Integration is a fully managed, multi-tenant, serverless, native cloud service that helps you with common extract, load, and transform (ETL) tasks such as ingesting data from different sources, cleansing, transforming, and reshaping that data, and then efficiently loading it to target data sources on Oracle Cloud Infrastructure.

OCI Goldengate — GoldenGate is a fully managed, native cloud service that moves data in real-time, at scale. OCI GoldenGate processes data as it moves from one or more data management systems to target databases. You can also design, run, orchestrate, and monitor data replication tasks without having to allocate or manage any compute environments.

Oracle Integration Cloud (OIC) — Oracle Integration is a fully managed service that allows you to integrate your applications, automate processes and gain insight into your business processes.

OCI Streaming (Kafka Compatible) — Oracle Cloud Infrastructure Streaming service provides a fully managed, scalable, and durable solution for ingesting and consuming high-volume data streams in real-time. Use Streaming for any use case in which data is produced and processed continually and sequentially in a publish-subscribe messaging model.

Storage

Oracle Object Storage — Oracle Cloud Infrastructure Object Storage service is an internet-scale, high-performance storage platform that offers reliable and cost-efficient data durability. The Object Storage service can store an unlimited amount of unstructured data of any content type, including analytic data and rich content, like images and videos.

Transformation

OCI Datacatalog — Oracle Cloud Infrastructure Data Catalog is a fully managed, self-service data discovery and governance solution for your enterprise data. With Data Catalog, you get a single collaborative environment to manage technical, business, and operational metadata. You can collect, organize, find, access, understand, enrich, and activate this metadata.

OCI Dataflow — Oracle Cloud Infrastructure Data Flow is a fully managed service for running Apache Spark ™ applications. It allows developers to focus on their applications and provides an easy runtime environment to execute them. It has an easy and simple user interface with API support for integration with applications and workflows.

OCI BigData — Big Data provisions fully configured, secure, highly available, and dedicated Hadoop and Spark clusters on demand. Scale the cluster to fit your big data and analytics workloads by using a range of Oracle Cloud Infrastructure compute shapes that support small test and development clusters to large production clusters.

OCI Data Science — Data Science is a fully managed and serverless platform for data science teams to build, train, and manage machine learning models in the Oracle Could Infrastructure.

Visualization

Oracle Analytics Cloud — Analytics empowers business analysts and consumers with modern, AI-powered, self-service analytics capabilities for data preparation, visualization, enterprise reporting, augmented analysis, and natural language processing.

Oracle Autonomous Database (Apex) — Autonomous Database is a fully managed, preconfigured database environment with four workload types available, which are: Autonomous Transaction Processing, Autonomous Data Warehouse, Oracle APEX Application Development, and Autonomous JSON Database.

OCI Datalake Deployment

Pre-requisites

  • Create an OCI Tenancy and create an Administrator user which has rights to setup IAM policies at the root compartment level
  • Ensure the correct IAM groups and policies are created as per the services which will be deployed.
  • Create API keys for a user which has Administrator privileges in your OCI tenancy
  • There should be network connectivity from your OCI tenancy Virtual Cloud Network (VCN) to your on-premise, other cloud or third party network. The security best practice is to have a private virtual circuit either using IPsec VPN or Fastconnect to other clouds and on-premise data centres. Here is an example of how you can connect an AWS VPC connect to OCI VCN using Fastconnect for high bandwidth and lower latency.

You can automate the deployment of OCI Datalake using Terraform.

Data Governance

Data Retention

Data retention is an important part of your Datalake strategy. Retention of data can be important for various compliance and regulatory policies in line with the industry in which your business operates or the local data retention laws of the country where the data resides.

Data Security

Security is the utmost priority especially when data from a variety of sources can be subject to compliances like HIPAA, PCI-DSS, FedRAMP, FIPS 140 etc. In the cloud, security is a shared responsibility model, so please refer this link on how OCI can help you be compliant with the various industry attestations for data security. Data in a datalake should be encrypted both at rest and in-transit. You can find out more on security related best practises on OCI here

Conclusion

In the recent past there has been further changes in where you park your final data. With cloud-native Data warehouse’s like Oracle’s Autonomous Data warehouse a more cost-effective alternate is available to store huge amounts of readily available data. Also a newer concept is having a Data Lakehouse Architecture, where the underlying storage behaves like tables in a relational database when it comes to querying and writing data into the Datalake. Data Lakehouse is a further evolution of Data lakes. The storing and processing of historical data from different sources to derive value for your business has been and will be an ever evolving process.

With OCI it is easy to build and deploy an internet-scale serverless & scalable datalake. OCI offers both data integration and transformation services along with data science and machine learning technologies to derive deep insights from your data. Oracle has always been central to every Enterprise’s critical data needs and we have always supported our customers for their growing data needs; be it be with Exadata for Petabyte-scale Data warehousing or now Oracle Cloud Infrastructure Datalake’s with Exabyte-scale data storage and processing.

Next Steps

[1] Automated Deployment with Terraform

[2] Enterprise data warehousing — an integrated data lake example

[3] Oracle Quickstart Deployment using Terraform

[4] Oracle updates Big Data Service to accelerate use of managed open source on Oracle Cloud

--

--

Shadab Mohammad
Oracle Developers

/*The statements and opinions expressed here are my own & do not necessarily represent those of my employer*/