Building an Agile Data Lake

Starting a data lake initiative is often a big step for an organization. There are many logical ways in which organizations can move towards fact-based decision making. Traditional way of EDW and BI tools have worked for many years. These systems support structured analysis for the known recurring business questions. But they have proven to be less agile in terms of fulfilling adhoc requirements and answering adhoc questions. Is it possible to be more agile in how we collect, analyze, and extract value out of data? Can we move towards exploiting most out of all possible data that we come across?

Data lake is one logical starting point towards being data-driven in an agile way. In this post I will show how an organization can approach building a data lake in an agile way, high level architecture of data lake and its components. I will also show which are some of the technologies of choice in today’s scenario.

You would have heard the term Data Lake quite a few times but let me redefine the term here.

Data lake is the centralized repository of all data collected from variety of sources

Stores both relational and non-relational data in raw form with lowest granularity
That can store data as long as required by leveraging scale-out, cheap storage typically HDFS or amazon S3
Serves as a staging layer for further structured and unstructured analysis

Getting all the data in one place is no way trivial. You don’t set out with an objective to build a data lake just because others are doing it. Often, you start with one business problem in mind; getting the required data just for that business problem streamed into the lake is a good starting point. In the epic book on data warehousing toolkit, Ralph Kimball also suggests to apply similar method although in the context of traditional DW/BI systems. Add more data sources as needed. Eventually, it will be the repository of all your data which can be readily used in answering new questions and also for operationalizing new analytics pipelines.

Many hadoop distributions have given their reference architectures of the ideal state of affairs. The following is a high level picture of how Data Lake might look like and some of the technologies involved.

[/fullwidth_text] [spb_single_image image=”10336″ image_size=”full” frame=”noframe” intro_animation=”none” full_width=”no” lightbox=”yes” link_target=”_self” caption=”Data Lake” width=”1/1″ el_position=”first last”] [fullwidth_text alt_background=”none” width=”1/1″ el_position=”first last”]

Raw data Staging Layer

It extracts data from various conventional and unstructured data sources in batch mode or streaming mode. For example, extracting data from relational OLTP sources every night, every few hours, or streaming in near real-time fashion. Choice of tools for extracting or ingestion will depend on the acceptable delay before which the data should be made available for decision making.

Tools like Flume or Logstash can forward log data from web servers and application servers to the data lake. Log data is quite useful in doing click stream analysis, A/B testing, recommender systems when combined with other data sources etc.

Data from social networking sites like facebook, twitter, linkedin can be streamed and stored for offline analysis. It may also be done on an ad-hoc basis, for example – monitoring response to a marketing campaign, product launch, or during a public event.

The data ingested from any of the possible sources is first stored in raw form in HDFS. Ideally, this raw staged data is what is called the data lake or data reservoir. It can serve as a source for data for any adhoc analysis, should the need arise in future.

Performance Layer (or Access Layer)

Raw data is subsequently deduplicated, cleaned, transformed, and stored in an efficiently queriable format. Typical choices are columnar file formats like parquet which also offers very good compression. Data may be partitioned by some time interval to reduce the data scanned at the time of query.

Parquet is not a good storage format for streaming data, if there is a need to make streaming data available with low latencies. NoSQL databases like Cassandra, HBase, or ElasticSearch may be used to store the streaming data.

Like mentioned in the previous blog about Data warehousing and BI, hadoop will not replace the DW/BI infrastructure atleast for now. Hadoop, being the landing zone and long term repository for all relational data from OLTPs, it can also perform ETL (utilizing the compute power of the cluster with Spark) and export the data to EDW.

Existing BI tools and custom web applications leveraging advanced visualizations like D3JS can directly connect to hadoop using SQL-on-Hadoop engines like Cloudera Impala and Spark SQL with Thrift server.

Data Discovery Lab

Data discovery lab is the sandbox/playground for new innovations. Data scientists can bring raw data and/or processed data in this lab to do exploratory analysis, building new machine learning models. The findings made during explorations might be presented as one time report, machine learning models built here can be operationalized. Key point here is – data lake enables the existence of such data discovery lab where data scientists and even business analysts can bring data for exploration and research.

To summarize, data lake brings great benefits by building an enterprise data hub, enabling data discovery, realtime analytics, and machine learning. Starting to build a data lake just needs one use-case, like recommender system for your customer portal, offloading ETL workloads from your EDW, customer churn analysis or A/B testing. This first step of starting to build enterprise data hub, and approaching it in an agile and iterative way can bring great long term benefits.

[/fullwidth_text]