Over the last decade organisations have done substantial investments in their data warehouses in terms of setting up hardware infrastructure, software licenses, BI tools and man-power to harvest huge amount of data and gain insights. However, with escalating data volume and variety, these data warehouses are experiencing inefficiency even with their ever increasing budgets. Though many organisations plan to implement big data technologies to slash down their data warehousing budget, it seems to be a daunting task at first to embrace emerging technologies like hadoop at first.
Lets explore possibilities that hadoop can offer to your data warehouse
The Challenge
Companies use hundreds or thousands of ETL jobs to bring together data from different systems or applications. Most of these upstream ETL processes were designed with limited set of data in mind few years ago. With high volume and high velocity data some of these processes take days to complete. Moreover, these complicated ETL processes are comprising of individualized connections for bring in data from variety of data sources. Ingesting new data source may takes months in altering the schema.
While planning for offloading ETL process to hadoop make sure you have well defined strategy for each of the following.
- What parts of your warehouse is to be offloaded
- What will be ETL workflow and which technology will be used for ETL
- What storage formats to use for efficient querying on top of Hadoop data
- What SQL-on-hadoop solution to use for querying data out of Hadoop
- Define BI tool Integration strategy
Scalable Hadoop solutions
Hadoop ecosystem offers a wide range of platforms that can not only solve the storage issue but provides distributed compute power that can free up substantial resources of your current data warehouse.
Hadoop ecosystem is evolving very fast that allows user to ensure consistency of data across the information pipeline and manage system resources in an efficient way.
Flexible schema allows you to ingest wide variety of data including sensor data, social media, IoT data and so on at your will.
Here is an interesting talk with Usama Fayad, Chief data officer and group managing director at Barclays, on how hadoop is adding value to data warehousing scenario of banking sector. According to his point of view the cost per terabyte per year can be brought down by a factor of 10 to a factor of 40. Apart from this huge storage cost savings, ingesting variety of data sources such as images, call center audio recordings and documents is provides high value for the banking sector. Ever growing data lakes are no longer ‘toxic dumps’ but used for gaining valuable insights.