What is an Enterprise Data Lake?
Way back in 2010, Pentaho co-founder and CTO, James Dixon coined the term ‘Data Lake’. While these days, there exist many interpretations of the term, usually it means a repository that holds a vast amount of raw data in its native format until it is needed. Raw data at its most granular level is stored so that any ad-hoc analysis can be performed at any time.
So in an enterprise context, a data lake is a repository of enterprise wide data (from one or more sources) in the very raw format. And the data lake is designed to take data in whatever format and store it. Data Lake should also enables or provides a way to analyse the data when required. Some people call it an Enterprise Data Hub as well.
In all most all cases, enterprises have used a Hadoop based system to design a data lake. HDFS or Hadoop Distributed File System has been the ideal to be designed as a data lake – It is highly scalable and provides low cost of storage. At the same time Hadoop and the ecosystem around it provides the platform, frameworks and tools for the effective data analysis.
Often people when learn about data lake try comparing it with Enterprise Data Warehouse or Data Marts. James Dixon explains it in the simplest way possible;
“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
Initially, James explained the data lake being sourced from a single source but the modern Enterprise Data Lake is designed to have data from many sources and some cases a single repository of all the enterprise wide data for easy consumption.
There are fundamental differences between a data lake and a traditional data warehouse. In a data warehouse Extract, Transform, and Load (ETL) process is followed; important here is that schema is well define and pre-defined in a data warehouse. So data needs to be transformed before being loaded onto it – schema on write. In a data lake however, there is no pre-built schema so any data is stored in its raw format. The process followed here is Extract, Load, and Transform (ELT); important to note is that data is loaded first onto it and then transformed when needed. The schema is decided while reading or querying – schema on read.
Image Credit: Tamara Dull, Director of Emerging Technologies, SAS Best Practices
Why a data lake?
We have Enterprise Data Warehouse (EDW) and Data marts approach for more than two decades now. So the common questions are:
- Why now the data lake?
- Is it going to replace data warehouses?
With the advent of big data, enterprises want to take advantage of it by deriving value from it. The traditional data warehouse built on top of RDBMS is not equipped for it. So there needs a different way to store and process big data in a cost effective manner. One can summarize the need for the data lake in the following way: to remove data silos and take advantage of enterprise wide data by allowing more questions, better answers and of course serendipity!
Is the Data Lake going to replace existing data warehouses or data marts? Probably not – though there are some efforts to replicate the data warehouse like functionalities in a data lake. A data warehouse with a possible OLAP layer brings easiness and effectiveness for business folks to derive immense value from it which a data lake can’t meet all by itself. However, a data lake can be an ideal staging area or the source for a data mart.
Source: PWC
Common Criticism of Data Lake Approach
There is always value to be found from data and a data lake makes big promise in that sense. But, there are also substantial risks which Gartner and others are critical about. Identifying these risks earlier on can help avoid pitfalls of a data lake approach and use it for best of an enterprise’s needs. These can also be considered as best practices to implement a data lake for an organisation.
- Data Governance
- Managing metadata and data discovery
- Determine data quality or the lineage of findings by other
- Security and Access Control
- Performance (Query and data processing)
If these are not addressed then a data lake can easily become a data swamp making it good for nothing. More on this, maybe in another post.
The Microsoft Approach – Azure Data Lake
Enter Azure Data Lake! Though it was announced in last April during the Build Conference, Microsoft brought in to public preview (beta release) last week with a promise to making big data easy and making enterprise data lake a success for any organisation in a most cost effective, simple and effective way.
Components of Azure Data Lake
Azure Data Lake has 3 components or parts.
- Data Lake Store
- Data Lake Analytics
- HDInsight
Data Lake Store
Azure Data Lake Store is a Hadoop file system that’s compatible with Hadoop Distributed File System (webHDFS) and works with the Hadoop ecosystem. Data Lake Store is integrated with Azure Data Lake Analytics and Azure HDInsight and will be integrated with Microsoft offerings as well as industry-standard distributions like Hortonworks, Cloudera, and MapR; and individual Hadoop projects like Spark, Storm, Flume, Sqoop, and Kafka. It has no fixed limits on account size or file size and has been optimised for massive throughput, high frequency, low latency, and real-time analytics.
Importantly, Data Lake Store provides rich capabilities to help manage and secure your data assets. It brings peace of mind by allowing to monitor performance, receive alerts, and audit usage. Data Lake Store uses Azure Active Directory, providing a robust identity and access management solution for all of the data in the store.
Data Lake Analytics
Microsoft introduced a new Azure based service, Data Lake Analytics which works closely on top of Data Lake Store. It provides a distributed infrastructure on demand and allow run complex data processing tasks in easiest possible manner. At the same time, it also introduces U-SQL (U stands for Universal!), a query language that blends the declarative nature of SQL with the expressive power of C#. Data Lake Analytics service runs U-SQL code as a job. The U-SQL language is built on the same distributed runtime that powers the big data systems inside Microsoft. Millions of SQL and .NET developers can now process and analyse all of their data using skills they already have – this might be remarkable as till now, big data processing takes a bit of extra skill sets.
HDInsight
HDInsight is already existing service, which provides managed cluster for Apache Hadoop and its ecosystem (Core Hadoop, HBase, Strom and the latest addition, Apache Spark). As expected it works seamlessly with Azure Data Lake Store.
Take Away so far
From the beginning, I have been interested in finding out if it can help organisations prevent their data lake becoming a data swamp (I must confess, I have not yet even scathed the surface of the Azure Data Lake). Well, it all depends on implementation but what is important is that, does it provide necessary support from the platform along with required tools? And so far, it appears so. It provides the complete, cost effective, all cloud implementation of big data with ease and effectiveness for the organisations to take advantage and implement their data lake.
While this exploration will continue, it has been increasingly clear that data lake approach may be the way forward for the enterprise wide big data implementation.
Whether you agree with me or not, you will find the following video entertaining and at the same time enlightening if you believe in the power of data – One of the most popular data guy of our time, all-time bestseller author (Freakonomics), Professor Steven Levitt talks at Sackler Big Data Colloquium explaining the importance of big data and data science as well as data scientists in his atypical and interesting way: