As enterprises modernize their aging data platforms, Data Lake architecture has become crucial to the future of their businesses. The term Data Lake has evolved in enterprise data architecture to mean a scalable data storage and compute platform that can flexibly hold data of all types, and process and query that data in various ways. In practice, scalability has been achieved using a Hadoop based distributed platform, and flexibility using a block storage abstraction that is agnostic to whether data is structured, semi-structured, or unstructured.
What is missing from the above definition is the temporal flow of data from disparate sources into the Data Lake, and the dynamic surround that makes the Data Lake itself useful: the means to know what is in the Data Lake (metadata and catalog), the policies determining who, when, and how of data usage (governance and security), and the tools to derive insights (analytics of all kinds) that should ultimately translate to better business outcomes.
What then is “Data Lake in the Cloud?” As was done with other enterprise IT platforms, the earliest attempts were to mimic on-premises Hadoop infrastructure in the cloud. This may have been part of the journey towards a more mature cloud adoption model since organizations could leverage their existing skill sets. But as an antithesis to the utility model of cloud usage, such attempts to create and maintain a permanent enterprise infrastructure in the cloud hardly makes any economic sense. And, a “lake in the cloud,” applied to data or otherwise is not the most meaningful metaphor.
We define “Data Lake in the Cloud” as a data fabric that pervades the enterprise and multiple cloud provider realms, overlaid by a management plane. Such a fabric affords a seamless view of data and motivates the most optimal use of multiple storage and compute options across private and public clouds. It leverages IaaS, as well as PaaS and higher level cloud services, unrestricted by physical infrastructure boundaries or specific distributed system technologies like Hadoop. We view the combining of Data Lake and cloud technologies less as a means for physically assembling data in a single repository, but rather as a means for assembling metadata in order to efficiently derive value from data regardless of its location. Through the current series of blogs, we strive to present the various facets of building, populating, maintaining, and benefiting from a cloud Data Lake that extends beyond the internal datacenters in an enterprise.