The data lake concept centers on landing all analysable data sets of any kind in raw or only lightly processed form into the easily expandable scale-out Hadoop infrastructure to ensure that the fidelity of the data is preserved.
Instead of forcing data into a static schema and running an ETL (Extract, Transform, Load) process to fit it into a structured database, a Hadoop-first approach enhances agility by storing data at its raw form. As a result, data is available at a more granular level without losing its details, and schemas are created at a later point. This process is also referred to as ‘schema-on-read.’
The data going into a lake might consist of machine-generated logs and sensor data (e.g., Internet of Things or IoT), customer behavior (e.g., web clickstreams), social media, documents (e.g., e-mails), geo-location trails, images, video and audio, and structured enterprise data sets such as transactional data from relational sources and systems such as ERP, CRM or SCM.
Pros and Cons of Data Lake
The economics of Hadoop versus a traditional data warehouse has positioned data lakes as less grandiose data stores which function as feeder systems to other data warehouses, analytic dashboards, or operational applications. Some treat them as initial landing zones and use them to figure out what data should be processed and sent downstream. However, the data stored in data lakes is at a micro-granular level, and not ready for business users or downstream applications.
Another reason for data lakes’ rudimentary use is their lack of enterprise-grade features required for broad and mission-critical usage. This includes lack of security, multi-tenancy, SLAs, and data governance capabilities that are core parts of existing data warehouses today. Therefore, while data lakes provide an economical and fast way to do detailed data discovery, it is critical to consider the longer term architectural journey on Hadoop as an analytical repository.
Data lakes are created to store historical and micro-transactional data – what in the past was not sustainable in data warehouses due to volumes, complexity, storage costs, latency, or granularity requirements. This level of detail in data offers rich insights, but deducting meaning from it is prone to error and misinterpretation.
For example, Hadoop can be used to store customer interactions with an application or website. While the data that represents the interactive nature of the customer experience has record-by-record details by capturing each click, it might be missing customer demographics, identification and prior activity. In this case, other data management tools are needed to add schemes around the most important elements of this data. For example, mapping web cookies to customer IDs provides additional dimensions about the visitor such as their age, location and prior purchases. In the same scenario, enriching the IP address of the visitor clickstream data can reveal geo-location and further segmentation of data.
Discovering patterns and analyzing data in the data lake leads to insights, but also to further questions. Data discovery is a process for extrapolating what data, level of detail and insights should be presented in customer-facing or business applications, and what other pieces of information are needed to enrich the data for a more complete picture.