The data warehouse has provided great value to businesses in unlocking the full potential of big data. Recently, a new emerging technology – Data Lake – is changing the way we approach and utilise data once again.
In order to clear up any confusion and distinguish these two concepts, let’s first take a look at the definitions.
What is data lake?
A data lake is a central storage repository that holds a large amount of raw data for later use. Since data can be stored as-is, your business doesn’t have to waste effort on converting, structuring and filing data until it is needed.
A data warehouse is also a data repository for businesses, mainly used to provide reports and data analysis. Data stored in the warehouse sometimes have to go through an extracting, transforming, and loading (ETL) process before being added to the repository.
The differences between Data lake and Data warehouse
To put it in simple terms, a data warehouse stores transformed and structured data from various enterprise sources. This data is ready to be used for other purposes, especially reporting and analysis.
A data lake stores unstructured data which is loaded in its raw state. Further transformations will be required when this data is needed for other purposes.Infographic: 4 Steps to Automate Enterprise Data Management
Each has its own way of handling data and providing results.
1. Data types
As mentioned above, a data warehouse consists of data extracted from transactional systems and quantitative metrics to support the analytics of performance and business status. A data warehouse requires a highly structured data model to define which data is loaded into the data warehouse and which data is not.
In a data lake, all kind of data from the source system is loaded. Including data sources that may be denied in data warehouses, such as web server logs, sensor data, social network activity, text and images, etc.
A data lake can even store data that is currently not in use but might be used in the future. This is made possible by low-cost storage solutions such as Hadoop.
Data warehouse applies the “Schema on Write” approach, which means its model is highly structured for the main purpose of reporting. In details, this process requires a considerable amount of time to analyse data sources, understand business processes, and filing data, result in a predefined system for data storing.
Data lakes keep data in its original state; when data is needed to answer a business related question, only the relevant data would be provided, and that smaller set of data can then be analysed to help answer the question. This approach is known as “Schema on Read”, effectively saves time and cost for your organisation.
Since a data warehouse is a highly structured repository, it is laborious to make changes to the structure according to business’s needs. The changing process requires lots of complex processes, thus making it slow and expensive.
A data lake, on the other hand, take advantage of the flexibility of data, because data is stored in its raw format and always accessible, allowing reconfiguration on the fly.
A data warehouse, which is familiar to businesses and users, easily fulfils business’s operation needs, namely their performance report, metrics or data statistic. Provided its well-structured, easy-to-use and mainly aimed to answer user’s questions, the data warehouse meets the needs of operational stages.
A data lake is more suited with users who do deep analysis – data scientists. Given the broad data types, they are able to mash up many different types of data and come up with entirely new questions to be answered.
Who should use data lakes?
Depend on their natures and abilities, data warehouse seems to be a better choice for organisations looking to capitalise on data. In the meantime, the data lake allows users to deeply exploit the possibilities that data can bring, however, this might be a difficult task for the average end users whose skills are not advanced enough.
It is definite that both technologies would continue to evolve. And chances are, providers would come up with a hybrid solution with the aim to make data utilisation faster, more scalable and more reliable.