What is a data lake, its benefits and use cases

What is a data lake?

A data lake is the concept centres on landing all analysable data sets of any kind in raw or only lightly processed form into the easily expandable scale-out infrastructure to ensure that the fidelity of the data is preserved.

Instead of forcing data into a static schema and running an ETL (Extract, Transform, Load) process to fit it into a structured database, a data lake enhances agility by storing data at its raw form. As a result, data is available at a more granular level without losing its details, and schemas are created at a later point. This process is also referred to as ‘schema-on-read.’

The data going into a lake might consist of machine-generated logs and sensor data (e.g., Internet of Things or IoT), customer behaviour (e.g., web clickstreams), social media, documents (e.g., e-mails), geo-location trails, images, video and audio, and structured enterprise data sets such as transactional data from relational sources and systems such as ERP, CRM or SCM.

Pros and cons of data lakes

Data lakes are created to store historical and micro-transactional data – what in the past was not sustainable in data warehouses due to volumes, complexity, storage costs, latency, or granularity requirements. This level of detail in data offers rich insights, but deducting meaning from it is prone to error and misinterpretation.

Another reason for data lakes’ rudimentary use is their lack of enterprise-grade features required for broad and mission-critical usage. This includes lack of security, multi-tenancy, SLAs, and data governance capabilities that are core parts of existing data warehouses today. Therefore, while data lakes provide an economical and fast way to do detailed data discovery, it is critical to consider the longer-term architectural journey on data lakes as an analytical repository.

Discovering patterns and analysing data in the data lake leads to insights, but also to further questions. Data discovery is a process for extrapolating what data, level of detail and insights should be presented in customer-facing or business applications, and what other pieces of information are needed to enrich the data for a more complete picture.

Data lakes vs data warehouses

Before data lakes, data warehouses were viewed as a revolution to enterprise data management. However, to make it to the warehouse, all data must be processed – a procedure that is not only time-consuming but also laborious and challenging.

The battle of data lakes vs data warehouses to define which technology is more useful will continue. To help you clear up any confusion, and to distinguish these two concepts, here are four prominent differences.

1. Data types

As mentioned above, data warehouses consist of data extracted from transactional systems and quantitative metrics to support the analytics of performance and business status. Data warehouses require a highly structured data model to define which data is loaded into the data warehouse and which data is not.

In data lakes, all kinds of data from the source system are loaded including data sources that may be denied in data warehouses, such as web server logs, sensor data, social network activity, text and images, etc.

Data lakes can even store data that is currently not in use but might become helpful in the future. This is made possible by low-cost storage solutions such as Hadoop.

2. Schema

Data warehouses apply the “Schema on Write” approach, which means its model is highly structured for the main purpose of reporting. In details, this process requires a considerable amount of time to analyse data sources, understand business processes, and filing data, result in a predefined system for data storing.

Data lakes keep data in its original state; when data is needed to answer a business-related question, only the relevant data would be provided, and that smaller set of data can then be analysed to help answer the question. This approach is known as “Schema on Read”, effectively saves time and cost for your organisation.

3. Flexibility

Since data warehouses are a highly structured repository, they are laborious to make changes to the structure according to the business’ needs. The changing process requires lots of complex processes, thus, making it slow and expensive.

Data lakes, on the other hand, take advantage of the flexibility of data, because data is stored in its raw format and always accessible, allowing reconfiguration on the fly.

4. Users

Data warehouses, which are familiar to businesses and users, easily fulfil businesses’ operation needs, namely their performance reports, metrics or data statistics. Provided its well-structured, easy-to-use and mainly aimed to answer user’s questions, the data warehouse meets the needs of operational stages.

A data lake is more suited to users who are tasked with deep analysis – such as data scientists. Given the broad data types, they can mash up many different types of data and come up with entirely new questions to be answered.

All in all, both technologies will continue to evolve. And chances are, providers would come up with a hybrid solution to make data utilisation faster, more scalable and more reliable.

Data lake use cases

According to an Aberdeen survey, organisations that implemented data lakes outperform their peers by 9% in revenue growth by identifying and acting upon new growth opportunities faster using new data sources and analytics.

The ability to garner practically all data provides endless opportunities for businesses. Data lakes have many uses and play a key role in providing solutions to many different business problems. With the right business intelligence tools, businesses can conduct experimental analysis before its value or purpose is defined and moved to a data warehouse.

Oil and Gas

Being one of the early adopters of multiple disruptive technologies, from cloud computing to IoT, it is no surprise that oil and gas are also onboard with this new trend. It is estimated that, on average, an oil and gas company generates 1.5 terabytes of IoT data daily.

Historical data stored in data lakes are vital for exploration, and thus, can be used to optimise directional drilling, minimise unexpected downtime, lower operating expenses, improve safety, and stay compliant with regulatory requirements. Data science combined with GPS can enable oil and gas companies to increase production more than 20 times.

Smart city initiatives

According to IDC, investments on technologies that drive smart cities initiatives are expected to reach $124 billion this year. These technologies will power traffic lights, direct law enforcement, enhance education systems, optimise waterways, tolls and more. Thus, the amount of data generated per vehicle or pedestrian every single minute will be tremendous. And such sheer volume can only be contained using data lakes.

Life sciences

Our body is a highly complex machine, and it also generates tons of data. Our weight, blood pressure, heart rate, temperature, enzymes, white blood cell counts, etc. are measurements that change over time.

Life sciences need data lakes to conduct data exploration and discovery to gain a deeper understanding of the human genome, predict and detect any defect, and leverage these insights to enhance the life expectancy of the entire world's population.

Cybersecurity

Cybersecurity has always been a challenge that every organisation tries to eliminate, or at least minimise. Any laptops, servers, smartphones, or computing devices are vulnerable and susceptible to internal and external threats. Ransomware, scam emails, viruses are becoming harder to detect.

To prevent these security breaches wreaking havoc a company, its employees and customers' trust, especially during the post-GDPR period, organisations need to put into place proactive, always-on security, disaster recovery and business continuity plans. Data lakes provide a safe and secure haven to house a business' precious digital assets.

Marketing

Every marketing channel and every touchpoint forms its own database. Data lakes can be used to collect any information, from demographic to preferences of both customers and prospects from disparate sources, to assist in hyper-personalised marketing campaigns. As a result, marketers do not need to acquire such data from third parties.

What many of us are not aware of is common customer data platforms used by marketers, like Salesforce and HubSpot, store fragmented data in data lakes, and then present it to us through a web-based interface.

Data lakes can enable marketers to monitor and analyse data in near real-time, a vital capability if you are working with streaming services and need timely information to make informed strategic decisions and segmented campaigns.

Infor Data Lake

As a part of Infor OS, the Infor Data Lake unites all of your data on CloudSuite, Internet of Things, documents, third-party application data… into just one repository. Infor Data Lake allows you to utilise your data sources to the fullest.

Infor Data Lake provides customers with many advanced functions such as intelligent data ingestion, metadata management (meta-graph), and key elements of big data architecture. It also enables users to access and consume data that meet their varied needs (APIs, SQL, Elastic Search, etc.) through an assortment of interfaces.

Infor is making efforts to support their customers with the application of their AI assistant - Coleman AI PaaS along with a suite of self-service interfaces. Hence, allowing developers to decide and analyse data sets through machine learning, and they can expose this logic via APIs or through events that can be interrogated by the CloudSuite.

Understand the needs for real-time updated data and analysis, Infor has comprised data push capabilities to establish a clean, useful, near real-time data lake that their customers can analyse with the help from Infor’s Birst BI platform.

Bear in mind that the first-generation data lakes are only exploratory. For it to effectively analyse and provide helpful insights, all your transactional and operational data must be gathered at one spot. Plus, we’re talking about a large amount of data and advanced technical skillset to practically exploit the data, a heavy-duty job for companies that haven’t prepared for the execution.

Everything You Need to Know About DATA LAKES

Table of Contents