Gleaning business insights through data analysis can help companies outperform competitors in a fast-changing business landscape. Vast amounts of data available to many companies makes data analysis even more valuable. But it can also introduce new challenges, both because of the sheer volume of data and because it includes unstructured data from sources such as websites, social media posts and internet of things (IoT) devices. A data lake is a data repository that lets organizations store all this unstructured information with structured information from core business applications and databases so they can analyze it. By exploring this treasure trove of information from multiple sources companies can generate valuable new insights that improve business performance.
What Is a Data Lake?
It is a data store that can hold all of an organization’s data, including unstructured data like images and text files. It can hold information from external sources, including IoT devices, website clickstreams and social media platforms. It also can store structured operational data from on-premises and cloud-based business applications. Companies can analyze this information using a variety of tools, including machine-learning technology that automatically hunts for patterns.
- A data lake enables companies to store and analyze all types of information, including unstructured, and structured relational data.
- Data lakes store data in its original raw format with no predefined database structure.
- Many users throughout the organization can use the data lake to get answers to different business questions, including new questions that are triggered by changing business conditions.
- Companies are increasingly migrating from on-premises to cloud-based data lakes because they offer easier scalability and lower administration and start-up costs with a pay-as-you go subscription model.
Data Lakes Defined
Despite the variety of data sources that feed them, data lakes are most simply defined by their most distinguishing features, including the following.
- They can handle any kind of data, including structured data from relational databases as well as emails, documents, sensor data, web logs and images.
- The structure (schema) of the data is not defined in advance; instead, subsets of the data are structured later when it’s analyzed.
- Data lakes primarily store data in its original format, although companies may clean up some data sources to prevent erroneous data from reaching the data lake.
- Many users throughout the organization may use the data lake to explore different business questions.
- Companies can apply a variety of analysis tools, including machine learning and graph analytics.
Why Are Data Lakes Needed?
Traditionally, companies have analyzed business data by applying analytical tools to structured information gathered from business applications and stored in relational databases. Organizations often use a specialized type of relational database, called a data warehouse, to store information specifically for analysis. Data warehouses are designed to deliver excellent analytical performance so business groups can get questions answered quickly.
However, organizations now have more data than ever before from a wider variety of sources. Much of this information consists of unstructured data that relational databases were not designed to handle. Data lakes let companies aggregate all of their unstructured and structured information into a single data store so they can analyze it more easily and generate new insights.
How Data Lakes Work
Data lakes import information from multiple sources and store it as raw, unstructured data in a flat file system. Data can be imported in batches or in a continuous real-time stream, depending on the source. These sources may include internal enterprise resource planning (ERP) or customer relationship management (CRM) applications, email, websites, sensors or other sources, like social media. The data is cataloged so that developers and users know what’s in the data lake. A broad range of users can then apply analytic tools to identify trends and other important insights. Companies can also use machine-learning tools, which automatically sift through the data to look for patterns.
Data Lakes vs. Data Warehouses
Data lakes and data warehouses are both data stores specifically designed for analysis, but they have very different characteristics and uses. Many organizations use both.
are relational databases that hold highly structured data from multiple sources, such as information from ERP systems and other applications. As with any relational database, companies define the way the information is structured in a database “schema” to optimize it for rapid information access. Designing the schema can be a lengthy process that involves gaining a clear understanding of how the data will be used. Making changes to the schema requires careful planning. Companies generally process raw data before moving it into the data warehouse in order to extract the most important information, ensure its accuracy and transform its structure to conform to the schema — a process known as extract, transform and load (ETL). To help provide users with quick answers to business questions, data warehouses often run on high-performance servers and storage either on premises or in the cloud.
in contrast, hold both structured and unstructured data from a much wider variety of sources, which could be as diverse as social media posts, weather information and factory equipment sensors. There’s no predefined schema, and data is imported into the database in its raw form without the need to transform it first — although organizations generally manage and catalog the data to ensure that it’s accurate and that users know what data is available. This means new data can move quickly into the data lake to become available for analysis. Data lakes allow users to explore this raw data in many ways, and they’re valuable for situations when companies don’t know in advance the kinds of questions they will want to ask. Subsets within the data lake may be structured as necessary when the information is needed for analysis. Data lakes can hold vast amounts of information, typically on low-cost storage, which may be slower for some queries than data warehouses.
Data lakes typically complement data warehouses and other relational databases rather than replace them. Data lakes and data warehouses can coexist in various ways. For example:
- Organizations may keep an existing data warehouse to provide high-performance analysis and reporting for groups that already use it while adding a data lake to support new data sources.
- Companies can archive historical data from the data warehouse by moving it to a data lake. This helps ensure that the data warehouse can continue to perform queries quickly and efficiently, while making historical data available for analysis.
- A data lake may act as the repository for all the organization’s data with subsets of structured data moved from the lake to smaller warehouses, called data marts, for specific groups within the organization.
|Data Lake||Data Warehouse|
|Information types||Many types of unstructured and structured data from business applications and databases, websites, IoT and mobile devices and social media||Structured relational (tabular) data from operational systems and databases|
|Structure||Data stored in its original raw form; data is structured only when it is analyzed||Schema defined in advance determines the way information is structured|
|Information availability||Data is available for analysis very quickly||Data may take longer to become available for analysis because it is processed before being imported to the data warehouse|
|Uses||A variety of users can explore the data to delve into new questions. Users include data scientists, analysts and developers||Primarily used by business groups and developers to ask specific, predetermined types of questions|
|Analysis||Wide variety of tools, including machine learning, statistical analysis, graph analytics||Analytic tools include business intelligence products, machine learning and dashboards|
Types of Data Lakes
There are two primary ways to implement a data lake: in the cloud or on premises. Here are the key differences.
Cloud data lakes
Cloud data lakes run on hardware and software in a supplier’s cloud and you access them over the internet. Most follow a pay-as-you-go subscription model. Cloud data lakes scale easily — as your data grows, you simply add cloud capacity. The provider manages security, reliability, data backup and performance so you can focus your efforts on determining which data to include in the data lake and how to analyze it.
On-premises data lakes
With an on-premises data lake, you install and run software to operate the data lake on servers and storage in your company’s data center. Capital investment is needed to buy software licenses and hardware, and you’ll need IT expertise to install and manage the data lake. You’re responsible for managing security, protecting data and ensuring adequate performance. You may need to migrate the data lake to a larger system as it grows. An on-premises system may provide higher performance for users located within the company’s facilities.
Elements of Data Lakes
A data lake typically includes four distinct high-level elements.
- Data ingestion: The data lake is supported by “connectors” and other services that import data from multiple structured and unstructured sources.
- Secure storage: The data lake must be able to store and protect a vast and expanding volume of data. The infrastructure supporting the data lake should scale easily and at an appropriate price because it’s seldom possible to predict all future sources and volume of data. It also needs to be protected against system failures and unauthorized access.
- Governance and curation: Businesses need to decide which data is imported into the data lake and how to manage it. The data also needs to be cataloged so users can find it. Without governance, data lakes can deteriorate into data swamps: pools of disorganized, stagnant data that languish unused and provide little value to the organization.
- Processing and analytics: The data lake should support a wide range of analytic tools because people will use the data lake for different types of analysis.
Data Lake Architecture
Data lakes typically share some architectural features. Even so, the details can vary, depending on the software used and how it’s implemented.
Data ingestion layer: This consists of connectors and services that bring data from diverse sources into the data lake. They may include prebuilt connectors to commonly used data sources, such as leading relational databases. Some ingestion services import data in batch files. Other services connect real-time streams of data from internal and external sources, such as website clickstreams, financial transactions, social media feeds and sensor telemetry data.
Security: The data lake may become the organization’s biggest and most valuable store of business data, so it’s vital to protect it. Security measures should include authenticating users and ensuring they access only the data they’re authorized to view. Some data lakes encrypt data both while it’s stored and while it’s in transit.
Catalog: A catalog is an essential feature that enables users and application developers to find information in the data lake. The catalog includes metadata describing each dataset in the lake. Metadata may describe not only the structure of the data but also include its source, quality and how it’s used in the business. Some metadata can be generated automatically when data is ingested.
Processing and analytics: The business uses analytic tools to gain insights from the data lake. Multiple groups of users may explore the data, including data scientists and business analysts. They may use a variety of tools depending on their needs and their level of expertise, ranging from traditional business intelligence tools that extract relational data for dashboards to statistical analysis and machine-learning tools.
Data Lake Examples
Data lakes can be valuable for companies in any industry that uses a lot of data. Some examples include:
Manufacturing: Companies can use data lakes to implement predictive maintenance and improve operating efficiency. By collecting information from equipment sensors, problem reports and records of repair, companies can better understand the most common causes of failure and predict when they will occur. They can then adjust maintenance schedules to maximize uptime and reduce repair costs. Companies can also use data lakes to analyze the efficiency of production processes and determine where to cut costs.
Marketing: Marketers gather information about customers from many sources, including display advertising, email campaigns, social media platforms and third-party providers of demographic and market information. A data lake can capture data from all those sources, including real-time feeds from websites and mobile apps. This helps marketers build a much more complete snapshot of their customers so they can better segment the customer base, target marketing campaigns and increase conversion rates. They can monitor fast-changing consumer preferences and analyze which marketing campaigns deliver the best return on investment.
Supply chain: Information about suppliers can be buried in multiple systems making it hard to spot trends and pinpoint problems. A data lake can collect information from internal ordering and warehouse management systems, suppliers and shippers, as well as external sources such as weather forecasts. As a result, companies can identify the cause of delays, get a clearer understanding of when they need to order inventory and predict potential bottlenecks.
Benefits of Data Lakes
Data lakes provide a number of important benefits that help businesses respond more quickly to changes in the business environment. Advantages include:
More data sources. Businesses can bring almost any type of structured or unstructured data into a data lake. They can get more value by bringing together and analyzing data from these different sources. Because the lake contains all raw data, not just refined subsets, expert users can explore every aspect of the data in increasing depth to glean new insights over time.
Greater agility. Business conditions can change rapidly, which means companies may need to get answers to new and unforeseen questions. Companies have more flexibility to analyze the data in different ways because data lakes don’t constrain the kinds of questions you can ask. This helps them adapt more quickly to changes in market preferences or economic conditions.
Value to more users. Data lakes may be useful to a wide range of users across the organization because they hold many types of information, which can be analyzed in many different ways. Data scientists can delve into the data using complex analytical and modeling tools, while business users can perform simpler analysis.
Potentially faster implementation. There’s no need to perform a lengthy schema-definition process before building the data lake. Information is simply imported in its raw form without requiring transformation.
Low-cost scalability. Data lakes scale at a relatively low cost because they generally run on low-cost hardware. This is important because a company’s store of information can grow rapidly and in ways that are not easy to predict. For example, new IoT devices or external sources can generate voluminous streams of information.
Challenges of Data Lakes
Data lakes have their share of downsides, too. Among the biggest challenges of data lakes — the dreaded data swamp.
Data swamps: A data lake can turn into a data swamp of stagnant information that is largely worthless if a company doesn’t use strong governance. This can happen when users are allowed to import any data they like or if companies don’t adequately catalog the data, ensure accuracy and remove obsolete information. When this happens most people won’t understand or trust the information in the data lake, so it doesn’t get used.
Underestimating the implementation: Implementing a data lake is a significant undertaking, and it’s important to plan the project carefully. Early data lakes were complex and difficult to implement. Companies encountered obstacles collecting data from so many different sources and struggled with understanding how to scale the data lake and teach non-expert users how to analyze the data. However, the growing maturity of data lake software and cloud-based products removing scalability concerns has helped to overcome these concerns.
History of Data Lakes
The history of data lakes is closely linked with Hadoop, an open-source software framework that was used to build some of the first data lakes (one of the software’s creators named it Hadoop, after his son’s toy elephant). Created in 2005 and still widely used for data lakes today, Hadoop includes a distributed file system that can store large amounts of information on clusters of low-cost servers.
James Dixon, then the chief technical officer at software company Pentaho, coined the term data lake in 2010 based on research into how companies were using Hadoop. He found that many of them were working with data that wouldn’t fit into a traditional data mart.
The companies said they wanted to explore their data to be ready to answer new, as-yet-unknown questions. Dixon proposed the data lake as a solution. As he put it: “If you think of a data mart as a store of bottled water — cleansed and packaged and structured for easy consumption — the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in or take samples.”
Since then, many suppliers have built commercial data lake offerings based on Hadoop and other software. In addition, organizations are migrating to cloud-based data lakes to take advantage of the scalability, reduced administration and pay-as-you go cost structure.
Data Lake Best Practices
Don’t underestimate the work involved in creating and managing a data lake. A well-executed data lake project can become an enormously valuable asset and a competitive advantage — but it’s important to do it right. Following these best practices can help your project stay on track.
Gather the right expertise. You may need technical skills that you don’t already have, such as data scientists with the quantitative skills to perform complex analysis and software engineers with data lake expertise. Enlist the help of people across the business to determine which questions to ask, which data should be in the lake and where to get it.
Focus on governance. Governance is vital to the initial success of a data lake project — and also prevents the lake from turning into an unusable data swamp over time. Establish a team of stakeholders to create a governance framework that determines who can add data to the data lake and how the data should be managed over its full life cycle. Set rules for cataloging the data so users can find it. Enlist the help of people across the business as data stewards to manage data quality for their department — they’ll know when data is inaccurate or when it’s not refreshed often enough to be useful. If you’re storing sensitive information, such as personal data, you’ll need to ensure compliance with all applicable regulations.
Don’t overlook security. Because it stores so much data, a data lake can become one of the organization’s most valuable information assets. Protecting it is critical. Ensure data is encrypted and that only authorized users can access it. If you’re running an on-premises implementation you’ll also need to consider other data protection measures, such as backups and redundancy.
Ingest data quickly and automate wherever possible. One of the key benefits of a data lake is that you can add new information rapidly because you can import data in its raw state. This means that you can analyze information sooner, which enables the company to respond more quickly to events. Maximize the benefits by focusing on rapid ingestion. You may be able to stream data continuously instead of importing it in batches. Automating data collection eliminates manual errors and allows data to be processed as soon as it’s available.
Data lakes can help organizations respond more quickly to the ever-changing business landscape. They let businesses quickly aggregate unstructured and structured data from many different sources into a single store for analysis. Many different users can employ a variety of analytic tools to explore answers to new business questions as they arise. A well-implemented data lake can deliver business insights that drive improvements in business performance.
Data Lake FAQs
A data warehouse is a relational database that is designed for analyzing data. It holds structured information typically drawn from core business applications. The structure of the information is defined in a schema, which is developed in advance. A data lake is also designed for analysis, but it holds both unstructured and structured data from a wider range of sources. It has no predefined schema and data is stored in its original form.
A data lake is a repository that can hold all of an organization’s data, including unstructured data like images and text files, as well as structured business data that‘s traditionally stored in relational databases. Companies can analyze this information using various tools, including machine-learning technology that automatically hunts for patterns in the data.
Data lakes got their name because they are unstructured pools of information with data flowing into them from many different sources.
Hadoop is not a data lake in itself. It is a software framework that is commonly used to build data lakes. It includes a distributed file system that can store large amounts of information on clusters of low-cost servers.