Data lake concept and systems: a survey

Abstract

reviews the development, definition, and architectures of data lakes
provide a comprehensive overview of research questions for designing and building data lakes
classify the existing data lake systems based on their provided functions

1 Introduction

Big data -> ELT, NoSQL -> data lakes

2 A brief history of data lakes

2010-2013: Beginnings
2014-2015: Criticisms and further development
2016-present: Prosperity and diversity

3 Data lake definition

Data Lake: A data lake is a flexible, scalable data storage and management system, which ingests and stores raw data from heterogeneous sources in their original format, and provides query processing and data analytics in an on-the-fly manner.

remarks

store raw data
not only a storage system
support on-demand data processing and querying

4 Data lake architecture

two high-level data lake philosophies:

pond architecture
zone architecture

High-level architectural philosophies lack technical details about functions, which hampers modular and repeatable implementations

users

data scientists
information curators
the governance, risk, and compliance team
operations team

5 Storage

preserve the ingested datasets

5.1 File-based storage systems

e.g. HDFS: supports a wide range of files

5.2 Single data store

e.g. Neo4j

has a special application focus on user data of usually relatively small size compared to business scenarios, but higher requirements regarding data privacy

5.3 Polystore systems

e.g. BigDAWG

5.4 Data lakes on clouds

e.g. IaaS

scale the storage space and computation power dynamically, and in many cases the prices of resources are more economic than on-premises
the major cloud vendors provide many additional analytics tools in their product portfolio
relying on a cloud platform also implies risks and challenges in some aspects such as data security, data provenance, and fault tolerance

6 Ingestion

Ingestion components load data into the lake, and store data into databases or file systems

6.1 Metadata extraction

To discovers metadata information that is essential for accessing a dataset

e.g. GEMMS, DATAMARAN, Skluma

6.2 Metadata modeling

To structure and organize the metadata in a formal way

The majority of such models are either logic-based or graph-structured with more or less formal semantics

Generic metadata model
Data vault
Graph-based metadata model
e.g. Aurum

7 Maintenance

To make the data usable the data lake needs to further process and maintain the raw data

7.1 Dataset preparation and organization

structure and navigate the massive heterogeneous datasets in data lakes

7.1.1 Dataset preparation

e.g. KAYAK

7.1.2 Data lake organization

e.g. GOODS, DS-kNN

data lake organization problem: discovering the optimal structure for users to electively nd the desired dataset in a data lake

data discovery: tries to find a subset of relevant datasets that are similar or complementary to a given dataset in a certain way

step

define and extract relatedness signals from tables
compute multi-dimensional similarities between attributes, and aggregate them to an overall similarity between tabular datasets

7.3 Data integration

To combining multiple heterogeneous data sources and providing unified data access for users

data integration techniques: schema matching, schema mapping, query reformulation, entity linkage

e.g. Constance

7.4 Metadata enrichment

To further understand and explore a dataset

e.g. CoreDB, GOODS, Constance

7.5 Data quality improvement

obtain dependencies from the data in the data lakes, and then use them to improve the data quality.

e.g. CLAMS, Constance

7.6 Schema evolution

handling the changes of schemas and integrity constraints

8 Exploration

challenging:

a large number of ingested sources
the heterogeneity of data

solutions:

discover the data lakes based on the relatedness of datasets
provide a unified query interface for heterogeneous data sources

8.1 Query-driven data discovery

searching a data lake based on the measured relatedness (e.g., joinable, unionable) among datasets

8.2 Query heterogeneous data

e.g. Constance, CoreDB, Ontario

querying solutions:

transform the data in heterogeneous NoSQL stores into relational tables and uses an existing relational database to process the data. e.g. Argo
multistore systems providing a SQL-like query language to query NoSQL systems
applying a middle-ware to access the multiple NoSQL stores

9 Composite metadata management

To prevent a data lake turn into a “data swamp”

9.1 Schema mapping formalisms

algorithmic properties (computational eficiency)

metadata-related challenges:

Expressive power

Structural properties and decidability of reasoning tasks

9.2 Data provenance(data lineage)

the information of data records

10 New directions

Machine learning in data lakes
Data lakes for data science
Stream data lakes

11 Summary and outlook

Some well-studied problems (e.g., data integration, schema evolution, metadata modeling) need new perspectives and methods in data lakes; while many blank spaces (e.g., stream data lakes, integrate data lakes with machine learning and data science) also call for novel solutions.

Data lake concept and systems_ a survey