Siva Kowsika
- Mar 10, 2021
- 25 min read

What to expect from your Data Lake?

1. What is a data lake

The data lake is currently a relatively hot concept, and many companies are building or planning to build their own data lake. However, before planning to build a data lake, figuring out what a data lake is, clarifying the basic components of a data lake project, and then designing the basic structure of the data lake is crucial to the construction of the data lake. What is a data lake? There are different definitions.

According to Wikipedia, a data lake is a type of system or storage that stores data in a natural/original format, usually object blocks or files, including copies of the original data generated by the original system and converted data generated for various tasks, including data from relationships. Structured data (rows and columns), semi-structured data (such as CSV, log, XML, JSON), unstructured data (such as email, documents, PDF, etc.) and binary data (such as images, audio, video).

The AWS Defined Data Lake is a centralized repository that allows you to store all structured and unstructured data at any scale.

Microsoft's definition is even more vague. It does not clearly give out what a Data Lake is. Instead, it cleverly defines the function of the Data Lake. The data lake includes everything that makes it easier for developers, data scientists, and analysts to store and process data capabilities. These capabilities allow users to store data of any scale, type, and speed, and can do all types of analysis and processing across platforms and languages.

There are actually many definitions of data lakes, but they basically revolve around the following characteristics.

1. The data lake needs to provide sufficient data storage capacity. This storage stores all the data in an enterprise/organization.

2. The data lake can store massive amounts of any type of data, including structured, semi-structured and unstructured data.

3. The data in the data lake is the original data, which is a complete copy of the business data. The data in the data lake retains their original appearance in the business system.

4. The data lake needs to have complete data management capabilities (perfect metadata), which can manage various data-related elements, including data sources, data formats, connection information, data schemas, and authority management.

5. The data lake needs to have diversified analysis capabilities, including but not limited to batch processing, streaming computing, interactive analysis, and machine learning; at the same time, it also needs to provide certain task scheduling and management capabilities.

6. The data lake needs to have complete data lifecycle management capabilities. Not only need to store the original data, but also need to be able to save the intermediate results of various analysis and processing, and to completely record the analysis and processing process of the data, which can help users to trace the generation process of any piece of data in complete detail.

7. The data lake needs to have complete data acquisition and data release capabilities. The data lake needs to be able to support a variety of data sources, and be able to obtain full/incremental data from related data sources; then standardize storage. The data lake can push the results of data analysis and processing to the appropriate storage engine to meet different application access requirements.

8. Support for big data, including ultra-large-scale storage and scalable large-scale data processing capabilities.

In summary, I personally believe that the data lake should be an evolving and scalable infrastructure for big data storage, processing, and analysis; data-oriented, realizing full access to any source, speed, scale, and type of data , Full storage, multi-mode processing and full life cycle management; and through interaction and integration with various external heterogeneous data sources, it supports various enterprise-level applications.

Two more points need to be pointed out here:

1) Scalability refers to the scalability of scale and the scalability of capabilities, that is, the data lake must not only be able to provide "sufficient" storage and computing capabilities as the amount of data increases; it also needs to continuously provide new data processing as needed Mode, for example, the business may only need batch processing capabilities at the beginning, but as the business develops, interactive ad hoc analysis capabilities may be required; and as the effectiveness requirements of the business continue to improve, it may be necessary to support real-time analysis and machine learning.

2) Data-oriented means that the data lake should be sufficiently simple and easy to use for users to help users free themselves from complex IT infrastructure operation and maintenance work, focusing on business, focusing on models, focusing on algorithms, and focusing on data. The data lake is for data scientists and analysts. At present, cloud native should be an ideal way to build a data lake. This point of view will be discussed in detail in the "Data Lake Basic Architecture" section later.

2. The basic characteristics of the data lake

After we have a basic understanding of the concept of a data lake, we need to further clarify what basic characteristics the data lake needs to have, especially compared with big data platforms or traditional data warehouses, what are the characteristics of the data lake. I personally feel that we can further analyze the characteristics of the data lake from the two levels of data and calculation. In terms of data:

1) "Fidelity". A complete copy of the data in the business system is stored in the data lake, which is exactly the same. The difference from the data warehouse is that a copy of the original data must be stored in the data lake. No matter the data format, data mode, or data content should not be modified. In this regard, the data lake emphasizes the preservation of the "authentic" business data. At the same time, the data lake should be able to store any type/format of data.

2) "Flexibility": One point in the above table is "write schema" vs "read schema". In fact, it is essentially a question of at which stage the design of the data schema occurs. For any data application, the design of the schema is actually essential. Even for some databases such as mongoDB that emphasize "modeless", it is still recommended in the best practice to use the same/similar structure for records as much as possible. The logic behind the "write-in schema" is that before data is written, the data schema needs to be determined according to the access method of the business, and then the data import is completed according to the established schema. The benefit is that the data and the business are well adapted However, this also means that the initial cost of ownership of the data warehouse will be relatively high, especially when the business model is not clear and the business is still in the exploratory stage, the flexibility of the data warehouse is not enough. The underlying logic behind the “read schema” emphasized by the Data Lake is that business uncertainty is the norm: we can’t anticipate business changes, so we maintain a certain degree of flexibility and delay the design so that the entire The infrastructure has the ability to make data fit the business "on demand". Therefore, I personally believe that "fidelity" and "flexibility" are in the same line: since there is no way to predict business changes, then simply keep the data in the most original state, and once needed, the data can be processed according to needs. Therefore, the data lake is more suitable for innovative enterprises and enterprises with rapid business development. At the same time, users of the data lake have correspondingly higher requirements. Data scientists and business analysts (with certain visualization tools) are the target customers of the data lake.

3) "Manageable": The data lake should provide comprehensive data management capabilities. Since data requires "fidelity" and "flexibility", there will be at least two types of data in the data lake: raw data and processed data. The data in the data lake will continue to accumulate and evolve. Therefore, data management capabilities will also be very demanding, and at least the following data management capabilities should be included: data source, data connection, data format, data schema (library/table/column/row). At the same time, the data lake is a unified data storage place in a single enterprise/organization. Therefore, it also needs to have certain rights management capabilities.

4) "Traceability": A data lake is a storage place for all data in an organization/enterprise. It needs to manage the entire life cycle of data, including the entire process of data definition, access, storage, processing, analysis, and application. The realization of a powerful data lake requires that the access, storage, processing, and consumption process of any piece of data between it be traceable, and the complete data generation process and flow process can be clearly reproduced. In terms of computing, I personally think that data lakes have very extensive requirements for computing capabilities, and they completely depend on the computing requirements of the business.

5) Rich calculation engine. From batch processing, streaming computing, interactive analysis to machine learning, all kinds of computing engines belong to the category of data lakes. Under normal circumstances, data loading, conversion, and processing will use a batch computing engine; parts that require real-time computing will use a streaming computing engine; for some exploratory analysis scenarios, an interactive analysis engine may need to be introduced. As the integration of big data technology and artificial intelligence technology becomes closer and closer, various machine learning/deep learning algorithms have been continuously introduced. For example, the TensorFlow/PyTorch framework has supported reading sample data from HDFS/S3/OSS for training. Therefore, for a qualified data lake project, the scalability/pluggability of the computing engine should be a type of basic capability.

6) Multi-modal storage engine. In theory, the data lake itself should have a built-in multi-modal storage engine to meet the data access requirements of different applications (considering the response time/concurrency/access frequency/cost and other factors). However, in the actual use process, the data in the data lake is usually not accessed frequently, and related applications are also used in exploratory data applications. In order to achieve acceptable cost performance, the data lake construction is usually Will choose relatively cheap storage engines (such as S3/OSS/HDFS/OBS), and work with external storage engines when needed to meet diversified application requirements.

3. The basic structure of the data lake

The data lake can be considered as a new generation of big data infrastructure. In order to better understand the basic architecture of the data lake, let's take a look at the evolution process of the big data infrastructure architecture.

1) The first stage: offline data processing infrastructure represented by Hadoop. Hadoop is a batch data processing infrastructure with HDFS as the core storage and MapReduce (MR for short) as the basic computing model. Around HDFS and MR, a series of components have been produced to continuously improve the data processing capabilities of the entire big data platform, such as HBase for online KV operations, HIVE for SQL, and PIG for workflow.

At the same time, as the performance requirements for batch processing are getting higher and higher, new computing models are constantly being proposed, resulting in computing engines such as Tez, Spark, and Presto, and the MR model has gradually evolved into a DAG model. On the one hand, the DAG model increases the abstract concurrency capabilities of the computing model: each computing process is decomposed, and tasks are logically segmented according to the aggregation operation points in the computing process. The tasks are divided into stages, and each stage can It consists of one or more tasks, which can be executed concurrently, thereby improving the parallelism of the entire calculation process; on the other hand, in order to reduce the intermediate result write file operation in the data processing process, computing engines such as Spark and Presto use computing as much as possible The memory of the node caches the data, thereby improving the efficiency of the entire data process and the system throughput.

2) The second stage: lambda architecture. With the continuous changes in data processing capabilities and processing requirements, more and more users have found that no matter how the batch processing mode improves performance, it cannot meet some processing scenarios with high real-time requirements. Streaming computing engines have emerged, such as Storm, Spark Streaming, Flink, etc. However, as more and more applications go online, everyone finds that batch processing and stream computing can be used together to meet most application needs; for users, they don’t actually care about the underlying computing model. It is hoped that whether it is batch processing or stream computing, the processing results can be returned based on a unified data model. The core concept of the Lambda architecture is "stream batch integration".

The entire data flow flows into the platform from left to right. After entering the platform, it is divided into two parts, one part adopts batch processing mode, and the other adopts stream computing mode. Regardless of the calculation mode, the final processing result is provided to the application through the service layer to ensure the consistency of access.

3) The third stage: Kappa architecture. The Lambda architecture solves the problem of consistency of data read by applications, but the processing link of "stream batch separation" increases the complexity of research and development. Therefore, someone asked whether a system can be used to solve all the problems. The current popular approach is to do it based on stream computing. The natural distributed characteristics of stream computing are destined to have better scalability. By increasing the concurrency of streaming computing and increasing the "time window" of streaming data, the two computing modes of batch processing and streaming are unified.

In summary, from the traditional hadoop architecture to the lambda architecture, from the lambda architecture to the Kappa architecture, the evolution of the big data platform infrastructure has gradually included all kinds of data processing capabilities required by the application, and the big data platform has gradually evolved into an enterprise / The organization's full data processing platform. In current corporate practice, except for relational databases relying on various independent business systems; almost all other data is considered to be incorporated into the big data platform for unified processing. However, the current big data platform infrastructure locks the perspective on storage and computing, while ignoring the asset management of data. This is precisely one of the directions that the data lake, as a new generation of big data infrastructure, focuses on.

The evolution of the big data infrastructure actually reflects one point: within the enterprise/organization, data is an important asset has become a consensus; in order to make better use of data, the enterprise/organization needs to store data assets 1) as-is for a long time 2) Effective management and centralized governance; 3) Provide multi-mode computing capabilities to meet processing needs; 4) As well as business-oriented, provide unified data views, data models and data processing results. The data lake was created in this context. In addition to the various basic capabilities of the big data platform, the data lake emphasizes the management, governance, and capitalization capabilities of data. Falling into the specific implementation, the data lake needs to include a series of data management components, including: 1) data access; 2) data relocation; 3) data governance; 4) quality management; 5) asset catalog; 6) access control ; 7) Task management; 8) Task scheduling; 9) Metadata management, etc. For a typical data lake, it is the same as the big data platform in that it also has the storage and computing capabilities required to process ultra-large-scale data, and can provide multi-mode data processing capabilities; the enhancement is that the data lake provides more For complete data management capabilities, it is specifically embodied in:

1) More powerful data access capabilities. Data access capabilities are embodied in the definition and management capabilities for various external heterogeneous data sources, as well as the ability to extract and migrate data related to external data sources. The extracted and migrated data includes the metadata of the external data source and the actual stored data.

2) More powerful data management capabilities. Management ability can be divided into basic management ability and extended management ability. The basic management capabilities include the management of various metadata, data access control, and data asset management, which are necessary for a data lake system. Later, we will discuss the basic information of each vendor in the "Data Lake Solutions of Various Vendors". Support methods for management capabilities. Expanded management capabilities include task management, process orchestration, and capabilities related to data quality and data governance. Task management and process orchestration are mainly used to manage, orchestrate, schedule, and monitor various tasks for processing data in the data lake system. Normally, data lake builders will purchase/develop customized data integration or data development subsystems Modules provide such capabilities, and customized systems/modules can be integrated with the data lake system by reading the relevant metadata of the data lake. Data quality and data governance are more complex issues. Under normal circumstances, the data lake system will not directly provide related functions, but will open various interfaces or metadata for capable enterprises/organizations and existing data. Manage software integration or do custom development.

3) Shareable metadata. The various computing engines in the data lake will be deeply integrated with the data in the data lake, and the basis of the integration is the metadata of the data lake. In a good data lake system, the computing engine can directly obtain data storage location, data format, data mode, data distribution and other information from the metadata when processing data, and then directly process the data without manual/programming intervention. Furthermore, a good data lake system can also perform access control on the data in the data lake, and the intensity of control can be achieved at different levels such as "databases and rows". In essence, it is hoped that the internal data of an enterprise/organization can be deposited in a clear and unified place. In fact, the storage of the data lake should be a type of distributed file system that can be expanded on demand. In most data lake practices, it is also recommended to use distributed systems such as S3/OSS/OBS/HDFS as the unified storage of the data lake. We can switch to the data dimension again and look at the way the data lake processes data from the perspective of the data life cycle. In theory, the data in a well-managed data lake will permanently retain the original data, while the process data will be continuously improved and evolved to meet business needs.

4. Data lake solutions of various vendors

As the current outlet for the data lake, major cloud vendors have launched their own data lake solutions and related products. This section will analyze the data lake solutions launched by various mainstream vendors and map them to the data lake reference architecture to help you understand the advantages and disadvantages of various solutions.

4.1 AWS Data Lake Solution

The entire solution is based on AWS Lake Formation. AWS Lake Formation is essentially a management component that cooperates with other AWS services to complete the entire enterprise-level data lake construction function. The four steps of data inflow, data precipitation, data calculation, and data application are key here. Let's take a closer look at its key points:

1) Data inflow. The inflow of data is the beginning of the construction of the entire data lake, including the inflow of metadata and the inflow of business data. Metadata inflow includes two steps: data source creation and metadata capture, which will eventually form a data resource catalog and generate corresponding security settings and access control policies. The solution provides specialized components to obtain relevant meta-information from external data sources. This component can connect to external data sources, detect data formats and schemas, and create metadata belonging to the data lake in the corresponding data resource catalog. The inflow of business data is done through ETL. In the specific product form, metadata capture, ETL and data preparation AWS separately abstracted them, forming a product called AWS GLUE. AWS GLUE and AWS Lake Formation share the same data resource catalog. It is clearly stated on the AWS GLUE official website document: "Each AWS account has one AWS Glue Data Catalog per AWS region". Support for heterogeneous data sources. The data lake solution provided by AWS supports S3, AWS relational databases, and AWS NoSQL databases. AWS uses GLUE, EMR, Athena and other components to support the free flow of data.

2) Data precipitation.

Use Amazon S3 as the centralized storage of the entire data lake, expand on demand/pay for usage.

3) Data calculation.

The entire solution uses AWS GLUE for basic data processing. The basic calculation form of GLUE is ETL tasks in various batch processing modes. The starting methods of tasks are divided into three types: manual trigger, timing trigger, and event trigger. It has to be said that the various services of AWS are implemented very well in the ecology. In the event trigger mode, AWS Lambda can be used for extended development and trigger one or more tasks at the same time, which greatly improves the custom development capabilities of task triggering; At the same time, various ETL tasks can be well monitored through CloudWatch.

4) Data application.

In addition to providing basic batch computing mode, AWS provides rich computing mode support through various external computing engines, such as Athena/Redshift to provide SQL-based interactive batch processing capabilities; EMR to provide various types of Spark's computing capabilities include stream computing capabilities and machine learning capabilities that Spark can provide.

5) Authority management.

AWS's data lake solution provides relatively complete permissions management through Lake Formation, with granularity including "library-table-column". However, there is one exception. When GLUE accesses Lake Formation, the granularity is only "library-table". This also shows from another side that the integration of GLUE and Lake Formation is closer. Data has greater access rights. The permissions of Lake Formation can be further subdivided into data resource directory access permissions and underlying data access permissions, which correspond to metadata and actual stored data respectively. The actual storage data access authority is further divided into data access authority and data storage access authority. The data access authority is similar to the database table access authority in the database, and the data storage authority further refines the access authority to the specific directory in S3 (divided into display and implicit). I personally think that this further reflects that the data lake needs to support a variety of different storage engines. The future data lake may not only include core storage such as S3/OSS/OBS/HDFS, but may include more types of storage engines according to application access requirements. For example, S3 stores raw data, NoSQL stores data that is suitable for access in "key-value" mode after processing, and OLAP engine stores data that needs to generate various reports/adhoc queries in real time. Although various materials are currently emphasizing the difference between data lakes and data warehouses; however, in essence, data lakes should be a concrete realization of a type of integrated data management thinking, and the "integration of lakes and warehouses" is also likely to be the future. A development trend.

In summary, the AWS data lake solution has a high maturity, especially metadata management and permission management. It has opened up the upstream and downstream relationships between heterogeneous data sources and various computing engines, allowing data to be "moved" freely. In terms of stream computing and machine learning, AWS's solutions are also relatively complete. In terms of stream computing, AWS has launched a special stream computing component Kinesis. The Kinesis data Firehose service in Kinesis can create a fully managed data distribution service. The data processed in real time through Kinesis data Stream can be easily written into S3 with the help of Firehose. And support the corresponding format conversion, such as converting JSON to Parquet format. The best part of the entire AWS solution is that Kinesis can access the metadata in GLUE, which fully reflects the ecological completeness of the AWS data lake solution. Similarly, in terms of machine learning, AWS provides the SageMaker service. SageMaker can read the training data in S3 and write back the trained model to S3. However, one thing that needs to be pointed out is that in the AWS data lake solution, stream computing and machine learning are not fixedly bundled, but only as an expansion of computing power and can be easily integrated.

The mapping of the AWS data lake solution in the reference architecture

In summary, AWS's data lake solution covers all functions except quality management and data governance. In fact, the work of quality management and data governance is strongly related to the organizational structure and business type of the enterprise, and requires a lot of customized development work. Therefore, it is understandable that the general solution does not include this content. In fact, there are now relatively good open source projects that support this project, such as Apache Griffin. If you have a strong demand for quality management and data governance, you can customize it by yourself.

4.2 Huawei Data Lake Solution

Information about Huawei's data lake solution comes from Huawei's official website. Related products currently available on the official website include Data Lake Insight (DLI) and Intelligent Data Lake Operation Platform (DAYU). Among them, DLI is equivalent to a collection of AWS's Lake Formation, GLUE, Athena, and EMR (Flink&Spark). I didn’t find the overall architecture diagram of DLI on the official website. Huawei's data lake solution is relatively complete. DLI undertakes all the core functions of data lake construction, data processing, data management, and data application. The biggest feature of DLI is the completeness of the analysis engine, including interactive analysis based on SQL and a stream batch integrated processing engine based on Spark+Flink. On the core storage engine, DLI is still provided through the built-in OBS, which is basically comparable to the capabilities of AWS S3. Huawei's data lake solution is more complete than AWS in the upstream and downstream ecosystems. For external data sources, it supports almost all data source services currently provided on Huawei Cloud. DLI can interface with Huawei's CDM (Cloud Data Migration Service) and DIS (Data Access Service): 1) With DIS, DLI can define various data points, which can be used in Flink operations as source or sink 2) With the help of CDM, DLI can even access data from IDC and third-party cloud services.

In order to better support advanced data lake functions such as data integration, data development, data governance, and quality management, Huawei Cloud provides the DAYU platform. The DAYU platform is the implementation of Huawei's data lake governance and operation methodology. DAYU covers the entire core process of data lake governance and provides corresponding tool support for it; even in Huawei's official documents, it gives suggestions for the construction of a data governance organization.

It can be seen that in essence, the methodology of DAYU data governance is actually an extension of the traditional data warehouse governance methodology on the data lake infrastructure: from the perspective of the data model, it still includes the source layer, multi-source integration layer, and detailed data layer. Fully consistent with the data warehouse. According to the data model and index model, quality rules and conversion models will be generated. DAYU will dock with DLI and directly call the relevant data processing services provided by DLI to complete data governance. The entire data lake solution of HUAWEI CLOUD completely covers the life cycle of data processing, and clearly supports data governance, and provides data governance process tools based on models and indicators. The data lake solution of HUAWEI CLOUD has gradually begun to be used The evolution of "Hucang Integration".

4.3 Alibaba Cloud Data Lake Solution

There are many data products on Alibaba Cloud. Alibaba Cloud's data lake solutions based on database products are more focused, focusing on two scenarios: data lake analysis and federated analysis.

The entire solution still uses OSS as the centralized storage of the data lake. In terms of data source support, all Alibaba Cloud databases are currently supported, including various databases such as OLTP, OLAP, and NoSQL. The core key points are as follows:

1) Data access and relocation. In the process of building a lake, the Formation component of DLA has the ability to discover metadata and build a lake with one click. At the time of writing, the current "one-click lake building" only supports full lake building, but incremental lake building based on binlog It is already under development and is expected to be online soon. Incremental lake building capabilities will greatly increase the real-time performance of the data in the data lake and reduce the pressure on the source-end business database to the lowest level. It should be noted here that DLA Formation is an internal component and is not exposed to the outside.

2) Data resource directory. DLA provides Meta data catalog components for unified management of data assets in the data lake, regardless of whether the data is "in the lake" or "outside the lake". Meta data catalog is also a unified metadata entry for federated analysis.

3) On the built-in computing engine, DLA provides two types: SQL computing engine and Spark computing engine. Whether it is SQL or Spark engine, it is deeply integrated with Meta data catalog, which can easily obtain metadata information. Based on Spark's capabilities, the DLA solution supports computing modes such as batch processing, stream computing, and machine learning.

4) In the peripheral ecology, in addition to supporting various heterogeneous data sources for data access and aggregation, in terms of external access capabilities, DLA is deeply integrated with the cloud native data warehouse (formerly ADB). On the one hand, the results of DLA processing can be pushed to ADB at the same time to meet real-time, interactive, ad hoc complex queries; on the other hand, data in ADB can also use the appearance function to easily return data to OSS. Based on DLA, various heterogeneous data sources on Alibaba Cloud can be completely opened up and data flow freely.

5) In terms of data integration and development, Alibaba Cloud's data lake solution provides two options: one is to use dataworks to complete; the other is to use DMS to complete. No matter which one you choose, it can provide visual process orchestration, task scheduling, and task management capabilities to the outside world. In terms of data lifecycle management, dataworks' data map capabilities are relatively more mature.

6) DMS provides powerful capabilities in data management and data security. The data management granularity of DMS is divided into "library-table-column-row", which fully supports enterprise-level data security management and control requirements. In addition to permission management, DMS is more refined in that it extends the original database-based devops concept to the data lake, making the operation, maintenance and development of the data lake more refined.

Data producers generate various types of data (under the cloud/on the cloud/other clouds), and use various tools to upload to various general/standard data sources, including OSS/HDFS/DB Wait. For various data sources, DLA completes lake construction operations through data discovery, data access, data migration and other capabilities. For data "into the lake", DLA provides data processing capabilities based on SQL and Spark, and can provide external visual data integration and data development capabilities based on Dataworks/DMS; in terms of external application service capabilities, DLA provides standardized JDBC interfaces , You can directly interface with various report tools, large-screen display functions, etc. Alibaba Cloud’s DLA is characterized by backing on the entire Alibaba Cloud database ecosystem, including OLTP, OLAP, NoSQL and other databases, and providing SQL-based data processing capabilities. For traditional enterprise database-based development technology stacks, the cost of transformation is relatively high. Lower, the learning curve is gentler.

Another feature of Aliyun's DLA solution is the "cloud-native integration of lake and warehouse". Traditional enterprise-level data warehouses are still irreplaceable in various report applications in the era of big data; however, data warehouses cannot meet the flexibility requirements of data analysis and processing in the era of big data; therefore, we recommend that data warehouses should be used as The upper-level applications of the data lake exist: that is, the data lake is the only official data storage place for original business data in an enterprise/organization; the data lake processes the original data according to various business application requirements to form intermediate results that can be reused; When the data schema (Schema) of the intermediate result is relatively fixed, DLA can push the intermediate result to the data warehouse for enterprises/organizations to develop business applications based on the data warehouse. While providing DLA, Alibaba Cloud also provides a cloud-native data warehouse (formerly ADB). DLA and cloud-native data warehouse are deeply integrated in the following two points.

1) Use the same-source SQL parsing engine. DLA's SQL is fully compatible with ADB's SQL syntax, which means that developers can use a set of technology stacks to simultaneously develop data lake applications and data warehouse applications. 2) All have built-in support for OSS access. OSS directly exists as the native storage of DLA; for ADB, the structured data on OSS can be easily accessed through the capabilities of external tables. With the help of external tables, data can be freely transferred between DLA and ADB, achieving a real integration of the lake and warehouse.

The combination of DLA+ADB truly achieves a cloud-native integration of lakes and warehouses (what is cloud-native is beyond the scope of this article). In essence, DLA can be regarded as a data warehouse paste source layer with extended capabilities. Compared with the traditional data warehouse, the source layer: (1) can save all kinds of structured, semi-structured and unstructured data; (2) can connect to various heterogeneous data sources; (3) have metadata discovery , Management, synchronization and other capabilities; (4) The built-in SQL/Spark computing engine has stronger data processing capabilities to meet diverse data processing needs; (5) It has full life cycle management capabilities for full data. The integrated lake warehouse solution based on DLA+ADB will simultaneously cover the processing capabilities of "big data platform + data warehouse".

Another important ability of DLA is to build a data flow system that extends in all directions, and to provide external capabilities with the experience of the database, regardless of whether the data is on or off the cloud, whether the data is inside or outside the organization; with the help of the data lake, each system There is no longer a barrier between data, and it can flow in and out freely; more importantly, this flow is regulated, and the data lake completely records the flow of data.

4.4 Azure Data Lake Solution

Azure's data lake solution includes data lake storage, interface layer, resource scheduling and computing engine layer. The storage layer is built based on Azure object Storage and still provides support for structured, semi-structured and unstructured data. The interface layer is WebHDFS. What is more special is that the HDFS interface is implemented in Azure Object Storage. Azure calls this capability "multi-protocol access on data lake storage". In terms of resource scheduling, Azure is based on YARN. On the computing engine, Azure provides a variety of processing engines such as U-SQL, Hadoop, and Spark. The special thing about Azure is the development support provided to customers based on visual studio.

1) Support of development tools and deep integration with visual studio; Azure recommends using U-SQL as the development language for data lake analysis applications. Visual studio provides a complete development environment for U-SQL; at the same time, in order to reduce the complexity of the distributed data lake system development, visual studio is packaged based on the project. When developing U-SQL, you can create a "U-SQL database project" "In this type of project, visual studio can be used to easily code and debug. At the same time, a wizard is also provided to publish the developed U-SQL script to the generation environment. U-SQL supports Python and R for expansion to meet the needs of custom development.

2) Adaptation of multiple computing engines: SQL, Apache Hadoop and Apache Spark. Hadoop here includes HDInsight (Azure hosted Hadoop service) provided by Azure, and Spark includes Azure Databricks.

3) The ability to automatically switch between a variety of different engine tasks. Microsoft recommends U-SQL as the default development tool for the data lake and provides various conversion tools to support conversion between U-SQL scripts and Hive, Spark (HDSight & databricks), and Azure Data Factory data Flow.

4.5 Summary

What this article discusses is the solution of the data lake, will not involve any single product of the cloud manufacturer. From the aspects of data access, data storage, data calculation, data management, and application ecology, we briefly made a summary similar to the following table.

For the sake of space, in fact, well-known cloud vendors also have data lake solutions from Google and Tencent. From their official websites, the two data lake solutions are relatively simple, and they are just some conceptual explanations. The recommended landing solution is "Ultralake". In fact, a data lake should not be viewed from a simple technology platform. There are also many ways to realize a data lake. To evaluate whether a data lake solution is mature, the key should be the data management capabilities it provides, including but not limited to meta Data, data asset catalogs, data sources, data processing tasks, data life cycle, data governance, rights management, etc.; and the ability to connect with the peripheral ecosystem.

5. Summary

As a new generation of big data analysis and processing infrastructure, the data lake needs to go beyond the traditional big data platform. I personally think that the following aspects are the possible future development directions of the data lake solution.

1) Cloud native architecture. There are different opinions about what cloud native architecture is, and it is difficult to find a unified definition. But specific to the data lake scenario, I personally think that it is the following three characteristics: (1) storage and computing are separated, computing power and storage capacity can be independently expanded; (2) multi-modal computing engine support, SQL, batch processing, streaming (3) Provide serverless services to ensure sufficient flexibility and support payment on demand.

2) Sufficient data management capabilities. The data lake needs to provide more powerful data management capabilities, including but not limited to data source management, data category management, processing flow scheduling, task scheduling, data traceability, data governance, quality management, authority management, etc.

3) Big data capabilities, database experience. At present, most data analysts have only database experience. Although the capabilities of big data platforms are strong, they are not user-friendly. Data scientists and data analysts should pay attention to data, algorithms, models, and business scenarios Instead of spending a lot of time and energy to learn the development of big data platforms. For the data lake to develop rapidly, how to provide users with a good user experience is the key. SQL-based database application development has been deeply rooted in the hearts of the people. How to release the capabilities of the data lake in the form of SQL is a major direction in the future.

4) Complete data integration and data development capabilities. The management and support of various heterogeneous data sources, the full/incremental migration support of heterogeneous data, and the support of various data formats are all directions that need continuous improvement. At the same time, a complete, visual, and extensible integrated development environment is required.

5) Deep integration and integration with the business. The composition of a typical data lake architecture has basically become the consensus of the industry: distributed object storage + multi-modal computing engine + data management. The key to deciding whether the data lake solution will win lies in data management. Whether it is the management of raw data, the management of data categories, the management of data models, the management of data permissions, or the management of processing tasks, they are inseparable from business adaptation. And integration; in the future, more and more industry data lake solutions will emerge, forming benign development and interaction with data scientists and data analysts. How to preset industry data models, ETL processes, analysis models, and customized algorithms in the data lake solution may be a key point for differentiated competition in the data lake field in the future.

This blog is sourced from the content available on Internet.