AWS Lake Formation preview supports transactions for concurrent DML operations and consistent query results, row-level security policies for granular access control, and accelerated access though inline filtering, aggregations, and automatic file compaction.
Image Credit: AWS Lake Formation
Transactions and Concurrent DML Support
Data lakes need to show users the correct view of data at all times, even while there are simultaneous real-time or frequent updates to the data. A common pattern in data lakes is to organize data into tables comprised of rows that can include structured or semi-structured data. To load streaming data or quickly incorporate changes from source data systems, you need to insert, delete, and modify rows across multiple tables in parallel. Today, developers write custom application code or use open source tools to manage these updates. These solutions are complex and difficult to scale because writing application code that maintains consistency when concurrently reading and writing the same data is tedious, brittle, and error prone. AWS Lake Formation introduces new APIs that support atomic, consistent, isolated, and durable (ACID) transactions using a new data lake table type, called a ‘governed table.’ A governed table allows multiple users to concurrently insert, delete, and modify rows across tables, while still allowing other users to simultaneously run analytical queries and machine learning (ML) models on the same data sets that return consistent and up-to-date results. The ability to update and delete individual rows in governed tables, like a row (record) of customer data after they have asked to be forgotten, helps users comply with “right to be forgotten” provisions in privacy laws like GDPR and CCPA.
Security at Row-Level
Making sure users have access to only the right data in a data lake is difficult. Some users need access to all data within a dataset, while other users are restricted from seeing columns of sensitive information like social security numbers or rows of data like sales records from other regions. Data lake administrators often maintain multiple copies of data to apply different security policies for different users. This adds complexity, operational overhead, and extra storage costs.
AWS Lake Formation already allows you to set access policies to hide data, such as hiding a column with social security numbers, from users who do not have permission to view that data. With row-level security, you can now set row-level policies in addition to column-level policies. For example, set a policy that gives a regional manager access to only the data for their region.
Better performance with filtering, aggregations, and automatic file compaction
Analytics performance, at times, can be impacted by inefficient storage of many small files that are automatically created as new data is written to the data lake. Processing these many small files creates additional overhead for analytics services and causes slower query responses.
With this preview, Lake Formation includes a new storage optimizer that automatically combines small files into larger files to speed up queries by up to 7x. This process, commonly known as compaction, is performed in the background so that there is no performance impact on your production workloads while this is taking place.
Preview
In the preview, these new capabilities are available via new, open, and public update and access APIs for data lakes. Once generally available, these APIs may be used by AWS services, third parties, and custom applications that directly read from and write to Amazon S3 data lakes.
Contact us to help you with the journey towards cloud data lake on AWS.
Comments