AWS Lake Formation simplifies and automates many of the complex manual steps usually required to create a data lake. This includes collecting, cleaning, and cataloging data, and securely making that data available for analytics.
Customers can easily bring their data into a data lake from a variety of sources using pre-defined templates; automatically classify and prepare the data; and centrally define granular data access policies to govern access by the different groups within an organization.
Customers want to be able to perform analytics and machine learning across all of their data; regardless of the format or where the data lives. A data lake removes data silos. Consequently, it allows data to reside in a central place so customers can more easily apply different types of analytics and machine learning. Amazon Simple Storage Service (Amazon S3) has become a very popular place for customers to build data lakes. That’s because of its scale, cost-effectiveness, durability, and easy integration with AWS’s analytics and machine learning services.
However, even with those significant benefits, building and managing a data lake can still be a complex and time-consuming process. Customers need to provision and configure storage, move data from disparate sources into the data lake, and extract the schema and add metadata tags to make it accessible from a searchable data catalog. In order to do so, customers must clean and prepare the data – including partitioning, indexing, and transforming the data – to optimize the performance and cost that comes with running analytics on the data.
Then, they have to set up data access roles. They also have to enforce security policies across their storage and each of their different analytics engines. Additionally, updating the security policies when permissions change or new end users join is mandatory.
And, finally, customers are required to make the data available in a secure way to their data analysts so that they can analyze and process the data using any of the available analytics engines. These steps require customers to perform a lot of manual work, and as a result, most customers can take up to several months to set up a data lake.
AWS Lake Formation simplifies data lakes
AWS Lake Formation significantly simplifies the process and removes the heavy lifting from setting up a data lake. It automates manual, time-consuming steps, like provisioning and configuring storage; crawling the data to extract schema and metadata tags; automatically optimizing the partitioning of the data; and transforming the data into formats like Apache Parquet and ORC that are ideal for analytics.
AWS Lake Formation cleans and deduplicates data using machine learning to improve data consistency and quality. To simplify data access and security, AWS Lake Formation provides a single, centralized place to set up and manage data access policies, governance, and auditing across Amazon S3 and multiple analytics engines.
Customers can now easily access data from a single place and integrate with their choice of AWS analytics and machine learning services, including Amazon Redshift, Amazon Athena, and AWS Glue, with Amazon EMR, Amazon QuickSight, and Amazon SageMaker following in the next few months. With AWS Lake Formation customers can set up and begin using a data lake in days instead of months.