Introduction to Data Lake
Data lake vs Data Warehouse
A data lake is a vast pool of raw data, the purpose for which is not defined. A data warehouse is a repository for structure, filtered data that has already been processed for a specific purpose.
A data lake is a centralized repository that allows you to migrate and store all structured and unstructured data at unlimited scale.
It is a single platform that hosts Storage, Governance, and Analytics
Data Lake | Data Warehouse |
---|---|
User is Data Scientists | User is Business Professional |
Highly accessible and quick to update | More complicated and costly to make changes |
AWS data lake is a single platform that hosts Storage, Governance, and Analytics
Data lake Storage
S3
Object storage to store any amount ot data and cost effective with 11 9s of durability. Provide object lifecycle management.
Glacier
Used for backup and Long term archival (multi-year) and extremely low caost with 11 9s durability.
Data lake storage comparison
S3 Standard | S3 Infrequent Access | Glacier | |
---|---|---|---|
Cost - 500GB Per month | USD 11.50 | USD 6.25 | USD 2.00 |
Durability | 99.999999999 % | 99.999999999 % | 99.999999999 % |
How to consolidate the data in a data lake
Realtime Streaming Data
Kinesis Firehose can be used to capture, transform and load the streaming data to the verity of AWS datastores, for example S3, Redshift, Elasticserach, and Splunk. you can use BI tools to do realtime analysis of the streaming data.
On premises contineous loading Data
AWS Storage Gateway can be used to extract data from on-prem data center and can present to S3 bucket as file share.
Snowball, SnowMobile to transfer data from onprem to cloud
Data catalog
It is used for make the data usable and discoverable. It contains metadata of Data, and it is ongoing monitoring activities for new data assets and putting into Data Catalog. It helps in tracking versions of changes. And provides queryable interface for all the data assets.
How to create data catalog in AWS
There are two way to approach this
- Using a custom way to collect the data from Datalake and create a Data catalog using AWS Lambda and DynamoDB and Elasticsearch.
- Using AWS Glue Catalog it is a managed option by AWS to maintain and build data catalogs. AWS glue service crawl the data from the source and automatically build the catalog based on some schema and store them into S3, DynamoDB and any other database that supports JDBC connectivity.
Data Formats
One of the core values of data lake is that it is the collection point and repository for all of an organization's data assets, in whatever their native formats are. Data format are, CSV, TSV, JSON, JSON lines, Parqute, ORC, Avro.
Data Transformation
Collect in native format
→Transform in Data lake
Format Conversion Using Amazon EMR
Amazon EMR is a managed hadoop enviornment for data processing, supports Spark, Hive, HBase, TensorFlow and MXNet
Source Format
→Hive (EMR)
→Target Format
or
Source Format
→Spark (EMR)
→Target Format
Format conversion using AWS Glue
AWS Glue is managed ETL which automatically generates ETL Scripts, schedule and run in spark and support for Scala and Python.
Source Format
-> Glue - Generate ETL Scripts - Run in Spark
-> Target
Kinesis Firehose
It is used to transform streaming data to Parquet, ORC formats, delivers transformed data to AWS Data Stores and optionally, backup original data to S3.
In-place Querying
- Directly query data in S3 using SQL
- Athena and Readshift Spectrum is used for in-place querying
Athena In-place Query
- Directly runs SQL query against files in S3
- No need to provision servers
- Charged based of amount of data scanned
- Supports for popular file formats: CSV, JSON, Parquet, ORC, Avro
- Recommended for Ad-hoc data discovery and SQL Query
This makes vast amount of unstructured data accessible to any data lake user who can use SQL.
Readshift Spectrum In-place query
BI Tools
-> Readshift Spectrum
-> Redshift Cluster
-> S3 Data lake
- Sophisticated Query Optimization
- Distributs query across multiple nodes
- Readshift data warehouse SQL Syntax
- Use with existing BI tools
- Query can span Redshift Tables and S3 Data lake
- Recommended for more complex queries, and supporting large number of users
Streaming Query Kinesis Data Analytics
Data Streams - firehose - Kinesis
-> Kinesis Data Analytics
-> Destination
Kinesis Data Analytics can be used for
- Querying Streaming Data
- Continuously running Query
- Sends matching results to configured destination
Analytics Tools
- Data lake needs to support current and future tools
- S3 is a pupular cloud service, and several third-party tools natively supports S3
Service | Purpose | Use |
---|---|---|
Amazon EMR | Hadoop Ecosystem tools | Process data in S3 using Spark, Hive, Pig, Hbase, TensorFlow, MxNet and so forth |
SageMaker | Machine Learning | Train Models with data in S3, General realtime and batch Predictions |
Artificial Intelligence | Video, Image, NLP | Analyze audio, video, image, text in S3 |
QuickSight | Business Intelligence | Create interactive Dashboard, Supports Athena, RedShift, Relational DB |
RedShift | Petabyte Scale Data Warehouse (Col storage) | Load Data in table from S3 - local Querying, Query S3 directly using RedShift Spectrum |
Lambda | Business Logic (FaaS) | Serverless code execution, Trigger based function invocation |
Amazon Kinesis
Amazon Kinesis enables you to ingest, buffer, and process streaming data in real-time. You can derive insights in seconds or minutes. Handles any amount of streaming data from hundreds of thousands of sources with very low latencies.
Kinesis provides various capabilities.
- Video Streams - capture and analyse video Stream and used for security monitoring, Video Playbacks, Face dections
- Data Streams - Capture and analyse Data stream and used for custom realtime application
- Firhose - Capture and deliver Data Stream to AWS Data Stores and use existing BI tools from Streaming Data: S3, RedShift, ElasticSearch, Splunk
- Data Analytics - Analyze Data Stream with SQL and Java and used for real-time analytics, Anomaly detections.
Monitoring and Optimization
Monitoring
Service | Purpose | Use |
---|---|---|
ClouldWatch | Monitoring | Monitor your resources, Configure Alarms to alert, Take automated action |
CloudWatch Log | Log Monitoring | Consolidate log files and monitor |
CloudTrail | Audit Trail | Log all activities and who performed those actions, Useful for investigation, compliance, monitoring |
Optimization
Data storage is often a significant portion of the costs associated with a data lake.
To obtimize the cost below AWS services are used.
- S3 Lifecycle Management
- S3 Storage Class Analysis
- Intelligent Tiering
- Glacier and Glacier Deep Archive
- Data Formats
Add Markdown or React files to src/pages
to create a standalone page:
src/pages/index.js
→localhost:3000/
src/pages/foo.md
→localhost:3000/foo
src/pages/foo/bar.js
→localhost:3000/foo/bar
Create your first React Page
Create a file at src/pages/my-react-page.js
:
import React from 'react';
import Layout from '@theme/Layout';
export default function MyReactPage() {
return (
<Layout>
<h1>My React page</h1>
<p>This is a React page</p>
</Layout>
);
}
A new page is now available at http://localhost:3000/my-react-page.
Create your first Markdown Page
Create a file at src/pages/my-markdown-page.md
:
# My Markdown page
This is a Markdown page
A new page is now available at http://localhost:3000/my-markdown-page.