Introduction to Data Lake

Data lake vs Data Warehouse

A data lake is a vast pool of raw data, the purpose for which is not defined. A data warehouse is a repository for structure, filtered data that has already been processed for a specific purpose.

A data lake is a centralized repository that allows you to migrate and store all structured and unstructured data at unlimited scale.

It is a single platform that hosts Storage, Governance, and Analytics

Data Lake	Data Warehouse
User is Data Scientists	User is Business Professional
Highly accessible and quick to update	More complicated and costly to make changes

AWS data lake is a single platform that hosts Storage, Governance, and Analytics

Data lake Storage

S3

Object storage to store any amount ot data and cost effective with 11 9s of durability. Provide object lifecycle management.

Glacier

Used for backup and Long term archival (multi-year) and extremely low caost with 11 9s durability.

Data lake storage comparison

	S3 Standard	S3 Infrequent Access	Glacier
Cost - 500GB Per month	USD 11.50	USD 6.25	USD 2.00
Durability	99.999999999 %	99.999999999 %	99.999999999 %

How to consolidate the data in a data lake

Realtime Streaming Data

Kinesis Firehose can be used to capture, transform and load the streaming data to the verity of AWS datastores, for example S3, Redshift, Elasticserach, and Splunk. you can use BI tools to do realtime analysis of the streaming data.

On premises contineous loading Data

AWS Storage Gateway can be used to extract data from on-prem data center and can present to S3 bucket as file share.

Snowball, SnowMobile to transfer data from onprem to cloud

Data catalog

It is used for make the data usable and discoverable. It contains metadata of Data, and it is ongoing monitoring activities for new data assets and putting into Data Catalog. It helps in tracking versions of changes. And provides queryable interface for all the data assets.

How to create data catalog in AWS

There are two way to approach this

Using a custom way to collect the data from Datalake and create a Data catalog using AWS Lambda and DynamoDB and Elasticsearch.
Using AWS Glue Catalog it is a managed option by AWS to maintain and build data catalogs. AWS glue service crawl the data from the source and automatically build the catalog based on some schema and store them into S3, DynamoDB and any other database that supports JDBC connectivity.

Data Formats

One of the core values of data lake is that it is the collection point and repository for all of an organization's data assets, in whatever their native formats are. Data format are, CSV, TSV, JSON, JSON lines, Parqute, ORC, Avro.

Data Transformation

Collect in native format → Transform in Data lake

Format Conversion Using Amazon EMR

Amazon EMR is a managed hadoop enviornment for data processing, supports Spark, Hive, HBase, TensorFlow and MXNet

Source Format → Hive (EMR) → Target Format

Source Format → Spark (EMR) → Target Format

Format conversion using AWS Glue

AWS Glue is managed ETL which automatically generates ETL Scripts, schedule and run in spark and support for Scala and Python.

Source Format -> Glue - Generate ETL Scripts - Run in Spark -> Target

Kinesis Firehose

It is used to transform streaming data to Parquet, ORC formats, delivers transformed data to AWS Data Stores and optionally, backup original data to S3.

In-place Querying

Directly query data in S3 using SQL
Athena and Readshift Spectrum is used for in-place querying

Athena In-place Query

Directly runs SQL query against files in S3
No need to provision servers
Charged based of amount of data scanned
Supports for popular file formats: CSV, JSON, Parquet, ORC, Avro
Recommended for Ad-hoc data discovery and SQL Query

This makes vast amount of unstructured data accessible to any data lake user who can use SQL.

Readshift Spectrum In-place query

BI Tools -> Readshift Spectrum -> Redshift Cluster -> S3 Data lake

Sophisticated Query Optimization
Distributs query across multiple nodes
Readshift data warehouse SQL Syntax
Use with existing BI tools
Query can span Redshift Tables and S3 Data lake
Recommended for more complex queries, and supporting large number of users

Streaming Query Kinesis Data Analytics

Data Streams - firehose - Kinesis -> Kinesis Data Analytics -> Destination

Kinesis Data Analytics can be used for

Querying Streaming Data
Continuously running Query
Sends matching results to configured destination

Analytics Tools

Data lake needs to support current and future tools
S3 is a pupular cloud service, and several third-party tools natively supports S3

Service	Purpose	Use
Amazon EMR	Hadoop Ecosystem tools	Process data in S3 using Spark, Hive, Pig, Hbase, TensorFlow, MxNet and so forth
SageMaker	Machine Learning	Train Models with data in S3, General realtime and batch Predictions
Artificial Intelligence	Video, Image, NLP	Analyze audio, video, image, text in S3
QuickSight	Business Intelligence	Create interactive Dashboard, Supports Athena, RedShift, Relational DB
RedShift	Petabyte Scale Data Warehouse (Col storage)	Load Data in table from S3 - local Querying, Query S3 directly using RedShift Spectrum
Lambda	Business Logic (FaaS)	Serverless code execution, Trigger based function invocation

Amazon Kinesis

Amazon Kinesis enables you to ingest, buffer, and process streaming data in real-time. You can derive insights in seconds or minutes. Handles any amount of streaming data from hundreds of thousands of sources with very low latencies.

Kinesis provides various capabilities.

Video Streams - capture and analyse video Stream and used for security monitoring, Video Playbacks, Face dections
Data Streams - Capture and analyse Data stream and used for custom realtime application
Firhose - Capture and deliver Data Stream to AWS Data Stores and use existing BI tools from Streaming Data: S3, RedShift, ElasticSearch, Splunk
Data Analytics - Analyze Data Stream with SQL and Java and used for real-time analytics, Anomaly detections.

Monitoring and Optimization

Monitoring

Service	Purpose	Use
ClouldWatch	Monitoring	Monitor your resources, Configure Alarms to alert, Take automated action
CloudWatch Log	Log Monitoring	Consolidate log files and monitor
CloudTrail	Audit Trail	Log all activities and who performed those actions, Useful for investigation, compliance, monitoring

Optimization

Data storage is often a significant portion of the costs associated with a data lake.

To obtimize the cost below AWS services are used.

S3 Lifecycle Management
S3 Storage Class Analysis
Intelligent Tiering
Glacier and Glacier Deep Archive
Data Formats

Add Markdown or React files to src/pages to create a standalone page:

src/pages/index.js → localhost:3000/
src/pages/foo.md → localhost:3000/foo
src/pages/foo/bar.js → localhost:3000/foo/bar

Create your first React Page

Create a file at src/pages/my-react-page.js:

src/pages/my-react-page.js
import React from 'react';
import Layout from '@theme/Layout';

export default function MyReactPage() {
  return (
    <Layout>
      <h1>My React page</h1>
      <p>This is a React page</p>
    </Layout>
  );
}

A new page is now available at http://localhost:3000/my-react-page.

Create your first Markdown Page

Create a file at src/pages/my-markdown-page.md:

src/pages/my-markdown-page.md
# My Markdown page

This is a Markdown page

A new page is now available at http://localhost:3000/my-markdown-page.

Introduction to Data Lake

Data lake vs Data Warehouse​

Data lake Storage​

S3​

Glacier​

Data lake storage comparison​

How to consolidate the data in a data lake​

Realtime Streaming Data​

On premises contineous loading Data​

Data catalog​

How to create data catalog in AWS​

Data Formats​

Data Transformation​

Format Conversion Using Amazon EMR​

Format conversion using AWS Glue​

Kinesis Firehose​

In-place Querying​

Athena In-place Query​

Readshift Spectrum In-place query​

Streaming Query Kinesis Data Analytics​

Analytics Tools​

Amazon Kinesis​

Monitoring and Optimization​

Monitoring​

Optimization​

Create your first React Page​

Create your first Markdown Page​