Skip to main content

Introduction to Data Lake

Data lake vs Data Warehouse

A data lake is a vast pool of raw data, the purpose for which is not defined. A data warehouse is a repository for structure, filtered data that has already been processed for a specific purpose.

A data lake is a centralized repository that allows you to migrate and store all structured and unstructured data at unlimited scale.

It is a single platform that hosts Storage, Governance, and Analytics

Data LakeData Warehouse
User is Data ScientistsUser is Business Professional
Highly accessible and quick to updateMore complicated and costly to make changes

AWS data lake is a single platform that hosts Storage, Governance, and Analytics

Data lake Storage

S3

Object storage to store any amount ot data and cost effective with 11 9s of durability. Provide object lifecycle management.

Glacier

Used for backup and Long term archival (multi-year) and extremely low caost with 11 9s durability.

Data lake storage comparison

S3 StandardS3 Infrequent AccessGlacier
Cost - 500GB Per monthUSD 11.50USD 6.25USD 2.00
Durability99.999999999 %99.999999999 %99.999999999 %

How to consolidate the data in a data lake

Realtime Streaming Data

Kinesis Firehose can be used to capture, transform and load the streaming data to the verity of AWS datastores, for example S3, Redshift, Elasticserach, and Splunk. you can use BI tools to do realtime analysis of the streaming data.

On premises contineous loading Data

AWS Storage Gateway can be used to extract data from on-prem data center and can present to S3 bucket as file share.

Snowball, SnowMobile to transfer data from onprem to cloud

Data catalog

It is used for make the data usable and discoverable. It contains metadata of Data, and it is ongoing monitoring activities for new data assets and putting into Data Catalog. It helps in tracking versions of changes. And provides queryable interface for all the data assets.

How to create data catalog in AWS

There are two way to approach this

  1. Using a custom way to collect the data from Datalake and create a Data catalog using AWS Lambda and DynamoDB and Elasticsearch.
  2. Using AWS Glue Catalog it is a managed option by AWS to maintain and build data catalogs. AWS glue service crawl the data from the source and automatically build the catalog based on some schema and store them into S3, DynamoDB and any other database that supports JDBC connectivity.

Data Formats

One of the core values of data lake is that it is the collection point and repository for all of an organization's data assets, in whatever their native formats are. Data format are, CSV, TSV, JSON, JSON lines, Parqute, ORC, Avro.

Data Transformation

  • Collect in native formatTransform in Data lake

Format Conversion Using Amazon EMR

Amazon EMR is a managed hadoop enviornment for data processing, supports Spark, Hive, HBase, TensorFlow and MXNet

  • Source FormatHive (EMR)Target Format

or

  • Source FormatSpark (EMR)Target Format

Format conversion using AWS Glue

AWS Glue is managed ETL which automatically generates ETL Scripts, schedule and run in spark and support for Scala and Python.

Source Format -> Glue - Generate ETL Scripts - Run in Spark -> Target

Kinesis Firehose

It is used to transform streaming data to Parquet, ORC formats, delivers transformed data to AWS Data Stores and optionally, backup original data to S3.

In-place Querying

  • Directly query data in S3 using SQL
  • Athena and Readshift Spectrum is used for in-place querying

Athena In-place Query

  • Directly runs SQL query against files in S3
  • No need to provision servers
  • Charged based of amount of data scanned
  • Supports for popular file formats: CSV, JSON, Parquet, ORC, Avro
  • Recommended for Ad-hoc data discovery and SQL Query

This makes vast amount of unstructured data accessible to any data lake user who can use SQL.

Readshift Spectrum In-place query

BI Tools -> Readshift Spectrum -> Redshift Cluster -> S3 Data lake

  • Sophisticated Query Optimization
  • Distributs query across multiple nodes
  • Readshift data warehouse SQL Syntax
  • Use with existing BI tools
  • Query can span Redshift Tables and S3 Data lake
  • Recommended for more complex queries, and supporting large number of users

Streaming Query Kinesis Data Analytics

Data Streams - firehose - Kinesis -> Kinesis Data Analytics -> Destination

Kinesis Data Analytics can be used for

  • Querying Streaming Data
  • Continuously running Query
  • Sends matching results to configured destination

Analytics Tools

  • Data lake needs to support current and future tools
  • S3 is a pupular cloud service, and several third-party tools natively supports S3
ServicePurposeUse
Amazon EMRHadoop Ecosystem toolsProcess data in S3 using Spark, Hive, Pig, Hbase, TensorFlow, MxNet and so forth
SageMakerMachine LearningTrain Models with data in S3, General realtime and batch Predictions
Artificial IntelligenceVideo, Image, NLPAnalyze audio, video, image, text in S3
QuickSightBusiness IntelligenceCreate interactive Dashboard, Supports Athena, RedShift, Relational DB
RedShiftPetabyte Scale Data Warehouse (Col storage)Load Data in table from S3 - local Querying, Query S3 directly using RedShift Spectrum
LambdaBusiness Logic (FaaS)Serverless code execution, Trigger based function invocation

Amazon Kinesis

Amazon Kinesis enables you to ingest, buffer, and process streaming data in real-time. You can derive insights in seconds or minutes. Handles any amount of streaming data from hundreds of thousands of sources with very low latencies.

Kinesis provides various capabilities.

  • Video Streams - capture and analyse video Stream and used for security monitoring, Video Playbacks, Face dections
  • Data Streams - Capture and analyse Data stream and used for custom realtime application
  • Firhose - Capture and deliver Data Stream to AWS Data Stores and use existing BI tools from Streaming Data: S3, RedShift, ElasticSearch, Splunk
  • Data Analytics - Analyze Data Stream with SQL and Java and used for real-time analytics, Anomaly detections.

Monitoring and Optimization

Monitoring

ServicePurposeUse
ClouldWatchMonitoringMonitor your resources, Configure Alarms to alert, Take automated action
CloudWatch LogLog MonitoringConsolidate log files and monitor
CloudTrailAudit TrailLog all activities and who performed those actions, Useful for investigation, compliance, monitoring

Optimization

Data storage is often a significant portion of the costs associated with a data lake.

To obtimize the cost below AWS services are used.

  • S3 Lifecycle Management
  • S3 Storage Class Analysis
  • Intelligent Tiering
  • Glacier and Glacier Deep Archive
  • Data Formats

Add Markdown or React files to src/pages to create a standalone page:

  • src/pages/index.jslocalhost:3000/
  • src/pages/foo.mdlocalhost:3000/foo
  • src/pages/foo/bar.jslocalhost:3000/foo/bar

Create your first React Page

Create a file at src/pages/my-react-page.js:

src/pages/my-react-page.js
import React from 'react';
import Layout from '@theme/Layout';

export default function MyReactPage() {
return (
<Layout>
<h1>My React page</h1>
<p>This is a React page</p>
</Layout>
);
}

A new page is now available at http://localhost:3000/my-react-page.

Create your first Markdown Page

Create a file at src/pages/my-markdown-page.md:

src/pages/my-markdown-page.md
# My Markdown page

This is a Markdown page

A new page is now available at http://localhost:3000/my-markdown-page.