Saturday, April 10, 2021

Day37 - AWS - Athena, Glue, Cloudwatch

 AWS - session 37

Class Notes
Today's topic
- Athena
- Glue Data Catalogue
- Cloudwatch logs

these two services are related to S3 (prefix-partition - improves the performance)


Go to aws console and search
- cloudwatch
- glue
- athena

Read about these services

AWS advance program
- Data analitics (athena and glue are discussed)
- These services are part of big-data.

Lets discuss what Athena and Glue do for us.

We have different dept in our organization
- Research
- Technical
- Development
- Support

They have their own data. Every day they generate lots of data. These are recorded on log file
We have web server, customer access it through.webserver.


log file can be as follow format

1    1.1.1.1        index.html    200
2    2.2.2.2        web.php        400


lets look at one of the Dataset
1,John, ok
2,Bill,goog
3,Jack,good

These data does not have metadata are called rawdata.
These data are hugh data, thats why called bigdata which are
use for analysis for future

Datalake - stores big data files. it stores all raw data which we use for analytics.

Lets say, lets find user who has 404 errror, or user who successfully login to the system and much more ..

Hadoop, which has distrubited filesystem (hdfs) through you access data lakes.
- its an open source tool, use our storage to configure it.
- or use aws, look for service that allow you to store and use it on aws.
Sice of the storage is going to be Petabyte

use some program/tool to access datalake to generate report.

AWS provides these kinds of sitauttion to handle.
s3 is an object storage which provides storage service.
- it is scalable. You can add your data
- Read/wrote/throughput and very fast on s3.

you can use hdfs or s3

bigstoreage is called datalake

if you data in a file in storage, you run a program on top of ram and process runs, reads the data, loads into memory and execute.
But our data is terabyte of data and its hard to load into memory.

But our program goes to s3, analyze the data and returns the result.
The tool we will be using is Athena.
- It is meant for analytics or analysis tool.
- And highly tied with s3.
- For anthena, S3 is data source.
- Athena can access data form other sources as well.

s3 is datalake
- data in raw format

1,John,100,ok
2,Bill,200,goog
3,Jack,500,good

Looking at the data above, we don't know what these data mean.

you also need metadata to know what this data about such as
serial no.
name
salary
remarks

s/n,name,salary,remarks
1,John,100,ok
2,Bill,200,goog
3,Jack,500,good

without metadata, its very hard to find out what the raw data really mean.
Once you have metadata, you can figure it out what this data really mean.

to analyze the rawdata, you first create metadata and attach to raw data.

so, first create metadata also called schema and attach metadata to s3 (data source), on rawdata.

The name of the service is glue
- glue is an ETL tool.
- glue has a service catalogue which helps you to create metadata.


Athena is similar to HIVE
S3 is similar to HDFS
schema (metadata) is similar to HIVE metastore

You have to set up HDFS, HIVE, HIVE metastore by yourself but
Athena, S3, Glue is already configured for you on aws.


splunk is alternate to Athena.
but splunk comes with more feature.

now, lets do the practial

we have a raw data. lets same as
> notepad athena3.csv
1,John,100,ok
2,Bill,200,goog
3,Jack,500,good

So, whats the plan
- First go to s3 (Bucket)and upload this file
  you upload this raw data

Go to aws console
bucket name: awsAthena
select the region (Singapore)
create a bucket

upload the file

s3 -> key=filename

if you want to access file from your laptop -> you initaite a get request

to get speed to download this file is very fast.


max speed you get file from bucket is 6000 files/sec
filesize=?

lets say you have 60,000/sec request per second, what happens?
how to improve the performance.

in s3, you have fastest storage service
which provides 6K/sec but how to get more access time.

you can create different bucket say
- region basis
- day basis (1, 15, 25, chrimas)
   /mybucket/day1
        /day2/f1
        /day3/f2
        /day15/d15
        /day25/d24
This way, we distribute the load. Every folders has different speed. say you have 10 folders and each has 6K so you will get 60,000/sec
This way you can access your required speed.
More partition, folder you create, speed increases.
- day1, day2, day3 .. these are called prefix
- There is no limit of prefix
- You can have more prefix to speed up.
- hardware perspective, reached the limit but this way, you can organize the data.
- Storage, speed up the performance with partition.

Lets go to s2 bucker and create a folers
- create folder  name: april
- create another folers: name: May

each folder has their own speed.
so in total we have 12k/sec speed.

Now, lets go ahead and upload to april folder.

use default option and upload the file.

Now, we have to create metadata

we have to go to Athena and glue

go to console and search for Athena

lets select the region: singapore
-click on get started
on the left side, you see: datasource
our datasource is s3
click on connect data source next to data source on the left side

Two options are available
- query data in aws s3
- query a data source -> select this one

and review
but we will select s3

choose a metadata catalog
- aws glue data catalog
- apache Hive metastore

Select glue

-----------------
now, open a new tab and search for glue.
you see data catalog
review different options
------------------------------

connect to data source
connect to AWS glue data catalog
second option


Tools ------SQL-------> data Source

add table
1. name and location
database: lwdb (its like a folder)
table name: emptable
location of input data set:

now go to your Athena and go to april and get the url
s3.awsathena../april
filename: athena3.cvs

- copy the address and next
2. Data format: csv

click next
2. Column:
what is the first field
cloumn: id
column type: int
column: name
salary , type, init

add column to add more data.
remarks - straing ..
id, name, string,.......
......................
id: int
name: string
salary: int
remarks: string

once you created column, click next and you will be at parititon

click next
you get errro to specify the location to store.
to resolve to go thena and click on the link provided.
when you click on the link, you will be on Athena bicket
create a folder for result. Output

Go to Athena and select the folder called output.

now, go back to graphic page and click next.
we created a schema table.


====================
go to to glue and refresh, you will see your database
click on database, you will see table
and when you click on table, you see detail about schema.

look at teh data location: it is pointed to s3.

you can edit schema, add more fields and more.



go to athena and create a new query
read all the column,row and display data.
select * from emptable;

you will see the result
You see nice display for data.

for big size data, do not run this type of query, it simplay may hangs.

Display sum of the salary
select SUM(salary) from emptable


Lets data is terrabype
only result goes to athena.


select * from emptable where name='jack';


selct remanrks from emptable where name='jack';

orderby
join
innerjoin ... much more..


================================
Cloud watch log
--------------------
You have one instance ec2: webserver
another instance on virtual box: DB server

both these system generate logs

client come and try to access the data.

matrics: mem,io,cup,throput

Logs are stored on your local disks.

if you have 100s of servers, and accessing is a chellange.
so, we create a centralize logging server where all the logs will be sync with

we use cloudwatch log which is a centralized log system.

Go to aws console and search for cloudwatch
- it is used for multiple purpose
- go to logs on left
- this service is used for monitoring, alayzing and sotring logs
- highly scallable, durable and much more.
- Log management tool.

logs are actually created on local storage and upload to cloudwatch.

we have tools for centralize loggin: rsyslog, datadog,splunk

cloudwatch has its own agent
- agent will collect logs from local system on behalv of cloudwatch and uploads to cloud watch.
- on centralized location, we use Athena or other tool to analyze.

You can install cloud agents on your own PC, VM as well.



Demo: tomorrow








No comments:

Post a Comment

Git branch show detached HEAD

  Git branch show detached HEAD 1. List your branch $ git branch * (HEAD detached at f219e03)   00 2. Run re-set hard $ git reset --hard 3. ...