AWS Glue 개요 및 활용방안

2020-02-14

Data Engineering

[AWS Glue 개요 및 활용방안]

Glue

[Glue를 이용한 데이터 전처리 핸즈온]

** 상단에 PPT 자료 참고

1. 핸즈온 시나리오

STEP 1) Glue Dev Endpoint 사용을 위한 IAM role 설정

Create role 클릭 -> 하단에 glue 선택 후 next 클릭 -> s3 검색 후 s3fullaccess 선택, glue 검색 후 servicerole 및 GlueServiceNotebookRole 선택 후 next 클릭 -> role 생성

STEP 2) Glue Dev Endpoint 생성

DPU worker type 설명

STEP 3) Dev Endpoint 내에 Notebook 생성

노트북을 새로 생성할때마다 role도 같이 새로 만들어줘야 한다.(노트북 생정중에 role을 생성하는 설정가능)

STEP 4) CSV 형태의 원본데이터를 parquet로 변환하고 partitioning 된 형태로 저장

ETL Job : CSV 형태의 원본데이터를 parquet로 변환하고 partitioning 된 형태로 저장하는 job
원본데이터 -> Clean data로 만드는 작업

STEP 5) 원본데이터 및 Clean Data를 대상으로 하는 Data Catalog 생성

step 5-1) Glue Catalog database 생성
step 5-2) Glue Crawler 생성(csv, parquet 2EA) 및 크롤링 수행
step 5-3) Glue Table(데이터 카탈로그) 생성확인

STEP 6) Athena를 이용한 테스트

왜 ETL 작업을 수행해야 하는가 확인하는 테스트
csv 형태의 파일 대비하여 parquet 파일형태 변환 및 partitioning 저장의 이점을 확인하는 테스트

STEP 7) spark SQL을 이용해 Clean Data를 대상으로 재가공한 합산 및 요약데이터 생성

STEP 8) step 4), step 7)에서 수행했던 작업을 Glue Job을 이용하여 실제 운영계에 적용하는 실습

STEP 9) 구동한 자원 종료

노트북 서버 중지 및 삭제, 개발앤드포인트 종료, 원본데이터 삭제

2. step 4), step 7)에서 작성할 pyspark 코드

2.1. step 4) 실습코드

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import *
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.types import *

## spark 사용을 위한 엔진구동
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

## csv파일을 spark dataframe으로 로딩
df_csv = spark.read.format("csv").option("header", "true").load("s3a://lhw-s3-test/Lotte/csv/*.csv")

Starting Spark application

## 데이터프레임 확인
df_csv.show(10)

+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+
|          event_time|event_type|product_id|        category_id|       category_code|   brand| price|  user_id|        user_session|
+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+
|2019-11-01 00:00:...|      view|   1003461|2053013555631882655|electronics.smart...|  xiaomi|489.07|520088904|4d3b30da-a5e4-49d...|
|2019-11-01 00:00:...|      view|   5000088|2053013566100866035|appliances.sewing...|  janome|293.65|530496790|8e5f4f83-366c-4f7...|
|2019-11-01 00:00:...|      view|  17302664|2053013553853497655|                null|   creed| 28.31|561587266|755422e7-9040-477...|
|2019-11-01 00:00:...|      view|   3601530|2053013563810775923|appliances.kitche...|      lg|712.87|518085591|3bfb58cd-7892-48c...|
|2019-11-01 00:00:...|      view|   1004775|2053013555631882655|electronics.smart...|  xiaomi|183.27|558856683|313628f1-68b8-460...|
|2019-11-01 00:00:...|      view|   1306894|2053013558920217191|  computers.notebook|      hp|360.09|520772685|816a59f3-f5ae-4cc...|
|2019-11-01 00:00:...|      view|   1306421|2053013558920217191|  computers.notebook|      hp|514.56|514028527|df8184cc-3694-454...|
|2019-11-01 00:00:...|      view|  15900065|2053013558190408249|                null| rondell| 30.86|518574284|5e6ef132-4d7c-473...|
|2019-11-01 00:00:...|      view|  12708937|2053013553559896355|                null|michelin| 72.72|532364121|0a899268-31eb-46d...|
|2019-11-01 00:00:...|      view|   1004258|2053013555631882655|electronics.smart...|   apple|732.07|532647354|d2d3d2c6-631d-489...|
+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+
only showing top 10 rows

## 데이터 레코드 수 확인
df_csv.count()

67501979

## 데이터프레임 스키마 확인
df_csv.schema

StructType(List(StructField(event_time,StringType,true),StructField(event_type,StringType,true),StructField(product_id,StringType,true),StructField(category_id,StringType,true),StructField(category_code,StringType,true),StructField(brand,StringType,true),StructField(price,StringType,true),StructField(user_id,StringType,true),StructField(user_session,StringType,true)))

## event_time이라는 피쳐를 슬라이싱하여 year, month, day 피쳐를 새로 생성하는 각각의 사용자 정의함수를 정의하여 적용

udf_year = udf(lambda record:record[0:4],StringType())
udf_month = udf(lambda record:record[5:7],StringType())
udf_day = udf(lambda record:record[8:10],StringType())

new_df_csv = df_csv.withColumn('year',udf_year('event_time').cast(IntegerType()))
new_df_csv = new_df_csv.withColumn('month',udf_month('event_time').cast(IntegerType()))
new_df_csv = new_df_csv.withColumn('day',udf_day('event_time').cast(IntegerType()))

new_df_csv.show(10)

+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+----+-----+---+
|          event_time|event_type|product_id|        category_id|       category_code|   brand| price|  user_id|        user_session|year|month|day|
+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+----+-----+---+
|2019-11-01 00:00:...|      view|   1003461|2053013555631882655|electronics.smart...|  xiaomi|489.07|520088904|4d3b30da-a5e4-49d...|2019|   11|  1|
|2019-11-01 00:00:...|      view|   5000088|2053013566100866035|appliances.sewing...|  janome|293.65|530496790|8e5f4f83-366c-4f7...|2019|   11|  1|
|2019-11-01 00:00:...|      view|  17302664|2053013553853497655|                null|   creed| 28.31|561587266|755422e7-9040-477...|2019|   11|  1|
|2019-11-01 00:00:...|      view|   3601530|2053013563810775923|appliances.kitche...|      lg|712.87|518085591|3bfb58cd-7892-48c...|2019|   11|  1|
|2019-11-01 00:00:...|      view|   1004775|2053013555631882655|electronics.smart...|  xiaomi|183.27|558856683|313628f1-68b8-460...|2019|   11|  1|
|2019-11-01 00:00:...|      view|   1306894|2053013558920217191|  computers.notebook|      hp|360.09|520772685|816a59f3-f5ae-4cc...|2019|   11|  1|
|2019-11-01 00:00:...|      view|   1306421|2053013558920217191|  computers.notebook|      hp|514.56|514028527|df8184cc-3694-454...|2019|   11|  1|
|2019-11-01 00:00:...|      view|  15900065|2053013558190408249|                null| rondell| 30.86|518574284|5e6ef132-4d7c-473...|2019|   11|  1|
|2019-11-01 00:00:...|      view|  12708937|2053013553559896355|                null|michelin| 72.72|532364121|0a899268-31eb-46d...|2019|   11|  1|
|2019-11-01 00:00:...|      view|   1004258|2053013555631882655|electronics.smart...|   apple|732.07|532647354|d2d3d2c6-631d-489...|2019|   11|  1|
+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+----+-----+---+
only showing top 10 rows

## year, month, day 피쳐를 사용자가 원하는 s3 경로로 저장. 포맷은 파케이형태
new_df_csv.write.partitionBy("year","month","day").save("s3a://lhw-s3-test/Lotte/cleandata/", format='parquet', header=True)

## 저장한 parquet 파일을 로딩
df_parquet = spark.read.format("parquet").option("header", "true").load("s3a://lhw-s3-test/Lotte/cleandata/")
df_parquet.show(20)

+--------------------+----------+----------+-------------------+--------------------+---------+-------+---------+--------------------+----+-----+---+
|          event_time|event_type|product_id|        category_id|       category_code|    brand|  price|  user_id|        user_session|year|month|day|
+--------------------+----------+----------+-------------------+--------------------+---------+-------+---------+--------------------+----+-----+---+
|2019-11-17 08:43:...|      view|   2501799|2053013564003713919|appliances.kitche...| elenberg|  46.31|563237118|4368d099-6d19-47c...|2019|   11| 17|
|2019-11-17 08:43:...|      view|   6400335|2053013554121933129|computers.compone...|    intel| 435.28|551129779|4db2c365-ee85-443...|2019|   11| 17|
|2019-11-17 08:43:...|      view|   3701538|2053013565983425517|appliances.enviro...|   irobot|1878.81|539845715|bf7d95c0-69e1-40f...|2019|   11| 17|
|2019-11-17 08:43:...|      view|  26400266|2053013563651392361|                null|  lucente| 119.18|572211322|8e6c63f8-7f34-48b...|2019|   11| 17|
|2019-11-17 08:43:...|      view|   1004659|2053013555631882655|electronics.smart...|  samsung| 762.18|512965259|2981c9f9-3905-49d...|2019|   11| 17|
|2019-11-17 08:43:...|      view|  28716993|2053013565228450757|       apparel.shoes|  respect|  82.63|524048356|a2a0497c-59a9-4d4...|2019|   11| 17|
|2019-11-17 08:43:...|      view|   1004856|2053013555631882655|electronics.smart...|  samsung| 128.42|572486931|55e4ac09-1cec-446...|2019|   11| 17|
|2019-11-17 08:43:...|      view|   1005021|2053013555631882655|electronics.smart...|     oppo| 386.08|512887550|3c3af822-9816-434...|2019|   11| 17|
|2019-11-17 08:43:...|      view|   3300349|2053013555355058573|                null|  redmond|  73.34|566506783|3835f9bf-750b-46a...|2019|   11| 17|
|2019-11-17 08:43:...|      view|   4600560|2053013563944993659|appliances.kitche...|     beko| 412.40|522329355|dce61941-af79-4fd...|2019|   11| 17|
|2019-11-17 08:43:...|      view|  12709556|2053013553559896355|                null|   viatti|  40.16|546204388|86e071a9-b9d2-47e...|2019|   11| 17|
|2019-11-17 08:43:...|  purchase|   7002255|2053013560346280633|       kids.carriage|wingoffly| 113.77|517821629|94575067-317d-4ec...|2019|   11| 17|
|2019-11-17 08:43:...|      view|  26203739|2053013563693335403|                null|  lucente| 158.56|527083517|e6bf2cdb-778f-44a...|2019|   11| 17|
|2019-11-17 08:43:...|      view|  28401077|2053013566209917945|     accessories.bag|  respect|  39.90|512757661|4c6f8f63-a612-4c5...|2019|   11| 17|
|2019-11-17 08:43:...|      view|   3701428|2053013565983425517|appliances.enviro...|   arnica|  68.59|572271574|7b9040ab-2524-4d6...|2019|   11| 17|
|2019-11-17 08:43:...|  purchase|   1004249|2053013555631882655|electronics.smart...|    apple| 765.79|562839858|98c3adb8-a028-4e8...|2019|   11| 17|
|2019-11-17 08:43:...|      view|   1005253|2053013555631882655|electronics.smart...|   xiaomi| 288.04|516404307|a383cb03-2673-446...|2019|   11| 17|
|2019-11-17 08:43:...|      view|   5700981|2053013553970938175|auto.accessories....|   alpine| 875.18|558414772|14c5b3c8-3c0a-4bd...|2019|   11| 17|
|2019-11-17 08:43:...|      view|  22300011|2053013552427434207|                null|    apple| 200.52|572444775|e13be654-ddef-4a7...|2019|   11| 17|
|2019-11-17 08:43:...|      view|   1801940|2053013554415534427|electronics.video.tv|    haier| 231.64|551704382|8e617209-f000-47a...|2019|   11| 17|
+--------------------+----------+----------+-------------------+--------------------+---------+-------+---------+--------------------+----+-----+---+
only showing top 20 rows

2.2. step 7) 실습코드

## Glue catalog를 이용해서 spark sql로 간단한 합산 데이터 프레임 만들기
## 메타데이터는 글루카탈로그를 이용함

sql_df = spark.sql("SELECT year,month,day,count(*) FROM lotte.cleandata group by year,month,day")
sql_df.show()

+----+-----+---+--------+
|year|month|day|count(1)|
+----+-----+---+--------+
|2019|   11| 19| 1728541|
|2019|   11| 12| 1987569|
|2019|   11| 10| 1940575|
|2019|   11| 30| 1754878|
|2019|   11| 15| 6220416|
|2019|   11|  3| 1567774|
|2019|   11|  9| 1877906|
|2019|   11| 29| 1854426|
|2019|   11| 18| 2021512|
|2019|   11| 21| 1677336|
|2019|   11| 24| 1591765|
|2019|   11|  2| 1555538|
|2019|   11| 16| 6502957|
|2019|   11|  5| 1717244|
|2019|   11|  7| 1796833|
|2019|   11| 28| 1658378|
|2019|   11|  8| 1896402|
|2019|   11| 14| 3069726|
|2019|   11| 11| 2009390|
|2019|   11|  6| 1694821|
+----+-----+---+--------+
only showing top 20 rows

## 합산한 데이터 프레임을 사용자가 원하는 s3 경로로 저장
sql_df.repartitionByRange(1,'year').write.save("s3a://lhw-s3-test/Lotte/summary/201911/", format='csv', header=True)

## 방금 저장한 합산데이터를 로딩
sql_new_df = spark.read.format("csv").option("header", "true").load("s3a://lhw-s3-test/Lotte/summary/201911/")
sql_new_df.show()

+----+-----+---+--------+
|year|month|day|count(1)|
+----+-----+---+--------+
|2019|   11| 15| 6220416|
|2019|   11|  3| 1567774|
|2019|   11| 18| 2021512|
|2019|   11| 28| 1658378|
|2019|   11|  8| 1896402|
|2019|   11| 14| 3069726|
|2019|   11|  1| 1445360|
|2019|   11| 25| 1593582|
|2019|   11|  9| 1877906|
|2019|   11| 11| 2009390|
|2019|   11| 22| 1568243|
|2019|   11| 23| 1561716|
|2019|   11| 10| 1940575|
|2019|   11|  7| 1796833|
|2019|   11|  5| 1717244|
|2019|   11|  6| 1694821|
|2019|   11| 27| 1646456|
|2019|   11| 12| 1987569|
|2019|   11| 30| 1754878|
|2019|   11| 16| 6502957|
+----+-----+---+--------+
only showing top 20 rows

3. step 6) Athena를 이용한 테스트 쿼리

쿼리 1

SELECT * FROM “lotte”.”csv” limit 1;

쿼리 2

SELECT count(*) FROM “lotte”.”csv” where substr(event_time,9,2) = ‘11’

– SELECT substr(event_time,1,4) yy, substr(event_time,6,2) mon , substr(event_time,9,2) dd, count(*)
– FROM “lotte”.”csv” – group by substr(event_time,1,4) , substr(event_time,6,2) , substr(event_time,9,2) – where substr(event_time,9,2) = ‘11’ ;

쿼리 3

SELECT * FROM “lotte”.”cleandata” limit 1;

쿼리 4

– SELECT COUNT(*)
– FROM “lotte”.”cleandata” – where substr(event_time,9,2) = ‘11’;

SELECT substr(event_time,1,4) yy, substr(event_time,6,2) mon , substr(event_time,9,2) dd, count(*)
FROM “lotte”.”cleandata” group by substr(event_time,1,4) , substr(event_time,6,2) , substr(event_time,9,2);

쿼리 5

SELECT count(*) FROM “lotte”.”cleandata” where day = ‘11’ ;

쿼리 6

SELECT year,month,day,count(*) FROM “lotte”.”cleandata” group by year,month,day;

 쿠버네티스 개요 AWS Lambda를 이용한 EMR spark batch job 자동화 