AWS Glue 개요 및 활용방안

2020-02-14

.

[AWS Glue 개요 및 활용방안]

Glue

[Glue를 이용한 데이터 전처리 핸즈온]

** 상단에 PPT 자료 참고

1. 핸즈온 시나리오

STEP 1) Glue Dev Endpoint 사용을 위한 IAM role 설정

  • Create role 클릭 -> 하단에 glue 선택 후 next 클릭 -> s3 검색 후 s3fullaccess 선택, glue 검색 후 servicerole 및 GlueServiceNotebookRole 선택 후 next 클릭 -> role 생성

STEP 2) Glue Dev Endpoint 생성

  • DPU worker type 설명

STEP 3) Dev Endpoint 내에 Notebook 생성

  • 노트북을 새로 생성할때마다 role도 같이 새로 만들어줘야 한다.(노트북 생정중에 role을 생성하는 설정가능)

STEP 4) CSV 형태의 원본데이터를 parquet로 변환하고 partitioning 된 형태로 저장

  • ETL Job : CSV 형태의 원본데이터를 parquet로 변환하고 partitioning 된 형태로 저장하는 job

  • 원본데이터 -> Clean data로 만드는 작업

STEP 5) 원본데이터 및 Clean Data를 대상으로 하는 Data Catalog 생성

  • step 5-1) Glue Catalog database 생성

  • step 5-2) Glue Crawler 생성(csv, parquet 2EA) 및 크롤링 수행

  • step 5-3) Glue Table(데이터 카탈로그) 생성확인

STEP 6) Athena를 이용한 테스트

  • 왜 ETL 작업을 수행해야 하는가 확인하는 테스트

  • csv 형태의 파일 대비하여 parquet 파일형태 변환 및 partitioning 저장의 이점을 확인하는 테스트

STEP 7) spark SQL을 이용해 Clean Data를 대상으로 재가공한 합산 및 요약데이터 생성

STEP 8) step 4), step 7)에서 수행했던 작업을 Glue Job을 이용하여 실제 운영계에 적용하는 실습

STEP 9) 구동한 자원 종료

  • 노트북 서버 중지 및 삭제, 개발앤드포인트 종료, 원본데이터 삭제

2. step 4), step 7)에서 작성할 pyspark 코드

2.1. step 4) 실습코드

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import *
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.types import *

## spark 사용을 위한 엔진구동
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

## csv파일을 spark dataframe으로 로딩
df_csv = spark.read.format("csv").option("header", "true").load("s3a://lhw-s3-test/Lotte/csv/*.csv")

Starting Spark application

## 데이터프레임 확인
df_csv.show(10)
+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+
|          event_time|event_type|product_id|        category_id|       category_code|   brand| price|  user_id|        user_session|
+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+
|2019-11-01 00:00:...|      view|   1003461|2053013555631882655|electronics.smart...|  xiaomi|489.07|520088904|4d3b30da-a5e4-49d...|
|2019-11-01 00:00:...|      view|   5000088|2053013566100866035|appliances.sewing...|  janome|293.65|530496790|8e5f4f83-366c-4f7...|
|2019-11-01 00:00:...|      view|  17302664|2053013553853497655|                null|   creed| 28.31|561587266|755422e7-9040-477...|
|2019-11-01 00:00:...|      view|   3601530|2053013563810775923|appliances.kitche...|      lg|712.87|518085591|3bfb58cd-7892-48c...|
|2019-11-01 00:00:...|      view|   1004775|2053013555631882655|electronics.smart...|  xiaomi|183.27|558856683|313628f1-68b8-460...|
|2019-11-01 00:00:...|      view|   1306894|2053013558920217191|  computers.notebook|      hp|360.09|520772685|816a59f3-f5ae-4cc...|
|2019-11-01 00:00:...|      view|   1306421|2053013558920217191|  computers.notebook|      hp|514.56|514028527|df8184cc-3694-454...|
|2019-11-01 00:00:...|      view|  15900065|2053013558190408249|                null| rondell| 30.86|518574284|5e6ef132-4d7c-473...|
|2019-11-01 00:00:...|      view|  12708937|2053013553559896355|                null|michelin| 72.72|532364121|0a899268-31eb-46d...|
|2019-11-01 00:00:...|      view|   1004258|2053013555631882655|electronics.smart...|   apple|732.07|532647354|d2d3d2c6-631d-489...|
+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+
only showing top 10 rows
## 데이터 레코드 수 확인
df_csv.count()
67501979
## 데이터프레임 스키마 확인
df_csv.schema
StructType(List(StructField(event_time,StringType,true),StructField(event_type,StringType,true),StructField(product_id,StringType,true),StructField(category_id,StringType,true),StructField(category_code,StringType,true),StructField(brand,StringType,true),StructField(price,StringType,true),StructField(user_id,StringType,true),StructField(user_session,StringType,true)))
## event_time이라는 피쳐를 슬라이싱하여 year, month, day 피쳐를 새로 생성하는 각각의 사용자 정의함수를 정의하여 적용

udf_year = udf(lambda record:record[0:4],StringType())
udf_month = udf(lambda record:record[5:7],StringType())
udf_day = udf(lambda record:record[8:10],StringType())

new_df_csv = df_csv.withColumn('year',udf_year('event_time').cast(IntegerType()))
new_df_csv = new_df_csv.withColumn('month',udf_month('event_time').cast(IntegerType()))
new_df_csv = new_df_csv.withColumn('day',udf_day('event_time').cast(IntegerType()))

new_df_csv.show(10)
+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+----+-----+---+
|          event_time|event_type|product_id|        category_id|       category_code|   brand| price|  user_id|        user_session|year|month|day|
+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+----+-----+---+
|2019-11-01 00:00:...|      view|   1003461|2053013555631882655|electronics.smart...|  xiaomi|489.07|520088904|4d3b30da-a5e4-49d...|2019|   11|  1|
|2019-11-01 00:00:...|      view|   5000088|2053013566100866035|appliances.sewing...|  janome|293.65|530496790|8e5f4f83-366c-4f7...|2019|   11|  1|
|2019-11-01 00:00:...|      view|  17302664|2053013553853497655|                null|   creed| 28.31|561587266|755422e7-9040-477...|2019|   11|  1|
|2019-11-01 00:00:...|      view|   3601530|2053013563810775923|appliances.kitche...|      lg|712.87|518085591|3bfb58cd-7892-48c...|2019|   11|  1|
|2019-11-01 00:00:...|      view|   1004775|2053013555631882655|electronics.smart...|  xiaomi|183.27|558856683|313628f1-68b8-460...|2019|   11|  1|
|2019-11-01 00:00:...|      view|   1306894|2053013558920217191|  computers.notebook|      hp|360.09|520772685|816a59f3-f5ae-4cc...|2019|   11|  1|
|2019-11-01 00:00:...|      view|   1306421|2053013558920217191|  computers.notebook|      hp|514.56|514028527|df8184cc-3694-454...|2019|   11|  1|
|2019-11-01 00:00:...|      view|  15900065|2053013558190408249|                null| rondell| 30.86|518574284|5e6ef132-4d7c-473...|2019|   11|  1|
|2019-11-01 00:00:...|      view|  12708937|2053013553559896355|                null|michelin| 72.72|532364121|0a899268-31eb-46d...|2019|   11|  1|
|2019-11-01 00:00:...|      view|   1004258|2053013555631882655|electronics.smart...|   apple|732.07|532647354|d2d3d2c6-631d-489...|2019|   11|  1|
+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+----+-----+---+
only showing top 10 rows
## year, month, day 피쳐를 사용자가 원하는 s3 경로로 저장. 포맷은 파케이형태
new_df_csv.write.partitionBy("year","month","day").save("s3a://lhw-s3-test/Lotte/cleandata/", format='parquet', header=True)
## 저장한 parquet 파일을 로딩
df_parquet = spark.read.format("parquet").option("header", "true").load("s3a://lhw-s3-test/Lotte/cleandata/")
df_parquet.show(20)
+--------------------+----------+----------+-------------------+--------------------+---------+-------+---------+--------------------+----+-----+---+
|          event_time|event_type|product_id|        category_id|       category_code|    brand|  price|  user_id|        user_session|year|month|day|
+--------------------+----------+----------+-------------------+--------------------+---------+-------+---------+--------------------+----+-----+---+
|2019-11-17 08:43:...|      view|   2501799|2053013564003713919|appliances.kitche...| elenberg|  46.31|563237118|4368d099-6d19-47c...|2019|   11| 17|
|2019-11-17 08:43:...|      view|   6400335|2053013554121933129|computers.compone...|    intel| 435.28|551129779|4db2c365-ee85-443...|2019|   11| 17|
|2019-11-17 08:43:...|      view|   3701538|2053013565983425517|appliances.enviro...|   irobot|1878.81|539845715|bf7d95c0-69e1-40f...|2019|   11| 17|
|2019-11-17 08:43:...|      view|  26400266|2053013563651392361|                null|  lucente| 119.18|572211322|8e6c63f8-7f34-48b...|2019|   11| 17|
|2019-11-17 08:43:...|      view|   1004659|2053013555631882655|electronics.smart...|  samsung| 762.18|512965259|2981c9f9-3905-49d...|2019|   11| 17|
|2019-11-17 08:43:...|      view|  28716993|2053013565228450757|       apparel.shoes|  respect|  82.63|524048356|a2a0497c-59a9-4d4...|2019|   11| 17|
|2019-11-17 08:43:...|      view|   1004856|2053013555631882655|electronics.smart...|  samsung| 128.42|572486931|55e4ac09-1cec-446...|2019|   11| 17|
|2019-11-17 08:43:...|      view|   1005021|2053013555631882655|electronics.smart...|     oppo| 386.08|512887550|3c3af822-9816-434...|2019|   11| 17|
|2019-11-17 08:43:...|      view|   3300349|2053013555355058573|                null|  redmond|  73.34|566506783|3835f9bf-750b-46a...|2019|   11| 17|
|2019-11-17 08:43:...|      view|   4600560|2053013563944993659|appliances.kitche...|     beko| 412.40|522329355|dce61941-af79-4fd...|2019|   11| 17|
|2019-11-17 08:43:...|      view|  12709556|2053013553559896355|                null|   viatti|  40.16|546204388|86e071a9-b9d2-47e...|2019|   11| 17|
|2019-11-17 08:43:...|  purchase|   7002255|2053013560346280633|       kids.carriage|wingoffly| 113.77|517821629|94575067-317d-4ec...|2019|   11| 17|
|2019-11-17 08:43:...|      view|  26203739|2053013563693335403|                null|  lucente| 158.56|527083517|e6bf2cdb-778f-44a...|2019|   11| 17|
|2019-11-17 08:43:...|      view|  28401077|2053013566209917945|     accessories.bag|  respect|  39.90|512757661|4c6f8f63-a612-4c5...|2019|   11| 17|
|2019-11-17 08:43:...|      view|   3701428|2053013565983425517|appliances.enviro...|   arnica|  68.59|572271574|7b9040ab-2524-4d6...|2019|   11| 17|
|2019-11-17 08:43:...|  purchase|   1004249|2053013555631882655|electronics.smart...|    apple| 765.79|562839858|98c3adb8-a028-4e8...|2019|   11| 17|
|2019-11-17 08:43:...|      view|   1005253|2053013555631882655|electronics.smart...|   xiaomi| 288.04|516404307|a383cb03-2673-446...|2019|   11| 17|
|2019-11-17 08:43:...|      view|   5700981|2053013553970938175|auto.accessories....|   alpine| 875.18|558414772|14c5b3c8-3c0a-4bd...|2019|   11| 17|
|2019-11-17 08:43:...|      view|  22300011|2053013552427434207|                null|    apple| 200.52|572444775|e13be654-ddef-4a7...|2019|   11| 17|
|2019-11-17 08:43:...|      view|   1801940|2053013554415534427|electronics.video.tv|    haier| 231.64|551704382|8e617209-f000-47a...|2019|   11| 17|
+--------------------+----------+----------+-------------------+--------------------+---------+-------+---------+--------------------+----+-----+---+
only showing top 20 rows

2.2. step 7) 실습코드

## Glue catalog를 이용해서 spark sql로 간단한 합산 데이터 프레임 만들기
## 메타데이터는 글루카탈로그를 이용함

sql_df = spark.sql("SELECT year,month,day,count(*) FROM lotte.cleandata group by year,month,day")
sql_df.show()
+----+-----+---+--------+
|year|month|day|count(1)|
+----+-----+---+--------+
|2019|   11| 19| 1728541|
|2019|   11| 12| 1987569|
|2019|   11| 10| 1940575|
|2019|   11| 30| 1754878|
|2019|   11| 15| 6220416|
|2019|   11|  3| 1567774|
|2019|   11|  9| 1877906|
|2019|   11| 29| 1854426|
|2019|   11| 18| 2021512|
|2019|   11| 21| 1677336|
|2019|   11| 24| 1591765|
|2019|   11|  2| 1555538|
|2019|   11| 16| 6502957|
|2019|   11|  5| 1717244|
|2019|   11|  7| 1796833|
|2019|   11| 28| 1658378|
|2019|   11|  8| 1896402|
|2019|   11| 14| 3069726|
|2019|   11| 11| 2009390|
|2019|   11|  6| 1694821|
+----+-----+---+--------+
only showing top 20 rows
## 합산한 데이터 프레임을 사용자가 원하는 s3 경로로 저장
sql_df.repartitionByRange(1,'year').write.save("s3a://lhw-s3-test/Lotte/summary/201911/", format='csv', header=True)
## 방금 저장한 합산데이터를 로딩
sql_new_df = spark.read.format("csv").option("header", "true").load("s3a://lhw-s3-test/Lotte/summary/201911/")
sql_new_df.show()
+----+-----+---+--------+
|year|month|day|count(1)|
+----+-----+---+--------+
|2019|   11| 15| 6220416|
|2019|   11|  3| 1567774|
|2019|   11| 18| 2021512|
|2019|   11| 28| 1658378|
|2019|   11|  8| 1896402|
|2019|   11| 14| 3069726|
|2019|   11|  1| 1445360|
|2019|   11| 25| 1593582|
|2019|   11|  9| 1877906|
|2019|   11| 11| 2009390|
|2019|   11| 22| 1568243|
|2019|   11| 23| 1561716|
|2019|   11| 10| 1940575|
|2019|   11|  7| 1796833|
|2019|   11|  5| 1717244|
|2019|   11|  6| 1694821|
|2019|   11| 27| 1646456|
|2019|   11| 12| 1987569|
|2019|   11| 30| 1754878|
|2019|   11| 16| 6502957|
+----+-----+---+--------+
only showing top 20 rows

3. step 6) Athena를 이용한 테스트 쿼리

쿼리 1

SELECT * FROM “lotte”.”csv” limit 1;

쿼리 2

SELECT count(*) FROM “lotte”.”csv” where substr(event_time,9,2) = ‘11’

– SELECT substr(event_time,1,4) yy, substr(event_time,6,2) mon , substr(event_time,9,2) dd, count(*)
– FROM “lotte”.”csv” – group by substr(event_time,1,4) , substr(event_time,6,2) , substr(event_time,9,2) – where substr(event_time,9,2) = ‘11’ ;

쿼리 3

SELECT * FROM “lotte”.”cleandata” limit 1;

쿼리 4

– SELECT COUNT(*)
– FROM “lotte”.”cleandata” – where substr(event_time,9,2) = ‘11’;

SELECT substr(event_time,1,4) yy, substr(event_time,6,2) mon , substr(event_time,9,2) dd, count(*)
FROM “lotte”.”cleandata” group by substr(event_time,1,4) , substr(event_time,6,2) , substr(event_time,9,2);

쿼리 5

SELECT count(*) FROM “lotte”.”cleandata” where day = ‘11’ ;

쿼리 6

SELECT year,month,day,count(*) FROM “lotte”.”cleandata” group by year,month,day;