AWS Glue 개요 및 활용방안
.
[AWS Glue 개요 및 활용방안]
[Glue를 이용한 데이터 전처리 핸즈온]
** 상단에 PPT 자료 참고
1. 핸즈온 시나리오
STEP 1) Glue Dev Endpoint 사용을 위한 IAM role 설정
- Create role 클릭 -> 하단에 glue 선택 후 next 클릭 -> s3 검색 후 s3fullaccess 선택, glue 검색 후 servicerole 및 GlueServiceNotebookRole 선택 후 next 클릭 -> role 생성
STEP 2) Glue Dev Endpoint 생성
- DPU worker type 설명
STEP 3) Dev Endpoint 내에 Notebook 생성
- 노트북을 새로 생성할때마다 role도 같이 새로 만들어줘야 한다.(노트북 생정중에 role을 생성하는 설정가능)
STEP 4) CSV 형태의 원본데이터를 parquet로 변환하고 partitioning 된 형태로 저장
-
ETL Job : CSV 형태의 원본데이터를 parquet로 변환하고 partitioning 된 형태로 저장하는 job
-
원본데이터 -> Clean data로 만드는 작업
STEP 5) 원본데이터 및 Clean Data를 대상으로 하는 Data Catalog 생성
-
step 5-1) Glue Catalog database 생성
-
step 5-2) Glue Crawler 생성(csv, parquet 2EA) 및 크롤링 수행
-
step 5-3) Glue Table(데이터 카탈로그) 생성확인
STEP 6) Athena를 이용한 테스트
-
왜 ETL 작업을 수행해야 하는가 확인하는 테스트
-
csv 형태의 파일 대비하여 parquet 파일형태 변환 및 partitioning 저장의 이점을 확인하는 테스트
STEP 7) spark SQL을 이용해 Clean Data를 대상으로 재가공한 합산 및 요약데이터 생성
STEP 8) step 4), step 7)에서 수행했던 작업을 Glue Job을 이용하여 실제 운영계에 적용하는 실습
STEP 9) 구동한 자원 종료
- 노트북 서버 중지 및 삭제, 개발앤드포인트 종료, 원본데이터 삭제
2. step 4), step 7)에서 작성할 pyspark 코드
2.1. step 4) 실습코드
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import *
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.types import *
## spark 사용을 위한 엔진구동
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
## csv파일을 spark dataframe으로 로딩
df_csv = spark.read.format("csv").option("header", "true").load("s3a://lhw-s3-test/Lotte/csv/*.csv")
Starting Spark application
## 데이터프레임 확인
df_csv.show(10)
+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+
| event_time|event_type|product_id| category_id| category_code| brand| price| user_id| user_session|
+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+
|2019-11-01 00:00:...| view| 1003461|2053013555631882655|electronics.smart...| xiaomi|489.07|520088904|4d3b30da-a5e4-49d...|
|2019-11-01 00:00:...| view| 5000088|2053013566100866035|appliances.sewing...| janome|293.65|530496790|8e5f4f83-366c-4f7...|
|2019-11-01 00:00:...| view| 17302664|2053013553853497655| null| creed| 28.31|561587266|755422e7-9040-477...|
|2019-11-01 00:00:...| view| 3601530|2053013563810775923|appliances.kitche...| lg|712.87|518085591|3bfb58cd-7892-48c...|
|2019-11-01 00:00:...| view| 1004775|2053013555631882655|electronics.smart...| xiaomi|183.27|558856683|313628f1-68b8-460...|
|2019-11-01 00:00:...| view| 1306894|2053013558920217191| computers.notebook| hp|360.09|520772685|816a59f3-f5ae-4cc...|
|2019-11-01 00:00:...| view| 1306421|2053013558920217191| computers.notebook| hp|514.56|514028527|df8184cc-3694-454...|
|2019-11-01 00:00:...| view| 15900065|2053013558190408249| null| rondell| 30.86|518574284|5e6ef132-4d7c-473...|
|2019-11-01 00:00:...| view| 12708937|2053013553559896355| null|michelin| 72.72|532364121|0a899268-31eb-46d...|
|2019-11-01 00:00:...| view| 1004258|2053013555631882655|electronics.smart...| apple|732.07|532647354|d2d3d2c6-631d-489...|
+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+
only showing top 10 rows
## 데이터 레코드 수 확인
df_csv.count()
67501979
## 데이터프레임 스키마 확인
df_csv.schema
StructType(List(StructField(event_time,StringType,true),StructField(event_type,StringType,true),StructField(product_id,StringType,true),StructField(category_id,StringType,true),StructField(category_code,StringType,true),StructField(brand,StringType,true),StructField(price,StringType,true),StructField(user_id,StringType,true),StructField(user_session,StringType,true)))
## event_time이라는 피쳐를 슬라이싱하여 year, month, day 피쳐를 새로 생성하는 각각의 사용자 정의함수를 정의하여 적용
udf_year = udf(lambda record:record[0:4],StringType())
udf_month = udf(lambda record:record[5:7],StringType())
udf_day = udf(lambda record:record[8:10],StringType())
new_df_csv = df_csv.withColumn('year',udf_year('event_time').cast(IntegerType()))
new_df_csv = new_df_csv.withColumn('month',udf_month('event_time').cast(IntegerType()))
new_df_csv = new_df_csv.withColumn('day',udf_day('event_time').cast(IntegerType()))
new_df_csv.show(10)
+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+----+-----+---+
| event_time|event_type|product_id| category_id| category_code| brand| price| user_id| user_session|year|month|day|
+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+----+-----+---+
|2019-11-01 00:00:...| view| 1003461|2053013555631882655|electronics.smart...| xiaomi|489.07|520088904|4d3b30da-a5e4-49d...|2019| 11| 1|
|2019-11-01 00:00:...| view| 5000088|2053013566100866035|appliances.sewing...| janome|293.65|530496790|8e5f4f83-366c-4f7...|2019| 11| 1|
|2019-11-01 00:00:...| view| 17302664|2053013553853497655| null| creed| 28.31|561587266|755422e7-9040-477...|2019| 11| 1|
|2019-11-01 00:00:...| view| 3601530|2053013563810775923|appliances.kitche...| lg|712.87|518085591|3bfb58cd-7892-48c...|2019| 11| 1|
|2019-11-01 00:00:...| view| 1004775|2053013555631882655|electronics.smart...| xiaomi|183.27|558856683|313628f1-68b8-460...|2019| 11| 1|
|2019-11-01 00:00:...| view| 1306894|2053013558920217191| computers.notebook| hp|360.09|520772685|816a59f3-f5ae-4cc...|2019| 11| 1|
|2019-11-01 00:00:...| view| 1306421|2053013558920217191| computers.notebook| hp|514.56|514028527|df8184cc-3694-454...|2019| 11| 1|
|2019-11-01 00:00:...| view| 15900065|2053013558190408249| null| rondell| 30.86|518574284|5e6ef132-4d7c-473...|2019| 11| 1|
|2019-11-01 00:00:...| view| 12708937|2053013553559896355| null|michelin| 72.72|532364121|0a899268-31eb-46d...|2019| 11| 1|
|2019-11-01 00:00:...| view| 1004258|2053013555631882655|electronics.smart...| apple|732.07|532647354|d2d3d2c6-631d-489...|2019| 11| 1|
+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+----+-----+---+
only showing top 10 rows
## year, month, day 피쳐를 사용자가 원하는 s3 경로로 저장. 포맷은 파케이형태
new_df_csv.write.partitionBy("year","month","day").save("s3a://lhw-s3-test/Lotte/cleandata/", format='parquet', header=True)
## 저장한 parquet 파일을 로딩
df_parquet = spark.read.format("parquet").option("header", "true").load("s3a://lhw-s3-test/Lotte/cleandata/")
df_parquet.show(20)
+--------------------+----------+----------+-------------------+--------------------+---------+-------+---------+--------------------+----+-----+---+
| event_time|event_type|product_id| category_id| category_code| brand| price| user_id| user_session|year|month|day|
+--------------------+----------+----------+-------------------+--------------------+---------+-------+---------+--------------------+----+-----+---+
|2019-11-17 08:43:...| view| 2501799|2053013564003713919|appliances.kitche...| elenberg| 46.31|563237118|4368d099-6d19-47c...|2019| 11| 17|
|2019-11-17 08:43:...| view| 6400335|2053013554121933129|computers.compone...| intel| 435.28|551129779|4db2c365-ee85-443...|2019| 11| 17|
|2019-11-17 08:43:...| view| 3701538|2053013565983425517|appliances.enviro...| irobot|1878.81|539845715|bf7d95c0-69e1-40f...|2019| 11| 17|
|2019-11-17 08:43:...| view| 26400266|2053013563651392361| null| lucente| 119.18|572211322|8e6c63f8-7f34-48b...|2019| 11| 17|
|2019-11-17 08:43:...| view| 1004659|2053013555631882655|electronics.smart...| samsung| 762.18|512965259|2981c9f9-3905-49d...|2019| 11| 17|
|2019-11-17 08:43:...| view| 28716993|2053013565228450757| apparel.shoes| respect| 82.63|524048356|a2a0497c-59a9-4d4...|2019| 11| 17|
|2019-11-17 08:43:...| view| 1004856|2053013555631882655|electronics.smart...| samsung| 128.42|572486931|55e4ac09-1cec-446...|2019| 11| 17|
|2019-11-17 08:43:...| view| 1005021|2053013555631882655|electronics.smart...| oppo| 386.08|512887550|3c3af822-9816-434...|2019| 11| 17|
|2019-11-17 08:43:...| view| 3300349|2053013555355058573| null| redmond| 73.34|566506783|3835f9bf-750b-46a...|2019| 11| 17|
|2019-11-17 08:43:...| view| 4600560|2053013563944993659|appliances.kitche...| beko| 412.40|522329355|dce61941-af79-4fd...|2019| 11| 17|
|2019-11-17 08:43:...| view| 12709556|2053013553559896355| null| viatti| 40.16|546204388|86e071a9-b9d2-47e...|2019| 11| 17|
|2019-11-17 08:43:...| purchase| 7002255|2053013560346280633| kids.carriage|wingoffly| 113.77|517821629|94575067-317d-4ec...|2019| 11| 17|
|2019-11-17 08:43:...| view| 26203739|2053013563693335403| null| lucente| 158.56|527083517|e6bf2cdb-778f-44a...|2019| 11| 17|
|2019-11-17 08:43:...| view| 28401077|2053013566209917945| accessories.bag| respect| 39.90|512757661|4c6f8f63-a612-4c5...|2019| 11| 17|
|2019-11-17 08:43:...| view| 3701428|2053013565983425517|appliances.enviro...| arnica| 68.59|572271574|7b9040ab-2524-4d6...|2019| 11| 17|
|2019-11-17 08:43:...| purchase| 1004249|2053013555631882655|electronics.smart...| apple| 765.79|562839858|98c3adb8-a028-4e8...|2019| 11| 17|
|2019-11-17 08:43:...| view| 1005253|2053013555631882655|electronics.smart...| xiaomi| 288.04|516404307|a383cb03-2673-446...|2019| 11| 17|
|2019-11-17 08:43:...| view| 5700981|2053013553970938175|auto.accessories....| alpine| 875.18|558414772|14c5b3c8-3c0a-4bd...|2019| 11| 17|
|2019-11-17 08:43:...| view| 22300011|2053013552427434207| null| apple| 200.52|572444775|e13be654-ddef-4a7...|2019| 11| 17|
|2019-11-17 08:43:...| view| 1801940|2053013554415534427|electronics.video.tv| haier| 231.64|551704382|8e617209-f000-47a...|2019| 11| 17|
+--------------------+----------+----------+-------------------+--------------------+---------+-------+---------+--------------------+----+-----+---+
only showing top 20 rows
2.2. step 7) 실습코드
## Glue catalog를 이용해서 spark sql로 간단한 합산 데이터 프레임 만들기
## 메타데이터는 글루카탈로그를 이용함
sql_df = spark.sql("SELECT year,month,day,count(*) FROM lotte.cleandata group by year,month,day")
sql_df.show()
+----+-----+---+--------+
|year|month|day|count(1)|
+----+-----+---+--------+
|2019| 11| 19| 1728541|
|2019| 11| 12| 1987569|
|2019| 11| 10| 1940575|
|2019| 11| 30| 1754878|
|2019| 11| 15| 6220416|
|2019| 11| 3| 1567774|
|2019| 11| 9| 1877906|
|2019| 11| 29| 1854426|
|2019| 11| 18| 2021512|
|2019| 11| 21| 1677336|
|2019| 11| 24| 1591765|
|2019| 11| 2| 1555538|
|2019| 11| 16| 6502957|
|2019| 11| 5| 1717244|
|2019| 11| 7| 1796833|
|2019| 11| 28| 1658378|
|2019| 11| 8| 1896402|
|2019| 11| 14| 3069726|
|2019| 11| 11| 2009390|
|2019| 11| 6| 1694821|
+----+-----+---+--------+
only showing top 20 rows
## 합산한 데이터 프레임을 사용자가 원하는 s3 경로로 저장
sql_df.repartitionByRange(1,'year').write.save("s3a://lhw-s3-test/Lotte/summary/201911/", format='csv', header=True)
## 방금 저장한 합산데이터를 로딩
sql_new_df = spark.read.format("csv").option("header", "true").load("s3a://lhw-s3-test/Lotte/summary/201911/")
sql_new_df.show()
+----+-----+---+--------+
|year|month|day|count(1)|
+----+-----+---+--------+
|2019| 11| 15| 6220416|
|2019| 11| 3| 1567774|
|2019| 11| 18| 2021512|
|2019| 11| 28| 1658378|
|2019| 11| 8| 1896402|
|2019| 11| 14| 3069726|
|2019| 11| 1| 1445360|
|2019| 11| 25| 1593582|
|2019| 11| 9| 1877906|
|2019| 11| 11| 2009390|
|2019| 11| 22| 1568243|
|2019| 11| 23| 1561716|
|2019| 11| 10| 1940575|
|2019| 11| 7| 1796833|
|2019| 11| 5| 1717244|
|2019| 11| 6| 1694821|
|2019| 11| 27| 1646456|
|2019| 11| 12| 1987569|
|2019| 11| 30| 1754878|
|2019| 11| 16| 6502957|
+----+-----+---+--------+
only showing top 20 rows
3. step 6) Athena를 이용한 테스트 쿼리
쿼리 1
SELECT * FROM “lotte”.”csv” limit 1;
쿼리 2
SELECT count(*) FROM “lotte”.”csv” where substr(event_time,9,2) = ‘11’
– SELECT substr(event_time,1,4) yy, substr(event_time,6,2) mon , substr(event_time,9,2) dd, count(*)
– FROM “lotte”.”csv”
– group by substr(event_time,1,4) , substr(event_time,6,2) , substr(event_time,9,2)
– where substr(event_time,9,2) = ‘11’ ;
쿼리 3
SELECT * FROM “lotte”.”cleandata” limit 1;
쿼리 4
– SELECT COUNT(*)
– FROM “lotte”.”cleandata”
– where substr(event_time,9,2) = ‘11’;
SELECT substr(event_time,1,4) yy, substr(event_time,6,2) mon , substr(event_time,9,2) dd, count(*)
FROM “lotte”.”cleandata”
group by substr(event_time,1,4) , substr(event_time,6,2) , substr(event_time,9,2);
쿼리 5
SELECT count(*) FROM “lotte”.”cleandata” where day = ‘11’ ;
쿼리 6
SELECT year,month,day,count(*) FROM “lotte”.”cleandata” group by year,month,day;