EMR on EKS 실습 워크샵

2021-12-19

.

Data_Engineering_TIL(20211218)

[실습자료]

  • AWS “EMR on EKS Workshop” 자료를 공부하고 정리한 내용입니다.

** URL : https://emr-on-eks.workshop.aws

[실습목표]

  • Create EMR virtual clusters that point to namespace on Amazon EKS

  • Submit Spark jobs to virtual clusters

  • Use managed linux nodes or fargate profiles for Amazon EKS nodegroup

  • Use Spark UI and Kubernetes dashboard for monitoring and debugging

  • Use advanced features such as pod templates

[실습내용]

STEP 1) 버지니아 리전으로가서 ec2 키 접속을 위한 SSH pem 키를 생성해준다.

1

** 실습이 버지니아 리전에서 이루어진다.

STEP 2) 실습용 cloudformation 템플릿 셋팅 링크를 클릭하고 아래의 야믈파일 템플릿을 이용해서 리소스를 생성해준다.

** 실습용 cloudformation 템플릿 셋팅 링크

https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/new?stackName=EMR-EKS-Workshop&templateURL=https://aws-data-analytics-workshops.s3.amazonaws.com/emr-eks-workshop/cloudformation/amazon-emr-on-eks-workshop.yaml

---
AWSTemplateFormatVersion: '2010-09-09'
Description: AWS CloudFormation template for EMR on EKS setup. Creates an EC2 instance that uses CDK to deploy the resources.

Parameters:

  EC2StartupInstanceType:
    Description: Startup instance type
    Type: String
    Default: t3.medium
    AllowedValues:
      - t2.micro
      - t3.micro
      - t3.small
      - t3.medium

  EC2StartupInstanceVolumeSize:
    Type: Number
    Description: The Size in GB of the Startup Instance Volume
    Default: 15

  EEKeyPair:
    Type: AWS::EC2::KeyPair::KeyName
    Description: SSH key (for access instances)
    Default: ee-default-keypair

  LatestAmiId:
    Type:  'AWS::SSM::Parameter::Value<AWS::EC2::Image::Id>'
    Default: '/aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-x86_64-gp2'

Resources:

  EMRWorkshopAdmin:
    Type: AWS::IAM::Role
    Properties:
      Tags:
        - Key: Environment
          Value: emr-eks-workshop
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
        - Effect: Allow
          Principal:
            Service:
            - ec2.amazonaws.com
            - ssm.amazonaws.com
          Action:
          - sts:AssumeRole
      ManagedPolicyArns:
      - arn:aws:iam::aws:policy/AdministratorAccess
      - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
      Path: "/"

  VPCStack:
    Type: 'AWS::CloudFormation::Stack'
    Properties:
      TemplateURL: >-
        https://aws-data-analytics-workshops.s3.amazonaws.com/emr-eks-workshop/cloudformation/eks_vpc_stack.yml
      TimeoutInMinutes: 10

  InstanceProfile:
    Type: AWS::IAM::InstanceProfile
    Properties:
      Path: "/"
      Roles:
      - Ref: EMRWorkshopAdmin

  EC2StartupInstance:
    Type: AWS::EC2::Instance
    DependsOn: [ VPCStack, InstanceProfile ]
    Properties:
      InstanceType: !Ref EC2StartupInstanceType
      KeyName: !Ref EEKeyPair
      ImageId: !Ref LatestAmiId
      IamInstanceProfile: !Ref InstanceProfile
      SecurityGroupIds:
      - !GetAtt VPCStack.Outputs.SecurityGroup
      SubnetId: !GetAtt VPCStack.Outputs.PublicSubnet1Id
      UserData:
        Fn::Base64: !Sub |
          #!/bin/bash -xe
          exec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1
          yum update -y
          yum install git -y
          touch ~/.bashrc
          curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.34.0/install.sh | bash
          export NVM_DIR="$HOME/.nvm"
          [ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh"
          nvm install node
          npm install -g aws-cdk
          git clone https://github.com/emrspecialistsamer/amazon-emr-on-eks-labs.git
          cd amazon-emr-on-eks-labs
          npm install --silent
          cdk bootstrap aws://${AWS::AccountId}/${AWS::Region}
          cdk deploy --require-approval never
      Tags:
        -
          Key: Environment
          Value: emr-eks-workshop

2

스텍을 생성하면 리소스 전체 생성완료까지 대략 15분정도 소요된다.

생성이 완료되면 아래와 같이 6개의 스텍이 생성된다.

3

EmrEksAppStack에 우리가 오늘 실습할때 필요한 리소스들이 생성된 것을 확인할 수 있다.

4

STEP 3) 실습용 CLOUD9 WORKSPACE 생성

버지니아 리전의 클라우드9 콘솔로 이동해서 아래와 같이 cloud9 환경을 생성해준다.

step 3-1) Select Create environment

step 3-2) Name it emr-eks-cloud9, click Next.

step 3-3) Choose t3.small for instance type, take default values for Environment type, Instance type, Platform.

step 3-4) Choose Network settings (advanced) and choose VPC. Select EmrEksAppStack VPC. Choose Next Step.

step 3-5) Click Create environment

step 3-6) Choose Open IDE to launch the Cloud9 IDE.

아래와 같이 클라우드 9 화면이 셋팅되었으면 완료가 된 상태임.

5

STEP 4) ATTACH IAM ROLE TO WORKSPACE

step 4-1) Click the grey circle button (in top right corner) and select Manage EC2 Instance

6

step 4-2) Select the instance, then choose Actions / Security / Modify IAM Role

7

step 4-3) Choose emr-eks-instance-profile from the IAM Role drop down, and select Save

8

STEP 5) UPDATE IAM SETTINGS FOR WORKSPACE

Cloud9 normally manages IAM credentials dynamically. This isn’t currently compatible with the EKS IAM authentication, so we will disable it and rely on the IAM role instead.

step 5-1) Return to your Cloud9 workspace and click the gear icon (in top right corner)

step 5-2) Select AWS SETTINGS

step 5-3) Turn off AWS managed temporary credentials

step 5-4) Close the Preferences tab

9

STEP 6) CREATE EMR CLUSTER ON EKS

# Run the below code in Cloud9 terminal to copy the bootstrap script
minsupark:~/environment $ curl https://aws-data-analytics-workshops.s3.amazonaws.com/emr-eks-workshop/scripts/bootstrap.sh -o bootstrap.sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3034  100  3034    0     0  53381      0 --:--:-- --:--:-- --:--:-- 54178

minsupark:~/environment $ pwd
/home/ec2-user/environment

minsupark:~/environment $ ll
total 8
-rw-rw-r-- 1 ec2-user ec2-user 3034 Dec 18 07:55 bootstrap.sh
-rw-r--r-- 1 ec2-user ec2-user  569 Dec 11 01:55 README.md

minsupark:~/environment $ cat bootstrap.sh
if [ $# -eq 0 ]
  then
    echo "Please provide EKSClusterName, Region and EKSClusterAdminArn from cloudformation outputs"
    return
fi

#cloud9 comes with AWS v1. Upgrade to AWS v2
sudo yum install jq -y

aws configure set region $2

account_id=`aws sts get-caller-identity --query Account --output text`

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

# Install eksctl on cloud9. You must have eksctl 0.34.0 version or later.

curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo mv /tmp/eksctl /usr/local/bin
eksctl version

# Install kubectl on cloud9.

curl -o kubectl https://amazon-eks.s3.us-west-2.amazonaws.com/1.18.8/2020-09-18/bin/linux/amd64/kubectl
chmod +x ./kubectl
sudo mv ./kubectl /usr/local/bin

# Install helm on cloud9.

curl -sSL https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash

# Copy TPC-DS data into account bucket
#aws s3 cp --recursive s3://aws-data-analytics-workshops/emr-eks-workshop/data/ s3://emr-eks-workshop-$account_id/data/

aws eks update-kubeconfig --name $1 --region $2 --role-arn $3

# Allow Cloud9 to talk to EKS Control Plane. Add Cloud9 IP address address inbound rule to EKS Cluster Security Group
export EKS_SG=`aws eks describe-cluster --name $1 --query cluster.resourcesVpcConfig.clusterSecurityGroupId | sed 's/"//g'`
export C9_IP=$(curl -s http://169.254.169.254/latest/meta-data/local-ipv4)
aws ec2 authorize-security-group-ingress  --group-id ${EKS_SG}  --protocol tcp  --port 443  --cidr ${C9_IP}/32

#Create a namespace on EKS for EMR cluster
kubectl create namespace emr-eks-workshop-namespace

#Create a namespace on EKS Fargate for EMR cluster
kubectl create namespace eks-fargate

# Create Amazon EMR Cluster in EKS emr-eks-workshop-namespace namespace

eksctl create iamidentitymapping \
    --cluster $1 \
    --namespace emr-eks-workshop-namespace \
    --service-name "emr-containers"

aws emr-containers create-virtual-cluster \
--name emr_eks_cluster \
--container-provider '{
    "id":   "'"$1"'",
    "type": "EKS",
    "info": {
        "eksInfo": {
            "namespace": "emr-eks-workshop-namespace"
        }
    }
}'    

# Setup the Trust Policy for the IAM Job Execution Role

aws emr-containers update-role-trust-policy \
       --cluster-name $1 \
       --namespace emr-eks-workshop-namespace \
       --role-name EMR_EKS_Job_Execution_Role

# Create Amazon EMR Cluster in EKS eks-fargate namespace

eksctl create iamidentitymapping \
    --cluster $1 \
    --namespace eks-fargate \
    --service-name "emr-containers"

# Setup the Trust Policy for the IAM Job Execution Role

aws emr-containers update-role-trust-policy \
       --cluster-name $1 \
       --namespace eks-fargate \
       --role-name EMR_EKS_Job_Execution_Role

# Enable IAM Roles for Service Accounts (IRSA) on the EKS cluster

eksctl utils associate-iam-oidc-provider --cluster $1 --approve

웹브라우저 새창을 하나 띄우고 클라우드 포메이션 콘솔로가서 아래와 같이 부트스트랩 실행 명령어를 복사한다.

10

클라우드 나인 콘솔로 다시 돌아와서 아래와 같이 명령어를 실행해준다.

# 위에서 복사한 부트스트랩 실행 명령어 붙어넣기후 실행한다.
# Once the bootstrap script runs it will create the necessary workspace in EKS for EMR Cluster and create an EMR Cluster
minsupark:~/environment $ sh bootstrap.sh Cluster9EE0221C-ffcf7c5e49c7479785f2b009ce1974c1 us-east-1 arn:aws:iam::111111111111:role/EmrEksAppStack-emreksadminRole494E27B4-C4P9V6CFNV4N

...

namespace/emr-eks-workshop-namespace created
namespace/eks-fargate created
2021-12-18 08:01:54 [ℹ]  eksctl version 0.77.0
2021-12-18 08:01:54 [ℹ]  using region us-east-1
2021-12-18 08:01:54 [ℹ]  created "emr-eks-workshop-namespace:Role.rbac.authorization.k8s.io/emr-containers"
2021-12-18 08:01:54 [ℹ]  created "emr-eks-workshop-namespace:RoleBinding.rbac.authorization.k8s.io/emr-containers"
2021-12-18 08:01:54 [ℹ]  adding identity "arn:aws:iam::111111111111:role/AWSServiceRoleForAmazonEMRContainers" to auth ConfigMap
{
    "arn": "arn:aws:emr-containers:us-east-1:111111111111:/virtualclusters/zse2a4iflxcu1mrstc0dk1srf", 
    "id": "zse2a4iflxcu1mrstc0dk1srf", 
    "name": "emr_eks_cluster"
}
Successfully updated trust policy of role EMR_EKS_Job_Execution_Role
2021-12-18 08:02:10 [ℹ]  eksctl version 0.77.0
2021-12-18 08:02:10 [ℹ]  using region us-east-1
2021-12-18 08:02:10 [ℹ]  created "eks-fargate:Role.rbac.authorization.k8s.io/emr-containers"
2021-12-18 08:02:10 [ℹ]  created "eks-fargate:RoleBinding.rbac.authorization.k8s.io/emr-containers"
Successfully updated trust policy of role EMR_EKS_Job_Execution_Role
2021-12-18 08:02:11 [ℹ]  eksctl version 0.77.0
2021-12-18 08:02:11 [ℹ]  using region us-east-1
2021-12-18 08:02:11 [ℹ]  will create IAM Open ID Connect provider for cluster "Cluster9EE0221C-ffcf7c5e49c7479785f2b009ce1974c1" in "us-east-1"
2021-12-18 08:02:11 [✔]  created IAM Open ID Connect provider for cluster "Cluster9EE0221C-ffcf7c5e49c7479785f2b009ce1974c1" in "us-east-1"

# 클러스터 잘 떴는지 체크
minsupark:~/environment $ kubectl get svc
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   172.20.0.1   <none>        443/TCP   45m

minsupark:~/environment $ eksctl get cluster
2021-12-18 08:05:32 [ℹ]  eksctl version 0.77.0
2021-12-18 08:05:32 [ℹ]  using region us-east-1
NAME                                                    REGION          EKSCTL CREATED
Cluster9EE0221C-ffcf7c5e49c7479785f2b009ce1974c1        us-east-1       False

또한 EMR 콘솔에 접속해보면 아래와 같이 클러스터가 생성된 것을 확인할 수 있다.

11

STEP 7) EMR 클러스터에 spark job submit하기

클라우드9 콘솔에서 아래와 같이 명령어를 실행해준다.

The below command runs a Spark Pi Python code on EMR 6.2.0. Substitute the values for virtual-cluster-id and execution-role-arn.

virtual-cluster-id: Copy the cluster id from the output of below command.

minsupark:~/environment $ aws emr-containers list-virtual-clusters
{
    "virtualClusters": [
        {
            "id": "zse2a4iflxcu1mrstc0dk1srf",
            "name": "emr_eks_cluster",
            "arn": "arn:aws:emr-containers:us-east-1:111111111111:/virtualclusters/zse2a4iflxcu1mrstc0dk1srf",
            "state": "RUNNING",
            "containerProvider": {
                "type": "EKS",
                "id": "Cluster9EE0221C-ffcf7c5e49c7479785f2b009ce1974c1",
                "info": {
                    "eksInfo": {
                        "namespace": "emr-eks-workshop-namespace"
                    }
                }
            },
            "createdAt": "2021-12-18T08:02:08+00:00",
            "tags": {}
        }
    ]
}

execution-role-arn: Refer to cloudformation outputs and copy the value of EKSClusterId, EMRJobExecutionRoleArn and S3 Bucket

Export these values to Cloud9 terminal:

export EMR_EKS_CLUSTER_ID=<virtual-cluster-id>

export EMR_EKS_EXECUTION_ARN=<arn:aws:iam::xxxxx:role/EMR_EKS_Job_Execution_Role>

export S3_BUCKET=<S3Bucket>

EMR_EKS_EXECUTION_ARN, S3_BUCKET는 아래 그림과 같이 클라우드 포메이션 output에서 확인할 수 있다.

12

minsupark:~/environment $ export EMR_EKS_CLUSTER_ID=zse2a4iflxcu1mrstc0dk1srf
minsupark:~/environment $ export EMR_EKS_EXECUTION_ARN=arn:aws:iam::111111111111:role/EMR_EKS_Job_Execution_Role
minsupark:~/environment $ export S3_BUCKET=s3://emr-eks-workshop-111111111111

아래와 같이 명령어를 실행해준다.

aws emr-containers start-job-run \
--virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
--name spark-pi \
--execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
--release-label emr-6.2.0-latest \
--job-driver '{
    "sparkSubmitJobDriver": {
        "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py",
        "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
        }
    }'
minsupark:~/environment $ aws emr-containers start-job-run \
> --virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
> --name spark-pi \
> --execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
> --release-label emr-6.2.0-latest \
> --job-driver '{
>     "sparkSubmitJobDriver": {
>         "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py",
>         "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
>         }
>     }'
{
    "id": "00000002veck7snqfhe",
    "name": "spark-pi",
    "arn": "arn:aws:emr-containers:us-east-1:111111111111:/virtualclusters/zse2a4iflxcu1mrstc0dk1srf/jobruns/00000002veck7snqfhe",
    "virtualClusterId": "zse2a4iflxcu1mrstc0dk1srf"
}

minsupark:~/environment $ aws s3 cp s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py .
download: s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py to ./pi.py

minsupark:~/environment $ cat pi.py
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import sys
from random import random
from operator import add

from pyspark.sql import SparkSession


if __name__ == "__main__":
    """
        Usage: pi [partitions]
    """
    spark = SparkSession\
        .builder\
        .appName("PythonPi")\
        .getOrCreate()

    partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
    n = 100000 * partitions

    def f(_):
        x = random() * 2 - 1
        y = random() * 2 - 1
        return 1 if x ** 2 + y ** 2 <= 1 else 0

    count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
    print("Pi is roughly %f" % (4.0 * count / n))

    spark.stop()

You can check your job on EMR console

13

You can try running the same job with a different EMR version. Below command runs the code on EMR 5.33.0

아래와 같이 다시 EMR 버전만 바꿔서 명령어를 실행해보자.

aws emr-containers start-job-run \
--virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
--name spark-pi \
--execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
--release-label emr-5.33.0-latest \
--job-driver '{
    "sparkSubmitJobDriver": {
        "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py",
        "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
        }
    }'
minsupark:~/environment $ aws emr-containers start-job-run \
> --virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
> --name spark-pi \
> --execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
> --release-label emr-5.33.0-latest \
> --job-driver '{
>     "sparkSubmitJobDriver": {
>         "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py",
>         "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
>         }
>     }' 
{
    "id": "00000002veckihtit41",
    "name": "spark-pi",
    "arn": "arn:aws:emr-containers:us-east-1:161461013751:/virtualclusters/zse2a4iflxcu1mrstc0dk1srf/jobruns/00000002veckihtit41",
    "virtualClusterId": "zse2a4iflxcu1mrstc0dk1srf"
}

EMR 콘솔에 접속해보면 아래 그림과 같이 또 다른잡이 또 실행되고 있는것을 확인할 수 있다.

14

이번에는 아래 명령어와 같이 s3에 결과가 남도록 실행해보자.

aws emr-containers start-job-run \
--virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
--name spark-pi-logging \
--execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
--release-label emr-6.2.0-latest \
--job-driver '{
    "sparkSubmitJobDriver": {
        "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py",
        "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
        }
    }' \
--configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.driver.memory":"2G"
         }
      }
    ], 
    "monitoringConfiguration": {
      "cloudWatchMonitoringConfiguration": {
        "logGroupName": "/emr-containers/jobs", 
        "logStreamNamePrefix": "emr-eks-workshop"
      }, 
      "s3MonitoringConfiguration": {
        "logUri": "'"$S3_BUCKET"'/logs/"
      }
    }
}'

Substitute the value for s3-bucket. This example shows how to specify a CloudWatch Monitoring Configuration and S3 log path as part of the job configuration.

minsupark:~/environment $ aws emr-containers start-job-run \
> --virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
> --name spark-pi-logging \
> --execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
> --release-label emr-6.2.0-latest \
> --job-driver '{
>     "sparkSubmitJobDriver": {
>         "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py",
>         "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
>         }
>     }' \
> --configuration-overrides '{
>     "applicationConfiguration": [
>       {
>         "classification": "spark-defaults", 
>         "properties": {
>           "spark.driver.memory":"2G"
>          }
>       }
>     ], 
>     "monitoringConfiguration": {
>       "cloudWatchMonitoringConfiguration": {
>         "logGroupName": "/emr-containers/jobs", 
>         "logStreamNamePrefix": "emr-eks-workshop"
>       }, 
>       "s3MonitoringConfiguration": {
>         "logUri": "'"$S3_BUCKET"'/logs/"
>       }
>     }
> }'
{
    "id": "00000002vecl0ttmdoh",
    "name": "spark-pi-logging",
    "arn": "arn:aws:emr-containers:us-east-1:161461013751:/virtualclusters/zse2a4iflxcu1mrstc0dk1srf/jobruns/00000002vecl0ttmdoh",
    "virtualClusterId": "zse2a4iflxcu1mrstc0dk1srf"
}

You can go to S3 bucket you specified to check for the logs. Your log data are sent to the following Amazon S3 locations.

Controller Logs - /logUri/virtual-cluster-id/jobs/job-id/containers/pod-name/(stderr.gz/stdout.gz)

Driver Logs - /logUri/virtual-cluster-id/jobs/job-id/containers/spark-application-id/spark-job-id-driver/(stderr.gz/stdout.gz)

Executor Logs - /logUri/virtual-cluster-id/jobs/job-id/containers/spark-application-id/executor-pod-name/(stderr.gz/stdout.gz)

Explore the contents of the Driver logs and run a S3 Select query on stdout.gz. Below screenshots show the output of the PySpark Pi job and the value of Pi. 

The path should be in the format: s3://xxxx/yyyy/containers/spark-xxxx/spark-xxx-driver/stdout.gz

아래 그림과 같이 s3에도 로그가 남는것을 확인할 수 있다.

15

In the StartJobRun API, log_group_name is the log group name for CloudWatch, and log_stream_prefix is the log stream name prefix for CloudWatch. You can view and search these logs in the AWS Management Console.

Controller logs - logGroup/logStreamPrefix/virtual-cluster-id/jobs/job-id/containers/pod-name/(stderr/stdout)

Driver logs - logGroup/logStreamPrefix/virtual-cluster-id/jobs/job-id/containers/spark-application-id/spark-job-id-driver/(stderrstdout)

Executor logs - logGroup/logStreamPrefix/virtual-cluster-id/jobs/job-id/containers/spark-application-id/executor-pod-name/(stderr/stdout)

From AWS Console -> Services -> CloudWatch Choose Log groups and then choose /emr-containers/jobs Choose workshop/xxx/jobs/xxx/containers/spark-xxx-drive/stdout as shown below:

아래와 같이 클라우드워치에서도 spark job 실행로그를 확인할 수 있다.

16

STEP 8) SPARK HISTORY SERVER에서 로그 확인하기

From AWS Console go to EMR Console. Choose Virtual Clusters -> emr_eks_cluster

17

STEP 9) EKS CLUSTER LOGGING하기

In this section we will learn how to enable EKS logging and viewing logs in CloudWatch.

CloudWatch logging for EKS control plane is not enabled by default due to data ingestion and storage costs.

There are 5 types of logs that you may wish to enable :

1) api

2) audit

3) authenticator

4) controllerManager

5) scheduler

step 9-1) From AWS Console go to Amazon EKS.

step 9-2) Choose Clusters and select the Cluster created by the workshop.

step 9-3) Choose Logging and then choose Manage Logging

step 9-4) Enable Audit and Scheduler and choose Save changes

18

step 9-5) 클라우드9 콘솔로 돌아가서 아래와 같이 명령어를 실행한다.

aws emr-containers start-job-run \
--virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
--name spark-pi \
--execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
--release-label emr-6.2.0-latest \
--job-driver '{
    "sparkSubmitJobDriver": {
        "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py",
        "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
        }
    }'
minsupark:~/environment $ aws emr-containers start-job-run \
> --virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
> --name spark-pi \
> --execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
> --release-label emr-6.2.0-latest \
> --job-driver '{
>     "sparkSubmitJobDriver": {
>         "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py",
>         "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
>         }
>     }'
{
    "id": "00000002vecms3kg4c1",
    "name": "spark-pi",
    "arn": "arn:aws:emr-containers:us-east-1:161461013751:/virtualclusters/zse2a4iflxcu1mrstc0dk1srf/jobruns/00000002vecms3kg4c1",
    "virtualClusterId": "zse2a4iflxcu1mrstc0dk1srf"
}

step 9-6) Go to CloudWatch

step 9-7) Choose Logs -> Log groups -> /aws/eks//cluster

19

step 9-8) Explore the logs by selecting one of the log streams

이제 그러면 Spark ETL job을 실행하고 EKS 모니터링과 로깅을 해보자.

STEP 10) DEPLOY KUBERNETES DASHBOARD

Kubernetes Dashboard is a web-based user interface. You can use Dashboard to get an overview of applications running on your cluster. In this lab we will deploy the official Kubernetes Dashboard.

step 10-1) Deploy Dashboard

# The Dashboard UI is not deployed by default. To deploy it, run the following command
minsupark:~/environment $ export DASHBOARD_VERSION="v2.0.0"
minsupark:~/environment $ kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/${DASHBOARD_VERSION}/aio/deploy/recommended.yaml
namespace/kubernetes-dashboard created
serviceaccount/kubernetes-dashboard created
service/kubernetes-dashboard created
secret/kubernetes-dashboard-certs created
secret/kubernetes-dashboard-csrf created
secret/kubernetes-dashboard-key-holder created
configmap/kubernetes-dashboard-settings created
role.rbac.authorization.k8s.io/kubernetes-dashboard created
clusterrole.rbac.authorization.k8s.io/kubernetes-dashboard created
rolebinding.rbac.authorization.k8s.io/kubernetes-dashboard created
clusterrolebinding.rbac.authorization.k8s.io/kubernetes-dashboard created
deployment.apps/kubernetes-dashboard created
service/dashboard-metrics-scraper created
deployment.apps/dashboard-metrics-scraper created

# You can access Dashboard using the kubectl command-line tool by running the following command in your Cloud9 terminal
minsupark:~/environment $ kubectl proxy --port=8080 --address=0.0.0.0 --disable-filter=true &
[1] 23532
minsupark:~/environment $ W1218 08:54:01.219245   23532 proxy.go:167] Request filter disabled, your proxy is vulnerable to XSRF attacks, please be cautious
Starting to serve on [::]:8080

# This will start the proxy, listen on port 8080, listen on all interfaces, and will disable the filtering of non-localhost requests. 4. This command will continue to run in the background of the current terminal’s session.

step 10-2) Accessing the Dashboard UI

먼저 클라우드9 콘솔로가서 아래와 같이 명령어를 실행해서 kubernetes-dashboard 접속을 위한 토근을 발급받는다.

aws eks get-token \
--cluster-name <EKSCluster> \
--region us-east-1 \
--role-arn <EKSClusterAdminArn> | jq -r '.status.token'

EKSClusterAdminArn 은 클라우드 포메이션에 EmrEksAppStack의 output에 기록되어 있음

minsupark:~/environment $ aws eks get-token --cluster-name Cluster9EE0221C-ffcf7c5e49c7479785f2b009ce1974c1 --region us-east-1 --role-arn arn:aws:iam::111111111111:role/EmrEksAppStack-emreksadminRole494E27B4-C4P9V6CFNV4N | jq -r '.status.token'
k8s-aws-v1.xxxxxxxxxxxxxxxxxxxxxxxxxxx...

In your Cloud9 workspace, click Tools / Preview / Preview Running Application

20

그런 다음에 아래 그림과 같이 url에 아래 문장을 추가해서 실행해준다.

/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/

그런 다음에 아래와 같이 해준다.

21

아래 그림과 같이 쿠버네티스 UI 콘솔화면에서 노드 현황 등을 확인할 수 있다.

22

그런 다음에 Choose Namespace and Select emr-eks-workshop-namespace 해준다.

23

그런 다음에 클라우드9 콘솔로 들어와서 아래와 같이 spark job을 실행해본다.

aws emr-containers start-job-run \
--virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
--name spark-pi-dashboard \
--execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
--release-label emr-6.2.0-latest \
--job-driver '{
    "sparkSubmitJobDriver": {
        "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py",
        "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
        }
    }'
minsupark:~/environment $ aws emr-containers start-job-run \
> --virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
> --name spark-pi-dashboard \
> --execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
> --release-label emr-6.2.0-latest \
> --job-driver '{
>     "sparkSubmitJobDriver": {
>         "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py",
>         "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
>         }
>     }'
{
    "id": "00000002vecs9v709q1",
    "name": "spark-pi-dashboard",
    "arn": "arn:aws:emr-containers:us-east-1:111111111111:/virtualclusters/zse2a4iflxcu1mrstc0dk1srf/jobruns/00000002vecs9v709q1",
    "virtualClusterId": "zse2a4iflxcu1mrstc0dk1srf"
}

쿠버네티스 콘솔화면에서 가면 아래와 같이 job 메뉴와 pod 메뉴에 실행 기록이 남는 것을 확인할 수 있다.

24

Click the three dots on the running driver pod and choose logs. Select spark-kubernetes-driver from the drop down and view the driver logs. You should see the logs as shown below:

25

STEP 11) SPARK ETL JOB 실행해보기

In the code below, Spark reads NY Taxi Trip data from Amazon S3. The script updates the timestamp column, prints the schema and row count and finally writes the data in parquet format to Amazon S3. The last section may take time depending on the EKS cluster size. Note that the input and output location is taken as a parameter. The script is already uploaded to the workshop S3 bucket and command to run the Spark ETL is shown in the section below the code snippet.

import sys
from datetime import datetime

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

if __name__ == "__main__":

    print(len(sys.argv))
    if (len(sys.argv) != 3):
        print("Usage: spark-etl [input-folder] [output-folder]")
        sys.exit(0)

    spark = SparkSession\
        .builder\
        .appName("SparkETL")\
        .getOrCreate()

    nyTaxi = spark.read.option("inferSchema", "true").option("header", "true").csv(sys.argv[1])

    updatedNYTaxi = nyTaxi.withColumn("current_date", lit(datetime.now()))

    updatedNYTaxi.printSchema()

    print(updatedNYTaxi.show())

    print("Total number of records: " + str(updatedNYTaxi.count()))
    
    updatedNYTaxi.write.parquet(sys.argv[2])

Run the following command to execute the Spark ETL job on your EMR on EKS cluster. The S3 output location will be passed as a parameter to the script, this is passed as an entryPointArguments

aws emr-containers start-job-run \
--virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
--name spark-etl \
--execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
--release-label emr-5.33.0-latest \
--job-driver '{
    "sparkSubmitJobDriver": {
        "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/spark-etl.py",
        "entryPointArguments": ["s3://aws-data-analytics-workshops/shared_datasets/tripdata/",
          "'"$S3_BUCKET"'/taxi-data/"
        ],
        "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
        }
    }'\
    --configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.driver.memory":"2G"
         }
      }
    ], 
    "monitoringConfiguration": {
      "cloudWatchMonitoringConfiguration": {
        "logGroupName": "/emr-containers/jobs", 
        "logStreamNamePrefix": "emr-eks-workshop"
      }, 
      "s3MonitoringConfiguration": {
        "logUri": "'"$S3_BUCKET"'/logs/"
      }
    }
}'
minsupark:~/environment $ aws emr-containers start-job-run \
> --virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
> --name spark-etl \
> --execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
> --release-label emr-5.33.0-latest \
> --job-driver '{
>     "sparkSubmitJobDriver": {
>         "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/spark-etl.py",
>         "entryPointArguments": ["s3://aws-data-analytics-workshops/shared_datasets/tripdata/",
>           "'"$S3_BUCKET"'/taxi-data/"
>         ],
>         "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
>         }
>     }'\
>     --configuration-overrides '{
>     "applicationConfiguration": [
>       {
>         "classification": "spark-defaults", 
>         "properties": {
>           "spark.driver.memory":"2G"
>          }
>       }
>     ], 
>     "monitoringConfiguration": {
>       "cloudWatchMonitoringConfiguration": {
>         "logGroupName": "/emr-containers/jobs", 
>         "logStreamNamePrefix": "emr-eks-workshop"
>       }, 
>       "s3MonitoringConfiguration": {
>         "logUri": "'"$S3_BUCKET"'/logs/"
>       }
>     }
> }'
{
    "id": "00000002vecstgqca7m",
    "name": "spark-etl",
    "arn": "arn:aws:emr-containers:us-east-1:111111111111:/virtualclusters/zse2a4iflxcu1mrstc0dk1srf/jobruns/00000002vecstgqca7m",
    "virtualClusterId": "zse2a4iflxcu1mrstc0dk1srf"
}

minsupark:~/environment $ aws s3 cp s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/spark-etl.py .
download: s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/spark-etl.py to ./spark-etl.py

minsupark:~/environment $ cat spark-etl.py
import sys
from datetime import datetime

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

if __name__ == "__main__":

    print(len(sys.argv))
    if (len(sys.argv) != 3):
        print("Usage: spark-etl [input-folder] [output-folder]")
        sys.exit(0)

    spark = SparkSession\
        .builder\
        .appName("SparkETL")\
        .getOrCreate()

    nyTaxi = spark.read.option("inferSchema", "true").option("header", "true").csv(sys.argv[1])

    updatedNYTaxi = nyTaxi.withColumn("current_date", lit(datetime.now()))

    updatedNYTaxi.printSchema()

    print(updatedNYTaxi.show())

    print("Total number of records: " + str(updatedNYTaxi.count()))
    
    updatedNYTaxi.write.parquet(sys.argv[2])

The screenshot below shows the output of the Spark ETL job.

26

STEP 12) AWS GLUE METASTORE INTEGRATION 하기

In this section we will learn how to run a Spark ETL job with EMR on EKS and interact with AWS Glue MetaStore to create a table.

In the code below, Spark reads NY Taxi Trip data from Amazon S3. The script updates the timestamp column, prints the schema and row count and writes the data in parquet format to Amazon S3 and finally creates a database in AWS Glue and creates a table in the database to run queries again. The script may take time depending on the EKS cluster size. Note that the input, output location and database name are taken as a parameter. The script is already uploaded to the workshop S3 bucket and command to run the Spark ETL is shown in the section below the code snippet.

import sys
from datetime import datetime

from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.functions import *

if __name__ == "__main__":

    print(len(sys.argv))
    if (len(sys.argv) != 4):
        print("Usage: spark-etl-glue [input-folder] [output-folder] [dbName]")
        sys.exit(0)

    spark = SparkSession\
        .builder\
        .appName("Python Spark SQL Glue integration example")\
        .enableHiveSupport()\
        .getOrCreate()

    nyTaxi = spark.read.option("inferSchema", "true").option("header", "true").csv(sys.argv[1])

    updatedNYTaxi = nyTaxi.withColumn("current_date", lit(datetime.now()))

    updatedNYTaxi.printSchema()

    print(updatedNYTaxi.show())

    print("Total number of records: " + str(updatedNYTaxi.count()))
    
    updatedNYTaxi.write.parquet(sys.argv[2])

    updatedNYTaxi.registerTempTable("ny_taxi_table")

    dbName = sys.argv[3]
    spark.sql("CREATE database if not exists " + dbName)
    spark.sql("USE " + dbName)
    spark.sql("CREATE table if not exists ny_taxi_parquet USING PARQUET LOCATION '" + sys.argv[2] + "' AS SELECT * from ny_taxi_table ")

Run the following command to execute the Spark ETL Glue job on your EMR on EKS cluster. The S3 output location and database name will be passed as a parameter to the script, this is passed as an entryPointArguments

aws emr-containers start-job-run \
--virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
--name spark-etl-s3-awsglue-integration \
--execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
--release-label emr-5.33.0-latest \
--job-driver '{
    "sparkSubmitJobDriver": {
        "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/spark-etl-glue.py",
        "entryPointArguments": [
          "s3://aws-data-analytics-workshops/shared_datasets/tripdata/","'"$S3_BUCKET"'/taxi-data-glue/","tripdata"
        ],
        "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
        }
    }' \
--configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.hadoop.hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
         }
      }
    ], 
    "monitoringConfiguration": {
      "cloudWatchMonitoringConfiguration": {
        "logGroupName": "/emr-containers/jobs", 
        "logStreamNamePrefix": "emr-eks-workshop"
      }, 
      "s3MonitoringConfiguration": {
        "logUri": "'"$S3_BUCKET"'/logs/"
      }
    }
}'
minsupark:~/environment $ aws emr-containers start-job-run \
> --virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
> --name spark-etl-s3-awsglue-integration \
> --execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
> --release-label emr-5.33.0-latest \
> --job-driver '{
>     "sparkSubmitJobDriver": {
>         "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/spark-etl-glue.py",
>         "entryPointArguments": [
>           "s3://aws-data-analytics-workshops/shared_datasets/tripdata/","'"$S3_BUCKET"'/taxi-data-glue/","tripdata"
>         ],
>         "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
>         }
>     }' \
> --configuration-overrides '{
>     "applicationConfiguration": [
>       {
>         "classification": "spark-defaults", 
>         "properties": {
>           "spark.hadoop.hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
>          }
>       }
>     ], 
>     "monitoringConfiguration": {
>       "cloudWatchMonitoringConfiguration": {
>         "logGroupName": "/emr-containers/jobs", 
>         "logStreamNamePrefix": "emr-eks-workshop"
>       }, 
>       "s3MonitoringConfiguration": {
>         "logUri": "'"$S3_BUCKET"'/logs/"
>       }
>     }
> }'
{
    "id": "00000002vectac6g8s1",
    "name": "spark-etl-s3-awsglue-integration",
    "arn": "arn:aws:emr-containers:us-east-1:111111111111:/virtualclusters/zse2a4iflxcu1mrstc0dk1srf/jobruns/00000002vectac6g8s1",
    "virtualClusterId": "zse2a4iflxcu1mrstc0dk1srf"
}

The screenshot below shows the output of the Spark ETL Glue job 그리고 Goto Amazon Athena from AWS Console and check the database and table and run queries to verify 된것을 아래 그림과 같이 확인할 수 있다.

27

이제는 좀더 심화된 내용으로 실습을 진행해보자.

In this section we will learn about EKS Node Placement and submit jobs to run only on ONDEMAND instance and another job to run only on SPOT instances. We will also learn how to use Pod Templates file to define the driver or executor pod’s configurations

STEP 13) spark job 실행을 시키는데 EKS node의 위치를 이곳저곳 바꿔서 해보기

AWS EKS clusters can span multiple AZs in a VPC. A Spark application whose driver and executor pods are distributed across multiple AZs can incur inter-AZ data transfer costs. To minimize or eliminate inter-AZ data transfer costs, you can configure the application to only run on the nodes within a single AZ. Also, depending on your use case you might prefer to run the Spark application on specific Instance Types.

step 13-1) submit a Spark job to run only on the nodes in a single AZ

먼저 아래 그림과 같이 Kubernetes Dashboard 로 가서 Node 메뉴로 이동한다.

28

step 13-2) Choose one of the listed nodes and choose Show all under Labels as shown below 그런 다음에 Choose one of the listed nodes and choose Show all under Labels as shown below 그런 다음에 The label of interest for us is topology.kubernetes.io/zone: us-east-1b

29

topology.kubernetes.io/zone: us-east-1b에서 us-east-1b string을 ctrl + c로 복사해둔다.

step 13-3) 클라우드9 콘솔에서 아래와 같이 명령어를 실행해본다. (Let’s submit the job to run only in Single AZ. Substitue the values for availability zone in the command below)

aws emr-containers start-job-run \
--virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
--name spark-pi-single-az \
--execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
--release-label emr-6.2.0-latest \
--job-driver '{
    "sparkSubmitJobDriver": {
        "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py",
        "sparkSubmitParameters": "--conf spark.kubernetes.node.selector.topology.kubernetes.io/zone='<availability zone>' --conf spark.executor.instances=1 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
        }
    }' \
--configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.driver.memory":"2G"
         }
      }
    ], 
    "monitoringConfiguration": {
      "cloudWatchMonitoringConfiguration": {
        "logGroupName": "/emr-containers/jobs", 
        "logStreamNamePrefix": "emr-eks-workshop"
      }, 
      "s3MonitoringConfiguration": {
        "logUri": "'"$S3_BUCKET"'/logs/"
      }
    }
}'

위에 topology.kubernetes.io/zone: us-east-1b에서 복사한 us-east-1b를 availability zone에 넣어주면 된다.

minsupark:~/environment $ aws emr-containers start-job-run \
> --virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
> --name spark-pi-single-az \
> --execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
> --release-label emr-6.2.0-latest \
> --job-driver '{
>     "sparkSubmitJobDriver": {
>         "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py",
>         "sparkSubmitParameters": "--conf spark.kubernetes.node.selector.topology.kubernetes.io/zone='us-east-1b' --conf spark.executor.instances=1 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
>         }
>     }' \
> --configuration-overrides '{
>     "applicationConfiguration": [
>       {
>         "classification": "spark-defaults", 
>         "properties": {
>           "spark.driver.memory":"2G"
>          }
>       }
>     ], 
>     "monitoringConfiguration": {
>       "cloudWatchMonitoringConfiguration": {
>         "logGroupName": "/emr-containers/jobs", 
>         "logStreamNamePrefix": "emr-eks-workshop"
>       }, 
>       "s3MonitoringConfiguration": {
>         "logUri": "'"$S3_BUCKET"'/logs/"
>       }
>     }
> }'
{
    "id": "00000002veh0rsu2ine",
    "name": "spark-pi-single-az",
    "arn": "arn:aws:emr-containers:us-east-1:111111111111:/virtualclusters/3ap6a9t9c11v5wki8o5ekqzx1/jobruns/00000002veh0rsu2ine",
    "virtualClusterId": "3ap6a9t9c11v5wki8o5ekqzx1"
}

When the job starts the driver pod and executor pods are scheduled only on those EKS worker nodes with the label topology.kubernetes.io/zone: . This ensures the spark job is run within a single AZ.

참고자료 :

https://spark.apache.org/docs/latest/running-on-kubernetes.html#pod-spec

https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/

Configuration of interest -

--conf spark.kubernetes.node.selector.topology.kubernetes.io/zone='<availability zone>'

zone is a built in label that EKS assigns to every EKS worker Node. The above config will ensure to schedule the driver and executor pod on those EKS worker nodes labeled - topology.kubernetes.io/zone: . However, user defined labels can also be assigned to EKS worker nodes and used as node selector.

Other common use cases are using node labels to force the job to run on on demand/spot, machine type, etc.

defined labels 참고자료 : https://eksctl.io/usage/eks-managed-nodes/#managing-labels

step 13-4) Single AZ and Instance Type Placement

In this lab we will submit a Spark job to run only on the nodes in a single AZ and on a specific instance type.

아까 라벨링 복사한 데시보드로 이동해서 이번에는 The label of interest for us are topology.kubernetes.io/zone: us-east-1b and node.kubernetes.io/instance-type: r5.xlarge가 있는지 확인해보자. 그리고 us-east-1b과 r5.xlarge를 메모장에 복사해둔다.

step 13-5) 클라우드9 콘솔에서 아래와 같은 명령어로 Let’s submit the job to run only in Single AZ and on specific Instance Type. Substitue the values for availability zone, instance type in the command below:

aws emr-containers start-job-run \
--virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
--name spark-pi-single-az-instance-type \
--execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
--release-label emr-6.2.0-latest \
--job-driver '{
    "sparkSubmitJobDriver": {
        "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py",
        "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
        }
    }' \
--configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.driver.memory":"2G",
          "spark.kubernetes.node.selector.topology.kubernetes.io/zone":"<availability zone>",
          "spark.kubernetes.node.selector.node.kubernetes.io/instance-type":"<instance type>"
         }
      }
    ], 
    "monitoringConfiguration": {
      "cloudWatchMonitoringConfiguration": {
        "logGroupName": "/emr-containers/jobs", 
        "logStreamNamePrefix": "emr-eks-workshop"
      }, 
      "s3MonitoringConfiguration": {
        "logUri": "'"$S3_BUCKET"'/logs/"
      }
    }
}'

마찬가지로 위에서 복사한 us-east-1b과 r5.xlarge를 갖고 availability zone와 instance type를 대체한 다음 아래와 같이 실행해본다.

minsupark:~/environment $ aws emr-containers start-job-run \
> --virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
> --name spark-pi-single-az-instance-type \
> --execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
> --release-label emr-6.2.0-latest \
> --job-driver '{
>     "sparkSubmitJobDriver": {
>         "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py",
>         "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
>         }
>     }' \
> --configuration-overrides '{
>     "applicationConfiguration": [
>       {
>         "classification": "spark-defaults", 
>         "properties": {
>           "spark.driver.memory":"2G",
>           "spark.kubernetes.node.selector.topology.kubernetes.io/zone":"us-east-1b",
>           "spark.kubernetes.node.selector.node.kubernetes.io/instance-type":"r5.xlarge"
>          }
>       }
>     ], 
>     "monitoringConfiguration": {
>       "cloudWatchMonitoringConfiguration": {
>         "logGroupName": "/emr-containers/jobs", 
>         "logStreamNamePrefix": "emr-eks-workshop"
>       }, 
>       "s3MonitoringConfiguration": {
>         "logUri": "'"$S3_BUCKET"'/logs/"
>       }
>     }
> }'
{
    "id": "00000002veh233amie3",
    "name": "spark-pi-single-az-instance-type",
    "arn": "arn:aws:emr-containers:us-east-1:111111111111:/virtualclusters/3ap6a9t9c11v5wki8o5ekqzx1/jobruns/00000002veh233amie3",
    "virtualClusterId": "3ap6a9t9c11v5wki8o5ekqzx1"
}

When the job starts the driver pod and executor pods are scheduled only on those EKS worker nodes with the label topology.kubernetes.io/zone: and label node.kubernetes.io/instance-type: . This ensures the spark job is run within a single AZ and the specific instance types needed for the job.

Configuration of interest -

spark.kubernetes.node.selector.topology.kubernetes.io/zone:<availability zone>
spark.kubernetes.node.selector.node.kubernetes.io/instance-type:<instance type>

zone and instance-type are built in label that EKS assigns to every EKS worker Node. The above config will ensure to schedule the driver and executor pod on those EKS worker nodes labeled - topology.kubernetes.io/zone: availability zone and node.kubernetes.io/instance-type: instance type. However, user defined labels can also be assigned to EKS worker nodes and used as node selector.

Multiple key value pairs for spark.kubernetes.node.selector.[labelKey] can be passed to add filter conditions for selecting the EKS worker node.

Other common use cases are using node labels to force the job to run on on demand/spot, machine type, etc.

STEP 13-6) POD 템플릿으로 spark job 실행해보기

In this section we will learn how to use Pod Template feature of Spark on Kubernetes and submit jobs.

With Amazon EMR versions 5.33.0 and later, Amazon EMR on EKS supports Spark’s pod template feature. You can use pod template files to define the driver or executor pod’s configurations that Spark configurations do not support. You can specify the spark properties spark.kubernetes.driver.podTemplateFile and spark.kubernetes.executor.podTemplateFile to point to the pod template files in Amazon S3. Then Spark will load the pod template file and use it to construct driver and executor pods. For more information about the Spark’s pod template feature, see Pod Template (https://spark.apache.org/docs/3.0.0-preview/running-on-kubernetes.html#pod-template ).

You can enable this pod template feature by passing an Amazon S3 path pointing to your pod template. Note: Spark uses the job execution role to load the pod template, so the job execution role must have permission to access Amazon S3 in order to load the pod templates.

We will run a use case where we want the Driver Pod to be always created on an ON-DEMAND instance while executor pods to launch on SPOT instances.

Note: The template files are already available in S3 bucket and the examples below can be run directly.

Driver Template Specification (아래에서 driver_template.yaml)

apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    eks.amazonaws.com/capacityType: ON_DEMAND

Executor Template Specification (아래에서 executor_template.yaml)

apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    eks.amazonaws.com/capacityType: SPOT

Example 1: Specify in SparkSubmitParameters

You can specify the Amazon S3 path to the pod template when using the SparkSubmitParameters as the following example demonstrates:

aws emr-containers start-job-run \
--virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
--name spark-pi-pod-template \
--execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
--release-label emr-5.33.0-latest \
--job-driver '{
    "sparkSubmitJobDriver": {
        "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py",
        "sparkSubmitParameters": "--conf spark.kubernetes.driver.podTemplateFile=s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/driver_template.yaml --conf spark.kubernetes.executor.podTemplateFile=s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/executor_template.yaml --conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
        }
    }'
minsupark:~/environment $ aws emr-containers start-job-run \
> --virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
> --name spark-pi-pod-template \
> --execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
> --release-label emr-5.33.0-latest \
> --job-driver '{
>     "sparkSubmitJobDriver": {
>         "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py",
>         "sparkSubmitParameters": "--conf spark.kubernetes.driver.podTemplateFile=s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/driver_template.yaml --conf spark.kubernetes.executor.podTemplateFile=s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/executor_template.yaml --conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
>         }
>     }'
{
    "id": "00000002veh2ts7svov",
    "name": "spark-pi-pod-template",
    "arn": "arn:aws:emr-containers:us-east-1:111111111111:/virtualclusters/3ap6a9t9c11v5wki8o5ekqzx1/jobruns/00000002veh2ts7svov",
    "virtualClusterId": "3ap6a9t9c11v5wki8o5ekqzx1"
}

Example 2: Specify in ApplicationConfiguration

You can also specify the Amazon S3 path to the pod template when using the configurationOverrides as the following example demonstrates:

aws emr-containers start-job-run \
--virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
--name spark-pi-pod-template \
--execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
--release-label emr-5.33.0-latest \
--job-driver '{
    "sparkSubmitJobDriver": {
        "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py",
        "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
        }
    }' \
--configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.driver.memory":"2G",
          "spark.kubernetes.driver.podTemplateFile":"s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/driver_template.yaml",
          "spark.kubernetes.executor.podTemplateFile":"s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/executor_template.yaml"
         }
      }
    ]   
}'
minsupark:~/environment $ aws emr-containers start-job-run \
> --virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
> --name spark-pi-pod-template \
> --execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
> --release-label emr-5.33.0-latest \
> --job-driver '{
>     "sparkSubmitJobDriver": {
>         "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py",
>         "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
>         }
>     }' \
> --configuration-overrides '{
>     "applicationConfiguration": [
>       {
>         "classification": "spark-defaults", 
>         "properties": {
>           "spark.driver.memory":"2G",
>           "spark.kubernetes.driver.podTemplateFile":"s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/driver_template.yaml",
>           "spark.kubernetes.executor.podTemplateFile":"s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/executor_template.yaml"
>          }
>       }
>     ]   
> }'
{
    "id": "00000002veh351i488f",
    "name": "spark-pi-pod-template",
    "arn": "arn:aws:emr-containers:us-east-1:111111111111:/virtualclusters/3ap6a9t9c11v5wki8o5ekqzx1/jobruns/00000002veh351i488f",
    "virtualClusterId": "3ap6a9t9c11v5wki8o5ekqzx1"
}

You can go to CloudWatch logs and check the scheduler logs to verify that the driver ran on ON_DEMAND instance and executors were launched on SPOT instances. Use the Kubernetes Dashboard -> Nodes to verify the Node IP addresses.

클라우드 워치 로그 경로는 예를 들어서 아래 그림과 같이 CloudWatch --> Log groups --> 클러스터 아이디 --> 스케쥴러로 접속해서 확인하면 된다.

30

아래와 같이 spark driver는 기존에 떠있었던 온디멘드 EC2에서 실행되고 executor는 spot 인스턴스에서 실행되는 것을 확인할 수 있다.

I0507 21:35:45.286775       1 scheduler.go:742] pod emr-eks-workshop-namespace/spark-00000002ua8o76tor07-driver is bound successfully on node "ip-10-0-182-89.ec2.internal", 4 nodes evaluated, 2 nodes were found feasible.
I0507 21:35:52.044834       1 scheduler.go:742] pod emr-eks-workshop-namespace/pythonpi-1620423351548-exec-1 is bound successfully on node "ip-10-0-140-246.ec2.internal", 4 nodes evaluated, 2 nodes were found feasible.
I0507 21:35:52.110272       1 scheduler.go:742] pod emr-eks-workshop-namespace/pythonpi-1620423352037-exec-2 is bound successfully on node "ip-10-0-197-221.ec2.internal", 4 nodes evaluated, 1 nodes were found feasible.

31

STEP 14) Serverless Spark with AWS Fargate

In this section we will learn how to use AWS Fargate and submit Spark job

AWS Fargate is a technology that provides on-demand, right-sized compute capacity for containers. With AWS Fargate, you no longer have to provision, configure, or scale groups of virtual machines to run containers. This removes the need to choose server types, decide when to scale your node groups, or optimize cluster packing. You can control which pods start on Fargate and how they run with Fargate profiles, which are defined as part of your Amazon EKS cluster.

Fargate profile is already created as part of your EKS cluster. You can check the Fargate profile as shown below:

(1) From AWS Console choose Elastic Kubernetes Service.

(2) Choose Clusters and choose the cluster created by the cloudformation stack as part of the workshop.

(3) Choose Configuration

(4) Choose Compute

32

Run the below command to create a new EMR virtual cluster to be used with fargate profile. Replace the value of EKSCluster from cloudformation outputs.

aws emr-containers create-virtual-cluster \
--name emr_eks_fargate_cluster \
--container-provider '{
    "id":   "<<EKSCluster>>",
    "type": "EKS",
    "info": {
        "eksInfo": {
            "namespace": "eks-fargate"
        }
    }
}'

위에서 EKSCluster는 cloudformation output에서 확인이 가능하다.

minsupark:~/environment $ aws emr-containers create-virtual-cluster \
> --name emr_eks_fargate_cluster \
> --container-provider '{
>     "id":   "Cluster9EE0221C-56f72a1ffd6841c5a237df7291747ca8",
>     "type": "EKS",
>     "info": {
>         "eksInfo": {
>             "namespace": "eks-fargate"
>         }
>     }
> }'
{
    "id": "x431imiq7ygl1qiwo9rw2fmcl",
    "name": "emr_eks_fargate_cluster",
    "arn": "arn:aws:emr-containers:us-east-1:111111111111:/virtualclusters/x431imiq7ygl1qiwo9rw2fmcl"
}

Note the virtual cluster id. We will use this to submit jobs to Fargate.

Run the below command to run a Spark job that computes the value of Pi. Replace the value of EMR_FARGATE_VIRTUAL_CLUSTER_ID with the virtual cluster id from previous step.

aws emr-containers start-job-run \
--virtual-cluster-id <<EMR_FARGATE_VIRTUAL_CLUSTER_ID>> \
--name spark-pi-logging \
--execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
--release-label emr-6.2.0-latest \
--job-driver '{
    "sparkSubmitJobDriver": {
        "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py",
        "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
        }
    }' \
--configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.driver.memory":"2G"
         }
      }
    ], 
    "monitoringConfiguration": {
      "cloudWatchMonitoringConfiguration": {
        "logGroupName": "/emr-containers/jobs", 
        "logStreamNamePrefix": "emr-eks-workshop"
      }, 
      "s3MonitoringConfiguration": {
        "logUri": "'"$S3_BUCKET"'/logs/"
      }
    }
}'

아래와 같이 위에 fargate 클러스터를 생성하고 output으로 나온 id 값을 복붙해주면 된다.

{
    "id": "x431imiq7ygl1qiwo9rw2fmcl",
    "name": "emr_eks_fargate_cluster",
    "arn": "arn:aws:emr-containers:us-east-1:111111111111:/virtualclusters/x431imiq7ygl1qiwo9rw2fmcl"
}
minsupark:~/environment $ aws emr-containers start-job-run \
> --virtual-cluster-id x431imiq7ygl1qiwo9rw2fmcl \
> --name spark-pi-logging \
> --execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
> --release-label emr-6.2.0-latest \
> --job-driver '{
>     "sparkSubmitJobDriver": {
>         "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py",
>         "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
>         }
>     }' \
> --configuration-overrides '{
>     "applicationConfiguration": [
>       {
>         "classification": "spark-defaults", 
>         "properties": {
>           "spark.driver.memory":"2G"
>          }
>       }
>     ], 
>     "monitoringConfiguration": {
>       "cloudWatchMonitoringConfiguration": {
>         "logGroupName": "/emr-containers/jobs", 
>         "logStreamNamePrefix": "emr-eks-workshop"
>       }, 
>       "s3MonitoringConfiguration": {
>         "logUri": "'"$S3_BUCKET"'/logs/"
>       }
>     }
> }'
{
    "id": "00000002veh7gu2aak1",
    "name": "spark-pi-logging",
    "arn": "arn:aws:emr-containers:us-east-1:161461013751:/virtualclusters/x431imiq7ygl1qiwo9rw2fmcl/jobruns/00000002veh7gu2aak1",
    "virtualClusterId": "x431imiq7ygl1qiwo9rw2fmcl"
}

# You can also run the following command to check the job and pods status
# You should see output similar to this once the job starts running
minsupark:~/environment $ kubectl get all -n eks-fargate
NAME                                   READY   STATUS              RESTARTS   AGE
pod/00000002veh7gu2aak1-zlsfk          3/3     Running             0          6m42s
pod/pythonpi-09331a7dd141afc7-exec-1   0/2     ContainerCreating   0          42s
pod/pythonpi-09331a7dd141afc7-exec-2   0/2     Pending             0          42s
pod/spark-00000002veh7gu2aak1-driver   2/2     Running             0          4m15s

NAME                                                            TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                      AGE
service/spark-00000002veh7gu2aak1-23d07d7dd13e5fdd-driver-svc   ClusterIP   None         <none>        7078/TCP,7079/TCP,4040/TCP   4m12s

NAME                            COMPLETIONS   DURATION   AGE
job.batch/00000002veh7gu2aak1   0/1           6m42s      6m42s

Check the staus of job on Kubernetes Dashboard

33

EMR 콘솔에서도 spark job이 실행되는 것을 확인할 수 있음

34

You can go to S3 bucket you specified to check for the logs. Your log data are sent to the following Amazon S3 locations.

Controller Logs - /logUri/virtual-cluster-id/jobs/job-id/containers/pod-name/(stderr.gz/stdout.gz)

Driver Logs - /logUri/virtual-cluster-id/jobs/job-id/containers/spark-application-id/spark-job-id-driver/(stderr.gz/stdout.gz)

Executor Logs - /logUri/virtual-cluster-id/jobs/job-id/containers/spark-application-id/executor-pod-name/(stderr.gz/stdout.gz)

Explore the contents of the Driver logs and run a S3 Select query on stdout.gz. Below screenshots show the output of the PySpark Pi job and the value of Pi. The path should be in the format: s3://xxxx/yyyy/containers/spark-xxxx/spark-xxx-driver/stdout.gz

35

stdout.gz을 열어보면 또는 아래와 같이 sql로 조회하면 3.14 결과값도 확인이 가능하다.

36

[실습종료후 실습리소스 삭제하기]

STEP 1) 클라우드9 콘솔에서 아래와 같이 명령어 실행

# 클러스터 리스트를 확인하고 떠있는 클러스터는 모조리 삭제해주면 된다.
minsupark:~/environment $ aws emr-containers list-virtual-clusters
{
    "virtualClusters": [
        {
            "id": "zse2a4iflxcu1mrstc0dk1srf",
            "name": "emr_eks_cluster",
            "arn": "arn:aws:emr-containers:us-east-1:111111111111:/virtualclusters/zse2a4iflxcu1mrstc0dk1srf",
            "state": "RUNNING",
            "containerProvider": {
                "type": "EKS",
                "id": "Cluster9EE0221C-ffcf7c5e49c7479785f2b009ce1974c1",
                "info": {
                    "eksInfo": {
                        "namespace": "emr-eks-workshop-namespace"
                    }
                }
            },
            "createdAt": "2021-12-18T08:02:08+00:00",
            "tags": {}
        }
    ]
}

# aws emr-containers delete-virtual-cluster --id <virtual-cluster-id>
minsupark:~/environment $ aws emr-containers delete-virtual-cluster --id zse2a4iflxcu1mrstc0dk1srf
{
    "id": "zse2a4iflxcu1mrstc0dk1srf"
}

minsupark:~/environment $ helm uninstall aws-load-balancer-controller -n kube-system

minsupark:~/environment $ kubectl delete -k github.com/aws/eks-charts/stable/aws-load-balancer-controller//crds?ref=master

# eksctl delete iamserviceaccount --cluster <<EKSClusterName>> --name aws-load-balancer-controller --namespace kube-system --wait
minsupark:~/environment $ eksctl delete iamserviceaccount --cluster Cluster9EE0221C-ffcf7c5e49c7479785f2b009ce1974c1 --name aws-load-balancer-controller --namespace kube-system --wait
2021-12-18 09:53:14 [ℹ]  eksctl version 0.77.0
2021-12-18 09:53:14 [ℹ]  using region us-east-1
2021-12-18 09:53:14 [ℹ]  1 iamserviceaccount (kube-system/aws-load-balancer-controller) was included (based on the include/exclude rules)
2021-12-18 09:53:14 [ℹ]  1 task: { delete serviceaccount "kube-system/aws-load-balancer-controller" }
2021-12-18 09:53:14 [ℹ]  serviceaccount "kube-system/aws-load-balancer-controller" was already deleted

STEP 2) Open IAM and delete EMREKSWorkshop-AWSLoadBalancerControllerIAMPolicy policy

STEP 3) Open up the CloudFormation console and select the emr-on-eks-workshop stack and click on Delete button to terminate the stack.

STEP 4) Open up the CloudFormation console and select the EmrEksAppStack stack and click on Delete button to terminate the stack.

STEP 5) Open up the CloudFormation console and delete the eksctl-Clusterxxx-…-addon-iamserviceaccount-kube-system-aws-load-balancer-controller stack

STEP 6) Delete the cloud9 environment emr-eks-cloud9 that was created as part of the workshop labs.

STEP 7) Delete the Glue table and database that was created as part of the the workshop labs.