EMR Studio with EMR on EKS 실습 워크샵

2021-12-25

.

Data_Engineering_TIL(20211225)

[실습자료]

  • AWS “EMR on EKS Workshop” 자료를 공부하고 정리한 내용입니다.

** URL : https://emr-on-eks.workshop.aws

  • “EMR on EKS 실습 워크샵” 스터디 노트 내용에 이어서 진행하는 실습자료 입니다.

** URL : https://minman2115.github.io/DE_TIL323

[학습내용]

STEP 1) 실습환경 셋팅

먼저 “EMR on EKS 실습 워크샵” 스터디 노트(https://minman2115.github.io/DE_TIL323) 내용에서 “STEP 1) 버지니아 리전으로가서 ec2 키 접속을 위한 SSH pem 키를 생성해준다.” 부터 “STEP 6) CREATE EMR CLUSTER ON EKS” 까지 진행해준다.

STEP 2) 아래와 같이 실습 오버뷰 내용을 통해서 어떻게 실습할지에 대한 내용을 확인하기

Amazon EMR Studio works with Amazon EMR on EKS so that Studio users can submit notebook code to Amazon Elastic Kubernetes Service (EKS). To provide Studio users access to EKS clusters, you must first create a virtual cluster using Amazon EMR on EKS. After you create a virtual cluster, you must create one or more managed endpoints to which Studio users can connect a Workspace.

Follow these labs to learn how to set up Amazon EMR on EKS for EMR Studio:

(1) Configuring AWS SSO

(2) Amazon EMR Studio Setup

(3) Using EMR Studio

(4) Configure Amazon EKS Cluster for Amazon EMR

(5) Create a Managed Endpoint

(6) Connecting from Studio

STEP 3) Enable AWS Single Sign-On

Currently, EMR Studio uses AWS Single Sign-On (SSO) to provide access with a unique sign-in URL. You can ignore this step if you already enabled SSO and have AWS SSO users/groups.

아래 그림과 같이 sso 콘솔에 접속한다. 그런 다음에 If you haven’t enabled SSO yet, you will see the following blue screen along with Enable AWS SSO button. Click on it to enable AWS SSO.

1

AWS SSO requires the AWS Organizations service . Click on Create AWS organization to create an organization.

2

이메일 인증후 SSO 콘솔로 이동하면 아래와 같이 URL을 획득할 수 있다.

Now, go back to AWS SSO service and enable AWS SSO. This process may take up to 30 seconds. If SSO is enabled, a successful message will be displayed. At the bottom, it will also display User portal URL for end users to login to AWS using AWS SSO. For this workshop, we will not use this URL as EMR Studio creates a unique sign in URL for you.

3

STEP 4) Create AWS SSO User

Click on the Users button from the left navigation panel and then click Add User to add a test user for EMR Studio. This is the user who will log in to EMR Studio using a unique URL and later work on the notebooks.

아래와 같은 정보로 유저를 생성한다.

(1) Username: emr-engineer

(2) Password: Generate a one-time password…

(3) Email address: Enter your email address

(4) First name: EMR

(5) Last name: Engineer

그런 다음에 Leave the rest of the fields as default and click on the Next: Groups button.

아래와 같이 유저를 생성하면 되고, 유저 정보가 생성되면 정보들을 메모장에 복사해둔다.

4

STEP 5) Amazon EMR Studio Setup

This chapter covers the foundational steps that you need to follow to create an Amazon EMR Studio, and attaching or detaching an existing EMR on EKS cluster to EMR Studio.

#######################################################################################

실습전 주의사항

#######################################################################################

(1) There must be at least one private subnet in common between your EMR Studio and the Amazon EKS cluster that you use to register your virtual cluster. Otherwise, your managed endpoint will not appear as an option in your Studio Workspaces. You can create an Amazon EKS cluster and associate it with the subnets that belong to your Studio. Alternatively, you can create a Studio and specify your EKS cluster’s private subnets.

(2) EMR Studio does not currently support Amazon EMR on EKS when you use an AWS Fargate-only Amazon EKS cluster

#######################################################################################

아래 그림과 같이 EMR 스튜디오를 생성해준다.

5

At this point, you have created an EMR Studio, but haven’t assigned the SSO user to this studio yet. Click on the emr-eks-workshop-studio to open up the details. Click on Add users, you will see the AWS SSO user that you created in the previous step. Select that user and click on Add.

그런 다음에 아래 그림과 같이 유저를 추가해준다.

6

By default, the assigned user gets AdministratorAccess session policy. You can fine tune it further by assigning a dedicated session policy. Select the user and click on Assign policy. The CloudFormation template already created session policy for this workshop. For this lab, select the EmrEksAppStack-EMRStudioUserIAMPolicyxxx from the drop-down and click on Apply policy.

7

At this point, Setting up the EMR Studio using console is completed.

To wrap-up, you created an Amazon EMR Studio using console, later you assign an AWS SSO user to that studio and provided a specific session policy. In the next lab, you will learn how to Use EMR Studio.

STEP 6) Using EMR Studio

Now that you have set up EMR Studio and assigned an AWS SSO user emr-engineer, use that user to log in and explore EMR Studio:

step 6-1) Log in to Amazon EMR Studio

Use the Url obtained from the previous step and open it on the browser. It will prompt you to enter the username. To avoid a browser cache issue, you can open up this Url in a different browser or incognito mode.

8

step 6-2) Create EMR Studio Workspace

아래 그림과 같이 워크스페이스를 생성한다.

9

step 6-3) 워크스페이스 접속

10

STEP 7) Configure Amazon EKS Cluster for Amazon EMR

In order to use an Amazon EKS cluster for Amazon EMR, it needs to configure first. This includes creating job execution role, setting up a load balancer controller, certificate manager and so on. Follow these steps to continue with this process:

To enable HTTPS connection for the load balancer, you need to upload a certificate to AWS Certificate Manager. You can use an existing certificate and import it from your local machine or you can follow these steps to create a certificate and load it to AWS Certificate Manager.

그런 다음에 로컬 PC에서 아래와 같이 명령어를 실행한다.

minma@minman MINGW64 ~/Desktop/aws key
$ aws configure
AWS Access Key ID [None]: xxxxxxxxxxxxxxxxxxxxxxxxx
AWS Secret Access Key [None]: yyyyyyyyyyyyyyyyyyyyyyyyy
Default region name [None]: us-east-1
Default output format [None]: json

minma@minman MINGW64 ~/Desktop/aws key
$ pwd
/c/Users/minma/Desktop/aws key

minma@minman MINGW64 ~/Desktop/aws key
$ openssl req -x509 -newkey rsa:1024 -keyout privateKey.pem -out certificateChain.pem -days 365 -nodes -subj '/C=US/ST=Washington/L=Seattle/O=MyOrg/OU=MyDept/CN=*.us-east-1.compute.internal'
Generating a RSA private key
..............+++++
......................................................+++++
writing new private key to 'privateKey.pem'
-----

minma@minman MINGW64 ~/Desktop/aws key
$ aws acm import-certificate --certificate fileb://trustedCertificates.pem --certificate-chain fileb://certificateChain.pem --private-key fileb://privateKey.pem
{
    "CertificateArn": "arn:aws:acm:us-east-1:111111111111:certificate/2fcc1f61-6d16-4f9e-89b9-9f2e75e3081b"
}

그런 다음에 클라우드9에서 아래와 같이 명령어를 실행한다.

minsupark:~/environment $ curl https://aws-data-analytics-workshops.s3.amazonaws.com/emr-eks-workshop/scripts/configureLoadBalancer.sh -o configureLoadBalancer.sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2367  100  2367    0     0  30036      0 --:--:-- --:--:-- --:--:-- 30346

minsupark:~/environment $ cat configureLoadBalancer.sh
clusterName=${1}
region=${2}
vpcId=${3}

echo "Download IAM policies AWS Load Balancer Controller..."

curl -o iam_policy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.2.0/docs/install/iam_policy.json

echo "Creating AWS Load Balancer Controller IAM Policy..."

loadBalancerPolicyARN=`(aws iam create-policy --region $region \
--policy-name EMREKSWorkshop-AWSLoadBalancerControllerIAMPolicy \
--policy-document file://iam_policy.json | jq -r '.Policy.Arn')`

echo "Load Balancer Controller IAM Policy ARN: "$loadBalancerPolicyARN

echo "Creating Kubernetes service account IAM role..."

eksctl create iamserviceaccount \
--cluster=$clusterName \
--namespace=kube-system \
--name=aws-load-balancer-controller \
--attach-policy-arn=$loadBalancerPolicyARN \
--override-existing-serviceaccounts \
--region $region \
--approve

cluster=`(echo $clusterName | cut -d- -f1)`
roleArn=`(aws iam list-roles | grep "eksctl-${cluster}" | grep "Arn" | sed 's/ //g' | cut -d'"' -f4)`
#roleArn=`(aws iam list-roles | jq -r '.Roles[] | select(.RoleName|test("eksctl-$clusterName")).Arn')`

echo "Service Account Role is created, ARN: ":$roleArn

cat <<EOF > aws-load-balancer-controller.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
    labels:
        app.kubernetes.io/component: controller
        app.kubernetes.io/name: aws-load-balancer-controller
    name: aws-load-balancer-controller
    namespace: kube-system
    annotations:
        eks.amazonaws.com/role-arn: $roleArn
EOF

echo "Creating Service Account..."

kubectl apply -f aws-load-balancer-controller.yaml

echo "Installing the TargetGroupBinding custom resource definitions..."

kubectl apply -k "github.com/aws/eks-charts/stable/aws-load-balancer-controller//crds?ref=master"

echo "Adding the eks-charts repository..."

helm repo add eks https://aws.github.io/eks-charts

echo "Installing the AWS Load Balancer Controller..."

helm upgrade -i aws-load-balancer-controller eks/aws-load-balancer-controller \
--set clusterName=$clusterName \
--set region=$region \
--set vpcId=$vpcId \
--set serviceAccount.create=false \
--set serviceAccount.name=aws-load-balancer-controller \
-n kube-system

echo "Verifying the controller status..."
kubectl get deployment -n kube-system aws-load-balancer-controller

echo "Load balancer controller has been installed successfully."

Run it by providing three parameters:

(1) EKS cluster name (you can obtain this from the CloudFormation output tab)

(2) your AWS region (us-east-1)

(3) VPC ID (you can obtain this from the CloudFormation output tab)

This script will take few minutes to run completely and will launch few AWS CloudFormation stacks.

minsupark:~/environment $ sh configureLoadBalancer.sh Cluster9EE0221C-2940135deafb4878bdb490dbcde6e613 us-east-1 vpc-06ecc7a825992bc90
Download IAM policies AWS Load Balancer Controller...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  7273  100  7273    0     0  79123      0 --:--:-- --:--:-- --:--:-- 79923
Creating AWS Load Balancer Controller IAM Policy...
Load Balancer Controller IAM Policy ARN: arn:aws:iam::111111111111:policy/EMREKSWorkshop-AWSLoadBalancerControllerIAMPolicy
Creating Kubernetes service account IAM role...
2021-12-25 11:50:38 [ℹ]  eksctl version 0.77.0
2021-12-25 11:50:38 [ℹ]  using region us-east-1
2021-12-25 11:50:38 [ℹ]  1 iamserviceaccount (kube-system/aws-load-balancer-controller) was included (based on the include/exclude rules)
2021-12-25 11:50:38 [!]  metadata of serviceaccounts that exist in Kubernetes will be updated, as --override-existing-serviceaccounts was set
2021-12-25 11:50:38 [ℹ]  1 task: { 
    2 sequential sub-tasks: { 
        create IAM role for serviceaccount "kube-system/aws-load-balancer-controller",
        create serviceaccount "kube-system/aws-load-balancer-controller",
    } }2021-12-25 11:50:38 [ℹ]  building iamserviceaccount stack "eksctl-Cluster9EE0221C-2940135deafb4878bdb490dbcde6e613-addon-iamserviceaccount-kube-system-aws-load-balancer-controller"
2021-12-25 11:50:38 [ℹ]  deploying stack "eksctl-Cluster9EE0221C-2940135deafb4878bdb490dbcde6e613-addon-iamserviceaccount-kube-system-aws-load-balancer-controller"
2021-12-25 11:50:38 [ℹ]  waiting for CloudFormation stack "eksctl-Cluster9EE0221C-2940135deafb4878bdb490dbcde6e613-addon-iamserviceaccount-kube-system-aws-load-balancer-controller"
2021-12-25 11:50:54 [ℹ]  waiting for CloudFormation stack "eksctl-Cluster9EE0221C-2940135deafb4878bdb490dbcde6e613-addon-iamserviceaccount-kube-system-aws-load-balancer-controller"
2021-12-25 11:51:11 [ℹ]  waiting for CloudFormation stack "eksctl-Cluster9EE0221C-2940135deafb4878bdb490dbcde6e613-addon-iamserviceaccount-kube-system-aws-load-balancer-controller"
2021-12-25 11:51:11 [ℹ]  created serviceaccount "kube-system/aws-load-balancer-controller"
Service Account Role is created, ARN: :arn:aws:iam::161461013751:role/eksctl-Cluster9EE0221C-2940135deafb4878bdb49-Role1-KAPGD7ELVF6X
Creating Service Account...
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
serviceaccount/aws-load-balancer-controller configured
Installing the TargetGroupBinding custom resource definitions...
customresourcedefinition.apiextensions.k8s.io/ingressclassparams.elbv2.k8s.aws created
customresourcedefinition.apiextensions.k8s.io/targetgroupbindings.elbv2.k8s.aws created
Adding the eks-charts repository...
"eks" has been added to your repositories
Installing the AWS Load Balancer Controller...
Release "aws-load-balancer-controller" does not exist. Installing it now.
NAME: aws-load-balancer-controller
LAST DEPLOYED: Sat Dec 25 11:51:21 2021
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
AWS Load Balancer controller installed!
Verifying the controller status...
NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
aws-load-balancer-controller   0/2     2            0           1s
Load balancer controller has been installed successfully.

# The loadbalancer might take few minutes to come up. Check the status to make sure it's in Ready state.
minsupark:~/environment $ kubectl get deployment -n kube-system aws-load-balancer-controller
NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
aws-load-balancer-controller   2/2     2            2           4m11s

STEP 8) Create a Managed Endpoint

클라우드9에서 아래와 같이 명령어를 실행한다.

Execute the following command to create a managed endpoint. It takes few parameters

--virtual-cluster-id: Virtual cluster id of the EMR cluster from the previous labs ( Amazon EMR on EKS Workshop -> Create EMR Cluster on EKS )

--name: Use "emr-eks-endpoint"

--execution-role-arn: You can obtain this ARN value from the CloudFormation Output tab, look for resource key: EMRJobExecutionRoleArn

--release-label: Use either "emr-6.2.0-latest" or "5.32.0"

--certificate-arn: Certificate ARN that you obtain after importing your certificate. Refer Fargate lab setup chapter.

# 명령어 템플릿
aws emr-containers create-managed-endpoint \
--type JUPYTER_ENTERPRISE_GATEWAY \
--virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
--name emr-eks-endpoint \
--execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
--release-label emr-6.2.0-latest \
--certificate-arn <certificate-arn>
minsupark:~/environment $ aws emr-containers create-managed-endpoint \
> --type JUPYTER_ENTERPRISE_GATEWAY \
> --virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
> --name emr-eks-endpoint \
> --execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
> --release-label emr-6.2.0-latest \
> --certificate-arn arn:aws:acm:us-east-1:111111111111:certificate/2fcc1f61-6d16-4f9e-89b9-9f2e75e3081b
{
    "id": "0q4dvqzxmbcb0",
    "name": "emr-eks-endpoint",
    "arn": "arn:aws:emr-containers:us-east-1:111111111111:/virtualclusters/1j9nl7iuhe3kzqbaftijpi2q0/endpoints/0q4dvqzxmbcb0",
    "virtualClusterId": "1j9nl7iuhe3kzqbaftijpi2q0"
}

# The command will return the id of the endpoint along with few other information. Keep the managed endpoint id to track the endpoint creation process. Endpoint creation process takes few minutes, you can check the status of the endpoint using this:
# aws emr-containers describe-managed-endpoint --id <endpoint-id> --virtual-cluster-id ${EMR_EKS_CLUSTER_ID}
# <endpoint-id>는 바로 위에서 실행한 명령어 결과인 id값임
minsupark:~/environment $ aws emr-containers describe-managed-endpoint --id 0q4dvqzxmbcb0 --virtual-cluster-id ${EMR_EKS_CLUSTER_ID}
{
    "endpoint": {
        "id": "0q4dvqzxmbcb0",
        "name": "emr-eks-endpoint",
        "arn": "arn:aws:emr-containers:us-east-1:111111111111:/virtualclusters/1j9nl7iuhe3kzqbaftijpi2q0/endpoints/0q4dvqzxmbcb0",
        "virtualClusterId": "1j9nl7iuhe3kzqbaftijpi2q0",
        "type": "JUPYTER_ENTERPRISE_GATEWAY",
        "state": "CREATING",
        "releaseLabel": "emr-6.2.0-latest",
        "executionRoleArn": "arn:aws:iam::111111111111:role/EMR_EKS_Job_Execution_Role",
        "certificateArn": "arn:aws:acm:us-east-1:111111111111:certificate/2fcc1f61-6d16-4f9e-89b9-9f2e75e3081b",
        "createdAt": "2021-12-25T12:04:17+00:00",
        "tags": {}
    }
}

STEP 9) Connecting from Studio

In the past few sections, you learned how to create an Amazon EKS cluster, how to configure it for Amazon EMR and later learned how to register your EKS cluster and create a managed endpoint. Now that you have met all the requirements to run EMR Studio Notebook code on EMR on EKS, follow these steps to connect EMR Studio with EMR on EKS

Open up EMR Studio and create a new workspace without any EMR cluster. In the following screenshot, emr-workspace-1 is created to connect to EMR on EKS.

Launch the new workspace and click on Cluster on the left navigation panel. Select EMR Cluster on EKS under Cluster type option. When you select EMR Cluster on EKS, you will notice the virtual cluster that you created in the previous step. Select emr-eks-cluster and check the drop-down for Endpoint. You will see the endpoint emr-eks-endpoint is showed up here . Select that endpoint and click on Attach button.

Cluster attachment process will begin. Once the cluster attachment is completed, the notebook may refresh automatically and you can confirm the attachment by clicking on the Cluster panel again. On the right side within the Launcher, you will notice that two kernels PySpark and Python 3 are now running on Kubernetes.

아래 그림과 같이 EMR 스튜디오에서 EKS 클러스터와 연결후에 첨부파일 SparkConverter.ipynb를 업로드하고 이거를 실행해본다.

11

노트북 내용 : Spark reads Amazon Reviews data for toys from Amazon S3. Then it prints out the total count, schema, and few sample data. The last sections repartition the data and write them back to Amazon S3. The last section may take time depending on the EKS cluster size. You can uncomment that line, replace the OUTPUT_LOCATION, and execute the entire notebook.

[실습종료후 리소스 삭제하기]

클라우드9에서 아래와 같은 명령어로 EKS 클러스터를 삭제한후에 S3로 가서 버킷들 비워주고, cloudformation으로 가서 EmrEksAppStack 스텍과 emr-on-eks-workshop 스택을 삭제해준다.

minsupark:~/environment $ aws emr-containers list-virtual-clusters
{
    "virtualClusters": [
        {
            "id": "1j9nl7iuhe3kzqbaftijpi2q0",
            "name": "emr_eks_cluster",
            "arn": "arn:aws:emr-containers:us-east-1:111111111111:/virtualclusters/1j9nl7iuhe3kzqbaftijpi2q0",
            "state": "RUNNING",
            "containerProvider": {
                "type": "EKS",
                "id": "Cluster9EE0221C-2940135deafb4878bdb490dbcde6e613",
                "info": {
                    "eksInfo": {
                        "namespace": "emr-eks-workshop-namespace"
                    }
                }
            },
            "createdAt": "2021-12-25T09:33:38+00:00",
            "tags": {}
        }
}

minsupark:~/environment $ aws emr-containers list-managed-endpoints --virtual-cluster-id 1j9nl7iuhe3kzqbaftijpi2q0
{
    "endpoints": [
        {
            "id": "0q4dvqzxmbcb0",
            "name": "emr-eks-endpoint",
            "arn": "arn:aws:emr-containers:us-east-1:111111111111:/virtualclusters/1j9nl7iuhe3kzqbaftijpi2q0/endpoints/0q4dvqzxmbcb0",
            "virtualClusterId": "1j9nl7iuhe3kzqbaftijpi2q0",
            "type": "JUPYTER_ENTERPRISE_GATEWAY",
            "state": "ACTIVE",
            "releaseLabel": "emr-6.2.0-latest",
            "executionRoleArn": "arn:aws:iam::111111111111:role/EMR_EKS_Job_Execution_Role",
            "certificateArn": "arn:aws:acm:us-east-1:111111111111:certificate/2fcc1f61-6d16-4f9e-89b9-9f2e75e3081b",
            "serverUrl": "https://internal-k8s-emrekswo-ingress0-d5c50a82f7-1706898367.us-east-1.elb.amazonaws.com:18888",
            "createdAt": "2021-12-25T12:04:17Z",
            "securityGroup": "sg-0c5d6d4f6be4aad55",
            "stateDetails": "Endpoint created successfully. It took 3 Minutes 13 Seconds",
            "tags": {}
        }
    ]
}

minsupark:~/environment $ aws emr-containers delete-managed-endpoint --id 0q4dvqzxmbcb0 --virtual-cluster-id 1j9nl7iuhe3kzqbaftijpi2q0
{
    "id": "0q4dvqzxmbcb0",
    "virtualClusterId": "1j9nl7iuhe3kzqbaftijpi2q0"
}

minsupark:~/environment $ aws emr-containers delete-virtual-cluster --id 1j9nl7iuhe3kzqbaftijpi2q0
{
    "id": "1j9nl7iuhe3kzqbaftijpi2q0"
}

minsupark:~/environment $ helm uninstall aws-load-balancer-controller -n kube-system

minsupark:~/environment $ kubectl delete -k github.com/aws/eks-charts/stable/aws-load-balancer-controller//crds?ref=master

minsupark:~/environment $ eksctl delete iamserviceaccount --cluster Cluster9EE0221C-2940135deafb4878bdb490dbcde6e613 --name aws-load-balancer-controller --namespace kube-system --wait