docker compose를 이용해서 local 환경에 spark cluster 구성하기
.
Data_Engineering_TIL(20240523)
학습시 참고자료
[Spark] 클러스터 구축 docker-compose, standalone (2/5) 블로그 글 자료
URL : https://ampersandor.tistory.com/11?category=1170869
개요
Mac OS local 환경에 docker compose를 이용해서 spark cluster를 구성하는 방법
구성방법 요약
STEP 1) 작업 folder 생성
STEP 2) custom image Dockerfile 작성
- cluster-base.Dockerfile
- spark-base.Dockerfile
- spark-master.Dockerfile
- spark-worker.Dockerfile
- jupyterlab.Dockerfile
STEP 3) image build shell script (build.sh) 작성
STEP 4) docker-compose.yaml 작성
STEP 5) docker image build
STEP 6) docker compose up
STEP 7) Spark 관련 UI 접속해보기
STEP 8) Spark 실행 테스트
구성내용
STEP 1) 작업 folder 생성
아래와 같이 작업 폴더와 file들이 구성될 것이다. 그래서 먼저 작업 folder를 생성해준다.
$ pwd
/Users/minman/spark_test
$ tree
.
├── build.sh
├── cluster-base.Dockerfile
├── docker-compose.yaml
├── jupyterlab.Dockerfile
├── spark-base.Dockerfile
├── spark-master.Dockerfile
└── spark-worker.Dockerfile
1 directory, 7 files
STEP 2) custom image Dockerfile 작성
[cluster-base.Dockerfile]
- 가장 기본이 되는 cluster-base 이미지
- 이 이미지를 활용하여 단순히 shared_workspace를 docker volume으로 공유를 하여 이 위에 spark 이미지들과 jupyter 이미지를 띄우는 구조
- 마치 실제 운영환경에서의 HDFS 위에 spark를 올리듯이 구성을 하는 구조
- cluster의 base image는 8-jre-slim
- spark 가 jvm 환경에서 돌아가며, java 를 linux os 이미지에서 설치하는 것보다 조금이라도 간편하게 docker 공식 사이트에서 배포되는 java runtime environment 이미지를 사용
ARG debian_buster_image_tag=8-jre-slim
FROM openjdk:${debian_buster_image_tag}
ARG shared_workspace=/opt/workspace
RUN mkdir -p ${shared_workspace} && \
apt-get update -y && \
apt-get install -y python3 && \
ln -s /usr/bin/python3 /usr/bin/python && \
rm -rf /var/lib/apt/lists/*
ENV SHARED_WORKSPACE=${shared_workspace}
VOLUME ${shared_workspace}
CMD ["bash"]
[spark-base.Dockerfile]
- cluster base image를 기반으로 spark를 다운받고, 압축을 해제하고, 환경변수를 셋팅
- cluster-base가 jre를 기반으로 한 이미지이기 때문에 jvm설치가 필요없음
FROM cluster-base
ARG spark_version
ARG hadoop_version
RUN apt-get update -y && \
apt-get install -y curl && \
curl https://archive.apache.org/dist/spark/spark-${spark_version}/spark-${spark_version}-bin-hadoop${hadoop_version}.tgz -o spark.tgz && \
tar -xf spark.tgz && \
mv spark-${spark_version}-bin-hadoop${hadoop_version} /usr/bin/ && \
mkdir /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/logs && \
rm spark.tgz
ENV SPARK_HOME /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}
ENV SPARK_MASTER_HOST spark-master
ENV SPARK_MASTER_PORT 7077
ENV PYSPARK_PYTHON python3
WORKDIR ${SPARK_HOME}
[spark-master.Dockerfile]
FROM spark-base
ARG spark_master_web_ui=8080
EXPOSE ${spark_master_web_ui}
EXPOSE ${SPARK_MASTER_PORT}
CMD bin/spark-class org.apache.spark.deploy.master.Master >> logs/spark-master.out
[spark-worker.Dockerfile]
FROM spark-base
ARG spark_worker_web_ui=8081
EXPOSE ${spark_worker_web_ui}
CMD bin/spark-class org.apache.spark.deploy.worker.Worker spark://${SPARK_MASTER_HOST}:${SPARK_MASTER_PORT} >> logs/spark-worker.out
[jupyterlab.Dockerfile]
FROM cluster-base
ARG spark_version
ARG jupyterlab_version
RUN apt-get update -y && \
apt-get install -y python3-pip && \
pip3 install wget pyspark==${spark_version} jupyterlab==${jupyterlab_version}
EXPOSE 8888
WORKDIR ${SHARED_WORKSPACE}
CMD jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root --NotebookApp.token=
STEP 3) image build shell script (build.sh) 작성
- build.sh
# -- Software Stack Version
SPARK_VERSION="3.0.0"
HADOOP_VERSION="2.7"
JUPYTERLAB_VERSION="2.1.5"
# -- Building the Images
echo "############################################# building cluster-base #############################################"
docker build \
-f cluster-base.Dockerfile \
-t cluster-base .
echo "############################################# building spark-base #############################################"
docker build \
--build-arg spark_version="${SPARK_VERSION}" \
--build-arg hadoop_version="${HADOOP_VERSION}" \
-f spark-base.Dockerfile \
-t spark-base .
echo "############################################# building spark-master #############################################"
docker build \
-f spark-master.Dockerfile \
-t spark-master .
echo "############################################# building spark-worker #############################################"
docker build \
-f spark-worker.Dockerfile \
-t spark-worker .
echo "############################################# building jupyter #############################################"
docker build \
--build-arg spark_version="${SPARK_VERSION}" \
--build-arg jupyterlab_version="${JUPYTERLAB_VERSION}" \
-f jupyterlab.Dockerfile \
-t jupyterlab .
STEP 4) docker-compose.yaml 작성
- docker-compose.yaml
version: "3.6" volumes: shared-workspace: name: "hadoop-distributed-file-system" driver: local services: jupyterlab: image: jupyterlab container_name: jupyterlab ports: - 8888:8888 volumes: - shared-workspace:/opt/workspace spark-master: image: spark-master container_name: spark-master ports: - 4040:8080 - 7077:7077 volumes: - shared-workspace:/opt/workspace spark-worker-1: image: spark-worker container_name: spark-worker-1 environment: - SPARK_WORKER_CORES=1 - SPARK_WORKER_MEMORY=512m ports: - 4041:8081 volumes: - shared-workspace:/opt/workspace depends_on: - spark-master spark-worker-2: image: spark-worker container_name: spark-worker-2 environment: - SPARK_WORKER_CORES=1 - SPARK_WORKER_MEMORY=512m ports: - 4042:8081 volumes: - shared-workspace:/opt/workspace depends_on: - spark-master
STEP 5) docker image build
아래와 같이 작업폴더에서 명령어를 실행하여 image를 생성한다.
$ chmod +x ./build.sh
$ ./run_build.sh
############################################# building cluster-base #############################################
[+] Building 1.6s (7/7) FINISHED
=> [internal] load build definition from cluster-base.Dockerfile 0.0s
=> => transferring dockerfile: 451B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/library/openjdk:8-jre-slim 1.5s
=> [auth] library/openjdk:pull token for registry-1.docker.io 0.0s
=> [1/2] FROM docker.io/library/openjdk:8-jre-slim@sha256:xxxxxxxxxxxxxxxxxxxxxx 0.0s
=> CACHED [2/2] RUN mkdir -p /opt/workspace && apt-get update -y && apt-get install -y python3 && ln -s /usr/bin/python3 /usr/bin/python && rm -rf / 0.0s
=> exporting to image 0.0s
=> => exporting layers 0.0s
=> => writing image sha256:xxxxxxxxxxxxxxxxxxxxxx 0.0s
=> => naming to docker.io/library/cluster-base 0.0s
Use 'docker scan' to run Snyk tests against images to find vulnerabilities and learn how to fix them
############################################# building spark-base #############################################
[+] Building 0.1s (7/7) FINISHED
=> [internal] load build definition from spark-base.Dockerfile 0.0s
=> => transferring dockerfile: 816B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/library/cluster-base:latest 0.0s
=> [1/3] FROM docker.io/library/cluster-base 0.0s
=> CACHED [2/3] RUN apt-get update -y && apt-get install -y curl && curl https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz -o 0.0s
=> CACHED [3/3] WORKDIR /usr/bin/spark-3.0.0-bin-hadoop2.7 0.0s
=> exporting to image 0.0s
=> => exporting layers 0.0s
=> => writing image sha256:xxxxxxxxxxxxxxxxxxxxxx 0.0s
=> => naming to docker.io/library/spark-base 0.0s
Use 'docker scan' to run Snyk tests against images to find vulnerabilities and learn how to fix them
############################################# building spark-master #############################################
[+] Building 0.1s (5/5) FINISHED
=> [internal] load build definition from spark-master.Dockerfile 0.0s
=> => transferring dockerfile: 257B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/library/spark-base:latest 0.0s
=> CACHED [1/1] FROM docker.io/library/spark-base 0.0s
=> exporting to image 0.0s
=> => exporting layers 0.0s
=> => writing image sha256:xxxxxxxxxxxxxxxxxxxxxx 0.0s
=> => naming to docker.io/library/spark-master 0.0s
Use 'docker scan' to run Snyk tests against images to find vulnerabilities and learn how to fix them
############################################# building spark-worker #############################################
[+] Building 0.1s (5/5) FINISHED
=> [internal] load build definition from spark-worker.Dockerfile 0.0s
=> => transferring dockerfile: 279B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/library/spark-base:latest 0.0s
=> CACHED [1/1] FROM docker.io/library/spark-base 0.0s
=> exporting to image 0.0s
=> => exporting layers 0.0s
=> => writing image sha256:xxxxxxxxxxxxxxxxxxxxxx 0.0s
=> => naming to docker.io/library/spark-worker 0.0s
Use 'docker scan' to run Snyk tests against images to find vulnerabilities and learn how to fix them
############################################# building jupyter #############################################
[+] Building 0.1s (7/7) FINISHED
=> [internal] load build definition from jupyterlab.Dockerfile 0.0s
=> => transferring dockerfile: 407B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/library/cluster-base:latest 0.0s
=> [1/3] FROM docker.io/library/cluster-base 0.0s
=> CACHED [2/3] RUN apt-get update -y && apt-get install -y python3-pip && pip3 install wget pyspark==3.0.0 jupyterlab==2.1.5 0.0s
=> CACHED [3/3] WORKDIR /opt/workspace 0.0s
=> exporting to image 0.0s
=> => exporting layers 0.0s
=> => writing image sha256:xxxxxxxxxxxxxxxxxxxxxx 0.0s
=> => naming to docker.io/library/jupyterlab 0.0s
Use 'docker scan' to run Snyk tests against images to find vulnerabilities and learn how to fix them
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
jupyterlab latest xxxxxxxxxxxx 53 minutes ago 1.43GB
spark-worker latest xxxxxxxxxxxx 54 minutes ago 483MB
spark-base latest xxxxxxxxxxxx 54 minutes ago 483MB
spark-master latest xxxxxxxxxxxx 54 minutes ago 483MB
cluster-base latest xxxxxxxxxxxx 55 minutes ago 217MB
STEP 6) docker compose up
아래와 같이 명령어를 실행하여 docker compose를 구동한다.
$ docker-compose up -d
[+] Running 6/6
⠿ Network spark_test_default Created 0.0s
⠿ Volume "hadoop-distributed-file-system" Created 0.0s
⠿ Container spark-master Started 0.7s
⠿ Container jupyterlab Started 0.7s
⠿ Container spark-worker-2 Started 1.4s
⠿ Container spark-worker-1 Started 1.4s
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
xxxxxxxxxxxx spark-worker "/bin/sh -c 'bin/spa…" 12 seconds ago Up 10 seconds 0.0.0.0:4041->8081/tcp spark-worker-1
xxxxxxxxxxxx spark-worker "/bin/sh -c 'bin/spa…" 12 seconds ago Up 10 seconds 0.0.0.0:4042->8081/tcp spark-worker-2
xxxxxxxxxxxx spark-master "/bin/sh -c 'bin/spa…" 12 seconds ago Up 11 seconds 7077/tcp, 0.0.0.0:8077->8077/tcp, 0.0.0.0:4040->8080/tcp spark-master
xxxxxxxxxxxx jupyterlab "/bin/sh -c 'jupyter…" 12 seconds ago Up 11 seconds 0.0.0.0:8888->8888/tcp jupyterlab
반대로 docker compose를 내리고 싶으면 docker compose down
명령어를 실행하여 종료
$ docker compose down
[+] Running 5/5
⠿ Container spark-worker-1 Removed 10.5s
⠿ Container spark-worker-2 Removed 10.6s
⠿ Container jupyterlab Removed 10.7s
⠿ Container spark-master Removed 10.3s
⠿ Network spark_test_default Removed 0.1s
STEP 7) Spark 관련 UI 접속해보기
아래의 링크를 통해서 jupyterlab 과 열어놓은 spark web ui 들을 볼 수 있으며, spark master 웹 UI 를 보면 worker 들이 잘 등록 되어있는 것을 확인할 수 있다.
- jupyterlab: http://localhost:8888
- 마스터 노드: http://localhost:4040
- 워커 노드 1: http://localhost:4041
- 워커 노드 2: http://localhost:4042
STEP 8) Spark 실행 테스트
jupyterlab에서 아래에 코드를 입력하여 실행해본다.
from pyspark.sql import SparkSession
spark = SparkSession.\
builder.\
appName("pyspark-notebook").\
master("spark://spark-master:7077").\
config("spark.executor.memory", "512m").\
getOrCreate()
import wget
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
wget.download(url)
data = spark.read.csv("iris.data")
data.show(n=5)