spark standalone & zeppelin 구성 실습
.
Data_Engineering_TIL(20201018)
Youtube 채널 ‘min zzang’ 님의 ‘아파치 스파크 Standalone 설치 및 구성 1&2’ 을 공부한 내용입니다.
** URL : https://youtu.be/mXUOxPETbm8
[실습목표]
가상머신 1대에 spark standalone mode를 설치한다.
[실습내용]
step 1) EC2 생성
-
운영체제 : Amazon linux 2
-
사양 : m5.4xlarge, 30GB 볼륨
step 2) SSH로 접속해서 아래와 같은 명령어 실행
[ec2-user@ip-10-1-10-126 ~]$ sudo yum update -y
[ec2-user@ip-10-1-10-126 ~]$ yum list java*jdk-devel
Loaded plugins: extras_suggestions, langpacks, priorities, update-motd
Available Packages
java-1.7.0-openjdk-devel.x86_64 1:1.7.0.261-2.6.22.2.amzn2.0.1 amzn2-core
java-1.8.0-openjdk-devel.x86_64 1:1.8.0.265.b01-1.amzn2.0.1 amzn2-core
# 자바 설치
[ec2-user@ip-10-1-10-126 ~]$ sudo yum install java-1.8.0-openjdk-devel.x86_64 -y
[ec2-user@ip-10-1-10-126 ~]$ java -version
openjdk version "1.8.0_265"
OpenJDK Runtime Environment (build 1.8.0_265-b01)
OpenJDK 64-Bit Server VM (build 25.265-b01, mixed mode)
# javac 위치확인
[ec2-user@ip-10-1-10-126 ~]$ readlink -f /usr/bin/javac
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.265.b01-1.amzn2.0.1.x86_64/bin/javac
# 메모리 가용자원 확인
[ec2-user@ip-10-1-10-126 ~]$ free -g
total used free shared buff/cache available
Mem: 62 0 60 0 1 61
Swap: 0 0 0
# cpu 코어수 확인
[ec2-user@ip-10-1-10-126 ~]$ grep -c processor /proc/cpuinfo
16
# spark 다운로드
[ec2-user@ip-10-1-10-126 ~]$ wget https://archive.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
--2020-10-15 09:03:09-- https://archive.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
Resolving archive.apache.org (archive.apache.org)... 138.201.131.134, 2a01:4f8:172:2ec5::2
Connecting to archive.apache.org (archive.apache.org)|138.201.131.134|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 195636829 (187M) [application/x-gzip]
Saving to: ‘spark-2.1.0-bin-hadoop2.7.tgz’
100%[================================================================================================>] 195,636,829 5.15MB/s in 38s
2020-10-15 09:03:48 (4.93 MB/s) - ‘spark-2.1.0-bin-hadoop2.7.tgz’ saved [195636829/195636829]
[ec2-user@ip-10-1-10-126 ~]$ ls
spark-2.1.0-bin-hadoop2.7.tgz
# 다운로드한 spark 압축해제
[ec2-user@ip-10-1-10-126 ~]$ tar -zxf spark-2.1.0-bin-hadoop2.7.tgz
[ec2-user@ip-10-1-10-126 ~]$ ls
spark-2.1.0-bin-hadoop2.7 spark-2.1.0-bin-hadoop2.7.tgz
# 자바와 spark PATH 설정을 해준다.
[ec2-user@ip-10-1-10-126 ~]$ export PATH=$PATH:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.265.b01-1.amzn2.0.1.x86_64/bin
[ec2-user@ip-10-1-10-126 ~]$ pwd
/home/ec2-user
[ec2-user@ip-10-1-10-126 ~]$ export PATH=$PATH:/home/ec2-user/spark-2.1.0-bin-hadoop2.7/bin
[ec2-user@ip-10-1-10-126 ~]$ cd spark-2.1.0-bin-hadoop2.7/
[ec2-user@ip-10-1-10-126 spark-2.1.0-bin-hadoop2.7]$ ls
bin conf data examples jars LICENSE licenses NOTICE python R README.md RELEASE sbin yarn
[ec2-user@ip-10-1-10-126 spark-2.1.0-bin-hadoop2.7]$ cd conf
[ec2-user@ip-10-1-10-126 conf]$ ls
docker.properties.template log4j.properties.template slaves.template spark-env.sh.template
fairscheduler.xml.template metrics.properties.template spark-defaults.conf.template
[ec2-user@ip-10-1-10-126 conf]$ cp spark-env.sh.template spark-env.sh
[ec2-user@ip-10-1-10-126 conf]$ ls
docker.properties.template log4j.properties.template slaves.template spark-env.sh
fairscheduler.xml.template metrics.properties.template spark-defaults.conf.template spark-env.sh.template
[ec2-user@ip-10-1-10-126 conf]$ sudo vim spark-env.sh
# 아래내용 추가
export SPARK_WORKER_INSTANCES=3
[ec2-user@ip-10-1-10-126 conf]$ cd ..
[ec2-user@ip-10-1-10-126 spark-2.1.0-bin-hadoop2.7]$ ls
bin conf data examples jars LICENSE licenses NOTICE python R README.md RELEASE sbin yarn
[ec2-user@ip-10-1-10-126 spark-2.1.0-bin-hadoop2.7]$ cd sbin
[ec2-user@ip-10-1-10-126 sbin]$ ls
slaves.sh start-history-server.sh start-slave.sh stop-master.sh stop-slaves.sh
spark-config.sh start-master.sh start-slaves.sh stop-mesos-dispatcher.sh stop-thriftserver.sh
spark-daemon.sh start-mesos-dispatcher.sh start-thriftserver.sh stop-mesos-shuffle-service.sh
spark-daemons.sh start-mesos-shuffle-service.sh stop-all.sh stop-shuffle-service.sh
start-all.sh start-shuffle-service.sh stop-history-server.sh stop-slave.sh
[ec2-user@ip-10-1-10-126 sbin]$ sh start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /home/ec2-user/spark-2.1.0-bin-hadoop2.7/logs/spark-ec2-user-org.apache.spark.deploy.master.Master-1-ip-10-1-10-126.ap-northeast-2.compute.internal.out
# spark-shell가 잘 실행되는지 확인해본다.
[ec2-user@ip-10-1-10-126 sbin]$ cd ..
[ec2-user@ip-10-1-10-126 spark-2.1.0-bin-hadoop2.7]$ ls
bin conf data examples jars LICENSE licenses logs NOTICE python R README.md RELEASE sbin work yarn
[ec2-user@ip-10-1-10-126 spark-2.1.0-bin-hadoop2.7]$ cd bin
[ec2-user@ip-10-1-10-126 bin]$ ls
beeline find-spark-home metastore_db pyspark.cmd spark-class sparkR spark-shell spark-sql spark-submit.cmd
beeline.cmd load-spark-env.cmd pyspark run-example spark-class2.cmd sparkR2.cmd spark-shell2.cmd spark-submit
derby.log load-spark-env.sh pyspark2.cmd run-example.cmd spark-class.cmd sparkR.cmd spark-shell.cmd spark-submit2.cmd
[ec2-user@ip-10-1-10-126 bin]$ ./spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/10/18 10:57:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/10/18 10:57:37 WARN SparkConf:
SPARK_WORKER_INSTANCES was detected (set to '3').
This is deprecated in Spark 1.0+.
Please instead use:
- ./spark-submit with --num-executors to specify the number of executors
- Or set SPARK_EXECUTOR_INSTANCES
- spark.executor.instances to configure the number of instances in the spark config.
20/10/18 10:57:38 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
20/10/18 10:57:41 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
20/10/18 10:57:41 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
20/10/18 10:57:42 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://10.1.10.126:4041
Spark context available as 'sc' (master = local[*], app id = local-1603018658481).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_265)
Type in expressions to have them evaluated.
Type :help for more information.
scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@64cec4d0
# 확인했으면 컨트롤+c 를 눌러 spark-shell을 빠져나간다.
그리고 마스터 노드 실행 후 웹브라우저를 열고 [ec2_public_ip]:8080
로 접속하면 아래 그림과 같은 화면이 전시될 것이다.
그리고 해당화면에서 확인할 수 있는 주요정보는 다음과 같다.
Spark Master at spark://ip-10-1-10-126.ap-northeast-2.compute.internal:7077
URL: spark://ip-10-1-10-126.ap-northeast-2.compute.internal:7077
REST URL: spark://ip-10-1-10-126.ap-northeast-2.compute.internal:6066 (cluster mode)
Alive Workers: 0
Cores in use: 0 Total, 0 Used
Memory in use: 0.0 B Total, 0.0 B Used
Applications: 0 Running, 0 Completed
Drivers: 0 Running, 0 Completed
Status: ALIVE
그리고 아래와 같이 슬레이브 노드도 스타트해준다.
# 슬레이브 실행시 뒤에 옵션이 붙는것에 유의한다.
# sh start-slave.sh [방금 위에서 확인한 마스터노드 주소] -m [노드당 메모리] -c [노드당 코어수]
# 이 가상머신의 메모리가 총 62G, 코어는 16 코어이므로 적당히 노드당 18기가, 코어는 4개씩으로 설정했다.
# 슬레이브 노드수는 아까 spark-env.sh에서 설정한 3대가 생성된다.
[ec2-user@ip-10-1-10-126 sbin]$ sh start-slave.sh spark://ip-10-1-10-126.ap-northeast-2.compute.internal:7077 -m 18g -c 4
starting org.apache.spark.deploy.worker.Worker, logging to /home/ec2-user/spark-2.1.0-bin-hadoop2.7/logs/spark-ec2-user-org.apache.spark.deploy.worker.Worker-1-ip-10-1-10-126.ap-northeast-2.compute.internal.out
starting org.apache.spark.deploy.worker.Worker, logging to /home/ec2-user/spark-2.1.0-bin-hadoop2.7/logs/spark-ec2-user-org.apache.spark.deploy.worker.Worker-2-ip-10-1-10-126.ap-northeast-2.compute.internal.out
starting org.apache.spark.deploy.worker.Worker, logging to /home/ec2-user/spark-2.1.0-bin-hadoop2.7/logs/spark-ec2-user-org.apache.spark.deploy.worker.Worker-3-ip-10-1-10-126.ap-northeast-2.compute.internal.out
그런다음에 다시 마스터노드 WEB UI로 가보면 슬레이브 노드가 올라온것을 확인할 수 있다.
Spark Master at spark://ip-10-1-10-126.ap-northeast-2.compute.internal:7077
URL: spark://ip-10-1-10-126.ap-northeast-2.compute.internal:7077
REST URL: spark://ip-10-1-10-126.ap-northeast-2.compute.internal:6066 (cluster mode)
Alive Workers: 3
Cores in use: 12 Total, 0 Used
Memory in use: 54.0 GB Total, 0.0 B Used
Applications: 0 Running, 0 Completed
Drivers: 0 Running, 0 Completed
Status: ALIVE
Workers
Worker Id Address State Cores Memory
worker-20201018101815-10.1.10.126-37999 10.1.10.126:37999 ALIVE 4 (0 Used) 18.0 GB (0.0 B Used)
worker-20201018101818-10.1.10.126-36635 10.1.10.126:36635 ALIVE 4 (0 Used) 18.0 GB (0.0 B Used)
worker-20201018101820-10.1.10.126-45525 10.1.10.126:45525 ALIVE 4 (0 Used) 18.0 GB (0.0 B Used)
하나의 버츄얼 머신에 세개의 쓰레드를 만든것이다.
이제는 제플린을 설치하려고 한다. 참고로 spark와 제플린의 버전을 잘 맞춰서 설치해야하기 때문에 스파크 최신버전이 아니라 2.1 버전을 설치한 것이다 (제플린은 0.7.3 버전). 스팍과 제플린의 버전을 맞춰주지 않으면 Error 발생이 많아서 상당히 운영하기 어려울 것이다.
[ec2-user@ip-10-1-10-126 sbin]$ cd ~
[ec2-user@ip-10-1-10-126 ~]$ wget https://archive.apache.org/dist/zeppelin/zeppelin-0.7.3/zeppelin-0.7.3-bin-all.tgz
--2020-10-18 10:34:45-- https://archive.apache.org/dist/zeppelin/zeppelin-0.7.3/zeppelin-0.7.3-bin-all.tgz
Resolving archive.apache.org (archive.apache.org)... 138.201.131.134, 2a01:4f8:172:2ec5::2
Connecting to archive.apache.org (archive.apache.org)|138.201.131.134|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 834875175 (796M) [application/x-gzip]
Saving to: ‘zeppelin-0.7.3-bin-all.tgz’
100%[==============================================================================================>] 834,875,175 5.59MB/s in 2m 25s
2020-10-18 10:37:11 (5.50 MB/s) - ‘zeppelin-0.7.3-bin-all.tgz’ saved [834875175/834875175]
[ec2-user@ip-10-1-10-126 ~]$ tar -zxf zeppelin-0.7.3-bin-all.tgz
[ec2-user@ip-10-1-10-126 ~]$ ls
spark-2.1.0-bin-hadoop2.7 spark-2.1.0-bin-hadoop2.7.tgz zeppelin-0.7.3-bin-all zeppelin-0.7.3-bin-all.tgz
[ec2-user@ip-10-1-10-126 ~]$ cd zeppelin-0.7.3-bin-all/conf
[ec2-user@ip-10-1-10-126 conf]$ ls
configuration.xsl log4j.properties zeppelin-env.cmd.template zeppelin-site.xml.template
interpreter-list shiro.ini.template zeppelin-env.sh.template
[ec2-user@ip-10-1-10-126 conf]$ cp zeppelin-site.xml.template zeppelin-site.xml
[ec2-user@ip-10-1-10-126 conf]$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001
inet 10.1.10.126 netmask 255.255.255.0 broadcast 10.1.10.255
inet6 fe80::96:85ff:fedd:5f42 prefixlen 64 scopeid 0x20<link>
ether 02:96:85:dd:5f:42 txqueuelen 1000 (Ethernet)
RX packets 836189 bytes 1224343214 (1.1 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 369801 bytes 24755058 (23.6 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 975 bytes 532888 (520.3 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 975 bytes 532888 (520.3 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
# 위에서 ifconfig 했을때 10.1.10.126 (ec2 private ip) 이거를 IP 주소로 설정해주고, 제플린 포트도 원래 디폴트는 8080인데 스팍 마스터서버가 해당포트를
# 쓰고있으니까 제플린은 7777을 쓰도록 설정해준다.
[ec2-user@ip-10-1-10-126 conf]$ sudo vim zeppelin-site.xml
<property>
<name>zeppelin.server.addr</name>
<value>10.1.10.126</value>
<description>Server address</description>
</property>
<property>
<name>zeppelin.server.port</name>
<value>7777</value>
<description>Server port.</description>
</property>
[ec2-user@ip-10-1-10-126 conf]$ cp zeppelin-env.sh.template zeppelin-env.sh
# 아래의 내용을 추가한다.
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.265.b01-1.amzn2.0.1.x86_64
export MASTER=spark://ip-10-1-10-126.ap-northeast-2.compute.internal:7077
export SPARK_HOME=/home/ec2-user/spark-2.1.0-bin-hadoop2.7
[ec2-user@ip-10-1-10-126 conf]$ cd ..
[ec2-user@ip-10-1-10-126 zeppelin-0.7.3-bin-all]$ ls
bin conf interpreter lib LICENSE licenses notebook NOTICE README.md zeppelin-web-0.7.3.war
[ec2-user@ip-10-1-10-126 zeppelin-0.7.3-bin-all]$ cd bin
[ec2-user@ip-10-1-10-126 bin]$ ls
common.cmd functions.cmd install-interpreter.sh interpreter.sh zeppelin-daemon.sh
common.sh functions.sh interpreter.cmd zeppelin.cmd zeppelin.sh
[ec2-user@ip-10-1-10-126 bin]$ ./zeppelin-daemon.sh start
Log dir doesn't exist, create /home/ec2-user/zeppelin-0.7.3-bin-all/logs
Pid dir doesn't exist, create /home/ec2-user/zeppelin-0.7.3-bin-all/run
Zeppelin start [ OK ]
# JVM에서 실행되는 프로세스가 어떤게 있는지 체크하여 우리가 띄운것들이 잘 구동되고 있는지 확인한다.
[ec2-user@ip-10-1-10-126 bin]$ jps
10001 ZeppelinServer
9618 Worker
9732 Worker
10075 Jps
12621 Master
9503 Worker
그런 다음에 웹브라우져를 열어서 [ec2-public-ip]:7777
로 접속하면 제플린에 정상접속 되는것을 확인할 수 있다.
접속해서 spark 노트북을 하나 만들고 sc
를 쓰고 실행했을때
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@552dc360
와 같이 sparkcontext가 잘 만들어지는지도 확인해본다.
그러면 제플린에서 데이터를 처리하는 실습을 해보자
먼저 버츄얼머신에 아래와 같이 샘플 데이터를 s3로부터 다운로드 하자.
[ec2-user@ip-10-1-10-126 ~]$ aws configure
AWS Access Key ID [None]: xxxxxxxxxxxxxxxxxxxxxxxxxx
AWS Secret Access Key [None]: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Default region name [None]: ap-northeast-2
Default output format [None]: json
[ec2-user@ip-10-1-10-126 ~]$ aws s3 cp s3://my_s3_bucket/oct/2019-Oct.csv sample_data.csv
download: s3://pms-bucket-test/oct/2019-Oct.csv to ./sample_data.csv
그런 다음에 제플린 노트북으로 돌아와서
val data = spark.read.format("csv").option("header","true").load("/home/ec2-user/sample_data.csv")
를 실행하면
data: org.apache.spark.sql.DataFrame = [event_time: string, event_type: string ... 7 more fields]
와 같이 결과가 뜨면서 데이터 프레임 형태로 데이터를 가져오게 된다.
val data_rdd = sc.textFile("/home/ec2-user/sample_data.csv")
와 같이 실행하면
data_rdd: org.apache.spark.rdd.RDD[String] = /home/ec2-user/sample_data.csv MapPartitionsRDD[8] at textFile at <console>:27
와 같이 rdd 형태로도 가져오게 된다.
data.show()
를 하게되면 아래와 같이 데이터를 불러온다.
+--------------------+----------+----------+-------------------+--------------------+--------+-------+---------+--------------------+
| event_time|event_type|product_id| category_id| category_code| brand| price| user_id| user_session|
+--------------------+----------+----------+-------------------+--------------------+--------+-------+---------+--------------------+
|2019-10-01 00:00:...| view| 44600062|2103807459595387724| null|shiseido| 35.79|541312140|72d76fde-8bb3-4e0...|
|2019-10-01 00:00:...| view| 3900821|2053013552326770905|appliances.enviro...| aqua| 33.20|554748717|9333dfbd-b87a-470...|
|2019-10-01 00:00:...| view| 17200506|2053013559792632471|furniture.living_...| null| 543.10|519107250|566511c2-e2e3-422...|
|2019-10-01 00:00:...| view| 1307067|2053013558920217191| computers.notebook| lenovo| 251.74|550050854|7c90fc70-0e80-459...|
|2019-10-01 00:00:...| view| 1004237|2053013555631882655|electronics.smart...| apple|1081.98|535871217|c6bd7419-2748-4c5...|
|2019-10-01 00:00:...| view| 1480613|2053013561092866779| computers.desktop| pulser| 908.62|512742880|0d0d91c2-c9c2-4e8...|
|2019-10-01 00:00:...| view| 17300353|2053013553853497655| null| creed| 380.96|555447699|4fe811e9-91de-46d...|
|2019-10-01 00:00:...| view| 31500053|2053013558031024687| null|luminarc| 41.16|550978835|6280d577-25c8-414...|
|2019-10-01 00:00:...| view| 28719074|2053013565480109009| apparel.shoes.keds| baden| 102.71|520571932|ac1cd4e5-a3ce-422...|
|2019-10-01 00:00:...| view| 1004545|2053013555631882655|electronics.smart...| huawei| 566.01|537918940|406c46ed-90a4-478...|
|2019-10-01 00:00:...| view| 2900536|2053013554776244595|appliances.kitche...|elenberg| 51.46|555158050|b5bdd0b3-4ca2-4c5...|
|2019-10-01 00:00:...| view| 1005011|2053013555631882655|electronics.smart...| samsung| 900.64|530282093|50a293fb-5940-41b...|
|2019-10-01 00:00:...| view| 3900746|2053013552326770905|appliances.enviro...| haier| 102.38|555444559|98b88fa0-d8fa-4b9...|
|2019-10-01 00:00:...| view| 44600062|2103807459595387724| null|shiseido| 35.79|541312140|72d76fde-8bb3-4e0...|
|2019-10-01 00:00:...| view| 13500240|2053013557099889147|furniture.bedroom...| brw| 93.18|555446365|7f0062d8-ead0-4e0...|
|2019-10-01 00:00:...| view| 23100006|2053013561638126333| null| null| 357.79|513642368|17566c27-0a8f-450...|
|2019-10-01 00:00:...| view| 1801995|2053013554415534427|electronics.video.tv| haier| 193.03|537192226|e3151795-c355-4ef...|
|2019-10-01 00:00:...| view| 10900029|2053013555069845885|appliances.kitche...| bosch| 58.95|519528062|901b9e3c-3f8f-414...|
|2019-10-01 00:00:...| view| 1306631|2053013558920217191| computers.notebook| hp| 580.89|550050854|7c90fc70-0e80-459...|
|2019-10-01 00:00:...| view| 1005135|2053013555631882655|electronics.smart...| apple|1747.79|535871217|c6bd7419-2748-4c5...|
+--------------------+----------+----------+-------------------+--------------------+--------+-------+---------+--------------------+
only showing top 20 rows
data_rdd.count()
를 실행하여 rdd로 해당 데이터의 로우카운트를 해보자.
실행결과는 res3: Long = 42448765
와 같이 나온다.