EMR 마스터노드에서 SBT를 이용한 scala script to jar file 변환 실습
2020-11-22
.
Data_Engineering_TIL(20201122)
[실습시 참고자료]
베스핀글로벌 최정민님 ‘Scala script 를 Jar file로 변환 및 실행하기’ 자료
[실습내용]
step 1) 임의의 EMR 생성, master node로 ssh 접속
step 2) EMR master node에 SBT 설치
마스터노드에서 아래와 같이 명령어 실행
Last login: Sun Nov 22 05:36:43 2020
__| __|_ )
_| ( / Amazon Linux 2 AMI
___|\___|___|
https://aws.amazon.com/amazon-linux-2/
71 package(s) needed for security, out of 105 available
Run "sudo yum update" to apply all updates.
EEEEEEEEEEEEEEEEEEEE MMMMMMMM MMMMMMMM RRRRRRRRRRRRRRR
E::::::::::::::::::E M:::::::M M:::::::M R::::::::::::::R
EE:::::EEEEEEEEE:::E M::::::::M M::::::::M R:::::RRRRRR:::::R
E::::E EEEEE M:::::::::M M:::::::::M RR::::R R::::R
E::::E M::::::M:::M M:::M::::::M R:::R R::::R
E:::::EEEEEEEEEE M:::::M M:::M M:::M M:::::M R:::RRRRRR:::::R
E::::::::::::::E M:::::M M:::M:::M M:::::M R:::::::::::RR
E:::::EEEEEEEEEE M:::::M M:::::M M:::::M R:::RRRRRR::::R
E::::E M:::::M M:::M M:::::M R:::R R::::R
E::::E EEEEE M:::::M MMM M:::::M R:::R R::::R
EE:::::EEEEEEEE::::E M:::::M M:::::M R:::R R::::R
E::::::::::::::::::E M:::::M M:::::M RR::::R R::::R
EEEEEEEEEEEEEEEEEEEE MMMMMMM MMMMMMM RRRRRRR RRRRRR
[hadoop@ip-10-0-1-70 ~]$ curl https://bintray.com/sbt/rpm/rpm | sudo tee /etc/yum.repos.d/bintraysbt-rpm.repo
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 160 0 160 0 0 212 0 --:--:-- --:--:-- --:--:-- 212
#bintray--sbt-rpm - packages by from Bintray
[bintray--sbt-rpm]
name=bintray--sbt-rpm
baseurl=https://sbt.bintray.com/rpm
gpgcheck=0
repo_gpgcheck=0
enabled=1
[hadoop@ip-10-0-1-70 ~]$ sudo yum install sbt -y
Loaded plugins: extras_suggestions, langpacks, priorities, update-motd
amzn2-core | 3.7 kB 00:00:00
bintray--sbt-rpm | 1.3 kB 00:00:00
bintray--sbt-rpm/primary | 6.0 kB 00:00:00
bintray--sbt-rpm 55/55
9 packages excluded due to repository priority protections
No package sbtsudo available.
Package yum-3.4.3-158.amzn2.0.4.noarch already installed and latest version
No package install available.
Resolving Dependencies
--> Running transaction check
---> Package sbt.noarch 0:1.4.3-0 will be installed
--> Finished Dependency Resolution
Dependencies Resolved
=========================================================================================================================================
Package Arch Version Repository Size
=========================================================================================================================================
Installing:
sbt noarch 1.4.3-0 bintray--sbt-rpm 1.2 M
Transaction Summary
=========================================================================================================================================
Install 1 Package
Total download size: 1.2 M
Installed size: 1.4 M
Downloading packages:
sbt-1.4.3.rpm | 1.2 MB 00:00:00
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Installing : sbt-1.4.3-0.noarch 1/1
Verifying : sbt-1.4.3-0.noarch 1/1
Installed:
sbt.noarch 0:1.4.3-0
Complete!
step 3) SBT project 구성
- .sbt 생성
[hadoop@ip-10-0-1-70 my_scala_pj]$ pwd
/home/hadoop
[hadoop@ip-10-0-1-70 ~]$ mkdir my_scala_pj
[hadoop@ip-10-0-1-70 ~]$ cd my_scala_pj/
[hadoop@ip-10-0-1-70 my_scala_pj]$ sudo vim ./simple.sbt
# 자유롭게 임의로 작성
name := "My scala Project"
# 자유롭게 임의로 작성
version := "0.1"
# 아래의 내용은 script의 종속성과 연관되어 있기 때문에 스크립트에 필요한 라이브러리에 맞게 작성되어야 함
scalaVersion := "2.12.12"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.0.0"
[hadoop@ip-10-0-1-70 my_scala_pj]$ find .
.
./simple.sbt
- TestApp.scala 생성
[hadoop@ip-10-0-1-70 my_scala_pj]$ mkdir src
[hadoop@ip-10-0-1-70 my_scala_pj]$ ls
simple.sbt src
[hadoop@ip-10-0-1-70 my_scala_pj]$ cd src
[hadoop@ip-10-0-1-70 src]$ mkdir main
[hadoop@ip-10-0-1-70 src]$ cd main
[hadoop@ip-10-0-1-70 main]$ mkdir scala
[hadoop@ip-10-0-1-70 main]$ cd scala
[hadoop@ip-10-0-1-70 scala]$ pwd
/home/hadoop/my_scala_pj/src/main/scala
[hadoop@ip-10-0-1-70 scala]$ sudo vim TestApp.scala
import org.apache.spark.sql.SparkSession
object TestApp {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.master("yarn").appName("my_app").getOrCreate()
val rdd = spark.read.textFile("s3://pms-bucket-test/README.txt")
rdd.write.format("parquet").save("s3://pms-bucket-test/test_folder")
println("Job done")
spark.stop()
System.exit(0)
}
}
[hadoop@ip-10-0-1-70 scala]$ cd /home/hadoop/my_scala_pj
[hadoop@ip-10-0-1-70 my_scala_pj]$ find .
.
./simple.sbt
./src
./src/main
./src/main/scala
./src/main/scala/TestApp.scala
[hadoop@ip-10-0-1-70 my_scala_pj]$ tree
.
├── simple.sbt
└── src
└── main
└── scala
└── TestApp.scala
3 directories, 2 files
step 4) scala script to jar file
# 프로젝트 최상단 폴더로 이동
[hadoop@ip-10-0-1-70 my_scala_pj]$ cd /home/hadoop/my_scala_pj
[hadoop@ip-10-0-1-70 my_scala_pj]$ sbt package
[info] [launcher] getting org.scala-sbt sbt 1.4.3 (this may take some time)...
downloading https://repo1.maven.org/maven2/org/scala-sbt/sbt/1.4.3/sbt-1.4.3.jar ...
:: loading settings :: url = jar:file:/usr/share/sbt/bin/sbt-launch.jar!/org/apache/ivy/core/settings/ivysettings.xml
downloading https://repo1.maven.org/maven2/org/scala-lang/scala-library/2.12.12/scala-library-2.12.12.jar ...
:: loading settings :: url = jar:file:/usr/share/sbt/bin/sbt-launch.jar!/org/apache/ivy/core/settings/ivysettings.xml
downloading https://repo1.maven.org/maven2/org/scala-sbt/io_2.12/1.4.0/io_2.12-1.4.0.jar ...
:: loading settings :: url = jar:file:/usr/share/sbt/bin/sbt-launch.jar!/org/apache/ivy/core/settings/ivysettings.xml
downloading https://repo1.maven.org/maven2/org/scala-sbt/main_2.12/1.4.3/main_2.12-1.4.3.jar ...
:: loading settings :: url = jar:file:/usr/share/sbt/bin/sbt-launch.jar!/org/apache/ivy/core/settings/ivysettings.xml
[SUCCESSFUL ] org.scala-sbt#sbt;1.4.3!sbt.jar (615ms)
downloading https://repo1.maven.org/maven2/org/scala-sbt/logic_2.12/1.4.3/logic_2.12-1.4.3.jar ...
[SUCCESSFUL ] org.scala-sbt#logic_2.12;1.4.3!logic_2.12.jar (579ms)
downloading https://repo1.maven.org/maven2/org/scala-sbt/actions_2.12/1.4.3/actions_2.12-1.4.3.jar ...
[SUCCESSFUL ] org.scala-sbt#io_2.12;1.4.0!io_2.12.jar (1461ms)
downloading https://repo1.maven.org/maven2/org/scala-sbt/main-settings_2.12/1.4.3/main-settings_2.12-1.4.3.jar ...
[SUCCESSFUL ] org.scala-sbt#actions_2.12;1.4.3!actions_2.12.jar (732ms)
...
https://repo1.maven.org/maven2/org/lz4/lz4-java/1.7.1/lz4-java-1.7.1.jar
100.0% [##########] 634.7 KiB (2.7 MiB / s)
[info] Fetched artifacts of
[warn] There may be incompatibilities among your library dependencies; run 'evicted' to see detailed eviction warnings.
[info] compiling 1 Scala source to /home/hadoop/my_scala_pj/target/scala-2.12/classes ...
https://repo1.maven.org/maven2/org/scala-sbt/compiler-bridge_2.12/1.4.3/compiler-bridge_2.12-1.4.3.pom
100.0% [##########] 2.7 KiB (17.5 KiB / s)
https://repo1.maven.org/maven2/org/scala-sbt/compiler-interface/1.4.3/compiler-interface-1.4.3.pom
100.0% [##########] 2.9 KiB (19.3 KiB / s)
https://repo1.maven.org/maven2/org/scala-sbt/compiler-bridge_2.12/1.4.3/compiler-bridge_2.12-1.4.3-sources.jar
100.0% [##########] 52.4 KiB (376.8 KiB / s)
https://repo1.maven.org/maven2/org/scala-sbt/util-interface/1.4.0/util-interface-1.4.0.jar
100.0% [##########] 2.6 KiB (19.0 KiB / s)
[info] Non-compiled module 'compiler-bridge_2.12' for Scala 2.12.12. Compiling...
[info] Compilation completed in 9.657s.
[success] Total time: 32 s, completed Nov 22, 2020 6:33:34 AM
# /home/hadoop/my_scala_pj/target/scala-2.12 폴더에 my-scala-project_2.12-0.1.jar 생성
[hadoop@ip-10-0-1-70 my_scala_pj]$ tree
.
├── project
│ ├── build.properties
│ └── target
│ ├── config-classes
│ │ ├── $26557af1960df2d28302.cache
│ │ ├── $26557af1960df2d28302.class
│ │ ├── $26557af1960df2d28302$.class
│ │ ├── $2c66babbd77e41c16c42.cache
│ │ ├── $2c66babbd77e41c16c42.class
│ │ ├── $2c66babbd77e41c16c42$.class
│ │ ├── $3f5e9b4c00ec2e641532.cache
│ │ ├── $3f5e9b4c00ec2e641532.class
│ │ ├── $3f5e9b4c00ec2e641532$.class
│ │ ├── $f4aa3cb48b96f0931374.cache
│ │ ├── $f4aa3cb48b96f0931374.class
│ │ └── $f4aa3cb48b96f0931374$.class
│ ├── scala-2.12
│ │ └── sbt-1.0
│ │ └── update
│ │ └── update_cache_2.12
│ │ ├── inputs
│ │ └── output
│ └── streams
│ ├── compile
│ │ ├── bspReporter
│ │ │ └── _global
│ │ │ └── streams
│ │ │ └── out
│ │ ├── compile
│ │ │ └── _global
│ │ │ └── streams
│ │ │ └── out
│ │ ├── compileIncremental
│ │ │ └── _global
│ │ │ └── streams
│ │ │ ├── export
│ │ │ └── out
│ │ ├── copyResources
│ │ │ └── _global
│ │ │ └── streams
│ │ │ ├── copy-resources
│ │ │ └── out
│ │ ├── dependencyClasspath
│ │ │ └── _global
│ │ │ └── streams
│ │ │ └── export
│ │ ├── exportedProducts
│ │ │ └── _global
│ │ │ └── streams
│ │ │ └── export
│ │ ├── externalDependencyClasspath
│ │ │ └── _global
│ │ │ └── streams
│ │ │ └── export
│ │ ├── _global
│ │ │ └── _global
│ │ │ ├── compileOutputs
│ │ │ │ └── previous
│ │ │ └── discoveredMainClasses
│ │ │ └── data
│ │ ├── incOptions
│ │ │ └── _global
│ │ │ └── streams
│ │ │ └── out
│ │ ├── internalDependencyClasspath
│ │ │ └── _global
│ │ │ └── streams
│ │ │ ├── export
│ │ │ └── out
│ │ ├── managedClasspath
│ │ │ └── _global
│ │ │ └── streams
│ │ │ └── export
│ │ ├── scalacOptions
│ │ │ └── _global
│ │ │ └── streams
│ │ │ └── out
│ │ ├── unmanagedClasspath
│ │ │ └── _global
│ │ │ └── streams
│ │ │ ├── export
│ │ │ └── out
│ │ └── unmanagedJars
│ │ └── _global
│ │ └── streams
│ │ └── export
│ ├── _global
│ │ ├── csrConfiguration
│ │ │ └── _global
│ │ │ └── streams
│ │ │ └── out
│ │ ├── csrProject
│ │ │ └── _global
│ │ │ └── streams
│ │ │ └── out
│ │ ├── dependencyPositions
│ │ │ └── _global
│ │ │ └── streams
│ │ │ └── update_cache_2.12
│ │ │ ├── input_dsp
│ │ │ └── output_dsp
│ │ ├── _global
│ │ │ └── _global
│ │ │ └── streams
│ │ │ └── out
│ │ ├── ivyConfiguration
│ │ │ └── _global
│ │ │ └── streams
│ │ │ └── out
│ │ ├── ivySbt
│ │ │ └── _global
│ │ │ └── streams
│ │ │ └── out
│ │ ├── moduleSettings
│ │ │ └── _global
│ │ │ └── streams
│ │ │ └── out
│ │ ├── projectDescriptors
│ │ │ └── _global
│ │ │ └── streams
│ │ │ └── out
│ │ ├── scalaCompilerBridgeScope
│ │ │ └── _global
│ │ │ └── streams
│ │ │ └── out
│ │ └── update
│ │ └── _global
│ │ └── streams
│ │ └── out
│ └── runtime
│ ├── dependencyClasspath
│ │ └── _global
│ │ └── streams
│ │ └── export
│ ├── exportedProducts
│ │ └── _global
│ │ └── streams
│ │ └── export
│ ├── externalDependencyClasspath
│ │ └── _global
│ │ └── streams
│ │ └── export
│ ├── fullClasspath
│ │ └── _global
│ │ └── streams
│ │ └── export
│ ├── internalDependencyClasspath
│ │ └── _global
│ │ └── streams
│ │ ├── export
│ │ └── out
│ ├── managedClasspath
│ │ └── _global
│ │ └── streams
│ │ └── export
│ ├── unmanagedClasspath
│ │ └── _global
│ │ └── streams
│ │ ├── export
│ │ └── out
│ └── unmanagedJars
│ └── _global
│ └── streams
│ └── export
├── simple.sbt
├── src
│ └── main
│ └── scala
│ └── TestApp.scala
└── target
├── global-logging
├── scala-2.12
│ ├── classes
│ │ ├── TestApp.class
│ │ └── TestApp$.class
│ ├── my-scala-project_2.12-0.1.jar
│ ├── update
│ │ └── update_cache_2.12
│ │ ├── inputs
│ │ └── output
│ └── zinc
│ └── inc_compile_2.12.zip
├── streams
│ ├── compile
│ │ ├── bspReporter
│ │ │ └── _global
│ │ │ └── streams
│ │ │ └── out
│ │ ├── compile
│ │ │ └── _global
│ │ │ └── streams
│ │ │ └── out
│ │ ├── compileIncremental
│ │ │ └── _global
│ │ │ └── streams
│ │ │ ├── export
│ │ │ └── out
│ │ ├── copyResources
│ │ │ └── _global
│ │ │ └── streams
│ │ │ ├── copy-resources
│ │ │ └── out
│ │ ├── dependencyClasspath
│ │ │ └── _global
│ │ │ └── streams
│ │ │ └── export
│ │ ├── externalDependencyClasspath
│ │ │ └── _global
│ │ │ └── streams
│ │ │ └── export
│ │ ├── _global
│ │ │ └── _global
│ │ │ ├── compileOutputs
│ │ │ │ └── previous
│ │ │ └── discoveredMainClasses
│ │ │ └── data
│ │ ├── incOptions
│ │ │ └── _global
│ │ │ └── streams
│ │ │ └── out
│ │ ├── internalDependencyClasspath
│ │ │ └── _global
│ │ │ └── streams
│ │ │ ├── export
│ │ │ └── out
│ │ ├── mainClass
│ │ │ └── _global
│ │ │ └── streams
│ │ │ └── out
│ │ ├── managedClasspath
│ │ │ └── _global
│ │ │ └── streams
│ │ │ └── export
│ │ ├── packageBin
│ │ │ └── _global
│ │ │ └── streams
│ │ │ ├── inputs
│ │ │ ├── out
│ │ │ └── output
│ │ ├── scalacOptions
│ │ │ └── _global
│ │ │ └── streams
│ │ │ └── out
│ │ ├── unmanagedClasspath
│ │ │ └── _global
│ │ │ └── streams
│ │ │ ├── export
│ │ │ └── out
│ │ └── unmanagedJars
│ │ └── _global
│ │ └── streams
│ │ └── export
│ └── _global
│ ├── csrConfiguration
│ │ └── _global
│ │ └── streams
│ │ └── out
│ ├── csrProject
│ │ └── _global
│ │ └── streams
│ │ └── out
│ ├── dependencyPositions
│ │ └── _global
│ │ └── streams
│ │ └── update_cache_2.12
│ │ ├── input_dsp
│ │ └── output_dsp
│ ├── _global
│ │ └── _global
│ │ └── streams
│ │ └── out
│ ├── ivyConfiguration
│ │ └── _global
│ │ └── streams
│ │ └── out
│ ├── ivySbt
│ │ └── _global
│ │ └── streams
│ │ └── out
│ ├── moduleSettings
│ │ └── _global
│ │ └── streams
│ │ └── out
│ ├── projectDescriptors
│ │ └── _global
│ │ └── streams
│ │ └── out
│ ├── scalaCompilerBridgeScope
│ │ └── _global
│ │ └── streams
│ │ └── out
│ └── update
│ └── _global
│ └── streams
│ └── out
└── task-temp-directory
200 directories, 96 files
step 5) spark-submit test
[hadoop@ip-10-0-1-70 ~]$ spark-submit --deploy-mode client /home/hadoop/my_scala_pj/target/scala-2.12/my-scala-project_2.12-0.1.jar
20/11/22 06:51:35 INFO SparkContext: Running Spark version 3.0.0-amzn-0
20/11/22 06:51:35 INFO ResourceUtils: ==============================================================
20/11/22 06:51:35 INFO ResourceUtils: Resources for spark.driver:
20/11/22 06:51:35 INFO ResourceUtils: ==============================================================
20/11/22 06:51:35 INFO SparkContext: Submitted application: my_app
20/11/22 06:51:35 INFO SecurityManager: Changing view acls to: hadoop
20/11/22 06:51:35 INFO SecurityManager: Changing modify acls to: hadoop
20/11/22 06:51:35 INFO SecurityManager: Changing view acls groups to:
20/11/22 06:51:35 INFO SecurityManager: Changing modify acls groups to:
20/11/22 06:51:35 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
20/11/22 06:51:35 INFO Utils: Successfully started service 'sparkDriver' on port 46701.
20/11/22 06:51:35 INFO SparkEnv: Registering MapOutputTracker
20/11/22 06:51:35 INFO SparkEnv: Registering BlockManagerMaster
20/11/22 06:51:35 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/11/22 06:51:35 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/11/22 06:51:36 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
20/11/22 06:51:36 INFO DiskBlockManager: Created local directory at /mnt/tmp/blockmgr-1dbf36f4-4555-4043-8b2d-xxxxxxxxxxxx
20/11/22 06:51:36 INFO MemoryStore: MemoryStore started with capacity 1028.8 MiB
20/11/22 06:51:36 INFO SparkEnv: Registering OutputCommitCoordinator
20/11/22 06:51:36 INFO log: Logging initialized @2638ms to org.sparkproject.jetty.util.log.Slf4jLog
20/11/22 06:51:36 INFO Server: jetty-9.4.20.v20190813; built: 2019-08-13T21:28:18.144Z; git: xxxxxxxxxxxxxxxxxxx; jvm 1.8.0_252-b09
20/11/22 06:51:36 INFO Server: Started @2748ms
20/11/22 06:51:36 INFO AbstractConnector: Started ServerConnector@4f7c0be3{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
20/11/22 06:51:36 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/11/22 06:51:36 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4a1e3ac1{/jobs,null,AVAILABLE,@Spark}
...
20/11/22 06:52:00 INFO MultipartUploadOutputStream: close closed:false s3://pms-bucket-test/test_folder/_SUCCESS
20/11/22 06:52:01 INFO BlockManagerInfo: Removed broadcast_1_piece0 on ip-10-0-1-70.ap-northeast-2.compute.internal:44339 in memory (size: 76.6 KiB, free: 1028.8 MiB)
20/11/22 06:52:01 INFO BlockManagerInfo: Removed broadcast_1_piece0 on ip-10-0-1-70.ap-northeast-2.compute.internal:41353 in memory (size: 76.6 KiB, free: 2.6 GiB)
20/11/22 06:52:01 INFO FileFormatWriter: Write Job 75e58e92-278e-4bd0-8ec4-ff25d5a95b4f committed.
20/11/22 06:52:01 INFO FileFormatWriter: Finished processing stats for write job 75e58e92-278e-4bd0-8ec4-xxxxxxxxxx.
Job done
20/11/22 06:52:01 INFO AbstractConnector: Stopped Spark@4f7c0be3{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
20/11/22 06:52:01 INFO SparkUI: Stopped Spark web UI at http://ip-10-0-1-70.ap-northeast-2.compute.internal:4040
20/11/22 06:52:01 INFO YarnClientSchedulerBackend: Interrupting monitor thread
20/11/22 06:52:01 INFO YarnClientSchedulerBackend: Shutting down all executors
20/11/22 06:52:01 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
20/11/22 06:52:01 INFO YarnClientSchedulerBackend: YARN client scheduler backend Stopped
20/11/22 06:52:01 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/11/22 06:52:01 INFO MemoryStore: MemoryStore cleared
20/11/22 06:52:01 INFO BlockManager: BlockManager stopped
20/11/22 06:52:01 INFO BlockManagerMaster: BlockManagerMaster stopped
20/11/22 06:52:01 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/11/22 06:52:01 INFO SparkContext: Successfully stopped SparkContext
20/11/22 06:52:01 INFO ShutdownHookManager: Shutdown hook called
20/11/22 06:52:01 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-54537dcf-0233-4fd6-a929-46767d90b6e0
20/11/22 06:52:01 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-f1de40ef-1032-4235-b334-39078403f24e