Spark application 컴포넌트 및 구동원리

2020-12-26

.

Data_Enginieering_TIL(20201226)

[학습시 참고자료]

  • 블로그글 “Spark. Anatomy of Spark application” 을 읽고 공부한 내용을 정리한 노트임

URL : https://luminousmen.com/post/spark-anatomy-of-spark-application

[학습내용]

1

2

3

4

5

  • Spark Application running steps

예시 어플리케이션

from pyspark.sql import SparkSession

# initialization of spark context
conf = SparkConf().setAppName(appName).setMaster(master) 
sc = SparkSession\
        .builder\
        .appName("PythonWordCount")\
        .config(conf=conf)\
        .getOrCreate()

# read data from HDFS, as a result we get RDD of lines
linesRDD = sc.textFile("hdfs://...")

# from RDD of lines create RDD of lists of words 
wordsRDD = linesRDD.flatMap(lambda line: line.split(" ")

# from RDD of lists of words make RDD of words tuples where 
# the first element is a word and the second is counter, at the
# beginning it should be 1
wordCountRDD= wordsRDD.map(lambda word: (word, 1))

# combine elements with the same word value
resultRDD = wordCountRDD.reduceByKey(lambda a, b: a + b)

# write it back to HDFS
resultRDD.saveAsTextFile("hdfs://...")
spark.stop()

6

step 1) When we send the Spark application in cluster mode, the spark-submit utility communicates with the Cluster Resource Manager to start the Application Master.

step 2) The Resource Manager is then held responsible for selecting the necessary container in which to run the Application Master. The Resource Manager then tells a specific Node Manager to launch the Application Master.

step 3) The Application Master registers with the Resource Manager. Registration allows the client program to request information from the Resource Manager, that information allows the client program to communicate directly with its own Application Master.

step 4) The Spark Driver then runs on the Application Master container (in case of cluster mode).

step 5) The driver implicitly converts user code containing transformations and actions into a logical plan called a DAG. All RDDs are created in the driver and do nothing until the action is called. At this stage, the driver also performs optimizations such as pipelining narrow transformations.

step 6) It then converts the DAG into a physical execution plan. After conversion to a physical execution plan, the driver creates physical execution units called tasks at each stage.

step 7) The Application Master now communicates with the Cluster Manager and negotiates resources. Cluster Manager allocates containers and asks the appropriate NodeManagers to run the executors on all selected containers. When executors run, they register with the Driver. This way, the Driver has a complete view of the artists.

step 8) At this point, the Driver will send tasks to executors via Cluster Manager based on the data placement.

step 9) The code of the user application is launched inside the container. It provides information (stage of execution, status) to the Application Master.

step 10) At this stage, we will start to execute our code. Our first RDD will be created by reading data in parallel from HDFS to different partitions on different nodes based on HDFS InputFormat. Thus, each node will have a subset of data.

step 11) After reading the data we have two map transformations which will be executed in parallel on each partition.

step 12) Next, we have a reduceByKey transformation, it is not a narrow transformation like map, so it will create an additional stage. It combines records with the same keys, then moves data between nodes (shuffle) and partitions to combine the keys of the same record.

step 13) We then perform an action — write back to HDFS which will trigger the entire DAG execution. During the execution of the user application, the client communicates with the Application Master to obtain the application status.

step 14) When the application finishes executing and all of the necessary work is done, the Application Master disconnects itself from the Resource Manager and stops, freeing up its container for other purposes.