Spark core concepts

2020-12-27

Data Engineering

Data_Engineering_TIL_(20201227)

학습 시 참고자료(출처) :

블로그 글 “Spark core concepts explained”을 읽고 공부한 내용을 정리한 노트입니다.
URL : https://luminousmen.com/post/spark-core-concepts-explained

1. 개요

스파크 아키텍처는 아래 두가지 메인 개념이 기본이라고 할 수 있다.

Resilient Distributed Dataset (RDD)
Directed Acyclic Graph (DAG)

2. Resilient Distributed Dataset (RDD)

Example :

rdd = sc.parallelize(range(20))  # create RDD

rdd

Starting Spark application

ID	YARN Application ID	Kind	State	Spark UI	Driver log	Current session?
0	application_1582606619430_0001	pyspark	idle	Link	Link	✔

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

PythonRDD[1] at RDD at PythonRDD.scala:53

rdd.collect()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

Let’s check the number of partitions and data on them:

# get current number of paritions
rdd.getNumPartitions()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

2

# collect data on driver based on partitions
rdd.glom().collect()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]]

Let’s use filter transformation on our data:

rdd = sc.parallelize(range(20))
filteredRDD = rdd.filter(lambda x: x > 10)

## to see the execution graph; only one stage is created
print(filteredRDD.toDebugString())

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

b'(2) PythonRDD[1] at RDD at PythonRDD.scala:53 []\n |  ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195 []'

filteredRDD.collect()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[11, 12, 13, 14, 15, 16, 17, 18, 19]

## group data based on mod
groupedRDD = filteredRDD.groupBy(lambda x: x % 2)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## two separate stages are created, because of the shuffle
print(groupedRDD.toDebugString())

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

b'(2) PythonRDD[6] at RDD at PythonRDD.scala:53 []\n |  MapPartitionsRDD[5] at mapPartitions at PythonRDD.scala:133 []\n |  ShuffledRDD[4] at partitionBy at NativeMethodAccessorImpl.java:0 []\n +-(2) PairwiseRDD[3] at groupBy at <stdin>:2 []\n    |  PythonRDD[2] at groupBy at <stdin>:2 []\n    |  ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195 []'

* Actions

Actions are applied when it is necessary to materialize the result — save the data to disk, write the data to a database or output a part of the data to the console. The collect operation that we have used so far is also an action — it collects data.

The actions are not lazy — they will actually trigger the data processing. Actions are RDD operations that produce values that are not RDD.

reduce 연산을 이용하여 filtering한 데이터를 sum하여 가져와보는 예시를 보자

filteredRDD.reduce(lambda a, b: a + b)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

135

3. DAG

Unlike Hadoop, where the user has to break down all the operations into smaller tasks and chain them together in order to use MapReduce, Spark defines tasks that can be computed in parallel with the partitioned data on the cluster. With these defined tasks, Spark builds a logical flow of operations that can be represented as a directional and acyclic graph, also known as DAG (Directed Acyclic Graph), where the node represents an RDD partition and the edge represents a data transformation. Spark builds the execution plan implicitly from the application provided by Spark.

DAGScheduler computes a DAG of stages for each job. A stage consists of tasks based on input data partitions. DAGScheduler merges some transformations together, for example, many map operators can be combined into one stage. The end result of DAGScheduler is an optimal set of stages in the form of TaskSet. Then the stages are passed to TaskScheduler. The number of stage tasks depends on the number of partitions. TaskScheduler launches tasks through cluster manager. TaskScheduler doesn’t know about the dependencies of stages.

RDDs can determine preferred locations for processing partitions. DAGScheduler places computation so that it will be as close to the data as possible (data locality).

 Spark application 컴포넌트 및 구동원리 Native Authenticator 를 이용한 EMR jupyterhub 계정관리 