Spark 구동원리 이해를 위한 YARN 기본개념

2020-01-17

.

Data_Engineering_TIL_(20201227)

[학습시 참고자료]

‘Things you need to know about Hadoop and YARN being a Spark developer’ 블로그글을 읽고 공부한 내용을 정리한 노트입니다.

URL : https://luminousmen.com/post/hadoop-yarn-spark

[필기노트]

0

1

In addition, the OS has a processor component: a kernel, a scheduler, and some threads and a process that allows programs to run on data.

2

25

* YARN Architecture

3

1) Client

  • can submit any type of application supported by YARN

2) ResourceManager(RM)

  • keeps track of live Nodemanagers and avaliable resources

  • allocates avaliable resources to appropriate applications and tasks

  • monitors application master(MapReduce master, spark master)

3) ApplicationMaster(AM)

  • coordinates the execution of all tasks within its application

  • asks for appropriate resource containers to run tasks

4) NodeManager(NM)

  • privides computational resources in form of containers

  • managers processes running in containers

5) containers

  • can run different types of task(also application master)

  • has different sizes. ex) CPU, RAM

4

5

6

7

  • Interesting facts and features

YARN offers a number of other great features. It is beyond the scope of this post to describe them all, but I have included some noteworthy features:

1) Uberization is the ability to run all MapReduce tasks in ApplicationMaster’s JVM if the tasks are small enough. This way, you avoid the overhead associated with requesting containers from ResourceManager and asking NodeManagers to run (presumably small) tasks.

2) Simplified management and access to application log files. Application-generated logs do not remain on individual slave nodes (as in MRv1) but are moved to a central repository, such as HDFS. They can later be used for debugging or for historical analysis to detect performance problems.