Spark 구동원리 이해를 위한 YARN 기본개념

2020-01-17

Data Engineering

Data_Engineering_TIL_(20201227)

[학습시 참고자료]

‘Things you need to know about Hadoop and YARN being a Spark developer’ 블로그글을 읽고 공부한 내용을 정리한 노트입니다.

URL : https://luminousmen.com/post/hadoop-yarn-spark

[필기노트]

In addition, the OS has a processor component: a kernel, a scheduler, and some threads and a process that allows programs to run on data.

* YARN Architecture

1) Client

can submit any type of application supported by YARN

2) ResourceManager(RM)

keeps track of live Nodemanagers and avaliable resources
allocates avaliable resources to appropriate applications and tasks
monitors application master(MapReduce master, spark master)

3) ApplicationMaster(AM)

coordinates the execution of all tasks within its application
asks for appropriate resource containers to run tasks

4) NodeManager(NM)

privides computational resources in form of containers
managers processes running in containers

5) containers

can run different types of task(also application master)
has different sizes. ex) CPU, RAM

Interesting facts and features

YARN offers a number of other great features. It is beyond the scope of this post to describe them all, but I have included some noteworthy features:

1) Uberization is the ability to run all MapReduce tasks in ApplicationMaster’s JVM if the tasks are small enough. This way, you avoid the overhead associated with requesting containers from ResourceManager and asking NodeManagers to run (presumably small) tasks.

2) Simplified management and access to application log files. Application-generated logs do not remain on individual slave nodes (as in MRv1) but are moved to a central repository, such as HDFS. They can later be used for debugging or for historical analysis to detect performance problems.

 Python boto3를 이용한 AWS 리소스 컨트롤 - Cloudwatch metric check 쿠버네티스 개요 