EMR 운영중 트러블슈팅 사례 - EMRFS(S3) writing을 위한 임시폴더 접근권한 문제 해결하기
.
Data_Engineering_TIL(20201101)
[문제상황]
spark-submit으로 어떤 pyspark script를 spark 엔진으로 실행하면, s3에 처리한 데이터를 write하는 구문에서 Error가 발생하는 현상 확인
(EMR 마스터노드 접속해서 spark-submit --master yarn --deploy-mode client test.py
를 실행했을때 test.py
의 모든 데이터 처리 과정은 문제없이 실행되는데 마지막에 처리한 데이터를 s3로 write 하는 구문 df.repartition(1).write.mode('overwrite').parquet('s3a://mys3bucket/')
에서 아래와 같은 Error가 발생한다.)
org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not writable: /mnt/var/lib/hadoop/tmp/s3a
at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:167)
at org.apache.hadoop.util.DiskChecker.checkDirInternal(DiskChecker.java:100)
at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:77)
at org.apache.hadoop.util.BasicDiskValidator.checkStatus(BasicDiskValidator.java:32)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:331)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:394)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.createTmpFileForWrite(LocalDirAllocator.java:477)
at org.apache.hadoop.fs.LocalDirAllocator.createTmpFileForWrite(LocalDirAllocator.java:213)
at org.apache.hadoop.fs.s3a.S3AFileSystem.createTmpFileForWrite(S3AFileSystem.java:589)
at org.apache.hadoop.fs.s3a.S3ADataBlocks$DiskBlockFactory.create(S3ADataBlocks.java:811)
at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.createBlockIfNeeded(S3ABlockOutputStream.java:190)
at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.(S3ABlockOutputStream.java:168)
at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:822)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1125)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1105)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:994)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:439)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:379)
at com.amazon.emr.committer.FilterParquetOutputCommitter.commitJob(FilterParquetOutputCommitter.java:82)
at com.amazon.emr.committer.EmrOptimizedSparkSqlParquetOutputCommitter.commitJob(EmrOptimizedSparkSqlParquetOutputCommitter.java:9)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:167)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:215)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:180)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:124)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:123)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:944)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:106)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:207)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:88)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:944)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:396)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:380)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:269)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:829)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
[문제발생 원인]
EMR을 생성하고 처음 spark 엔진으로 어떤 데이터를 처리하고 s3에 업로드를 jupyterhub에서 jupyter notebook pyspark kernal로 처리했었다. 이때 jupyterhub에서 Livy라는 사용자가 /mnt/var/lib/hadoop/tmp/s3a
에 데이터를 쓰면서 write 권한을 LIvy 자기 자신한테만 부여하였다. 이런 상태에서 마스터 노드에서 spark-submit 명령어로 s3에 file을 wirte하게 되면 Yarn이 해당 일을 하게 되는데 /mnt/var/lib/hadoop/tmp/s3a
에 Livy만 wirte 권한을 부여해버렸기 때문에 write하지 못하고 위의 Error를 일으키게 된 것이다.
[문제해결 방안]
마스터, 코어 노드에 각각 접속해서 Yarn에게도 write 권한을 주고 다시한번 spark-submit --master yarn --deploy-mode client test.py
하게 되면 실행이 잘 되는 것을 확인할 수 있다.
마스터, 코어 노드에 각각 접속해서 Yarn에게도 write 권한을 주는 방법은 아래와 같다.
## 10-0-5-186은 1번 코어노드
[hadoop@ip-10-0-5-186 s3a]$ sudo su - yarn
Last login: Fri Oct 30 05:10:18 UTC 2020
EEEEEEEEEEEEEEEEEEEE MMMMMMMM MMMMMMMM RRRRRRRRRRRRRRR
E::::::::::::::::::E M:::::::M M:::::::M R::::::::::::::R
EE:::::EEEEEEEEE:::E M::::::::M M::::::::M R:::::RRRRRR:::::R
E::::E EEEEE M:::::::::M M:::::::::M RR::::R R::::R
E::::E M::::::M:::M M:::M::::::M R:::R R::::R
E:::::EEEEEEEEEE M:::::M M:::M M:::M M:::::M R:::RRRRRR:::::R
E::::::::::::::E M:::::M M:::M:::M M:::::M R:::::::::::RR
E:::::EEEEEEEEEE M:::::M M:::::M M:::::M R:::RRRRRR::::R
E::::E M:::::M M:::M M:::::M R:::R R::::R
E::::E EEEEE M:::::M MMM M:::::M R:::R R::::R
EE:::::EEEEEEEE::::E M:::::M M:::::M R:::R R::::R
E::::::::::::::::::E M:::::M M:::::M RR::::R R::::R
EEEEEEEEEEEEEEEEEEEE MMMMMMM MMMMMMM RRRRRRR RRRRRR
-bash-4.2$ cd /mnt/var/lib/hadoop/tmp/s3a
-bash-4.2$ ls -ialF
total 0
1424388 drwxr-xr-x 2 yarn yarn 6 Oct 30 11:45 ./
100664302 drwxrwxrwt 4 hadoop hadoop 41 Oct 30 06:55 ../
-bash-4.2$ chmod a+w .
-bash-4.2$ ls -ialF
total 0
1424388 drwxrwxrwx 2 yarn yarn 6 Oct 30 11:45 ./
100664302 drwxrwxrwt 4 hadoop hadoop 41 Oct 30 06:55 ../
## 10-0-5-187은 2번 코어노드
[hadoop@ip-10-0-5-187 ~]$ sudo su - yarn
Last login: Fri Oct 30 07:03:10 UTC 2020
EEEEEEEEEEEEEEEEEEEE MMMMMMMM MMMMMMMM RRRRRRRRRRRRRRR
E::::::::::::::::::E M:::::::M M:::::::M R::::::::::::::R
EE:::::EEEEEEEEE:::E M::::::::M M::::::::M R:::::RRRRRR:::::R
E::::E EEEEE M:::::::::M M:::::::::M RR::::R R::::R
E::::E M::::::M:::M M:::M::::::M R:::R R::::R
E:::::EEEEEEEEEE M:::::M M:::M M:::M M:::::M R:::RRRRRR:::::R
E::::::::::::::E M:::::M M:::M:::M M:::::M R:::::::::::RR
E:::::EEEEEEEEEE M:::::M M:::::M M:::::M R:::RRRRRR::::R
E::::E M:::::M M:::M M:::::M R:::R R::::R
E::::E EEEEE M:::::M MMM M:::::M R:::R R::::R
EE:::::EEEEEEEE::::E M:::::M M:::::M R:::R R::::R
E::::::::::::::::::E M:::::M M:::::M RR::::R R::::R
EEEEEEEEEEEEEEEEEEEE MMMMMMM MMMMMMM RRRRRRR RRRRRR
-bash-4.2$ cd /mnt/var/lib/hadoop/tmp/s3a
-bash-4.2$ ls -al
total 0
drwxr-xr-x 2 yarn yarn 6 Oct 30 11:35 .
drwxrwxrwt 4 hadoop hadoop 41 Oct 30 11:07 ..
-bash-4.2$ chmod a+w .
-bash-4.2$ ls -al
total 0
drwxrwxrwx 2 yarn yarn 6 Oct 30 11:35 .
drwxrwxrwt 4 hadoop hadoop 41 Oct 30 11:07 ..
## 10-0-5-185 는 마스터노드
[hadoop@ip-10-0-5-185 ~]$ sudo su - root
EEEEEEEEEEEEEEEEEEEE MMMMMMMM MMMMMMMM RRRRRRRRRRRRRRR
E::::::::::::::::::E M:::::::M M:::::::M R::::::::::::::R
EE:::::EEEEEEEEE:::E M::::::::M M::::::::M R:::::RRRRRR:::::R
E::::E EEEEE M:::::::::M M:::::::::M RR::::R R::::R
E::::E M::::::M:::M M:::M::::::M R:::R R::::R
E:::::EEEEEEEEEE M:::::M M:::M M:::M M:::::M R:::RRRRRR:::::R
E::::::::::::::E M:::::M M:::M:::M M:::::M R:::::::::::RR
E:::::EEEEEEEEEE M:::::M M:::::M M:::::M R:::RRRRRR::::R
E::::E M:::::M M:::M M:::::M R:::R R::::R
E::::E EEEEE M:::::M MMM M:::::M R:::R R::::R
EE:::::EEEEEEEE::::E M:::::M M:::::M R:::R R::::R
E::::::::::::::::::E M:::::M M:::::M RR::::R R::::R
EEEEEEEEEEEEEEEEEEEE MMMMMMM MMMMMMM RRRRRRR RRRRRR
[root@ip-10-0-5-185 ~]# cd /mnt/var/lib/hadoop/tmp/s3a
[root@ip-10-0-5-185 s3a]# ls -ailF
total 0
268435565 drwxr-xr-x 2 livy livy 6 Oct 30 06:55 ./
33587496 drwxrwxrwt 5 hadoop hadoop 43 Oct 30 06:55 ../
[root@ip-30-2-1-204 s3a]# chmod a+w .
[root@ip-30-2-1-204 s3a]# ls -ialF
total 0
268435565 drwxrwxrwx 2 livy livy 6 Oct 30 06:55 ./
33587496 drwxrwxrwt 5 hadoop hadoop 43 Oct 30 06:55 ../