ec2에서 hadoop pseudo-distributed 구현 실습

2020-10-05

.

Data_Engineering_TIL(20200925)

  • study program : T아카데미 - 아파치 하둡 입문과 활용

** URL : https://tacademy.skplanet.com/frontMain.action

실습목표 : ec2 1대에서 hadoop pseudo-distributed를 구현해본다.

  • jdk 1.8 설치
[ec2-user@ip-10-1-10-221 ~]$ sudo yum install -y java-1.8.0-openjdk-devel.x86_64
[ec2-user@ip-10-1-10-221 ~]$ sudo /usr/sbin/alternatives --config java
There is 1 program that provides 'java'.

  Selection    Command
-----------------------------------------------
*+ 1           java-1.8.0-openjdk.x86_64 (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.265.b01-1.amzn2.0.1.x86_64/jre/bin/java)

Enter to keep the current selection[+], or type selection number: ## 엔터 누르면 됨

[ec2-user@ip-10-1-10-221 ~]$ sudo yum remove java-1.7.0-openjdk -y
Loaded plugins: extras_suggestions, langpacks, priorities, update-motd
No Match for argument: java-1.7.0-openjdk
No Packages marked for removal

[ec2-user@ip-10-1-10-221 ~]$ java -version
openjdk version "1.8.0_265"
OpenJDK Runtime Environment (build 1.8.0_265-b01)
OpenJDK 64-Bit Server VM (build 25.265-b01, mixed mode)

[ec2-user@ip-10-1-10-178 ~]$ readlink -f $(which java)
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.265.b01-1.amzn2.0.1.x86_64/jre/bin/java
## 그러면 Java home 경로는 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.265.b01-1.amzn2.0.1.x86_64 이다.
  • hadoop 3.3 ver 설치
[ec2-user@ip-10-1-10-221 ~]$ wget http://apache.mirror.cdnetworks.com/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
wget http://apache.mirror.cdnetworks.com/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
--2020-09-25 03:44:16--  http://apache.mirror.cdnetworks.com/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
Resolving apache.mirror.cdnetworks.com (apache.mirror.cdnetworks.com)... 14.0.101.165
Connecting to apache.mirror.cdnetworks.com (apache.mirror.cdnetworks.com)|14.0.101.165|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 500749234 (478M) [application/x-gzip]
Saving to: hadoop-3.3.0.tar.gz

100%[=======================================================================================>] 500,749,234  105MB/s   in 7.3s

2020-09-25 03:44:24 (65.1 MB/s) - hadoop-3.3.0.tar.gz saved [500749234/500749234]

[ec2-user@ip-10-1-10-221 ~]$ tar -zxvf hadoop-3.3.0.tar.gz
  • maven과 hive 3.1.2 ver 설치 (optional, 필수가 아니고 다른 실습을 위해 그냥 추가적으로 설치한 부분임)
[ec2-user@ip-10-1-10-221 ~]$ sudo wget http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo -O /etc/yum.repos.d/epel-apache-maven.repo
[ec2-user@ip-10-1-10-221 ~]$ sudo sed -i s/\$releasever/6/g /etc/yum.repos.d/epel-apache-maven.repo
[ec2-user@ip-10-1-10-221 ~]$ sudo yum install -y apache-maven
[ec2-user@ip-10-1-10-221 ~]$ mvn --version
Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z)
Maven home: /usr/share/apache-maven
Java version: 1.8.0_265, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.265.b01-1.amzn2.0.1.x86_64/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "4.14.193-149.317.amzn2.x86_64", arch: "amd64", family: "unix"
                
[ec2-user@ip-10-1-10-221 ~]$ wget http://apache.mirror.cdnetworks.com/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
--2020-09-25 03:46:36--  http://apache.mirror.cdnetworks.com/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
Resolving apache.mirror.cdnetworks.com (apache.mirror.cdnetworks.com)... 14.0.101.165
Connecting to apache.mirror.cdnetworks.com (apache.mirror.cdnetworks.com)|14.0.101.165|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 278813748 (266M) [application/x-gzip]
Saving to: apache-hive-3.1.2-bin.tar.gz

100%[=======================================================================================>] 278,813,748  104MB/s   in 2.5s

2020-09-25 03:46:38 (104 MB/s) - apache-hive-3.1.2-bin.tar.gz saved [278813748/278813748]
        
[ec2-user@ip-10-1-10-221 ~]$ tar -zxvf apache-hive-3.1.2-bin.tar.gz
  • Platform 폴더를 만들어서 하둡과 하이브를 Platform 폴더에 위치시킨다. 그리고 자바와 하둡을 대상으로 환경변수를 설정
[ec2-user@ip-10-1-10-221 ~]$ pwd
/home/ec2-user

[ec2-user@ip-10-1-10-221 ~]$ ls
apache-hive-3.1.2-bin  apache-hive-3.1.2-bin.tar.gz  hadoop-3.3.0  hadoop-3.3.0.tar.gz

[ec2-user@ip-10-1-10-221 ~]$ mkdir Platform

[ec2-user@ip-10-1-10-221 ~]$ mv apache-hive-3.1.2-bin Platform/

[ec2-user@ip-10-1-10-221 ~]$ mv apache-hive-3.1.2-bin.tar.gz Platform/

[ec2-user@ip-10-1-10-221 ~]$ mv hadoop-3.3.0 Platform/

[ec2-user@ip-10-1-10-221 ~]$ mv hadoop-3.3.0.tar.gz Platform/

[ec2-user@ip-10-1-10-221 ~]$ ls
Platform

[ec2-user@ip-10-1-10-221 ~]$ cd Platform/

[ec2-user@ip-10-1-10-221 Platform]$ ls
apache-hive-3.1.2-bin  apache-hive-3.1.2-bin.tar.gz  hadoop-3.3.0  hadoop-3.3.0.tar.gz

[ec2-user@ip-10-1-10-178 Platform]$ cd ~

[ec2-user@ip-10-1-10-146 ~]$ sudo vim .bash_profile 
# 가장 하단에 아래 내용추가
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.265.b01-1.amzn2.0.1.x86_64
export HADOOP_HOME=/home/ec2-user/Platform/hadoop-3.3.0
export PATH=$HADOOP_HOME/bin:$PATH

[ec2-user@ip-10-1-10-178 ~]$ source .bash_profile


## .bash_profile 정상적용 확인
[ec2-user@ip-10-1-10-178 ~]$ $JAVA_HOME/bin/javac -version
javac 1.8.0_265

[ec2-user@ip-10-1-10-178 ~]$ echo $JAVA_HOME
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.265.b01-1.amzn2.0.1.x86_64
  • ssh 설정
[ec2-user@ip-10-1-10-221 ~]$ cd ~ # /home/ec2-user로 이동
[ec2-user@ip-10-1-10-178 ~]$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
Generating public/private rsa key pair.
Your identification has been saved in /home/ec2-user/.ssh/id_rsa.
Your public key has been saved in /home/ec2-user/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx ec2-user@ip-10-1-10-178.ap-northeast-2.compute.internal
The key's randomart image is:
+---[RSA 2048]----+
|                 |
|                 |
|                 |
|                 |
|                 |
|                 |
|                 |
|                 |
|                 |
+----[SHA256]-----+
[ec2-user@ip-10-1-10-178 ~]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
[ec2-user@ip-10-1-10-178 ~]$ chmod 0600 ~/.ssh/authorized_keys
[ec2-user@ip-10-1-10-178 ~]$ ssh localhost
Last login: Sun Oct  4 10:38:48 2020 from 61.101.189.4

       __|  __|_  )
       _|  (     /   Amazon Linux 2 AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-2/
[ec2-user@ip-10-1-10-178 ~]$ exit
logout
Connection to localhost closed.

참고로 만약에 하둡 클라이언트를 두고, 실제 여러대의 노드를 띄워 하둡 클러스터를 구현하고 싶다면

https://minman2115.github.io/DE_TIL112/ 를 참고해서 ssh 설정등 세부적으로 좀더 해줘야 한다.

  • 최소한의 HDFS config 설정
[ec2-user@ip-10-1-10-221 ~]$ cd /home/ec2-user/Platform/hadoop-3.3.0/etc/hadoop
[ec2-user@ip-10-1-10-221 hadoop]$ ls
capacity-scheduler.xml      hadoop-user-functions.sh.example  kms-log4j.properties        ssl-client.xml.example
configuration.xsl           hdfs-rbf-site.xml                 kms-site.xml                ssl-server.xml.example
container-executor.cfg      hdfs-site.xml                     log4j.properties            user_ec_policies.xml.template
core-site.xml               httpfs-env.sh                     mapred-env.cmd              workers
hadoop-env.cmd              httpfs-log4j.properties           mapred-env.sh               yarn-env.cmd
hadoop-env.sh               httpfs-site.xml                   mapred-queues.xml.template  yarn-env.sh
hadoop-metrics2.properties  kms-acls.xml                      mapred-site.xml             yarnservice-log4j.properties
hadoop-policy.xml           kms-env.sh                        shellprofile.d              yarn-site.xml


[ec2-user@ip-10-1-10-221 hadoop]$ sudo vim hadoop-env.sh

## 자바 홈 부분에 가서 아래내용 추가
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.265.b01-1.amzn2.0.1.x86_64

## 아래 부분 각주해제
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_OS_TYPE=${HADOOP_OS_TYPE:-$(uname -s)}

[ec2-user@ip-10-1-10-221 hadoop]$ mkdir /home/ec2-user/hadoop_temp
[ec2-user@ip-10-1-10-221 hadoop]$ sudo vim core-site.xml
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/home/ec2-user/hadoop_temp</value> ## 시전에 /home/ec2-user/hadoop_temp 폴더 생성필요
    </property>
</configuration>


[ec2-user@ip-10-1-10-221 hadoop]$ sudo vim hdfs-site.xml
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>


[ec2-user@ip-10-1-10-221 ~]$ cd /home/ec2-user/Platform/hadoop-3.3.0
[ec2-user@ip-10-1-10-221 hadoop-3.3.0]$ pwd
/home/ec2-user/Platform/hadoop-3.3.0            
[ec2-user@ip-10-1-10-221 hadoop-3.3.0]$ bin/hdfs namenode -format #복붙할때 -format에서 long vs short hyphen 주의
2020-10-04 12:23:57,237 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ip-10-1-10-178.ap-northeast-2.compute.internal/10.1.10.178
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 3.3.0
STARTUP_MSG:   classpath = /home/ec2-user/Platform/hadoop-3.3.0/etc/hadoop:/home/ec2-user/Platform/hadoop-3.3.0 ... 생략
STARTUP_MSG:   build = https://gitbox.apache.org/repos/asf/hadoop.git -r aa96f1871bfd858f9bac59cf2a81ec470da649af; compiled by 'brahma' on 2020-07-06T18:44Z
STARTUP_MSG:   java = 1.8.0_265
************************************************************/
2020-10-05 04:55:33,549 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
2020-10-05 04:55:33,683 INFO namenode.NameNode: createNameNode [-format]
2020-10-05 04:55:34,434 INFO namenode.NameNode: Formatting using clusterid: CID-0434521d-37d1-4523-be97-5c2fb5b4172d
2020-10-05 04:55:34,480 INFO namenode.FSEditLog: Edit logging is async:true
2020-10-05 04:55:34,535 INFO namenode.FSNamesystem: KeyProvider: null
2020-10-05 04:55:34,540 INFO namenode.FSNamesystem: fsLock is fair: true
2020-10-05 04:55:34,544 INFO namenode.FSNamesystem: Detailed lock hold time metrics enabled: false
2020-10-05 04:55:34,558 INFO namenode.FSNamesystem: fsOwner                = ec2-user (auth:SIMPLE)
2020-10-05 04:55:34,558 INFO namenode.FSNamesystem: supergroup             = supergroup
2020-10-05 04:55:34,558 INFO namenode.FSNamesystem: isPermissionEnabled    = true
2020-10-05 04:55:34,558 INFO namenode.FSNamesystem: isStoragePolicyEnabled = true
2020-10-05 04:55:34,558 INFO namenode.FSNamesystem: HA Enabled: false
2020-10-05 04:55:34,605 INFO common.Util: dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disabling file IO profiling
2020-10-05 04:55:34,615 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit: configured=1000, counted=60, effected=1000
2020-10-05 04:55:34,615 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
2020-10-05 04:55:34,618 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
2020-10-05 04:55:34,619 INFO blockmanagement.BlockManager: The block deletion will start around 2020 Oct 05 04:55:34
2020-10-05 04:55:34,620 INFO util.GSet: Computing capacity for map BlocksMap
2020-10-05 04:55:34,620 INFO util.GSet: VM type       = 64-bit
2020-10-05 04:55:34,623 INFO util.GSet: 2.0% max memory 436 MB = 8.7 MB
2020-10-05 04:55:34,623 INFO util.GSet: capacity      = 2^20 = 1048576 entries
2020-10-05 04:55:34,630 INFO blockmanagement.BlockManager: Storage policy satisfier is disabled
2020-10-05 04:55:34,631 INFO blockmanagement.BlockManager: dfs.block.access.token.enable = false
2020-10-05 04:55:34,639 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.threshold-pct = 0.999
2020-10-05 04:55:34,639 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.min.datanodes = 0
2020-10-05 04:55:34,640 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.extension = 30000
2020-10-05 04:55:34,640 INFO blockmanagement.BlockManager: defaultReplication         = 1
2020-10-05 04:55:34,640 INFO blockmanagement.BlockManager: maxReplication             = 512
2020-10-05 04:55:34,640 INFO blockmanagement.BlockManager: minReplication             = 1
2020-10-05 04:55:34,640 INFO blockmanagement.BlockManager: maxReplicationStreams      = 2
2020-10-05 04:55:34,641 INFO blockmanagement.BlockManager: redundancyRecheckInterval  = 3000ms
2020-10-05 04:55:34,641 INFO blockmanagement.BlockManager: encryptDataTransfer        = false
2020-10-05 04:55:34,641 INFO blockmanagement.BlockManager: maxNumBlocksToLog          = 1000
2020-10-05 04:55:34,680 INFO namenode.FSDirectory: GLOBAL serial map: bits=29 maxEntries=536870911
2020-10-05 04:55:34,680 INFO namenode.FSDirectory: USER serial map: bits=24 maxEntries=16777215
2020-10-05 04:55:34,680 INFO namenode.FSDirectory: GROUP serial map: bits=24 maxEntries=16777215
2020-10-05 04:55:34,680 INFO namenode.FSDirectory: XATTR serial map: bits=24 maxEntries=16777215
2020-10-05 04:55:34,699 INFO util.GSet: Computing capacity for map INodeMap
2020-10-05 04:55:34,699 INFO util.GSet: VM type       = 64-bit
2020-10-05 04:55:34,699 INFO util.GSet: 1.0% max memory 436 MB = 4.4 MB
2020-10-05 04:55:34,700 INFO util.GSet: capacity      = 2^19 = 524288 entries
2020-10-05 04:55:34,702 INFO namenode.FSDirectory: ACLs enabled? true
2020-10-05 04:55:34,702 INFO namenode.FSDirectory: POSIX ACL inheritance enabled? true
2020-10-05 04:55:34,702 INFO namenode.FSDirectory: XAttrs enabled? true
2020-10-05 04:55:34,702 INFO namenode.NameNode: Caching file names occurring more than 10 times
2020-10-05 04:55:34,708 INFO snapshot.SnapshotManager: Loaded config captureOpenFiles: false, skipCaptureAccessTimeOnlyChange: false, snapshotDiffAllowSnapRootDescendant: true, maxSnapshotLimit: 65536
2020-10-05 04:55:34,726 INFO snapshot.SnapshotManager: SkipList is disabled
2020-10-05 04:55:34,732 INFO util.GSet: Computing capacity for map cachedBlocks
2020-10-05 04:55:34,732 INFO util.GSet: VM type       = 64-bit
2020-10-05 04:55:34,733 INFO util.GSet: 0.25% max memory 436 MB = 1.1 MB
2020-10-05 04:55:34,733 INFO util.GSet: capacity      = 2^17 = 131072 entries
2020-10-05 04:55:34,743 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
2020-10-05 04:55:34,743 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
2020-10-05 04:55:34,743 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
2020-10-05 04:55:34,752 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
2020-10-05 04:55:34,754 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
2020-10-05 04:55:34,757 INFO util.GSet: Computing capacity for map NameNodeRetryCache
2020-10-05 04:55:34,757 INFO util.GSet: VM type       = 64-bit
2020-10-05 04:55:34,757 INFO util.GSet: 0.029999999329447746% max memory 436 MB = 133.9 KB
2020-10-05 04:55:34,757 INFO util.GSet: capacity      = 2^14 = 16384 entries
2020-10-05 04:55:34,801 INFO namenode.FSImage: Allocated new BlockPoolId: BP-1928559644-10.1.10.61-1601873734792
2020-10-05 04:55:34,826 INFO common.Storage: Storage directory /tmp/hadoop-ec2-user/dfs/name has been successfully formatted.
2020-10-05 04:55:34,873 INFO namenode.FSImageFormatProtobuf: Saving image file /tmp/hadoop-ec2-user/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
2020-10-05 04:55:34,976 INFO namenode.FSImageFormatProtobuf: Image file /tmp/hadoop-ec2-user/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 403 bytes saved in 0 seconds .
2020-10-05 04:55:34,993 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2020-10-05 04:55:35,003 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid=0 when meet shutdown.
2020-10-05 04:55:35,003 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ip-10-1-10-61.ap-northeast-2.compute.internal/10.1.10.61
************************************************************/

[ec2-user@ip-10-1-10-221 hadoop-3.3.0]$ sbin/start-dfs.sh
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [ip-10-1-10-61.ap-northeast-2.compute.internal]

## HDFS 정상구동여부 확인
[ec2-user@ip-10-1-10-221 hadoop-3.3.0]$ jps
10864 NameNode
11376 Jps
11012 DataNode
11246 SecondaryNameNode

웹브라우저를 열어서 [ec2 퍼블릭 아이피]:9870 를 입력했을때 네임노드 UI로 접속이 가능한지 확인.

보안그룹에서 참고로 9870포트를 사전에 열어줘야함

[ec2-user@ip-10-1-10-221 hadoop]$ cd /home/ec2-user/Platform/hadoop-3.3.0/etc/hadoop
[ec2-user@ip-10-1-10-221 hadoop]$ sudo vim mapred-site.xml
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.application.classpath</name>
        <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
    </property>
</configuration>

[ec2-user@ip-10-1-10-147 hadoop]$ sudo vim yarn-site.xml
<configuration>
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
        <property>
                <name>yarn.nodemanager.env-whitelist</name>
                <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
        </property>
</configuration>
    
[ec2-user@ip-10-1-10-221 hadoop]$ cd /home/ec2-user/Platform/hadoop-3.3.0            
[ec2-user@ip-10-1-10-221 hadoop-3.3.0]$ sbin/start-yarn.sh
Starting resourcemanager
Starting nodemanagers

[ec2-user@ip-10-1-10-221 hadoop-3.3.0]$ jps
12578 SecondaryNameNode
14037 NodeManager
14358 Jps
13897 ResourceManager
12187 NameNode
12334 DataNode

마찬가지로 웹브라우저를 열어서 [ec2 퍼블릭 아이피]:8088 를 입력했을때 네임노드 UI로 접속이 가능한지 확인.

보안그룹에서 참고로 8088포트를 사전에 열어줘야함