ec2에서 hadoop pseudo-distributed 구현 실습
2020-10-05
.
Data_Engineering_TIL(20200925)
- study program : T아카데미 - 아파치 하둡 입문과 활용
** URL : https://tacademy.skplanet.com/frontMain.action
실습목표 : ec2 1대에서 hadoop pseudo-distributed를 구현해본다.
- jdk 1.8 설치
[ec2-user@ip-10-1-10-221 ~]$ sudo yum install -y java-1.8.0-openjdk-devel.x86_64
[ec2-user@ip-10-1-10-221 ~]$ sudo /usr/sbin/alternatives --config java
There is 1 program that provides 'java'.
Selection Command
-----------------------------------------------
*+ 1 java-1.8.0-openjdk.x86_64 (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.265.b01-1.amzn2.0.1.x86_64/jre/bin/java)
Enter to keep the current selection[+], or type selection number: ## 엔터 누르면 됨
[ec2-user@ip-10-1-10-221 ~]$ sudo yum remove java-1.7.0-openjdk -y
Loaded plugins: extras_suggestions, langpacks, priorities, update-motd
No Match for argument: java-1.7.0-openjdk
No Packages marked for removal
[ec2-user@ip-10-1-10-221 ~]$ java -version
openjdk version "1.8.0_265"
OpenJDK Runtime Environment (build 1.8.0_265-b01)
OpenJDK 64-Bit Server VM (build 25.265-b01, mixed mode)
[ec2-user@ip-10-1-10-178 ~]$ readlink -f $(which java)
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.265.b01-1.amzn2.0.1.x86_64/jre/bin/java
## 그러면 Java home 경로는 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.265.b01-1.amzn2.0.1.x86_64 이다.
- hadoop 3.3 ver 설치
[ec2-user@ip-10-1-10-221 ~]$ wget http://apache.mirror.cdnetworks.com/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
wget http://apache.mirror.cdnetworks.com/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
--2020-09-25 03:44:16-- http://apache.mirror.cdnetworks.com/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
Resolving apache.mirror.cdnetworks.com (apache.mirror.cdnetworks.com)... 14.0.101.165
Connecting to apache.mirror.cdnetworks.com (apache.mirror.cdnetworks.com)|14.0.101.165|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 500749234 (478M) [application/x-gzip]
Saving to: ‘hadoop-3.3.0.tar.gz’
100%[=======================================================================================>] 500,749,234 105MB/s in 7.3s
2020-09-25 03:44:24 (65.1 MB/s) - ‘hadoop-3.3.0.tar.gz’ saved [500749234/500749234]
[ec2-user@ip-10-1-10-221 ~]$ tar -zxvf hadoop-3.3.0.tar.gz
- maven과 hive 3.1.2 ver 설치 (optional, 필수가 아니고 다른 실습을 위해 그냥 추가적으로 설치한 부분임)
[ec2-user@ip-10-1-10-221 ~]$ sudo wget http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo -O /etc/yum.repos.d/epel-apache-maven.repo
[ec2-user@ip-10-1-10-221 ~]$ sudo sed -i s/\$releasever/6/g /etc/yum.repos.d/epel-apache-maven.repo
[ec2-user@ip-10-1-10-221 ~]$ sudo yum install -y apache-maven
[ec2-user@ip-10-1-10-221 ~]$ mvn --version
Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z)
Maven home: /usr/share/apache-maven
Java version: 1.8.0_265, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.265.b01-1.amzn2.0.1.x86_64/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "4.14.193-149.317.amzn2.x86_64", arch: "amd64", family: "unix"
[ec2-user@ip-10-1-10-221 ~]$ wget http://apache.mirror.cdnetworks.com/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
--2020-09-25 03:46:36-- http://apache.mirror.cdnetworks.com/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
Resolving apache.mirror.cdnetworks.com (apache.mirror.cdnetworks.com)... 14.0.101.165
Connecting to apache.mirror.cdnetworks.com (apache.mirror.cdnetworks.com)|14.0.101.165|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 278813748 (266M) [application/x-gzip]
Saving to: ‘apache-hive-3.1.2-bin.tar.gz’
100%[=======================================================================================>] 278,813,748 104MB/s in 2.5s
2020-09-25 03:46:38 (104 MB/s) - ‘apache-hive-3.1.2-bin.tar.gz’ saved [278813748/278813748]
[ec2-user@ip-10-1-10-221 ~]$ tar -zxvf apache-hive-3.1.2-bin.tar.gz
- Platform 폴더를 만들어서 하둡과 하이브를 Platform 폴더에 위치시킨다. 그리고 자바와 하둡을 대상으로 환경변수를 설정
[ec2-user@ip-10-1-10-221 ~]$ pwd
/home/ec2-user
[ec2-user@ip-10-1-10-221 ~]$ ls
apache-hive-3.1.2-bin apache-hive-3.1.2-bin.tar.gz hadoop-3.3.0 hadoop-3.3.0.tar.gz
[ec2-user@ip-10-1-10-221 ~]$ mkdir Platform
[ec2-user@ip-10-1-10-221 ~]$ mv apache-hive-3.1.2-bin Platform/
[ec2-user@ip-10-1-10-221 ~]$ mv apache-hive-3.1.2-bin.tar.gz Platform/
[ec2-user@ip-10-1-10-221 ~]$ mv hadoop-3.3.0 Platform/
[ec2-user@ip-10-1-10-221 ~]$ mv hadoop-3.3.0.tar.gz Platform/
[ec2-user@ip-10-1-10-221 ~]$ ls
Platform
[ec2-user@ip-10-1-10-221 ~]$ cd Platform/
[ec2-user@ip-10-1-10-221 Platform]$ ls
apache-hive-3.1.2-bin apache-hive-3.1.2-bin.tar.gz hadoop-3.3.0 hadoop-3.3.0.tar.gz
[ec2-user@ip-10-1-10-178 Platform]$ cd ~
[ec2-user@ip-10-1-10-146 ~]$ sudo vim .bash_profile
# 가장 하단에 아래 내용추가
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.265.b01-1.amzn2.0.1.x86_64
export HADOOP_HOME=/home/ec2-user/Platform/hadoop-3.3.0
export PATH=$HADOOP_HOME/bin:$PATH
[ec2-user@ip-10-1-10-178 ~]$ source .bash_profile
## .bash_profile 정상적용 확인
[ec2-user@ip-10-1-10-178 ~]$ $JAVA_HOME/bin/javac -version
javac 1.8.0_265
[ec2-user@ip-10-1-10-178 ~]$ echo $JAVA_HOME
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.265.b01-1.amzn2.0.1.x86_64
- ssh 설정
[ec2-user@ip-10-1-10-221 ~]$ cd ~ # /home/ec2-user로 이동
[ec2-user@ip-10-1-10-178 ~]$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
Generating public/private rsa key pair.
Your identification has been saved in /home/ec2-user/.ssh/id_rsa.
Your public key has been saved in /home/ec2-user/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx ec2-user@ip-10-1-10-178.ap-northeast-2.compute.internal
The key's randomart image is:
+---[RSA 2048]----+
| |
| |
| |
| |
| |
| |
| |
| |
| |
+----[SHA256]-----+
[ec2-user@ip-10-1-10-178 ~]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
[ec2-user@ip-10-1-10-178 ~]$ chmod 0600 ~/.ssh/authorized_keys
[ec2-user@ip-10-1-10-178 ~]$ ssh localhost
Last login: Sun Oct 4 10:38:48 2020 from 61.101.189.4
__| __|_ )
_| ( / Amazon Linux 2 AMI
___|\___|___|
https://aws.amazon.com/amazon-linux-2/
[ec2-user@ip-10-1-10-178 ~]$ exit
logout
Connection to localhost closed.
참고로 만약에 하둡 클라이언트를 두고, 실제 여러대의 노드를 띄워 하둡 클러스터를 구현하고 싶다면
https://minman2115.github.io/DE_TIL112/ 를 참고해서 ssh 설정등 세부적으로 좀더 해줘야 한다.
- 최소한의 HDFS config 설정
[ec2-user@ip-10-1-10-221 ~]$ cd /home/ec2-user/Platform/hadoop-3.3.0/etc/hadoop
[ec2-user@ip-10-1-10-221 hadoop]$ ls
capacity-scheduler.xml hadoop-user-functions.sh.example kms-log4j.properties ssl-client.xml.example
configuration.xsl hdfs-rbf-site.xml kms-site.xml ssl-server.xml.example
container-executor.cfg hdfs-site.xml log4j.properties user_ec_policies.xml.template
core-site.xml httpfs-env.sh mapred-env.cmd workers
hadoop-env.cmd httpfs-log4j.properties mapred-env.sh yarn-env.cmd
hadoop-env.sh httpfs-site.xml mapred-queues.xml.template yarn-env.sh
hadoop-metrics2.properties kms-acls.xml mapred-site.xml yarnservice-log4j.properties
hadoop-policy.xml kms-env.sh shellprofile.d yarn-site.xml
[ec2-user@ip-10-1-10-221 hadoop]$ sudo vim hadoop-env.sh
## 자바 홈 부분에 가서 아래내용 추가
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.265.b01-1.amzn2.0.1.x86_64
## 아래 부분 각주해제
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_OS_TYPE=${HADOOP_OS_TYPE:-$(uname -s)}
[ec2-user@ip-10-1-10-221 hadoop]$ mkdir /home/ec2-user/hadoop_temp
[ec2-user@ip-10-1-10-221 hadoop]$ sudo vim core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/ec2-user/hadoop_temp</value> ## 시전에 /home/ec2-user/hadoop_temp 폴더 생성필요
</property>
</configuration>
[ec2-user@ip-10-1-10-221 hadoop]$ sudo vim hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
[ec2-user@ip-10-1-10-221 ~]$ cd /home/ec2-user/Platform/hadoop-3.3.0
[ec2-user@ip-10-1-10-221 hadoop-3.3.0]$ pwd
/home/ec2-user/Platform/hadoop-3.3.0
[ec2-user@ip-10-1-10-221 hadoop-3.3.0]$ bin/hdfs namenode -format #복붙할때 -format에서 long vs short hyphen 주의
2020-10-04 12:23:57,237 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = ip-10-1-10-178.ap-northeast-2.compute.internal/10.1.10.178
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 3.3.0
STARTUP_MSG: classpath = /home/ec2-user/Platform/hadoop-3.3.0/etc/hadoop:/home/ec2-user/Platform/hadoop-3.3.0 ... 생략
STARTUP_MSG: build = https://gitbox.apache.org/repos/asf/hadoop.git -r aa96f1871bfd858f9bac59cf2a81ec470da649af; compiled by 'brahma' on 2020-07-06T18:44Z
STARTUP_MSG: java = 1.8.0_265
************************************************************/
2020-10-05 04:55:33,549 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
2020-10-05 04:55:33,683 INFO namenode.NameNode: createNameNode [-format]
2020-10-05 04:55:34,434 INFO namenode.NameNode: Formatting using clusterid: CID-0434521d-37d1-4523-be97-5c2fb5b4172d
2020-10-05 04:55:34,480 INFO namenode.FSEditLog: Edit logging is async:true
2020-10-05 04:55:34,535 INFO namenode.FSNamesystem: KeyProvider: null
2020-10-05 04:55:34,540 INFO namenode.FSNamesystem: fsLock is fair: true
2020-10-05 04:55:34,544 INFO namenode.FSNamesystem: Detailed lock hold time metrics enabled: false
2020-10-05 04:55:34,558 INFO namenode.FSNamesystem: fsOwner = ec2-user (auth:SIMPLE)
2020-10-05 04:55:34,558 INFO namenode.FSNamesystem: supergroup = supergroup
2020-10-05 04:55:34,558 INFO namenode.FSNamesystem: isPermissionEnabled = true
2020-10-05 04:55:34,558 INFO namenode.FSNamesystem: isStoragePolicyEnabled = true
2020-10-05 04:55:34,558 INFO namenode.FSNamesystem: HA Enabled: false
2020-10-05 04:55:34,605 INFO common.Util: dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disabling file IO profiling
2020-10-05 04:55:34,615 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit: configured=1000, counted=60, effected=1000
2020-10-05 04:55:34,615 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
2020-10-05 04:55:34,618 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
2020-10-05 04:55:34,619 INFO blockmanagement.BlockManager: The block deletion will start around 2020 Oct 05 04:55:34
2020-10-05 04:55:34,620 INFO util.GSet: Computing capacity for map BlocksMap
2020-10-05 04:55:34,620 INFO util.GSet: VM type = 64-bit
2020-10-05 04:55:34,623 INFO util.GSet: 2.0% max memory 436 MB = 8.7 MB
2020-10-05 04:55:34,623 INFO util.GSet: capacity = 2^20 = 1048576 entries
2020-10-05 04:55:34,630 INFO blockmanagement.BlockManager: Storage policy satisfier is disabled
2020-10-05 04:55:34,631 INFO blockmanagement.BlockManager: dfs.block.access.token.enable = false
2020-10-05 04:55:34,639 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.threshold-pct = 0.999
2020-10-05 04:55:34,639 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.min.datanodes = 0
2020-10-05 04:55:34,640 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.extension = 30000
2020-10-05 04:55:34,640 INFO blockmanagement.BlockManager: defaultReplication = 1
2020-10-05 04:55:34,640 INFO blockmanagement.BlockManager: maxReplication = 512
2020-10-05 04:55:34,640 INFO blockmanagement.BlockManager: minReplication = 1
2020-10-05 04:55:34,640 INFO blockmanagement.BlockManager: maxReplicationStreams = 2
2020-10-05 04:55:34,641 INFO blockmanagement.BlockManager: redundancyRecheckInterval = 3000ms
2020-10-05 04:55:34,641 INFO blockmanagement.BlockManager: encryptDataTransfer = false
2020-10-05 04:55:34,641 INFO blockmanagement.BlockManager: maxNumBlocksToLog = 1000
2020-10-05 04:55:34,680 INFO namenode.FSDirectory: GLOBAL serial map: bits=29 maxEntries=536870911
2020-10-05 04:55:34,680 INFO namenode.FSDirectory: USER serial map: bits=24 maxEntries=16777215
2020-10-05 04:55:34,680 INFO namenode.FSDirectory: GROUP serial map: bits=24 maxEntries=16777215
2020-10-05 04:55:34,680 INFO namenode.FSDirectory: XATTR serial map: bits=24 maxEntries=16777215
2020-10-05 04:55:34,699 INFO util.GSet: Computing capacity for map INodeMap
2020-10-05 04:55:34,699 INFO util.GSet: VM type = 64-bit
2020-10-05 04:55:34,699 INFO util.GSet: 1.0% max memory 436 MB = 4.4 MB
2020-10-05 04:55:34,700 INFO util.GSet: capacity = 2^19 = 524288 entries
2020-10-05 04:55:34,702 INFO namenode.FSDirectory: ACLs enabled? true
2020-10-05 04:55:34,702 INFO namenode.FSDirectory: POSIX ACL inheritance enabled? true
2020-10-05 04:55:34,702 INFO namenode.FSDirectory: XAttrs enabled? true
2020-10-05 04:55:34,702 INFO namenode.NameNode: Caching file names occurring more than 10 times
2020-10-05 04:55:34,708 INFO snapshot.SnapshotManager: Loaded config captureOpenFiles: false, skipCaptureAccessTimeOnlyChange: false, snapshotDiffAllowSnapRootDescendant: true, maxSnapshotLimit: 65536
2020-10-05 04:55:34,726 INFO snapshot.SnapshotManager: SkipList is disabled
2020-10-05 04:55:34,732 INFO util.GSet: Computing capacity for map cachedBlocks
2020-10-05 04:55:34,732 INFO util.GSet: VM type = 64-bit
2020-10-05 04:55:34,733 INFO util.GSet: 0.25% max memory 436 MB = 1.1 MB
2020-10-05 04:55:34,733 INFO util.GSet: capacity = 2^17 = 131072 entries
2020-10-05 04:55:34,743 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
2020-10-05 04:55:34,743 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
2020-10-05 04:55:34,743 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
2020-10-05 04:55:34,752 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
2020-10-05 04:55:34,754 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
2020-10-05 04:55:34,757 INFO util.GSet: Computing capacity for map NameNodeRetryCache
2020-10-05 04:55:34,757 INFO util.GSet: VM type = 64-bit
2020-10-05 04:55:34,757 INFO util.GSet: 0.029999999329447746% max memory 436 MB = 133.9 KB
2020-10-05 04:55:34,757 INFO util.GSet: capacity = 2^14 = 16384 entries
2020-10-05 04:55:34,801 INFO namenode.FSImage: Allocated new BlockPoolId: BP-1928559644-10.1.10.61-1601873734792
2020-10-05 04:55:34,826 INFO common.Storage: Storage directory /tmp/hadoop-ec2-user/dfs/name has been successfully formatted.
2020-10-05 04:55:34,873 INFO namenode.FSImageFormatProtobuf: Saving image file /tmp/hadoop-ec2-user/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
2020-10-05 04:55:34,976 INFO namenode.FSImageFormatProtobuf: Image file /tmp/hadoop-ec2-user/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 403 bytes saved in 0 seconds .
2020-10-05 04:55:34,993 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2020-10-05 04:55:35,003 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid=0 when meet shutdown.
2020-10-05 04:55:35,003 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ip-10-1-10-61.ap-northeast-2.compute.internal/10.1.10.61
************************************************************/
[ec2-user@ip-10-1-10-221 hadoop-3.3.0]$ sbin/start-dfs.sh
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [ip-10-1-10-61.ap-northeast-2.compute.internal]
## HDFS 정상구동여부 확인
[ec2-user@ip-10-1-10-221 hadoop-3.3.0]$ jps
10864 NameNode
11376 Jps
11012 DataNode
11246 SecondaryNameNode
웹브라우저를 열어서 [ec2 퍼블릭 아이피]:9870
를 입력했을때 네임노드 UI로 접속이 가능한지 확인.
보안그룹에서 참고로 9870포트를 사전에 열어줘야함
[ec2-user@ip-10-1-10-221 hadoop]$ cd /home/ec2-user/Platform/hadoop-3.3.0/etc/hadoop
[ec2-user@ip-10-1-10-221 hadoop]$ sudo vim mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
</property>
</configuration>
[ec2-user@ip-10-1-10-147 hadoop]$ sudo vim yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>
[ec2-user@ip-10-1-10-221 hadoop]$ cd /home/ec2-user/Platform/hadoop-3.3.0
[ec2-user@ip-10-1-10-221 hadoop-3.3.0]$ sbin/start-yarn.sh
Starting resourcemanager
Starting nodemanagers
[ec2-user@ip-10-1-10-221 hadoop-3.3.0]$ jps
12578 SecondaryNameNode
14037 NodeManager
14358 Jps
13897 ResourceManager
12187 NameNode
12334 DataNode
마찬가지로 웹브라우저를 열어서 [ec2 퍼블릭 아이피]:8088
를 입력했을때 네임노드 UI로 접속이 가능한지 확인.
보안그룹에서 참고로 8088포트를 사전에 열어줘야함