s3-dist-cp 사용예시

2021-09-28

.

Data_Engineering_TIL(20210928)

객체가 많으면 많을수록 AWS CLI 명령어(aws s3 cp) 보다 더 유리한 것으로 보임

사용예시

[hadoop@ip-10-10-1-149 ~]$ s3-dist-cp --src=s3://xxx-bucket/ --dest=s3://pms-bucket-test/s3distcp_test/
2021-09-28 13:29:53,489 INFO s3distcp.Main: Running with args: -libjars /usr/share/aws/emr/s3-dist-cp/lib/guava-18.0.jar,/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp-2.16.0.jar,/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar --src=s3://xxx-bucket/ --dest=s3://pms-bucket-test/s3distcp_test/
2021-09-28 13:29:53,880 INFO s3distcp.S3DistCp: S3DistCp args: --src=s3://xxx-bucket/ --dest=s3://pms-bucket-test/s3distcp_test/
2021-09-28 13:29:53,886 INFO s3distcp.S3DistCp: Using output path 'hdfs:/tmp/8cad22b9-e740-4ec9-9d49-ebef46e4b792/output'
2021-09-28 13:29:53,886 INFO s3distcp.S3DistCp: Try to recursively delete with throw Exceptionhdfs:/tmp/8cad22b9-e740-4ec9-9d49-ebef46e4b792/files
2021-09-28 13:29:54,376 INFO s3distcp.S3DistCp: Try to recursively delete with throw Exceptionhdfs:/tmp/8cad22b9-e740-4ec9-9d49-ebef46e4b792/output
2021-09-28 13:29:56,226 INFO s3distcp.S3ClientFactory: Create Amazon S3 client with endpoint override
2021-09-28 13:29:56,637 INFO s3distcp.S3ClientFactory: DefaultAWSCredentialsProviderChain is used to create AmazonS3Client.
2021-09-28 13:29:56,637 INFO s3distcp.S3ClientFactory: Overrides Amazon S3 Endpoint with s3.ap-northeast-2.amazonaws.com
2021-09-28 13:29:56,805 INFO s3distcp.S3DistCp: Get src prefix s3://xxx-bucket/ endWithSlash true
2021-09-28 13:29:56,805 INFO s3distcp.S3DistCp: Only a single prefix is got
2021-09-28 13:29:56,808 INFO s3distcp.S3DistCp: listing objects in bucket xxx-bucket target prefix is null
2021-09-28 13:29:56,824 WARN cred.CredentialsLegacyConfigLocationProvider: Found the legacy config profiles file at [/home/hadoop/.aws/config]. Please move it to the latest default location [~/.aws/credentials].
2021-09-28 13:29:57,442 INFO s3distcp.S3DistCp: Got object summary
 Bucket = xxx-bucket, Key = xxxxxx.csv
2021-09-28 13:29:57,443 INFO s3distcp.S3DistCp: Got object summary
 Bucket = xxx-bucket, Key = yyyyyy.csv
2021-09-28 13:29:57,443 INFO s3distcp.S3DistCp: Got object summary
 Bucket = xxx-bucket, Key = zzzzzz.csv



...



2021-09-28 13:29:58,015 INFO s3distcp.S3DistCp: Got object summary
 Bucket = xxx-bucket, Key = test.txt
2021-09-28 13:29:58,015 INFO s3distcp.S3DistCp: Got object summary
 Bucket = xxx-bucket, Key = test2.txt
2021-09-28 13:29:58,015 INFO s3distcp.S3DistCp: Got object summary
 Bucket = xxx-bucket, Key = output.txt
2021-09-28 13:29:58,015 INFO s3distcp.S3DistCp: Got object summary
 Bucket = xxx-bucket, Key = pms_test.txt
2021-09-28 13:29:58,016 INFO s3distcp.S3DistCp: Created 1 files to copy 2018 files
2021-09-28 13:29:58,744 INFO s3distcp.S3DistCp: Reducer number: 10
2021-09-28 13:29:58,961 INFO client.RMProxy: Connecting to ResourceManager at ip-10-10-1-149.ap-northeast-2.compute.internal/10.10.1.149:8032
2021-09-28 13:29:59,102 INFO client.AHSProxy: Connecting to Application History server at ip-10-10-1-149.ap-northeast-2.compute.internal/10.10.1.149:10200
2021-09-28 13:29:59,228 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1632835295636_0001
2021-09-28 13:30:00,287 INFO input.FileInputFormat: Total input files to process : 1
2021-09-28 13:30:01,151 INFO mapreduce.JobSubmitter: number of splits:1
2021-09-28 13:30:01,339 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1632835295636_0001
2021-09-28 13:30:01,341 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-09-28 13:30:01,635 INFO conf.Configuration: resource-types.xml not found
2021-09-28 13:30:01,635 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2021-09-28 13:30:02,074 INFO impl.YarnClientImpl: Submitted application application_1632835295636_0001
2021-09-28 13:30:02,128 INFO mapreduce.Job: The url to track the job: http://ip-10-10-1-149.ap-northeast-2.compute.internal:20888/proxy/application_16328352956
2021-09-28 13:30:02,128 INFO mapreduce.Job: Running job: job_1632835295636_0001
2021-09-28 13:30:10,218 INFO mapreduce.Job: Job job_1632835295636_0001 running in uber mode : false
2021-09-28 13:30:10,219 INFO mapreduce.Job:  map 0% reduce 0%
2021-09-28 13:30:15,269 INFO mapreduce.Job:  map 100% reduce 0%
2021-09-28 13:30:33,365 INFO mapreduce.Job:  map 100% reduce 10%
2021-09-28 13:30:50,453 INFO mapreduce.Job:  map 100% reduce 20%
2021-09-28 13:31:07,532 INFO mapreduce.Job:  map 100% reduce 29%
2021-09-28 13:31:13,557 INFO mapreduce.Job:  map 100% reduce 30%
2021-09-28 13:33:36,169 INFO mapreduce.Job:  map 100% reduce 40%
2021-09-28 13:34:41,369 INFO mapreduce.Job:  map 100% reduce 50%
2021-09-28 13:34:58,420 INFO mapreduce.Job:  map 100% reduce 60%
2021-09-28 13:35:15,470 INFO mapreduce.Job:  map 100% reduce 69%
2021-09-28 13:35:18,479 INFO mapreduce.Job:  map 100% reduce 70%
2021-09-28 13:35:33,523 INFO mapreduce.Job:  map 100% reduce 80%
2021-09-28 13:36:39,717 INFO mapreduce.Job:  map 100% reduce 89%
2021-09-28 13:36:45,735 INFO mapreduce.Job:  map 100% reduce 90%
2021-09-28 13:38:19,048 INFO mapreduce.Job:  map 100% reduce 99%
2021-09-28 13:38:25,064 INFO mapreduce.Job:  map 100% reduce 100%
2021-09-28 13:38:58,164 INFO mapreduce.Job: Job job_1632835295636_0001 completed successfully
2021-09-28 13:38:58,256 INFO mapreduce.Job: Counters: 59
        File System Counters
                FILE: Number of bytes read=72762
                FILE: Number of bytes written=2813691
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=414445
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=54
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=20
                HDFS: Number of bytes read erasure-coded=0
                S3: Number of bytes read=27055616432
                S3: Number of bytes written=27055789737
                S3: Number of read operations=0
                S3: Number of large read operations=0
                S3: Number of write operations=0
        Job Counters
                Launched map tasks=1
                Launched reduce tasks=10
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=285600
                Total time spent by all reduces in occupied slots (ms)=98009472
                Total time spent by all map tasks (ms)=2975
                Total time spent by all reduce tasks (ms)=510466
                Total vcore-milliseconds taken by all map tasks=2975
                Total vcore-milliseconds taken by all reduce tasks=510466
                Total megabyte-milliseconds taken by all map tasks=9139200
                Total megabyte-milliseconds taken by all reduce tasks=3136303104
        Map-Reduce Framework
                Map input records=2018
                Map output records=2018
                Map output bytes=501895
                Map output materialized bytes=72722
                Input split bytes=172
                Combine input records=0
                Combine output records=0
                Reduce input groups=2018
                Reduce shuffle bytes=72722
                Reduce input records=2018
                Reduce output records=0
                Spilled Records=4036
                Shuffled Maps =10
                Failed Shuffles=0
                Merged Map outputs=10
                GC time elapsed (ms)=3090
                CPU time spent (ms)=781620
                Physical memory (bytes) snapshot=7798751232
                Virtual memory (bytes) snapshot=75669417984
                Total committed heap usage (bytes)=6478626816
                Peak Map Physical memory (bytes)=472436736
                Peak Map Virtual memory (bytes)=4405927936
                Peak Reduce Physical memory (bytes)=1179234304
                Peak Reduce Virtual memory (bytes)=7135850496
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=414273
        File Output Format Counters
                Bytes Written=0
2021-09-28 13:38:58,258 INFO s3distcp.S3DistCp: Try to recursively delete without throw Exceptionhdfs:/tmp/8cad22b9-e740-4ec9-9d49-ebef46e4b792

또다른 예시로 특정 s3 prefix를 제외하고 s3-dist-cp를 하는 아래에 URL 자료도 참고할 것

https://github.com/minman2115/Data_engineering_studynotes_2021/blob/master/s3-dist-cp%20%EC%82%AC%EC%9A%A9%EC%98%88%EC%8B%9C/example.zip