分享

Hadoop集群间数据传输工具---Distcp

又被占用 发表于 2016-1-14 10:58:01 [显示全部楼层] 回帖奖励 阅读模式 关闭右栏 0 11266
两个Hadoop集群间实现数据传输的一个工具介绍-----Distcp
官方文档:http://hadoop.apache.org/docs/stable1/distcp2.html
现在有两个版本分别是Distcp 和Distcp2。自己测试了下,具体效率值没有做对比;
使用方法:
相同版本的hdfs集群间传输:
# hadoop distcp hdfs://master1:9000/foo  hdfs://master2:9000/foo
16/01/03 17:32:10 INFO tools.DistCp: Input Options: DistCpOptions
{atomicCommit=false, syncFolder=false, deleteMissing=false,
ignoreFailures=false, maxMaps=20, sslConfigurationFile='null',
copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[hdfs://
master1:9000/foo], targetPath=hdfs://master:9000/foo, targetPathExists=true,
preserveRawXattrs=false}

16/01/03 17:32:10 INFO client.RMProxy: Connecting to ResourceManager at master1/192.168.211.128:8032
16/01/03 17:32:13 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
16/01/03 17:32:13 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor
16/01/03 17:32:14 INFO client.RMProxy: Connecting to ResourceManager at master1/192.168.211.128:8032
16/01/03 17:32:15 INFO mapreduce.JobSubmitter: number of splits:2
16/01/03 17:32:16 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1451870356825_0001
16/01/03 17:32:17 INFO impl.YarnClientImpl: Submitted application application_1451870356825_0001
16/01/03 17:32:17 INFO mapreduce.Job: The url to track the job: http://master1:8088/proxy/application_1451870356825_0001/
16/01/03 17:32:17 INFO tools.DistCp: DistCp job-id: job_1451870356825_0001
16/01/03 17:32:17 INFO mapreduce.Job: Running job: job_1451870356825_0001
16/01/03 17:32:38 INFO mapreduce.Job: Job job_1451870356825_0001 running in uber mode : false

16/01/03 17:32:38 INFO mapreduce.Job:  map 0% reduce 0%
16/01/03 17:32:58 INFO mapreduce.Job:  map 50% reduce 0%
16/01/03 17:33:05 INFO mapreduce.Job:  map 100% reduce 0%

16/01/03 17:33:05 INFO mapreduce.Job: Job job_1451870356825_0001 completed successfully
16/01/03 17:33:05 INFO mapreduce.Job: Counters: 33
   File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=216186
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=1196
        HDFS: Number of bytes written=66
        HDFS: Number of read operations=33
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=10

  Job Counters
        Launched map tasks=2
       Other local map tasks=2
       Total time spent by all maps in occupied slots (ms)=42030
       Total time spent by all reduces in occupied slots (ms)=0
       Total time spent by all map tasks (ms)=42030
       Total vcore-seconds taken by all map tasks=42030
       Total megabyte-seconds taken by all map tasks=43038720

       Map-Reduce Framework
       Map input records=4
       Map output records=0
       Input split bytes=266
       Spilled Records=0
       Failed Shuffles=0
       Merged Map outputs=0
       GC time elapsed (ms)=677
       CPU time spent (ms)=2450
       Physical memory (bytes) snapshot=180961280
       Virtual memory (bytes) snapshot=4115144704
      Total committed heap usage (bytes)=33157120  

File Input Format Counters
      Bytes Read=864

File Output Format Counters
      Bytes Written=0
org.apache.hadoop.tools.mapred.CopyMapper$Counter
      BYTESCOPIED=66
      BYTESEXPECTED=66
      COPY=4

在另外一个集群的foo目录下能查看到传输过来的数据
在不同的Hdfs版本之间传输的话:
对于不同Hadoop版本间的拷贝,用户应该使用HftpFileSystem。 这是一个只读文件系统,所以distcp必须运行在目标端集群上(更确切的说是在能够写入目标集群的TaskTracker上)。 源的格式是 hftp://dfs.http.address/ (默认情况dfs.http.address是 :50070,我测试的相同版本之间的端口使用的是9000)。

没找到任何评论,期待你打破沉寂

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条