分享

MapReduce之计数器及实例

yuwenge 发表于 2015-6-15 19:38:13 [显示全部楼层] 只看大图 回帖奖励 阅读模式 关闭右栏 4 49853
问题导读
1.hadoop有哪些内置计数器?
2.job.getCounters()可以得到什么?
3.MapReduce是否允许用户自定义计数器?








简述:
Hadoop计数器:可以让开发人员以全局的视角来审查相关作业的运行情况以及各项指标,及时做出错误诊断并进行相应处理。
相比而言,计数器方式比日志更易于分析。
内置计数器:
(1)Hadoop内置的计数器,主要用来记录作业的执行情况
(2)内置计数器包括如下:
—MapReduce框架计数器(Map-Reduce Framework)
—文件系统计数器(File System Counters)
—作业计数器(Job Counters)
—文件输入格式计数器(File Output Format Counters)
—文件输出格式计数器(File Input Format Counters)
—Shuffle 错误计数器(Shuffle Errors)
(3)计数器由相关的task进行维护,定期传递给tasktracker,再由tasktracker传给jobtracker;
(4)最终的作业计数器实际上是由jobtracker维护,所以计数器可以被全局汇总,同时也不必在整个网络中传递。
(5)只有当一个作业执行成功后,最终的计数器的值才是完整可靠的;
[mw_shl_code=bash,true]内置计数器:
15/06/15 08:46:47 INFO mapreduce.Job: Job job_1434248323399_0004 completed successfully
15/06/15 08:46:47 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=103
                FILE: Number of bytes written=315873
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=116
                HDFS: Number of bytes written=40
                HDFS: Number of read operations=9
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=4
        Job Counters
                Launched map tasks=1
                Launched reduce tasks=2
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=2893
                Total time spent by all reduces in occupied slots (ms)=6453
                Total time spent by all map tasks (ms)=2893
                Total time spent by all reduce tasks (ms)=6453
                Total vcore-seconds taken by all map tasks=2893
                Total vcore-seconds taken by all reduce tasks=6453
                Total megabyte-seconds taken by all map tasks=2962432
                Total megabyte-seconds taken by all reduce tasks=6607872
        Map-Reduce Framework
                Map input records=7
                Map output records=7
                Map output bytes=77
                Map output materialized bytes=103
                Input split bytes=95
                Combine input records=0
                Combine output records=0
                Reduce input groups=2
                Reduce shuffle bytes=103
                Reduce input records=7
                Reduce output records=2
                Spilled Records=14
                Shuffled Maps =2
                Failed Shuffles=0
                Merged Map outputs=2
                GC time elapsed (ms)=59
                CPU time spent (ms)=3600
                Physical memory (bytes) snapshot=606015488
                Virtual memory (bytes) snapshot=2672865280
                Total committed heap usage (bytes)=602996736
         Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0

                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=21
        File Output Format Counters
                Bytes Written=40[/mw_shl_code]

计数器使用:
1、Web UI进行查看
(注:要启动历史服务器)
20150615094910140.png
2、命令行方式:
hadoop job -counter(Hadoop2.x无效)
3、使用Hadoop API
通过job.getCounters()得到Counters,而后调用counters.findCounter()方法去得到计数器对象;查看最终的计数器的值需要等作业完成之后。
自定义计数器及实例:

MapReduce允许用户自定义计数器,计数器是一个全局变量,计数器有组的概念,可以用Java的枚举类型或者用字符串来定义方法;

[mw_shl_code=java,true]package org.apache.hadoop.mapreduce;
public interface TaskAttemptContext extends JobContext, Progressable {
    //Get the {@link Counter} for the given
    //<code>counterName</code>.
    public Counter getCounter(Enum<?> counterName);

    //Get the {@link Counter} for the given
    //<code>groupName</code> and <code>counterName</code>.
    public Counter getCounter(String groupName, String counterName);
}[/mw_shl_code]
字符串方式(动态计数器)比枚举类型要更加灵活,可以动态在一个组下面添加多个计数器;在旧API中使用Reporter,而新API用context.getCounter(groupName,counterName)来获取计数器配置并设置;然后让计数器递增。
[mw_shl_code=java,true]package org.apache.hadoop.mapreduce;
/**
* A named counter that tracks the progress of a map/reduce job.
* <p><code>Counters</code> represent global counters, defined either by the
* Map-Reduce framework or applications. Each <code>Counter</code> is named by
* an {@link Enum} and has a long for the value.</p>
* <p><code>Counters</code> are bunched into Groups, each comprising of
* counters from a particular <code>Enum</code> class.
*/
public interface Counter extends Writable {
  /**
   * Increment this counter by the given value
   * @param incr the value to increase this counter by
   */
  void increment(long incr);
}[/mw_shl_code]
自定义计数器实例
统计词汇行中词汇数超过2个或少于2个的行数
输入数据文件counter
[mw_shl_code=bash,true][root@liguodong file]# vi counter
[root@liguodong file]# hdfs dfs -put counter /counter
[root@liguodong file]# hdfs dfs -cat /counter
hello world
hello hadoop
hi baby
hello 4325 7785993
java hadoop
come[/mw_shl_code]
[mw_shl_code=java,true]package MyCounter;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import MyPartitioner.MyPartitioner;
import MyPartitioner.MyPartitioner.DefPartitioner;
import MyPartitioner.MyPartitioner.MyMapper;
import MyPartitioner.MyPartitioner.MyReducer;

public class MyCounter {
    private final static String INPUT_PATH = "hdfs://liguodong:8020/counter";
    private final static String OUTPUT_PATH = "hdfs://liguodong:8020/outputcounter";
    public static class MyMapper extends Mapper<LongWritable, Text, LongWritable, Text>
    {
        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException
        {   
            String[] val = value.toString().split("\\s+");
            if(val.length<2){
                context.getCounter("ErrorCounter","below_2").increment(1);
            }else if(val.length>2){
                context.getCounter("ErrorCounter", "above_2").increment(1);
            }
            context.write(key, value);
        }
    }

    public static void main(String[] args) throws IllegalArgumentException, IOException,
    URISyntaxException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
        final FileSystem fileSystem = FileSystem.get(new URI(INPUT_PATH),conf);
        if(fileSystem.exists(new Path(OUTPUT_PATH)))
        {
            fileSystem.delete(new Path(OUTPUT_PATH),true);
        }
        Job job = Job.getInstance(conf, "define counter");

        job.setJarByClass(MyPartitioner.class);

        FileInputFormat.addInputPath(job, new Path(INPUT_PATH));  
        job.setMapperClass(MyMapper.class);

        job.setNumReduceTasks(0);

        FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));     
        //提交作业
        System.exit(job.waitForCompletion(true) ? 0 : 1);      
    }
}[/mw_shl_code]
[mw_shl_code=bash,true]运行结果:
[main] INFO org.apache.hadoop.mapreduce.Job - Counters: 25
    File System Counters
        FILE: Number of bytes read=148
        FILE: Number of bytes written=187834
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=69
        HDFS: Number of bytes written=86
        HDFS: Number of read operations=8
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=3
    Map-Reduce Framework
        Map input records=6
        Map output records=6
        Input split bytes=94
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=12
        CPU time spent (ms)=0
        Physical memory (bytes) snapshot=0
        Virtual memory (bytes) snapshot=0
        Total committed heap usage (bytes)=16252928
    ErrorCounter
        above_2=1
        below_2=1
    File Input Format Counters
        Bytes Read=69
    File Output Format Counters
        Bytes Written=86[/mw_shl_code]



已有(4)人评论

跳转到指定楼层
轩辕依梦Q 发表于 2015-6-21 10:41:15
多谢楼主分享,正好补充点知识
回复

使用道具 举报

rocky2015 发表于 2015-12-13 17:34:54
嗯,不错,如果把这个图再多截取一下更好了,
回复

使用道具 举报

阿里云 发表于 2017-10-29 09:50:24
感谢楼主的分享,理论结合实现, 棒!
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条