分享

hadoop中的hive查询cdn访问日志指定时间段内url访问次数最多的前10位(结合python...

本帖最后由 hiqj 于 2014-12-26 22:53 编辑

1、hadoop环境描述:
master节点:node1
slave节点:node2,node3,node4
远端服务器(python连接hive):node29

2、需求:通过hive查询到cdn日志中指定时间段内url访问次数最多的前10个url


ps:用pig查询可以查询文章:

http://shineforever.blog.51cto.com/1429204/1571124

3、说明:python操作远程操作需要使用Thrift接口:
hive源码包下面自带Thrift插件:
  1. [root@node1 shell]# ls -l /usr/local/hive-0.8.1/lib/py
  2. total 28
  3. drwxr-xr-x 2 hadoop hadoop 4096 Nov  5 15:29 fb303
  4. drwxr-xr-x 2 hadoop hadoop 4096 Oct 15 10:30 fb303_scripts
  5. drwxr-xr-x 2 hadoop hadoop 4096 Nov  5 15:29 hive_metastore
  6. drwxr-xr-x 2 hadoop hadoop 4096 Oct 15 10:30 hive_serde
  7. drwxr-xr-x 2 hadoop hadoop 4096 Nov  5 15:29 hive_service
  8. drwxr-xr-x 2 hadoop hadoop 4096 Nov  5 15:20 queryplan
  9. drwxr-xr-x 6 hadoop hadoop 4096 Nov  5 15:20 thrift
复制代码
1)、把相关文件scp到远端的node29相关目录下:
  1. scp -r /usr/local/hive-0.8.1/lib/py/* 172.16.41.29:/usr/local/hive_py/.
复制代码
2) 、在node1服务器上开发hive:
  1. [hadoop@node1 py]$ hive --service  hiveserver
  2. Starting Hive Thrift Server
复制代码
3)、在node29上编写查询脚本:
#!/usr/bin/env python
#coding:utf-8
#找出cdn日志指定时间段,url访问次数最多的前10个;

import sys
import os
import string
import re
import MySQLdb

#加载hive的python相关库文件;
sys.path.append('/usr/local/hive_py')
from hive_service import ThriftHive
from hive_service.ttypes import HiveServerException
from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol


dbname="default"
hsql="select request,count(request) as counts from cdnlog where time >= '[27/Oct/2014:10:40:00 +0800]' and time <= '[27/Oct/2014:10:49
:59 +0800]' group by request order by counts desc limit 10"

def hiveExe(hsql,dbname):

    try:
        transport = TSocket.TSocket('172.16.41.151', 10000)
        transport = TTransport.TBufferedTransport(transport)
        protocol = TBinaryProtocol.TBinaryProtocol(transport)
        client = ThriftHive.Client(protocol)
        transport.open()
#加载增长表达式支持,必需(以下加载路径为远端hive路径,不是脚本的本地路径!)
        client.execute('add jar /usr/local/hive-0.8.1/lib/hive_contrib.jar')
#        client.execute("use " + dbname)
#        row = client.fetchOne()
        client.execute(hsql)
        return client.fetchAll() #查询所有数据;

        transport.close()
    except Thrift.TException, tx:
        print '%s' % (tx.message)

if __name__ == '__main__':
    results=hiveExe(hsql,dbname)
    num=len(results)
    for i in range(num):

在node29上执行脚本,输入结果为:
wKioL1RbHoPT0rHZAAHHICCIv7U438.jpg
node1服务器上hive计算过程为:
wKiom1RbHjnhxxvdAAcmfK-RneY166.jpg

没找到任何评论,期待你打破沉寂

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条