由于数据量大关系型数据库可装载但处理起来很容易遇到性能瓶颈,第一步工作是将关系型数据库数据同步到分布式上,一下脚本是在
hive中处理大批量数据,然后将结果集数据每天全量同步到mysql用作报表展现
[mw_shl_code=shell,true]#########################################################################
## *@Company: bcw
## *@Prog Desc: 功能:REPORT_BI_数据导出
## *@Name: data_hive_to_mysql
## *@Modify 朱齐天
## *@Modify Date: 2017/06/12
## *@1,hive_to_mysql
#########################################################################
#!/bin/bash
date=`date +%Y%m%d -d'1 days ago'`
echo $date
source /etc/profile
#login to mysql
HOSTNAME="xxx"
PORT="3306"
USERNAME="xxx"
PASSWORD="xxx"
DBNAME="db"
TABNAME="table"
JDBCDRIVER="jdbc:mysql://"
#terminated args
FIELDSTERMINATED="\t"
LINESTERMINATED="\n"
NULLSTRING="\\\N"
NONSTRING="\\\N"
#hdfsdir args
HDFSDIR="xxx"
HIVEDB="xxx"
HIVEDB_BAK="xxx_bak"
#hql scripts 逻辑处理完的结果集数据
HQL1="insert overwrite table ${HIVEDB}.${TABNAME}
xxx"
#数据备份
HQL2="insert overwrite table ${HIVEDB_BAK}.${TABNAME} partition(pt_date=${date})
echo "#####################################run hql scripts:${HQL1}${HQL2}"
#execu hiveql
hive -e "${HQL1}${HQL2}";
select * from ${HIVEDB}.${TABNAME};"
if [ $? -ne 0 ];then
echo "##################################run hql failed!"
exit 1;
else
echo "##################################run hql success!"
fi
#"清除mysql数据库订单发货按发货时间表数据"
echo "clear data for ${DBNAME}.${TABNAME}"
mysql -h${HOSTNAME} -P${PORT} -u${USERNAME} -p${PASSWORD} ${DBNAME} -e "
truncate table ${TABNAME};
"
if [ $? -ne 0 ];then
echo "##################################clear data failed!"
exit 1;
else
echo "##################################clear data success!"
fi
COLUMNS=`hive -e "use ${HIVEDB};desc ${TABNAME}"`
for column in `echo "${COLUMNS}"| awk -F" " '{print $1}' | grep -v "col_name" | grep -v "pt_date" | grep -v "#"`
do
sql+="${column},"
done
LENGTH=${#sql}
echo "columns:${sql:0:${LENGTH}-1}"
#同步数据到mysql
sqoop export \
--connect ${JDBCDRIVER}${HOSTNAME}:${PORT}/${DBNAME} \
--username ${USERNAME} \
--password ${PASSWORD} \
--table ${TABNAME} \
--columns "${sql:0:${LENGTH}-1}" \
--export-dir ${HDFSDIR}${TABNAME}/ \
--input-fields-terminated-by ${FIELDSTERMINATED} \
--input-lines-terminated-by ${LINESTERMINATED} \
--input-null-string ${NULLSTRING} \
--input-null-non-string ${NONSTRING} ;
if [ $? -ne 0 ];then
echo "##################################synchronization data failed!"
exit 1;
else
echo "##################################synchronization data success!"
fi
exit;[/mw_shl_code]
方法比较low,肯定是有更高效的办法,这份作为第一个版本,因为我们hive是用tez作为执行引擎,效率也还不错的
重在流程,我们也不会把一些高端技术生硬的套上去,企业中满足需求即可
|
|