分享

看看大神是如何解决难题的:spark写hfile遇到错误解题思路总结

nettman 2019-5-14 19:41:45 发表于 总结型 [显示全部楼层] 回帖奖励 阅读模式 关闭右栏 0 6325
问题导读

1.本文解决了什么问题?
2.在写入hbase的过程中遇到了什么问题?
3.本文解决问题的思路是什么?


需求
将HDFS上的数据解析出来,然后通过hfile方式批量量写入Hbase(需要多列列写入)
写入数据的关键api:
[mw_shl_code=scala,true]saveAsNewAPIHadoopFile(
stagingFolder,
classOf[ImmutableBytesWritable],
classOf[KeyValue],
classOf[HFileOutputFormat2],
job.getConfiguration)[/mw_shl_code]



特殊地方:
1):
最初写hfile警告:
Does it contain files in subdirectories that correspond to column family
names

这个原因大概3种:
A:代码问题
B:数据源问题
C:setMapOutputKeyClass 和 saveAsNewAPIHadoopFile中的Class不不⼀一致
(我的是数据源问题)

2):
正常些put操作的时候,服务端⾃自动帮助排序,因此在使用put操作的时候没有涉及到这样的错误:


Added a key not lexically larger than previous

但是在写hfile的时候如果出现报错:
Added a key not lexically larger than previous

这样的错误,一般会认为rowkey没有做好排序,然后傻fufu的去验证了了一下,rowkey的确做了了排序
真正原因:
spark写hfile时候是按照rowkey+列列族+列列名进行行排序的,因此在写入数据的时候,要做到整体有序
(事情还没完)

3):
因为需要多列列写入,最好的方式:要么反射来动态获取列列名称和列列值 、 要么通过datafame去获取
(df.columns)
反射方式:
[mw_shl_code=scala,true]val listData: RDD[(ImmutableBytesWritable, ListBuffer[KeyValue])] = rdd.map
{
line =>
val rowkey = line.vintime
val clazz = Class.forName(XXXXXXXXXXXXXXXX)
val fields = clazz.getDeclaredFields
var list = new ListBuffer[String]()
var kvlist = new ListBuffer[KeyValue]()//
if (fields != null && fields.size > 0) {
for (field <- fields) {
field.setAccessible(true)
val column = field.getName
list.append(column)
}
}
val newList = list.sortWith(_ < _)
val ik = new ImmutableBytesWritable(Bytes.toBytes(rowkey))
for(column <- newList){
val declaredField: Field =
line.getClass.getDeclaredField(column)
declaredField.setAccessible(true)
val value = declaredField.get(line).toString
val kv: KeyValue = new KeyValue(
Bytes.toBytes(rowkey),
Bytes.toBytes(columnFamily),
Bytes.toBytes(column),
Bytes.toBytes(value))
kvlist.append(kv)
}
(ik, kvlist)
}[/mw_shl_code]


datafame方式:
[mw_shl_code=scala,true]val tmpData: RDD[(ImmutableBytesWritable, util.LinkedList[KeyValue])] =
df.rdd.map(
line =>{
val rowkey = line.getAs[String]("vintime")
val ik = new ImmutableBytesWritable(Bytes.toBytes(rowkey))
var linkedList = new util.LinkedList[KeyValue]()
for (column <- columns) {
val kv: KeyValue = new KeyValue(
Bytes.toBytes(rowkey),
Bytes.toBytes(columnFamily),
Bytes.toBytes(column),
Bytes.toBytes(line.getAs[String](column)))
linkedList.add(kv)
}
(ik, linkedList)
})
val result: RDD[(ImmutableBytesWritable, KeyValue)] =
tmpData.flatMapValues(
s => {
val values: Iterator[KeyValue] =
JavaConverters.asScalaIteratorConverter(s.iterator()).asScala
values
}
).sortBy(x =>x._1 , true)[/mw_shl_code]

仔细观察可以发现,其实两者都做了了排序操作,但是即便便经过(1)步骤后仍然报错:
Added a key not lexically larger than previous

那么在回想一下之前写hfile的要求:
rowkey+列列族+列列都要有序,那么如果出现数据的重复,也不不算是有序的操作!
因为,做一下数据的去重:
[mw_shl_code=scala,true]val key: RDD[(String, TransferTime)] = data.reduceByKey((x, y) => y)
val unitData: RDD[TransferTime] = key.map(line => line._2)[/mw_shl_code]
果然,这样解决了了:Added a key not lexically larger than previous这个异常
但是会报如下另一个异常:
[mw_shl_code=scala,true]Kryo serialization failed: Buffer overflow[/mw_shl_code]
这个是因为在对一些类做kryo序列列化时候,数据量量的缓存大小超过了了默认值,做一下调整即可
[mw_shl_code=scala,true]sparkConf.set("spark.kryoserializer.buffer.max" , "256m")
sparkConf.set("spark.kryoserializer.buffer" , "64m")[/mw_shl_code]
完整代码:
[mw_shl_code=scala,true]/**
* Created by angel
*/
object WriteTransferTime extends WriteToHbase{
/**
* @param data 要插&#12042;入的数据
* @param tableName 表名
**/
override def bulkLoadData(data: RDD[Any], tableName: String ,
columnFamily:String): Unit = {
val bean: RDD[TransferTime] = data.map(line =>
line.asInstanceOf[TransferTime])
val map: RDD[(String, TransferTime)] = bean.map(line => (line.vintime ,
line))
val key: RDD[(String, TransferTime)] = map.reduceByKey((x, y) => y)
val map1: RDD[TransferTime] = key.map(line => line._2)
val by1: RDD[TransferTime] = map1.sortBy(f => f.vintime)
val listData: RDD[(ImmutableBytesWritable, ListBuffer[KeyValue])] =
by1.map {
line =>
val rowkey = line.vintime
val clazz =
Class.forName("com.dongfeng.code.Bean.message.TransferTime")
val fields = clazz.getDeclaredFields
var list = new ListBuffer[String]()
var kvlist = new ListBuffer[KeyValue]()//
if (fields != null && fields.size > 0) {
for (field <- fields) {
field.setAccessible(true)
val column = field.getName
list.append(column)
}
}
val newList = list.sortWith(_ < _)
val ik = new ImmutableBytesWritable(Bytes.toBytes(rowkey))
for(column <- newList){
val declaredField: Field = line.getClass.getDeclaredField(column)
declaredField.setAccessible(true)
val value = declaredField.get(line).toString
val kv: KeyValue = new KeyValue(
Bytes.toBytes(rowkey),
Bytes.toBytes(columnFamily),
Bytes.toBytes(column),
Bytes.toBytes(value))
kvlist.append(kv)
}
(ik, kvlist)
}
val result: RDD[(ImmutableBytesWritable, KeyValue)] =
listData.flatMapValues(
s => {
val values: Iterator[KeyValue] = s.iterator
values
}
)
val resultDD: RDD[(ImmutableBytesWritable, KeyValue)] = result.sortBy(x
=>x._1 , true)
WriteToHbaseDB.hfile_load(result , TableName.valueOf(tableName) ,
columnFamily)
}
}
def hfile_load(rdd:RDD[(ImmutableBytesWritable , KeyValue)] , tableName:
TableName , columnFamily:String): Unit ={
//声明表的信息
var table: Table = null
try{
val startTime = System.currentTimeMillis()
println(s"开始时间:-------->${startTime}")
//&#12131;生成的HFile的临时保存路路径
val stagingFolder = "hdfs://cdh1:9000/hfile/"+tableName+new
Date().getTime//
table = connection.getTable(tableName)
//如果表不不存在,则创建表
if(!admin.tableExists(tableName)){
createTable(tableName , columnFamily)
}
//开始导&#12042;入
val job = Job.getInstance(config)
job.setJobName("DumpFile")
job.setMapOutputKeyClass(classOf[ImmutableBytesWritable])
job.setMapOutputValueClass(classOf[KeyValue])
rdd.sortBy(x => x._1, true).saveAsNewAPIHadoopFile(
stagingFolder,
classOf[ImmutableBytesWritable],
classOf[KeyValue],
classOf[HFileOutputFormat2],
job.getConfiguration)
val load = new LoadIncrementalHFiles(config)
val regionLocator = connection.getRegionLocator(tableName)
HFileOutputFormat2.configureIncrementalLoad(job, table,
regionLocator)
load.doBulkLoad(new Path(stagingFolder), table.asInstanceOf[HTable])
// load.doBulkLoad(new Path(stagingFolder) , connection.getAdmin ,
table , regionLocator)
val endTime = System.currentTimeMillis()
println(s"结束时间:-------->${endTime}")
println(s"花费的时间:----------------->${(endTime - startTime)}ms")
}catch{
case e:IOException =>
e.printStackTrace()
}finally {
if (table != null) {
try {
// 关闭HTable对象 table.close();
} catch {
case e: IOException =>
e.printStackTrace();
}
}
if (connection != null) {
try { //关闭hbase连接.
connection.close();
} catch {
case e: IOException =>
e.printStackTrace();
}
}
}
}
}[/mw_shl_code]


来自文档-下载:
spark写hfile遇到错误以及总结.zip (107.62 KB, 下载次数: 4)
加微信w3aboutyun,可拉入技术爱好者群

没找到任何评论,期待你打破沉寂

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条