HBase连接池 -- HTablePool被Deprecated以及可能原因是什么

本帖最后由 pig2 于 2014-8-28 00:51 编辑
问题导读：
1.官方如何解释HTablePool被弃用的
2.使用哪个类，代替HTablePool？
3.使用HConnectionManager如何创建表？

1.连接
HTable是HBase的client，负责从meta表中找到目标数据所在的RegionServers，当定位到目标RegionServers后，client直接和RegionServers交互，而不比再经过master。
HTable实例并不是线程安全的。当需要创建HTable实例时，明智的做法是使用相同的HBaseConfiguration实例，这使得共享连接到RegionServers的ZK和socket实例，例如，应该使用这样的代码：

HBaseConfiguration conf = HBaseConfiguration.create();
HTable table1 = new HTable(conf, "myTable");
HTable table2 = new HTable(conf, "myTable");
复制代码

而不是这样的代码：

HBaseConfiguration conf1 = HBaseConfiguration.create();
HTable table1 = new HTable(conf1, "myTable");
HBaseConfiguration conf2 = HBaseConfiguration.create();
HTable table2 = new HTable(conf2, "myTable");
复制代码

2.连接池
当面对多线程访问需求时，我们可以预先建立HConnection，参见以下代码：

Example 9.1. Pre-Creating a HConnection

// Create a connection to the cluster.
HConnection connection = HConnectionManager.createConnection(Configuration);
HTableInterface table = connection.getTable("myTable");
// use table as needed, the table returned is lightweight
table.close();
// use the connection for other access to the cluster
connection.close();
复制代码

构建HTableInterface实现是非常轻量级的，并且资源是可控的。

注意：
HTablePool是HBase连接池的老用法，该类在0.94，0.95和0.96中已经不建议使用，在0.98.1版本以后已经移除。（简陋的官方文档到此为止。）

3.HConnectionManager
该类是连接池的关键，专门介绍。
HConnectionManager是一个不可实例化的类，专门用于创建HConnection。
最简单的创建HConnection实例的方式是HConnectionManager.createConnection(config)，该方法创建了一个连接到集群的HConnection实例，该实例被创建的程序管理。通过这个HConnection实例，可以使用HConnection.getTable(byte[])方法取得HTableInterface implementations的实现，例如:
         HConnection connection = HConnectionManager.createConnection(config);
         HTableInterface table = connection.getTable("tablename");
         try {
            // Use the table as needed, for a single operation and a single thread
         } finally {
            table.close();
            connection.close();
         }

3.1构造函数
无，不可实例化。

3.2常用方法
（1）static HConnection  createConnection(org.apache.hadoop.conf.Configuration conf)
创建一个新的HConnection实例。
该方法绕过了常规的HConnection生命周期管理，常规是通过getConnection(Configuration)来获取连接。调用方负责执行Closeable.close()来关闭获得的连接实例。
推荐的创建HConnection的方法是：
      HConnection connection = HConnectionManager.createConnection(conf);
      HTableInterface table = connection.getTable("mytable");
      table.get(...);
      ...
      table.close();
      connection.close();

（2）public static HConnection getConnection(org.apache.hadoop.conf.Configuration conf)
根据conf获取连接实例。如果没有对应的连接实例存在，该方法创建一个新的连接。

注意：该方法在0.96和0.98版本中都被Deprecated了，不建议使用，但是在最新的未发布代码版本中又复活了！！！

3.3实例代码
package fulong.bigdata.hbase;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.CellUtil;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HConnection;
import org.apache.hadoop.hbase.client.HConnectionManager;
import org.apache.hadoop.hbase.client.HTableInterface;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;

public class ConnectionPoolTest {
private static final String QUORUM = "FBI001,FBI002,FBI003";
private static final String CLIENTPORT = "2181";
private static final String TABLENAME = "rd_ns:itable";
private static Configuration conf = null;
private static HConnection conn = null;

static{
      try {
         conf =  HBaseConfiguration.create();
         conf.set("hbase.zookeeper.quorum", QUORUM);
         conf.set("hbase.zookeeper.property.clientPort", CLIENTPORT);
         conn = HConnectionManager.createConnection(conf);
      } catch (IOException e) {
         e.printStackTrace();
      }
}

public static void main(String[] args) throws IOException {
      HTableInterface htable = ConnectionPoolTest.conn.getTable(TABLENAME);
      try {
         Scan scan = new Scan();
         ResultScanner rs = htable.getScanner(scan);
         for (Result r : rs.next(5)) {
            for (Cell cell : r.rawCells()) {
                  System.out.println("Rowkey : " + Bytes.toString(r.getRow())
                        + " Familiy:Quilifier : "
                        + Bytes.toString(CellUtil.cloneQualifier(cell))
                        + " Value : "
                        + Bytes.toString(CellUtil.cloneValue(cell))
                        + " Time : " + cell.getTimestamp());
            }
         }
      } finally {
         htable.close();
      }

}
}

4.阅读源码的新发现
4.1消失的HConnectionManager.getConnection
从0.96和0.98版本HConnectionManager的源码中可以看到
  static final Map<HConnectionKey, HConnectionImplementation> CONNECTION_INSTANCES;

就是连接池，连接池中的每个连接用HConnectionKey来标识，然而，HConnectionManager源码中所有涉及CONNECTION_INSTANCES的方法全都被Deprcated了。

我们来看已经被Deprecated的getConnection方法：
  /**
* Get the connection that goes with the passed <code>conf</code> configuration instance.
* If no current connection exists, method creates a new connection and keys it using
* connection-specific properties from the passed {@link Configuration}; see
* {@link HConnectionKey}.
* @param conf configuration
* @return HConnection object for <code>conf</code>
* @throws ZooKeeperConnectionException
*/
  @Deprecated
  public static HConnection getConnection(final Configuration conf)
  throws IOException {
HConnectionKey connectionKey = new HConnectionKey(conf);
synchronized (CONNECTION_INSTANCES) {
   HConnectionImplementation connection = CONNECTION_INSTANCES.get(connectionKey);
   if (connection == null) {
      connection = (HConnectionImplementation)createConnection(conf, true);
      CONNECTION_INSTANCES.put(connectionKey, connection);
   } else if (connection.isClosed()) {
      HConnectionManager.deleteConnection(connectionKey, true);
      connection = (HConnectionImplementation)createConnection(conf, true);
      CONNECTION_INSTANCES.put(connectionKey, connection);
   }
   connection.incCount();
   return connection;
}
  }
该方法逻辑很简单：
根据传入的conf构建HConnectionKey，然后以HConnectionKey实例为key到连接池Map对象CONNECTION_INSTANCES中去查找connection，如果找到就返回connection，如果找不到就新建，如果找到但已被关闭，就删除再新建。

我们来看HConnectionKey的构造函数：

  HConnectionKey(Configuration conf) {
Map<String, String> m = new HashMap<String, String>();
if (conf != null) {
   for (String property : CONNECTION_PROPERTIES) {
      String value = conf.get(property);
      if (value != null) {
      m.put(property, value);
      }
   }
}
this.properties = Collections.unmodifiableMap(m);

try {
   UserProvider provider = UserProvider.instantiate(conf);
   User currentUser = provider.getCurrent();
   if (currentUser != null) {
      username = currentUser.getName();
   }
} catch (IOException ioe) {
   HConnectionManager.LOG.warn("Error obtaining current user, skipping username in HConnectionKey", ioe);
}
  }

由以上源码可知，接收conf构造HConnectionKey实例时，其实是将conf配置文件中的属性赋值给HConnectionKey自身的属性，换句话说，不管你new几次，只要conf的属性相同，new出来的HConnectionKey实例的属性都相同。
结论一：conf的属性 --》 HConnectionKey实例的属性

接下来，回到getConnection源码中看到这样一句话：
   HConnectionImplementation connection = CONNECTION_INSTANCES.get(connectionKey);

该代码是以HConnectionKey实例为key来查找CONNECTION_INSTANCES这个LinkedHashMap中是否已经包含了HConnectionKey实例为key的键值对，这里要注意的是，map的get方法，其实获取的是key的hashcode，这个自己读JDK源码就能看到。
然而HConnectionKey已经重载了hashcode方法：
  @Override
  public int hashCode() {
final int prime = 31;
int result = 1;
if (username != null) {
   result = username.hashCode();
}
for (String property : CONNECTION_PROPERTIES) {
   String value = properties.get(property);
   if (value != null) {
      result = prime * result + value.hashCode();
   }
}

return result;
  }

在该代码中，最终返回的hashcode取决于当前用户名及当前conf配置文件的属性。所以，只要conf配置文件的属性和用户相同，HConnectionKey实例的hashcode就相同！
结论二：conf的属性 --》HConnectionKey实例的hashcode

再来看刚才这句代码：
   HConnectionImplementation connection = CONNECTION_INSTANCES.get(connectionKey);
对于get方法的参数connectionKey，不管connectionKey是不是同一个对象，只要connectionKey的属性相同，那connectionKey的hasecode就相同，对于get方法而言，也就是同样的key！！！
所以，可以得出结论三：conf的属性 --》HConnectionKey实例的hashcode --》 get返回的connection实例
结论三换句话说说：
conf的属性相同 --》 CONNECTION_INSTANCES.get返回同一个connection实例

然而，假设我们的HBase集群只有一个，那我们的HBase集群的conf配置文件也就只有一个（固定的一组属性），除非你有多个HBase集群另当别论。
在这样一个机制下，如果只有一个conf配置文件，则连接池中永远只会有一个connection实例！那“池”的意义就不大了！
所以，代码中才将基于getConnection获取池中物的机制Deprecated了，转而在官方文档中建议：
*******************************************************************************************************************
当面对多线程访问需求时，我们可以预先建立HConnection，参见以下代码：

Example 9.1. Pre-Creating a HConnection

// Create a connection to the cluster.HConnection connection = HConnectionManager.createConnection(Configuration);HTableInterface table = connection.getTable("myTable");// use table as needed, the table returned is lightweighttable.close();// use the connection for other access to the clusterconnection.close();

构建HTableInterface实现是非常轻量级的，并且资源是可控的。

*******************************************************************************************************************

（以上重新拷贝了一次官方文档的翻译）
如果大家按照官方文档的建议做了，也就是预先创建了一个连接，以后的访问都共享该连接，这样的效果其实和过去的getConnection完全一样，都是在玩一个connection实例！

4.2 HBase的新时代
我查看了Git上最新版本的代码（https://git-wip-us.apache.org/repos/asf?p=hbase.git;a=tree），发现getConnection复活了：
  /**
* Get the connection that goes with the passed <code>conf</code> configuration instance.
* If no current connection exists, method creates a new connection and keys it using
* connection-specific properties from the passed {@link Configuration}; see
* {@link HConnectionKey}.
* @param conf configuration
* @return HConnection object for <code>conf</code>
* @throws ZooKeeperConnectionException
*/
  public static HConnection getConnection(final Configuration conf) throws IOException {
   return ConnectionManager.getConnectionInternal(conf);
  }

这个不是重点，重点是最新版本代码的pom：

<groupId>org.apache.hbase</groupId>

<artifactId>hbase</artifactId>

<packaging>pom</packaging>

<version>2.0.0-SNAPSHOT</version>

<name>HBase</name>

<description>

   Apache HBase\99 is the Hadoop database. Use it when you need

   random, realtime read/write access to your Big Data.

   This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters

   of commodity hardware.

</description>

关于释放HTable实例与释放连接的问题
HTable实例相关的两个连接，一个是对zookeeper,一个是regionServer
如果没有其他HTable实例（在HTablePool尺寸大于0的情况不可能出现这种情况），及没有zookeeper的连接计数为0，此时才会释放zookeeper连接
regionServer的连接有HBaseClient$Connection这个线程单独维护，与HTable实例基本没啥关系，注意HBaseClient$Connection这个线程绑定了连接
总体看HTablePool
容纳了多个HTable实例
多个HTable实例会共享同一个zookeeper连接
多个HTable实例，如果同在一个RegionServer会共享同一个连接HBaseClient$Connection
很容易让人误解每个HTable实例都有一个HBaseClient$Connection，就像连接池那样，其实不是
虽然HTablePool有最大尺寸，但并没有限制HTable实例不得大于这个尺寸，一旦超过这个尺寸就会实例化，但归还到实例池的时候，如果池满了会弃用，因此HTablePool就是一个对象池而不是连接池

使用HTablePool的意义？
《HBase-The-Definitive-Guide》作者是这么说的
实例化HTable实例比较耗时，最好启动时初始化（这个理由不是很充分，完全可以使用HTable单例）
HTable实例线程不安全，特别是在auto flash为false的情况，因为存在本地的write buffer ，即使auto flash为true，也不建议使用（对此作者并没说为什么）建议每个线程一个HTable实例
HTablePool存在的问题
PooledHTable的代码很恶心，PooledHTable作为一个HTable的wrapper,两者的关系应该是包含，但源码中却是继承
HTablePool并不是连接池，就是直接使用HBaseClient$Connection【如果是同一个region的话就是单线程】来完成网络通讯的，后在多个线程使用单个HBaseClient$Connection而带来同步和阻塞的问题

参考：http://blog.csdn.net/u010967382/article/details/38046821