日志

Impala 表使用 SequenceFile 文件格式

已有 804 次阅读2014-12-5 16:56

Impala 表使用 SequenceFile 文件格式

Cloudera Impala 支持使用 SequenceFile 数据文件。

参加以下章节了解 Impala 表使用 SequenceFile 数据文件的详情：

创建 SequenceFile 表并加载数据

假如你没有使用已有的数据文件，请先创建一个合适格式的文件。

创建 SequenceFile 表：

在 impala-shell 中，执行类似命令：

create table sequencefile_table (column_specs) stored as sequencefile;

因为 Impala 可以查询一些目前它无法写入数据的表，当创建特定格式的表之后，你可能需要在 Hive shell 中加载数据。参见 Impala 如何使用 Hadoop 文件格式了解详细信息。当通过 Hive 或其他 Impala 之外的机制加载数据之后，在你下次连接到 Impala 节点时，在执行关于这个表的查询之前，执行 REFRESH table_name 语句，以确保 Impala 识别到新添加的数据。

例如，下面是你如何在 Impala 中创建 SequenceFile 表(通过显式设置列，或者克隆其他表的结构)，通过 Hive 加载数据，然后通过 Impala 查询：

$ impala-shell -i localhost [localhost:21000] > create table seqfile_table (x int) stored as seqfile; [localhost:21000] > create table seqfile_clone like some_other_table stored as seqfile; [localhost:21000] > quit; $ hive hive> insert into table seqfile_table select x from some_other_table; 3 Rows loaded to seqfile_table Time taken: 19.047 seconds hive> quit; $ impala-shell -i localhost [localhost:21000] > select * from seqfile_table; Returned 0 row(s) in 0.23s [localhost:21000] > -- Make Impala recognize the data loaded through Hive; [localhost:21000] > refresh seqfile_table; [localhost:21000] > select * from seqfile_table; +---+ | x | +---+ | 1 | | 2 | | 3 | +---+ Returned 3 row(s) in 0.23s

SequenceFile 表启用压缩

你可能希望对已有的表启用压缩。启用压缩大多数情况下能提高性能提升，并且 SequenceFile 表支持压缩。例如，启用 Snappy 压缩，你需要通过 Hive shell 加载数据时设置以下附加设置：

hive> SET hive.exec.compress.output=true; hive> SET mapred.max.split.size=256000000; hive> SET mapred.output.compression.type=BLOCK; hive> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; hive> insert overwrite table new_table select * from old_table;

假如你转换分区表，你必须完成额外的步骤。这时候，类似下面指定附加的设置：

hive> create tablenew_table(your_cols) partitioned by (partition_cols) stored asnew_format; hive> SET hive.exec.dynamic.partition.mode=nonstrict; hive> SET hive.exec.dynamic.partition=true; hive> insert overwrite table new_table partition(comma_separated_partition_cols) select * from old_table;

请记住 Hive 不需要你设置源格式。考虑转换一个包含年和月两个分区列的分区表到采用 Snappy 压缩的SequenceFile 格式，结合之前所述的组件来完成这个表的转换，你应当类似下面指定设置：

hive> create table TBL_SEQ (int_col int, string_col string) STORED AS SEQUENCEFILE; hive> SET hive.exec.compress.output=true; hive> SET mapred.max.split.size=256000000; hive> SET mapred.output.compression.type=BLOCK; hive> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; hive> SET hive.exec.dynamic.partition.mode=nonstrict; hive> SET hive.exec.dynamic.partition=true; hive> INSERT OVERWRITE TABLE tbl_seq SELECT * FROM tbl;

为了对分区表完成类似的处理，你应当类似下面指定设置：

hive> CREATE TABLE tbl_seq (int_col INT, string_col STRING) PARTITIONED BY (year INT) STORED AS SEQUENCEFILE; hive> SET hive.exec.compress.output=true; hive> SET mapred.max.split.size=256000000; hive> SET mapred.output.compression.type=BLOCK; hive> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; hive> SET hive.exec.dynamic.partition.mode=nonstrict; hive> SET hive.exec.dynamic.partition=true; hive> INSERT OVERWRITE TABLE tbl_seq PARTITION(year) SELECT * FROM tbl;