Sqoop1.4.4将MySQL数据导入到HDFS中及问题总结

本帖最后由 pig2 于 2015-10-23 17:53 编辑

问题导读：

1、Sqoop使用SQL语句实现数据导入使用哪个参数？

2、使用--query参数执行数据导入，三个必须加上的参数是？

3、--split-by参数的作用？

4、Sqoop执行数据导入时，Map tasks的默认个数是？

5、--query后SQL语句双引号和单引号的区别？该怎么解决？

6、Sqoop执行数据导入有哪两种数据文件格式？默认的是哪个文件格式？

一、自由查询形式导入

Sqoop还支持将任意的查询结果集导入，不使用--table、--columns和--where，使用SQL语句--query参数执行自由查询导入，但是必须指定--target-dir目录，必须指定--split-by 分隔列，同时必须使用where且在其后加个$CONDITIONS，使Sqoop进程替代为一个唯一的条件表达式达到条件查询效果。如下：

[mw_shl_code=bash,true][hadoopUser@secondmgt conf]$ sqoop import --connect jdbc:mysql://secondmgt:3306/spice --username hive --password hive --query  'select * from users where  id<60 and $CONDITIONS' --split-by id -m 1 --target-dir /output/query/
Warning: /usr/lib/hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
15/01/18 14:30:10 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
15/01/18 14:30:10 INFO manager.<a  target="_blank" class="keylink">MySQL</a>Manager: Preparing to use a <a  target="_blank" class="keylink">MySQL</a> streaming resultset.
15/01/18 14:30:10 INFO tool.CodeGenTool: Beginning code generation
15/01/18 14:30:11 INFO manager.SqlManager: Executing SQL statement: select * from users where  id<60 and  (1 = 0)
15/01/18 14:30:11 INFO manager.SqlManager: Executing SQL statement: select * from users where  id<60 and  (1 = 0)
15/01/18 14:30:11 INFO manager.SqlManager: Executing SQL statement: select * from users where  id<60 and  (1 = 0)
15/01/18 14:30:11 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0
Note: /tmp/sqoop-hadoopUser/compile/3488270c7f7b23dd3b556d8d185f6a82/QueryResult.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
15/01/18 14:30:12 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoopUser/compile/3488270c7f7b23dd3b556d8d185f6a82/QueryResult.jar
15/01/18 14:30:12 INFO mapreduce.ImportJobBase: Beginning query import.
15/01/18 14:30:12 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoopUser/cloud/hbase/hbase-0.96.2-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/01/18 14:30:12 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
15/01/18 14:30:13 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
15/01/18 14:30:13 INFO client.RMProxy: Connecting to ResourceManager at secondmgt/192.168.2.133:8032
15/01/18 14:30:14 INFO mapreduce.JobSubmitter: number of splits:1
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.job.classpath.files is deprecated. Instead, use mapreduce.job.classpath.files
15/01/18 14:30:14 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.cache.files.filesizes is deprecated. Instead, use mapreduce.job.cache.files.filesizes
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
15/01/18 14:30:14 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
15/01/18 14:30:14 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
15/01/18 14:30:14 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
15/01/18 14:30:14 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1421373857783_0016
15/01/18 14:30:15 INFO impl.YarnClientImpl: Submitted application application_1421373857783_0016 to ResourceManager at secondmgt/192.168.2.133:8032
15/01/18 14:30:15 INFO mapreduce.Job: The url to track the job: http://secondmgt:8088/proxy/application_1421373857783_0016/
15/01/18 14:30:15 INFO mapreduce.Job: Running job: job_1421373857783_0016
15/01/18 14:30:27 INFO mapreduce.Job: Job job_1421373857783_0016 running in uber mode : false
15/01/18 14:30:27 INFO mapreduce.Job:  map 0% reduce 0%
15/01/18 14:30:38 INFO mapreduce.Job:  map 100% reduce 0%
15/01/18 14:30:38 INFO mapreduce.Job: Job job_1421373857783_0016 completed successfully
15/01/18 14:30:38 INFO mapreduce.Job: Counters: 27
      File System Counters
            FILE: Number of bytes read=0
            FILE: Number of bytes written=91814
            FILE: Number of read operations=0
            FILE: Number of large read operations=0
            FILE: Number of write operations=0
            HDFS: Number of bytes read=87
            HDFS: Number of bytes written=123
            HDFS: Number of read operations=4
            HDFS: Number of large read operations=0
            HDFS: Number of write operations=2
      Job Counters
            Launched map tasks=1
            Other local map tasks=1
            Total time spent by all maps in occupied slots (ms)=33944
            Total time spent by all reduces in occupied slots (ms)=0
      Map-Reduce Framework
            Map input records=3
            Map output records=3
            Input split bytes=87
            Spilled Records=0
            Failed Shuffles=0
            Merged Map outputs=0
            GC time elapsed (ms)=44
            CPU time spent (ms)=2440
            Physical memory (bytes) snapshot=164503552
            Virtual memory (bytes) snapshot=888926208
            Total committed heap usage (bytes)=83886080
      File Input Format Counters
            Bytes Read=0
      File Output Format Counters
            Bytes Written=123
15/01/18 14:30:38 INFO mapreduce.ImportJobBase: Transferred 123 bytes in 25.6853 seconds (4.7887 bytes/sec)
15/01/18 14:30:38 INFO mapreduce.ImportJobBase: Retrieved 3 records.[/mw_shl_code]

Sqoop使用--split-by 列名，根据此分隔工作量，默认的Sqoop将表中的关键字作为分隔列，由上导入命令可知，此处我们是以“id”作为分隔列。

Sqoop从大部分的数据源并行的导入数据，我们可以使用-m参数控制Map tasks的数目，默认是4个，此处我们改成了1个Map task。Map task,根据整个范围的均衡大小进行操作。例如，你有一张表，关键字id范围是0-1000，默认Map tasks 是4个，Sqoop将会执行4个进程，每个进程以如下格式执行SELECT * FROM sometable WHERE id >= lo AND id < hi其中(lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) 在不同的任务中。

注意一：如果你的表中关键字不是根据其范围均匀的分布，就可能导致不平衡的任务。这个时候你需要明确的选择一个不同的列使用--split-by指定分隔参数。目前，Sqoop，还不支持对各个列索引进行分隔，如果一个表没有索引列或者含有多个关键字列，你必须手动的指定一个分隔列。

注意二：如果SQL语句中使用双引号（“”），则必须使用$CONDITIONS代替$CONDITIONS，使你的shell不将其识别为shell自身的变量。如下示例：

错误方式：

[mw_shl_code=bash,true][hadoopUser@secondmgt ~]$ sqoop import --connect jdbc:mysql://secondmgt:3306/spice --username hive --pass<a  target="_blank" class="keylink">word</a> hive --query "select * from users where $CONDITIONS" --split-by id  --target-dir /output/query/
Warning: /usr/lib/hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
15/01/18 15:17:50 WARN tool.BaseSqoopTool: Setting your pass<a  target="_blank" class="keylink">word</a> on the command-line is insecure. Consider using -P instead.
15/01/18 15:17:50 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
15/01/18 15:17:50 INFO tool.CodeGenTool: Beginning code generation
15/01/18 15:17:50 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Query [select * from users where ] must contain '$CONDITIONS' in WHERE clause.
      at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:352)
      at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1277)
      at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1089)
      at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:96)
      at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:396)
      at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:502)
      at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
      at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
      at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
      at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
      at org.apache.sqoop.Sqoop.main(Sqoop.java:238)[/mw_shl_code]

正确如下：
[mw_shl_code=bash,true][hadoopUser@secondmgt ~]$ sqoop import --connect jdbc:mysql://secondmgt:3306/spice --username hive --password hive --query "select * from users where $CONDITIONS" --split-by id  --target-dir /output/query/[/mw_shl_code]

注意三：目前版本的Sqoop中，使用自由形式查询导入，只提供简单的查询，没有复杂的和“OR”条件查询在where子句中。

二、查看结果

[hadoopUser@secondmgt ~]$ hadoop fs -cat /output/query/*
56,hua,hanyun,男,开通,2013-12-02,0,1
58,feng,123456,男,开通,2013-11-22,0,0
59,test,123456,男,开通,2014-03-05,58,0

三、控制导入进程

有些数据库提供更加快捷、高效的方式用来将数据库表中的数据导入到其他的系统中，这个时候可以--direct 参数。如：mysql会调用 mysqldump和mysqlimport ，PostgreSQL 为psql。

四、控制映射类型

Sqoop预配置了Java和Hive典型的大部分SQL类型，然而，默认的类型有时候不一定完全适合用户需求。可以使用下面两个参数根据自己的应用修改映射类型

[mw_shl_code=bash,true]Argument       Description
--map-column-java <mapping>       Override mapping from SQL to Java type for configured columns.
--map-column-hive <mapping>       Override mapping from SQL to Hive type for configured columns.[/mw_shl_code]

五、文件格式

Sqoop支持两种类型的文件格式导入：分隔符文本和序列文件（delimited text or SequenceFiles）。默认的是采用分隔符文本，由上面导入后查询的结果可知，默认采用逗号分隔的。可以使用--as-textfile参数修改默认的文件导入格式。

delimited text 是适合大多数非二进制数据类型。它也很容易支持进一步操纵其他工具,如Hive。

SequenceFiles是二进制格式以自定义记录特有的数据类型来存储个人记录的。

qqzj · 发表于 2016-1-5 19:46:33

学习学习

图文精华

Sqoop1.4.4将MySQL数据导入到HDFS中及问题总结

已有(1)人评论

推荐 /2