分享

Hadoop教程2:Hello World及hadoop生态系统以及 Sqoop 与 Informix、DB2 和 MySQL

本帖最后由 nettman 于 2014-5-18 20:17 编辑
继续接着上面的内容:
上一贴:精彩内容





Hadoop 包括一个检查 HDFS 状态的浏览器界面。
图 7 显示了单词计数作业的输出。
图 7. 使用浏览器查看 HDFS nnbrowse.jpg


Cloudera 网站免费提供了一个更复杂的控制台。这个控制台提供了大量超出标准 Hadoop Web 界面的功能。请注意,图 8 所示的 HDFS 健康状态为 Bad。
图 8. 由 Cloudera Manager 管理的 Hadoop 服务
cm.jpg



为什么是 Bad(不好)?因为在单个虚拟机中,HDFS 无法制作数据块的三个副本。当块不足以复制时,就会存在数据丢失的风险,因此系统的健康状态是不好的。您没有尝试在单个节点上运行生产 Hadoop 作业,这是好事。
您的 MapReduce 作业并不会受限于 Java。最后这个 MapReduce 示例使用 Hadoop Streaming 支持用 Python 编写的一个映射程序和用 AWK 编写的缩减程序。不,您不必是一个 Java 大师也可以编写 map-reduce!
Mark Twain 并不是 Cooper 的铁杆粉丝。在这个用例中,Hadoop 将提供比较 Twain 和 Cooper 的一些简单的文学评论。Flesch-Kincaid 测试对特定文本的阅读级别进行计算。此分析的因素之一是句子的平均长度。解析句子原来比只是查找句号字符要复杂得多。openNLP 包和 Python NLTK 包有出色的句子分析程序。为了简单起见,清单 8 中的示例将使用字长替代一个单词中的音节数。如果您想将这项工作带到一个新的水平,在 MapReduce 中实施 Flesch-Kincaid 测试,抓取 Web,并计算出您最喜爱的新闻站点的阅读级别。

清单 8. 基于 Python 的映射程序文学评论
  1. # here is the mapper we'll connect to the streaming hadoop interface
  2.                
  3. # the mapper is reading the text in the file - not really appreciating Twain's humor
  4. #
  5.                
  6. # modified from
  7. # http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
  8. $ cat mapper.py
  9. #!/usr/bin/env python
  10. import sys
  11.                
  12. # read stdin
  13. for linein in sys.stdin:
  14. # strip blanks
  15. linein = linein.strip()
  16. # split into words
  17. mywords = linein.split()
  18. # loop on mywords, output the length of each word
  19. for word in mywords:
  20. # the reducer just cares about the first column,
  21. # normally there is a key - value pair
  22. print '%s %s' % (len(word), 0)
复制代码

针对单词 “Twain” 的映射程序输出将是 5 0。字长按数值顺序进行排序,并按排序顺序提交给缩减程序。在清单 9 和清单 10 中的示例中,不需要对数据进行排序,就可以得到正确的输出,但排序是内置在 MapReduce 基础架构中的,无论如何都会发生。

清单 9. 用于文学评论的 AWK 缩减程序

  1. # the awk code is modified from [url]http://www.commandlinefu.com[/url]
  2.                
  3. # awk is calculating
  4. #  NR - the number of words in total
  5. #  sum/NR - the average word length
  6. # sqrt(mean2/NR) - the standard deviation
  7.                
  8. $ cat statsreducer.awk
  9. awk '{delta = $1 - avg; avg += delta / NR; \
  10. mean2 += delta * ($1 - avg); sum=$1+sum } \
  11. END { print NR, sum/NR, sqrt(mean2 / NR); }'
复制代码

清单 10. 使用 Hadoop Streaming 运行 Python 映射程序和 AWK 缩减程序

  1. # test locally
  2.                
  3. # because we're using Hadoop Streaming, we can test the
  4. # mapper and reducer with simple pipes
  5.                
  6. # the "sort" phase is a reminder the keys are sorted
  7. # before presentation to the reducer
  8. #in this example it doesn't matter what order the
  9. # word length values are presented for calculating the std deviation
  10.                
  11. $ zcat ../DS.txt.gz  | ./mapper.py | sort | ./statsreducer.awk
  12. 215107 4.56068 2.50734
  13.                
  14. # now run in hadoop with streaming
  15.                
  16. # CDH4
  17. hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
  18. -input HF.txt -output HFstats -file ./mapper.py -file \
  19. ./statsreducer.awk -mapper ./mapper.py -reducer ./statsreducer.awk
  20.                
  21. # CDH3
  22. $ hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u4.jar \
  23. -input HF.txt -output HFstats -file ./mapper.py -file ./statsreducer.awk \
  24. -mapper ./mapper.py -reducer ./statsreducer.awk
  25.                
  26. $ hls HFstats
  27. Found 3 items
  28. -rw-r--r--   1 cloudera supergroup   0 2012-08-12 15:38 /user/cloudera/HFstats/_SUCCESS
  29. drwxr-xr-x   - cloudera supergroup   0 2012-08-12 15:37 /user/cloudera/HFstats/_logs
  30. -rw-r--r--   1 cloudera ...  24 2012-08-12 15:37 /user/cloudera/HFstats/part-00000
  31.                
  32. $ hcat /user/cloudera/HFstats/part-00000
  33. 113365 4.11227 2.17086
  34.                
  35. # now for cooper
  36.                
  37. $ hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u4.jar \
  38. -input DS.txt.gz -output DSstats -file ./mapper.py -file ./statsreducer.awk \
  39. -mapper ./mapper.py -reducer ./statsreducer.awk
  40.                
  41. $ hcat /user/cloudera/DSstats/part-00000
  42. 215107 4.56068 2.50734
复制代码

Mark Twain 的粉丝若知道 Hadoop 发现 Cooper 使用较长的单词,并且其标准偏差令人震惊,那么他们就可以愉快地放松了(幽默意图)。当然,要是假设较短的单词会更好。让我们继续,下一步是将 HDFS 中的数据写入 Informix 和 DB2。
使用 Sqoop 通过 JDBC 将来自 HDFS 的数据写入 Informix、DB2 或 MySQL
Sqoop Apache 项目是一个开源的基于 JDBC 的 Hadoop,用于数据库的数据移动实用程序。Sqoop 最初由在 Cloudera 的黑客马拉松 (hackathon) 创建,后来成为开源的工具。
将数据从 HDFS 移动到关系数据库是一种常见的用例。HDFS 和 map-reduce 在执行繁重工作方面是非常棒的。对于简单的查询或 Web 站点的后端存储,在关系存储区中缓存 map-reduce 输出是一个很好的设计模式。您可以避免重新运行 map-reduce 单词计数,只需将结果 Sqoop 到 Informix 和 DB2 中即可。您已经生成了关于 Twain 和 Cooper 的数据,现在,让我们把这些数据移动到一个数据库,如清单 11 所示。

清单 11. JDBC 驱动程序安装

  1. #Sqoop needs access to the JDBC driver for every
  2. # database that it will access
  3.                
  4. # please copy the driver for each database you plan to use for these exercises
  5. # the MySQL database and driver are already installed in the virtual image
  6. # but you still need to copy the driver to the sqoop/lib directory
  7.                
  8. #one time copy of jdbc driver to sqoop lib directory
  9. $ sudo cp Informix_JDBC_Driver/lib/ifxjdbc*.jar /usr/lib/sqoop/lib/
  10. $ sudo cp db2jdbc/db2jcc*.jar /usr/lib/sqoop/lib/
  11. $ sudo cp /usr/lib/hive/lib/mysql-connector-java-5.1.15-bin.jar /usr/lib/sqoop/lib/
复制代码

清单 12 至 15 所示的示例分别对应于每种数据库。请跳到您感兴趣的示例,包括 Informix、DB2 或 MySQL。对于掌握多种数据库语言的人,请享受执行每个示例的乐趣。如果这里没有包括您首选的数据库,让这些示例在其他地方工作也不会是一个巨大的挑战。

清单 12. Informix 用户:Sqoop 将单词计数的结果写入 Informix
  1. # create a target table to put the data
  2. # fire up dbaccess and use this sql
  3. # create table wordcount ( word char(36) primary key, n int);
  4.                
  5. # now run the sqoop command
  6. # this is best put in a shell script to help avoid typos...
  7.                
  8. $ sqoop export -D sqoop.export.records.per.statement=1 \
  9. --fields-terminated-by '\t' --driver com.informix.jdbc.IfxDriver \
  10. --connect \
  11. "jdbc:informix-sqli://myhost:54321/stores_demo:informixserver=i7;user=me;password=mypw" \
  12. --table wordcount --export-dir /user/cloudera/HF.out
复制代码

清单 13. Informix 用户:Sqoop 将单词计数的结果写入 Informix
  1. 12/08/08 21:39:42 INFO manager.SqlManager: Using default fetchSize of 1000
  2. 12/08/08 21:39:42 INFO tool.CodeGenTool: Beginning code generation
  3. 12/08/08 21:39:43 INFO manager.SqlManager: Executing SQL statement: SELECT t.*
  4. FROM wordcount AS t WHERE 1=0
  5. 12/08/08 21:39:43 INFO manager.SqlManager: Executing SQL statement: SELECT t.*
  6. FROM wordcount AS t WHERE 1=0
  7. 12/08/08 21:39:43 INFO orm.CompilationManager: HADOOP_HOME is /usr/lib/hadoop
  8. 12/08/08 21:39:43 INFO orm.CompilationManager: Found hadoop core jar at:
  9. /usr/lib/hadoop/hadoop-0.20.2-cdh3u4-core.jar
  10. 12/08/08 21:39:45 INFO orm.CompilationManager: Writing jar file:
  11. /tmp/sqoop-cloudera/compile/248b77c05740f863a15e0136accf32cf/wordcount.jar
  12. 12/08/08 21:39:45 INFO mapreduce.ExportJobBase: Beginning export of wordcount
  13. 12/08/08 21:39:45 INFO manager.SqlManager: Executing SQL statement: SELECT t.*
  14. FROM wordcount AS t WHERE 1=0
  15. 12/08/08 21:39:46 INFO input.FileInputFormat: Total input paths to process : 1
  16. 12/08/08 21:39:46 INFO input.FileInputFormat: Total input paths to process : 1
  17. 12/08/08 21:39:46 INFO mapred.JobClient: Running job: job_201208081900_0012
  18. 12/08/08 21:39:47 INFO mapred.JobClient:  map 0% reduce 0%
  19. 12/08/08 21:39:58 INFO mapred.JobClient:  map 38% reduce 0%
  20. 12/08/08 21:40:00 INFO mapred.JobClient:  map 64% reduce 0%
  21. 12/08/08 21:40:04 INFO mapred.JobClient:  map 82% reduce 0%
  22. 12/08/08 21:40:07 INFO mapred.JobClient:  map 98% reduce 0%
  23. 12/08/08 21:40:09 INFO mapred.JobClient: Task Id :
  24. attempt_201208081900_0012_m_000000_0, Status : FAILED
  25. java.io.IOException: java.sql.SQLException:
  26.     Encoding or code set not supported.
  27. at ...SqlRecordWriter.close(AsyncSqlRecordWriter.java:187)
  28. at ...$NewDirectOutputCollector.close(MapTask.java:540)
  29. at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649)
  30. at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
  31. at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
  32. at java.security.AccessController.doPrivileged(Native Method)
  33. at javax.security.auth.Subject.doAs(Subject.java:396)
  34. at ....doAs(UserGroupInformation.java:1177)
  35. at org.apache.hadoop.mapred.Child.main(Child.java:264)
  36. Caused by: java.sql.SQLException: Encoding or code set not supported.
  37. at com.informix.util.IfxErrMsg.getSQLException(IfxErrMsg.java:413)
  38. at com.informix.jdbc.IfxChar.toIfx(IfxChar.java:135)
  39. at com.informix.jdbc.IfxSqli.a(IfxSqli.java:1304)
  40. at com.informix.jdbc.IfxSqli.d(IfxSqli.java:1605)
  41. at com.informix.jdbc.IfxS
  42. 12/08/08 21:40:11 INFO mapred.JobClient:  map 0% reduce 0%
  43. 12/08/08 21:40:15 INFO mapred.JobClient: Task Id :
  44. attempt_201208081900_0012_m_000000_1, Status : FAILED
  45. java.io.IOException: java.sql.SQLException:
  46.     Unique constraint (informix.u169_821) violated.
  47. at .mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWriter.java:223)
  48. at .mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWriter.java:49)
  49. at .mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:531)
  50. at .mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
  51. at com.cloudera.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:82)
  52. at com.cloudera.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:40)
  53. at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
  54. at .mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:189)
  55. at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
  56. at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
  57. at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
  58. at java.security.AccessController.doPrivileged(Native Method)
  59. at javax.security.a
  60. 12/08/08 21:40:20 INFO mapred.JobClient:
  61. Task Id : attempt_201208081900_0012_m_000000_2, Status : FAILED
  62. java.sql.SQLException: Unique constraint (informix.u169_821) violated.
  63. at .mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWriter.java:223)
  64. at .mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWriter.java:49)
  65. at .mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:531)
  66. at .mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
  67. at com.cloudera.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:82)
  68. at com.cloudera.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:40)
  69. at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
  70. at .mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:189)
  71. at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
  72. at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
  73. at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
  74. at java.security.AccessController.doPrivileged(Native Method)
  75. at javax.security.a
  76. 12/08/08 21:40:27 INFO mapred.JobClient: Job complete: job_201208081900_0012
  77. 12/08/08 21:40:27 INFO mapred.JobClient: Counters: 7
  78. 12/08/08 21:40:27 INFO mapred.JobClient:   Job Counters
  79. 12/08/08 21:40:27 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=38479
  80. 12/08/08 21:40:27 INFO mapred.JobClient:     
  81. Total time spent by all reduces waiting after reserving slots (ms)=0
  82. 12/08/08 21:40:27 INFO mapred.JobClient:     
  83. Total time spent by all maps waiting after reserving slots (ms)=0
  84. 12/08/08 21:40:27 INFO mapred.JobClient:     Launched map tasks=4
  85. 12/08/08 21:40:27 INFO mapred.JobClient:     Data-local map tasks=4
  86. 12/08/08 21:40:27 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
  87. 12/08/08 21:40:27 INFO mapred.JobClient:     Failed map tasks=1
  88. 12/08/08 21:40:27 INFO mapreduce.ExportJobBase:
  89. Transferred 0 bytes in 41.5758 seconds (0 bytes/sec)
  90. 12/08/08 21:40:27 INFO mapreduce.ExportJobBase: Exported 0 records.
  91. 12/08/08 21:40:27 ERROR tool.ExportTool: Error during export: Export job failed!
  92.                
  93. # despite the errors above, rows are inserted into the wordcount table
  94. # one row is missing
  95. # the retry and duplicate key exception are most likely related, but
  96. # troubleshooting will be saved for a later article
  97.                
  98. # check how we did
  99. # nothing like a "here document" shell script
  100.                
  101. $ dbaccess stores_demo - <<eoj
  102. > select count(*) from wordcount;
  103. > eoj
  104.                
  105. Database selected.
  106. (count(*))
  107. 13837
  108. 1 row(s) retrieved.
  109. Database closed.
复制代码



清单 14. DB2 用户:Sqoop 将单词计数的结果写入 DB2
  1. # here is the db2 syntax
  2. # create a destination table for db2
  3. #
  4. #db2 => connect to sample
  5. #
  6. #   Database Connection Information
  7. #
  8. # Database server        = DB2/LINUXX8664 10.1.0
  9. # SQL authorization ID   = DB2INST1
  10. # Local database alias   = SAMPLE
  11. #
  12. #db2 => create table wordcount ( word char(36) not null primary key , n int)
  13. #DB20000I  The SQL command completed successfully.
  14. #
  15.                
  16. sqoop export -D sqoop.export.records.per.statement=1 \
  17. --fields-terminated-by '\t' \
  18. --driver com.ibm.db2.jcc.DB2Driver \
  19. --connect "jdbc:db2://192.168.1.131:50001/sample"  \
  20. --username db2inst1 --password db2inst1 \
  21. --table wordcount --export-dir /user/cloudera/HF.out
  22.                
  23. 12/08/09 12:32:59 WARN tool.BaseSqoopTool: Setting your password on the
  24. command-line is insecure. Consider using -P instead.
  25. 12/08/09 12:32:59 INFO manager.SqlManager: Using default fetchSize of 1000
  26. 12/08/09 12:32:59 INFO tool.CodeGenTool: Beginning code generation
  27. 12/08/09 12:32:59 INFO manager.SqlManager: Executing SQL statement:
  28. SELECT t.* FROM wordcount AS t WHERE 1=0
  29. 12/08/09 12:32:59 INFO manager.SqlManager: Executing SQL statement:
  30. SELECT t.* FROM wordcount AS t WHERE 1=0
  31. 12/08/09 12:32:59 INFO orm.CompilationManager: HADOOP_HOME is /usr/lib/hadoop
  32. 12/08/09 12:32:59 INFO orm.CompilationManager: Found hadoop core jar
  33. at: /usr/lib/hadoop/hadoop-0.20.2-cdh3u4-core.jar
  34. 12/08/09 12:33:00 INFO orm.CompilationManager: Writing jar
  35. file: /tmp/sqoop-cloudera/compile/5532984df6e28e5a45884a21bab245ba/wordcount.jar
  36. 12/08/09 12:33:00 INFO mapreduce.ExportJobBase: Beginning export of wordcount
  37. 12/08/09 12:33:01 INFO manager.SqlManager: Executing SQL statement:
  38. SELECT t.* FROM wordcount AS t WHERE 1=0
  39. 12/08/09 12:33:02 INFO input.FileInputFormat: Total input paths to process : 1
  40. 12/08/09 12:33:02 INFO input.FileInputFormat: Total input paths to process : 1
  41. 12/08/09 12:33:02 INFO mapred.JobClient: Running job: job_201208091208_0002
  42. 12/08/09 12:33:03 INFO mapred.JobClient:  map 0% reduce 0%
  43. 12/08/09 12:33:14 INFO mapred.JobClient:  map 24% reduce 0%
  44. 12/08/09 12:33:17 INFO mapred.JobClient:  map 44% reduce 0%
  45. 12/08/09 12:33:20 INFO mapred.JobClient:  map 67% reduce 0%
  46. 12/08/09 12:33:23 INFO mapred.JobClient:  map 86% reduce 0%
  47. 12/08/09 12:33:24 INFO mapred.JobClient:  map 100% reduce 0%
  48. 12/08/09 12:33:25 INFO mapred.JobClient: Job complete: job_201208091208_0002
  49. 12/08/09 12:33:25 INFO mapred.JobClient: Counters: 16
  50. 12/08/09 12:33:25 INFO mapred.JobClient:   Job Counters
  51. 12/08/09 12:33:25 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=21648
  52. 12/08/09 12:33:25 INFO mapred.JobClient:     Total time spent by all
  53. reduces waiting after reserving slots (ms)=0
  54. 12/08/09 12:33:25 INFO mapred.JobClient:     Total time spent by all
  55. maps waiting after reserving slots (ms)=0
  56. 12/08/09 12:33:25 INFO mapred.JobClient:     Launched map tasks=1
  57. 12/08/09 12:33:25 INFO mapred.JobClient:     Data-local map tasks=1
  58. 12/08/09 12:33:25 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
  59. 12/08/09 12:33:25 INFO mapred.JobClient:   FileSystemCounters
  60. 12/08/09 12:33:25 INFO mapred.JobClient:     HDFS_BYTES_READ=138350
  61. 12/08/09 12:33:25 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=69425
  62. 12/08/09 12:33:25 INFO mapred.JobClient:   Map-Reduce Framework
  63. 12/08/09 12:33:25 INFO mapred.JobClient:     Map input records=13838
  64. 12/08/09 12:33:25 INFO mapred.JobClient:     Physical memory (bytes) snapshot=105148416
  65. 12/08/09 12:33:25 INFO mapred.JobClient:     Spilled Records=0
  66. 12/08/09 12:33:25 INFO mapred.JobClient:     CPU time spent (ms)=9250
  67. 12/08/09 12:33:25 INFO mapred.JobClient:     Total committed heap usage (bytes)=42008576
  68. 12/08/09 12:33:25 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=596447232
  69. 12/08/09 12:33:25 INFO mapred.JobClient:     Map output records=13838
  70. 12/08/09 12:33:25 INFO mapred.JobClient:     SPLIT_RAW_BYTES=126
  71. 12/08/09 12:33:25 INFO mapreduce.ExportJobBase: Transferred 135.1074 KB
  72. in 24.4977 seconds (5.5151 KB/sec)
  73. 12/08/09 12:33:25 INFO mapreduce.ExportJobBase: Exported 13838 records.               
  74.                
  75. # check on the results...
  76. #
  77. #db2 => select count(*) from wordcount
  78. #
  79. #1         
  80. #-----------
  81. #      13838
  82. #
  83. #  1 record(s) selected.
  84. #
  85. #
复制代码

清单 15. MySQL 用户:Sqoop 将单词计数的结果写入 MySQL
  1. # if you don't have Informix or DB2 you can still do this example
  2. # mysql - it is already installed in the VM, here is how to access
  3.                
  4. # one time copy of the JDBC driver
  5.                
  6. sudo cp /usr/lib/hive/lib/mysql-connector-java-5.1.15-bin.jar /usr/lib/sqoop/lib/
  7.                
  8. # now create the database and table
  9.                
  10. $ mysql -u root
  11. Welcome to the MySQL monitor.  Commands end with ; or \g.
  12. Your MySQL connection id is 45
  13. Server version: 5.0.95 Source distribution
  14.                
  15. Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved.
  16.                
  17. Oracle is a registered trademark of Oracle Corporation and/or its
  18. affiliates. Other names may be trademarks of their respective
  19. owners.
  20.                
  21. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
  22.                
  23. mysql> create database mydemo;
  24. Query OK, 1 row affected (0.00 sec)
  25.                
  26. mysql> use mydemo
  27. Database changed
  28. mysql> create table wordcount ( word char(36) not null primary key, n int);
  29. Query OK, 0 rows affected (0.00 sec)
  30.                
  31. mysql> exit
  32. Bye
  33.                
  34. # now export
  35.                
  36. $ sqoop export --connect jdbc:mysql://localhost/mydemo \
  37. --table wordcount --export-dir /user/cloudera/HF.out \
  38. --fields-terminated-by '\t' --username root
复制代码

使用 Sqoop 将数据从 Informix 和 DB2 导入到 HDFS
使用 Sqoop 也可以实现将数据插入 Hadoop HDFS。通过导入参数可以控制此双向功能。
这两种产品自带的样本数据库有一些您可以为此目的使用的简单数据集。清单 16 显示了 Sqoop 每台服务器的语法和结果。
对于 MySQL 用户,请调整以下 Informix 或 DB2 示例中的语法。

清单 16. Sqoop 从 Informix 样本数据库导入到 HDFS

  1. $ sqoop import --driver com.informix.jdbc.IfxDriver \
  2. --connect \
  3. "jdbc:informix-sqli://192.168.1.143:54321/stores_demo:informixserver=ifx117" \
  4. --table orders \
  5. --username informix --password useyours
  6.                
  7. 12/08/09 14:39:18 WARN tool.BaseSqoopTool: Setting your password on the command-line
  8. is insecure. Consider using -P instead.
  9. 12/08/09 14:39:18 INFO manager.SqlManager: Using default fetchSize of 1000
  10. 12/08/09 14:39:18 INFO tool.CodeGenTool: Beginning code generation
  11. 12/08/09 14:39:19 INFO manager.SqlManager: Executing SQL statement:
  12. SELECT t.* FROM orders AS t WHERE 1=0
  13. 12/08/09 14:39:19 INFO manager.SqlManager: Executing SQL statement:
  14. SELECT t.* FROM orders AS t WHERE 1=0
  15. 12/08/09 14:39:19 INFO orm.CompilationManager: HADOOP_HOME is /usr/lib/hadoop
  16. 12/08/09 14:39:19 INFO orm.CompilationManager: Found hadoop core jar
  17. at: /usr/lib/hadoop/hadoop-0.20.2-cdh3u4-core.jar
  18. 12/08/09 14:39:21 INFO orm.CompilationManager: Writing jar
  19. file: /tmp/sqoop-cloudera/compile/0b59eec7007d3cff1fc0ae446ced3637/orders.jar
  20. 12/08/09 14:39:21 INFO mapreduce.ImportJobBase: Beginning import of orders
  21. 12/08/09 14:39:21 INFO manager.SqlManager: Executing SQL statement:
  22. SELECT t.* FROM orders AS t WHERE 1=0
  23. 12/08/09 14:39:22 INFO db.DataDrivenDBInputFormat: BoundingValsQuery:
  24. SELECT MIN(order_num), MAX(order_num) FROM orders
  25. 12/08/09 14:39:22 INFO mapred.JobClient: Running job: job_201208091208_0003
  26. 12/08/09 14:39:23 INFO mapred.JobClient:  map 0% reduce 0%
  27. 12/08/09 14:39:31 INFO mapred.JobClient:  map 25% reduce 0%
  28. 12/08/09 14:39:32 INFO mapred.JobClient:  map 50% reduce 0%
  29. 12/08/09 14:39:36 INFO mapred.JobClient:  map 100% reduce 0%
  30. 12/08/09 14:39:37 INFO mapred.JobClient: Job complete: job_201208091208_0003
  31. 12/08/09 14:39:37 INFO mapred.JobClient: Counters: 16
  32. 12/08/09 14:39:37 INFO mapred.JobClient:   Job Counters
  33. 12/08/09 14:39:37 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=22529
  34. 12/08/09 14:39:37 INFO mapred.JobClient:     Total time spent by all reduces
  35. waiting after reserving slots (ms)=0
  36. 12/08/09 14:39:37 INFO mapred.JobClient:     Total time spent by all maps
  37. waiting after reserving slots (ms)=0
  38. 12/08/09 14:39:37 INFO mapred.JobClient:     Launched map tasks=4
  39. 12/08/09 14:39:37 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
  40. 12/08/09 14:39:37 INFO mapred.JobClient:   FileSystemCounters
  41. 12/08/09 14:39:37 INFO mapred.JobClient:     HDFS_BYTES_READ=457
  42. 12/08/09 14:39:37 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=278928
  43. 12/08/09 14:39:37 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2368
  44. 12/08/09 14:39:37 INFO mapred.JobClient:   Map-Reduce Framework
  45. 12/08/09 14:39:37 INFO mapred.JobClient:     Map input records=23
  46. 12/08/09 14:39:37 INFO mapred.JobClient:     Physical memory (bytes) snapshot=291364864
  47. 12/08/09 14:39:37 INFO mapred.JobClient:     Spilled Records=0
  48. 12/08/09 14:39:37 INFO mapred.JobClient:     CPU time spent (ms)=1610
  49. 12/08/09 14:39:37 INFO mapred.JobClient:     Total committed heap usage (bytes)=168034304
  50. 12/08/09 14:39:37 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2074587136
  51. 12/08/09 14:39:37 INFO mapred.JobClient:     Map output records=23
  52. 12/08/09 14:39:37 INFO mapred.JobClient:     SPLIT_RAW_BYTES=457
  53. 12/08/09 14:39:37 INFO mapreduce.ImportJobBase: Transferred 2.3125 KB in 16.7045
  54. seconds (141.7585 bytes/sec)
  55. 12/08/09 14:39:37 INFO mapreduce.ImportJobBase: Retrieved 23 records.
  56.                
  57. # now look at the results
  58.                
  59. $ hls
  60. Found 4 items
  61. -rw-r--r--   1 cloudera supergroup     459386 2012-08-08 19:34 /user/cloudera/DS.txt.gz
  62. drwxr-xr-x   - cloudera supergroup          0 2012-08-08 19:38 /user/cloudera/HF.out
  63. -rw-r--r--   1 cloudera supergroup     597587 2012-08-08 19:35 /user/cloudera/HF.txt
  64. drwxr-xr-x   - cloudera supergroup          0 2012-08-09 14:39 /user/cloudera/orders
  65. $ hls orders
  66. Found 6 items
  67. -rw-r--r--   1 cloudera supergroup 0 2012-08-09 14:39 /user/cloudera/orders/_SUCCESS
  68. drwxr-xr-x   - cloudera supergroup 0 2012-08-09 14:39 /user/cloudera/orders/_logs
  69. -rw-r--r--   1 cloudera ...roup 630 2012-08-09 14:39 /user/cloudera/orders/part-m-00000
  70. -rw-r--r--   1 cloudera supergroup        
  71. 564 2012-08-09 14:39 /user/cloudera/orders/part-m-00001
  72. -rw-r--r--   1 cloudera supergroup        
  73. 527 2012-08-09 14:39 /user/cloudera/orders/part-m-00002
  74. -rw-r--r--   1 cloudera supergroup        
  75. 647 2012-08-09 14:39 /user/cloudera/orders/part-m-00003
  76.                
  77. # wow  there are four files part-m-0000x
  78. # look inside one
  79.                
  80. # some of the lines are edited to fit on the screen
  81. $ hcat /user/cloudera/orders/part-m-00002
  82. 1013,2008-06-22,104,express ,n,B77930    ,2008-07-10,60.80,12.20,2008-07-31
  83. 1014,2008-06-25,106,ring bell,  ,n,8052      ,2008-07-03,40.60,12.30,2008-07-10
  84. 1015,2008-06-27,110,        ,n,MA003     ,2008-07-16,20.60,6.30,2008-08-31
  85. 1016,2008-06-29,119, St.          ,n,PC6782    ,2008-07-12,35.00,11.80,null
  86. 1017,2008-07-09,120,use                 ,n,DM354331  ,2008-07-13,60.00,18.00,null
复制代码


为什么有四个不同的文件,而且每个文件只包含一部分数据?Sqoop 是一个高度并行化的实用程序。如果一个运行 Sqoop 的具有 4000 个节点的集群从数据库全力执行导入操作,那么 4000 个数据库连接看起来非常像针对数据库的拒绝服务攻击。Sqoop 的默认连接限制是 4 个 JDBC 连接。每个连接在 HDFS 中生成一个数据文件。因此会有四个文件。不用担心,您会看到 Hadoop 如何毫无难度地跨这些文件进行工作。
下一步是导入一个 DB2 表。如清单 17 所示,通过指定 -m 1 选项,就可以导入没有主键的表,其结果也是一个单一文件。


未完待续,请看后面!!!



没找到任何评论,期待你打破沉寂

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条