分享

SparkSQL配置和使用初探



问题导读:
1.SparkSQL的环境配置?
2.SparkSQL遇到问题及解决方案?







1.环境

  1. OS:Red Hat Enterprise Linux Server release 6.4 (Santiago)
  2. Hadoop:Hadoop 2.4.1
  3. Hive:0.11.0
  4. JDK:1.7.0_60
  5. Spark:1.1.0(内置SparkSQL)
  6. Scala:2.11.2
复制代码


2.Spark集群规划

  1. 账户:ebupt
  2. master:eb174
  3. slaves:eb174、eb175、eb176
复制代码



3.SparkSQL发展历史
2014年9月11日,发布Spark1.1.0。Spark从1.0开始引入SparkSQL。Spark1.1.0变化较大是SparkSQL和MLlib。具体参见release note
SparkSQL的前身是Shark。由于Shark自身的不完善,2014年6月1日Reynold Xin宣布:停止对Shark的开发。SparkSQL抛弃原有Shark的代码,汲取了Shark的一些优点,如内存列存储(In-Memory Columnar Storage)、Hive兼容性等,重新开发SparkSQL。


4.具体配置


5.运行
  • 启动Spark集群
  • 启动SparkSQL Client:./spark/bin/spark-sql --master spark://eb174:7077 --executor-memory 3g
  • 运行SQL,访问hive的表:spark-sql> select count(*) from test.t1;
  1. 14/10/08 20:46:04 INFO ParseDriver: Parsing command: select count(*) from test.t1
  2. 14/10/08 20:46:05 INFO ParseDriver: Parse Completed
  3. 14/10/08 20:46:05 INFO metastore: Trying to connect to metastore with URI thrift://eb170:9083
  4. 14/10/08 20:46:05 INFO metastore: Waiting 1 seconds before next connection attempt.
  5. 14/10/08 20:46:06 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@eb174:55408/user/Executor#1282322316] with ID 2
  6. 14/10/08 20:46:06 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@eb176:56138/user/Executor#-264112470] with ID 0
  7. 14/10/08 20:46:06 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@eb175:43791/user/Executor#-996481867] with ID 1
  8. 14/10/08 20:46:06 INFO BlockManagerMasterActor: Registering block manager eb174:54967 with 265.4 MB RAM
  9. 14/10/08 20:46:06 INFO BlockManagerMasterActor: Registering block manager eb176:60783 with 265.4 MB RAM
  10. 14/10/08 20:46:06 INFO BlockManagerMasterActor: Registering block manager eb175:35197 with 265.4 MB RAM
  11. 14/10/08 20:46:06 INFO metastore: Connected to metastore.
  12. 14/10/08 20:46:07 INFO deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
  13. 14/10/08 20:46:07 INFO MemoryStore: ensureFreeSpace(406982) called with curMem=0, maxMem=278302556
  14. 14/10/08 20:46:07 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 397.4 KB, free 265.0 MB)
  15. 14/10/08 20:46:07 INFO MemoryStore: ensureFreeSpace(25198) called with curMem=406982, maxMem=278302556
  16. 14/10/08 20:46:07 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.6 KB, free 265.0 MB)
  17. 14/10/08 20:46:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on eb174:49971 (size: 24.6 KB, free: 265.4 MB)
  18. 14/10/08 20:46:07 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
  19. 14/10/08 20:46:07 INFO SparkContext: Starting job: collect at HiveContext.scala:415
  20. 14/10/08 20:46:08 INFO FileInputFormat: Total input paths to process : 1
  21. 14/10/08 20:46:08 INFO DAGScheduler: Registering RDD 5 (mapPartitions at Exchange.scala:86)
  22. 14/10/08 20:46:08 INFO DAGScheduler: Got job 0 (collect at HiveContext.scala:415) with 1 output partitions (allowLocal=false)
  23. 14/10/08 20:46:08 INFO DAGScheduler: Final stage: Stage 0(collect at HiveContext.scala:415)
  24. 14/10/08 20:46:08 INFO DAGScheduler: Parents of final stage: List(Stage 1)
  25. 14/10/08 20:46:08 INFO DAGScheduler: Missing parents: List(Stage 1)
  26. 14/10/08 20:46:08 INFO DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[5] at mapPartitions at Exchange.scala:86), which has no missing parents
  27. 14/10/08 20:46:08 INFO MemoryStore: ensureFreeSpace(11000) called with curMem=432180, maxMem=278302556
  28. 14/10/08 20:46:08 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 10.7 KB, free 265.0 MB)
  29. 14/10/08 20:46:08 INFO MemoryStore: ensureFreeSpace(5567) called with curMem=443180, maxMem=278302556
  30. 14/10/08 20:46:08 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 5.4 KB, free 265.0 MB)
  31. 14/10/08 20:46:08 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on eb174:49971 (size: 5.4 KB, free: 265.4 MB)
  32. 14/10/08 20:46:08 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
  33. 14/10/08 20:46:08 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (MapPartitionsRDD[5] at mapPartitions at Exchange.scala:86)
  34. 14/10/08 20:46:08 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
  35. 14/10/08 20:46:08 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 0, eb174, NODE_LOCAL, 1199 bytes)
  36. 14/10/08 20:46:08 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 1, eb176, NODE_LOCAL, 1199 bytes)
  37. 14/10/08 20:46:08 INFO ConnectionManager: Accepted connection from [eb176/10.1.69.176:49289]
  38. 14/10/08 20:46:08 INFO ConnectionManager: Accepted connection from [eb174/10.1.69.174:33401]
  39. 14/10/08 20:46:08 INFO SendingConnection: Initiating connection to [eb176/10.1.69.176:60783]
  40. 14/10/08 20:46:08 INFO SendingConnection: Initiating connection to [eb174/10.1.69.174:54967]
  41. 14/10/08 20:46:08 INFO SendingConnection: Connected to [eb176/10.1.69.176:60783], 1 messages pending
  42. 14/10/08 20:46:08 INFO SendingConnection: Connected to [eb174/10.1.69.174:54967], 1 messages pending
  43. 14/10/08 20:46:08 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on eb176:60783 (size: 5.4 KB, free: 265.4 MB)
  44. 14/10/08 20:46:08 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on eb174:54967 (size: 5.4 KB, free: 265.4 MB)
  45. 14/10/08 20:46:08 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on eb174:54967 (size: 24.6 KB, free: 265.4 MB)
  46. 14/10/08 20:46:08 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on eb176:60783 (size: 24.6 KB, free: 265.4 MB)
  47. 14/10/08 20:46:10 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 1) in 2657 ms on eb176 (1/2)
  48. 14/10/08 20:46:10 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 0) in 2675 ms on eb174 (2/2)
  49. 14/10/08 20:46:10 INFO DAGScheduler: Stage 1 (mapPartitions at Exchange.scala:86) finished in 2.680 s
  50. 14/10/08 20:46:10 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
  51. 14/10/08 20:46:10 INFO DAGScheduler: looking for newly runnable stages
  52. 14/10/08 20:46:10 INFO DAGScheduler: running: Set()
  53. 14/10/08 20:46:10 INFO DAGScheduler: waiting: Set(Stage 0)
  54. 14/10/08 20:46:10 INFO DAGScheduler: failed: Set()
  55. 14/10/08 20:46:10 INFO DAGScheduler: Missing parents for Stage 0: List()
  56. 14/10/08 20:46:10 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[9] at map at HiveContext.scala:360), which is now runnable
  57. 14/10/08 20:46:10 INFO MemoryStore: ensureFreeSpace(9752) called with curMem=448747, maxMem=278302556
  58. 14/10/08 20:46:10 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 9.5 KB, free 265.0 MB)
  59. 14/10/08 20:46:10 INFO MemoryStore: ensureFreeSpace(4941) called with curMem=458499, maxMem=278302556
  60. 14/10/08 20:46:10 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 4.8 KB, free 265.0 MB)
  61. 14/10/08 20:46:10 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on eb174:49971 (size: 4.8 KB, free: 265.4 MB)
  62. 14/10/08 20:46:10 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0
  63. 14/10/08 20:46:11 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MappedRDD[9] at map at HiveContext.scala:360)
  64. 14/10/08 20:46:11 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
  65. 14/10/08 20:46:11 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 2, eb175, PROCESS_LOCAL, 948 bytes)
  66. 14/10/08 20:46:11 INFO StatsReportListener: Finished stage: org.apache.spark.scheduler.StageInfo@513f39c
  67. 14/10/08 20:46:11 INFO StatsReportListener: task runtime:(count: 2, mean: 2666.000000, stdev: 9.000000, max: 2675.000000, min: 2657.000000)
  68. 14/10/08 20:46:11 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%     95%     100%
  69. 14/10/08 20:46:11 INFO StatsReportListener:     2.7 s   2.7 s   2.7 s   2.7 s   2.7 s   2.7 s   2.7 s   2.7 s   2.7 s
  70. 14/10/08 20:46:11 INFO StatsReportListener: shuffle bytes written:(count: 2, mean: 50.000000, stdev: 0.000000, max: 50.000000, min: 50.000000)
  71. 14/10/08 20:46:11 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%     95%     100%
  72. 14/10/08 20:46:11 INFO StatsReportListener:     50.0 B  50.0 B  50.0 B  50.0 B  50.0 B  50.0 B  50.0 B  50.0 B  50.0 B
  73. 14/10/08 20:46:11 INFO StatsReportListener: task result size:(count: 2, mean: 1848.000000, stdev: 0.000000, max: 1848.000000, min: 1848.000000)
  74. 14/10/08 20:46:11 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%     95%     100%
  75. 14/10/08 20:46:11 INFO StatsReportListener:     1848.0 B        1848.0 B        1848.0 B        1848.0 B        1848.0 B        1848.0 B    1848.0 B        1848.0 B        1848.0 B
  76. 14/10/08 20:46:11 INFO StatsReportListener: executor (non-fetch) time pct: (count: 2, mean: 86.309428, stdev: 0.103820, max: 86.413248, min: 86.205607)
  77. 14/10/08 20:46:11 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%     95%     100%
  78. 14/10/08 20:46:11 INFO StatsReportListener:     86 %    86 %    86 %    86 %    86 %    86 %    86 %    86 %    86 %
  79. 14/10/08 20:46:11 INFO StatsReportListener: other time pct: (count: 2, mean: 13.690572, stdev: 0.103820, max: 13.794393, min: 13.586752)
  80. 14/10/08 20:46:11 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%     95%     100%
  81. 14/10/08 20:46:11 INFO StatsReportListener:     14 %    14 %    14 %    14 %    14 %    14 %    14 %    14 %    14 %
  82. 14/10/08 20:46:11 INFO ConnectionManager: Accepted connection from [eb175/10.1.69.175:36187]
  83. 14/10/08 20:46:11 INFO SendingConnection: Initiating connection to [eb175/10.1.69.175:35197]
  84. 14/10/08 20:46:11 INFO SendingConnection: Connected to [eb175/10.1.69.175:35197], 1 messages pending
  85. 14/10/08 20:46:11 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on eb175:35197 (size: 4.8 KB, free: 265.4 MB)
  86. 14/10/08 20:46:12 INFO MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to sparkExecutor@eb175:58085
  87. 14/10/08 20:46:12 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 140 bytes
  88. 14/10/08 20:46:12 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 2) in 1428 ms on eb175 (1/1)
  89. 14/10/08 20:46:12 INFO DAGScheduler: Stage 0 (collect at HiveContext.scala:415) finished in 1.432 s
  90. 14/10/08 20:46:12 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
  91. 14/10/08 20:46:12 INFO StatsReportListener: Finished stage: org.apache.spark.scheduler.StageInfo@6e8030b0
  92. 14/10/08 20:46:12 INFO StatsReportListener: task runtime:(count: 1, mean: 1428.000000, stdev: 0.000000, max: 1428.000000, min: 1428.000000)
  93. 14/10/08 20:46:12 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%     95%     100%
  94. 14/10/08 20:46:12 INFO StatsReportListener:     1.4 s   1.4 s   1.4 s   1.4 s   1.4 s   1.4 s   1.4 s   1.4 s   1.4 s
  95. 14/10/08 20:46:12 INFO StatsReportListener: fetch wait time:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)
  96. 14/10/08 20:46:12 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%     95%     100%
  97. 14/10/08 20:46:12 INFO StatsReportListener:     0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms
  98. 14/10/08 20:46:12 INFO StatsReportListener: remote bytes read:(count: 1, mean: 100.000000, stdev: 0.000000, max: 100.000000, min: 100.000000)
  99. 14/10/08 20:46:12 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%     95%     100%
  100. 14/10/08 20:46:12 INFO StatsReportListener:     100.0 B 100.0 B 100.0 B 100.0 B 100.0 B 100.0 B 100.0 B 100.0 B 100.0 B
  101. 14/10/08 20:46:12 INFO SparkContext: Job finished: collect at HiveContext.scala:415, took 4.787407158 s
  102. 14/10/08 20:46:12 INFO StatsReportListener: task result size:(count: 1, mean: 1072.000000, stdev: 0.000000, max: 1072.000000, min: 1072.000000)
  103. 14/10/08 20:46:12 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%     95%     100%
  104. 14/10/08 20:46:12 INFO StatsReportListener:     1072.0 B        1072.0 B        1072.0 B        1072.0 B        1072.0 B        1072.0 B    1072.0 B        1072.0 B        1072.0 B
  105. 14/10/08 20:46:12 INFO StatsReportListener: executor (non-fetch) time pct: (count: 1, mean: 80.252101, stdev: 0.000000, max: 80.252101, min: 80.252101)
  106. 14/10/08 20:46:12 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%     95%     100%
  107. 14/10/08 20:46:12 INFO StatsReportListener:     80 %    80 %    80 %    80 %    80 %    80 %    80 %    80 %    80 %
  108. 14/10/08 20:46:12 INFO StatsReportListener: fetch wait time pct: (count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)
  109. 14/10/08 20:46:12 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%     95%     100%
  110. 14/10/08 20:46:12 INFO StatsReportListener:      0 %     0 %     0 %     0 %     0 %     0 %     0 %     0 %     0 %
  111. 14/10/08 20:46:12 INFO StatsReportListener: other time pct: (count: 1, mean: 19.747899, stdev: 0.000000, max: 19.747899, min: 19.747899)
  112. 14/10/08 20:46:12 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%     95%     100%
  113. 14/10/08 20:46:12 INFO StatsReportListener:     20 %    20 %    20 %    20 %    20 %    20 %    20 %    20 %    20 %
  114. 5078
  115. Time taken: 7.581 seconds
复制代码



注意:
  • 在启动spark-sql时,如果不指定master,则以local的方式运行,master既可以指定standalone的地址,也可以指定yarn;
  • 当设定master为yarn时(spark-sql --master yarn)时,可以通过http://$master:8088页面监控到整个job的执行过程;
  • 如果在$SPARK_HOME/conf/spark-defaults.conf中配置了spark.master spark://eb174:7077,那么在启动spark-sql时不指定master也是运行在standalone集群之上。


6.遇到的问题及解决方案
① 在spark-sql客户端命令行界面运行SQL语句出现无法解析UnknownHostException:ebcloud(这是hadoop的dfs.nameservices)
  1. 14/10/08 20:42:44 ERROR CliDriver: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, eb174): java.lang.IllegalArgumentException: java.net.UnknownHostException: ebcloud
  2.         org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377)
  3.         org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:240)
  4.         org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:144)
  5.         org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:579)
  6.         org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:524)
  7.         org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:146)
  8.         org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2397)
  9.         org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
  10.         org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
复制代码



原因:Spark无法正确获取HDFS的地址。因此,将hadoop的HDFS配置文件hdfs-site.xml拷贝到$SPARK_HOME/conf目录下。



  1. 14/10/08 20:26:46 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
  2. 14/10/08 20:26:46 INFO SparkContext: Starting job: collect at HiveContext.scala:415
  3. 14/10/08 20:29:19 WARN RetryInvocationHandler: Exception while invoking class org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo over eb171/10.1.69.171:8020. Not retrying because failovers (15) exceeded maximum allowed (15)
  4. java.net.ConnectException: Call From eb174/10.1.69.174 to eb171:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
  5. at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  6. at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
  7. at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  8. at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
  9. at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
  10. at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
  11. at org.apache.hadoop.ipc.Client.call(Client.java:1414)
  12. at org.apache.hadoop.ipc.Client.call(Client.java:1363)
复制代码



原因:hdfs连接失败,原因是hdfs-site.xml未全部同步到slaves的节点上。


7.参考资料





引用:http://www.cnblogs.com/byrhuangqiang/p/4012087.html
欢迎加入about云群90371779322273151432264021 ,云计算爱好者群,亦可关注about云腾讯认证空间||关注本站微信

已有(2)人评论

跳转到指定楼层
hbu126 发表于 2014-12-26 09:37:13
thank you very much
回复

使用道具 举报

hbu126 发表于 2014-12-26 10:04:19
thank you very much
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条