Spark Sql系统入门4：spark应用程序中使用spark sql

问题导读

1.你认为如何初始化spark sql？
2.不同的语言，实现方式都是什么？
3.spark sql语句如何实现在应用程序中使用？

相关篇章
Spark Sql系统入门1：什么是spark sql及包含哪些组件
http://www.aboutyun.com/forum.php?mod=viewthread&tid=20910

Spark Sql系统入门2：spark sql精简总结
http://www.aboutyun.com/forum.php?mod=viewthread&tid=21002

Spark Sql系统入门3：spark sql运行计划精简
http://www.aboutyun.com/forum.php?mod=viewthread&tid=21032

为了使用spark sql，我们构建HiveContext （或则SQLContext 那些想要的精简版）基于我们的SparkContext.这个context 提供额外的函数为查询和整合spark sql数据。使用HiveContext，我们构建SchemaRDDs.这代表我们机构化数据，和操作他们使用sql或则正常的rdd操作如map（）.
初始化spark sql

为了开始spark sql，我们需要添加一些imports 到我们程序。如下面例子1
例子1Scala SQL imports
[mw_shl_code=scala,true]// Import Spark SQL
import org.apache.spark.sql.hive.HiveContext
// Or if you can't have the hive dependencies
import org.apache.spark.sql.SQLContext[/mw_shl_code]

Scala用户注意，我们不使用 import HiveContext._,像我们这样做SparkContext，获取访问implicits.这些implicits用来转换rdds,带着需要的type信息到spark sql的序列化rdds为查询。相反，一旦我们有了结构化HiveContext实例化，我们可以导入 implicits 在例子2中。导入Java和Python在例子3和4中。例子2Scala SQL imports
[mw_shl_code=scala,true]// Create a Spark SQL HiveContext
val hiveCtx = ...
// Import the implicit conversions
import hiveCtx._
[/mw_shl_code]

例子3Java SQL imports
[mw_shl_code=java,true]// Import Spark SQL
import org.apache.spark.sql.hive.HiveContext;
// Or if you can't have the hive dependencies
import org.apache.spark.sql.SQLContext;
// Import the JavaSchemaRDD
import org.apache.spark.sql.SchemaRDD;
import org.apache.spark.sql.Row;
[/mw_shl_code]
例子4Python SQL imports
[mw_shl_code=python,true]# Import Spark SQL
from pyspark.sql import HiveContext, Row
# Or if you can't include the hive requirements
from pyspark.sql import SQLContext, Row[/mw_shl_code]
一旦我们添加我们的imports,我们需要创建HiveContext,或则SQLContext，如果我们引入Hive依赖（查看例子5和6）。这两个类都需要运行spark。
例子5：使用Scala结构化sql context
[mw_shl_code=scala,true]val sc = new SparkContext(...)
val hiveCtx = new HiveContext(sc)[/mw_shl_code]

例子6：使用java结构化sql context

[mw_shl_code=java,true]JavaSparkContext ctx = new JavaSparkContext(...);
SQLContext sqlCtx = new HiveContext(ctx);[/mw_shl_code]

例子7：使用python结构化sql context
[mw_shl_code=python,true]hiveCtx = HiveContext(sc)[/mw_shl_code]
现在我们有了HiveContext 或则SQLContext，我们准备加载数据和查询。

基本查询例子
为了对一个表查询，我们调用HiveContext或则SQLContext的sql()函数.第一个事情，我们需要告诉spark sql关于一些数据的查询。在这种情况下，我们load Twitter数据【json格式】,和给它一个name,注册为 “临时表”，因此我们可以使用sql查询。
例子8使用Scala加载和查询tweets
[mw_shl_code=scala,true]val input = hiveCtx.jsonFile(inputFile)
// Register the input schema RDD
input.registerTempTable("tweets")
// Select tweets based on the retweetCount
val topTweets = hiveCtx.sql("SELECT text, retweetCount FROM
tweets ORDER BY retweetCount LIMIT 10")[/mw_shl_code]

例子9使用Java加载和查询tweets
[mw_shl_code=java,true]SchemaRDD input = hiveCtx.jsonFile(inputFile);
// Register the input schema RDD
input.registerTempTable("tweets");
// Select tweets based on the retweetCount
SchemaRDD topTweets = hiveCtx.sql("SELECT text, retweetCount FROM
tweets ORDER BY retweetCount LIMIT 10");[/mw_shl_code]

例子10使用Python加载和查询tweets
[mw_shl_code=python,true]input = hiveCtx.jsonFile(inputFile)
# Register the input schema RDD
input.registerTempTable("tweets")
# Select tweets based on the retweetCount
topTweets = hiveCtx.sql("""SELECT text, retweetCount FROM
tweets ORDER BY retweetCount LIMIT 10""")
[/mw_shl_code]

如果你已经安装hive，并且复制hive-site.xml文件到$SPARK_HOME/conf，你也可以运行hiveCtx.sql 查询已存在的hive表。

转载注明来自about云（www.aboutyun.com）

小北酱 · 发表于 2017-2-25 08:52:04

留名感谢分享

美丽天空 · 发表于 2017-2-25 11:13:53

感谢分享

美丽天空 · 发表于 2017-2-26 23:05:25

感谢分享

图文精华

Spark Sql系统入门4：spark应用程序中使用spark sql

本帖被以下淘专辑推荐:

已有(3)人评论

活跃会员

热心会员

优秀版主

论坛元老

推荐 /2