spark sql对seq(s1, s2, s3, ...)值的包装,seq的每个元素si会被包装成一个Row
如果si为一个简单值,则生成一个只包含一个value列的Row
如果si为一个N-Tuple,则生成一个包含N列的Row
特别的,如果N-Tuple是一元组,则视为非元组,即生成一个只包含一个value列的Row
scala> Seq(("bluejoe"),("alex")).toDF().show
+-------+
| value|
+-------+
|bluejoe|
| alex|
+-------+
scala> Seq("bluejoe","alex").toDF().show
+-------+
| value|
+-------+
|bluejoe|
| alex|
+-------+
scala> Seq(("bluejoe",1),("alex",0)).toDF().show
+-------+---+
| _1| _2|
+-------+---+
|bluejoe| 1|
| alex| 0|
+-------+---+
我特意编写了如下测试用例,验证了这种情况:
@Test
def testEncoderSchema() {
val spark = SparkSession.builder.master("local[4]")
.getOrCreate();
val sqlContext = spark.sqlContext;
import sqlContext.implicits._
import org.apache.spark.sql.catalyst.encoders.encoderFor
val schema1 = encoderFor[String].schema;
val schema2 = encoderFor[(String)].schema;
val schema3 = encoderFor[((String))].schema;
Assert.assertEquals(schema1, schema2);
Assert.assertEquals(schema1, schema3);
}
网友评论