美文网首页
Spark ML Feature

Spark ML Feature

作者: emm_simon | 来源:发表于2019-10-24 21:07 被阅读0次

    [参考官方文档]
    [参考link]

    特征提取

    TF-IDF

    scala_demo :

    import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
    
    val sentenceData = spark.createDataFrame(Seq(
      (0.0, "Hi I heard about Spark"),
      (0.0, "I wish Java could use case classes"),
      (1.0, "Logistic regression models are neat")
    )).toDF("label", "sentence")
    
    val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
    val wordsData = tokenizer.transform(sentenceData)
    
    val hashingTF = new HashingTF()
      .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
    
    val featurizedData = hashingTF.transform(wordsData)
    // alternatively, CountVectorizer can also be used to get term frequency vectors
    
    val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
    val idfModel = idf.fit(featurizedData)
    
    val rescaledData = idfModel.transform(featurizedData)
    rescaledData.select("label", "features").show()
    
    Word2Vec

    scala_demo :

    import org.apache.spark.ml.feature.Word2Vec
    import org.apache.spark.ml.linalg.Vector
    import org.apache.spark.sql.Row
    
    // Input data: Each row is a bag of words from a sentence or document.
    val documentDF = spark.createDataFrame(Seq(
      "Hi I heard about Spark".split(" "),
      "I wish Java could use case classes".split(" "),
      "Logistic regression models are neat".split(" ")
    ).map(Tuple1.apply)).toDF("text")
    
    // Learn a mapping from words to Vectors.
    val word2Vec = new Word2Vec()
      .setInputCol("text")
      .setOutputCol("result")
      .setVectorSize(3)
      .setMinCount(0)
    val model = word2Vec.fit(documentDF)
    
    val result = model.transform(documentDF)
    result.collect().foreach { case Row(text: Seq[_], features: Vector) =>
      println(s"Text: [${text.mkString(", ")}] => \nVector: $features\n") }
    
    CountVectorizer

    CountVectorizerCountVectorizerModel用于将a collection of text documents转化成vectors of token counts。
    当没有一个现有的字典(a-priori dictionary)可用的时候,CountVectorizer可以被用来作为一个Estimator来extract the vocabulary, 而且产生一个CountVectorizerModel
    这个模型为文档在这个词汇表上产生一个稀疏的表示, 然后传给其他的算法模型,比如LDA。
    在fitting的过程中, CountVectorizer会按照term frequency across the corpus进行排名,来选择top vocabSize words。
    一个可选的参数minDF同样会影响fitting的过程,这个参数指定文档中一个term必须出现的最小次数(or fraction if < 1.0) ,然后才能包含在vocabulary。
    另一个选项:二进制开关参数(binary toggle parameter)用来控制输出的vector,如果设成true的话,所有的nonzero计数都会统一被设成1。这对于特别有用离散的概率模型,为binary counts目标进行建模, 而不是为integer counts目标建模。
    举例 :

     id | texts
    ----|----------
     0  | Array("a", "b", "c")
     1  | Array("a", "b", "b", "c", "a")
    

    每一条样本的texts字段是一个Array[String]类型的文档。调用CountVectorizer的fit()方法生成一个包含字母表(a, b, c)的CountVectorizerModel对象。然后在transformation之后在原DataFrame里面生成新的一列"vector":

     id | texts                           | vector
    ----|---------------------------------|---------------
     0  | Array("a", "b", "c")            | (3,[0,1,2],[1.0,1.0,1.0])
     1  | Array("a", "b", "b", "c", "a")  | (3,[0,1,2],[2.0,2.0,1.0])
    

    每个vector展示这个document在字母表上的token counts。
    scala_demo :

    import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
    
    val df = spark.createDataFrame(Seq(
      (0, Array("a", "b", "c")),
      (1, Array("a", "b", "b", "c", "a"))
    )).toDF("id", "words")
    
    // fit a CountVectorizerModel from the corpus
    val cvModel: CountVectorizerModel = new CountVectorizer()
      .setInputCol("words")
      .setOutputCol("features")
      .setVocabSize(3)
      .setMinDF(2)
      .fit(df)
    
    // alternatively, define CountVectorizerModel with a-priori vocabulary
    val cvm = new CountVectorizerModel(Array("a", "b", "c"))
      .setInputCol("words")
      .setOutputCol("features")
    
    cvModel.transform(df).show(false)
    

    特征转换

    标记生成器(Tokenizer)
    停用词移除器(StopWordsRemover)
    n-gram
    二值化
    PCA
    多项式展开(PolynomialExpansion)
    离散余弦变换(Discrete Cosine Transform DCT)
    StringIndexer
    IndexToString
    OneHotEncoder
    VectorIndexer
    Interaction
    Normalizer
    StandardScaler
    MinMaxScaler
    MaxAbsScaler
    Bucketizer
    ElementwiseProduct
    SQLTransformer
    VectorAssembler
    QuantileDiscretizer
    Imputer

    特征选择

    VectorSlicer
    RFormula
    ChiSqSelector
    局部敏感哈希
    欧几里德距离的随机投影
    Jaccard距离最小hash
    特征转换
    近似相似性join
    近似最近邻搜索
    LSH操作
    LSH算法

    相关文章

      网友评论

          本文标题:Spark ML Feature

          本文链接:https://www.haomeiwen.com/subject/rmtjvctx.html