美文网首页
Hadoop - SparkSQL

Hadoop - SparkSQL

作者: Xiangyuan_Ren | 来源:发表于2018-01-02 16:18 被阅读0次
    image.png
    • DataFrame -> DataSet Spark2.0
    • Codes:

    export SPARK_MAJOR_VERSION=2

    from pyspark.sql import SparkSession
    from pyspark.sql import Row
    from pyspark.sql import functions
    
    def loadMovieNames():
        movieNames = {}
        with open("ml-100k/u.item") as f:
            for line in f:
                fields = line.split('|')
                movieNames[int(fields[0])] = fields[1]
        return movieNames
    
    def parseInput(line):
        fields = line.split()
        return Row(movieID = int(fields[1]), rating = float(fields[2]))
    
    if __name__ == "__main__":
        # Create a SparkSession (the config bit is only for Windows!)
        spark = SparkSession.builder.appName("PopularMovies").getOrCreate()
    
        # Load up our movie ID -> name dictionary
        movieNames = loadMovieNames()
        # Get the raw data
        lines = spark.sparkContext.textFile("hdfs:///user/maria_dev/ml-100k/u.data")
        # Convert it to a RDD of Row objects with (movieID, rating)
        movies = lines.map(parseInput)
        # Convert that to a DataFrame
        movieDataset = spark.createDataFrame(movies)
    
        # Compute average rating for each movieID
        averageRatings = movieDataset.groupBy("movieID").avg("rating")
    
        # Compute count of ratings for each movieID
        counts = movieDataset.groupBy("movieID").count()
    
        # Join the two together (We now have movieID, avg(rating), and count columns)
        averagesAndCounts = counts.join(averageRatings, "movieID")
    
        # Pull the top 10 results
        topTen = averagesAndCounts.orderBy("avg(rating)").take(10)
    
        # Print them out, converting movie ID's to names as we go.
        for movie in topTen:
            print (movieNames[movie[0]], movie[1], movie[2])
    
        # Stop the session
        spark.stop()
    

    spark-submit LowestRatedMovieDataFrame.py

    • Result:

    ('Further Gesture, A (1996)', 1, 1.0)
    ('Falling in Love Again (1980)', 2, 1.0)
    ('Amityville: Dollhouse (1996)', 3, 1.0)
    ('Power 98 (1995)', 1, 1.0)
    ('Low Life, The (1994)', 1, 1.0)
    ('Careful (1992)', 1, 1.0)
    ('Lotto Land (1995)', 1, 1.0)
    ('Hostile Intentions (1994)', 1, 1.0)
    ('Amityville: A New Generation (1993)', 5, 1.0)

    相关文章

      网友评论

          本文标题:Hadoop - SparkSQL

          本文链接:https://www.haomeiwen.com/subject/nmfqnxtx.html