美文网首页
MARK5826 Week4 lab

MARK5826 Week4 lab

作者: GhostintheCode | 来源:发表于2018-08-20 13:06 被阅读0次

    CONTENT OPTIMIZATION(Use machine learning)

    AIM

    Given movie data, find out and predict revenue/ratings, and understand WHAT contributes to the prediction.

    Reading Data

    显示文件读取数据

    Let's DESCRIBE the data again using DESCRIBE

    describe(data)
    
    describe(data)
    head(data, 5)
    
    head(data, 5)

    Plotting and Tabulating - Tableau

    之后会写如何使用Tableau

    Data Cleaning - Natural Language Processing

    Remember the genre and we plotted it against years?
    Let's investigate it again.

    head(data['genre'],10)
    
    head(data['genre'],10)

    Clearly, theres a problem when we plotted the data. We need to SPLIT the column by the comma!
    BUT, before, remove all SPACES using REMOVE. Why?
    Say: [apple, banana, hello] and [banana, apple, hello]
    If we just split by a comma, then apple in list 1 and apple in list 2 will NOT be matched together, since in the second list, its _apple (space).

    data['genre'] = remove(data['genre'], ' ')
    splits = split(data['genre'], ',')
    head(splits, 10)
    
    head(splits,10)
    tally
    genres = tally(splits, multiple = True)
    head(genres, 10)
    
    head(genres, 10)

    Next, let's find the COLUMN SUM using COL_OPERATION and 'sum'


    col_operation(genres, 'sum')

    Now use PLOT and BARPLOT to plot the genres total

    sum_genre = col_operation(genres, 'sum')
    plot(x = sum_genre, style = 'barplot')
    
    col-operation

    So, after we counted GENRES separately, we want to find how RATING is affected per GENRE.
    Now, since the table is just 1s and 0s (for count), we want to MULTIPLY GENRES table with RATING.

    M = multiply(genres, data['rating'])
    head(M, 10)
    
    把每一个类型的电影乘上Rating的分数

    You can see how all the ratings are multiplied onto GENRES. Each row is the same number as ratings. but the whole point now is we want to have an AVERAGE rating PER GENRE!
    So, how do we summarise columns? Use COL_OPERATION!

    col_operation(M, 'mean')
    
    col_operation(M, 'mean')
    预期和真实情况
    col_operation(M, 'mean_zero')
    plot(col_operation(M, 'mean_zero'), style = 'barplot')

    Clearly, WAR genre has the average highest, whilst HORROR the lowest average.
    Next, we want to analyse DIRECTORS. Do they affect RATINGS??
    We can use TALLY to see the count.


    head( tally(data['director']) , 10)

    Now, we clearly want to use MULTIPLE = TRUE in TALLY. But, the biggest problem is many directors direct once. We need to REMOVE all directors who directed say less than 4 movies.
    This is because a 1 time shot at directing might not mean anything. But if you directed 4 movies or so, it could be indicative of your "real" rating.
    Don't forget to remove all SPACES using REMOVE

    data['director'] = remove(data['director'], ' ')
    directors = tally(data['director'], multiple = True, min_count = 4)
    head(directors, 10)
    

    Now, use COL_OPERATION after MULTIPLYING RATING, and see who wins!

    director_rating =     col_operation(   multiply(  directors,data['rating']   ),   'mean_zero')
    plot(director_rating, style = 'barplot')
    
    plot(director_rating, style = 'barplot')

    And how about ACTORS??? Same thing we do with directors. They must have performed in at least 5 films. (NOT 4 but 5)
    Use SPLIT = ',' then TALLY MULTIPLE = TRUE then MULTIPLY and then COL_OPERATION (mean zero)
    Also remove all SPACES using REMOVE

    data['actors'] = remove(data['actors'], ' ')
    splits = split(data['actors'], ',')
    sample(splits, 5)
    
    sample(splits, 5)

    和head函数有何区别呢?

    actors = tally(splits, multiple = True, min_count = 5)
    tail(actors, 5)
    
    表格中有数字代表拍过五篇电影的演员
    actors_rating = multiply(actors, data['rating'])
    actors_mean = col_operation(actors_rating, 'mean_zero')
    actors_mean = sort(actors_mean)
    actors_mean[0:10]
    
    actors_mean[0:10]

    Machine Learning - Supervised Learning

    Finally, we get to make a model to predict RATINGS!!! We want to combine the data we have just made for ACTORS, DIRECTORS and GENRES.
    First, we need to CONCATENATE all 3 new data with the original.
    Use HCAT.

    X = hcat(data, actors, directors, genres)
    

    Now, we want to CLEAN the data. Machine Learning models require that there are NO missing values.
    Also, ID columns (rank in this case) need to be removed.
    All TEXT / OBJECT columns must be removed.
    Essentially, only NUMBERS can remain.
    Use CLEAN

    X = clean(X)
    help(clean)
    
    help(clean)

    Now, we want the data to predict RATING.
    Remove RATING from X using normal column selecting, and remove it from X using EXCLUDE.
    Also remove METASCORE (might be influencing on real RATING)


    head(x)
    Y = X['rating']
    X = exclude(X, ['rating','metascore'])
    
    公式基础
    标准化
    线性回归
    线性模型
    建模g
    预测

    Now, after we fit a linear model, we want to know whether it was a good model or not.
    An easy way to check, is to plot the real Y as X and PREDICTED Y as Y, and see if they lay on a straight line.
    The closer the line is to a 45 degree line, the better the prediction.
    Use model.PLOT(predictions, real_Y)

    model.plot(predictions, Y)
    

    ![model.plot(predictions, Y)(https://img.haomeiwen.com/i13612449/7ebf8cf4ca65efa3.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
    Using model.COEFFICIENTS, and PLOT = TRUE with TOP = 50 (if no top, then too many rows will be outputted)

    model.coefficients(plot = True, top = 50)
    
    model.coefficients(plot = True, top = 50)

    Negative means a BAD influence on RATING. Positive means a GOOD influence.
    Note - in terms of interpretation, BE VERY VERY CAREFUL. VERY CAREFUL.
    The weights symbolise SCALED weights (since we standardised). This means a NEGATIVE doesnt necessarily mean "smaller = better"
    You need to UNSCALE the data and see the weights impact.
    There are also MEAN, STD, RANGE and SCALE columns.
    MEAN means the mean of the column after scaling. STD is the standard deviation.
    RANGE means original [minimum, .... maximum] of the data. It shows a snapshot of how the column looked it.
    SCALE means scaled [minimum, .... maximum] of the data. It shows a snapshot of how the column looked it after it scaled.

    Machine Learning - Interpretation Traps

    One of the most important things to do is Interpreting the results / coefficients. Let's get the top 10.
    Now, VOTES is 0.66. What this means is if VOTES > mean (169808), each standard deviation above the mean (or +188668) contributes 0.66 to the RATING.
    不是清楚0.66哪里来的?

    Now, director_(Christopher Nolan). Be careful. VERY VERY CAREFUL. It does NOT NECESSARILY mean Christopher Nolan is BAD (since negative). (or is he actually bad?)

    Remember we found that he was one of the HIGHEST rated directors? Then why is the Linear Model providing a negative coefficient?

    sort(director_rating, how = 'descending')[0:10]
    
    sort(director_rating, how = 'descending')[0:10]

    Hmmmm? What's going on?
    Say we REMOVED NOLAN from the movies he directed. What'll happen to him?
    Let's find out! (Next week, we'll continue on Linear Model interpretation - since it's very important).
    Use ANALYSE Column = director_(ChristopherNolan), and PLOT = True

    model.analyse(plot = True, column = 'director_(ChristopherNolan)')
    
    model.analyse(plot = True, column = 'director_(ChristopherNolan)')

    From ANALYSE, we can see if NOLAN is not directing the 5 films (from N), he actually we REDUCE the overall mean score by -2.17!!! (Change if Removed).
    So, but why is the linear model saying he has a negative influence??
    WE"LL DISCUSS NEXT WEEK ABOUT THIS.

    相关文章

      网友评论

          本文标题:MARK5826 Week4 lab

          本文链接:https://www.haomeiwen.com/subject/wggkiftx.html