MARK5826 Week4 lab

作者: GhostintheCode | 来源:发表于2018-08-20 13:06 被阅读0次

MARK5826 Week4 lab
MARK5826 Week 2 lab
Cloud Computing on Coursera Week
2020-03-20
2020-03-20
5月week4 文献阅读：Concept and benchma
ubuntu下c/c++编译及运行
Git的详细用法及其原因解释(CS 61B)
C#代写Mandatory Tic-Tac-Toe代写data
CS229 Week4 Neural Networks

CONTENT OPTIMIZATION(Use machine learning)

AIM

Given movie data, find out and predict revenue/ratings, and understand WHAT contributes to the prediction.

Reading Data

显示文件读取数据

Let's DESCRIBE the data again using DESCRIBE

describe(data)

describe(data)

head(data, 5)

head(data, 5)

Plotting and Tabulating - Tableau

之后会写如何使用Tableau

Data Cleaning - Natural Language Processing

Remember the genre and we plotted it against years?
Let's investigate it again.

head(data['genre'],10)

head(data['genre'],10)

Clearly, theres a problem when we plotted the data. We need to SPLIT the column by the comma!
BUT, before, remove all SPACES using REMOVE. Why?
Say: [apple, banana, hello] and [banana, apple, hello]
If we just split by a comma, then apple in list 1 and apple in list 2 will NOT be matched together, since in the second list, its _apple (space).

data['genre'] = remove(data['genre'], ' ')
splits = split(data['genre'], ',')
head(splits, 10)

head(splits,10)

tally

genres = tally(splits, multiple = True)
head(genres, 10)

head(genres, 10)

Next, let's find the COLUMN SUM using COL_OPERATION and 'sum'

col_operation(genres, 'sum')

Now use PLOT and BARPLOT to plot the genres total

sum_genre = col_operation(genres, 'sum')
plot(x = sum_genre, style = 'barplot')

col-operation

So, after we counted GENRES separately, we want to find how RATING is affected per GENRE.
Now, since the table is just 1s and 0s (for count), we want to MULTIPLY GENRES table with RATING.

M = multiply(genres, data['rating'])
head(M, 10)

把每一个类型的电影乘上Rating的分数

You can see how all the ratings are multiplied onto GENRES. Each row is the same number as ratings. but the whole point now is we want to have an AVERAGE rating PER GENRE!
So, how do we summarise columns? Use COL_OPERATION!

col_operation(M, 'mean')

col_operation(M, 'mean')

预期和真实情况

col_operation(M, 'mean_zero')

plot(col_operation(M, 'mean_zero'), style = 'barplot')

Clearly, WAR genre has the average highest, whilst HORROR the lowest average.
Next, we want to analyse DIRECTORS. Do they affect RATINGS??
We can use TALLY to see the count.

head( tally(data['director']) , 10)

Now, we clearly want to use MULTIPLE = TRUE in TALLY. But, the biggest problem is many directors direct once. We need to REMOVE all directors who directed say less than 4 movies.
This is because a 1 time shot at directing might not mean anything. But if you directed 4 movies or so, it could be indicative of your "real" rating.
Don't forget to remove all SPACES using REMOVE

data['director'] = remove(data['director'], ' ')
directors = tally(data['director'], multiple = True, min_count = 4)
head(directors, 10)

Now, use COL_OPERATION after MULTIPLYING RATING, and see who wins!

director_rating =     col_operation(   multiply(  directors,data['rating']   ),   'mean_zero')
plot(director_rating, style = 'barplot')

plot(director_rating, style = 'barplot')

And how about ACTORS??? Same thing we do with directors. They must have performed in at least 5 films. (NOT 4 but 5)
Use SPLIT = ',' then TALLY MULTIPLE = TRUE then MULTIPLY and then COL_OPERATION (mean zero)
Also remove all SPACES using REMOVE

data['actors'] = remove(data['actors'], ' ')
splits = split(data['actors'], ',')
sample(splits, 5)

sample(splits, 5)

和head函数有何区别呢？

actors = tally(splits, multiple = True, min_count = 5)
tail(actors, 5)

表格中有数字代表拍过五篇电影的演员

actors_rating = multiply(actors, data['rating'])
actors_mean = col_operation(actors_rating, 'mean_zero')
actors_mean = sort(actors_mean)
actors_mean[0:10]

actors_mean[0:10]

Machine Learning - Supervised Learning

Finally, we get to make a model to predict RATINGS!!! We want to combine the data we have just made for ACTORS, DIRECTORS and GENRES.
First, we need to CONCATENATE all 3 new data with the original.
Use HCAT.

X = hcat(data, actors, directors, genres)

Now, we want to CLEAN the data. Machine Learning models require that there are NO missing values.
Also, ID columns (rank in this case) need to be removed.
All TEXT / OBJECT columns must be removed.
Essentially, only NUMBERS can remain.
Use CLEAN

X = clean(X)
help(clean)

help(clean)

Now, we want the data to predict RATING.
Remove RATING from X using normal column selecting, and remove it from X using EXCLUDE.
Also remove METASCORE (might be influencing on real RATING)

head(x)

Y = X['rating']
X = exclude(X, ['rating','metascore'])

公式基础

标准化

线性回归

线性模型

建模g

预测

Now, after we fit a linear model, we want to know whether it was a good model or not.
An easy way to check, is to plot the real Y as X and PREDICTED Y as Y, and see if they lay on a straight line.
The closer the line is to a 45 degree line, the better the prediction.
Use model.PLOT(predictions, real_Y)

model.plot(predictions, Y)

![model.plot(predictions, Y)(https://img.haomeiwen.com/i13612449/7ebf8cf4ca65efa3.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
Using model.COEFFICIENTS, and PLOT = TRUE with TOP = 50 (if no top, then too many rows will be outputted)

model.coefficients(plot = True, top = 50)

model.coefficients(plot = True, top = 50)

Negative means a BAD influence on RATING. Positive means a GOOD influence.
Note - in terms of interpretation, BE VERY VERY CAREFUL. VERY CAREFUL.
The weights symbolise SCALED weights (since we standardised). This means a NEGATIVE doesnt necessarily mean "smaller = better"
You need to UNSCALE the data and see the weights impact.
There are also MEAN, STD, RANGE and SCALE columns.
MEAN means the mean of the column after scaling. STD is the standard deviation.
RANGE means original [minimum, .... maximum] of the data. It shows a snapshot of how the column looked it.
SCALE means scaled [minimum, .... maximum] of the data. It shows a snapshot of how the column looked it after it scaled.

Machine Learning - Interpretation Traps

One of the most important things to do is Interpreting the results / coefficients. Let's get the top 10.
Now, VOTES is 0.66. What this means is if VOTES > mean (169808), each standard deviation above the mean (or +188668) contributes 0.66 to the RATING.
不是清楚0.66哪里来的？

Now, director_(Christopher Nolan). Be careful. VERY VERY CAREFUL. It does NOT NECESSARILY mean Christopher Nolan is BAD (since negative). (or is he actually bad?)

Remember we found that he was one of the HIGHEST rated directors? Then why is the Linear Model providing a negative coefficient?

sort(director_rating, how = 'descending')[0:10]

sort(director_rating, how = 'descending')[0:10]

Hmmmm? What's going on?
Say we REMOVED NOLAN from the movies he directed. What'll happen to him?
Let's find out! (Next week, we'll continue on Linear Model interpretation - since it's very important).
Use ANALYSE Column = director_(ChristopherNolan), and PLOT = True

model.analyse(plot = True, column = 'director_(ChristopherNolan)')

model.analyse(plot = True, column = 'director_(ChristopherNolan)')

From ANALYSE, we can see if NOLAN is not directing the 5 films (from N), he actually we REDUCE the overall mean score by -2.17!!! (Change if Removed).
So, but why is the linear model saying he has a negative influence??
WE"LL DISCUSS NEXT WEEK ABOUT THIS.

网友评论

本文标题：MARK5826 Week4 lab

本文链接：https://www.haomeiwen.com/subject/wggkiftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

MARK5826 Week4 lab

CONTENT OPTIMIZATION(Use machine learning)

AIM

Reading Data

Plotting and Tabulating - Tableau

Data Cleaning - Natural Language Processing

Machine Learning - Supervised Learning

Machine Learning - Interpretation Traps

相关文章

MARK5826 Week4 lab

MARK5826 Week 2 lab

Cloud Computing on Coursera Week

2020-03-20

2020-03-20

5月week4 文献阅读：Concept and benchma

ubuntu下c/c++编译及运行

Git的详细用法及其原因解释(CS 61B)

C#代写Mandatory Tic-Tac-Toe代写data

CS229 Week4 Neural Networks

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读