这几天在优达Udacity学了用R做数据分析,以前也学过,不过没有学得这么系统,把今天学的过程和作业贴在这里。有兴趣的同学可以点击链接去听课
Lesson 4
Scatterplots and Perceived Audience Size
Notes:
Scatterplots
Notes:
library(ggplot2)
pf <- read.delim('pseudo_facebook.tsv')
qplot(age,friend_count,data=pf)
image.png
What are some things that you notice right away?
Response:
ggplot Syntax
Notes:
ggplot(aes(x=age,y=friend_count),data=pf)+
geom_point()+
xlim(13,90)
## Warning: Removed 4906 rows containing missing values (geom_point).
image.png
Overplotting
Notes:
ggplot(aes(x=age,y=friend_count),data=pf)+
geom_jitter(alpha=1/20)+
xlim(13,90)
## Warning: Removed 5176 rows containing missing values (geom_point).
image.png
y
What do you notice in the plot?
Response:
Coord_trans()
Notes:
ggplot(aes(x=age,y=friend_count),data=pf)+
geom_point(alpha=1/20,position=position_jitter(h=0))+
xlim(13,90)+
coord_trans(y = "sqrt")
## Warning: Removed 5191 rows containing missing values (geom_point).
image.png
n
Look up the documentation for coord_trans() and add a layer to the plot that transforms friend_count using the square root function. Create your plot!
What do you notice?
Alpha and Jitter
Notes:
ggplot(aes(x=age,y=friendships_initiated),data=pf)+
geom_point(alpha=1/20,position='jitter')
image.png
uya
Overplotting and Domain Knowledge
Notes:
Conditional Means
Notes:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
age_groups <- group_by(pf,age)
pf.fc_by_age1 <- summarise(age_groups,
frd_co_mean=mean(friend_count),
frd_co_median=median(friend_count),
n=n())
pf.fc_by_age1 <- arrange(pf.fc_by_age1,age)
head(pf.fc_by_age1)
## # A tibble: 6 x 4
## age frd_co_mean frd_co_median n
## <int> <dbl> <dbl> <int>
## 1 13 164.7500 74.0 484
## 2 14 251.3901 132.0 1925
## 3 15 347.6921 161.0 2618
## 4 16 351.9371 171.5 3086
## 5 17 350.3006 156.0 3283
## 6 18 331.1663 162.0 5196
library(dplyr)
pf.fc_by_age2 <- pf %>%
group_by(age)%>%
summarise(frd_co_mean=mean(friend_count),
frd_co_median=median(friend_count),
n=n())%>%
arrange(age)
Create your plot!
ggplot(aes(x=age,y=frd_co_mean),data=pf.fc_by_age2)+
geom_line()
image.png
Overlaying Summaries with Raw Data
Notes:
ggplot(aes(x=age,y=friend_count),data=pf)+
geom_point(alpha=1/20,position=position_jitter(h=0),color="orange")+
xlim(13,90)+
coord_trans(y = "sqrt")+
geom_line(stat="summary",fun.y=mean)+
geom_line(stat="summary",fun.y=quantile,fun.args = list(probs = .9),
linetype=2,color='blue')+
geom_line(stat="summary",fun.y=quantile,fun.args = list(probs = .1),
linetype=2,color='blue')+
geom_line(stat="summary",fun.y=quantile,fun.args = list(probs = .5),
color='blue')
## Warning: Removed 4906 rows containing non-finite values (stat_summary).
## Warning: Removed 4906 rows containing non-finite values (stat_summary).
## Warning: Removed 4906 rows containing non-finite values (stat_summary).
## Warning: Removed 4906 rows containing non-finite values (stat_summary).
## Warning: Removed 5191 rows containing missing values (geom_point).
image.png
yu
ggplot(aes(x=age,y=friend_count),data=pf)+
geom_point(alpha=1/20,position=position_jitter(h=0),color="orange")+
coord_cartesian(xlim=c(13,70),ylim=c(0,1000))+
geom_line(stat="summary",fun.y=mean)+
geom_line(stat="summary",fun.y=quantile,fun.args = list(probs = .9),
linetype=2,color='blue')+
geom_line(stat="summary",fun.y=quantile,fun.args = list(probs = .1),
linetype=2,color='blue')+
geom_line(stat="summary",fun.y=quantile,fun.args = list(probs = .5),
color='blue')
image.png
n
What are some of your observations of the plot?
Response:
Moira: Histogram Summary and Scatterplot
See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.
Notes:
Correlation
Notes:
cor.test(pf$age,pf$friend_count,method='pearson')
##
## Pearson's product-moment correlation
##
## data: pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03363072 -0.02118189
## sample estimates:
## cor
## -0.02740737
Look up the documentation for the cor.test function.
What’s the correlation between age and friend count? Round to three decimal places. Response:
Correlation on Subsets
Notes:
with(subset(pf,age<=70 & age>=13),cor.test(age, friend_count))
##
## Pearson's product-moment correlation
##
## data: age and friend_count
## t = -52.592, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1780220 -0.1654129
## sample estimates:
## cor
## -0.1717245
Correlation Methods
Notes:
Create Scatterplots
Notes:
with(subset(pf,age<70),cor.test(www_likes_received,likes_received))
##
## Pearson's product-moment correlation
##
## data: www_likes_received and likes_received
## t = 926.58, df = 90664, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9504188 0.9516623
## sample estimates:
## cor
## 0.9510444
ggplot(aes(x=www_likes_received,y=likes_received),data=pf)+
geom_point()
image.png
Strong Correlations
Notes:
ggplot(aes(x=www_likes_received,y=likes_received),data=pf)+
geom_point(alpha=1/20)+
xlim(0,quantile(pf$www_likes_received,.95))+
ylim(0,quantile(pf$likes_received,.95))+
geom_smooth(method="lm",color='blue')
## Warning: Removed 6075 rows containing non-finite values (stat_smooth).
## Warning: Removed 6075 rows containing missing values (geom_point).
image.png
ya
What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.
with(subset(pf,age<70),cor.test(www_likes_received,likes_received))
##
## Pearson's product-moment correlation
##
## data: www_likes_received and likes_received
## t = 926.58, df = 90664, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9504188 0.9516623
## sample estimates:
## cor
## 0.9510444
Response:
Moira on Correlation
Notes:
More Caution with Correlation
Notes:
library(alr3)
## Loading required package: car
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
data(Mitchell)
names(Mitchell)
## [1] "Month" "Temp"
with(data=Mitchell,cor.test(Temp,Month))
##
## Pearson's product-moment correlation
##
## data: Temp and Month
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08053637 0.19331562
## sample estimates:
## cor
## 0.05747063
Create your plot!
ggplot(aes(x=Month,y=Temp),data=Mitchell)+
geom_line()
image.png
Noisy Scatterplots
-
Take a guess for the correlation coefficient for the scatterplot.
-
What is the actual correlation of the two variables? (Round to the thousandths place)
ggplot(aes(x=(Month%%12),y=Temp),data=Mitchell)+
geom_point(alpha=0.3)+
scale_x_continuous(breaks=seq(0,203,11))
image.png
Making Sense of Data
Notes:
A New Perspective
What do you notice? Response:
Watch the solution video and check out the Instructor Notes! Notes:
Understanding Noise: Age to Age Months
Notes:
pf$age_with_months <- pf$age + (1 - pf$dob_month / 12)
Age with Months Means
pf.fc_by_age_months <- group_by(pf, age_with_months)%>%
summarise(friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n()) %>%
arrange(age_with_months)
head(pf.fc_by_age_months)
## # A tibble: 6 x 4
## age_with_months friend_count_mean friend_count_median n
## <dbl> <dbl> <dbl> <int>
## 1 13.16667 46.33333 30.5 6
## 2 13.25000 115.07143 23.5 14
## 3 13.33333 136.20000 44.0 25
## 4 13.41667 164.24242 72.0 33
## 5 13.50000 131.17778 66.0 45
## 6 13.58333 156.81481 64.0 54
Programming Assignment
Noise in Conditional Means
qplot(x=age_with_months,y=friend_count_mean,
data=subset(pf.fc_by_age_months,age_with_months<71),
geom="line")
image.png
Smoothing Conditional Means
Notes:
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
q1 <- ggplot(aes(x=age_with_months,y=friend_count_mean),
data=subset(pf.fc_by_age_months,age_with_months<71))+
geom_line()+
geom_smooth()
q2<- ggplot(aes(x=round(age/5)*5,y=friend_count),
data=subset(pf,age<71))+
geom_line(stat='summary',fun.y=mean)
grid.arrange(q1,q2,ncol=1)
## `geom_smooth()` using method = 'loess'
image.png
Which Plot to Choose?
Notes:
Analyzing Two Variables
Reflection:
Click
KnitHTML
to see all of your hard work and to have an html page of this lesson, your answers, and your notes!
网友评论