美文网首页R语言学习大数据
R语言学习记录 - 双变量分析

R语言学习记录 - 双变量分析

作者: 侯悍超 | 来源:发表于2017-12-02 15:40 被阅读57次

这几天在优达Udacity学了用R做数据分析,以前也学过,不过没有学得这么系统,把今天学的过程和作业贴在这里。有兴趣的同学可以点击链接去听课

Lesson 4


Scatterplots and Perceived Audience Size

Notes:


Scatterplots

Notes:

library(ggplot2)
pf <- read.delim('pseudo_facebook.tsv')
qplot(age,friend_count,data=pf)
image.png

What are some things that you notice right away?

Response:


ggplot Syntax

Notes:

ggplot(aes(x=age,y=friend_count),data=pf)+
  geom_point()+
  xlim(13,90)
## Warning: Removed 4906 rows containing missing values (geom_point).
image.png

Overplotting

Notes:

ggplot(aes(x=age,y=friend_count),data=pf)+
  geom_jitter(alpha=1/20)+
  xlim(13,90)
## Warning: Removed 5176 rows containing missing values (geom_point).
image.png

y

What do you notice in the plot?

Response:


Coord_trans()

Notes:

ggplot(aes(x=age,y=friend_count),data=pf)+
  geom_point(alpha=1/20,position=position_jitter(h=0))+
  xlim(13,90)+
  coord_trans(y = "sqrt")
## Warning: Removed 5191 rows containing missing values (geom_point).
image.png

n

Look up the documentation for coord_trans() and add a layer to the plot that transforms friend_count using the square root function. Create your plot!

What do you notice?


Alpha and Jitter

Notes:

ggplot(aes(x=age,y=friendships_initiated),data=pf)+
  geom_point(alpha=1/20,position='jitter')
image.png

uya


Overplotting and Domain Knowledge

Notes:


Conditional Means

Notes:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
age_groups <- group_by(pf,age)
pf.fc_by_age1 <- summarise(age_groups,
          frd_co_mean=mean(friend_count),
          frd_co_median=median(friend_count),
          n=n())
pf.fc_by_age1 <- arrange(pf.fc_by_age1,age)
head(pf.fc_by_age1)
## # A tibble: 6 x 4
##     age frd_co_mean frd_co_median     n
##   <int>       <dbl>         <dbl> <int>
## 1    13    164.7500          74.0   484
## 2    14    251.3901         132.0  1925
## 3    15    347.6921         161.0  2618
## 4    16    351.9371         171.5  3086
## 5    17    350.3006         156.0  3283
## 6    18    331.1663         162.0  5196
library(dplyr)
pf.fc_by_age2  <- pf %>%
  group_by(age)%>%
  summarise(frd_co_mean=mean(friend_count),
          frd_co_median=median(friend_count),
          n=n())%>%
  arrange(age)

Create your plot!

ggplot(aes(x=age,y=frd_co_mean),data=pf.fc_by_age2)+
  geom_line()
image.png

Overlaying Summaries with Raw Data

Notes:

ggplot(aes(x=age,y=friend_count),data=pf)+
  geom_point(alpha=1/20,position=position_jitter(h=0),color="orange")+
  xlim(13,90)+
  coord_trans(y = "sqrt")+
  geom_line(stat="summary",fun.y=mean)+
  geom_line(stat="summary",fun.y=quantile,fun.args = list(probs = .9),
            linetype=2,color='blue')+
  geom_line(stat="summary",fun.y=quantile,fun.args = list(probs = .1),
            linetype=2,color='blue')+
  geom_line(stat="summary",fun.y=quantile,fun.args = list(probs = .5),
            color='blue')
## Warning: Removed 4906 rows containing non-finite values (stat_summary).

## Warning: Removed 4906 rows containing non-finite values (stat_summary).

## Warning: Removed 4906 rows containing non-finite values (stat_summary).

## Warning: Removed 4906 rows containing non-finite values (stat_summary).
## Warning: Removed 5191 rows containing missing values (geom_point).
image.png

yu

ggplot(aes(x=age,y=friend_count),data=pf)+
  geom_point(alpha=1/20,position=position_jitter(h=0),color="orange")+
  coord_cartesian(xlim=c(13,70),ylim=c(0,1000))+
  geom_line(stat="summary",fun.y=mean)+
  geom_line(stat="summary",fun.y=quantile,fun.args = list(probs = .9),
            linetype=2,color='blue')+
  geom_line(stat="summary",fun.y=quantile,fun.args = list(probs = .1),
            linetype=2,color='blue')+
  geom_line(stat="summary",fun.y=quantile,fun.args = list(probs = .5),
            color='blue')
image.png

n

What are some of your observations of the plot?

Response:


Moira: Histogram Summary and Scatterplot

See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.

Notes:


Correlation

Notes:

cor.test(pf$age,pf$friend_count,method='pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03363072 -0.02118189
## sample estimates:
##         cor 
## -0.02740737

Look up the documentation for the cor.test function.

What’s the correlation between age and friend count? Round to three decimal places. Response:


Correlation on Subsets

Notes:

with(subset(pf,age<=70 & age>=13),cor.test(age, friend_count))
## 
##  Pearson's product-moment correlation
## 
## data:  age and friend_count
## t = -52.592, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1780220 -0.1654129
## sample estimates:
##        cor 
## -0.1717245

Correlation Methods

Notes:


Create Scatterplots

Notes:

with(subset(pf,age<70),cor.test(www_likes_received,likes_received))
## 
##  Pearson's product-moment correlation
## 
## data:  www_likes_received and likes_received
## t = 926.58, df = 90664, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9504188 0.9516623
## sample estimates:
##       cor 
## 0.9510444
ggplot(aes(x=www_likes_received,y=likes_received),data=pf)+
  geom_point()
image.png

Strong Correlations

Notes:

ggplot(aes(x=www_likes_received,y=likes_received),data=pf)+
  geom_point(alpha=1/20)+
  xlim(0,quantile(pf$www_likes_received,.95))+
  ylim(0,quantile(pf$likes_received,.95))+
  geom_smooth(method="lm",color='blue')
## Warning: Removed 6075 rows containing non-finite values (stat_smooth).
## Warning: Removed 6075 rows containing missing values (geom_point).
image.png

ya

What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.

with(subset(pf,age<70),cor.test(www_likes_received,likes_received))
## 
##  Pearson's product-moment correlation
## 
## data:  www_likes_received and likes_received
## t = 926.58, df = 90664, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9504188 0.9516623
## sample estimates:
##       cor 
## 0.9510444

Response:


Moira on Correlation

Notes:


More Caution with Correlation

Notes:

library(alr3)
## Loading required package: car
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
data(Mitchell)
names(Mitchell)
## [1] "Month" "Temp"
with(data=Mitchell,cor.test(Temp,Month))
## 
##  Pearson's product-moment correlation
## 
## data:  Temp and Month
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08053637  0.19331562
## sample estimates:
##        cor 
## 0.05747063

Create your plot!

ggplot(aes(x=Month,y=Temp),data=Mitchell)+
  geom_line()
image.png

Noisy Scatterplots

  1. Take a guess for the correlation coefficient for the scatterplot.

  2. What is the actual correlation of the two variables? (Round to the thousandths place)

ggplot(aes(x=(Month%%12),y=Temp),data=Mitchell)+
  geom_point(alpha=0.3)+
  scale_x_continuous(breaks=seq(0,203,11))
image.png

Making Sense of Data

Notes:


A New Perspective

What do you notice? Response:

Watch the solution video and check out the Instructor Notes! Notes:


Understanding Noise: Age to Age Months

Notes:

pf$age_with_months <- pf$age + (1 - pf$dob_month / 12) 

Age with Months Means

pf.fc_by_age_months <- group_by(pf, age_with_months)%>%
  summarise(friend_count_mean = mean(friend_count),
            friend_count_median = median(friend_count),
            n = n()) %>%
  arrange(age_with_months) 
head(pf.fc_by_age_months)
## # A tibble: 6 x 4
##   age_with_months friend_count_mean friend_count_median     n
##             <dbl>             <dbl>               <dbl> <int>
## 1        13.16667          46.33333                30.5     6
## 2        13.25000         115.07143                23.5    14
## 3        13.33333         136.20000                44.0    25
## 4        13.41667         164.24242                72.0    33
## 5        13.50000         131.17778                66.0    45
## 6        13.58333         156.81481                64.0    54

Programming Assignment


Noise in Conditional Means

qplot(x=age_with_months,y=friend_count_mean,
      data=subset(pf.fc_by_age_months,age_with_months<71),
                  geom="line")
image.png

Smoothing Conditional Means

Notes:

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
q1 <- ggplot(aes(x=age_with_months,y=friend_count_mean),
      data=subset(pf.fc_by_age_months,age_with_months<71))+
  geom_line()+
  geom_smooth()

q2<- ggplot(aes(x=round(age/5)*5,y=friend_count),
      data=subset(pf,age<71))+
  geom_line(stat='summary',fun.y=mean)

grid.arrange(q1,q2,ncol=1)
## `geom_smooth()` using method = 'loess'
image.png

Which Plot to Choose?

Notes:


Analyzing Two Variables

Reflection:


Click

KnitHTML

to see all of your hard work and to have an html page of this lesson, your answers, and your notes!

相关文章

网友评论

    本文标题:R语言学习记录 - 双变量分析

    本文链接:https://www.haomeiwen.com/subject/pbmnbxtx.html