美文网首页R语言学习大数据
R语言学习记录 - 双变量分析

R语言学习记录 - 双变量分析

作者: 侯悍超 | 来源:发表于2017-12-02 15:40 被阅读57次

    这几天在优达Udacity学了用R做数据分析,以前也学过,不过没有学得这么系统,把今天学的过程和作业贴在这里。有兴趣的同学可以点击链接去听课

    Lesson 4


    Scatterplots and Perceived Audience Size

    Notes:


    Scatterplots

    Notes:

    library(ggplot2)
    pf <- read.delim('pseudo_facebook.tsv')
    qplot(age,friend_count,data=pf)
    
    image.png

    What are some things that you notice right away?

    Response:


    ggplot Syntax

    Notes:

    ggplot(aes(x=age,y=friend_count),data=pf)+
      geom_point()+
      xlim(13,90)
    
    ## Warning: Removed 4906 rows containing missing values (geom_point).
    
    image.png

    Overplotting

    Notes:

    ggplot(aes(x=age,y=friend_count),data=pf)+
      geom_jitter(alpha=1/20)+
      xlim(13,90)
    
    ## Warning: Removed 5176 rows containing missing values (geom_point).
    
    image.png

    y

    What do you notice in the plot?

    Response:


    Coord_trans()

    Notes:

    ggplot(aes(x=age,y=friend_count),data=pf)+
      geom_point(alpha=1/20,position=position_jitter(h=0))+
      xlim(13,90)+
      coord_trans(y = "sqrt")
    
    ## Warning: Removed 5191 rows containing missing values (geom_point).
    
    image.png

    n

    Look up the documentation for coord_trans() and add a layer to the plot that transforms friend_count using the square root function. Create your plot!

    What do you notice?


    Alpha and Jitter

    Notes:

    ggplot(aes(x=age,y=friendships_initiated),data=pf)+
      geom_point(alpha=1/20,position='jitter')
    
    image.png

    uya


    Overplotting and Domain Knowledge

    Notes:


    Conditional Means

    Notes:

    library(dplyr)
    
    ## 
    ## Attaching package: 'dplyr'
    
    ## The following objects are masked from 'package:stats':
    ## 
    ##     filter, lag
    
    ## The following objects are masked from 'package:base':
    ## 
    ##     intersect, setdiff, setequal, union
    
    age_groups <- group_by(pf,age)
    pf.fc_by_age1 <- summarise(age_groups,
              frd_co_mean=mean(friend_count),
              frd_co_median=median(friend_count),
              n=n())
    pf.fc_by_age1 <- arrange(pf.fc_by_age1,age)
    head(pf.fc_by_age1)
    
    ## # A tibble: 6 x 4
    ##     age frd_co_mean frd_co_median     n
    ##   <int>       <dbl>         <dbl> <int>
    ## 1    13    164.7500          74.0   484
    ## 2    14    251.3901         132.0  1925
    ## 3    15    347.6921         161.0  2618
    ## 4    16    351.9371         171.5  3086
    ## 5    17    350.3006         156.0  3283
    ## 6    18    331.1663         162.0  5196
    
    library(dplyr)
    pf.fc_by_age2  <- pf %>%
      group_by(age)%>%
      summarise(frd_co_mean=mean(friend_count),
              frd_co_median=median(friend_count),
              n=n())%>%
      arrange(age)
    

    Create your plot!

    ggplot(aes(x=age,y=frd_co_mean),data=pf.fc_by_age2)+
      geom_line()
    
    image.png

    Overlaying Summaries with Raw Data

    Notes:

    ggplot(aes(x=age,y=friend_count),data=pf)+
      geom_point(alpha=1/20,position=position_jitter(h=0),color="orange")+
      xlim(13,90)+
      coord_trans(y = "sqrt")+
      geom_line(stat="summary",fun.y=mean)+
      geom_line(stat="summary",fun.y=quantile,fun.args = list(probs = .9),
                linetype=2,color='blue')+
      geom_line(stat="summary",fun.y=quantile,fun.args = list(probs = .1),
                linetype=2,color='blue')+
      geom_line(stat="summary",fun.y=quantile,fun.args = list(probs = .5),
                color='blue')
    
    ## Warning: Removed 4906 rows containing non-finite values (stat_summary).
    
    ## Warning: Removed 4906 rows containing non-finite values (stat_summary).
    
    ## Warning: Removed 4906 rows containing non-finite values (stat_summary).
    
    ## Warning: Removed 4906 rows containing non-finite values (stat_summary).
    
    ## Warning: Removed 5191 rows containing missing values (geom_point).
    
    image.png

    yu

    ggplot(aes(x=age,y=friend_count),data=pf)+
      geom_point(alpha=1/20,position=position_jitter(h=0),color="orange")+
      coord_cartesian(xlim=c(13,70),ylim=c(0,1000))+
      geom_line(stat="summary",fun.y=mean)+
      geom_line(stat="summary",fun.y=quantile,fun.args = list(probs = .9),
                linetype=2,color='blue')+
      geom_line(stat="summary",fun.y=quantile,fun.args = list(probs = .1),
                linetype=2,color='blue')+
      geom_line(stat="summary",fun.y=quantile,fun.args = list(probs = .5),
                color='blue')
    
    image.png

    n

    What are some of your observations of the plot?

    Response:


    Moira: Histogram Summary and Scatterplot

    See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.

    Notes:


    Correlation

    Notes:

    cor.test(pf$age,pf$friend_count,method='pearson')
    
    ## 
    ##  Pearson's product-moment correlation
    ## 
    ## data:  pf$age and pf$friend_count
    ## t = -8.6268, df = 99001, p-value < 2.2e-16
    ## alternative hypothesis: true correlation is not equal to 0
    ## 95 percent confidence interval:
    ##  -0.03363072 -0.02118189
    ## sample estimates:
    ##         cor 
    ## -0.02740737
    

    Look up the documentation for the cor.test function.

    What’s the correlation between age and friend count? Round to three decimal places. Response:


    Correlation on Subsets

    Notes:

    with(subset(pf,age<=70 & age>=13),cor.test(age, friend_count))
    
    ## 
    ##  Pearson's product-moment correlation
    ## 
    ## data:  age and friend_count
    ## t = -52.592, df = 91029, p-value < 2.2e-16
    ## alternative hypothesis: true correlation is not equal to 0
    ## 95 percent confidence interval:
    ##  -0.1780220 -0.1654129
    ## sample estimates:
    ##        cor 
    ## -0.1717245
    

    Correlation Methods

    Notes:


    Create Scatterplots

    Notes:

    with(subset(pf,age<70),cor.test(www_likes_received,likes_received))
    
    ## 
    ##  Pearson's product-moment correlation
    ## 
    ## data:  www_likes_received and likes_received
    ## t = 926.58, df = 90664, p-value < 2.2e-16
    ## alternative hypothesis: true correlation is not equal to 0
    ## 95 percent confidence interval:
    ##  0.9504188 0.9516623
    ## sample estimates:
    ##       cor 
    ## 0.9510444
    
    ggplot(aes(x=www_likes_received,y=likes_received),data=pf)+
      geom_point()
    
    image.png

    Strong Correlations

    Notes:

    ggplot(aes(x=www_likes_received,y=likes_received),data=pf)+
      geom_point(alpha=1/20)+
      xlim(0,quantile(pf$www_likes_received,.95))+
      ylim(0,quantile(pf$likes_received,.95))+
      geom_smooth(method="lm",color='blue')
    
    ## Warning: Removed 6075 rows containing non-finite values (stat_smooth).
    
    ## Warning: Removed 6075 rows containing missing values (geom_point).
    
    image.png

    ya

    What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.

    with(subset(pf,age<70),cor.test(www_likes_received,likes_received))
    
    ## 
    ##  Pearson's product-moment correlation
    ## 
    ## data:  www_likes_received and likes_received
    ## t = 926.58, df = 90664, p-value < 2.2e-16
    ## alternative hypothesis: true correlation is not equal to 0
    ## 95 percent confidence interval:
    ##  0.9504188 0.9516623
    ## sample estimates:
    ##       cor 
    ## 0.9510444
    

    Response:


    Moira on Correlation

    Notes:


    More Caution with Correlation

    Notes:

    library(alr3)
    
    ## Loading required package: car
    
    ## 
    ## Attaching package: 'car'
    
    ## The following object is masked from 'package:dplyr':
    ## 
    ##     recode
    
    data(Mitchell)
    names(Mitchell)
    
    ## [1] "Month" "Temp"
    
    with(data=Mitchell,cor.test(Temp,Month))
    
    ## 
    ##  Pearson's product-moment correlation
    ## 
    ## data:  Temp and Month
    ## t = 0.81816, df = 202, p-value = 0.4142
    ## alternative hypothesis: true correlation is not equal to 0
    ## 95 percent confidence interval:
    ##  -0.08053637  0.19331562
    ## sample estimates:
    ##        cor 
    ## 0.05747063
    

    Create your plot!

    ggplot(aes(x=Month,y=Temp),data=Mitchell)+
      geom_line()
    
    image.png

    Noisy Scatterplots

    1. Take a guess for the correlation coefficient for the scatterplot.

    2. What is the actual correlation of the two variables? (Round to the thousandths place)

    ggplot(aes(x=(Month%%12),y=Temp),data=Mitchell)+
      geom_point(alpha=0.3)+
      scale_x_continuous(breaks=seq(0,203,11))
    
    image.png

    Making Sense of Data

    Notes:


    A New Perspective

    What do you notice? Response:

    Watch the solution video and check out the Instructor Notes! Notes:


    Understanding Noise: Age to Age Months

    Notes:

    pf$age_with_months <- pf$age + (1 - pf$dob_month / 12) 
    

    Age with Months Means

    pf.fc_by_age_months <- group_by(pf, age_with_months)%>%
      summarise(friend_count_mean = mean(friend_count),
                friend_count_median = median(friend_count),
                n = n()) %>%
      arrange(age_with_months) 
    head(pf.fc_by_age_months)
    
    ## # A tibble: 6 x 4
    ##   age_with_months friend_count_mean friend_count_median     n
    ##             <dbl>             <dbl>               <dbl> <int>
    ## 1        13.16667          46.33333                30.5     6
    ## 2        13.25000         115.07143                23.5    14
    ## 3        13.33333         136.20000                44.0    25
    ## 4        13.41667         164.24242                72.0    33
    ## 5        13.50000         131.17778                66.0    45
    ## 6        13.58333         156.81481                64.0    54
    

    Programming Assignment


    Noise in Conditional Means

    qplot(x=age_with_months,y=friend_count_mean,
          data=subset(pf.fc_by_age_months,age_with_months<71),
                      geom="line")
    
    image.png

    Smoothing Conditional Means

    Notes:

    library(gridExtra)
    
    ## 
    ## Attaching package: 'gridExtra'
    
    ## The following object is masked from 'package:dplyr':
    ## 
    ##     combine
    
    q1 <- ggplot(aes(x=age_with_months,y=friend_count_mean),
          data=subset(pf.fc_by_age_months,age_with_months<71))+
      geom_line()+
      geom_smooth()
    
    q2<- ggplot(aes(x=round(age/5)*5,y=friend_count),
          data=subset(pf,age<71))+
      geom_line(stat='summary',fun.y=mean)
    
    grid.arrange(q1,q2,ncol=1)
    
    ## `geom_smooth()` using method = 'loess'
    
    image.png

    Which Plot to Choose?

    Notes:


    Analyzing Two Variables

    Reflection:


    Click

    KnitHTML

    to see all of your hard work and to have an html page of this lesson, your answers, and your notes!

    相关文章

      网友评论

        本文标题:R语言学习记录 - 双变量分析

        本文链接:https://www.haomeiwen.com/subject/pbmnbxtx.html