今天在优达Udacity学了用R做数据分析,以前也学过,不过没有学得这么系统,把今天学的过程和作业贴在这里。有兴趣的同学可以点击链接去听课
Lesson 3
What to Do First?
Notes:
Pseudo-Facebook User Data
Notes:
getwd()
## [1] "C:/Users/HH/Desktop/R Data analyst"
list.files()
## [1] "07-tidy-data.pdf" "demystifying.R"
## [3] "demystifyingR2_v3.html" "demystifyingR2_v3.Rmd"
## [5] "EDA_Course_Materials.zip" "lesson3_student.html"
## [7] "lesson3_student.rmd" "pseudo_facebook.tsv"
## [9] "reddit.csv" "stateData.csv"
## [11] "tidy-data.pdf"
pf<-read.delim('pseudo_facebook.tsv')
names(pf)
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
Histogram of Users’ Birthdays
Notes:
library(ggplot2)
qplot(x=dob_day,data=pf)+
scale_x_continuous(breaks=1:31)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
data:image/s3,"s3://crabby-images/e1d49/e1d4958090c4e3b462b0b8341cf85b7ddbefa291" alt=""
What are some things that you notice about this histogram?
Response: It is usual that so many people birth on 1st
Moira’s Investigation
Notes:
Estimating Your Audience Size
Notes:
Think about a time when you posted a specific message or shared a photo on Facebook. What was it?
Response:
How many of your friends do you think saw that post?
Response:
Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is?
Response:
Perceived Audience Size
Notes:
Faceting
Notes:
qplot(x=dob_day,data=pf)+
scale_x_continuous(breaks=1:31)+
facet_wrap(~dob_month,ncol=3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
data:image/s3,"s3://crabby-images/80e13/80e131d9fb9584c1b95e024540bd5be31b3ab2c8" alt=""
Let鈥檚 take another look at our plot. What stands out to you here?
Response:
Be Skeptical - Outliers and Anomalies
Notes:
Moira’s Outlier
Notes: #### Which case do you think applies to Moira鈥檚 outlier? Response:
Friend Count
Notes:
What code would you enter to create a histogram of friend counts?
qplot(friend_count,data=pf)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
data:image/s3,"s3://crabby-images/cce51/cce51a216d6ad4c2e8ea5888aeb965a03f121193" alt=""
How is this plot similar to Moira’s first plot?
Response:
Limiting the Axes
Notes:
qplot(friend_count,data=pf)+
scale_x_continuous(limits=c(0,1000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
data:image/s3,"s3://crabby-images/5089b/5089b2bc7bf53093ae73d336bb66bfc2a9ca36ca" alt=""
Exploring with Bin Width
Notes:
Adjusting the Bin Width
Notes:
Faceting Friend Count
# What code would you add to create a facet the histogram by gender?
# Add it to the code below.
qplot(x = friend_count, data = pf, binwidth = 10) +
scale_x_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, 50))+
facet_wrap(~gender)
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
data:image/s3,"s3://crabby-images/447e4/447e48f19a586ce5dffc315e61266151fdb28253" alt=""
Omitting NA Values
Notes:
qplot(friend_count,data=subset(pf,!is.na(gender)),binwidth=25)+
scale_x_continuous(limits=c(0,1000),breaks=seq(0,1000,50))+
facet_wrap(~gender)
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
data:image/s3,"s3://crabby-images/b861e/b861e675f95c7b29c29c091a9a9379654074cc58" alt=""
Statistics ‘by’ Gender
Notes:
table(pf$gender)
##
## female male
## 40254 58574
by(pf$friend_count,pf$gender,summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 37 96 242 244 4923
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 27 74 165 182 4917
Who on average has more friends: men or women?
Response: women #### What’s the difference between the median friend count for women and men? Response: 22 #### Why would the median be a better measure than the mean? Response: don’t change too much when there are extreme data ***
Tenure
Notes:
qplot(x=tenure,data=pf, binwidth=30,
color=I('black'), fill=I('#099DD9'))
## Warning: Removed 2 rows containing non-finite values (stat_bin).
data:image/s3,"s3://crabby-images/5d0ab/5d0ab81d9eb8f1ed89ade3e40630f8c9a8dffa76" alt=""
How would you create a histogram of tenure by year?
qplot(x=tenure/365,data=pf, binwidth=.25,
color=I('black'), fill=I('#F79420'))+
scale_x_continuous(breaks=seq(1,7,1),limits=c(0,7))
## Warning: Removed 26 rows containing non-finite values (stat_bin).
data:image/s3,"s3://crabby-images/cf8ac/cf8ac6ff9033d4c64f665684bc2e31c25ba85514" alt=""
Labeling Plots
Notes:
qplot(x=tenure/365,data=pf,
xlab='No. of years using FB',
ylab='No. of users in sample',
binwidth=.25,
color=I('black'), fill=I('#F79420'))+
scale_x_continuous(breaks=seq(1,7,1),limits=c(0,7))
## Warning: Removed 26 rows containing non-finite values (stat_bin).
data:image/s3,"s3://crabby-images/5fa77/5fa774a242f95e8f93301f70cf7a005638ac9342" alt=""
User Ages
Notes:
qplot(x=age,data=pf,
xlab='Age of users', ylab='Number of users',
binwidth=1,
color=I('black'), fill=I('#5760AB'))+
scale_x_continuous(breaks=seq(1,113,5))
data:image/s3,"s3://crabby-images/274b4/274b486e08d748286b1bfddf1a00cbcc3c02fa7f" alt=""
What do you notice?
Response:
The Spread of Memes
Notes:
Lada’s Money Bag Meme
Notes:
Transforming Data
Notes:
library(gridExtra)
p1 <- qplot(x= friend_count,data=pf)
p2 <- qplot(x=log10(friend_count+1),data=pf)
p3 <- qplot(x=sqrt(friend_count+1),data=pf)
grid.arrange(p1, p2, p3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
data:image/s3,"s3://crabby-images/36fbe/36fbed2cd3504cb4b087ca7de750306a07979ff0" alt=""
p1 <- ggplot(aes(x= friend_count),data=pf) + geom_histogram()
p2 <- p1 + scale_x_log10()
p3 <- p1 + scale_x_sqrt()
grid.arrange(p1, p2, p3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1962 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
data:image/s3,"s3://crabby-images/88d92/88d92d161e8274dc5cf8eb8508941fcb9e52e684" alt=""
Add a Scaling Layer
Notes:
qplot (x=friend_count,data=pf)+
scale_x_log10()
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1962 rows containing non-finite values (stat_bin).
data:image/s3,"s3://crabby-images/8fa54/8fa54fe4343cf5ede0eb001ff060d94cf90b0e34" alt=""
Frequency Polygons
q1 <- ggplot(aes(x=friend_count,y=..count../sum(..count..)),
data=subset(pf,!is.na(gender)))+
geom_freqpoly(aes(color=gender),binwidth=10)+
scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50))+
xlab('Numbers of Friends')+
ylab('Percentage of users with that friend count')
q2 <- ggplot(aes(x=friend_count,y=..count../sum(..count..)),
data=subset(pf,!is.na(gender)))+
geom_freqpoly(aes(color=gender),binwidth=10)+
scale_x_continuous(limits = c(0, 250), breaks = seq(0, 250, 50))+
xlab('Numbers of Friends')+
ylab('Percentage of users with that friend count')
q3 <- ggplot(aes(x=friend_count,y=..count../sum(..count..)),
data=subset(pf,!is.na(gender)))+
geom_freqpoly(aes(color=gender),binwidth=10)+
scale_x_continuous(limits = c(250, 500), breaks = seq(250, 500, 50))+
xlab('Numbers of Friends')+
ylab('Percentage of users with that friend count')
q4 <- ggplot(aes(x=friend_count,y=..count../sum(..count..)),
data=subset(pf,!is.na(gender)))+
geom_freqpoly(aes(color=gender),binwidth=10)+
scale_x_continuous(limits = c(500, 1000), breaks = seq(500, 1000, 50))+
xlab('Numbers of Friends')+
ylab('Percentage of users with that friend count')
grid.arrange(q1,q2,q3,q4,ncol=2)
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).
## Warning: Removed 19870 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).
## Warning: Removed 87181 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).
## Warning: Removed 93438 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).
data:image/s3,"s3://crabby-images/175da/175da53be640f8e5815c0fea78e7927cc284085a" alt=""
Likes on the Web
Notes:
by(pf$www_likes,pf$gender,sum)
## pf$gender: female
## [1] 3507665
## --------------------------------------------------------
## pf$gender: male
## [1] 1430175
by(pf$www_likes_received,pf$gender,sum)
## pf$gender: female
## [1] 4199879
## --------------------------------------------------------
## pf$gender: male
## [1] 1586098
Box Plots
Notes:
qplot(x=gender,y=friend_count,
data=subset(pf,!is.na(gender)),
geom='boxplot')+
scale_y_log10()
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1962 rows containing non-finite values (stat_boxplot).
data:image/s3,"s3://crabby-images/02ead/02ead8b0da87fdebdaf3330b03ebe711a13e86c9" alt=""
Adjust the code to focus on users who have friend counts between 0 and 1000.
qplot(x=gender,y=friend_count,
data=subset(pf,!is.na(gender)),
geom='boxplot')+
coord_cartesian(ylim=c(0,1000))
data:image/s3,"s3://crabby-images/31ee1/31ee1d4623aec79a2ffb47344dadcc3d5510b27b" alt=""
Box Plots, Quartiles, and Friendships
Notes:
qplot(x=gender,y=friendships_initiated,
data=subset(pf,!is.na(gender)),
geom='boxplot')+
coord_cartesian(ylim=c(0,500))
data:image/s3,"s3://crabby-images/b00f8/b00f8bbf16f70d4ba0536def8acae15f51d2c058" alt=""
On average, who initiated more friendships in our sample: men or women?
Response: #### Write about some ways that you can verify your answer. Response:
Response:
Getting Logical
Notes:
Response:
Analyzing One Variable
Reflection:
Click
KnitHTML
to see all of your hard work and to have an html page of this lesson, your answers, and your notes!
网友评论