今天在优达Udacity学了用R做数据分析,以前也学过,不过没有学得这么系统,把今天学的过程和作业贴在这里。有兴趣的同学可以点击链接去听课
Lesson 3
What to Do First?
Notes:
Pseudo-Facebook User Data
Notes:
getwd()
## [1] "C:/Users/HH/Desktop/R Data analyst"
list.files()
## [1] "07-tidy-data.pdf" "demystifying.R"
## [3] "demystifyingR2_v3.html" "demystifyingR2_v3.Rmd"
## [5] "EDA_Course_Materials.zip" "lesson3_student.html"
## [7] "lesson3_student.rmd" "pseudo_facebook.tsv"
## [9] "reddit.csv" "stateData.csv"
## [11] "tidy-data.pdf"
pf<-read.delim('pseudo_facebook.tsv')
names(pf)
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
Histogram of Users’ Birthdays
Notes:
library(ggplot2)
qplot(x=dob_day,data=pf)+
scale_x_continuous(breaks=1:31)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
![](https://img.haomeiwen.com/i3328747/ae5affd3b0fb1d03.png)
What are some things that you notice about this histogram?
Response: It is usual that so many people birth on 1st
Moira’s Investigation
Notes:
Estimating Your Audience Size
Notes:
Think about a time when you posted a specific message or shared a photo on Facebook. What was it?
Response:
How many of your friends do you think saw that post?
Response:
Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is?
Response:
Perceived Audience Size
Notes:
Faceting
Notes:
qplot(x=dob_day,data=pf)+
scale_x_continuous(breaks=1:31)+
facet_wrap(~dob_month,ncol=3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
![](https://img.haomeiwen.com/i3328747/1df7938f18307101.png)
Let鈥檚 take another look at our plot. What stands out to you here?
Response:
Be Skeptical - Outliers and Anomalies
Notes:
Moira’s Outlier
Notes: #### Which case do you think applies to Moira鈥檚 outlier? Response:
Friend Count
Notes:
What code would you enter to create a histogram of friend counts?
qplot(friend_count,data=pf)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
![](https://img.haomeiwen.com/i3328747/c24984fa03873998.png)
How is this plot similar to Moira’s first plot?
Response:
Limiting the Axes
Notes:
qplot(friend_count,data=pf)+
scale_x_continuous(limits=c(0,1000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
![](https://img.haomeiwen.com/i3328747/00ec0853a17e2ef0.png)
Exploring with Bin Width
Notes:
Adjusting the Bin Width
Notes:
Faceting Friend Count
# What code would you add to create a facet the histogram by gender?
# Add it to the code below.
qplot(x = friend_count, data = pf, binwidth = 10) +
scale_x_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, 50))+
facet_wrap(~gender)
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
![](https://img.haomeiwen.com/i3328747/36b10ce9fab7cf5c.png)
Omitting NA Values
Notes:
qplot(friend_count,data=subset(pf,!is.na(gender)),binwidth=25)+
scale_x_continuous(limits=c(0,1000),breaks=seq(0,1000,50))+
facet_wrap(~gender)
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
![](https://img.haomeiwen.com/i3328747/c13442596dc9948d.png)
Statistics ‘by’ Gender
Notes:
table(pf$gender)
##
## female male
## 40254 58574
by(pf$friend_count,pf$gender,summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 37 96 242 244 4923
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 27 74 165 182 4917
Who on average has more friends: men or women?
Response: women #### What’s the difference between the median friend count for women and men? Response: 22 #### Why would the median be a better measure than the mean? Response: don’t change too much when there are extreme data ***
Tenure
Notes:
qplot(x=tenure,data=pf, binwidth=30,
color=I('black'), fill=I('#099DD9'))
## Warning: Removed 2 rows containing non-finite values (stat_bin).
![](https://img.haomeiwen.com/i3328747/36452cee4c251913.png)
How would you create a histogram of tenure by year?
qplot(x=tenure/365,data=pf, binwidth=.25,
color=I('black'), fill=I('#F79420'))+
scale_x_continuous(breaks=seq(1,7,1),limits=c(0,7))
## Warning: Removed 26 rows containing non-finite values (stat_bin).
![](https://img.haomeiwen.com/i3328747/dfb15bcd3b8c6d6a.png)
Labeling Plots
Notes:
qplot(x=tenure/365,data=pf,
xlab='No. of years using FB',
ylab='No. of users in sample',
binwidth=.25,
color=I('black'), fill=I('#F79420'))+
scale_x_continuous(breaks=seq(1,7,1),limits=c(0,7))
## Warning: Removed 26 rows containing non-finite values (stat_bin).
![](https://img.haomeiwen.com/i3328747/b2bf81c17474da51.png)
User Ages
Notes:
qplot(x=age,data=pf,
xlab='Age of users', ylab='Number of users',
binwidth=1,
color=I('black'), fill=I('#5760AB'))+
scale_x_continuous(breaks=seq(1,113,5))
![](https://img.haomeiwen.com/i3328747/7f4e35fcf1cc9393.png)
What do you notice?
Response:
The Spread of Memes
Notes:
Lada’s Money Bag Meme
Notes:
Transforming Data
Notes:
library(gridExtra)
p1 <- qplot(x= friend_count,data=pf)
p2 <- qplot(x=log10(friend_count+1),data=pf)
p3 <- qplot(x=sqrt(friend_count+1),data=pf)
grid.arrange(p1, p2, p3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
![](https://img.haomeiwen.com/i3328747/70b156e2b9f2693c.png)
p1 <- ggplot(aes(x= friend_count),data=pf) + geom_histogram()
p2 <- p1 + scale_x_log10()
p3 <- p1 + scale_x_sqrt()
grid.arrange(p1, p2, p3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1962 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
![](https://img.haomeiwen.com/i3328747/69bcb05928e4e114.png)
Add a Scaling Layer
Notes:
qplot (x=friend_count,data=pf)+
scale_x_log10()
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1962 rows containing non-finite values (stat_bin).
![](https://img.haomeiwen.com/i3328747/b2c6d6693de4c847.png)
Frequency Polygons
q1 <- ggplot(aes(x=friend_count,y=..count../sum(..count..)),
data=subset(pf,!is.na(gender)))+
geom_freqpoly(aes(color=gender),binwidth=10)+
scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50))+
xlab('Numbers of Friends')+
ylab('Percentage of users with that friend count')
q2 <- ggplot(aes(x=friend_count,y=..count../sum(..count..)),
data=subset(pf,!is.na(gender)))+
geom_freqpoly(aes(color=gender),binwidth=10)+
scale_x_continuous(limits = c(0, 250), breaks = seq(0, 250, 50))+
xlab('Numbers of Friends')+
ylab('Percentage of users with that friend count')
q3 <- ggplot(aes(x=friend_count,y=..count../sum(..count..)),
data=subset(pf,!is.na(gender)))+
geom_freqpoly(aes(color=gender),binwidth=10)+
scale_x_continuous(limits = c(250, 500), breaks = seq(250, 500, 50))+
xlab('Numbers of Friends')+
ylab('Percentage of users with that friend count')
q4 <- ggplot(aes(x=friend_count,y=..count../sum(..count..)),
data=subset(pf,!is.na(gender)))+
geom_freqpoly(aes(color=gender),binwidth=10)+
scale_x_continuous(limits = c(500, 1000), breaks = seq(500, 1000, 50))+
xlab('Numbers of Friends')+
ylab('Percentage of users with that friend count')
grid.arrange(q1,q2,q3,q4,ncol=2)
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).
## Warning: Removed 19870 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).
## Warning: Removed 87181 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).
## Warning: Removed 93438 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).
![](https://img.haomeiwen.com/i3328747/1cb68e2e7b890aba.png)
Likes on the Web
Notes:
by(pf$www_likes,pf$gender,sum)
## pf$gender: female
## [1] 3507665
## --------------------------------------------------------
## pf$gender: male
## [1] 1430175
by(pf$www_likes_received,pf$gender,sum)
## pf$gender: female
## [1] 4199879
## --------------------------------------------------------
## pf$gender: male
## [1] 1586098
Box Plots
Notes:
qplot(x=gender,y=friend_count,
data=subset(pf,!is.na(gender)),
geom='boxplot')+
scale_y_log10()
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1962 rows containing non-finite values (stat_boxplot).
![](https://img.haomeiwen.com/i3328747/38f57029daee6ea9.png)
Adjust the code to focus on users who have friend counts between 0 and 1000.
qplot(x=gender,y=friend_count,
data=subset(pf,!is.na(gender)),
geom='boxplot')+
coord_cartesian(ylim=c(0,1000))
![](https://img.haomeiwen.com/i3328747/26648a9054a41842.png)
Box Plots, Quartiles, and Friendships
Notes:
qplot(x=gender,y=friendships_initiated,
data=subset(pf,!is.na(gender)),
geom='boxplot')+
coord_cartesian(ylim=c(0,500))
![](https://img.haomeiwen.com/i3328747/03d0230d3d2c56b1.png)
On average, who initiated more friendships in our sample: men or women?
Response: #### Write about some ways that you can verify your answer. Response:
Response:
Getting Logical
Notes:
Response:
Analyzing One Variable
Reflection:
Click
KnitHTML
to see all of your hard work and to have an html page of this lesson, your answers, and your notes!
网友评论