R初级作业(一)
- 打开 Rstudio 告诉我它的工作目录。
getwd
[1] "C:/Users/ZPY/Desktop/生信培训/u盘资料/3天课程资料/1.R/01-get_start"
getwd后显示的是当前的工作目录。
- 新建6个向量,基于不同的原子类型。(重点是字符串,数值,逻辑值)
- 字符串
> x<- c("a","b","test")
> x
[1] "a" "b" "test"
> class(x)
[1] "character"
2.数值型
> x2 <- c (1:15)
> x2
[1] 1 2 3 4 5 6 7 8 9 10
[11] 11 12 13 14 15
> class(x2)
[1] "integer"
3.逻辑值
> x3 <- c(T,T,F,T)
> x3
[1] TRUE TRUE FALSE TRUE
> class(x3)
[1] "logical"
- 新建一些数据结构,比如矩阵,数组,数据框,列表等重点是数据框,矩阵)
1.新建数据框
> df <- data.frame(gene=paste0("gene",1:5),a1 = rnorm(n=5),a2 = rnorm(n=5),a3 = rnorm(n=5),a4 = rnorm(n=5),a5 = rnorm(n=5))
> df
gene a1 a2
1 gene1 -1.2618589 1.23975969
2 gene2 -0.4130272 0.55415193
3 gene3 1.3418602 1.49528004
4 gene4 0.6431766 -0.92223528
5 gene5 0.9204888 0.04323589
a3 a4
1 1.9925937 1.13877165
2 1.9097372 0.57711783
3 -0.5818669 0.60345433
4 0.2551079 0.09098584
5 0.2774576 -1.24120023
a5
1 1.575937779
2 -0.008430767
3 -0.738294543
4 -0.032767262
5 -0.082232013
2.新建矩阵
> m <- matrix(1:15,ncol =3)
> m
[,1] [,2] [,3]
[1,] 1 6 11
[2,] 2 7 12
[3,] 3 8 13
[4,] 4 9 14
[5,] 5 10 15
> rownames(m) <- paste0(rep("gene",5),1:5)
> colnames(m) <- c("a1","a2","a3")
> m
a1 a2 a3
gene1 1 6 11
gene2 2 7 12
gene3 3 8 13
gene4 4 9 14
gene5 5 10 15
- 在你新建的数据框进行切片操作,比如首先取第1,3行, 然后取第4,6列
1.取df 1,3行,4到6列
> df[c(1,3),]
gene a1 a2
1 gene1 -1.261859 1.23976
3 gene3 1.341860 1.49528
a3 a4 a5
1 1.9925937 1.1387717 1.5759378
3 -0.5818669 0.6034543 -0.7382945
> df[,4:6]
a3 a4
1 1.9925937 1.13877165
2 1.9097372 0.57711783
3 -0.5818669 0.60345433
4 0.2551079 0.09098584
5 0.2774576 -1.24120023
a5
1 1.575937779
2 -0.008430767
3 -0.738294543
4 -0.032767262
5 -0.082232013
- 使用data函数来加载R内置数据集 rivers 描述它
> data("rivers")
> rivers
> ?rivers #北美141条河流长度
> data("rivers") #加载rivers
> rivers
[1] 735 320 325 392 524 450
[7] 1459 135 465 600 330 336
[13] 280 315 870 906 202 329
[19] 290 1000 600 505 1450 840
[25] 1243 890 350 407 286 280
[31] 525 720 390 250 327 230
[37] 265 850 210 630 260 230
[43] 360 730 600 306 390 420
[49] 291 710 340 217 281 352
[55] 259 250 470 680 570 350
[61] 300 560 900 625 332 2348
[67] 1171 3710 2315 2533 780 280
[73] 410 460 260 255 431 350
[79] 760 618 338 981 1306 500
[85] 696 605 250 411 1054 735
[91] 233 435 490 310 460 383
[97] 375 1270 545 445 1885 380
[103] 300 380 377 425 276 210
[109] 800 420 350 360 538 1100
[115] 1205 314 237 610 360 540
[121] 1038 424 310 300 444 301
[127] 268 620 215 652 900 525
[133] 246 360 529 500 720 270
[139] 430 671 1770
> ?rivers
> length(rivers)
[1] 141
> unique(rivers)
[1] 735 320 325 392 524 450
[7] 1459 135 465 600 330 336
[13] 280 315 870 906 202 329
[19] 290 1000 505 1450 840 1243
[25] 890 350 407 286 525 720
[31] 390 250 327 230 265 850
[37] 210 630 260 360 730 306
[43] 420 291 710 340 217 281
[49] 352 259 470 680 570 300
[55] 560 900 625 332 2348 1171
[61] 3710 2315 2533 780 410 460
[67] 255 431 760 618 338 981
[73] 1306 500 696 605 411 1054
[79] 233 435 490 310 383 375
[85] 1270 545 445 1885 380 377
[91] 425 276 800 538 1100 1205
[97] 314 237 610 540 1038 424
[103] 444 301 268 620 215 652
[109] 246 529 270 430 671 1770
> length(rivers)
[1] 141
> unique(rivers)#去重复
[1] 735 320 325 392 524 450
[7] 1459 135 465 600 330 336
[13] 280 315 870 906 202 329
[19] 290 1000 505 1450 840 1243
[25] 890 350 407 286 525 720
[31] 390 250 327 230 265 850
[37] 210 630 260 360 730 306
[43] 420 291 710 340 217 281
[49] 352 259 470 680 570 300
[55] 560 900 625 332 2348 1171
[61] 3710 2315 2533 780 410 460
[67] 255 431 760 618 338 981
[73] 1306 500 696 605 411 1054
[79] 233 435 490 310 383 375
[85] 1270 545 445 1885 380 377
[91] 425 276 800 538 1100 1205
[97] 314 237 610 540 1038 424
[103] 444 301 268 620 215 652
[109] 246 529 270 430 671 1770
> length(unique(rivers))#元素个数
[1] 114
> table(rivers)#统计
rivers
135 202 210 215 217 230 233
1 1 2 1 1 2 1
237 246 250 255 259 260 265
1 1 3 1 1 2 1
268 270 276 280 281 286 290
1 1 1 3 1 1 1
291 300 301 306 310 314 315
1 3 1 1 2 1 1
320 325 327 329 330 332 336
1 1 1 1 1 1 1
338 340 350 352 360 375 377
1 1 4 1 4 1 1
380 383 390 392 407 410 411
2 1 2 1 1 1 1
420 424 425 430 431 435 444
2 1 1 1 1 1 1
445 450 460 465 470 490 500
1 1 2 1 1 1 2
505 524 525 529 538 540 545
1 1 2 1 1 1 1
560 570 600 605 610 618 620
1 1 3 1 1 1 1
625 630 652 671 680 696 710
1 1 1 1 1 1 1
720 730 735 760 780 800 840
2 1 2 1 1 1 1
850 870 890 900 906 981 1000
1 1 1 2 1 1 1
1038 1054 1100 1171 1205 1243 1270
1 1 1 1 1 1 1
1306 1450 1459 1770 1885 2315 2348
1 1 1 1 1 1 1
2533 3710
1 1
> sort(rivers)#排序
[1] 135 202 210 210 215 217
[7] 230 230 233 237 246 250
[13] 250 250 255 259 260 260
[19] 265 268 270 276 280 280
[25] 280 281 286 290 291 300
[31] 300 300 301 306 310 310
[37] 314 315 320 325 327 329
[43] 330 332 336 338 340 350
[49] 350 350 350 352 360 360
[55] 360 360 375 377 380 380
[61] 383 390 390 392 407 410
[67] 411 420 420 424 425 430
[73] 431 435 444 445 450 460
[79] 460 465 470 490 500 500
[85] 505 524 525 525 529 538
[91] 540 545 560 570 600 600
[97] 600 605 610 618 620 625
[103] 630 652 671 680 696 710
[109] 720 720 730 735 735 760
[115] 780 800 840 850 870 890
[121] 900 900 906 981 1000 1038
[127] 1054 1100 1171 1205 1243 1270
[133] 1306 1450 1459 1770 1885 2315
[139] 2348 2533 3710
> median(rivers)#中位数
[1] 425
> range(rivers)#显示最大值及最小值
[1] 135 3710
> which.min(rivers)#最小值下标
[1] 8
- 下载 https://www.ncbi.nlm.nih.gov/sra?term=SRP133642 里面的
RunInfo Table
文件读入到R里面,了解这个数据框,多少列,每一列都是什么属性的元素。
1.下载步骤:打开链接--Send results to Run selector
--RunInfo Table
2.读取文件
> sra <- read.table (SraRUNTable.txt)
Error in read.table(SraRUNTable.txt) : object 'SraRUNTable.txt' not found #文件没有在当前工作目录下
> sra <- read.table (file="SraRunTable.txt")
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 44 elements #指定文件分隔符为 \t
> df1 <- read.table(file = "SraRunTable.txt",header = T,sep = '\t')#注意读取列名
> View(df1)
> dim(df1)#查看行数列数
> nrow(df1)#查看行数
> ncol(df1)#查看列数
> colnames(df1)#查看列名
[1] "BioSample"
[2] "Experiment"
[3] "MBases"
[4] "MBytes"
[5] "Run"
[6] "SRA_Sample"
[7] "Sample_Name"
[8] "Assay_Type"
[9] "AssemblyName"
[10] "AvgSpotLen"
[11] "BioProject"
[12] "Center_Name"
[13] "Consent"
[14] "DATASTORE_filetype"
[15] "DATASTORE_provider"
[16] "InsertSize"
[17] "Instrument"
[18] "LibraryLayout"
[19] "LibrarySelection"
[20] "LibrarySource"
[21] "LoadDate"
[22] "Organism"
[23] "Platform"
[24] "ReleaseDate"
[25] "SRA_Study"
[26] "age"
[27] "cell_type"
[28] "marker_genes"
[29] "source_name"
[30] "strain"
[31] "tissue"
> for (i in colnames(df1)) paste(i,class(df1[,i])) %>% print() #查看文件属性
Error in paste(i, class(df1[, i])) %>% print() :
could not find function "%>%" # %>% %>%来自magrittr包的管道,其作用是将前一步的结果直接传参给下一步的[函数](https://www.baidu.com/s?wd=%E5%87%BD%E6%95%B0&tn=SE_PcZhidaonwhc_ngpagmjz&rsv_dl=gh_pc_zhidao),从而省略了中间的赋值步骤,可以大量减少内存中的对象,节省内存。
报错解决:https://stackoverflow.com/questions/30248583/error-could-not-find-function
之后:
>install.packages("magrittr")
>library(magrittr) #加载包,以便使用%>%
> for (i in colnames(df1)) paste(i,class(df1[,i])) %>% print() #对于df1的每一列,都输出列内容及列属性,其中%>%表示重定向符号,将之前的操作输出,paste:将向量转换成字符并连接,参考>[https://blog.csdn.net/neweastsun/article/details/51792237]
- 下载 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE111229 里面的
样本信息sample.csv
读入到R里面,了解这个数据框,多少列,每一列都是什么属性的元素
1.下载GEO样本信息 点此获取下载步骤
GEO官网:https://www.ncbi.nlm.nih.gov/geo/ ---- 点击samples----- search 输入 GSE111229 ---- export
2.读取到R中
> df2<- read.table("sample.csv",header = T)
> View(df2)
> dim(df2)
[1] 20 6
> library(magrittr)
> for (i in colnames(df2)) paste(i,class(df2[,i])) %>% print()
[1] "Accession.Title.Sample character"
[1] "Type.Taxonomy.Channels.Platform.Series.Supplementary character"
[1] "Types.Supplementary character"
[1] "Links.SRA character"
[1] "Accession.Contact.Release character"
[1] "Date character"
读出之后发现与其他同学的不同,错误原因:此文件为csv文件,以,为分隔符,若想用read.table ,需要:
> df2<- read.table("sample.csv",sep = ",")
> df3<- read.csv("sample.csv")
> dim(df2)
[1] 20 12
发现行数还是错误,查看下载来的初始文件本是20行,因为我下载时只下载了当前页,重新下载,选择 All search results,
> df2=read.csv(file="sample.csv")
> View(df2)
> dim(df2)
[1] 768 12
> library(magrittr)
Warning message:
程辑包‘magrittr’是用R版本3.5.3 来建造的
> for (i in colnames(df2)) paste(i,class(df2[,i])) %>% print()
[1] "Accession factor"
[1] "Title factor"
[1] "Sample.Type factor"
[1] "Taxonomy factor"
[1] "Channels integer"
[1] "Platform factor"
[1] "Series factor"
[1] "Supplementary.Types factor"
[1] "Supplementary.Links factor"
[1] "SRA.Accession factor"
[1] "Contact factor"
[1] "Release.Date factor"
- 把前面两个步骤的两个表(RunInfo Table 文件,样本信息sample.csv)关联起来,使用merge函数。
总体思路:找出相同内容合并
rm(list = ls())
options(stringsAsFactors = F)
df1 <- read.table(file = "SraRunTable.txt",header = T,sep = '\t')
df2 <- read.csv(file = "sample.csv")
for (i in colnames(df1)) {if (i %in% colnames(df2)) print(i)}#查看相同列名
df1[1,"Platform"]
df2[1,"Platform"]#查看两个数据框相同列名的行名
后面大神的内容就看不懂了,参考https://www.jianshu.com/p/c07e67e2c757
网友评论