【R<-练习】生物统计练习题2

作者: 莫讠 | 来源:发表于2019-05-14 21:55 被阅读0次

【R<-练习】生物统计练习题2
【R<-练习】生物统计学练习题3
【R<-练习】生物统计学练习题1
R统计练习题
sam和bam格式文件的shell小练习
fasta和fastq格式文件的shell小练习
fasta和fastq格式文件的shell小练习
fasta和fastq格式文件的shell小练习
2018-12-01
fasta和fastq格式文件的shell小练习

Q1
研究者检测了6位胃癌病人的肿瘤样本和对应的癌旁正常组织样本中的TP53的表达量，试分别用wilcoxon signed-rank test和signed test，检测TP53的表达量在胃癌患者的肿瘤与正常样本中是否存在差异。

表1 胃癌患者的肿瘤样本和正常样本中TP53的表达量(FPKM)

肿瘤样本	30.09	18.68	19.90	5.13	39.69	30.87
正常样本	30.25	17.14	25.79	8.64	36.42	31.93

解：根据题意作出以下假设：
step1：
H0：TP53的表达量在胃癌患者的肿瘤与正常样本中不存在差异。
HA：TP53的表达量在胃癌患者的肿瘤与正常样本中存在差异。

1. wilcoxon signed-rank test

step2：输入数据，调用wilcox.test()命令，作wilcoxon signed-rank test

> x<-c(30.09,18.68,19.90,5.13,39.69,30.87)
> y<-c(30.25,17.14,25.79,8.64,36.42,31.93)
> wilcox.test(x,y,alternative = "two.sided", paired = T)

    Wilcoxon signed rank test

data:  x and y
V = 7, p-value = 0.5625
alternative hypothesis: true location shift is not equal to 0

step3:
p-value = 0.5625>0.05，说明TP53的表达量在胃癌患者的肿瘤与正常样本中不存在差异。

2. signed test

The sign test is a statistical test to compare the sizes of two groups. It is a non-parametric or “distribution free” test, which means the test doesn't assume the data comes from a particular distribution, like the normal distribution. The sign test is an alternative to a one sample t test or a paired t test.

step2: 调用binom.tes()命令，作signed test

> binom.test(sum(x>y),length(x),alternative="two.sided")

    Exact binomial test

data:  sum(x > y) and length(x)
number of successes = 2, number of trials = 6, p-value = 0.6875
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.04327187 0.77722190
sample estimates:
probability of success 
             0.3333333

step3:
p-value = 0.6875，说明TP53的表达量在胃癌患者的肿瘤与正常样本中不存在差异。

Q2
今测得某单位10名青年男性员工的体脂率和10名中年男性员工的体脂率，如表2，试用wilcoxon rank sum test分析中年男性员工的体脂率是否比青年男性员工的体脂率高？

表2 两组员工的体脂率（%）

青年男性	15	17	21	14	18	16	20	13	20	17
中年男性	20	16	19	18	17	22	21	18	19	16

解：根据题意作出以下假设：
step1：
H0：中年男性员工的体脂率与青年男性员工的体脂率不存在差异。
HA：中年男性员工的体脂率高于青年男性员工的体脂率。

1. wilcoxon signed-ran

> j<-c(15,17,21,14,18,16,20,13,20,17);s<-c(20,16,19,18,17,22,21,18,19,16)
> wilcox.test(j,s,alternative = "less")

    Wilcoxon rank sum test with continuity correction

data:  j and s
W = 33.5, p-value = 0.1117
alternative hypothesis: true location shift is less than 0

Warning message:
In wilcox.test.default(j, s, alternative = "less") :
  cannot compute exact p-value with ties

step3:
p-value = 0.1117>0.05，说明中年男性员工的体脂率与青年男性员工的体脂率不存在差异。

Q3:
Alzheimer's Disease (AD) is the most common cause of dementia, a group of brain disorders that results in the loss of intellectual and social skills. These changes are severe enough to interfere with day-to-day life. AD is characterized by the presence of senile plaques and neurofibrillary tangles in cortical regions of the brain. These pathological markers are thought to be responsible for the massive cortical neurodegeneration and concomitant loss of memory, reasoning, and after aberrant behaviors that are seen in patients with AD.
Here we have a gene expression data from normal neurons and neurons containing neurofibrillary tangles of 14 mid-stage AD cases. It can be found in “expression_data.txt”. Column 1-7 of “expression_data.txt” are normal neurons and column 8-14 are neurons with neurofibrillary tangles. Use the information mentioned above to answer the following questions:

a)Use t-test to find significantly differential expression genes between normal and tangle neuron sample (p-value < 0.01). Give the number of differential expressed genes and give the names of top 10 significantly differential expression genes.
Hint1: two types of t-test with equal variance and unequal variance are different.
Hint2: “names()” or “rownames()” can be used to extract names of differentially expressed genes.
Hint3: “apply(data,1,function(x){…})” can apply function to every row in data more quickly than “for{}”, so try to use “apply”.

b)Adjust the p-values in question a) with both “bonferroni” and “fdr” method to find differentially expressed genes (adjusted p-value < 0.01). Give the number of differential expressed genes.
Hint: you can do the adjustment according to the formula, or use “p.adjust()” instead

注释：
本例题需要用到的数据从此下载

a)

# 首先设置工作路径
> setwd("C:/Users/My/Desktop/")
# 读取数据
> expression_data<-read.table("expression_data.txt",header = T, sep = "")
# 查看expression data 前6行的内容及格式
> head(expression_data)
         normal_1 normal_2 normal_3 normal_4 normal_5 normal_6  normal_7
A1BG     6.917468 6.308350 5.318841 5.886811 5.082975 5.629453  4.919035
A1BG-AS1 7.862730 7.065809 6.783732 6.275773 3.063104 5.131017  3.708938
A1CF     7.425996 7.707939 7.550885 5.708736 5.900326 6.888329  5.987753
A2M      6.544029 8.364054 7.239746 9.037322 8.783692 8.863040 10.602350
A2M-AS1  4.986893 7.248867 7.098780 6.787237 6.252609 6.428099  7.758556
A2ML1    7.205739 6.642142 6.320614 7.311845 5.133855 6.073740  6.683212
         tangle_1 tangle_2 tangle_3  tangle_4 tangle_5 tangle_6 tangle_7
A1BG     3.585249 5.387387 7.238931  6.806550 5.540485 5.371650 6.713791
A1BG-AS1 7.184999 5.617467 6.097404  7.146769 5.184743 4.538170 4.729529
A1CF     5.383705 7.921897 6.340110  7.833862 6.250489 6.386345 7.602273
A2M      8.322173 8.372522 8.977941 10.238420 9.501352 9.413700 9.712794
A2M-AS1  6.974106 7.745843 7.434041  7.455784 6.776254 6.949485 7.995484
A2ML1    7.004367 6.478264 6.273696  6.413472 4.983931 5.629765 5.213316

# 查看数据格式
> class(expression_data)
[1] "data.frame" #显示expression_data 为数据框
# 查看后6行的数据格式
> tail(expression_data)
       normal_1 normal_2 normal_3 normal_4 normal_5 normal_6 normal_7 tangle_1
ZXDC   6.640701 5.890944 5.938750 6.654318 6.736413 7.074550 7.471034 5.968300
ZYG11A 5.743557 4.240352 3.554672 4.870805 1.078166 2.100594 1.994033 3.260842
ZYG11B 8.155717 7.833657 7.735409 7.545919 9.281287 8.703769 9.068741 7.634121
ZYX    7.644676 7.219442 8.196732 8.111574 6.319132 7.059664 7.027851 8.310182
ZZEF1  6.333352 6.179294 8.038788 8.134454 5.207987 5.412623 5.279691 5.618791
ZZZ3   7.528695 7.970740 7.829330 7.998374 8.486491 8.432052 8.734170 8.111172
       tangle_2 tangle_3 tangle_4 tangle_5 tangle_6 tangle_7
ZXDC   5.956762 6.761840 6.477379 6.634568 6.790257 7.145204
ZYG11A 4.943476 2.984469 3.324508 1.310224 2.336032 2.811428
ZYG11B 7.664895 8.532017 7.817962 9.153750 8.746823 6.946859
ZYX    6.880747 7.719131 7.439598 6.459248 6.677331 8.315430
ZZEF1  5.748279 4.855432 5.800385 5.424811 5.232450 5.403220
ZZZ3   7.877152 7.529809 7.612622 8.091155 7.937044 7.274963

# 对数据按要求进行方差检验
# 通过apply() 函数进行数据批量处理，前7列和后7列数据一一进行计算方差
> p.vartest <- apply(expression_data, 1, function(x)var.test(x[1:7],x[8:14])$p.value)
# 通过cbind()将两者组合在一起成为一个新的expression_data文件
> expression_data <- cbind(expression_data,p.vartest)
# 查看前6行
> head(expression_data)
         normal_1 normal_2 normal_3 normal_4 normal_5 normal_6  normal_7 tangle_1 tangle_2 tangle_3  tangle_4 tangle_5 tangle_6 tangle_7 p.vartest
A1BG     6.917468 6.308350 5.318841 5.886811 5.082975 5.629453  4.919035 3.585249 5.387387 7.238931  6.806550 5.540485 5.371650 6.713791 0.1997210
A1BG-AS1 7.862730 7.065809 6.783732 6.275773 3.063104 5.131017  3.708938 7.184999 5.617467 6.097404  7.146769 5.184743 4.538170 4.729529 0.2408243
A1CF     7.425996 7.707939 7.550885 5.708736 5.900326 6.888329  5.987753 5.383705 7.921897 6.340110  7.833862 6.250489 6.386345 7.602273 0.7720165
A2M      6.544029 8.364054 7.239746 9.037322 8.783692 8.863040 10.602350 8.322173 8.372522 8.977941 10.238420 9.501352 9.413700 9.712794 0.1550970
A2M-AS1  4.986893 7.248867 7.098780 6.787237 6.252609 6.428099  7.758556 6.974106 7.745843 7.434041  7.455784 6.776254 6.949485 7.995484 0.1210762
A2ML1    7.205739 6.642142 6.320614 7.311845 5.133855 6.073740  6.683212 7.004367 6.478264 6.273696  6.413472 4.983931 5.629765 5.213316 0.9951156
# 通过apply() 函数批量将符合方差相等的数据进行成组t检验
# 得到p-value
> p.value <-apply(expression_data, 1, function(x){
+     if(x[15]>0.05){
+         t.test(x[1:7],x[8:14],var.equal = T)$p.value} #方差相等的t-test
+     else{t.test(x[1:7],x[8:14],var.equal = F)$p.value}#方差不等的t-test
+ })

# 提取出前10个差异表达最大的基因的名称
> names(p.value[order(p.value,decreasing = F)[1:10]])
 [1] "TECPR2"       "ZMAT2"        "NT5DC3"       "SIN3A"        "HYOU1"        "UBE2Z"       
 [7] "SDR39U1"      "GNL2"         "VIPAS39"      "LOC100505584"

b)

> adjust1 <- p.adjust(p.value, method = "bonferroni")
> sum(adjust1<=0.01)
[1] 364

> adjust2 <- p.adjust(p.value,method = "fdr")
> sum(adjust2<=0.01)
[1] 1049

所以用FDR和Bonferroni校正p值分别有1049个和364个基因差异表达

【R<-练习】生物统计练习题2
Q1研究者检测了6位胃癌病人的肿瘤样本和对应的癌旁正常组织样本中的TP53的表达量，试分别用wilcoxon si...
【R<-练习】生物统计学练习题3
Question1. 农场调查了A、B、C等3个品种草莓的维生素C含量,测定结果如表1所示。（1）检验维生素C含...
【R<-练习】生物统计学练习题1
第一小题：要求：在R环境中完成下述操作，并写出具体R代码。查看R当前工作目录，设置R工作目录为数据所在目录并查看...
R统计练习题
基础统计 1、求众数（mode）：一组数据中出现次数最多的数值，叫众数，有时众数在一组数中有好几个。用M表示。理...
sam和bam格式文件的shell小练习
首先使用bowtie2软件自带的测试数据生成sam/bam文件，代码如下： LINUX练习题 1) 统计共多少条r...
fasta和fastq格式文件的shell小练习
这个是有机结合生物信息学的linux和数据格式的练习题：下载bowtie2软件后拿到示例数据： 1)统计read...
fasta和fastq格式文件的shell小练习
这个是有机结合生物信息学的linux和数据格式的练习题：下载bowtie2软件后拿到示例数据： 1)统计reads...
fasta和fastq格式文件的shell小练习
这个是有机结合生物信息学的linux和数据格式的练习题：下载bowtie2软件后拿到示例数据： 1）统计reads...
2018-12-01
练习题1 练习题2 练习题3
fasta和fastq格式文件的shell小练习
这个是有机结合生物信息学的linux和数据格式的练习题：下载bowtie2软件后拿到示例数据： 1.统计read...