1 术语?
NEPC-neuroendocrine prostate cancer 神经内分泌前列腺癌
PCa-prostate cancer 前列腺癌
t-NEPC-treatment-emergent NEPC (t-NEPC)
hypoxia-directed therapy
2 数据
PCa Beltran data set
---All BAM files and associated sample information are described in Supplementary Table 11; data are deposited in dbGap phs000909.v.p1 and accessible on the cBIO Portal for Cancer Genomics.
使用[RSEM]处理和标准化来自TCGA的RNASeqV2 以产生TPM(每百万转录物)。
原始数据太大了,而且貌似,我们一般没有dbGAP的下载权限?
所以我使用的数据是data_RNA_Seq_expression_median.txt中的数据
image.png
dbGAP image.png
PCa Lin data set
----?LTL545| 结合了临床队列? ------先不处理
GPL14450的数据---quantile normalization
因为是芯片数据,所以需要找到对应平台的注释文件,我是从GEO下载的对应的注释txt
https://www.ncbi.nlm.nih.gov/geo/browse/
image.png
CCLE数据
CCLE: Lung Cancer
CCLE: Nervous system tumor
- SCLC=lung+small_cell+ATCC+Gender(F/M)-note 重复 38
- NSCLC=lung-small_cell-large_cell+ATCC|ECACC+Gender(F/M)-note 重复 71
- Neuroblastoma=neuroblastoma+Gender(F/M)-note 重复 11
- glioma=glioma+Gender(F/M)-note 重复 33
zcat CCLE_RNAseq_genes_rpkm_20180929.gct.gz |sed -n '3p' > cell_line.txt
awk '{for(i=1;i<=NF;i++){a[FNR,i]=$i}}END{for(i=1;i<=NF;i++){for(j=1;j<=FNR;j++){printf a[j,i]" "}print ""}}' cell_line.txt > tcell_line.txt
cat > num.sh
cat $1|while read line
do
cat tcell_line.txt|grep -n ${line} >>$1_num.txt
done
#####此处有教训,scc.txt是在window里从excel筛选出来粘贴得到的,然后传到服务器,这里的格式不是unix格式,在grep过程中一直没有结果,在notepad++转成unix格式后,再传到服务器,运行脚本,才有结果;$1这里是指我在windows里根据文章描述筛选出来的细胞系的txt;这里是要把对应的列取出来,之后方便用cut函数将对应的细胞系的表达情况的列取出来
cat > target.sh
cat $1|while read line
do
echo $line > line.txt
num=`cut -d ':' -f 1 line.txt`
col=`zcat CCLE_RNAseq_genes_rpkm_20180929.gct.gz|cut -f ${num} -`
echo $col > line1.txt
paste line1.txt >>$1_target.txt
done
#####这里是要根据上一步的列号,进行cut操作,echo之后,就是行的模式,可以重定向
zscore
For mRNA and microRNA expression data, we typically compute the relative expression of an individual gene and tumor to the gene's expression distribution in a reference population. That reference population is all samples that are diploid for the gene in question (by default for mRNA), or normal samples (when specified), or all profiled samples . The returned value indicates the number of standard deviations away from the mean of expression in the reference population (Z-score). This measure is useful to determine whether a gene is up- or down-regulated relative to the normal samples or all other tumor samples.
the z-scores are calculated using only patient data. Hence, overexpressed in this case implies higher expression than the average patient.
3 R部分
Wilcoxon test was used to calculate p-value in every comparison and Benjamini-Hochberg adjustment was conducted to assess the false discovery rates (FDR) of multiple comparisons. Genes co-up-regulated (fold change >2 and FDR < 0.05) in NE vs.non-NE comparisons of all the four data sets were subjected to the following network analysis.
网友评论