Mutsig代表的就是"mutation Significance",简单来说就是把所有的tumor样本集合起来,算他们的变异,算出一个显著性的阈值,超过阈值的即为显著变异。
CGA: mutsig
我用的脚本是官网推荐的,由于没有matlab lisence就用的free MCR,除了自己的maf文件剩下的是下载文件。
maffile <- read.maf(maf = maffile)
mafcorrect <- prepareMutSig(maf = maffile)
run_MutSigCV.sh <path_to_MCR> my_mutations.maf exome_full192.coverage.txt gene.covariates.txt my_results mutation_type_dictionary_file.txt chr_files_hg19
有跑出的significant gene信息,每行为一个基因,后面跟着其Q-value,按q-value排序。
Broad Institute发布的一款关于somatic copy-number alterations 驱动基因的软件,安装有点费劲,请参考INSTALL.txt,或中文版参考:
1、segmentation file (-seg)(REQUIRED)
The column headers are:
(1) Sample (sample name)
(2) Chromosome (chromosome number)
(3) Start Position (segment start position, in bases)
(4) End Position (segment end position, in bases)
(5) Num Markers (number of markers in segment)
(6) Seg.CN (log2() -1 of copy number)
2、 Markers File (-mk)(optional)
The markers file identifies the marker positions used in the original dataset (before segmentation) for array or capture experiments.
3、Reference Genome File (-refgene)(REQUIRED)
GISTIC安装的时候refgenefiles/文件夹下有提供Reference genome files created in MatlabTM,mat格式,不可查看,根据自己用的参考基因组版本选择。
4、Array List File (-alf)(optional)
5、CNV File (-cnv)(optional)
该文件是为了排除germline CNV。
1、All Lesions File (all_lesions.conf_XX.txt, where XX is the confidence level)
Region Data
Columns 1-9 present the data about the significant regions as follows:
(1) Unique Name: A name assigned to identify the region
(2) Descriptor: The genomic descriptor of that region.
(3) Wide Peak Limits: The "wide peak" boundaries most likely to contain the targeted genes. These are listed in genomic coordinates and marker (or probe) indices.
(4) Peak Limits: The boundaries of the region of maximal amplification or deletion.
(5) Region Limits: The boundaries of the entire significant region of amplification or deletion.
(6) q-values: The q-value of the peak region.
(7) Residual q-values: The q-value of the peak region after removing ("peeling off") amplifications or deletions that overlap other, more significant peak regions in the same chromosome.
(8) Broad or Focal: Identifies whether the region reaches significance due primarily to broad events (called "broad"), focal events (called "focal"), or independently significant broad and focal events (called "both").
(9) Amplitude Threshold: Key giving the meaning of values in the subsequent columns associated with each sample.
Sample Data
Each of the analyzed samples is represented in one of the columns following the lesion data (columns 10 through end). The data contained in these columns varies slightly by section of the file.
A '0' indicates that the copy number of the sample was not amplified or deleted beyond the threshold amount in that peak region. A '1' indicates that the sample had low-level copy number aberrations (exceeding the low threshold indicated in column 9), and a '2' indicates that the sample had high-level copy number aberrations (exceeding the high threshold indicated in column 9).
2、Amplification/Deletion Genes File (amp(/del)_genes.conf_XX.txt, where XX is the confidence level)
The amp genes file contains one column for each amplification peak identified in the GISTIC analysis. The first four rows are:
(1) cytoband
(2) q-value
(3) residual q-value
(4) wide peak boundaries
3、Gistic Scores File (scores.gistic)
The scores file lists the q-values [presented as -log10(q)], G-scores, average amplitudes among aberrant samples, and frequency of aberration, across the genome for both amplifications and deletions.