表达定量计算与表达矩阵的生成

作者: 路人里的路人 | 来源:发表于2023-07-15 10:52 被阅读0次

R语言DESeq2基因差异表达分析
多元数据
基于seurat单细胞测序10x数据挖掘分析流程（标准）
多项式回归
Matlab
转录组分析实战第五节：EdgeR对表达量矩阵进行基因表达差异分析
2021-07-28 关于limma包差异分析结果的logFC解
edgeR对表达量矩阵进行基因表达差异分析
【r<-差异分析】当使用limma时，它在比较什么
Bulk RNA-seq上游分析流程（Salmon）

表达定量的计算

1.在Linux系统安装R语言

R的安装本来可以通过配置安装的，但现在没有管理员权限，就通过conda来实现。

conda create -n R
#创建一个存放R语言的环境
conda activate R
#激活R语言的环境
conda search r
#查询conda中存在的R版本，尽量挑比较新的版本
conda install  r-base=4.3
#安装R,一般R和R包是储存在conda -forge这个库中，所以R包的安装如下
conda install -c conda-forge -y xxx
#安装某R包

2.安装统计软件subread

2.1 通过官方网站安装

wget https://sourceforge.net/projects/subread/files/subread-2.0.6/subread-2.0.6-Linux-x86_64.tar.gz
#通过wget下载linux版本的subread
tar -zxvf subread-2.0.6-Linux-x86_64.tar.gz
#解压缩安装包
mv subread-2.0.6-Linux-x86_64 subread
#改成一个比较简单的名称
export PATH=/path/to/your/subread/bin:$PATH
#添加到环境变量

2.2 使用R安装

conda activate R
#激活R
install.packages("BiocManager")
#安装BiocManager
BiocManager::install("Rsubread")
#使用BiocManager安装Rsubread

2.3 使用conda安装

conda create -n subread
#创建subread的环境
conda activate subread
#激活环境
conda install -c bioconda subread=2.0.6
#安装subread

3. 表达定量计算

3.1安装所需R包

运行计算FPKM值的R包前需要安装"limma"、“edgeR”两个R包，而这两个包的安装需要使用"BiocManager"

BiocManager::install("limma")
#下载limma包
BiocManager::install("edgeR")
#下载edgeR包

以下是课程中提供的R包的代码：

#!/usr/bin/env Rscript
# parse parameter ---------------------------------------------------------
library(argparser, quietly=TRUE)
# Create a parser
p <- arg_parser("run featureCounts and calculate FPKM/TPM")

# Add command line arguments
p <- add_argument(p, "--bam", help="input: bam file", type="character")
p <- add_argument(p, "--gtf", help="input: gtf file", type="character")
p <- add_argument(p, "--output", help="output prefix", type="character")

# Parse the command line arguments
argv <- parse_args(p)

library(Rsubread)
library(limma)
library(edgeR)

bamFile <- argv$bam
gtfFile <- argv$gtf
nthreads <- 1
outFilePref <- argv$output

outStatsFilePath  <- paste(outFilePref, '.log',  sep = ''); 
outCountsFilePath <- paste(outFilePref, '.count', sep = ''); 

fCountsList = featureCounts(bamFile, annot.ext=gtfFile, isGTFAnnotationFile=TRUE, nthreads=nthreads, isPairedEnd=FALSE)
dgeList = DGEList(counts=fCountsList$counts, genes=fCountsList$annotation)
fpkm = rpkm(dgeList, dgeList$genes$Length)
tpm = exp(log(fpkm) - log(sum(fpkm)) + log(1e6))

write.table(fCountsList$stat, outStatsFilePath, sep="\t", col.names=FALSE, row.names=FALSE, quote=FALSE)

featureCounts = cbind(fCountsList$annotation[,1], fCountsList$counts, fpkm, tpm)
colnames(featureCounts) = c('gene_id', 'counts', 'fpkm','tpm')
write.table(featureCounts, outCountsFilePath, sep="\t", col.names=TRUE, row.names=FALSE, quote=FALSE)

该R包运行的基本命令为：

Rscript run-featurecounts.R -b /home/monkeyflower/xiaodeng/transcripotom/filter_and_qc/filtered/BLO_S1_LD1.bam -g /home/monkeyflower/xiaodeng/transcripotom/ref/genes.gtf -o BLO_S1_LD1
#-b后面接的是压缩后的bam文件，-g后面接的是基因组的gtf文件，-o后面接的是自己想要的输出文件的前缀

需要注意的是，如果是双末端测序则需要将这段代码注释为’isGTFAnnotationFile=TRUE, nthreads=nthreads, isPairedEnd=FALSE‘中的isPairedEnd=FALSE注释为“isPairedEnd=TRUE”

3.2 awk命令批量生成脚本

awk '{print "Rscript run-featurecounts.R -b /path/to/"$2".bam -g /path/to/genes.gtf -o "$2}' sample.txt > step1_count.sh

3.3 log文件与count文件的解读

log文件记录了有多少片段分配到了基因上，_unmapped表示有多少数据没比对到基因组上，_nofeatures表示有多少数据比对到了基因组上但没有基因结构。
count文件中第一列是基因的ID,第二列说明有多少读长(reads)落到了前面的基因上，第三列是fpkm,第四列是tpm,虽然fpkm值算法存在错误，但趋势与tpm值一致，故不影响结论。

表达矩阵的生成

1. 使用abundance_estimates_to_matrix.pl完成标准化后矩阵的生成

1.1 程序基本命令

1.1.1普通用法

script/abundance_estimates_to_matrix.pl --est_method <method> sample1.results sample2.results --out_prefix xxx

1.1.2 进阶用法

ls /path/to/the/2.Quantification/*.count > genes.quant_files.txt
#先将所有的count文件路径集合到genes.quant_files.txt中，/path/to/the/2.Quantification/*.count要是count文件的绝对路径
perl script/abundance_estimates_to_matrix.pl --est_method featureCounts --quant_files genes.quant_files.txt --out_prefix genes
#再使用perl脚本调用这个文件即可

将上述命令放入到一个名为merge_result.sh的脚本中。

1.2其他文件准备

在merge_result.sh同级目录下创建一个名为script的文件夹，script文件夹里面建一个support_scripts文件夹，script文件夹里放《abundance_estimates_to_matrix.pl》脚本，support_scripts文件夹里放《run_TMM_scale_matrix.pl》脚本。同时更改以上两个脚本的权限，给予可执行权限。
abundance_estimates_to_matrix.pl脚本

#!/usr/bin/env perl

use strict;
use warnings;
use FindBin;
use File::Basename;
use Getopt::Long qw(:config no_ignore_case bundling pass_through);

my $usage = <<__EOUSAGE__;

############################################################
#
# Usage:  $0 --est_method <method>  sample1.results sample2.results ...
#
#      or  $0 --est_method <method> --quant_files file.listing_target_files.txt
#
#      Note, if only a single input file is given, it's expected to contain the paths to all the target abundance estimation files.
#
# Required:
#            
#  --est_method <string>           featureCounts|RSEM|eXpress|kallisto|salmon  (needs to know what format to expect)
#
# Options:
#
#  --cross_sample_norm <string>         TMM|UpperQuartile|none   (default: TMM)
#
#  --name_sample_by_basedir             name sample column by dirname instead of filename
#      --basedir_index <int>            default(-2)
#
#  --out_prefix <string>                default: 'matrix'
#
#  --quant_files <string>              file containing a list of all the target files.
#
############################################################


__EOUSAGE__

    ;


my $help_flag;
my $est_method;
my $val_type;
my $cross_sample_norm = "TMM";
my $name_sample_by_basedir = 0;
my $out_prefix = "matrix";
my $basedir_index = -2;
my $quant_files = "";

&GetOptions('help|h' => \$help_flag,
            'est_method=s' => \$est_method,
            
            'cross_sample_norm=s' => \$cross_sample_norm,
            'name_sample_by_basedir' => \$name_sample_by_basedir,
            'out_prefix=s' => \$out_prefix,
            'basedir_index=i' => \$basedir_index,
            
            'quant_files=s' => \$quant_files,
            );


unless ($est_method && (@ARGV || $quant_files)) {
    die $usage;
}

unless ($est_method =~ /^(featureCounts|RSEM|eXpress|kallisto|salmon)/i) {
    die "Error, dont recognize --est_method $est_method ";
}
unless ($cross_sample_norm =~ /^(TMM|UpperQuartile|none)$/i) {
    die "Error, dont recognize --cross_sample_norm $cross_sample_norm ";
}

my @files;

if ($quant_files) {
    # allow for a file listing the various files.
    @files = `cat $quant_files`;
    chomp @files;
}
elsif (@ARGV) {
    @files = @ARGV;
}
else {
    die $usage;
}



=data_formats

## featureCounts

0   gene_id
1   counts
2   fpkm
3   tpm
## RSEM:

0       transcript_id
1       gene_id
2       length
3       effective_length
4       expected_count
5       TPM
6       FPKM
7       IsoPct


## eXpress v1.5:

1       target_id
2       length
3       eff_length
4       tot_counts
5       uniq_counts
6       est_counts
7       eff_counts
8       ambig_distr_alpha
9       ambig_distr_beta
10      fpkm
11      fpkm_conf_low
12      fpkm_conf_high
13      solvable
14      tpm


## kallisto:
0       target_id
1       length
2       eff_length
3       est_counts
4       tpm


## salmon:
0       Name
1       Length
2       EffectiveLength
3       TPM
4       NumReads

=cut
    
    ;

my ($acc_field, $counts_field, $fpkm_field, $tpm_field);

if ($est_method =~ /^rsem$/i) {
    $acc_field = 0;
    $counts_field = "expected_count";
    $fpkm_field = "FPKM";
    $tpm_field = "TPM";
}
elsif ($est_method =~ /^express$/i) {  # as of v1.5
    $acc_field = "target_id";
    $counts_field = "eff_counts";
    $fpkm_field = "fpkm";
    $tpm_field = "tpm";
}
elsif ($est_method =~ /^kallisto$/i) {
    $acc_field = "target_id";
    $counts_field = "est_counts";
    $fpkm_field = "tpm";
    $tpm_field = "tpm";
}
elsif ($est_method =~ /^salmon/) {
    $acc_field = "Name";
    $counts_field = "NumReads";
    $fpkm_field = "TPM";
    $tpm_field = "TPM";
}
elsif ($est_method =~ /^featureCounts/) {
    $acc_field = "gene_id";
    $counts_field = "counts";
    $fpkm_field = "fpkm";
    $tpm_field = "tpm";
}
else {
    die "Error, dont recognize --est_method [$est_method] ";
}

main: {
    
    my %data;
    
    foreach my $file (@files) {
        print STDERR "-reading file: $file\n";
        open (my $fh, $file) or die "Error, cannot open file $file";
        my $header = <$fh>; 
        chomp $header;
        my %fields = &parse_field_positions($header);
        #use Data::Dumper; print STDERR Dumper(\%fields);
        while (<$fh>) {
            chomp;
            
            my @x = split(/\t/);
            my $acc = $x[ $fields{$acc_field} ];
            my $count = $x[ $fields{$counts_field} ];
            my $fpkm = $x[ $fields{$fpkm_field} ];
            my $tpm = $x[ $fields{$tpm_field} ];

            $data{$acc}->{$file}->{count} = $count;
            $data{$acc}->{$file}->{FPKM} = $fpkm;
            $data{$acc}->{$file}->{TPM} = $tpm;
        }
        close $fh;
    }
    
    my @filenames = @files;
    foreach my $file (@filenames) {
        if ($name_sample_by_basedir) {
            my @path = split(m|/|, $file);
            $file = $path[$basedir_index];
        }
        else {
            $file = basename($file);
        }
    }
    print STDERR "\n\n* Outputting combined matrix.\n\n";
    
    my $counts_matrix_file = "$out_prefix.counts.matrix";
    my $TPM_matrix_file = "$out_prefix.TPM.not_cross_norm";
    open (my $ofh_counts, ">$counts_matrix_file") or die "Error, cannot write file $counts_matrix_file";
    open (my $ofh_TPM, ">$TPM_matrix_file") or die "Error, cannot write file $TPM_matrix_file";

    { # check to see if they're unique
        my %filename_map = map { + $_ => 1 } @filenames;
        if (scalar keys %filename_map != scalar @filenames) {
            die "Error, the column headings: @filenames are not unique.  Should you consider using the --name_sample_by_basedir parameter?";
        }
    }
    

    # clean up matrix headers
    foreach my $file (@filenames) {
        # also, get rid of the part of the filename that RSEM adds
        $file =~ s/\.(genes|isoforms)\.results$//;
    # featureCounts
        $file =~ s/\.count$//;
    }
    

    print $ofh_counts join("\t", "", @filenames) . "\n";
    print $ofh_TPM join("\t", "", @filenames) . "\n";

    foreach my $acc (keys %data) {
        
        print $ofh_counts "$acc";
        print $ofh_TPM "$acc";
        
        foreach my $file (@files) {

            my $count = $data{$acc}->{$file}->{count};
            unless (defined $count) {
                $count = "NA";
            }
            my $tpm = $data{$acc}->{$file}->{TPM};
            if (defined $tpm) {
                $tpm = $tpm/1;
            }
            else {
                $tpm = "NA";
            }

            print $ofh_counts "\t$count";
            print $ofh_TPM "\t$tpm";
        }
        
        print $ofh_counts "\n";
        print $ofh_TPM "\n";

    }
    close $ofh_counts;
    close $ofh_TPM;
    

    if (scalar @files > 1) {
        ## more than one sample 
        
        if ($cross_sample_norm =~ /^TMM$/i) {
            my $cmd = "$FindBin::RealBin/support_scripts/run_TMM_scale_matrix.pl --matrix $TPM_matrix_file > $out_prefix.$cross_sample_norm.EXPR.matrix";
            &process_cmd($cmd);
        }
        elsif ($cross_sample_norm =~ /^UpperQuartile$/) {
            my $cmd = "$FindBin::RealBin/support_scripts/run_UpperQuartileNormalization_matrix.pl --matrix $TPM_matrix_file > $out_prefix.$cross_sample_norm.EXPR.matrix";
            &process_cmd($cmd);
        }
        elsif ($cross_sample_norm =~ /^none$/i) {
            print STDERR "-not performing cross-sample normalization.\n";
        }
    }
    else {
        unless (scalar @files == 1) { 
            die "Error, no target samples. Shouldn't get here.";
        }
        print STDERR "Warning, only one sample, so not performing cross-sample normalization\n";
    }
    print STDERR "Done.\n\n";
    
    exit(0);
}

####
sub process_cmd {
    my ($cmd) = @_;

    print STDERR $cmd;
    my $ret = system($cmd);
    if ($ret) {
        die "Error, CMD: $cmd died with ret $ret";
    }

    return;
}
        

####
sub parse_field_positions {
    my ($header) = @_;

    my %field_pos;
    my @fields = split(/\s+/, $header);
    for (my $i = 0; $i <= $#fields; $i++) {
        $field_pos{$fields[$i]} = $i;
        $field_pos{$i} = $i; # for fixed column assignment
    }

    return(%field_pos);
}

run_TMM_scale_matrix.pl脚本

#!/usr/bin/env perl

use strict;
use warnings;
use Carp;
use Getopt::Long qw(:config no_ignore_case bundling);
use Cwd;
use FindBin;
use File::Basename;
use lib ("$FindBin::RealBin/../../PerlLib");
use Data::Dumper;

my $usage = <<__EOUSAGE__;



#################################################################################################
#
#  Required:
#
#  --matrix <string>      matrix of raw read counts (not normalized!)
#
################################################################################################




__EOUSAGE__


    ;



my $matrix_file;
my $help_flag;

&GetOptions ( 'h' => \$help_flag,
              'matrix=s' => \$matrix_file,
              );


if ($help_flag) {
    die $usage;
}


unless ($matrix_file) {
    die $usage;
}



main: {
    
    my $tmm_info_file = &run_TMM($matrix_file);
    
    &write_normalized_file($matrix_file, $tmm_info_file);
    
    exit(0);
}


####
sub run_TMM {
    my ($counts_matrix_file) = @_;
    
    my $tmm_norm_script = "__tmp_runTMM.R";
    open (my $ofh, ">$tmm_norm_script") or die "Error, cannot write to $tmm_norm_script";
    #print $ofh "source(\"$FindBin::RealBin/R/edgeR_funcs.R\")\n";
    
    print $ofh "library(edgeR)\n\n";
    
    print $ofh "rnaseqMatrix = read.table(\"$counts_matrix_file\", header=T, row.names=1, com='', check.names=F)\n";
    print $ofh "rnaseqMatrix = as.matrix(rnaseqMatrix)\n";
    print $ofh "rnaseqMatrix = round(rnaseqMatrix)\n";
    print $ofh "exp_study = DGEList(counts=rnaseqMatrix, group=factor(colnames(rnaseqMatrix)))\n";
    print $ofh "exp_study = calcNormFactors(exp_study)\n";
    
    print $ofh "exp_study\$samples\$eff.lib.size = exp_study\$samples\$lib.size * exp_study\$samples\$norm.factors\n";
    print $ofh "write.table(exp_study\$samples, file=\"$counts_matrix_file.TMM_info.txt\", quote=F, sep=\"\\t\", row.names=F)\n";
    
    close $ofh;

    &process_cmd("Rscript $tmm_norm_script 1>&2 ");
    
    return("$counts_matrix_file.TMM_info.txt");

}

####
sub process_cmd {
    my ($cmd) = @_;

    print STDERR "CMD: $cmd\n";
    my $ret = system($cmd);

    if ($ret) {
        die "Error, cmd: $cmd died with ret ($ret) ";
    }

    return;
}

####
sub write_normalized_file {
    my ($matrix_file, $tmm_info_file) = @_;

    my %col_to_eff_lib_size;
    my %col_to_norm_factor;
    open (my $fh, $tmm_info_file) or die "Error, cannot open file $tmm_info_file";
    
    my %bloom_to_col;

    my $header = <$fh>;
    while (<$fh>) {
        chomp;
        my @x = split(/\t/);
        my ($col, $norm_factor, $eff_lib_size) = ($x[0], $x[2], $x[3]);
        
        $col =~ s/\"//g;
        
        my $bloom = $col;
        $bloom =~ s/\W/$;/g;
        
        $col_to_eff_lib_size{$col} = $eff_lib_size;
        $col_to_norm_factor{$col} = $norm_factor;
    
        if ($bloom ne $col) {

            if (exists $bloom_to_col{$bloom}) {
                die "Error, already stored $bloom_to_col{$bloom} for $bloom, but trying to also store $col here... Ensure column names are unique according to non-word characters.";
            }

            $col_to_eff_lib_size{$bloom} = $eff_lib_size;
            $col_to_norm_factor{$bloom} = $norm_factor;
        }
        
    }
    close $fh;
    
    open ($fh, $matrix_file);
    $header = <$fh>;
    chomp $header;
    my @pos_to_col = split(/\t/, $header);
    my $check_column_ordering_flag = 0;
    
    while (<$fh>) {
        chomp;
        my @x = split(/\t/);
        
        unless ($check_column_ordering_flag) {
            if (scalar(@x) == scalar(@pos_to_col) + 1) {
                ## header is offset, as is acceptable by R
                ## not acceptable here.  fix it:
                unshift (@pos_to_col, "");
            }
            $check_column_ordering_flag = 1;
            print join("\t", @pos_to_col) . "\n";
        }
        

        my $gene = $x[0];
        
        print $gene;
        for (my $i = 1; $i <= $#x; $i++) {
            my $col = $pos_to_col[$i];
            
            $col =~ s/\"//g;
            
            my $adj_col = $col;
            $adj_col =~ s/-/\./g;
            
            my $bloom = $col;
            $bloom =~ s/\W/$;/g;

            my $eff_lib_size = $col_to_eff_lib_size{$col} 
            || $col_to_eff_lib_size{$bloom}
            || $col_to_eff_lib_size{$adj_col}
            || $col_to_eff_lib_size{"X$col"} 
            || die "Error, no eff lib size for [$col] or [$bloom] or [$adj_col] or [\"X$col\"]" . Dumper(\%col_to_eff_lib_size);
            
            my $norm_factor = $col_to_norm_factor{$col}
            || $col_to_norm_factor{$bloom}
            || $col_to_norm_factor{$adj_col}
            || $col_to_norm_factor{"X$col"}
            || die "Error, no normalization scaling factor for $col" . Dumper(\%col_to_norm_factor);
            

            my $read_count = $x[$i];
    
            my $converted_val = sprintf("%.3f", $read_count * 1/$norm_factor);
            
            print "\t$converted_val";
        }
        print "\n";
    }
    
    return;

}

1.3 运行程序

chmod a+x run_TMM_scale_matrix.pl
#给予perl脚本可执行权限
conda activate R
#激活R
nohup sh merge_result.sh
#后台挂起运行命令脚本

R语言DESeq2基因差异表达分析
经过表达定量后，我们已经得到了基因的表达量矩阵，差异表达分析通常是RNA-seq分析的第一步。差异基因表达分析通...
多元数据
一、多元数据的数学表达 ①整理资料形式（表格） ②矩阵表示形式变量类型：定量变量、定性变量随便变量与随机向量 ...
基于seurat单细胞测序10x数据挖掘分析流程（标准）
单细胞数据读入：10X数据，表达矩阵数据细胞表达谱过滤表达矩阵标准化计算高显著变化的基因 PCA降维分析：碎...
多项式回归
矩阵表达多项式回归的矩阵表达就是训练数据矩阵与** 数据矩阵相乘**，得到回归数据矩阵求解公式下图w的...
Matlab
Matlab的数学计算=数值计算+符号计算矩阵矩阵生成通过显式元素列表输入矩阵通过语句生成矩阵 from:...
转录组分析实战第五节：EdgeR对表达量矩阵进行基因表达差异分析
通过前期的基因定量，我们通过RSEM或者Kallisto这些工具得到了每个基因或者转录本的表达量矩阵。输出的矩阵包...
2021-07-28 关于limma包差异分析结果的logFC解
首先，我们要明白，limma接受的输入参数就是一个表达矩阵，而且是log后的表达矩阵（以2为底）。那么最后计算得...
edgeR对表达量矩阵进行基因表达差异分析
通过前期的基因定量，我们通过**RSEM**或者Kallisto这些工具得到了每个基因或者转录本的表达量矩阵。输出...
【r<-差异分析】当使用limma时，它在比较什么
差异分析流程示例与资料基因芯片的差异表达分析主要有构建基因表达矩阵、构建实验设计矩阵、构建对比模型（对比矩阵）...
Bulk RNA-seq上游分析流程（Salmon）
转录组分析一般通过比对或非比对的方法对基因表达进行定量，得到一个样本-基因表达矩阵用于下游数据分析。下游分析通常是...