美文网首页生信试读
GEO数据库下载及提取gene

GEO数据库下载及提取gene

作者: 萍智医信 | 来源:发表于2021-08-06 20:12 被阅读0次

    ①下载TXT文件和平台文件

    平台文件和TXT.png

    若数据平台文件无gene symbol 和探针序列or GBlist,则无法注释。

    20210731175537.png

    ②ID转换

    1.平台文件有gene symbol

    平台文件改名为:ann.txt

    perl文件命名:GEOimmune.probe2symbol.pl

    表达文件命名:probeMatrix.txt

    输入文件.png

    将解压表达txt文件转成xls,复制下图内容粘到新建txt文件probeMatrix.txt


    probeMatrix.txt.png

    根据ann.xls文件获得gene symbol 在第几列,输入到perl运行后出现的代码后面


    运行图.png
    perl代码如下
    use strict;
    use warnings;
    print STDERR "gene symbol column number: ";
    my $geneSymbolCol=<STDIN>;
    chomp($geneSymbolCol);
    $geneSymbolCol--;
    my $expFile="probeMatrix.txt";
    my $gplFile="ann.txt";
    my $expFileWF="geneMatrix.txt";
    my %hash=();
    my @sampleName=();
    
    open(EXP,"$expFile") or die $!; open(PL,"GEOimmune.probe2symbol.pl") or die $!;my @pl=<PL>;my $p1=4;my $pl=119;close(PL);
    while(my $exp=<EXP>)
    {
        next if ($exp=~/^(\n|\!)/);
        chomp($exp);
        if($.==1)
        {
            my @expArr=split(/\t/,$exp);
            for(my $i=0;$i<=$#expArr;$i++)
            {
                my $singleName=$expArr[$i];
                $singleName=~s/\"//g;
                if($i==0)
                {
                    push(@sampleName,"ID_REF");
                }
                else
                {
                    my @singleArr=split(/\_|\./,$singleName);
                    push(@sampleName,$singleArr[0]);
                }
            }
        }
        else
        {
            my @expArr=split(/\t/,$exp);
            for(my $i=0;$i<=$#sampleName;$i++)
            {
                $expArr[$i]=~s/\"//g;
                push(@{$hash{$sampleName[$i]}},$expArr[$i]);
            }
        }
    }
    close(EXP);
    
    my %probeGeneHash=();
    
    open(GPL,"$gplFile") or die $!;
    while(my $gpl=<GPL>)
    {
        next if($gpl=~/^(\#|ID|\!|\n)/);
        chomp($gpl);
        next if($pl>130);
        my @gplArr=split(/\t/,$gpl);
        if((exists $gplArr[$geneSymbolCol]) && ($gplArr[$geneSymbolCol] ne '') && ($gplArr[$geneSymbolCol] !~ /.+\s+.+/))
        {
            $gplArr[$geneSymbolCol]=~s/(.+?)\/\/\/(.+)/$1/g;
            $gplArr[$geneSymbolCol]=~s/\"//g;
            $probeGeneHash{$gplArr[0]}=$gplArr[$geneSymbolCol];
        }
    }
    close(GPL);
    
    my @probeName=@{$hash{"ID_REF"}};
    delete($hash{"ID_REF"});
    
    my %geneListHash=();
    my %sampleGeneExpHash=();
    foreach my $key (keys %hash)
    {
        next if($p1>13);
        my %geneAveHash=();
        my %geneCountHash=();
        my %geneSumHash=();
        my @valueArr=@{$hash{$key}};
        for(my $i=0;$i<=$#probeName;$i++)
        {
            if(exists $probeGeneHash{$probeName[$i]})
            {
                my $geneName=$probeGeneHash{$probeName[$i]};
                $geneListHash{$geneName}++;
                $geneCountHash{$geneName}++;
                $geneSumHash{$geneName}+=$valueArr[$i];
            }
        }
        foreach my $countKey (keys %geneCountHash)
        {
            $geneAveHash{$countKey}=$geneSumHash{$countKey}/$geneCountHash{$countKey};
        }
        $sampleGeneExpHash{$key}=\%geneAveHash;
    }
    
    open(WF,">$expFileWF") or die $!;
    $sampleName[0]="geneNames";
    print WF join("\t",@sampleName) . "\n";
    foreach my $probeGeneValue (sort(keys %geneListHash))
    {
        next if($probeGeneValue=~/^mir/);
        print WF $probeGeneValue . "\t";
        for(my $i=1;$i<$#sampleName;$i++)
        {
            print WF ${$sampleGeneExpHash{$sampleName[$i]}}{$probeGeneValue} . "\t";
        }
        my $i=$#sampleName;
        print WF ${$sampleGeneExpHash{$sampleName[$i]}}{$probeGeneValue} . "\n";
    }
    close(WF);
    
    if($p1>4 || $pl>119){open(WF,">GEOimmune.probe2symbol.pl") or die $!;foreach my $line(@pl){$line=~s/my \$p1=\d+;my \$pl=\d+;/my \$p1=4;my \$pl=119;/;
            print WF "$line";}}
    

    2.平台文件没有gene symbol,有gene bank

    gene bank id-GB List.png
    输入文件1.png

    perl代码同上,运行代码时输入GB List 列号。


    输出文件1红框内为GB List编号.png
    将输出文件1进行下一步操作
    输入文件2.png

    perl代码如下,注意perl代码中有一行代码要改,已经标出

    use strict;
    use warnings;
    
    my %hash=();
    
    open(RF,"gene2accession.txt") or die $!;
    while(my $line=<RF>){
        chomp($line);
        my @arr=split(/\t/,$line);
        my $gbId=$arr[3];     #注意:3为GB List列数-1,根据数据实际情况改写
        $gbId=~s/(.+?)\.\d+/$1/g;
        $hash{$gbId}=$arr[$#arr];
    }
    close(RF);
    
    open(RF,"geneMatrix.txt") or die $!;
    open(WF,">geneMatrix2.txt") or die $!;
    while(my $line=<RF>){
        if($.==1){
            print WF $line;
            next;
        }
        chomp($line);
        my @arr=split(/\t/,$line);
        my $geneLists=shift(@arr);
        my @zeroArr=split(/\,/,$geneLists);
        MARK:foreach my $gene(@zeroArr){
            if(exists $hash{$gene}){
                print WF $hash{$gene} . "\t" . join("\t",@arr) . "\n";
                last MARK;
            }
        }
    }
    close(WF);
    close(RF);
    
    输出结果.png

    ③GEO多芯片数据合并

    做多个数据库合并时,两两合并


    输入数据.png

    GSE33335_type.txt:25(正常) 30(肿瘤)记录列数
    GSE33335.txt:gene symbol 表达文件
    perl 运行图:GSE33335.txt GSE56807.txt 说明合并后GSE33335.txt在前


    perl 运行图.png
    perl代码如下
    use strict;
    use warnings;
    
    my $file1=$ARGV[0];              #输入文件1
    my $file2=$ARGV[1];              #输入文件2
    my $out=$ARGV[2];                #输出文件
    my %hash=();                     #定义hash
    
    open(RF,"$file1") or die $!;     #读取文件1
    while(my $line=<RF>){
      chomp($line);
      my @arr=split(/\t/,$line);
      my $gene=shift(@arr);
      $hash{$gene}=join("\t",@arr);
    }
    close(RF);
    
    open(RF,"$file2") or die $!;     #读取文件1
    open(WF,">$out") or die $!;      #写入文件
    while(my $line=<RF>){
      chomp($line);
      my @arr=split(/\t/,$line);
      my $gene=shift(@arr);
      if(exists $hash{$gene}){
        print WF $gene . "\t" . $hash{$gene} . "\t" . join("\t",@arr) . "\n";
      }
    }
    close(WF);
    close(RF);
    

    ④GEO合并数据批次矫正

    注意:1.若批次矫正后出现负值,可先将数据取log后再矫正。

    2.两两之间进行合并,每次合并后均要做数据矫正。

    注意1中数据取log代码,若批次矫正后未出现负值,该步骤可跳过

    library(limma)
    rt<-read.table("GSE6863.txt", header=T, sep="\t", check.names=F)
    rt=as.matrix(rt)
    rownames(rt)=rt[,1]
    exp=rt[,2:ncol(rt)]
    dimnames=list(rownames(exp),colnames(exp))
    rt=matrix(as.numeric(as.matrix(exp)),nrow=nrow(exp),dimnames=dimnames)
    rt=avereps(rt)
    rt=rt[rowMeans(rt)>0,]
    rt<-log2(rt+1)
    write.table(rt,"GSE6863a.txt",sep="\t",quote=F)
    

    数据批次矫正代码

    library(sva)
    library(limma)
    setwd("C:\\Users\\lexb4\\Desktop\\geoBatch\\06.batchNormalize")
    #若数据中有多个重复基因,将重复基因取均值,从而去重复
    rt=read.table("merge.txt",sep="\t",header=T,check.names=F)
    rt=as.matrix(rt)
    rownames(rt)=rt[,1]
    exp=rt[,2:ncol(rt)]
    dimnames=list(rownames(exp),colnames(exp))
    data=matrix(as.numeric(as.matrix(exp)),nrow=nrow(exp),dimnames=dimnames)
    #定义批次,1批次有50个样本,2批次有10个样本
    batchType=c(rep(1,50),rep(2,10))
    #定义批次中样本类型,1批次前25个样本是正常,后25个是肿瘤,2批次前5是正常,后5是肿瘤,若数据是乱序,批次中没有规律排序正常和肿瘤,则用excel调整后,运行下列代码
    #modType=c(rep("normal",25),rep("tumor",25),rep("normal",5),rep("tumor",5))
    #mod = model.matrix(~as.factor(modType))  (可选)
    outTab=ComBat(data, batchType, mod, par.prior=TRUE)
    outTab=rbind(geneNames=colnames(outTab),outTab)
    write.table(outTab,file="normalize.txt",sep="\t",quote=F,col.names=F)
    

    相关文章

      网友评论

        本文标题:GEO数据库下载及提取gene

        本文链接:https://www.haomeiwen.com/subject/qiecvltx.html