一个提取序列的脚本

作者: 正踪大米饭儿 | 来源:发表于2016-12-09 19:32 被阅读341次

从fasta序列中提取特定ID的序列
一个提取序列的脚本
bedtools批量提取基因组指定位置序列
^M引发的惨案
使用TBtools对叶绿体蛋白编码基因进行GO注释
Python3 提取CDS
从fasta文件中批量提取特定序列
人员信息提取其实很简单
GO注释---Tbtools
GEE提取感兴趣区的时间序列

由于过程比较复杂，先附上代码，希望有天能写出多个版本～～～

#!/usr/bin/perl -w
use strict;
use Getopt::Long;

my ($list,$indir,$outdir,$help);

GetOptions(
    "list|l=s"   => \$list,
    "indir|i=s"  => \$indir,
    "outdir|o:s" => \$outdir,
    "help|?"     => \$help
);

die &usage if ( !defined $list || !defined $indir || defined $help);
$outdir ||= ".";
mkdir $outdir,0755 if (not -e $outdir);


open LST,$list or die "Can't open the filtered gene list file $! \n";
my %lst = map{chomp;(split /\t/,$_)[0],1}<LST>;
close LST;

my %genes;
opendir DIR,$indir or die "Can't open the outDIR of RSEM expression\n";
foreach my $file (sort grep(/^.*\.genes\.results$/,readdir(DIR))) {
    open IN,"$indir/$file" or die $!;
    while (<IN>) {
        chomp;
        next if /^gene/;

        my @a = split /\t/,$_;
        if (exists $lst{$a[0]}) {
            my @b = split /,/,$a[1];
            for my $b (@b){
                $genes{$b} = $a[0];
            }
        }
    }
    close IN;
}
closedir DIR;

open OUT,">$outdir/transcript.list";
foreach my $k (sort keys %genes){
    print OUT "$k\t$genes{$k}\n";
}
close OUT;


sub usage {
    print <<USAGE;
Name:
    $0
Description:
    merge the out put file of the rsem < \$sample.genes.result >
Update:
    2016-12-02  genebang
Uasge:
    perl $0 [options] 
Options:
    -l, --list    <string>*    list of filtered gene 
    -i, --indir   <string>*    input files contain the output of the rsem-exprssion-calular result
    -o, --output  <string>     output file
    --help                     print this help information
e.g
    perl $0 -l filter.gene.FPKM.xls -i Test/output -o ./ 
    perl $0 -l filter.gene.FPKM.xls -i Test/output -o ./ 
    perl $0 --list filter.gene.FPKM.xls --indir Test/output --outdir ./  
USAGE
    exit 0;
}

__END__

该脚本可以实现输入一个列表文件，提取某一目录下所有某一类型文件中的所有子集对应的名称。
示例文件如下：

列表文件：

ENSG00000000003.14   
ENSG00000000005.5      
ENSG00000000419.12    
ENSG00000000457.13    
ENSG00000000460.16

文件夹目录列表：

$ ls ./
BC486AA.genes.results
BC486BA.genes.results
BC499AA.genes.results
BC499BA.genes.results
BC505AA.genes.results
BC505BA.genes.results

文件内容：

ENSG00000000003.14      ENST00000373020.8,ENST00000494424.1,ENST00000496771.5,ENST00000612152.4
ENSG00000000005.5       ENST00000373031.4,ENST00000485971.1 
ENSG00000000419.12      ENST00000371582.8,ENST00000371584.8,ENST00000371588.9,ENST00000413082.1
ENSG00000000457.13      ENST00000367770.5,ENST00000367771.10,ENST00000367772.8,ENST00000423670.
ENSG00000000460.16      ENST00000286031.10,ENST00000359326.8,ENST00000413811.3,ENST00000459772.

结果输出：

ENST00000000233.9       ENSG00000004059.10
ENST00000000412.7       ENSG00000003056.7
ENST00000000442.10      ENSG00000173153.13
ENST00000001008.5       ENSG00000004478.7
ENST00000001146.6       ENSG00000003137.8

该脚本使用了perl 程序中好多实用的功能。

网友评论

本文标题：一个提取序列的脚本

本文链接：https://www.haomeiwen.com/subject/qzcymttx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

一个提取序列的脚本

相关文章

从fasta序列中提取特定ID的序列

一个提取序列的脚本

bedtools批量提取基因组指定位置序列

^M引发的惨案

使用TBtools对叶绿体蛋白编码基因进行GO注释

Python3 提取CDS

从fasta文件中批量提取特定序列

人员信息提取其实很简单

GO注释---Tbtools

GEE提取感兴趣区的时间序列

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

生物信息学习

生物信息学

Perl