美文网首页Linux与生物信息
提取fasta文件特定ID序列,及其计算序列长度

提取fasta文件特定ID序列,及其计算序列长度

作者: caokai001 | 来源:发表于2020-04-27 01:38 被阅读0次

    参考:

    https://www.biostars.org/p/127141/
    快速计算fasta序列长度的方法


    Tips:

    有时候需要通过具体的ID,获取序列信息,及其计算序列长度.
    (下面主要通过linux 相关命令来实现)

    ### 测试数据:
    $ cat >test.fa
    >blah_C1
    ACTGTCTGTC
    ACTGTGTTGTG
    ATGTTGTGTGTG
    >blah_C2
    ACTTTATATATT
    >blah_C3
    ACTTATATATATATA
    >blah_C4
    ACTTATATATATATA
    >blah_C5
    ACTTTATATATT
    
    $ cat >IDs.txt
    blah_C4
    blah_C5
    

    Part1: 提取特定ID的序列

    1. You can do that with BBTools: 里面好多java 软件包,都被shell 封装好了.
    $ ./filterbyname.sh 
    
    Written by Brian Bushnell
    Last modified September 1, 2016
    
    Description:  Filters reads by name.
    
    Usage:  filterbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f>
    
    in2 and out2 are for paired reads and are optional.
    If input is paired and there is only one output file, it will be written interleaved.
    Important!  Leading > and @ symbols are NOT part of sequence names;  they are part of
    the fasta, fastq, and sam specifications.  Therefore, this is correct:
    names=e.coli_K12
    And these are incorrect:
    names=>e.coli_K12
    names=@e.coli_K12
    
    Parameters:
    include=f       Set to 'true' to include the filtered names rather than excluding them.
    substring=f     Allow one name to be a substring of the other, rather than a full match.
                       f: No substring matching.
                       t: Bidirectional substring matching.
                       header: Allow input read headers to be substrings of names in list.
                       name: Allow names in list to be substrings of input read headers.
    prefix=f        Allow names to match read header prefixes.
    case=t          (casesensitive) Match case also.
    ow=t            (overwrite) Overwrites files that already exist.
    app=f           (append) Append to files that already exist.
    zl=4            (ziplevel) Set compression level, 1 (low) to 9 (max).
    int=f           (interleaved) Determines whether INPUT file is considered interleaved.
    names=          A list of strings or files.  The files can have one name per line, or
                    be a standard read file (fasta, fastq, or sam).
    minlen=0        Do not output reads shorter than this.
    ths=f           (truncateheadersymbol) Ignore a leading @ or > symbol in the names file.
    tws=f           (truncatewhitespace) Ignore leading or trailing whitespace in the names file.
    truncate=f      Set both ths and tws at the same time.
    
    Positional parameters:
    These optionally allow you to output only a portion of a sequence.  Zero-based, inclusive.
    Intended for a single sequence and include=t mode.
    from=-1         Only print bases starting at this position.
    to=-1           Only print bases up to this position.
    range=          Set from and to with a single flag.
    
    
    Java Parameters:
    -Xmx            This will set Java's memory usage, overriding autodetection.
                    -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs.
                        The max is typically 85% of physical memory.
    -eoom           This flag will cause the process to exit if an out-of-memory
                    exception occurs.  Requires Java 8u92+.
    -da             Disable assertions.
    
    To read from stdin, set 'in=stdin'.  The format should be specified with an extension, like 'in=stdin.fq.gz'
    To write to stdout, set 'out=stdout'.  The format should be specified with an extension, like 'out=stdout.fasta'
    
    Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.
    
    

    太长了,直接看例子:

    ### 程序运行:
    $ ./filterbyname.sh  in=test.fa names=IDs.txt out=out.fa include
    /public/home/kcao/tools/bbmap//calcmem.sh: line 75: [: -v: unary operator expected
    java -ea -Xmx13206m -cp /public/home/kcao/tools/bbmap/current/ driver.FilterReadsByName in=test.fa names=IDs.txt out=out.fa include
    Executing driver.FilterReadsByName [in=test.fa, names=IDs.txt, out=out.fa, include]
    
    Input is being processed as unpaired
    Time:                           0.098 seconds.
    Reads Processed:           5    0.05k reads/sec
    Bases Processed:          87    0.00m bases/sec
    Reads Out:          2
    Bases Out:          27
    ### 查看结果:
    [01:21:30] kcao@login:~/tools/bbmap
    $ cat out.fa 
    >blah_C4
    ACTTATATATATATA
    >blah_C5
    ACTTTATATATT
    

    2.1 awk 也可以搞定

    $ awk 'NR==1{printf $0"\t";next}{printf /^>/ ? "\n"$0"\t" : $0}' test.fa | awk -F"\t" 'BEGIN{while((getline k < "IDs.txt")>0)i[k]=1}{gsub("^>","",$0); if(i[$1]){print ">"$1"\n"$2}}'
    >blah_C4
    ACTTATATATATATA
    >blah_C5
    ACTTTATATATT
    

    2.2 不完善版本:

    $ awk -v RS='>' 'match($0, "blah_C1"){print ">"$0}' test.fa 
    >blah_C1
    ACTGTCTGTC
    ACTGTGTTGTG
    ATGTTGTGTGTG
    

    3.Bioawk 也可以

    $ bioawk -cfastx 'BEGIN{while((getline k <"IDs.txt")>0)i[k]=1}{if(i[$name])print ">"$name"\n"$seq}' test.fa
    >blah_C4
    ACTTATATATATATA
    >blah_C5
    ACTTTATATATT
    

    4.UCGS utilities

    $ ./faSomeRecords main.fasta id.txt output.fa
    

    option -exclude will output sequences not present in "main.fasta"

    1. python(不完善:每次提取一个ID)
    $ cat >python_extract_ID.py
    import sys
    name = sys.argv[2]
    flag = 0
    with open(sys.argv[1]) as f:
        for line in f:
            line = line.strip()
            if line[0] == '>':
                if line.split()[0][1:] == name:
                    flag = 1
                    print(line)
                else :
                    flag = 0
            else:
                if flag == 0:
                    pass
                else:
                    print(line)
    
    $ python python_extract_ID.py test.fa "blah_C1"
    >blah_C1
    ACTGTCTGTC
    ACTGTGTTGTG
    ATGTTGTGTGTG
    
    小结:个人推荐awk 那个代码,修改两个参数就可以跑。

    Part2 : 计算序列长度

    • awk
    $ awk '/^>/&&NR>1{print "";}{ printf "%s",/^>/ ? $0" ":$0 }'  test.fa 
    >blah_C1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
    >blah_C2 ACTTTATATATT
    >blah_C3 ACTTATATATATATA
    >blah_C4 ACTTATATATATATA
    >blah_C5 ACTTTATATATT
    
    

    awk 里面length 函数.

    $ awk '/^>/&&NR>1{print "";}{ printf "%s",/^>/ ? $0" ":$0 }'  test.fa | awk '{print $1"\t"length($2)}'
    >blah_C1    33
    >blah_C2    12
    >blah_C3    15
    >blah_C4    15
    >blah_C5    12
    

    欢迎评论补充~

    相关文章

      网友评论

        本文标题:提取fasta文件特定ID序列,及其计算序列长度

        本文链接:https://www.haomeiwen.com/subject/fwevmhtx.html