美文网首页Linux与生物信息
提取fasta文件特定ID序列,及其计算序列长度

提取fasta文件特定ID序列,及其计算序列长度

作者: caokai001 | 来源:发表于2020-04-27 01:38 被阅读0次

参考:

https://www.biostars.org/p/127141/
快速计算fasta序列长度的方法


Tips:

有时候需要通过具体的ID,获取序列信息,及其计算序列长度.
(下面主要通过linux 相关命令来实现)

### 测试数据:
$ cat >test.fa
>blah_C1
ACTGTCTGTC
ACTGTGTTGTG
ATGTTGTGTGTG
>blah_C2
ACTTTATATATT
>blah_C3
ACTTATATATATATA
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT

$ cat >IDs.txt
blah_C4
blah_C5

Part1: 提取特定ID的序列

  1. You can do that with BBTools: 里面好多java 软件包,都被shell 封装好了.
$ ./filterbyname.sh 

Written by Brian Bushnell
Last modified September 1, 2016

Description:  Filters reads by name.

Usage:  filterbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f>

in2 and out2 are for paired reads and are optional.
If input is paired and there is only one output file, it will be written interleaved.
Important!  Leading > and @ symbols are NOT part of sequence names;  they are part of
the fasta, fastq, and sam specifications.  Therefore, this is correct:
names=e.coli_K12
And these are incorrect:
names=>e.coli_K12
names=@e.coli_K12

Parameters:
include=f       Set to 'true' to include the filtered names rather than excluding them.
substring=f     Allow one name to be a substring of the other, rather than a full match.
                   f: No substring matching.
                   t: Bidirectional substring matching.
                   header: Allow input read headers to be substrings of names in list.
                   name: Allow names in list to be substrings of input read headers.
prefix=f        Allow names to match read header prefixes.
case=t          (casesensitive) Match case also.
ow=t            (overwrite) Overwrites files that already exist.
app=f           (append) Append to files that already exist.
zl=4            (ziplevel) Set compression level, 1 (low) to 9 (max).
int=f           (interleaved) Determines whether INPUT file is considered interleaved.
names=          A list of strings or files.  The files can have one name per line, or
                be a standard read file (fasta, fastq, or sam).
minlen=0        Do not output reads shorter than this.
ths=f           (truncateheadersymbol) Ignore a leading @ or > symbol in the names file.
tws=f           (truncatewhitespace) Ignore leading or trailing whitespace in the names file.
truncate=f      Set both ths and tws at the same time.

Positional parameters:
These optionally allow you to output only a portion of a sequence.  Zero-based, inclusive.
Intended for a single sequence and include=t mode.
from=-1         Only print bases starting at this position.
to=-1           Only print bases up to this position.
range=          Set from and to with a single flag.


Java Parameters:
-Xmx            This will set Java's memory usage, overriding autodetection.
                -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs.
                    The max is typically 85% of physical memory.
-eoom           This flag will cause the process to exit if an out-of-memory
                exception occurs.  Requires Java 8u92+.
-da             Disable assertions.

To read from stdin, set 'in=stdin'.  The format should be specified with an extension, like 'in=stdin.fq.gz'
To write to stdout, set 'out=stdout'.  The format should be specified with an extension, like 'out=stdout.fasta'

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

太长了,直接看例子:

### 程序运行:
$ ./filterbyname.sh  in=test.fa names=IDs.txt out=out.fa include
/public/home/kcao/tools/bbmap//calcmem.sh: line 75: [: -v: unary operator expected
java -ea -Xmx13206m -cp /public/home/kcao/tools/bbmap/current/ driver.FilterReadsByName in=test.fa names=IDs.txt out=out.fa include
Executing driver.FilterReadsByName [in=test.fa, names=IDs.txt, out=out.fa, include]

Input is being processed as unpaired
Time:                           0.098 seconds.
Reads Processed:           5    0.05k reads/sec
Bases Processed:          87    0.00m bases/sec
Reads Out:          2
Bases Out:          27
### 查看结果:
[01:21:30] kcao@login:~/tools/bbmap
$ cat out.fa 
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT

2.1 awk 也可以搞定

$ awk 'NR==1{printf $0"\t";next}{printf /^>/ ? "\n"$0"\t" : $0}' test.fa | awk -F"\t" 'BEGIN{while((getline k < "IDs.txt")>0)i[k]=1}{gsub("^>","",$0); if(i[$1]){print ">"$1"\n"$2}}'
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT

2.2 不完善版本:

$ awk -v RS='>' 'match($0, "blah_C1"){print ">"$0}' test.fa 
>blah_C1
ACTGTCTGTC
ACTGTGTTGTG
ATGTTGTGTGTG

3.Bioawk 也可以

$ bioawk -cfastx 'BEGIN{while((getline k <"IDs.txt")>0)i[k]=1}{if(i[$name])print ">"$name"\n"$seq}' test.fa
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT

4.UCGS utilities

$ ./faSomeRecords main.fasta id.txt output.fa

option -exclude will output sequences not present in "main.fasta"

  1. python(不完善:每次提取一个ID)
$ cat >python_extract_ID.py
import sys
name = sys.argv[2]
flag = 0
with open(sys.argv[1]) as f:
    for line in f:
        line = line.strip()
        if line[0] == '>':
            if line.split()[0][1:] == name:
                flag = 1
                print(line)
            else :
                flag = 0
        else:
            if flag == 0:
                pass
            else:
                print(line)

$ python python_extract_ID.py test.fa "blah_C1"
>blah_C1
ACTGTCTGTC
ACTGTGTTGTG
ATGTTGTGTGTG
小结:个人推荐awk 那个代码,修改两个参数就可以跑。

Part2 : 计算序列长度

  • awk
$ awk '/^>/&&NR>1{print "";}{ printf "%s",/^>/ ? $0" ":$0 }'  test.fa 
>blah_C1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
>blah_C2 ACTTTATATATT
>blah_C3 ACTTATATATATATA
>blah_C4 ACTTATATATATATA
>blah_C5 ACTTTATATATT

awk 里面length 函数.

$ awk '/^>/&&NR>1{print "";}{ printf "%s",/^>/ ? $0" ":$0 }'  test.fa | awk '{print $1"\t"length($2)}'
>blah_C1    33
>blah_C2    12
>blah_C3    15
>blah_C4    15
>blah_C5    12

欢迎评论补充~

相关文章

网友评论

    本文标题:提取fasta文件特定ID序列,及其计算序列长度

    本文链接:https://www.haomeiwen.com/subject/fwevmhtx.html