参考:
https://www.biostars.org/p/127141/
快速计算fasta序列长度的方法
Tips:
有时候需要通过具体的ID,获取序列信息,及其计算序列长度.
(下面主要通过linux 相关命令来实现)
### 测试数据:
$ cat >test.fa
>blah_C1
ACTGTCTGTC
ACTGTGTTGTG
ATGTTGTGTGTG
>blah_C2
ACTTTATATATT
>blah_C3
ACTTATATATATATA
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT
$ cat >IDs.txt
blah_C4
blah_C5
Part1: 提取特定ID的序列
- You can do that with BBTools: 里面好多java 软件包,都被shell 封装好了.
$ ./filterbyname.sh
Written by Brian Bushnell
Last modified September 1, 2016
Description: Filters reads by name.
Usage: filterbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f>
in2 and out2 are for paired reads and are optional.
If input is paired and there is only one output file, it will be written interleaved.
Important! Leading > and @ symbols are NOT part of sequence names; they are part of
the fasta, fastq, and sam specifications. Therefore, this is correct:
names=e.coli_K12
And these are incorrect:
names=>e.coli_K12
names=@e.coli_K12
Parameters:
include=f Set to 'true' to include the filtered names rather than excluding them.
substring=f Allow one name to be a substring of the other, rather than a full match.
f: No substring matching.
t: Bidirectional substring matching.
header: Allow input read headers to be substrings of names in list.
name: Allow names in list to be substrings of input read headers.
prefix=f Allow names to match read header prefixes.
case=t (casesensitive) Match case also.
ow=t (overwrite) Overwrites files that already exist.
app=f (append) Append to files that already exist.
zl=4 (ziplevel) Set compression level, 1 (low) to 9 (max).
int=f (interleaved) Determines whether INPUT file is considered interleaved.
names= A list of strings or files. The files can have one name per line, or
be a standard read file (fasta, fastq, or sam).
minlen=0 Do not output reads shorter than this.
ths=f (truncateheadersymbol) Ignore a leading @ or > symbol in the names file.
tws=f (truncatewhitespace) Ignore leading or trailing whitespace in the names file.
truncate=f Set both ths and tws at the same time.
Positional parameters:
These optionally allow you to output only a portion of a sequence. Zero-based, inclusive.
Intended for a single sequence and include=t mode.
from=-1 Only print bases starting at this position.
to=-1 Only print bases up to this position.
range= Set from and to with a single flag.
Java Parameters:
-Xmx This will set Java's memory usage, overriding autodetection.
-Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs.
The max is typically 85% of physical memory.
-eoom This flag will cause the process to exit if an out-of-memory
exception occurs. Requires Java 8u92+.
-da Disable assertions.
To read from stdin, set 'in=stdin'. The format should be specified with an extension, like 'in=stdin.fq.gz'
To write to stdout, set 'out=stdout'. The format should be specified with an extension, like 'out=stdout.fasta'
Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.
太长了,直接看例子:
### 程序运行:
$ ./filterbyname.sh in=test.fa names=IDs.txt out=out.fa include
/public/home/kcao/tools/bbmap//calcmem.sh: line 75: [: -v: unary operator expected
java -ea -Xmx13206m -cp /public/home/kcao/tools/bbmap/current/ driver.FilterReadsByName in=test.fa names=IDs.txt out=out.fa include
Executing driver.FilterReadsByName [in=test.fa, names=IDs.txt, out=out.fa, include]
Input is being processed as unpaired
Time: 0.098 seconds.
Reads Processed: 5 0.05k reads/sec
Bases Processed: 87 0.00m bases/sec
Reads Out: 2
Bases Out: 27
### 查看结果:
[01:21:30] kcao@login:~/tools/bbmap
$ cat out.fa
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT
2.1 awk 也可以搞定
$ awk 'NR==1{printf $0"\t";next}{printf /^>/ ? "\n"$0"\t" : $0}' test.fa | awk -F"\t" 'BEGIN{while((getline k < "IDs.txt")>0)i[k]=1}{gsub("^>","",$0); if(i[$1]){print ">"$1"\n"$2}}'
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT
2.2 不完善版本:
$ awk -v RS='>' 'match($0, "blah_C1"){print ">"$0}' test.fa
>blah_C1
ACTGTCTGTC
ACTGTGTTGTG
ATGTTGTGTGTG
3.Bioawk 也可以
$ bioawk -cfastx 'BEGIN{while((getline k <"IDs.txt")>0)i[k]=1}{if(i[$name])print ">"$name"\n"$seq}' test.fa
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT
$ ./faSomeRecords main.fasta id.txt output.fa
option -exclude
will output sequences not present in "main.fasta"
- python(不完善:每次提取一个ID)
$ cat >python_extract_ID.py
import sys
name = sys.argv[2]
flag = 0
with open(sys.argv[1]) as f:
for line in f:
line = line.strip()
if line[0] == '>':
if line.split()[0][1:] == name:
flag = 1
print(line)
else :
flag = 0
else:
if flag == 0:
pass
else:
print(line)
$ python python_extract_ID.py test.fa "blah_C1"
>blah_C1
ACTGTCTGTC
ACTGTGTTGTG
ATGTTGTGTGTG
小结:个人推荐awk 那个代码,修改两个参数就可以跑。
Part2 : 计算序列长度
- awk
$ awk '/^>/&&NR>1{print "";}{ printf "%s",/^>/ ? $0" ":$0 }' test.fa
>blah_C1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
>blah_C2 ACTTTATATATT
>blah_C3 ACTTATATATATATA
>blah_C4 ACTTATATATATATA
>blah_C5 ACTTTATATATT
awk 里面length 函数.
$ awk '/^>/&&NR>1{print "";}{ printf "%s",/^>/ ? $0" ":$0 }' test.fa | awk '{print $1"\t"length($2)}'
>blah_C1 33
>blah_C2 12
>blah_C3 15
>blah_C4 15
>blah_C5 12
欢迎评论补充~
网友评论