![](https://img.haomeiwen.com/i7691822/830fc29f8e1c995b.png)
介绍
这是一个轻量级的 Python C 扩展,使用户能够从纯文本和gzip 压缩的pyfastx
FASTA/Q 文件中随机访问序列。该模块旨在为用户提供简单的 API,以便从 FASTA 中提取序列,并通过标识符和索引号从 FASTQ 中读取。将构建存储在 sqlite3 数据库文件中的索引以进行随机访问,以避免消耗过多的内存。此外,还可以解析标准(序列被分散成多条长度相同的行)和非标准(序列被分散成不同长度的一行或多行)FASTA格式。
特征
- Python 扩展的单个文件
- 轻量级、内存高效,可用于解析 FASTA/Q 文件
-
gzipped
从FASTA/Q 文件快速随机访问序列 - 从 FASTA 文件中逐行读取序列
- 计算FASTA文件中序列的N50和L50
- 计算GC含量和核苷酸组成
- 计算反向互补序列
- 兼容性极佳,支持解析非标准FASTA文件
- 支持FASTQ质量分数转换
- 提供用于分割 FASTA/Q 文件的命令行界面
安装
目前,pyfastx支持Python 3.6以上的版本,通过pip
即可安装。
pip install pyfastx
FASTX
FASTA 序列迭代
在 FASTX 对象上迭代序列时,将返回一个元组。(name, seq)
fa = pyfastx.Fastx('tests/data/test.fa.gz')
for name,seq in fa:
print(name)
print(seq)
'''
JZ822577.1
CTCTAGAGATTACTTCTTCACATTCCAGATCACTCAGGCTCTTTGTCATTTTAGTTTGACTAGGATATCGAGTATTCAAGCTCATCGCTTTTGGTAATCTTTGCGGTGCATGCCTTTGCATGCTGTATTGCTGCTTCATCATCCCCTTTGACTTGTGTGGCGGTGGCAAGACATCCGAAGAGTTAAGCGATGCTTGTCTAGTCAATTTCCCCATGTACAGAATCATTGTTGTCAATTGGTTGTTTCCTTGATGGTGAAGGGGCTTCAATACATGAGTTCCAAACTAACATTTCTTGACTAACACTTGAGGAAGAAGGACAAGGGTCCCCATGT
'''
#always output uppercase sequence
for item in pyfastx.Fastx('tests/data/test.fa', uppercase=True):
print(item)
'''
('JZ822577.1', 'CTCTAGAGATTACTTCTTCACATTCCAGATCACTCAGGCTCTTTGTCATTTTAGTTTGACTAGGATATCGAGTATTCAAGCTCATCGCTTTTGGTAATCTTTGCGGTGCATGCCTTTGCATGCTGTATTGCTGCTTCATCATCCCCTTTGACTTGTGTGGCGGTGGCAAGACATCCGAAGAGTTAAGCGATGCTTGTCTAGTCAATTTCCCCATGTACAGAATCATTGTTGTCAATTGGTTGTTTCCTTGATGGTGAAGGGGCTTCAATACATGAGTTCCAAACTAACATTTCTTGACTAACACTTGAGGAAGAAGGACAAGGGTCCCCATGT')
'''
# 手动指定序列格式
for item in pyfastx.Fastx('tests/data/test.fa', format="fasta"):
print(item)
'''
('JZ822577.1', 'CTCTAGAGATTACTTCTTCACATTCCAGATCACTCAGGCTCTTTGTCATTTTAGTTTGACTAGGATATCGAGTATTCAAGCTCATCGCTTTTGGTAATCTTTGCGGTGCATGCCTTTGCATGCTGTATTGCTGCTTCATCATCCCCTTTGACTTGTGTGGCGGTGGCAAGACATCCGAAGAGTTAAGCGATGCTTGTCTAGTCAATTTCCCCATGTACAGAATCATTGTTGTCAATTGGTTGTTTCCTTGATGGTGAAGGGGCTTCAATACATGAGTTCCAAACTAACATTTCTTGACTAACACTTGAGGAAGAAGGACAAGGGTCCCCATGT')
'''
如果想要序列注释,可以将comment设置为True,pyfastx
0.9.0中新增。
fa = pyfastx.Fastx('tests/data/test.fa.gz', comment=True)
for name,seq,comment in fa:
print(name)
print(seq)
print(comment)
'''
JZ822577.1
CTCTAGAGATTACTTCTTCACATTCCAGATCACTCAGGCTCTTTGTCATTTTAGTTTGACTAGGATATCGAGTATTCAAGCTCATCGCTTTTGGTAATCTTTGCGGTGCATGCCTTTGCATGCTGTATTGCTGCTTCATCATCCCCTTTGACTTGTGTGGCGGTGGCAAGACATCCGAAGAGTTAAGCGATGCTTGTCTAGTCAATTTCCCCATGTACAGAATCATTGTTGTCAATTGGTTGTTTCCTTGATGGTGAAGGGGCTTCAATACATGAGTTCCAAACTAACATTTCTTGACTAACACTTGAGGAAGAAGGACAAGGGTCCCCATGT
contig1 cDNA library of flower petals in tree peony by suppression subtractive hybridization Paeonia suffruticosa cDNA, mRNA sequence
'''
FASTQ 读取迭代
当迭代读取 FASTX 对象时,(name, seq, qual)
将返回一个元组。
fq = pyfastx.Fastx('tests/data/test.fq.gz')
for name,seq,qual in fq:
print(name)
print(seq)
print(qual)
'''
A00129:183:H77K2DMXX:1:1101:6804:1031
TGCACACGTAGGCGCGAGCGTCGCCGCCGGCGGCCTTGCAGGCAGCGACGGCTTCGTCGAGACGTTCGCGGTTGAGATCGACCAGGGCCAGCCTGGCGCCCTTGCCGGCGAGATATTCGCCCATTGCCCGGCCGAGCCCCTGGCAGCCGC
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFF
'''
如果想要读取评论(就是第一行中的第二列名称),可以在pyfastx
0.9.0中将comment设置为True
fq = pyfastx.Fastx('tests/data/test.fq.gz', comment=True)
for name,seq,qual,comment in fq:
print(name)
print(seq)
print(qual)
print(comment)
'''
A00129:183:H77K2DMXX:1:1101:6804:1031
TGCACACGTAGGCGCGAGCGTCGCCGCCGGCGGCCTTGCAGGCAGCGACGGCTTCGTCGAGACGTTCGCGGTTGAGATCGACCAGGGCCAGCCTGGCGCCCTTGCCGGCGAGATATTCGCCCATTGCCCGGCCGAGCCCCTGGCAGCCGC
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFF
1:N:0:CAATGGAA+CGAGGCTG
'''
FASTA(fa)
读取FASTA文件
读取纯文本或gzipped FASTA文件并构建索引,支持随机访问FASTA。
>>> import pyfastx
>>> fa = pyfastx.Fasta('tests/data/test.fa.gz')
>>> fa
<Fasta> test/data/test.fa.gz contains 211 seqs
FASTA记录迭代
迭代普通或 gzipped FASTA 文件而不构建索引的最快方法,迭代将返回包含名称和序列的元组。
import pyfastx
for name, seq in pyfastx.Fasta('tests/data/test.fa.gz', build_index=False):
print(name, seq)
'''
JZ822577.1 CTCTAGAGATTACTTCTTCACATTCCAGATCACTCAGGCTCTTTGTCATTTTAGTTTGACTAGGATATCGAGTATTCAAGCTCATCGCTTTTGGTAATCTTTGCGGTGCATGCCTTTGCATGCTGTATTGCTGCTTCATCATCCCCTTTGACTTGTGTGGCGGTGGCAAGACATCCGAAGAGTTAAGCGATGCTTGTCTAGTCAATTTCCCCATGTACAGAATCATTGTTGTCAATTGGTTGTTTCCTTGATGGTGAAGGGGCTTCAATACATGAGTTCCAAACTAACATTTCTTGACTAACACTTGAGGAAGAAGGACAAGGGTCCCCATGT
'''
您还可以从 FASTA 对象迭代序列对象,如下所示:
import pyfastx
for seq in pyfastx.Fasta('tests/data/test.fa.gz'):
print(seq.name)
print(seq.seq)
print(seq.description)
'''
JZ822577.1
CTCTAGAGATTACTTCTTCACATTCCAGATCACTCAGGCTCTTTGTCATTTTAGTTTGACTAGGATATCGAGTATTCAAGCTCATCGCTTTTGGTAATCTTTGCGGTGCATGCCTTTGCATGCTGTATTGCTGCTTCATCATCCCCTTTGACTTGTGTGGCGGTGGCAAGACATCCGAAGAGTTAAGCGATGCTTGTCTAGTCAATTTCCCCATGTACAGAATCATTGTTGTCAATTGGTTGTTTCCTTGATGGTGAAGGGGCTTCAATACATGAGTTCCAAACTAACATTTCTTGACTAACACTTGAGGAAGAAGGACAAGGGTCCCCATGT
JZ822577.1 contig1 cDNA library of flower petals in tree peony by suppression subtractive hybridization Paeonia suffruticosa cDNA, mRNA sequence
'''
使用build_index=True
(默认)返回序列对象进行迭代,该对象允许您访问序列的属性。 pyfastx 0.6.3 中的新增功能。
获取FASTA信息
>>> # get sequence counts in FASTA
>>> len(fa)
211
>>> # get total sequence length of FASTA
>>> fa.size
86262
>>> # 获取FASTA DNA序列的GC含量
>>> fa.gc_content
43.529014587402344
>>> # get GC skew of DNA sequences in FASTA 获取 FASTA 中 DNA 序列的 GC 偏差
>>> # New in pyfastx 0.3.8
>>> fa.gc_skews
0.004287730902433395
>>> # get composition of nucleotides in FASTA (获取 FASTA 中的核苷酸组成)
>>> fa.composition
{'A': 24534, 'C': 18694, 'G': 18855, 'T': 24179}
>>> # get fasta type (DNA, RNA, or protein)
>>> fa.type
'DNA'
>>> # check fasta file is gzip compressed (检查 fasta 文件是否经过 gzip 压缩)
>>> fa.is_gzip
True
获取最长和最短序列
pyfastx
0.3.0中的新功能
>>> # get longest sequence
>>> s = fa.longest
>>> s
<Sequence> JZ822609.1 with length of 821
>>> s.name
'JZ822609.1'
>>> len(s)
821
>>> # get shortest sequence
>>> s = fa.shortest
>>> s
<Sequence> JZ822617.1 with length of 118
>>> s.name
'JZ822617.1'
>>> len(s)
118
获取最长和最短序列
pyfastx
0.3.0中的新功能
>>> # get longest sequence
>>> s = fa.longest
>>> s
<Sequence> JZ822609.1 with length of 821
>>> s.name
'JZ822609.1'
>>> len(s)
821
>>> # get shortest sequence
>>> s = fa.shortest
>>> s
<Sequence> JZ822617.1 with length of 118
>>> s.name
'JZ822617.1'
>>> len(s)
118
计算 N50 和 L50
pyfastx
0.3.0中的新功能
计算装配N50和L50,返回(N50,L50),详细了解N50,L50
"N50" 和 "L50" 是用于衡量测序数据集中序列长度分布的两个指标,通常用于评估测序质量和装配(assembly)的效果。
N50:
- 定义: N50 是一个中位数统计指标,表示当你按照序列长度从长到短对序列进行排序时,从最长的序列开始累加,直到达到或超过总序列长度的50%时的那个序列的长度。
- 意义: N50 提供了一个更有代表性的序列长度,因为它考虑到了较长序列的影响。如果 N50 较大,表示数据集中有较多的较长序列。
L50:
- 定义: L50 是与 N50 相关的统计指标,表示达到或超过 N50 长度的序列的数量。
- 意义: L50 提供了一个直观的感觉,即为达到或超过50%的总序列长度,需要多少个序列。较小的 L50 值通常是更好的,因为它表示较少的序列数量就能达到较大的总长度。
这两个指标在比较不同测序数据集或装配的质量时很有用,特别是在评估基因组装配的效果时。
>>> # get FASTA N50 and L50
>>> fa.nl(50)
(516, 66)
>>> # get FASTA N90 and L90
>>> fa.nl(90)
(231, 161)
>>> # get FASTA N75 and L75
>>> fa.nl(75)
(365, 117)
获取序列平均值和中位长度
pyfastx
0.3.0中的新功能
>>> # get sequence average length
>>> fa.mean
408
>>> # get seqeunce median length
>>> fa.median
430
获取序列计数
pyfastx
0.3.0中的新功能
获取长度 >= 指定长度的序列计数
>>> # get counts of sequences with length >= 200 bp
>>> fa.count(200)
173
>>> # get counts of sequences with length >= 500 bp
>>> fa.count(500)
70
获取子序列
可以使用 [start, end] 坐标列表从 FASTA 文件中检索子序列
>>> # get subsequence with start and end position
>>> interval = (1, 10)
>>> fa.fetch('JZ822577.1', interval)
'CTCTAGAGAT'
>>> # get subsequences with a list of start and end position
>>> intervals = [(1, 10), (50, 60)]
>>> fa.fetch('JZ822577.1', intervals)
'CTCTAGAGATTTTAGTTTGAC'
>>> # get subsequences with reverse strand
>>> fa.fetch('JZ822577.1', (1, 10), strand='-')
'ATCTCTAGAG'
Key function
Sometimes your fasta will have a long header which contains multiple identifiers and description, for example, ">JZ822577.1 contig1 cDNA library of flower petals in tree peony by suppression subtractive hybridization Paeonia suffruticosa cDNA, mRNA sequence". In this case, both "JZ822577.1" and "contig1" can be used as identifer. you can specify the key function to select one as identifier.
>>> #default use JZ822577.1 as identifier
>>> #specify key_func to select contig1 as identifer
>>> fa = pyfastx.Fasta('tests/data/test.fa.gz', key_func=lambda x: x.split()[1])
>>> fa
<Fasta> tests/data/test.fa.gz contains 211 seqs
顺序 - >fa
从 FASTA 获取序列
>>> # get sequence like a dictionary by identifier
>>> s1 = fa['JZ822577.1']
>>> s1
<Sequence> JZ822577.1 with length of 333
>>> # get sequence like a list by index
>>> s2 = fa[2]
>>> s2
<Sequence> JZ822579.1 with length of 176
>>> # get last sequence
>>> s3 = fa[-1]
>>> s3
<Sequence> JZ840318.1 with length of 134
>>> # check a sequence name weather in FASTA file
>>> 'JZ822577.1' in fa
True
序列切片
序列对象可以像Python字符串一样被切片
>>> # get a sub seq from sequence
>>> s = fa[-1]
>>> ss = s[10:30]
>>> ss
<Sequence> JZ840318.1 from 11 to 30
>>> ss.name
'JZ840318.1:11-30'
>>> s.seq
'ACTGGAGGTTCTTCTTCCTGTGGAAAGTAACTTGTTTTGCCTTCACCTGCCTGTTCTTCACATCAACCTTGTTCCCACACAAAACAATGGGAATGTTCTCACACACCCTGCAGAGATCACGATGCCATGTTGGT'
>>> ss.seq
'CTTCTTCCTGTGGAAAGTAA'
>>> ss = s[-10:]
>>> ss
<Sequence> JZ840318.1 from 125 to 134
>>> ss.name
'JZ840318.1:125-134'
>>> ss.seq
'CCATGTTGGT'
切片开始和结束坐标从 0 开始。目前,pyfastx 不支持可选的第三个step
或stride
参数。例如ss[::-1]
反向和互补序列
对 Sequence 进行反向、互补或反向互补。
>>> # get sliced sequence
>>> fa[0][10:20].seq
'GTCAATTTCC'
>>> # get reverse of sliced sequence 反向==从尾到头,翻转
>>> fa[0][10:20].reverse
'CCTTTAACTG'
>>> # get complement of sliced sequence # 互补
>>> fa[0][10:20].complement
'CAGTTAAAGG'
>>> # get reversed complement sequence, corresponding to sequence in antisense strand
# 得到反向互补序列,对应于反义链中的序列
>>> fa[0][10:20].antisense
'GGAAATTGAC'
fa = pyfastx.Fasta('tests/data/test.fa.gz')
for seq in fa:
print(seq.name)
print(seq.seq)
print(seq.reverse) # 反向
print(seq.description)
搜索子序列
pyfastx
0.3.6中的新功能
从给定序列中搜索子序列并获取第一次出现的基于 1 的起始位置
>>> # search subsequence in sense strand
>>> fa[0].search('GCTTCAATACA')
262
>>> # check subsequence weather in sequence
>>> 'GCTTCAATACA' in fa[0]
True
>>> # search subsequence in antisense strand
>>> fa[0].search('CCTCAAGT', '-')
301
FastaKeys -> fa
New in pyfastx
0.8.0. We have changed Identifier
object to FastaKeys
object.
Get keys
获取序列的所有名称作为类似列表的对象。
>>> ids = fa.keys()
>>> ids
<FastaKeys> contains 211 keys
>>> # get count of sequence
>>> len(ids)
211
>>> # get key by index
>>> ids[0]
'JZ822577.1'
>>> # check key whether in fasta
>>> 'JZ822577.1' in ids
True
>>> # iterate over keys
>>> for name in ids:
>>> print(name)
>>> # convert to a list
>>> list(ids)
排序键
按迭代的序列 ID、名称或长度对键进行排序
pyfastx
0.5.0中的新功能
>>> # sort keys by length with descending order
# 这一步没什么用啊
>>> for name in ids.sort(by='length', reverse=True):
>>> print(name)
>>> # sort keys by name with ascending order
>>> for name in ids.sort(by='name'): ===这个可行,按照名字排序==
>>> print(name)
>>> # sort keys by id with descending order
>>> for name in ids.sort(by='id', reverse=True) ===降序排列==
>>> print(name)
过滤键
按序列长度和名称过滤键
pyfastx
0.5.10中的新功能
>>> # get keys with length > 600
>>> ids.filter(ids > 600)
<FastaKeys> contains 48 keys
>>> # get keys with length >= 500 and <= 700
>>> ids.filter(ids>=500, ids<=700)
<FastaKeys> contains 48 keys
>>> # get keys with length > 500 and < 600
>>> ids.filter(500<ids<600)
<FastaKeys> contains 22 keys
>>> # get keys contain JZ8226
>>> ids.filter(ids % 'JZ8226')
<FastaKeys> contains 90 keys
>>> # get keys contain JZ8226 with length > 550
>>> ids.filter(ids % 'JZ8226', ids>550)
<FastaKeys> contains 17 keys
>>> # clear sort order and filters
>>> ids.reset()
<FastaKeys> contains 211 keys
>>> # list a filtered result
>>> ids.filter(ids % 'JZ8226', ids>730)
>>> list(ids)
['JZ822609.1', 'JZ822650.1', 'JZ822664.1', 'JZ822699.1']
>>> # list a filtered result with sort order
>>> ids.filter(ids % 'JZ8226', ids>730).sort('length', reverse=True)
>>> list(ids)
['JZ822609.1', 'JZ822699.1', 'JZ822664.1', 'JZ822650.1']
>>> ids.filter(ids % 'JZ8226', ids>730).sort('name', reverse=True)
>>> list(ids)
['JZ822699.1', 'JZ822664.1', 'JZ822650.1', 'JZ822609.1']
FASTQ(fq)
读取FASTQ文件
读取纯文件或 gzip 压缩文件并构建索引,支持从 FASTQ 读取的随机访问。
>>> import pyfastx
>>> fq = pyfastx.Fastq('tests/data/test.fq.gz')
>>> fq
<Fastq> tests/data/test.fq.gz contains 100 reads
FASTQ记录迭代
解析普通或 gzip 压缩的 FASTQ 文件而不构建索引的最快方法,迭代将返回包含读取名称、序列和质量的元组。
>>> import pyfastx
>>> for name,seq,qual in pyfastx.Fastq('tests/data/test.fq.gz', build_index=False):
>>> print(name)
>>> print(seq)
>>> print(qual)
还可以像这样从 FASTQ 对象迭代读取对象:
for read in pyfastx.Fastq('tests/data/test.fq.gz'):
print(read)
print(read.name)
print(read.seq)
print(read.qual)
# quali 表示测序质量的评估。这些质量分数的范围通常是从33到126,其中33表示最低质量,126表示最高质量。
print(read.quali)
'''
TGCACACGTAGGCGCGAGCGTCGCCGCCGGCGGCCTTGCAGGCAGCGACGGCTTCGTCGAGACGTTCGCGGTTGAGATCGACCAGGGCCAGCCTGGCGCCCTTGCCGGCGAGATATTCGCCCATTGCCCGGCCGAGCCCCTGGCAGCCGC
A00129:183:H77K2DMXX:1:1101:6804:1031
TGCACACGTAGGCGCGAGCGTCGCCGCCGGCGGCCTTGCAGGCAGCGACGGCTTCGTCGAGACGTTCGCGGTTGAGATCGACCAGGGCCAGCCTGGCGCCCTTGCCGGCGAGATATTCGCCCATTGCCCGGCCGAGCCCCTGGCAGCCGC
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFF
[37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37]
'''
(默认)返回读取对象的迭代build_index=True
允许您访问读取的属性。
获取 FASTQ 信息
>>> # get read counts in FASTQ
>>> len(fq)
800
>>> # get total bases
>>> fq.size
120000
>>> # get GC content of FASTQ file
>>> fq.gc_content
66.17471313476562
>>> # 获取fastq中的碱的组成
>>> fq.composition
{'A': 20501, 'C': 39705, 'G': 39704, 'T': 20089, 'N': 1}
>>> # New in pyfastx 0.6.10
>>> # 获得读数长度的平均
>>> fq.avglen
150.0
>>> # get maximum lenth of reads
>>> fq.maxlen
150
>>> # get minimum length of reas
>>> fq.minlen
150
>>> # get maximum quality score
>>> fq.maxqual
70
>>> # get minimum quality score
>>> fq.minqual
35
>>> # get phred which affects the quality score conversion
>>> fq.phred
33
>>> # Guess fastq quality encoding system
>>> # New in pyfastx 0.4.1
>>> fq.encoding_type
['Sanger Phred+33', 'Illumina 1.8+ Phred+33']
从 FASTQ 读取
>>> #get read like a dict by read name
>>> r1 = fq['A00129:183:H77K2DMXX:1:1101:4752:1047']
>>> r1
<Read> A00129:183:H77K2DMXX:1:1101:4752:1047 with length of 150
>>> # get read like a list by index
>>> r2 = fq[10]
>>> r2
<Read> A00129:183:H77K2DMXX:1:1101:18041:1078 with length of 150
>>> # get the last read
>>> r3 = fq[-1]
>>> r3
<Read> A00129:183:H77K2DMXX:1:1101:31575:4726 with length of 150
>>> # check a read weather in FASTQ file
>>> 'A00129:183:H77K2DMXX:1:1101:4752:1047' in fq
True
获取已读信息
>>> r = fq[-10]
>>> r
<Read> A00129:183:H77K2DMXX:1:1101:1750:4711 with length of 150
>>> # get read order number in FASTQ file
>>> r.id
791
>>> # get read name
>>> r.name
'A00129:183:H77K2DMXX:1:1101:1750:4711'
>>> # get read full header line, New in pyfastx 0.6.3
>>> r.description
'@A00129:183:H77K2DMXX:1:1101:1750:4711 1:N:0:CAATGGAA+CGAGGCTG'
>>> # get read length
>>> len(r)
150
>>> # get read sequence
>>> r.seq
'CGAGGAAATCGACGTCACCGATCTGGAAGCCCTGCGCGCCCATCTCAACCAGAAATGGGGTGGCCAGCGCGGCAAGCTGACCCTGCTGCCGTTCCTGGTCCGCGCCATGGTCGTGGCGCTGCGCGACTTCCCGCAGTTGAACGCGCGCTA'
>>> # get raw string of read, New in pyfastx 0.6.3
>>> print(r.raw)
@A00129:183:H77K2DMXX:1:1101:1750:4711 1:N:0:CAATGGAA+CGAGGCTG
CGAGGAAATCGACGTCACCGATCTGGAAGCCCTGCGCGCCCATCTCAACCAGAAATGGGGTGGCCAGCGCGGCAAGCTGACCCTGCTGCCGTTCCTGGTCCGCGCCATGGTCGTGGCGCTGCGCGACTTCCCGCAGTTGAACGCGCGCTA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFFF:
>>> # get read quality ascii string
>>> r.qual
'FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFFF:'
>>> # get read quality integer value, ascii - 33 or 64
>>> r.quali
[37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 11, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25]
>>> # get read length
>>> len(r)
150
在FASTQ文件中,质量分数(Quality Scores)以ASCII字符表示,每个字符对应一个数字,用于表示相应碱基的测序质量。在ASCII字符集中,字母 'F' 对应的ASCII码是 70。
在你提供的例子中,
r.qual
是一个质量分数序列,其中所有的字符都是 'F',对应ASCII码 70。
FastaKey -> fq
获取 fastq 键
获取所有 read 名称作为类似列表的对象。
>>> ids = fq.keys()
>>> ids
<FastqKeys> contains 800 keys
>>> # get count of read
>>> len(ids)
800
>>> # get key by index
>>> ids[0]
'A00129:183:H77K2DMXX:1:1101:6804:1031'
>>> # check key whether in fasta
>>> 'A00129:183:H77K2DMXX:1:1101:14416:1031' in ids
True
命令行界面
$ pyfastx -h
usage: pyfastx COMMAND [OPTIONS]
A command line tool for FASTA/Q file manipulation
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
Commands:
index build index for fasta/q file
stat show detailed statistics information of fasta/q file
split split fasta/q file into multiple files
fq2fa convert fastq file to fasta file
subseq get subsequences from fasta file by region
sample randomly sample sequences from fasta or fastq file
extract extract full sequences or reads from fasta/q file
建立索引
pyfastx
0.6.10中的新功能
$ pyfastx index -h
usage: pyfastx index [-h] [-f] fastx [fastx ...]
positional arguments:
fastx fasta or fastq file, gzip support
optional arguments:
-h, --help show this help message and exit
-f, --full build full index, base composition will be calculated
pyfastx index -f tests/data/test.fq.gz
生成
test.fq.gz.fxi
显示统计信息
$ pyfastx stat -h
usage: pyfastx info [-h] fastx
positional arguments:
fastx input fasta or fastq file, gzip support
optional arguments:
-h, --help show this help message and exit
分割 FASTA/Q 文件
$ pyfastx split -h
usage: pyfastx split [-h] (-n int | -c int) [-o str] fastx
positional arguments:
fastx fasta or fastq file, gzip support
optional arguments:
-h, --help show this help message and exit
-n int split a fasta/q file into N new files with even size
-c int split a fasta/q file into multiple files containing the same sequence counts
-o str, --out-dir str
output directory, default is current folder
-
将文件分割成大小相等的文件:
bashCopy code pyfastx split -n 3 -o output_folder input.fasta
这将把
input.fasta
文件分割成 3 个新文件,每个文件的大小相等,并存储在output_folder
目录中。 -
将文件分割成包含相同序列数量的文件:
bashCopy code pyfastx split -c 1000 -o output_folder input.fastq
这将把
input.fastq
文件分割成多个包含 1000 条序列的新文件,并存储在output_folder
目录中。
将 FASTQ 转换为 FASTA 文件
$ pyfastx fq2fa -h
usage: pyfastx fq2fa [-h] [-o str] fastx
positional arguments:
fastx fastq file, gzip support
optional arguments:
-h, --help show this help message and exit
-o str, --out-file str
output file, default: output to stdout
pyfastx fq2fa -o result/output.fa tests/data/test.fq.gz
pyfastx fq2fa -o result/output.fasta tests/data/test.fq.gz
pyfastx fq2fa -o result/output_nogzip.fasta tests/data/test.fq
获取带有区域的子序列
$ pyfastx subseq -h
usage: pyfastx subseq [-h] [-r str | -b str] [-o str] fastx [region [region ...]]
positional arguments:
fastx input fasta file, gzip support
region format is chr:start-end, start and end position is 1-based, multiple names were separated by space
optional arguments:
-h, --help show this help message and exit
-r str, --region-file str
tab-delimited file, one region per line, both start and end position are 1-based
-b str, --bed-file str
tab-delimited BED file, 0-based start position and 1-based end position
-o str, --out-file str
output file, default: output to stdout
样本序列
$ pyfastx sample -h
usage: pyfastx sample [-h] (-n int | -p float) [-s int] [--sequential-read] [-o str] fastx
positional arguments:
fastx fasta or fastq file, gzip support
optional arguments:
-h, --help show this help message and exit
-n int number of sequences to be sampled 要抽样的序列数。
-p float proportion of sequences to be sampled, 0~1 要抽样的序列比例,范围从0到1。
-s int, --seed int random seed, default is the current system time :用于可重复性的随机种子。默认为当前系统时间
启动顺序读取,特别适用于抽样大量序列。
--sequential-read start sequential reading, particularly suitable for sampling large numbers of sequences
-o str, --out-file str
output file, default: output to stdout
pyfastx sample
命令用于从FASTA或FASTQ文件中抽样序列。以下是该命令的选项解释:
-
位置参数:
-
fastx
:输入的FASTA或FASTQ文件。支持gzip压缩。
-
-
可选参数:
-
-h, --help
:显示帮助消息并退出。
-
-
抽样选项:
-
-n int
:要抽样的序列数。 -
-p float
:要抽样的序列比例,范围从0到1。
-
-
随机化选项:
-
-s int, --seed int
:用于可重复性的随机种子。默认为当前系统时间。
-
-
读取选项:
-
--sequential-read
:启动顺序读取,特别适用于抽样大量序列。
-
-
输出选项:
-
-o str, --out-file str
:保存抽样序列的输出文件。默认为输出到stdout。
-
示例用法:
-
从输入文件中抽样100个序列:
pyfastx sample -n 100 input.fasta
-
从输入文件中抽样20%的序列,并指定种子以确保可重复性:
pyfastx sample -p 0.2 -s 42 input.fastq
-
从输入文件中顺序抽样500个序列:
pyfastx sample -n 500 --sequential-read input.fasta.gz
-
抽样10%的序列并将输出保存到文件:
pyfastx sample -p 0.1 -o sampled_sequences.fasta input.fastq.gz
提取序列
pyfastx
0.6.10中的新功能
$ pyfastx extract -h
usage: pyfastx extract [-h] [-l str] [--reverse-complement] [--out-fasta] [-o str] [--sequential-read]
fastx [name [name ...]]
positional arguments:
fastx fasta or fastq file, gzip support
name sequence name or read name, multiple names were separated by space
optional arguments:
-h, --help show this help message and exit
-l str, --list-file str
a file containing sequence or read names, one name per line
--reverse-complement output reverse complement sequence
--out-fasta output fasta format when extract reads from fastq, default output fastq format
-o str, --out-file str
output file, default: output to stdout
--sequential-read start sequential reading, particularly suitable for extracting large numbers of sequences
网友评论