美文网首页
pyfastx 介绍

pyfastx 介绍

作者: 逍遥_yjz | 来源:发表于2024-02-05 14:14 被阅读0次
253.png

介绍

这是一个轻量级的 Python C 扩展,使用户能够从纯文本和gzip 压缩的pyfastxFASTA/Q 文件中随机访问序列。该模块旨在为用户提供简单的 API,以便从 FASTA 中提取序列,并通过标识符和索引号从 FASTQ 中读取。将构建存储在 sqlite3 数据库文件中的索引以进行随机访问,以避免消耗过多的内存。此外,还可以解析标准(序列被分散成多条长度相同的行)和非标准(序列被分散成不同长度的一行或多行)FASTA格式。

特征

  • Python 扩展的单个文件
  • 轻量级、内存高效,可用于解析 FASTA/Q 文件
  • gzipped从FASTA/Q 文件快速随机访问序列
  • 从 FASTA 文件中逐行读取序列
  • 计算FASTA文件中序列的N50和L50
  • 计算GC含量和核苷酸组成
  • 计算反向互补序列
  • 兼容性极佳,支持解析非标准FASTA文件
  • 支持FASTQ质量分数转换
  • 提供用于分割 FASTA/Q 文件的命令行界面

安装

目前,pyfastx支持Python 3.6以上的版本,通过pip即可安装。

pip install pyfastx

FASTX

FASTA 序列迭代

在 FASTX 对象上迭代序列时,将返回一个元组。(name, seq)

fa = pyfastx.Fastx('tests/data/test.fa.gz')
for name,seq in fa:
    print(name)
    print(seq)

'''
JZ822577.1
CTCTAGAGATTACTTCTTCACATTCCAGATCACTCAGGCTCTTTGTCATTTTAGTTTGACTAGGATATCGAGTATTCAAGCTCATCGCTTTTGGTAATCTTTGCGGTGCATGCCTTTGCATGCTGTATTGCTGCTTCATCATCCCCTTTGACTTGTGTGGCGGTGGCAAGACATCCGAAGAGTTAAGCGATGCTTGTCTAGTCAATTTCCCCATGTACAGAATCATTGTTGTCAATTGGTTGTTTCCTTGATGGTGAAGGGGCTTCAATACATGAGTTCCAAACTAACATTTCTTGACTAACACTTGAGGAAGAAGGACAAGGGTCCCCATGT
'''

#always output uppercase sequence
for item in pyfastx.Fastx('tests/data/test.fa', uppercase=True):
    print(item)

'''
('JZ822577.1', 'CTCTAGAGATTACTTCTTCACATTCCAGATCACTCAGGCTCTTTGTCATTTTAGTTTGACTAGGATATCGAGTATTCAAGCTCATCGCTTTTGGTAATCTTTGCGGTGCATGCCTTTGCATGCTGTATTGCTGCTTCATCATCCCCTTTGACTTGTGTGGCGGTGGCAAGACATCCGAAGAGTTAAGCGATGCTTGTCTAGTCAATTTCCCCATGTACAGAATCATTGTTGTCAATTGGTTGTTTCCTTGATGGTGAAGGGGCTTCAATACATGAGTTCCAAACTAACATTTCTTGACTAACACTTGAGGAAGAAGGACAAGGGTCCCCATGT')
'''
    
# 手动指定序列格式
for item in pyfastx.Fastx('tests/data/test.fa', format="fasta"):
    print(item)
    
'''
('JZ822577.1', 'CTCTAGAGATTACTTCTTCACATTCCAGATCACTCAGGCTCTTTGTCATTTTAGTTTGACTAGGATATCGAGTATTCAAGCTCATCGCTTTTGGTAATCTTTGCGGTGCATGCCTTTGCATGCTGTATTGCTGCTTCATCATCCCCTTTGACTTGTGTGGCGGTGGCAAGACATCCGAAGAGTTAAGCGATGCTTGTCTAGTCAATTTCCCCATGTACAGAATCATTGTTGTCAATTGGTTGTTTCCTTGATGGTGAAGGGGCTTCAATACATGAGTTCCAAACTAACATTTCTTGACTAACACTTGAGGAAGAAGGACAAGGGTCCCCATGT')
'''

如果想要序列注释,可以将comment设置为True,pyfastx0.9.0中新增。

fa = pyfastx.Fastx('tests/data/test.fa.gz', comment=True)
for name,seq,comment in fa:
    print(name)
    print(seq)
    print(comment)

'''
JZ822577.1
CTCTAGAGATTACTTCTTCACATTCCAGATCACTCAGGCTCTTTGTCATTTTAGTTTGACTAGGATATCGAGTATTCAAGCTCATCGCTTTTGGTAATCTTTGCGGTGCATGCCTTTGCATGCTGTATTGCTGCTTCATCATCCCCTTTGACTTGTGTGGCGGTGGCAAGACATCCGAAGAGTTAAGCGATGCTTGTCTAGTCAATTTCCCCATGTACAGAATCATTGTTGTCAATTGGTTGTTTCCTTGATGGTGAAGGGGCTTCAATACATGAGTTCCAAACTAACATTTCTTGACTAACACTTGAGGAAGAAGGACAAGGGTCCCCATGT
contig1 cDNA library of flower petals in tree peony by suppression subtractive hybridization Paeonia suffruticosa cDNA, mRNA sequence
'''

FASTQ 读取迭代

当迭代读取 FASTX 对象时,(name, seq, qual)将返回一个元组。

fq = pyfastx.Fastx('tests/data/test.fq.gz')
for name,seq,qual in fq:
    print(name)
    print(seq)
    print(qual)
'''
A00129:183:H77K2DMXX:1:1101:6804:1031
TGCACACGTAGGCGCGAGCGTCGCCGCCGGCGGCCTTGCAGGCAGCGACGGCTTCGTCGAGACGTTCGCGGTTGAGATCGACCAGGGCCAGCCTGGCGCCCTTGCCGGCGAGATATTCGCCCATTGCCCGGCCGAGCCCCTGGCAGCCGC
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFF
'''

如果想要读取评论(就是第一行中的第二列名称),可以在pyfastx0.9.0中将comment设置为True

fq = pyfastx.Fastx('tests/data/test.fq.gz', comment=True)
for name,seq,qual,comment in fq:
    print(name)
    print(seq)
    print(qual)
    print(comment)

'''
A00129:183:H77K2DMXX:1:1101:6804:1031
TGCACACGTAGGCGCGAGCGTCGCCGCCGGCGGCCTTGCAGGCAGCGACGGCTTCGTCGAGACGTTCGCGGTTGAGATCGACCAGGGCCAGCCTGGCGCCCTTGCCGGCGAGATATTCGCCCATTGCCCGGCCGAGCCCCTGGCAGCCGC
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFF
1:N:0:CAATGGAA+CGAGGCTG
'''

FASTA(fa)

读取FASTA文件

读取纯文本或gzipped FASTA文件并构建索引,支持随机访问FASTA。

>>> import pyfastx
>>> fa = pyfastx.Fasta('tests/data/test.fa.gz')
>>> fa
<Fasta> test/data/test.fa.gz contains 211 seqs

FASTA记录迭代

迭代普通或 gzipped FASTA 文件而不构建索引的最快方法,迭代将返回包含名称和序列的元组。

import pyfastx
for name, seq in pyfastx.Fasta('tests/data/test.fa.gz', build_index=False):
    print(name, seq)
    
'''
JZ822577.1 CTCTAGAGATTACTTCTTCACATTCCAGATCACTCAGGCTCTTTGTCATTTTAGTTTGACTAGGATATCGAGTATTCAAGCTCATCGCTTTTGGTAATCTTTGCGGTGCATGCCTTTGCATGCTGTATTGCTGCTTCATCATCCCCTTTGACTTGTGTGGCGGTGGCAAGACATCCGAAGAGTTAAGCGATGCTTGTCTAGTCAATTTCCCCATGTACAGAATCATTGTTGTCAATTGGTTGTTTCCTTGATGGTGAAGGGGCTTCAATACATGAGTTCCAAACTAACATTTCTTGACTAACACTTGAGGAAGAAGGACAAGGGTCCCCATGT
'''

您还可以从 FASTA 对象迭代序列对象,如下所示:

import pyfastx
for seq in pyfastx.Fasta('tests/data/test.fa.gz'):
    print(seq.name)
    print(seq.seq)
    print(seq.description)

'''
JZ822577.1
CTCTAGAGATTACTTCTTCACATTCCAGATCACTCAGGCTCTTTGTCATTTTAGTTTGACTAGGATATCGAGTATTCAAGCTCATCGCTTTTGGTAATCTTTGCGGTGCATGCCTTTGCATGCTGTATTGCTGCTTCATCATCCCCTTTGACTTGTGTGGCGGTGGCAAGACATCCGAAGAGTTAAGCGATGCTTGTCTAGTCAATTTCCCCATGTACAGAATCATTGTTGTCAATTGGTTGTTTCCTTGATGGTGAAGGGGCTTCAATACATGAGTTCCAAACTAACATTTCTTGACTAACACTTGAGGAAGAAGGACAAGGGTCCCCATGT
JZ822577.1 contig1 cDNA library of flower petals in tree peony by suppression subtractive hybridization Paeonia suffruticosa cDNA, mRNA sequence
'''

使用build_index=True(默认)返回序列对象进行迭代,该对象允许您访问序列的属性。 pyfastx 0.6.3 中的新增功能。

获取FASTA信息

>>> # get sequence counts in FASTA
>>> len(fa)
211

>>> # get total sequence length of FASTA
>>> fa.size
86262

>>> # 获取FASTA DNA序列的GC含量
>>> fa.gc_content
43.529014587402344

>>> # get GC skew of DNA sequences in FASTA 获取 FASTA 中 DNA 序列的 GC 偏差
>>> # New in pyfastx 0.3.8
>>> fa.gc_skews
0.004287730902433395

>>> # get composition of nucleotides in FASTA (获取 FASTA 中的核苷酸组成)
>>> fa.composition
{'A': 24534, 'C': 18694, 'G': 18855, 'T': 24179}

>>> # get fasta type (DNA, RNA, or protein)
>>> fa.type
'DNA'

>>> # check fasta file is gzip compressed (检查 fasta 文件是否经过 gzip 压缩)
>>> fa.is_gzip
True

获取最长和最短序列

pyfastx0.3.0中的新功能

>>> # get longest sequence
>>> s = fa.longest
>>> s
<Sequence> JZ822609.1 with length of 821

>>> s.name
'JZ822609.1'

>>> len(s)
821

>>> # get shortest sequence
>>> s = fa.shortest
>>> s
<Sequence> JZ822617.1 with length of 118

>>> s.name
'JZ822617.1'

>>> len(s)
118

获取最长和最短序列

pyfastx0.3.0中的新功能

>>> # get longest sequence
>>> s = fa.longest
>>> s
<Sequence> JZ822609.1 with length of 821

>>> s.name
'JZ822609.1'

>>> len(s)
821

>>> # get shortest sequence
>>> s = fa.shortest
>>> s
<Sequence> JZ822617.1 with length of 118

>>> s.name
'JZ822617.1'

>>> len(s)
118

计算 N50 和 L50

pyfastx0.3.0中的新功能

计算装配N50和L50,返回(N50,L50),详细了解N50,L50

"N50" 和 "L50" 是用于衡量测序数据集中序列长度分布的两个指标,通常用于评估测序质量和装配(assembly)的效果。

  1. N50:

    • 定义: N50 是一个中位数统计指标,表示当你按照序列长度从长到短对序列进行排序时,从最长的序列开始累加,直到达到或超过总序列长度的50%时的那个序列的长度。
    • 意义: N50 提供了一个更有代表性的序列长度,因为它考虑到了较长序列的影响。如果 N50 较大,表示数据集中有较多的较长序列。
  2. L50:

    • 定义: L50 是与 N50 相关的统计指标,表示达到或超过 N50 长度的序列的数量。
    • 意义: L50 提供了一个直观的感觉,即为达到或超过50%的总序列长度,需要多少个序列。较小的 L50 值通常是更好的,因为它表示较少的序列数量就能达到较大的总长度。

这两个指标在比较不同测序数据集或装配的质量时很有用,特别是在评估基因组装配的效果时。

>>> # get FASTA N50 and L50
>>> fa.nl(50)
(516, 66)

>>> # get FASTA N90 and L90
>>> fa.nl(90)
(231, 161)

>>> # get FASTA N75 and L75
>>> fa.nl(75)
(365, 117)

获取序列平均值和中位长度

pyfastx0.3.0中的新功能

>>> # get sequence average length
>>> fa.mean
408

>>> # get seqeunce median length
>>> fa.median
430

获取序列计数

pyfastx0.3.0中的新功能

获取长度 >= 指定长度的序列计数

>>> # get counts of sequences with length >= 200 bp
>>> fa.count(200)
173

>>> # get counts of sequences with length >= 500 bp
>>> fa.count(500)
70

获取子序列

可以使用 [start, end] 坐标列表从 FASTA 文件中检索子序列

>>> # get subsequence with start and end position
>>> interval = (1, 10)
>>> fa.fetch('JZ822577.1', interval)
'CTCTAGAGAT'

>>> # get subsequences with a list of start and end position
>>> intervals = [(1, 10), (50, 60)]
>>> fa.fetch('JZ822577.1', intervals)
'CTCTAGAGATTTTAGTTTGAC'

>>> # get subsequences with reverse strand
>>> fa.fetch('JZ822577.1', (1, 10), strand='-')
'ATCTCTAGAG'

Key function

Sometimes your fasta will have a long header which contains multiple identifiers and description, for example, ">JZ822577.1 contig1 cDNA library of flower petals in tree peony by suppression subtractive hybridization Paeonia suffruticosa cDNA, mRNA sequence". In this case, both "JZ822577.1" and "contig1" can be used as identifer. you can specify the key function to select one as identifier.

>>> #default use JZ822577.1 as identifier
>>> #specify key_func to select contig1 as identifer
>>> fa = pyfastx.Fasta('tests/data/test.fa.gz', key_func=lambda x: x.split()[1])
>>> fa
<Fasta> tests/data/test.fa.gz contains 211 seqs

顺序 - >fa

从 FASTA 获取序列

>>> # get sequence like a dictionary by identifier
>>> s1 = fa['JZ822577.1']
>>> s1
<Sequence> JZ822577.1 with length of 333

>>> # get sequence like a list by index
>>> s2 = fa[2]
>>> s2
<Sequence> JZ822579.1 with length of 176

>>> # get last sequence
>>> s3 = fa[-1]
>>> s3
<Sequence> JZ840318.1 with length of 134

>>> # check a sequence name weather in FASTA file
>>> 'JZ822577.1' in fa
True

序列切片

序列对象可以像Python字符串一样被切片

>>> # get a sub seq from sequence
>>> s = fa[-1]
>>> ss = s[10:30]
>>> ss
<Sequence> JZ840318.1 from 11 to 30

>>> ss.name
'JZ840318.1:11-30'

>>> s.seq
'ACTGGAGGTTCTTCTTCCTGTGGAAAGTAACTTGTTTTGCCTTCACCTGCCTGTTCTTCACATCAACCTTGTTCCCACACAAAACAATGGGAATGTTCTCACACACCCTGCAGAGATCACGATGCCATGTTGGT'

>>> ss.seq
'CTTCTTCCTGTGGAAAGTAA'

>>> ss = s[-10:]
>>> ss
<Sequence> JZ840318.1 from 125 to 134

>>> ss.name
'JZ840318.1:125-134'

>>> ss.seq
'CCATGTTGGT'

切片开始和结束坐标从 0 开始。目前,pyfastx 不支持可选的第三个stepstride参数。例如ss[::-1]

反向和互补序列

对 Sequence 进行反向、互补或反向互补。

>>> # get sliced sequence
>>> fa[0][10:20].seq
'GTCAATTTCC'

>>> # get reverse of sliced sequence 反向==从尾到头,翻转
>>> fa[0][10:20].reverse
'CCTTTAACTG'

>>> # get complement of sliced sequence # 互补
>>> fa[0][10:20].complement
'CAGTTAAAGG'

>>> # get reversed complement sequence, corresponding to sequence in antisense strand
# 得到反向互补序列,对应于反义链中的序列
>>> fa[0][10:20].antisense
'GGAAATTGAC'

fa = pyfastx.Fasta('tests/data/test.fa.gz')

for seq in fa:
    print(seq.name)
    print(seq.seq)
    print(seq.reverse) # 反向
    print(seq.description)

搜索子序列

pyfastx0.3.6中的新功能

从给定序列中搜索子序列并获取第一次出现的基于 1 的起始位置

>>> # search subsequence in sense strand
>>> fa[0].search('GCTTCAATACA')
262

>>> # check subsequence weather in sequence
>>> 'GCTTCAATACA' in fa[0]
True

>>> # search subsequence in antisense strand
>>> fa[0].search('CCTCAAGT', '-')
301

FastaKeys -> fa

New in pyfastx 0.8.0. We have changed Identifier object to FastaKeys object.

Get keys

获取序列的所有名称作为类似列表的对象。

>>> ids = fa.keys()
>>> ids
<FastaKeys> contains 211 keys

>>> # get count of sequence
>>> len(ids)
211

>>> # get key by index
>>> ids[0]
'JZ822577.1'

>>> # check key whether in fasta
>>> 'JZ822577.1' in ids
True

>>> # iterate over keys
>>> for name in ids:
>>>     print(name)

>>> # convert to a list
>>> list(ids)

排序键

按迭代的序列 ID、名称或长度对键进行排序

pyfastx0.5.0中的新功能

>>> # sort keys by length with descending order
# 这一步没什么用啊
>>> for name in ids.sort(by='length', reverse=True):
>>>     print(name)

>>> # sort keys by name with ascending order
>>> for name in ids.sort(by='name'):                  ===这个可行,按照名字排序==
>>>     print(name)

>>> # sort keys by id with descending order
>>> for name in ids.sort(by='id', reverse=True)   ===降序排列==
>>>     print(name)

过滤键

序列长度和名称过滤键

pyfastx0.5.10中的新功能

>>> # get keys with length > 600
>>> ids.filter(ids > 600)
<FastaKeys> contains 48 keys

>>> # get keys with length >= 500 and <= 700
>>> ids.filter(ids>=500, ids<=700)
<FastaKeys> contains 48 keys

>>> # get keys with length > 500 and < 600
>>> ids.filter(500<ids<600)
<FastaKeys> contains 22 keys

>>> # get keys contain JZ8226
>>> ids.filter(ids % 'JZ8226')
<FastaKeys> contains 90 keys

>>> # get keys contain JZ8226 with length > 550
>>> ids.filter(ids % 'JZ8226', ids>550)
<FastaKeys> contains 17 keys

>>> # clear sort order and filters
>>> ids.reset()
<FastaKeys> contains 211 keys

>>> # list a filtered result
>>> ids.filter(ids % 'JZ8226', ids>730)
>>> list(ids)
['JZ822609.1', 'JZ822650.1', 'JZ822664.1', 'JZ822699.1']

>>> # list a filtered result with sort order
>>> ids.filter(ids % 'JZ8226', ids>730).sort('length', reverse=True)
>>> list(ids)
['JZ822609.1', 'JZ822699.1', 'JZ822664.1', 'JZ822650.1']

>>> ids.filter(ids % 'JZ8226', ids>730).sort('name', reverse=True)
>>> list(ids)
['JZ822699.1', 'JZ822664.1', 'JZ822650.1', 'JZ822609.1']

FASTQ(fq)

读取FASTQ文件

读取纯文件或 gzip 压缩文件并构建索引,支持从 FASTQ 读取的随机访问。

>>> import pyfastx
>>> fq = pyfastx.Fastq('tests/data/test.fq.gz')
>>> fq
<Fastq> tests/data/test.fq.gz contains 100 reads

FASTQ记录迭代

解析普通或 gzip 压缩的 FASTQ 文件而不构建索引的最快方法,迭代将返回包含读取名称、序列和质量的元组。

>>> import pyfastx
>>> for name,seq,qual in pyfastx.Fastq('tests/data/test.fq.gz', build_index=False):
>>>     print(name)
>>>     print(seq)
>>>     print(qual)

还可以像这样从 FASTQ 对象迭代读取对象:

for read in pyfastx.Fastq('tests/data/test.fq.gz'):
    print(read)
    print(read.name)
    print(read.seq)
    print(read.qual)
    # quali 表示测序质量的评估。这些质量分数的范围通常是从33到126,其中33表示最低质量,126表示最高质量。
    print(read.quali)
    
'''
TGCACACGTAGGCGCGAGCGTCGCCGCCGGCGGCCTTGCAGGCAGCGACGGCTTCGTCGAGACGTTCGCGGTTGAGATCGACCAGGGCCAGCCTGGCGCCCTTGCCGGCGAGATATTCGCCCATTGCCCGGCCGAGCCCCTGGCAGCCGC
A00129:183:H77K2DMXX:1:1101:6804:1031
TGCACACGTAGGCGCGAGCGTCGCCGCCGGCGGCCTTGCAGGCAGCGACGGCTTCGTCGAGACGTTCGCGGTTGAGATCGACCAGGGCCAGCCTGGCGCCCTTGCCGGCGAGATATTCGCCCATTGCCCGGCCGAGCCCCTGGCAGCCGC
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFF
[37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37]
'''

(默认)返回读取对象的迭代build_index=True允许您访问读取的属性。

获取 FASTQ 信息

>>> # get read counts in FASTQ
>>> len(fq)
800

>>> # get total bases
>>> fq.size
120000

>>> # get GC content of FASTQ file
>>> fq.gc_content
66.17471313476562

>>> # 获取fastq中的碱的组成
>>> fq.composition
{'A': 20501, 'C': 39705, 'G': 39704, 'T': 20089, 'N': 1}

>>> # New in pyfastx 0.6.10
>>> # 获得读数长度的平均
>>> fq.avglen
150.0

>>> # get maximum lenth of reads
>>> fq.maxlen
150

>>> # get minimum length of reas
>>> fq.minlen
150

>>> # get maximum quality score
>>> fq.maxqual
70

>>> # get minimum quality score
>>> fq.minqual
35

>>> # get phred which affects the quality score conversion
>>> fq.phred
33

>>> # Guess fastq quality encoding system
>>> # New in pyfastx 0.4.1
>>> fq.encoding_type
['Sanger Phred+33', 'Illumina 1.8+ Phred+33']

从 FASTQ 读取

>>> #get read like a dict by read name
>>> r1 = fq['A00129:183:H77K2DMXX:1:1101:4752:1047']
>>> r1
<Read> A00129:183:H77K2DMXX:1:1101:4752:1047 with length of 150

>>> # get read like a list by index
>>> r2 = fq[10]
>>> r2
<Read> A00129:183:H77K2DMXX:1:1101:18041:1078 with length of 150

>>> # get the last read
>>> r3 = fq[-1]
>>> r3
<Read> A00129:183:H77K2DMXX:1:1101:31575:4726 with length of 150

>>> # check a read weather in FASTQ file
>>> 'A00129:183:H77K2DMXX:1:1101:4752:1047' in fq
True

获取已读信息

>>> r = fq[-10]
>>> r
<Read> A00129:183:H77K2DMXX:1:1101:1750:4711 with length of 150

>>> # get read order number in FASTQ file
>>> r.id
791

>>> # get read name
>>> r.name
'A00129:183:H77K2DMXX:1:1101:1750:4711'

>>> # get read full header line, New in pyfastx 0.6.3
>>> r.description
'@A00129:183:H77K2DMXX:1:1101:1750:4711 1:N:0:CAATGGAA+CGAGGCTG'

>>> # get read length
>>> len(r)
150

>>> # get read sequence
>>> r.seq
'CGAGGAAATCGACGTCACCGATCTGGAAGCCCTGCGCGCCCATCTCAACCAGAAATGGGGTGGCCAGCGCGGCAAGCTGACCCTGCTGCCGTTCCTGGTCCGCGCCATGGTCGTGGCGCTGCGCGACTTCCCGCAGTTGAACGCGCGCTA'

>>> # get raw string of read, New in pyfastx 0.6.3
>>> print(r.raw)
@A00129:183:H77K2DMXX:1:1101:1750:4711 1:N:0:CAATGGAA+CGAGGCTG
CGAGGAAATCGACGTCACCGATCTGGAAGCCCTGCGCGCCCATCTCAACCAGAAATGGGGTGGCCAGCGCGGCAAGCTGACCCTGCTGCCGTTCCTGGTCCGCGCCATGGTCGTGGCGCTGCGCGACTTCCCGCAGTTGAACGCGCGCTA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFFF:

>>> # get read quality ascii string
>>> r.qual
'FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFFF:'

>>> # get read quality integer value, ascii - 33 or 64
>>> r.quali
[37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 11, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25]

>>> # get read length
>>> len(r)
150

在FASTQ文件中,质量分数(Quality Scores)以ASCII字符表示,每个字符对应一个数字,用于表示相应碱基的测序质量。在ASCII字符集中,字母 'F' 对应的ASCII码是 70。

在你提供的例子中,r.qual 是一个质量分数序列,其中所有的字符都是 'F',对应ASCII码 70。

FastaKey -> fq

获取 fastq 键

获取所有 read 名称作为类似列表的对象。

>>> ids = fq.keys()
>>> ids
<FastqKeys> contains 800 keys

>>> # get count of read
>>> len(ids)
800

>>> # get key by index
>>> ids[0]
'A00129:183:H77K2DMXX:1:1101:6804:1031'

>>> # check key whether in fasta
>>> 'A00129:183:H77K2DMXX:1:1101:14416:1031' in ids
True

命令行界面

$ pyfastx -h

usage: pyfastx COMMAND [OPTIONS]

A command line tool for FASTA/Q file manipulation

optional arguments:
  -h, --help     show this help message and exit
  -v, --version  show program's version number and exit

Commands:

    index        build index for fasta/q file
    stat         show detailed statistics information of fasta/q file
    split        split fasta/q file into multiple files
    fq2fa        convert fastq file to fasta file
    subseq       get subsequences from fasta file by region
    sample       randomly sample sequences from fasta or fastq file
    extract      extract full sequences or reads from fasta/q file

建立索引

pyfastx0.6.10中的新功能

$ pyfastx index -h

usage: pyfastx index [-h] [-f] fastx [fastx ...]

positional arguments:
  fastx       fasta or fastq file, gzip support

optional arguments:
  -h, --help  show this help message and exit
  -f, --full  build full index, base composition will be calculated
pyfastx index -f tests/data/test.fq.gz
生成
test.fq.gz.fxi

显示统计信息

$ pyfastx stat -h

usage: pyfastx info [-h] fastx

positional arguments:
  fastx       input fasta or fastq file, gzip support

optional arguments:
  -h, --help  show this help message and exit

分割 FASTA/Q 文件

$ pyfastx split -h

usage: pyfastx split [-h] (-n int | -c int) [-o str] fastx

positional arguments:
  fastx                 fasta or fastq file, gzip support

optional arguments:
  -h, --help            show this help message and exit
  -n int                split a fasta/q file into N new files with even size
  -c int                split a fasta/q file into multiple files containing the same sequence counts
  -o str, --out-dir str
                        output directory, default is current folder
  • 将文件分割成大小相等的文件:

    bashCopy code
    pyfastx split -n 3 -o output_folder input.fasta
    

    这将把 input.fasta 文件分割成 3 个新文件,每个文件的大小相等,并存储在 output_folder 目录中。

  • 将文件分割成包含相同序列数量的文件:

    bashCopy code
    pyfastx split -c 1000 -o output_folder input.fastq
    

    这将把 input.fastq 文件分割成多个包含 1000 条序列的新文件,并存储在 output_folder 目录中。

将 FASTQ 转换为 FASTA 文件

$ pyfastx fq2fa -h

usage: pyfastx fq2fa [-h] [-o str] fastx

positional arguments:
  fastx                 fastq file, gzip support

optional arguments:
  -h, --help            show this help message and exit
  -o str, --out-file str
                        output file, default: output to stdout
pyfastx fq2fa -o result/output.fa tests/data/test.fq.gz
pyfastx fq2fa -o result/output.fasta tests/data/test.fq.gz

pyfastx fq2fa -o result/output_nogzip.fasta tests/data/test.fq

获取带有区域的子序列

$ pyfastx subseq -h

usage: pyfastx subseq [-h] [-r str | -b str] [-o str] fastx [region [region ...]]

positional arguments:
  fastx                 input fasta file, gzip support
  region                format is chr:start-end, start and end position is 1-based, multiple names were separated by space

optional arguments:
  -h, --help            show this help message and exit
  -r str, --region-file str
                        tab-delimited file, one region per line, both start and end position are 1-based
  -b str, --bed-file str
                        tab-delimited BED file, 0-based start position and 1-based end position
  -o str, --out-file str
                        output file, default: output to stdout

样本序列

$ pyfastx sample -h

usage: pyfastx sample [-h] (-n int | -p float) [-s int] [--sequential-read] [-o str] fastx

positional arguments:
  fastx                 fasta or fastq file, gzip support

optional arguments:
  -h, --help            show this help message and exit
  -n int                number of sequences to be sampled 要抽样的序列数。
  -p float              proportion of sequences to be sampled, 0~1 要抽样的序列比例,范围从0到1。
  -s int, --seed int    random seed, default is the current system time :用于可重复性的随机种子。默认为当前系统时间
                      启动顺序读取,特别适用于抽样大量序列。
--sequential-read     start sequential reading, particularly suitable for sampling large numbers of sequences
  -o str, --out-file str
                        output file, default: output to stdout

pyfastx sample命令用于从FASTA或FASTQ文件中抽样序列。以下是该命令的选项解释:

  • 位置参数:

    • fastx:输入的FASTA或FASTQ文件。支持gzip压缩。
  • 可选参数:

    • -h, --help:显示帮助消息并退出。
  • 抽样选项:

    • -n int:要抽样的序列数。
    • -p float:要抽样的序列比例,范围从0到1。
  • 随机化选项:

    • -s int, --seed int:用于可重复性的随机种子。默认为当前系统时间。
  • 读取选项:

    • --sequential-read:启动顺序读取,特别适用于抽样大量序列。
  • 输出选项:

    • -o str, --out-file str:保存抽样序列的输出文件。默认为输出到stdout。

示例用法:

  1. 从输入文件中抽样100个序列:

    pyfastx sample -n 100 input.fasta
    
  2. 从输入文件中抽样20%的序列,并指定种子以确保可重复性:

    pyfastx sample -p 0.2 -s 42 input.fastq
    
  3. 从输入文件中顺序抽样500个序列:

    pyfastx sample -n 500 --sequential-read input.fasta.gz
    
  4. 抽样10%的序列并将输出保存到文件:

    pyfastx sample -p 0.1 -o sampled_sequences.fasta input.fastq.gz
    

提取序列

pyfastx0.6.10中的新功能

$ pyfastx extract -h

usage: pyfastx extract [-h] [-l str] [--reverse-complement] [--out-fasta] [-o str] [--sequential-read]
                       fastx [name [name ...]]

positional arguments:
  fastx                 fasta or fastq file, gzip support
  name                  sequence name or read name, multiple names were separated by space

optional arguments:
  -h, --help            show this help message and exit
  -l str, --list-file str
                        a file containing sequence or read names, one name per line
  --reverse-complement  output reverse complement sequence
  --out-fasta           output fasta format when extract reads from fastq, default output fastq format
  -o str, --out-file str
                        output file, default: output to stdout
  --sequential-read     start sequential reading, particularly suitable for extracting large numbers of sequences

相关文章

  • 提取最长cds mRNA gene

    使用方法 好像Pyfastx比biopython读取序列的速度更快,或许可以试一试pyfastx

  • Runtime介绍---术语介绍

    1. 什么是Runtime Runtime又叫运行时,是一套C语言的API。 我们平时编写的OC代码,底层都是基于...

  • 介绍

    万物终有一天会消失殆尽,诸神出卖黎明,光明为黑暗所湮灭,日月皆痕,海潮鸣泣,幼雏嚎啕,生灵涂炭。 托里奥世纪第20...

  • 介绍😊

    大家好,我是beth,初入简书,不邀自来,还请各位见谅! 先说说我是怎么想着来的吧?这不是刚过了一个寒假嘛...

  • 介绍

    在这个世界上还有三个家族他们不受各个国家联合国管。但他们身上有着使命分别是帝国家族曲国家族圣国家族。他们隐藏在一个...

  • 介绍

    云轩:主角,星罗帝国的二皇子。从小就不能练气,被人们称为废物。直到12岁的时候,自己的武魂觉醒才能练气,双...

  • 介绍

    万花阁 神秘至极的组织,亦正亦邪。万花阁的人行动隐秘,至今未被发现所在地。听说组成成员均以花来命名。所到之处,皆留...

  • 介绍

    此书命曰元.八洲传。属九洲四传第二部。第一部,上古往事。上古往事乃元八洲传外传。前两部为战胜心魔,而第三部,大梦...

  • 介绍

    千肆篇 7月的天气燥热,但在红杏阁里这份燥热就别有一番风味。漫天的胭脂水粉的香味变成了调味剂,女人们千姿百媚,在...

  • 介绍

    该文集属于收录文集,里面的内容不全是本人创作,有收录个人喜欢的内容。 *(偏个人向)

网友评论

      本文标题:pyfastx 介绍

      本文链接:https://www.haomeiwen.com/subject/bvuqadtx.html