本例子展示的是如何在内含子和外显子中统计reads数量。不但包括了这个内容,还包括了其他的命名展示。例如
1、BAM file support (for more, see Working with BAM files)
2、indexing into Interval objects (for more, see Intervals)
3、filtering (for more, see Filtering)
4、streaming (for more, see Using BedTool objects as iterators/generators)
5、ability to use parallel processing
下面是含有注释信息的命令行展示
import multiprocessing
import pybedtools
# get example GFF and BAM filenames 获得例子中的GFF和BAM的内容gff = pybedtools.example_filename('gdc.gff')
bam = pybedtools.example_filename('gdc.bam')
# Some GFF files have invalid entries -- like chromosomes with negative coords GFF文件中含有不实的内容,例如染色为负数坐标的信息
# orfeatures of length = 0. This lineremoves them and saves the result in a tempfile 或者特征值的长度为0,可以使用命令行去除这些无意义的内容,并且存储为另外一个文件。g = pybedtools.BedTool(gff).remove_invalid().saveas()
# Next, we create a function to pass only features for a particular featuretype. This is similar to a"grep" operation when applied to every#feature in a BedTool
随后,我们构建了一个函数用来只显示符合要求的特征,类似于grep命令def featuretype_filter(feature, featuretype):
if feature[2] == featuretype:
return True
return False
# This function will eventually be run in parallel, applying the filter above to several different BedTools simultaneously多种不同的BedTools同时选择def subset_featuretypes(featuretype):
result = g.filter(featuretype_filter, featuretype).saveas()
return pybedtools.BedTool(result.fn)
# Thisfunction performs the intersection of a BAM file with a GFF file and returns the total number of hits. Itwill eventually be run in parallel.def count_reads_in_features(features_fn):
"""
Callback function to count reads infeatures
"""
# BAM files are auto-detected; no need foran `abam` argument. Here we # construct a new BedTool out of the BAM file and intersect it with the features filename. We use stream=True so that no intermediate tempfile is created, and bed=True so that the.count() method can iterate through the resulting streamed BedTool.return pybedtools.BedTool(bam).intersect(b=features_fn, stream=True).count()
# Set up a pool of workers for parallel processing pool = multiprocessing.Pool()
# Create separate files for introns and exons, using the function we defined abovefeaturetypes = ('intron', 'exon')
introns, exons = pool.map(subset_featuretypes, featuretypes)
# Perform some genome algebra to get unique and shared intron/exon regions. Here we keep only the filename of the results, which is safer than an entire BedTool for passing around in parallel computations.exon_only = exons.subtract(introns).merge().remove_invalid().saveas().fn
intron_only = introns.subtract(exons).merge().remove_invalid().saveas().fn
intron_and_exon = exons.intersect(introns).merge().remove_invalid().saveas().fn
# Do intersections with BAM file in parallel, using the other function we defined abovefeatures = (exon_only, intron_only, intron_and_exon)
results = pool.map(count_reads_in_features, features)
# Print the resultslabels = (' exon only:', ' intron only:', 'intron and exon:')
for label, reads in zip(labels, results):
sys.stdout.write('%s %s\n' % (label, reads))
网友评论