Polymorphic rate over protein-coding genes with polymut.py
This function calculates polymorphic site rates over protein coding genes. It considers dominant and second-dominant alleles over protein-coding genes on the nucleotide level, translates the ORFs into proteins and then calculates and outputs the number of synonymous and non-synonymous mutations (on the protein level) between the dominant and second-dominant protein sequences. Positions with a ratio between second-dominant and dominant allele coverage smaller than dominant_frq_thrsh are considered non-variant. This function was used in the study by Pasolli et al., 2019 as an ad-hoc measure to calculate strain heterogeneity in metagenomes. Since the likelihood of finding more than one strain in the same gut varies strongly across gut commensals (as well as different within-species genetic diversity), this function does not allow a rigorous classification of metagenomes into strain-mixed and non-strain-mixed, but it can be shown that - considering polymorphic site rates over i.e. core genes of any given speices - samples with a higher polymorphic site rate are more likely to harbour more than one strain.
Please supply a gff file from Prokka and make sure that the contig names between the bam file and the gff file can be matched.
网友评论