美文网首页
基于OrthoFinder的结果提取直系同源单拷贝基因建树

基于OrthoFinder的结果提取直系同源单拷贝基因建树

作者: 深山夕照深秋雨OvO | 来源:发表于2023-04-20 10:17 被阅读0次

0.运行Orthofinder

1.建树

进入到Single_Copy_Orthologue_Sequences文件夹中

ls *fa|while read file;do mafft --auto $file > $file.aln;done
ls *aln|while read file;do seqkit sort $file|seqkit seq -w 0 > $file.format;done
rm *.aln
find  *format | xargs paste -d " " > ../all.single.copy.aln.fa && rm *format
这里就不提取4D位点了, 有的文章会提取有的文章则不提取
all.single.copy.aln.fa便是用来建树的文件

trimal -in all.single.copy.aln.fa -out all.single.copy.aln.trimal.fa
iqtree -s all.single.copy.aln.trimal.fa --runs 100
这个方法只能说命令行简单, 但是运行时间堪忧
因为我是slurm集群
for i in {00001..15175}
do
echo "mafft --auto OG00${i}.fa | seqkit sort - | seqkit seq -w 0 > OG00${i}.format" >> 1.sh
done
可以这样写,分1000行提交,运行速度更快
find  *format | xargs paste -d " " > ../all.single.copy.aln.fa && rm *format
这里就不提取4D位点了, 有的文章会提取有的文章则不提取
all.single.copy.aln.fa便是用来建树的文件

trimal -in all.single.copy.aln.fa -out all.single.copy.aln.trimal.fa
可以用的核心是32,所以-T设置为32
iqtree -s all.single.copy.aln.trimal.fa --runs 100 -T 32

https://github.com/yongzhiyang2012/Two_iron_wood_genome_analysis/tree/master/evolution/02.phylogenetic
在网上找到了相关的脚本也是构建物种树的, 包括后续的估计物种分歧时间和基因家族的收缩与扩张(https://www.jianshu.com/p/deb27036d3c5)都是基于这个网站的脚本做的

如果你使用了这些脚本, 可以引用下面这篇文章
Yang Y, Ma T, Wang Z, Lu Z, Li Y, Fu C, Chen X, Zhao M, Olson MS, Liu J: Genomic effects of population collapse in a critically endangered ironwood tree Ostrya rehderiana. Nature Communications 9, 5449 (2018).

下面介绍上述网站提及脚本的用法:

  1. 运行Orthofinder

1.输入文件的准备
上一步运行orthofinder, 在输出的文件夹Orthogroups中的 Orthogroups.txt 与 Orthogroups_SingleCopyOrthologues.txt是我们要用到的.
Orthogroups.txt中, 在基因名前面加上物种名, 比如原来的基因名是 evm.model.Chr1.1, 需要改成sppA|evm.model.Chr1.1, 即 "物种名" + “|”
此外, 在运行这些脚本的目录下创建一个文件夹data, 里面需要将orthofinder用到的物种的cds/pep文件做相应处理, 在基因名前加上物种名, 最后将cds/pep文件命名为 sppA.cds.clean.fa/sppA.pep.clean.fa


Orthogroups.txt修改后
sppA.pep.clean.fa
data文件夹里面的内容
python3 1.AllClusters.txt.single.copygene.txt.py Orthogroups_SingleCopyOrthologues.txt Orthogroups.txt > scg.txt
perl 1.splitSeq.pl data scg.txt
perl 2.align.pl > 2.align.pl.sh
sh 2.align.pl.sh
perl 3.connect.pl
perl 4.4Dsites.pl 3.connect.pl.connect.cds.fa
perl fasta2phylip.pl 4.4Dsites.pl.connect4Dsites.fa > A-c.phy
iqtree -s A-c.phy --runs 100

#输出的A-c.phy是后续做基因家族收缩与扩张要用到的文件
#A-c.phy.treefile即是物种树

3.该网站提供的脚本内容如下:
https://github.com/yongzhiyang2012/Two_iron_wood_genome_analysis/tree/master/evolution/02.phylogenetic

如果你使用了这些脚本, 可以引用下面这篇文章
Yang Y, Ma T, Wang Z, Lu Z, Li Y, Fu C, Chen X, Zhao M, Olson MS, Liu J: Genomic effects of population collapse in a critically endangered ironwood tree Ostrya rehderiana. Nature Communications 9, 5449 (2018).


vim 1.splitSeq.pl data scg.txt

use strict;
use warnings;
use Bio::SeqIO;

## created by Yongzhi Yang. 2017/3/20 ##

my ($dir,$singlecoypegenelist)=@ARGV;
die "Usage:\nperl $0 dir single_coype_gene_list
dir: containing cds and pep seq with the name of XXX.cds.clean.fa and XXX.pep.clean.fa. XXX represent the species ID.
single_coype_gene_list: clusterid\tGene_num\tSpecies_num\tOrthologous_genes_split_comma
" if (! $singlecoypegenelist);

my %list;
my %seq;
my @seq=<$dir/*clean.fa>;
for my $seqfile (@seq){
    $seqfile=~/\/(\w+)\.(\w+)\.clean.fa$/;
    my $type=$2;
    my $name=$1;
    my $fa=Bio::SeqIO->new(-file=>"$seqfile",-format=>'fasta');
    while (my $seq=$fa->next_seq) {
        my $id=$seq->id;
        my $seq=$seq->seq;
        my $key;
        if ($id=~/^$name\|/){
            $key=$id;
        }else{
            $key="$name|$id";
        }
        $seq{$type}{$key}=$seq;
    }
}

open (F,"$singlecoypegenelist")||die"$!";
while (<F>) {
    chomp;
    next if /^#/;
    my @a=split(/\t/,$_);
    my @b=split(/,/,$a[3]);
    for (my $i=0;$i<@b;$i++){
        $list{$a[0]}{$b[$i]}++;
    }
}
close F;

`mkdir align` if (! -e "align");
    for my $k (sort keys %list){
    my $dir="align/$k";
    `mkdir $dir` if (! -e "$dir");
    for my $type ("cds","pep"){
        open (O,">$dir/$type");
        for my $k2 (sort keys %{$list{$k}}){
            die "$k2\n" if ! exists $seq{$type}{$k2};
            $k2=~/^(\S+)\|/ or die "$k2\n";
            my $outid=$1;
            print O ">$outid\n$seq{$type}{$k2}\n";
        }
        close O;
    }
}

vim 2.align.pl

use strict;
use warnings;

## created by Yongzhi Yang. 2017/3/20 ##

my $muscle="~/muscle";  
#可以更换为其他序列比对软件

my $pal2nal="~/pal2nal.pl";

my @in=<align/*/pep>;
for my $in (@in){
    my $in2=$in;
    $in2=~s/pep/cds/;
    print "$muscle -in $in -out $in.best.fas ; $pal2nal $in.best.fas $in2 -output fasta > $in2.best.fas\n";
}
#如果更换了序列比对软件, 更改上述命令行即可

2.align.pl中提及的pal2nal.pl如下获取:
wget http://www.bork.embl.de/pal2nal/distribution/pal2nal.v14.tar.gz
tar -zxvf pal2nal.v14.tar.gz

vim 3.connect.pl

use strict;
use warnings;
use Bio::SeqIO;

## created by Yongzhi Yang. 2017/3/20 ##

my @in=<align/*/cds.best.fas>;
my %seq;
my %name;
my $alllen=0;
my $number;

for my $in (sort @in){
    my $fa=Bio::SeqIO->new(-file=>"$in",-format=>"fasta");
    my $len;
    while (my $seq=$fa->next_seq) {
        my $id=$seq->id;
        my $seq=$seq->seq;
        $len=length($seq);
        ###check
        if ($len % 3 != 0){
            die "wrong: $in\t$id\n";
        }
        my @seq=split(//,$seq);
        for (my $i=0;$i<@seq;$i += 3){
            my $word=$seq[$i].$seq[$i+1].$seq[$i+2];
            die "wrong: $in\t$id\n" if ($word=~/-/ && $word=~/\w/);
        }
        ###check done
        my @id=split(/\|/,$id);
        $seq{$id[0]} .= $seq;
        $name{$id[0]} .= $id;
    }
    $alllen += $len;
    $number++;
}

open (O,">$0.list");
print O "Length:\t$alllen\tNumber:\t$number\n";
for my $k1 (sort keys %name){
    my $v=$name{$k1};
    my @v=split(/$k1\|/,$v);
    shift @v;
    print O "$k1\t",scalar(@v),"\t",join(",",@v),"\n";
}
close O;
open (O,">$0.connect.cds.fa");
for my $k1 (sort keys %seq){
    print O ">$k1\n$seq{$k1}\n";
}
close O;

vim 4.4Dsites.pl

use strict;
use warnings;
use Bio::SeqIO;

## created by Yongzhi Yang. 2017/3/20 ##

my %fourDegenerateSite=(
            'TC'=>'Ser',
            'CT'=>'Leu',
            'CC'=>'Pho',
            'CG'=>'Arg',
            'AC'=>'Thr',
            'GT'=>'Val',
            'GC'=>'Phe',
            'GG'=>'Gln',
           );

my %seq;
my %count;
#my $in=shift or die "perl $0 \$inFasta output\n";
#my $out=shift or die "perl $0 \$inFasta output\n";
my $in="3.connect.pl.connect.cds.fa";
my $out="$0.connect4Dsites.fa";
my $fa=Bio::SeqIO->new(-file=>"$in",-format=>"fasta");
while (my $seq=$fa->next_seq) {
    my $id=$seq->id;
    my $seq=$seq->seq;
    my @seq=split(//,$seq);
    for (my $i=0;$i<@seq;$i=$i+3){
        my $key=$seq[$i].$seq[$i+1];
        my $value=$seq[$i+2];
        if (exists $fourDegenerateSite{$key}){
            $seq{$id}{$i}=$value;
            print "$id\t$i\t$value\t",$seq[$i-2],$seq[$i-1],$seq[$i],$seq[$i+1],$seq[$i+2],"\n" if ($value eq '-');
            $count{$i}++;
        }
    }
}


open (O,">$out");
open (O1,">$out.list");
my @k1=sort keys %seq;
for my $k1 (@k1){
    print O ">$k1\n";
    print O1 "$k1\n";
    for my $k2 (sort{$a<=>$b} keys %{$seq{$k1}}){
        my $count=$count{$k2};
        if ($count==scalar(@k1)){
            print O "$seq{$k1}{$k2}";
            print O1 $k2+2," ";
            #print "$k1\t$k2\t$seq{$k1}{$k2}\n" if $seq{$k1}{$k2} eq '-';
        }
    }
    print O "\n";
    print O1 "\n";
}
close O;
close O1;

cat fasta2phylip.pl

#!/usr/bin/perl

# Converts an aligned fasta (aa or dna) seq file to phylip format

# Copyright 2013, Naoki Takebayashi <ntakebayashi@alaska.edu>
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License as
# published by the Free Software Foundation; either version 3 of the
# License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.

# Version: 20130612


my $usage = "Usage: $0 [-h] [-v] [-c numChar] [infile]\n" .
    "  -h: help\n" .
    "  -c: long seq names are shortened to 10 char, default: 7 chars from the\n".
    "      beggining is combined with the last 3 chars.  You can change the\n".
    "      behavior by this option.  E.g., -c 10 will shorten the long name\n" .
    "      by taking the first 10 characters of the name.\n".
    "  -v:  verbose (print name conversion in STDERR)\n" .
    " infile should be an aligned fasta, " .
    "STDIN is used if no infile is given\n";

use IO::File;
use Getopt::Std;
getopts('hvc:') || die "$usage\n";

die "$usage\n" if (defined ($opt_h));

my $totNumChar = 10;  # number of characters allowed for name in phylip
my $numFrontChar = 7; # When the name is too long, this amount of characters
                      # are used from the beginning of the name, and the rest
                      # are from the end of the name.
if (defined ($opt_c)) {
    if ($opt_c <= $totNumChar && $opt_c >= 0) {
    $numFrontChar = $opt_c;
    } else {
    die "Error: give an integer between 0 and $totNumChar (ends inclusive)".
        " for -c.\n";
    }
}

my $tmpFH = IO::File->new_tmpfile || die "Unable to make new temp file: $!";

my $firstLine = 1;
my $maxLen = 0;
my $numTaxa = 0;
my $name;

while(<>){

    chomp;
    s/^\s+//; s/\s$//;
    next if (/^$/);

    if (s/^>\s*//) {

    if ($firstLine == 0) {
        if ($seqLen != $maxLen && $maxLen != 0) {
        warn "WARN: The $numTaxa-th species ($name) have ",
        "different seq length\n";
        warn "Previous Seq Len: $maxLen, This Seq Len: $seqLen\n";
        }
        print $tmpFH "\n";    # end of the previous sequence
    } else {
        $firstLine = 0;
    }

    $maxLen = $seqLen if ($seqLen > $maxLen); $seqLen = 0;
    $numTaxa ++;

    $name = $_;
    if (CharLen($_) <=10) {
        printf $tmpFH "%-10s", $_;
    } else  {
        $shortName = TrimName($_);
        print STDERR "$_ => $shortName\n" if (defined ($opt_v));
        printf $tmpFH "%10s", $shortName;
    }
    } else {
    $seqLen += CharLen ($_);
    print $tmpFH $_;
    }
}

print $tmpFH "\n";

### print out to the STDOUT
print "$numTaxa   $maxLen\n";

seek ($tmpFH, 0, 0) || die "seek: $!";
my $line;
while (defined ($line = $tmpFH->getline())) {
    chomp ($line);
    print "$line";
    $missingBases = $maxLen - (CharLen($line) - $totNumChar);
    while ($missingBases > 0) {
    print "-";
    $missingBases--;
    }
    print "\n";
}

sub CharLen {  # returns number of characters in a string
    my $string = shift;
    return scalar(split (//, $string));
}

sub TrimName { # trim a long name
    my $name = shift;
    my @charArray = split (//, $name);

    if ($totNumChar == $numFrontChar) {
    return join ('', splice (@charArray, 0, $numFrontChar));
    }  else {
    return join ('', splice (@charArray, 0, $numFrontChar),
             splice (@charArray, - ($totNumChar - $numFrontChar)));
    }
}

相关文章

网友评论

      本文标题:基于OrthoFinder的结果提取直系同源单拷贝基因建树

      本文链接:https://www.haomeiwen.com/subject/fxyrjdtx.html