美文网首页Shell和命令
写个zsh脚本(主要是awk)合并gff,去除gff中的重叠区域

写个zsh脚本(主要是awk)合并gff,去除gff中的重叠区域

作者: 精神秃头生信小伙儿 | 来源:发表于2021-02-16 00:13 被阅读0次

    问题背景

    大致是这样,假如你用一堆注释软件然后用不同物种得到了一堆不同的gff,那这些gff肯定是有重叠的,对于注释信息gff来说只需要保留gene、exon、cds有这些features的区段,并且去除有重叠的区段仅仅保留一个,这里不考虑打分长度什么的(不同软件不同物种也不好比较分数,当然有些软件可以根据权重来合并注释结果,比如RNA注释权重较高然后同源注释次之从头预测的最低),于是脚本就如下(其实用python/R也很简单,但是考虑到性能方面还是用zsh/以awk为主体会快很多),具体思路就是以features为gene的行作为标识符(因为这个gene feature下面所有的记录都是针对这个基因的,所以相当于分隔记录符一样),用sort排序得到对应标识符顺序,用awk读进去成数组,生成“标识符:针对这个gene的所有记录”的哈希表,然后根据排序的数组的标识符顺序来得到整个gff的排序。接下来再判断排序后的gff的"gene features"的首尾区间是否有重叠,同样也是用哈希记录然后最终输出。(简书这个缩进我属实无语懒得改了)

    #!/usr/bin/zsh
    # -*- coding: utf-8 -*-
    ### ------------------------------------
    ### merge gff results generated by annotation pipeline
    ### ------------------------------------
    
    # Get gffdir from input
    while {getopts d: arg} {
            case $arg {
                    (d)
                    gffdir=$OPTARG
                    d=$arg
                    # Test if the gffdir exists.
                    if [[ -d $gffdir ]] {
                            echo "Your gffdir is $gffdir"
                    } else {
                               echo "The directory that you specified does not exist, please specify the correct path."
                               exit
                    }
                    ;;
                    (?)
                    echo "Wrong option!"
                    ;;
            }
    }
    
    #  Test if the -d option is provided.
    if [[ -z $d ]] {
            echo "You must use -d to specify the directory that contains all the gff files that you want to merge."
            exit
    }
    
    # Merge all gff to a big gff
    if [[ -f $gffdir/merged.gff ]] {
            rm -rf $gffdir/merged.gff
    }
    if [[ -f $gffdir/sortedmerged.gff ]] {
            rm -rf $gffdir/sortedmerged.gff
    }
    if [[ -f $gffdir/filtermerged.gff ]] {
            rm -rf $gffdir/filtermerged.gff
    }
    ls $gffdir/*gff | while read gff
    do
            print "Proccessing $gff..."
            grep -v "^#" $gff >> $gffdir/merged.gff
    done && print "Successfully merged gff files!"
    
    # sort gff
    ## Get a sorted id array that contains only the rows whose feature is gene
    id_sorted=`gawk '$3=="gene"{print $0}' $gffdir/merged.gff |
            sort -t $'\t' -k 1,1 -k 7,7 -k 4n,4 -k 5n,5`
    ## Use awk and srotedIDarray to get a sorted gff file
    gawk -v arr=$id_sorted '
    BEGIN{
    split(arr,id_sorted,"\n")
    RS="\n"
    FS="\t"
    OFS="\t"
    ORS="\n"
    }
    $3=="gene"{
            recs[id]=id"\n"lines
            id=$0
            lines=""
    }
    $3!="gene"{
            lines=lines"\n"$0
    }
    END{
    recs[id]=id+"\n"+lines
    for (id in id_sorted){
            # print recs[id]
            print recs[id_sorted[id]]
    }
    }
    ' $gffdir/merged.gff | sed '/^\s*$/d' > $gffdir/sortedmerged.gff
    
    
    # Exclude overlap records of sorted gff
    gawk 'BEGIN{
    RS="\n"
    FS="\t"
    flag=1
    }
    $3=="gene"{
    if(!start[$1","$7]){
            start[$1","$7]=$4
            end[$1","$7]=$5
            flag=1
            print $0
    }
    else{
            s=start[$1","$7]
            e=end[$1","$7]
            if((s<=$4&&$4<=e)||(s<=$5&&$5<=$e)){
                    flag=0
            }
            else{
                    start[$1","$7]=$4
                    end[$1","$7]=$5
                    flag=1
                    print $0
            }
    }
    }
    $3!="gene"&&flag==1{print $0}
    ' $gffdir/sortedmerged.gff > $gffdir/filtermerged.gff
    
    

    相关文章

      网友评论

        本文标题:写个zsh脚本(主要是awk)合并gff,去除gff中的重叠区域

        本文链接:https://www.haomeiwen.com/subject/kaomxltx.html