美文网首页
gff3文件去重

gff3文件去重

作者: xiaosine | 来源:发表于2022-11-12 14:51 被阅读0次

    ```{python}

    #https://www.stechies.com/read-file-line-by-line-python/

    ## https://blog.csdn.net/weixin_41578567/article/details/103262925

    ## https://www.statology.org/pandas-remove-duplicates-keep-max/

    import os

    import sys

    import pandas as pd

    input_file="alltransc.genome.sorted.uniq.gene+_gene-.bingji_145.sart1.haveid.2.genome.mRNA_CDS_1_head100.gff3"

    ## read big data

    #df=pd.read_csv(input_file,header=None,sep="\t",nrows=100)

    data=pd.read_csv(input_file,header=None,sep="\t", chunksize=500)

    df = pd.concat(data)

    ## need install datatable module in conda environment by "conda install -c conda-forge datatable"

    #import datatable as dt

    #df = dt.fread()

    num_list=list(range(1,10))

    df.columns=["V"+str(x) for x in num_list]

    #drop rows with duplicate team and positions but keeps row with max points

    ## V5=mRNA; V3=start; V4=end

    df_new = df.sort_values('V5', ascending=False).drop_duplicates(['V3', 'V4']).sort_index()

    df_new.to_csv("output.gff3",sep="\t",index=False,header=False)

    ```

    相关文章

      网友评论

          本文标题:gff3文件去重

          本文链接:https://www.haomeiwen.com/subject/mspeyrtx.html