美文网首页多组学知识基础
HOMER中findMotifGenome.pl中碰到的两个小错

HOMER中findMotifGenome.pl中碰到的两个小错

作者: 城管大队哈队长 | 来源:发表于2020-05-10 16:32 被阅读0次

    我有时候会对DIffBind的输出结果进行下修改,然后放到HOMER里面去找motif。但我有个不太好的一点就是,每次都不记我的awk操作,导致我每次都得重新输一遍awk。于是我今天在跑的时候就遇到了两个小bug。
    第一个是HOMER的findMotifGenome.pl显示我有好多Redundant Peak IDs

    findMotifsGenome.pl down.bed ~/reference/genome/TAIR10/Athaliana.fa .
    
        Position file = down.bed
        Genome = /home/sgd/reference/genome/TAIR10/Athaliana.fa
        Output Directory = .
        Using Custom Genome
        Peak/BED file conversion summary:
            BED/Header formatted lines: 1163
            peakfile formatted lines: 0
    
        Peak File Statistics:
            Total Peaks: 1163
            Redundant Peak IDs: 1162
            Peaks lacking information: 0 (need at least 5 columns per peak)
            Peaks with misformatted coordinates: 0 (should be integer)
            Peaks with misformatted strand: 0 (should be either +/- or 0/1)
    

    但事实上我awk操作产生的并没有冗余的peak

    awk -F "," 'BEGIN {OFS="\t"} $9 < -0.58 && $11 < 0.1 && NR > 1 {print $1,$2,$3,$5,$12}' test.csv | sed 's/\"//g' | sort -k1,1 -k2,2n > down.bed
    
    Chr1    67143   67743   *       WT_Peak_5
    Chr1    365682  366282  *       WT_Peak_28
    Chr1    415434  416034  *       WT_Peak_30
    Chr1    497934  498534  *       WT_Peak_41
    Chr1    677310  677910  *       WT_Peak_58
    

    我重新看了下HOMER的要求,才发现我的格式输出错误了……

    
    findMotifsGenome.pl accepts HOMER peak files or BED files:
    
    HOMER peak files should have at minimum 5 columns (separated by TABs, additional columns will be ignored):
    
        Column1: Unique Peak ID
        Column2: chromosome
        Column3: starting position
        Column4: ending position
        Column5: Strand (+/- or 0/1, where 0="+", 1="-")
    
    BED files should have at minimum 6 columns (separated by TABs, additional columns will be ignored)
    
        Column1: chromosome
        Column2: starting position
        Column3: ending position
        Column4: Unique Peak ID
        Column5: not used
        Column6: Strand (+/- or 0/1, where 0="+", 1="-")
    
    In theory, HOMER will accept BED files with only 4 columns (+/- in the 4th column), and files without unique IDs, but this is NOT recommended.  For one, if you don't have unique IDs for your regions, it's hard to go back and figure out which region contains which peak.
    
    

    我输出的是bed格式,但我的第四列却是链信息而非Unique Peak ID。所以HOMER就会误把我的第四列变成了peakID,所以他就会认为我的peakID全都是冗余的。

    在我改了列顺序之后,出现了另一个报错

    awk -F "," 'BEGIN {OFS="\t"} $9 < -0.58 && $11 < 0.1 && NR > 1 {print $12,$1,$2,$3,$5}' test.csv | sed 's/\"//g' | sort -k2,2 -k3,3n > down.bed
    
    WT_Peak_5   Chr1    67143   67743   *
    WT_Peak_28  Chr1    365682  366282  *
    WT_Peak_30  Chr1    415434  416034  *
    WT_Peak_41  Chr1    497934  498534  *
    WT_Peak_58  Chr1    677310  677910  *
    WT_Peak_67  Chr1    898870  899470  *
    

    类似这样的报错

    Reading input files...
        0 total sequences read
        Autonormalization: 1-mers (4 total)
                A       inf%    inf%    -nan
                C       inf%    inf%    -nan
                G       inf%    inf%    -nan
                T       inf%    inf%    -nan
        Autonormalization: 2-mers (16 total)
                AA      inf%    inf%    -nan
                CA      inf%    inf%    -nan
                GA      inf%    inf%    -nan
                TA      inf%    inf%    -nan
                AC      inf%    inf%    -nan
                CC      inf%    inf%    -nan
                GC      inf%    inf%    -nan
                TC      inf%    inf%    -nan
                AG      inf%    inf%    -nan
                CG      inf%    inf%    -nan
                GG      inf%    inf%    -nan
                TG      inf%    inf%    -nan
                AT      inf%    inf%    -nan
                CT      inf%    inf%    -nan
                GT      inf%    inf%    -nan
                TT      inf%    inf%    -nan
    

    后来我才发现,我应该把链信息中的 * 写成 ., 这样就不会报错了。所以最后的输入文件应该是长这样的

    awk -F "," 'BEGIN {OFS="\t"} $9 < -0.58 && $11 < 0.1 && NR > 1 {print $12,$1,$2,$3,"."}' test.csv | sed 's/\"//g' | sort -k2,2 -k3,3n > down.bed
    
    WT_Peak_5   Chr1    67143   67743   .
    WT_Peak_28  Chr1    365682  366282  .
    WT_Peak_30  Chr1    415434  416034  .
    WT_Peak_41  Chr1    497934  498534  .
    

    相关文章

      网友评论

        本文标题:HOMER中findMotifGenome.pl中碰到的两个小错

        本文链接:https://www.haomeiwen.com/subject/xtbjnhtx.html