美文网首页RNA-seqNGS 学习资源收集
生物信息学中常用的文件转换命令

生物信息学中常用的文件转换命令

作者: 热衷组培的二货潜 | 来源:发表于2018-10-17 14:07 被阅读10次

    放置在最前面的参考链接

    支持的文件转换类型

    • text file to wig
    • bed to wig
    • wig to bed
    • wig to bigwig
    • bed to bigbed
    • BAM to bedGraph for UCSC genome browser
    • bam to bigwig
    • bed to gff
    • Split bed file by chromosome
    • gff to gtf
    • gtf to bed
    • blat to gff
    • ...

    由于需要翻墙,所以这里直接将整个网页复制粘贴过来(这应该不侵权吧!)

    Convert text file to wig

     Sample command:
       txt2wig.pl foo.txt trackName(one word) > foo.wig
    

    Convert bed to wig

        Sample command:
        bed2wig.pl inputBed sampleName(one word) probeWidth > outputWig
        Note: It assumes that the probe width in all records is constant.
              If probe width is not constant, you can use bedGraph format.
              To convert bed to bedGraph format, just change the track name to bedGraph, and minus chromosome end position in bed format by 1.
    

    Convert wig to bed

      Sample command with variableStep wig format:
       wig2bed.pl inputWig sampleName(one word) > outputBed
      
       Sample command with fixedStep wig format:
       wig2bed_fixedStep.pl inputWig > outputBed
    

    Convert wig to bigwig

      Sample commands:
      Get chromosome lengths
       fetchChromSizes  hg18 > chrSize.txt
      Convert wig to big wig:  
       wigToBigWig foo.wig chrSize.txt foo.bw
    

    Convert bed to bigbed

     Sample commands:
      Get chromosome lengths
       fetchChromSizes  hg18 > chrSize.txt
      Convert bed to big bed:  
       bedToBigBed foo.bed chrSize.txt foo.bb
    

    Convert BAM to bedGraph for UCSC genome browser

      To view BAM files on UCSC browser, both foo.sorted.bam and foo.sorted.bam.bai have to be on a http or ftp server. One way to get around this is to convert BAM files into bedGraph files, which should be small enough that they can be simply uploaded.
       genomeCoverageBed -split -bg -ibam sorted.bam -g hg19.genome    
       where hg19.genome file is tab delimited and structured as follows:
           <chromName><TAB><chromSize>
           chr1    249250621
       One can use the UCSC Genome Browser's MySQL database to extract chromosome sizes. For example, H. sapiens:
           mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg19.chromInfo" > hg19.genome
    

    Convert bam to bigwig

    • Method 1: Single-base resolution across the genome
      Step1: convert bam to bedGraph format:
    genomeCoverageBed -split -bg -ibam accepted_hits.bam -g /nfs/genomes/mouse_gp_jul_07/anno/mm9.size > accepted_hits.bedGraph
    
      Step2: convert bedGraph to bigwig format:
    bedGraphToBigWig  accepted_hits.bedGraph /nfs/genomes/mouse_gp_jul_07/anno/mm9.size accepted_hits.bw
        where mm9.size file is tab delimited and structured as follows:
            <chromName><TAB><chromSize>
    
    • Method 2: Resolution of desired window size (after creating windows across desired regions or genome)
    coverageBed -a Human.hg19.1000.500.bed -b Sample_1.sorted.bam | cut -f1-4 > Sample_1.1000.500.coverage.bedgraph
    

    Update/fix UCSC GTF file

    • GTF files from UCSC Table Browser use RefSeq (NM* ids) for both gene_id and transcript_id which may not be compatible for some programs (eg. counting by genes using HTSeq)
    • Some Refseq gtf files (such as for the hg19, hg18, mm9, and dm3 assemblies) are in /nfs/genomes/, under gtf/ in each species folder. If you would like to create additional files, here are the steps:
      Step 1: Use UCSC Table Browser to download RefSeq id and gene symbol.
        Use "Genes and Gene Prediction Tracks" for group, "RefSeq Genes" for track and "refGene" for table.  Choose  "selected fields from primary and related tables" for output format and click "get output".  In the next page select "name" and "name2" for the fields.  
        output format should be : NM_017940       NBPF1
      Step 2: Download a gtf file from the UCSC Table Browser
        This uses refseq ID as gene_id and transcript_id, so we need to replace it with the gene symbol.
        sample command:  
          /nfs/BaRC_Public/BaRC_code/Perl/fix_gtf_refSeq_ensembl.pl hg19.refgene.gtf refseq2symbol > hg19.refgene.gtf
      Step 3: About 50-70 genes in the gtf file from UCSC are incorrect; they include exons with a start coordinate that is larger than the end coordinate.  
        Software such as cufflinks fails to deal with this situation and ignores these exons. 
        Since this only affects the last 1-3 bases of a transcript, a temporary solution is to remove these records.
          sample command: awk -F"\t" '{ if($4<=$5) print $0 }' hg19.refgene.gtf > hg19.refgene_new.gtf
    

    Convert bed to gff

    • Note that bed and gff use slightly different coordinate conventions
    • Use /nfs/BaRC_Public/BaRC_code/Perl/bed2gff/bed2gff.pl
        USAGE: bed2gff.pl bedFile > gffFile
        Ex: bed2gff.pl foo.bed WIBR exon > foo.gff
    

    Split bed file by chromosome

    • Sometimes it's easier working with only one chromosome of regions at a time
    • Output files will be named like "Sample_1.chr1.bed".
     awk '{close(f);f=$1}{print > "Sample_1."f".bed"}' Sample_1_all_chrs.bed
    

    Convert gff to gtf

    • Use ​gffread: Try 'gffread -h' too see the program's many options
    gffread My_transcripts_genes.gff3 -T -E -o My_transcripts_genes.gtf
    

    Convert gtf to bed

    • convert gtf to genePhred
      gtfToGenePred my.gtf my.genePhred
    
    • convert genePhred to bed:
       awk -f genePhredToBed my.genePhred > my.bed
    
    • genePhredToBed is a awk script by Katrina Learned, downloaded from UCSC Genome Browser discussion list
    #!/usr/bin/awk -f
    
    #
    # Convert genePred file to a bed file (on stdout)
    #
    BEGIN {
         FS="\t";
         OFS="\t";
    }
    {
         name=$1
         chrom=$2
         strand=$3
         start=$4
         end=$5
         cdsStart=$6
         cdsEnd=$7
         blkCnt=$8
    
         delete starts
         split($9, starts, ",");
         delete ends
         split($10, ends, ",");
         blkStarts=""
         blkSizes=""
         for (i = 1; i <= blkCnt; i++) {
             blkSizes = blkSizes (ends[i]-starts[i]) ",";
             blkStarts = blkStarts (starts[i]-start) ",";
         }
    
         print chrom, start, end, name, 1000, strand, cdsStart, cdsEnd, 0, blkCnt, blkSizes, blkStarts
    }
    

    Convert blat to gff

    • Use /nfs/BaRC_Public/BaRC_code/Perl/blat2gff/blat2gff.pl
     Convert BLAT output file (PSL format) into GFF format (v1.1 14 Dec 2010)
       blat2gff.pl blatFile dataSource(ex:WIBR) > gffFile
    

    Create wiggle files for visualizing paired-end data mapping to the + and - strands

    • split by strand by matched strand
    # input:    accepted_hits.bam
    # output:   accepted_hits_negStrand.bam: mapped to negative strand
    #       accepted_hits_posStrand.bam: mapped to positive strand
    
    bsub "samtools view -f 16 -b accepted_hits.bam >| accepted_hits_negStrand.bam"
    bsub "samtools view -F 16 -b accepted_hits.bam >| accepted_hits_posStrand.bam"
    
    • split reads by pair
    # input:    accepted_hits_posStrand.bam or accepted_hits_negStrand.bam
    # output:   1st pair: *_1stPair.bam
    #           2nd pair: *_2ndPair.bam
    bsub "samtools view -b -f 0x0040 accepted_hits_posStrand.bam > accepted_hits_posStrand_1stPair.bam"
    bsub "samtools view -b -F 0x0040 accepted_hits_posStrand.bam > accepted_hits_posStrand_2ndPair.bam"
    bsub "samtools view -b -f 0x0040 accepted_hits_negStrand.bam > accepted_hits_negStrand_1stPair.bam"
    bsub "samtools view -b -F 0x0040 accepted_hits_negStrand.bam > accepted_hits_negStrand_2ndPair.bam"
    
    • convert from bam to bedgraph format
    # input:    bam format: accepted_hits_*Strand_*Pair.bam
    #           /nfs/genomes/mouse_gp_jul_07/anno/mm9.size: length of each chromosome, format like 
    #                                   chr1    197195432
    # output:   bedgraph format: accepted_hits_*Strand_*Pair.bedgraph
    bsub "genomeCoverageBed -split -bg -ibam accepted_hits_posStrand_1stPair.bam -g mm9.size >| accepted_hits_posStrand_1stPair.bedgraph"
    bsub "genomeCoverageBed -split -bg -ibam accepted_hits_posStrand_2ndPair.bam -g mm9.size >| accepted_hits_posStrand_2ndPair.bedgraph"
    bsub "genomeCoverageBed -split -bg -ibam accepted_hits_negStrand_1stPair.bam -g mm9.size >| accepted_hits_negStrand_1stPair.bedgraph"
    bsub "genomeCoverageBed -split -bg -ibam accepted_hits_negStrand_2ndPair.bam -g mm9.size >| accepted_hits_negStrand_2ndPair.bedgraph"
    
    • join the reads sharing the same strand
    # This step is for fr-firststrand library (such as dUTP). which is
        1+-,1-+,2++,2--
    
        read1 mapped to ‘+’ strand indicates parental gene on ‘-‘ strand
        read1 mapped to ‘-‘ strand indicates parental gene on ‘+’ strand
        read2 mapped to ‘+’ strand indicates parental gene on ‘+’ strand
        read2 mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand
     
    # input:    bedgraph file from the same strand
    # output:   merged bedgraph: pos.bedgraph or neg.bedgraph
    unionBedGraphs -i accepted_hits_posStrand_2ndPair.bedgraph accepted_hits_negStrand_1stPair.bedgraph |awk '{ print $1"\t"$2"\t"$3"\t"$4+$5 }' >|pos.bedgraph
    unionBedGraphs -i accepted_hits_posStrand_1stPair.bedgraph accepted_hits_negStrand_2ndPair.bedgraph |awk '{ print $1"\t"$2"\t"$3"\t-"$4+$5 }' >|neg.bedgraph
    
    infer_experiment.py -r mm9.refseq.bed12 -i accepted_hits.bam
    
    • convert bedgraph to bigwig
    # get rid of header lines of mm9.size: the header line with "chrom   size" is removed
    # input:    mm9.size: length of each chromosome
    # output:   mm9.size_noHeader
    tail --line=+2 mm9.size > mm9.size_noHeader
    # convert bedgraph to bigwig
    # input:    bedgraph file: neg.bedgraph or pos.bedgraph
    #           mm9.size_noHeader: length of each chromosome
    # *output:  bigwig format: neg.bw or pos.bw
    #           neg.bw or pos.bw can be visualized with IGV/UCSC genome browser
    bsub bedGraphToBigWig neg.bedgraph mm9.size_noHeader neg.bw
    bsub bedGraphToBigWig pos.bedgraph mm9.size_noHeader pos.bw
    

    相关文章

      网友评论

        本文标题:生物信息学中常用的文件转换命令

        本文链接:https://www.haomeiwen.com/subject/qkzlzftx.html