如何批量获取基因的转录组序列

作者: 落寞的橙子 | 来源:发表于2020-01-16 03:48 被阅读0次

    一、使用biomaRt包的getSequence
    我的一个例子,ensembl编号的基因集(eg:ENSG00000272106.1)

    基因集
    rm(list=ls())
    suppressMessages(library(biomaRt))
    #移除genelist的小数点
    genelist<-unlist(lapply(human_overlap_lnc, FUN = function(x) {return(strsplit(x, split = ".",fixed = T)[[1]][1])}))
    #查看有那些dataset
    #选用对应的数据集,这里采用的是hsapiens_gene_ensembl
    mart = useMart('ensembl')
    listDatasets(mart)
    Ensemble_to_seq<-function(x) {
    mart <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
    seq = getSequence(id = x, 
                      type = "ensembl_gene_id", 
                      seqType = "cdna", 
                      mart = mart)
    seq<-as.data.frame(seq)
    seq$"Length"<-lapply(seq[,1],function(y){return(nchar(y))})
    return(seq)
    }
    
    outTab<-data.frame()
    for (i in genelist){
      outTab<-rbind(outTab,Ensemble_to_seq(i))
    }
    
    write.table(outTab,paste0(ori_dir_lnc,“/your_GeneList_sequence.txt"),sep = "\t",quote = F,row.names = F,col.names = F)
    

    二、EASER 不会Python 没run

    三、seqkit grep
    后来我发现biomaRT实在太慢了,于是想到可以用grep,可以先去genecode上下载transcripts.fa 文件,使用seqkit的grep功能去做这个事情。

    seqkit 使用说明

    四、awk+grep
    参考方法
    虽然笨,但是确实work。同样去GENECODE网站上去下载transcripts.fa,然后往下走就行了,注意release版本要一致。我的例子:

    #######prepare the  files
    #Get the transcript sequence
    data_dir=/your_dir/sequence
    human_transcript_fasta=/your_dir/hg38/fasta/gencode.v32.transcripts.fa
    mouse_transcript_fasta=/your_dir/mouse/fasta/gencode.vM23.transcripts.fa 
    human_fa_dir=/your_dir/sequence/human/fasta
    mouse_fa_dir=/your_dir/sequence/mouse/fasta
    mkdir -p ${human_fa_dir}
    mkdir -p ${mouse_fa_dir}
    
    awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < ${human_transcript_fasta} > /you/hg38/fasta/gencode.v32.transcripts.linear.fa
    awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < ${mouse_transcript_fasta} > /your_dir/hg38/fasta/gencode.vM23.transcripts.linear.fa
    
    human_fa_linear=/your_dir/hg38/fasta/gencode.v32.transcripts.linear.fa
    mouse_fa_linear=/your_dir/hg38/fasta/gencode.vM23.transcripts.linear.fa
    
    cat ${data_dir}/human_fasting_refeeding_overlap_lncRNAs_logFC_0_pvalue_0.05.txt | while read line
    do
      cat ${human_fa_linear} | grep  ${line} --no-group-separator | tr '\t' '\n' |awk '{ if ($0~/^>/) { n=split($0, a, "|"); gsub(/_/," ", a[1]); printf("%s|%s\n", a[1], substr(a[2], 2)); } else { print $0; } }' > ${human_fa_dir}/${line}.fa
    done
    
    
    cat ${mouse_fa_linear} | grep -f ${data_dir}/mouse_genelist.txt --no-group-separator | tr '\t' '\n' > ${mouse_fa_dir}/mouse.lncRNA.fa
    
    

    相关文章

      网友评论

        本文标题:如何批量获取基因的转录组序列

        本文链接:https://www.haomeiwen.com/subject/jqnxzctx.html