美文网首页
从句子里提取出基因名称

从句子里提取出基因名称

作者: 小洁忘了怎么分身 | 来源:发表于2023-04-18 09:57 被阅读0次

    从句子里提取出基因名称

    需求

    豆同学的需求,从大量的句子里提取出基因名称。

    拿其中一个做例句:

    "To ascertain whether a pre-existing subset of endoderm progenitors were responsible for generating endoderm cells in EZH2-/- cultures, we used flow cytometry to separate KIT+/CXCR4+ (endoderm primed) and KIT-/CXCR4- (not endoderm primed) EZH2-/- populations and subjected the cells to endoderm differentiation"

    这句话里的基因名有:"EZH2" "KIT" "CXCR4"三个。

    思路

    把文中所有的标点符号替换成空格,然后以空格为分隔符拆分。

    代码

    library(stringr)
    s = "To ascertain whether a pre-existing subset of endoderm progenitors were responsible for generating endoderm cells in EZH2-/- cultures, we used flow cytometry to separate KIT+/CXCR4+ (endoderm primed) and KIT-/CXCR4- (not endoderm primed) EZH2-/- populations and subjected the cells to endoderm differentiation"
    s2 = gsub("[[:punct:]]"," ",s)
    m2 = str_split(s2," ")[[1]]
    # all_g是全部基因组成的向量,可以简化一下变短点。
    all_g = c("EZH2", "KIT", "CXCR4", "AKR1B1P8", "AKR1B10", "AKR1B10P1", 
    "AKR1B10P2", "AKR1B11", "AKR1B15")
    all_g
    、
    ## [1] "EZH2"      "KIT"       "CXCR4"     "AKR1B1P8"  "AKR1B10"   "AKR1B10P1"
    ## [7] "AKR1B10P2" "AKR1B11"   "AKR1B15"
    
    unique(m2[m2 %in% all_g])
    
    ## [1] "EZH2"  "KIT"   "CXCR4"
    

    核心是:正则表达式 [[:punct:]] 匹配所有的标点符号。

    gsub把全部标点替换成了空格。

    相关文章

      网友评论

          本文标题:从句子里提取出基因名称

          本文链接:https://www.haomeiwen.com/subject/wyyhjdtx.html