美文网首页
复刻python知识图谱

复刻python知识图谱

作者: 万州客 | 来源:发表于2023-03-31 23:23 被阅读0次

    也算是个中间版:https://zhuanlan.zhihu.com/p/243211697?utm_source=qq

    主要思想是取主语和宾语为实体节点,而将谓语存为关系边,然后对应的存到pandas的dataframe里。至于显示出来,使用networkx,plt来。
    代码

    import re
    import pandas as pd
    import bs4
    import requests
    import spacy
    from spacy import displacy
    nlp = spacy.load('en_core_web_sm')
    
    from spacy.matcher import Matcher
    from spacy.tokens import Span
    import networkx as nx
    
    import matplotlib.pyplot as plt
    from tqdm import tqdm
    pd.set_option('display.max_colwidth', 200)
    
    candidate_sentences = pd.read_csv("wiki_sentences_v2.csv")
    
    """
    
    print(candidate_sentences.shape)
    print(candidate_sentences['sentence'].sample(5))
    
    doc = nlp("the music rights were sold for ₹150 million .")
    
    for tok in doc:
        print(tok.text, "...", tok.dep_)
    """
    
    
    def get_entities(sent):
        ## chunk 1
        # 我在这个块中定义了一些空变量。prv tok dep和prv tok text将分别保留句子中前一个单词和前一个单词本身的依赖标签。前缀和修饰符将保存与主题或对象相关的文本。
        ent1 = ""
        ent2 = ""
    
        prv_tok_dep = ""  # dependency tag of previous token in the sentence
        prv_tok_text = ""  # previous token in the sentence
    
        prefix = ""
        modifier = ""
    
        #############################################################
    
        for tok in nlp(sent):
            ## chunk 2
            # 接下来,我们将遍历句子中的记号。我们将首先检查标记是否为标点符号。如果是,那么我们将忽略它并转移到下一个令牌。如果标记是复合单词的一部分(dependency tag = compound),我们将把它保存在prefix变量中。复合词是由多个单词组成一个具有新含义的单词(例如“Football Stadium”, “animal lover”)。
            # 当我们在句子中遇到主语或宾语时,我们会加上这个前缀。我们将对修饰语做同样的事情,例如“nice shirt”, “big house”
    
            # if token is a punctuation mark then move on to the next token
            if tok.dep_ != "punct":
                # check: token is a compound word or not
                if tok.dep_ == "compound":
                    prefix = tok.text
                    # if the previous word was also a 'compound' then add the current word to it
                    if prv_tok_dep == "compound":
                        prefix = prv_tok_text + " " + tok.text
    
                # check: token is a modifier or not
                if tok.dep_.endswith("mod") == True:
                    modifier = tok.text
                    # if the previous word was also a 'compound' then add the current word to it
                    if prv_tok_dep == "compound":
                        modifier = prv_tok_text + " " + tok.text
    
                ## chunk 3
                # 在这里,如果令牌是主语,那么它将作为ent1变量中的第一个实体被捕获。变量如前缀,修饰符,prv tok dep,和prv tok文本将被重置。
                if tok.dep_.find("subj") == True:
                    ent1 = modifier + " " + prefix + " " + tok.text
                    prefix = ""
                    modifier = ""
                    prv_tok_dep = ""
                    prv_tok_text = ""
    
                    ## chunk 4
                # 在这里,如果令牌是宾语,那么它将被捕获为ent2变量中的第二个实体。变量,如前缀,修饰符,prv tok dep,和prv tok文本将再次被重置。
                if tok.dep_.find("obj") == True:
                    ent2 = modifier + " " + prefix + " " + tok.text
    
                ## chunk 5
                # 一旦我们捕获了句子中的主语和宾语,我们将更新前面的标记和它的依赖标记。
                # update variables
                prv_tok_dep = tok.dep_
                prv_tok_text = tok.text
        #############################################################
    
        return [ent1.strip(), ent2.strip()]
    
    
    def get_relation(sent):
        doc = nlp(sent)
       # Matcher class object
        matcher = Matcher(nlp.vocab)
       #define the pattern
        pattern = [{'DEP':'ROOT'}, {'DEP':'prep','OP':"?"}, {'DEP':'agent','OP':"?"}, {'POS':'ADJ','OP':"?"}]
        matcher.add("matching_1", [pattern])
        matches = matcher(doc)
        k = len(matches) - 1
        span = doc[matches[k][1]:matches[k][2]]
        return(span.text)
    
    
    entity_pairs = []
    for i in tqdm(candidate_sentences["sentence"]):
        entity_pairs.append(get_entities(i))
    
    print(entity_pairs[10:20])
    get_relation("John completed the task")
    relations = [get_relation(i) for i in tqdm(candidate_sentences['sentence'])]
    print(pd.Series(relations).value_counts()[:50])
    
    # extract subject
    source = [i[0] for i in entity_pairs]
    # extract object
    target = [i[1] for i in entity_pairs]
    
    
    kg_df = pd.DataFrame({'source':source, 'target':target, 'edge':relations})
    """
    
    # create a directed-graph from a dataframe
    G=nx.from_pandas_edgelist(kg_df, "source", "target", edge_attr=True, create_using=nx.MultiDiGraph())
    plt.figure(figsize=(12,12))
    pos = nx.spring_layout(G)
    nx.draw(G, with_labels=True, node_color='skyblue', edge_cmap=plt.cm.Blues, pos = pos)
    plt.show()
    """
    G=nx.from_pandas_edgelist(
        kg_df[kg_df['edge']=="released in"],
        "source",
        "target",
        edge_attr=True,
        create_using=nx.MultiDiGraph())
    plt.figure(figsize=(12,12))
    pos = nx.spring_layout(G, k = 0.5) # k regulates the distance between nodes
    nx.draw(G, with_labels=True, node_color='skyblue', node_size=1500, edge_cmap=plt.cm.Blues, pos = pos)
    plt.show()
    
    

    截图


    2023-04-01 23_10_33-Figure 1.png 2023-04-01 23_15_12-python从零开始构建知识图谱 - 知乎.png 2023-04-01 23_17_26-Figure 1.png 2023-04-01 23_19_16-Figure 1.png

    相关文章

      网友评论

          本文标题:复刻python知识图谱

          本文链接:https://www.haomeiwen.com/subject/cfkkddtx.html