DeepLearning4J 文本相似度比较

作者: 山哥Samuel | 来源:发表于2019-06-15 22:11 被阅读50次

    文本相似度的比较在很多场景都可以用到,比如之前CMB的客户地址比较,比如搜索,比如机器人问答…
    Python平台的各种解决方案都已经很成熟,算出来的结果也很令人满意,但是用Python做好之后,由于臭名昭著的GIL(Global Interpreter Lock),由于缓慢的Looping… 等等原因都令得它的部署很昂贵,要很好的机器才能满足高并发的需求。
    山哥留意到Java平台有个叫DL4J的,在NLP方面也很成熟,各种神经网络,CPUGPU CUDA都支持,测评结果说,比Tensorflow的性能还好些。于是山哥心动了。毕竟Java的性能是钢钢的!尤其是用过JVisualVm来监控调优性能之后,山哥已经不相信现阶段有比JVM更适合做后台服务的平台了。
    在网上搜索了一下例子, 发现DL4J没有一个很好的做文本相似度比较的例子,有的都是利用它算Word2Vec,然后找单词相似度…呃,这个很好,但是离应用还差十八万千里啊!得!还是得自己动手!
    借鉴了Python的版本,看了各种NLP的论文资料,还是把Solution定位如此:

    • TF-IDF训练语料库
    • 基于这个TF-IDF模型,计算目标Text的向量,和源Text的向量
    • 计算向量余弦相似度(Cosine Similarity
      TF-IDF挺神奇的,它不需要神经网络,但是在某些场景效果特别好,所以据说Google的搜索算法也用它。而且由于不用做神经网络的大量计算,用CPU就够了。真是“又便宜又好,我们一直都拿它当宝!大哥买了送大嫂,大嫂高兴的不得了…”咳咳!
      好了,废话一堆,永远比不上几句代码直接。来吧……

    Maven

    <dependency>
        <groupId>org.nd4j</groupId>
        <artifactId>nd4j-native-platform</artifactId>
        <version>${nd4j.version}</version>
    </dependency>
    <!-- Core DL4J functionality -->
    <dependency>
        <groupId>org.deeplearning4j</groupId>
        <artifactId>deeplearning4j-nlp</artifactId>
        <version>${dl4j.version}</version>
    </dependency>
    

    Java (其实是Kotlin)

    实例中用默认的英文分词器,中文分词器请看 这里

    package com.example.demo
    
    import org.deeplearning4j.bagofwords.vectorizer.TfidfVectorizer
    import org.deeplearning4j.text.sentenceiterator.CollectionSentenceIterator
    import org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory
    import org.junit.Test
    import org.nd4j.linalg.ops.transforms.Transforms
    import org.slf4j.LoggerFactory
    import java.util.*
    import java.util.concurrent.atomic.AtomicInteger
    
    class TextSimilarityControllerTests {
        val logger = LoggerFactory.getLogger(TextSimilarityControllerTests::class.java)
    
        @Test
        @Throws(Exception::class)
        fun testTfIdfVectorizer() {
            val rawLines = Arrays.asList("HSAC Software Development (Guangdong) Ltd",
                    "HSAC holdings plc",
                    "Citi Bank",
                    "HSAC Software Development (IN) Ltd")
            val iter = CollectionSentenceIterator(rawLines)
            // DefaultTokenizer是英文的,如果是中文,要自己用中文分词器实现,比如Ansj
            val tokenizerFactory = DefaultTokenizerFactory()
    
            val vectorizer = TfidfVectorizer.Builder()
                    .setMinWordFrequency(1)
                    .setStopWords(ArrayList())
                    .setTokenizerFactory(tokenizerFactory)
                    .setIterator(iter)
                    .build()
    
            vectorizer.fit()
    
            val vector = vectorizer.transform("HSAC Software Development (Guangdong) Limited")
    
            logger.info("TF-IDF vector: " + Arrays.toString(vector.data().asDouble()))
    
            /**
             * Compare the similarity, sort desc, and pick top 2 and print it out
             */
            val counter = AtomicInteger(1)
            rawLines.parallelStream().map { line ->
                Pair<Double, String>(Transforms.cosineSim(vector, vectorizer.transform(line)), line)
            }.sorted { o1, o2 ->
                // Desc
                o2.first.compareTo(o1.first)
            }.limit(2).forEachOrdered {
                logger.info("\n" +
                        "Here comes ${counter.getAndIncrement().ordinal()} result of Top 2:" +
                        "\n" +
                        "line '${it.second}' with sim: ${it.first}")
            }
            val x=1;
    
        }
    }
    
    /**
     * To ordinalize the number.
     */
    fun Number.ordinal(): String {
        val suffix = arrayOf("th", "st", "nd", "rd", "th", "th", "th", "th", "th", "th")
        val m = this.toInt() % 100
        return this.toString() + suffix[if (m > 3 && m < 21) 0 else m % 10]
    }
    

    结果输出。(100%相似为1,越接近就越相似)

    21:37:46.306 [main] INFO com.example.demo.TextSimilarityControllerTests - TF-IDF vector: [0.02498774789273739, 0.06020599976181984, 0.0, 0.06020599976181984, 0.0, 0.0, 0.0, 0.12041199952363968, 0.0, 0.0]
    21:37:46.340 [ForkJoinPool.commonPool-worker-19] INFO com.example.demo.TextSimilarityControllerTests -

    Here comes 1st result of Top 2:

    line 'HSAC Software Development (Guangdong) Ltd' with sim: 0.9276711940765381
    21:37:46.340 [ForkJoinPool.commonPool-worker-19] INFO com.example.demo.TextSimilarityControllerTests -

    Here comes 2nd result of Top 2:

    line 'HSAC Software Development (IN) Ltd' with sim: 0.32648345828056335

    哦也~ 大功告成…

    相关文章

      网友评论

        本文标题:DeepLearning4J 文本相似度比较

        本文链接:https://www.haomeiwen.com/subject/zcuifctx.html