美文网首页MLSQLYour Guide with MLSQL Stack
Your Guide to NLP with MLSQL Sta

Your Guide to NLP with MLSQL Sta

作者: 祝威廉 | 来源:发表于2019-05-12 13:47 被阅读394次

    End2End NLP with MLSQL Stack

    MLSQL stack supports a complete pipeline of train/predict. This means the following steps can be in the same script:

    1. collect data
    2. preprocess data
    3. train
    4. predict

    Also, since any model and preprocess ET can be registered as function, you can reuse all these functions in Predict Service without coding any more.

    Requirements

    This guide requires MLSQL Stack 1.3.0-SNAPSHOT. You can setup MLSQL stack with following links. We recommend you deploy MLSQL stack in local.

    1. Docker
    2. Mannually Compile
    3. Prebuild Distribution

    If you meet any problem when deploying, please let me know, please feel free to address any issue in this link.

    Data Prepare

    This article we will deal with Chinese.

    Download sogou news from this site: news_sohusite.

    image.png

    Upload file to MLSQL Stack File Server

    Upload news_sohusite_xml.full.tar to MLSQL Stack file server, just drag the file to the upload area:

    image.png

    Once done, the web will indicate the success with showoing one file have been uploaded, it looks like this:

    image.png

    Download the file and save to your home

    In order to read this file, we should save this file to our home. Use a command like the following:

    -----------------------------------------
    -- Download from file server.
    -- run command as DownloadExt.`` where 
    -- from="public/SogouCS.reduced.tar" and
    -- to="/tmp/nlp/sogo";
    -- or you can use command line.
    -----------------------------------------
    
    !saveUploadFileToHome public/SogouCS.reduced.tar /tmp/nlp/sogo;
    

    Check if the file has been created:

    !fs -ls /tmp/nlp/sogo;
    

    Well, it has been created sucessfully.

    Found 1 items
    -rw-r--r--   1 allwefantasy admin 1537763850 2019-05-09 16:59 /tmp/nlp/sogo/news_sohusite_xml.dat
    

    Load the xml data

    MLSQL stack supports lots of datasource which inlcude XML and the news_sohusite_xml.dat is XML format. We can use load statement to load the data:

    -- load data with xml format
    load xml.`/tmp/nlp/sogo/news_sohusite_xml.dat` where rowTag="doc" and charset="GBK" as xmlData; 
    

    Notice that you can select any statement and then execute it and check the result is whether you expect:

    image.png

    Extract label from URL

    The URL is lokk like this:

    http://sports.sohu.com/20070422/n249599819.shtml
    

    We need to extract the sports from it. It means this article belongs to sports category.

    select temp.* from (select split(split(url,"/")[2],"\\.")[0] as labelStr,content from xmlData) as temp 
    where temp.labelStr is not null 
    as rawData;
    

    The label we extract from URL is string, and the algorithm RandonForest requires the label an integer. Here we use StringIndex to implements get the mapping between string and number:

    train rawData as StringIndex.`/tmp/nlp/label_mapping` where inputCol="labelStr"and
    outputCol="label" ;
    

    Now we can convert all string label to interger label:

    predict rawData as StringIndex.`/tmp/nlp/label_mapping` as rawDataWithLabel;
    

    Notice that we need to register this model as a function because we need to convert the number back to string in later predict stage. It's easy to do this:

    register StringIndex.`/tmp/nlp/label_mapping` as convert_label;
    

    Split the dataset

    Sometimes We need to reduce the dataset because of limited resource we have. In another scenario, we may need to split the data into train/validate/test sets. They all can be implemented by ET RateSampler. In MLSQL, many ET have a more easy way to use, we call it command line style. Here are ET style and Command Line style.

    ET Style:

    run xmlData as RateSampler.`` 
    where labelCol="url" and sampleRate="0.9,0.1" 
    as xmlDataArray;
    

    Command Line Style:

    !split rawDataWithLabel by label with "0.9,0.1" named xmlDataArray;
    

    Now, we have splitted dataset of each category into 0.9/0.1. In order to speed up the performance, we use 10% data only.

    select * from xmlDataArray where __split__=1 as miniXmlData;
    

    Save what we got until now (Optinal)

    save overwrite miniXmlData as parquet.`/tmp/nlp/miniXmlData`;
    load parquet.`/tmp/nlp/miniXmlData` as miniXmlData;
    

    This will avoid computation every time when we want to get miniXmlData.
    In production, you may will use cache(memory and disk), you can use it like this:

    !cache miniXmlData script;
    

    You do not need to release it mannually, MLSQL Engine will take care it.

    Use TF/IDF to process content

    train miniXmlData as TfIdfInPlace.`/tmp/nlp/tfidf` where inputCol="content" as trainData;
    

    Again register the model as a functioin:

    register TfIdfInPlace.`/tmp/nlp/tfidf` as tfidf_predict;
    

    Save what we got until now (Optinal)

    save overwrite trainData as parquet.`/tmp/nlp/trainData`;
    load parquet.`/tmp/nlp/trainData` as trainData;
    

    Again, you can cache the trainData.

    Cut the feature size

    The feature size generated by ET tfidf is > 60w, this will slow down the performance, here we use vec_range to subrange the vector:

    select vec_range(content,array(0,10000)) as content,label from trainData as trainData;
    

    There are so many vector related functions in MLSQL, check here if you are interested in.

    Train RandomForest

    train trainData as RandomForest.`/tmp/nlp/rf` where 
    keepVersion="true" 
    and fitParam.0.featuresCol="content" 
    and fitParam.0.labelCol="label"
    and fitParam.0.maxDepth="4"
    and fitParam.0.checkpointInterval="100"
    and fitParam.0.numTrees="4"
    ;
    

    you can use fitParam.group to configure multi-group params, like this:

    train trainData as RandomForest.`/tmp/nlp/rf` where 
    keepVersion="true" 
    and fitParam.0.featuresCol="content" 
    and fitParam.0.labelCol="label"
    and fitParam.0.maxDepth="4"
    and fitParam.0.checkpointInterval="100"
    and fitParam.0.numTrees="4"
    and fitParam.1.featuresCol="content" 
    and fitParam.1.labelCol="label"
    and fitParam.1.maxDepth="3"
    and fitParam.1.checkpointInterval="100"
    and fitParam.1.numTrees="10"
    ;
    

    Then MLSQL Engine will generate two models.

    image.png

    Register the model as a function:

    register RandomForest.`/tmp/nlp/rf` as rf_predict;
    

    Predict

    End to end predict, you can also deploy this as an API service.
    Do not forget to subrange the tfidf feature:

    select convert_label_r(vec_argmax(rf_predict(vec_range(tfidf_predict("新闻不错"),array(0,10000))))) as predicted as output;
    

    As you can see, we use all functions registered before which make us convert raw data to finally string category. And the code is clear:

    1. use tfidf_predict to generate vector
    2. use vec_range to subrange the vector
    3. use rf_predict to get the number category
    4. use convert_label_r convert number to string
    image.png

    Most the time, you may train several times, and if you wanna see the history,
    use command like this:

    !model history /tmp/nlp/rf;
    
    image.png

    How to deploy API service

    Just start MLSQL Engine with local mode, and then you can post http://127.0.0.1:9003/model/predict with follow params:

    dataType=row
    data=[{"content":"新闻不错"}]
    sql=select convert_label_r(vec_argmax(rf_predict(vec_range(tfidf_predict(content),array(0,10000))))) as predicted
    

    That's All.

    Bonus

    Thanks to the include statement and the script store support, if you have set up the MLSQL stack, you can use the script from the store immediately:

    
    include store.`/alg/text_classify.mlsql`;
    
    !textClassify /tmp/nlp/sogo/news_sohusite_xml.dat  /tmp/nlp2;
    !textPredict "新闻很不错";
    

    MLSQL Engine will download script from repo.store.mlsql.tech automatically.
    Any script you have written can be wrap as a command and used by others.

    The Final Complete Script

    
    -----------------------------------------
    -- Download from file server.
    -- run command as DownloadExt.`` where 
    -- from="public/SogouCS.reduced.tar" and
    -- to="/tmp/nlp/sogo";
    -- or you can use command line.
    -----------------------------------------
    
    !saveUploadFileToHome public/SogouCS.reduced.tar /tmp/nlp/sogo;
    
    
    -- load data with xml format
    load xml.`/tmp/nlp/sogo/news_sohusite_xml.dat` where rowTag="doc" and charset="GBK" as xmlData; 
    
    
    --extract `sports` from url[http://sports.sohu.com/20070422/n249599819.shtml]
    select temp.* from (select split(split(url,"/")[2],"\\.")[0] as labelStr,content from xmlData) as temp 
    where temp.labelStr is not null 
    as rawData;
    
    
    -- Tips:
    ----------------------------------------------------------------------------------
    -- Try to use the follow sql to explore how many label we have and how they looks like.
    --
    -- select distinct(split(split(url,"/")[2],"\\.")[0]) as labelStr from rawData as output;
    -- select split(split(url,"/")[2],"\\.")[0] as labelStr,url from rawData as output;
    ----------------------------------------------------------------------------------
    
    -- the label we extract from url is string, and the algorithm RandonForest requires the label is 
    -- integers. here we use StringIndex to implments this.
    -- train a model which can map label to number and vice versa
    train rawData as StringIndex.`/tmp/nlp/label_mapping` where inputCol="labelStr"and
    outputCol="label" ;
    
    -- convert label to number 
    predict rawData as StringIndex.`/tmp/nlp/label_mapping` as rawDataWithLabel;
    
    
    -- you can use register to convert a model to a functioin
    register StringIndex.`/tmp/nlp/label_mapping` as convert_label; 
    
    
    
    -- we can reduce the dataset. Because if there are too much data but just get  limited resource 
    -- it may take too long. you can use command line 
    -- or you can use raw ET:
    --
    -- run xmlData as RateSampler.`` 
    -- where labelCol="url" and sampleRate="0.9,0.1" 
    -- as xmlDataArray;
    !split rawDataWithLabel by label with "0.9,0.1" named xmlDataArray;
    -- then we fetch the xmlDataArray with position one to get the 10% data.
    select * from xmlDataArray where __split__=1 as miniXmlData;
    
    -- we can save the result data, because it really take much time.
    save overwrite miniXmlData as parquet.`/tmp/nlp/miniXmlData`;
    
    load parquet.`/tmp/nlp/miniXmlData` as miniXmlData;
    -- select * from miniXmlData limit 10 as output;
    
    --convert the content to tfidf format
    train miniXmlData as TfIdfInPlace.`/tmp/nlp/tfidf` where inputCol="content" as trainData;
    -- again register  a model as a functioin
    register TfIdfInPlace.`/tmp/nlp/tfidf` as tfidf_predict;
    
    
    save overwrite trainData as parquet.`/tmp/nlp/trainData`;
    load parquet.`/tmp/nlp/trainData` as trainData;
    
    -- the feature size  generated by tfidf is  > 60w,  this will slow down the performance,
    -- here we use vec_range to subrange the vector.
    select vec_range(content,array(0,10000)) as content,label from trainData as trainData;
    
    -- use algorithm RandomForest to train 
    -- you can use fitParam.group to congiure multi group params
    train trainData as RandomForest.`/tmp/nlp/rf` where 
    keepVersion="true" 
    and fitParam.0.featuresCol="content" 
    and fitParam.0.labelCol="label"
    and fitParam.0.maxDepth="4"
    and fitParam.0.checkpointInterval="100"
    and fitParam.0.numTrees="4"
    ;
    
    -- register  RF model as a functioin
    register RandomForest.`/tmp/nlp/rf` as rf_predict;
    
    -- end to end predict; you can also deploy this as a API service
    -- do not forget to subrange the tfidf feature
    select convert_label_r(vec_argmax(rf_predict(vec_range(tfidf_predict("新闻不错"),array(0,10000))))) as predicted as output;
    
    -- !model history /tmp/nlp/rf;
    
    
    

    相关文章

      网友评论

        本文标题:Your Guide to NLP with MLSQL Sta

        本文链接:https://www.haomeiwen.com/subject/qmewoqtx.html