Home Depot Product Search Releva

作者: 阿发贝塔伽马 | 来源:发表于2018-07-29 17:14 被阅读10次

    kaggle赛题链接Home Depot Product Search Relevance,这个题目关键点就是特征提取,给的数据需要观察处理,提交的成绩,排在前10%

    第一类特征(词汇语意)

    • 可以用Levenshtein.ratio函数来评估两个英文单词相似度,
    • 使用nltk工具,nltk.corpus 中 wordnet来判断两个词语意相似度,本文对原文进行wordnet判断,提取词干后,词发生很大变化,词干提取后主要用来计算tidf,wordvec
    • 如果以上两个相似度都很低,还要查看属性文件中是否有匹配单词(只发现一个训练集是三分,但是与title、description十分不匹配,但是与属性文档中一个项匹配)
    • 如果以上都不匹配,至少发现四个案例是这样,搜索的产品型号,需要使用google搜索(网络请求),用搜索到的第一个内容再来判断相似度

    第二类特征 词向量(gensim中wod2vec)

    • 用word2vec训练维基百科英文语料,来衡量两个词汇相关性
    • 用word2vec将product_title与product_description合起来作为语料训练得到词向量

    第三类特征 tidf


    读取数据

    import pandas as pd
    
    # 读取数据
    df_train = pd.read_csv('/Users/tangqinglong/Desktop/Scikit-learn/Depot/train.csv', encoding='ISO-8859-1')
    df_test = pd.read_csv('/Users/tangqinglong/Desktop/Scikit-learn/Depot/test.csv', encoding='ISO-8859-1')
    df_pro = pd.read_csv('/Users/tangqinglong/Desktop/Scikit-learn/Depot/product_descriptions.csv', encoding='ISO-8859-1')
    df_attr = pd.read_csv('/Users/tangqinglong/Desktop/Scikit-learn/Depot/attributes.csv', encoding='ISO-8859-1')
    

    设置pandas显示,这样可以全文显示,方便查看

    pd.set_option('display.max_colwidth',1000)
    

    竖合,将test数据追加到train下方,并且忽略index

    df_all = pd.concat((df_train, df_test), axis=0,ignore_index=True)
    

    将产品描述信息作为一列加入到总表中

    df_all = pd.merge(df_all, df_pro, how='left', on='product_uid')
    

    看一下打分基本分布情况

    import copy
    a = copy.deepcopy(df_train['relevance'].values)
    a.sort()
    import matplotlib.pyplot as plt
    print type(a)
    plt.plot(a)
    plt.show()
    
    df_y = pd.DataFrame(df_train['relevance'])
    df_y['num'] = 1
    df_y_g = df_y['num'].groupby(df_y['relevance'])
    df_y_g.sum()
    

    输出

    relevance
    1.00     2105
    1.25        4
    1.33     3006
    1.50        5
    1.67     6780
    1.75        9
    2.00    11730
    2.25       11
    2.33    16060
    2.50       19
    2.67    15202
    2.75       11
    3.00    19125
    Name: num, dtype: int64
    

    可以看到评分从1分到3分,隔0.33或者0.34一个档次,其中1.25,1.50等等样本太少,可以忽略不计,分成7类,可以把回归问题转为分类问题来考虑,先按照回归问题处理

    • 建立停用词字典
    dic_stopwords = dict(zip(stopwords.words('english'),xrange(len(stopwords.words('english')))))
    
    • 文本预处理
    from nltk import SnowballStemmer  
    from nltk.corpus import stopwords
    import re
    import Levenshtein
    stemmer = SnowballStemmer('english')
    
    pattern_replace_pair_list = [
                (r"<.+?>", r""),
                # html codes
                (r"&nbsp;", r" "),
                (r"&amp;", r"&"),
                (r"&#39;", r"'"),
                (r"/>/Agt/>", r""),
                (r"</a<gt/", r""),
                (r"gt/>", r""),
                (r"/>", r""),
                (r"<br", r""),
                # do not remove [".", "/", "-", "%"] as they are useful in numbers, e.g., 1.97, 1-1/2, 10%, etc.
                (r"[ &<>)(_,;:!?\+^~@#\$\*]+", r" "),
                (r"'s\\b", r""),
                (r"[']+", r""),
        # 将DeckOver这样次分开,字母与字符英文连在一起的也分开
                #(r'([A-Z][a-z]+|[a-z]+|\d+)', r'\1 '),
                (r'(\d?)([a-zA-Z]+)', r'\1 \2 '),
                #(r'(/d+)', r' \1 '),
                (r'([A-Z][a-z]+)', r' \1 '),
            ]
    dic = {1:'one', 2:'two', 3:'three', 4:'four', 5:'five',6:'six', 7:'seven', 8:'eight', 9:'night',0:'zero'}
    def dashrep(matchobj):
        if len(matchobj.group())==1:
            return dic[int(matchobj.group())]
        else:
            return matchobj.group() 
    
    # 小写 去除标点符号,停用词
    def transform(text):
        for pattern, replace in pattern_replace_pair_list:
            try:
                text = re.sub(pattern, replace, text)
            except:
                pass
        #text = re.sub(r'[\d]+', dashrep, text)
        text = re.sub(r"\s+", " ", text).strip()
        return ' '.join([word for word in text.lower().split() if word not in dic_stopwords])
    
    #word_list = "Package stopwords is already up-to-date".split(" ")
    #filtered_words = [word for word in word_list if word not in stopwords.words('english')]
    
    # 词干提取
    def str_stemmer(s):
        # 不进行词干提取,只是变小写
        if isinstance(s, float):
            s = unicode(s)
        return ' '.join([stemmer.stem(word) for word in s.split()])
    
    # str2 中有多少单词在str1中
    def str_common_word(str1, str2):
        return sum(int(str2.find(word)>=0) for word in str1.split())
    
    def str_notcommon_word(str1, str2):
        return sum(int(str2.find(word)==-1) for word in str1.split())
    
    # str1:title,str2:pro,str3:desc
    # 功能:判断标题中有几个单词共同出现在str2,str3中
    def str_common_desc_pro_word(str1, str2, str3):
        return sum(int(str2.find(word)>=0 and str3.find(word)>=0) for word in str1.split())
    
    def word_vs_word_ratio(str1, str2):
        ratio = 0
        count = 0
        for word1 in str1.split():
            for word2 in str2.split():
                ratio = Levenshtein.ratio(word1, word2)+ratio
                count+=1
        return ratio/max(count,1)
    
    def search_vs_word_ratio(str_search, str_des):
        ratio = 0
        if len(str_search) ==0:
            return 0
        for word in str_des:
            ratio = max(Levenshtein.ratio(str_search, word), ratio)
        return ratio
    
    
    
    import nltk
    import Levenshtein
    def similarity(word1, word2):    
        from nltk.corpus import wordnet as wn
        word_1 = wn.synsets(word1)
        word_2 = wn.synsets(word2)
        sl = 0.
        for el1 in word_1:
            for el2 in word_2:
                val = el1.path_similarity(el2)
                if val is not None:
                    sl = max(val,sl)
                    if sl > 0.8:
                        break
        return sl
    
    def similarity_sentences(str1, str2):
        sl, count, Ntotal, Nmatch = 0., 0., 0., 0.
        for word1 in nltk.pos_tag(str1.split()):
            score = 0
            for word2 in nltk.pos_tag(str2.split()):
                score = max(Levenshtein.ratio(word1[0],word2[0]),score)
                if score < 0.75:
                    if word1[1][0]==word2[1][0]:
                        score = max(similarity(word1[0], word2[0]),score)
                #print score, word1, word2
                if score < 0.75:
                    continue
                sl += score
                count += 1
                break
            if score > 0.7:
                Nmatch += 1
        if count == 0:
            return [0., 0.]
        return  [sl/count,Nmatch/max(len(str1.split()),1)] 
    
    • 去除标点符号,变小写,去停用词
    %time df_all['search_term_transform'] = df_all['search_term'].map(lambda x:transform(x))
    %time df_all['product_title_transform'] = df_all['product_title'].map(lambda x:transform(x))
    %time df_all['pro_des_trans'] = df_all['product_description'].map(lambda x:transform(x))
    
    • 词干提取
    %time df_all['search_term_transform_stem'] = df_all['search_term_transform'].map(lambda x:str_stemmer(x))
    %time df_all['product_title_transform_stem'] = df_all['product_title_transform'].map(lambda x:str_stemmer(x))
    %time df_all['pro_des_trans_stem'] = df_all['pro_des_trans'].map(lambda x:str_stemmer(x))
    
    df_all['all_texts_transform'] = df_all['product_title_transform'] + ' . ' + df_all['pro_des_trans']
    df_all['all_texts_trans_stemm'] = df_all['product_title_transform_stem'] + ' . ' + df_all['pro_des_trans_stem']
    
    from joblib import Parallel, delayed
    def func_similarity(df):
        sea_ter_tit = df.apply(lambda temp:similarity_sentences(temp['search_term_transform'],temp['product_title_transform']), axis=1).values[0]
        sea_ter_des = df.apply(lambda temp:similarity_sentences(temp['search_term_transform'],temp['pro_des_trans']), axis=1).values[0]
        sea_ter_all = df.apply(lambda temp:similarity_sentences(temp['search_term_transform'],temp['all_texts_transform']), axis=1).values[0]
        
        df.loc[:, 'sea_ter_tit'] = sea_ter_tit[0]
        df.loc[:, 'sea_ter_tit_ratio'] = sea_ter_tit[1]
        df.loc[:, 'sea_ter_des'] = sea_ter_des[0]
        df.loc[:, 'sea_ter_des_ratio'] = sea_ter_des[1]
        df.loc[:, 'sea_ter_all'] = sea_ter_all[0]
        df.loc[:, 'sea_ter_all_ratio'] = sea_ter_all[1]
        
        return df
    def apply_parallel(df_grouped, func):
        """利用 Parallel 和 delayed 函数实现并行运算"""
        results = Parallel(n_jobs=-1)(delayed(func)(group) for name, group in df_grouped)
        return pd.concat(results)
    
    df_model = pd.DataFrame({})
    

    使用多进程计算特征

    df_grouped =df_all.groupby(df_all.index)
    %time df_model = apply_parallel(df_grouped, func_similarity)[['sea_ter_tit', \
                                                                  'sea_ter_tit_ratio',\
                                                                  'sea_ter_des',\
                                                                  'sea_ter_des_ratio',\
                                                                  'sea_ter_all',\
                                                                  'sea_ter_all_ratio',\
                                                                  'product_uid']]
    
    import numpy as np
    df_model['len_search_term']=df_all['search_term_transform_stem'].map(
            lambda x:len(x.split())).astype(np.int64)
    #【新特征1】: search_term 和 product_title比较
    df_model['dist_in_title'] = df_all.apply(lambda x:word_vs_word_ratio((x['search_term_transform_stem']),x['product_title_transform_stem']), axis=1)
    
    df_model['dist_in_title1'] = df_all.apply(lambda x:search_vs_word_ratio((x['search_term_transform_stem']),x['product_title_transform_stem']), axis=1)
    
    #【新特征2】: search_term 和 product_description比较
    df_model['dist_in_desc'] = df_all.apply(lambda x:word_vs_word_ratio((x['search_term_transform_stem']),x['pro_des_trans_stem']), axis=1)
    
    df_model['dist_in_desc1'] = df_all.apply(lambda x:search_vs_word_ratio((x['search_term_transform_stem']),x['pro_des_trans_stem']), axis=1)
    
    df_model['len_of_query']=df_all['search_term_transform_stem'].map(
            lambda x:len(x.split())).astype(np.int64)
    df_model['len_search'] = df_all['search_term_transform_stem'].map(lambda x:len(x))
    
    # 产品标题中有多少关键词重合
    df_model['commons_in_title']=df_all.apply(
        lambda x:str_common_word(
        x['search_term_transform_stem'],x['product_title_transform_stem']),axis=1)
    
    # 描述中有多少关键词重合
    %time df_model['commons_in_desc'] = df_all.apply(lambda x:str_common_word(x['search_term_transform_stem'], x['pro_des_trans_stem']), axis=1)
    
    %time df_model['common_in_desc_pro']=df_all.apply(lambda x: str_common_desc_pro_word(x['search_term_transform_stem'],x['pro_des_trans_stem'],x['product_title_transform_stem']), axis=1)
    
    df_model['nn_word_in_search'] = df_all['search_term_transform_stem'].map(nn_word_numbers_In_Search)
    
    df_model['queryvstitle'] = df_model.apply(lambda x: float(x['commons_in_title'])/max((x['len_of_query']),1), axis=1)
    
    df_model['product_uid'] =df_all['product_uid']
    
    df_model['queryvsdesc'] = df_model.apply(lambda x: float(x['commons_in_desc'])/max(x['len_of_query'],1), axis=1)
    
    

    tidf特征

    # 有了组合好的句子,可以分词了准备
    # 分词:这里我们用gensim,为了更加细致的分解TFIDF的步骤动作;其实sklearn本身也有简单好用的tfidf模型
    # Tokenize可以用各家或者各种方法,就是把长长的string变成list of tokens。包括NLTK,SKLEARN都有自家的解决方案
    
    from gensim.utils import tokenize
    from gensim.corpora.dictionary import Dictionary
    #得到了一个很多单词的大词典
    dictionary = Dictionary(list(tokenize(x, errors='ignore')) for x in df_all['all_texts_trans_stemm'].values)
    print(dictionary)
    #这个类所做的事情也很简单,就是扫便我们所有的语料,并且转化成简单的单词的个数计算
    class MyCorpus(object):
        def __iter__(self):
            for x in df_all['all_texts_trans_stemm'].values:
                yield dictionary.doc2bow(list(tokenize(x, errors='ignore')))
    
    # 这里这么折腾一下,仅仅是为了内存friendly。面对大量corpus数据时,你直接存成一个list,会使得整个运行变得很慢。
    # 所以我们搞成这样,一次只输出一组。但本质上依旧长得跟 [['sentence', '1'], ['sentence', '2'], ...]一样
    corpus = MyCorpus()
    
    # 有了我们标准形式的语料库,我们于是就可以init我们的TFIDFmodel了。这里做的事情,就是把已经变成BoW向量的数组,做一次TFIDF的计算。
    from gensim.models.tfidfmodel import TfidfModel
    tfidf = TfidfModel(corpus)
    #示例:这下我们看看一个普通的句子放过来长什么样子:
    tfidf[dictionary.doc2bow(list(tokenize('hello world, good morning', errors='ignore')))]
    
    #怎么判断两个句子的相似度呢?
    #这里有个trick,因为我们得到的tfidf只是『有这个字,就有这个值』,并不是一个全部值。
    #也就是说,两个matrix可能size是完全不一样的。
    #想用cosine计算的同学就会问了,两个matrix的size都不fix,怎么办?
    #咦,这里就注意咯。他们的size其实是一样的。只是把全部是0的那部分给省略了对吧?
    #于是,我们只要拿其中一个作为index。扩展开全部的matrixsize,另一个带入,就可以计算了
    
    from gensim.similarities import MatrixSimilarity
    
    # 先把刚刚那句话包装成一个方法
    def to_tfidf(text):
        res = tfidf[dictionary.doc2bow(list(tokenize(text, errors='ignore')))]
        return res
    
    # 然后,我们创造一个cosine similarity的比较方法
    def cos_sim(text1, text2):
        tfidf1 = to_tfidf(text1)
        tfidf2 = to_tfidf(text2)
        index = MatrixSimilarity([tfidf1],num_features=len(dictionary))
        sim = index[tfidf2]
        # 本来sim输出是一个array,我们不需要一个array来表示,
        # 所以我们直接cast成一个float
        return float(sim[0])
    
    #计算搜索词语与产品title相似度
    df_model['tfidf_cos_sim_in_title'] = df_all.apply(lambda x: cos_sim(x['search_term_transform_stem'], x['product_title_transform_stem']), axis=1)
    #计算搜索词与产品描述description相似度
    df_model['tfidf_cos_sim_in_desc'] = df_all.apply(lambda x: cos_sim(x['search_term_transform_stem'], x['pro_des_trans_stem']), axis=1)
    

    通过Word2Vec来评判距离,搜索词与产品title,产品描述的

    import nltk
    #1)nltk也是自带一个强大的句子分割器。【调用工具】
    tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    #2)我们先把长文本搞成list of 句子,再把句子变成list of 单词:【文本->句子】
    sentences = [tokenizer.tokenize(x) for x in df_all['all_texts_trans_stemm'].values]
    #3)我们把list of lists 给 flatten了。【句子 -> 扁平化flatten】
    sentences = [y for x in sentences for y in x] #一共1998321个句子。
    #4)我们把句子里的单词给分好。可以用刚刚Gensim的tokenizer, 也可以用nltk的word_tokenizer 【句子 -> 单词】
    from nltk.tokenize import word_tokenize
    w2v_corpus = [word_tokenize(x) for x in sentences]
    #5) 训练我们的预料库,成为词向量 【单词 -> 训练语料库model】
    from gensim.models.word2vec import Word2Vec
    model = Word2Vec(w2v_corpus, size=128, window=5, min_count=5, workers=4)
    
    • 可以得到每个单词的向量,但是每一格句子中由多个单词组成,把每个单词向量取平均,
    import numpy as np
    #6) 可以得到每个单词的向量,但是每一格句子中由多个单词组成,把每个单词向量取平均,
    vocab = model.vocabulary
    #得到任意text句子的vector(就是取平均)
    def get_vector(text):
        res = np.zeros([128])
        count = 0
        for word in word_tokenize(text):
            res += model[word]
            count+=1
        return res/max(count,1)
    
    • 计算两个句子的vector的相似度, 用cosine similarity,用scipy的spatial功能
    from scipy import spatial
    def w2v_cos_sim(text1, text2):
        try:
            w2v1 = get_vector(text1)
            w2v2 = get_vector(text2)
            sim = 1 - spatial.distance.cosine(w2v1, w2v2)
            return float(sim)
        except:
            return float(0)
    
    df_model['w2v_cos_sim_in_title'] = df_all.apply(lambda x: w2v_cos_sim(x['search_term_transform_stem'], x['product_title_transform_stem']), axis=1)
    df_model['w2v_cos_sim_in_desc'] = df_all.apply(lambda x: w2v_cos_sim(x['search_term_transform_stem'], x['pro_des_trans_stem']), axis=1)
    

    处理异常数据

    def check_series(series):
        i = 0
        for el in series:
            if np.isnan(el):
                series[i] = 0
            i+=1
        return series
    df_model.apply(check_series)
    
    # 记录测试集的id
    test_ids = df_test['id']
    # 分离出y_train
    y_train = df_train['relevance'].values
    X_train = df_model.loc[df_train.index]
    X_test = df_model.loc[len(df_train.index):]
    
    from sklearn import preprocessing
    scaler = preprocessing.StandardScaler().fit(X_train.values)
    train = scaler.transform(X_train.values)
    test = scaler.transform(X_test.values)
    

    相关性系数

    from scipy.stats import pearsonr
    
    lable=y_train
    lr = []
    for i, line in enumerate(X_train.values.T):
        lr.append([pearsonr(lable,line),i])
    lr.sort()
    print lr
    

    xgboost

    • 由于xgboost不支持logcosh,需要自定义损失函数
    import xgboost as xgb
    import random
    import numpy as np
    
    def huber_approx_obj(preds, dtrain):
        d = dtrain.get_label() - preds  #remove .get_labels() for sklearn
        h = 1  #h is delta in the graphic
        scale = 1 + (d / h) ** 2
        scale_sqrt = np.sqrt(scale)
        grad = d / scale_sqrt
        hess = 1 / scale / scale_sqrt
        return grad, hess
    
    def log_cosh_obj(preds, dtrain):
        x = dtrain.get_label() - preds
        grad = np.tanh(-x)
        hess = 1- np.tanh(x)**2
        return grad, hess
    def square_loss(preds, dtrain):
        #x = dtrain.get_label()-preds
        grad = preds - dtrain.get_label()
        hess = [1]*len(grad)
        return grad,hess
    xgb_params = { 
    'eta': 0.03, 
    'max_depth': 6, 
    'gamma':2,  # 在树的叶子节点下一个分区的最小损失,越大算法模型越保守 。[0:]
    'subsample': 0.6, 
    'colsample_bytree': 0.7, 
    'objective': 'reg:linear', 
    'eval_metric': 'rmse', 
    'silent': 1 ,
    'min_child_weight':1,
    'lambda':100,
    'seed':1,
    'booster':'gbtree'
    #'booster':'gblinear'
    }
    
    # xgboost加载数据为DMatrix对象 
    dtrain = xgb.DMatrix(train, y_train) 
    
    # xgboost交叉验证并输出rmse 
    cv_output = xgb.cv(xgb_params, dtrain, num_boost_round=3000, early_stopping_rounds=100,obj = log_cosh_obj,
    verbose_eval=50,nfold=5, show_stdv=False, shuffle=True) 
    cv_output[['train-rmse-mean', 'test-rmse-mean']].plot() 
    
    bst = xgb.train(xgb_params, dtrain, num_boost_round=3000, early_stopping_rounds=None, obj=log_cosh_obj)
    
    dtest = xgb.DMatrix(test)
    xgb_preds = bst.predict(dtest)
    xgb_preds = scale_pred(xgb_preds)
    

    把预测值控制到1到3

    def scale_pred(pred):
        for i, el in enumerate(pred):
            if el > 3:
                pred[i] = 3
            elif el < 1:
                pred[i] = 1
        return pred
    
    pd.DataFrame({'id':test_ids, 'relevance':xgb_preds[:,0]}).to_csv(
        '/Users/tangqinglong/Desktop/Scikit-learn/Depot/submission.csv', index=False)
    

    keras+tf

    import keras
    from keras.models import Sequential
    from keras.layers import Dense, Dropout, Activation
    from keras.optimizers import SGD
    
    # Generate dummy data
    import numpy as np
    
    model = Sequential()
    # Dense(64) is a fully-connected layer with 64 hidden units.
    # in the first layer, you must specify the expected input data shape:
    # here, 20-dimensional vectors.
    L=len(X_train.columns)
    model.add(Dense(64, activation='relu', input_dim=L))
    model.add(Dropout(0.5))
    model.add(Dense(32, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1))
    
    sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
    model.compile(loss='logcosh',
                  optimizer=sgd,
                  metrics=['mse'])
    
    model.fit(train, y_train,
              epochs=1000,
              batch_size=256)
    

    模型融合,把不同模型预测的成绩作为特征,再进行回归

    相关文章

      网友评论

        本文标题:Home Depot Product Search Releva

        本文链接:https://www.haomeiwen.com/subject/xibzmftx.html