美文网首页
20190805工作进展

20190805工作进展

作者: Songger | 来源:发表于2019-08-05 12:08 被阅读0次

上周五工作:

  1. 使用手肘法测试top 1w query最佳聚类类别数,但是在这一数据中,sse斜率变化不大,分析的原因是数据太乱,无法找到最合适的类别数。在1000-1500类之间取1370类的时候sse曲线斜率变化最大,使用这一数据进行了测试,程序一直申请不到资源。
  2. 使用最朴素的dssm_v0算法进行query和title的关联,程序已经跑通,训练阶段完成,今天进行测试。

今天计划:
使用query top 1w和ugc 的title对训练得到的dssm_v0进行inference测试。

  1. query的加权embedding的query

拿到分词结果及各词权重
create table graph_embedding.hs_dssm_dic_query_2 as
select graph_embedding:hs_split_1(id, pair, "|") as (id, word, weight) from
(select bi_udf:bi_split_value(id, tag_result, "%") as (id, pair) from
(select id, search_kg:alinlp_termweight_ecom(se_keyword, "%", "{word}|{weight}", 1, 1) as tag_result from graph_embedding.hs_dssm_dic_query_1 where lengthb(se_keyword) > 0)a)b where lengthb(b.pair) > 0;

对分词结果进行embedding
create table hs_dssm_dic_query_3 as select id, word, weight, search_kg:alinlp_word_embedding(word, "100", "CONTENT_SEARCH") as word_emb from hs_dssm_dic_query_2;

embedding 结果加权平均得到词向量
create table graph_embedding.hs_tmp_160 as select id, graph_embedding:hs_merge_emb_14(weight, word_emb) as emb from graph_embedding.hs_dssm_dic_query_3 group by id;

  1. 得到inference的加权平均embedding

train_query : hs_dssm_dic_query_1 - | id | words_mainse_ids | se_keyword |
train_title : hs_dssm_dic_title_3 - | id | words_mainse_ids | title |


inference_query : hs_dssm_dic_query_inf_1 - | id | words_mainse_ids | query |
inference_title : hs_dssm_dic_title_inf_1 - | id | words_mainse_ids | title |


hs_dssm_dic_query_inf_1 -> he_tmp_162

create table graph_embedding.hs_dssm_dic_query_inf_2 as
select graph_embedding:hs_split_1(id, pair, "|") as (id, word, weight) from
(select bi_udf:bi_split_value(id, tag_result, "%") as (id, pair) from
(select id, search_kg:alinlp_termweight_ecom(query, "%", "{word}|{weight}", 1, 1) as tag_result from graph_embedding.hs_dssm_dic_query_inf_1 where lengthb(query) > 0)a)b where lengthb(b.pair) > 0;

create table hs_dssm_dic_query_inf_3 as select id, word, weight, search_kg:alinlp_word_embedding(word, "100", "CONTENT_SEARCH") as word_emb from hs_dssm_dic_query_inf_2;

create table graph_embedding.hs_tmp_162 as select id, graph_embedding:hs_merge_emb_14(weight, word_emb) as emb from graph_embedding.hs_dssm_dic_query_inf_3 group by id;

hs_dssm_dic_title_inf_1 -> hs_tmp_164

create table graph_embedding.hs_dssm_dic_title_inf_2 as
select graph_embedding:hs_split_1(id, pair, "|") as (id, word, weight) from
(select bi_udf:bi_split_value(id, tag_result, "%") as (id, pair) from
(select id, search_kg:alinlp_termweight_ecom(title, "%", "{word}|{weight}", 1, 1) as tag_result from graph_embedding.hs_dssm_dic_title_inf_1 where lengthb(title) > 0)a)b where lengthb(b.pair) > 0;

create table hs_dssm_dic_title_inf_3 as select id, word, weight, search_kg:alinlp_word_embedding(word, "100", "CONTENT_SEARCH") as word_emb from hs_dssm_dic_title_inf_2;

create table graph_embedding.hs_tmp_162 as select id, graph_embedding:hs_merge_emb_14(weight, word_emb) as emb from graph_embedding.hs_dssm_dic_query_inf_3 group by id;

得到inference数据

create table graph_embedding.hs_tmp_156 as
select c.query_id, c.title_id, c.query, d.emb as title from
(select a.*, b.emb as query from (select * from graph_embedding.hs_tmp_157)a left join (select * from graph_embedding.hs_tmp_162)b on a.query_id == b.id)c left join (select * from graph_embedding.hs_tmp_164)d on c.title_id == d.id;

相关文章

  • 20190805工作进展

    上周五工作: 使用手肘法测试top 1w query最佳聚类类别数,但是在这一数据中,sse斜率变化不大,分析的原...

  • 第145篇【20190805】幸福的一天

    本文提纲001 晨读课20190805流程002 整理随笔 001 20190805流程 8:30-10:00晨读...

  • 20190805

    人不曾活在枷锁之中反而拥有去思考去想念去逃避的自由20190805

  • 20190805

    经事实证明。如果我不调闹钟很容易起不来,就算已经睡够了也起不来,反而会很不精神。所以我要调整自己的作息,找到最适合...

  • 20190805

    坚持了这么长时间写文了,还是没有每天想写的欲望,这么坚持着,完全是为了想让自己把这件事情坚持下去。 还是看到了一点...

  • 20190805

    坚持了这么长时间写文了,还是没有每天想写的欲望,这么坚持着,完全是为了想让自己把这件事情坚持下去。 还是看到了一点...

  • 20190805

    每个人心里都有一片森林, 也许我们从来都不曾去过, 但它一直在那里, 永远在那里, 迷失的人迷失了, 相逢的人会再...

  • 20190805

    【打卡始于20180318持续打卡于20190805 姓名:富智燚 单位:海南蔚蓝时代实业有限公司 361期努力一...

  • 20190805

    今日大吉~ 和哥哥去逛婚戒,之前看好的那款涨价了,犹豫不决;后来又去I DO家闲逛,看到了一款天使之翼,便宜又打折...

  • 20190805

    事件记录 上班,今天一天没有什么事,主要做了几件事:一是搞定短信平台的事,二是整理了一下全会的资料,三是还了车。其...

网友评论

      本文标题:20190805工作进展

      本文链接:https://www.haomeiwen.com/subject/dkspdctx.html