Giter Club home page Giter Club logo

dssm's Introduction

Learning Deep Structured Semantic Models for Web Search using Clickthrough Data以及其后续文章

A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems的实现Demo。

注意:

****2020/11/15****

论文li2020sentence将normalizing flows和bert结合,在语义相似度任务上有奇效,接下来会继续进行验证。

****2020/10/27****

添加底层使用bert的siamese-bert实验,见siamese_network.py中类 SiamenseBert,其他和下面一样.

相比于bert 直接将两句作为输入,双塔bert的优势在于:

  • max sequence len会更短,训练所需要显存更小,速度会稍微快一些,对于硬件不是太好的伙伴比较友好。
  • 可以训练后使用bert作为句子embedding的encoder,在一些线上匹配的时候,可以预先将需要对比的句子向量算出来,节省实时算力。
  • 效果相比于直接用bert输入两句,测试集会差一个多点。
  • bert可以使用[CLS]的输出或者average context embedding, 一般后者效果会更好。
# bert_siamese双塔模型
python train.py --mode=train --method=bert_siamese
# 直接使用功能bert
python train.py --mode=train --method=bert

参考:reimers2019sentence

****2020/10/17****

由于之前数据集问题,会有不收敛问题,现更换数据集为LCQMC口语化描述的语义相似度数据集。模型也从多塔变成了双塔模型,见siamese_network.py, 训练入口:train.py

难以找到搜索点击的公开数据集,暂且用语义相似任务数据集,有点变味了,哈哈 目前看在此数据集上测试数据集的准确率是提升的,只有七十多,但是要达到论文的准确率,仍然还需要进行调参

训练(默认使用功LCQMC数据集):

python train.py --mode=train

预测:

python train.py --mode=train --file=$predict_file$

测试文件格式: q1\tq2, 例如:

今天天气怎么样	今天温度怎么样

****2019/5/18****

由于之前代码api过时,已更新最新代码于:dssm_rnn.py

数据处理代码data_input.py 和数据data 已经更新,由于使用了rnn,所以输入非bag of words方式。

img

来源:Palangi, Hamid, et al. "Semantic modelling with long-short-term memory for information retrieval." arXiv preprint arXiv:1412.6629 2014.

训练损失,在45个epoch时基本不下降:

dssm_rnn_loss

1. 数据&环境

DSSM,对于输入数据是Query对,即Query短句和相应的展示,展示中分点击和未点击,分别为正负样,同时对于点击的先后顺序,也是有不同赋值,具体可参考论文。

对于我的Query数据本人无权开放,还请自行寻找数据。 环境:

  1. win, python3.5, tensorflow1.4.

2. word hashing

原文使用3-grams,对于中文,我使用了uni-gram,因为中文本身字有一定代表意义(也有论文拆笔画),对于每个gram都使用one-hot编码代替,最终可以大大降低短句维度。

3. 结构

结构图:

img

  1. 把条目映射成低维向量。
  2. 计算查询和文档的cosine相似度。

3.1 输入

这里使用了TensorBoard可视化,所以定义了name_scope:

with tf.name_scope('input'):
    query_batch = tf.sparse_placeholder(tf.float32, shape=[None, TRIGRAM_D], name='QueryBatch')
    doc_positive_batch = tf.sparse_placeholder(tf.float32, shape=[None, TRIGRAM_D], name='DocBatch')
    doc_negative_batch = tf.sparse_placeholder(tf.float32, shape=[None, TRIGRAM_D], name='DocBatch')
    on_train = tf.placeholder(tf.bool)

3.2 全连接层

我使用三层的全连接层,对于每一层全连接层,除了神经元不一样,其他都一样,所以可以写一个函数复用。 $$ l_n = W_n x + b_1 $$

def add_layer(inputs, in_size, out_size, activation_function=None):
    wlimit = np.sqrt(6.0 / (in_size + out_size))
    Weights = tf.Variable(tf.random_uniform([in_size, out_size], -wlimit, wlimit))
    biases = tf.Variable(tf.random_uniform([out_size], -wlimit, wlimit))
    Wx_plus_b = tf.matmul(inputs, Weights) + biases
    if activation_function is None:
        outputs = Wx_plus_b
    else:
        outputs = activation_function(Wx_plus_b)
    return outputs

其中,对于权重和Bias,使用了按照论文的特定的初始化方式:

	wlimit = np.sqrt(6.0 / (in_size + out_size))
    Weights = tf.Variable(tf.random_uniform([in_size, out_size], -wlimit, wlimit))
    biases = tf.Variable(tf.random_uniform([out_size], -wlimit, wlimit))

Batch Normalization

def batch_normalization(x, phase_train, out_size):
    """
    Batch normalization on convolutional maps.
    Ref.: http://stackoverflow.com/questions/33949786/how-could-i-use-batch-normalization-in-tensorflow
    Args:
        x:           Tensor, 4D BHWD input maps
        out_size:       integer, depth of input maps
        phase_train: boolean tf.Varialbe, true indicates training phase
        scope:       string, variable scope
    Return:
        normed:      batch-normalized maps
    """
    with tf.variable_scope('bn'):
        beta = tf.Variable(tf.constant(0.0, shape=[out_size]),
                           name='beta', trainable=True)
        gamma = tf.Variable(tf.constant(1.0, shape=[out_size]),
                            name='gamma', trainable=True)
        batch_mean, batch_var = tf.nn.moments(x, [0], name='moments')
        ema = tf.train.ExponentialMovingAverage(decay=0.5)

        def mean_var_with_update():
            ema_apply_op = ema.apply([batch_mean, batch_var])
            with tf.control_dependencies([ema_apply_op]):
                return tf.identity(batch_mean), tf.identity(batch_var)

        mean, var = tf.cond(phase_train,
                            mean_var_with_update,
                            lambda: (ema.average(batch_mean), ema.average(batch_var)))
        normed = tf.nn.batch_normalization(x, mean, var, beta, gamma, 1e-3)
    return normed

单层

with tf.name_scope('FC1'):
    # 激活函数在BN之后,所以此处为None
    query_l1 = add_layer(query_batch, TRIGRAM_D, L1_N, activation_function=None)
    doc_positive_l1 = add_layer(doc_positive_batch, TRIGRAM_D, L1_N, activation_function=None)
    doc_negative_l1 = add_layer(doc_negative_batch, TRIGRAM_D, L1_N, activation_function=None)

with tf.name_scope('BN1'):
    query_l1 = batch_normalization(query_l1, on_train, L1_N)
    doc_l1 = batch_normalization(tf.concat([doc_positive_l1, doc_negative_l1], axis=0), on_train, L1_N)
    doc_positive_l1 = tf.slice(doc_l1, [0, 0], [query_BS, -1])
    doc_negative_l1 = tf.slice(doc_l1, [query_BS, 0], [-1, -1])
    query_l1_out = tf.nn.relu(query_l1)
    doc_positive_l1_out = tf.nn.relu(doc_positive_l1)
    doc_negative_l1_out = tf.nn.relu(doc_negative_l1)
······

合并负样本

with tf.name_scope('Merge_Negative_Doc'):
    # 合并负样本,tile可选择是否扩展负样本。
    doc_y = tf.tile(doc_positive_y, [1, 1])
    for i in range(NEG):
        for j in range(query_BS):
            # slice(input_, begin, size)切片API
            doc_y = tf.concat([doc_y, tf.slice(doc_negative_y, [j * NEG + i, 0], [1, -1])], 0)

3.3 计算cos相似度

with tf.name_scope('Cosine_Similarity'):
    # Cosine similarity
    # query_norm = sqrt(sum(each x^2))
    query_norm = tf.tile(tf.sqrt(tf.reduce_sum(tf.square(query_y), 1, True)), [NEG + 1, 1])
    # doc_norm = sqrt(sum(each x^2))
    doc_norm = tf.sqrt(tf.reduce_sum(tf.square(doc_y), 1, True))

    prod = tf.reduce_sum(tf.multiply(tf.tile(query_y, [NEG + 1, 1]), doc_y), 1, True)
    norm_prod = tf.multiply(query_norm, doc_norm)

    # cos_sim_raw = query * doc / (||query|| * ||doc||)
    cos_sim_raw = tf.truediv(prod, norm_prod)
    # gamma = 20
    cos_sim = tf.transpose(tf.reshape(tf.transpose(cos_sim_raw), [NEG + 1, query_BS])) * 20

3.4 定义损失函数

with tf.name_scope('Loss'):
    # Train Loss
    # 转化为softmax概率矩阵。
    prob = tf.nn.softmax(cos_sim)
    # 只取第一列,即正样本列概率。
    hit_prob = tf.slice(prob, [0, 0], [-1, 1])
    loss = -tf.reduce_sum(tf.log(hit_prob))
    tf.summary.scalar('loss', loss)

3.5选择优化方法

with tf.name_scope('Training'):
    # Optimizer
    train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize(loss)

3.6 开始训练

# 创建一个Saver对象,选择性保存变量或者模型。
saver = tf.train.Saver()
# with tf.Session(config=config) as sess:
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    train_writer = tf.summary.FileWriter(FLAGS.summaries_dir + '/train', sess.graph)
    start = time.time()
    for step in range(FLAGS.max_steps):
        batch_id = step % FLAGS.epoch_steps
        sess.run(train_step, feed_dict=feed_dict(True, True, batch_id % FLAGS.pack_size, 0.5))

GitHub完整代码 https://github.com/InsaneLife/dssm

Multi-view DSSM实现同理,可以参考GitHub:multi_view_dssm_v3

CSDN原文:http://blog.csdn.net/shine19930820/article/details/79042567

自动调参

参数搜索空间:search_space.json 配置文件:auto_ml.yml 启动命令

nnictl create --config auto_ml.yml -p 8888

由于没有gpu 😂,auto_ml.yml设置中没有配置gpu,有gpu同学可自行配置。

详细文档:https://nni.readthedocs.io/zh/latest/Overview.html

Reference

dssm's People

Contributors

insanelife avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dssm's Issues

训练环境

请问有谁用全量数据训过吗?大概需要多大的内存空间?带GPU的呢?

data_input

import data_input ModuleNotFoundError:No module named 'data_input'
Where is the data_input

🚨 Potential Deserialization of Untrusted Data

👋 Hello, @InsaneLife - a potential high severity Deserialization of Untrusted Data vulnerability in your repository has been disclosed to us.

Next Steps

1️⃣ Visit https://huntr.dev/bounties/1-other-InsaneLife/dssm for more advisory information.

2️⃣ Sign-up to validate or speak to the researcher for more assistance.

3️⃣ Propose a patch or outsource it to our community - whoever fixes it gets paid.


Confused or need more help?

  • Join us on our Discord and a member of our team will be happy to help! 🤗

  • Speak to a member of our team: @JamieSlome


This issue was automatically generated by huntr.dev - a bug bounty board for securing open source code.

Data fromat

您好。能方便给一下数据格式么。不知道咋个弄训练数据啊

测试准确度低

用”siamese_bert“模型,在80万公司数据集上,1(正):4(负),跑出来的cos倒排,感觉完全不靠谱,发愁
auc: 0.64
准确率: 0.75

datasets format problem

Can you tell me the datasets format or show a screenshot ?
In the following, you use data_sets.query_test_data, data_sets.doc_test_positive, data_sets.doc_test_negative, so I don't quite understand the format.
Thanks!

dssm loss计算为什么是reduce_sum

在dssm.py中,计算loss的代码
with tf.name_scope('Loss'):
# Train Loss
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=doc_label_batch, logits=cos_sim)
losses = tf.reduce_sum(cross_entropy)
tf.summary.scalar('loss', losses)
pass
是不是有问题?为什么是reduce_sum?而不是reduce_mean

Why my train loss equal to nan?

Hi,
When I run dssm_rnn.py, the train loss always shows nan. Change learning rate, no matter what.
I print out the variables in the model, and the variable embedding in the word_embeddings_layer shows nan for the first time.
How to deal with it. Thanks!

如何预测

需要先训练模型,然后做预测,训练入口:train.py
训练(默认使用功LCQMC数据集):

python train.py --mode=train

预测:

python train.py --mode=train --file=$predict_file$

测试文件格式: q1\tq2, 例如:

今天天气怎么样	今天温度怎么样

Originally posted by @InsaneLife in #25 (comment)

损失函数的定义只涉及到了正样本?

您好,我看代码里定义损失函数那一块,先对query分别和正样本负样本的out_embedding求cos,然后外接softmax之后,只用到了正样本的概率结果,为什么不把负样本的概率结果求负之后也加进来呢?

如果按照您的loss定义,那么完全可以舍去负样本的输入。

数据格式

你好,能否告知训练样本的格式是怎么样的呢(正负样本如何组织的,输入是一个query对应1个正样本,4个负样本吗),还有你中文特征提取是只用了uni_gramn吗,方便留个邮箱或者联系方式吗,谢谢(by the way, 我也是在成都哟,哈哈)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.