DeepMind 最新论文合集

作者: 朱小虎XiaohuZhu | 来源:发表于2016-05-31 12:31 被阅读1633次

    Neil Zhu,简书ID Not_GOD,University AI 创始人 & Chief Scientist,致力于推进世界人工智能化进程。制定并实施 UAI 中长期增长战略和目标,带领团队快速成长为人工智能领域最专业的力量。
    作为行业领导者,他和UAI一起在2014年创建了TASA(中国最早的人工智能社团), DL Center(深度学习知识中心全球价值网络),AI growth(行业智库培训)等,为中国的人工智能人才建设输送了大量的血液和养分。此外,他还参与或者举办过各类国际性的人工智能峰会和活动,产生了巨大的影响力,书写了60万字的人工智能精品技术内容,生产翻译了全球第一本深度学习入门书《神经网络与深度学习》,生产的内容被大量的专业垂直公众号和媒体转载与连载。曾经受邀为国内顶尖大学制定人工智能学习规划和教授人工智能前沿课程,均受学生和老师好评。

    Continuous Deep Q-Learning with Model-based Acceleration

    http://arxiv.org/pdf/1603.00748v1.pdf

    简介:Model-free reinforcement learning has been successfully
    applied to a range of challenging problems,
    and has recently been extended to handle
    large neural network policies and value functions.
    However, the sample complexity of modelfree
    algorithms, particularly when using highdimensional
    function approximators, tends to
    limit their applicability to physical systems. In
    this paper, we explore algorithms and representations
    to reduce the sample complexity of
    deep reinforcement learning for continuous control
    tasks. We propose two complementary techniques
    for improving the efficiency of such algorithms.
    First, we derive a continuous variant of
    the Q-learning algorithm, which we call normalized
    adantage functions (NAF), as an alternative
    to the more commonly used policy gradient and
    actor-critic methods. NAF representation allows
    us to apply Q-learning with experience replay to
    continuous tasks, and substantially improves performance
    on a set of simulated robotic control
    tasks. To further improve the efficiency of our
    approach, we explore the use of learned models
    for accelerating model-free reinforcement learning.
    We show that iteratively refitted local linear
    models are especially effective for this, and
    demonstrate substantially faster learning on domains
    where such models are applicable.

    Learning functions across many orders of magnitudes

    http://arxiv.org/pdf/1602.07714v1.pdf

    简介:Learning non-linear functions can be hard when
    the magnitude of the target function is unknown
    beforehand, as most learning algorithms are not
    scale invariant. We propose an algorithm to adaptively
    normalize these targets. This is complementary
    to recent advances in input normalization.
    Importantly, the proposed method preserves
    the unnormalized outputs whenever the normalization
    is updated to avoid instability caused by
    non-stationarity. It can be combined with any
    learning algorithm and any non-linear function
    approximation, including the important special
    case of deep learning. We empirically validate
    the method in supervised learning and reinforcement
    learning and apply it to learning how to play
    Atari 2600 games. Previous work on applying
    deep learning to this domain relied on clipping
    the rewards to make learning in different games
    more homogeneous, but this uses the domainspecific
    knowledge that in these games counting
    rewards is often almost as informative as summing
    these. Using our adaptive normalization
    we can remove this heuristic without diminishing
    overall performance, and even improve performance
    on some games, such as Ms. Pac-Man
    and Centipede, on which previous methods did
    not perform well.

    Deep Exploration via Bootstrapped DQN

    http://arxiv.org/pdf/1602.04621v1.pdf

    简介:Efficient exploration in complex environments
    remains a major challenge for reinforcement
    learning. We propose bootstrapped DQN, a simple
    algorithm that explores in a computationally
    and statistically efficient manner through use
    of randomized value functions. Unlike dithering
    strategies such as �-greedy exploration, bootstrapped
    DQN carries out temporally-extended
    (or deep) exploration; this can lead to exponentially
    faster learning. We demonstrate these
    benefits in complex stochastic MDPs and in the
    large-scale Arcade Learning Environment. Bootstrapped
    DQN substantially improves learning
    times and performance across most Atari games.

    One-shot Learning with Memory-Augmented Neural Networks

    https://arxiv.org/pdf/1605.06065v1.pdf

    简介:One-shot 学习上的工作,传统方法需要大量数据进行学习. 新数据进来,模型必然是低效地重新学习参数来平滑地引入新数据的信息. 拥有增强记忆能力的结构,如 NTM,提供了快速编码和检索新信息的能力,因此潜在地避开了传统模型的弱点。这篇文章给出了记忆增强神经网络可以快速吸收新的数据,并利用新数据仅仅在加入很少样本后作出准确的预测. 另外还引入一种新的获取外部记忆的方法,这种方法聚焦在记忆内容,不像之前的方法额外使用了基于记忆的位置的机制来定位。

    Deep Reinforcement Learning with Attention for Slate Markov Decision Processes with High-Dimensional States and Actions

    http://arxiv.org/pdf/1512.01124v2.pdf

    简介:Many real-world problems come with action spaces represented as feature vectors. Although high-dimensional control is a largely unsolved problem, there has recently been progress for modest dimensionalities. Here we report on a successful attempt at addressing problems of dimensionality as high as 2000, of a particular form. Motivated by important applications such as recommendation systems
    that do not fit the standard reinforcement learning frameworks, we introduce Slate Markov Decision Processes (slate-MDPs).

    Slate-MDP 是一个组合行动(在基础 MDP 中的原始行动的元组)空间的 MDP. agent 并不去控制这个行动的选择,行动甚至可能不是来自组合行动的,比如说,推荐系统中所有的推荐都可以被用户忽略的。我们使用深度 Q-学习基于状态和行动的特征表示来学习整个组合行动的值。

    Unlike existing methods, we optimize for both the combinatorial and sequential aspects of our tasks. The new agent’s superiority over agents that either ignore the combinatorial or sequential long-term value aspect is demonstrated on a range of environments with dynamics from a real-world recommendation
    system. Further, we use deep deterministic policy gradients to learn
    a policy that for each position of the slate, guides attention towards the part of the action space in which the value is the highest and we only evaluate actions in this area. The attention is used within a sequentially greedy procedure leveraging submodularity. Finally, we show how introducing risk-seeking can dramatically improve the agents performance and ability to discover more far reaching strategies.

    Increasing the Action Gap: New Operators for Reinforcement Learning

    http://arxiv.org/pdf/1512.04860v1.pdf

    简介:This paper introduces new optimality-preserving operators
    on Q-functions. We first describe an operator for tabular representations, the consistent Bellman operator, which incorporates a notion of local policy consistency. We show that this local consistency leads to an increase in the action gap at each state; increasing this gap, we argue, mitigates the undesirable effects of approximation and estimation errors on the induced greedy policies. This operator can also be applied to discretized continuous space and time problems, and we provide empirical results evidencing superior performance in this context. Extending the idea of a locally consistent operator, we then derive sufficient conditions for an operator to preserve optimality, leading to a family of operators which includes our consistent Bellman
    operator. As corollaries we provide a proof of optimality for Baird’s advantage learning algorithm and derive other gap-increasing operators with interesting properties. We conclude with an empirical study on 60 Atari 2600 games illustrating the strong potential of these new operators.

    MUPROP: UNBIASED BACKPROPAGATION FOR STOCHASTIC NEURAL NETWORKS

    http://arxiv.org/pdf/1511.05176v2.pdf

    简介:Deep neural networks are powerful parametric models that can be trained effi-
    ciently using the backpropagation algorithm. Stochastic neural networks combine
    the power of large parametric functions with that of graphical models, which
    makes it possible to learn very complex distributions. However, as backpropagation
    is not directly applicable to stochastic networks that include discrete sampling
    operations within their computational graph, training such networks remains diffi-
    cult. We present MuProp, an unbiased gradient estimator for stochastic networks,
    designed to make this task easier. MuProp improves on the likelihood-ratio estimator
    by reducing its variance using a control variate based on the first-order Taylor
    expansion of a mean-field network. Crucially, unlike prior attempts at using
    backpropagation for training stochastic networks, the resulting estimator is unbiased
    and well behaved. Our experiments on structured output prediction and discrete
    latent variable modeling demonstrate that MuProp yields consistently good
    performance across a range of difficult tasks.

    POLICY DISTILLATION

    http://arxiv.org/pdf/1511.06295.pdf

    相关文章

      网友评论

        本文标题:DeepMind 最新论文合集

        本文链接:https://www.haomeiwen.com/subject/gbyzrttx.html