美文网首页
Inverse Reward Design

Inverse Reward Design

作者: 朱小虎XiaohuZhu | 来源:发表于2017-11-10 20:12 被阅读78次

Dylan Hadfield-Menell Smitha Milli Pieter Abbeel∗ Stuart Russell Anca Dragan
Department of Electrical Engineering and Computer Science
University of California, Berkeley
Berkeley, CA 94709
{dhm, smilli, pabbeel, russell, anca}@cs.berkeley.edu
Abstract
Autonomous agents optimize the reward function we give them. What they don’t
know is how hard it is for us to design a reward function that actually captures
what we want. When designing the reward, we might think of some specific
training scenarios, and make sure that the reward will lead to the right behavior
in those scenarios. Inevitably, agents encounter new scenarios (e.g., new types of
terrain) where optimizing that same reward may lead to undesired behavior. Our
insight is that reward functions are merely observations about what the designer
actually wants, and that they should be interpreted in the context in which they were
designed. We introduce inverse reward design (IRD) as the problem of inferring the
true objective based on the designed reward and the training MDP. We introduce
approximate methods for solving IRD problems, and use their solution to plan
risk-averse behavior in test MDPs. Empirical results suggest that this approach can
help alleviate negative side effects of misspecified reward functions and mitigate
reward hacking.

相关文章

网友评论

      本文标题:Inverse Reward Design

      本文链接:https://www.haomeiwen.com/subject/ygjimxtx.html