本文研究了在涉及大量同时运行的子进程的顺序决策问题中预算(或其他资源)的分配问题,其中这些子进程之间唯一的交互是通过它们逐渐消耗预算(或所涉及的资源)。我们以广告商与大量目标客户的交互为主要动机,考虑了两个互补的问题,包括计算MDP价值函数作为可用预算的函数,以及构建一个政策,规定给定系统的联合状态时跨所有子进程采取的联合政策,受全局预算约束。
Abstract
We consider the problem of budget (or other resource) allocation in sequential decision prob- lems involving a large number of concurrently running sub-processes, whose only interaction is through their gradual consumption of budget (or the resource in question). We use the case of an advertiser interacting with a large population of target customers as a primary motivation. We consider two complementary problems that comprise our key contributions.
Our first contribution addresses the problem of computing MDP value functions as a function of the available budget. In contrast to standard constrained MDPs—which find optimal value functions given a fixed expected budget constraint—our aim is to assess the tradeoff between expected budget spent and the value attained when using that budget optimally. We show that optimal value functions are concave in budget. More importantly, in the finite-horizon case, we show there are a finite number of useful budget levels. This gives rise to piecewise-linear, concave value functions (piecewise-constant if we restrict to deterministic policies) with an representation that can be computed readily via dynamic programming. This representation also supports natural approximations. Our model not only allows the assessment of budget/value tradeoffs (e.g., to find the “sweet spot” in spend), but plays an important role in the allocation of budget across competing subprocesses.
Our second contribution is a method for constructing a policy that prescribes the joint policy to be taken across all sub-processes given the joint state of the system, subject to a global budget constraint. We cast the problem as a weakly coupled MDP in which budget is allocated online to the individual subprocesses based on its observed (initial) state and the subprocess-specific value function. We show that the budget allocation process can be cast as a multi-item, multiple-choice knapsack problem (MCKP), which admits an efficient greedy algorithm to determine optimal al- locations. We also discuss the possibility of online, per-stage re-allocation of budget to adaptively satisfy strict rather than expected budget constraints.
这篇论文主要研究在涉及大量同时运行的子进程的顺序决策问题中的预算(或其他资源)分配问题,这些子进程之间唯一的交互是通过它们逐渐消耗预算(或所涉及的资源)。作者以广告商与大量目标客户互动的情况为主要动机。作者提出了两个互补的问题,这是作者的主要贡献。
作者的第一个贡献解决了计算MDP值函数的问题,这是一个关于可用预算的函数的问题。与标准的约束MDP不同,标准约束MDP是在给定固定预算约束的情况下找到最优值函数,而我们的目标是评估在使用该预算时预期的预算支出和获得的价值之间的权衡。我们证明了最优值函数在预算方面是凸的。更重要的是,在有限时间内,我们证明有一定数量的有用预算水平。这产生了分段线性、凸的价值函数(如果我们限制为确定性策略,则为分段常数),其表示可以通过动态规划轻松计算。这种表示也支持自然的近似。我们的模型不仅允许评估预算/价值的权衡(例如,找到支出的“甜点”),而且在竞争子进程之间分配预算方面起着重要作用。
作者的第二个贡献是一种构建策略的方法,该策略规定了在给定系统的联合状态下,对所有子进程采取的联合策略,受全局预算约束的限制。我们将问题作为弱耦合MDP,其中预算根据其观察到的(初始)状态和子进程特定的值函数在线分配给各个子进程。我们证明了预算分配过程可以被视为多项、多选背包问题(MCKP),该问题具有有效的贪心算法来确定最优分配。我们还讨论了在线、逐阶段重新分配预算以适应严格而不是预期预算约束的可能性。
总之,这篇论文提出了一种新的方法来解决顺序决策问题中的预算分配问题,这对于广告商与大量目标客户互动的情况非常有用。作者提出了两个互补的问题,一个是计算MDP值函数,另一个是构建策略。作者的方法可以帮助我们评估预算/价值的权衡,以及在竞争子进程之间分配预算。举个例子,这种方法可以应用于广告商在不同的广告平台上投放广告的决策中,以找到最优的预算分配方案。
Introduction
Markov decision processes (MDPs) have found wide application throughout AI and provide firm foundations for decision-theoretic planning [10] and reinforcement learning [27]. In many domains, however, actions have costs or otherwise consume limited resources—in such cases, optimal policies must be constructed subject to resource constraints, a problem often best formulated using constrained MDPs (CMDPs) [2]. The complexity of these problems is exacerbated in domains where multiple independent or semi-independent MDPs must be controlled jointly, since the state and action spaces for the joint process typically consists of the cross-product of those from the individual subprocesses [32].
A key setting where both of these characteristics arise is online advertising. A number of models of online user engagement have been proposed that treat user behavior as a Markov process [15, 7]. Archak et al. [6], in particular, propose a constrained MDP model for the optimal allocation of an advertiser’s budget over an extended horizon that captures the sequential, cumulative, stochastic effect of multiple advertising touch points on a user’s behavior. Their model assumes that a fixed budget is predetermined for each user, and focuses on how to optimally advertise to a user, subject to that budget constraint. However, this formulation is not well-suited to advertisers who (a) have a global budget constraint (rather than a per-user constraint); and (b) are free to allocate that budget differentially and adaptively across users of different types, or users in different states of the underlying MDP. We instead model this as a joint optimization problem in which a joint policy across all users must be constructed. In particular, we consider the optimal, global, dynamic allocation of budget across all users in the target market, removing the need to pre-specify a per-user budget.
Using this problem as motivation, we consider a natural decomposition of this type of resource allocation problem as a weakly coupled MDP [32]. In this model, each “local” MDP (corresponding to a customer or customer type) is solved separately, and the solutions composed into a joint pol- icy. Within this setting, our first contribution relates to the solution of the local MDPs themselves. One standard approach to dealing with resource constraints, such as budget limits, is CMDPs [2], in which actions consume one or more resources (e.g., budget, power) and optimal policies are sought that use no more than some pre-defined level of each resource in expectation. CMDPs have been applied in online advertising (see above), robotics, and other domains [12, 19, 18]. While CMDPs are valuable models, one weakness is their required a priori specification of fixed “budget” or limit on each resource—in our example, this would be, say, a daily monetary budget cap on ad spend for each customer or across a specific ad network [7, 6, 4]. However, in many cases, determining a suitable budget depends on the tradeoff between ad spend and expected value. In our motivating example, in particular, the budget to be allocated to customers of a specific type, or in a specific MDP state, must be traded off against that allocated to other types or states.
这篇论文主要研究在涉及大量同时运行的子进程的顺序决策问题中的预算(或其他资源)分配问题,这些子进程之间唯一的交互是通过它们逐渐消耗预算(或所涉及的资源)。作者以广告商与大量目标客户互动的情况为主要动机。作者提出了两个互补的问题,一个是计算MDP值函数,另一个是构建策略。作者的方法可以帮助我们评估预算/价值的权衡,以及在竞争子进程之间分配预算。举个例子,这种方法可以应用于广告商在不同的广告平台上投放广告的决策中,以找到最优的预算分配方案。
在介绍中,作者指出了在许多领域中,行动具有成本或以其他方式消耗有限资源,因此必须在资源约束下构建最优策略,这通常最好使用约束MDP(CMDP)进行制定。当多个独立或半独立的MDP必须联合控制时,这些问题的复杂性会加剧。作者以在线广告为例,提出了一种弱耦合MDP的自然分解方法,其中每个“本地”MDP(对应于客户或客户类型)都单独解决,并将解决方案组合成联合策略。作者的第一个贡献是解决本地MDP的解决方案,其中行动消耗一个或多个资源(例如,预算、功率),并寻求使用每个资源的预定义水平的最优策略。作者的第二个贡献是一种构建策略的方法,该策略规定了在给定系统的联合状态下,对所有子进程采取的联合策略,受全局预算约束的限制。
这篇论文主要介绍了在涉及有限资源的情况下,如何构建最优策略的问题。在许多领域中,行动具有成本或以其他方式消耗有限资源,因此必须在资源约束下构建最优策略,这通常最好使用约束MDP(CMDP)进行制定。当多个独立或半独立的MDP必须联合控制时,这些问题的复杂性会加剧。作者以在线广告为例,提出了一种弱耦合MDP的自然分解方法,其中每个“本地”MDP(对应于客户或客户类型)都单独解决,并将解决方案组合成联合策略。作者的第一个贡献是解决本地MDP的解决方案,其中行动消耗一个或多个资源(例如,预算、功率),并寻求使用每个资源的预定义水平的最优策略。作者的第二个贡献是一种构建策略的方法,该策略规定了在给定系统的联合状态下,对所有子进程采取的联合策略,受全局预算约束的限制。
在广告投放领域,已经有很多关于用户行为的马尔可夫过程模型,但这些模型都是基于用户行为的预算限制,而不是广告商的全局预算限制。因此,作者提出了一种联合优化问题的模型,该模型需要构建一个适用于所有用户的联合策略,而不是预先指定每个用户的预算。作者的方法可以帮助我们评估预算/价值的权衡,以及在竞争子进程之间分配预算。
作者提出的方法可以应用于广告商在不同的广告平台上投放广告的决策中,以找到最优的预算分配方案。在许多领域中,行动具有成本或以其他方式消耗有限资源,因此必须在资源约束下构建最优策略,这通常最好使用约束MDP(CMDP)进行制定。当多个独立或半独立的MDP必须联合控制时,这些问题的复杂性会加剧。
We propose a model called budgeted MDPs (BMDPs) in which, like CMDPs, actions consume resources and can only be taken when the required resources are available. Unlike CMDPs, we do not place an a priori limit on resources, but instead solve the for all possible resource levels, allowing one to explore the tradeoff between (optimal) expected value and allocated resources. Note that we could formulate this problem by embedding the available resource levels in the state space, creating a “hybrid state” with one continuous dimension per resource. However, typically one chooses a budget(or appropriate resource level) when initiating a policy—in our approach, after solving the budgeted MDP to assess value/budget tradeoffs. As a consequence, we instead treat the budget as a separate parameter of the value function (VF) and policy, and analyze the structure of optimal VFs w.r.t. budget. Specifically, for finite state and action BMDPs, we show that for any fixed state s: (a) the optimal VF V (s, b) is concave in available budget b; and (b) the optimal VF for any finite horizon is piecewise- linear and concave (PWLC) in budget, and is defined by a finite set of useful resource levels (the VF is piecewise constant if policies are deterministic). We present a dynamic programming (DP) algorithm to compute this PWLC VF representation, one that readily supports approximation. This approach is well-suited not only to multi-task problems, like our ad domain, but more generally to any constrained MDP where the appropriate level of resource to allocate depends on the value obtainable (e.g., the “sweet spot” in spend) and may not be easily determined a priori. Using the VF to inform the selection of an appropriate budget, one automatically determines the optimal policy for that (expected) spend level—as if one had solved the corresponding CMDP for that budget.
Our second contribution is a method for piecing together the solutions of the individual BMDPs to determine a joint policy, subject to a global resource/budget constraint. Since the MDP is weakly coupled [32]—specifically, the individual customer MDPs evolve independently, linked only through the consumption of shared budget—our primary aim is to determine an appropriate allocation of budget to each customer, which is turn dictates the optimal policy for that customer. We show that, given the form of BMDP optimal value functions, the budget allocation problem can be formulated as a multiple-choice knapsack problem (MCKP) [40], for which a straightforward greedy optimization can be used to construct the optimal online allocation of budget. We also discuss circumstances in which the dynamic, per-stage reallocation of budget may be valuable—in such cases, the offline solution of the underlying BMDPs and the greedy nature of the online allocation algorithm admit fast, real-time response.
The paper is organized as follows. We discuss related work on ad budget optimization and con- strained and weakly coupled MDPs in Sec. 2. We outline our basic constrained weakly coupled MDP model in Sec. 3. In Sec. 4, we introduce budgeted MDPs (BMDPs) as a model for the local (e.g., single-customer) MDPs that make up the weakly coupled model. Unlike CMDPs, the aim is to de- termine VFs and policies as a function of available budget. In this section we describe the PWLC structure of optimal VFs for BMDPs; introduce a DP algorithm for computing VFs and policies; de- scribe methods for approximation and provide an analysis of approximation error; describe how to compute the variance in expected spend induced by the optimal policy; and provide empirical results on both synthetic MDPs and MDPs derived from the data of a large advertiser. In Sec. 5, we describe how to exploit BMDP solutions to allocate a global budget to multiple independent local MDPs, linked only by their budget consumption. We formulate the problem as a multiple-choice knapsack problem (MCKP) and describe a simple greedy algorithm for optimal budget allocation. We show that this way of exploiting BMDP solutions provides much better results than adopting a static or fixed per- user budget. We also propose a dynamic budget reallocation method that reallocates budget as each CMDP process evolves. This reduces variance in the global spend, ensures guaranteed budget con- straint satisfaction rather than expected satisfaction and, in settings with tightly constrained budgets, can offer better expected value through budget redistribution.
本文提出了一种名为预算马尔可夫决策过程(Budgeted MDPs,BMDPs)的模型,类似于约束马尔可夫决策过程(Constrained MDPs,CMDPs),动作会消耗资源,只有在所需资源可用时才能执行。与CMDPs不同的是,我们不会事先限制资源,而是解决所有可能的资源水平,以探索(最优)期望价值和分配资源之间的权衡。请注意,我们可以通过将可用资源水平嵌入状态空间中,创建一个具有每个资源的连续维度的“混合状态”,来制定此问题。然而,通常在启动策略时选择预算(或适当的资源水平)- 在我们的方法中,在解决预算MDP以评估价值/预算权衡之后。因此,我们将预算视为价值函数(VF)和策略的单独参数,并分析最优VF的结构与预算有关。具体而言,对于有限状态和动作的BMDPs,我们证明对于任何固定状态s:(a)可用预算b的最优VF V(s,b)是凸的;(b)任何有限时间段的最优VF都是分段线性和凸的(PWLC),并且由一组有用的资源水平定义(如果策略是确定性的,则VF是分段常数)。我们提出了一种动态规划(DP)算法来计算这种PWLC VF表示,该算法可以轻松支持近似。这种方法不仅非常适合多任务问题,例如我们的广告领域,而且更普遍地适用于任何约束MDP,其中分配的适当资源水平取决于可获得的价值(例如,花费的“甜点”)并且可能不容易事先确定。使用VF来指导选择适当的预算,可以自动确定该(预期)支出水平的最优策略-就像已经解决了相应的CMDP一样。
我们的第二个贡献是一种方法,将单个BMDP的解决方案拼接在一起,以确定受全局资源/预算约束的联合策略。由于MDP是弱耦合的[32] -具体而言,单个客户MDP独立演变,仅通过共享预算的消耗相连-我们的主要目标是确定每个客户的预算分配,这反过来决定了该客户的最优策略。我们展示了,鉴于BMDP最优价值函数的形式,预算分配问题可以被制定为多项选择背包问题(MCKP)[40],可以使用简单的贪心优化来构建预算的最优在线分配。我们还讨论了动态的,每个阶段的预算重新分配可能有价值的情况-在这种情况下,基于离线BMDP的解决方案和在线分配算法的贪心性质,可以实现快速的实时响应。
本文的结构如下。我们在第2节中讨论了广告预算优化和约束和弱耦合MDP的相关工作。我们在第3节中概述了我们的基本约束弱耦合MDP模型。在第4节中,我们介绍了预算MDPs(BMDPs)作为组成弱耦合模型的本地(例如单个客户)MDPs的模型。与CMDPs不同,目的是根据可用预算确定VF和策略。在本节中,我们描述了BMDP最优VF的PWLC结构;介绍了计算VF和策略的DP算法;描述了逼近方法并提供逼近误差分析;描述了如何计算最优策略引起的预期支出方差;并提供了关于合成MDP和大型广告商数据的合成MDP的实证结果。在第5节中,我们描述了如何利用BMDP解决方案将全局预算分配给多个独立的本地MDP,仅通过其预算消耗相连。我们将问题制定为多项选择背包问题(MCKP),并提出了一种贪心优化算法来构建预算的最优在线分配。我们还讨论了动态的,每个阶段的预算重新分配的潜在好处,以及如何使用BMDP解决方案有效地实现它。最后,在第6节中,我们总结了本文,并讨论了未来的研究方向。
Background and Related Work
Budget Optimization in Online Advertising
Optimal allocation of advertising budgets is addressed at length in the marketing literature. Tradi- tional concerns include how to allocate budget to different segments of the target market [41], and exploiting the different stochastic responses predicted from such segments [26]. Budget allocation is often modeled as the control of a continuous dynamical system [21], where audience response (e.g., purchases) to control actions (e.g., ad spending) are used to justify policies such as pulsing. User behavior is also sometimes modeled as a discrete Markov process (reflecting consumer demand, ad response, ad carryover effects, etc.) [12, 34].
A variety of mechanisms to support online advertising have been studied (and successfully de- ployed), including auctions for sponsored search [44, 20]. In this context, advertiser budget optimiza- tion is critical to maximizing impact and ROI in keyword auctions, leading to a variety of propos- als for optimal bidding strategies under various models and constraints, including budget constraints [8, 23, 33, 24, 17, 28]. Even more expressive methods (e.g., for entire ad campaigns) have been proposed for both search and display advertising in online settings [11, 29].
By contrast, relatively little work has considered budget optimization in an online setting that is sensitive to user behavior as it extends over time. Charikar et al. [15] assume the web surfing behav- ior of different user types can be modeled as a Markov process and construct policies to optimally allocate ad spend at different states (e.g., web sites) to maximize return (e.g., conversions), possibly as a function of user history. Archak et al. [6] tackle a related problem, offering a somewhat different model, but still assuming Markovian user behavior (here in a sponsored search setting). They propose a constrained MDP model that can be used to determine the optimal ad spend for a given user, assum- ing an a priori fixed budget for that user. It is this model we adopt for our approach below, though we do not fix a per-user budget. Interestingly, the same authors [7] analyze user behavior empiri- cally, demonstrating that users in a search context exhibit what might be called a “general to specific” search behavior that is approximately Markovian, and suggest that myopic click-through optimiza- tion will fail to optimize spend (hence motivating their MDP approach [6]). Amin et al. [4] develop an MDP model of the budget optimization process itself, formulating the problem facing the online advertiser—who is uncertain about the winning bid distribution—as one of learning under censored observations. Unlike the models above (and unlike our work), this work does not model Markovian user behavior.
这篇论文主要讨论了如何在在线广告投放中,通过考虑用户行为的影响,进行预算的优化分配。传统的广告预算分配问题主要关注如何将预算分配到目标市场的不同细分领域,并利用这些领域的不同随机响应来制定策略。广告预算分配通常被建模为连续动态系统的控制,其中观众对控制行为(例如广告支出)的响应(例如购买)被用来证明脉冲等策略的合理性。用户行为有时也被建模为离散的马尔可夫过程(反映消费者需求、广告响应、广告延续效应等)。
在网络广告中,已经研究(并成功部署)了各种支持机制,包括赞助搜索的拍卖。在这种情况下,广告主预算优化对于在关键字拍卖中最大化影响和投资回报至关重要,因此提出了各种在不同模型和约束条件下的最优出价策略,包括预算约束。对于整个广告活动,也提出了更具表现力的方法,包括在线搜索和展示广告。
与此相比,相对较少的工作考虑了在线环境中的预算优化问题,该问题对用户行为的敏感性随时间延伸。Charikar等人假设不同用户类型的网络浏览行为可以建模为马尔可夫过程,并构建策略以在不同状态(例如网站)上最优地分配广告支出,以最大化回报(例如转化),可能作为用户历史的函数。Archak等人解决了一个相关的问题,提供了一个略微不同的模型,但仍然假设马尔可夫用户行为(这里是在赞助搜索设置中)。他们提出了一个受限制的MDP模型,可用于确定给定用户的最佳广告支出,假设预先确定了该用户的预算。这就是我们下面采用的模型,尽管我们没有固定每个用户的预算。有趣的是,同样的作者通过实证分析用户行为,证明了在搜索环境中用户表现出近似马尔可夫的“从一般到特定”的搜索行为,并建议短视的点击率优化将无法优化支出(从而推动他们的MDP方法)。Amin等人开发了一个预算优化过程的MDP模型,将在线广告商面临的问题(不确定获胜出价分布)形式化为在被审查的观察下学习的问题。与上述模型(以及我们的工作)不同,这项工作没有对马尔可夫用户行为进行建模。
Li et al. [30] also consider Markov models in behavioral and contextual advertising. More recent work has considered the application of reinforcement learning methods to determine appropriate ad- vertising and offer strategies to large populations of customers, learning optimal policies from data generated by online interactions with users [38, 43]. These methods are similar in their underlying user modeling assumptions, but attempt to learn policies from data and do not account for budget constraints or tradeoffs, whereas we assume an underlying transition model is available and optimize policies relative to these budget tradeoffs.
Li等人在行为和上下文广告中也考虑了马尔可夫模型。更近期的工作考虑了将强化学习方法应用于确定适当的广告和优惠策略,以大规模客户群体为目标,从与用户的在线交互生成的数据中学习最优策略。这些方法在其基本的用户建模假设上类似,但试图从数据中学习策略,而不考虑预算约束或权衡,而我们假设一个基本的转移模型可用,并相对于这些预算权衡优化策略。
Constrained MDPs
An important extension of the standard MDP model is offered by constrained MDPs (CMDPs) [2], in which actions consume one or more resources (e.g., monetary budget, power) and optimal policies are sought that use no more than some pre-defined level of each resource in expectation. CMDPs have been applied in online advertising, mobile robotics, and other domains within AI and OR [12, 19, 18], and techniques that exploit the structure of CMDPs for effective solution have received some attention [2, 19, 18]. We briefly define MDPs and the basic (one-dimensional) constrained extension.
这篇论文讲的是一种叫做约束马尔可夫决策过程(CMDP)的模型,它是标准马尔可夫决策过程(MDP)模型的一种重要扩展。在CMDP中,每个行动都会消耗一个或多个资源,比如说预算、能量等等。我们希望找到最优策略,使得每种资源的使用量不超过预先设定的阈值。CMDP已经被应用于在线广告、移动机器人等人工智能和运筹学领域的研究中。一些技术已经开始利用CMDP的结构来有效地解决问题。在论文中,作者还简要介绍了MDP和基本的(一维)约束扩展。
Weakly Coupled MDPs
The decomposition of MDPs into independent or semi-independent processes can often be used to mitigate the curse of dimensionality by reducing the size of the MDP state and action space. Chal- lenges lie in discovering a suitable decomposition structure and in determining how best to use the sub-process solutions to construct a (generally approximate) global solution.
There have been a number of approaches developed that have this flavor, in both standard MDPs and decentralized (DEC) MDPs and POMDPs [39, 39, 1, 42, 31, 19, 45, 3, 35, 14]. The approach most related to ours is the decomposition method for weakly-coupled MDPs of Meuleau et al. [32]. In their model, a joint fully observable MDP is comprised of a number of almost completely independent subprocesses, each itself a fully observable MDP (a local MDP). Each MDP reflects the task or ob- jectives of a specific agent, but the local policies themselves require resources, both consumable (e.g., fuel or money, that, once used in one local policy, are unavailable for future use in that or any other local MDP), and non-consumable (e.g., equipment, that can be reused multiple times and swapped across local MDPs, but can only be used in one local MDP at a time). The method of Meuleau et al. [32] solves the local MDPs independently to produce local value functions and policies that are parameterized by the resources available. these local value functions then inform a simple greedy algorithm that assigns resources to each local MDP. Finally, unconsumed resources at each stage of the process are reassigned depending on the updated observed state of the local MDPs.
Our approach to budget decomposition is similar, but has some key differences. We use the more standard expected budget constraint, rather than a guaranteed budget constraint to determine the optimal policy—this requires the development of new DP techniques (as opposed to the use of standard value or policy iteration [32]). We also show that our resource (budget) allocation algorithm is optimal. Our dynamic budget reallocation scheme is derived from the reallocation mechanism of Meuleau et al. [32].
这段内容主要讲的是将MDP分解成独立或半独立的子过程,可以减小MDP状态和动作空间的大小,以缓解维度灾难。但是,如何找到合适的分解结构,并确定如何最好地使用子过程的解来构建(通常是近似的)全局解,是一个挑战。已经有一些方法被开发出来,既适用于标准MDP,也适用于分散式(DEC)MDP和POMDP [39, 39, 1, 42, 31, 19, 45, 3, 35, 14]。与我们最相关的方法是Meuleau等人提出的弱耦合MDP的分解方法[32]。在他们的模型中,一个联合完全可观察的MDP由许多几乎完全独立的子过程组成,每个子过程本身都是一个完全可观察的MDP(本地MDP)。每个MDP反映了特定代理的任务或目标,但本地策略本身需要资源,包括可消耗的资源(例如燃料或资金,在一个本地策略中使用后,在该本地MDP或任何其他本地MDP中都无法再次使用)和不可消耗的资源(例如设备,可以多次重复使用并在本地MDP之间交换,但一次只能在一个本地MDP中使用)。Meuleau等人的方法[32]独立解决本地MDP,以产生由可用资源参数化的本地值函数和策略。这些本地值函数然后通知一个简单的贪婪算法,将资源分配给每个本地MDP。最后,在过程的每个阶段重新分配未消耗的资源,取决于本地MDP的更新观察状态。我们的预算分解方法类似,但有一些关键的差异。我们使用更标准的预期预算约束,而不是保证预算约束来确定最优策略,这需要开发新的DP技术(而不是使用标准的值或策略迭代[32])。我们还展示了我们的资源(预算)分配算法是最优的。我们的动态预算重新分配方案是从Meuleau等人的重新分配机制中推导出来的[32]。
MDPs for Large-scale Budget Allocation
We begin by outlining a simple MDP model for engagement with users from a large, heterogeneous population, using a sequence of potentially costly actions. We use direct-response advertising as our main motivating example, though our techniques apply to control of any large, distributed collection of fully observable MDPs where actions consume limited resources. We note that our model abstracts away a number of factors that arise in realistic ad domains (e.g., partial or intermittent observability of user responses, hidden user state, coarse-grained control and lagging effects, incentives and game- theoretic interactions) in order to focus on the essence of budget allocation.
We assume an advertiser has a fixed budget with which to influence a target population of users though the use of various advertising actions. Each action taken by the advertiser has a specific cost and is targeted to a specific user (e.g., a textual or graphical ad in response to a search query, an in-app or web page graphical/video ad, direct mail to a household).3
Users respond to advertising actions stochastically, and in a way that may depend not only on user features (e.g., demographics), but also on past actions to which they have been subjected (e.g., previous ad exposures or solicitations) and past responses (e.g., click through or purchase behavior). Following Archak et al. [6], we model this situation as an MDP in the following fashion. We assume a finite set of users i ≤ M. Generally speaking, users may be segmented into types that reflect static, observable characteristics that influence their response to advertising. For ease of exposition, we assume all users have the same type; but the extension to multiple types is straightforward, and aspects of the model requiring modification for multiple types will be indicated as they arise.
We assume a finite set S of user states j ≤ N. At any time, user i is some state s[i] ∈ S. Let S = SM be the joint state space, with the joint state of all users denoted s = ⟨s[1],...,s[M]⟩ ∈ S. Intuitively, the user state captures all relevant user characteristics and history that could alter her re- sponse to an advertising action. S could be relatively small or may be quite large in certain contexts (e.g., the most recent keyword search on which the advertiser bid, or some sufficient statistics summa- rizing historical interactions and user responses). A finite set A of advertising actions is available: for ease of presentation, we assume that any a ∈ A can be applied to a user in any state (subject to cost constraints, discussed below). Accommodating state-dependent constraints on the feasible action set is straightforward (as in any MDP model). At each stage, the advertiser selects an action a[i] to apply to user i. Letting A = AM, a joint action is a = ⟨a[1],...,a[M]⟩ ∈ A.4
The stochastic response of a user to an action is captured by the transition model P : S × A → ∆(S ), with P (s, a, t) denoting the probability that a user in state s moves to state t when subjected to action a.5 The reward model R(s, a) reflects the costs and payoffs when an advertiser applies action a to a user in state s. We decompose reward as R(s, a) = U (s) − C (s, a): cost model C reflects action costs (e.g., cost of placing an ad, or potential annoyance factor and associated lost revenue); and utility function U reflects benefits/payoffs (e.g., revenue from sales, value of brand exposure, etc.).
本文首先提出了一个简单的MDP模型,用于处理与大型异构人群的用户互动,使用一系列可能具有成本的行动。我们以直接响应广告为主要的动机示例,尽管我们的技术适用于控制任何大型、分布式的完全可观察的MDP集合,其中行动消耗有限的资源。我们注意到,我们的模型抽象了现实广告领域中出现的许多因素(例如,用户响应的部分或间歇性可观察性、隐藏的用户状态、粗粒度控制和滞后效应、激励和博弈理论交互),以便专注于预算分配的本质。
我们假设广告商有一个固定的预算,用于通过使用各种广告行动来影响目标用户群体。广告商采取的每个行动都有特定的成本,并针对特定的用户(例如,响应搜索查询的文本或图形广告、应用程序或网页图形/视频广告、直接邮寄到家庭)。用户随机地对广告行动做出反应,这种反应可能不仅取决于用户特征(例如,人口统计学),还取决于他们曾经遭受过的过去行动(例如,以前的广告曝光或征求)和过去的反应(例如,点击或购买行为)。根据Archak等人的方法,我们将这种情况建模为MDP。我们假设有一个有限的用户集合i≤M。一般来说,用户可以被分成类型,反映出影响他们对广告反应的静态、可观察特征。为了便于阐述,我们假设所有用户都具有相同的类型;但是,扩展到多个类型是简单的,并且需要修改的模型方面将在出现时指出。
我们假设有一个用户状态的有限集合S,其中j≤N。在任何时候,用户i是某个状态s[i]∈S。让S=SM成为联合状态空间,所有用户的联合状态表示为s=⟨s[1],...,s[M]⟩∈S。直观地说,用户状态捕捉所有相关的用户特征和历史,这些特征和历史可能会改变她对广告行动的反应。在某些情况下,S可能相对较小,或者可能相当大(例如,广告商出价的最新关键字搜索,或者总结历史互动和用户反应的足够统计数据)。有一个广告行动的有限集合A可用:为了方便起见,我们假设任何a∈A都可以应用于任何状态的用户(受成本约束的限制讨论如下)。适应状态依赖性约束的可行行动集合是简单的(如任何MDP模型中一样)。在每个阶段,广告商选择一个要应用于用户i的行动a[i]。让A=AM成为联合行动,联合行动为a=⟨a[1],...,a[M]⟩∈A。
用户对行动的随机响应由转移模型P:S×A→∆(S)捕获,其中P(s,a,t)表示当受到行动a影响时,处于状态s的用户移动到状态t的概率。奖励模型R(s,a)反映了广告商将行动a应用于状态s的用户时的成本和回报。我们将奖励分解为R(s,a)=U(s)-C(s,a):成本模型C反映了行动成本(例如,放置广告的成本,或潜在的烦恼因素和相关的失去收入);效用函数U反映了收益/回报(例如,销售收入、品牌曝光的价值等)。
Solving Budgeted MDPs
We first formulate budgeted MDPs, a variant of CMDPs in which budgets (or other resources) are modeled so that decisions can be made readily about the level of budget to provide. The budget is implicitly treated as a component of the state, and value functions and policies are constructed that vary with both the state and available budget. We first outline the model, and then develop key intuitions regarding VF structure by considering deterministic policies. We move on to discuss stochastic policies—where randomization is much more natural than in the CMDP model—and show that these give rise to value functions that have computationally useful structural properties. This structure can be exploited in a simple dynamic programming (DP) algorithm, and gives rise to natural approximations. We conclude the section with some computational experiments to show the efficacy of our DP method.
这篇论文首先提出了一种叫做“budgeted MDPs”的CMDPs变体,其中预算(或其他资源)被建模,以便可以轻松地做出关于提供预算水平的决策。预算被隐式地视为状态的一个组成部分,并构建了随着状态和可用预算变化的价值函数和策略。我们首先概述了模型,然后通过考虑确定性策略来发展关于VF结构的关键直觉。我们继续讨论随机策略——在CMDP模型中,随机化比在这里更自然——并展示了这些策略产生的价值函数具有计算上有用的结构性质。这种结构可以在一个简单的动态规划(DP)算法中被利用,并产生自然的近似值。我们通过一些计算实验来结束这一部分,以展示我们DP方法的功效。
网友评论