A network selection optimization algorithm based on the Markov decision process(MDP)is proposed so that mobile terminals can always connect to the best wireless network in a heterogeneous network environment.Consideri...A network selection optimization algorithm based on the Markov decision process(MDP)is proposed so that mobile terminals can always connect to the best wireless network in a heterogeneous network environment.Considering the different types of service requirements,the MDP model and its reward function are constructed based on the quality of service(QoS)attribute parameters of the mobile users,and the network attribute weights are calculated by using the analytic hierarchy process(AHP).The network handoff decision condition is designed according to the different types of user services and the time-varying characteristics of the network,and the MDP model is solved by using the genetic algorithm and simulated annealing(GA-SA),thus,users can seamlessly switch to the network with the best long-term expected reward value.Simulation results show that the proposed algorithm has good convergence performance,and can guarantee that users with different service types will obtain satisfactory expected total reward values and have low numbers of network handoffs.展开更多
Optimal policies in Markov decision problems may be quite sensitive with regard to transition probabilities.In practice,some transition probabilities may be uncertain.The goals of the present study are to find the rob...Optimal policies in Markov decision problems may be quite sensitive with regard to transition probabilities.In practice,some transition probabilities may be uncertain.The goals of the present study are to find the robust range for a certain optimal policy and to obtain value intervals of exact transition probabilities.Our research yields powerful contributions for Markov decision processes(MDPs)with uncertain transition probabilities.We first propose a method for estimating unknown transition probabilities based on maximum likelihood.Since the estimation may be far from accurate,and the highest expected total reward of the MDP may be sensitive to these transition probabilities,we analyze the robustness of an optimal policy and propose an approach for robust analysis.After giving the definition of a robust optimal policy with uncertain transition probabilities represented as sets of numbers,we formulate a model to obtain the optimal policy.Finally,we define the value intervals of the exact transition probabilities and construct models to determine the lower and upper bounds.Numerical examples are given to show the practicability of our methods.展开更多
This paper studies the limit average variance criterion for continuous-time Markov decision processes in Polish spaces. Based on two approaches, this paper proves not only the existence of solutions to the variance mi...This paper studies the limit average variance criterion for continuous-time Markov decision processes in Polish spaces. Based on two approaches, this paper proves not only the existence of solutions to the variance minimization optimality equation and the existence of a variance minimal policy that is canonical, but also the existence of solutions to the two variance minimization optimality inequalities and the existence of a variance minimal policy which may not be canonical. An example is given to illustrate all of our conditions.展开更多
In recent years, ride-on-demand (RoD) services such as Uber and Didi are becoming increasingly popular. Different from traditional taxi services, RoD services adopt dynamic pricing mechanisms to manipulate the supply ...In recent years, ride-on-demand (RoD) services such as Uber and Didi are becoming increasingly popular. Different from traditional taxi services, RoD services adopt dynamic pricing mechanisms to manipulate the supply and demand on the road, and such mechanisms improve service capacity and quality. Seeking route recommendation has been widely studied in taxi service. In RoD services, the dynamic price is a new and accurate indicator that represents the supply and demand condition, but it is yet rarely studied in providing clues for drivers to seek for passengers. In this paper, we proposed to incorporate the impacts of dynamic prices as a key factor in recommending seeking routes to drivers. We first showed the importance and need to do that by analyzing real service data. We then designed a Markov Decision Process (MDP) model based on passenger order and car GPS trajectories datasets, and took into account dynamic prices in designing rewards. Results show that our model not only guides drivers to locations with higher prices, but also significantly improves driver revenue. Compared with things with the drivers before using the model, the maximum yield after using it can be increased to 28%.展开更多
In this paper,we study the distributionally robust joint chance-constrained Markov decision process.Utilizing the logarithmic transformation technique,we derive its deterministic reformulation with bi-convex terms und...In this paper,we study the distributionally robust joint chance-constrained Markov decision process.Utilizing the logarithmic transformation technique,we derive its deterministic reformulation with bi-convex terms under the moment-based uncertainty set.To cope with the non-convexity and improve the robustness of the solution,we propose a dynamical neural network approach to solve the reformulated optimization problem.Numerical results on a machine replacement problem demonstrate the efficiency of the proposed dynamical neural network approach when compared with the sequential convex approximation approach.展开更多
As one of the major contributions of biology to competitive decision making, evolutionary game theory provides a useful tool for studying the evolution of cooperation. To achieve the optimal solution for unmanned aeri...As one of the major contributions of biology to competitive decision making, evolutionary game theory provides a useful tool for studying the evolution of cooperation. To achieve the optimal solution for unmanned aerial vehicles (UAVs) that are car- rying out a sensing task, this paper presents a Markov decision evolutionary game (MDEG) based learning algorithm. Each in- dividual in the algorithm follows a Markov decision strategy to maximize its payoff against the well known Tit-for-Tat strate- gy. Simulation results demonstrate that the MDEG theory based approach effectively improves the collective payoff of the roam. The proposed algorithm can not only obtain the best action sequence but also a sub-optimal Markov policy that is inde- pendent of the game duration. Furthermore, the paper also studies the emergence of cooperation in the evolution of self-regarded UAVs. The results show that it is the adaptive ability of the MDEG based approach as well as the perfect balance between revenge and forgiveness of the Tit-for-Tat strategy that the emergence of cooperation should be attributed to.展开更多
In shield tunneling, the control system needs very reliable capability of deviation rectifying in order to ensure that the tunnel trajectory meets the permissible criterion. To this goal, we present an approach that a...In shield tunneling, the control system needs very reliable capability of deviation rectifying in order to ensure that the tunnel trajectory meets the permissible criterion. To this goal, we present an approach that adopts Markov decision process (MDP) theory to plan the driving force with explicit representation of the uncertainty during excavation. The shield attitudes of possi- ble world and driving forces during excavation are scattered as a state set and an action set, respectively. In particular, an evaluation function is proposed with consideration of the stability of driving force and the deviation of shield attitude. Unlike the deterministic approach, the driving forces based on MDP model lead to an uncertain effect and the attitude is known only with an imprecise probability. We consider the case that the transition probability varies in a given domain estimated by field data, and discuss the optimal policy based on the interval arithmetic. The validity of the approach is discussed by comparing the driving force planning with the actual operating data from the field records of Line 9 in Tianjin. It is proved that the MDP model is reasonable enough to predict the driving force for automatic deviation rectifying.展开更多
Markov decision process(MDP)offers a general framework for modelling sequential decision making where outcomes are random.In particular,it serves as a mathematical framework for reinforcement learning.This paper intro...Markov decision process(MDP)offers a general framework for modelling sequential decision making where outcomes are random.In particular,it serves as a mathematical framework for reinforcement learning.This paper introduces an extension of MDP,namely quantum MDP(q MDP),that can serve as a mathematical model of decision making about quantum systems.We develop dynamic programming algorithms for policy evaluation and finding optimal policies for q MDPs in the case of finite-horizon.The results obtained in this paper provide some useful mathematical tools for reinforcement learning techniques applied to the quantum world.展开更多
This paper focuses on the constrained optimality problem (COP) of first passage discrete-time Markov decision processes (DTMDPs) in denumerable state and compact Borel action spaces with multi-constraints, state-d...This paper focuses on the constrained optimality problem (COP) of first passage discrete-time Markov decision processes (DTMDPs) in denumerable state and compact Borel action spaces with multi-constraints, state-dependent discount factors, and possibly unbounded costs. By means of the properties of a so-called occupation measure of a policy, we show that the constrained optimality problem is equivalent to an (infinite-dimensional) linear programming on the set of occupation measures with some constraints, and thus prove the existence of an optimal policy under suitable conditions. Furthermore, using the equivalence between the constrained optimality problem and the linear programming, we obtain an exact form of an optimal policy for the case of finite states and actions. Finally, as an example, a controlled queueing system is given to illustrate our results.展开更多
This paper is concerned with the convergence of a sequence of discrete-time Markov decision processes(DTMDPs)with constraints,state-action dependent discount factors,and possibly unbounded costs.Using the convex analy...This paper is concerned with the convergence of a sequence of discrete-time Markov decision processes(DTMDPs)with constraints,state-action dependent discount factors,and possibly unbounded costs.Using the convex analytic approach under mild conditions,we prove that the optimal values and optimal policies of the original DTMDPs converge to those of the"limit"one.Furthermore,we show that any countablestate DTMDP can be approximated by a sequence of finite-state DTMDPs,which are constructed using the truncation technique.Finally,we illustrate the approximation by solving a controlled queueing system numerically,and give the corresponding error bound of the approximation.展开更多
MARKOV decision processes (MDPs) have been studied by mathematicians, probabilists, operation researchers and engineers since the late 1950s. In an MDPs a stochastic, dynamic system is controlled by a 'policy'...MARKOV decision processes (MDPs) have been studied by mathematicians, probabilists, operation researchers and engineers since the late 1950s. In an MDPs a stochastic, dynamic system is controlled by a 'policy' selected by a decision-maker/controller, with the goal of maximizing an overall reward function that is an appropriately defined aggregate of immediate rewards, over either finite or infinite time horizon.As such MDPs are a useful paradigm for modeling many processes occurring naturally in the management and engineering contexts..展开更多
Markov decision processes (MDPs) and their variants are widely studied in the theory of controls for stochastic discrete- event systems driven by Markov chains. Much of the literature focusses on the risk-neutral cr...Markov decision processes (MDPs) and their variants are widely studied in the theory of controls for stochastic discrete- event systems driven by Markov chains. Much of the literature focusses on the risk-neutral criterion in which the expected rewards, either average or discounted, are maximized. There exists some literature on MDPs that takes risks into account. Much of this addresses the exponential utility (EU) function and mechanisms to penalize different forms of variance of the rewards. EU functions have some numerical deficiencies, while variance measures variability both above and below the mean rewards; the variability above mean rewards is usually beneficial and should not be penalized/avoided. As such, risk metrics that account for pre-specified targets (thresholds) for rewards have been considered in the literature, where the goal is to penalize the risks of revenues falling below those targets. Existing work on MDPs that takes targets into account seeks to minimize risks of this nature. Minimizing risks can lead to poor solutions where the risk is zero or near zero, but the average rewards are also rather low. In this paper, hence, we study a risk-averse criterion, in particular the so-called downside risk, which equals the probability of the revenues falling below a given target, where, in contrast to minimizing such risks, we only reduce this risk at the cost of slightly lowered average rewards. A solution where the risk is low and the average reward is quite high, although not at its maximum attainable value, is very attractive in practice. To be more specific, in our formulation, the objective function is the expected value of the rewards minus a scalar times the downside risk. In this setting, we analyze the infinite horizon MDP, the finite horizon MDP, and the infinite horizon semi-MDP (SMDP). We develop dynamic programming and reinforcement learning algorithms for the finite and infinite horizon. The algorithms are tested in numerical studies and show encouraging performance.展开更多
This paper is an attempt to study the minimization problem of the risk probability of piecewise deterministic Markov decision processes(PDMDPs)with unbounded transition rates and Borel spaces.Different from the expect...This paper is an attempt to study the minimization problem of the risk probability of piecewise deterministic Markov decision processes(PDMDPs)with unbounded transition rates and Borel spaces.Different from the expected discounted and average criteria in the existing literature,we consider the risk probability that the total rewards produced by a system do not exceed a prescribed goal during a first passage time to some target set,and aim to find a policy that minimizes the risk probability over the class of all history-dependent policies.Under suitable conditions,we derive the optimality equation(OE)for the probability criterion,prove that the value function of the minimization problem is the unique solution to the OE,and establish the existence ofε(≥0)-optimal policies.Finally,we provide two examples to illustrate our results.展开更多
This paper is concerned with the continuous-time Markov decision processes (MDP) having weak and strong interactions. Using a hierarchical approach, the state space of the underlying Markov chain can be decomposed int...This paper is concerned with the continuous-time Markov decision processes (MDP) having weak and strong interactions. Using a hierarchical approach, the state space of the underlying Markov chain can be decomposed into several groups of recurrent states and a group of transient states resulting in a singularly perturbed MDP formulation. Instead of solving the original problem directly, a limit problem that is much simpler to handle is derived. On the basis of the optical solution of the limit problem, nearly optimal decisions are constructed for the original problem. The asymptotic optimality of the constructed control is obtained; the rate of convergence is ascertained.展开更多
This paper studies the strong n(n =—1,0)-discount and finite horizon criteria for continuoustime Markov decision processes in Polish spaces.The corresponding transition rates are allowed to be unbounded,and the rewar...This paper studies the strong n(n =—1,0)-discount and finite horizon criteria for continuoustime Markov decision processes in Polish spaces.The corresponding transition rates are allowed to be unbounded,and the reward rates may have neither upper nor lower bounds.Under mild conditions,the authors prove the existence of strong n(n =—1,0)-discount optimal stationary policies by developing two equivalence relations:One is between the standard expected average reward and strong—1-discount optimality,and the other is between the bias and strong 0-discount optimality.The authors also prove the existence of an optimal policy for a finite horizon control problem by developing an interesting characterization of a canonical triplet.展开更多
In this paper we study the average sample-path cost (ASPC) problem for continuous-time Markov decision processes in Polish spaces. To the best of our knowledge, this paper is a first attempt to study the ASPC criter...In this paper we study the average sample-path cost (ASPC) problem for continuous-time Markov decision processes in Polish spaces. To the best of our knowledge, this paper is a first attempt to study the ASPC criterion on continuous-time MDPs with Polish state and action spaces. The corresponding transition rates are allowed to be unbounded, and the cost rates may have neither upper nor lower bounds. Under some mild hypotheses, we prove the existence of (ε〉 0)-ASPC optimal stationary policies based on two different approaches: one is the "optimality equation" approach and the other is the "two optimality inequalities" approach.展开更多
This paper studies denumerable continuous-time Markov decision processes with expected total reward criteria. The authors first study the unconstrained model with possible unbounded transition rates, and give suitable...This paper studies denumerable continuous-time Markov decision processes with expected total reward criteria. The authors first study the unconstrained model with possible unbounded transition rates, and give suitable conditions on the controlled system's primitive data under which the authors show the existence of a solution to the total reward optimality equation and also the existence of an optimal stationary policy. Then, the authors impose a constraint on an expected total cost, and consider the associated constrained model. Basing on the results about the unconstrained model and using the Lagrange multipliers approach, the authors prove the existence of constrained-optimal policies under some additional conditions. Finally, the authors apply the results to controlled queueing systems.展开更多
We study the Markov decision processes under the average-value-at-risk criterion.The state space and the action space are Borel spaces,the costs are admitted to be unbounded from above,and the discount factors are sta...We study the Markov decision processes under the average-value-at-risk criterion.The state space and the action space are Borel spaces,the costs are admitted to be unbounded from above,and the discount factors are state-action dependent.Under suitable conditions,we establish the existence of optimal deterministic stationary policies.Furthermore,we apply our main results to a cash-balance model.展开更多
This paper deals with Markov decision processes with a target set for nonpositive rewards. Two types of threshold probability criteria are discussed. The first criterion is a probability that a total reward is not gre...This paper deals with Markov decision processes with a target set for nonpositive rewards. Two types of threshold probability criteria are discussed. The first criterion is a probability that a total reward is not greater than a given initial threshold value, and the second is a probability that the total reward is less than it. Our first (resp. second) optimizing problem is to minimize the first (resp. second) threshold probability. These problems suggest that the threshold value is a permissible level of the total reward to reach a goal (the target set), that is, we would reach this set over the level, if possible. For the both problems, we show that 1) the optimal threshold probability is a unique solution to an optimality equation, 2) there exists an optimal deterministic stationary policy, and 3) a value iteration and a policy space iteration are given. In addition, we prove that the first (resp. second) optimal threshold probability is a monotone increasing and right (resp. left) continuous function of the initial threshold value and propose a method to obtain an optimal policy and the optimal threshold probability in the first problem by using them in the second problem.展开更多
Self-adaptive systems are able to adjust their behaviour in response to environmental condition changes and are widely deployed as Internetwares.Considered as a promising way to handle the ever-growing complexity of s...Self-adaptive systems are able to adjust their behaviour in response to environmental condition changes and are widely deployed as Internetwares.Considered as a promising way to handle the ever-growing complexity of software systems,they have seen an increasing level of interest and are covering a variety of applications,e.g.,autonomous car systems and adaptive network systems.Many approaches for the construction of self-adaptive systems have been developed,and probabilistic models,such as Markov decision processes(MDPs),are one of the favoured.However,the majority of them do not deal with the problems of the underlying MDP being obsolete under new environments or unsatisfactory to the given properties.This results in the generated policies from such MDP failing to guide the self-adaptive system to run correctly and meet goals.In this article,we propose a systematic approach to updating an obsolete MDP by exploring new states and transitions and removing obsolete ones,and repairing an unsatisfactory MDP by adjusting its structure in a more meaningful way rather than arbitrarily changing the transition probabilities to values not in line with reality.Experimental results show that the MDPs updated and repaired by our approach are more competent in guiding the self-adaptive systems’correct running compared with the original ones.展开更多
基金partially supported by Nation Science Foundation of China (61661025, 61661026)Foundation of A hundred Youth Talents Training Program of Lanzhou Jiaotong University (152022)
文摘A network selection optimization algorithm based on the Markov decision process(MDP)is proposed so that mobile terminals can always connect to the best wireless network in a heterogeneous network environment.Considering the different types of service requirements,the MDP model and its reward function are constructed based on the quality of service(QoS)attribute parameters of the mobile users,and the network attribute weights are calculated by using the analytic hierarchy process(AHP).The network handoff decision condition is designed according to the different types of user services and the time-varying characteristics of the network,and the MDP model is solved by using the genetic algorithm and simulated annealing(GA-SA),thus,users can seamlessly switch to the network with the best long-term expected reward value.Simulation results show that the proposed algorithm has good convergence performance,and can guarantee that users with different service types will obtain satisfactory expected total reward values and have low numbers of network handoffs.
基金Supported by the National Natural Science Foundation of China(71571019).
文摘Optimal policies in Markov decision problems may be quite sensitive with regard to transition probabilities.In practice,some transition probabilities may be uncertain.The goals of the present study are to find the robust range for a certain optimal policy and to obtain value intervals of exact transition probabilities.Our research yields powerful contributions for Markov decision processes(MDPs)with uncertain transition probabilities.We first propose a method for estimating unknown transition probabilities based on maximum likelihood.Since the estimation may be far from accurate,and the highest expected total reward of the MDP may be sensitive to these transition probabilities,we analyze the robustness of an optimal policy and propose an approach for robust analysis.After giving the definition of a robust optimal policy with uncertain transition probabilities represented as sets of numbers,we formulate a model to obtain the optimal policy.Finally,we define the value intervals of the exact transition probabilities and construct models to determine the lower and upper bounds.Numerical examples are given to show the practicability of our methods.
基金supported by the National Natural Science Foundation of China(10801056)the Natural Science Foundation of Ningbo(2010A610094)
文摘This paper studies the limit average variance criterion for continuous-time Markov decision processes in Polish spaces. Based on two approaches, this paper proves not only the existence of solutions to the variance minimization optimality equation and the existence of a variance minimal policy that is canonical, but also the existence of solutions to the two variance minimization optimality inequalities and the existence of a variance minimal policy which may not be canonical. An example is given to illustrate all of our conditions.
文摘In recent years, ride-on-demand (RoD) services such as Uber and Didi are becoming increasingly popular. Different from traditional taxi services, RoD services adopt dynamic pricing mechanisms to manipulate the supply and demand on the road, and such mechanisms improve service capacity and quality. Seeking route recommendation has been widely studied in taxi service. In RoD services, the dynamic price is a new and accurate indicator that represents the supply and demand condition, but it is yet rarely studied in providing clues for drivers to seek for passengers. In this paper, we proposed to incorporate the impacts of dynamic prices as a key factor in recommending seeking routes to drivers. We first showed the importance and need to do that by analyzing real service data. We then designed a Markov Decision Process (MDP) model based on passenger order and car GPS trajectories datasets, and took into account dynamic prices in designing rewards. Results show that our model not only guides drivers to locations with higher prices, but also significantly improves driver revenue. Compared with things with the drivers before using the model, the maximum yield after using it can be increased to 28%.
基金supported by National Natural Science Foundation of China(Grant Nos.11991023 and 12371324)National Key R&D Program of China(Grant No.2022YFA1004000)。
文摘In this paper,we study the distributionally robust joint chance-constrained Markov decision process.Utilizing the logarithmic transformation technique,we derive its deterministic reformulation with bi-convex terms under the moment-based uncertainty set.To cope with the non-convexity and improve the robustness of the solution,we propose a dynamical neural network approach to solve the reformulated optimization problem.Numerical results on a machine replacement problem demonstrate the efficiency of the proposed dynamical neural network approach when compared with the sequential convex approximation approach.
基金supported by the National Natural Science Foundation of China(Grant Nos.61425008,61333004 and 61273054)Top-Notch Young Talents Program of China,and Aeronautical Foundation of China(Grant No.20135851042)
文摘As one of the major contributions of biology to competitive decision making, evolutionary game theory provides a useful tool for studying the evolution of cooperation. To achieve the optimal solution for unmanned aerial vehicles (UAVs) that are car- rying out a sensing task, this paper presents a Markov decision evolutionary game (MDEG) based learning algorithm. Each in- dividual in the algorithm follows a Markov decision strategy to maximize its payoff against the well known Tit-for-Tat strate- gy. Simulation results demonstrate that the MDEG theory based approach effectively improves the collective payoff of the roam. The proposed algorithm can not only obtain the best action sequence but also a sub-optimal Markov policy that is inde- pendent of the game duration. Furthermore, the paper also studies the emergence of cooperation in the evolution of self-regarded UAVs. The results show that it is the adaptive ability of the MDEG based approach as well as the perfect balance between revenge and forgiveness of the Tit-for-Tat strategy that the emergence of cooperation should be attributed to.
基金supported by the National Basic Research Program (973 Program) of China (Grant No. 2007CB714000)
文摘In shield tunneling, the control system needs very reliable capability of deviation rectifying in order to ensure that the tunnel trajectory meets the permissible criterion. To this goal, we present an approach that adopts Markov decision process (MDP) theory to plan the driving force with explicit representation of the uncertainty during excavation. The shield attitudes of possi- ble world and driving forces during excavation are scattered as a state set and an action set, respectively. In particular, an evaluation function is proposed with consideration of the stability of driving force and the deviation of shield attitude. Unlike the deterministic approach, the driving forces based on MDP model lead to an uncertain effect and the attitude is known only with an imprecise probability. We consider the case that the transition probability varies in a given domain estimated by field data, and discuss the optimal policy based on the interval arithmetic. The validity of the approach is discussed by comparing the driving force planning with the actual operating data from the field records of Line 9 in Tianjin. It is proved that the MDP model is reasonable enough to predict the driving force for automatic deviation rectifying.
基金partly supported by National Key R&D Program of China(No.2018YFA0306701)the Australian Research Council(Nos.DP160101652 and DP180100691)+1 种基金National Natural Science Foundation of China(No.61832015)the Key Research Program of Frontier Sciences,Chinese Academy of Sciences。
文摘Markov decision process(MDP)offers a general framework for modelling sequential decision making where outcomes are random.In particular,it serves as a mathematical framework for reinforcement learning.This paper introduces an extension of MDP,namely quantum MDP(q MDP),that can serve as a mathematical model of decision making about quantum systems.We develop dynamic programming algorithms for policy evaluation and finding optimal policies for q MDPs in the case of finite-horizon.The results obtained in this paper provide some useful mathematical tools for reinforcement learning techniques applied to the quantum world.
基金This work was supported in part by the National Natural Science Foundation of China (Grant Nos. 61374067, 41271076).
文摘This paper focuses on the constrained optimality problem (COP) of first passage discrete-time Markov decision processes (DTMDPs) in denumerable state and compact Borel action spaces with multi-constraints, state-dependent discount factors, and possibly unbounded costs. By means of the properties of a so-called occupation measure of a policy, we show that the constrained optimality problem is equivalent to an (infinite-dimensional) linear programming on the set of occupation measures with some constraints, and thus prove the existence of an optimal policy under suitable conditions. Furthermore, using the equivalence between the constrained optimality problem and the linear programming, we obtain an exact form of an optimal policy for the case of finite states and actions. Finally, as an example, a controlled queueing system is given to illustrate our results.
基金supported by National Natural Science Foundation of China (Grant Nos. 61374067 and 41271076)
文摘This paper is concerned with the convergence of a sequence of discrete-time Markov decision processes(DTMDPs)with constraints,state-action dependent discount factors,and possibly unbounded costs.Using the convex analytic approach under mild conditions,we prove that the optimal values and optimal policies of the original DTMDPs converge to those of the"limit"one.Furthermore,we show that any countablestate DTMDP can be approximated by a sequence of finite-state DTMDPs,which are constructed using the truncation technique.Finally,we illustrate the approximation by solving a controlled queueing system numerically,and give the corresponding error bound of the approximation.
文摘MARKOV decision processes (MDPs) have been studied by mathematicians, probabilists, operation researchers and engineers since the late 1950s. In an MDPs a stochastic, dynamic system is controlled by a 'policy' selected by a decision-maker/controller, with the goal of maximizing an overall reward function that is an appropriately defined aggregate of immediate rewards, over either finite or infinite time horizon.As such MDPs are a useful paradigm for modeling many processes occurring naturally in the management and engineering contexts..
文摘Markov decision processes (MDPs) and their variants are widely studied in the theory of controls for stochastic discrete- event systems driven by Markov chains. Much of the literature focusses on the risk-neutral criterion in which the expected rewards, either average or discounted, are maximized. There exists some literature on MDPs that takes risks into account. Much of this addresses the exponential utility (EU) function and mechanisms to penalize different forms of variance of the rewards. EU functions have some numerical deficiencies, while variance measures variability both above and below the mean rewards; the variability above mean rewards is usually beneficial and should not be penalized/avoided. As such, risk metrics that account for pre-specified targets (thresholds) for rewards have been considered in the literature, where the goal is to penalize the risks of revenues falling below those targets. Existing work on MDPs that takes targets into account seeks to minimize risks of this nature. Minimizing risks can lead to poor solutions where the risk is zero or near zero, but the average rewards are also rather low. In this paper, hence, we study a risk-averse criterion, in particular the so-called downside risk, which equals the probability of the revenues falling below a given target, where, in contrast to minimizing such risks, we only reduce this risk at the cost of slightly lowered average rewards. A solution where the risk is low and the average reward is quite high, although not at its maximum attainable value, is very attractive in practice. To be more specific, in our formulation, the objective function is the expected value of the rewards minus a scalar times the downside risk. In this setting, we analyze the infinite horizon MDP, the finite horizon MDP, and the infinite horizon semi-MDP (SMDP). We develop dynamic programming and reinforcement learning algorithms for the finite and infinite horizon. The algorithms are tested in numerical studies and show encouraging performance.
基金supported by the National Natural Science Foundation of China(Nos.11931018,11961005)Guangdong Province Key Laboratory of Computational Science at the Sun Yat-sen University(No.2020B1212060032)the Natural Science Foundation of Guangxi Province(No.2020GXNSFAA297196)。
文摘This paper is an attempt to study the minimization problem of the risk probability of piecewise deterministic Markov decision processes(PDMDPs)with unbounded transition rates and Borel spaces.Different from the expected discounted and average criteria in the existing literature,we consider the risk probability that the total rewards produced by a system do not exceed a prescribed goal during a first passage time to some target set,and aim to find a policy that minimizes the risk probability over the class of all history-dependent policies.Under suitable conditions,we derive the optimality equation(OE)for the probability criterion,prove that the value function of the minimization problem is the unique solution to the OE,and establish the existence ofε(≥0)-optimal policies.Finally,we provide two examples to illustrate our results.
基金The research of this author is supported in part by the Office of Naval Research Grant N00014-96-1-0263.The research of this a
文摘This paper is concerned with the continuous-time Markov decision processes (MDP) having weak and strong interactions. Using a hierarchical approach, the state space of the underlying Markov chain can be decomposed into several groups of recurrent states and a group of transient states resulting in a singularly perturbed MDP formulation. Instead of solving the original problem directly, a limit problem that is much simpler to handle is derived. On the basis of the optical solution of the limit problem, nearly optimal decisions are constructed for the original problem. The asymptotic optimality of the constructed control is obtained; the rate of convergence is ascertained.
基金supported by the National Natural Science Foundation of China under Grant Nos.61374080 and 61374067the Natural Science Foundation of Zhejiang Province under Grant No.LY12F03010+1 种基金the Natural Science Foundation of Ningbo under Grant No.2012A610032Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions
文摘This paper studies the strong n(n =—1,0)-discount and finite horizon criteria for continuoustime Markov decision processes in Polish spaces.The corresponding transition rates are allowed to be unbounded,and the reward rates may have neither upper nor lower bounds.Under mild conditions,the authors prove the existence of strong n(n =—1,0)-discount optimal stationary policies by developing two equivalence relations:One is between the standard expected average reward and strong—1-discount optimality,and the other is between the bias and strong 0-discount optimality.The authors also prove the existence of an optimal policy for a finite horizon control problem by developing an interesting characterization of a canonical triplet.
基金Supported by the National Natural Science Foundation of China (No.10801056)the Natural Science Foundation of Ningbo (No. 2010A610094)K.C. Wong Magna Fund in Ningbo University
文摘In this paper we study the average sample-path cost (ASPC) problem for continuous-time Markov decision processes in Polish spaces. To the best of our knowledge, this paper is a first attempt to study the ASPC criterion on continuous-time MDPs with Polish state and action spaces. The corresponding transition rates are allowed to be unbounded, and the cost rates may have neither upper nor lower bounds. Under some mild hypotheses, we prove the existence of (ε〉 0)-ASPC optimal stationary policies based on two different approaches: one is the "optimality equation" approach and the other is the "two optimality inequalities" approach.
基金supported by the National Natural Science Foundation of China under Grant Nos.10925107 and 60874004
文摘This paper studies denumerable continuous-time Markov decision processes with expected total reward criteria. The authors first study the unconstrained model with possible unbounded transition rates, and give suitable conditions on the controlled system's primitive data under which the authors show the existence of a solution to the total reward optimality equation and also the existence of an optimal stationary policy. Then, the authors impose a constraint on an expected total cost, and consider the associated constrained model. Basing on the results about the unconstrained model and using the Lagrange multipliers approach, the authors prove the existence of constrained-optimal policies under some additional conditions. Finally, the authors apply the results to controlled queueing systems.
基金supported by the National Natural Science Foundation of China(Grant Nos.61673019,11931018)the Natural Science Foundation of Guangdong Province(Grant Nos.2018A030313738,2021A1515010057)+1 种基金Guangdong Province Key Laboratory of Computational Science at the Sun Yat-sen University(2020B1212060032)IMR and RAE Research Fund,Faculty of Science,HKU.
文摘We study the Markov decision processes under the average-value-at-risk criterion.The state space and the action space are Borel spaces,the costs are admitted to be unbounded from above,and the discount factors are state-action dependent.Under suitable conditions,we establish the existence of optimal deterministic stationary policies.Furthermore,we apply our main results to a cash-balance model.
文摘This paper deals with Markov decision processes with a target set for nonpositive rewards. Two types of threshold probability criteria are discussed. The first criterion is a probability that a total reward is not greater than a given initial threshold value, and the second is a probability that the total reward is less than it. Our first (resp. second) optimizing problem is to minimize the first (resp. second) threshold probability. These problems suggest that the threshold value is a permissible level of the total reward to reach a goal (the target set), that is, we would reach this set over the level, if possible. For the both problems, we show that 1) the optimal threshold probability is a unique solution to an optimality equation, 2) there exists an optimal deterministic stationary policy, and 3) a value iteration and a policy space iteration are given. In addition, we prove that the first (resp. second) optimal threshold probability is a monotone increasing and right (resp. left) continuous function of the initial threshold value and propose a method to obtain an optimal policy and the optimal threshold probability in the first problem by using them in the second problem.
基金supported by the National Natural Science Foundation of China under Grant Nos.61802179,61972193 and 61972197the Fundamental Research Funds for the Central Universities of China under Grant No.NS2021069the Natural Science Foundation of Jiangsu Province of China under Grant No.BK20201292.
文摘Self-adaptive systems are able to adjust their behaviour in response to environmental condition changes and are widely deployed as Internetwares.Considered as a promising way to handle the ever-growing complexity of software systems,they have seen an increasing level of interest and are covering a variety of applications,e.g.,autonomous car systems and adaptive network systems.Many approaches for the construction of self-adaptive systems have been developed,and probabilistic models,such as Markov decision processes(MDPs),are one of the favoured.However,the majority of them do not deal with the problems of the underlying MDP being obsolete under new environments or unsatisfactory to the given properties.This results in the generated policies from such MDP failing to guide the self-adaptive system to run correctly and meet goals.In this article,we propose a systematic approach to updating an obsolete MDP by exploring new states and transitions and removing obsolete ones,and repairing an unsatisfactory MDP by adjusting its structure in a more meaningful way rather than arbitrarily changing the transition probabilities to values not in line with reality.Experimental results show that the MDPs updated and repaired by our approach are more competent in guiding the self-adaptive systems’correct running compared with the original ones.