Markov decision processes (MDPs) and their variants are widely studied in the theory of controls for stochastic discrete- event systems driven by Markov chains. Much of the literature focusses on the risk-neutral cr...Markov decision processes (MDPs) and their variants are widely studied in the theory of controls for stochastic discrete- event systems driven by Markov chains. Much of the literature focusses on the risk-neutral criterion in which the expected rewards, either average or discounted, are maximized. There exists some literature on MDPs that takes risks into account. Much of this addresses the exponential utility (EU) function and mechanisms to penalize different forms of variance of the rewards. EU functions have some numerical deficiencies, while variance measures variability both above and below the mean rewards; the variability above mean rewards is usually beneficial and should not be penalized/avoided. As such, risk metrics that account for pre-specified targets (thresholds) for rewards have been considered in the literature, where the goal is to penalize the risks of revenues falling below those targets. Existing work on MDPs that takes targets into account seeks to minimize risks of this nature. Minimizing risks can lead to poor solutions where the risk is zero or near zero, but the average rewards are also rather low. In this paper, hence, we study a risk-averse criterion, in particular the so-called downside risk, which equals the probability of the revenues falling below a given target, where, in contrast to minimizing such risks, we only reduce this risk at the cost of slightly lowered average rewards. A solution where the risk is low and the average reward is quite high, although not at its maximum attainable value, is very attractive in practice. To be more specific, in our formulation, the objective function is the expected value of the rewards minus a scalar times the downside risk. In this setting, we analyze the infinite horizon MDP, the finite horizon MDP, and the infinite horizon semi-MDP (SMDP). We develop dynamic programming and reinforcement learning algorithms for the finite and infinite horizon. The algorithms are tested in numerical studies and show encouraging performance.展开更多
The adaptive critic heuristic has been a popular algorithm in reinforcement learning(RL) and approximate dynamic programming(ADP) alike.It is one of the ?rst RL and ADP algorithms.RL and ADP algorithms are particularl...The adaptive critic heuristic has been a popular algorithm in reinforcement learning(RL) and approximate dynamic programming(ADP) alike.It is one of the ?rst RL and ADP algorithms.RL and ADP algorithms are particularly useful for solving Markov decision processes(MDPs) that suffer from the curses of dimensionality and modeling.Many real-world problems,however,tend to be semi-Markov decision processes(SMDPs) in which the time spent in each transition of the underlying Markov chains is itself a random variable.Unfortunately for the average reward case,unlike the discounted reward case,the MDP does not have an easy extension to the SMDP.Examples of SMDPs can be found in the area of supply chain management,maintenance management,and airline revenue management.In this paper,we propose an adaptive critic heuristic for the SMDP under the long-run average reward criterion.We present the convergence analysis of the algorithm which shows that under certain mild conditions,which can be ensured within a simulator,the algorithm converges to an optimal solution with probability 1.We test the algorithm extensively on a problem of airline revenue management in which the manager has to set prices for airline tickets over the booking horizon.The problem has a large scale,suffering from the curse of dimensionality,and hence it is difficult to solve it via classical methods of dynamic programming.Our numerical results are encouraging and show that the algorithm outperforms an existing heuristic used widely in the airline industry.展开更多
An earthquake significant on the Richter scale occurring in an area with a high population density requires an effective and equitable emergency response plan. Emergency resources are usually located in so-called resp...An earthquake significant on the Richter scale occurring in an area with a high population density requires an effective and equitable emergency response plan. Emergency resources are usually located in so-called responding centers. One of the first problems faced by disaster-response management personnel in the rapidly degrading post-earthquake conditions is to gage the hazard rate to which the disaster-affected area is subjected, estimate the time taken to bring the situation under control, also called restoration time, and select the appropriate responding center for relief-and-rescue activities. In this paper, we propose an elaborate semi-Markov model to capture the stochastic dynamics of the events that follow an earthquake, which will be used to quantify the hazard rate to which people are exposed and estimate the restoration time. The model will be further used, via dynamic programming, to determine the appropriate responding center. Our proposed model can be employed in conjunction with a variety of hazard scales and by collecting data on a few parameters related to emergency management. The model will be particularly useful in a smart city, where historic data on events following an earthquake would be systematically and accurately recorded.展开更多
文摘Markov decision processes (MDPs) and their variants are widely studied in the theory of controls for stochastic discrete- event systems driven by Markov chains. Much of the literature focusses on the risk-neutral criterion in which the expected rewards, either average or discounted, are maximized. There exists some literature on MDPs that takes risks into account. Much of this addresses the exponential utility (EU) function and mechanisms to penalize different forms of variance of the rewards. EU functions have some numerical deficiencies, while variance measures variability both above and below the mean rewards; the variability above mean rewards is usually beneficial and should not be penalized/avoided. As such, risk metrics that account for pre-specified targets (thresholds) for rewards have been considered in the literature, where the goal is to penalize the risks of revenues falling below those targets. Existing work on MDPs that takes targets into account seeks to minimize risks of this nature. Minimizing risks can lead to poor solutions where the risk is zero or near zero, but the average rewards are also rather low. In this paper, hence, we study a risk-averse criterion, in particular the so-called downside risk, which equals the probability of the revenues falling below a given target, where, in contrast to minimizing such risks, we only reduce this risk at the cost of slightly lowered average rewards. A solution where the risk is low and the average reward is quite high, although not at its maximum attainable value, is very attractive in practice. To be more specific, in our formulation, the objective function is the expected value of the rewards minus a scalar times the downside risk. In this setting, we analyze the infinite horizon MDP, the finite horizon MDP, and the infinite horizon semi-MDP (SMDP). We develop dynamic programming and reinforcement learning algorithms for the finite and infinite horizon. The algorithms are tested in numerical studies and show encouraging performance.
基金supported by the National Science Foundation (No.ECS0841055)
文摘The adaptive critic heuristic has been a popular algorithm in reinforcement learning(RL) and approximate dynamic programming(ADP) alike.It is one of the ?rst RL and ADP algorithms.RL and ADP algorithms are particularly useful for solving Markov decision processes(MDPs) that suffer from the curses of dimensionality and modeling.Many real-world problems,however,tend to be semi-Markov decision processes(SMDPs) in which the time spent in each transition of the underlying Markov chains is itself a random variable.Unfortunately for the average reward case,unlike the discounted reward case,the MDP does not have an easy extension to the SMDP.Examples of SMDPs can be found in the area of supply chain management,maintenance management,and airline revenue management.In this paper,we propose an adaptive critic heuristic for the SMDP under the long-run average reward criterion.We present the convergence analysis of the algorithm which shows that under certain mild conditions,which can be ensured within a simulator,the algorithm converges to an optimal solution with probability 1.We test the algorithm extensively on a problem of airline revenue management in which the manager has to set prices for airline tickets over the booking horizon.The problem has a large scale,suffering from the curse of dimensionality,and hence it is difficult to solve it via classical methods of dynamic programming.Our numerical results are encouraging and show that the algorithm outperforms an existing heuristic used widely in the airline industry.
文摘An earthquake significant on the Richter scale occurring in an area with a high population density requires an effective and equitable emergency response plan. Emergency resources are usually located in so-called responding centers. One of the first problems faced by disaster-response management personnel in the rapidly degrading post-earthquake conditions is to gage the hazard rate to which the disaster-affected area is subjected, estimate the time taken to bring the situation under control, also called restoration time, and select the appropriate responding center for relief-and-rescue activities. In this paper, we propose an elaborate semi-Markov model to capture the stochastic dynamics of the events that follow an earthquake, which will be used to quantify the hazard rate to which people are exposed and estimate the restoration time. The model will be further used, via dynamic programming, to determine the appropriate responding center. Our proposed model can be employed in conjunction with a variety of hazard scales and by collecting data on a few parameters related to emergency management. The model will be particularly useful in a smart city, where historic data on events following an earthquake would be systematically and accurately recorded.