期刊文献+
共找到83篇文章
< 1 2 5 >
每页显示 20 50 100
Robust analysis of discounted Markov decision processes with uncertain transition probabilities 被引量:2
1
作者 LOU Zhen-kai HOU Fu-jun LOU Xu-ming 《Applied Mathematics(A Journal of Chinese Universities)》 SCIE CSCD 2020年第4期417-436,共20页
Optimal policies in Markov decision problems may be quite sensitive with regard to transition probabilities.In practice,some transition probabilities may be uncertain.The goals of the present study are to find the rob... Optimal policies in Markov decision problems may be quite sensitive with regard to transition probabilities.In practice,some transition probabilities may be uncertain.The goals of the present study are to find the robust range for a certain optimal policy and to obtain value intervals of exact transition probabilities.Our research yields powerful contributions for Markov decision processes(MDPs)with uncertain transition probabilities.We first propose a method for estimating unknown transition probabilities based on maximum likelihood.Since the estimation may be far from accurate,and the highest expected total reward of the MDP may be sensitive to these transition probabilities,we analyze the robustness of an optimal policy and propose an approach for robust analysis.After giving the definition of a robust optimal policy with uncertain transition probabilities represented as sets of numbers,we formulate a model to obtain the optimal policy.Finally,we define the value intervals of the exact transition probabilities and construct models to determine the lower and upper bounds.Numerical examples are given to show the practicability of our methods. 展开更多
关键词 markov decision processes uncertain transition probabilities robustness and sensitivity robust optimal policy value interval
下载PDF
Variance minimization for continuous-time Markov decision processes: two approaches 被引量:1
2
作者 ZHU Quan-xin 《Applied Mathematics(A Journal of Chinese Universities)》 SCIE CSCD 2010年第4期400-410,共11页
This paper studies the limit average variance criterion for continuous-time Markov decision processes in Polish spaces. Based on two approaches, this paper proves not only the existence of solutions to the variance mi... This paper studies the limit average variance criterion for continuous-time Markov decision processes in Polish spaces. Based on two approaches, this paper proves not only the existence of solutions to the variance minimization optimality equation and the existence of a variance minimal policy that is canonical, but also the existence of solutions to the two variance minimization optimality inequalities and the existence of a variance minimal policy which may not be canonical. An example is given to illustrate all of our conditions. 展开更多
关键词 Continuous-time markov decision process Polish space variance minimization optimality equation optimality inequality.
下载PDF
Seeking for Passenger under Dynamic Prices: A Markov Decision Process Approach
3
作者 Qianrong Shen 《Journal of Computer and Communications》 2021年第12期80-97,共18页
In recent years, ride-on-demand (RoD) services such as Uber and Didi are becoming increasingly popular. Different from traditional taxi services, RoD services adopt dynamic pricing mechanisms to manipulate the supply ... In recent years, ride-on-demand (RoD) services such as Uber and Didi are becoming increasingly popular. Different from traditional taxi services, RoD services adopt dynamic pricing mechanisms to manipulate the supply and demand on the road, and such mechanisms improve service capacity and quality. Seeking route recommendation has been widely studied in taxi service. In RoD services, the dynamic price is a new and accurate indicator that represents the supply and demand condition, but it is yet rarely studied in providing clues for drivers to seek for passengers. In this paper, we proposed to incorporate the impacts of dynamic prices as a key factor in recommending seeking routes to drivers. We first showed the importance and need to do that by analyzing real service data. We then designed a Markov Decision Process (MDP) model based on passenger order and car GPS trajectories datasets, and took into account dynamic prices in designing rewards. Results show that our model not only guides drivers to locations with higher prices, but also significantly improves driver revenue. Compared with things with the drivers before using the model, the maximum yield after using it can be increased to 28%. 展开更多
关键词 Ride-on-Demand Service markov decision process Dynamic Pricing Taxi Services Route Recommendation
下载PDF
A dynamical neural network approach for distributionally robust chance-constrained Markov decision process 被引量:1
4
作者 Tian Xia Jia Liu Zhiping Chen 《Science China Mathematics》 SCIE CSCD 2024年第6期1395-1418,共24页
In this paper,we study the distributionally robust joint chance-constrained Markov decision process.Utilizing the logarithmic transformation technique,we derive its deterministic reformulation with bi-convex terms und... In this paper,we study the distributionally robust joint chance-constrained Markov decision process.Utilizing the logarithmic transformation technique,we derive its deterministic reformulation with bi-convex terms under the moment-based uncertainty set.To cope with the non-convexity and improve the robustness of the solution,we propose a dynamical neural network approach to solve the reformulated optimization problem.Numerical results on a machine replacement problem demonstrate the efficiency of the proposed dynamical neural network approach when compared with the sequential convex approximation approach. 展开更多
关键词 markov decision process chance constraints distributionally robust optimization moment-based ambiguity set dynamical neural network
原文传递
Heterogeneous Network Selection Optimization Algorithm Based on a Markov Decision Model 被引量:7
5
作者 Jianli Xie Wenjuan Gao Cuiran Li 《China Communications》 SCIE CSCD 2020年第2期40-53,共14页
A network selection optimization algorithm based on the Markov decision process(MDP)is proposed so that mobile terminals can always connect to the best wireless network in a heterogeneous network environment.Consideri... A network selection optimization algorithm based on the Markov decision process(MDP)is proposed so that mobile terminals can always connect to the best wireless network in a heterogeneous network environment.Considering the different types of service requirements,the MDP model and its reward function are constructed based on the quality of service(QoS)attribute parameters of the mobile users,and the network attribute weights are calculated by using the analytic hierarchy process(AHP).The network handoff decision condition is designed according to the different types of user services and the time-varying characteristics of the network,and the MDP model is solved by using the genetic algorithm and simulated annealing(GA-SA),thus,users can seamlessly switch to the network with the best long-term expected reward value.Simulation results show that the proposed algorithm has good convergence performance,and can guarantee that users with different service types will obtain satisfactory expected total reward values and have low numbers of network handoffs. 展开更多
关键词 heterogeneous wireless networks markov decision process reward function genetic algorithm simulated annealing
下载PDF
Recorded recurrent deep reinforcement learning guidance laws for intercepting endoatmospheric maneuvering missiles
6
作者 Xiaoqi Qiu Peng Lai +1 位作者 Changsheng Gao Wuxing Jing 《Defence Technology(防务技术)》 SCIE EI CAS CSCD 2024年第1期457-470,共14页
This work proposes a recorded recurrent twin delayed deep deterministic(RRTD3)policy gradient algorithm to solve the challenge of constructing guidance laws for intercepting endoatmospheric maneuvering missiles with u... This work proposes a recorded recurrent twin delayed deep deterministic(RRTD3)policy gradient algorithm to solve the challenge of constructing guidance laws for intercepting endoatmospheric maneuvering missiles with uncertainties and observation noise.The attack-defense engagement scenario is modeled as a partially observable Markov decision process(POMDP).Given the benefits of recurrent neural networks(RNNs)in processing sequence information,an RNN layer is incorporated into the agent’s policy network to alleviate the bottleneck of traditional deep reinforcement learning methods while dealing with POMDPs.The measurements from the interceptor’s seeker during each guidance cycle are combined into one sequence as the input to the policy network since the detection frequency of an interceptor is usually higher than its guidance frequency.During training,the hidden states of the RNN layer in the policy network are recorded to overcome the partially observable problem that this RNN layer causes inside the agent.The training curves show that the proposed RRTD3 successfully enhances data efficiency,training speed,and training stability.The test results confirm the advantages of the RRTD3-based guidance laws over some conventional guidance laws. 展开更多
关键词 Endoatmospheric interception Missile guidance Reinforcement learning markov decision process Recurrent neural networks
下载PDF
Deep Reinforcement Learning for Energy-Efficient Edge Caching in Mobile Edge Networks
7
作者 Meng Deng Zhou Huan +3 位作者 Jiang Kai Zheng Hantong Cao Yue Chen Peng 《China Communications》 SCIE CSCD 2024年第11期243-256,共14页
Edge caching has emerged as a promising application paradigm in 5G networks,and by building edge networks to cache content,it can alleviate the traffic load brought about by the rapid growth of Internet of Things(IoT)... Edge caching has emerged as a promising application paradigm in 5G networks,and by building edge networks to cache content,it can alleviate the traffic load brought about by the rapid growth of Internet of Things(IoT)services and applications.Due to the limitations of Edge Servers(ESs)and a large number of user demands,how to make the decision and utilize the resources of ESs are significant.In this paper,we aim to minimize the total system energy consumption in a heterogeneous network and formulate the content caching optimization problem as a Mixed Integer Non-Linear Programming(MINLP).To address the optimization problem,a Deep Q-Network(DQN)-based method is proposed to improve the overall performance of the system and reduce the backhaul traffic load.In addition,the DQN-based method can effectively solve the limitation of traditional reinforcement learning(RL)in complex scenarios.Simulation results show that the proposed DQN-based method can greatly outperform other benchmark methods,and significantly improve the cache hit rate and reduce the total system energy consumption in different scenarios. 展开更多
关键词 deep reinforcement learning edge caching energy consumption markov decision process
下载PDF
Distributed Resource Allocation in Dispersed Computing Environment Based on UAV Track Inspection in Urban Rail Transit
8
作者 Tong Gan Shuo Dong +1 位作者 Shiyou Wang Jiaxin Li 《Computers, Materials & Continua》 SCIE EI 2024年第7期643-660,共18页
With the rapid development of urban rail transit,the existing track detection has some problems such as low efficiency and insufficient detection coverage,so an intelligent and automatic track detectionmethod based on... With the rapid development of urban rail transit,the existing track detection has some problems such as low efficiency and insufficient detection coverage,so an intelligent and automatic track detectionmethod based onUAV is urgently needed to avoid major safety accidents.At the same time,the geographical distribution of IoT devices results in the inefficient use of the significant computing potential held by a large number of devices.As a result,the Dispersed Computing(DCOMP)architecture enables collaborative computing between devices in the Internet of Everything(IoE),promotes low-latency and efficient cross-wide applications,and meets users’growing needs for computing performance and service quality.This paper focuses on examining the resource allocation challenge within a dispersed computing environment that utilizes UAV inspection tracks.Furthermore,the system takes into account both resource constraints and computational constraints and transforms the optimization problem into an energy minimization problem with computational constraints.The Markov Decision Process(MDP)model is employed to capture the connection between the dispersed computing resource allocation strategy and the system environment.Subsequently,a method based on Double Deep Q-Network(DDQN)is introduced to derive the optimal policy.Simultaneously,an experience replay mechanism is implemented to tackle the issue of increasing dimensionality.The experimental simulations validate the efficacy of the method across various scenarios. 展开更多
关键词 UAV track inspection dispersed computing resource allocation deep reinforcement learning markov decision process
下载PDF
Service Function Chain Deployment Algorithm Based on Multi-Agent Deep Reinforcement Learning
9
作者 Wanwei Huang Qiancheng Zhang +2 位作者 Tao Liu YaoliXu Dalei Zhang 《Computers, Materials & Continua》 SCIE EI 2024年第9期4875-4893,共19页
Aiming at the rapid growth of network services,which leads to the problems of long service request processing time and high deployment cost in the deployment of network function virtualization service function chain(S... Aiming at the rapid growth of network services,which leads to the problems of long service request processing time and high deployment cost in the deployment of network function virtualization service function chain(SFC)under 5G networks,this paper proposes a multi-agent deep deterministic policy gradient optimization algorithm for SFC deployment(MADDPG-SD).Initially,an optimization model is devised to enhance the request acceptance rate,minimizing the latency and deploying the cost SFC is constructed for the network resource-constrained case.Subsequently,we model the dynamic problem as a Markov decision process(MDP),facilitating adaptation to the evolving states of network resources.Finally,by allocating SFCs to different agents and adopting a collaborative deployment strategy,each agent aims to maximize the request acceptance rate or minimize latency and costs.These agents learn strategies from historical data of virtual network functions in SFCs to guide server node selection,and achieve approximately optimal SFC deployment strategies through a cooperative framework of centralized training and distributed execution.Experimental simulation results indicate that the proposed method,while simultaneously meeting performance requirements and resource capacity constraints,has effectively increased the acceptance rate of requests compared to the comparative algorithms,reducing the end-to-end latency by 4.942%and the deployment cost by 8.045%. 展开更多
关键词 Network function virtualization service function chain markov decision process multi-agent reinforcement learning
下载PDF
Age-Driven Joint Sampling and Non-Slot Based Scheduling for Industrial Internet of Things
10
作者 Cao Yali Teng Yinglei +1 位作者 Song Mei Wang Nan 《China Communications》 SCIE CSCD 2024年第11期190-204,共15页
Effective control of time-sensitive industrial applications depends on the real-time transmission of data from underlying sensors.Quantifying the data freshness through age of information(AoI),in this paper,we jointly... Effective control of time-sensitive industrial applications depends on the real-time transmission of data from underlying sensors.Quantifying the data freshness through age of information(AoI),in this paper,we jointly design sampling and non-slot based scheduling policies to minimize the maximum time-average age of information(MAoI)among sensors with the constraints of average energy cost and finite queue stability.To overcome the intractability involving high couplings of such a complex stochastic process,we first focus on the single-sensor time-average AoI optimization problem and convert the constrained Markov decision process(CMDP)into an unconstrained Markov decision process(MDP)by the Lagrangian method.With the infinite-time average energy and AoI expression expended as the Bellman equation,the singlesensor time-average AoI optimization problem can be approached through the steady-state distribution probability.Further,we propose a low-complexity sub-optimal sampling and semi-distributed scheduling scheme for the multi-sensor scenario.The simulation results show that the proposed scheme reduces the MAoI significantly while achieving a balance between the sampling rate and service rate for multiple sensors. 展开更多
关键词 Age of Information(AoI) Industrial Internet of Things(IIoT) markov decision process(MDP) time sensitive systems URLLC
下载PDF
Application of Exponential Distribution in Modeling of State Holding Time in HIV/AIDS Transition Dynamics
11
作者 Nahashon Mwirigi 《Open Journal of Modelling and Simulation》 2024年第4期159-183,共25页
Markov modeling of HIV/AIDS progression was done under the assumption that the state holding time (waiting time) had a constant hazard. This paper discusses the properties of the hazard function of the Exponential dis... Markov modeling of HIV/AIDS progression was done under the assumption that the state holding time (waiting time) had a constant hazard. This paper discusses the properties of the hazard function of the Exponential distributions and its modifications namely;Parameter proportion hazard (PH) and Accelerated failure time models (AFT) and their effectiveness in modeling the state holding time in Markov modeling of HIV/AIDS progression with and without risk factors. Patients were categorized by gender and age with female gender being the baseline. Data simulated using R software was fitted to each model, and the model parameters were estimated. The estimated P and Z values were then used to test the null hypothesis that the state waiting time data followed an Exponential distribution. Model identification criteria;Akaike information criteria (AIC), Bayesian information criteria (BIC), log-likelihood (LL), and R2 were used to evaluate the performance of the models. For the Survival Regression model, P and Z values supported the non-rejection of the null hypothesis for mixed gender without interaction and supported the rejection of the same for mixed gender with interaction term and males aged 50 - 60 years. Both Parameters supported the non-rejection of the null hypothesis in the rest of the age groups. For Gender male with interaction both P and Z values supported rejection in all the age groups except the age group 20 - 30 years. For Cox Proportional hazard and AFT models, both P and Z values supported the non-rejection of the null hypothesis across all age groups. The P-values for the three models supported different decisions for and against the Null hypothesis with AFT and Cox values supporting similar decisions in most of the age groups. Among the models considered, the regression assumption provided a superior fit based on (AIC), (BIC), (LL), and R2 Model identification criteria. This was particularly evident in age and gender subgroups where the data exhibited non-proportional hazards and violated the assumptions required for the Cox Proportional Hazard model. Moreover, the simplicity of the regression model, along with its ability to capture essential state transitions without over fitting, made it a more appropriate choice. 展开更多
关键词 markov Chain markov process Semi markov process markov decision Tree Stochastic process Survival Rate CD4+ Levels Absorption Rates AFT Model PH Model
下载PDF
Performance sensitivities for parameterized Markov systems
12
作者 XirenCAO JunyuZHANG 《控制理论与应用(英文版)》 EI 2004年第1期65-68,共4页
It is known that the performance potentials (or equivalentiy, perturbation realization factors) can be used as building blocks for performance sensitivities of Markov systems. In parameterized systems, the changes in ... It is known that the performance potentials (or equivalentiy, perturbation realization factors) can be used as building blocks for performance sensitivities of Markov systems. In parameterized systems, the changes in parameters may only affect some states, and the explicit transition probability matrix may not be known. In this paper, we use an example to show that we can use potentials to construct performance sensitivities in a more flexible way; only the potentials at the affected states need to be estimated, and the transition probability matrix need not be known. Policy iteration algorithms, which are simpler than the standard one, can be established. 展开更多
关键词 Perturbation analysis markov decision processes Policy iteration Reinforcement learning Perturbation realization
下载PDF
Grid Integration of Wind Generation Considering Remote Wind Farms:Hybrid Markovian and Interval Unit Commitment
13
作者 Bing Yan Haipei Fan +5 位作者 Peter B.Luh Khosrow Moslehi Xiaoming Feng Chien Ning Yu Mikhail A.Bragin Yaowen Yu 《IEEE/CAA Journal of Automatica Sinica》 SCIE EI CSCD 2017年第2期205-215,共11页
Grid integration of wind power is essential to reduce fossil fuel usage but challenging in view of the intermittent nature of wind.Recently,we developed a hybrid Markovian and interval approach for the unit commitment... Grid integration of wind power is essential to reduce fossil fuel usage but challenging in view of the intermittent nature of wind.Recently,we developed a hybrid Markovian and interval approach for the unit commitment and economic dispatch problem where power generation of conventional units is linked to local wind states to dampen the effects of wind uncertainties.Also,to reduce complexity,extreme and expected states are considered as interval modeling.Although this approach is effective,the fact that major wind farms are often located in remote locations and not accompanied by conventional units leads to conservative results.Furthermore,weights of extreme and expected states in the objective function are difficult to tune,resulting in significant differences between optimization and simulation costs.In this paper,each remote wind farm is paired with a conventional unit to dampen the effects of wind uncertainties without using expensive utility-scaled battery storage,and extra constraints are innovatively established to model pairing.Additionally,proper weights are derived through a novel quadratic fit of cost functions.The problem is solved by using a creative integration of our recent surrogate Lagrangian relaxation and branch-and-cut.Results demonstrate modeling accuracy,computational efficiency,and significant reduction of conservativeness of the previous approach. 展开更多
关键词 BRANCH-AND-CUT interval optimization markov decision process remote wind farms surrogate Lagrangian relaxation(SLR) unit commitment
下载PDF
Driving force planning in shield tunneling based on Markov decision processes 被引量:7
14
作者 HU XiangTao HUANG YongAn +1 位作者 YIN ZhouPing XIONG YouLun 《Science China(Technological Sciences)》 SCIE EI CAS 2012年第4期1022-1030,共9页
In shield tunneling, the control system needs very reliable capability of deviation rectifying in order to ensure that the tunnel trajectory meets the permissible criterion. To this goal, we present an approach that a... In shield tunneling, the control system needs very reliable capability of deviation rectifying in order to ensure that the tunnel trajectory meets the permissible criterion. To this goal, we present an approach that adopts Markov decision process (MDP) theory to plan the driving force with explicit representation of the uncertainty during excavation. The shield attitudes of possi- ble world and driving forces during excavation are scattered as a state set and an action set, respectively. In particular, an evaluation function is proposed with consideration of the stability of driving force and the deviation of shield attitude. Unlike the deterministic approach, the driving forces based on MDP model lead to an uncertain effect and the attitude is known only with an imprecise probability. We consider the case that the transition probability varies in a given domain estimated by field data, and discuss the optimal policy based on the interval arithmetic. The validity of the approach is discussed by comparing the driving force planning with the actual operating data from the field records of Line 9 in Tianjin. It is proved that the MDP model is reasonable enough to predict the driving force for automatic deviation rectifying. 展开更多
关键词 shield tunneling markov decision process automatic deviation rectifying interval arithmetic driving force planning
原文传递
Opportunistic admission and resource allocation for slicing enhanced IoT networks
15
作者 Long Zhang Bin Cao Gang Feng 《Digital Communications and Networks》 SCIE CSCD 2023年第6期1465-1476,共12页
Network slicing is envisioned as one of the key techniques to meet the extremely diversified service requirements of the Internet of Things(IoT)as it provides an enhanced user experience and elastic resource configura... Network slicing is envisioned as one of the key techniques to meet the extremely diversified service requirements of the Internet of Things(IoT)as it provides an enhanced user experience and elastic resource configuration.In the context of slicing enhanced IoT networks,both the Service Provider(SP)and Infrastructure Provider(InP)face challenges of ensuring efficient slice construction and high profit in dynamic environments.These challenges arise from randomly generated and departed slice requests from end-users,uncertain resource availability,and multidimensional resource allocation.Admission and resource allocation for distinct demands of slice requests are the key issues in addressing these challenges and should be handled effectively in dynamic environments.To this end,we propose an Opportunistic Admission and Resource allocation(OAR)policy to deal with the issues of random slicing requests,uncertain resource availability,and heterogeneous multi-resources.The key idea of OAR is to allow the SP to decide whether to accept slice requests immediately or defer them according to the load and price of resources.To cope with the random slice requests and uncertain resource availability,we formulated this issue as a Markov Decision Process(MDP)to obtain the optimal admission policy,with the aim of maximizing the system reward.Furthermore,the buyer-seller game theory approach was adopted to realize the optimal resource allocation,while motivating each SP and InP to maximize their rewards.Our numerical results show that the proposed OAR policy can make reasonable decisions effectively and steadily,and outperforms the baseline schemes in terms of the system reward. 展开更多
关键词 SLICE IOT markov decision process Game theory Admission and resource allocation
下载PDF
SBFT:A BFT Consensus Mechanism Based on DQN Algorithm for Industrial Internet of Thing
16
作者 Ningjie Gao Ru Huo +3 位作者 Shuo Wang Jiang Liu Tao Huang Yunjie Liu 《China Communications》 SCIE CSCD 2023年第10期185-199,共15页
With the development and widespread use of blockchain in recent years,many projects have introduced blockchain technology to solve the growing security issues of the Industrial Internet of Things(IIoT).However,due to ... With the development and widespread use of blockchain in recent years,many projects have introduced blockchain technology to solve the growing security issues of the Industrial Internet of Things(IIoT).However,due to the conflict between the operational performance and security of the blockchain system and the compatibility issues with a large number of IIoT devices running together,the mainstream blockchain system cannot be applied to IIoT scenarios.In order to solve these problems,this paper proposes SBFT(Speculative Byzantine Consensus Protocol),a flexible and scalable blockchain consensus mechanism for the Industrial Internet of Things.SBFT has a consensus process based on speculation,improving the throughput and consensus speed of blockchain systems and reducing communication overhead.In order to improve the compatibility and scalability of the blockchain system,we select some nodes to participate in the consensus,and these nodes have better performance in the network.Since multiple properties determine node performance,we abstract the node selection problem as a joint optimization problem and use Dueling Deep Q Learning(DQL)to solve it.Finally,we evaluate the performance of the scheme through simulation,and the simulation results prove the superiority of our scheme. 展开更多
关键词 Industrial Internet of Things Byzantine fault tolerance speculative consensus mechanism markov decision process deep reinforcement learning
下载PDF
Multi-Agent Deep Reinforcement Learning for Cross-Layer Scheduling in Mobile Ad-Hoc Networks
17
作者 Xinxing Zheng Yu Zhao +1 位作者 Joohyun Lee Wei Chen 《China Communications》 SCIE CSCD 2023年第8期78-88,共11页
Due to the fading characteristics of wireless channels and the burstiness of data traffic,how to deal with congestion in Ad-hoc networks with effective algorithms is still open and challenging.In this paper,we focus o... Due to the fading characteristics of wireless channels and the burstiness of data traffic,how to deal with congestion in Ad-hoc networks with effective algorithms is still open and challenging.In this paper,we focus on enabling congestion control to minimize network transmission delays through flexible power control.To effectively solve the congestion problem,we propose a distributed cross-layer scheduling algorithm,which is empowered by graph-based multi-agent deep reinforcement learning.The transmit power is adaptively adjusted in real-time by our algorithm based only on local information(i.e.,channel state information and queue length)and local communication(i.e.,information exchanged with neighbors).Moreover,the training complexity of the algorithm is low due to the regional cooperation based on the graph attention network.In the evaluation,we show that our algorithm can reduce the transmission delay of data flow under severe signal interference and drastically changing channel states,and demonstrate the adaptability and stability in different topologies.The method is general and can be extended to various types of topologies. 展开更多
关键词 Ad-hoc network cross-layer scheduling multi agent deep reinforcement learning interference elimination power control queue scheduling actorcritic methods markov decision process
下载PDF
Optimal Policies for Quantum Markov Decision Processes 被引量:2
18
作者 Ming-Sheng Ying Yuan Feng Sheng-Gang Ying 《International Journal of Automation and computing》 EI CSCD 2021年第3期410-421,共12页
Markov decision process(MDP)offers a general framework for modelling sequential decision making where outcomes are random.In particular,it serves as a mathematical framework for reinforcement learning.This paper intro... Markov decision process(MDP)offers a general framework for modelling sequential decision making where outcomes are random.In particular,it serves as a mathematical framework for reinforcement learning.This paper introduces an extension of MDP,namely quantum MDP(q MDP),that can serve as a mathematical model of decision making about quantum systems.We develop dynamic programming algorithms for policy evaluation and finding optimal policies for q MDPs in the case of finite-horizon.The results obtained in this paper provide some useful mathematical tools for reinforcement learning techniques applied to the quantum world. 展开更多
关键词 Quantum markov decision processes quantum machine learning reinforcement learning dynamic programming decision making
原文传递
Convergence of Markov decision processes with constraints and state-action dependent discount factors 被引量:2
19
作者 Xiao Wu Xianping Guo 《Science China Mathematics》 SCIE CSCD 2020年第1期167-182,共16页
This paper is concerned with the convergence of a sequence of discrete-time Markov decision processes(DTMDPs)with constraints,state-action dependent discount factors,and possibly unbounded costs.Using the convex analy... This paper is concerned with the convergence of a sequence of discrete-time Markov decision processes(DTMDPs)with constraints,state-action dependent discount factors,and possibly unbounded costs.Using the convex analytic approach under mild conditions,we prove that the optimal values and optimal policies of the original DTMDPs converge to those of the"limit"one.Furthermore,we show that any countablestate DTMDP can be approximated by a sequence of finite-state DTMDPs,which are constructed using the truncation technique.Finally,we illustrate the approximation by solving a controlled queueing system numerically,and give the corresponding error bound of the approximation. 展开更多
关键词 discrete-time markov decision processes state-action dependent discount factors unbounded costs CONVERGENCE
原文传递
First passage Markov decision processes with constraints and varying discount factors 被引量:2
20
作者 Xiao WU Xiaolong ZOU Xianping GUO 《Frontiers of Mathematics in China》 SCIE CSCD 2015年第4期1005-1023,共19页
This paper focuses on the constrained optimality problem (COP) of first passage discrete-time Markov decision processes (DTMDPs) in denumerable state and compact Borel action spaces with multi-constraints, state-d... This paper focuses on the constrained optimality problem (COP) of first passage discrete-time Markov decision processes (DTMDPs) in denumerable state and compact Borel action spaces with multi-constraints, state-dependent discount factors, and possibly unbounded costs. By means of the properties of a so-called occupation measure of a policy, we show that the constrained optimality problem is equivalent to an (infinite-dimensional) linear programming on the set of occupation measures with some constraints, and thus prove the existence of an optimal policy under suitable conditions. Furthermore, using the equivalence between the constrained optimality problem and the linear programming, we obtain an exact form of an optimal policy for the case of finite states and actions. Finally, as an example, a controlled queueing system is given to illustrate our results. 展开更多
关键词 Discrete-time markov decision process (DTMDP) constrainedoptimality varying discount factor unbounded cost
原文传递
上一页 1 2 5 下一页 到第
使用帮助 返回顶部