摘要
MARKOV decision processes (MDPs) have been studied by mathematicians, probabilists, operation researchers and engineers since the late 1950s. In an MDPs a stochastic, dynamic system is controlled by a 'policy' selected by a decision-maker/controller, with the goal of maximizing an overall reward function that is an appropriately defined aggregate of immediate rewards, over either finite or infinite time horizon.As such MDPs are a useful paradigm for modeling many processes occurring naturally in the management and engineering contexts..