武器装备体系作战仿真研究隶属于复杂系统研究范畴,首次对基于Nash-Q的网络信息体系(network information system-of-systems,NISoS)对抗认知决策行为进行探索研究。Nash-Q算法与联合Q-learning算法具有类似的形式,其区别在于联合策略...武器装备体系作战仿真研究隶属于复杂系统研究范畴,首次对基于Nash-Q的网络信息体系(network information system-of-systems,NISoS)对抗认知决策行为进行探索研究。Nash-Q算法与联合Q-learning算法具有类似的形式,其区别在于联合策略的计算,对于零和博弈体系作战模型,由于Nash-Q不需要其他Agent的历史信息即可通过Nash均衡的求解而获得混合策略,因此更易于实现也更加高效。建立了战役层次零和作战动态博弈模型,在不需要其他Agent的完全信息时,给出了Nash均衡的求解方法。此外,采用高斯径向基神经网络对Q表进行离散,使得算法具有更好的离散效果以及泛化能力。最后,通过NISoS作战仿真实验验证了算法的有效性以及相比基于Q-learning算法以及Rule-based决策算法具有更高的收益,并且在离线决策中表现优异。展开更多
As a core component of the network,web applications have become one of the preferred targets for attackers because the static configuration of web applications simplifies the exploitation of vulnerabilities by attacke...As a core component of the network,web applications have become one of the preferred targets for attackers because the static configuration of web applications simplifies the exploitation of vulnerabilities by attackers.Although the moving target defense(MTD)has been proposed to increase the attack difficulty for the attackers,there is no solo approach can cope with different attacks;in addition,it is impossible to implement all these approaches simultaneously due to the resource limitation.Thus,the selection of an optimal defense strategy based on MTD has become the focus of research.In general,the confrontation of two players in the security domain is viewed as a stochastic game,and the reward matrices are known to both players.However,in a real security confrontation,this scenario represents an incomplete information game.Each player can only observe the actions performed by the opponent,and the observed actions are not completely accurate.To accurately describe the attacker’s reward function to reach the Nash equilibrium,this work simulated and updated the strategy selection distribution of the attacker by observing and investigating the strategy selection history of the attacker.Next,the possible rewards of the attacker in each confrontation via the observation matrix were corrected.On this basis,the Nash-Q learning algorithm with reward quantification was proposed to select the optimal strategy.Moreover,the performances of the Minimax-Q learning algorithm and Naive-Q learning algorithm were compared and analyzed in the MTD environment.Finally,the experimental results showed that the strategy selection algorithm can enable defenders to select a more reasonable defensive strategy and achieve the maximum possible reward.展开更多
In this paper,a local-learning algorithm for multi-agent is presented based on the fact that individual agent performs local perception and local interaction under group environment.As for in-dividual-learning,agent a...In this paper,a local-learning algorithm for multi-agent is presented based on the fact that individual agent performs local perception and local interaction under group environment.As for in-dividual-learning,agent adopts greedy strategy to maximize its reward when interacting with envi-ronment.In group-learning,local interaction takes place between each two agents.A local-learning algorithm to choose and modify agents' actions is proposed to improve the traditional Q-learning algorithm,respectively in the situations of zero-sum games and general-sum games with unique equi-librium or multi-equilibrium.And this local-learning algorithm is proved to be convergent and the computation complexity is lower than the Nash-Q.Additionally,through grid-game test,it is indicated that by using this local-learning algorithm,the local behaviors of agents can spread to globe.展开更多
文摘武器装备体系作战仿真研究隶属于复杂系统研究范畴,首次对基于Nash-Q的网络信息体系(network information system-of-systems,NISoS)对抗认知决策行为进行探索研究。Nash-Q算法与联合Q-learning算法具有类似的形式,其区别在于联合策略的计算,对于零和博弈体系作战模型,由于Nash-Q不需要其他Agent的历史信息即可通过Nash均衡的求解而获得混合策略,因此更易于实现也更加高效。建立了战役层次零和作战动态博弈模型,在不需要其他Agent的完全信息时,给出了Nash均衡的求解方法。此外,采用高斯径向基神经网络对Q表进行离散,使得算法具有更好的离散效果以及泛化能力。最后,通过NISoS作战仿真实验验证了算法的有效性以及相比基于Q-learning算法以及Rule-based决策算法具有更高的收益,并且在离线决策中表现优异。
基金This paper is supported by the National Key R&D Program of China(2017YFB0802703)the National Nature Science Foundation of China(61602052).
文摘As a core component of the network,web applications have become one of the preferred targets for attackers because the static configuration of web applications simplifies the exploitation of vulnerabilities by attackers.Although the moving target defense(MTD)has been proposed to increase the attack difficulty for the attackers,there is no solo approach can cope with different attacks;in addition,it is impossible to implement all these approaches simultaneously due to the resource limitation.Thus,the selection of an optimal defense strategy based on MTD has become the focus of research.In general,the confrontation of two players in the security domain is viewed as a stochastic game,and the reward matrices are known to both players.However,in a real security confrontation,this scenario represents an incomplete information game.Each player can only observe the actions performed by the opponent,and the observed actions are not completely accurate.To accurately describe the attacker’s reward function to reach the Nash equilibrium,this work simulated and updated the strategy selection distribution of the attacker by observing and investigating the strategy selection history of the attacker.Next,the possible rewards of the attacker in each confrontation via the observation matrix were corrected.On this basis,the Nash-Q learning algorithm with reward quantification was proposed to select the optimal strategy.Moreover,the performances of the Minimax-Q learning algorithm and Naive-Q learning algorithm were compared and analyzed in the MTD environment.Finally,the experimental results showed that the strategy selection algorithm can enable defenders to select a more reasonable defensive strategy and achieve the maximum possible reward.
文摘In this paper,a local-learning algorithm for multi-agent is presented based on the fact that individual agent performs local perception and local interaction under group environment.As for in-dividual-learning,agent adopts greedy strategy to maximize its reward when interacting with envi-ronment.In group-learning,local interaction takes place between each two agents.A local-learning algorithm to choose and modify agents' actions is proposed to improve the traditional Q-learning algorithm,respectively in the situations of zero-sum games and general-sum games with unique equi-librium or multi-equilibrium.And this local-learning algorithm is proved to be convergent and the computation complexity is lower than the Nash-Q.Additionally,through grid-game test,it is indicated that by using this local-learning algorithm,the local behaviors of agents can spread to globe.