Future unmanned battles desperately require intelli-gent combat policies,and multi-agent reinforcement learning offers a promising solution.However,due to the complexity of combat operations and large size of the comb...Future unmanned battles desperately require intelli-gent combat policies,and multi-agent reinforcement learning offers a promising solution.However,due to the complexity of combat operations and large size of the combat group,this task suffers from credit assignment problem more than other rein-forcement learning tasks.This study uses reward shaping to relieve the credit assignment problem and improve policy train-ing for the new generation of large-scale unmanned combat operations.We first prove that multiple reward shaping func-tions would not change the Nash Equilibrium in stochastic games,providing theoretical support for their use.According to the characteristics of combat operations,we propose tactical reward shaping(TRS)that comprises maneuver shaping advice and threat assessment-based attack shaping advice.Then,we investigate the effects of different types and combinations of shaping advice on combat policies through experiments.The results show that TRS improves both the efficiency and attack accuracy of combat policies,with the combination of maneuver reward shaping advice and ally-focused attack shaping advice achieving the best performance compared with that of the base-line strategy.展开更多
In this paper, the reinforcement learning method for cooperative multi-agent systems(MAS) with incremental number of agents is studied. The existing multi-agent reinforcement learning approaches deal with the MAS with...In this paper, the reinforcement learning method for cooperative multi-agent systems(MAS) with incremental number of agents is studied. The existing multi-agent reinforcement learning approaches deal with the MAS with a specific number of agents, and can learn well-performed policies. However, if there is an increasing number of agents, the previously learned in may not perform well in the current scenario. The new agents need to learn from scratch to find optimal policies with others,which may slow down the learning speed of the whole team. To solve that problem, in this paper, we propose a new algorithm to take full advantage of the historical knowledge which was learned before, and transfer it from the previous agents to the new agents. Since the previous agents have been trained well in the source environment, they are treated as teacher agents in the target environment. Correspondingly, the new agents are called student agents. To enable the student agents to learn from the teacher agents, we first modify the input nodes of the networks for teacher agents to adapt to the current environment. Then, the teacher agents take the observations of the student agents as input, and output the advised actions and values as supervising information. Finally, the student agents combine the reward from the environment and the supervising information from the teacher agents, and learn the optimal policies with modified loss functions. By taking full advantage of the knowledge of teacher agents, the search space for the student agents will be reduced significantly, which can accelerate the learning speed of the holistic system. The proposed algorithm is verified in some multi-agent simulation environments, and its efficiency has been demonstrated by the experiment results.展开更多
In the evolutionary game of the same task for groups,the changes in game rules,personal interests,the crowd size,and external supervision cause uncertain effects on individual decision-making and game results.In the M...In the evolutionary game of the same task for groups,the changes in game rules,personal interests,the crowd size,and external supervision cause uncertain effects on individual decision-making and game results.In the Markov decision framework,a single-task multi-decision evolutionary game model based on multi-agent reinforcement learning is proposed to explore the evolutionary rules in the process of a game.The model can improve the result of a evolutionary game and facilitate the completion of the task.First,based on the multi-agent theory,to solve the existing problems in the original model,a negative feedback tax penalty mechanism is proposed to guide the strategy selection of individuals in the group.In addition,in order to evaluate the evolutionary game results of the group in the model,a calculation method of the group intelligence level is defined.Secondly,the Q-learning algorithm is used to improve the guiding effect of the negative feedback tax penalty mechanism.In the model,the selection strategy of the Q-learning algorithm is improved and a bounded rationality evolutionary game strategy is proposed based on the rule of evolutionary games and the consideration of the bounded rationality of individuals.Finally,simulation results show that the proposed model can effectively guide individuals to choose cooperation strategies which are beneficial to task completion and stability under different negative feedback factor values and different group sizes,so as to improve the group intelligence level.展开更多
To solve the problem of multi-target hunting by an unmanned surface vehicle(USV)fleet,a hunting algorithm based on multi-agent reinforcement learning is proposed.Firstly,the hunting environment and kinematic model wit...To solve the problem of multi-target hunting by an unmanned surface vehicle(USV)fleet,a hunting algorithm based on multi-agent reinforcement learning is proposed.Firstly,the hunting environment and kinematic model without boundary constraints are built,and the criteria for successful target capture are given.Then,the cooperative hunting problem of a USV fleet is modeled as a decentralized partially observable Markov decision process(Dec-POMDP),and a distributed partially observable multitarget hunting Proximal Policy Optimization(DPOMH-PPO)algorithm applicable to USVs is proposed.In addition,an observation model,a reward function and the action space applicable to multi-target hunting tasks are designed.To deal with the dynamic change of observational feature dimension input by partially observable systems,a feature embedding block is proposed.By combining the two feature compression methods of column-wise max pooling(CMP)and column-wise average-pooling(CAP),observational feature encoding is established.Finally,the centralized training and decentralized execution framework is adopted to complete the training of hunting strategy.Each USV in the fleet shares the same policy and perform actions independently.Simulation experiments have verified the effectiveness of the DPOMH-PPO algorithm in the test scenarios with different numbers of USVs.Moreover,the advantages of the proposed model are comprehensively analyzed from the aspects of algorithm performance,migration effect in task scenarios and self-organization capability after being damaged,the potential deployment and application of DPOMH-PPO in the real environment is verified.展开更多
Mobile CrowdSensing(MCS)is a promising sensing paradigm that recruits users to cooperatively perform sensing tasks.Recently,unmanned aerial vehicles(UAVs)as the powerful sensing devices are used to replace user partic...Mobile CrowdSensing(MCS)is a promising sensing paradigm that recruits users to cooperatively perform sensing tasks.Recently,unmanned aerial vehicles(UAVs)as the powerful sensing devices are used to replace user participation and carry out some special tasks,such as epidemic monitoring and earthquakes rescue.In this paper,we focus on scheduling UAVs to sense the task Point-of-Interests(PoIs)with different frequency coverage requirements.To accomplish the sensing task,the scheduling strategy needs to consider the coverage requirement,geographic fairness and energy charging simultaneously.We consider the complex interaction among UAVs and propose a grouping multi-agent deep reinforcement learning approach(G-MADDPG)to schedule UAVs distributively.G-MADDPG groups all UAVs into some teams by a distance-based clustering algorithm(DCA),then it regards each team as an agent.In this way,G-MADDPG solves the problem that the training time of traditional MADDPG is too long to converge when the number of UAVs is large,and the trade-off between training time and result accuracy could be controlled flexibly by adjusting the number of teams.Extensive simulation results show that our scheduling strategy has better performance compared with three baselines and is flexible in balancing training time and result accuracy.展开更多
Tracking maneuvering target in real time autonomously and accurately in an uncertain environment is one of the challenging missions for unmanned aerial vehicles(UAVs).In this paper,aiming to address the control proble...Tracking maneuvering target in real time autonomously and accurately in an uncertain environment is one of the challenging missions for unmanned aerial vehicles(UAVs).In this paper,aiming to address the control problem of maneuvering target tracking and obstacle avoidance,an online path planning approach for UAV is developed based on deep reinforcement learning.Through end-to-end learning powered by neural networks,the proposed approach can achieve the perception of the environment and continuous motion output control.This proposed approach includes:(1)A deep deterministic policy gradient(DDPG)-based control framework to provide learning and autonomous decision-making capability for UAVs;(2)An improved method named MN-DDPG for introducing a type of mixed noises to assist UAV with exploring stochastic strategies for online optimal planning;and(3)An algorithm of taskdecomposition and pre-training for efficient transfer learning to improve the generalization capability of UAV’s control model built based on MN-DDPG.The experimental simulation results have verified that the proposed approach can achieve good self-adaptive adjustment of UAV’s flight attitude in the tasks of maneuvering target tracking with a significant improvement in generalization capability and training efficiency of UAV tracking controller in uncertain environments.展开更多
As an important mechanism in multi-agent interaction,communication can make agents form complex team relationships rather than constitute a simple set of multiple independent agents.However,the existing communication ...As an important mechanism in multi-agent interaction,communication can make agents form complex team relationships rather than constitute a simple set of multiple independent agents.However,the existing communication schemes can bring much timing redundancy and irrelevant messages,which seriously affects their practical application.To solve this problem,this paper proposes a targeted multiagent communication algorithm based on state control(SCTC).The SCTC uses a gating mechanism based on state control to reduce the timing redundancy of communication between agents and determines the interaction relationship between agents and the importance weight of a communication message through a series connection of hard-and self-attention mechanisms,realizing targeted communication message processing.In addition,by minimizing the difference between the fusion message generated from a real communication message of each agent and a fusion message generated from the buffered message,the correctness of the final action choice of the agent is ensured.Our evaluation using a challenging set of Star Craft II benchmarks indicates that the SCTC can significantly improve the learning performance and reduce the communication overhead between agents,thus ensuring better cooperation between agents.展开更多
城市电网在发生N-1故障后,极可能新增运行风险,导致N-1-1时出现大面积停电事故。为管控城市电网N-1后运行风险,该文提出一种改进双智能体竞争双深度Q网络(dueling double deep Q network,D3QN)的城市电网N-1风险管控转供策略。根据风险...城市电网在发生N-1故障后,极可能新增运行风险,导致N-1-1时出现大面积停电事故。为管控城市电网N-1后运行风险,该文提出一种改进双智能体竞争双深度Q网络(dueling double deep Q network,D3QN)的城市电网N-1风险管控转供策略。根据风险管控原则,提出一种无需额外历史数据、考虑备自投装置、单供变电站风险和单供负荷母线风险的N-1场景指标;建立计及动作次序、指标间关系的负荷转供三阶段求解模型。以含预动作-变化探索值选择策略的改进双智能体D3QN方法,将负荷转供分为多个子转供环节学习,使转供思路清晰化,对动作空间进行降维,提高训练寻优效果,得到管控N-1风险的负荷转供策略。通过城市电网多场景算例分析,验证该文模型和方法的有效性。展开更多
在目前基于深度强化学习的数据库索引推荐中,当负载变化时,由于实际负载与训练负载差距较大,模型的推荐效果会显著下降。针对现有基于深度强化学习的索引推荐算法在负载增量变化下自适应性和模型泛化性不足的问题,提出了一个基于多智能...在目前基于深度强化学习的数据库索引推荐中,当负载变化时,由于实际负载与训练负载差距较大,模型的推荐效果会显著下降。针对现有基于深度强化学习的索引推荐算法在负载增量变化下自适应性和模型泛化性不足的问题,提出了一个基于多智能体迁移强化学习的索引推荐算法MARLIA(multi-agent reinforcement learning index advisor)。该算法结合了迁移学习的思想,使用多智能体进行模型训练。在负载变化更新导致模型推荐效果下降时,该算法可以利用策略蒸馏的方式将旧索引推荐策略传递给新索引推荐智能体,提高了模型的泛化性和对动态负载的支持。在TPC-H数据集上的实验结果表明,该算法的负载代价提升率与基线算法相比稳定在7%以内,在负载为120条时缓存命中率为76.3%。该研究表明,MARLIA算法在负载变化时具有强大的自适应性和模型泛化能力。展开更多
传统的数据清洗方法因需专家手动定义数据质量规则,不仅复杂且耗时巨大,且清洗后的数据可能不能被重复利用,降低了数据清洗的质量和效率.为此,提出基于双重深度Q网络的否定约束迁移(double deep Q-network for denial constraints trans...传统的数据清洗方法因需专家手动定义数据质量规则,不仅复杂且耗时巨大,且清洗后的数据可能不能被重复利用,降低了数据清洗的质量和效率.为此,提出基于双重深度Q网络的否定约束迁移(double deep Q-network for denial constraints transfer,DDQN-DCT)算法.该算法设计否定约束(denial constraint,DC)的相似性度量方法,并结合相似性和DC的简洁性和覆盖度,使用双重深度Q网络(double deep Q-network,DDQN)对DC规则中的谓词进行修改从而实现DC迁移,目标是使迁移后的规则与原规则具有最大的相似性,以保留原有规则的信息.基于DDQN-DCT进一步设计了DDQN-DCT+算法,把DDQN的动作选择策略划为增加和删除2个阶段,并通过对比实验证明DDQN-DCT+在DC简洁性上有更好的表现.通过与暴力依赖约束迁移(brute-force dependency constraint transfer,BFDC)、DDQN-DCT+和结构扩大/缩小(structure expansion/reduction,SER)方法进行对比实验,发现DDQN-DCT算法在规则相似度上相较于BFDC平均提升约10%,相较于DDQN-DCT+平均提升约10.6%,相较于SER平均提升约16.4%.DDQN-DCT能够有效地将源域的规则迁移到相似的目标域数据上.展开更多
This paper investigates impulsive orbital attack-defense(AD)games under multiple constraints and victory conditions,involving three spacecraft:attacker,target,and defender.In the AD scenario,the attacker aims to breac...This paper investigates impulsive orbital attack-defense(AD)games under multiple constraints and victory conditions,involving three spacecraft:attacker,target,and defender.In the AD scenario,the attacker aims to breach the defender's interception to rendezvous with the target,while the defender seeks to protect the target by blocking or actively pursuing the attacker.Four different maneuvering constraints and five potential game outcomes are incorporated to more accurately model AD game problems and increase complexity,thereby reducing the effectiveness of traditional methods such as differential games and game-tree searches.To address these challenges,this study proposes a multiagent deep reinforcement learning solution with variable reward functions.Two attack strategies,Direct attack(DA)and Bypass attack(BA),are developed for the attacker,each focusing on different mission priorities.Similarly,two defense strategies,Direct interdiction(DI)and Collinear interdiction(CI),are designed for the defender,each optimizing specific defensive actions through tailored reward functions.Each reward function incorporates both process rewards(e.g.,distance and angle)and outcome rewards,derived from physical principles and validated via geometric analysis.Extensive simulations of four strategy confrontations demonstrate average defensive success rates of 75%for DI vs.DA,40%for DI vs.BA,80%for CI vs.DA,and 70%for CI vs.BA.Results indicate that CI outperforms DI for defenders,while BA outperforms DA for attackers.Moreover,defenders achieve their objectives more effectively under identical maneuvering capabilities.Trajectory evolution analyses further illustrate the effectiveness of the proposed variable reward function-driven strategies.These strategies and analyses offer valuable guidance for practical orbital defense scenarios and lay a foundation for future multi-agent game research.展开更多
深度强化学习(DRL)已被成功应用于移动机器人路径规划中,基于DRL的移动机器人路径规划算法适用于高维环境,是实现移动机器人自主学习的重要方法。而训练DRL模型需要大量的环境交互经验,这意味着更高的计算成本。此外,DRL算法的经验池容...深度强化学习(DRL)已被成功应用于移动机器人路径规划中,基于DRL的移动机器人路径规划算法适用于高维环境,是实现移动机器人自主学习的重要方法。而训练DRL模型需要大量的环境交互经验,这意味着更高的计算成本。此外,DRL算法的经验池容量有限,无法确保经验的有效利用。作为类脑计算重要工具之一的脉冲神经网络(Spiking Neural Networks,SNNs)以其独有的生物似真性,能同时融入时空信息,适用于机器人环境感知及控制。结合SNNs、卷积神经网络(CNNs)和策略融合,针对基于DRL的移动机器人路径规划算法进行研究,完成了以下工作:1)提出SCDDPG(SCDDP)算法。该算法利用CNNs对输入状态进行多通道特征提取,利用SNNs对提取的特征进行时空学习。2)在SCDDPG的基础上,提出SC2DDPG(SC2DDPG)算法。SC2DDPG通过设计状态约束策略对机器人运行状态进行约束,避免了不必要的环境探索,提升了SC2DDPG中DRL的收敛速度。3)在SCDDPG的基础上,提出了PFTDDPG(Policy Fusion and Transfer SCDDPG,PFTDDPG)算法。该算法采用分阶控制模式与DRL算法融合,针对环境中的楔形障碍物实施沿墙行走策略,并引入迁移学习对先验知识进行策略迁移。PFTDDPG算法不仅完成了单纯依靠RL不能完成的路径规划任务,还可以得到最优无碰路径。此外PFTDDPG提升了模型的收敛速度和路径规划性能。实验结果证明了所提出的3种路径规划算法的有效性,对比实验结果表明:在SpikeDDPG,SCDDPG,SC2DDPG和PFTDDPG算法中,PFTDDPG算法在路径规划成功率、训练收敛速度、规划路径长度等性能指标上表现最佳。本工作为移动机器人路径规划提出了新思路,丰富了DRL在移动机器人路径规划中的解决方案。展开更多
文摘Future unmanned battles desperately require intelli-gent combat policies,and multi-agent reinforcement learning offers a promising solution.However,due to the complexity of combat operations and large size of the combat group,this task suffers from credit assignment problem more than other rein-forcement learning tasks.This study uses reward shaping to relieve the credit assignment problem and improve policy train-ing for the new generation of large-scale unmanned combat operations.We first prove that multiple reward shaping func-tions would not change the Nash Equilibrium in stochastic games,providing theoretical support for their use.According to the characteristics of combat operations,we propose tactical reward shaping(TRS)that comprises maneuver shaping advice and threat assessment-based attack shaping advice.Then,we investigate the effects of different types and combinations of shaping advice on combat policies through experiments.The results show that TRS improves both the efficiency and attack accuracy of combat policies,with the combination of maneuver reward shaping advice and ally-focused attack shaping advice achieving the best performance compared with that of the base-line strategy.
基金supported by the National Key R&D Program of China (2018AAA0101400)the National Natural Science Foundation of China (62173251+3 种基金61921004U1713209)the Natural Science Foundation of Jiangsu Province of China (BK20202006)the Guangdong Provincial Key Laboratory of Intelligent Decision and Cooperative Control。
文摘In this paper, the reinforcement learning method for cooperative multi-agent systems(MAS) with incremental number of agents is studied. The existing multi-agent reinforcement learning approaches deal with the MAS with a specific number of agents, and can learn well-performed policies. However, if there is an increasing number of agents, the previously learned in may not perform well in the current scenario. The new agents need to learn from scratch to find optimal policies with others,which may slow down the learning speed of the whole team. To solve that problem, in this paper, we propose a new algorithm to take full advantage of the historical knowledge which was learned before, and transfer it from the previous agents to the new agents. Since the previous agents have been trained well in the source environment, they are treated as teacher agents in the target environment. Correspondingly, the new agents are called student agents. To enable the student agents to learn from the teacher agents, we first modify the input nodes of the networks for teacher agents to adapt to the current environment. Then, the teacher agents take the observations of the student agents as input, and output the advised actions and values as supervising information. Finally, the student agents combine the reward from the environment and the supervising information from the teacher agents, and learn the optimal policies with modified loss functions. By taking full advantage of the knowledge of teacher agents, the search space for the student agents will be reduced significantly, which can accelerate the learning speed of the holistic system. The proposed algorithm is verified in some multi-agent simulation environments, and its efficiency has been demonstrated by the experiment results.
基金supported by the National Key R&D Program of China(2017YFB1400105).
文摘In the evolutionary game of the same task for groups,the changes in game rules,personal interests,the crowd size,and external supervision cause uncertain effects on individual decision-making and game results.In the Markov decision framework,a single-task multi-decision evolutionary game model based on multi-agent reinforcement learning is proposed to explore the evolutionary rules in the process of a game.The model can improve the result of a evolutionary game and facilitate the completion of the task.First,based on the multi-agent theory,to solve the existing problems in the original model,a negative feedback tax penalty mechanism is proposed to guide the strategy selection of individuals in the group.In addition,in order to evaluate the evolutionary game results of the group in the model,a calculation method of the group intelligence level is defined.Secondly,the Q-learning algorithm is used to improve the guiding effect of the negative feedback tax penalty mechanism.In the model,the selection strategy of the Q-learning algorithm is improved and a bounded rationality evolutionary game strategy is proposed based on the rule of evolutionary games and the consideration of the bounded rationality of individuals.Finally,simulation results show that the proposed model can effectively guide individuals to choose cooperation strategies which are beneficial to task completion and stability under different negative feedback factor values and different group sizes,so as to improve the group intelligence level.
基金financial support from National Natural Science Foundation of China(Grant No.61601491)Natural Science Foundation of Hubei Province,China(Grant No.2018CFC865)Military Research Project of China(-Grant No.YJ2020B117)。
文摘To solve the problem of multi-target hunting by an unmanned surface vehicle(USV)fleet,a hunting algorithm based on multi-agent reinforcement learning is proposed.Firstly,the hunting environment and kinematic model without boundary constraints are built,and the criteria for successful target capture are given.Then,the cooperative hunting problem of a USV fleet is modeled as a decentralized partially observable Markov decision process(Dec-POMDP),and a distributed partially observable multitarget hunting Proximal Policy Optimization(DPOMH-PPO)algorithm applicable to USVs is proposed.In addition,an observation model,a reward function and the action space applicable to multi-target hunting tasks are designed.To deal with the dynamic change of observational feature dimension input by partially observable systems,a feature embedding block is proposed.By combining the two feature compression methods of column-wise max pooling(CMP)and column-wise average-pooling(CAP),observational feature encoding is established.Finally,the centralized training and decentralized execution framework is adopted to complete the training of hunting strategy.Each USV in the fleet shares the same policy and perform actions independently.Simulation experiments have verified the effectiveness of the DPOMH-PPO algorithm in the test scenarios with different numbers of USVs.Moreover,the advantages of the proposed model are comprehensively analyzed from the aspects of algorithm performance,migration effect in task scenarios and self-organization capability after being damaged,the potential deployment and application of DPOMH-PPO in the real environment is verified.
基金supported by the Innovation Capacity Construction Project of Jilin Development and Reform Commission(2020C017-2)Science and Technology Development Plan Project of Jilin Province(20210201082GX)。
文摘Mobile CrowdSensing(MCS)is a promising sensing paradigm that recruits users to cooperatively perform sensing tasks.Recently,unmanned aerial vehicles(UAVs)as the powerful sensing devices are used to replace user participation and carry out some special tasks,such as epidemic monitoring and earthquakes rescue.In this paper,we focus on scheduling UAVs to sense the task Point-of-Interests(PoIs)with different frequency coverage requirements.To accomplish the sensing task,the scheduling strategy needs to consider the coverage requirement,geographic fairness and energy charging simultaneously.We consider the complex interaction among UAVs and propose a grouping multi-agent deep reinforcement learning approach(G-MADDPG)to schedule UAVs distributively.G-MADDPG groups all UAVs into some teams by a distance-based clustering algorithm(DCA),then it regards each team as an agent.In this way,G-MADDPG solves the problem that the training time of traditional MADDPG is too long to converge when the number of UAVs is large,and the trade-off between training time and result accuracy could be controlled flexibly by adjusting the number of teams.Extensive simulation results show that our scheduling strategy has better performance compared with three baselines and is flexible in balancing training time and result accuracy.
基金The authors would like to acknowledge National Natural Science Foundation of China(Grant No.61573285,No.62003267)Aeronautical Science Foundation of China(Grant No.2017ZC53021)+1 种基金Open Fund of Key Laboratory of Data Link Technology of China Electronics Technology Group Corporation(Grant No.CLDL-20182101)Natural Science Foundation of Shaanxi Province(Grant No.2020JQ-220)to provide fund for conducting experiments.
文摘Tracking maneuvering target in real time autonomously and accurately in an uncertain environment is one of the challenging missions for unmanned aerial vehicles(UAVs).In this paper,aiming to address the control problem of maneuvering target tracking and obstacle avoidance,an online path planning approach for UAV is developed based on deep reinforcement learning.Through end-to-end learning powered by neural networks,the proposed approach can achieve the perception of the environment and continuous motion output control.This proposed approach includes:(1)A deep deterministic policy gradient(DDPG)-based control framework to provide learning and autonomous decision-making capability for UAVs;(2)An improved method named MN-DDPG for introducing a type of mixed noises to assist UAV with exploring stochastic strategies for online optimal planning;and(3)An algorithm of taskdecomposition and pre-training for efficient transfer learning to improve the generalization capability of UAV’s control model built based on MN-DDPG.The experimental simulation results have verified that the proposed approach can achieve good self-adaptive adjustment of UAV’s flight attitude in the tasks of maneuvering target tracking with a significant improvement in generalization capability and training efficiency of UAV tracking controller in uncertain environments.
文摘As an important mechanism in multi-agent interaction,communication can make agents form complex team relationships rather than constitute a simple set of multiple independent agents.However,the existing communication schemes can bring much timing redundancy and irrelevant messages,which seriously affects their practical application.To solve this problem,this paper proposes a targeted multiagent communication algorithm based on state control(SCTC).The SCTC uses a gating mechanism based on state control to reduce the timing redundancy of communication between agents and determines the interaction relationship between agents and the importance weight of a communication message through a series connection of hard-and self-attention mechanisms,realizing targeted communication message processing.In addition,by minimizing the difference between the fusion message generated from a real communication message of each agent and a fusion message generated from the buffered message,the correctness of the final action choice of the agent is ensured.Our evaluation using a challenging set of Star Craft II benchmarks indicates that the SCTC can significantly improve the learning performance and reduce the communication overhead between agents,thus ensuring better cooperation between agents.
文摘城市电网在发生N-1故障后,极可能新增运行风险,导致N-1-1时出现大面积停电事故。为管控城市电网N-1后运行风险,该文提出一种改进双智能体竞争双深度Q网络(dueling double deep Q network,D3QN)的城市电网N-1风险管控转供策略。根据风险管控原则,提出一种无需额外历史数据、考虑备自投装置、单供变电站风险和单供负荷母线风险的N-1场景指标;建立计及动作次序、指标间关系的负荷转供三阶段求解模型。以含预动作-变化探索值选择策略的改进双智能体D3QN方法,将负荷转供分为多个子转供环节学习,使转供思路清晰化,对动作空间进行降维,提高训练寻优效果,得到管控N-1风险的负荷转供策略。通过城市电网多场景算例分析,验证该文模型和方法的有效性。
文摘在目前基于深度强化学习的数据库索引推荐中,当负载变化时,由于实际负载与训练负载差距较大,模型的推荐效果会显著下降。针对现有基于深度强化学习的索引推荐算法在负载增量变化下自适应性和模型泛化性不足的问题,提出了一个基于多智能体迁移强化学习的索引推荐算法MARLIA(multi-agent reinforcement learning index advisor)。该算法结合了迁移学习的思想,使用多智能体进行模型训练。在负载变化更新导致模型推荐效果下降时,该算法可以利用策略蒸馏的方式将旧索引推荐策略传递给新索引推荐智能体,提高了模型的泛化性和对动态负载的支持。在TPC-H数据集上的实验结果表明,该算法的负载代价提升率与基线算法相比稳定在7%以内,在负载为120条时缓存命中率为76.3%。该研究表明,MARLIA算法在负载变化时具有强大的自适应性和模型泛化能力。
文摘传统的数据清洗方法因需专家手动定义数据质量规则,不仅复杂且耗时巨大,且清洗后的数据可能不能被重复利用,降低了数据清洗的质量和效率.为此,提出基于双重深度Q网络的否定约束迁移(double deep Q-network for denial constraints transfer,DDQN-DCT)算法.该算法设计否定约束(denial constraint,DC)的相似性度量方法,并结合相似性和DC的简洁性和覆盖度,使用双重深度Q网络(double deep Q-network,DDQN)对DC规则中的谓词进行修改从而实现DC迁移,目标是使迁移后的规则与原规则具有最大的相似性,以保留原有规则的信息.基于DDQN-DCT进一步设计了DDQN-DCT+算法,把DDQN的动作选择策略划为增加和删除2个阶段,并通过对比实验证明DDQN-DCT+在DC简洁性上有更好的表现.通过与暴力依赖约束迁移(brute-force dependency constraint transfer,BFDC)、DDQN-DCT+和结构扩大/缩小(structure expansion/reduction,SER)方法进行对比实验,发现DDQN-DCT算法在规则相似度上相较于BFDC平均提升约10%,相较于DDQN-DCT+平均提升约10.6%,相较于SER平均提升约16.4%.DDQN-DCT能够有效地将源域的规则迁移到相似的目标域数据上.
基金supported by National Key R&D Program of China:Gravitational Wave Detection Project(Grant Nos.2021YFC22026,2021YFC2202601,2021YFC2202603)National Natural Science Foundation of China(Grant Nos.12172288 and 12472046)。
文摘This paper investigates impulsive orbital attack-defense(AD)games under multiple constraints and victory conditions,involving three spacecraft:attacker,target,and defender.In the AD scenario,the attacker aims to breach the defender's interception to rendezvous with the target,while the defender seeks to protect the target by blocking or actively pursuing the attacker.Four different maneuvering constraints and five potential game outcomes are incorporated to more accurately model AD game problems and increase complexity,thereby reducing the effectiveness of traditional methods such as differential games and game-tree searches.To address these challenges,this study proposes a multiagent deep reinforcement learning solution with variable reward functions.Two attack strategies,Direct attack(DA)and Bypass attack(BA),are developed for the attacker,each focusing on different mission priorities.Similarly,two defense strategies,Direct interdiction(DI)and Collinear interdiction(CI),are designed for the defender,each optimizing specific defensive actions through tailored reward functions.Each reward function incorporates both process rewards(e.g.,distance and angle)and outcome rewards,derived from physical principles and validated via geometric analysis.Extensive simulations of four strategy confrontations demonstrate average defensive success rates of 75%for DI vs.DA,40%for DI vs.BA,80%for CI vs.DA,and 70%for CI vs.BA.Results indicate that CI outperforms DI for defenders,while BA outperforms DA for attackers.Moreover,defenders achieve their objectives more effectively under identical maneuvering capabilities.Trajectory evolution analyses further illustrate the effectiveness of the proposed variable reward function-driven strategies.These strategies and analyses offer valuable guidance for practical orbital defense scenarios and lay a foundation for future multi-agent game research.
文摘深度强化学习(DRL)已被成功应用于移动机器人路径规划中,基于DRL的移动机器人路径规划算法适用于高维环境,是实现移动机器人自主学习的重要方法。而训练DRL模型需要大量的环境交互经验,这意味着更高的计算成本。此外,DRL算法的经验池容量有限,无法确保经验的有效利用。作为类脑计算重要工具之一的脉冲神经网络(Spiking Neural Networks,SNNs)以其独有的生物似真性,能同时融入时空信息,适用于机器人环境感知及控制。结合SNNs、卷积神经网络(CNNs)和策略融合,针对基于DRL的移动机器人路径规划算法进行研究,完成了以下工作:1)提出SCDDPG(SCDDP)算法。该算法利用CNNs对输入状态进行多通道特征提取,利用SNNs对提取的特征进行时空学习。2)在SCDDPG的基础上,提出SC2DDPG(SC2DDPG)算法。SC2DDPG通过设计状态约束策略对机器人运行状态进行约束,避免了不必要的环境探索,提升了SC2DDPG中DRL的收敛速度。3)在SCDDPG的基础上,提出了PFTDDPG(Policy Fusion and Transfer SCDDPG,PFTDDPG)算法。该算法采用分阶控制模式与DRL算法融合,针对环境中的楔形障碍物实施沿墙行走策略,并引入迁移学习对先验知识进行策略迁移。PFTDDPG算法不仅完成了单纯依靠RL不能完成的路径规划任务,还可以得到最优无碰路径。此外PFTDDPG提升了模型的收敛速度和路径规划性能。实验结果证明了所提出的3种路径规划算法的有效性,对比实验结果表明:在SpikeDDPG,SCDDPG,SC2DDPG和PFTDDPG算法中,PFTDDPG算法在路径规划成功率、训练收敛速度、规划路径长度等性能指标上表现最佳。本工作为移动机器人路径规划提出了新思路,丰富了DRL在移动机器人路径规划中的解决方案。