Low Earth orbit(LEO)satellite networks exhibit distinct characteristics,e.g.,limited resources of individual satellite nodes and dynamic network topology,which have brought many challenges for routing algorithms.To sa...Low Earth orbit(LEO)satellite networks exhibit distinct characteristics,e.g.,limited resources of individual satellite nodes and dynamic network topology,which have brought many challenges for routing algorithms.To satisfy quality of service(QoS)requirements of various users,it is critical to research efficient routing strategies to fully utilize satellite resources.This paper proposes a multi-QoS information optimized routing algorithm based on reinforcement learning for LEO satellite networks,which guarantees high level assurance demand services to be prioritized under limited satellite resources while considering the load balancing performance of the satellite networks for low level assurance demand services to ensure the full and effective utilization of satellite resources.An auxiliary path search algorithm is proposed to accelerate the convergence of satellite routing algorithm.Simulation results show that the generated routing strategy can timely process and fully meet the QoS demands of high assurance services while effectively improving the load balancing performance of the link.展开更多
The guidance strategy is an extremely critical factor in determining the striking effect of the missile operation.A novel guidance law is presented by exploiting the deep reinforcement learning(DRL)with the hierarchic...The guidance strategy is an extremely critical factor in determining the striking effect of the missile operation.A novel guidance law is presented by exploiting the deep reinforcement learning(DRL)with the hierarchical deep deterministic policy gradient(DDPG)algorithm.The reward functions are constructed to minimize the line-of-sight(LOS)angle rate and avoid the threat caused by the opposed obstacles.To attenuate the chattering of the acceleration,a hierarchical reinforcement learning structure and an improved reward function with action penalty are put forward.The simulation results validate that the missile under the proposed method can hit the target successfully and keep away from the threatened areas effectively.展开更多
In order to improve the autonomous ability of unmanned aerial vehicles(UAV)to implement air combat mission,many artificial intelligence-based autonomous air combat maneuver decision-making studies have been carried ou...In order to improve the autonomous ability of unmanned aerial vehicles(UAV)to implement air combat mission,many artificial intelligence-based autonomous air combat maneuver decision-making studies have been carried out,but these studies are often aimed at individual decision-making in 1 v1 scenarios which rarely happen in actual air combat.Based on the research of the 1 v1 autonomous air combat maneuver decision,this paper builds a multi-UAV cooperative air combat maneuver decision model based on multi-agent reinforcement learning.Firstly,a bidirectional recurrent neural network(BRNN)is used to achieve communication between UAV individuals,and the multi-UAV cooperative air combat maneuver decision model under the actor-critic architecture is established.Secondly,through combining with target allocation and air combat situation assessment,the tactical goal of the formation is merged with the reinforcement learning goal of every UAV,and a cooperative tactical maneuver policy is generated.The simulation results prove that the multi-UAV cooperative air combat maneuver decision model established in this paper can obtain the cooperative maneuver policy through reinforcement learning,the cooperative maneuver policy can guide UAVs to obtain the overall situational advantage and defeat the opponents under tactical cooperation.展开更多
Motion planning is critical to realize the autonomous operation of mobile robots.As the complexity and randomness of robot application scenarios increase,the planning capability of the classical hierarchical motion pl...Motion planning is critical to realize the autonomous operation of mobile robots.As the complexity and randomness of robot application scenarios increase,the planning capability of the classical hierarchical motion planners is challenged.With the development of machine learning,the deep reinforcement learning(DRL)-based motion planner has gradually become a research hotspot due to its several advantageous feature.The DRL-based motion planner is model-free and does not rely on the prior structured map.Most importantly,the DRL-based motion planner achieves the unification of the global planner and the local planner.In this paper,we provide a systematic review of various motion planning methods.Firstly,we summarize the representative and state-of-the-art works for each submodule of the classical motion planning architecture and analyze their performance features.Then,we concentrate on summarizing reinforcement learning(RL)-based motion planning approaches,including motion planners combined with RL improvements,map-free RL-based motion planners,and multi-robot cooperative planning methods.Finally,we analyze the urgent challenges faced by these mainstream RLbased motion planners in detail,review some state-of-the-art works for these issues,and propose suggestions for future research.展开更多
This paper investigates the guidance method based on reinforcement learning(RL)for the coplanar orbital interception in a continuous low-thrust scenario.The problem is formulated into a Markov decision process(MDP)mod...This paper investigates the guidance method based on reinforcement learning(RL)for the coplanar orbital interception in a continuous low-thrust scenario.The problem is formulated into a Markov decision process(MDP)model,then a welldesigned RL algorithm,experience based deep deterministic policy gradient(EBDDPG),is proposed to solve it.By taking the advantage of prior information generated through the optimal control model,the proposed algorithm not only resolves the convergence problem of the common RL algorithm,but also successfully trains an efficient deep neural network(DNN)controller for the chaser spacecraft to generate the control sequence.Numerical simulation results show that the proposed algorithm is feasible and the trained DNN controller significantly improves the efficiency over traditional optimization methods by roughly two orders of magnitude.展开更多
The scale of ground-to-air confrontation task assignments is large and needs to deal with many concurrent task assignments and random events.Aiming at the problems where existing task assignment methods are applied to...The scale of ground-to-air confrontation task assignments is large and needs to deal with many concurrent task assignments and random events.Aiming at the problems where existing task assignment methods are applied to ground-to-air confrontation,there is low efficiency in dealing with complex tasks,and there are interactive conflicts in multiagent systems.This study proposes a multiagent architecture based on a one-general agent with multiple narrow agents(OGMN)to reduce task assignment conflicts.Considering the slow speed of traditional dynamic task assignment algorithms,this paper proposes the proximal policy optimization for task assignment of general and narrow agents(PPOTAGNA)algorithm.The algorithm based on the idea of the optimal assignment strategy algorithm and combined with the training framework of deep reinforcement learning(DRL)adds a multihead attention mechanism and a stage reward mechanism to the bilateral band clipping PPO algorithm to solve the problem of low training efficiency.Finally,simulation experiments are carried out in the digital battlefield.The multiagent architecture based on OGMN combined with the PPO-TAGNA algorithm can obtain higher rewards faster and has a higher win ratio.By analyzing agent behavior,the efficiency,superiority and rationality of resource utilization of this method are verified.展开更多
In the evolutionary game of the same task for groups,the changes in game rules,personal interests,the crowd size,and external supervision cause uncertain effects on individual decision-making and game results.In the M...In the evolutionary game of the same task for groups,the changes in game rules,personal interests,the crowd size,and external supervision cause uncertain effects on individual decision-making and game results.In the Markov decision framework,a single-task multi-decision evolutionary game model based on multi-agent reinforcement learning is proposed to explore the evolutionary rules in the process of a game.The model can improve the result of a evolutionary game and facilitate the completion of the task.First,based on the multi-agent theory,to solve the existing problems in the original model,a negative feedback tax penalty mechanism is proposed to guide the strategy selection of individuals in the group.In addition,in order to evaluate the evolutionary game results of the group in the model,a calculation method of the group intelligence level is defined.Secondly,the Q-learning algorithm is used to improve the guiding effect of the negative feedback tax penalty mechanism.In the model,the selection strategy of the Q-learning algorithm is improved and a bounded rationality evolutionary game strategy is proposed based on the rule of evolutionary games and the consideration of the bounded rationality of individuals.Finally,simulation results show that the proposed model can effectively guide individuals to choose cooperation strategies which are beneficial to task completion and stability under different negative feedback factor values and different group sizes,so as to improve the group intelligence level.展开更多
In this paper, the reinforcement learning method for cooperative multi-agent systems(MAS) with incremental number of agents is studied. The existing multi-agent reinforcement learning approaches deal with the MAS with...In this paper, the reinforcement learning method for cooperative multi-agent systems(MAS) with incremental number of agents is studied. The existing multi-agent reinforcement learning approaches deal with the MAS with a specific number of agents, and can learn well-performed policies. However, if there is an increasing number of agents, the previously learned in may not perform well in the current scenario. The new agents need to learn from scratch to find optimal policies with others,which may slow down the learning speed of the whole team. To solve that problem, in this paper, we propose a new algorithm to take full advantage of the historical knowledge which was learned before, and transfer it from the previous agents to the new agents. Since the previous agents have been trained well in the source environment, they are treated as teacher agents in the target environment. Correspondingly, the new agents are called student agents. To enable the student agents to learn from the teacher agents, we first modify the input nodes of the networks for teacher agents to adapt to the current environment. Then, the teacher agents take the observations of the student agents as input, and output the advised actions and values as supervising information. Finally, the student agents combine the reward from the environment and the supervising information from the teacher agents, and learn the optimal policies with modified loss functions. By taking full advantage of the knowledge of teacher agents, the search space for the student agents will be reduced significantly, which can accelerate the learning speed of the holistic system. The proposed algorithm is verified in some multi-agent simulation environments, and its efficiency has been demonstrated by the experiment results.展开更多
This paper proposes a liner active disturbance rejection control(LADRC) method based on the Q-Learning algorithm of reinforcement learning(RL) to control the six-degree-of-freedom motion of an autonomous underwater ve...This paper proposes a liner active disturbance rejection control(LADRC) method based on the Q-Learning algorithm of reinforcement learning(RL) to control the six-degree-of-freedom motion of an autonomous underwater vehicle(AUV).The number of controllers is increased to realize AUV motion decoupling.At the same time, in order to avoid the oversize of the algorithm, combined with the controlled content, a simplified Q-learning algorithm is constructed to realize the parameter adaptation of the LADRC controller.Finally, through the simulation experiment of the controller with fixed parameters and the controller based on the Q-learning algorithm, the rationality of the simplified algorithm, the effectiveness of parameter adaptation, and the unique advantages of the LADRC controller are verified.展开更多
A novel method was designed to solve reinforcement learning problems with artificial potential field.Firstly a reinforcement learning problem was transferred to a path planning problem by using artificial potential fi...A novel method was designed to solve reinforcement learning problems with artificial potential field.Firstly a reinforcement learning problem was transferred to a path planning problem by using artificial potential field(APF),which was a very appropriate method to model a reinforcement learning problem.Secondly,a new APF algorithm was proposed to overcome the local minimum problem in the potential field methods with a virtual water-flow concept.The performance of this new method was tested by a gridworld problem named as key and door maze.The experimental results show that within 45 trials,good and deterministic policies are found in almost all simulations.In comparison with WIERING's HQ-learning system which needs 20 000 trials for stable solution,the proposed new method can obtain optimal and stable policy far more quickly than HQ-learning.Therefore,the new method is simple and effective to give an optimal solution to the reinforcement learning problem.展开更多
To solve the problem of multi-target hunting by an unmanned surface vehicle(USV)fleet,a hunting algorithm based on multi-agent reinforcement learning is proposed.Firstly,the hunting environment and kinematic model wit...To solve the problem of multi-target hunting by an unmanned surface vehicle(USV)fleet,a hunting algorithm based on multi-agent reinforcement learning is proposed.Firstly,the hunting environment and kinematic model without boundary constraints are built,and the criteria for successful target capture are given.Then,the cooperative hunting problem of a USV fleet is modeled as a decentralized partially observable Markov decision process(Dec-POMDP),and a distributed partially observable multitarget hunting Proximal Policy Optimization(DPOMH-PPO)algorithm applicable to USVs is proposed.In addition,an observation model,a reward function and the action space applicable to multi-target hunting tasks are designed.To deal with the dynamic change of observational feature dimension input by partially observable systems,a feature embedding block is proposed.By combining the two feature compression methods of column-wise max pooling(CMP)and column-wise average-pooling(CAP),observational feature encoding is established.Finally,the centralized training and decentralized execution framework is adopted to complete the training of hunting strategy.Each USV in the fleet shares the same policy and perform actions independently.Simulation experiments have verified the effectiveness of the DPOMH-PPO algorithm in the test scenarios with different numbers of USVs.Moreover,the advantages of the proposed model are comprehensively analyzed from the aspects of algorithm performance,migration effect in task scenarios and self-organization capability after being damaged,the potential deployment and application of DPOMH-PPO in the real environment is verified.展开更多
In real-time strategy(RTS)games,the ability of recognizing other players’goals is important for creating artifical intelligence(AI)players.However,most current goal recognition methods do not take the player’s decep...In real-time strategy(RTS)games,the ability of recognizing other players’goals is important for creating artifical intelligence(AI)players.However,most current goal recognition methods do not take the player’s deceptive behavior into account which often occurs in RTS game scenarios,resulting in poor recognition results.In order to solve this problem,this paper proposes goal recognition for deceptive agent,which is an extended goal recognition method applying the deductive reason method(from general to special)to model the deceptive agent’s behavioral strategy.First of all,the general deceptive behavior model is proposed to abstract features of deception,and then these features are applied to construct a behavior strategy that best matches the deceiver’s historical behavior data by the inverse reinforcement learning(IRL)method.Final,to interfere with the deceptive behavior implementation,we construct a game model to describe the confrontation scenario and the most effective interference measures.展开更多
In this paper, a reinforcement learning-based multibattery energy storage system(MBESS) scheduling policy is proposed to minimize the consumers ’ electricity cost. The MBESS scheduling problem is modeled as a Markov ...In this paper, a reinforcement learning-based multibattery energy storage system(MBESS) scheduling policy is proposed to minimize the consumers ’ electricity cost. The MBESS scheduling problem is modeled as a Markov decision process(MDP) with unknown transition probability. However, the optimal value function is time-dependent and difficult to obtain because of the periodicity of the electricity price and residential load. Therefore, a series of time-independent action-value functions are proposed to describe every period of a day. To approximate every action-value function, a corresponding critic network is established, which is cascaded with other critic networks according to the time sequence. Then, the continuous management strategy is obtained from the related action network. Moreover, a two-stage learning protocol including offline and online learning stages is provided for detailed implementation in real-time battery management. Numerical experimental examples are given to demonstrate the effectiveness of the developed algorithm.展开更多
The unmanned aerial vehicle(UAV)swarm technology is one of the research hotspots in recent years.With the continuous improvement of autonomous intelligence of UAV,the swarm technology of UAV will become one of the mai...The unmanned aerial vehicle(UAV)swarm technology is one of the research hotspots in recent years.With the continuous improvement of autonomous intelligence of UAV,the swarm technology of UAV will become one of the main trends of UAV development in the future.This paper studies the behavior decision-making process of UAV swarm rendezvous task based on the double deep Q network(DDQN)algorithm.We design a guided reward function to effectively solve the problem of algorithm convergence caused by the sparse return problem in deep reinforcement learning(DRL)for the long period task.We also propose the concept of temporary storage area,optimizing the memory playback unit of the traditional DDQN algorithm,improving the convergence speed of the algorithm,and speeding up the training process of the algorithm.Different from traditional task environment,this paper establishes a continuous state-space task environment model to improve the authentication process of UAV task environment.Based on the DDQN algorithm,the collaborative tasks of UAV swarm in different task scenarios are trained.The experimental results validate that the DDQN algorithm is efficient in terms of training UAV swarm to complete the given collaborative tasks while meeting the requirements of UAV swarm for centralization and autonomy,and improving the intelligence of UAV swarm collaborative task execution.The simulation results show that after training,the proposed UAV swarm can carry out the rendezvous task well,and the success rate of the mission reaches 90%.展开更多
In the world, most of the successes are results of longterm efforts. The reward of success is extremely high, but before that, a long-term investment process is required. People who are “myopic” only value short-ter...In the world, most of the successes are results of longterm efforts. The reward of success is extremely high, but before that, a long-term investment process is required. People who are “myopic” only value short-term rewards and are unwilling to make early-stage investments, so they hardly get the ultimate success and the corresponding high rewards. Similarly, for a reinforcement learning(RL) model with long-delay rewards, the discount rate determines the strength of agent’s “farsightedness”.In order to enable the trained agent to make a chain of correct choices and succeed finally, the feasible region of the discount rate is obtained through mathematical derivation in this paper firstly. It satisfies the “farsightedness” requirement of agent. Afterwards, in order to avoid the complicated problem of solving implicit equations in the process of choosing feasible solutions,a simple method is explored and verified by theoreti cal demonstration and mathematical experiments. Then, a series of RL experiments are designed and implemented to verify the validity of theory. Finally, the model is extended from the finite process to the infinite process. The validity of the extended model is verified by theories and experiments. The whole research not only reveals the significance of the discount rate, but also provides a theoretical basis as well as a practical method for the choice of discount rate in future researches.展开更多
To solve the problem of the low interference success rate of air defense missile radio fuzes due to the unified interference form of the traditional fuze interference system,an interference decision method based Q-lea...To solve the problem of the low interference success rate of air defense missile radio fuzes due to the unified interference form of the traditional fuze interference system,an interference decision method based Q-learning algorithm is proposed.First,dividing the distance between the missile and the target into multiple states to increase the quantity of state spaces.Second,a multidimensional motion space is utilized,and the search range of which changes with the distance of the projectile,to select parameters and minimize the amount of ineffective interference parameters.The interference effect is determined by detecting whether the fuze signal disappears.Finally,a weighted reward function is used to determine the reward value based on the range state,output power,and parameter quantity information of the interference form.The effectiveness of the proposed method in selecting the range of motion space parameters and designing the discrimination degree of the reward function has been verified through offline experiments involving full-range missile rendezvous.The optimal interference form for each distance state has been obtained.Compared with the single-interference decision method,the proposed decision method can effectively improve the success rate of interference.展开更多
This work proposes a recorded recurrent twin delayed deep deterministic(RRTD3)policy gradient algorithm to solve the challenge of constructing guidance laws for intercepting endoatmospheric maneuvering missiles with u...This work proposes a recorded recurrent twin delayed deep deterministic(RRTD3)policy gradient algorithm to solve the challenge of constructing guidance laws for intercepting endoatmospheric maneuvering missiles with uncertainties and observation noise.The attack-defense engagement scenario is modeled as a partially observable Markov decision process(POMDP).Given the benefits of recurrent neural networks(RNNs)in processing sequence information,an RNN layer is incorporated into the agent’s policy network to alleviate the bottleneck of traditional deep reinforcement learning methods while dealing with POMDPs.The measurements from the interceptor’s seeker during each guidance cycle are combined into one sequence as the input to the policy network since the detection frequency of an interceptor is usually higher than its guidance frequency.During training,the hidden states of the RNN layer in the policy network are recorded to overcome the partially observable problem that this RNN layer causes inside the agent.The training curves show that the proposed RRTD3 successfully enhances data efficiency,training speed,and training stability.The test results confirm the advantages of the RRTD3-based guidance laws over some conventional guidance laws.展开更多
Autonomous umanned aerial vehicle(UAV) manipulation is necessary for the defense department to execute tactical missions given by commanders in the future unmanned battlefield. A large amount of research has been devo...Autonomous umanned aerial vehicle(UAV) manipulation is necessary for the defense department to execute tactical missions given by commanders in the future unmanned battlefield. A large amount of research has been devoted to improving the autonomous decision-making ability of UAV in an interactive environment, where finding the optimal maneuvering decisionmaking policy became one of the key issues for enabling the intelligence of UAV. In this paper, we propose a maneuvering decision-making algorithm for autonomous air-delivery based on deep reinforcement learning under the guidance of expert experience. Specifically, we refine the guidance towards area and guidance towards specific point tasks for the air-delivery process based on the traditional air-to-surface fire control methods.Moreover, we construct the UAV maneuvering decision-making model based on Markov decision processes(MDPs). Specifically, we present a reward shaping method for the guidance towards area and guidance towards specific point tasks using potential-based function and expert-guided advice. The proposed algorithm could accelerate the convergence of the maneuvering decision-making policy and increase the stability of the policy in terms of the output during the later stage of training process. The effectiveness of the proposed maneuvering decision-making policy is illustrated by the curves of training parameters and extensive experimental results for testing the trained policy.展开更多
This paper mainly focuses on the development of a learning-based controller for a class of uncertain mechanical systems modeled by the Euler-Lagrange formulation.The considered system can depict the behavior of a larg...This paper mainly focuses on the development of a learning-based controller for a class of uncertain mechanical systems modeled by the Euler-Lagrange formulation.The considered system can depict the behavior of a large class of engineering systems,such as vehicular systems,robot manipulators and satellites.All these systems are often characterized by highly nonlinear characteristics,heavy modeling uncertainties and unknown perturbations,therefore,accurate-model-based nonlinear control approaches become unavailable.Motivated by the challenge,a reinforcement learning(RL)adaptive control methodology based on the actor-critic framework is investigated to compensate the uncertain mechanical dynamics.The approximation inaccuracies caused by RL and the exogenous unknown disturbances are circumvented via a continuous robust integral of the sign of the error(RISE)control approach.Different from a classical RISE control law,a tanh(·)function is utilized instead of a sign(·)function to acquire a more smooth control signal.The developed controller requires very little prior knowledge of the dynamic model,is robust to unknown dynamics and exogenous disturbances,and can achieve asymptotic output tracking.Eventually,co-simulations through ADAMS and MATLAB/Simulink on a three degrees-of-freedom(3-DOF)manipulator and experiments on a real-time electromechanical servo system are performed to verify the performance of the proposed approach.展开更多
Equipment development planning(EDP)is usually a long-term process often performed in an environment with high uncertainty.The traditional multi-stage dynamic programming cannot cope with this kind of uncertainty with ...Equipment development planning(EDP)is usually a long-term process often performed in an environment with high uncertainty.The traditional multi-stage dynamic programming cannot cope with this kind of uncertainty with unpredictable situations.To deal with this problem,a multi-stage EDP model based on a deep reinforcement learning(DRL)algorithm is proposed to respond quickly to any environmental changes within a reasonable range.Firstly,the basic problem of multi-stage EDP is described,and a mathematical planning model is constructed.Then,for two kinds of uncertainties(future capabi lity requirements and the amount of investment in each stage),a corresponding DRL framework is designed to define the environment,state,action,and reward function for multi-stage EDP.After that,the dueling deep Q-network(Dueling DQN)algorithm is used to solve the multi-stage EDP to generate an approximately optimal multi-stage equipment development scheme.Finally,a case of ten kinds of equipment in 100 possible environments,which are randomly generated,is used to test the feasibility and effectiveness of the proposed models.The results show that the algorithm can respond instantaneously in any state of the multistage EDP environment and unlike traditional algorithms,the algorithm does not need to re-optimize the problem for any change in the environment.In addition,the algorithm can flexibly adjust at subsequent planning stages in the event of a change to the equipment capability requirements to adapt to the new requirements.展开更多
基金National Key Research and Development Program(2021YFB2900604)。
文摘Low Earth orbit(LEO)satellite networks exhibit distinct characteristics,e.g.,limited resources of individual satellite nodes and dynamic network topology,which have brought many challenges for routing algorithms.To satisfy quality of service(QoS)requirements of various users,it is critical to research efficient routing strategies to fully utilize satellite resources.This paper proposes a multi-QoS information optimized routing algorithm based on reinforcement learning for LEO satellite networks,which guarantees high level assurance demand services to be prioritized under limited satellite resources while considering the load balancing performance of the satellite networks for low level assurance demand services to ensure the full and effective utilization of satellite resources.An auxiliary path search algorithm is proposed to accelerate the convergence of satellite routing algorithm.Simulation results show that the generated routing strategy can timely process and fully meet the QoS demands of high assurance services while effectively improving the load balancing performance of the link.
基金supported by the National Natural Science Foundation of China(62003021,91212304).
文摘The guidance strategy is an extremely critical factor in determining the striking effect of the missile operation.A novel guidance law is presented by exploiting the deep reinforcement learning(DRL)with the hierarchical deep deterministic policy gradient(DDPG)algorithm.The reward functions are constructed to minimize the line-of-sight(LOS)angle rate and avoid the threat caused by the opposed obstacles.To attenuate the chattering of the acceleration,a hierarchical reinforcement learning structure and an improved reward function with action penalty are put forward.The simulation results validate that the missile under the proposed method can hit the target successfully and keep away from the threatened areas effectively.
基金supported by the Aeronautical Science Foundation of China(2017ZC53033)the Seed Foundation of Innovation and Creation for Graduate Students in Northwestern Polytechnical University(CX2020156)。
文摘In order to improve the autonomous ability of unmanned aerial vehicles(UAV)to implement air combat mission,many artificial intelligence-based autonomous air combat maneuver decision-making studies have been carried out,but these studies are often aimed at individual decision-making in 1 v1 scenarios which rarely happen in actual air combat.Based on the research of the 1 v1 autonomous air combat maneuver decision,this paper builds a multi-UAV cooperative air combat maneuver decision model based on multi-agent reinforcement learning.Firstly,a bidirectional recurrent neural network(BRNN)is used to achieve communication between UAV individuals,and the multi-UAV cooperative air combat maneuver decision model under the actor-critic architecture is established.Secondly,through combining with target allocation and air combat situation assessment,the tactical goal of the formation is merged with the reinforcement learning goal of every UAV,and a cooperative tactical maneuver policy is generated.The simulation results prove that the multi-UAV cooperative air combat maneuver decision model established in this paper can obtain the cooperative maneuver policy through reinforcement learning,the cooperative maneuver policy can guide UAVs to obtain the overall situational advantage and defeat the opponents under tactical cooperation.
基金supported by the National Natural Science Foundation of China (62173251)the“Zhishan”Scholars Programs of Southeast University+1 种基金the Fundamental Research Funds for the Central UniversitiesShanghai Gaofeng&Gaoyuan Project for University Academic Program Development (22120210022)
文摘Motion planning is critical to realize the autonomous operation of mobile robots.As the complexity and randomness of robot application scenarios increase,the planning capability of the classical hierarchical motion planners is challenged.With the development of machine learning,the deep reinforcement learning(DRL)-based motion planner has gradually become a research hotspot due to its several advantageous feature.The DRL-based motion planner is model-free and does not rely on the prior structured map.Most importantly,the DRL-based motion planner achieves the unification of the global planner and the local planner.In this paper,we provide a systematic review of various motion planning methods.Firstly,we summarize the representative and state-of-the-art works for each submodule of the classical motion planning architecture and analyze their performance features.Then,we concentrate on summarizing reinforcement learning(RL)-based motion planning approaches,including motion planners combined with RL improvements,map-free RL-based motion planners,and multi-robot cooperative planning methods.Finally,we analyze the urgent challenges faced by these mainstream RLbased motion planners in detail,review some state-of-the-art works for these issues,and propose suggestions for future research.
基金supported by the National Defense Science and Technology Innovation(18-163-15-LZ-001-004-13).
文摘This paper investigates the guidance method based on reinforcement learning(RL)for the coplanar orbital interception in a continuous low-thrust scenario.The problem is formulated into a Markov decision process(MDP)model,then a welldesigned RL algorithm,experience based deep deterministic policy gradient(EBDDPG),is proposed to solve it.By taking the advantage of prior information generated through the optimal control model,the proposed algorithm not only resolves the convergence problem of the common RL algorithm,but also successfully trains an efficient deep neural network(DNN)controller for the chaser spacecraft to generate the control sequence.Numerical simulation results show that the proposed algorithm is feasible and the trained DNN controller significantly improves the efficiency over traditional optimization methods by roughly two orders of magnitude.
基金the Project of National Natural Science Foundation of China(Grant No.62106283)the Project of National Natural Science Foundation of China(Grant No.72001214)to provide fund for conducting experimentsthe Project of Natural Science Foundation of Shaanxi Province(Grant No.2020JQ-484)。
文摘The scale of ground-to-air confrontation task assignments is large and needs to deal with many concurrent task assignments and random events.Aiming at the problems where existing task assignment methods are applied to ground-to-air confrontation,there is low efficiency in dealing with complex tasks,and there are interactive conflicts in multiagent systems.This study proposes a multiagent architecture based on a one-general agent with multiple narrow agents(OGMN)to reduce task assignment conflicts.Considering the slow speed of traditional dynamic task assignment algorithms,this paper proposes the proximal policy optimization for task assignment of general and narrow agents(PPOTAGNA)algorithm.The algorithm based on the idea of the optimal assignment strategy algorithm and combined with the training framework of deep reinforcement learning(DRL)adds a multihead attention mechanism and a stage reward mechanism to the bilateral band clipping PPO algorithm to solve the problem of low training efficiency.Finally,simulation experiments are carried out in the digital battlefield.The multiagent architecture based on OGMN combined with the PPO-TAGNA algorithm can obtain higher rewards faster and has a higher win ratio.By analyzing agent behavior,the efficiency,superiority and rationality of resource utilization of this method are verified.
基金supported by the National Key R&D Program of China(2017YFB1400105).
文摘In the evolutionary game of the same task for groups,the changes in game rules,personal interests,the crowd size,and external supervision cause uncertain effects on individual decision-making and game results.In the Markov decision framework,a single-task multi-decision evolutionary game model based on multi-agent reinforcement learning is proposed to explore the evolutionary rules in the process of a game.The model can improve the result of a evolutionary game and facilitate the completion of the task.First,based on the multi-agent theory,to solve the existing problems in the original model,a negative feedback tax penalty mechanism is proposed to guide the strategy selection of individuals in the group.In addition,in order to evaluate the evolutionary game results of the group in the model,a calculation method of the group intelligence level is defined.Secondly,the Q-learning algorithm is used to improve the guiding effect of the negative feedback tax penalty mechanism.In the model,the selection strategy of the Q-learning algorithm is improved and a bounded rationality evolutionary game strategy is proposed based on the rule of evolutionary games and the consideration of the bounded rationality of individuals.Finally,simulation results show that the proposed model can effectively guide individuals to choose cooperation strategies which are beneficial to task completion and stability under different negative feedback factor values and different group sizes,so as to improve the group intelligence level.
基金supported by the National Key R&D Program of China (2018AAA0101400)the National Natural Science Foundation of China (62173251+3 种基金61921004U1713209)the Natural Science Foundation of Jiangsu Province of China (BK20202006)the Guangdong Provincial Key Laboratory of Intelligent Decision and Cooperative Control。
文摘In this paper, the reinforcement learning method for cooperative multi-agent systems(MAS) with incremental number of agents is studied. The existing multi-agent reinforcement learning approaches deal with the MAS with a specific number of agents, and can learn well-performed policies. However, if there is an increasing number of agents, the previously learned in may not perform well in the current scenario. The new agents need to learn from scratch to find optimal policies with others,which may slow down the learning speed of the whole team. To solve that problem, in this paper, we propose a new algorithm to take full advantage of the historical knowledge which was learned before, and transfer it from the previous agents to the new agents. Since the previous agents have been trained well in the source environment, they are treated as teacher agents in the target environment. Correspondingly, the new agents are called student agents. To enable the student agents to learn from the teacher agents, we first modify the input nodes of the networks for teacher agents to adapt to the current environment. Then, the teacher agents take the observations of the student agents as input, and output the advised actions and values as supervising information. Finally, the student agents combine the reward from the environment and the supervising information from the teacher agents, and learn the optimal policies with modified loss functions. By taking full advantage of the knowledge of teacher agents, the search space for the student agents will be reduced significantly, which can accelerate the learning speed of the holistic system. The proposed algorithm is verified in some multi-agent simulation environments, and its efficiency has been demonstrated by the experiment results.
基金supported by the National Natural Science Foundation of China (6197317561973172)Tianjin Natural Science Foundation (19JCZDJC32800)。
文摘This paper proposes a liner active disturbance rejection control(LADRC) method based on the Q-Learning algorithm of reinforcement learning(RL) to control the six-degree-of-freedom motion of an autonomous underwater vehicle(AUV).The number of controllers is increased to realize AUV motion decoupling.At the same time, in order to avoid the oversize of the algorithm, combined with the controlled content, a simplified Q-learning algorithm is constructed to realize the parameter adaptation of the LADRC controller.Finally, through the simulation experiment of the controller with fixed parameters and the controller based on the Q-learning algorithm, the rationality of the simplified algorithm, the effectiveness of parameter adaptation, and the unique advantages of the LADRC controller are verified.
基金Projects(30270496,60075019,60575012)supported by the National Natural Science Foundation of China
文摘A novel method was designed to solve reinforcement learning problems with artificial potential field.Firstly a reinforcement learning problem was transferred to a path planning problem by using artificial potential field(APF),which was a very appropriate method to model a reinforcement learning problem.Secondly,a new APF algorithm was proposed to overcome the local minimum problem in the potential field methods with a virtual water-flow concept.The performance of this new method was tested by a gridworld problem named as key and door maze.The experimental results show that within 45 trials,good and deterministic policies are found in almost all simulations.In comparison with WIERING's HQ-learning system which needs 20 000 trials for stable solution,the proposed new method can obtain optimal and stable policy far more quickly than HQ-learning.Therefore,the new method is simple and effective to give an optimal solution to the reinforcement learning problem.
基金financial support from National Natural Science Foundation of China(Grant No.61601491)Natural Science Foundation of Hubei Province,China(Grant No.2018CFC865)Military Research Project of China(-Grant No.YJ2020B117)。
文摘To solve the problem of multi-target hunting by an unmanned surface vehicle(USV)fleet,a hunting algorithm based on multi-agent reinforcement learning is proposed.Firstly,the hunting environment and kinematic model without boundary constraints are built,and the criteria for successful target capture are given.Then,the cooperative hunting problem of a USV fleet is modeled as a decentralized partially observable Markov decision process(Dec-POMDP),and a distributed partially observable multitarget hunting Proximal Policy Optimization(DPOMH-PPO)algorithm applicable to USVs is proposed.In addition,an observation model,a reward function and the action space applicable to multi-target hunting tasks are designed.To deal with the dynamic change of observational feature dimension input by partially observable systems,a feature embedding block is proposed.By combining the two feature compression methods of column-wise max pooling(CMP)and column-wise average-pooling(CAP),observational feature encoding is established.Finally,the centralized training and decentralized execution framework is adopted to complete the training of hunting strategy.Each USV in the fleet shares the same policy and perform actions independently.Simulation experiments have verified the effectiveness of the DPOMH-PPO algorithm in the test scenarios with different numbers of USVs.Moreover,the advantages of the proposed model are comprehensively analyzed from the aspects of algorithm performance,migration effect in task scenarios and self-organization capability after being damaged,the potential deployment and application of DPOMH-PPO in the real environment is verified.
文摘In real-time strategy(RTS)games,the ability of recognizing other players’goals is important for creating artifical intelligence(AI)players.However,most current goal recognition methods do not take the player’s deceptive behavior into account which often occurs in RTS game scenarios,resulting in poor recognition results.In order to solve this problem,this paper proposes goal recognition for deceptive agent,which is an extended goal recognition method applying the deductive reason method(from general to special)to model the deceptive agent’s behavioral strategy.First of all,the general deceptive behavior model is proposed to abstract features of deception,and then these features are applied to construct a behavior strategy that best matches the deceiver’s historical behavior data by the inverse reinforcement learning(IRL)method.Final,to interfere with the deceptive behavior implementation,we construct a game model to describe the confrontation scenario and the most effective interference measures.
基金supported by the National Key R&D Program of China (2018AAA0101400)the National Natural Science Foundation of China (61921004,62173251,U1713209,62236002)+1 种基金the Fundamental Research Funds for the Central UniversitiesGuangdong Provincial Key Laboratory of Intelligent Decision and Cooperative Control。
文摘In this paper, a reinforcement learning-based multibattery energy storage system(MBESS) scheduling policy is proposed to minimize the consumers ’ electricity cost. The MBESS scheduling problem is modeled as a Markov decision process(MDP) with unknown transition probability. However, the optimal value function is time-dependent and difficult to obtain because of the periodicity of the electricity price and residential load. Therefore, a series of time-independent action-value functions are proposed to describe every period of a day. To approximate every action-value function, a corresponding critic network is established, which is cascaded with other critic networks according to the time sequence. Then, the continuous management strategy is obtained from the related action network. Moreover, a two-stage learning protocol including offline and online learning stages is provided for detailed implementation in real-time battery management. Numerical experimental examples are given to demonstrate the effectiveness of the developed algorithm.
基金supported by the Aeronautical Science Foundation(2017ZC53033).
文摘The unmanned aerial vehicle(UAV)swarm technology is one of the research hotspots in recent years.With the continuous improvement of autonomous intelligence of UAV,the swarm technology of UAV will become one of the main trends of UAV development in the future.This paper studies the behavior decision-making process of UAV swarm rendezvous task based on the double deep Q network(DDQN)algorithm.We design a guided reward function to effectively solve the problem of algorithm convergence caused by the sparse return problem in deep reinforcement learning(DRL)for the long period task.We also propose the concept of temporary storage area,optimizing the memory playback unit of the traditional DDQN algorithm,improving the convergence speed of the algorithm,and speeding up the training process of the algorithm.Different from traditional task environment,this paper establishes a continuous state-space task environment model to improve the authentication process of UAV task environment.Based on the DDQN algorithm,the collaborative tasks of UAV swarm in different task scenarios are trained.The experimental results validate that the DDQN algorithm is efficient in terms of training UAV swarm to complete the given collaborative tasks while meeting the requirements of UAV swarm for centralization and autonomy,and improving the intelligence of UAV swarm collaborative task execution.The simulation results show that after training,the proposed UAV swarm can carry out the rendezvous task well,and the success rate of the mission reaches 90%.
基金supported by the National Natural Science Foundation of China (717712167170120972001214)。
文摘In the world, most of the successes are results of longterm efforts. The reward of success is extremely high, but before that, a long-term investment process is required. People who are “myopic” only value short-term rewards and are unwilling to make early-stage investments, so they hardly get the ultimate success and the corresponding high rewards. Similarly, for a reinforcement learning(RL) model with long-delay rewards, the discount rate determines the strength of agent’s “farsightedness”.In order to enable the trained agent to make a chain of correct choices and succeed finally, the feasible region of the discount rate is obtained through mathematical derivation in this paper firstly. It satisfies the “farsightedness” requirement of agent. Afterwards, in order to avoid the complicated problem of solving implicit equations in the process of choosing feasible solutions,a simple method is explored and verified by theoreti cal demonstration and mathematical experiments. Then, a series of RL experiments are designed and implemented to verify the validity of theory. Finally, the model is extended from the finite process to the infinite process. The validity of the extended model is verified by theories and experiments. The whole research not only reveals the significance of the discount rate, but also provides a theoretical basis as well as a practical method for the choice of discount rate in future researches.
基金National Natural Science Foundation of China(61973037)National 173 Program Project(2019-JCJQ-ZD-324).
文摘To solve the problem of the low interference success rate of air defense missile radio fuzes due to the unified interference form of the traditional fuze interference system,an interference decision method based Q-learning algorithm is proposed.First,dividing the distance between the missile and the target into multiple states to increase the quantity of state spaces.Second,a multidimensional motion space is utilized,and the search range of which changes with the distance of the projectile,to select parameters and minimize the amount of ineffective interference parameters.The interference effect is determined by detecting whether the fuze signal disappears.Finally,a weighted reward function is used to determine the reward value based on the range state,output power,and parameter quantity information of the interference form.The effectiveness of the proposed method in selecting the range of motion space parameters and designing the discrimination degree of the reward function has been verified through offline experiments involving full-range missile rendezvous.The optimal interference form for each distance state has been obtained.Compared with the single-interference decision method,the proposed decision method can effectively improve the success rate of interference.
基金supported by the National Natural Science Foundation of China(Grant No.12072090)。
文摘This work proposes a recorded recurrent twin delayed deep deterministic(RRTD3)policy gradient algorithm to solve the challenge of constructing guidance laws for intercepting endoatmospheric maneuvering missiles with uncertainties and observation noise.The attack-defense engagement scenario is modeled as a partially observable Markov decision process(POMDP).Given the benefits of recurrent neural networks(RNNs)in processing sequence information,an RNN layer is incorporated into the agent’s policy network to alleviate the bottleneck of traditional deep reinforcement learning methods while dealing with POMDPs.The measurements from the interceptor’s seeker during each guidance cycle are combined into one sequence as the input to the policy network since the detection frequency of an interceptor is usually higher than its guidance frequency.During training,the hidden states of the RNN layer in the policy network are recorded to overcome the partially observable problem that this RNN layer causes inside the agent.The training curves show that the proposed RRTD3 successfully enhances data efficiency,training speed,and training stability.The test results confirm the advantages of the RRTD3-based guidance laws over some conventional guidance laws.
基金supported by the Key Research and Development Program of Shaanxi (2022GXLH-02-09)the Aeronautical Science Foundation of China (20200051053001)the Natural Science Basic Research Program of Shaanxi (2020JM-147)。
文摘Autonomous umanned aerial vehicle(UAV) manipulation is necessary for the defense department to execute tactical missions given by commanders in the future unmanned battlefield. A large amount of research has been devoted to improving the autonomous decision-making ability of UAV in an interactive environment, where finding the optimal maneuvering decisionmaking policy became one of the key issues for enabling the intelligence of UAV. In this paper, we propose a maneuvering decision-making algorithm for autonomous air-delivery based on deep reinforcement learning under the guidance of expert experience. Specifically, we refine the guidance towards area and guidance towards specific point tasks for the air-delivery process based on the traditional air-to-surface fire control methods.Moreover, we construct the UAV maneuvering decision-making model based on Markov decision processes(MDPs). Specifically, we present a reward shaping method for the guidance towards area and guidance towards specific point tasks using potential-based function and expert-guided advice. The proposed algorithm could accelerate the convergence of the maneuvering decision-making policy and increase the stability of the policy in terms of the output during the later stage of training process. The effectiveness of the proposed maneuvering decision-making policy is illustrated by the curves of training parameters and extensive experimental results for testing the trained policy.
基金supported in part by the National Key R&D Program of China under Grant 2021YFB2011300the National Natural Science Foundation of China under Grant 52075262。
文摘This paper mainly focuses on the development of a learning-based controller for a class of uncertain mechanical systems modeled by the Euler-Lagrange formulation.The considered system can depict the behavior of a large class of engineering systems,such as vehicular systems,robot manipulators and satellites.All these systems are often characterized by highly nonlinear characteristics,heavy modeling uncertainties and unknown perturbations,therefore,accurate-model-based nonlinear control approaches become unavailable.Motivated by the challenge,a reinforcement learning(RL)adaptive control methodology based on the actor-critic framework is investigated to compensate the uncertain mechanical dynamics.The approximation inaccuracies caused by RL and the exogenous unknown disturbances are circumvented via a continuous robust integral of the sign of the error(RISE)control approach.Different from a classical RISE control law,a tanh(·)function is utilized instead of a sign(·)function to acquire a more smooth control signal.The developed controller requires very little prior knowledge of the dynamic model,is robust to unknown dynamics and exogenous disturbances,and can achieve asymptotic output tracking.Eventually,co-simulations through ADAMS and MATLAB/Simulink on a three degrees-of-freedom(3-DOF)manipulator and experiments on a real-time electromechanical servo system are performed to verify the performance of the proposed approach.
基金supported by the National Natural Science Foundation of China(71690233,72001209)the Scientific Research Foundation of the National University of Defense Technology(ZK19-16)。
文摘Equipment development planning(EDP)is usually a long-term process often performed in an environment with high uncertainty.The traditional multi-stage dynamic programming cannot cope with this kind of uncertainty with unpredictable situations.To deal with this problem,a multi-stage EDP model based on a deep reinforcement learning(DRL)algorithm is proposed to respond quickly to any environmental changes within a reasonable range.Firstly,the basic problem of multi-stage EDP is described,and a mathematical planning model is constructed.Then,for two kinds of uncertainties(future capabi lity requirements and the amount of investment in each stage),a corresponding DRL framework is designed to define the environment,state,action,and reward function for multi-stage EDP.After that,the dueling deep Q-network(Dueling DQN)algorithm is used to solve the multi-stage EDP to generate an approximately optimal multi-stage equipment development scheme.Finally,a case of ten kinds of equipment in 100 possible environments,which are randomly generated,is used to test the feasibility and effectiveness of the proposed models.The results show that the algorithm can respond instantaneously in any state of the multistage EDP environment and unlike traditional algorithms,the algorithm does not need to re-optimize the problem for any change in the environment.In addition,the algorithm can flexibly adjust at subsequent planning stages in the event of a change to the equipment capability requirements to adapt to the new requirements.