This paper investigates impulsive orbital attack-defense(AD)games under multiple constraints and victory conditions,involving three spacecraft:attacker,target,and defender.In the AD scenario,the attacker aims to breac...This paper investigates impulsive orbital attack-defense(AD)games under multiple constraints and victory conditions,involving three spacecraft:attacker,target,and defender.In the AD scenario,the attacker aims to breach the defender's interception to rendezvous with the target,while the defender seeks to protect the target by blocking or actively pursuing the attacker.Four different maneuvering constraints and five potential game outcomes are incorporated to more accurately model AD game problems and increase complexity,thereby reducing the effectiveness of traditional methods such as differential games and game-tree searches.To address these challenges,this study proposes a multiagent deep reinforcement learning solution with variable reward functions.Two attack strategies,Direct attack(DA)and Bypass attack(BA),are developed for the attacker,each focusing on different mission priorities.Similarly,two defense strategies,Direct interdiction(DI)and Collinear interdiction(CI),are designed for the defender,each optimizing specific defensive actions through tailored reward functions.Each reward function incorporates both process rewards(e.g.,distance and angle)and outcome rewards,derived from physical principles and validated via geometric analysis.Extensive simulations of four strategy confrontations demonstrate average defensive success rates of 75%for DI vs.DA,40%for DI vs.BA,80%for CI vs.DA,and 70%for CI vs.BA.Results indicate that CI outperforms DI for defenders,while BA outperforms DA for attackers.Moreover,defenders achieve their objectives more effectively under identical maneuvering capabilities.Trajectory evolution analyses further illustrate the effectiveness of the proposed variable reward function-driven strategies.These strategies and analyses offer valuable guidance for practical orbital defense scenarios and lay a foundation for future multi-agent game research.展开更多
To explore the green development of automobile enterprises and promote the achievement of the“dual carbon”target,based on the bounded rationality assumptions,this study constructed a tripartite evolutionary game mod...To explore the green development of automobile enterprises and promote the achievement of the“dual carbon”target,based on the bounded rationality assumptions,this study constructed a tripartite evolutionary game model of gov-ernment,commercial banks,and automobile enterprises;introduced a dynamic reward and punishment mechanism;and analyzed the development process of the three parties’strategic behavior under the static and dynamic reward and punish-ment mechanism.Vensim PLE was used for numerical simulation analysis.Our results indicate that the system could not reach a stable state under the static reward and punishment mechanism.A dynamic reward and punishment mechanism can effectively improve the system stability and better fit real situations.Under the dynamic reward and punishment mechan-ism,an increase in the initial probabilities of the three parties can promote the system stability,and the government can im-plement effective supervision by adjusting the upper limit of the reward and punishment intensity.Finally,the implementa-tion of green credit by commercial banks plays a significant role in promoting the green development of automobile enter-prises.展开更多
Future unmanned battles desperately require intelli-gent combat policies,and multi-agent reinforcement learning offers a promising solution.However,due to the complexity of combat operations and large size of the comb...Future unmanned battles desperately require intelli-gent combat policies,and multi-agent reinforcement learning offers a promising solution.However,due to the complexity of combat operations and large size of the combat group,this task suffers from credit assignment problem more than other rein-forcement learning tasks.This study uses reward shaping to relieve the credit assignment problem and improve policy train-ing for the new generation of large-scale unmanned combat operations.We first prove that multiple reward shaping func-tions would not change the Nash Equilibrium in stochastic games,providing theoretical support for their use.According to the characteristics of combat operations,we propose tactical reward shaping(TRS)that comprises maneuver shaping advice and threat assessment-based attack shaping advice.Then,we investigate the effects of different types and combinations of shaping advice on combat policies through experiments.The results show that TRS improves both the efficiency and attack accuracy of combat policies,with the combination of maneuver reward shaping advice and ally-focused attack shaping advice achieving the best performance compared with that of the base-line strategy.展开更多
In the world, most of the successes are results of longterm efforts. The reward of success is extremely high, but before that, a long-term investment process is required. People who are “myopic” only value short-ter...In the world, most of the successes are results of longterm efforts. The reward of success is extremely high, but before that, a long-term investment process is required. People who are “myopic” only value short-term rewards and are unwilling to make early-stage investments, so they hardly get the ultimate success and the corresponding high rewards. Similarly, for a reinforcement learning(RL) model with long-delay rewards, the discount rate determines the strength of agent’s “farsightedness”.In order to enable the trained agent to make a chain of correct choices and succeed finally, the feasible region of the discount rate is obtained through mathematical derivation in this paper firstly. It satisfies the “farsightedness” requirement of agent. Afterwards, in order to avoid the complicated problem of solving implicit equations in the process of choosing feasible solutions,a simple method is explored and verified by theoreti cal demonstration and mathematical experiments. Then, a series of RL experiments are designed and implemented to verify the validity of theory. Finally, the model is extended from the finite process to the infinite process. The validity of the extended model is verified by theories and experiments. The whole research not only reveals the significance of the discount rate, but also provides a theoretical basis as well as a practical method for the choice of discount rate in future researches.展开更多
OBJECTIVE Glutamatergic projections from prefrontal cortex(PFc) to nucleus accumbens(NAc) regulate the dopamine(DA) release in NAc.However,it is not clear whether this circuit is effective for the reward and motivatio...OBJECTIVE Glutamatergic projections from prefrontal cortex(PFc) to nucleus accumbens(NAc) regulate the dopamine(DA) release in NAc.However,it is not clear whether this circuit is effective for the reward and motivation of heroin addiction.Our study investigates the effects of metabotropic glutamate receptor 2/3(mGluR2/3) and the projections from ventromedial prefrontal cortex(vmPFc) to the NAc shell on the reward and motivation of heroin-addicted rats.METHODS First,rats were trained to selfadministration for 14 d.On the 15 thday,parts of rats were injected with mGluR 2/3 agonist LY379268(0.1,0.3 and 1.0 mg·kg-1,ip) systematically and another parts of rats were bilaterally microinjected with LY379268(0.3 and 1.0 g·L^(-1))at the volume of 0.5 μL into the ventral tegmental area(VTA),NAc core or NAc shell,respectively.All rats were followed by heroin self-administration testing under fixed ratio 1(FR1) schedule or progressed ratio(PR) schedule to observe the effect of LY379268 on the heroin reward or motivation.Second,rats were injected chemogenetic glutamatergic virus(pAAV-CaMKIIa-hM3 D(Gq)-mCherry or pAOV-CaMKIIa-hM4 D(Gi)-mCherry-3 Flag) or negative control virus in vmPFc,and trained to heroin self-administration for 14 d.On the 15 thday,rats were bilateral y microinjected with clozapine-N-oxide(CNO,1 mmol·L^(-1),0.5 μL) into NAc shell and tested the effect on the heroin reward or motivation.Finally,rats were injected optogenetical glutamatergic virus(AAV2/9-CaM KⅡ-hChR2-EYFP) or negative control virus in vmPFc,implanted 16 channel photoelectrode in ipsilateral NAc shell,and trained to heroin selfadministration for 14 d.On the 15 thday,rats were tested heroin reward under FR1 procedure with blue light stimulation in the wavelength of470 nm,frequency of 25 HZ and power of 5 mW.Each stimulation lasting for 1 h and interval for1 h.The spike changes before and after stimulation in NAc Shel neural nerve was recorded.RESULTS LY379268 cloud dose-dependent attenuated the heroin reward or motivation and the local effective site was mainly in the NAc shell.Chemogenetic results showed activation or inactivation the projection from vmPFc to NAc shell enhanced or attenuated the heroin reward and motivation,respectively.Optogenetical stimulation the same projection also enhanced the heroin reward,and a tonic neuronal firing at the nerve of NAc shell was observed during the light stimulation session.CONCLUSION mGluR2/3 activation in the NAc shell is involved in the inhibition of heroin reward and motivation.Activation the projection from PFc to NAc shell can enhance the effects on heroin reward and motivation.展开更多
Different from the fact that the main researches are focused on single futures contract and lack of the comparison of different periods, this paper described the statistical characteristics of wheat futures reward tim...Different from the fact that the main researches are focused on single futures contract and lack of the comparison of different periods, this paper described the statistical characteristics of wheat futures reward time series of Zhengzhou Commodity Exchange in recent three years. Besides the basic statistic analysis, the paper used the GARCH and EGARCH model to describe the time series which had the ARCH effect and analyzed the persistence of volatility shocks and the leverage effect. The results showed that compared with that of normal one,wheat futures reward series were abnormality, leptokurtic and thick tail distribution. The study also found that two-part of the reward series had no autocorrelation. Among the six correlative series, three ones presented the ARCH effect. By using of the Auto-regressive Distributed Lag Model, GARCH model and EGARCH model, the paper demonstrates the persistence of volatility shocks and the leverage effect on the wheat futures reward time series. The results reveal that on the one hand, the statistical characteristics of the wheat futures reward are similar to the aboard mature futures market as a whole. But on the other hand, the results reflect some shortages such as the immatureness and the over-control by the government in the Chinese future market.展开更多
现有研究在多QoS(quality of service)调度问题中,由于仅依赖即时奖励反馈机制,在资源受限的场景下处理时延敏感数据和具有连续传输需求的媒体数据时,存在可扩展性差和资源浪费的问题。为此,提出了一种基于奖励回溯的DQN(reward backtra...现有研究在多QoS(quality of service)调度问题中,由于仅依赖即时奖励反馈机制,在资源受限的场景下处理时延敏感数据和具有连续传输需求的媒体数据时,存在可扩展性差和资源浪费的问题。为此,提出了一种基于奖励回溯的DQN(reward backtracking based deep Q-network,RB-DQN)算法。该算法通过未来时刻的交互来回溯调整当前状态的策略评估,以更加有效地识别并解决因不合理调度策略导致的丢包。同时,设计了一种时延-吞吐均衡度量(latency throughput trade-off,LTT)指标,该指标综合考虑了时延敏感数据和媒体类型数据的业务需求,并可通过权重调整来突出不同的侧重点。大量仿真结果表明,与其他调度策略相比,所提算法能够有效降低时延敏感数据的延迟和抖动,同时确保媒体类型数据的流畅性与稳定性。展开更多
基金supported by National Key R&D Program of China:Gravitational Wave Detection Project(Grant Nos.2021YFC22026,2021YFC2202601,2021YFC2202603)National Natural Science Foundation of China(Grant Nos.12172288 and 12472046)。
文摘This paper investigates impulsive orbital attack-defense(AD)games under multiple constraints and victory conditions,involving three spacecraft:attacker,target,and defender.In the AD scenario,the attacker aims to breach the defender's interception to rendezvous with the target,while the defender seeks to protect the target by blocking or actively pursuing the attacker.Four different maneuvering constraints and five potential game outcomes are incorporated to more accurately model AD game problems and increase complexity,thereby reducing the effectiveness of traditional methods such as differential games and game-tree searches.To address these challenges,this study proposes a multiagent deep reinforcement learning solution with variable reward functions.Two attack strategies,Direct attack(DA)and Bypass attack(BA),are developed for the attacker,each focusing on different mission priorities.Similarly,two defense strategies,Direct interdiction(DI)and Collinear interdiction(CI),are designed for the defender,each optimizing specific defensive actions through tailored reward functions.Each reward function incorporates both process rewards(e.g.,distance and angle)and outcome rewards,derived from physical principles and validated via geometric analysis.Extensive simulations of four strategy confrontations demonstrate average defensive success rates of 75%for DI vs.DA,40%for DI vs.BA,80%for CI vs.DA,and 70%for CI vs.BA.Results indicate that CI outperforms DI for defenders,while BA outperforms DA for attackers.Moreover,defenders achieve their objectives more effectively under identical maneuvering capabilities.Trajectory evolution analyses further illustrate the effectiveness of the proposed variable reward function-driven strategies.These strategies and analyses offer valuable guidance for practical orbital defense scenarios and lay a foundation for future multi-agent game research.
基金supported by the National Natural Science Foundation of China(71973001).
文摘To explore the green development of automobile enterprises and promote the achievement of the“dual carbon”target,based on the bounded rationality assumptions,this study constructed a tripartite evolutionary game model of gov-ernment,commercial banks,and automobile enterprises;introduced a dynamic reward and punishment mechanism;and analyzed the development process of the three parties’strategic behavior under the static and dynamic reward and punish-ment mechanism.Vensim PLE was used for numerical simulation analysis.Our results indicate that the system could not reach a stable state under the static reward and punishment mechanism.A dynamic reward and punishment mechanism can effectively improve the system stability and better fit real situations.Under the dynamic reward and punishment mechan-ism,an increase in the initial probabilities of the three parties can promote the system stability,and the government can im-plement effective supervision by adjusting the upper limit of the reward and punishment intensity.Finally,the implementa-tion of green credit by commercial banks plays a significant role in promoting the green development of automobile enter-prises.
文摘Future unmanned battles desperately require intelli-gent combat policies,and multi-agent reinforcement learning offers a promising solution.However,due to the complexity of combat operations and large size of the combat group,this task suffers from credit assignment problem more than other rein-forcement learning tasks.This study uses reward shaping to relieve the credit assignment problem and improve policy train-ing for the new generation of large-scale unmanned combat operations.We first prove that multiple reward shaping func-tions would not change the Nash Equilibrium in stochastic games,providing theoretical support for their use.According to the characteristics of combat operations,we propose tactical reward shaping(TRS)that comprises maneuver shaping advice and threat assessment-based attack shaping advice.Then,we investigate the effects of different types and combinations of shaping advice on combat policies through experiments.The results show that TRS improves both the efficiency and attack accuracy of combat policies,with the combination of maneuver reward shaping advice and ally-focused attack shaping advice achieving the best performance compared with that of the base-line strategy.
基金supported by the National Natural Science Foundation of China (717712167170120972001214)。
文摘In the world, most of the successes are results of longterm efforts. The reward of success is extremely high, but before that, a long-term investment process is required. People who are “myopic” only value short-term rewards and are unwilling to make early-stage investments, so they hardly get the ultimate success and the corresponding high rewards. Similarly, for a reinforcement learning(RL) model with long-delay rewards, the discount rate determines the strength of agent’s “farsightedness”.In order to enable the trained agent to make a chain of correct choices and succeed finally, the feasible region of the discount rate is obtained through mathematical derivation in this paper firstly. It satisfies the “farsightedness” requirement of agent. Afterwards, in order to avoid the complicated problem of solving implicit equations in the process of choosing feasible solutions,a simple method is explored and verified by theoreti cal demonstration and mathematical experiments. Then, a series of RL experiments are designed and implemented to verify the validity of theory. Finally, the model is extended from the finite process to the infinite process. The validity of the extended model is verified by theories and experiments. The whole research not only reveals the significance of the discount rate, but also provides a theoretical basis as well as a practical method for the choice of discount rate in future researches.
基金National Basic Research Program of China(2015CB553504)National Natural Science Foundationof China (81471350+1 种基金81671321)Natural Science Foundation of Ningbo Municipality,Zhejiang Province, China (2017A610214).
文摘OBJECTIVE Glutamatergic projections from prefrontal cortex(PFc) to nucleus accumbens(NAc) regulate the dopamine(DA) release in NAc.However,it is not clear whether this circuit is effective for the reward and motivation of heroin addiction.Our study investigates the effects of metabotropic glutamate receptor 2/3(mGluR2/3) and the projections from ventromedial prefrontal cortex(vmPFc) to the NAc shell on the reward and motivation of heroin-addicted rats.METHODS First,rats were trained to selfadministration for 14 d.On the 15 thday,parts of rats were injected with mGluR 2/3 agonist LY379268(0.1,0.3 and 1.0 mg·kg-1,ip) systematically and another parts of rats were bilaterally microinjected with LY379268(0.3 and 1.0 g·L^(-1))at the volume of 0.5 μL into the ventral tegmental area(VTA),NAc core or NAc shell,respectively.All rats were followed by heroin self-administration testing under fixed ratio 1(FR1) schedule or progressed ratio(PR) schedule to observe the effect of LY379268 on the heroin reward or motivation.Second,rats were injected chemogenetic glutamatergic virus(pAAV-CaMKIIa-hM3 D(Gq)-mCherry or pAOV-CaMKIIa-hM4 D(Gi)-mCherry-3 Flag) or negative control virus in vmPFc,and trained to heroin self-administration for 14 d.On the 15 thday,rats were bilateral y microinjected with clozapine-N-oxide(CNO,1 mmol·L^(-1),0.5 μL) into NAc shell and tested the effect on the heroin reward or motivation.Finally,rats were injected optogenetical glutamatergic virus(AAV2/9-CaM KⅡ-hChR2-EYFP) or negative control virus in vmPFc,implanted 16 channel photoelectrode in ipsilateral NAc shell,and trained to heroin selfadministration for 14 d.On the 15 thday,rats were tested heroin reward under FR1 procedure with blue light stimulation in the wavelength of470 nm,frequency of 25 HZ and power of 5 mW.Each stimulation lasting for 1 h and interval for1 h.The spike changes before and after stimulation in NAc Shel neural nerve was recorded.RESULTS LY379268 cloud dose-dependent attenuated the heroin reward or motivation and the local effective site was mainly in the NAc shell.Chemogenetic results showed activation or inactivation the projection from vmPFc to NAc shell enhanced or attenuated the heroin reward and motivation,respectively.Optogenetical stimulation the same projection also enhanced the heroin reward,and a tonic neuronal firing at the nerve of NAc shell was observed during the light stimulation session.CONCLUSION mGluR2/3 activation in the NAc shell is involved in the inhibition of heroin reward and motivation.Activation the projection from PFc to NAc shell can enhance the effects on heroin reward and motivation.
文摘Different from the fact that the main researches are focused on single futures contract and lack of the comparison of different periods, this paper described the statistical characteristics of wheat futures reward time series of Zhengzhou Commodity Exchange in recent three years. Besides the basic statistic analysis, the paper used the GARCH and EGARCH model to describe the time series which had the ARCH effect and analyzed the persistence of volatility shocks and the leverage effect. The results showed that compared with that of normal one,wheat futures reward series were abnormality, leptokurtic and thick tail distribution. The study also found that two-part of the reward series had no autocorrelation. Among the six correlative series, three ones presented the ARCH effect. By using of the Auto-regressive Distributed Lag Model, GARCH model and EGARCH model, the paper demonstrates the persistence of volatility shocks and the leverage effect on the wheat futures reward time series. The results reveal that on the one hand, the statistical characteristics of the wheat futures reward are similar to the aboard mature futures market as a whole. But on the other hand, the results reflect some shortages such as the immatureness and the over-control by the government in the Chinese future market.
文摘现有研究在多QoS(quality of service)调度问题中,由于仅依赖即时奖励反馈机制,在资源受限的场景下处理时延敏感数据和具有连续传输需求的媒体数据时,存在可扩展性差和资源浪费的问题。为此,提出了一种基于奖励回溯的DQN(reward backtracking based deep Q-network,RB-DQN)算法。该算法通过未来时刻的交互来回溯调整当前状态的策略评估,以更加有效地识别并解决因不合理调度策略导致的丢包。同时,设计了一种时延-吞吐均衡度量(latency throughput trade-off,LTT)指标,该指标综合考虑了时延敏感数据和媒体类型数据的业务需求,并可通过权重调整来突出不同的侧重点。大量仿真结果表明,与其他调度策略相比,所提算法能够有效降低时延敏感数据的延迟和抖动,同时确保媒体类型数据的流畅性与稳定性。