An alpha-uniformized Markov chain is defined by the concept of equivalent infinitesimalgenerator for a semi-Markov decision process (SMDP) with both average- and discounted-criteria.According to the relations of their...An alpha-uniformized Markov chain is defined by the concept of equivalent infinitesimalgenerator for a semi-Markov decision process (SMDP) with both average- and discounted-criteria.According to the relations of their performance measures and performance potentials, the optimiza-tion of an SMDP can be realized by simulating the chain. For the critic model of neuro-dynamicprogramming (NDP), a neuro-policy iteration (NPI) algorithm is presented, and the performanceerror bound is shown as there are approximate error and improvement error in each iteration step.The obtained results may be extended to Markov systems, and have much applicability. Finally, anumerical example is provided.展开更多
为适应实际大规模M arkov系统的需要,讨论M arkov决策过程(MDP)基于仿真的学习优化问题.根据定义式,建立性能势在平均和折扣性能准则下统一的即时差分公式,并利用一个神经元网络来表示性能势的估计值,导出参数TD(0)学习公式和算法,进行...为适应实际大规模M arkov系统的需要,讨论M arkov决策过程(MDP)基于仿真的学习优化问题.根据定义式,建立性能势在平均和折扣性能准则下统一的即时差分公式,并利用一个神经元网络来表示性能势的估计值,导出参数TD(0)学习公式和算法,进行逼近策略评估;然后,根据性能势的逼近值,通过逼近策略迭代来实现两种准则下统一的神经元动态规划(neuro-dynam ic programm ing,NDP)优化方法.研究结果适用于半M arkov决策过程,并通过一个数值例子,说明了文中的神经元策略迭代算法对两种准则都适用,验证了平均问题是折扣问题当折扣因子趋近于零时的极限情况.展开更多
针对基于云边协同的云制造环境下制造资源实时感知数据难以及时处理的问题,考虑边缘端有限的计算资源、动态变化的网络状态以及任务负载等不确定性因素,给出一种基于混合深度强化学习(mixedbased deep reinforcement learning,M-DRL)的...针对基于云边协同的云制造环境下制造资源实时感知数据难以及时处理的问题,考虑边缘端有限的计算资源、动态变化的网络状态以及任务负载等不确定性因素,给出一种基于混合深度强化学习(mixedbased deep reinforcement learning,M-DRL)的云边协同联合卸载策略。首先,融合云端的离散模型卸载与边缘端的连续任务卸载建立联合卸载模型;其次,将一段连续时隙内综合时延与能耗总成本为目标的卸载优化问题形式化地定义为马尔可夫决策过程(MDP);最后,使用DDPG与DQN的集成探索策略、在网络架构中引入长短期记忆网络(LSTM)的M-DRL算法求解该优化问题。仿真结果表明,M-DRL与已有一些卸载算法相比具有良好的收敛性和稳定性,并显著降低了系统总成本,为制造资源感知数据及时处理提供了一种有效的解决方案。展开更多
文摘An alpha-uniformized Markov chain is defined by the concept of equivalent infinitesimalgenerator for a semi-Markov decision process (SMDP) with both average- and discounted-criteria.According to the relations of their performance measures and performance potentials, the optimiza-tion of an SMDP can be realized by simulating the chain. For the critic model of neuro-dynamicprogramming (NDP), a neuro-policy iteration (NPI) algorithm is presented, and the performanceerror bound is shown as there are approximate error and improvement error in each iteration step.The obtained results may be extended to Markov systems, and have much applicability. Finally, anumerical example is provided.
文摘为适应实际大规模M arkov系统的需要,讨论M arkov决策过程(MDP)基于仿真的学习优化问题.根据定义式,建立性能势在平均和折扣性能准则下统一的即时差分公式,并利用一个神经元网络来表示性能势的估计值,导出参数TD(0)学习公式和算法,进行逼近策略评估;然后,根据性能势的逼近值,通过逼近策略迭代来实现两种准则下统一的神经元动态规划(neuro-dynam ic programm ing,NDP)优化方法.研究结果适用于半M arkov决策过程,并通过一个数值例子,说明了文中的神经元策略迭代算法对两种准则都适用,验证了平均问题是折扣问题当折扣因子趋近于零时的极限情况.
文摘针对基于云边协同的云制造环境下制造资源实时感知数据难以及时处理的问题,考虑边缘端有限的计算资源、动态变化的网络状态以及任务负载等不确定性因素,给出一种基于混合深度强化学习(mixedbased deep reinforcement learning,M-DRL)的云边协同联合卸载策略。首先,融合云端的离散模型卸载与边缘端的连续任务卸载建立联合卸载模型;其次,将一段连续时隙内综合时延与能耗总成本为目标的卸载优化问题形式化地定义为马尔可夫决策过程(MDP);最后,使用DDPG与DQN的集成探索策略、在网络架构中引入长短期记忆网络(LSTM)的M-DRL算法求解该优化问题。仿真结果表明,M-DRL与已有一些卸载算法相比具有良好的收敛性和稳定性,并显著降低了系统总成本,为制造资源感知数据及时处理提供了一种有效的解决方案。