基于深度多智能体强化学习的机床混流装配线调度优化

姜兴宇; 陈嘉淇; 王立权; 徐伟宏

doi:10.3969/j.issn.1007-7375.240136

基于深度多智能体强化学习的机床混流装配线调度优化

Deep Multi-Intelligent Reinforcement Learning-Based Scheduling Optimization for Mixed-Flow Assembly Lines of Machine Tools

摘要

摘要: 为保证机床混流装配车间生产的机床准时交付，提出一种基于改进的深度多智能体强化学习的机床混流装配线调度优化方法，以解决最小延迟生产调度优化模型求解质量低、训练速度缓慢问题，构建以最小延迟时间目标的混流装配线调度优化模型，应用去中心化分散执行的双价值网络的(double deep Q network，DDQN)智能体来学习生产信息与调度目标的关系，该框架采用集中训练与分散执行的策略，并使用参数共享技术，能处理多智能体强化学习中的非稳态问题。在此基础上，采用递归神经网络来管理可变长度的状态和行动表示，使智能体具有处理任意规模问题的能力。同时引入全局/局部奖励函数，以解决训练过程中的奖励稀疏问题。通过消融实验，确定了最优的参数组合。数值实验结果表明，与标准测试方案相比，本算法在目标达成度面比改进前提升了24.1% ~ 32.3%，训练速度提高了8.3%。

Abstract: In order to ensure the on-time delivery of machine tools produced in the machine tool mixed-flow assembly shop, a machine tool mixed-flow assembly line scheduling optimization method based on improved deep multi-intelligence reinforcement learning is proposed to address the problems of the solution quality of the minimum-delay production scheduling optimization model and the slow training speed. A mixed-flow assembly line scheduling optimization model is constructed with the objective of minimum delay time, and a decentralized and decentralized execution of double deep Q network (DDQN) intelligences is applied to learn the relationship between the production information and scheduling objectives. The framework adopts the strategy of centralized training and decentralized execution and uses the parameter sharing technology, which is able to deal with non-stationary problems in multi-intelligence reinforcement learning. Problems. On this basis, a recurrent neural network is used to manage variable-length state and action representations, giving the intelligences the ability to handle problems of arbitrary size. A global/local reward function is also introduced to solve the reward sparsity problem in the training process. The optimal parameter combinations are identified through ablation experiments. The numerical experimental results show that, compared with the standard test scheme, the present algorithm improves the goal attainment facet by 24.1% to 32.3% over the pre-improvement period, and the training speed is increased by 8.3%.

HTML全文

参考文献(16)

施引文献

资源附件(0)