基于GAIL-PPO两阶段训练的高速车辆换道决策优化框架

郭洪江; 朱政泽; 龚家元; 殷政; 吕成志

doi:10.3969/j.issn.1007-7375.250150

基于GAIL-PPO两阶段训练的高速车辆换道决策优化框架

A Two-Stage GAIL-PPO Optimization Framework for High-Speed Vehicle Lane-changing Decision-Making

摘要

摘要: 针对高速公路场景下智能车辆换道决策中模仿学习对数据质量依赖高、泛化能力有限以及强化学习训练效率低、多目标难以兼顾等问题，提出一种基于生成对抗模仿学习(generative adversarial imitation learning, GAIL)与近端策略优化(proximal policy optimization, PPO)的两阶段协同优化框架。在GAIL判别器中引入基于Wasserstein距离并结合梯度惩罚的对抗机制以提升训练稳定性；并将PPO纳入GAIL的生成器更新过程，通过Actor-Critic架构增强策略学习的鲁棒性；采用PPO对预训练策略进行多目标强化微调，构建融合通行效率与安全约束的多目标奖励函数，实现从专家模仿到复杂场景约束下的策略优化。基于highway-env仿真环境的实验结果表明，与PPO baseline和DQN方法相比，所提出方法在平均通行速度方面分别提升约4%和8%，同时有效减少不必要的换道行为；结合纵向加速度时间序列分析与鲁棒性测试结果，进一步验证了该方法在不同驾驶时长、交通流密度、车道数及车辆动力学约束条件下的稳定性与泛化能力。

Abstract: This paper addresses challenges in intelligent vehicle lane-changing decisions on highways, including the high dependency on data quality and limited generalization capability of imitation learning, as well as the low training efficiency and difficulty in balancing multiple objectives in reinforcement learning. It proposes a two-stage collaborative optimization framework based on generative adversarial imitation learning (GAIL) and proximal policy optimization (PPO). An adversarial mechanism based on Wasserstein distance combined with gradient penalties is introduced into the GAIL discriminator to enhance training stability. PPO is integrated into the generator update process of GAIL, leveraging an Actor-Critic architecture to strengthen the robustness of policy learning. PPO is employed for multi-objective reinforcement fine-tuning of the pre-trained policy, constructing a multi-objective reward function that balances traffic efficiency and safety constraints. This enables progression from expert imitation to policy optimization under complex scenario constraints. Experimental results on the highway-env simulation environment demonstrate that compared to the PPO baseline and DQN methods, the proposed approach achieves approximately 4% and 8% improvements in average travel speed, respectively, while effectively reducing unnecessary lane changes. Combined with longitudinal acceleration time series analysis and robustness testing, these results further validate the method's stability and generalization capabilities under varying driving durations, traffic flow densities, lane numbers, and vehicle dynamics constraints.

HTML全文

参考文献(16)

施引文献

资源附件(0)