基于AI Agent的数据中心冷水机组健康监测方法

吴限; 孙衍宁; 马昊; 刘丽兰

doi:10.3969/j.issn.1007-7375.260023

基于AI Agent的数据中心冷水机组健康监测方法

Health Monitoring Method for Data Center Chiller Units Based on AI Agent

摘要

摘要: 针对数据中心冷水机组健康监测中存在的诊断模型固化、跨场景迁移困难以及算法模型与运维认知割裂等问题，提出一种基于智能体(AI Agent)的健康监测方法。该方法借鉴人体健康诊断中初步筛查与综合会诊的协同逻辑，构建融合单变量阈值监测与多变量核主元分析(kernel principal component analysis, KPCA)的诊断框架，并基于 DeepSeek-V3 大语言模型实现监测策略自动规划、算法工具调用和可执行代码生成。基于 ASHRAE RP-1043 公开数据集的实验结果表明，该智能体能够准确复现冷却水不足、冷冻水不足、制冷剂泄漏等7类典型故障的诊断逻辑，并通过标准化函数调用机制提升诊断模型在不同机组间的迁移效率与部署灵活性。研究结果进一步表明，在单变量阈值监测与多变量 KPCA 工具协同作用下，该方法可有效降低多工况、不同机组健康监测的误报率；同时，在保证诊断准确性的前提下，智能体生成的监测代码较人工编写代码行数减少约8.6%，为数据中心关键基础设施故障预测与健康管理提供了一种可扩展、易部署的自动化解决方案。

Abstract: To address the problems of rigid diagnostic models, difficult cross-scenario migration, and the disconnection between algorithmic models and operation-and-maintenance knowledge in health monitoring of data center chiller units, this paper proposes an AI Agent-based health monitoring method. Inspired by the collaborative logic of preliminary screening and comprehensive consultation in human medical diagnosis, the proposed method constructs a diagnostic framework that integrates univariate threshold monitoring with multivariate kernel principal component analysis (KPCA). Based on the DeepSeek-V3 large language model, the AI Agent realizes automatic planning of monitoring strategies, invocation of algorithmic tools, and generation of executable code. Experiments on the public ASHRAE RP-1043 dataset show that the proposed AI Agent can accurately reproduce the diagnostic logic for seven typical faults, including reduced condenser water flow, reduced chilled water flow, and refrigerant leakage. Through a standardized function-calling mechanism, the method improves the migration efficiency and deployment flexibility of diagnostic models across different chiller units. The results further demonstrate that, with the collaboration of univariate threshold monitoring and multivariate KPCA tools, the proposed method can effectively reduce the false alarm rate in multi-condition and multi-unit health monitoring. Meanwhile, while maintaining diagnostic accuracy, the monitoring code generated by the AI Agent reduces the number of lines of code by approximately 8.6% compared with manually written code. This study provides a scalable and easy-to-deploy automated solution for fault prediction and health management of critical data center infrastructure.

HTML全文

参考文献(21)

施引文献

资源附件(0)