期刊文献+

基于大语言模型的Python编程设计智能助教系统选型评测

System Selection and Performance Evaluation of LLM-Based Python Programming Teaching Assistants
在线阅读 下载PDF
导出
摘要 本研究针对大语言模型在Python编程教育中的应用,构建了多维度评测体系,系统对比了通义千问、星火、文心一言等主流模型在教学场景中的表现。通过设计事实性问题、推理性问题、代码生成及多轮对话等测试任务,从回答准确性、完整性、语言流畅性、上下文理解能力及代码示例质量五个维度进行评估。实验结果表明,qwen-plus在综合评分中表现最优,其回答覆盖边界条件和多轮逻辑关联性,且代码示例符合PEP8规范;Ernie Bot 8k与sparkV3.5在准确性上优异但存在冗余注释问题,而GPT-4因代码冗余和异常处理片面性得分较低。研究揭示了模型在Python语言细节覆盖和上下文建模方面的共性缺陷,并提出通过知识库更新、强化学习优化及多模态评测体系改进的路径,为智能助教系统的选型与教学场景适配提供了实证依据。 This study investigates the application of large language models (LLMs) in Python programming education by constructing a multi-dimensional evaluation framework to systematically compare the performance of mainstream models, such as Qwen-Plus, Ernie Bot 8k, and SparkV3.5, in educational scenarios. Through testing tasks including factual questions, reasoning problems, code generation, and multi-turn dialogue, models were assessed across five dimensions: accuracy, completeness, linguistic fluency, contextual understanding, and code example quality. Experimental results show that Qwen-Plus achieved the highest overall score, demonstrating superior coverage of edge cases and logical coherence in multi-turn interactions, with code examples adhering to PEP8 standards. Ernie Bot 8k and SparkV3.5 exhibited high accuracy but suffered from redundant annotations, while GPT-4 scored lower due to code redundancy and incomplete exception handling. The study identifies common limitations in models’ coverage of Python language details and contextual modeling, suggesting improvements through knowledge base updates, reinforcement learning optimization, and multi-modal evaluation frameworks. These findings provide empirical evidence for model selection and educational scenario adaptation in intelligent teaching assistant systems.
出处 《计算机科学与应用》 2025年第6期190-197,共8页 Computer Science and Application
  • 相关文献

参考文献4

二级参考文献28

共引文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部