• 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于视觉语言模型的多模态学生参与度预测方法

沈双宏 秦子轩 苏喻 王士进 黄振亚 孙登第

沈双宏, 秦子轩, 苏喻, 王士进, 黄振亚, 孙登第. 基于视觉语言模型的多模态学生参与度预测方法. 自动化学报, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c260173
引用本文: 沈双宏, 秦子轩, 苏喻, 王士进, 黄振亚, 孙登第. 基于视觉语言模型的多模态学生参与度预测方法. 自动化学报, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c260173
Shen Shuang-Hong, Qin Zi-Xuan, Su Yu, Wang Shi-Jin, Huang Zhen-Ya, Sun Deng-Di. Multimodal student engagement prediction method based on vision-language models. Acta Automatica Sinica, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c260173
Citation: Shen Shuang-Hong, Qin Zi-Xuan, Su Yu, Wang Shi-Jin, Huang Zhen-Ya, Sun Deng-Di. Multimodal student engagement prediction method based on vision-language models. Acta Automatica Sinica, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c260173

基于视觉语言模型的多模态学生参与度预测方法

doi: 10.16383/j.aas.c260173 cstr: 10.16383/j.aas.c260173
基金项目: 国家自然科学基金(62507010, 62577019), 中国博士后科学基金(2024M760725), 安徽省自然科学基金(2408085QF212), 安徽省高校自然科学研究重大项目(2025AHGXZK20027), 安徽省高校质量工程教学研究重大项目(2023jujxwt006), 华南师范大学广东省心理健康与认知科学重点实验室开放课题, 安徽省高校数字化转型项目资助
详细信息
    作者简介:

    沈双宏:合肥综合性国家科学中心人工智能研究院副研究员. 主要研究方向为数据挖掘, 智能教育. E-mail: shshen@iai.ustc.edu.cn

    秦子轩:安徽大学硕士研究生. 主要研究方向为计算机视觉, 多模态融合及视觉语言模型应用. E-mail: wa24201053@stu.ahu.edu.cn

    苏喻:合肥师范学院教授. 主要研究方向为自然语言理解, 智慧教育. E-mail: yusu@hfnu.edu.cn

    王士进:正高级工程师, 科大讯飞股份有限公司副总裁. 主要研究方向为认知智能及智慧教育. E-mail: sjwang3@iflytek.com

    黄振亚:中国科学技术大学计算机科学与技术学院副教授. 主要研究方向为数据挖掘及认知推理. E-mail: huangzy@ustc.edu.cn

    孙登第:安徽大学人工智能学院教授. 主要研究方向为计算机视觉, 机器学习与深度学习. 本文通信作者. E-mail: sundengdi@ahu.edu.cn

Multimodal student engagement prediction Method Based on Vision-language Models

Funds: Supported by National Natural Science Foundation of China (62507010), China Postdoctoral Science Foundation (2024M760725), Anhui Provincial Natural Science Foundation (2408085QF212), Key Project of Natural Science Research in Anhui Provincial Institutions of Higher Education (2025AHGXZK20027), Key Teaching Research Project of the Quality Engineering Program in Anhui Provincial Institutions of Higher Education (2023jujxwt006), Opening Project of Key Laboratory of Mental Health and Cognitive Science (South China Normal University), and Anhui Provincial University Digital Transformation Project
More Information
    Author Bio:

    SHEN Shuang-Hong  Associate researcher at the Institute of Artificial Intelligence, Hefei Comprehensive National Science Center. His research interests include data mining and intelligent education

    QIN Zi-Xuan  Master student at Anhui University. His research interests include computer vision, multimodal fusion, and the application of vision-language models

    SU Yu  Professor at Hefei Normal University. His research interests include natural language understanding and intelligent education

    WANG Shi-Jin  Professor-level senior engineer, vice president at the iFLYTEK Co., Ltd. His research interests include cognitive intelligence and smart education

    HUANG Zhen-Ya  Associate professor at the School of Computer Science and Technology, University of Science and Technology of China. He received his Ph.D. from the University of Science and Technology of China in 2020. His research interests include data mining and cognitive reasoning

    SUN Deng-Di  Professor at the School of Artificial Intelligence, Anhui University. His research interests include computer vision, machine learning, and deep learning. Corresponding author of this paper

  • 摘要: 随着在线教育的普及, 学生参与度预测(SEP)已成为评估教学效能的核心任务. 尽管视觉语言模型(VLMs)在通用多模态表征学习中表现卓越, 但直接迁移至SEP领域时, 受限于对细粒度面部微表情及特定教学语境下宏观情绪的感知瓶颈, 难以实现视觉特征与高层语义标签的精准对齐. 为此, 提出了基于VLMs的多模态学生参与度预测方法VLM-SEP. 该方法先进行多层次参与度特征解耦, 从面部表情与动作姿态中提取结构化参与度特征; 再引入知识引导的注意力集中度推理, 建立离散视觉特征与参与度状态语言描述的显式映射; 最后通过跨模态融合决策, 结合视觉与文本信息实现参与度精准判别. 在多个公开数据集上的实验结果表明, 该方法有效提升了VLMs在SEP领域的适配性, 为学生参与度预测提供可解释解决方案.
  • 图  1  VLM-SEP模型流程框架图

    Fig.  1  The framework of the VLM-SEP model

    图  2  多层次参与度特征解耦模块的提示词

    Fig.  2  Prompt for multi-level engagement feature decoupling module

    图  3  知识引导的注意力集中度推理模块的提示词

    Fig.  3  Prompt for knowledge-guided attention concentration reasoning module

    图  4  特征解耦前后样本分布对比

    Fig.  4  Comparison of sample distributions before and after feature decoupling

    图  5  文本特征与视频特征的时序相关性热力图

    Fig.  5  Temporal correlation heatmap between text and video features

    图  6  VLM-SEP方法在三个数据集上的混淆矩阵

    Fig.  6  VLM-SEP method's confusion matrices on the three datasets

    图  7  案例分析

    Fig.  7  Case analysis

    图  8  定性分析

    Fig.  8  Qualitative analysis

    表  1  数据集分布表

    Table  1  Dataset distribution table of DAISEE / EmotiW2023 / DIPSER

    数据集 训练集 验证集 测试集 总计
    DAISEE 5 466 1 712 1 708 8 886
    EmotiW2023 5 443 1 427 1 074 7 944
    DIPSER 5 560 1 470 1 644 8 674
    下载: 导出CSV

    表  2  在三个学生数据集上VLM-SEP与传统方法的准确率对比

    Table  2  Accuracy comparison between VLM-SEP and traditional methods on three student datasets

    模型 输入 DAISEE EmotiW2023 DIPSER
    Video InceptionNet 视频帧 $ 53.98 \pm 0.30 $ $ 52.05 \pm 0.30 $ $ 63.63 \pm 0.20 $
    C3D + LSTM 视频帧 $ 57.08 \pm 0.20 $ $ 51.44 \pm 0.20 $ $ 63.44 \pm 0.25 $
    ResNet + LSTM 视频帧 $ 57.62 \pm 0.15 $ $ 54.05 \pm 0.15 $ $ 64.11 \pm 0.30 $
    Swin-Transformer 视频帧 $ 55.62 \pm 0.30 $ $ 52.52 \pm 0.30 $ $ 63.44 \pm 0.20 $
    ViViT 视频帧 $ 57.03 \pm 0.25 $ $ 54.17 \pm 0.25 $ $ 63.99 \pm 0.30 $
    EfficientNet + LSTM 视频帧 $ 56.98 \pm 0.25 $ $ 58.13 \pm 0.25 $ $ 63.56 \pm 0.20 $
    EfficientNet + Bi-LSTM 视频帧 $ 57.33 \pm 0.25 $ $ 58.87 \pm 0.35 $ 64.54 ± 0.20
    FANN 视频帧 $ 58.08 \pm 0.30 $ $ 57.91 \pm 0.40 $ $ 64.17 \pm 0.20 $
    VisioPhysioENet 视频帧+生理信号 $ 56.64 \pm 0.10 $ $ 51.30 \pm 0.20 $ _
    MGAFR 视频帧+文本 58.36 ± 0.10 59.39 ± 0.10 _
    MIST 视频帧+文本 $ 57.22 \pm 0.20 $ $ 57.86 \pm 0.20 $ _
    Video-LLaVA (7B) 视频帧+文本 $ 49.97 \pm 0.00 $ $ 29.73 \pm 0.00 $ $ 48.24 \pm 0.00 $
    LLaVa-Next (7B) 视频帧+文本 $ 45.17 \pm 0.00 $ $ 24.67 \pm 0.00 $ $ 21.11 \pm 0.00 $
    Qwen3-VL (8B) 视频帧+文本 $ 50.24 \pm 0.00 $ $ 51.30 \pm 0.00 $ $ 55.23 \pm 0.00 $
    VLM-SEP 视频帧+文本 $ 59.01 ± 0.20 $ $ 61.08 ± 0.10 $ $ 65.82 ± 0.20 $
    下载: 导出CSV

    表  3  消融实验

    Table  3  Ablation study

    消融模块 DAISEE EmotiW2023 DIPSER
    准确率 F1分数 平均绝对误差 准确率 F1分数 平均绝对误差 准确率 F1分数 平均绝对误差
    仅视觉分支 57.96 0.533 0 0.430 9 59.78 0.585 7 0.563 3 64.90 0.611 6 0.402 7
    仅文本分支 53.04 0.515 9 0.485 4 60.43 0.537 7 0.533 5 63.44 0.492 5 0.365 6
    简单融合 58.72 0.546 0 0.422 7 59.78 0.566 8 0.544 7 65.70 0.635 0 0.376 5
    无知识引导模块 56.79 0.525 9 0.441 5 60.04 0.580 1 0.499 1 65.15 0.550 3 0.357 1
    单项视觉引导文本 58.08 0.547 2 0.430 9 61.92 0.596 9 0.496 3 64.78 0.531 9 0.366 8
    单向文本引导视觉 57.73 0.479 2 0.445 6 60.15 0.584 4 0.522 3 65.39 0.626 1 0.389 9
    完整模型(VLM-SEP) 59.01 0.555 4 0.418 6 61.08 0.597 0 0.533 5 65.82 0.610 8 0.374 7
    下载: 导出CSV
  • [1] 肖建力, 黄星宇, 姜飞. 智慧教育中的大语言模型综述. 智能系统学报, 2025, 20(5): 1054−1070 doi: 10.11992/tis.202406040

    Xiao Jian-Li, Huang Xing-Yu, Jiang Fei. A review of large language models in intelligent education. CAAI Transactions on Intelligent Systems, 2025, 20(5): 1054−1070 doi: 10.11992/tis.202406040
    [2] Doherty K, Doherty G. Engagement in HCI: Conception, theory and measurement. ACM Computing Surveys, 2018, 51(5): 1−39
    [3] D'Mello S K. Improving student engagement in and with digital learning technologies. Pushing the Frontiers with Artificial Intelligence, Blockchain and Robots. Cham, Switzerland: Springer, 2021. 79–104
    [4] Wei Y, Wang J, Yang H H, Shi Y, Zhou G, Li X. Research on the influence of students' engagement in blended synchronous learning. In: Proceedings of the 2023 International Symposium on Educational Technology. Hong Kong, China: IEEE, 2023. 37–41
    [5] Fredricks J A, Blumenfeld P C, Paris A H. School engagement: Potential of the concept, state of the evidence. Review of Educational Research, 2004, 74(1): 59−109 doi: 10.3102/00346543074001059
    [6] Fredricks J A, Filsecker M, Lawson M A. Student engagement, context, and adjustment: Addressing definitional, measurement, and methodological issues. Learning and Instruction, 2016, 43: 1−4 doi: 10.1016/j.learninstruc.2016.02.002
    [7] Li T, Zhu A. How online classroom interaction tool change the teaching and learning mode under the traditional computer-assisted instruction? In: Proceedings of the International Conference on Computer Science and Educational Informatization. Kunming, China: IEEE, 2019. 25–29
    [8] Smith J, Schreder K. Are they paying attention, or are they shoe-shopping? Evidence from online learning. International Journal of Multidisciplinary Perspectives in Higher Education, 2020, 5(1): 200−209
    [9] 莫元娇. 基于跨分支注意力学习的课堂学生参与度预测方法. 工业控制计算机, 2025, 38(1): 118−120 doi: 10.3969/j.issn.1001-182X.2025.01.045

    Mo Yuan-Jiao. Classroom student engagement prediction method based on cross-branch attention learning. Industrial Control Computer, 2025, 38(1): 118−120 doi: 10.3969/j.issn.1001-182X.2025.01.045
    [10] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016. 770–778
    [11] Geng L, Xu M, Wei Z, Zhou X. Learning deep spatiotemporal feature for engagement recognition of online courses. In: Proceedings of the Symposium Series on Computational Intelligence. Xiamen, China: IEEE, 2019. 442–447
    [12] 沃焱, 梁籍云, 韩国强. 基于度量学习的跨模态人脸检索方法. 华南理工大学学报(自然科学版), 2022, 50(6): 1−9

    Wo Yan, Liang Ji-Yun, Han Guo-Qiang. Cross-modal face retrieval method based on metric learning. Journal of South China University of Technology (Natural Science Edition), 2022, 50(6): 1−9
    [13] Zhang H, Xiao X, Huang T, Liu S, Xia Y, Li J. A novel end-to-end network for automatic student engagement recognition. In: Proceedings of the 9th International Conference on Electronics Information and Emergency Communication. Beijing, China: IEEE, 2019. 342–345
    [14] Liao J, Liang Y, Pan J. Deep facial spatiotemporal network for engagement prediction in online learning. Applied Intelligence, 2021, 51(10): 6609−6621 doi: 10.1007/s10489-020-02139-8
    [15] Liu Z, Kong W, Peng X, Yang Z, Liu S, Liu S, et al. Dual-feature-embeddings-based semisupervised learning for cognitive engagement classification in online course discussions. Knowledge-Based Systems, 2023, 259: Article No. 110053 doi: 10.1016/j.knosys.2022.110053
    [16] Huang T, Mei Y, Zhang H, Liu S, Yang H. Fine-grained engagement recognition in online learning environment. In: Proceedings of the 9th International Conference on Electronics Information and Emergency Communication. Beijing, China: IEEE, 2019. 338–341
    [17] Singh M, Hoque X, Zeng D, Wang Y, Ikeda K, Dhall A. Do I have your attention: A large scale engagement prediction dataset and baselines. In: Proceedings of the 25th ACM International Conference on Multimodal Interaction. New York, USA: ACM, 2023. 174–182
    [18] Vedernikov A, Kumar P, Chen H, Seppänen T, Li X. TCCT-Net: Two-stream network architecture for fast and efficient engagement estimation via behavioral feature signals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA: IEEE, 2024. 4723–4732
    [19] Deng Y, Bian J, Wu S, Lai J, Xie X. Multiplex graph aggregation and feature refinement for unsupervised incomplete multimodal emotion recognition. Information Fusion, 2025, 114: Article No. 102711 doi: 10.1016/j.inffus.2024.102711
    [20] Singh A, Verma N, Goyal K, Singh A, Kumar P, Li X. VisioPhysioENet: Multimodal engagement detection using visual and physiological signals. arXiv preprint arXiv: 2409.16126, 2024.
    [21] Boitel E, Mohasseb A, Haig E. MIST: Multimodal emotion recognition using DeBERTa for text, Semi-CNN for speech, ResNet-50 for facial, and 3D-CNN for motion analysis. Expert Systems With Applications, 2025, 270: Article No. 126236 doi: 10.1016/j.eswa.2024.126236
    [22] Liu Y, Kuang Z, Zhang H, Li C, Li F, Ding X. PRADA: Prompt-guided representation alignment and dynamic adaption for time series forecasting. Knowledge-Based Systems, 2025, 318: Article No. 113478 doi: 10.1016/j.knosys.2025.113478
    [23] Zou X, Li X, Hu P, Dong M. MrBalance: A framework for enhancing event causality identification in multi-agent debates via role assignment. Knowledge-Based Systems, 2025Article No. 114470
    [24] Qi X, Zeng Y, Xie T, Chen P Y, Jia R, Mittal P, Henderson P. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv: 2310.03693, 2023.
    [25] Heilporn G, Raynault A, Frenette É. Student engagement in a higher education course: A multidimensional scale for different course modalities. Social Sciences & Humanities Open, 2024, 9: Article No. 100794 doi: 10.1016/j.ssaho.2023.100794
    [26] Fredricks J A, McColskey W. The measurement of student engagement: A comparative analysis of various methods and student self-report instruments. Handbook of Research on Student Engagement. Boston, MA: Springer US, 2012. 763–782
    [27] Blikstein P. Multimodal learning analytics. In: Proceedings of the Third International Conference on Learning Analytics and Knowledge. New York, USA: ACM, 2013. 102–106
    [28] Yang A, Li A, Yang B, Zhang B, Hui B, Zheng B, et al. Qwen3 technical report. arXiv preprint arXiv: 2505.09388, 2025.
    [29] Gupta A, D'Cunha A, Awasthi K, Balasubramanian V. DAISEE: Towards user engagement recognition in the wild. arXiv preprint arXiv: 1609.01885, 2016.
    [30] Dhall A, Singh M, Goecke R, Gedeon T, Zeng D, Wang Y, et al. EmotiW2023: Emotion recognition in the wild challenge. In: Proceedings of the 25th ACM International Conference on Multimodal Interaction. New York, USA: ACM, 2023. 746–749
    [31] Marquez-Carpintero L, Suescun-Ferrandiz S, Álvarez C L, Fernandez-Herrero J, Viejo D, Roig-Vila R, et al. DIPSER: A dataset for in-person student engagement recognition in the wild. arXiv preprint arXiv: 2502.20209, 2025.
    [32] Szegedy C, Ioffe S, Vanhoucke V, Alemi A. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. San Francisco, CA, USA: AAAI, 2017. 4278–4284
    [33] Parmar P, Morris B. Action quality assessment across multiple actions. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa Village, HI, USA: IEEE, 2019. 1468–1476
    [34] Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin Transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, QC, Canada: IEEE, 2021. 10012–10022
    [35] Arnab A, Dehghani M, Heigold G, Sun C, Lučić M, Schmid C. ViViT: A video vision Transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, QC, Canada: IEEE, 2021. 6836–6846
    [36] Selim T, Elkabani I, Abdou M A. Students engagement level detection in online e-learning using hybrid EfficientNetB7 together with TCN, LSTM, and Bi-LSTM. IEEE Access, 2022, 10: 99573−99583 doi: 10.1109/ACCESS.2022.3206779
    [37] Wang H, Sun H M, Zhang W L, Chen Y X, Jia R S. FANN: A novel frame attention neural network for student engagement recognition in facial video. The Visual Computer, 2025, 41: 6011−6025 doi: 10.1007/s00371-024-03768-7
    [38] Lin B, Ye Y, Zhu B, Cui J, Ning M, Jin P, et al. Video-LLaVA: Learning united visual representation by alignment before projection. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Miami, Florida, USA: ACL, 2024. 5971–5984
    [39] Li B, Zhang K, Zhang H, Guo D, Zhang R, Li F, et al. LLaVA-NeXT: Stronger LLMs supercharge multimodal capabilities in the wild. arXiv preprint arXiv: 2408.01073, 2024.
  • 加载中
计量
  • 文章访问数:  7
  • HTML全文浏览量:  5
  • 被引次数: 0
出版历程
  • 收稿日期:  2026-03-09
  • 录用日期:  2026-05-07
  • 网络出版日期:  2026-05-29

目录

    /

    返回文章
    返回