摘要
风格化数字人是在计算机图形学、视觉艺术和游戏设计等领域中迅速发展的一个领域。数字人物的设计和制作技术取得了显著的进步,使得数字人物能够具有更加逼真的外观和行为,同时也可以更好地适应各种艺术风格和情境。本文围绕风格化数字人任务,围绕数字人的风格化生成、多模态驱动与用户交互3个核心研究方向的发展现状、前沿动态、热点问题等进行系统性综述。针对数字人的风格化生成,从显式三维模型和隐式三维模型两种数字人的三维表达方式对于方法进行分类。显式三维数字人风格化以基于优化的方法、基于生成对抗网络的方法、基于引擎的方法为主要分析对象;隐式三维数字人风格化从通用隐式场景风格化方法以及针对人脸的隐式风格化进行回顾。针对数字人的驱动,根据驱动源的不同,从显式音频驱动、文本驱动和视频驱动3个方面进行回顾。根据驱动实现算法的不同,从基于中间变量、基于编码—解码结构等方面进行回顾。此外,算法还根据中间变量的不同可分为基于关键点、三维人脸和光流的方法。针对数字人的用户交互,目前主流的交互方式是语音交互,本文对语音交互模块从自动语音识别和文本转语音合成两方面进行了回顾,对于数字人的对话系统模块,从自然语言理解和自然语言生成等方面进行了回顾。在此基础上,展望了风格化数字人研究的未来发展趋势,为后续的相关研究提供参考。
Stylized digital characters have emerged as a fundamental force in reshaping the landscape of computer graph⁃ics,visual arts,and game design.Their unparalleled ability to mimic human appearance and behavior,coupled with their flexibility in adapting to a wide array of artistic styles and narrative frameworks,underscores their growing importance in crafting immersive and engaging digital experiences.This comprehensive exploration delves deeply into the complex world of stylized digital humans,explores their current development status,identifies the latest trends,and addresses the press⁃ing challenges that lie ahead in three foundational research domains:the creation of stylized digital humans,multimodal driving mechanisms,and user interaction modalities.The first domain,creation of stylized digital humans,examines the innovative methodologies employed in generating lifelike but stylistically diverse characters that can seamlessly integrate into various digital environments.From advancements in 3D modeling and texturing to the integration of artificial intelli⁃gence for dynamic character development,this section provides a thorough analysis of the tools and technologies that are pushing the boundaries of what digital characters can achieve.In the realm of multimodal driving mechanisms,this study investigates evolving techniques for animating and controlling digital humans by using a range of inputs,such as voice,ges⁃ture,and real-time motion capture.This section delves into how these mechanisms not only enhance the realism of charac⁃ter interactions but also open new avenues for creators to involve users in interactive narratives in more meaningful ways.Finally,the discussion of user interaction modalities explores the various ways in which end-users can engage with and influence the behavior of digital humans.From immersive virtual and augmented reality experiences to interactive web and mobile platforms,this segment evaluates the effectiveness of different modalities in creating a two-way interaction that enriches the user’s experience and deepens their connection to digital characters.At the heart of this exploration lies the creation of stylized digital humans,a field that has witnessed remarkable progress in recent years.The generation of these characters can be broadly classified into two categories:explicit 3D models and implicit 3D models.Explicit 3D digital human stylization encompasses a range of methodologies,including optimization-based approaches that meticulously refine digital meshes to conform to specific stylistic attributes.These techniques often involve iterative processes that adjust geo⁃metric details,textures,and lighting to achieve the desired aesthetic.Generative adversarial networks,as cornerstones of deep learning,have revolutionized this landscape by enabling the automatic generation of novel stylized forms that capture intricate nuances of various artistic styles.Furthermore,engine-based methods harness the power of advanced rendering engines to apply artistic filters and affect real time,offering unparalleled flexibility and control over the final visual output.Implicit 3D digital human stylization draws inspiration from the realm of implicit scene stylization,particularly via neural implicit representations.These approaches offer a more holistic and flexible approach for representing and manipulating 3D geometry and appearance,enabling stylization that transcends traditional mesh-based limitations.Within this framework,facial stylization holds a special place,requiring a profound understanding of facial anatomy,expression dynamics,and cultural nuances.Specialized methods have been developed to capture and manipulate facial features in a nuanced and artistic manner,fostering a level of realism and emotional expressiveness that is crucial for believable digital humans.Ani⁃mating and controlling the behavior of stylized digital humans necessitates the use of diverse driving signals,which serve as the lifeblood of these virtual beings.This study delves into three primary sources of these signals:explicit audio drivers,text drivers,and video drivers.Audio drivers leverage speech recognition and prosody analysis to synchronize digital human movements with spoken language,enabling them to lip-sync and gesture in a natural and expressive manner.Con⁃versely,text drivers rely on natural language processing(NLP)techniques to interpret textual commands or prompts and convert them into coherent actions,allowing for a more directive form of control.Video drivers,which are perhaps the most advanced in terms of realism,employ computer vision algorithms to track and mimic the movements of real-world actors,providing a seamless bridge between the virtual and physical worlds.These drivers are supported by sophisticated imple⁃mentation algorithms,many of which rely on intermediate variable-driven coding-decoding structures.Keypoint-based methods play a pivotal role in capturing and transferring motion,allowing for the precise replication of movements across different characters.Moreover,3D face-based approaches focus on facial animation and utilize detailed facial models and advanced animation techniques to achieve unparalleled realism in expressions and emotions.Meanwhile,optical flowbased techniques offer a holistic approach to motion estimation,synthesis,capture,and reproduction of complex motion patterns across the entire digital human body.The true magic of stylized digital humans lies in their ability to engage with users in meaningful and natural interactions.Voice interaction,currently the mainstream mode of communication,relies heavily on automatic speech recognition for accurate speech-to-text conversion and text-to-speech synthesis for generating natural-sounding synthetic speech.The dialog system module,a cornerstone of virtual human interaction,emphasizes the importance of natural language understanding for interpreting user inputs and natural language generation for crafting appro⁃priate responses.When these capabilities are seamlessly integrated,stylized digital humans are capable of engaging in fluid and contextually relevant conversations with users,fostering a sense of intimacy and connection.The study of stylized digital characters will likely continue its ascendancy,fueled by advancements in deep learning,computer vision,and NLP.Future research may delve into integrating multiple modalities for richer and more nuanced interactions,pushing the boundaries of what is possible in virtual human communication.Innovative stylization techniques that bridge the gap between reality and fiction will also be explored,enabling the creation of digital humans that are both fantastic and relat⁃able.Moreover,the development of intelligent agents capable of autonomous creativity and learning will revolutionize the way stylized digital humans can contribute to various industries,including entertainment,education,healthcare,and beyond.As technology continues to evolve,stylized digital humans will undoubtedly play an increasingly substantial role in shaping how people engage with digital content and with each other,ushering in a new era of digital creativity and expres⁃sion.This study serves as a valuable resource for researchers and practitioners alike,offering a comprehensive overview of the current state of the art and guiding the way forward in this dynamic,exciting field.
作者
潘烨
李韶旭
谭帅
韦俊杰
翟广涛
杨小康
Pan Ye;Li Shaoxu;Tan Shuai;Wei Junjie;Zhai Guangtao;Yang Xiaokang(Department of Computer Science and Engineering,Shanghai Jiao Tong University,Shanghai 200240,China)
出处
《中国图象图形学报》
北大核心
2025年第2期334-360,共27页
Journal of Image and Graphics
基金
国家自然科学基金项目(62472285,62102255)
上海市科技重大专项项目(2021SHZDZX0102)。
关键词
风格化
数字人
人脸驱动
人机交互
三维建模
深度学习
神经网络
stylization
digital characters
face driven
human-computer interaction
3D modeling
deep learning
neu⁃ral network
作者简介
通信作者:潘烨,女,副教授,主要研究方向为虚拟现实、人机交互和角色动画。E-mail:whitneypanye@sjtu.edu.cn;李韶旭,男,博士研究生,主要研究方向为风格化三维人脸生成。E-mail:lishaoxu94@163.com;谭帅,男,博士研究生,主要研究方向为数字人驱动与生成。E-mail:tanshuai0219@sjtu.edu.cn;韦俊杰,男,硕士研究生,主要研究方向为风格化三维人脸生成与交互。E-mail:danbaiwei@163.com;翟广涛,男,教授,主要研究方向为多媒体信号处理。E-mail:zhaiguangtao@sjtu.edu.cn;杨小康,男,教授,主要研究方向为视频编码与通信、图像处理与模式识别、视频分析与检索。E-mail:xkyang@sjtu.edu.cn。