多模态人机交互综述

Translated title of the contribution: A survey on multi-modal human-computer interaction

Jianhua Tao*, Yingcai Wu, Chun Yu, Dongdong Weng, Guanjun Li, Teng Han, Yuntao Wang, Bin Liu

*Corresponding author for this work

Research output: Contribution to journalReview articlepeer-review

14 Citations (Scopus)

Abstract

Benefiting from the development of the Internet of things, human-computer interaction devices have been widely used in people's daily life. Human-computer interaction is no longer limited to the input and output modes of a single sensory channel (vision, touch, hearing, smell and taste). Multi-modal human-computer interaction aims to exchange information between human and computer by using multi-modal information such as speech, image, text, eye movement and touch. Multi-modal human-computer interaction includes multi-modal information input from human to computer and multi-modal information presentation from computer to human and it is a comprehensive discipline closely related to cognitive psychology, ergonomics, multimedia technology and virtual reality technology. At present, multi-modal human-computer interaction and various kinds of academic and technology in the field of image and graphics are more and more closely combined. In the era of big data and artificial intelligence, multi-modal human-computer interaction technology, as the technical carrier of human-machine-thing, is closely related to the development of image and graphics, artificial intelligence, emotional computing, physiological and psychological assessment, Internet big data, office education, medical rehabilitation and other fields. The research on multi-modal human-computer interaction first appeared in the 1990 s, and a number of works proposed an interactive method combining speech and gesture. In recent years, the emergence of immersive visualization provides a new multi-modal interactive interface for human-computer interaction: an immersive environment that integrates visual, auditory, tactile and other sensory channels. Visualization is an important scientific technology for data analysis and exploration. It converts abstract data into graphical representations and facilitates analytical reasoning through interactive interfaces. In today's data explosion, visualization transforms complex big data into easy-to-understand content, improving people's ability to understand and explore data. The traditional interactive interface can only support a flat visual design, including data mapping channels and data interaction methods, and cannot meet the analysis needs in the context of the big data area. In the area of big data, data visualization will have problems such as limited presentation space, abstract data expression, and data occlusion. The emergence of immersive visualization provides a broad presentation space for high-dimensional big data visualization, integrating multi-sensing channels and multi-modalities. Interaction allows users to interact with data naturally and in parallel using multiple channels. The interaction technology based on sound field perception can be divided into three types according to the working principle: measure and identify the acoustic characteristics of a specific space, passage or the change of the acoustic characteristics caused by the action; use the sound wave measurement of the microphone array to achieve sound source localization, the sound source can emit specific carrier audio to improve the positioning accuracy and robustness; the machine learning algorithm recognizes the sound from a specific scene, environment or human body. The technical solution includes a single method based on sound field perception and a sensor fusion solution. In the physical interaction system, the user interacts with the virtual environment by using the physical objects existing in the real environment. In recent years, the integration of physical interaction interface technology into virtual reality and augmented reality has become a mainstream direction in this field, and the concept of "physical mixed reality" has gradually formed, which is also the conceptual basis of passive haptics. The haptics of physical interaction can be divided into three ways: static passive haptics; passive haptics with feedback and active force haptics. Since active haptic devices are relatively expensive, there are few current researches, and the main research directions are still static passive haptics and encounter-type haptics. Regarding the mixed reality interaction mode of passive haptics, the current research levels of various countries and institutions in the world are not very different, but there is a slight emphasis. Wearable interaction is mainly divided into research on gesture interaction and touch interaction mainly in the form of wristbands, skin electronic technology and interaction design. Gesture input is considered to be one of the core contents of "natural human-machine interface", and it is also suitable for exploring the input methods of wearable devices. The key to realizing gesture input lies in sensing technology. At present, in the field of human-computer interaction, the sensing technology for gesture recognition based on infrared light, motion sensor, electromagnetic, capacitive, ultrasonic, camera and biological signals has been deeply studied. As the natural interface between people and the outside world, the skin has been initially used to explore its role in information interaction, and its applications in several aspects have demonstrated its advantages. The human-computer dialogue interaction process involves multiple modules such as speech recognition, emotion recognition, dialogue system, and speech synthesis. First, the user's speech is converted into corresponding text and emotion labels through speech recognition and emotion recognition modules. The dialogue system is then used to understand what the user is saying and generate dialogue responses. Finally, the speech synthesis module converts the dialogue responses into speech to interact with the user. How to effectively integrate information of different modalities in the human-computer interaction system and improve the quality of human-computer interaction is also worthy of attention. Multi-modal fusion methods can be divided into three types: feature layer fusion methods, decision layer fusion methods, and hybrid fusion methods. The feature layer fusion method maps the features extracted from multiple modalities into a feature vector through a certain transformation and then sends it to the classification model to obtain the final decision. The decision-level fusion method combines the decisions obtained from different modal information to obtain the final decision. The hybrid fusion method adopts both the feature layer fusion method and the decision layer fusion method. This paper systematically reviews the development status and emerging directions of multi-modal human-computer interaction, and thoroughly combs the research progress of big data visualization interaction, interaction based on sound field perception, near-eye display entity interaction, wearable interaction, and human-computer dialogue interaction. This article believes that expanding new interaction methods, designing efficient interaction combinations of various modalities, building miniaturized interactive devices, cross-device distributed interaction, and improving the robustness of interactive algorithms in open environments are the future works of multi-modal human-computer interaction.

Translated title of the contributionA survey on multi-modal human-computer interaction
Original languageChinese (Traditional)
Pages (from-to)1956-1987
Number of pages32
JournalJournal of Image and Graphics
Volume27
Issue number6
DOIs
Publication statusPublished - 16 Jun 2022

Fingerprint

Dive into the research topics of 'A survey on multi-modal human-computer interaction'. Together they form a unique fingerprint.

Cite this