Toward automatic audio description generation for accessible videos

Yujia Wang; Wei Liang

doi:10.1145/3411764.3445347

Toward automatic audio description generation for accessible videos

Yujia Wang, Wei Liang

School of Computer Science and Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

39 Citations (Scopus)

Abstract

Video accessibility is essential for people with visual impairments. Audio descriptions describe what is happening on-screen, e.g., physical actions, facial expressions, and scene changes. Generating highquality audio descriptions requires a lot of manual description generation [50]. To address this accessibility obstacle, we built a system that analyzes the audiovisual contents of a video and generates the audio descriptions. The system consisted of three modules: AD insertion time prediction, AD generation, and AD optimization. We evaluated the quality of our system on five types of videos by conducting qualitative studies with 20 sighted users and 12 users who were blind or visually impaired. Our findings revealed how audio description preferences varied with user types and video types. Based on our study's analysis, we provided recommendations for the development of future audio description generation technologies.

Original language	English
Title of host publication	CHI 2021 - Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems
Subtitle of host publication	Making Waves, Combining Strengths
Publisher	Association for Computing Machinery
ISBN (Electronic)	9781450380966
DOIs	https://doi.org/10.1145/3411764.3445347
Publication status	Published - 6 May 2021
Event	10th International Conference on Materials Processing and Characterisation, ICMPC 2020 - Mathura, U.P., India Duration: 21 Feb 2020 → 23 Feb 2020

Publication series

Name	Conference on Human Factors in Computing Systems - Proceedings

Conference

Conference	10th International Conference on Materials Processing and Characterisation, ICMPC 2020
Country/Territory	India
City	Mathura, U.P.
Period	21/02/20 → 23/02/20

Keywords

Audio description
Audio-visual consistency
Video captioning, sentence-level embedding, accessibility
Video description

Access to Document

10.1145/3411764.3445347

Cite this

@inproceedings{24941ef0b14844ecb6b8a167a0b72d52,

title = "Toward automatic audio description generation for accessible videos",

abstract = "Video accessibility is essential for people with visual impairments. Audio descriptions describe what is happening on-screen, e.g., physical actions, facial expressions, and scene changes. Generating highquality audio descriptions requires a lot of manual description generation [50]. To address this accessibility obstacle, we built a system that analyzes the audiovisual contents of a video and generates the audio descriptions. The system consisted of three modules: AD insertion time prediction, AD generation, and AD optimization. We evaluated the quality of our system on five types of videos by conducting qualitative studies with 20 sighted users and 12 users who were blind or visually impaired. Our findings revealed how audio description preferences varied with user types and video types. Based on our study's analysis, we provided recommendations for the development of future audio description generation technologies.",

keywords = "Audio description, Audio-visual consistency, Video captioning, sentence-level embedding, accessibility, Video description",

author = "Yujia Wang and Wei Liang",

note = "Publisher Copyright: {\textcopyright} 2021 ACM.; 10th International Conference on Materials Processing and Characterisation, ICMPC 2020 ; Conference date: 21-02-2020 Through 23-02-2020",

year = "2021",

month = may,

day = "6",

doi = "10.1145/3411764.3445347",

language = "English",

series = "Conference on Human Factors in Computing Systems - Proceedings",

publisher = "Association for Computing Machinery",

booktitle = "CHI 2021 - Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems",

}

Wang, Y & Liang, W 2021, Toward automatic audio description generation for accessible videos. in CHI 2021 - Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems: Making Waves, Combining Strengths. Conference on Human Factors in Computing Systems - Proceedings, Association for Computing Machinery, 10th International Conference on Materials Processing and Characterisation, ICMPC 2020, Mathura, U.P., India, 21/02/20. https://doi.org/10.1145/3411764.3445347

Toward automatic audio description generation for accessible videos. / Wang, Yujia; Liang, Wei.
CHI 2021 - Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems: Making Waves, Combining Strengths. Association for Computing Machinery, 2021. (Conference on Human Factors in Computing Systems - Proceedings).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Toward automatic audio description generation for accessible videos

AU - Wang, Yujia

AU - Liang, Wei

PY - 2021/5/6

Y1 - 2021/5/6

N2 - Video accessibility is essential for people with visual impairments. Audio descriptions describe what is happening on-screen, e.g., physical actions, facial expressions, and scene changes. Generating highquality audio descriptions requires a lot of manual description generation [50]. To address this accessibility obstacle, we built a system that analyzes the audiovisual contents of a video and generates the audio descriptions. The system consisted of three modules: AD insertion time prediction, AD generation, and AD optimization. We evaluated the quality of our system on five types of videos by conducting qualitative studies with 20 sighted users and 12 users who were blind or visually impaired. Our findings revealed how audio description preferences varied with user types and video types. Based on our study's analysis, we provided recommendations for the development of future audio description generation technologies.

AB - Video accessibility is essential for people with visual impairments. Audio descriptions describe what is happening on-screen, e.g., physical actions, facial expressions, and scene changes. Generating highquality audio descriptions requires a lot of manual description generation [50]. To address this accessibility obstacle, we built a system that analyzes the audiovisual contents of a video and generates the audio descriptions. The system consisted of three modules: AD insertion time prediction, AD generation, and AD optimization. We evaluated the quality of our system on five types of videos by conducting qualitative studies with 20 sighted users and 12 users who were blind or visually impaired. Our findings revealed how audio description preferences varied with user types and video types. Based on our study's analysis, we provided recommendations for the development of future audio description generation technologies.

KW - Audio description

KW - Audio-visual consistency

KW - Video captioning, sentence-level embedding, accessibility

KW - Video description

UR - http://www.scopus.com/inward/record.url?scp=85106756267&partnerID=8YFLogxK

U2 - 10.1145/3411764.3445347

DO - 10.1145/3411764.3445347

M3 - Conference contribution

AN - SCOPUS:85106756267

T3 - Conference on Human Factors in Computing Systems - Proceedings

BT - CHI 2021 - Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems

PB - Association for Computing Machinery

T2 - 10th International Conference on Materials Processing and Characterisation, ICMPC 2020

Y2 - 21 February 2020 through 23 February 2020

ER -

Toward automatic audio description generation for accessible videos

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this