LASFormer: Light Transformer for Action Segmentation with Receptive Field-Guided Distillation and Action Relation Encoding

Zhichao Ma; Kan Li

doi:10.3390/math12010057

LASFormer: Light Transformer for Action Segmentation with Receptive Field-Guided Distillation and Action Relation Encoding

Zhichao Ma, Kan Li^*

^*Corresponding author for this work

School of Computer Science and Technology

Beijing Institute of Technology

Research output: Contribution to journal › Article › peer-review

Abstract

Transformer-based models for action segmentation have achieved high frame-wise accuracy against challenging benchmarks. However, they rely on multiple decoders and self-attention blocks for informative representations, whose huge computing and memory costs remain an obstacle to handling long video sequences and practical deployment. To address these issues, we design a light transformer model for the action segmentation task, named LASFormer, with a novel encoder–decoder structure based on three key designs. First, we propose a receptive field-guided distillation to realize mode reduction, which can overcome more generally the gap in semantic feature structure between the intermediate features by aggregated temporal dilation convolution (ATDC). Second, we propose a simplified implicit attention to replace self-attention to avoid its quadratic complexity. Third, we design an efficient action relation encoding module embedded after the decoder, where the temporal graph reasoning introduces an inductive bias that adjacent frames are more likely to belong to the same class of model global temporal relations, and the cross-model fusion structure integrates frame-level and segment-level temporal clues, which can avoid over-segmentation independent of multiple decoders, thus reducing further computational complexity. Extensive experiments have verified the effectiveness and efficiency of the framework. Against the challenging 50Salads, GTEA, and Breakfast benchmarks, LASFormer significantly outperforms the current state-of-the-art methods in accuracy, edit score, and F1 score.

Original language	English
Article number	57
Journal	Mathematics
Volume	12
Issue number	1
DOIs	https://doi.org/10.3390/math12010057
Publication status	Published - Jan 2024

Keywords

action relation encoding
action segmentation
light transformer
model reduction

Access to Document

10.3390/math12010057

Cite this

@article{b821532aca714f1dbb3a53bdd65478bc,

title = "LASFormer: Light Transformer for Action Segmentation with Receptive Field-Guided Distillation and Action Relation Encoding",

abstract = "Transformer-based models for action segmentation have achieved high frame-wise accuracy against challenging benchmarks. However, they rely on multiple decoders and self-attention blocks for informative representations, whose huge computing and memory costs remain an obstacle to handling long video sequences and practical deployment. To address these issues, we design a light transformer model for the action segmentation task, named LASFormer, with a novel encoder–decoder structure based on three key designs. First, we propose a receptive field-guided distillation to realize mode reduction, which can overcome more generally the gap in semantic feature structure between the intermediate features by aggregated temporal dilation convolution (ATDC). Second, we propose a simplified implicit attention to replace self-attention to avoid its quadratic complexity. Third, we design an efficient action relation encoding module embedded after the decoder, where the temporal graph reasoning introduces an inductive bias that adjacent frames are more likely to belong to the same class of model global temporal relations, and the cross-model fusion structure integrates frame-level and segment-level temporal clues, which can avoid over-segmentation independent of multiple decoders, thus reducing further computational complexity. Extensive experiments have verified the effectiveness and efficiency of the framework. Against the challenging 50Salads, GTEA, and Breakfast benchmarks, LASFormer significantly outperforms the current state-of-the-art methods in accuracy, edit score, and F1 score.",

keywords = "action relation encoding, action segmentation, light transformer, model reduction",

author = "Zhichao Ma and Kan Li",

note = "Publisher Copyright: {\textcopyright} 2023 by the authors.",

year = "2024",

month = jan,

doi = "10.3390/math12010057",

language = "English",

volume = "12",

journal = "Mathematics",

issn = "2227-7390",

publisher = "Multidisciplinary Digital Publishing Institute (MDPI)",

number = "1",

}

TY - JOUR

T1 - LASFormer

T2 - Light Transformer for Action Segmentation with Receptive Field-Guided Distillation and Action Relation Encoding

AU - Ma, Zhichao

AU - Li, Kan

PY - 2024/1

Y1 - 2024/1

N2 - Transformer-based models for action segmentation have achieved high frame-wise accuracy against challenging benchmarks. However, they rely on multiple decoders and self-attention blocks for informative representations, whose huge computing and memory costs remain an obstacle to handling long video sequences and practical deployment. To address these issues, we design a light transformer model for the action segmentation task, named LASFormer, with a novel encoder–decoder structure based on three key designs. First, we propose a receptive field-guided distillation to realize mode reduction, which can overcome more generally the gap in semantic feature structure between the intermediate features by aggregated temporal dilation convolution (ATDC). Second, we propose a simplified implicit attention to replace self-attention to avoid its quadratic complexity. Third, we design an efficient action relation encoding module embedded after the decoder, where the temporal graph reasoning introduces an inductive bias that adjacent frames are more likely to belong to the same class of model global temporal relations, and the cross-model fusion structure integrates frame-level and segment-level temporal clues, which can avoid over-segmentation independent of multiple decoders, thus reducing further computational complexity. Extensive experiments have verified the effectiveness and efficiency of the framework. Against the challenging 50Salads, GTEA, and Breakfast benchmarks, LASFormer significantly outperforms the current state-of-the-art methods in accuracy, edit score, and F1 score.

AB - Transformer-based models for action segmentation have achieved high frame-wise accuracy against challenging benchmarks. However, they rely on multiple decoders and self-attention blocks for informative representations, whose huge computing and memory costs remain an obstacle to handling long video sequences and practical deployment. To address these issues, we design a light transformer model for the action segmentation task, named LASFormer, with a novel encoder–decoder structure based on three key designs. First, we propose a receptive field-guided distillation to realize mode reduction, which can overcome more generally the gap in semantic feature structure between the intermediate features by aggregated temporal dilation convolution (ATDC). Second, we propose a simplified implicit attention to replace self-attention to avoid its quadratic complexity. Third, we design an efficient action relation encoding module embedded after the decoder, where the temporal graph reasoning introduces an inductive bias that adjacent frames are more likely to belong to the same class of model global temporal relations, and the cross-model fusion structure integrates frame-level and segment-level temporal clues, which can avoid over-segmentation independent of multiple decoders, thus reducing further computational complexity. Extensive experiments have verified the effectiveness and efficiency of the framework. Against the challenging 50Salads, GTEA, and Breakfast benchmarks, LASFormer significantly outperforms the current state-of-the-art methods in accuracy, edit score, and F1 score.

KW - action relation encoding

KW - action segmentation

KW - light transformer

KW - model reduction

UR - http://www.scopus.com/inward/record.url?scp=85181903951&partnerID=8YFLogxK

U2 - 10.3390/math12010057

DO - 10.3390/math12010057

M3 - Article

AN - SCOPUS:85181903951

SN - 2227-7390

VL - 12

JO - Mathematics

JF - Mathematics

IS - 1

M1 - 57

ER -

LASFormer: Light Transformer for Action Segmentation with Receptive Field-Guided Distillation and Action Relation Encoding

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this