Accepted at ECCV 2026

SICAGE: Speaker-Independent Culture-Aware Gesture Generation using TED4C-L Dataset

SICAGE learns cultural representations from behavioral features, such as speech and text, while reducing dependence on each speaker's individual style. These representations generalize to unseen speakers from the same cultural group and condition culture-aware co-speech gesture generation.

Ariel Gjaci · Antonio Sgorbissa · Vittorio Murino

TED4C-L dataset at a glance

106.45hours
764speakers
4cultural groups
659ksamples

Learning cultural patterns in a speaker-independent way.

Culture is shaped by shared social practices and individual behavior. When a model learns cultural representations from the behavior of one group of speakers, those representations should still work for unseen speakers from the same group. Speaker-independent evaluation is therefore essential: otherwise, apparently culture-aware behavior can simply reflect patterns tied to the speakers used during training.

SICAGE treats each speaker as a separate domain and learns cultural representations designed to generalize to unseen speakers from the same group. ALaDiT then combines those representations with speech, textual context, and a short motion seed to synthesize synchronized upper-body gestures.

One modular pipeline.

The dataset, cultural encoder, and motion generation model are modular, so each component can be replaced or studied independently.

SICAGE pipeline from TED4C-L multimodal inputs through speaker-independent culture encoding to ALaDiT gesture generation
SICAGE extracts multimodal context, regularizes culture embeddings across speaker domains, and conditions motion synthesis.

Our implementation

1

Build TED4C-L

TED4C-L contains 659k five-second samples with matched upper-body motion, audio features, text features, culture labels, and speaker identities.

2

Learn cultural representations

Fishr or adversarial learning uses speech and text features to build cultural representations that improve generalization to unseen speakers.

3

Generate motion with ALaDiT

A 50-step motion diffusion model aligns motion with low-level audio and high-level text and cultural contexts to generate four seconds of motion in less than 14 ms.

The ALaDiT model with Fishr cultural embeddings gives the strongest results.

Mean ± standard deviation over 10 speaker-disjoint test runs. Lower FGD is better. For all other metrics, higher is better.

Best realism 1.03

FGD for ALaDiT/FI

Best cultural expressivity 44.61%

CE F1 for ALaDiT/FI

Real-time synthesis <14 ms

for four seconds of motion

Model FGD ↓ CE F1 ↑ BAS ↑ SRGR ↑ Diversity ↑
ALaDiT / OneHot 1.63 ± .23 43.73 ± 1.13 22.51 ± .17 67.63 ± .25 111.79 ± .58
ALaDiT / NoDG 1.56 ± .22 43.18 ± 1.20 22.51 ± .23 67.76 ± .23 111.60 ± .71
ALaDiT / NoAlign 1.36 ± .16 43.37 ± .91 22.58 ± .17 68.17 ± .23 111.10 ± .77
ALaDiT / NC 1.60 ± .18 43.41 ± 1.10 22.51 ± .15 67.72 ± .23 109.50 ± .68
ALaDiT / ADV 1.53 ± .17 42.71 ± .95 22.45 ± .17 67.57 ± .27 111.75 ± .71
best overall ALaDiT / FI 1.03 ± .15 44.61 ± .95 22.63 ± .22 68.09 ± .25 110.27 ± .70

Ablation evidence

Several ablations further demonstrate the value of SICAGE.

The ablations show that Fishr-based cultural representations matter beyond simply adding a culture label. Replacing Fishr with one-hot culture labels worsens FGD from 1.03 to 1.63, and using the same speech/text encoder without domain generalization gives 1.56. Removing ALaDiT's alignment losses also reduces quality, with FGD increasing to 1.36 and CE F1 dropping below FI. The user study further supports this trend: among generated models, ALaDiT/FI receives the highest overall ratings and is preferred over ADV overall and over NC for Cultural Match.

Same speech. Four motion conditions.

Each 25-second clip compares real motion, no-culture ALaDiT, adversarial culture embeddings, and Fishr embeddings.

IndiaHindi
ItalyItalian
JapanJapanese
TurkeyTurkish

A large, speaker-disjoint cultural gesture dataset.

TED4C-L contains 106.45 hours from 764 speakers across India, Italy, Japan, and Turkey. The public release provides numeric motion, audio, and text representations with fixed speaker-independent splits.

Explore the dataset
India 191 speakers Italy 194 speakers Japan 196 speakers Turkey 183 speakers

We provide the code for full reproducibility.

The repository covers dataset creation or download, all ALaDiT variants and baseline ablations, and the code to prepare, run, and analyze the user study.

BibTeX

Accepted at ECCV 2026.

@inproceedings{gjaci2026sicage,
  author        = {Gjaci, Ariel and Sgorbissa, Antonio and Murino, Vittorio},
  title         = {SICAGE: Speaker-Independent Culture-Aware Gesture Generation using TED4C-L Dataset},
  booktitle     = {European Conference on Computer Vision (ECCV)},
  year          = {2026},
  eprint        = {2606.30001},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  doi           = {10.48550/arXiv.2606.30001},
  url           = {https://arxiv.org/abs/2606.30001}
}