Build TED4C-L
TED4C-L contains 659k five-second samples with matched upper-body motion, audio features, text features, culture labels, and speaker identities.
Accepted at ECCV 2026
SICAGE learns cultural representations from behavioral features, such as speech and text, while reducing dependence on each speaker's individual style. These representations generalize to unseen speakers from the same cultural group and condition culture-aware co-speech gesture generation.
Ariel Gjaci · Antonio Sgorbissa · Vittorio Murino
TED4C-L dataset at a glance
Culture is shaped by shared social practices and individual behavior. When a model learns cultural representations from the behavior of one group of speakers, those representations should still work for unseen speakers from the same group. Speaker-independent evaluation is therefore essential: otherwise, apparently culture-aware behavior can simply reflect patterns tied to the speakers used during training.
SICAGE treats each speaker as a separate domain and learns cultural representations designed to generalize to unseen speakers from the same group. ALaDiT then combines those representations with speech, textual context, and a short motion seed to synthesize synchronized upper-body gestures.
The dataset, cultural encoder, and motion generation model are modular, so each component can be replaced or studied independently.
Our implementation
TED4C-L contains 659k five-second samples with matched upper-body motion, audio features, text features, culture labels, and speaker identities.
Fishr or adversarial learning uses speech and text features to build cultural representations that improve generalization to unseen speakers.
A 50-step motion diffusion model aligns motion with low-level audio and high-level text and cultural contexts to generate four seconds of motion in less than 14 ms.
Mean ± standard deviation over 10 speaker-disjoint test runs. Lower FGD is better. For all other metrics, higher is better.
FGD for ALaDiT/FI
CE F1 for ALaDiT/FI
for four seconds of motion
| Model | FGD ↓ | CE F1 ↑ | BAS ↑ | SRGR ↑ | Diversity ↑ |
|---|---|---|---|---|---|
| ALaDiT / OneHot | 1.63 ± .23 | 43.73 ± 1.13 | 22.51 ± .17 | 67.63 ± .25 | 111.79 ± .58 |
| ALaDiT / NoDG | 1.56 ± .22 | 43.18 ± 1.20 | 22.51 ± .23 | 67.76 ± .23 | 111.60 ± .71 |
| ALaDiT / NoAlign | 1.36 ± .16 | 43.37 ± .91 | 22.58 ± .17 | 68.17 ± .23 | 111.10 ± .77 |
| ALaDiT / NC | 1.60 ± .18 | 43.41 ± 1.10 | 22.51 ± .15 | 67.72 ± .23 | 109.50 ± .68 |
| ALaDiT / ADV | 1.53 ± .17 | 42.71 ± .95 | 22.45 ± .17 | 67.57 ± .27 | 111.75 ± .71 |
| best overall ALaDiT / FI | 1.03 ± .15 | 44.61 ± .95 | 22.63 ± .22 | 68.09 ± .25 | 110.27 ± .70 |
Ablation evidence
The ablations show that Fishr-based cultural representations matter beyond simply adding a culture label. Replacing Fishr with one-hot culture labels worsens FGD from 1.03 to 1.63, and using the same speech/text encoder without domain generalization gives 1.56. Removing ALaDiT's alignment losses also reduces quality, with FGD increasing to 1.36 and CE F1 dropping below FI. The user study further supports this trend: among generated models, ALaDiT/FI receives the highest overall ratings and is preferred over ADV overall and over NC for Cultural Match.
Each 25-second clip compares real motion, no-culture ALaDiT, adversarial culture embeddings, and Fishr embeddings.
TED4C-L contains 106.45 hours from 764 speakers across India, Italy, Japan, and Turkey. The public release provides numeric motion, audio, and text representations with fixed speaker-independent splits.
Explore the datasetThe repository covers dataset creation or download, all ALaDiT variants and baseline ablations, and the code to prepare, run, and analyze the user study.
Accepted at ECCV 2026.
@inproceedings{gjaci2026sicage,
author = {Gjaci, Ariel and Sgorbissa, Antonio and Murino, Vittorio},
title = {SICAGE: Speaker-Independent Culture-Aware Gesture Generation using TED4C-L Dataset},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2026},
eprint = {2606.30001},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
doi = {10.48550/arXiv.2606.30001},
url = {https://arxiv.org/abs/2606.30001}
}