HiSAM: Hierarchical State Space Alignment for Motion Generation.

Abstract

Achieving realistic long-sequence human motion genera-tion requires a model capable of maintaining both temporal coherence and spatial fidelity across frames. Current state space models (SSMs), including MAMBA, offer efficient handling of extended data sequences but lack the specialized framework necessary for complex motion generationtasks. To address these limitations, we present HiSAM, an adaptation of the MAMBA architecture uniquely suited to the demands of motion data. This model introduces two core innovations: a Temporal Consistency Block (TCB) that organizes multiple SSMs hierarchically within a U-Net structure to strengthen temporal alignment, and a Spatial Alignment Block (SAB) to enhance frame-to-frame spatial accuracy. Evaluated on HumanML3D and KIT-ML datasets, HiSAM demonstrates a significant improvement in generation quality, reducing Frechet Inception Distance by 43% while achieving a 2.9x speed increase over top-performing diffusion models. This approach underscores the effectiveness of SSM-based frameworks in generating high-quality, extended motion sequences efficiently.

Publication
IEEE/CVF Conference on Computer Vision and Pattern Recognition