Logo
Published on

STAA: Real-Time Explanation for Video Transformers

Authors
  • avatar
    Name
    Zerui Wang
    Twitter

The Challenge

Video Transformer models like TimeSformer and ViViT have achieved state-of-the-art performance in video understanding tasks. However, their complex attention mechanisms make them difficult to interpret. Traditional XAI methods face significant limitations:

MethodSpatialTemporalSpeedFaithfulness
Grad-CAMYesNoFast0.65
SHAP-VideoYesYesVery Slow0.72
Attention RolloutYesLimitedMedium0.68
STAA (Ours)YesYesFast0.87

How STAA Works

STAA extracts explanations directly from the Transformer's self-attention mechanism in a single forward pass:

STAA Method Overview

Key Innovation

Instead of treating spatial and temporal dimensions separately, STAA:

  1. Extracts joint attention weights from all transformer layers
  2. Decomposes into spatial and temporal components using our novel attribution algorithm
  3. Produces unified explanation showing both WHERE and WHEN the model focuses
def staa_attribution(video_tensor, model):
    """
    STAA: Spatio-Temporal Attention Attribution

    Args:
        video_tensor: Input video [B, T, C, H, W]
        model: Pretrained video transformer

    Returns:
        spatial_attr: Spatial importance map [B, T, H, W]
        temporal_attr: Temporal importance scores [B, T]
    """
    # Single forward pass with attention capture
    output, attention_weights = model(
        video_tensor,
        return_attention=True
    )

    # Extract spatial attribution (which pixels matter)
    spatial_attr = compute_spatial_attribution(
        attention_weights,
        method='gradient_weighted'
    )

    # Extract temporal attribution (which frames matter)
    temporal_attr = compute_temporal_attribution(
        attention_weights,
        method='attention_flow'
    )

    return spatial_attr, temporal_attr

Experimental Results

Performance on Kinetics-400

MetricSTAAGrad-CAMSHAPLIME
Faithfulness0.870.650.720.58
Monotonicity0.910.710.680.62
Latency (ms)473532004100
Memory (GB)2.11.88.57.2

Computational Efficiency

  • Overhead: Less than 3% additional computation vs. inference-only
  • Latency: 47ms average (sub-150ms for real-time applications)
  • Memory: Minimal additional memory footprint

Applications

1. Autonomous Driving

Understand what the model sees in real-time video streams from vehicle cameras.

2. Medical Diagnosis

Explain video-based diagnostic decisions (e.g., ultrasound analysis, surgical video review).

3. Security Systems

Debug model failures and detect adversarial attacks on surveillance systems.

4. Research & Development

Analyze Transformer attention patterns to improve model architectures.

Citation

@article{wang2025staa,
  title={STAA: Spatio-Temporal Attention Attribution for
         Real-Time Interpreting Transformer-Based AI Video Models},
  author={Wang, Zerui and Liu, Yan},
  journal={IEEE Access},
  volume={13},
  pages={101647--101661},
  year={2025},
  doi={10.1109/ACCESS.2025.3575440}
}