Scientists are tackling the challenge of deploying advanced language-action (VLA) models on robots in real-time, a crucial step towards more responsive and efficient robotic systems. Boseong Jeon, Yunho Choi, and Taehan Kim, all from Samsung Research South Korea, present a novel knowledge distillation framework called Shallow-π that dramatically reduces model complexity without significantly sacrificing performance. Their research systematically reduces the number of layers in flow-based VLA models , a previously underexplored area , achieving over two times faster inference with minimal loss in success rate on standard benchmarks. This work is significant because it demonstrates state-of-the-art performance for reduced VLA models and, importantly, validates its effectiveness through large-scale, real-world experiments on industrial robot platforms like Jetson Orin and Thor, paving the way for practical robotic applications.

The team achieved a substantial reduction in model depth, compressing a VLA from 18 to 6 layers, while maintaining high performance on standard manipulation benchmarks. This breakthrough was accomplished by systematically reducing transformer layers in both the VLM backbone and the flow-based action head, a strategy previously unexplored in the context of knowledge distillation for flow-based VLAs. The study unveils a principled approach to compressing VLA models, focusing on transformer layer reduction rather than token-level efficiency, which has been the primary focus of prior work.

Researchers carefully designed a set of distillation objectives, including ground-truth supervision, teacher trajectory imitation, and intermediate attention transfer, specifically tailored for π-like flow-based VLAs. These objectives ensure that the reduced model retains crucial layer-wise feature transfer capabilities, essential for maintaining performance in complex tasks. This practical validation demonstrates the feasibility of deploying advanced VLA models on edge devices for real-time robotic control. The work opens new avenues for creating more responsive and efficient robots capable of operating in challenging, real-world environments. This innovation addresses a key limitation of existing flow-based VLA models, which combine a large VLM backbone with a computationally intensive diffusion-based action head. This achievement is particularly significant as it was accomplished without relying on graph-level optimizations or runtime conversion techniques, highlighting the inherent efficiency of the proposed framework.

Distilling π-Flow VLAs for Reduced Depth improves efficiency

The research team compressed models from an initial 18 layers down to just 6, achieving a substantial reduction in computational complexity without significant performance loss. This innovative approach addresses a critical need for faster, on-device inference crucial for real-time robotic applications. To achieve this layer reduction, the study pioneered a systematic knowledge distillation process, carefully tailoring objectives specifically for π-like flow-based VLAs, architectures where the action head mirrors the VLM depth to facilitate layer-wise feature transfer. Experiments employed a combination of distillation objectives, including ground-truth supervision, teacher trajectory imitation, and intermediate attention transfer, ensuring effective knowledge transfer from a larger teacher model to the compressed student model.

The team meticulously designed these objectives to account for the unique characteristics of flow-based VLAs, where only action tokens are denoised and multimodal features are injected at each layer. The methodology involved training a 6-layer Shallow-π model and rigorously evaluating its performance on standard manipulation benchmarks. Furthermore, the team achieved almost 10Hz end-to-end inference on Jetson Orin without relying on graph-level optimisations or runtime conversion techniques, highlighting the efficiency and practicality of Shallow-pi. The work directly addresses limitations of prior layer-skipping methods, which typically require the full model to remain in GPU memory and often focus solely on backbone depth reduction.

Shallow-pi Distillation Accelerates Robotic Model Inference on resource-constrained

The research team successfully compressed a language-action (VLA) model from 18 layers to just 6 layers using a principled knowledge distillation approach, representing a significant reduction in model complexity. This breakthrough delivers substantial improvements in on-device inference, crucial for real-time robotic deployment. The team meticulously measured feature similarity across denoising timesteps, denoted as τ, to understand layer interactions within the VLA model. Analysis of cosine similarity between adjacent layers demonstrated that similarity profiles vary substantially with τ, rendering fixed skipping rules ineffective.

Layer sensitivity analysis, quantifying the decrease in success rate when individual layers are skipped on the LIBERO benchmark, confirmed that similarity poorly predicts functional importance, skipping layers with higher similarity sometimes caused larger performance drops. Data shows that removing layers in order of lowest sensitivity, as determined by this analysis, rapidly caused the success rate to collapse after only three layers were removed. Results demonstrate the unique effectiveness of transformer layer reduction in improving inference latency, particularly on high-performance hardware like the H100, where token reduction offers minimal benefit. Tests prove that reducing the number of transformer layers leads to a substantially larger decrease in inference time compared to modest latency improvements from token-level pruning.

The study employed knowledge distillation, training a student model to approximate the teacher’s denoising behaviour using task supervision, knowledge distillation losses, and a novel attention distillation loss. Specifically, the task loss was calculated as E[∥vθ(·) −u∥2 2], while the knowledge distillation loss was E[∥vθ(·) −vφ(·)∥2 2]. Furthermore, the team introduced an attention distillation loss, Lattn = E h KL Attna→vl φ ∥Attna→vl θ i, aligning cross-attention distributions between action queries and vision, language key, value pairs at an intermediate transformer layer.

Shallow-π Distillation Boosts Robotic manipulation Speed by learning

The distilled model demonstrates improved robustness and generalization under unseen spatial perturbations, effectively incorporating updated visual observations to correct for errors. However, the authors acknowledge that knowledge distillation incurs higher training-time computational costs compared to layer-skipping approaches, as both teacher and student models are loaded simultaneously. Future work will focus on strategies to selectively freeze model components and curate key training samples to reduce VRAM consumption and improve distillation efficiency. Additionally, researchers plan to explore combining Shallow-π with other efficiency techniques, such as visual token reduction and diffusion step reduction, to further increase inference throughput.