Researchers are tackling the complex problem of robotic manipulation of moving objects, an area where current Vision-Language-Action (VLA) systems often fall short despite successes with static objects. Haozhe Xie, Beichen Wen, and Jiarui Zheng, all from S-Lab at Nanyang Technological University, alongside Chen et al, introduce DynamicVLA, a new framework designed to address these limitations through rapid perception, prediction of object movement, and continuous control. This work is significant because it not only presents a novel architecture integrating temporal reasoning and closed-loop adaptation, but also introduces the Dynamic Object Manipulation (DOM) benchmark , a new dataset of 200,000 synthetic and 2,000 real-world episodes created using an automated pipeline, facilitating further research in this challenging field.
Despite advancements in static manipulation, robots struggle with scenarios demanding rapid perception, temporal anticipation, and continuous control when objects are in motion. The research team addressed this limitation by integrating Temporal reasoning and closed-loop adaptation into a unified system. DynamicVLA employs a compact 0.4B parameter Vision-Language-Action (VLA) model, utilising a convolutional vision encoder to achieve spatially efficient and structurally faithful encoding, thereby enabling fast multimodal inference.
Central to this breakthrough is the implementation of Continuous Inference, a technique that allows for overlapping reasoning and execution, significantly reducing latency and facilitating timely adaptation to object movement. Furthermore, the researchers introduced Latent-aware Action Streaming, a mechanism that bridges the gap between perception and execution by enforcing temporally aligned action execution, ensuring consistent control even with inference delays. This pipeline efficiently generated 200K synthetic episodes across 2.8K diverse scenes and featuring 206 distinct objects, alongside 2K real-world episodes collected without relying on teleoperation.
Extensive evaluations demonstrate that DynamicVLA achieves remarkable improvements in response speed, perception accuracy, and generalisation capabilities. The framework’s performance surpasses existing VLA models, positioning it as a unified solution for general dynamic object manipulation across various robotic embodiments. As illustrated in accompanying figures, current VLA models suffer from perception-execution gaps and inter-chunk waiting, leading to delayed reactions; DynamicVLA eliminates these issues through its innovative approach to action streaming and continuous inference. The DOM benchmark itself is a significant contribution, providing a robust platform for evaluating perception, interaction, and generalisation in dynamic manipulation tasks. The automated data collection pipeline, validated across robots like the Franka Emika Panda and AgileX PiPER, enables efficient data gathering in both simulation and real-world settings. By combining a compact, efficient VLA model with a novel inference scheme and a large-scale benchmark, this work opens new avenues for developing robots capable of interacting with dynamic environments in a seamless and reliable manner, paving the way for more versatile and adaptable robotic systems.
DynamicVLA framework and continuous multimodal inference offer promising
Scientists developed DynamicVLA, a framework designed to address the challenges of manipulating dynamic objects, a persistent problem for Vision-Language-Action models. The research team tackled limitations in perceiving rapidly moving objects and maintaining continuous control by integrating temporal reasoning and closed-loop adaptation into their system. A key component of this work is a compact 0.4B VLA model, utilising a convolutional vision encoder, FastViT, to achieve spatially efficient and structurally faithful encoding, enabling fast multimodal inference. This architecture prioritises speed without sacrificing the quality of multimodal reasoning, a crucial step towards real-time dynamic manipulation.
To overcome slow inference speeds, the study pioneered Continuous Inference, a technique that enables overlapping reasoning and execution. This method employs pipelined inference windows, allowing the system to process incoming data and generate actions without blocking, thereby reducing latency and facilitating timely adaptation to object motion. Experiments demonstrate that this approach allows for non-blocking action execution across consecutive action chunks, critical for responding to unpredictable movements. This benchmark was built using an automated data collection pipeline, efficiently generating 200K synthetic episodes across 2.8K scenes and 206 objects. The pipeline also facilitates the rapid collection of 2K real-world episodes, eliminating the need for time-consuming teleoperation. The system delivers large-scale dynamic manipulation data in both simulation and the real world, across multiple robot embodiments. The architecture couples a lightweight backbone with an action expert, utilising SmolLM2-360M as the language backbone and truncating it to its first 16 layers to minimise inference latency. The action expert, a diffusion-based model, predicts action chunks conditioned on multimodal inputs, enabling precise control in dynamic environments.
DynamicVLA achieves fast, adaptive object manipulation through learned
Scientists have developed DynamicVLA, a new framework for dynamic object manipulation, addressing challenges in rapidly changing scenarios requiring precise perception and control. The research team integrated temporal reasoning and closed-loop adaptation, achieving significant improvements in response speed, perception, and generalisation capabilities. Experiments revealed that DynamicVLA utilises a compact 0.4B Vision-Language-Action model with a convolutional vision encoder, enabling fast multimodal inference and structurally faithful encoding0.06% across nine evaluation sub-dimensions, significantly outperforming existing methods. Specifically, DynamicVLA achieved a path length of 2.50 meters and a task completion time of 8.53 seconds in simulation, showcasing enhanced efficiency and speed. Data shows that DynamicVLA excels in dynamic interaction, achieving 60.5% success in closed-loop reactivity, 38.5% in dynamic adaptation, and 40.5% in long-horizon sequencing. This represents a substantial improvement over the strongest baseline, with gains of +188.1%, +87.8%, and +440.0% respectively.
Measurements confirm that the framework effectively interprets appearance, spatial, and motion cues during dynamic manipulation, achieving 51.5% success in visual understanding, 48.0% in spatial reasoning, and 33.5% in motion perception. Tests prove that DynamicVLA generalises well to unseen objects, novel 3D scenes, and unseen motion regimes, recording success rates of 59.5%, 65.0%, and 26.5% respectively. The breakthrough delivers a unified framework for general dynamic object manipulation, with the framework’s components demonstrably impacting performance and revealing trade-offs between model capacity and inference efficiency. In real-world experiments, DynamicVLA consistently realigned perception and action under tight temporal constraints, maintaining robust performance where baseline methods frequently failed due to delayed reactions.
DynamicVLA tackles real-time robotic object manipulation with impressive
Scientists have developed DynamicVLA, a new framework designed to improve the manipulation of dynamic objects by robots. This research addresses the limitations of existing Vision-Language-Action models, which often struggle with scenarios requiring quick perception, anticipation of movement, and continuous control. The framework integrates temporal reasoning and closed-loop adaptation, utilising a compact 0.4 billion parameter VLA with a convolutional vision encoder for efficient processing. Extensive evaluations across diverse tasks and robotic embodiments demonstrate improvements in speed, perception, and generalisation. The authors acknowledge that the model’s performance is subject to limitations in real-time responsiveness and adaptation to unpredictable object motion. Future research could focus on enhancing the model’s ability to handle even more complex and uncertain dynamic environments, potentially exploring methods for improved prediction of object trajectories and more robust contact handling. This work establishes a unified framework for general dynamic object manipulation, offering a significant step towards more versatile and adaptable robotic systems.