Sara Hooker, leader of Cohere for AI, and Xingjian "XJ" Zhang, Head of Growth at Apex.AI, have shed light on the promising advancements of vision-language models (VLMs) and their consequential role in the future of autonomous driving. On March 18, 2025, both figures emphasized the transformative potential of VLMs, which merge computer vision and natural language processing to enable autonomous vehicles (AVs) to effectively interpret multimodal data by linking visual inputs with textual descriptions.
Zhang pointed out the prevailing struggles faced by the autonomous vehicle industry, particularly highlighting the infamous "long tail problem." He explained how AVs, even after two decades since the DARPA Grand Challenge, face significant challenges when dealing with unforeseen scenarios during routine trips. These vehicles typically rely on high-definition maps, extensively detailed datasets, and rigid rule-based logic, performing effectively within structured environments but stumbling when the unexpected arises. "Think of them as well-rehearsed stage actors—flawless when following a script but lost when asked to improvise," said Zhang.
The introduction of vision-language models has sparked optimism within the field. A notable project, DriveVLM, developed by Li Auto and Tsinghua University, exemplifies the integration of VLMs. DriveVLM employs advanced technology, including a vision transformer encoder and large language models (LLMs), to process high volumes of data generated on the road. By converting images to tokens, this system is able to generate detailed linguistic descriptions of its surroundings, including road conditions and navigational attributes, even addressing rare long-tail circumstances.
This amalgamation of visual and linguistic data enhances the adaptability of AV systems, pushing the boundaries of their operational environments. "VLMs can handle the plethora of scenarios by being pre-trained on large-scale internet datasets, giving them a foundational grasp of the world," explained Zhang. Improved scene comprehension and planning empower AVs to navigate complex environments more adeptly than before.
At the same time, AV architectures are progressing from modular setups to end-to-end (E2E) designs. Traditional systems compartmentalize various aspects, such as perception, prediction, and planning, creating inefficiencies. Conversely, end-to-end architectures unify these steps, directly processing raw sensor inputs to output immediate driving actions—mitigated by joint optimizations among the components.
Waymo’s End-to-End Multimodal Model for Autonomous Driving (EMMA) serves as another prime example, integrating both perception and planning within one system. Achieving state-of-the-art performance according to benchmark datasets, including nuScenes and the Waymo Open Motion Dataset (WOMD), EMMA processes raw camera images along with high-level driving commands to produce efficient driving outputs.
Nevertheless, significant challenges persist. Unlike static models such as DALL-E, which focus on image generation, AVs must continuously interpret high-dimensional video streams for real-time processing. Capturing long-term spatial relationships and dynamic changes within complex traffic conditions demands advanced 3D scene comprehension—something still largely unresolved.
DriveVLM, for example, faced practical limitations when tested on the NVIDIA Orin X using its four-billion-parameter Qwen model. The system displayed prefill latency of 0.57 seconds and decode latency of 1.33 seconds, combining for 1.9 seconds to respond to a single scene. This time lapse is particularly alarming; at 50 mph, the vehicle could travel approximately 139 feet (42 meters) before it reacts, presenting dangerous consequences for real-time situations.
Despite the hurdles, the pace of innovation within the AV sector is unrelenting. Zhang asserts, "I believe future breakthroughs in model distillation will enable VLMs to become more efficient without compromising intelligence," pointing to advancements in edge computing as pivotal to minimizing inference latency. The goal is to enable VLM-powered autonomous vehicles to adeptly process multimodal information instantly, paving the way for on-the-fly decision-making.
For years, autonomous vehicles have grappled with the intricacies of real-world environments. With optimism, industry experts believe VLMs might finally unite "vision," "language," and "model" to revolutionize not only driving processes but also how vehicles learn, reason, and communicate with humans.