Yun Zhang

I am currently a first-year PhD student at the the UCLA’s Mobility Lab, working under the guidance of Prof. Jiaqi Ma and Prof. Wei Wang. My research focuses on vision-language-action (VLA) systems and embodied intelligence, with an emphasis on enabling agents to perceive, reason, and act effectively in real-world environments.

My work lies at the intersection of robotics, artificial intelligence, and mobility. I am particularly interested in developing methods that allow physical AI systems to integrate visual and linguistic understanding with action, maintain structured memory over time, and perform reliable decision-making in long-horizon tasks. My research spans topics including navigation, manipulation, and memory-driven reasoning, with the goal of building robust and adaptable embodied agents.

I am also an Amazon Trainium Fellow, supported for my research on large-scale vision and action learning for embodied intelligence.

news

Oct 19, 2025	Thrilled to announce that our paper MIC-BEV: Multi-Infrastructure Camera Bird’s-Eye-View Transformer with Relation-Aware Fusion for 3D Object Detection received the Best Paper Award (Third Prize) at the ICCV 2025 DriveX Workshop.
Aug 19, 2025	Honored to be selected as a 2025 Amazon Trainium Fellow.
Apr 07, 2025	I’m excited to share that I’ve been selected as a recipient of the 2025 RSS Pathway Fellowship Program! Grateful to RSS for this opportunity to connect with the robotics community and further explore my research interests.
Mar 17, 2025	Thrilled to share that I’ve been selected to receive the prestigious Graduate Dean’s Scholar Award (GDSA) from UCLA’s Division of Graduate Education, enhancing my student financial support with awards totaling $14,500 over the next two years.
Jan 09, 2025	Excited to won the U.S. Department of Transportation’s Intersection Safety Challenge as core developer. Recieving $750,000 cash prize.

selected publications

TIC-VLA: A Think-in-Control Vision-Language-Action Model for Robot Navigation in Dynamic Environments

International Conference on Machine Learning (ICML), 2026

Zhiyu Huang*, Yun Zhang*, Johnson Liu, Rui Song, Chen Tang, and Jiaqi Ma

Abs

Robots in dynamic, human-centric environments must follow language instructions while maintaining real-time reactive control. Vision-language-action (VLA) models offer a promising framework, but they assume temporally aligned reasoning and control, despite semantic inference being inherently delayed relative to real-time action. We introduce Think-in-Control (TIC)-VLA, a latency-aware framework that explicitly models delayed semantic reasoning during action generation. TIC-VLA defines a delayed semantic-control interface that conditions action generation on delayed vision-language semantic states and explicit latency metadata, in addition to current observations, enabling policies to compensate for asynchronous reasoning. We further propose a latency-consistent training pipeline that injects reasoning inference delays during imitation learning and online reinforcement learning, aligning training with asynchronous deployment. To support realistic evaluation, we present DynaNav, a physics-accurate, photo-realistic simulation suite for language-guided navigation in dynamic environments. Extensive experiments in simulation and on a real robot show that TIC-VLA consistently outperforms prior VLA models while maintaining robust real-time control under multi-second reasoning latency.
MIC-BEV: Multi-Infrastructure Camera Bird’s-Eye-View Transformer with Relation-Aware Fusion for 3D Object Detection

Under Review, Best Paper Award (Third Prize)for ICCV 2025DriveX Workshop, 2025

Yun Zhang, Zhaoliang Zheng, Johnson Liu, Zhiyu Huang, Zewei Zhou, Zonglin Meng, Tianhui Cai, and Jiaqi Ma

Abs

Infrastructure-based perception plays a crucial role in intelligent transportation systems, offering global situational awareness and enabling cooperative autonomy. However, existing camera-based detection models often underperform in such scenarios due to challenges such as multi-view infrastructure setup, diverse camera configurations, degraded visual inputs, and various road layouts. We introduce MIC-BEV, a Transformer-based bird’s-eye-view (BEV) perception framework for infrastructure-based multi-camera 3D object detection. MIC-BEV flexibly supports a variable number of cameras with heterogeneous intrinsic and extrinsic parameters and demonstrates strong robustness under sensor degradation. The proposed graph-enhanced fusion module in MIC-BEV integrates multi-view image features into the BEV space by exploiting geometric relationships between cameras and BEV cells alongside latent visual cues. To support training and evaluation, we introduce M2I, a synthetic dataset for infrastructure-based object detection, featuring diverse camera configurations, road layouts, and environmental conditions. Extensive experiments on both M2I and the real-world dataset RoScenes demonstrate that MIC-BEV achieves state-of-the-art performance in 3D object detection. It also remains robust under challenging conditions, including extreme weather and sensor degradation. These results highlight the potential of MIC-BEV for real-world deployment.
InSPE: Rapid Evaluation of Heterogeneous Multi-Modal Infrastructure Sensor Placement

Under Review, 2025

Zhaoliang Zheng*, Yun Zhang*, Zonglin Meng, Johnson Liu, Xin Xia, Jiaqi Ma

Abs

This paper introduces InSPE, a framework for evaluating heterogeneous multi-modal infrastructure sensor placement by integrating metrics like sensor coverage, occlusion, and information gain, supported by a new dataset and benchmarking experiments to optimize perception in intelligent intersections.
AutoVLA: Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Neural Information Processing Systems (NeurIPS), 2025

Zhiyu Huang, Zewei Zhou, Tianhui Cai, Seth Z. Zhao, Yun Zhang, Jiaqi Ma

Abs

AutoVLA is a vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning capabilities.