I am a Ph.D. candidate in Electrical Computer Engineering at The University of Texas at Austin advised by Prof. Atlas Wang at VITA group.
I work closely with Prof. Marco Pavone and Prof. Yue Wang on 3D end-to-end models with robust generalization capabilities; with Prof. Achuta Kadambi on recovering 3D/4D signals that capture the space-time structure of our world from casually captured data; and with Prof. Callie Hao on hardware-software co-design.
I was the awardees of Qualcomm Innovation Fellowship 2022.
Humans possess the remarkable ability to perceive and interact with their environment. This ability
is driven by an internal understanding of how the scene's structure is formulated and the inherent
properties of the environment. To equip future intelligent machines with this capability, it is essential
to perceive geometry directly from visual inputs for interaction with the physical world, rather than
relying on offline algorithms for preprocessing camera poses, which limits the scalability of foundation
models for safe and reliable planning.
The next-generation learning algorithms that equipped with visual sensors, shall inherently perceive geometric structures that
pre-train 3D foundation models that leverage Internet-scale video data; align with existing VLMs for reliable reasoning and planning
by leveraging physical geometry from visual inputs; investigate novel architectures that can efficiently
process, interpret, and reason over high-resolution visual streams in the temporal dimension.
The emergence of reasoning or strong generalization capabilities within foundation models are built upon the ability of processing and compressing large-scale data into well-designed, scalable models. However, 3D learning typically requires lengthy, modular, and non-differentiable pipelines for calibrating image or video data. This paradigm significantly hinders the scaling of 3D models to learn from web-scale, unannotated video data that lacks image annotations and camera poses. My researches address a practical and compelling challenge: 3D reconstruction from pose-free, unannotated image data.
My research tackles the challenge of data sparsity in photorealistic 3D digital environments, where dense scene capture with annotated poses is often unavailable. By combining Geometric Principles with Generative Priors—learned from large datasets—My reseaches fill in missing information using statistical patterns in shape and appearance. This approach leverages both the deterministic nature of geometry and the probabilistic power of generative models, enabling architectures to effectively learn from limited data.
My research has been demonstrated on platforms such as Quest 3, implemented within IARPA projects, and integrated into multiple commercial products.