New Technique Improves AI Ability to Map 3D Space With 2D Cameras
For Immediate Release
Researchers have developed a technique that allows artificial intelligence (AI) programs to better map three-dimensional spaces using two-dimensional images captured by multiple cameras. Because the technique works effectively with limited computational resources, it holds promise for improving the navigation of autonomous vehicles.
“Most autonomous vehicles use powerful AI programs called vision transformers to take 2D images from multiple cameras and create a representation of the 3D space around the vehicle,” says Tianfu Wu, corresponding author of a paper on the work and an associate professor of electrical and computer engineering at North Carolina State University. “However, while each of these AI programs takes a different approach, there is still substantial room for improvement.
“Our technique, called Multi-View Attentive Contextualization (MvACon), is a plug-and-play supplement that can be used in conjunction with these existing vision transformer AIs to improve their ability to map 3D spaces,” Wu says. “The vision transformers aren’t getting any additional data from their cameras, they’re just able to make better use of the data.”
MvACon effectively works by modifying an approach called Patch-to-Cluster attention (PaCa), which Wu and his collaborators released last year. PaCa allows transformer AIs to more efficiently and effectively identify objects in an image.
“The key advance here is applying what we demonstrated with PaCa to the challenge of mapping 3D space using multiple cameras,” Wu says.
To test the performance of MvACon, the researchers used it in conjunction with three leading vision transformers – BEVFormer, the BEVFormer DFA3D variant, and PETR. In each case, the vision transformers were collecting 2D images from six different cameras. In all three instances, MvACon significantly improved the performance of each vision transformer.
“Performance was particularly improved when it came to locating objects, as well as the speed and orientation of those objects,” says Wu. “And the increase in computational demand of adding MvACon to the vision transformers was almost negligible.
“Our next steps include testing MvACon against additional benchmark datasets, as well as testing it against actual video input from autonomous vehicles. If MvACon continues to outperform the existing vision transformers, we’re optimistic that it will be adopted for widespread use.”
The paper, “Multi-View Attentive Contextualization for Multi-View 3D Object Detection,” will be presented June 20 at the IEEE/CVF Conference on Computer Vision and Pattern Recognition, being held in Seattle, Wash. First author of the paper is Xianpeng Liu, a recent Ph.D. graduate of NC State. The paper was co-authored by Ce Zheng and Chen Chen of the University of Central Florida; Ming Qian and Nan Xue of the Ant Group; and Zhebin Zhang and Chen Li of the OPPO U.S. Research Center.
The work was done with support from the National Science Foundation, under grants 1909644, 2024688 and 2013451; the U.S. Army Research Office, under grants W911NF1810295 and W911NF2210010; and a research gift fund from Innopeak Technology, Inc.
-shipman-
Note to Editors: The study abstract follows.
“Multi-View Attentive Contextualization for Multi-View 3D Object Detection”
Authors: Xianpeng Liu and Tianfu Wu, North Carolina State University; Ce Zheng and Chen Chen, University of Central Florida; Ming Qian and Nan Xue, Ant Group; and Zhebin Zhang and Chen Li, OPPO U.S. Research Center
Presented: June 20 at the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Wash.
Abstract: We present Multi-View Attentive Contextualization (MvACon), a simple yet effective method for improving 2Dto-3D feature lifting in query-based multi-view 3D (MV3D) object detection. Despite remarkable progress witnessed in the field of query-based MV3D object detection, prior art often suffers from either the lack of exploiting high-resolution 2D features in dense attention-based lifting, due to high computational costs, or from insufficiently dense grounding of 3D queries to multi-scale 2D features in sparse attention-based lifting. Our proposed MvACon hits the two birds with one stone using a representationally dense yet computationally sparse attentive feature contextualization scheme that is agnostic to specific 2D-to-3D feature lifting approaches. In experiments, the proposed MvACon is thoroughly tested on the nuScenes benchmark, using both the BEVFormer and its recent 3D deformable attention (DFA3D) variant, as well as the PETR, showing consistent detection performance improvement, especially in enhancing performance in location, orientation, and velocity prediction. It is also tested on the Waymo-mini benchmark using BEVFormer with similar improvement. We qualitatively and quantitatively show that global cluster-based contexts effectively encode dense scene-level contexts for MV3D object detection. The promising results of our proposed MvACon reinforces the adage in computer vision – “(contextualized) feature matters.”