Introducing Cupid, a 3D generator that accurately infers the camera pose, 3D shape, and texture from a single image.
We present Cupid, a new generation-based method for single-image 3D reconstruction. Our approach jointly infers camera pose, 3D shape, and texture by formulating the task as a conditional sampling process within a two-stage flow matching pipeline. This unified framework enables robust pose estimation and achieves state-of-the-art results in both geometric accuracy and visual fidelity.
Tip: Double-click the camera in the scene then adjust the image opacity to check view consistency.
We use a powerful generative technique called Flow Model
Stage 1: Occupancy and Pose Generation.
The first stage generates a coarse representation of the object and simultaneously estimates the camera pose. Given an input image, our flow model produces two key outputs: an occupancy cube (indicating which voxels $\mathbf{x}_i$ in space belong to the object) and a novel UV cube (indicating the 2D pixel locations $\mathbf{u}_i$ for each 3D voxel).
we can robustly solve for the camera's projection matrix $\mathbf{P}^{*}$ using a classical least-squares solver
$$ \mathbf{P}^{*} = \argmin_{\mathbf{P}} \sum_{i} \big\Vert\pi(\mathbf{P},\mathbf{x}_i) - \mathbf{u}_i\big\Vert^2. \tag{1} $$
Stage 2: Pose-Aligned Geometry and Appearance Generation. With the camera pose now known, the second stage generates the fine-grained geometry and appearance. A common problem here is "color drift" and "detail inconsistency", where the 3D model doesn't perfectly match the input image's colors and details. We solve this with a pose-aligned conditioner that inject pixel-wise information.
For each voxel in the occupancy cube, we use the calculated pose to find exactly where it lands on the 2D input image. We then sample features (both high-level semantics from DINO and low-level color/texture) from that precise pixel location. These pixel-aligned features are injected directly into the generation process, ensuring the final 3D model has high-fidelity geometry and appearance that is faithful to the input view.
Our framework naturally extends to reconstructing entire scenes. We use an object detector (like SAM) to find all objects in an image. Then, we run our reconstruction process on each object independently.
Then, using the 3D-2D correspondences our method provides, we align each reconstructed object with a global depth prior (from a model like MoGe
Although Cupid is trained with single image condition, it can be easily extended to multi-view reconstruction, thanks to the flexibility of our generative framework.
Given multiple images of the same object from different angles, we know that the 3D object cube should be the same across all views.
Therefore, we can run our flow model for each image, but share the same occupancy latent $\mathbf{X}$ across all views during the iterative flow sampling. This is similar to MultiDiffusion
@misc{huang2025cupidposegroundedgenerative3d, title={CUPID: Pose-Grounded Generative 3D Reconstruction from a Single Image}, author={Binbin Huang and Haobin Duan and Yiqun Zhao and Zibo Zhao and Yi Ma and Shenghua Gao}, year={2025}, eprint={2510.20776}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.20776}, }
We thank NYU VisionX for the nice project page template.