Skip to main content

Showing 1–50 of 151 results for author: Wonka, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2604.01973  [pdf, ps, other

    cs.CV

    NearID: Identity Representation Learning via Near-identity Distractors

    Authors: Aleksandar Cvejic, Rameen Abdal, Abdelrahman Eldesokey, Bernard Ghanem, Peter Wonka

    Abstract: When evaluating identity-focused tasks such as personalized generation and image editing, existing vision encoders entangle object identity with background context, leading to unreliable representations and metrics. We introduce the first principled framework to address this vulnerability using Near-identity (NearID) distractors, where semantically similar but distinct instances are placed on the… ▽ More

    Submitted 2 April, 2026; originally announced April 2026.

    Comments: Code at https://github.com/Gorluxor/NearID

  2. arXiv:2603.17975  [pdf, ps, other

    cs.CV

    AHOY! Animatable Humans under Occlusion from YouTube Videos with Gaussian Splatting and Video Diffusion Priors

    Authors: Aymen Mir, Riza Alp Guler, Xiangjun Tang, Peter Wonka, Gerard Pons-Moll

    Abstract: We present AHOY, a method for reconstructing complete, animatable 3D Gaussian avatars from in-the-wild monocular video despite heavy occlusion. Existing methods assume unoccluded input-a fully visible subject, often in a canonical pose-excluding the vast majority of real-world footage where people are routinely occluded by furniture, objects, or other people. Reconstructing from such footage poses… ▽ More

    Submitted 18 March, 2026; originally announced March 2026.

    Comments: Our project page is available at https://miraymen.github.io/ahoy/

  3. arXiv:2603.07664  [pdf, ps, other

    cs.CV cs.AI cs.GR

    Ref-DGS: Reflective Dual Gaussian Splatting

    Authors: Ningjing Fan, Yiqun Wang, Dongming Yan, Peter Wonka

    Abstract: Reflective appearance, especially strong and typically near-field specular reflections, poses a fundamental challenge for accurate surface reconstruction and novel view synthesis. Existing Gaussian splatting methods either fail to model near-field specular reflections or rely on explicit ray tracing at substantial computational cost. We present Ref-DGS, a reflective dual Gaussian splatting framewo… ▽ More

    Submitted 13 March, 2026; v1 submitted 8 March, 2026; originally announced March 2026.

    Comments: Project page: https://straybirdflower.github.io/Ref-DGS/

  4. arXiv:2603.04090  [pdf, ps, other

    cs.CV cs.GR cs.HC

    EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR

    Authors: Zhenyu Li, Sai Kumar Dwivedi, Filip Maric, Carlos Chacon, Nadine Bertsch, Filippo Arcadu, Tomas Hodan, Michael Ramamonjisoa, Peter Wonka, Amy Zhao, Robin Kips, Cem Keskin, Anastasia Tkach, Chenhongyi Yang

    Abstract: Egocentric human motion estimation is essential for AR/VR experiences, yet remains challenging due to limited body coverage from the egocentric viewpoint, frequent occlusions, and scarce labeled data. We present EgoPoseFormer v2, a method that addresses these challenges through two key contributions: (1) a transformer-based model for temporally consistent and spatially grounded body pose estimatio… ▽ More

    Submitted 4 March, 2026; originally announced March 2026.

    Comments: Accepted to CVPR 2026

  5. arXiv:2603.03026  [pdf, ps, other

    cs.CV

    Any Resolution Any Geometry: From Multi-View To Multi-Patch

    Authors: Wenqing Cui, Zhenyu Li, Mykola Lavreniuk, Jian Shi, Ramzi Idoughi, Xiangjun Tang, Peter Wonka

    Abstract: Joint estimation of surface normals and depth is essential for holistic 3D scene understanding, yet high-resolution prediction remains difficult due to the trade-off between preserving fine local detail and maintaining global consistency. To address this challenge, we propose the Ultra Resolution Geometry Transformer (URGT), which adapts the Visual Geometry Grounded Transformer (VGGT) into a unifi… ▽ More

    Submitted 3 March, 2026; originally announced March 2026.

    Comments: Project webpage: https://github.com/Dreamaker-MrC/Any-Resolution-Any-Geometry

  6. arXiv:2602.11693  [pdf, ps, other

    cs.GR cs.AI cs.CV

    OMEGA-Avatar: One-shot Modeling of 360° Gaussian Avatars

    Authors: Zehao Xia, Yiqun Wang, Zhengda Lu, Kai Liu, Jun Xiao, Peter Wonka

    Abstract: Creating high-fidelity, animatable 3D avatars from a single image remains a formidable challenge. We identified three desirable attributes of avatar generation: 1) the method should be feed-forward, 2) model a 360° full-head, and 3) should be animation-ready. However, current work addresses only two of the three points simultaneously. To address these limitations, we propose OMEGA-Avatar, the firs… ▽ More

    Submitted 12 February, 2026; originally announced February 2026.

    Comments: Project page: https://omega-avatar.github.io/OMEGA-Avatar/

  7. arXiv:2601.22231  [pdf, ps, other

    cs.CV

    Geometry without Position? When Positional Embeddings Help and Hurt Spatial Reasoning

    Authors: Jian Shi, Michael Birsak, Wenqing Cui, Zhenyu Li, Peter Wonka

    Abstract: This paper revisits the role of positional embeddings (PEs) within vision transformers (ViTs) from a geometric perspective. We show that PEs are not mere token indices but effectively function as geometric priors that shape the spatial structure of the representation. We introduce token-level diagnostics that measure how multi-view geometric consistency in ViT representation depends on consitent P… ▽ More

    Submitted 29 January, 2026; originally announced January 2026.

  8. arXiv:2601.11194  [pdf, ps, other

    cs.CV

    ATATA: One Algorithm to Align Them All

    Authors: Boyi Pang, Savva Ignatyev, Vladimir Ippolitov, Ramil Khafizov, Yurii Melnik, Oleg Voynov, Maksim Nakhodnov, Aibek Alanov, Xiaopeng Fan, Peter Wonka, Evgeny Burnaev

    Abstract: We suggest a new multi-modal algorithm for joint inference of paired structurally aligned samples with Rectified Flow models. While some existing methods propose a codependent generation process, they do not view the problem of joint generation from a structural alignment perspective. Recent work uses Score Distillation Sampling to generate aligned 3D models, but SDS is known to be time-consuming,… ▽ More

    Submitted 16 January, 2026; originally announced January 2026.

  9. arXiv:2601.02457  [pdf, ps, other

    cs.CV

    PatchAlign3D: Local Feature Alignment for Dense 3D Shape understanding

    Authors: Souhail Hadgi, Bingchen Gong, Ramana Sundararaman, Emery Pierson, Lei Li, Peter Wonka, Maks Ovsjanikov

    Abstract: Current foundation models for 3D shapes excel at global tasks (retrieval, classification) but transfer poorly to local part-level reasoning. Recent approaches leverage vision and language foundation models to directly solve dense tasks through multi-view renderings and text queries. While promising, these pipelines require expensive inference over multiple renderings, depend heavily on large langu… ▽ More

    Submitted 5 January, 2026; originally announced January 2026.

    Comments: Project website: https://souhail-hadgi.github.io/patchalign3dsite/

  10. arXiv:2512.16920  [pdf, ps, other

    cs.CV cs.AI

    EasyV2V: A High-quality Instruction-based Video Editing Framework

    Authors: Jinjie Mai, Chaoyang Wang, Guocheng Gordon Qian, Willi Menapace, Sergey Tulyakov, Bernard Ghanem, Peter Wonka, Ashkan Mirzaei

    Abstract: While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the design space of data, architecture, and control, and introduce \emph{EasyV2V}, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, l… ▽ More

    Submitted 18 December, 2025; originally announced December 2025.

    Comments: Project page: https://snap-research.github.io/easyv2v/

  11. arXiv:2512.10840  [pdf, ps, other

    cs.CV

    PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning

    Authors: Jianqi Chen, Biao Zhang, Xiangjun Tang, Peter Wonka

    Abstract: 6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometry-aware multi-view framework that directly predicts object po… ▽ More

    Submitted 11 December, 2025; originally announced December 2025.

    Comments: Project page: https://windvchen.github.io/PoseGAM/

  12. arXiv:2512.07459  [pdf, ps, other

    cs.GR cs.CV

    Human Geometry Distribution for 3D Animation Generation

    Authors: Xiangjun Tang, Biao Zhang, Peter Wonka

    Abstract: Generating realistic human geometry animations remains a challenging task, as it requires modeling natural clothing dynamics with fine-grained geometric details under limited data. To address these challenges, we propose two novel designs. First, we propose a compact distribution-based latent representation that enables efficient and high-quality geometry generation. We improve upon previous work… ▽ More

    Submitted 8 December, 2025; originally announced December 2025.

  13. arXiv:2512.02781  [pdf, ps, other

    cs.CV cs.GR cs.LG

    LumiX: Structured and Coherent Text-to-Intrinsic Generation

    Authors: Xu Han, Biao Zhang, Xiangjun Tang, Xianzhi Li, Peter Wonka

    Abstract: We present LumiX, a structured diffusion framework for coherent text-to-intrinsic generation. Conditioned on text prompts, LumiX jointly generates a comprehensive set of intrinsic maps (e.g., albedo, irradiance, normal, depth, and final color), providing a structured and physically consistent description of an underlying scene. This is enabled by two key contributions: 1) Query-Broadcast Attention… ▽ More

    Submitted 2 December, 2025; originally announced December 2025.

    Comments: The code will be available at https://github.com/xhanxu/LumiX

  14. arXiv:2511.22171  [pdf, ps, other

    cs.CV cs.GR

    BrepGPT: Autoregressive B-rep Generation with Voronoi Half-Patch

    Authors: Pu Li, Wenhao Zhang, Weize Quan, Biao Zhang, Peter Wonka, Dong-Ming Yan

    Abstract: Boundary representation (B-rep) is the de facto standard for CAD model representation in modern industrial design. The intricate coupling between geometric and topological elements in B-rep structures has forced existing generative methods to rely on cascaded multi-stage networks, resulting in error accumulation and computational inefficiency. We present BrepGPT, a single-stage autoregressive fram… ▽ More

    Submitted 27 November, 2025; originally announced November 2025.

  15. arXiv:2510.15386  [pdf, ps, other

    cs.CV

    PFGS: Pose-Fused 3D Gaussian Splatting for Complete Multi-Pose Object Reconstruction

    Authors: Ting-Yu Yen, Yu-Sheng Chiu, Shih-Hsuan Hung, Peter Wonka, Hung-Kuo Chu

    Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled high-quality, real-time novel-view synthesis from multi-view images. However, most existing methods assume the object is captured in a single, static pose, resulting in incomplete reconstructions that miss occluded or self-occluded regions. We introduce PFGS, a pose-aware 3DGS framework that addresses the practical challenge of reconstru… ▽ More

    Submitted 17 October, 2025; originally announced October 2025.

  16. arXiv:2510.06208  [pdf, ps, other

    cs.CV

    ShapeGen4D: Towards High Quality 4D Shape Generation from Videos

    Authors: Jiraphon Yenphraphai, Ashkan Mirzaei, Jianqi Chen, Jiaxu Zou, Sergey Tulyakov, Raymond A. Yeh, Peter Wonka, Chaoyang Wang

    Abstract: Video-conditioned 4D shape generation aims to recover time-varying 3D geometry and view-consistent appearance directly from an input video. In this work, we introduce a native video-to-4D shape generation framework that synthesizes a single dynamic 3D representation end-to-end from the video. Our framework introduces three key components based on large-scale pre-trained 3D models: (i) a temporal a… ▽ More

    Submitted 7 October, 2025; originally announced October 2025.

    Comments: Project page: https://shapegen4d.github.io/

  17. arXiv:2509.21989  [pdf, ps, other

    cs.CV

    Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation

    Authors: Abdelrahman Eldesokey, Aleksandar Cvejic, Bernard Ghanem, Peter Wonka

    Abstract: We propose a novel approach for disentangling visual and semantic features from the backbones of pre-trained diffusion models, enabling visual correspondence in a manner analogous to the well-established semantic correspondence. While diffusion model backbones are known to encode semantically rich features, they must also contain visual features to support their image synthesis capabilities. Howev… ▽ More

    Submitted 26 September, 2025; originally announced September 2025.

    Comments: NeurIPS 2025 (Spotlight). Project Page: https://abdo-eldesokey.github.io/mind-the-glitch/

  18. arXiv:2509.11164  [pdf, ps, other

    cs.CV

    No Mesh, No Problem: Estimating Coral Volume and Surface from Sparse Multi-View Images

    Authors: Diego Eustachio Farchione, Ramzi Idoughi, Peter Wonka

    Abstract: Effective reef monitoring requires the quantification of coral growth via accurate volumetric and surface area estimates, which is a challenging task due to the complex morphology of corals. We propose a novel, lightweight, and scalable learning framework that addresses this challenge by predicting the 3D volume and surface area of coral-like objects from 2D multi-view RGB images. Our approach uti… ▽ More

    Submitted 22 January, 2026; v1 submitted 14 September, 2025; originally announced September 2025.

    Comments: Reverted to previous version due to clarity issues

  19. arXiv:2509.10678  [pdf, ps, other

    cs.GR

    T2Bs: Text-to-Character Blendshapes via Video Generation

    Authors: Jiahao Luo, Chaoyang Wang, Michael Vasilkovsky, Vladislav Shakhrai, Di Liu, Peiye Zhuang, Sergey Tulyakov, Peter Wonka, Hsin-Ying Lee, James Davis, Jian Wang

    Abstract: We present T2Bs, a framework for generating high-quality, animatable character head morphable models from text by combining static text-to-3D generation with video diffusion. Text-to-3D models produce detailed static geometry but lack motion synthesis, while video diffusion models generate motion with temporal and multi-view geometric inconsistencies. T2Bs bridges this gap by leveraging deformable… ▽ More

    Submitted 26 September, 2025; v1 submitted 12 September, 2025; originally announced September 2025.

  20. arXiv:2508.11379  [pdf, ps, other

    cs.CV cs.AI

    G-CUT3R: Guided 3D Reconstruction with Camera and Depth Prior Integration

    Authors: Ramil Khafizov, Artem Komarichev, Ruslan Rakhimov, Peter Wonka, Evgeny Burnaev

    Abstract: We introduce G-CUT3R, a novel feed-forward approach for guided 3D scene reconstruction that enhances the CUT3R model by integrating prior information. Unlike existing feed-forward methods that rely solely on input images, our method leverages auxiliary data, such as depth, camera calibrations, or camera positions, commonly available in real-world scenarios. We propose a lightweight modification to… ▽ More

    Submitted 29 September, 2025; v1 submitted 15 August, 2025; originally announced August 2025.

  21. arXiv:2508.01170  [pdf, ps, other

    cs.CV

    DELTAv2: Accelerating Dense 3D Tracking

    Authors: Tuan Duc Ngo, Ashkan Mirzaei, Guocheng Qian, Hanwen Liang, Chuang Gan, Evangelos Kalogerakis, Peter Wonka, Chaoyang Wang

    Abstract: We propose a novel algorithm for accelerating dense long-term 3D point tracking in videos. Through analysis of existing state-of-the-art methods, we identify two major computational bottlenecks. First, transformer-based iterative tracking becomes expensive when handling a large number of trajectories. To address this, we introduce a coarse-to-fine strategy that begins tracking with a small subset… ▽ More

    Submitted 9 December, 2025; v1 submitted 1 August, 2025; originally announced August 2025.

    Comments: Project page: https://snap-research.github.io/DELTAv2/

  22. arXiv:2507.15321  [pdf, ps, other

    cs.CV

    BenchDepth: Are We on the Right Way to Evaluate Depth Foundation Models?

    Authors: Zhenyu Li, Haotong Lin, Jiashi Feng, Peter Wonka, Bingyi Kang

    Abstract: Depth estimation is a fundamental task in computer vision with diverse applications. Recent advancements in deep learning have led to powerful depth foundation models (DFMs), yet their evaluation remains challenging due to inconsistencies in existing protocols. Traditional benchmarks rely on alignment-based metrics that introduce biases, favor certain depth representations, and complicate fair com… ▽ More

    Submitted 21 July, 2025; originally announced July 2025.

    Comments: Webpage: https://zhyever.github.io/benchdepth

  23. arXiv:2507.07644  [pdf, ps, other

    cs.AI

    FloorplanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations

    Authors: Fedor Rodionov, Abdelrahman Eldesokey, Michael Birsak, John Femiani, Bernard Ghanem, Peter Wonka

    Abstract: We introduce FloorplanQA, a diagnostic benchmark for evaluating spatial reasoning in large-language models (LLMs). FloorplanQA is grounded in structured representations of indoor scenes, such as (e.g., kitchens, living rooms, bedrooms, bathrooms, and others), encoded symbolically in JSON or XML layouts. The benchmark covers core spatial tasks, including distance measurement, visibility, path findi… ▽ More

    Submitted 30 January, 2026; v1 submitted 10 July, 2025; originally announced July 2025.

    Comments: v3, Project page: https://huggingface.co/papers/2507.07644

  24. arXiv:2506.18839  [pdf, ps, other

    cs.CV

    4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation

    Authors: Chaoyang Wang, Ashkan Mirzaei, Vidit Goel, Willi Menapace, Aliaksandr Siarohin, Avalon Vinella, Michael Vasilkovsky, Ivan Skorokhodov, Vladislav Shakhrai, Sergey Korolev, Sergey Tulyakov, Peter Wonka

    Abstract: We propose the first framework capable of computing a 4D spatio-temporal grid of video frames and 3D Gaussian particles for each time step using a feed-forward architecture. Our architecture has two main components, a 4D video model and a 4D reconstruction model. In the first part, we analyze current 4D video diffusion architectures that perform spatial and temporal attention either sequentially o… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

  25. arXiv:2505.21319  [pdf, ps, other

    cs.GR cs.CV

    efunc: An Efficient Function Representation without Neural Networks

    Authors: Biao Zhang, Peter Wonka

    Abstract: Function fitting/approximation plays a fundamental role in computer graphics and other engineering applications. While recent advances have explored neural networks to address this task, these methods often rely on architectures with many parameters, limiting their practical applicability. In contrast, we pursue high-quality function approximation using parameter-efficient representations that eli… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: Project website: https://efunc.github.io/efunc/

  26. arXiv:2505.05288  [pdf, ps, other

    cs.CV cs.AI cs.RO

    PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes

    Authors: Ahmed Abdelreheem, Filippo Aleotti, Jamie Watson, Zawar Qureshi, Abdelrahman Eldesokey, Peter Wonka, Gabriel Brostow, Sara Vicente, Guillermo Garcia-Hernando

    Abstract: We introduce the novel task of Language-Guided Object Placement in Real 3D Scenes. Our model is given a 3D scene's point cloud, a 3D asset, and a textual prompt broadly describing where the 3D asset should be placed. The task here is to find a valid placement for the 3D asset that respects the prompt. Compared with other language-guided localization tasks in 3D scenes such as grounding, this task… ▽ More

    Submitted 2 October, 2025; v1 submitted 8 May, 2025; originally announced May 2025.

    Comments: ICCV 2025. Project page: https://nianticlabs.github.io/placeit3d/

  27. arXiv:2504.18424  [pdf, other

    cs.CV

    LaRI: Layered Ray Intersections for Single-view 3D Geometric Reasoning

    Authors: Rui Li, Biao Zhang, Zhenyu Li, Federico Tombari, Peter Wonka

    Abstract: We present layered ray intersections (LaRI), a new method for unseen geometry reasoning from a single image. Unlike conventional depth estimation that is limited to the visible surface, LaRI models multiple surfaces intersected by the camera rays using layered point maps. Benefiting from the compact and layered representation, LaRI enables complete, efficient, and view-aligned geometric reasoning… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

    Comments: Project page: https://ruili3.github.io/lari

  28. arXiv:2503.20318  [pdf, other

    cs.CV

    EditCLIP: Representation Learning for Image Editing

    Authors: Qian Wang, Aleksandar Cvejic, Abdelrahman Eldesokey, Peter Wonka

    Abstract: We introduce EditCLIP, a novel representation-learning approach for image editing. Our method learns a unified representation of edits by jointly encoding an input image and its edited counterpart, effectively capturing their transformation. To evaluate its effectiveness, we employ EditCLIP to solve two tasks: exemplar-based image editing and automated edit evaluation. In exemplar-based image edit… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

    Comments: Project page: https://qianwangx.github.io/EditCLIP/

  29. arXiv:2503.20289  [pdf, ps, other

    cs.CV

    HierRelTriple: Guiding Indoor Layout Generation with Hierarchical Relationship Triplet Losses

    Authors: Kaifan Sun, Bingchen Yang, Peter Wonka, Jun Xiao, Haiyong Jiang

    Abstract: We present a hierarchical triplet-based indoor relationship learning method, coined HierRelTriple, with a focus on spatial relationship learning. Existing approaches often depend on manually defined spatial rules or simplified pairwise representations, which fail to capture complex, multi-object relationships found in real scenarios and lead to overcrowded or physically implausible arrangements. W… ▽ More

    Submitted 15 September, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

  30. arXiv:2503.16653  [pdf, other

    cs.CV

    iFlame: Interleaving Full and Linear Attention for Efficient Mesh Generation

    Authors: Hanxiao Wang, Biao Zhang, Weize Quan, Dong-Ming Yan, Peter Wonka

    Abstract: This paper propose iFlame, a novel transformer-based network architecture for mesh generation. While attention-based models have demonstrated remarkable performance in mesh generation, their quadratic computational complexity limits scalability, particularly for high-resolution 3D data. Conversely, linear attention mechanisms offer lower computational costs but often struggle to capture long-range… ▽ More

    Submitted 23 March, 2025; v1 submitted 20 March, 2025; originally announced March 2025.

    Comments: Project website: https://wanghanxiao123.github.io/iFa/

  31. arXiv:2503.09631  [pdf, ps, other

    cs.GR eess.IV

    V2M4: 4D Mesh Animation Reconstruction from a Single Monocular Video

    Authors: Jianqi Chen, Biao Zhang, Xiangjun Tang, Peter Wonka

    Abstract: We present V2M4, a novel 4D reconstruction method that directly generates a usable 4D mesh animation asset from a single monocular video. Unlike existing approaches that rely on priors from multi-view image and video generation models, our method is based on native 3D mesh generation models. Naively applying 3D mesh generation models to generate a mesh for each frame in a 4D task can lead to issue… ▽ More

    Submitted 29 July, 2025; v1 submitted 11 March, 2025; originally announced March 2025.

    Comments: Accepted by ICCV 2025. Project page: https://windvchen.github.io/V2M4/

  32. arXiv:2503.01448  [pdf, ps, other

    cs.CV

    Generative Human Geometry Distribution

    Authors: Xiangjun Tang, Biao Zhang, Peter Wonka

    Abstract: Realistic human geometry generation is an important yet challenging task, requiring both the preservation of fine clothing details and the accurate modeling of clothing-body interactions. To tackle this challenge, we build upon Geometry distributions, a recently proposed representation that can model a single human geometry with high fidelity using a flow matching model. However, extending a singl… ▽ More

    Submitted 4 March, 2026; v1 submitted 3 March, 2025; originally announced March 2025.

  33. arXiv:2502.04762  [pdf, other

    cs.CV

    Autoregressive Generation of Static and Growing Trees

    Authors: Hanxiao Wang, Biao Zhang, Jonathan Klein, Dominik L. Michels, Dongming Yan, Peter Wonka

    Abstract: We propose a transformer architecture and training strategy for tree generation. The architecture processes data at multiple resolutions and has an hourglass shape, with middle layers processing fewer tokens than outer layers. Similar to convolutional networks, we introduce longer range skip connections to completent this multi-resolution approach. The key advantage of this architecture is the fas… ▽ More

    Submitted 7 February, 2025; originally announced February 2025.

  34. PartEdit: Fine-Grained Image Editing using Pre-Trained Diffusion Models

    Authors: Aleksandar Cvejic, Abdelrahman Eldesokey, Peter Wonka

    Abstract: We present the first text-based image editing approach for object parts based on pre-trained diffusion models. Diffusion-based image editing approaches capitalized on the deep understanding of diffusion models of image semantics to perform a variety of edits. However, existing diffusion models lack sufficient understanding of many object parts, hindering fine-grained edits requested by users. To a… ▽ More

    Submitted 27 June, 2025; v1 submitted 6 February, 2025; originally announced February 2025.

    Comments: Accepted by SIGGRAPH 2025 (Conference Track). Project page: https://gorluxor.github.io/part-edit/

    Journal ref: SIGGRAPH 2025 Conference Proceedings

  35. arXiv:2501.15981  [pdf, ps, other

    cs.CV cs.GR cs.LG

    MatCLIP: Light- and Shape-Insensitive Assignment of PBR Material Models

    Authors: Michael Birsak, John Femiani, Biao Zhang, Peter Wonka

    Abstract: Assigning realistic materials to 3D models remains a significant challenge in computer graphics. We propose MatCLIP, a novel method that extracts shape- and lighting-insensitive descriptors of Physically Based Rendering (PBR) materials to assign plausible textures to 3D objects based on images, such as the output of Latent Diffusion Models (LDMs) or photographs. Matching PBR materials to static im… ▽ More

    Submitted 9 August, 2025; v1 submitted 27 January, 2025; originally announced January 2025.

    Comments: Accepted at SIGGRAPH 2025 (Conference Track). Project page: https://birsakm.github.io/matclip

    Journal ref: SIGGRAPH 2025 Conference Proceedings

  36. arXiv:2501.01121  [pdf, other

    cs.CV

    PatchRefiner V2: Fast and Lightweight Real-Domain High-Resolution Metric Depth Estimation

    Authors: Zhenyu Li, Wenqing Cui, Shariq Farooq Bhat, Peter Wonka

    Abstract: While current high-resolution depth estimation methods achieve strong results, they often suffer from computational inefficiencies due to reliance on heavyweight models and multiple inference steps, increasing inference time. To address this, we introduce PatchRefiner V2 (PRV2), which replaces heavy refiner models with lightweight encoders. This reduces model size and inference time but introduces… ▽ More

    Submitted 2 January, 2025; originally announced January 2025.

  37. arXiv:2412.06592  [pdf, other

    cs.CV cs.GR

    PrEditor3D: Fast and Precise 3D Shape Editing

    Authors: Ziya Erkoç, Can Gümeli, Chaoyang Wang, Matthias Nießner, Angela Dai, Peter Wonka, Hsin-Ying Lee, Peiye Zhuang

    Abstract: We propose a training-free approach to 3D editing that enables the editing of a single shape within a few minutes. The edited 3D mesh aligns well with the prompts, and remains identical for regions that are not intended to be altered. To this end, we first project the 3D object onto 4-view images and perform synchronized multi-view image editing along with user-guided text prompts and user-provide… ▽ More

    Submitted 9 December, 2024; originally announced December 2024.

    Comments: Project Page: https://ziyaerkoc.com/preditor3d/ Video: https://www.youtube.com/watch?v=Ty2xXaEuewI

  38. arXiv:2412.06292  [pdf, other

    cs.CV

    ZeroKey: Point-Level Reasoning and Zero-Shot 3D Keypoint Detection from Large Language Models

    Authors: Bingchen Gong, Diego Gomez, Abdullah Hamdi, Abdelrahman Eldesokey, Ahmed Abdelreheem, Peter Wonka, Maks Ovsjanikov

    Abstract: We propose a novel zero-shot approach for keypoint detection on 3D shapes. Point-level reasoning on visual data is challenging as it requires precise localization capability, posing problems even for powerful models like DINO or CLIP. Traditional methods for 3D keypoint detection rely heavily on annotated 3D datasets and extensive supervised training, limiting their scalability and applicability t… ▽ More

    Submitted 9 December, 2024; originally announced December 2024.

    Comments: Project website is accessible at https://sites.google.com/view/zerokey

  39. arXiv:2412.04462  [pdf, other

    cs.CV

    4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion

    Authors: Chaoyang Wang, Peiye Zhuang, Tuan Duc Ngo, Willi Menapace, Aliaksandr Siarohin, Michael Vasilkovsky, Ivan Skorokhodov, Sergey Tulyakov, Peter Wonka, Hsin-Ying Lee

    Abstract: We propose 4Real-Video, a novel framework for generating 4D videos, organized as a grid of video frames with both time and viewpoint axes. In this grid, each row contains frames sharing the same timestep, while each column contains frames from the same viewpoint. We propose a novel two-stream architecture. One stream performs viewpoint updates on columns, and the other stream performs temporal upd… ▽ More

    Submitted 5 December, 2024; originally announced December 2024.

    Comments: Project page: https://snap-research.github.io/4Real-Video/

  40. arXiv:2412.02336  [pdf, other

    cs.CV

    Amodal Depth Anything: Amodal Depth Estimation in the Wild

    Authors: Zhenyu Li, Mykola Lavreniuk, Jian Shi, Shariq Farooq Bhat, Peter Wonka

    Abstract: Amodal depth estimation aims to predict the depth of occluded (invisible) parts of objects in a scene. This task addresses the question of whether models can effectively perceive the geometry of occluded regions based on visible cues. Prior methods primarily rely on synthetic datasets and focus on metric depth estimation, limiting their generalization to real-world settings due to domain shifts an… ▽ More

    Submitted 3 December, 2024; originally announced December 2024.

  41. arXiv:2412.00155  [pdf, other

    cs.CV cs.LG

    T-3DGS: Removing Transient Objects for 3D Scene Reconstruction

    Authors: Alexander Markin, Vadim Pryadilshchikov, Artem Komarichev, Ruslan Rakhimov, Peter Wonka, Evgeny Burnaev

    Abstract: Transient objects in video sequences can significantly degrade the quality of 3D scene reconstructions. To address this challenge, we propose T-3DGS, a novel framework that robustly filters out transient distractors during 3D reconstruction using Gaussian Splatting. Our framework consists of two steps. First, we employ an unsupervised classification network that distinguishes transient objects fro… ▽ More

    Submitted 8 March, 2025; v1 submitted 29 November, 2024; originally announced December 2024.

    Comments: Project website at https://transient-3dgs.github.io/

  42. arXiv:2411.16076  [pdf, ps, other

    cs.CV cs.GR

    Geometry Distributions

    Authors: Biao Zhang, Jing Ren, Peter Wonka

    Abstract: Neural representations of 3D data have been widely adopted across various applications, particularly in recent work leveraging coordinate-based networks to model scalar or vector fields. However, these approaches face inherent challenges, such as handling thin structures and non-watertight geometries, which limit their flexibility and accuracy. In contrast, we propose a novel geometric data repres… ▽ More

    Submitted 21 February, 2026; v1 submitted 24 November, 2024; originally announced November 2024.

    Comments: Accepted to ICCV 2025. For the project site, see https://1zb.github.io/GeomDist/

  43. arXiv:2411.14295  [pdf, other

    cs.CV

    StereoCrafter-Zero: Zero-Shot Stereo Video Generation with Noisy Restart

    Authors: Jian Shi, Qian Wang, Zhenyu Li, Ramzi Idoughi, Peter Wonka

    Abstract: Generating high-quality stereo videos that mimic human binocular vision requires consistent depth perception and temporal coherence across frames. Despite advances in image and video synthesis using diffusion models, producing high-quality stereo videos remains a challenging task due to the difficulty of maintaining consistent temporal and spatial coherence between left and right views. We introdu… ▽ More

    Submitted 12 March, 2025; v1 submitted 21 November, 2024; originally announced November 2024.

  44. arXiv:2410.01295  [pdf, other

    cs.CV cs.GR

    LaGeM: A Large Geometry Model for 3D Representation Learning and Diffusion

    Authors: Biao Zhang, Peter Wonka

    Abstract: This paper introduces a novel hierarchical autoencoder that maps 3D models into a highly compressed latent space. The hierarchical autoencoder is specifically designed to tackle the challenges arising from large-scale datasets and generative modeling using diffusion. Different from previous approaches that only work on a regular image or volume grid, our hierarchical autoencoder operates on unorde… ▽ More

    Submitted 2 October, 2024; originally announced October 2024.

    Comments: For more information: https://1zb.github.io/LaGeM

  45. arXiv:2410.00262  [pdf, other

    cs.CV

    ImmersePro: End-to-End Stereo Video Synthesis Via Implicit Disparity Learning

    Authors: Jian Shi, Zhenyu Li, Peter Wonka

    Abstract: We introduce \textit{ImmersePro}, an innovative framework specifically designed to transform single-view videos into stereo videos. This framework utilizes a novel dual-branch architecture comprising a disparity branch and a context branch on video data by leveraging spatial-temporal attention mechanisms. \textit{ImmersePro} employs implicit disparity guidance, enabling the generation of stereo pa… ▽ More

    Submitted 30 September, 2024; originally announced October 2024.

  46. arXiv:2408.14819  [pdf, other

    cs.CV

    Build-A-Scene: Interactive 3D Layout Control for Diffusion-Based Image Generation

    Authors: Abdelrahman Eldesokey, Peter Wonka

    Abstract: We propose a diffusion-based approach for Text-to-Image (T2I) generation with interactive 3D layout control. Layout control has been widely studied to alleviate the shortcomings of T2I diffusion models in understanding objects' placement and relationships from text descriptions. Nevertheless, existing approaches for layout control are limited to 2D layouts, require the user to provide a static lay… ▽ More

    Submitted 27 August, 2024; originally announced August 2024.

    Comments: Project Page: https://abdo-eldesokey.github.io/build-a-scene/

  47. arXiv:2406.15020  [pdf, other

    cs.CV

    A3D: Does Diffusion Dream about 3D Alignment?

    Authors: Savva Ignatyev, Nina Konovalova, Daniil Selikhanovych, Oleg Voynov, Nikolay Patakin, Ilya Olkov, Dmitry Senushkin, Alexey Artemov, Anton Konushin, Alexander Filippov, Peter Wonka, Evgeny Burnaev

    Abstract: We tackle the problem of text-driven 3D generation from a geometry alignment perspective. Given a set of text prompts, we aim to generate a collection of objects with semantically corresponding parts aligned across them. Recent methods based on Score Distillation have succeeded in distilling the knowledge from 2D diffusion models to high-quality representations of the 3D objects. These methods han… ▽ More

    Submitted 16 March, 2025; v1 submitted 21 June, 2024; originally announced June 2024.

  48. arXiv:2406.12831  [pdf, other

    cs.CV cs.AI cs.MM

    VIA: Unified Spatiotemporal Video Adaptation Framework for Global and Local Video Editing

    Authors: Jing Gu, Yuwei Fang, Ivan Skorokhodov, Peter Wonka, Xinya Du, Sergey Tulyakov, Xin Eric Wang

    Abstract: Video editing serves as a fundamental pillar of digital media, spanning applications in entertainment, education, and professional communication. However, previous methods often overlook the necessity of comprehensively understanding both global and local contexts, leading to inaccurate and inconsistent edits in the spatiotemporal dimension, especially for long videos. In this paper, we introduce… ▽ More

    Submitted 27 March, 2025; v1 submitted 18 June, 2024; originally announced June 2024.

    Comments: 18 pages, 16 figures

  49. arXiv:2406.08659  [pdf, other

    cs.CV

    Vivid-ZOO: Multi-View Video Generation with Diffusion Model

    Authors: Bing Li, Cheng Zheng, Wenxuan Zhu, Jinjie Mai, Biao Zhang, Peter Wonka, Bernard Ghanem

    Abstract: While diffusion models have shown impressive performance in 2D image/video generation, diffusion-based Text-to-Multi-view-Video (T2MVid) generation remains underexplored. The new challenges posed by T2MVid generation lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution. To this end, we propose a novel diffusion-based pipeline tha… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Our project page is at https://hi-zhengcheng.github.io/vividzoo/

  50. arXiv:2406.06679  [pdf, other

    cs.CV

    PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation

    Authors: Zhenyu Li, Shariq Farooq Bhat, Peter Wonka

    Abstract: This paper introduces PatchRefiner, an advanced framework for metric single image depth estimation aimed at high-resolution real-domain inputs. While depth estimation is crucial for applications such as autonomous driving, 3D generative modeling, and 3D reconstruction, achieving accurate high-resolution depth in real-world scenarios is challenging due to the constraints of existing architectures a… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.