Skip to main content

Showing 1–33 of 33 results for author: Harley, A W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2510.26443  [pdf, ps, other

    cs.CV

    PointSt3R: Point Tracking through 3D Grounded Correspondence

    Authors: Rhodri Guerrier, Adam W. Harley, Dima Damen

    Abstract: Recent advances in foundational 3D reconstruction models, such as DUSt3R and MASt3R, have shown great potential in 2D and 3D correspondence in static scenes. In this paper, we propose to adapt them for the task of point tracking through 3D grounded correspondence. We first demonstrate that these models are competitive point trackers when focusing on static points, present in current point tracking… ▽ More

    Submitted 30 October, 2025; originally announced October 2025.

    Comments: http://rhodriguerrier.github.io/PointSt3R

  2. arXiv:2510.20951  [pdf, ps, other

    cs.CV

    Generative Point Tracking with Flow Matching

    Authors: Mattie Tesfaldet, Adam W. Harley, Konstantinos G. Derpanis, Derek Nowrouzezahrai, Christopher Pal

    Abstract: Tracking a point through a video can be a challenging task due to uncertainty arising from visual obfuscations, such as appearance changes and occlusions. Although current state-of-the-art discriminative models excel in regressing long-term point trajectory estimates -- even through occlusions -- they are limited to regressing to a mean (or mode) in the presence of uncertainty, and fail to capture… ▽ More

    Submitted 23 October, 2025; originally announced October 2025.

    Comments: Project page: https://mtesfaldet.net/genpt_projpage/

  3. arXiv:2508.14466  [pdf, ps, other

    cs.CV

    LookOut: Real-World Humanoid Egocentric Navigation

    Authors: Boxiao Pan, Adam W. Harley, C. Karen Liu, Leonidas J. Guibas

    Abstract: The ability to predict collision-free future trajectories from egocentric observations is crucial in applications such as humanoid robotics, VR / AR, and assistive navigation. In this work, we introduce the challenging problem of predicting a sequence of future 6D head poses from an egocentric video. In particular, we predict both head translations and rotations to learn the active information-gat… ▽ More

    Submitted 20 August, 2025; originally announced August 2025.

  4. arXiv:2506.07310  [pdf, ps, other

    cs.CV

    AllTracker: Efficient Dense Point Tracking at High Resolution

    Authors: Adam W. Harley, Yang You, Xinglong Sun, Yang Zheng, Nikhil Raghuraman, Yunqi Gu, Sheldon Liang, Wen-Hsuan Chu, Achal Dave, Pavel Tokmakov, Suya You, Rares Ambrus, Katerina Fragkiadaki, Leonidas J. Guibas

    Abstract: We introduce AllTracker: a model that estimates long-range point tracks by way of estimating the flow field between a query frame and every other frame of a video. Unlike existing point tracking methods, our approach delivers high-resolution and dense (all-pixel) correspondence fields, which can be visualized as flow maps. Unlike existing optical flow methods, our approach corresponds one frame to… ▽ More

    Submitted 1 August, 2025; v1 submitted 8 June, 2025; originally announced June 2025.

  5. arXiv:2506.03868  [pdf, ps, other

    cs.CV

    Animal Pose Labeling Using General-Purpose Point Trackers

    Authors: Zhuoyang Pan, Boxiao Pan, Guandao Yang, Adam W. Harley, Leonidas Guibas

    Abstract: Automatically estimating animal poses from videos is important for studying animal behaviors. Existing methods do not perform reliably since they are trained on datasets that are not comprehensive enough to capture all necessary animal behaviors. However, it is very challenging to collect such datasets due to the large variations in animal morphology. In this paper, we propose an animal pose label… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

  6. arXiv:2504.14717  [pdf, ps, other

    cs.CV cs.LG

    TAPIP3D: Tracking Any Point in Persistent 3D Geometry

    Authors: Bowei Zhang, Lei Ke, Adam W. Harley, Katerina Fragkiadaki

    Abstract: We introduce TAPIP3D, a novel approach for long-term 3D point tracking in monocular RGB and RGB-D videos. TAPIP3D represents videos as camera-stabilized spatio-temporal feature clouds, leveraging depth and camera motion information to lift 2D video features into a 3D world space where camera movement is effectively canceled out. Within this stabilized 3D representation, TAPIP3D iteratively refines… ▽ More

    Submitted 14 November, 2025; v1 submitted 20 April, 2025; originally announced April 2025.

    Comments: NeurIPS 2025. Long-term feed-forward 3D point tracking in persistent 3D point maps. Code:https://github.com/zbw001/TAPIP3D

  7. arXiv:2412.04592  [pdf, other

    cs.CV

    EgoPoints: Advancing Point Tracking for Egocentric Videos

    Authors: Ahmad Darkhalil, Rhodri Guerrier, Adam W. Harley, Dima Damen

    Abstract: We introduce EgoPoints, a benchmark for point tracking in egocentric videos. We annotate 4.7K challenging tracks in egocentric sequences. Compared to the popular TAP-Vid-DAVIS evaluation benchmark, we include 9x more points that go out-of-view and 59x more points that require re-identification (ReID) after returning to view. To measure the performance of models on these challenging points, we intr… ▽ More

    Submitted 5 December, 2024; originally announced December 2024.

    Comments: Accepted at WACV 2025. Paper webpage: https://ahmaddarkhalil.github.io/EgoPoints/

  8. arXiv:2412.04457  [pdf, ps, other

    cs.CV

    Monocular Dynamic Gaussian Splatting: Fast, Brittle, and Scene Complexity Rules

    Authors: Yiqing Liang, Mikhail Okunev, Mikaela Angelina Uy, Runfeng Li, Leonidas Guibas, James Tompkin, Adam W. Harley

    Abstract: Gaussian splatting methods are emerging as a popular approach for converting multi-view image data into scene representations that allow view synthesis. In particular, there is interest in enabling view synthesis for dynamic scenes using only monocular input data -- an ill-posed and challenging problem. The fast pace of work in this area has produced multiple simultaneous papers that claim to work… ▽ More

    Submitted 7 June, 2025; v1 submitted 5 December, 2024; originally announced December 2024.

    Comments: TMLR 2025. Project Website: https://brownvc.github.io/MonoDyGauBench.github.io/

  9. arXiv:2405.19678  [pdf, other

    cs.CV cs.AI

    View-Consistent Hierarchical 3D Segmentation Using Ultrametric Feature Fields

    Authors: Haodi He, Colton Stearns, Adam W. Harley, Leonidas J. Guibas

    Abstract: Large-scale vision foundation models such as Segment Anything (SAM) demonstrate impressive performance in zero-shot image segmentation at multiple levels of granularity. However, these zero-shot predictions are rarely 3D-consistent. As the camera viewpoint changes in a scene, so do the segmentation predictions, as well as the characterizations of "coarse" or "fine" granularity. In this work, we ad… ▽ More

    Submitted 17 July, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

  10. arXiv:2401.02416  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    ODIN: A Single Model for 2D and 3D Segmentation

    Authors: Ayush Jain, Pushkal Katara, Nikolaos Gkanatsios, Adam W. Harley, Gabriel Sarch, Kriti Aggarwal, Vishrav Chaudhary, Katerina Fragkiadaki

    Abstract: State-of-the-art models on contemporary 3D segmentation benchmarks like ScanNet consume and label dataset-provided 3D point clouds, obtained through post processing of sensed multiview RGB-D images. They are typically trained in-domain, forego large-scale 2D pre-training and outperform alternatives that featurize the posed RGB-D multiview images instead. The gap in performance between methods that… ▽ More

    Submitted 25 June, 2024; v1 submitted 4 January, 2024; originally announced January 2024.

    Comments: Camera Ready (CVPR 2024, Highlight)

  11. arXiv:2401.00850  [pdf, other

    cs.CV cs.AI

    Refining Pre-Trained Motion Models

    Authors: Xinglong Sun, Adam W. Harley, Leonidas J. Guibas

    Abstract: Given the difficulty of manually annotating motion in video, the current best motion estimation methods are trained with synthetic data, and therefore struggle somewhat due to a train/test gap. Self-supervised methods hold the promise of training directly on real video, but typically perform worse. These include methods trained with warp error (i.e., color constancy) combined with smoothness terms… ▽ More

    Submitted 16 February, 2024; v1 submitted 1 January, 2024; originally announced January 2024.

    Comments: Accepted at ICRA 2024

  12. arXiv:2312.15130  [pdf, other

    cs.CV

    PACE: A Large-Scale Dataset with Pose Annotations in Cluttered Environments

    Authors: Yang You, Kai Xiong, Zhening Yang, Zhengxiang Huang, Junwei Zhou, Ruoxi Shi, Zhou Fang, Adam W. Harley, Leonidas Guibas, Cewu Lu

    Abstract: We introduce PACE (Pose Annotations in Cluttered Environments), a large-scale benchmark designed to advance the development and evaluation of pose estimation methods in cluttered scenarios. PACE provides a large-scale real-world benchmark for both instance-level and category-level settings. The benchmark consists of 55K frames with 258K annotations across 300 videos, covering 238 objects from 43 c… ▽ More

    Submitted 19 July, 2024; v1 submitted 22 December, 2023; originally announced December 2023.

    Comments: 14 pages; Accepted to ECCV 2024

  13. arXiv:2310.06992  [pdf, other

    cs.CV

    Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models

    Authors: Wen-Hsuan Chu, Adam W. Harley, Pavel Tokmakov, Achal Dave, Leonidas Guibas, Katerina Fragkiadaki

    Abstract: Object tracking is central to robot perception and scene understanding. Tracking-by-detection has long been a dominant paradigm for object tracking of specific object categories. Recently, large-scale pre-trained models have shown promising advances in detecting and segmenting objects and parts in 2D static images in the wild. This begs the question: can we re-purpose these large-scale pre-trained… ▽ More

    Submitted 25 January, 2024; v1 submitted 10 October, 2023; originally announced October 2023.

    Comments: Project page available at https://wenhsuanchu.github.io/ovtracktor/

  14. arXiv:2309.03468  [pdf, other

    cs.CV cs.AI cs.LG

    Support-Set Context Matters for Bongard Problems

    Authors: Nikhil Raghuraman, Adam W. Harley, Leonidas Guibas

    Abstract: Current machine learning methods struggle to solve Bongard problems, which are a type of IQ test that requires deriving an abstract "concept" from a set of positive and negative "support" images, and then classifying whether or not a new query image depicts the key concept. On Bongard-HOI, a benchmark for natural-image Bongard problems, most existing methods have reached at best 69% accuracy (wher… ▽ More

    Submitted 30 November, 2024; v1 submitted 6 September, 2023; originally announced September 2023.

    Comments: TMLR October 2024. Code: https://github.com/nraghuraman/bongard-context

  15. arXiv:2307.15055  [pdf, other

    cs.CV

    PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking

    Authors: Yang Zheng, Adam W. Harley, Bokui Shen, Gordon Wetzstein, Leonidas J. Guibas

    Abstract: We introduce PointOdyssey, a large-scale synthetic dataset, and data generation framework, for the training and evaluation of long-term fine-grained tracking algorithms. Our goal is to advance the state-of-the-art by placing emphasis on long videos with naturalistic motion. Toward the goal of naturalism, we animate deformable characters using real-world motion capture data, we build 3D scenes to m… ▽ More

    Submitted 27 July, 2023; originally announced July 2023.

  16. arXiv:2207.10761  [pdf, other

    cs.CV

    TIDEE: Tidying Up Novel Rooms using Visuo-Semantic Commonsense Priors

    Authors: Gabriel Sarch, Zhaoyuan Fang, Adam W. Harley, Paul Schydlo, Michael J. Tarr, Saurabh Gupta, Katerina Fragkiadaki

    Abstract: We introduce TIDEE, an embodied agent that tidies up a disordered scene based on learned commonsense object placement and room arrangement priors. TIDEE explores a home environment, detects objects that are out of their natural place, infers plausible object contexts for them, localizes such contexts in the current scene, and repositions the objects. Commonsense priors are encoded in three modules… ▽ More

    Submitted 21 July, 2022; originally announced July 2022.

  17. arXiv:2206.07959  [pdf, other

    cs.CV

    Simple-BEV: What Really Matters for Multi-Sensor BEV Perception?

    Authors: Adam W. Harley, Zhaoyuan Fang, Jie Li, Rares Ambrus, Katerina Fragkiadaki

    Abstract: Building 3D perception systems for autonomous vehicles that do not rely on high-density LiDAR is a critical research problem because of the expense of LiDAR systems compared to cameras and other sensors. Recent research has developed a variety of camera-only methods, where features are differentiably "lifted" from the multi-camera images onto the 2D ground plane, yielding a "bird's eye view" (BEV)… ▽ More

    Submitted 29 September, 2022; v1 submitted 16 June, 2022; originally announced June 2022.

  18. arXiv:2204.04153  [pdf, other

    cs.CV

    Particle Video Revisited: Tracking Through Occlusions Using Point Trajectories

    Authors: Adam W. Harley, Zhaoyuan Fang, Katerina Fragkiadaki

    Abstract: Tracking pixels in videos is typically studied as an optical flow estimation problem, where every pixel is described with a displacement vector that locates it in the next frame. Even though wider temporal context is freely available, prior efforts to take this into account have yielded only small gains over 2-frame methods. In this paper, we revisit Sand and Teller's "particle video" approach, an… ▽ More

    Submitted 25 July, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

  19. arXiv:2104.03851  [pdf, other

    cs.CV

    CoCoNets: Continuous Contrastive 3D Scene Representations

    Authors: Shamit Lal, Mihir Prabhudesai, Ishita Mediratta, Adam W. Harley, Katerina Fragkiadaki

    Abstract: This paper explores self-supervised learning of amodal 3D feature representations from RGB and RGB-D posed images and videos, agnostic to object and scene semantic content, and evaluates the resulting scene representations in the downstream tasks of visual correspondence, object tracking, and object detection. The model infers a latent3D representation of the scene in the form of 3D feature points… ▽ More

    Submitted 8 April, 2021; originally announced April 2021.

  20. arXiv:2104.03424  [pdf, other

    cs.CV

    Track, Check, Repeat: An EM Approach to Unsupervised Tracking

    Authors: Adam W. Harley, Yiming Zuo, Jing Wen, Ayush Mangal, Shubhankar Potdar, Ritwick Chaudhry, Katerina Fragkiadaki

    Abstract: We propose an unsupervised method for detecting and tracking moving objects in 3D, in unlabelled RGB-D videos. The method begins with classic handcrafted techniques for segmenting objects using motion cues: we estimate optical flow and camera motion, and conservatively segment regions that appear to be moving independently of the background. Treating these initial segments as pseudo-labels, we lea… ▽ More

    Submitted 7 April, 2021; originally announced April 2021.

  21. arXiv:2012.00057  [pdf, other

    cs.CV cs.AI cs.LG

    Move to See Better: Self-Improving Embodied Object Detection

    Authors: Zhaoyuan Fang, Ayush Jain, Gabriel Sarch, Adam W. Harley, Katerina Fragkiadaki

    Abstract: Passive methods for object detection and segmentation treat images of the same scene as individual samples and do not exploit object permanence across multiple views. Generalization to novel or difficult viewpoints thus requires additional training with lots of annotations. In contrast, humans often recognize objects by simply moving around, to get more informative viewpoints. In this paper, we pr… ▽ More

    Submitted 29 March, 2021; v1 submitted 30 November, 2020; originally announced December 2020.

    Comments: First three authors contributed equally. Project Page: https://ayushjain1144.github.io/SeeingByMoving/

  22. arXiv:2011.03367  [pdf, other

    cs.CV

    Disentangling 3D Prototypical Networks For Few-Shot Concept Learning

    Authors: Mihir Prabhudesai, Shamit Lal, Darshan Patil, Hsiao-Yu Tung, Adam W Harley, Katerina Fragkiadaki

    Abstract: We present neural architectures that disentangle RGB-D images into objects' shapes and styles and a map of the background scene, and explore their applications for few-shot 3D object detection and few-shot concept classification. Our networks incorporate architectural biases that reflect the image formation process, 3D geometry of the world scene, and shape-style interplay. They are trained end-to… ▽ More

    Submitted 20 July, 2021; v1 submitted 6 November, 2020; originally announced November 2020.

  23. arXiv:2010.16279  [pdf, other

    cs.CV

    3D Object Recognition By Corresponding and Quantizing Neural 3D Scene Representations

    Authors: Mihir Prabhudesai, Shamit Lal, Hsiao-Yu Fish Tung, Adam W. Harley, Shubhankar Potdar, Katerina Fragkiadaki

    Abstract: We propose a system that learns to detect objects and infer their 3D poses in RGB-D images. Many existing systems can identify objects and infer 3D poses, but they heavily rely on human labels and 3D annotations. The challenge here is to achieve this without relying on strong supervision signals. To address this challenge, we propose a model that maps RGB-D images to a set of 3D visual feature map… ▽ More

    Submitted 30 October, 2020; originally announced October 2020.

  24. arXiv:2008.01295  [pdf, other

    cs.CV

    Tracking Emerges by Looking Around Static Scenes, with Neural 3D Mapping

    Authors: Adam W. Harley, Shrinidhi K. Lakshmikanth, Paul Schydlo, Katerina Fragkiadaki

    Abstract: We hypothesize that an agent that can look around in static scenes can learn rich visual representations applicable to 3D object tracking in complex dynamic scenes. We are motivated in this pursuit by the fact that the physical world itself is mostly static, and multiview correspondence labels are relatively cheap to collect in static scenes, e.g., by triangulation. We propose to leverage multivie… ▽ More

    Submitted 3 August, 2020; originally announced August 2020.

  25. arXiv:1910.01210  [pdf, other

    cs.CV cs.LG cs.RO

    Embodied Language Grounding with 3D Visual Feature Representations

    Authors: Mihir Prabhudesai, Hsiao-Yu Fish Tung, Syed Ashar Javed, Maximilian Sieb, Adam W. Harley, Katerina Fragkiadaki

    Abstract: We propose associating language utterances to 3D visual abstractions of the scene they describe. The 3D visual abstractions are encoded as 3-dimensional visual feature maps. We infer these 3D visual scene feature maps from RGB images of the scene via view prediction: when the generated 3D scene feature map is neurally projected from a camera viewpoint, it should match the corresponding RGB image.… ▽ More

    Submitted 17 June, 2021; v1 submitted 2 October, 2019; originally announced October 2019.

    Journal ref: Conference on Computer Vision and Pattern Recognition. 2020, pp. 2220-2229

  26. arXiv:1906.03764  [pdf, other

    cs.CV

    Learning from Unlabelled Videos Using Contrastive Predictive Neural 3D Mapping

    Authors: Adam W. Harley, Shrinidhi K. Lakshmikanth, Fangyu Li, Xian Zhou, Hsiao-Yu Fish Tung, Katerina Fragkiadaki

    Abstract: Predictive coding theories suggest that the brain learns by predicting observations at various levels of abstraction. One of the most basic prediction tasks is view prediction: how would a given scene look from an alternative viewpoint? Humans excel at this task. Our ability to imagine and fill in missing information is tightly coupled with perception: we feel as if we see the world in 3 dimension… ▽ More

    Submitted 16 May, 2020; v1 submitted 9 June, 2019; originally announced June 2019.

  27. arXiv:1901.03628  [pdf, other

    cs.CV

    Image Disentanglement and Uncooperative Re-Entanglement for High-Fidelity Image-to-Image Translation

    Authors: Adam W. Harley, Shih-En Wei, Jason Saragih, Katerina Fragkiadaki

    Abstract: Cross-domain image-to-image translation should satisfy two requirements: (1) preserve the information that is common to both domains, and (2) generate convincing images covering variations that appear in the target domain. This is challenging, especially when there are no example translations available as supervision. Adversarial cycle consistency was recently proposed as a solution, with beautifu… ▽ More

    Submitted 19 October, 2019; v1 submitted 11 January, 2019; originally announced January 2019.

  28. arXiv:1804.10692  [pdf, other

    cs.CV cs.RO

    Reward Learning from Narrated Demonstrations

    Authors: Hsiao-Yu Fish Tung, Adam W. Harley, Liang-Kang Huang, Katerina Fragkiadaki

    Abstract: Humans effortlessly "program" one another by communicating goals and desires in natural language. In contrast, humans program robotic behaviours by indicating desired object locations and poses to be achieved, by providing RGB images of goal configurations, or supplying a demonstration to be imitated. None of these methods generalize across environment variations, and they convey the goal in awkwa… ▽ More

    Submitted 27 April, 2018; originally announced April 2018.

    Comments: The work has been accepted to Conference on Computer Vision and Pattern Recognition (CVPR) 2018

  29. arXiv:1708.04607  [pdf, other

    cs.CV

    Segmentation-Aware Convolutional Networks Using Local Attention Masks

    Authors: Adam W. Harley, Konstantinos G. Derpanis, Iasonas Kokkinos

    Abstract: We introduce an approach to integrate segmentation information within a convolutional neural network (CNN). This counter-acts the tendency of CNNs to smooth information across regions and increases their spatial precision. To obtain segmentation information, we set up a CNN to provide an embedding space where region co-membership can be estimated based on Euclidean distance. We use these embedding… ▽ More

    Submitted 15 August, 2017; originally announced August 2017.

  30. arXiv:1705.11166  [pdf, other

    cs.CV

    Adversarial Inverse Graphics Networks: Learning 2D-to-3D Lifting and Image-to-Image Translation from Unpaired Supervision

    Authors: Hsiao-Yu Fish Tung, Adam W. Harley, William Seto, Katerina Fragkiadaki

    Abstract: Researchers have developed excellent feed-forward models that learn to map images to desired outputs, such as to the images' latent factors, or to other images, using supervised learning. Learning such mappings from unlabelled data, or improving upon supervised models by exploiting unlabelled data, remains elusive. We argue that there are two important parts to learning without annotations: (i) ma… ▽ More

    Submitted 1 September, 2017; v1 submitted 31 May, 2017; originally announced May 2017.

    Journal ref: The IEEE International Conference on Computer Vision (ICCV), 2017, pp. 4354-4362

  31. arXiv:1608.05842  [pdf, other

    cs.CV

    Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness

    Authors: Jason J. Yu, Adam W. Harley, Konstantinos G. Derpanis

    Abstract: Recently, convolutional networks (convnets) have proven useful for predicting optical flow. Much of this success is predicated on the availability of large datasets that require expensive and involved data acquisition and laborious la- beling. To bypass these challenges, we propose an unsuper- vised approach (i.e., without leveraging groundtruth flow) to train a convnet end-to-end for predicting o… ▽ More

    Submitted 20 August, 2016; originally announced August 2016.

  32. arXiv:1511.04377  [pdf, other

    cs.CV

    Learning Dense Convolutional Embeddings for Semantic Segmentation

    Authors: Adam W. Harley, Konstantinos G. Derpanis, Iasonas Kokkinos

    Abstract: This paper proposes a new deep convolutional neural network (DCNN) architecture that learns pixel embeddings, such that pairwise distances between the embeddings can be used to infer whether or not the pixels lie on the same region. That is, for any two pixels on the same object, the embeddings are trained to be similar; for any pair that straddles an object boundary, the embeddings are trained to… ▽ More

    Submitted 7 January, 2016; v1 submitted 13 November, 2015; originally announced November 2015.

  33. arXiv:1502.07058  [pdf, other

    cs.CV cs.IR cs.LG cs.NE

    Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval

    Authors: Adam W. Harley, Alex Ufkes, Konstantinos G. Derpanis

    Abstract: This paper presents a new state-of-the-art for document image classification and retrieval, using features learned by deep convolutional neural networks (CNNs). In object and scene analysis, deep neural nets are capable of learning a hierarchical chain of abstraction from pixel inputs to concise and descriptive representations. The current work explores this capacity in the realm of document analy… ▽ More

    Submitted 25 February, 2015; originally announced February 2015.