Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation

Authors: Shihan Cheng, Nilesh Kulkarni, David Hyde, Dmitriy Smirnov

Abstract: Fine-tuning large-scale text-to-video diffusion models to add new generative controls, such as those over physical camera parameters (e.g., shutter speed or aperture), typically requires vast, high-fidelity datasets that are difficult to acquire. In this work, we propose a data-efficient fine-tuning strategy that learns these controls from sparse, low-quality synthetic data. We show that not only… ▽ More Fine-tuning large-scale text-to-video diffusion models to add new generative controls, such as those over physical camera parameters (e.g., shutter speed or aperture), typically requires vast, high-fidelity datasets that are difficult to acquire. In this work, we propose a data-efficient fine-tuning strategy that learns these controls from sparse, low-quality synthetic data. We show that not only does fine-tuning on such simple data enable the desired controls, it actually yields superior results to models fine-tuned on photorealistic "real" data. Beyond demonstrating these results, we provide a framework that justifies this phenomenon both intuitively and quantitatively. △ Less

Submitted 10 December, 2025; v1 submitted 21 November, 2025; originally announced November 2025.

MSC Class: 68U05 ACM Class: I.3.3; I.5.4

arXiv:2506.09440 [pdf, ps, other]

GigaChat Family: Efficient Russian Language Modeling Through Mixture of Experts Architecture

Authors: GigaChat team, Mamedov Valentin, Evgenii Kosarev, Gregory Leleytner, Ilya Shchuckin, Valeriy Berezovskiy, Daniil Smirnov, Dmitry Kozlov, Sergei Averkiev, Lukyanenko Ivan, Aleksandr Proshunin, Ainur Israfilova, Ivan Baskov, Artem Chervyakov, Emil Shakirov, Mikhail Kolesov, Daria Khomich, Darya Latortseva, Sergei Porkhun, Yury Fedorov, Oleg Kutuzov, Polina Kudriavtseva, Sofiia Soldatova, Kolodin Egor, Stanislav Pyatkin , et al. (9 additional authors not shown)

Abstract: Generative large language models (LLMs) have become crucial for modern NLP research and applications across various languages. However, the development of foundational models specifically tailored to the Russian language has been limited, primarily due to the significant computational resources required. This paper introduces the GigaChat family of Russian LLMs, available in various sizes, includi… ▽ More Generative large language models (LLMs) have become crucial for modern NLP research and applications across various languages. However, the development of foundational models specifically tailored to the Russian language has been limited, primarily due to the significant computational resources required. This paper introduces the GigaChat family of Russian LLMs, available in various sizes, including base models and instruction-tuned versions. We provide a detailed report on the model architecture, pre-training process, and experiments to guide design choices. In addition, we evaluate their performance on Russian and English benchmarks and compare GigaChat with multilingual analogs. The paper presents a system demonstration of the top-performing models accessible via an API, a Telegram bot, and a Web interface. Furthermore, we have released three open GigaChat models in open-source (https://huggingface.co/ai-sage), aiming to expand NLP research opportunities and support the development of industrial solutions for the Russian language. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: ACL-2025 System Demo

arXiv:2411.01212 [pdf, other]

Infinite-Resolution Integral Noise Warping for Diffusion Models

Authors: Yitong Deng, Winnie Lin, Lingxiao Li, Dmitriy Smirnov, Ryan Burgert, Ning Yu, Vincent Dedun, Mohammad H. Taghavi

Abstract: Adapting pretrained image-based diffusion models to generate temporally consistent videos has become an impactful generative modeling research direction. Training-free noise-space manipulation has proven to be an effective technique, where the challenge is to preserve the Gaussian white noise distribution while adding in temporal consistency. Recently, Chang et al. (2024) formulated this problem u… ▽ More Adapting pretrained image-based diffusion models to generate temporally consistent videos has become an impactful generative modeling research direction. Training-free noise-space manipulation has proven to be an effective technique, where the challenge is to preserve the Gaussian white noise distribution while adding in temporal consistency. Recently, Chang et al. (2024) formulated this problem using an integral noise representation with distribution-preserving guarantees, and proposed an upsampling-based algorithm to compute it. However, while their mathematical formulation is advantageous, the algorithm incurs a high computational cost. Through analyzing the limiting-case behavior of their algorithm as the upsampling resolution goes to infinity, we develop an alternative algorithm that, by gathering increments of multiple Brownian bridges, achieves their infinite-resolution accuracy while simultaneously reducing the computational cost by orders of magnitude. We prove and experimentally validate our theoretical claims, and demonstrate our method's effectiveness in real-world applications. We further show that our method readily extends to the 3-dimensional space. △ Less

Submitted 2 November, 2024; originally announced November 2024.

arXiv:2306.13702 [pdf, other]

Magenta Green Screen: Spectrally Multiplexed Alpha Matting with Deep Colorization

Authors: Dmitriy Smirnov, Chloe LeGendre, Xueming Yu, Paul Debevec

Abstract: We introduce Magenta Green Screen, a novel machine learning--enabled matting technique for recording the color image of a foreground actor and a simultaneous high-quality alpha channel without requiring a special camera or manual keying techniques. We record the actor on a green background but light them with only red and blue foreground lighting. In this configuration, the green channel shows the… ▽ More We introduce Magenta Green Screen, a novel machine learning--enabled matting technique for recording the color image of a foreground actor and a simultaneous high-quality alpha channel without requiring a special camera or manual keying techniques. We record the actor on a green background but light them with only red and blue foreground lighting. In this configuration, the green channel shows the actor silhouetted against a bright, even background, which can be used directly as a holdout matte, the inverse of the actor's alpha channel. We then restore the green channel of the foreground using a machine learning colorization technique. We train the colorization model with an example sequence of the actor lit by white lighting, yielding convincing and temporally stable colorization results. We further show that time-multiplexing the lighting between Magenta Green Screen and Green Magenta Screen allows the technique to be practiced under what appears to be mostly normal lighting. We demonstrate that our technique yields high-quality compositing results when implemented on a modern LED virtual production stage. The alpha channel data obtainable with our technique can provide significantly higher quality training data for natural image matting algorithms to support future ML matting research. △ Less

Submitted 23 June, 2023; originally announced June 2023.

Comments: In DigiPro 2023

arXiv:2201.11940 [pdf, other]

Wassersplines for Neural Vector Field--Controlled Animation

Authors: Paul Zhang, Dmitriy Smirnov, Justin Solomon

Abstract: Much of computer-generated animation is created by manipulating meshes with rigs. While this approach works well for animating articulated objects like animals, it has limited flexibility for animating less structured free-form objects. We introduce Wassersplines, a novel trajectory inference method for animating unstructured densities based on recent advances in continuous normalizing flows and o… ▽ More Much of computer-generated animation is created by manipulating meshes with rigs. While this approach works well for animating articulated objects like animals, it has limited flexibility for animating less structured free-form objects. We introduce Wassersplines, a novel trajectory inference method for animating unstructured densities based on recent advances in continuous normalizing flows and optimal transport. The key idea is to train a neurally-parameterized velocity field that represents the motion between keyframes. Trajectories are then computed by advecting keyframes through the velocity field. We solve an additional Wasserstein barycenter interpolation problem to guarantee strict adherence to keyframes. Our tool can stylize trajectories through a variety of PDE-based regularizers to create different visual effects. We demonstrate our tool on various keyframe interpolation problems to produce temporally-coherent animations without meshing or rigging. △ Less

Submitted 19 September, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

arXiv:2111.09383 [pdf, other]

DeepCurrents: Learning Implicit Representations of Shapes with Boundaries

Authors: David Palmer, Dmitriy Smirnov, Stephanie Wang, Albert Chern, Justin Solomon

Abstract: Recent techniques have been successful in reconstructing surfaces as level sets of learned functions (such as signed distance fields) parameterized by deep neural networks. Many of these methods, however, learn only closed surfaces and are unable to reconstruct shapes with boundary curves. We propose a hybrid shape representation that combines explicit boundary curves with implicit learned interio… ▽ More Recent techniques have been successful in reconstructing surfaces as level sets of learned functions (such as signed distance fields) parameterized by deep neural networks. Many of these methods, however, learn only closed surfaces and are unable to reconstruct shapes with boundary curves. We propose a hybrid shape representation that combines explicit boundary curves with implicit learned interiors. Using machinery from geometric measure theory, we parameterize currents using deep networks and use stochastic gradient descent to solve a minimal surface problem. By modifying the metric according to target geometry coming, e.g., from a mesh or point cloud, we can use this approach to represent arbitrary surfaces, learning implicitly defined shapes with explicitly defined boundary curves. We further demonstrate learning families of shapes jointly parameterized by boundary curves and latent codes. △ Less

Submitted 21 March, 2022; v1 submitted 17 November, 2021; originally announced November 2021.

arXiv:2109.06279 [pdf, other]

doi 10.1145/3478513.3480568

Interactive All-Hex Meshing via Cuboid Decomposition

Authors: Lingxiao Li, Paul Zhang, Dmitriy Smirnov, S. Mazdak Abulnaga, Justin Solomon

Abstract: Standard PolyCube-based hexahedral (hex) meshing methods aim to deform the input domain into an axis-aligned PolyCube volume with integer corners; if this deformation is bijective, then applying the inverse map to the voxelized PolyCube yields a valid hex mesh. A key challenge in these methods is to maintain the bijectivity of the PolyCube deformation, thus reducing the robustness of these algorit… ▽ More Standard PolyCube-based hexahedral (hex) meshing methods aim to deform the input domain into an axis-aligned PolyCube volume with integer corners; if this deformation is bijective, then applying the inverse map to the voxelized PolyCube yields a valid hex mesh. A key challenge in these methods is to maintain the bijectivity of the PolyCube deformation, thus reducing the robustness of these algorithms. In this work, we present an interactive pipeline for hex meshing that sidesteps this challenge by using a new representation of PolyCubes as unions of cuboids. We begin by deforming the input tetrahedral mesh into a near-PolyCube domain whose faces are close but not perfectly aligned to the major axis directions. We then build a PolyCube by optimizing the layout of a set of cuboids with user guidance to closely fit the deformed domain. Finally, we construct an inversion-free pullback map from the voxelized PolyCube to the input domain while optimizing for mesh quality metrics. We allow extensive user control over each stage, such as editing the voxelized PolyCube, positioning surface vertices, and exploring the trade-off among competing quality metrics, while also providing automatic alternatives. We validate our method on over one hundred shapes, including models that are challenging for past PolyCube-based and frame-field-based methods. Our pipeline reliably produces hex meshes with quality on par with or better than state-of-the-art. We additionally conduct a user study with 20 participants in which the majority prefer hex meshes they make using our tool to the ones from automatic state-of-the-art methods. This demonstrates the need for intuitive interactive hex meshing tools where the user can dictate the priorities of their mesh. △ Less

Submitted 13 September, 2021; originally announced September 2021.

ACM Class: I.3.5

arXiv:2104.14553 [pdf, other]

MarioNette: Self-Supervised Sprite Learning

Authors: Dmitriy Smirnov, Michael Gharbi, Matthew Fisher, Vitor Guizilini, Alexei A. Efros, Justin Solomon

Abstract: Artists and video game designers often construct 2D animations using libraries of sprites -- textured patches of objects and characters. We propose a deep learning approach that decomposes sprite-based video animations into a disentangled representation of recurring graphic elements in a self-supervised manner. By jointly learning a dictionary of possibly transparent patches and training a network… ▽ More Artists and video game designers often construct 2D animations using libraries of sprites -- textured patches of objects and characters. We propose a deep learning approach that decomposes sprite-based video animations into a disentangled representation of recurring graphic elements in a self-supervised manner. By jointly learning a dictionary of possibly transparent patches and training a network that places them onto a canvas, we deconstruct sprite-based content into a sparse, consistent, and explicit representation that can be easily used in downstream tasks, like editing or analysis. Our framework offers a promising approach for discovering recurring visual patterns in image collections without supervision. △ Less

Submitted 20 October, 2021; v1 submitted 29 April, 2021; originally announced April 2021.

Comments: Accepted to NeurIPS 2021

arXiv:2104.12826 [pdf, other]

HodgeNet: Learning Spectral Geometry on Triangle Meshes

Authors: Dmitriy Smirnov, Justin Solomon

Abstract: Constrained by the limitations of learning toolkits engineered for other applications, such as those in image processing, many mesh-based learning algorithms employ data flows that would be atypical from the perspective of conventional geometry processing. As an alternative, we present a technique for learning from meshes built from standard geometry processing modules and operations. We show that… ▽ More Constrained by the limitations of learning toolkits engineered for other applications, such as those in image processing, many mesh-based learning algorithms employ data flows that would be atypical from the perspective of conventional geometry processing. As an alternative, we present a technique for learning from meshes built from standard geometry processing modules and operations. We show that low-order eigenvalue/eigenvector computation from operators parameterized using discrete exterior calculus is amenable to efficient approximate backpropagation, yielding spectral per-element or per-mesh features with similar formulas to classical descriptors like the heat/wave kernel signatures. Our model uses few parameters, generalizes to high-resolution meshes, and exhibits performance and time complexity on par with past work. △ Less

Submitted 26 April, 2021; originally announced April 2021.

Comments: Accepted to SIGGRAPH 2021

arXiv:2004.14875 [pdf, other]

Polygonal Building Segmentation by Frame Field Learning

Authors: Nicolas Girard, Dmitriy Smirnov, Justin Solomon, Yuliya Tarabalka

Abstract: While state of the art image segmentation models typically output segmentations in raster format, applications in geographic information systems often require vector polygons. To help bridge the gap between deep network output and the format used in downstream tasks, we add a frame field output to a deep segmentation model for extracting buildings from remote sensing images. We train a deep neural… ▽ More While state of the art image segmentation models typically output segmentations in raster format, applications in geographic information systems often require vector polygons. To help bridge the gap between deep network output and the format used in downstream tasks, we add a frame field output to a deep segmentation model for extracting buildings from remote sensing images. We train a deep neural network that aligns a predicted frame field to ground truth contours. This additional objective improves segmentation quality by leveraging multi-task learning and provides structural information that later facilitates polygonization; we also introduce a polygonization algorithm that utilizes the frame field along with the raster segmentation. Our code is available at https://github.com/Lydorn/Polygonization-by-Frame-Field-Learning. △ Less

Submitted 31 March, 2021; v1 submitted 30 April, 2020; originally announced April 2020.

Comments: CVPR 2021 - IEEE Conference on Computer Vision and Pattern Recognition, Jun 2021, Pittsburg / Virtual, United States

Report number: hal-02548545, v2

arXiv:1906.12337 [pdf, other]

Learning Manifold Patch-Based Representations of Man-Made Shapes

Authors: Dmitriy Smirnov, Mikhail Bessmeltsev, Justin Solomon

Abstract: Choosing the right representation for geometry is crucial for making 3D models compatible with existing applications. Focusing on piecewise-smooth man-made shapes, we propose a new representation that is usable in conventional CAD modeling pipelines and can also be learned by deep neural networks. We demonstrate its benefits by applying it to the task of sketch-based modeling. Given a raster image… ▽ More Choosing the right representation for geometry is crucial for making 3D models compatible with existing applications. Focusing on piecewise-smooth man-made shapes, we propose a new representation that is usable in conventional CAD modeling pipelines and can also be learned by deep neural networks. We demonstrate its benefits by applying it to the task of sketch-based modeling. Given a raster image, our system infers a set of parametric surfaces that realize the input in 3D. To capture piecewise smooth geometry, we learn a special shape representation: a deformable parametric template composed of Coons patches. Naively training such a system, however, is hampered by non-manifold artifacts in the parametric shapes and by a lack of data. To address this, we introduce loss functions that bias the network to output non-self-intersecting shapes and implement them as part of a fully self-supervised system, automatically generating both shape templates and synthetic training data. We develop a testbed for sketch-based modeling, demonstrate shape interpolation, and provide comparison to related work. △ Less

Submitted 9 February, 2021; v1 submitted 28 June, 2019; originally announced June 2019.

Comments: Accepted to ICLR 2021

arXiv:1904.08921 [pdf, other]

Deep Parametric Shape Predictions using Distance Fields

Authors: Dmitriy Smirnov, Matthew Fisher, Vladimir G. Kim, Richard Zhang, Justin Solomon

Abstract: Many tasks in graphics and vision demand machinery for converting shapes into consistent representations with sparse sets of parameters; these representations facilitate rendering, editing, and storage. When the source data is noisy or ambiguous, however, artists and engineers often manually construct such representations, a tedious and potentially time-consuming process. While advances in deep le… ▽ More Many tasks in graphics and vision demand machinery for converting shapes into consistent representations with sparse sets of parameters; these representations facilitate rendering, editing, and storage. When the source data is noisy or ambiguous, however, artists and engineers often manually construct such representations, a tedious and potentially time-consuming process. While advances in deep learning have been successfully applied to noisy geometric data, the task of generating parametric shapes has so far been difficult for these methods. Hence, we propose a new framework for predicting parametric shape primitives using deep learning. We use distance fields to transition between shape parameters like control points and input data on a pixel grid. We demonstrate efficacy on 2D and 3D tasks, including font vectorization and surface abstraction. △ Less

Submitted 19 March, 2020; v1 submitted 18 April, 2019; originally announced April 2019.

Comments: Accepted to CVPR 2020

Journal ref: Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2020) 561-570

arXiv:1701.02595 [pdf]

Around power law for PageRank components in Buckley-Osthus model of web graph

Authors: Alexander Gasnikov, Maxim Zhukovskii, Sergey Kim, Fedor Noskov, Stepan Plaunov, Daniil Smirnov

Abstract: In the paper we investigate power law for PageRank components for the Buckley-Osthus model for web graph. We compare different numerical methods for PageRank calculation. With the best method we do a lot of numerical experiments. These experiments confirm the hypothesis about power law. At the end we discuss real model of web-ranking based on the classical PageRank approach. In the paper we investigate power law for PageRank components for the Buckley-Osthus model for web graph. We compare different numerical methods for PageRank calculation. With the best method we do a lot of numerical experiments. These experiments confirm the hypothesis about power law. At the end we discuss real model of web-ranking based on the classical PageRank approach. △ Less

Submitted 1 March, 2017; v1 submitted 8 January, 2017; originally announced January 2017.

Comments: in Russian, 41 pages

Showing 1–13 of 13 results for author: Smirnov, D