-
Learning to Think Like a Cartoon Captionist: Incongruity-Resolution Supervision for Multimodal Humor Understanding
Authors:
Hatice Merve Vural,
Doga Kukul,
Ege Erdem Ozlu,
Demir Ekin Arikan,
Bob Mankoff,
Erkut Erdem,
Aykut Erdem
Abstract:
Humor is one of the few cognitive tasks where getting the reasoning right matters as much as getting the answer right. While recent work evaluates humor understanding on benchmarks such as the New Yorker Cartoon Caption Contest (NYCC), it largely treats it as black-box prediction, overlooking the structured reasoning processes underlying humor comprehension. We introduce IRS (Incongruity-Resolutio…
▽ More
Humor is one of the few cognitive tasks where getting the reasoning right matters as much as getting the answer right. While recent work evaluates humor understanding on benchmarks such as the New Yorker Cartoon Caption Contest (NYCC), it largely treats it as black-box prediction, overlooking the structured reasoning processes underlying humor comprehension. We introduce IRS (Incongruity-Resolution Supervision), a framework that decomposes humor understanding into three components: incongruity modeling, which identifies mismatches in the visual scene; resolution modeling, which constructs coherent reinterpretations of these mismatches; and preference alignment, which evaluates candidate interpretations under human judgments. Grounded in incongruity-resolution theory and expert captionist practice, IRS supervises intermediate reasoning process through structured traces that make the path from visual perception to humorous interpretation explicit and learnable. Across 7B, 32B, and 72B models on NYCC, IRS outperforms strong open and closed multimodal baselines across caption matching and ranking tasks, with our largest model approaching expert-level performance on ranking. Zero-shot transfer to external benchmarks shows that IRS learns generalizable reasoning patterns. Our results suggest that supervising reasoning structure, rather than scale alone, is key for reasoning-centric tasks.
△ Less
Submitted 16 April, 2026;
originally announced April 2026.
-
FewMMBench: A Benchmark for Multimodal Few-Shot Learning
Authors:
Mustafa Dogan,
Ilker Kesen,
Iacer Calixto,
Aykut Erdem,
Erkut Erdem
Abstract:
As multimodal large language models (MLLMs) advance in handling interleaved image-text data, assessing their few-shot learning capabilities remains an open challenge. In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting. Covering a diverse suite of multi…
▽ More
As multimodal large language models (MLLMs) advance in handling interleaved image-text data, assessing their few-shot learning capabilities remains an open challenge. In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting. Covering a diverse suite of multimodal understanding tasks, from attribute recognition to temporal reasoning, FewMMBench enables systematic analysis across task types, model families, and prompting strategies. We evaluate 26 open-weight MLLMs from six model families across zero-shot, few-shot, and CoT-augmented few-shot settings. Our findings reveal that instruction-tuned models exhibit strong zero-shot performance but benefit minimally, or even regress, with additional demonstrations or CoT reasoning. Retrieval-based demonstrations and increased context size also yield limited gains. These results highlight FewMMBench as a rigorous testbed for diagnosing and advancing few-shot capabilities in multimodal LLMs. The data is available at: https://huggingface.co/datasets/mustafaa/FewMMBench
△ Less
Submitted 25 February, 2026;
originally announced February 2026.
-
LAMP: Language-Assisted Motion Planning for Controllable Video Generation
Authors:
Muhammed Burak Kizil,
Enes Sanli,
Niloy J. Mitra,
Erkut Erdem,
Aykut Erdem,
Duygu Ceylan
Abstract:
Video generation has achieved remarkable progress in visual fidelity and controllability, enabling conditioning on text, layout, or motion. Among these, motion control - specifying object dynamics and camera trajectories - is essential for composing complex, cinematic scenes, yet existing interfaces remain limited. We introduce LAMP that leverages large language models (LLMs) as motion planners to…
▽ More
Video generation has achieved remarkable progress in visual fidelity and controllability, enabling conditioning on text, layout, or motion. Among these, motion control - specifying object dynamics and camera trajectories - is essential for composing complex, cinematic scenes, yet existing interfaces remain limited. We introduce LAMP that leverages large language models (LLMs) as motion planners to translate natural language descriptions into explicit 3D trajectories for dynamic objects and (relatively defined) cameras. LAMP defines a motion domain-specific language (DSL), inspired by cinematography conventions. By harnessing program synthesis capabilities of LLMs, LAMP generates structured motion programs from natural language, which are deterministically mapped to 3D trajectories. We construct a large-scale procedural dataset pairing natural text descriptions with corresponding motion programs and 3D trajectories. Experiments demonstrate LAMP's improved performance in motion controllability and alignment with user intent compared to state-of-the-art alternatives establishing the first framework for generating both object and camera motions directly from natural language specifications. Code, models and data are available on our project page.
△ Less
Submitted 29 March, 2026; v1 submitted 3 December, 2025;
originally announced December 2025.
-
Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360-Degree Videos
Authors:
Mert Cokelek,
Halit Ozsoy,
Nevrez Imamoglu,
Cagri Ozcinar,
Inci Ayhan,
Erkut Erdem,
Aykut Erdem
Abstract:
Omnidirectional videos (ODVs) are redefining viewer experiences in virtual reality (VR) by offering an unprecedented full field-of-view (FOV). This study extends the domain of saliency prediction to 360-degree environments, addressing the complexities of spherical distortion and the integration of spatial audio. Contextually, ODVs have transformed user experience by adding a spatial audio dimensio…
▽ More
Omnidirectional videos (ODVs) are redefining viewer experiences in virtual reality (VR) by offering an unprecedented full field-of-view (FOV). This study extends the domain of saliency prediction to 360-degree environments, addressing the complexities of spherical distortion and the integration of spatial audio. Contextually, ODVs have transformed user experience by adding a spatial audio dimension that aligns sound direction with the viewer's perspective in spherical scenes. Motivated by the lack of comprehensive datasets for 360-degree audio-visual saliency prediction, our study curates YT360-EyeTracking, a new dataset of 81 ODVs, each observed under varying audio-visual conditions. Our goal is to explore how to utilize audio-visual cues to effectively predict visual saliency in 360-degree videos. Towards this aim, we propose two novel saliency prediction models: SalViT360, a vision-transformer-based framework for ODVs equipped with spherical geometry-aware spatio-temporal attention layers, and SalViT360-AV, which further incorporates transformer adapters conditioned on audio input. Our results on a number of benchmark datasets, including our YT360-EyeTracking, demonstrate that SalViT360 and SalViT360-AV significantly outperform existing methods in predicting viewer attention in 360-degree scenes. Interpreting these results, we suggest that integrating spatial audio cues in the model architecture is crucial for accurate saliency prediction in omnidirectional videos. Code and dataset will be available at https://cyberiada.github.io/SalViT360.
△ Less
Submitted 27 August, 2025;
originally announced August 2025.
-
Cetvel: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish
Authors:
Yakup Abrek Er,
Ilker Kesen,
Gözde Gül Şahin,
Aykut Erdem
Abstract:
We introduce Cetvel, a comprehensive benchmark designed to evaluate large language models (LLMs) in Turkish. Existing Turkish benchmarks often lack either task diversity or culturally relevant content, or both. Cetvel addresses these gaps by combining a broad range of both discriminative and generative tasks ensuring content that reflects the linguistic and cultural richness of Turkish language. C…
▽ More
We introduce Cetvel, a comprehensive benchmark designed to evaluate large language models (LLMs) in Turkish. Existing Turkish benchmarks often lack either task diversity or culturally relevant content, or both. Cetvel addresses these gaps by combining a broad range of both discriminative and generative tasks ensuring content that reflects the linguistic and cultural richness of Turkish language. Cetvel covers 23 tasks grouped into seven categories, including tasks such as grammatical error correction, machine translation, and question answering rooted in Turkish history and idiomatic language. We evaluate 33 open-weight LLMs (up to 70B parameters) covering different model families and instruction paradigms. Our experiments reveal that Turkish-centric instruction-tuned models generally underperform relative to multilingual or general-purpose models (e.g. Llama 3 and Mistral), despite being tailored for the language. Moreover, we show that tasks such as grammatical error correction and extractive question answering are particularly discriminative in differentiating model capabilities. Cetvel offers a comprehensive and culturally grounded evaluation suite for advancing the development and assessment of LLMs in Turkish.
△ Less
Submitted 22 August, 2025;
originally announced August 2025.
-
Absolute Parameters of Young Stars: NO Puppis
Authors:
Ahmet Erdem,
Volkan Bakış,
John Southworth,
Michael D. Rhodes,
Filiz Kahraman Aliçavuş,
Edwin Budding,
Mark Blackford,
Timothy Banks,
Murray Alexander
Abstract:
The southern early-type, young, eccentric-orbit eclipsing binary NO Puppis forms the A component of the multiple star Gaia DR3 552\-8147999779517568. The B component is an astrometric binary now at a separation of about 8.1 arcsec. There may be other fainter stars in this interesting but complex stellar system. We have combined several lines of evidence, including TESS data from 4 sectors, new gro…
▽ More
The southern early-type, young, eccentric-orbit eclipsing binary NO Puppis forms the A component of the multiple star Gaia DR3 552\-8147999779517568. The B component is an astrometric binary now at a separation of about 8.1 arcsec. There may be other fainter stars in this interesting but complex stellar system. We have combined several lines of evidence, including TESS data from 4 sectors, new ground-based BVR photometry, HARPS (ESO) and HERCULES (UCMJO) high-resolution spectra and astrometry of NO Pup. We derive a revised set of absolute parameters with increased precision. Alternative optimal curve-fitting programs were used in the analysis, allowing a wider view of modelling and parameter uncertainties. The main parameters are as follows: $M_{Aa} = 3.58 \pm 0.11$, $M_{Ab} = 1.68 \pm 0.09$ (M$_\odot$); $R_{Aa} = 2.17 \pm 0.03$, $R_{Ab} = 1.51 \pm 0.06$ (R$_\odot$), and $T_{\rm e Aa} = 13300 \pm 500$, $T_{\rm e Ab} = 7400 \pm 500$ (K). We estimate approximate masses of the wide companions, Ba and Bb, as $M_{Ba} = 2.0$ and $M_{Bb} = 1.8$ (M$_\odot$). The close binary's orbital separation is $a= 8.51 \pm 0.05$ (R$_\odot$); its age is approximately $20$ Myr and distance $172 \pm 1$ pc. The close binary's secondary (Ab) appears to be the source of low amplitude $ δ$ Scuti-type oscillations, although the form of these oscillations is irregular and unrepetitive. Analysis of the $ λ$ 6678 He I profile of the primary show synchronism of the mean bodily and orbital rotations. The retention of significant orbital eccentricity, in view of the closeness of the A-system components, is unexpected and poses challenges for the explanation that we discuss.
△ Less
Submitted 7 August, 2025;
originally announced August 2025.
-
Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models
Authors:
Enes Sanli,
Baris Sarper Tezcan,
Aykut Erdem,
Erkut Erdem
Abstract:
Recent progress in text-to-video (T2V) generation has enabled the synthesis of visually compelling and temporally coherent videos from natural language. However, these models often fall short in basic physical commonsense, producing outputs that violate intuitive expectations around causality, object behavior, and tool use. Addressing this gap, we present PhysVidBench, a benchmark designed to eval…
▽ More
Recent progress in text-to-video (T2V) generation has enabled the synthesis of visually compelling and temporally coherent videos from natural language. However, these models often fall short in basic physical commonsense, producing outputs that violate intuitive expectations around causality, object behavior, and tool use. Addressing this gap, we present PhysVidBench, a benchmark designed to evaluate the physical reasoning capabilities of T2V systems. The benchmark includes 383 carefully curated prompts, emphasizing tool use, material properties, and procedural interactions, and domains where physical plausibility is crucial. For each prompt, we generate videos using diverse state-of-the-art models and adopt a three-stage evaluation pipeline: (1) formulate grounded physics questions from the prompt, (2) caption the generated video with a vision-language model, and (3) task a language model to answer several physics-involved questions using only the caption. This indirect strategy circumvents common hallucination issues in direct video-based evaluation. By highlighting affordances and tool-mediated actions, areas overlooked in current T2V evaluations, PhysVidBench provides a structured, interpretable framework for assessing physical commonsense in generative video models.
△ Less
Submitted 21 July, 2025;
originally announced July 2025.
-
TanDiT: Tangent-Plane Diffusion Transformer for High-Quality 360° Panorama Generation
Authors:
Hakan Çapuk,
Andrew Bond,
Muhammed Burak Kızıl,
Emir Göçen,
Erkut Erdem,
Aykut Erdem
Abstract:
Recent advances in image generation have led to remarkable improvements in synthesizing perspective images. However, these models still struggle with panoramic image generation due to unique challenges, including varying levels of geometric distortion and the requirement for seamless loop-consistency. To address these issues while leveraging the strengths of the existing models, we introduce TanDi…
▽ More
Recent advances in image generation have led to remarkable improvements in synthesizing perspective images. However, these models still struggle with panoramic image generation due to unique challenges, including varying levels of geometric distortion and the requirement for seamless loop-consistency. To address these issues while leveraging the strengths of the existing models, we introduce TanDiT, a method that synthesizes panoramic scenes by generating grids of tangent-plane images covering the entire 360$^\circ$ view. Unlike previous methods relying on multiple diffusion branches, TanDiT utilizes a unified diffusion model trained to produce these tangent-plane images simultaneously within a single denoising iteration. Furthermore, we propose a model-agnostic post-processing step specifically designed to enhance global coherence across the generated panoramas. To accurately assess panoramic image quality, we also present two specialized metrics, TangentIS and TangentFID, and provide a comprehensive benchmark comprising captioned panoramic datasets and standardized evaluation scripts. Extensive experiments demonstrate that our method generalizes effectively beyond its training data, robustly interprets detailed and complex text prompts, and seamlessly integrates with various generative models to yield high-quality, diverse panoramic images.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
DeVisE: Behavioral Testing of Medical Large Language Models
Authors:
Camila Zurdo Tagliabue,
Heloisa Oss Boll,
Aykut Erdem,
Erkut Erdem,
Iacer Calixto
Abstract:
Large language models (LLMs) are increasingly applied in clinical decision support, yet current evaluations rarely reveal whether their outputs reflect genuine medical reasoning or superficial correlations. We introduce DeVisE (Demographics and Vital signs Evaluation), a behavioral testing framework that probes fine-grained clinical understanding through controlled counterfactuals. Using intensive…
▽ More
Large language models (LLMs) are increasingly applied in clinical decision support, yet current evaluations rarely reveal whether their outputs reflect genuine medical reasoning or superficial correlations. We introduce DeVisE (Demographics and Vital signs Evaluation), a behavioral testing framework that probes fine-grained clinical understanding through controlled counterfactuals. Using intensive care unit (ICU) discharge notes from MIMIC-IV, we construct both raw (real-world) and template-based (synthetic) variants with single-variable perturbations in demographic (age, gender, ethnicity) and vital sign attributes. We evaluate eight LLMs, spanning general-purpose and medical variants, under zero-shot setting. Model behavior is analyzed through (1) input-level sensitivity, capturing how counterfactuals alter perplexity, and (2) downstream reasoning, measuring their effect on predicted ICU length-of-stay and mortality. Overall, our results show that standard task metrics obscure clinically relevant differences in model behavior, with models differing substantially in how consistently and proportionally they adjust predictions to counterfactual perturbations.
△ Less
Submitted 26 February, 2026; v1 submitted 18 June, 2025;
originally announced June 2025.
-
A Vision-Language Framework for Multispectral Scene Representation Using Language-Grounded Features
Authors:
Enes Karanfil,
Nevrez Imamoglu,
Erkut Erdem,
Aykut Erdem
Abstract:
Scene understanding in remote sensing often faces challenges in generating accurate representations for complex environments such as various land use areas or coastal regions, which may also include snow, clouds, or haze. To address this, we present a vision-language framework named Spectral LLaVA, which integrates multispectral data with vision-language alignment techniques to enhance scene repre…
▽ More
Scene understanding in remote sensing often faces challenges in generating accurate representations for complex environments such as various land use areas or coastal regions, which may also include snow, clouds, or haze. To address this, we present a vision-language framework named Spectral LLaVA, which integrates multispectral data with vision-language alignment techniques to enhance scene representation and description. Using the BigEarthNet v2 dataset from Sentinel-2, we establish a baseline with RGB-based scene descriptions and further demonstrate substantial improvements through the incorporation of multispectral information. Our framework optimizes a lightweight linear projection layer for alignment while keeping the vision backbone of SpectralGPT frozen. Our experiments encompass scene classification using linear probing and language modeling for jointly performing scene classification and description generation. Our results highlight Spectral LLaVA's ability to produce detailed and accurate descriptions, particularly for scenarios where RGB data alone proves inadequate, while also enhancing classification performance by refining SpectralGPT features into semantically meaningful representations.
△ Less
Submitted 17 January, 2025;
originally announced January 2025.
-
GaussianVideo: Efficient Video Representation via Hierarchical Gaussian Splatting
Authors:
Andrew Bond,
Jui-Hsien Wang,
Long Mai,
Erkut Erdem,
Aykut Erdem
Abstract:
Efficient neural representations for dynamic video scenes are critical for applications ranging from video compression to interactive simulations. Yet, existing methods often face challenges related to high memory usage, lengthy training times, and temporal consistency. To address these issues, we introduce a novel neural video representation that combines 3D Gaussian splatting with continuous cam…
▽ More
Efficient neural representations for dynamic video scenes are critical for applications ranging from video compression to interactive simulations. Yet, existing methods often face challenges related to high memory usage, lengthy training times, and temporal consistency. To address these issues, we introduce a novel neural video representation that combines 3D Gaussian splatting with continuous camera motion modeling. By leveraging Neural ODEs, our approach learns smooth camera trajectories while maintaining an explicit 3D scene representation through Gaussians. Additionally, we introduce a spatiotemporal hierarchical learning strategy, progressively refining spatial and temporal features to enhance reconstruction quality and accelerate convergence. This memory-efficient approach achieves high-quality rendering at impressive speeds. Experimental results show that our hierarchical learning, combined with robust camera motion modeling, captures complex dynamic scenes with strong temporal consistency, achieving state-of-the-art performance across diverse video datasets in both high- and low-motion scenarios.
△ Less
Submitted 8 January, 2025;
originally announced January 2025.
-
HyperGAN-CLIP: A Unified Framework for Domain Adaptation, Image Synthesis and Manipulation
Authors:
Abdul Basit Anees,
Ahmet Canberk Baykal,
Muhammed Burak Kizil,
Duygu Ceylan,
Erkut Erdem,
Aykut Erdem
Abstract:
Generative Adversarial Networks (GANs), particularly StyleGAN and its variants, have demonstrated remarkable capabilities in generating highly realistic images. Despite their success, adapting these models to diverse tasks such as domain adaptation, reference-guided synthesis, and text-guided manipulation with limited training data remains challenging. Towards this end, in this study, we present a…
▽ More
Generative Adversarial Networks (GANs), particularly StyleGAN and its variants, have demonstrated remarkable capabilities in generating highly realistic images. Despite their success, adapting these models to diverse tasks such as domain adaptation, reference-guided synthesis, and text-guided manipulation with limited training data remains challenging. Towards this end, in this study, we present a novel framework that significantly extends the capabilities of a pre-trained StyleGAN by integrating CLIP space via hypernetworks. This integration allows dynamic adaptation of StyleGAN to new domains defined by reference images or textual descriptions. Additionally, we introduce a CLIP-guided discriminator that enhances the alignment between generated images and target domains, ensuring superior image quality. Our approach demonstrates unprecedented flexibility, enabling text-guided image manipulation without the need for text-specific training data and facilitating seamless style transfer. Comprehensive qualitative and quantitative evaluations confirm the robustness and superior performance of our framework compared to existing methods.
△ Less
Submitted 19 November, 2024;
originally announced November 2024.
-
HUE Dataset: High-Resolution Event and Frame Sequences for Low-Light Vision
Authors:
Burak Ercan,
Onur Eker,
Aykut Erdem,
Erkut Erdem
Abstract:
Low-light environments pose significant challenges for image enhancement methods. To address these challenges, in this work, we introduce the HUE dataset, a comprehensive collection of high-resolution event and frame sequences captured in diverse and challenging low-light conditions. Our dataset includes 106 sequences, encompassing indoor, cityscape, twilight, night, driving, and controlled scenar…
▽ More
Low-light environments pose significant challenges for image enhancement methods. To address these challenges, in this work, we introduce the HUE dataset, a comprehensive collection of high-resolution event and frame sequences captured in diverse and challenging low-light conditions. Our dataset includes 106 sequences, encompassing indoor, cityscape, twilight, night, driving, and controlled scenarios, each carefully recorded to address various illumination levels and dynamic ranges. Utilizing a hybrid RGB and event camera setup. we collect a dataset that combines high-resolution event data with complementary frame data. We employ both qualitative and quantitative evaluations using no-reference metrics to assess state-of-the-art low-light enhancement and event-based image reconstruction methods. Additionally, we evaluate these methods on a downstream object detection task. Our findings reveal that while event-based methods perform well in specific metrics, they may produce false positives in practical applications. This dataset and our comprehensive analysis provide valuable insights for future research in low-light vision and hybrid camera systems.
△ Less
Submitted 24 October, 2024;
originally announced October 2024.
-
Comparative study of the W UMa type binaries S Ant and Epsilon CrA
Authors:
Volkan Bakis,
Edwin Budding,
Ahmet Erdem,
Tom Love,
Mark G. Blackford,
Wu Zihao,
Adam Tang,
Michael D. Rhodes,
Timothy S. Banks
Abstract:
Contact binaries challenge contemporary stellar astrophysics with respect to their incidence, structure and evolution. We explore these issues through a detailed study of two bright examples: S Ant and Eps CrA, that permit high-resolution spectroscopy at a relatively good S/N ratio. The availability of high-quality photometry, including data from the TESS satellite as well as Gaia parallaxes, allo…
▽ More
Contact binaries challenge contemporary stellar astrophysics with respect to their incidence, structure and evolution. We explore these issues through a detailed study of two bright examples: S Ant and Eps CrA, that permit high-resolution spectroscopy at a relatively good S/N ratio. The availability of high-quality photometry, including data from the TESS satellite as well as Gaia parallaxes, allows us to apply the Russell paradigm to produce reliable up-to-date information on the physical properties of these binaries. As a result, models of their interactive evolution, such as the thermal relaxation oscillator scenario, can be examined. Mass transfer between the components is clearly evidenced, but the variability of the O'Connell effect over relatively short time scales points to irregularities in the mass transfer or accretion processes. Our findings indicate that S Ant may evolve into an R CMa type Algol, while the low mass ratio of Eps CrA suggests a likely merger of its components in the not-too-distant future.
△ Less
Submitted 25 September, 2024;
originally announced September 2024.
-
Modelling of eclipsing binary systems with pulsating components and tertiary companions: BF Vel and RR Lep
Authors:
Alexios Liakos,
David J. W. Moriarty,
Ahmet Erdem,
Julian F. West,
Phil Evans
Abstract:
This paper presents a comprehensive analysis of RR Lep and BF Vel, two short-period semi-detached oscillating Algols (oEA stars), which are shown to be triple systems. Spectral types of their primaries were determined and radial velocities calculated from spectra observed with the Australian National University's 2.3 m telescope and Wide Field Spectrograph. Spectra of the Na I D doublet confirmed…
▽ More
This paper presents a comprehensive analysis of RR Lep and BF Vel, two short-period semi-detached oscillating Algols (oEA stars), which are shown to be triple systems. Spectral types of their primaries were determined and radial velocities calculated from spectra observed with the Australian National University's 2.3 m telescope and Wide Field Spectrograph. Spectra of the Na I D doublet confirmed the presence of tertiary components which were apparent in the broadening function analyses and, with H_a spectra during primary eclipses, indicated chromospherical activity in their secondaries. Ground-based telescopes were used for observations in several pass bands for photometric analyses. These data were complemented by data from the TESS mission to enable the modelling of the light curves, followed by a detailed analysis of pulsations. Eclipse-timing variation (ETV) analyses of both systems were used to determine the most likely mechanisms modulating the orbital period. We found mass values M1 = 2.9 M_sun and M2 = 0.75 M_sun for the components of RR Lep, and M1 = 1.93 M_sun and M2 = 0.97 M_sun for those of BF Vel. By integrating information from photometry, spectroscopy and ETV analysis, we found that tertiary components revolve around both systems. The primary star of RR Lep pulsates in 36 frequencies, of which five were identified as independent modes, with the dominant one being 32.28 d^-1. The pulsating component of BF Vel oscillates in 37 frequencies, with the frequency 46.73 d^-1 revealed as the only independent mode. For both systems, many frequencies were found to be related to the orbital frequency. Their physical properties were compared with other oEA stars in Mass-Radius and H-R diagrams, and the pulsational properties of their delta Sct components were compared with currently known systems of this type within the orbital-pulsation period and logg-pulsation period diagrams.
△ Less
Submitted 16 September, 2024; v1 submitted 6 September, 2024;
originally announced September 2024.
-
Winning Amazon KDD Cup'24
Authors:
Chris Deotte,
Ivan Sorokin,
Ahmet Erdem,
Benedikt Schifferer,
Gilberto Titericz Jr,
Simon Jegou
Abstract:
This paper describes the winning solution of all 5 tasks for the Amazon KDD Cup 2024 Multi Task Online Shopping Challenge for LLMs. The challenge was to build a useful assistant, answering questions in the domain of online shopping. The competition contained 57 diverse tasks, covering 5 different task types (e.g. multiple choice) and across 4 different tracks (e.g. multi-lingual). Our solution is…
▽ More
This paper describes the winning solution of all 5 tasks for the Amazon KDD Cup 2024 Multi Task Online Shopping Challenge for LLMs. The challenge was to build a useful assistant, answering questions in the domain of online shopping. The competition contained 57 diverse tasks, covering 5 different task types (e.g. multiple choice) and across 4 different tracks (e.g. multi-lingual). Our solution is a single model per track. We fine-tune Qwen2-72B-Instruct on our own training dataset. As the competition released only 96 example questions, we developed our own training dataset by processing multiple public datasets or using Large Language Models for data augmentation and synthetic data generation. We apply wise-ft to account for distribution shifts and ensemble multiple LoRA adapters in one model. We employed Logits Processors to constrain the model output on relevant tokens for the tasks. AWQ 4-bit Quantization and vLLM are used during inference to predict the test dataset in the time constraints of 20 to 140 minutes depending on the track. Our solution achieved the first place in each individual track and is the first place overall of Amazons KDD Cup 2024.
△ Less
Submitted 5 August, 2024;
originally announced August 2024.
-
Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot Learning
Authors:
Mustafa Dogan,
Ilker Kesen,
Iacer Calixto,
Aykut Erdem,
Erkut Erdem
Abstract:
The linguistic capabilities of Multimodal Large Language Models (MLLMs) are critical for their effective application across diverse tasks. This study aims to evaluate the performance of MLLMs on the VALSE benchmark, focusing on the efficacy of few-shot In-Context Learning (ICL), and Chain-of-Thought (CoT) prompting. We conducted a comprehensive assessment of state-of-the-art MLLMs, varying in mode…
▽ More
The linguistic capabilities of Multimodal Large Language Models (MLLMs) are critical for their effective application across diverse tasks. This study aims to evaluate the performance of MLLMs on the VALSE benchmark, focusing on the efficacy of few-shot In-Context Learning (ICL), and Chain-of-Thought (CoT) prompting. We conducted a comprehensive assessment of state-of-the-art MLLMs, varying in model size and pretraining datasets. The experimental results reveal that ICL and CoT prompting significantly boost model performance, particularly in tasks requiring complex reasoning and contextual understanding. Models pretrained on captioning datasets show superior zero-shot performance, while those trained on interleaved image-text data benefit from few-shot learning. Our findings provide valuable insights into optimizing MLLMs for better grounding of language in visual contexts, highlighting the importance of the composition of pretraining data and the potential of few-shot learning strategies to improve the reasoning abilities of MLLMs.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models
Authors:
Yigit Ekin,
Ahmet Burak Yildirim,
Erdem Eren Caglar,
Aykut Erdem,
Erkut Erdem,
Aysegul Dundar
Abstract:
Advanced image editing techniques, particularly inpainting, are essential for seamlessly removing unwanted elements while preserving visual integrity. Traditional GAN-based methods have achieved notable success, but recent advancements in diffusion models have produced superior results due to their training on large-scale datasets, enabling the generation of remarkably realistic inpainted images.…
▽ More
Advanced image editing techniques, particularly inpainting, are essential for seamlessly removing unwanted elements while preserving visual integrity. Traditional GAN-based methods have achieved notable success, but recent advancements in diffusion models have produced superior results due to their training on large-scale datasets, enabling the generation of remarkably realistic inpainted images. Despite their strengths, diffusion models often struggle with object removal tasks without explicit guidance, leading to unintended hallucinations of the removed object. To address this issue, we introduce CLIPAway, a novel approach leveraging CLIP embeddings to focus on background regions while excluding foreground elements. CLIPAway enhances inpainting accuracy and quality by identifying embeddings that prioritize the background, thus achieving seamless object removal. Unlike other methods that rely on specialized training datasets or costly manual annotations, CLIPAway provides a flexible, plug-and-play solution compatible with various diffusion-based inpainting techniques.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models
Authors:
Burak Can Biner,
Farrin Marouf Sofian,
Umur Berkay Karakaş,
Duygu Ceylan,
Erkut Erdem,
Aykut Erdem
Abstract:
We are witnessing a revolution in conditional image synthesis with the recent success of large scale text-to-image generation methods. This success also opens up new opportunities in controlling the generation and editing process using multi-modal input. While spatial control using cues such as depth, sketch, and other images has attracted a lot of research, we argue that another equally effective…
▽ More
We are witnessing a revolution in conditional image synthesis with the recent success of large scale text-to-image generation methods. This success also opens up new opportunities in controlling the generation and editing process using multi-modal input. While spatial control using cues such as depth, sketch, and other images has attracted a lot of research, we argue that another equally effective modality is audio since sound and sight are two main components of human perception. Hence, we propose a method to enable audio-conditioning in large scale image diffusion models. Our method first maps features obtained from audio clips to tokens that can be injected into the diffusion model in a fashion similar to text tokens. We introduce additional audio-image cross attention layers which we finetune while freezing the weights of the original layers of the diffusion model. In addition to audio conditioned image generation, our method can also be utilized in conjuction with diffusion based editing methods to enable audio conditioned image editing. We demonstrate our method on a wide range of audio and image datasets. We perform extensive comparisons with recent methods and show favorable performance.
△ Less
Submitted 1 May, 2024;
originally announced May 2024.
-
Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare
Authors:
Emre Can Acikgoz,
Osman Batur İnce,
Rayene Bench,
Arda Anıl Boz,
İlker Kesen,
Aykut Erdem,
Erkut Erdem
Abstract:
The integration of Large Language Models (LLMs) into healthcare promises to transform medical diagnostics, research, and patient care. Yet, the progression of medical LLMs faces obstacles such as complex training requirements, rigorous evaluation demands, and the dominance of proprietary models that restrict academic exploration. Transparent, comprehensive access to LLM resources is essential for…
▽ More
The integration of Large Language Models (LLMs) into healthcare promises to transform medical diagnostics, research, and patient care. Yet, the progression of medical LLMs faces obstacles such as complex training requirements, rigorous evaluation demands, and the dominance of proprietary models that restrict academic exploration. Transparent, comprehensive access to LLM resources is essential for advancing the field, fostering reproducibility, and encouraging innovation in healthcare AI. We present Hippocrates, an open-source LLM framework specifically developed for the medical domain. In stark contrast to previous efforts, it offers unrestricted access to its training datasets, codebase, checkpoints, and evaluation protocols. This open approach is designed to stimulate collaborative research, allowing the community to build upon, refine, and rigorously evaluate medical LLMs within a transparent ecosystem. Also, we introduce Hippo, a family of 7B models tailored for the medical domain, fine-tuned from Mistral and LLaMA2 through continual pre-training, instruction tuning, and reinforcement learning from human and AI feedback. Our models outperform existing open medical LLMs models by a large-margin, even surpassing models with 70B parameters. Through Hippocrates, we aspire to unlock the full potential of LLMs not just to advance medical knowledge and patient care but also to democratize the benefits of AI research in healthcare, making them available across the globe.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
Sequential Compositional Generalization in Multimodal Models
Authors:
Semih Yagcioglu,
Osman Batur İnce,
Aykut Erdem,
Erkut Erdem,
Desmond Elliott,
Deniz Yuret
Abstract:
The rise of large-scale multimodal models has paved the pathway for groundbreaking advances in generative modeling and reasoning, unlocking transformative applications in a variety of complex tasks. However, a pressing question that remains is their genuine capability for stronger forms of generalization, which has been largely underexplored in the multimodal setting. Our study aims to address thi…
▽ More
The rise of large-scale multimodal models has paved the pathway for groundbreaking advances in generative modeling and reasoning, unlocking transformative applications in a variety of complex tasks. However, a pressing question that remains is their genuine capability for stronger forms of generalization, which has been largely underexplored in the multimodal setting. Our study aims to address this by examining sequential compositional generalization using \textsc{CompAct} (\underline{Comp}ositional \underline{Act}ivities)\footnote{Project Page: \url{http://cyberiada.github.io/CompAct}}, a carefully constructed, perceptually grounded dataset set within a rich backdrop of egocentric kitchen activity videos. Each instance in our dataset is represented with a combination of raw video footage, naturally occurring sound, and crowd-sourced step-by-step descriptions. More importantly, our setup ensures that the individual concepts are consistently distributed across training and evaluation sets, while their compositions are novel in the evaluation set. We conduct a comprehensive assessment of several unimodal and multimodal models. Our findings reveal that bi-modal and tri-modal models exhibit a clear edge over their text-only counterparts. This highlights the importance of multimodality while charting a trajectory for future research in this domain.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models
Authors:
Ilker Kesen,
Andrea Pedrotti,
Mustafa Dogan,
Michele Cafagna,
Emre Can Acikgoz,
Letitia Parcalabescu,
Iacer Calixto,
Anette Frank,
Albert Gatt,
Aykut Erdem,
Erkut Erdem
Abstract:
With the ever-increasing popularity of pretrained Video-Language Models (VidLMs), there is a pressing need to develop robust evaluation methodologies that delve deeper into their visio-linguistic capabilities. To address this challenge, we present ViLMA (Video Language Model Assessment), a task-agnostic benchmark that places the assessment of fine-grained capabilities of these models on a firm foo…
▽ More
With the ever-increasing popularity of pretrained Video-Language Models (VidLMs), there is a pressing need to develop robust evaluation methodologies that delve deeper into their visio-linguistic capabilities. To address this challenge, we present ViLMA (Video Language Model Assessment), a task-agnostic benchmark that places the assessment of fine-grained capabilities of these models on a firm footing. Task-based evaluations, while valuable, fail to capture the complexities and specific temporal aspects of moving images that VidLMs need to process. Through carefully curated counterfactuals, ViLMA offers a controlled evaluation suite that sheds light on the true potential of these models, as well as their performance gaps compared to human-level understanding. ViLMA also includes proficiency tests, which assess basic capabilities deemed essential to solving the main counterfactual tests. We show that current VidLMs' grounding abilities are no better than those of vision-language models which use static images. This is especially striking once the performance on proficiency tests is factored in. Our benchmark serves as a catalyst for future research on VidLMs, helping to highlight areas that still need to be explored.
△ Less
Submitted 12 November, 2023;
originally announced November 2023.
-
Harnessing Dataset Cartography for Improved Compositional Generalization in Transformers
Authors:
Osman Batur İnce,
Tanin Zeraati,
Semih Yagcioglu,
Yadollah Yaghoobzadeh,
Erkut Erdem,
Aykut Erdem
Abstract:
Neural networks have revolutionized language modeling and excelled in various downstream tasks. However, the extent to which these models achieve compositional generalization comparable to human cognitive abilities remains a topic of debate. While existing approaches in the field have mainly focused on novel architectures and alternative learning paradigms, we introduce a pioneering method harness…
▽ More
Neural networks have revolutionized language modeling and excelled in various downstream tasks. However, the extent to which these models achieve compositional generalization comparable to human cognitive abilities remains a topic of debate. While existing approaches in the field have mainly focused on novel architectures and alternative learning paradigms, we introduce a pioneering method harnessing the power of dataset cartography (Swayamdipta et al., 2020). By strategically identifying a subset of compositional generalization data using this approach, we achieve a remarkable improvement in model accuracy, yielding enhancements of up to 10% on CFQ and COGS datasets. Notably, our technique incorporates dataset cartography as a curriculum learning criterion, eliminating the need for hyperparameter tuning while consistently achieving superior performance. Our findings highlight the untapped potential of dataset cartography in unleashing the full capabilities of compositional generalization within Transformer models. Our code is available at https://github.com/cyberiada/cartography-for-compositionality.
△ Less
Submitted 18 October, 2023;
originally announced October 2023.
-
Hyperspectral Image Denoising via Self-Modulating Convolutional Neural Networks
Authors:
Orhan Torun,
Seniha Esen Yuksel,
Erkut Erdem,
Nevrez Imamoglu,
Aykut Erdem
Abstract:
Compared to natural images, hyperspectral images (HSIs) consist of a large number of bands, with each band capturing different spectral information from a certain wavelength, even some beyond the visible spectrum. These characteristics of HSIs make them highly effective for remote sensing applications. That said, the existing hyperspectral imaging devices introduce severe degradation in HSIs. Henc…
▽ More
Compared to natural images, hyperspectral images (HSIs) consist of a large number of bands, with each band capturing different spectral information from a certain wavelength, even some beyond the visible spectrum. These characteristics of HSIs make them highly effective for remote sensing applications. That said, the existing hyperspectral imaging devices introduce severe degradation in HSIs. Hence, hyperspectral image denoising has attracted lots of attention by the community lately. While recent deep HSI denoising methods have provided effective solutions, their performance under real-life complex noise remains suboptimal, as they lack adaptability to new data. To overcome these limitations, in our work, we introduce a self-modulating convolutional neural network which we refer to as SM-CNN, which utilizes correlated spectral and spatial information. At the core of the model lies a novel block, which we call spectral self-modulating residual block (SSMRB), that allows the network to transform the features in an adaptive manner based on the adjacent spectral data, enhancing the network's ability to handle complex noise. In particular, the introduction of SSMRB transforms our denoising network into a dynamic network that adapts its predicted features while denoising every input HSI with respect to its spatio-spectral characteristics. Experimental analysis on both synthetic and real data shows that the proposed SM-CNN outperforms other state-of-the-art HSI denoising methods both quantitatively and qualitatively on public benchmark datasets.
△ Less
Submitted 15 September, 2023;
originally announced September 2023.
-
Spherical Vision Transformer for 360-degree Video Saliency Prediction
Authors:
Mert Cokelek,
Nevrez Imamoglu,
Cagri Ozcinar,
Erkut Erdem,
Aykut Erdem
Abstract:
The growing interest in omnidirectional videos (ODVs) that capture the full field-of-view (FOV) has gained 360-degree saliency prediction importance in computer vision. However, predicting where humans look in 360-degree scenes presents unique challenges, including spherical distortion, high resolution, and limited labelled data. We propose a novel vision-transformer-based model for omnidirectiona…
▽ More
The growing interest in omnidirectional videos (ODVs) that capture the full field-of-view (FOV) has gained 360-degree saliency prediction importance in computer vision. However, predicting where humans look in 360-degree scenes presents unique challenges, including spherical distortion, high resolution, and limited labelled data. We propose a novel vision-transformer-based model for omnidirectional videos named SalViT360 that leverages tangent image representations. We introduce a spherical geometry-aware spatiotemporal self-attention mechanism that is capable of effective omnidirectional video understanding. Furthermore, we present a consistency-based unsupervised regularization term for projection-based 360-degree dense-prediction models to reduce artefacts in the predictions that occur after inverse projection. Our approach is the first to employ tangent images for omnidirectional saliency prediction, and our experimental results on three ODV saliency datasets demonstrate its effectiveness compared to the state-of-the-art.
△ Less
Submitted 24 August, 2023;
originally announced August 2023.
-
CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing
Authors:
Ahmet Canberk Baykal,
Abdul Basit Anees,
Duygu Ceylan,
Erkut Erdem,
Aykut Erdem,
Deniz Yuret
Abstract:
Researchers have recently begun exploring the use of StyleGAN-based models for real image editing. One particularly interesting application is using natural language descriptions to guide the editing process. Existing approaches for editing images using language either resort to instance-level latent code optimization or map predefined text prompts to some editing directions in the latent space. H…
▽ More
Researchers have recently begun exploring the use of StyleGAN-based models for real image editing. One particularly interesting application is using natural language descriptions to guide the editing process. Existing approaches for editing images using language either resort to instance-level latent code optimization or map predefined text prompts to some editing directions in the latent space. However, these approaches have inherent limitations. The former is not very efficient, while the latter often struggles to effectively handle multi-attribute changes. To address these weaknesses, we present CLIPInverter, a new text-driven image editing approach that is able to efficiently and reliably perform multi-attribute changes. The core of our method is the use of novel, lightweight text-conditioned adapter layers integrated into pretrained GAN-inversion networks. We demonstrate that by conditioning the initial inversion step on the CLIP embedding of the target description, we are able to obtain more successful edit directions. Additionally, we use a CLIP-guided refinement step to make corrections in the resulting residual latent codes, which further improves the alignment with the text prompt. Our method outperforms competing approaches in terms of manipulation accuracy and photo-realism on various domains including human faces, cats, and birds, as shown by our qualitative and quantitative results.
△ Less
Submitted 18 July, 2023; v1 submitted 17 July, 2023;
originally announced July 2023.
-
HyperE2VID: Improving Event-Based Video Reconstruction via Hypernetworks
Authors:
Burak Ercan,
Onur Eker,
Canberk Saglam,
Aykut Erdem,
Erkut Erdem
Abstract:
Event-based cameras are becoming increasingly popular for their ability to capture high-speed motion with low latency and high dynamic range. However, generating videos from events remains challenging due to the highly sparse and varying nature of event data. To address this, in this study, we propose HyperE2VID, a dynamic neural network architecture for event-based video reconstruction. Our appro…
▽ More
Event-based cameras are becoming increasingly popular for their ability to capture high-speed motion with low latency and high dynamic range. However, generating videos from events remains challenging due to the highly sparse and varying nature of event data. To address this, in this study, we propose HyperE2VID, a dynamic neural network architecture for event-based video reconstruction. Our approach uses hypernetworks to generate per-pixel adaptive filters guided by a context fusion module that combines information from event voxel grids and previously reconstructed intensity images. We also employ a curriculum learning strategy to train the network more robustly. Our comprehensive experimental evaluations across various benchmark datasets reveal that HyperE2VID not only surpasses current state-of-the-art methods in terms of reconstruction quality but also achieves this with fewer parameters, reduced computational requirements, and accelerated inference times.
△ Less
Submitted 20 February, 2024; v1 submitted 10 May, 2023;
originally announced May 2023.
-
EVREAL: Towards a Comprehensive Benchmark and Analysis Suite for Event-based Video Reconstruction
Authors:
Burak Ercan,
Onur Eker,
Aykut Erdem,
Erkut Erdem
Abstract:
Event cameras are a new type of vision sensor that incorporates asynchronous and independent pixels, offering advantages over traditional frame-based cameras such as high dynamic range and minimal motion blur. However, their output is not easily understandable by humans, making the reconstruction of intensity images from event streams a fundamental task in event-based vision. While recent deep lea…
▽ More
Event cameras are a new type of vision sensor that incorporates asynchronous and independent pixels, offering advantages over traditional frame-based cameras such as high dynamic range and minimal motion blur. However, their output is not easily understandable by humans, making the reconstruction of intensity images from event streams a fundamental task in event-based vision. While recent deep learning-based methods have shown promise in video reconstruction from events, this problem is not completely solved yet. To facilitate comparison between different approaches, standardized evaluation protocols and diverse test datasets are essential. This paper proposes a unified evaluation methodology and introduces an open-source framework called EVREAL to comprehensively benchmark and analyze various event-based video reconstruction methods from the literature. Using EVREAL, we give a detailed analysis of the state-of-the-art methods for event-based video reconstruction, and provide valuable insights into the performance of these methods under varying settings, challenging scenarios, and downstream tasks.
△ Less
Submitted 5 April, 2024; v1 submitted 30 April, 2023;
originally announced May 2023.
-
VidStyleODE: Disentangled Video Editing via StyleGAN and NeuralODEs
Authors:
Moayed Haji Ali,
Andrew Bond,
Tolga Birdal,
Duygu Ceylan,
Levent Karacan,
Erkut Erdem,
Aykut Erdem
Abstract:
We propose $\textbf{VidStyleODE}$, a spatiotemporally continuous disentangled $\textbf{Vid}$eo representation based upon $\textbf{Style}$GAN and Neural-$\textbf{ODE}$s. Effective traversal of the latent space learned by Generative Adversarial Networks (GANs) has been the basis for recent breakthroughs in image editing. However, the applicability of such advancements to the video domain has been hi…
▽ More
We propose $\textbf{VidStyleODE}$, a spatiotemporally continuous disentangled $\textbf{Vid}$eo representation based upon $\textbf{Style}$GAN and Neural-$\textbf{ODE}$s. Effective traversal of the latent space learned by Generative Adversarial Networks (GANs) has been the basis for recent breakthroughs in image editing. However, the applicability of such advancements to the video domain has been hindered by the difficulty of representing and controlling videos in the latent space of GANs. In particular, videos are composed of content (i.e., appearance) and complex motion components that require a special mechanism to disentangle and control. To achieve this, VidStyleODE encodes the video content in a pre-trained StyleGAN $\mathcal{W}_+$ space and benefits from a latent ODE component to summarize the spatiotemporal dynamics of the input video. Our novel continuous video generation process then combines the two to generate high-quality and temporally consistent videos with varying frame rates. We show that our proposed method enables a variety of applications on real videos: text-guided appearance manipulation, motion manipulation, image animation, and video interpolation and extrapolation. Project website: https://cyberiada.github.io/VidStyleODE
△ Less
Submitted 10 March, 2025; v1 submitted 12 April, 2023;
originally announced April 2023.
-
Inst-Inpaint: Instructing to Remove Objects with Diffusion Models
Authors:
Ahmet Burak Yildirim,
Vedat Baday,
Erkut Erdem,
Aykut Erdem,
Aysegul Dundar
Abstract:
Image inpainting task refers to erasing unwanted pixels from images and filling them in a semantically consistent and realistic way. Traditionally, the pixels that are wished to be erased are defined with binary masks. From the application point of view, a user needs to generate the masks for the objects they would like to remove which can be time-consuming and prone to errors. In this work, we ar…
▽ More
Image inpainting task refers to erasing unwanted pixels from images and filling them in a semantically consistent and realistic way. Traditionally, the pixels that are wished to be erased are defined with binary masks. From the application point of view, a user needs to generate the masks for the objects they would like to remove which can be time-consuming and prone to errors. In this work, we are interested in an image inpainting algorithm that estimates which object to be removed based on natural language input and removes it, simultaneously. For this purpose, first, we construct a dataset named GQA-Inpaint for this task. Second, we present a novel inpainting framework, Inst-Inpaint, that can remove objects from images based on the instructions given as text prompts. We set various GAN and diffusion-based baselines and run experiments on synthetic and real image datasets. We compare methods with different evaluation metrics that measure the quality and accuracy of the models and show significant quantitative and qualitative improvements.
△ Less
Submitted 9 August, 2023; v1 submitted 6 April, 2023;
originally announced April 2023.
-
ST360IQ: No-Reference Omnidirectional Image Quality Assessment with Spherical Vision Transformers
Authors:
Nafiseh Jabbari Tofighi,
Mohamed Hedi Elfkir,
Nevrez Imamoglu,
Cagri Ozcinar,
Erkut Erdem,
Aykut Erdem
Abstract:
Omnidirectional images, aka 360 images, can deliver immersive and interactive visual experiences. As their popularity has increased dramatically in recent years, evaluating the quality of 360 images has become a problem of interest since it provides insights for capturing, transmitting, and consuming this new media. However, directly adapting quality assessment methods proposed for standard natura…
▽ More
Omnidirectional images, aka 360 images, can deliver immersive and interactive visual experiences. As their popularity has increased dramatically in recent years, evaluating the quality of 360 images has become a problem of interest since it provides insights for capturing, transmitting, and consuming this new media. However, directly adapting quality assessment methods proposed for standard natural images for omnidirectional data poses certain challenges. These models need to deal with very high-resolution data and implicit distortions due to the spherical form of the images. In this study, we present a method for no-reference 360 image quality assessment. Our proposed ST360IQ model extracts tangent viewports from the salient parts of the input omnidirectional image and employs a vision-transformers based module processing saliency selective patches/tokens that estimates a quality score from each viewport. Then, it aggregates these scores to give a final quality score. Our experiments on two benchmark datasets, namely OIQA and CVIQ datasets, demonstrate that as compared to the state-of-the-art, our approach predicts the quality of an omnidirectional image correlated with the human-perceived image quality. The code has been available on https://github.com/Nafiseh-Tofighi/ST360IQ
△ Less
Submitted 13 March, 2023;
originally announced March 2023.
-
First detailed study of two eccentric eclipsing binaries: TYC 5378-1590-1 and TYC 8378-252-1
Authors:
P. Zasche,
D. Sürgit,
A. Erdem,
C. A. Engelbrecht,
F. Marang
Abstract:
Aims: The analysis of combined photometry and spectroscopy of eccentric eclipsing binary systems facilitates the derivation of precise values for the parameters of the component stars and their orbits, thereby providing stringent tests of theories of stellar structure and evolution. In this paper two eccentric eclipsing binary systems, TYC 5378-1590-1 and TYC 8378-252-1, are studied in detail for…
▽ More
Aims: The analysis of combined photometry and spectroscopy of eccentric eclipsing binary systems facilitates the derivation of precise values for the parameters of the component stars and their orbits, thereby providing stringent tests of theories of stellar structure and evolution. In this paper two eccentric eclipsing binary systems, TYC 5378-1590-1 and TYC 8378-252-1, are studied in detail for the first time. Methods: Radial velocities were obtained using cross-correlation methods applied to mid-resolution spectra covering almost the entire orbital phase domains of these two systems. TESS photometry was used for the analysis of TYC 5378-1590-1, whereas ASAS-SN photometry was used for the analysis of TYC 8378-252-1. Results: We obtained the first precise derivation of the physical parameters of these systems. Both systems display moderately eccentric orbits (e = 0.3 and 0.2) with periods of 3.7323 and 2.8776 days, respectively. The apsidal motion is very slow, with a duration of several centuries. We present two models for the apsidal motion of TYC 5378-1590-1. The internal structure constant derived from observations for TYC 8378-252-1 is approximately 11% lower than theoretical predictions. We discuss possible reasons for this discrepancy. Our analysis indicates that the components of both systems are on the main sequence. The components of TYC 5378-1590-1 are relatively young stars (age 600 Myr) close to the ZAMS, whereas the components of TYC 8378-252-1 are relatively old stars (age 4 Gyr) close to the TAMS. Our finding that the circularization timescale for TYC 5378-1590-1 is 200 times longer than its evolutionary age is compatible with theory; however, our the evolutionary age of TYC 8378-252-1 is approximately ten times longer than the circulation age, while its orbital eccentricity is quite high (e= 0.2), challenges the present theories of circularization.
△ Less
Submitted 9 February, 2023;
originally announced February 2023.
-
General Terms of All Almost Balancing Numbers of First and Second Type
Authors:
Ahmet Tekcan,
Alper Erdem
Abstract:
In this work, we determined the general terms of all almost balancing numbers of first and second type in terms of balancing numbers and conversely we determined the general terms of all balancing numbers in terms of all almost balancing numbers of first and second type. We also set a correspondence between all almost balancing numbers of first and second type and Pell numbers.
In this work, we determined the general terms of all almost balancing numbers of first and second type in terms of balancing numbers and conversely we determined the general terms of all balancing numbers in terms of all almost balancing numbers of first and second type. We also set a correspondence between all almost balancing numbers of first and second type and Pell numbers.
△ Less
Submitted 19 November, 2022; v1 submitted 16 November, 2022;
originally announced November 2022.
-
Detecting Euphemisms with Literal Descriptions and Visual Imagery
Authors:
İlker Kesen,
Aykut Erdem,
Erkut Erdem,
Iacer Calixto
Abstract:
This paper describes our two-stage system for the Euphemism Detection shared task hosted by the 3rd Workshop on Figurative Language Processing in conjunction with EMNLP 2022. Euphemisms tone down expressions about sensitive or unpleasant issues like addiction and death. The ambiguous nature of euphemistic words or expressions makes it challenging to detect their actual meaning within a context. In…
▽ More
This paper describes our two-stage system for the Euphemism Detection shared task hosted by the 3rd Workshop on Figurative Language Processing in conjunction with EMNLP 2022. Euphemisms tone down expressions about sensitive or unpleasant issues like addiction and death. The ambiguous nature of euphemistic words or expressions makes it challenging to detect their actual meaning within a context. In the first stage, we seek to mitigate this ambiguity by incorporating literal descriptions into input text prompts to our baseline model. It turns out that this kind of direct supervision yields remarkable performance improvement. In the second stage, we integrate visual supervision into our system using visual imageries, two sets of images generated by a text-to-image model by taking terms and descriptions as input. Our experiments demonstrate that visual supervision also gives a statistically significant performance boost. Our system achieved the second place with an F1 score of 87.2%, only about 0.9% worse than the best submission.
△ Less
Submitted 8 November, 2022;
originally announced November 2022.
-
Disentangling Content and Motion for Text-Based Neural Video Manipulation
Authors:
Levent Karacan,
Tolga Kerimoğlu,
İsmail İnan,
Tolga Birdal,
Erkut Erdem,
Aykut Erdem
Abstract:
Giving machines the ability to imagine possible new objects or scenes from linguistic descriptions and produce their realistic renderings is arguably one of the most challenging problems in computer vision. Recent advances in deep generative models have led to new approaches that give promising results towards this goal. In this paper, we introduce a new method called DiCoMoGAN for manipulating vi…
▽ More
Giving machines the ability to imagine possible new objects or scenes from linguistic descriptions and produce their realistic renderings is arguably one of the most challenging problems in computer vision. Recent advances in deep generative models have led to new approaches that give promising results towards this goal. In this paper, we introduce a new method called DiCoMoGAN for manipulating videos with natural language, aiming to perform local and semantic edits on a video clip to alter the appearances of an object of interest. Our GAN architecture allows for better utilization of multiple observations by disentangling content and motion to enable controllable semantic edits. To this end, we introduce two tightly coupled networks: (i) a representation network for constructing a concise understanding of motion dynamics and temporally invariant content, and (ii) a translation network that exploits the extracted latent content representation to actuate the manipulation according to the target description. Our qualitative and quantitative evaluations demonstrate that DiCoMoGAN significantly outperforms existing frame-based methods, producing temporally coherent and semantically more meaningful results.
△ Less
Submitted 5 November, 2022;
originally announced November 2022.
-
Perception-Distortion Trade-off in the SR Space Spanned by Flow Models
Authors:
Cansu Korkmaz,
A. Murat Tekalp,
Zafer Dogan,
Erkut Erdem,
Aykut Erdem
Abstract:
Flow-based generative super-resolution (SR) models learn to produce a diverse set of feasible SR solutions, called the SR space. Diversity of SR solutions increases with the temperature ($τ$) of latent variables, which introduces random variations of texture among sample solutions, resulting in visual artifacts and low fidelity. In this paper, we present a simple but effective image ensembling/fus…
▽ More
Flow-based generative super-resolution (SR) models learn to produce a diverse set of feasible SR solutions, called the SR space. Diversity of SR solutions increases with the temperature ($τ$) of latent variables, which introduces random variations of texture among sample solutions, resulting in visual artifacts and low fidelity. In this paper, we present a simple but effective image ensembling/fusion approach to obtain a single SR image eliminating random artifacts and improving fidelity without significantly compromising perceptual quality. We achieve this by benefiting from a diverse set of feasible photo-realistic solutions in the SR space spanned by flow models. We propose different image ensembling and fusion strategies which offer multiple paths to move sample solutions in the SR space to more desired destinations in the perception-distortion plane in a controllable manner depending on the fidelity vs. perceptual quality requirements of the task at hand. Experimental results demonstrate that our image ensembling/fusion strategy achieves more promising perception-distortion trade-off compared to sample SR images produced by flow models and adversarially trained models in terms of both quantitative metrics and visual quality.
△ Less
Submitted 18 September, 2022;
originally announced September 2022.
-
V410 Puppis: A useful laboratory for early stellar evolution
Authors:
Ahmet Erdem,
Derya Surgit,
Burcu Ozkardes,
Petr Hadrava,
Micheal D. Rhodes,
Tom Love,
Mark G. Blackford,
Timothy S. Banks,
Edwin Budding
Abstract:
New spectrometric (HERCULES) and ground-based multi-colour photometric data on the multiple star V410 Puppis are combined with satellite photometry (HIPPARCOS and TESS), as well as historic astrometric observations. Absolute parameters for V410 Pup Aab are derived: $M_{Aa}$ = $3.15 \pm 0.10$, $M_{Ab}$ = $1.83 \pm 0.08$ (M$_{\odot}$); $R_{Aa}$ = $2.12 \pm 0.10$, $R_{Ab}$ = $1.52 \pm 0.08$ (R…
▽ More
New spectrometric (HERCULES) and ground-based multi-colour photometric data on the multiple star V410 Puppis are combined with satellite photometry (HIPPARCOS and TESS), as well as historic astrometric observations. Absolute parameters for V410 Pup Aab are derived: $M_{Aa}$ = $3.15 \pm 0.10$, $M_{Ab}$ = $1.83 \pm 0.08$ (M$_{\odot}$); $R_{Aa}$ = $2.12 \pm 0.10$, $R_{Ab}$ = $1.52 \pm 0.08$ (R$_\odot$); $a$ = $6.57 \pm 0.04$ R$_\odot$; $T_{Aa}$ = $12500 \pm 1000$, $T_{Ab}$ = $9070 \pm 800$ (K), and photometric distance $350 \pm 10$ (pc). We report the discovery of a low-amplitude SPB variation in the light curve and also indications of an accretion structure around V410 Pup B as well as emission cores in V410 Pup C. We argue that V410 Pup is probably a young formation connected with the Vela 2 OB Association. The combined evidence allows an age in the range 7-25 Myr from comparisons with standard stellar evolution modelling.
△ Less
Submitted 27 July, 2022;
originally announced July 2022.
-
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Authors:
Aarohi Srivastava,
Abhinav Rastogi,
Abhishek Rao,
Abu Awal Md Shoeb,
Abubakar Abid,
Adam Fisch,
Adam R. Brown,
Adam Santoro,
Aditya Gupta,
Adrià Garriga-Alonso,
Agnieszka Kluska,
Aitor Lewkowycz,
Akshat Agarwal,
Alethea Power,
Alex Ray,
Alex Warstadt,
Alexander W. Kocurek,
Ali Safaya,
Ali Tazarv,
Alice Xiang,
Alicia Parrish,
Allen Nie,
Aman Hussain,
Amanda Askell,
Amanda Dsouza
, et al. (426 additional authors not shown)
Abstract:
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur…
▽ More
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
△ Less
Submitted 12 June, 2023; v1 submitted 9 June, 2022;
originally announced June 2022.
-
SLAMP: Stochastic Latent Appearance and Motion Prediction
Authors:
Adil Kaan Akan,
Erkut Erdem,
Aykut Erdem,
Fatma Güney
Abstract:
Motion is an important cue for video prediction and often utilized by separating video content into static and dynamic components. Most of the previous work utilizing motion is deterministic but there are stochastic methods that can model the inherent uncertainty of the future. Existing stochastic models either do not reason about motion explicitly or make limiting assumptions about the static par…
▽ More
Motion is an important cue for video prediction and often utilized by separating video content into static and dynamic components. Most of the previous work utilizing motion is deterministic but there are stochastic methods that can model the inherent uncertainty of the future. Existing stochastic models either do not reason about motion explicitly or make limiting assumptions about the static part. In this paper, we reason about appearance and motion in the video stochastically by predicting the future based on the motion history. Explicit reasoning about motion without history already reaches the performance of current stochastic models. The motion history further improves the results by allowing to predict consistent dynamics several frames into the future. Our model performs comparably to the state-of-the-art models on the generic video prediction datasets, however, significantly outperforms them on two challenging real-world autonomous driving datasets with complex motion and dynamic background.
△ Less
Submitted 5 August, 2021;
originally announced August 2021.
-
Absolute Parameters of Young Stars: PU Pup
Authors:
A. Erdem,
D. Surgit,
T. S. Banks,
B. Ozkardes,
E. Budding
Abstract:
We present combined photometric and spectroscopic analyses of the southern binary star PU Pup. High-resolution spectra of this system were taken at the University of Canterbury Mt. John Observatory in the years 2008 and again in 2014-15. We find the light contribution of the secondary component to be only $\sim$2\% of the total light of the system in optical wavelengths, resulting in a single-line…
▽ More
We present combined photometric and spectroscopic analyses of the southern binary star PU Pup. High-resolution spectra of this system were taken at the University of Canterbury Mt. John Observatory in the years 2008 and again in 2014-15. We find the light contribution of the secondary component to be only $\sim$2\% of the total light of the system in optical wavelengths, resulting in a single-lined spectroscopic binary. Recent TESS data revealed grazing eclipses within the light minima, though the tidal distortion, examined also from HIPPARCOS data, remains the predominating light curve effect. Our model shows PU Pup to have the more massive primary relatively close to filling its Roche lobe. PU Pup is thus approaching the rare `fast phase' of interactive (Case B) evolution. Our adopted absolute parameters are as follows: $M_1$ = 4.10 ($\pm$0.20) M$_{\odot}$, $M_2$ = 0.65 ($\pm$0.05) M$_{\odot}$, $R_{1}$ = 6.60 ($\pm$0.30) R$_{\odot}$, $R_2$ = 0.90 ($\pm$0.10) R$_{\odot}$; $T_{1}$ = 11500 ($\pm$500) K, $T_{2}$ = 5000 ($\pm$350) K; photometric distance = 186 ($\pm$20) pc, age = 170 ($\pm$20) My. The less-massive secondary component is found to be significantly oversized and overluminous compared to standard Main Sequence models. We discuss this discrepancy referring to heating from the reflection effect.
△ Less
Submitted 9 July, 2021;
originally announced July 2021.
-
The ultra-hot-Jupiter KELT-16 b: Dynamical Evolution and Atmospheric Properties
Authors:
L. Mancini,
J. Southworth,
L. Naponiello,
O. Basturk,
D. Barbato,
F. Biagiotti,
I. Bruni,
L. Cabona,
G. D'Ago,
M. Damasso,
A. Erdem,
D. Evans,
Th. Henning,
O. Ozturk,
D. Ricci,
A. Sozzetti,
J. Tregloan-Reed,
S. Yalcinkayaz
Abstract:
We present broad-band photometry of 30 planetary transits of the ultra-hot Jupiter KELT-16b, using five medium-class telescopes. The transits were monitored through standard B, V, R, I filters and four were simultaneously observed from different places, for a total of 36 new light curves. We used these new photometric data and those from the TESS space telescope to review the main physical propert…
▽ More
We present broad-band photometry of 30 planetary transits of the ultra-hot Jupiter KELT-16b, using five medium-class telescopes. The transits were monitored through standard B, V, R, I filters and four were simultaneously observed from different places, for a total of 36 new light curves. We used these new photometric data and those from the TESS space telescope to review the main physical properties of the KELT-16 planetary system. Our results agree with previous measurements but are more precise. We estimated the mid-transit times for each of these transits and combined them with others from the literature to obtain 69 epochs, with a time baseline extending over more than four years, and searched for transit time variations. We found no evidence for a period change, suggesting a lower limit for orbital decay at 8 Myr, with a lower limit on the reduced tidal quality factor of $Q^{\prime}_{\star}>(1.9 \pm 0.8) \times 10^5$ with $95\%$ confidence. We built up an observational, low-resolution transmission spectrum of the planet, finding evidence of the presence of optical absorbers, although with a low significance. Using TESS data, we reconstructed the phase curve finding that KELT-16b has a phase offset of $25.25 \pm 14.03$ $^{\circ}$E, a day- and night-side brightness temperature of $3190 \pm 61$ K and $2668 \pm 56$ K, respectively. Finally, we compared the flux ratio of the planet over its star at the TESS and Spitzer wavelengths with theoretical emission spectra, finding evidence of a temperature inversion in the planet's atmosphere, the chemical composition of which is preferably oxygen-rich rather than carbon-rich.
△ Less
Submitted 5 November, 2021; v1 submitted 3 May, 2021;
originally announced May 2021.
-
A Gated Fusion Network for Dynamic Saliency Prediction
Authors:
Aysun Kocak,
Erkut Erdem,
Aykut Erdem
Abstract:
Predicting saliency in videos is a challenging problem due to complex modeling of interactions between spatial and temporal information, especially when ever-changing, dynamic nature of videos is considered. Recently, researchers have proposed large-scale datasets and models that take advantage of deep learning as a way to understand what's important for video saliency. These approaches, however,…
▽ More
Predicting saliency in videos is a challenging problem due to complex modeling of interactions between spatial and temporal information, especially when ever-changing, dynamic nature of videos is considered. Recently, researchers have proposed large-scale datasets and models that take advantage of deep learning as a way to understand what's important for video saliency. These approaches, however, learn to combine spatial and temporal features in a static manner and do not adapt themselves much to the changes in the video content. In this paper, we introduce Gated Fusion Network for dynamic saliency (GFSalNet), the first deep saliency model capable of making predictions in a dynamic way via gated fusion mechanism. Moreover, our model also exploits spatial and channel-wise attention within a multi-scale architecture that further allows for highly accurate predictions. We evaluate the proposed approach on a number of datasets, and our experimental analysis demonstrates that it outperforms or is highly competitive with the state of the art. Importantly, we show that it has a good generalization ability, and moreover, exploits temporal information more effectively via its adaptive fusion scheme.
△ Less
Submitted 15 February, 2021;
originally announced February 2021.
-
Object and Relation Centric Representations for Push Effect Prediction
Authors:
Ahmet E. Tekden,
Aykut Erdem,
Erkut Erdem,
Tamim Asfour,
Emre Ugur
Abstract:
Pushing is an essential non-prehensile manipulation skill used for tasks ranging from pre-grasp manipulation to scene rearrangement, reasoning about object relations in the scene, and thus pushing actions have been widely studied in robotics. The effective use of pushing actions often requires an understanding of the dynamics of the manipulated objects and adaptation to the discrepancies between p…
▽ More
Pushing is an essential non-prehensile manipulation skill used for tasks ranging from pre-grasp manipulation to scene rearrangement, reasoning about object relations in the scene, and thus pushing actions have been widely studied in robotics. The effective use of pushing actions often requires an understanding of the dynamics of the manipulated objects and adaptation to the discrepancies between prediction and reality. For this reason, effect prediction and parameter estimation with pushing actions have been heavily investigated in the literature. However, current approaches are limited because they either model systems with a fixed number of objects or use image-based representations whose outputs are not very interpretable and quickly accumulate errors. In this paper, we propose a graph neural network based framework for effect prediction and parameter estimation of pushing actions by modeling object relations based on contacts or articulations. Our framework is validated both in real and simulated environments containing different shaped multi-part objects connected via different types of joints and objects with different masses, and it outperforms image-based representations on physics prediction. Our approach enables the robot to predict and adapt the effect of a pushing action as it observes the scene. It can also be used for tool manipulation with never-seen tools. Further, we demonstrate 6D effect prediction in the lever-up action in the context of robot-based hard-disk disassembly.
△ Less
Submitted 22 February, 2023; v1 submitted 3 February, 2021;
originally announced February 2021.
-
Physical parameters of close binary systems: VIII
Authors:
K. Gazeas,
S. Zola,
A. Liakos,
B. Zakrzewski,
S. M. Rucinski,
J. M. Kreiner,
W. Ogloza,
M. Drozdz,
D. Koziel-Wierzbowska,
G. Stachowski,
M. Siwak,
A. Baran,
D. Kjurkchieva,
D. Marchev,
A. Erdem,
S. Szalankiewicz
Abstract:
This paper presents the results of a combined spectroscopic and photometric study of 20 contact binary systems: HV Aqr, OO Aql, FI Boo, TX Cnc, OT Cnc, EE Cet, RWCom, KR Com, V401 Cyg, V345 Gem, AK Her, V502 Oph, V566 Oph, V2612 Oph, V1363 Ori, V351 Peg, V357 Peg, Y Sex, V1123 Tau and W UMa, which was conducted in the frame of the W UMa Project. Together with 51 already covered by the project and…
▽ More
This paper presents the results of a combined spectroscopic and photometric study of 20 contact binary systems: HV Aqr, OO Aql, FI Boo, TX Cnc, OT Cnc, EE Cet, RWCom, KR Com, V401 Cyg, V345 Gem, AK Her, V502 Oph, V566 Oph, V2612 Oph, V1363 Ori, V351 Peg, V357 Peg, Y Sex, V1123 Tau and W UMa, which was conducted in the frame of the W UMa Project. Together with 51 already covered by the project and an additional 67 in the existing literature, these systems bring the total number of contact binaries with known combined spectroscopic and photometric solutions to 138. It was found that mass, radius and luminosity of the components follow certain relations along the MS and new empirical power relations are extracted.We found that 30 per cent of the systems in the current sample show extreme values in their parameters, expressed in their mass ratio or fill-out factor. This study shows that, among the contact binary systems studied, some have an extremely low mass ratio (q < 0.1) or an ultra-short orbital period (Porb < 0.25 d), which are expected to show evidence of mass transfer progress. The evolutionary status of these components is discussed with the aid of correlation diagrams and their physical and orbital parameters compared to those in the entire sample of known contact binaries. The existence of very short orbital periods confirms the very slow nature of the merging process, which seems to explain why their components still exist as MS stars in contact confgurations even after several Gyr of evolution.
△ Less
Submitted 26 January, 2021;
originally announced January 2021.
-
Cross-lingual Visual Pre-training for Multimodal Machine Translation
Authors:
Ozan Caglayan,
Menekse Kuyu,
Mustafa Sercan Amac,
Pranava Madhyastha,
Erkut Erdem,
Aykut Erdem,
Lucia Specia
Abstract:
Pre-trained language models have been shown to improve performance in many natural language tasks substantially. Although the early focus of such models was single language pre-training, recent advances have resulted in cross-lingual and visual pre-training methods. In this paper, we combine these two approaches to learn visually-grounded cross-lingual representations. Specifically, we extend the…
▽ More
Pre-trained language models have been shown to improve performance in many natural language tasks substantially. Although the early focus of such models was single language pre-training, recent advances have resulted in cross-lingual and visual pre-training methods. In this paper, we combine these two approaches to learn visually-grounded cross-lingual representations. Specifically, we extend the translation language modelling (Lample and Conneau, 2019) with masked region classification and perform pre-training with three-way parallel vision & language corpora. We show that when fine-tuned for multimodal machine translation, these models obtain state-of-the-art performance. We also provide qualitative insights into the usefulness of the learned grounded representations.
△ Less
Submitted 20 April, 2021; v1 submitted 25 January, 2021;
originally announced January 2021.
-
MSVD-Turkish: A Comprehensive Multimodal Dataset for Integrated Vision and Language Research in Turkish
Authors:
Begum Citamak,
Ozan Caglayan,
Menekse Kuyu,
Erkut Erdem,
Aykut Erdem,
Pranava Madhyastha,
Lucia Specia
Abstract:
Automatic generation of video descriptions in natural language, also called video captioning, aims to understand the visual content of the video and produce a natural language sentence depicting the objects and actions in the scene. This challenging integrated vision and language problem, however, has been predominantly addressed for English. The lack of data and the linguistic properties of other…
▽ More
Automatic generation of video descriptions in natural language, also called video captioning, aims to understand the visual content of the video and produce a natural language sentence depicting the objects and actions in the scene. This challenging integrated vision and language problem, however, has been predominantly addressed for English. The lack of data and the linguistic properties of other languages limit the success of existing approaches for such languages. In this paper we target Turkish, a morphologically rich and agglutinative language that has very different properties compared to English. To do so, we create the first large scale video captioning dataset for this language by carefully translating the English descriptions of the videos in the MSVD (Microsoft Research Video Description Corpus) dataset into Turkish. In addition to enabling research in video captioning in Turkish, the parallel English-Turkish descriptions also enables the study of the role of video context in (multimodal) machine translation. In our experiments, we build models for both video captioning and multimodal machine translation and investigate the effect of different word segmentation approaches and different neural architectures to better address the properties of Turkish. We hope that the MSVD-Turkish dataset and the results reported in this work will lead to better video captioning and multimodal machine translation models for Turkish and other morphology rich and agglutinative languages.
△ Less
Submitted 13 December, 2020;
originally announced December 2020.
-
CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions
Authors:
Tayfun Ates,
M. Samil Atesoglu,
Cagatay Yigit,
Ilker Kesen,
Mert Kobas,
Erkut Erdem,
Aykut Erdem,
Tilbe Goksun,
Deniz Yuret
Abstract:
Humans are able to perceive, understand and reason about causal events. Developing models with similar physical and causal understanding capabilities is a long-standing goal of artificial intelligence. As a step towards this direction, we introduce CRAFT, a new video question answering dataset that requires causal reasoning about physical forces and object interactions. It contains 58K video and q…
▽ More
Humans are able to perceive, understand and reason about causal events. Developing models with similar physical and causal understanding capabilities is a long-standing goal of artificial intelligence. As a step towards this direction, we introduce CRAFT, a new video question answering dataset that requires causal reasoning about physical forces and object interactions. It contains 58K video and question pairs that are generated from 10K videos from 20 different virtual environments, containing various objects in motion that interact with each other and the scene. Two question categories in CRAFT include previously studied descriptive and counterfactual questions. Additionally, inspired by the Force Dynamics Theory in cognitive linguistics, we introduce a new causal question category that involves understanding the causal interactions between objects through notions like cause, enable, and prevent. Our results show that even though the questions in CRAFT are easy for humans, the tested baseline models, including existing state-of-the-art methods, do not yet deal with the challenges posed in our benchmark.
△ Less
Submitted 1 March, 2022; v1 submitted 8 December, 2020;
originally announced December 2020.
-
Burst Photography for Learning to Enhance Extremely Dark Images
Authors:
Ahmet Serdar Karadeniz,
Erkut Erdem,
Aykut Erdem
Abstract:
Capturing images under extremely low-light conditions poses significant challenges for the standard camera pipeline. Images become too dark and too noisy, which makes traditional enhancement techniques almost impossible to apply. Recently, learning-based approaches have shown very promising results for this task since they have substantially more expressive capabilities to allow for improved quali…
▽ More
Capturing images under extremely low-light conditions poses significant challenges for the standard camera pipeline. Images become too dark and too noisy, which makes traditional enhancement techniques almost impossible to apply. Recently, learning-based approaches have shown very promising results for this task since they have substantially more expressive capabilities to allow for improved quality. Motivated by these studies, in this paper, we aim to leverage burst photography to boost the performance and obtain much sharper and more accurate RGB images from extremely dark raw images. The backbone of our proposed framework is a novel coarse-to-fine network architecture that generates high-quality outputs progressively. The coarse network predicts a low-resolution, denoised raw image, which is then fed to the fine network to recover fine-scale details and realistic textures. To further reduce the noise level and improve the color accuracy, we extend this network to a permutation invariant structure so that it takes a burst of low-light images as input and merges information from multiple images at the feature-level. Our experiments demonstrate that our approach leads to perceptually more pleasing results than the state-of-the-art methods by producing more detailed and considerably higher quality images.
△ Less
Submitted 19 November, 2021; v1 submitted 17 June, 2020;
originally announced June 2020.
-
Modulating Bottom-Up and Top-Down Visual Processing via Language-Conditional Filters
Authors:
İlker Kesen,
Ozan Arkan Can,
Erkut Erdem,
Aykut Erdem,
Deniz Yuret
Abstract:
How to best integrate linguistic and perceptual processing in multi-modal tasks that involve language and vision is an important open problem. In this work, we argue that the common practice of using language in a top-down manner, to direct visual attention over high-level visual features, may not be optimal. We hypothesize that the use of language to also condition the bottom-up processing from p…
▽ More
How to best integrate linguistic and perceptual processing in multi-modal tasks that involve language and vision is an important open problem. In this work, we argue that the common practice of using language in a top-down manner, to direct visual attention over high-level visual features, may not be optimal. We hypothesize that the use of language to also condition the bottom-up processing from pixels to high-level features can provide benefits to the overall performance. To support our claim, we propose a U-Net-based model and perform experiments on two language-vision dense-prediction tasks: referring expression segmentation and language-guided image colorization. We compare results where either one or both of the top-down and bottom-up visual branches are conditioned on language. Our experiments reveal that using language to control the filters for bottom-up visual processing in addition to top-down attention leads to better results on both tasks and achieves competitive performance. Our linguistic analysis suggests that bottom-up conditioning improves segmentation of objects especially when input text refers to low-level visual concepts. Code is available at https://github.com/ilkerkesen/bvpr.
△ Less
Submitted 23 June, 2022; v1 submitted 28 March, 2020;
originally announced March 2020.
-
Burst Denoising of Dark Images
Authors:
Ahmet Serdar Karadeniz,
Erkut Erdem,
Aykut Erdem
Abstract:
Capturing images under extremely low-light conditions poses significant challenges for the standard camera pipeline. Images become too dark and too noisy, which makes traditional image enhancement techniques almost impossible to apply. Very recently, researchers have shown promising results using learning based approaches. Motivated by these ideas, in this paper, we propose a deep learning framewo…
▽ More
Capturing images under extremely low-light conditions poses significant challenges for the standard camera pipeline. Images become too dark and too noisy, which makes traditional image enhancement techniques almost impossible to apply. Very recently, researchers have shown promising results using learning based approaches. Motivated by these ideas, in this paper, we propose a deep learning framework for obtaining clean and colorful RGB images from extremely dark raw images. The backbone of our framework is a novel coarse-to-fine network architecture that generates high-quality outputs in a progressive manner. The coarse network predicts a low-resolution, denoised raw image, which is then fed to the fine network to recover fine-scale details and realistic textures. To further reduce noise and improve color accuracy, we extend this network to a permutation invariant structure so that it takes a burst of low-light images as input and merges information from multiple images at the feature-level. Our experiments demonstrate that the proposed approach leads to perceptually more pleasing results than state-of-the-art methods by producing much sharper and higher quality images.
△ Less
Submitted 18 June, 2020; v1 submitted 17 March, 2020;
originally announced March 2020.