publications | Metod Jazbec

Preprints

arXiv

A Tale of Two Temperatures: Simple, Efficient, and Diverse Sampling from Diffusion Language Models

Theo X. Olausson*, Metod Jazbec*, Xi Wang, Armando Solar-Lezama, Christian A. Naesseth, Stephan Mandt, and Eric Nalisnick

2026

Abs HTML

Much work has been done on designing fast and accurate sampling for diffusion language models (dLLMs). However, these efforts have largely focused on the tradeoff between speed and quality of individual samples; how to additionally ensure diversity across samples remains less well understood. In this work, we show that diversity can be increased by using softened, tempered versions of familiar confidence-based remasking heuristics, retaining their computational benefits and offering simple implementations. We motivate this approach by introducing an idealized formal model of fork tokens and studying the impact of remasking on the expected entropy at the forks. Empirically, the proposed tempered heuristics close the exploration gap (pass@k) between existing confidence-based and autoregressive sampling, hence outperforming both when controlling for cost (pass@NFE). We further study how the increase in diversity translates to downstream post-training and test-time compute scaling. Overall, our findings demonstrate that simple, efficient, and diverse sampling from dLLMs is possible.

Conference Papers

ICMLOral

Learning Unmasking Policies for Diffusion Language Models

Metod Jazbec*, Theo X. Olausson*, Louis Béthune, Pierre Ablin, Michael Kirchhof, João Monteiro, Victor Turrisi, Jason Ramapuram, and Marco Cuturi

2026

Abs HTML Code

Diffusion (Large) Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks, while holding the promise of being more efficient during inference. One critical design aspect of dLLMs is the sampling procedure that selects which tokens to unmask at each diffusion step. Indeed, recent work has found that heuristic strategies such as confidence thresholding improve both sample quality and token throughput compared to random unmasking. However, such heuristics have downsides: they require manual tuning, and we observe that their performance degrades with larger block sizes. In this work, we instead propose to train sampling procedures using reinforcement learning. Specifically, we formalize masked diffusion sampling as a Markov decision process in which the dLLM serves as the environment, and propose a lightweight policy based on a single-layer transformer that maps dLLM token confidences to unmasking decisions. Our experiments show that these trained policies match the performance of state-of-the-art heuristics when combined with semi-autoregressive (block) generation, while outperforming them in the full-diffusion setting.
ICML

Controlling the Risk of Corrupted Contexts for Language Models via Early-Exiting

Andrea Wynn, Metod Jazbec, Charith Peris, Rinat Khaziev, Anqi Liu, Daniel Khashabi, and Eric Nalisnick

2026

Abs HTML

Large language models (LLMs) can be influenced by harmful or irrelevant context, which can significantly harm model performance on downstream tasks. This motivates principled designs in which LLM systems include built-in mechanisms to guard against such "garbage in, garbage out" scenarios. We propose a novel approach to limit the degree to which harmful context can degrade model performance. First, we define a baseline "safe" behavior for the model – the model’s performance given no context at all (zero-shot). Next, we apply distribution-free risk control (DFRC) to control the extent to which the user-provided context can decay performance below this safe zero-shot baseline. We achieve this by leveraging dynamic early exit prediction, ignoring later attention heads that attend the most to the unsafe inputs. Finally, we propose modifications to DFRC that allow it to both control risk for harmful inputs and leverage performance and efficiency gains on helpful inputs. We present both theoretical and empirical results across 9 tasks spanning in-context learning and open-ended question answering, showing that our approach can effectively control risk for harmful context and simultaneously achieve substantial computational efficiency gains with helpful context.
ICML

Efficient and Uncertainty-Aware Diffusion Framework for Offline-to-Online Reinforcement Learning

Ha Manh Bui, Metod Jazbec, Eric Nalisnick, and Anqi Liu

2026

Abs HTML

Offline-to-Online Reinforcement Learning (O2O-RL) leverages an offline, pre-trained policy to minimize costly online interactions. Although data-efficient, O2O-RL is susceptible to shifts between offline and online distributions. Existing work aims to mitigate the harm of this shift by finetuning the policy on trajectory data sampled from a diffusion model. Inspired by this line of work, we propose DUAL: an efficient Diffusion Uncertainty-Aware framework for offline-to-online reinforcement Learning. DUAL utilizes the prior knowledge of the diffusion model to distill a fast-sampling diffusion actor policy and transition model in the offline phase. DUAL also employs a Laplace approximation and distance transition-state-shift detection, thereby using uncertainty quantification to improve exploration versus exploitation in the online phase. We formally show that our actor loss with the Laplace approximation provides a proxy for a principled estimate of epistemic uncertainty. Empirically, DUAL improves the online expected return over O2O-RL baselines across multiple settings and environments.
NeurIPS

Monitoring Risks in Test-Time Adaptation

Mona Schirmer*, Metod Jazbec*, Christian A. Naesseth, and Eric Nalisnick

2025

Abs HTML Code

Encountering shifted data at test time is a ubiquitous challenge when deploying predictive models. Test-time adaptation (TTA) methods address this issue by continuously adapting a deployed model using only unlabeled test data. While TTA can extend the model’s lifespan, it is only a temporary solution. Eventually the model might degrade to the point that it must be taken offline and retrained. To detect such points of ultimate failure, we propose pairing TTA with risk monitoring frameworks that track predictive performance and raise alerts when predefined performance criteria are violated. Specifically, we extend existing monitoring tools based on sequential testing with confidence sequences to accommodate scenarios in which the model is updated at test time and no test labels are available to estimate the performance metrics of interest. Our extensions unlock the application of rigorous statistical risk monitoring to TTA, and we demonstrate the effectiveness of our proposed TTA monitoring framework across a representative set of datasets, distribution shift types, and TTA methods.
UAI

Generative Uncertainty in Diffusion Models

Metod Jazbec, Eliot Wong-Toi, Guoxuan Xia, Dan Zhang, Eric Nalisnick, and Stephan Mandt

2025

Abs HTML Code

Diffusion models have recently driven significant breakthroughs in generative modeling. While state-of-the-art models produce high-quality samples on average, individual samples can still be low quality. Detecting such samples without human inspection remains a challenging task. To address this, we propose a Bayesian framework for estimating generative uncertainty of synthetic samples. We outline how to make Bayesian inference practical for large, modern generative models and introduce a new semantic likelihood (evaluated in the latent space of a feature extractor) to address the challenges posed by high-dimensional sample spaces. Through our experiments, we demonstrate that the proposed generative uncertainty effectively identifies poor-quality samples and significantly outperforms existing uncertainty-based methods. Notably, our Bayesian framework can be applied post-hoc to any pretrained diffusion or flow matching model (via the Laplace approximation), and we propose simple yet effective techniques to minimize its computational overhead during sampling.
NeurIPS

Fast yet Safe: Early-Exiting with Risk Control

Metod Jazbec*, Alexander Timans*, Tin Hadži Veljković, Kaspar Sakmann, Dan Zhang, Christian A. Naesseth, and Eric Nalisnick

2024

Abs HTML Code

Scaling machine learning models significantly improves their performance. However, such gains come at the cost of inference being slow and resource-intensive. Early-exit neural networks (EENNs) offer a promising solution: they accelerate inference by allowing intermediate layers to exit and produce a prediction early. Yet a fundamental issue with EENNs is how to determine when to exit without severely degrading performance. In other words, when is it ’safe’ for an EENN to go ’fast’? To address this issue, we investigate how to adapt frameworks of risk control to EENNs. Risk control offers a distribution-free, post-hoc solution that tunes the EENN’s exiting mechanism so that exits only occur when the output is of sufficient quality. We empirically validate our insights on a range of vision and language tasks, demonstrating that risk control can produce substantial computational savings, all the while preserving user-specified performance goals.
UAI

Early-Exit Neural Networks with Nested Prediction Sets

Metod Jazbec, Patrick Forré, Stephan Mandt, Dan Zhang, and Eric Nalisnick

2024

Abs HTML Code

Early-exit neural networks (EENNs) enable adaptive and efficient inference by providing predictions at multiple stages during the forward pass. In safety-critical applications, these predictions are meaningful only when accompanied by reliable uncertainty estimates. A popular method for quantifying the uncertainty of predictive models is the use of prediction sets. However, we demonstrate that standard techniques such as conformal prediction and Bayesian credible sets are not directly applicable to EENNs. They tend to generate non-nested sets at different exits, meaning labels deemed improbable at one exit may reappear in the prediction sets of subsequent exits. To address this issue, we investigate anytime-valid confidence sequences (AVCSs), an extension of traditional confidence intervals tailored for data-streaming scenarios. These sequences are inherently nested and thus well-suited for the sequential prediction task in EENNs. We explore the theoretical and practical challenges of using AVCSs in EENNs and show that they indeed yield nested sets across exits. Thus, our work presents a promising approach towards fast, yet still safe, predictive modeling.
NeurIPS

Towards Anytime Classification in Early-Exit Architectures by Enforcing Conditional Monotonicity

Metod Jazbec, James Urquhart Allingham, Dan Zhang, and Eric Nalisnick

2023

Abs HTML Code

Modern predictive models are often deployed to environments in which computational budgets are dynamic. Anytime algorithms are well-suited to such environments as, at any point during computation, they can output a prediction whose quality is a function of computation time. Early-exit neural networks have garnered attention in the context of anytime computation due to their capability to provide intermediate predictions at various stages throughout the network. However, we demonstrate that current early-exit networks are not directly applicable to anytime settings, as the quality of predictions for individual data points is not guaranteed to improve with longer computation. To address this shortcoming, we propose an elegant post-hoc modification, based on the Product-of-Experts, that encourages an early-exit network to become gradually confident. This gives our deep models the property of conditional monotonicity in the prediction quality – an essential stepping stone towards truly anytime predictive modeling using early-exit architectures. Our empirical results on standard image-classification tasks demonstrate that such behaviors can be achieved while preserving competitive accuracy on average.
AISTATS

Scalable gaussian process variational autoencoders

Metod Jazbec, Matt Ashman, Vincent Fortuin, Michael Pearce, Stephan Mandt, and Gunnar Rätsch

2021

Abs HTML Code

Conventional variational autoencoders fail in modeling correlations between data points due to their use of factorized priors. Amortized Gaussian process inference through GPVAEs has led to significant improvements in this regard, but is still inhibited by the intrinsic complexity of exact GP inference. We improve the scalability of these methods through principled sparse inference approaches. We propose a new scalable GPVAE model that outperforms existing approaches in terms of runtime and memory footprint, is easy to implement, and allows for joint end-to-end optimization of all components.

Workshop Papers

DeLTa@ICLR

Dynamic Mixture-of-Experts for Visual Autoregressive Model

Jort Vincenti, Metod Jazbec, and Guoxuan Xia

2026

Abs HTML

Visual Autoregressive Models (VAR) offer efficient and high-quality image generation but suffer from computational redundancy due to repeated Transformer calls at increasing resolutions. We introduce a dynamic Mixture-of-Experts router integrated into VAR. The new architecture allows to trade compute for quality through scale-aware thresholding. This thresholding strategy balances expert selection based on token complexity and resolution, without requiring additional training. As a result, we achieve 20% fewer FLOPs, 11% faster inference and match the image quality achieved by the dense baseline.
SPIGM@NeurIPS

Failure Prediction Is a Better Performance Proxy for Early-Exit Networks Than Calibration

Piotr Kubaty, Filip Szatkowski, Metod Jazbec, and Bartosz Wójcik

2025

Abs HTML

Early-exit models accelerate inference by attaching internal classifiers to intermediate layers of the network, allowing computation to halt once a prediction meets a predefined exit criterion. Most early-exit methods rely on confidence-based exit strategies, which has motivated prior work to calibrate intermediate classifiers in pursuit of improved performance-efficiency trade-offs. In this paper, we argue that calibration metrics can be misleading indicators of multi-exit model performance. Specifically, we present empirical evidence showing that miscalibrated networks can outperform calibrated ones. As an alternative, we propose using failure prediction as a more informative proxy for early-exit model performance. Unlike calibration, failure prediction captures changes in sample rankings and correlates strongly with efficiency gains, offering a more reliable framework for designing and evaluating early-exit models.
AFM@NeurIPS

DuoDiff: Accelerating Diffusion Models with a Dual-Backbone Approach

Daniel Gallo Fernandez, Razvan-Andrei Matisan, Alejandro Monroy Munoz, Ana Maria Vasilcoiu, Janusz Partyka, Tin Hadži Veljković, and Metod Jazbec

2024

Abs HTML Code

Diffusion models have achieved unprecedented performance in image generation, yet they suffer from slow inference due to their iterative sampling process. To address this, early-exiting has recently been proposed, where the depth of the denoising network is made adaptive based on the (estimated) difficulty of each sampling step. Here, we discover an interesting "phase transition" in the sampling process of current adaptive diffusion models: the denoising network consistently exits early during the initial sampling steps, until it suddenly switches to utilizing the full network. Based on this, we propose accelerating generation by employing a shallower denoising network in the initial sampling steps and a deeper network in the later steps. We demonstrate empirically that our dual-backbone approach, DuoDiff, outperforms existing early-exit diffusion methods in both inference speed and generation quality. Importantly, DuoDiff is easy to implement and complementary to existing approaches for accelerating diffusion.
ENLSP@NeurIPS

Dynamic Vocabulary Pruning in Early-Exit LLMs

Karim Abdel Sadek, Matteo Nulli, Joan Velja, Jort Vincenti, and Metod Jazbec

2024

Abs HTML Code

Increasing the size of large language models (LLMs) has been shown to lead to better performance. However, this comes at the cost of slower and more expensive inference. Early-exiting is a promising approach for improving the efficiency of LLM inference by enabling next token prediction at intermediate layers. Yet, the large vocabulary size in modern LLMs makes the confidence estimation required for exit decisions computationally expensive, diminishing the efficiency gains. To address this, we propose dynamically pruning the vocabulary at test time for each token. Specifically, the vocabulary is pruned at one of the initial layers, and the smaller vocabulary is then used throughout the rest of the forward pass. Our experiments demonstrate that such post-hoc dynamic vocabulary pruning improves the efficiency of confidence estimation in early-exit LLMs while maintaining competitive performance.
FITML@NeurIPS

On Efficient Distillation from LLMs to SLMs

Metod Jazbec, Menglin Xia, Ankur Mallick, Daniel Madrigal, Dongge Han, Samuel Kessler, and Victor Rühle

2024

Abs HTML Code

Finetuning small language models (SLMs) on data generated by large language models (LLMs), a form of knowledge distillation, has recently been demonstrated to lead to significantly enhanced capabilities of small models across various domains (e.g., mathematical reasoning). However, current approaches typically require synthesizing a large number of new examples (>100\textrmK), which increases the resources and training time needed for finetuning. To address this issue, we investigate principles for making the distillation process more efficient by reducing the amount of synthetic data required. Specifically, we explore \emph(i) incorporating SLM’s feedback into the LLM’s data generation process and \emph(ii) including LLM’s rationales (i.e., step-by-step solutions) in the distilled data. In our experiments using the Mistral7B model as the SLM on math reasoning tasks (GSM8K, MATH), we find that both feedback and rationales can help make finetuning with distillation more efficient (by requiring up to \sim2\textx less synthetic data).
AABI

Factorized Gaussian process variational autoencoders

Metod Jazbec, Michael Pearce, and Vincent Fortuin

2020

Abs arXiv Code

Variational autoencoders often assume isotropic Gaussian priors and mean-field posteriors, hence do not exploit structure in scenarios where we may expect similarity or consistency across latent variables. Gaussian process variational autoencoders alleviate this problem through the use of a latent Gaussian process, but lead to a cubic inference time complexity. We propose a more scalable extension of these models by leveraging the independence of the auxiliary features, which is present in many datasets. Our model factorizes the latent kernel across these features in different dimensions, leading to a significant speed-up (in theory and practice), while empirically performing comparably to existing non-scalable approaches. Moreover, our approach allows for additional modeling of global latent information and for more general extrapolation to unseen input combinations.

Journal Papers

RS Open Science

On the impact of publicly available news and information transfer to financial markets

Metod Jazbec, Barna Pàsztor, Felix Faltings, Nino Antulov-Fantulin, and Petter N Kolm

2021

Abs HTML Code

We quantify the propagation and absorption of large-scale publicly available news articles from the World Wide Web to financial markets. To extract publicly available information, we use the news archives from the Common Crawl, a non-profit organization that crawls a large part of the web. We develop a processing pipeline to identify news articles associated with the constituent companies in the S&P 500 index, an equity market index that measures the stock performance of US companies. Using machine learning techniques, we extract sentiment scores from the Common Crawl News data and employ tools from information theory to quantify the information transfer from public news articles to the US stock market. Furthermore, we analyse and quantify the economic significance of the news-based information with a simple sentiment-based portfolio trading strategy. Our findings provide support for that information in publicly available news on the World Wide Web has a statistically and economically significant impact on events in financial markets.