Reasons and Solutions for the Decline in Model Performance after Editing
Huang, Xiusheng, Liu, Jiaxiang, Wang, Yequan, Liu, Kang
Knowledge editing technology has received widespread attention for low-cost updates of incorrect or outdated knowledge in large-scale language models. However, recent research has found that edited models often exhibit varying degrees of performance degradation. The reasons behind this phenomenon and potential solutions have not yet been provided. In order to investigate the reasons for the performance decline of the edited model and optimize the editing method, this work explores the underlying reasons from both data and model perspectives. Specifically, 1) from a data perspective, to clarify the impact of data on the performance of editing models, this paper first constructs a Multi-Question Dataset (MQD) to evaluate the impact of different types of editing data on model performance. The performance of the editing model is mainly affected by the diversity of editing targets and sequence length, as determined through experiments. 2) From a model perspective, this article explores the factors that affect the performance of editing models. The results indicate a strong correlation between the L1-norm of the editing model layer and the editing accuracy, and clarify that this is an important factor leading to the bottleneck of editing performance. Finally, in order to improve the performance of the editing model, this paper further proposes a Dump for Sequence (D4S) method, which successfully overcomes the previous editing bottleneck by reducing the L1-norm of the editing layer, allowing users to perform multiple effective edits and minimizing model damage. Our code is available at https://github.com/nlpkeg/D4S.
LLM Pretraining with Continuous Concepts
Jihoon Tack, Jack Lanchantin, Jane Dwivedi-Yu, Andrew Cohen, Ilia Kulikov, Janice Lan, Shibo Hao, Yuandong Tian, Jason E Weston, Xian Li
Next token prediction has been the standard training objective used in large language model pretraining. Representations are learned as a result of optimizing for token-level perplexity. We propose Continuous Concept Mixing (CoCoMix), a novel pretraining framework that combines discrete next token prediction with continuous concepts. Specifically, CoCoMix predicts ``continuous concepts'' learned from a pretrained sparse autoencoder and mixes them into the model's hidden state by interleaving with token hidden representations. Through experiments on multiple benchmarks, including language modeling and downstream reasoning tasks, we show that CoCoMix is more sample efficient and consistently outperforms standard next token prediction and knowledge distillation. We find that combining both concept learning and interleaving in an end-to-end framework is critical to performance gains. Furthermore, CoCoMix enhances interpretability and steerability by allowing direct inspection and modification of the predicted concept, offering a transparent way to guide the model’s internal reasoning process.
VIRTUE: Visual-Interactive Text-Image Universal Embedder
Wei-Yao Wang, Kazuya Tateishi, Qiyu Wu, Shusuke Takahashi, Yuki Mitsufuji
Multimodal representation learning models have demonstrated successful operation across complex tasks, and the integration of vision-language models (VLMs) has further enabled embedding models with instruction-following capabilities. However, existing embedding models lack visual-interactive capabilities to specify regions of interests from users (e.g., point, bounding box, mask), which have been explored in generative models to broaden their human-interactive applicability. Equipping embedding models with visual interactions not only would unlock new applications with localized grounding of user intent, which remains unexplored, but also enable the models to learn entity-level information within images to complement their global representations for conventional embedding tasks. In this paper, we propose a novel **V**isual-**I**nte**R**active **T**ext-Image **U**niversal **E**mbedder (**VIRTUE**) that extends the capabilities of the segmentation model and the vision-language model to the realm of representation learning. In VIRTUE, the segmentation model can process visual prompts that pinpoint specific regions within an image, thereby enabling the embedder to handle complex and ambiguous scenarios more precisely. To evaluate the visual-interaction ability of VIRTUE, we introduce a large-scale **S**egmentation-and-Scene **Ca**ption **R**etrieval (**SCaR**) benchmark comprising 1M samples that aims to retrieve the text caption by jointly considering the entity with a specific object and image scene. VIRTUE consistently achieves a state-of-the-art performance with significant improvements across 36 universal MMEB (**3.1\%–8.5\%**) and five visual-interactive SCaR (**15.2\%–20.3\%**) tasks. The code, models, and benchmarks are available at https://github.com/sony/virtue.
What Matters When Repurposing Diffusion Models for General Dense Perception Tasks?
Guangkai Xu, yongtao ge, Mingyu Liu, Chengxiang Fan, Kangyang Xie, Zhiyue Zhao, Hao Chen, Chunhua Shen
Extensive pre-training with large data is indispensable for downstream geometry and semantic visual perception tasks. Thanks to large-scale text-to-image (T2I) pretraining, recent works show promising results by simply fine-tuning T2I diffusion models for a few dense perception tasks. However, several crucial design decisions in this process still lack comprehensive justification, encompassing the necessity of the multi-step diffusion mechanism, training strategy, inference ensemble strategy, and fine-tuning data quality. In this work, we conduct a thorough investigation into critical factors that affect transfer efficiency and performance when using diffusion priors. Our key findings are: 1) High-quality fine-tuning data is paramount for both semantic and geometry perception tasks. 2) As a special case of the diffusion scheduler by setting its hyper-parameters, the multi-step generation can be simplified to a one-step fine-tuning paradigm without any loss of performance, while significantly speeding up inference. 3) Apart from fine-tuning the diffusion model with only latent space supervision, task-specific supervision can be beneficial to enhance fine-grained details. These observations culminate in the development of GenPercept, an effective deterministic one-step fine-tuning paradigm tailored for dense visual perception tasks exploiting diffusion priors. Different from the previous multi-step methods, our paradigm offers a much faster inference speed, and can be seamlessly integrated with customized perception decoders and loss functions for task-specific supervision, which can be critical for improving the fine-grained details of predictions. Comprehensive experiments on a diverse set of dense visual perceptual tasks, including monocular depth estimation, surface normal estimation, image segmentation, and matting, are performed to demonstrate the remarkable adaptability and effectiveness of our proposed method. Code: https://github.com/aim-uofa/GenPercept
Rethinking Radiology Report Generation: From Narrative Flow to Topic-Guided Findings
Sheng Cheng, Devika Subramanian
Vision-Language Models (VLMs) for radiology report generation are typically trained to mimic the narrative flow of human experts. However, we identify a potential limitation in this conventional paradigm. We hypothesize that optimizing for narrative coherence encourages models to rely on linguistic priors and inter-sentence correlations, which can weaken their grounding in direct visual evidence and lead to factual inaccuracies. To investigate this, we design a controlled experiment demonstrating that as textual context increases, a model's reliance on the input image systematically decays. We propose LLaVA-TA (Topic-guided and Anatomy-aware), a new fine-tuning framework that directly addresses this challenge by re-engineering the generation process. Instead of producing a linear narrative, LLaVA-TA decomposes the report into a set of independent, clinically-relevant topics. By training the model to generate a discrete finding for each topic conditioned on both the full image and its corresponding anatomical region, we reduce the model's reliance on narrative flow and enforce stricter visual grounding. Our experiments show that LLaVA-TA sets a new state of the art on the MIMIC-CXR dataset, significantly improving clinical accuracy on metrics like RadGraph F1 (from 29.4 to 44.0) and CheXpert F1-14 (from 39.5 to 71.5) over strong baselines. Our work demonstrates that dismantling a report's narrative structure to enforce independent, visually-grounded observations is a crucial and effective step toward building more accurate and reliable medical VLMs.
A Practitioner's Guide to Real-World Continual Multimodal Pretraining
Udandarao, Vishaal, Roth, Karsten, Dziadzio, Sebastian, Prabhu, Ameya, Cherti, Mehdi, Vinyals, Oriol, Henaff, Olivier, Albanie, Samuel, Akata, Zeynep, Bethge, Matthias
Multimodal foundation models serve numerous applications at the intersection of vision and language. Still, despite being pretrained on extensive data, they become outdated over time.To keep models updated, research into continual pretraining mainly explores scenarios with either (1) infrequent, indiscriminate updates on large-scale new data, or (2) frequent, sample-level updates.However, practical model deployment often operates in the gap between these two limit cases, as real-world applications demand adaptation to specific subdomains, tasks or concepts --- spread over the entire, varying life cycle of a model. In this work, we complement current perspectives on continual pretraining through a research test bed and offer comprehensive guidance for effective continual model updates in such scenarios.We first introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with realistic compute constraints and practical deployment requirements, constructed over 63 datasets with diverse visual and semantic coverage.Using FoMo-in-Flux, we explore the complex landscape of practical continual pretraining through multiple perspectives: (1) data mixtures and stream orderings that emulate real-world deployment settings, (2) methods ranging from simple fine-tuning and traditional continual learning strategies to parameter-efficient updates and model merging, (3) meta-learning-rate schedules and mechanistic design choices, and (4) model and compute scaling. Together, our insights provide a practitioner's guide to continual multimodal pretraining for real-world deployment. Benchmark and code is provided here: https://github.com/ExplainableML/fomoinflux.
Rethinking Self-Distillation: Label Averaging and Enhanced Soft Label Refinement with Partial Labels
Hyeonsu Jeong, Hye Won Chung
We investigate the mechanisms of self-distillation in multi-class classification, particularly in the context of linear probing with fixed feature extractors where traditional feature learning explanations do not apply. Our theoretical analysis reveals that multi-round self-distillation effectively performs label averaging among instances with high feature correlations, governed by the eigenvectors of the Gram matrix derived from input features. This process leads to clustered predictions and improved generalization, mitigating the impact of label noise by reducing the model's reliance on potentially corrupted labels. We establish conditions under which multi-round self-distillation achieves 100\% population accuracy despite label noise. Furthermore, we introduce a novel, efficient single-round self-distillation method using refined partial labels from the teacher's top two softmax outputs, referred to as the PLL student model. This approach replicates the benefits of multi-round distillation in a single round, achieving comparable or superior performance--especially in high-noise scenarios--while significantly reducing computational cost.
Flat Reward in Policy Parameter Space Implies Robust Reinforcement Learning
HyunKyu Lee, Sung Whan Yoon
Investigating flat minima on loss surfaces in parameter space is well-documented in the supervised learning context, highlighting its advantages for model generalization. However, limited attention has been paid to the reinforcement learning (RL) context, where the impact of flatter reward landscapes in policy parameter space remains largely unexplored. Beyond merely extrapolating from supervised learning, which suggests a link between flat reward landscapes and enhanced generalization, we aim to formally connect the flatness of the reward surface to the robustness of RL models. In policy models where a deep neural network determines actions, flatter reward landscapes in response to parameter perturbations lead to consistent rewards even when actions are perturbed. Moreover, robustness to action perturbations further enhances robustness against other variations, such as changes in state transition probabilities and reward functions. We extensively simulate various RL environments, confirming the consistent benefits of flatter reward landscapes in enhancing the robustness of RL under diverse conditions, including action selection, transition dynamics, and reward functions. The code for these experiments is available at https://github.com/HK-05/flatreward-RRL.
Contrastive Learning with Adversarial Examples
Ho, Chih-Hui, Nvasconcelos, Nuno
Contrastive learning (CL) is a popular technique for self-supervised learning (SSL) of visual representations. It uses pairs of augmentations of unlabeled training examples to define a classification task for pretext learning of a deep embedding. Despite extensive works in augmentation procedures, prior works do not address the selection of challenging negative pairs, as images within a sampled batch are treated independently. This paper addresses the problem, by introducing a new family of adversarial examples for constrastive learning and using these examples to define a new adversarial training algorithm for SSL, denoted as CLAE. When compared to standard CL, the use of adversarial examples creates more challenging positive pairs and adversarial training produces harder negative pairs by accounting for all images in a batch during the optimization. CLAE is compatible with many CL methods in the literature. Experiments show that it improves the performance of several existing CL baselines on multiple datasets.
Understanding the Implicit Biases of Design Choices for Time Series Foundation Models
Annan Yu, Danielle Maddix, Boran Han, Xiyuan Zhang, Abdul Fatir Ansari, Oleksandr Shchur, Christos Faloutsos, Andrew Gordon Wilson, Michael W Mahoney, Bernie Wang
Time series foundation models (TSFMs) are a potential class of powerful, general-purpose tools for forecasting and related temporal tasks, but their behavior is strongly shaped by subtle inductive biases in their design. Rather than developing a new model and claiming that it is better than existing TSFMs, e.g., by winning on existing benchmarks, our objective is to understand how the various "knobs" of the training process affect model quality. Using a mix of theory and controlled empirical evaluation, we identify and show how various design choices (e.g., patch size, embedding choice, training objective, etc.) lead to implicit biases in fundamental model properties (e.g., temporal behavior, geometric structure, how aggressively or not the model regresses to the mean, etc.), and how these biases can be intuitive or counterintuitive, depending on properties of the model and data. We illustrate in a case study on outlier handling how multiple biases interact in complex ways.
One-Sided Unsupervised Domain Mapping
Benaim, Sagie, Wolf, Lior
In unsupervised domain mapping, the learner is given two unmatched datasets and . The goal is to learn a mapping that translates a sample in to the analog sample in . Recent approaches have shown that when learning simultaneously both and the inverse mapping , convincing mappings are obtained. In this work, we present a method of learning without learning . This is done by learning a mapping that maintains the distance between a pair of samples. Moreover, good mappings are obtained, even by maintaining the distance between different parts of the same sample before and after mapping. We present experimental results that the new method not only allows for one sided mapping learning, but also leads to preferable numerical results over the existing circularity-based constraint. Our entire code is made publicly available at~\url{https://github.com/sagiebenaim/DistanceGAN}.
Do Deep Neural Network Solutions Form a Star Domain?
Ankit Sonthalia, Alexander Rubinstein, Ehsan Abbasnejad, Seong Joon Oh
It has recently been conjectured that neural network solution sets reachable via stochastic gradient descent (SGD) are convex, considering permutation invariances. This means that a linear path can connect two independent solutions with low loss, given the weights of one of the models are appropriately permuted. However, current methods to test this theory often require very wide networks to succeed. In this work, we conjecture that more generally, the SGD solution set is a star domain that contains a star model that is linearly connected to all the other solutions via paths with low loss values, modulo permutations. We propose the Starlight algorithm that finds a star model of a given learning task. We validate our claim by showing that this star model is linearly connected with other independently found solutions. As an additional benefit of our study, we demonstrate better uncertainty estimates on Bayesian Model Averaging over the obtained star domain. Further, we demonstrate star models as potential substitutes for model ensembles.
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, Yao Lu
VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation. Traditional visual language models (VLMs) use separate modules for understanding and generating visual content, which can lead to misalignment and increased complexity. In contrast, VILA-U employs a single autoregressive next-token prediction framework for both tasks, eliminating the need for additional components like diffusion models. This approach not only simplifies the model but also achieves near state-of-the-art performance in visual language understanding and generation. The success of VILA-U is attributed to two main factors: the unified vision tower that aligns discrete visual tokens with textual inputs during pretraining, which enhances visual perception, and autoregressive image generation can achieve similar quality as diffusion models with high-quality dataset. This allows VILA-U to perform comparably to more complex models using a fully token-based autoregressive framework.
Advantage Alignment Algorithms
Juan Duque, Milad Aghajohari, Timotheus Cooijmans, Razvan Ciuca, Tianyu Zhang, Gauthier Gidel, Aaron Courville
Artificially intelligent agents are increasingly being integrated into human decision-making: from large language model (LLM) assistants to autonomous vehicles. These systems often optimize their individual objective, leading to conflicts, particularly in general-sum games where naive reinforcement learning agents empirically converge to Pareto-suboptimal Nash equilibria. To address this issue, opponent shaping has emerged as a paradigm for finding socially beneficial equilibria in general-sum games. In this work, we introduce Advantage Alignment, a family of algorithms derived from first principles that perform opponent shaping efficiently and intuitively. We achieve this by aligning the advantages of interacting agents, increasing the probability of mutually beneficial actions when their interaction has been positive. We prove that existing opponent shaping methods implicitly perform Advantage Alignment. Compared to these methods, Advantage Alignment simplifies the mathematical formulation of opponent shaping, reduces the computational burden and extends to continuous action domains. We demonstrate the effectiveness of our algorithms across a range of social dilemmas, achieving state-of-the-art cooperation and robustness against exploitation.
Shapley-Guided Utility Learning for Effective Graph Inference Data Valuation
Hongliang Chi, Qiong Wu, Zhengyi Zhou, Yao Ma
Graph Neural Networks (GNNs) have demonstrated remarkable performance in various graph-based machine learning tasks, yet evaluating the importance of neighbors of testing nodes remains largely unexplored due to the challenge of assessing data importance without test labels. To address this gap, we propose Shapley-Guided Utility Learning (SGUL), a novel framework for graph inference data valuation. SGUL innovatively combines transferable data-specific and modelspecific features to approximate test accuracy without relying on ground truth labels. By incorporating Shapley values as a preprocessing step and using feature Shapley values as input, our method enables direct optimization of Shapley value prediction while reducing computational demands. SGUL overcomes key limitations of existing methods, including poor generalization to unseen test-time structures and indirect optimization. Experiments on diverse graph datasets demonstrate that SGUL consistently outperforms existing baselines in both inductive and transductive settings. SGUL offers an effective, efficient, and interpretable approach for quantifying the value of test-time neighbors.
PureGen: Universal Data Purification for Train-Time Poison Defense via Generative Model Dynamics
Pooladzandi, Omead, Bhat, Sunay, Jiang, Jeffrey, Branch, Alexander, Pottie, Gregory
Train-time data poisoning attacks threaten machine learning models by introducing adversarial examples during training, leading to misclassification. Current defense methods often reduce generalization performance, are attack-specific, and impose significant training overhead. To address this, we introduce a set of universal data purification methods using a stochastic transform, , realized via iterative Langevin dynamics of Energy-Based Models (EBMs), Denoising Diffusion Probabilistic Models (DDPMs), or both. These approaches purify poisoned data with minimal impact on classifier generalization. Our specially trained EBMs and DDPMs provide state-of-the-art defense against various attacks (including Narcissus, Bullseye Polytope, Gradient Matching) on CIFAR-10, Tiny-ImageNet, and CINIC-10, without needing attack or classifier-specific information. We discuss performance trade-offs and show that our methods remain highly effective even with poisoned or distributionally shifted generative model training data.
SimXRD-4M: Big Simulated X-ray Diffraction Data and Crystal Symmetry Classification Benchmark
Bin Cao, Yang Liu, Zinan Zheng, Ruifeng Tan, Jia Li, Tong-Yi Zhang
Powder X-ray diffraction (XRD) patterns are highly effective for crystal identification and play a pivotal role in materials discovery. While machine learning (ML) has advanced the analysis of powder XRD patterns, progress has been constrained by the limited availability of training data and established benchmarks. To address this, we introduce SimXRD, the largest open-source simulated XRD pattern dataset to date, aimed at accelerating the development of crystallographic informatics. We developed a novel XRD simulation method that incorporates comprehensive physical interactions, resulting in a high-fidelity database. SimXRD comprises 4,065,346 simulated powder XRD patterns, representing 119,569 unique crystal structures under 33 simulated conditions that reflect real-world variations. We benchmark 21 sequence models in both in-library and out-of-library scenarios and analyze the impact of class imbalance in long-tailed crystal label distributions. Remarkably, we find that: (1) current neural networks struggle with classifying low-frequency crystals, particularly in out-of-library situations; (2) models trained on SimXRD can generalize to real experimental data.
Mixture of In-Context Prompters for Tabular PFNs
Derek Xu, Olcay Cirit, Reza Asadi, Yizhou Sun, Wei Wang
Recent benchmarks find In-Context Learning (ICL) outperforms both deep learning and tree-based algorithms on small tabular datasets. However, on larger datasets, ICL for tabular learning suffers in both efficiency and effectiveness. In terms of efficiency, transformers incur linear space and quadratic time complexity w.r.t. context size. In terms of effectiveness, contexts at inference encounter distribution shift compared to contexts from pretraining. We propose MixturePFN, which extends Sparse Mixture of Experts to the state-of-the-art ICL for tabular learning model. Specifically, MixturePFN finetunes a specialized ICL expert on each cluster of tabular data and routes new test samples to appropriate experts at inference. MixturePFN supports constant-size contexts by splitting large training datasets into more manageable clusters. MixturePFN addresses distribution shift by finetuning an expert on each training dataset cluster via bootstrapping. Extensive experimental results shows MixturePFN outperforms 19 baselines both in mean rank and as the Condorcet winner across 36 diverse tabular datasets under both accuracy and F1 score with statistical significance.
Online Algorithms with Uncertainty-Quantified Predictions
Bo Sun, Jerry Huang, Nicolas Christianson, Mohammad Hajiesmaili, Adam Wierman, Raouf Boutaba
The burgeoning field of algorithms with predictions studies the problem of using possibly imperfect machine learning predictions to improve online algorithm performance. While nearly all existing algorithms in this framework make no assumptions on prediction quality, a number of methods providing uncertainty quantification (UQ) on machine learning models have been developed in recent years, which could enable additional information about prediction quality at decision time. In this work, we investigate the problem of optimally utilizing uncertainty-quantified predictions in the design of online algorithms. In particular, we study two classic online problems, ski rental and online search, where the decision-maker is provided predictions augmented with UQ describing the likelihood of the ground truth falling within a particular range of values. We demonstrate that non-trivial modifications to algorithm design are needed to fully leverage the UQ predictions. Moreover, we consider how to utilize more general forms of UQ, proposing an online learning framework that learns to exploit UQ to make decisions in multi-instance settings.
Towards Minimizing Feature Drift in Model Merging: Layer-wise Task Vector Fusion for Adaptive Knowledge Integration
Wenju Sun, Qingyong Li, Wen Wang, Yang Liu, Yangliao Geng, Boyang Li
Multi-task model merging aims to consolidate knowledge from multiple fine-tuned task-specific experts into a unified model while minimizing performance degradation. Existing methods primarily approach this by minimizing differences between task-specific experts and the unified model, either from a parameter-level or a task-loss perspective. However, parameter-level methods exhibit a significant performance gap compared to the upper bound, while task-loss approaches entail costly secondary training procedures. In contrast, we observe that performance degradation closely correlates with feature drift, i.e., differences in feature representations of the same sample caused by model merging. Motivated by this observation, we propose Layer-wise Optimal Task Vector Merging (LOT Merging), a technique that explicitly minimizes feature drift between task-specific experts and the unified model in a layer-by-layer manner. LOT Merging can be formulated as a convex quadratic optimization problem, enabling us to analytically derive closed-form solutions for the parameters of linear and normalization layers. Consequently, LOT Merging achieves efficient model consolidation through basic matrix operations. Extensive experiments across vision and vision-language benchmarks demonstrate that LOT Merging significantly outperforms baseline methods, achieving improvements of up to 4.4% (ViT-B/32) over state-of-the-art approaches. The source code is available at https://github.com/SunWenJu123/model-merging.
HYPER: A Foundation Model for Inductive Link Prediction with Knowledge Hypergraphs
Xingyue Huang, Mikhail Galkin, Michael Bronstein, Ismail I Ceylan
Inductive link prediction with knowledge hypergraphs is the task of predicting missing hyperedges involving completely *novel entities* (i.e., nodes unseen during training). Existing methods for inductive link prediction with knowledge hypergraphs assume a fixed relational vocabulary and, as a result, cannot generalize to knowledge hypergraphs with *novel relation types* (i.e., relations unseen during training). Inspired by knowledge graph foundation models, we propose HYPER as a foundation model for link prediction, which can generalize to *any knowledge hypergraph*, including novel entities and novel relations. Importantly, HYPER can learn and transfer across different relation types of *varying arities*, by encoding the entities of each hyperedge along with their respective positions in the hyperedge. To evaluate HYPER, we construct 16 new inductive datasets from existing knowledge hypergraphs, covering a diverse range of relation types of varying arities. Empirically, HYPER consistently outperforms all existing methods in both node-only and node-and-relation inductive settings, showing strong generalization to unseen, higher-arity relational structures.
Efficient Sketches for Training Data Attribution and Studying the Loss Landscape
Schioppa, Andrea
The study of modern machine learning models often necessitates storing vast quantities of gradients or Hessian vector products (HVPs). Traditional sketching methods struggle to scale under these memory constraints. We present a novel framework for scalable gradient and HVP sketching, tailored for modern hardware. We provide theoretical guarantees and demonstrate the power of our methods in applications like training data attribution, Hessian spectrum analysis, and intrinsic dimension computation for pre-trained language models. Our work sheds new light on the behavior of pre-trained language models, challenging assumptions about their intrinsic dimensionality and Hessian properties.
Distributional Associations vs In-Context Reasoning: A Study of Feed-forward and Attention Layers
Lei Chen, Joan Bruna, Alberto Bietti
Large language models have been successful at tasks involving basic forms of in-context reasoning, such as generating coherent language, as well as storing vast amounts of knowledge. At the core of the Transformer architecture behind such models are feed-forward and attention layers, which are often associated to knowledge and reasoning, respectively. In this paper, we study this distinction empirically and theoretically in a controlled synthetic setting where certain next-token predictions involve both distributional and in-context information. We find that feed-forward layers tend to learn simple distributional associations such as bigrams, while attention layers focus on in-context reasoning. Our theoretical analysis identifies the noise in the gradients as a key factor behind this discrepancy. Finally, we illustrate how similar disparities emerge in pre-trained models through ablations on the Pythia model family on simple reasoning tasks.
Non-Asymptotic Analysis for Two Time-scale TDC with General Smooth Function Approximation
Wang, Yue, Zou, Shaofeng, Zhou, Yi
Temporal-difference learning with gradient correction (TDC) is a two time-scale algorithm for policy evaluation in reinforcement learning. This algorithm was initially proposed with linear function approximation, and was later extended to the one with general smooth function approximation. The asymptotic convergence for the on-policy setting with general smooth function approximation was established in [Bhatnagar et al., 2009], however, the non-asymptotic convergence analysis remains unsolved due to challenges in the non-linear and two-time-scale update structure, non-convex objective function and the projection onto a time-varying tangent plane. In this paper, we develop novel techniques to address the above challenges and explicitly characterize the non-asymptotic error bound for the general off-policy setting with i.i.d. or Markovian samples, and show that it converges as fast as (up to a factor of ). Our approach can be applied to a wide range of value-based reinforcement learning algorithms with general smooth function approximation.
Learning to Localize Using a LiDAR Intensity Map
Ioan Andrei Barsan, Shenlong Wang, Andrei Pokrovsky, Raquel Urtasun
In this paper we propose a real-time, calibration-agnostic and effective localization system for self-driving cars. Our method learns to embed the online LiDAR sweeps and intensity map into a joint deep embedding space. Localization is then conducted through an efficient convolutional matching between the embeddings. Our full system can operate in real-time at 15Hz while achieving centimeter level accuracy across different LiDAR sensors and environments. Our experiments illustrate the performance of the proposed approach over a large-scale dataset consisting of over 4000km of driving.
Diffusion Transformers for Imputation: Statistical Efficiency and Uncertainty Quantification
Zeqi Ye, Minshuo Chen
Imputation methods play a critical role in enhancing the quality of practical time-series data, which often suffer from pervasive missing values. Recently, diffusion-based generative imputation methods have demonstrated remarkable success compared to autoregressive and conventional statistical approaches. Despite their empirical success, the theoretical understanding of how well diffusion-based models capture complex spatial and temporal dependencies between the missing values and observed ones remains limited. Our work addresses this gap by investigating the statistical efficiency of conditional diffusion transformers for imputation and quantifying the uncertainty in missing values. Specifically, we derive statistical sample complexity bounds based on a novel approximation theory for conditional score functions using transformers, and, through this, construct tight confidence regions for missing values. Our findings also reveal that the efficiency and accuracy of imputation are significantly influenced by the missing patterns. Furthermore, we validate these theoretical insights through simulation and propose a mixed-masking training strategy to enhance the imputation performance.
Efficient Temporal Action Segmentation via Boundary-aware Query Voting
Wang, Peiyao, Lin, Yuewei, Blasch, Erik, wei, jie, Ling, Haibin
Although the performance of Temporal Action Segmentation (TAS) has been improved in recent years, achieving promising results often comes with a high computational cost due to dense inputs, complex model structures, and resource-intensive post-processing requirements. To improve the efficiency while keeping the high performance, we present a novel perspective centered on per-segment classification. By harnessing the capabilities of Transformers, we tokenize each video segment as an instance token, endowed with intrinsic instance segmentation. To realize efficient action segmentation, we introduce BaFormer, a boundary-aware Transformer network. It employs instance queries for instance segmentation and a global query for class-agnostic boundary prediction, yielding continuous segment proposals. During inference, BaFormer employs a simple yet effective voting strategy to classify boundary-wise segments based on instance segmentation. Remarkably, as a single-stage approach, BaFormer significantly reduces the computational costs, utilizing only 6% of the running time compared to the state-of-the-art method DiffAct, while producing better or comparable accuracy over several popular benchmarks. The code for this project is publicly available at https://github.com/peiyao-w/BaFormer.
LSEnet: Lorentz Structural Entropy Neural Network for Deep Graph Clustering
Li Sun, Zhenhao Huang, Hao Peng, YuJie Wang, Chunyang Liu, Philip Yu
Graph clustering is a fundamental problem in machine learning. Deep learning methods achieve the state-of-the-art results in recent years, but they still cannot work without predefined cluster numbers. Such limitation motivates us to pose a more challenging problem of graph clustering with unknown cluster number. We propose to address this problem from a fresh perspective of graph information theory (i.e., structural information). In the literature, structural information has not yet been introduced to deep clustering, and its classic definition falls short of discrete formulation and modeling node features. In this work, we first formulate a differentiable structural information (DSI) in the continuous realm, accompanied by several theoretical results. By minimizing DSI, we construct the optimal partitioning tree where densely connected nodes in the graph tend to have the same assignment, revealing the cluster struc- ture. DSI is also theoretically presented as a new graph clustering objective, not requiring the pre-defined cluster number. Furthermore, we design a neural LSEnet in the Lorentz model of hyperbolic space, where we integrate node features to structural information via manifold-valued graph convolution. Extensive empirical results on real graphs show the superiority of our approach.
Asymptotics of smoothed Wasserstein distances in the small noise regime
Ding, Yunzi, Niles-Weed, Jonathan
We study the behavior of the Wasserstein- distance between discrete measures and in when both measures are smoothed by small amounts of Gaussian noise. This procedure, known as Gaussian-smoothed optimal transport, has recently attracted attention as a statistically attractive alternative to the unregularized Wasserstein distance. We give precise bounds on the approximation properties of this proposal in the small noise regime, and establish the existence of a phase transition: we show that, if the optimal transport plan from to is unique and a perfect matching, there exists a critical threshold such that the difference between and the Gaussian-smoothed OT distance scales like for below the threshold, and scales like above it. These results establish that for sufficiently small, the smoothed Wasserstein distance approximates the unregularized distance exponentially well.
Enhancing Hallucination Detection through Noise Injection
Litian Liu, Reza Pourreza, Sunny Panchal, Apratim Bhattacharyya, Yubing Jian, Yao Qin, Roland Memisevic
Large Language Models (LLMs) are prone to generating plausible yet incorrect responses, known as hallucinations. Effectively detecting hallucinations is therefore crucial for the safe deployment of LLMs. Recent research has linked hallucinations to model uncertainty, suggesting that hallucinations can be detected by measuring dispersion over answer distributions obtained from multiple samples drawn from a model. While drawing from the distribution over tokens defined by the model is a natural way to obtain samples, in this work, we argue that it is sub-optimal for the purpose of detecting hallucinations. We show that detection can be improved significantly by taking into account model uncertainty in the Bayesian sense. To this end, we propose a very simple, training-free approach based on perturbing an appropriate subset of model parameters, or equivalently hidden unit activations, during sampling. We demonstrate that our approach significantly improves inference-time hallucination detection over standard sampling across diverse datasets, model architectures, and uncertainty metrics.
A Truncated Newton Method for Optimal Transport
Mete Kemertas, Amir-massoud Farahmand, Allan Jepson
Developing a contemporary optimal transport (OT) solver requires navigating trade-offs among several critical requirements: GPU parallelization, scalability to high-dimensional problems, theoretical convergence guarantees, empirical performance in terms of precision versus runtime, and numerical stability in practice. With these challenges in mind, we introduce a specialized truncated Newton algorithm for entropic-regularized OT. In addition to proving that locally quadratic convergence is possible without assuming a Lipschitz Hessian, we provide strategies to maximally exploit the high rate of local convergence in practice. Our GPU-parallel algorithm exhibits exceptionally favorable runtime performance, achieving high precision orders of magnitude faster than many existing alternatives. This is evidenced by wall-clock time experiments on 24 problem sets (12 datasets 2 cost functions). The scalability of the algorithm is showcased on an extremely large OT problem with , solved approximately under weak entopric regularization.
How does a Neural Network's Architecture Impact its Robustness to Noisy Labels?
Li, Jingling, Zhang, Mozhi, Xu, Keyulu, Dickerson, John, Ba, Jimmy
Noisy labels are inevitable in large real-world datasets. In this work, we explore an area understudied by previous works --- how the network's architecture impacts its robustness to noisy labels. We provide a formal framework connecting the robustness of a network to the alignments between its architecture and target/noise functions. Our framework measures a network's robustness via the predictive power in its representations --- the test performance of a linear model trained on the learned representations using a small set of clean labels. We hypothesize that a network is more robust to noisy labels if its architecture is more aligned with the target function than the noise. To support our hypothesis, we provide both theoretical and empirical evidence across various neural network architectures and different domains. We also find that when the network is well-aligned with the target function, its predictive power in representations could improve upon state-of-the-art (SOTA) noisy-label-training methods in terms of test accuracy and even outperform sophisticated methods that use clean labels.
Entropic Gromov-Wasserstein between Gaussian Distributions
Khang Le, Dung Le, Huy Nguyen, , Tung Pham, Nhat Ho
We study the entropic Gromov-Wasserstein and its unbalanced version between (unbalanced) Gaussian distributions with different dimensions. When the metric is the inner product, which we refer to as inner product Gromov-Wasserstein (IGW), we demonstrate that the optimal transportation plans of entropic IGW and its unbalanced variant are (unbalanced) Gaussian distributions. Via an application of von Neumann's trace inequality, we obtain closed-form expressions for the entropic IGW between these Gaussian distributions. Finally, we consider an entropic inner product Gromov-Wasserstein barycenter of multiple Gaussian distributions. We prove that the barycenter is a Gaussian distribution when the entropic regularization parameter is small. We further derive a closed-form expression for the covariance matrix of the barycenter.
InstaSHAP: Interpretable Additive Models Explain Shapley Values Instantly
James Enouen, Yan Liu
In recent years, the Shapley value and SHAP explanations have emerged as oneof the most dominant paradigms for providing post-hoc explanations of blackbox models. Despite their well-founded theoretical properties, many recent workshave focused on the limitations in both their computational efficiency and theirrepresentation power. The underlying connection with additive models, however,is left critically under-emphasized in the current literature. In this work, we findthat a variational perspective linking GAM models and SHAP explanations is ableto provide deep insights into nearly all recent developments. In light of this connection, we borrow in the other direction to develop a new method to train interpretable GAM models which are automatically purified to compute the Shapleyvalue in a single forward pass. Finally, we provide theoretical results showing thelimited representation power of GAM models is the same Achilles’ heel existingin SHAP and discuss the implications for SHAP’s modern usage in CV and NLP.
CALM: Culturally Self-Aware Language Models
Lingzhi Shen, Xiaohao Cai, Yunfei Long, Imran Razzak, Guanming Chen, Shoaib Jameel
Cultural awareness in language models is the capacity to understand and adapt to diverse cultural contexts. However, most existing approaches treat culture as static background knowledge, overlooking its dynamic and evolving nature. This limitation reduces their reliability in downstream tasks that demand genuine cultural sensitivity. In this work, we introduce CALM, a novel framework designed to endow language models with cultural self-awareness. CALM disentangles task semantics from explicit cultural concepts and latent cultural signals, shaping them into structured cultural clusters through contrastive learning. These clusters are then aligned via cross-attention to establish fine-grained interactions among related cultural features and are adaptively integrated through a Mixture-of-Experts mechanism along culture-specific dimensions. The resulting unified representation is fused with the model's original knowledge to construct a culturally grounded internal identity state, which is further enhanced through self-prompted reflective learning, enabling continual adaptation and self-correction. Experiments on the benchmark datasets demonstrate that CALM outperforms state-of-the-art methods.
SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators
Rasoul Shafipour, David Harrison, Maxwell Horton, JEFFREY MARKER, Houman Bedayat, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, Saman Naderiparizi
Large Language Models (LLMs) have transformed natural language processing, but face significant challenges in widespread deployment due to their high runtime cost. In this paper, we introduce SeedLM, a novel post-training compression method that uses seeds of a pseudo-random generator to encode and compress model weights. Specifically, for each block of weights, we find a seed that is fed into a Linear Feedback Shift Register (LFSR) during inference to efficiently generate a random matrix. This matrix is then linearly combined with compressed coefficients to reconstruct the weight block. SeedLM reduces memory access and leverages idle compute cycles during inference, effectively speeding up memory-bound tasks by trading compute for fewer memory accesses. Unlike state-of-the-art methods that rely on calibration data, our approach is data-free and generalizes well across diverse tasks. Our experiments with Llama3 70B, which is particularly challenging, show zero-shot accuracy retention at 4- and 3-bit compression to be on par with or better than state-of-the-art methods, while maintaining performance comparable to FP16 baselines. Additionally, FPGA-based tests demonstrate that 4-bit SeedLM, as model size increases, approaches a 4x speed-up over an FP16 Llama 2/3 baseline.
LO: Compute-Efficient Meta-Generalization of Learned Optimizers
Benjamin Thérien, Charles-Étienne Joseph, Boris Knyazev, Edouard Oyallon, Irina Rish, Eugene Belilovsky
Learned optimizers (LOs) have the potential to significantly reduce the wall-clock training time of neural networks. However, they can struggle to optimize unseen tasks (*meta-generalize*), especially when training networks wider than those seen during meta-training. To address this, we derive the Maximal Update Parametrization (P) for two state-of-the-art learned optimizer architectures and propose a simple meta-training recipe for -parameterized LOs (LOs). Our empirical evaluation demonstrates that LOs meta-trained with our recipe substantially improve meta-generalization to wider unseen tasks when compared to LOs trained under standard parametrization (SP) using the same compute budget. We also empirically observe that LOs exhibit unexpectedly improved meta-generalization to deeper networks ( meta-training) and surprising generalization to much longer training horizons ( meta-training) when compared to SP LOs.
DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging
Pagliardini, Matteo, Mohtashami, Amirkeivan, Fleuret, François, Jaggi, Martin
The transformer architecture by Vaswani et al. (2017) is now ubiquitous across application domains, from natural language processing to speech processing and image understanding. We propose DenseFormer, a simple modification to the standard architecture that improves the perplexity of the model without increasing its size---adding a few thousand parameters for large-scale models in the 100B parameters range. Our approach relies on an additional averaging step after each transformer block, which computes a weighted average of current and past representations---we refer to this operation as Depth-Weighted-Average (DWA). The learned DWA weights exhibit coherent patterns of information flow, revealing the strong and structured reuse of activations from distant layers. Experiments demonstrate that DenseFormer is more data efficient, reaching the same perplexity of much deeper transformer models, and that for the same perplexity, these new models outperform transformer baselines in terms of memory efficiency and inference time.
Single-agent Poisoning Attacks Suffice to Ruin Multi-Agent Learning
Fan Yao, Yuwei Cheng, Ermin Wei, Haifeng Xu
We investigate the robustness of multi-agent learning in strongly monotone games with bandit feedback. While previous research has developed learning algorithms that achieve last-iterate convergence to the unique Nash equilibrium (NE) at a polynomial rate, we demonstrate that all such algorithms are vulnerable to adversaries capable of poisoning even a single agent's utility observations. Specifically, we propose an attacking strategy such that for any given time horizon , the adversary can mislead any multi-agent learning algorithm to converge to a point other than the unique NE with a corruption budget that grows sublinearly in . To further understand the inherent robustness of these algorithms, we characterize the fundamental trade-off between convergence speed and the maximum tolerable total utility corruptions for two example algorithms, including the state-of-the-art one. Our theoretical and empirical results reveal an intrinsic efficiency-robustness trade-off: the faster an algorithm converges, the more vulnerable it becomes to utility poisoning attacks. To the best of our knowledge, this is the first work to identify and characterize such a trade-off in the context of multi-agent learning.
Learning piecewise Lipschitz functions in changing environments
Dravyansh Sharma, Maria-Florina Balcan, Travis Dick
Optimization in the presence of sharp (non-Lipschitz), unpredictable (w.r.t. time and amount) changes is a challenging and largely unexplored problem of great significance. We consider the class of piecewise Lipschitz functions, which is the most general online setting considered in the literature for the problem, and arises naturally in various combinatorial algorithm selection problems where utility functions can have sharp discontinuities. The usual performance metric of ‘static’ regret minimizes the gap between the payoff accumulated and that of the best fixed point for the entire duration, and thus fails to capture changing environments. Shifting regret is a useful alternative, which allows for up to environment {\it shifts}. In this work we provide an regret bound for -dispersed functions, where roughly quantifies the rate at which discontinuities appear in the utility functions in expectation (typically in problems of practical interest \cite{2019arXiv190409014B,balcan2018dispersion}). We also present a lower bound tight up to sub-logarithmic factors. We further obtain improved bounds when selecting from a small pool of experts. We empirically demonstrate a key application of our algorithms to online clustering problems on popular benchmarks.
Unifying Causal Representation Learning with the Invariance Principle
Dingling Yao, Dario Rancati, Riccardo Cadei, Marco Fumero, Francesco Locatello
Causal representation learning (CRL) aims at recovering latent causal variables from high-dimensional observations to solve causal downstream tasks, such as predicting the effect of new interventions or more robust classification. A plethora of methods have been developed, each tackling carefully crafted problem settings that lead to different types of identifiability. These different settings are widely assumed to be important because they are often linked to different rungs of Pearl's causal hierarchy, even though this correspondence is not always exact. This work shows that instead of strictly conforming to this hierarchical mapping, *many causal representation learning approaches methodologically align their representations with inherent data symmetries.* Identification of causal variables is guided by invariance principles that are not necessarily causal. This result allows us to unify many existing approaches in a single method that can mix and match different assumptions, including non-causal ones, based on the invariance relevant to the problem at hand. It also significantly benefits applicability, which we demonstrate by improving treatment effect estimation on real-world high-dimensional ecological data. Overall, this paper clarifies the role of causal assumptions in the discovery of causal variables and shifts the focus to preserving data symmetries.
Mitigate the Gap: Improving Cross-Modal Alignment in CLIP
Sedigheh Eslami, Gerard de Melo
Contrastive Language--Image Pre-training (CLIP) has manifested remarkable improvements in zero-shot classification and cross-modal vision-language tasks. Yet, from a geometrical point of view, the CLIP embedding space has been found to have a pronounced modality gap. This gap renders the embedding space overly sparse and disconnected, with different modalities being densely distributed in distinct subregions of the hypersphere. In this work, we propose AlignCLIP, in order to improve the alignment between text and image embeddings, and thereby reduce the modality gap. AlignCLIP increases the cross-modal alignment, and yields gains across several zero-shot and fine-tuning downstream evaluations by sharing the learnable parameters between the modality encoders and a semantically-regularized separation objective function on the uni-modal embeddings. The source code and model checkpoints for reproducing our experiments are available at https://github.com/sarahESL/AlignCLIP.
Exposing Mixture and Annotating Confusion for Active Universal Test-Time Adaptation
Jiayao Tan, Fan Lyu, Chenggong Ni, Fuyuan Hu, Wei Feng, Rui Yao
Universal Test-Time Adaptation (UTTA) tackles the challenge of handling both class and domain shifts in unsupervised settings with stream testing data. Currently, most UTTA methods can only deal with minor shifts and heavily rely on heuristic approaches. To advance UTTA under dual shifts, we propose a novel Active Universal Test-Time Adaptation (AUTTA) framework, Exposing Mixture and Annotating Confusion (EMAC), which incorporates active human annotation into the UTTA setting. To select appropriate samples for annotation in AUTTA, we first identify the mixed regions of target domain samples under dual shifts, highlighting potential candidate samples. We then design a reward-guided active selection strategy to prioritize annotating the most representative samples within this set, maximizing annotation effectiveness. Additionally, to balance the use of pseudo-labels with the limited number of annotations, we propose an adaptation objective designed to address the adaptation imbalance caused by annotation scarcity. Extensive experiments show that the proposed AUTTA approach significantly improves performance and achieves state-of-the-art.
MuPT: A Generative Symbolic Music Pretrained Transformer
Xingwei Qu, yuelin bai, Yinghao MA, Ziya Zhou, Ka Man Lo, JIAHENG LIU, Ruibin Yuan, Lejun Min, Xueling Liu, Tianyu Zhang, Xeron Du, Shuyue Guo, Yiming Liang, Yizhi Li, Shangda Wu, Junting Zhou, Tianyu Zheng, Ziyang Ma, Fengze Han, Wei Xue, Gus Xia, Emmanouil Benetos, Xiang Yue, Chenghua Lin, Xu Tan, Wenhao Huang, Jie Fu, Ge Zhang
In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition.To address the challenges associated with misaligned measures from different tracks during generation, we propose the development of a ynchronized ulti-rack ABC Notation (), which aims to preserve coherence across multiple musical tracks. Our contributions include a series of models capable of handling up to 8192 tokens, covering 90\% of the symbolic music data in our training set. Furthermore, we explore the implications of the ymbolic usic caling Law () on model performance. The results indicate a promising research direction in music generation, offering extensive resources for further research through our open-source contributions.
Order Matters in the Presence of Dataset Imbalance for Multilingual Learning
Choi, Dami, Xin, Derrick, Dadkhahi, Hamid, Gilmer, Justin, Garg, Ankush, Firat, Orhan, Yeh, Chih-Kuan, Dai, Andrew M., Ghorbani, Behrooz
In this paper, we empirically study the optimization dynamics of multi-task learning, particularly focusing on those that govern a collection of tasks with significant data imbalance. We present a simple yet effective method of pre-training on high-resource tasks, followed by fine-tuning on a mixture of high/low-resource tasks. We provide a thorough empirical study and analysis of this method's benefits showing that it achieves consistent improvements relative to the performance trade-off profile of standard static weighting. We analyze under what data regimes this method is applicable and show its improvements empirically in neural machine translation (NMT) and multi-lingual language modeling.
Private Federated Learning with Autotuned Compression
Enayat Ullah, Christopher Choquette-Choo, Peter Kairouz, Sewoong Oh
We propose new techniques for reducing communication in private federated learning without the need for setting or tuning compression rates. Our on-the-fly methods automatically adjust the compression rate based on the error induced during training, while maintaining provable privacy guarantees through the use of secure aggregation and differential privacy. Our techniques are provably instance-optimal for mean estimation, meaning that they can adapt to the ``hardness of the problem'' with minimal interactivity. We demonstrate the effectiveness of our approach on real-world datasets by achieving favorable compression rates without the need for tuning.
Spurious Local Minima are Common in Two-Layer ReLU Neural Networks
Itay Safran, Ohad Shamir
We consider the optimization problem associated with training simple ReLU neural networks of the form with respect to the squared loss. We provide a computer-assisted proof that even if the input distribution is standard Gaussian, even if the dimension is arbitrarily large, and even if the target values are generated by such a network, with orthonormal parameter vectors, the problem can still have spurious local minima once . By a concentration of measure argument, this implies that in high input dimensions,
Aligning Time Series on Incomparable Spaces
Samuel Cohen, Giulia Luise, Alexander Terenin, Brandon Amos, Marc Deisenroth
Dynamic time warping (DTW) is a useful method for aligning, comparing and combining time series, but it requires them to live in comparable spaces. In this work, we consider a setting in which time series live on different spaces without a sensible ground metric, causing DTW to become ill-defined. To alleviate this, we propose Gromov dynamic time warping (GDTW), a distance between time series on potentially incomparable spaces that avoids the comparability requirement by instead considering intra-relational geometry. We demonstrate its effectiveness at aligning, combining and comparing time series living on incomparable spaces. We further propose a smoothed version of GDTW as a differentiable loss and assess its properties in a variety of settings, including barycentric averaging, generative modeling and imitation learning.
Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models
Anirudh Bharadwaj, Chaitanya Malaviya, Nitish Joshi, Mark Yatskar
Language models serve as proxies for human preference judgements in alignment and evaluation, yet they exhibit systematic miscalibration, prioritizing superficial patterns over substantive qualities. This bias manifests as overreliance on features like length, structure, and style, leading to issues like reward hacking and unreliable evaluations. However, the connection between training data artifacts and the miscalibrated preferences exhibited by models remains poorly understood. In this work, we systematically investigate the relationship between training data biases and preference model miscalibration across five idiosyncratic features of language model generations: length, structure, jargon, sycophancy and vagueness. Using controlled counterfactual pairs, we first quantify the extent to which preference models favor responses with artificially magnified biases (\textit{skew}), finding this preference occurs in \% of instances, and model preferences show high \textit{miscalibration} (\%) compared to human preferences. Notably, bias features only show mild negative correlations to human preference labels (mean ) but show moderately strong positive correlations with labels from a strong reward model (mean ), suggesting that models may overrely on spurious cues. To mitigate these issues, we propose a simple post-training method based on counterfactual data augmentation (CDA) using synthesized contrastive examples. Fine-tuning models with CDA reduces average miscalibration from 39.4\% to 32.5\% and average absolute skew difference from 20.5\% to 10.0\%, while maintaining overall RewardBench performance, indicating that targeted debiasing can strengthen the reliability of preference models within standard alignment pipelines.
Calibration and Consistency of Adversarial Surrogate Losses
Awasthi, Pranjal, Frank, Natalie, Mao, Anqi, Mohri, Mehryar, Zhong, Yutao
Adversarial robustness is an increasingly critical property of classifiers in applications. The design of robust algorithms relies on surrogate losses since the optimization of the adversarial loss with most hypothesis sets is NP-hard. But, which surrogate losses should be used and when do they benefit from theoretical guarantees? We present an extensive study of this question, including a detailed analysis of the -calibration and -consistency of adversarial surrogate losses. We show that convex loss functions, or the supremum-based convex losses often used in applications, are not -calibrated for common hypothesis sets used in machine learning. We then give a characterization of -calibration and prove that some surrogate losses are indeed -calibrated for the adversarial zero-one loss, with common hypothesis sets. In particular, we fix some calibration results presented in prior work for a family of linear models and significantly generalize the results to the nonlinear hypothesis sets. Next, we show that -calibration is not sufficient to guarantee consistency and prove that, in the absence of any distributional assumption, no continuous surrogate loss is consistent in the adversarial setting. This, in particular, proves that a claim made in prior work is inaccurate. Next, we identify natural conditions under which some surrogate losses that we describe in detail are -consistent. We also report a series of empirical results which show that many -calibrated surrogate losses are indeed not -consistent, and validate our theoretical assumptions. Our adversarial -consistency results are novel, even for the case where is the family of all measurable functions.