Papers
Ultrahyperbolic Neural Networks
Law, Marc
Riemannian space forms, such as the Euclidean space, sphere and hyperbolic space, are popular and powerful representation spaces in machine learning. For instance, hyperbolic geometry is appropriate to represent graphs without cycles and has been used to extend Graph Neural Networks. Recently, some pseudo-Riemannian space forms that generalize both hyperbolic and spherical geometries have been exploited to learn a specific type of nonparametric embedding called ultrahyperbolic. The lack of geodesic between every pair of ultrahyperbolic points makes the task of learning parametric models (e.g., neural networks) difficult. This paper introduces a method to learn parametric models in ultrahyperbolic space. We experimentally show the relevance of our approach in the tasks of graph and node classification.
A Communication-efficient Algorithm with Linear Convergence for Federated Minimax Learning
Sun, Zhenyu, Wei, Ermin
In this paper, we study a large-scale multi-agent minimax optimization problem, which models many interesting applications in statistical learning and game theory, including Generative Adversarial Networks (GANs). The overall objective is a sum of agents' private local objective functions. We focus on the federated setting, where agents can perform local computation and communicate with a central server. Most existing federated minimax algorithms either require communication per iteration or lack performance guarantees with the exception of Local Stochastic Gradient Descent Ascent (SGDA), a multiple-local-update descent ascent algorithm which guarantees convergence under a diminishing stepsize. By analyzing Local SGDA under the ideal condition of no gradient noise, we show that generally it cannot guarantee exact convergence with constant stepsizes and thus suffers from slow rates of convergence. To tackle this issue, we propose FedGDA-GT, an improved Federated (Fed) Gradient Descent Ascent (GDA) method based on Gradient Tracking (GT). When local objectives are Lipschitz smooth and strongly-convex-strongly-concave, we prove that FedGDA-GT converges linearly with a constant stepsize to global -approximation solution with rounds of communication, which matches the time complexity of centralized GDA method. Then, we analyze the general distributed minimax problem from a statistical aspect, where the overall objective approximates a true population minimax risk by empirical samples. We provide generalization bounds for learning with this objective through Rademacher complexity analysis. Finally, we numerically show that FedGDA-GT outperforms Local SGDA.
Model Rubik’s Cube: Twisting Resolution, Depth and Width for TinyNets
Han, Kai, Wang, Yunhe, Zhang, Qiulin, Zhang, Wei, XU, Chunjing, Zhang, Tong
To obtain excellent deep neural architectures, a series of techniques are carefully designed in EfficientNets. The giant formula for simultaneously enlarging the resolution, depth and width provides us a Rubik’s cube for neural networks. So that we can find networks with high efficiency and excellent performance by twisting the three dimensions. This paper aims to explore the twisting rules for obtaining deep neural networks with minimum model sizes and computational costs. Different from the network enlarging, we observe that resolution and depth are more important than width for tiny networks. Therefore, the original method, \ie the compound scaling in EfficientNet is no longer suitable. To this end, we summarize a tiny formula for downsizing neural architectures through a series of smaller models derived from the EfficientNet-B0 with the FLOPs constraint. Experimental results on the ImageNet benchmark illustrate that our TinyNet performs much better than the smaller version of EfficientNets using the inversed giant formula. For instance, our TinyNet-E achieves a 59.9\% Top-1 accuracy with only 24M FLOPs, which is about 1.9\% higher than that of the previous best MobileNetV3 with similar computational cost. Code will be available at \url{https://github.com/huawei-noah/CV-Backbones/tree/master/tinynet}, and \url{https://gitee.com/mindspore/mindspore/tree/master/model_zoo/research/cv/tinynet}.
GraphTrail: Translating GNN Predictions into Human-Interpretable Logical Rules
Armgaan, Burouj, Dalmia, Manthan, Medya, Sourav, Ranu, Sayan
Instance-level explanation of graph neural networks (GNNs) is a well-studied area. These explainers, however, only explain an instance (e.g., a graph) and fail to uncover the combinatorial reasoning learned by a GNN from the training data towards making its predictions. In this work, we introduce GraphTrail, the first end-to-end, global, post-hoc GNN explainer that translates the functioning of a black-box GNN model to a boolean formula over the (sub)graph level concepts without relying on local explainers. GraphTrail is unique in automatically mining the discriminative subgraph-level concepts using Shapley values. Subsequently, the GNN predictions are mapped to a human-interpretable boolean formula over these concepts through symbolic regression. Extensive experiments across diverse datasets and GNN architectures demonstrate significant improvement over existing global explainers in mapping GNN predictions to faithful logical formulae. The robust and accurate performance of GraphTrail makes it invaluable for improving GNNs and facilitates adoption in domains with strict transparency requirements.
Differentiable Top-k with Optimal Transport
Xie, Yujia, Dai, Hanjun, Chen, Minshuo, Dai, Bo, Zhao, Tuo, Zha, Hongyuan, Wei, Wei, Pfister, Tomas
Finding the k largest or smallest elements from a collection of scores, i.e., top-k operation, is an important model component widely used in information retrieval, machine learning, and data mining. However, if the top-k operation is implemented in an algorithmic way, e.g., using bubble algorithm, the resulted model cannot be trained in an end-to-end way using prevalent gradient descent algorithms. This is because these implementations typically involve swapping indices, whose gradient cannot be computed. Moreover, the corresponding mapping from the input scores to the indicator vector of whether this element belongs to the top-k set is essentially discontinuous. To address the issue, we propose a smoothed approximation, namely SOFT (Scalable Optimal transport-based diFferenTiable) top-k operator. Specifically, our SOFT top-k operator approximates the output of top-k operation as the solution of an Entropic Optimal Transport (EOT) problem. The gradient of the SOFT operator can then be efficiently approximated based on the optimality conditions of EOT problem. We then apply the proposed operator to k-nearest neighbors algorithm and beam search algorithm. The numerical experiment demonstrates their achieve improved performance.
Narrowing the Gap: Random Forests In Theory and In Practice
Misha Denil, David Matheson, Nando De Freitas
Despite widespread interest and practical use, the theoretical properties of random forests are still not well understood. In this paper we contribute to this understanding in two ways. We present a new theoreti- cally tractable variant of random regression forests and prove that our algorithm is con- sistent. We also provide an empirical eval- uation, comparing our algorithm and other theoretically tractable random forest models to the random forest algorithm used in prac- tice. Our experiments provide insight into the relative importance of different simplifi- cations that theoreticians have made to ob- tain tractable models for analysis.
Efficient Learning for AlphaZero via Path Consistency
Dengwei Zhao, Shikui Tu, Lei Xu
In recent years, deep reinforcement learning have made great breakthroughs on board games. Still, most of the works require huge computational resources for a large scale of environmental interactions or self-play for the games. This paper aims at building powerful models under a limited amount of self-plays which can be utilized by a human throughout the lifetime. We proposes a learning algorithm built on AlphaZero, with its path searching regularised by a path consistency (PC) optimality, i.e., values on one optimal search path should be identical. Thus, the algorithm is shortly named PCZero. In implementation, historical trajectory and scouted search paths by MCTS makes a good balance between exploration and exploitation, which enhances the generalization ability effectively. PCZero obtains winning rate against the champion of Hex Computer Olympiad in 2015 on Hex, much higher than by AlphaZero. The models consume only self-play games, about the amount humans can study in a lifetime. The improvements by PCZero have been also generalized to Othello and Gomoku. Experiments also demonstrate the efficiency of PCZero under offline learning setting.
Infinite latent feature models and the Indian buffet process
Ghahramani, Zoubin, Griffiths, Thomas
We define a probability distribution over equivalence classes of binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features. We identify a simple generative process that results in the same distribution over equivalence classes, which we call the Indian buffet process. We illustrate the use of this distribution as a prior in an infinite latent feature model, deriving a Markov chain Monte Carlo algorithm for inference in this model and applying the algorithm to an image dataset.
Untangling tradeoffs between recurrence and self-attention in artificial neural networks
Kerg, Giancarlo, Kanuparthi, Bhargav, ALIAS PARTH GOYAL, Anirudh Goyal, Goyette, Kyle, Bengio, Yoshua, Lajoie, Guillaume
Attention and self-attention mechanisms, are now central to state-of-the-art deep learning on sequential tasks. However, most recent progress hinges on heuristic approaches with limited understanding of attention's role in model optimization and computation, and rely on considerable memory and computational resources that scale poorly. In this work, we present a formal analysis of how self-attention affects gradient propagation in recurrent networks, and prove that it mitigates the problem of vanishing gradients when trying to capture long-term dependencies by establishing concrete bounds for gradient norms. Building on these results, we propose a relevancy screening mechanism, inspired by the cognitive process of memory consolidation, that allows for a scalable use of sparse self-attention with recurrence. While providing guarantees to avoid vanishing gradients, we use simple numerical experiments to demonstrate the tradeoffs in performance and computational resources by efficiently balancing attention and recurrence. Based on our results, we propose a concrete direction of research to improve scalability of attentive networks.
De-Anonymizing Text by Fingerprinting Language Generation
Sun, Zhen, Schuster, Roei, Shmatikov, Vitaly
Components of machine learning systems are not (yet) perceived as security hotspots. Secure coding practices, such as ensuring that no execution paths depend on confidential inputs, have not yet been adopted by ML developers. We initiate the study of code security of ML systems by investigating how nucleus sampling---a popular approach for generating text, used for applications such as auto-completion---unwittingly leaks texts typed by users. Our main result is that the series of nucleus sizes for many natural English word sequences is a unique fingerprint. We then show how an attacker can infer typed text by measuring these fingerprints via a suitable side channel (e.g., cache access times), explain how this attack could help de-anonymize anonymous texts, and discuss defenses.
Rethinking Optimal Transport in Offline Reinforcement Learning
Asadulaev, Arip, Korst, Rostislav, Korotin, Aleksandr, Egiazarian, Vage, Filchenkov, Andrey, Burnaev, Evgeny
We propose a novel algorithm for offline reinforcement learning using optimal transport. Typically, in offline reinforcement learning, the data is provided by various experts and some of them can be sub-optimal. To extract an efficient policy, it is necessary to \emph{stitch} the best behaviors from the dataset. To address this problem, we rethink offline reinforcement learning as an optimal transportation problem. And based on this, we present an algorithm that aims to find a policy that maps states to a \emph{partial} distribution of the best expert actions for each given state. We evaluate the performance of our algorithm on continuous control problems from the D4RL suite and demonstrate improvements over existing methods.
Learning to Generate with Memory
Chongxuan Li, Jun Zhu, Bo Zhang
Memory units have been widely used to enrich the capabilities of deep networks on capturing long-term dependencies in reasoning and prediction tasks, but little investigation exists on deep generative models (DGMs) which are good at inferring high-level invariant representations from unlabeled data. This paper presents a deep generative model with a possibly large external memory and an attention mechanism to capture the local detail information that is often lost in the bottom-up abstraction process in representation learning. By adopting a smooth attention model, the whole network is trained end-to-end by optimizing a variational bound of data likelihood via auto-encoding variational Bayesian methods, where an asymmetric recognition network is learnt jointly to infer high-level invariant representations. The asymmetric architecture can reduce the competition between bottom-up invariant feature extraction and top-down generation of instance details. Our experiments on several datasets demonstrate that memory can significantly boost the performance of DGMs on various tasks, including density estimation, image generation, and missing value imputation, and DGMs with memory can achieve state-of-the-art quantitative results.
Adaptive Annealed Importance Sampling with Constant Rate Progress
Shirin Goshtasbpour, Victor Cohen, Perez Fernando
Annealed Importance Sampling (AIS) synthesizes weighted samples from an intractable distribution given its unnormalized density function. This algorithm relies on a sequence of interpolating distributions bridging the target to an initial tractable distribution such as the well-known geometric mean path of unnormalized distributions which is assumed to be suboptimal in general. In this paper, we prove that the geometric annealing corresponds to the distribution path that minimizes the KL divergence between the current particle distribution and the desired target when the feasible change in the particle distribution is constrained. Following this observation, we derive the constant rate discretization schedule for this annealing sequence, which adjusts the schedule to the difficulty of moving samples between the initial and the target distributions. We further extend our results to -divergences and present the respective dynamics of annealing sequences based on which we propose the Constant Rate AIS (CR-AIS) algorithm and its efficient implementation for -divergences. We empirically show that CR-AIS performs well on multiple benchmark distributions while avoiding the computationally expensive tuning loop in existing Adaptive AIS.
Quantification and Analysis of Layer-wise and Pixel-wise Information Discarding
Haotian Ma, Hao Zhang, Fan Zhou, Yinqing Zhang, Quanshi Zhang
This paper presents a method to explain how the information of each input variable is gradually discarded during the forward propagation in a deep neural network (DNN), which provides new perspectives to explain DNNs. We define two types of entropy-based metrics, i.e. (1) the discarding of pixel-wise information used in the forward propagation, and (2) the uncertainty of the input reconstruction, to measure input information contained by a specific layer from two perspectives. Unlike previous attribution metrics, the proposed metrics ensure the fairness of comparisons between different layers of different DNNs. We can use these metrics to analyze the efficiency of information processing in DNNs, which exhibits strong connections to the performance of DNNs. We analyze information discarding in apixel-wise manner, which is different from the information bottleneck theory measuring feature information w.r.t. the sample distribution. Experiments have shown the effectiveness of our metrics in analyzing classic DNNs and explaining existing deep-learning techniques. The code is available at https://github.com/haotianSustc/deepinfo.
Does Knowledge Distillation Really Work?
Stanton, Samuel, Izmailov, Pavel, Kirichenko, Polina, Alemi, Alexander A., Wilson, Andrew G.
Knowledge distillation is a popular technique for training a small student network to emulate a larger teacher model, such as an ensemble of networks. We show that while knowledge distillation can improve student generalization, it does not typically work as it is commonly understood: there often remains a surprisingly large discrepancy between the predictive distributions of the teacher and the student, even in cases when the student has the capacity to perfectly match the teacher. We identify difficulties in optimization as a key reason for why the student is unable to match the teacher. We also show how the details of the dataset used for distillation play a role in how closely the student matches the teacher --- and that more closely matching the teacher paradoxically does not always lead to better student generalization.
Hierarchical Clustering for Euclidean Data
Moses Charikar, Vaggos Chatziafratis, Rad Niazadeh, Grigory Yaroslavtsev
Recent works on Hierarchical Clustering (HC), a well-studied problem in exploratory data analysis, have focused on optimizing various objective functions for this problem under arbitrary similarity measures. In this paper we take the first step and give novel scalable algorithms for this problem tailored to Euclidean data in R^d and under vector-based similarity measures, a prevalent model in several typical machine learning applications. We focus primarily on the popular Gaussian kernel and present our results through the lens of the objective introduced recently by [MW’17]. We show the approximation factor in [MW’17] can be improved for Euclidean data. We further demonstrate both theoretically and experimentally that our algorithms scale to very high dimension d, while outperforming average-linkage and showing competitive results against other less scalable approaches.
Two steps to risk sensitivity
Gagne, Christopher, Dayan, Peter
Distributional reinforcement learning (RL) – in which agents learn about all the possible long-term consequences of their actions, and not just the expected value – is of great recent interest. One of the most important affordances of a distributional view is facilitating a modern, measured, approach to risk when outcomes are not completely certain. By contrast, psychological and neuroscientific investigations into decision making under risk have utilized a variety of more venerable theoretical models such as prospect theory that lack axiomatically desirable properties such as coherence. Here, we consider a particularly relevant risk measure for modeling human and animal planning, called conditional value-at-risk (CVaR), which quantifies worst-case outcomes (e.g., vehicle accidents or predation). We first adopt a conventional distributional approach to CVaR in a sequential setting and reanalyze the choices of human decision-makers in the well-known two-step task, revealing substantial risk aversion that had been lurking under stickiness and perseveration. We then consider a further critical property of risk sensitivity, namely time consistency, showing alternatives to this form of CVaR that enjoy this desirable characteristic. We use simulations to examine settings in which the various forms differ in ways that have implications for human and animal planning and behavior.
Deep Supervised and Convolutional Generative Stochastic Network for Protein Secondary Structure Prediction
Jian Zhou, Olga Troyanskaya
Predicting protein secondary structure is a fundamental problem in protein structure prediction. Here we present a new supervised generative stochastic network (GSN) based method to predict local secondary structure with deep hierarchical representations. GSN is a recently proposed deep learning technique (Bengio & Thibodeau-Laufer, 2013) to globally train deep generative model. We present the supervised extension of GSN, which learns a Markov chain to sample from a conditional distribution, and applied it to protein structure prediction. To scale the model to full-sized, high-dimensional data, like protein sequences with hundreds of amino-acids, we introduce a convolutional architecture, which allows efficient learning across multiple layers of hierarchical representations. Our architecture uniquely focuses on predicting structured low-level labels informed with both low and high-level representations learned by the model. In our application this corresponds to labeling the secondary structure state of each amino-acid residue. We trained and tested the model on separate sets of non-homologous proteins sharing less than 30% sequence identity. Our model achieves 66.4% Q8 accuracy on the CB513 dataset, better than the previously reported best performance 64.9% (Wang et al., 2011) for this challenging secondary structure prediction problem.
Towards Understanding Extrapolation: a Causal Lens
Kong, Lingjing, Chen, Guangyi, Stojanov, Petar, Li, Haoxuan, Xing, Eric, Zhang, Kun
Canonical work handling distribution shifts typically necessitates an entire target distribution that lands inside the training distribution.However, practical scenarios often involve only a handful target samples, potentially lying outside the training support, which requires the capability of extrapolation.In this work, we aim to provide a theoretical understanding of when extrapolation is possible and offer principled methods to achieve it without requiring an on-support target distribution.To this end, we formulate the extrapolation problem with a latent-variable model that embodies the minimal change principle in causal mechanisms.Under this formulation, we cast the extrapolation problem into a latent-variable identification problem.We provide realistic conditions on shift properties and the estimation objectives that lead to identification even when only one off-support target sample is available, tackling the most challenging scenarios.Our theory reveals the intricate interplay between the underlying manifold's smoothness and the shift properties.We showcase how our theoretical results inform the design of practical adaptation algorithms. Through experiments on both synthetic and real-world data, we validate our theoretical findings and their practical implications.
Simultaneously Learning Stochastic and Adversarial Bandits with General Graph Feedback
Fang Kong, Yichi Zhou, Shuai Li
The problem of online learning with graph feedback has been extensively studied in the literature due to its generality and potential to model various learning tasks. Existing works mainly study the adversarial and stochastic feedback separately. If the prior knowledge of the feedback mechanism is unavailable or wrong, such specially designed algorithms could suffer great loss. To avoid this problem, \citet{erez2021towards} try to optimize for both environments. However, they assume the feedback graphs are undirected and each vertex has a self-loop, which compromises the generality of the framework and may not be satisfied in applications. With a general feedback graph, the observation of an arm may not be available when this arm is pulled, which makes the exploration more expensive and the algorithms more challenging to perform optimally in both environments. In this work, we overcome this difficulty by a new trade-off mechanism with a carefully-designed proportion for exploration and exploitation. We prove the proposed algorithm simultaneously achieves regret in the stochastic setting and minimax-optimal regret of in the adversarial setting where is the horizon and hides parameters independent of as well as logarithmic terms. To our knowledge, this is the first best-of-both-worlds result for general feedback graphs.
Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control
Gupta, Gunshi, Yadav, Karmesh, Gal, Yarin, Batra, Dhruv, Kira, Zsolt, Lu, Cong, Rudner, Tim G. J.
Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs. Such capabilities are difficult to learn solely from task-specific data. This has led to the emergence of pre-trained vision-language models as a tool for transferring representations learned from internet-scale data to downstream tasks and new domains. However, commonly used contrastively trained representations such as in CLIP have been shown to fail at enabling embodied agents to gain a sufficiently fine-grained scene understanding—a capability vital for control. To address this shortcoming, we consider representations from pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts and as such, contain text-conditioned representations that reflect highly fine-grained visuo-spatial information. Using pre-trained text-to-image diffusion models, we construct Stable Control Representations which allow learning downstream control policies that generalize to complex, open-ended environments. We show that policies learned using Stable Control Representations are competitive with state-of-the-art representation learning approaches across a broad range of simulated control settings, encompassing challenging manipulation and navigation tasks. Most notably, we show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.
Unity by Diversity: Improved Representation Learning for Multimodal VAEs
Sutter, Thomas, Meng, Yang, Agostini, Andrea, Chopard, Daphné, Fortin, Norbert, Vogt, Julia, Shahbaba, Babak, Mandt, Stephan
Variational Autoencoders for multimodal data hold promise for many tasks in data analysis, such as representation learning, conditional generation, and imputation.Current architectures either share the encoder output, decoder input, or both across modalities to learn a shared representation. Such architectures impose hard constraints on the model. In this work, we show that a better latent representation can be obtained by replacing these hard constraints with a soft constraint. We propose a new mixture-of-experts prior, softly guiding each modality's latent representation towards a shared aggregate posterior.This approach results in a superior latent representation and allows each encoding to preserve information better from its uncompressed original features. In extensive experiments on multiple benchmark datasets and two challenging real-world datasets, we show improved learned latent representations and imputation of missing data modalities compared to existing methods.
Accurate Point Cloud Registration with Robust Optimal Transport
Shen, Zhengyang, Feydy, Jean, Liu, Peirong, Curiale, Ariel H, San Jose Estepar, Ruben, San Jose Estepar, Raul, Niethammer, Marc
This work investigates the use of robust optimal transport (OT) for shape matching. Specifically, we show that recent OT solvers improve both optimization-based and deep learning methods for point cloud registration, boosting accuracy at an affordable computational cost. This manuscript starts with a practical overview of modern OT theory. We then provide solutions to the main difficulties in using this framework for shape matching. Finally, we showcase the performance of transport-enhanced registration models on a wide range of challenging tasks: rigid registration for partial shapes; scene flow estimation on the Kitti dataset; and nonparametric registration of lung vascular trees between inspiration and expiration. Our OT-based methods achieve state-of-the-art results on Kitti and for the challenging lung registration task, both in terms of accuracy and scalability. We also release PVT1010, a new public dataset of 1,010 pairs of lung vascular trees with densely sampled points. This dataset provides a challenging use case for point cloud registration algorithms with highly complex shapes and deformations. Our work demonstrates that robust OT enables fast pre-alignment and fine-tuning for a wide range of registration models, thereby providing a new key method for the computer vision toolbox. Our code and dataset are available online at: https://github.com/uncbiag/robot.
On the Learnability of Multilabel Ranking
Raman, Vinod, SUBEDI, UNIQUE, Tewari, Ambuj
Multilabel ranking is a central task in machine learning. However, the most fundamental question of learnability in a multilabel ranking setting with relevance-score feedback remains unanswered. In this work, we characterize the learnability of multilabel ranking problems in both batch and online settings for a large family of ranking losses. Along the way, we give two equivalence classes of ranking losses based on learnability that capture most losses used in practice.
A Limitation of the PAC-Bayes Framework
Livni, Roi, Moran, Shay
PAC-Bayes is a useful framework for deriving generalization bounds which was introduced by McAllester ('98). This framework has the flexibility of deriving distribution- and algorithm-dependent bounds, which are often tighter than VC-related uniform convergence bounds. In this manuscript we present a limitation for the PAC-Bayes framework. We demonstrate an easy learning task which is not amenable to a PAC-Bayes analysis. Specifically, we consider the task of linear classification in 1D; it is well-known that this task is learnable using just examples. On the other hand, we show that this fact can not be proved using a PAC-Bayes analysis: for any algorithm that learns 1-dimensional linear classifiers there exists a (realizable) distribution for which the PAC-Bayes bound is arbitrarily large.
An Efficient Approach for Assessing Hyperparameter Importance
Frank Hutter, Holger Hoos, Kevin Leyton-Brown
The performance of many machine learning methods depends critically on hyperparameter settings. Sophisticated Bayesian optimization methods have recently achieved considerable successes in optimizing these hyperparameters, in several cases surpassing the performance of human experts. However, blind reliance on such methods can leave end users without insight into the relative importance of different hyperparameters and their interactions. This paper describes efficient methods that can be used to gain such insight, leveraging random forest models fit on the data already gathered by Bayesian optimization. We first introduce a novel, linear-time algorithm for computing marginals of random forest predictions and then show how to leverage these predictions within a functional ANOVA framework, to quantify the importance of both single hyperparameters and of interactions between hyperparameters. We conducted experiments with prominent machine learning frameworks and state-of-the-art solvers for combinatorial problems. We show that our methods provide insight into the relationship between hyperparameter settings and performance, and demonstrate that—even in very high-dimensional cases—most performance variation is attributable to just a few hyperparameters.
GDeR: Safeguarding Efficiency, Balancing, and Robustness via Prototypical Graph Pruning
Zhang, Guibin, Dong, Haonan, zhang, yuchen, Li, Zhixun, Chen, Dingshuo, Wang, Kai, Chen, Tianlong, Liang, Yuxuan, Cheng, Dawei, Wang, Kun
Training high-quality deep models necessitates vast amounts of data, resulting in overwhelming computational and memory demands. Recently, data pruning, distillation, and coreset selection have been developed to streamline data volume by \textit{retaining}, \textit{synthesizing}, or \textit{selecting} a small yet informative subset from the full set. Among these methods, data pruning incurs the least additional training cost and offers the most practical acceleration benefits. However, it is the most vulnerable, often suffering significant performance degradation with imbalanced or biased data schema, thus raising concerns about its accuracy and reliability in on-device deployment. Therefore, there is a looming need for a new data pruning paradigm that maintains the efficiency of previous practices while ensuring balance and robustness.Unlike the fields of computer vision and natural language processing, where mature solutions have been developed to address these issues, graph neural networks (GNNs) continue to struggle with increasingly large-scale, imbalanced, and noisy datasets, lacking a unified dataset pruning solution. To achieve this, we introduce a novel dynamic soft-pruning method, \ourmethod, designed to update the training ``basket'' during the process using trainable prototypes. \ourmethod first constructs a well-modeled graph embedding hypersphere and then samples \textit{representative, balanced, and unbiased subsets} from this embedding space, which achieves the goal we called {\fontfamily{lmtt}\selectfont \textbf{Graph Training Debugging}}.Extensive experiments on four datasets across three GNN backbones, demonstrate that \ourmethod (I) achieves or surpasses the performance of the full dataset with fewer training samples, (II) attains up to a lossless training speedup, and (III) outperforms state-of-the-art pruning methods in imbalanced training and noisy training scenarios by and , respectively.
What is being transferred in transfer learning?
Neyshabur, Behnam, Sedghi, Hanie, Zhang, Chiyuan
One desired capability for machines is the ability to transfer their understanding of one domain to another domain where data is (usually) scarce. Despite ample adaptation of transfer learning in many deep learning applications, we yet do not understand what enables a successful transfer and which part of the network is responsible for that. In this paper, we provide new tools and analysis to address these fundamental questions. Through a series of analysis on transferring to block-shuffled images, we separate the effect of feature reuse from learning high-level statistics of data and show that some benefit of transfer learning comes from the latter. We present that when training from pre-trained weights, the model stays in the same basin in the loss landscape and different instances of such model are similar in feature space and close in parameter space.
Recovery Analysis for Plug-and-Play Priors using the Restricted Eigenvalue Condition
Liu, Jiaming, Asif, Salman, Wohlberg, Brendt, Kamilov, Ulugbek
The plug-and-play priors (PnP) and regularization by denoising (RED) methods have become widely used for solving inverse problems by leveraging pre-trained deep denoisers as image priors. While the empirical imaging performance and the theoretical convergence properties of these algorithms have been widely investigated, their recovery properties have not previously been theoretically analyzed. We address this gap by showing how to establish theoretical recovery guarantees for PnP/RED by assuming that the solution of these methods lies near the fixed-points of a deep neural network. We also present numerical results comparing the recovery performance of PnP/RED in compressive sensing against that of recent compressive sensing algorithms based on generative models. Our numerical results suggest that PnP with a pre-trained artifact removal network provides significantly better results compared to the existing state-of-the-art methods.
Retiring Adult: New Datasets for Fair Machine Learning
Ding, Frances, Hardt, Moritz, Miller, John, Schmidt, Ludwig
Although the fairness community has recognized the importance of data, researchers in the area primarily rely on UCI Adult when it comes to tabular data. Derived from a 1994 US Census survey, this dataset has appeared in hundreds of research papers where it served as the basis for the development and comparison of many algorithmic fairness interventions. We reconstruct a superset of the UCI Adult data from available US Census sources and reveal idiosyncrasies of the UCI Adult dataset that limit its external validity. Our primary contribution is a suite of new datasets derived from US Census surveys that extend the existing data ecosystem for research on fair machine learning. We create prediction tasks relating to income, employment, health, transportation, and housing. The data span multiple years and all states of the United States, allowing researchers to study temporal shift and geographic variation. We highlight a broad initial sweep of new empirical insights relating to trade-offs between fairness criteria, performance of algorithmic interventions, and the role of distribution shift based on our new datasets. Our findings inform ongoing debates, challenge some existing narratives, and point to future research directions.
Inverse Reinforcement Learning in a Continuous State Space with Formal Guarantees
Dexter, Gregory, Bello, Kevin, Honorio, Jean
Inverse Reinforcement Learning (IRL) is the problem of finding a reward function which describes observed/known expert behavior. The IRL setting is remarkably useful for automated control, in situations where the reward function is difficult to specify manually or as a means to extract agent preference. In this work, we provide a new IRL algorithm for the continuous state space setting with unknown transition dynamics by modeling the system using a basis of orthonormal functions. Moreover, we provide a proof of correctness and formal guarantees on the sample and time complexity of our algorithm. Finally, we present synthetic experiments to corroborate our theoretical guarantees.
Generative Cooperative Networks for Natural Language Generation
Sylvain Lamprier, Thomas Scialom, Antoine Chaffin, Vincent Claveau, Ewa Kijak, Jacopo Staiano, Benjamin Piwowarski
Generative Adversarial Networks (GANs) have known a tremendous success for many continuous generation tasks, especially in the field of image generation. However, for discrete outputs such as language, optimizing GANs remains an open problem with many instabilities, as no gradient can be properly back-propagated from the discriminator output to the generator parameters. An alternative is to learn the generator network via reinforcement learning, using the discriminator signal as a reward, but such a technique suffers from moving rewards and vanishing gradient problems. Finally, it often falls short compared to direct maximum-likelihood approaches. In this paper, we introduce Generative Cooperative Networks, in which the discriminator architecture is cooperatively used along with the generation policy to output samples of realistic texts for the task at hand. We give theoretical guarantees of convergence for our approach, and study various efficient decoding schemes to empirically achieve state-of-the-art results in two main NLG tasks.
Conformalization of Sparse Generalized Linear Models
Etash Guha, Eugene Ndiaye, Xiaoming Huo
Given a sequence of observable variables , the conformal prediction method estimates a confidence set for given that is valid for any finite sample size by merely assuming that the joint distribution of the data is permutation invariant. Although attractive, computing such a set is computationally infeasible in most regression problems. Indeed, in these cases, the unknown variable can take an infinite number of possible candidate values, and generating conformal sets requires retraining a predictive model for each candidate. In this paper, we focus on a sparse linear model with only a subset of variables for prediction and use numerical continuation techniques to approximate the solution path efficiently. The critical property we exploit is that the set of selected variables is invariant under a small perturbation of the input data. Therefore, it is sufficient to enumerate and refit the model only at the change points of the set of active features and smoothly interpolate the rest of the solution via a Predictor-Corrector mechanism. We show how our path-following algorithm accurately approximates conformal prediction sets and illustrate its performance using synthetic and real data examples.
Multi-Task Temporal Shift Attention Networks for On-Device Contactless Vitals Measurement
Liu, Xin, Fromm, Josh, Patel, Shwetak, McDuff, Daniel
Telehealth and remote health monitoring have become increasingly important during the SARS-CoV-2 pandemic and it is widely expected that this will have a lasting impact on healthcare practices. These tools can help reduce the risk of exposing patients and medical staff to infection, make healthcare services more accessible, and allow providers to see more patients. However, objective measurement of vital signs is challenging without direct contact with a patient. We present a video-based and on-device optical cardiopulmonary vital sign measurement approach. It leverages a novel multi-task temporal shift convolutional attention network (MTTS-CAN) and enables real-time cardiovascular and respiratory measurements on mobile platforms. We evaluate our system on an Advanced RISC Machine (ARM) CPU and achieve state-of-the-art accuracy while running at over 150 frames per second which enables real-time applications. Systematic experimentation on large benchmark datasets reveals that our approach leads to substantial (20%-50%) reductions in error and generalizes well across datasets.
Neural Networks Structured for Control Application to Aircraft Landing
Schley, Charles, Chauvin, Yves, Henkle, Van, Golden, Richard
We present a generic neural network architecture capable of con(cid:173) trolling non-linear plants. The network is composed of dynamic. parallel, linear maps gated by non-linear switches. Using a recur(cid:173) rent form of the back-propagation algorithm, control is achieved by optimizing the control gains and task-adapted switch parame(cid:173) ters. A mean quadratic cost function computed across a nominal plant trajectory is minimized along with performance constraint penalties. The approach is demonstrated for a control task con(cid:173) sisting of landing a commercial aircraft in difficult wind conditions. We show that the network yields excellent performance while re(cid:173) maining within acceptable damping response constraints.
FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs
Agarwal, Alekh, Kakade, Sham, Krishnamurthy, Akshay, Sun, Wen
In order to deal with the curse of dimensionality in reinforcement learning (RL), it is common practice to make parametric assumptions where values or policies are functions of some low dimensional feature space. This work focuses on the representation learning question: how can we learn such features? Under the assumption that the underlying (unknown) dynamics correspond to a low rank transition matrix, we show how the representation learning question is related to a particular non-linear matrix decomposition problem. Structurally, we make precise connections between these low rank MDPs and latent variable models, showing how they significantly generalize prior formulations, such as block MDPs, for representation learning in RL. Algorithmically, we develop FLAMBE, which engages in exploration and representation learning for provably efficient RL in low rank transition models. On a technical level, our analysis eliminates reachability assumptions that appear in prior results on the simpler block MDP model and may be of independent interest.
Modeling Complex Cells in an Awake Macaque during Natural Image Viewing
Vinje, William, Gallant, Jack
We model the responses of cells in visual area VI during natural vision. Our model consists of a classical energy mechanism whose output is divided by nonclassical gain control and texture contrast mechanisms. We apply this model to review movies, a stimulus sequence that replicates the stimulation a cell receives during free viewing of natural images. Data were collected from three cells using five different review movies, and the model was fit separately to the data from each movie. For the energy mechanism alone we find modest but significant correlations (rE = 0.41, 0.43, 0.59, 0.35) between model and data. These correlations are improved somewhat when we allow for suppressive surround effects (rE+G = 0.42, 0.56, 0.60, 0.37). In one case the inclusion of a delayed suppressive surround dramatically improves the fit to the data by modifying the time course of the model's response.
Understanding Representation of Deep Equilibrium Models from Neural Collapse Perspective
Sun, Haixiang, Shi, Ye
Deep Equilibrium Model (DEQ), which serves as a typical implicit neural network, emphasizes their memory efficiency and competitive performance compared to explicit neural networks. However, there has been relatively limited theoretical analysis on the representation of DEQ. In this paper, we utilize the Neural Collapse () as a tool to systematically analyze the representation of DEQ under both balanced and imbalanced conditions. is an interesting phenomenon in the neural network training process that characterizes the geometry of class features and classifier weights. While extensively studied in traditional explicit neural networks, the phenomenon has not received substantial attention in the context of implicit neural networks. We theoretically show that exists in DEQ under balanced conditions. Moreover, in imbalanced settings, despite the presence of minority collapse, DEQ demonstrated advantages over explicit neural networks. These advantages include the convergence of extracted features to the vertices of a simplex equiangular tight frame and self-duality properties under mild conditions, highlighting DEQ's superiority in handling imbalanced datasets. Finally, we validate our theoretical analyses through experiments in both balanced and imbalanced scenarios.
Deep Variational Graph Convolutional Recurrent Network for Multivariate Time Series Anomaly Detection
Wenchao Chen, Long Tian, Bo Chen, Liang Dai, Zhibin Duan, Mingyuan Zhou
Anomaly detection within multivariate time series (MTS) is an essential task in both data mining and service quality management. Many recent works on anomaly detection focus on designing unsupervised probabilistic models toextract robust normal patterns of MTS. In this paper, we model sensor dependency and stochasticity within MTS by developing an embedding-guided probabilistic generative network. We combine it with adaptive variational graph convolutional recurrent network %and get variational GCRN (VGCRN) to model both spatial and temporal fine-grained correlations in MTS. To explore hierarchical latent representations, we further extend VGCRN into a deep variational network, which captures multilevel information at different layers and is robust to noisy time series. Moreover, we develop an upward-downward variational inference scheme that considers both forecasting-based and reconstruction-based losses, achieving an accurate posterior approximation of latent variables with better MTS representations. The experiments verify the superiority of the proposed method over current state-of-the-art methods.
Frank-Wolfe with Subsampling Oracle
Thomas Kerdreux, Fabian Pedregosa, Alexandre d’Aspremont
We analyze two novel randomized variants of the Frank-Wolfe (FW) or conditional gradient algorithm. While classical FW algorithms require solving a linear minimization problem over the domain at each iteration, the proposed method only requires to solve a linear minimization problem over a small
Directed Acyclic Transformer for Non-Autoregressive Machine Translation
Fei Huang, Hao Zhou, Yang Liu, Hang Li, Minlie Huang
Non-autoregressive Transformers (NATs) significantly reduce the decoding latency by generating all tokens in parallel. However, such independent predictions prevent NATs from capturing the dependencies between the tokens for generating multiple possible translations. In this paper, we propose Directed Acyclic Transfomer (DA-Transformer), which represents the hidden states in a Directed Acyclic Graph (DAG), where each path of the DAG corresponds to a specific translation. The whole DAG simultaneously captures multiple translations and facilitates fast predictions in a non-autoregressive fashion. Experiments on the raw training data of WMT benchmark show that DA-Transformer substantially outperforms previous NATs by about 3 BLEU on average, which is the first NAT model that achieves competitive results with autoregressive Transformers without relying on knowledge distillation.
Latent Template Induction with Gumbel-CRFs
Fu, Yao, Tan, Chuanqi, Bi, Bin, Chen, Mosha, Feng, Yansong, Rush, Alexander
Learning to control the structure of sentences is a challenging problem in text generation. Existing work either relies on simple deterministic approaches or RL-based hard structures. We explore the use of structured variational autoencoders to infer latent templates for sentence generation using a soft, continuous relaxation in order to utilize reparameterization for training. Specifically, we propose a Gumbel-CRF, a continuous relaxation of the CRF sampling algorithm using a relaxed Forward-Filtering Backward-Sampling (FFBS) approach. As a reparameterized gradient estimator, the Gumbel-CRF gives more stable gradients than score-function based estimators. As a structured inference network, we show that it learns interpretable templates during training, which allows us to control the decoder during testing. We demonstrate the effectiveness of our methods with experiments on data-to-text generation and unsupervised paraphrase generation.
Bangs, Clicks, Snaps, Thuds and Whacks: An Architecture for Acoustic Transient Processing
Pineda, Fernando, Cauwenberghs, Gert, Edwards, R.
We propose a neuromorphic architecture for real-time processing of acoustic transients in analog VLSI. We show how judicious normalization of a time-frequency signal allows an elegant and robust implementation of a correlation algorithm. The algorithm uses binary multiplexing instead of analog-analog multiplication. This removes the need for analog storage and analog-multiplication. Simulations show that the resulting algorithm has the same out-of-sample classification performance (-93% correct) as a baseline template-matching algorithm.
Robust Principal Component Analysis with Complex Noise
Qian Zhao, Deyu Meng, Zongben Xu, Wangmeng Zuo, Lei Zhang
The research on robust principal component analysis (RPCA) has been attracting much attention recently. The original RPCA model assumes sparse noise, and use the L_1-norm to characterize the error term. In practice, however, the noise is much more complex and it is not appropriate to simply use a certain L_p-norm for noise modeling. We propose a generative RPCA model under the Bayesian framework by modeling data noise as a mixture of Gaussians (MoG). The MoG is a universal approximator to continuous distributions and thus our model is able to fit a wide range of noises such as Laplacian, Gaussian, sparse noises and any combinations of them. A variational Bayes algorithm is presented to infer the posterior of the proposed model. All involved parameters can be recursively updated in closed form. The advantage of our method is demonstrated by extensive experiments on synthetic data, face modeling and background subtraction.
Set Based Stochastic Subsampling
Bruno Andreis, Seanie Lee, A. Tuan Nguyen, Juho Lee, Eunho Yang, Sung Ju Hwang
Deep models are designed to operate on huge volumes of high dimensional data such as images. In order to reduce the volume of data these models must process, we propose a set-based two-stage end-to-end neural subsampling model that is jointly optimized with an \textit{arbitrary} downstream task network (e.g. classifier). In the first stage, we efficiently subsample \textit{candidate elements} using conditionally independent Bernoulli random variables by capturing coarse grained global information using set encoding functions, followed by conditionally dependent autoregressive subsampling of the candidate elements using Categorical random variables by modeling pair-wise interactions using set attention networks in the second stage. We apply our method to feature and instance selection and show that it outperforms the relevant baselines under low subsampling rates on a variety of tasks including image classification, image reconstruction, function reconstruction and few-shot classification. Additionally, for nonparametric models such as Neural Processes that require to leverage the whole training data at inference time, we show that our method enhances the scalability of these models.
An Equivalence Between Data Poisoning and Byzantine Gradient Attacks
Sadegh Farhadkhani, Rachid Guerraoui, Lê-Nguyên Hoang, Oscar Villemaud
To study the resilience of distributed learning, the ``Byzantine" literature considers a strong threat model where workers can report arbitrary gradients to the parameter server. Whereas this model helped obtain several fundamental results, it has sometimes been considered unrealistic, when the workers are mostly trustworthy machines. In this paper, we show a surprising equivalence between this model and data poisoning, a threat considered much more realistic. More specifically, we prove that every gradient attack can be reduced to data poisoning, in any personalized federated learning system with PAC guarantees (which we show are both desirable and realistic). This equivalence makes it possible to obtain new impossibility results on the resilience of \emph{any} ``robust'' learning algorithm to data poisoning in highly heterogeneous applications, as corollaries of existing impossibility theorems on Byzantine machine learning. Moreover, using our equivalence, we derive a practical attack that we show (theoretically and empirically) can be very effective against classical personalized federated learning models.
A Bayesian-Symbolic Approach to Reasoning and Learning in Intuitive Physics
Xu, Kai, Srivastava, Akash, Gutfreund, Dan, Sosa, Felix, Ullman, Tomer, Tenenbaum, Josh, Sutton, Charles
Humans can reason about intuitive physics in fully or partially observed environments even after being exposed to a very limited set of observations. This sample-efficient intuitive physical reasoning is considered a core domain of human common sense knowledge. One hypothesis to explain this remarkable capacity, posits that humans quickly learn approximations to the laws of physics that govern the dynamics of the environment. In this paper, we propose a Bayesian-symbolic framework (BSP) for physical reasoning and learning that is close to human-level sample-efficiency and accuracy. In BSP, the environment is represented by a top-down generative model of entities, which are assumed to interact with each other under unknown force laws over their latent and observed properties. BSP models each of these entities as random variables, and uses Bayesian inference to estimate their unknown properties. For learning the unknown forces, BSP leverages symbolic regression on a novel grammar of Newtonian physics in a bilevel optimization setup. These inference and regression steps are performed in an iterative manner using expectation-maximization, allowing BSP to simultaneously learn force laws while maintaining uncertainty over entity properties. We show that BSP is more sample-efficient compared to neural alternatives on controlled synthetic datasets, demonstrate BSP's applicability to real-world common sense scenes and study BSP's performance on tasks previously used to study human physical reasoning.
Nesting Particle Filters for Experimental Design in Dynamical Systems
Sahel Iqbal, Adrien Corenflos, Simo Särkkä, Hany Abdulsamad
In this paper, we propose a novel approach to Bayesian experimental design for non-exchangeable data that formulates it as risk-sensitive policy optimization. We develop the Inside-Out SMC algorithm, a nested sequential Monte Carlo technique to infer optimal designs, and embed it into a particle Markov chain Monte Carlo framework to perform gradient-based policy amortization. Our approach is distinct from other amortized experimental design techniques, as it does not rely on contrastive estimators. Numerical validation on a set of dynamical systems showcases the efficacy of our method in comparison to other state-of-the-art strategies.
Modern Hopfield Networks and Attention for Immune Repertoire Classification
Widrich, Michael, Schäfl, Bernhard, Pavlović, Milena, Ramsauer, Hubert, Gruber, Lukas, Holzleitner, Markus, Brandstetter, Johannes, Sandve, Geir Kjetil, Greiff, Victor, Hochreiter, Sepp, Klambauer, Günter
A central mechanism in machine learning is to identify, store, and recognize patterns. How to learn, access, and retrieve such patterns is crucial in Hopfield networks and the more recent transformer architectures. We show that the attention mechanism of transformer architectures is actually the update rule of modern Hopfield networks that can store exponentially many patterns. We exploit this high storage capacity of modern Hopfield networks to solve a challenging multiple instance learning (MIL) problem in computational biology: immune repertoire classification. In immune repertoire classification, a vast number of immune receptors are used to predict the immune status of an individual. This constitutes a MIL problem with an unprecedentedly massive number of instances, two orders of magnitude larger than currently considered problems, and with an extremely low witness rate. Accurate and interpretable machine learning methods solving this problem could pave the way towards new vaccines and therapies, which is currently a very relevant research topic intensified by the COVID-19 crisis. In this work, we present our novel method DeepRC that integrates transformer-like attention, or equivalently modern Hopfield networks, into deep learning architectures for massive MIL such as immune repertoire classification. We demonstrate that DeepRC outperforms all other methods with respect to predictive performance on large-scale experiments including simulated and real-world virus infection data and enables the extraction of sequence motifs that are connected to a given disease class. Source code and datasets: https://github.com/ml-jku/DeepRC
Why Normalizing Flows Fail to Detect Out-of-Distribution Data
Kirichenko, Polina, Izmailov, Pavel, Wilson, Andrew G.
Detecting out-of-distribution (OOD) data is crucial for robust machine learning systems. Normalizing flows are flexible deep generative models that often surprisingly fail to distinguish between in- and out-of-distribution data: a flow trained on pictures of clothing assigns higher likelihood to handwritten digits. We investigate why normalizing flows perform poorly for OOD detection. We demonstrate that flows learn local pixel correlations and generic image-to-latent-space transformations which are not specific to the target image datasets, focusing on flows based on coupling layers. We show that by modifying the architecture of flow coupling layers we can bias the flow towards learning the semantic structure of the target data, improving OOD detection. Our investigation reveals that properties that enable flows to generate high-fidelity images can have a detrimental effect on OOD detection.