A note on the relations between mixture models, maximum-likelihood and entropic optimal transport
Titouan Vayer, Etienne Lasalle
This note aims to demonstrate that performing maximum-likelihood estimation for a mixture model is equivalent to minimizing over the parameters an optimal transport problem with entropic regularization. The objective is pedagogical: we seek to present this already known result in a concise and hopefully simple manner. We give an illustration with Gaussian mixture models by showing that the standard EM algorithm is a specific block-coordinate descent on an optimal transport loss.
The Unified Non-Convex Framework for Robust Causal Inference: Overcoming the Gaussian Barrier and Optimization Fragility
Eichi Uehara
This document proposes a Unified Robust Framework that re-engineers the estimation of the Average Treatment Effect on the Overlap (ATO). It synthesizes gamma-Divergence for outlier robustness, Graduated Non-Convexity (GNC) for global optimization, and a "Gatekeeper" mechanism to address the impossibility of higher-order orthogonality in Gaussian regimes.
Improving the Predictive Performances of $k$ Nearest Neighbors Learning by Efficient Variable Selection
Eddie Pei, Ernest Fokoue
This paper computationally demonstrates a sharp improvement in predictive performance for $k$ nearest neighbors thanks to an efficient forward selection of the predictor variables. We show both simulated and real-world data that this novel repeatedly approaches outperformance regression models under stepwise selection
A Note on Comparison of F-measures
Wei Ju, Wenxin Jiang
We comment on a recent TKDE paper "Linear Approximation of F-measure for the Performance Evaluation of Classification Algorithms on Imbalanced Data Sets", and make two improvements related to comparison of F-measures for two prediction rules.
Deep orthogonal linear networks are shallow
Pierre Ablin
We consider the problem of training a deep orthogonal linear network, which consists of a product of orthogonal matrices, with no non-linearity in-between. We show that training the weights with Riemannian gradient descent is equivalent to training the whole factorization by gradient descent. This means that there is no effect of overparametrization and implicit bias at all in this setting: training such a deep, overparametrized, network is perfectly equivalent to training a one-layer shallow network.
The Elliptical Potential Lemma Revisited
Alexandra Carpentier, Claire Vernade, Yasin Abbasi-Yadkori
This note proposes a new proof and new perspectives on the so-called Elliptical Potential Lemma. This result is important in online learning, especially for linear stochastic bandits. The original proof of the result, however short and elegant, does not give much flexibility on the type of potentials considered and we believe that this new interpretation can be of interest for future research in this field.
Multilevel Monte Carlo estimation of log marginal likelihood
Takashi Goda, Kei Ishikawa
In this short note we provide an unbiased multilevel Monte Carlo estimator of the log marginal likelihood and discuss its application to variational Bayes.
Multilayer Perceptron Algebra
Zhao Peng
Artificial Neural Networks(ANN) has been phenomenally successful on various pattern recognition tasks. However, the design of neural networks rely heavily on the experience and intuitions of individual developers. In this article, the author introduces a mathematical structure called MLP algebra on the set of all Multilayer Perceptron Neural Networks(MLP), which can serve as a guiding principle to build MLPs accommodating to the particular data sets, and to build complex MLPs from simpler ones.
The energy landscape of a simple neural network
Anthony Collins Gamst, Alden Walker
We explore the energy landscape of a simple neural network. In particular, we expand upon previous work demonstrating that the empirical complexity of fitted neural networks is vastly less than a naive parameter count would suggest and that this implicit regularization is actually beneficial for generalization from fitted models.
A Contemporary Overview of Probabilistic Latent Variable Models
Rick Farouni
In this paper we provide a conceptual overview of latent variable models within a probabilistic modeling framework, an overview that emphasizes the compositional nature and the interconnectedness of the seemingly disparate models commonly encountered in statistical practice.
RSSL: Semi-supervised Learning in R
Jesse H. Krijthe
In this paper, we introduce a package for semi-supervised learning research in the R programming language called RSSL. We cover the purpose of the package, the methods it includes and comment on their use and implementation. We then show, using several code examples, how the package can be used to replicate well-known results from the semi-supervised learning literature.
A Simple Approach to Sparse Clustering
Ery Arias-Castro, Xiao Pu
Consider the problem of sparse clustering, where it is assumed that only a subset of the features are useful for clustering purposes. In the framework of the COSA method of Friedman and Meulman, subsequently improved in the form of the Sparse K-means method of Witten and Tibshirani, a natural and simpler hill-climbing approach is introduced. The new method is shown to be competitive with these two methods and others.
On the convergence rate of the three operator splitting scheme
Fabian Pedregosa
The three operator splitting scheme was recently proposed by [Davis and Yin, 2015] as a method to optimize composite objective functions with one convex smooth term and two convex (possibly non-smooth) terms for which we have access to their proximity operator. In this short note we provide an alternative proof for the sublinear rate of convergence of this method.
Graphical Exponential Screening
Zhe Liu
In high dimensions we propose and analyze an aggregation estimator of the precision matrix for Gaussian graphical models. This estimator, called graphical Exponential Screening (gES), linearly combines a suitable set of individual estimators with different underlying graphs, and balances the estimation error and sparsity. We study the risk of this aggregation estimator and show that it is comparable to that of the best estimator based on a single graph, chosen by an oracle. Numerical performance of our method is investigated using both simulated and real datasets, in comparison with some state-of-art estimation procedures.
Spectral Convergence Rate of Graph Laplacian
Xu Wang
Laplacian Eigenvectors of the graph constructed from a data set are used in many spectral manifold learning algorithms such as diffusion maps and spectral clustering. Given a graph constructed from a random sample of a $d$-dimensional compact submanifold $M$ in $\mathbb{R}^D$, we establish the spectral convergence rate of the graph Laplacian. It implies the consistency of the spectral clustering algorithm via a standard perturbation argument. A simple numerical study indicates the necessity of a denoising step before applying spectral algorithms.
Variable and Fixed Interval Exponential Smoothing
Javier R. Movellan
Exponential smoothers are a simple and memory efficient way to compute running averages of time series. Here we define and describe practical properties of exponential smoothers for signals observed at constant and variable intervals.
Improved graph Laplacian via geometric self-consistency
Dominique Perrault-Joncas, Marina Meila
We address the problem of setting the kernel bandwidth used by Manifold Learning algorithms to construct the graph Laplacian. Exploiting the connection between manifold geometry, represented by the Riemannian metric, and the Laplace-Beltrami operator, we set the bandwidth by optimizing the Laplacian's ability to preserve the geometry of the data. Experiments show that this principled approach is effective and robust.
Most Correlated Arms Identification
Che-Yu Liu, Sébastien Bubeck
We study the problem of finding the most mutually correlated arms among many arms. We show that adaptive arms sampling strategies can have significant advantages over the non-adaptive uniform sampling strategy. Our proposed algorithms rely on a novel correlation estimator. The use of this accurate estimator allows us to get improved results for a wide range of problem instances.
Chebushev Greedy Algorithm in convex optimization
Vladimir Temlyakov
Chebyshev Greedy Algorithm is a generalization of the well known Orthogonal Matching Pursuit defined in a Hilbert space to the case of Banach spaces. We apply this algorithm for constructing sparse approximate solutions (with respect to a given dictionary) to convex optimization problems. Rate of convergence results in a style of the Lebesgue-type inequalities are proved.
Sparsity-accuracy trade-off in MKL
Ryota Tomioka, Taiji Suzuki
We empirically investigate the best trade-off between sparse and uniformly-weighted multiple kernel learning (MKL) using the elastic-net regularization on real and simulated datasets. We find that the best trade-off parameter depends not only on the sparsity of the true kernel-weight spectrum but also on the linear dependence among kernels and the number of samples.