
Contents
Awards
Printed Proceedings
Online Proceedings
Crossconference papers
Awards
In honor of its 25th anniversary, the Machine Learning Journal is sponsoring
the awards for the student authors of the best and distinguished papers.
Good Job and Congratulations!
Best Paper
The Best Paper award goes to Kevin Waugh, Brian Ziebart and Drew Bagnell for
Computational Rationalization: The Inverse Equilibrium Problem.
Distinguished Paper Awards
Abhimanyu Das and David Kempe:
Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation
and Dictionary Selection
Miguel LazaroGredilla and Michalis Titsias:
Variational Heteroscedastic Gaussian Process Regression
Jascha SohlDickstein, Peter Battaglino, and Michael DeWeese:
Minimum Probability Flow Learning
Distinguished Application Paper Awards
Lauren Hannah and David Dunson:
Approximate Dynamic Programming for Storage Problems
Sean Gerrish and David Blei:
Predicting Legislative Roll Calls from Text
Richard Socher, Cliff ChiungYu Lin, Andrew Ng, and Chris Manning:
Parsing Natural Scenes and Natural Language with Recursive Neural Networks
TestofTime Award
This award is given to papers that time and hindsight proved to be of lasting value
to the Machine Learning community. This year, we are recognizing the seminal paper
of CRFs. John D. Lafferty, Andrew McCallum, Fernando C. N. Pereira.
Conditional Random Fields:
Probabilistic Models for Segmenting and Labeling Sequence Data.
Printed Proceedings
You can purchase the printed proceedings online for $100
and pick them up during the conference at the registration office, aka the Grand Ballroom Coat Check.
Online Proceedings
These are the abstracts of the accepted papers. You can download
the whole proceedings (87MB zip) and
the summary bibfile.
Hashing with Graphs
Wei Liu, Jun Wang, Sanjiv Kumar, ShihFu Chang
Abstract:Hashing is becoming increasingly popular for efficient nearest neighbor search in massive databases. However, learning short codes that yield good search performance is still a challenge. Moreover, in many cases realworld data lives on a lowdimensional manifold, which should be taken into account to capture meaningful nearest neighbors. In this paper, we propose a novel graphbased hashing method which automatically discovers the neighborhood structure inherent in the data to learn appropriate compact codes. To make such an approach computationally feasible, we utilize Anchor Graphs to obtain tractable lowrank adjacency matrices. Our formulation allows constant time hashing of a new data point by extrapolating graph Laplacian eigenvectors to eigenfunctions. Finally, we describe a hierarchical threshold learning procedure in which each eigenfunction yields multiple bits, leading to higher search accuracy. Experimental comparison with the other stateoftheart methods on two large datasets demonstrates the efficacy of the proposed method.
[download] [bibTeX] [discuss]
Efficient Sparse Modeling with Automatic Feature Grouping
Wenliang Zhong, James Kwok
Abstract:The grouping of features is highly beneficial in learning with highdimensional data. It reduces the variance in the estimation and improves the stability of feature selection, leading to improved generalization. Moreover, it can also help in data understanding and interpretation. OSCAR is a recent sparse modeling tool that achieves this by using a $\ell_1$regularizer and a pairwise $\ell_\infty$regularizer. However, its optimization is computationally expensive. In this paper, we propose an efficient solver based on the accelerated gradient methods. We show that its key projection step can be solved by a simple iterative group merging algorithm. It is highly efficient and reduces the empirical time complexity from $O(d^3 \sim d^5)$ for the existing solvers to just $O(d)$, where $d$ is the number of features. Experimental results on toy and realworld data sets demonstrate that OSCAR is a competitive sparse modeling approach with the added ability of automatic feature grouping.
[download] [bibTeX] [discuss]
MultiLabel Classification on Tree and DAGStructured Hierarchies
Wei Bi, James Kwok
Abstract:Many realworld applications involve multilabel classification, in which the labels are organized in the form of a tree or directed acyclic graph (DAG). However, current research efforts typically ignore the label dependencies or can only exploit the dependencies in treestructured hierarchies. In this paper, we present a novel hierarchical multilabel classification algorithm which can be used on both tree and DAGstructured hierarchies. The key idea is to formulate the search for the optimal consistent multilabel as the finding of the best subgraph in a tree/DAG. Using a simple greedy strategy, the proposed algorithm is computationally efficient, easy to implement, does not suffer from the problem of insufficient/skewed training data in classifier training, and can be readily used on large hierarchies. Theoretical results guarantee the optimality of the obtained solution. Experiments are performed on a large number of functional genomics data sets. The proposed method consistently outperforms the stateoftheart method on both tree and DAGstructured hierarchies.
[download] [bibTeX] [discuss]
A Graphbased Framework for MultiTask MultiView Learning
Jingrui He, Rick Lawrence
Abstract:Many realworld problems exhibit dualheterogeneity. A single learning task might have features in multiple views (i.e., feature heterogeneity); multiple learning tasks might be related with each other through one or more shared views (i.e., task heterogeneity). Existing multitask learning or multiview learning algorithms only capture one type of heterogeneity. In this paper, we introduce MultiTask MultiView (M^2TV) learning for such complicated learning problems with both feature heterogeneity and task heterogeneity. We propose a graphbased framework (GraM^2) to take full advantage of the dualheterogeneous nature. Our framework has a natural connection to Reproducing Kernel Hilbert Space (RKHS). Furthermore, we propose an iterative algorithm (IteM^2) for GraM^2 framework, and analyze its optimality, convergence and time complexity. Experimental results on various real data sets demonstrate its effectiveness.
[download] [bibTeX] [discuss]
GoDec: Randomized Lowrank & Sparse Matrix Decomposition in Noisy Case
Tianyi Zhou, Dacheng Tao, University of Technology
Abstract:Lowrank and sparse structures have been profoundly studied in matrix completion and compressed sensing. In this paper, we develop ``Go Decomposition'' (GoDec) to efficiently and robustly estimate the lowrank part $L$ and the sparse part $S$ of a matrix $X=L+S+G$ with noise $G$. GoDec alternatively assigns the lowrank approximation of $XS$ to $L$ and the sparse approximation of $XL$ to $S$. The algorithm can be significantly accelerated by bilateral random projections (BRP). We also propose GoDec for matrix completion as an important variant. We prove that the objective value $\XLS\_F^2$ converges to a local minimum, while $L$ and $S$ linearly converge to local optimums. Theoretically, we analyze the influence of $L$, $S$ and $G$ to the asymptotic/convergence speeds in order to discover the robustness of GoDec. Empirical studies suggest the efficiency, robustness and effectiveness of GoDec comparing with representative matrix decomposition and completion tools, e.g., Robust PCA and OptSpace.
[download] [bibTeX] [discuss]
Unimodal Bandits
Jia Yuan Yu, Shie Mannor
Abstract:We consider multiarmed bandit problems where the expected reward is unimodal over partially ordered arms. In particular, the arms may belong to a continuous interval or correspond to vertices in a graph, where the graph structure represents similarity in rewards. The unimodality assumption has an important advantage: we can determine if a given arm is optimal by sampling the possible directions around it. This property allows us to quickly and efficiently find the optimal arm and detect abrupt changes in the reward distributions. For the case of bandits on graphs, we incur a regret proportional to the maximal degree and the diameter of the graph, instead of the total number of vertices.
[download] [bibTeX] [discuss]
Learning Output Kernels with Block Coordinate Descent
Francesco Dinuzzo, Cheng Soon Ong, Peter Gehler, Gianluigi Pillonetto
Abstract:We propose a method to learn simultaneously a vectorvalued function and a kernel between its components. The obtained kernel can be used both to improve learning performances and to reveal structures in the output space which may be important in their own right. Our method is based on the solution of a suitable regularization problem over a reproducing kernel Hilbert space (RKHS) of vectorvalued functions. Although the regularized risk functional is nonconvex, we show that it is invex, implying that all local minimizers are global minimizers. We derive a blockwise coordinate descent method that efficiently exploits the structure of the objective functional. Then, we empirically demonstrate that the proposed method can improve classification accuracy. Finally, we provide a visual interpretation of the learned kernel matrix for some well known datasets.
[download] [bibTeX] [discuss]
Vectorvalued Manifold Regularization
Ha Quang Minh, Vikas Sindhwani
Abstract:We consider the general problem of learning an unknown functional dependency, f : X>Y, between a structured input space X and a structured output space Y, from labeled and unlabeled examples. We formulate this problem in terms of datadependent regularization in Vectorvalued Reproducing Kernel Hilbert Spaces (Micchelli & Pontil, 2005) which elegantly extend familiar scalarvalued kernel methods to the general setting where Y has a Hilbert space structure. Our methods provide a natural extension of Manifold Regularization (Belkin et al., 2006) algorithms to also exploit output interdependencies while enforcing smoothness with respect to input data geometry. We propose a class of matrixvalued kernels which allow efficient implementations of our algorithms via the use of numerical solvers for Sylvester matrix equations. On multilabel image annotation and text classification problems, we find favorable empirical comparisons against several competing alternatives.
[download] [bibTeX] [discuss]
On InformationMaximization Clustering: Tuning Parameter Selection and Analytic Solution
Masashi Sugiyama, Makoto Yamada, Manabu Kimura, Hirotaka Hachiya
Abstract:Informationmaximization clustering learns a probabilistic classifier in an unsupervised manner so that mutual information between feature vectors and cluster assignments is maximized. A notable advantage of this approach is that it only involves continuous optimization of model parameters, which is substantially easier to solve than discrete optimization of cluster assignments. However, existing methods still involve nonconvex optimization problems, and therefore finding a good local optimal solution is not straightforward in practice. In this paper, we propose an alternative informationmaximization clustering method based on a squaredloss variant of mutual information. This novel approach gives a clustering solution analytically in a computationally efficient way via kernel eigenvalue decomposition. Furthermore, we provide a practical model selection procedure that allows us to objectively optimize tuning parameters included in the kernel function. Through experiments, we demonstrate the usefulness of the proposed approach.
[download] [bibTeX] [discuss]
On tracking portfolios with certainty equivalents on a generalization of Markowitz model: the Fool, the Wise and the Adaptive
Richard Nock, Brice Magdalou, Eric Briys, Frank Nielsen
Abstract:Portfolio allocation theory has been heavily influenced by a major contribution of Harry Markowitz in the early fifties: the meanvariance approach. While there has been a continuous line of works in online learning portfolios over the past decades, very few works have really tried to cope with Markowitz model. A major drawback of the meanvariance approach is that it is approximationfree only when stock returns obey a Gaussian distribution, an assumption known not to hold in real data. In this paper, we first alleviate this assumption, and rigorously lift the meanvariance model to a more general meandivergence model in which stock returns are allowed to obey any exponential family of distributions. We then devise a general online learning algorithm in this setting. We prove for this algorithm the first lower bounds on the most relevant quantity to be optimized in the framework of Markowitz model: the certainty equivalents. Experiments on four realworld stock markets display its ability to track portfolios whose cumulated returns exceed those of the best stock by orders of magnitude.
[download] [bibTeX] [discuss]
Multiple Instance Learning with Manifold Bags
Boris Babenko, Nakul Verma, Piotr Dollar, Serge Belongie
Abstract:In many machine learning applications, labeling every instance of data is burdensome. Multiple Instance Learning (MIL), in which training data is provided in the form of labeled bags rather than labeled instances, is one approach for a more relaxed form of supervised learning. Though much progress has been made in analyzing MIL problems, existing work considers bags that have a finite number of instances. In this paper we argue that in many applications of MIL (e.g. image, audio, e.t.c.) the bags are better modeled as low dimensional manifolds in high dimensional feature space. We show that the geometric structure of such manifold bags affects PAClearnability. We discuss how a learning algorithm that is designed for finite sized bags can be adapted to learn from manifold bags. Furthermore, we propose a simple heuristic that reduces the memory requirements of such algorithms. Our experiments on realworld data validate our analysis and show that our approach works well.
[download] [bibTeX] [discuss]
Eigenvalue Sensitive Feature Selection
Yi Jiang, Jiangtao Ren
Abstract:In recent years, some spectral feature selection methods are proposed to choose those features with high power of preserving sample similarity. However, when there exist lots of irrelevant or noisy features in data, the similarity matrix constructed from all the unweighted features may be not reliable, which then misleads existing spectral feature selection methods to select 'wrong' features. To solve this problem, we propose that feature importance should be evaluated according to their impacts on similarity matrix, which means features with high impacts on similarity matrix should be chosen as important ones. Since graph Laplacian\cite{luxbury2007} is defined on the similarity matrix, then the impact of each feature on similarity matrix can be reflected on the change of graph Laplacian, especially on its eigensystem. Based on this point of view, we propose an Eigenvalue Sensitive Criteria (EVSC) for feature selection, which aims at seeking those features with high impact on graph Laplacian's eigenvalues. Empirical analysis demonstrates our proposed method outperforms some traditional spectral feature selection methods.
[download] [bibTeX] [discuss]
Large Scale Text Classification using Semisupervised Multinomial Naive Bayes
Jiang Su, Jelber Sayyad Shirab, Stan Matwin
Abstract:Numerous semisupervised learning methods have been proposed to augment Multinomial Naive Bayes (MNB) using unlabeled documents, but their use in practice is often limited due to implementation difficulty, inconsistent prediction performance, or high computational cost. In this paper, we propose a new, very simple semisupervised extension of MNB, called Semisupervised Frequency Estimate (SFE). Our experiments show that it consistently improves MNB with additional data (labeled or unlabeled) in terms of AUC and accuracy, which is not the case when combining MNB with Expectation Maximization (EM). We attribute this to the fact that SFE consistently produces better conditional log likelihood values than both EM+MNB and MNB in labeled training data.
[download] [bibTeX] [discuss]
Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines
KyungHyun Cho, Tapani Raiko, Alexander Ilin
Abstract:Boltzmann machines are often used as building blocks in greedy learning of deep networks. However, training even a simplified model, known as restricted Boltzmann machine (RBM), can be extremely laborious: Traditional learning algorithms often converge only with the right choice of the learning rate scheduling and the scale of the initial weights. They are also sensitive to specific data representation: An equivalent RBM can be obtained by flipping some bits and changing the weights and biases accordingly, but traditional learning rules are not invariant to such transformations. Without careful tuning of these training settings, traditional algorithms can easily get stuck at plateaus or even diverge. In this work, we present an enhanced gradient which is derived such that it is invariant to bitflipping transformations. We also propose a way to automatically adjust the learning rate by maximizing a local likelihood estimate. Our experiments confirm that the proposed improvements yield more stable training of RBMs.
[download] [bibTeX] [discuss]
Dynamic Tree Block Coordinate Ascent
Daniel Tarlow, Dhruv Batra, Pushmeet Kohli, Vladimir Kolmogorov
Abstract:This paper proposes a novel Linear Programming (LP) based algorithm, called Dynamic TreeBlock Coordinate Ascent (DTBCA), for performing maximum a posteriori (MAP) inference in probabilistic graphical models. Unlike traditional message passing algorithms, which operate uniformly on the whole factor graph, our method dynamically chooses regions of the factor graph on which to focus messagepassing efforts. We propose two criteria for selecting regions, including an efficiently computable upperbound on the increase in the objective possible by passing messages in any particular region. This bound is derived from the theory of primaldual methods from combinatorial optimization, and the forest that maximizes the bounds can be chosen efficiently using a maximumspanningtreelike algorithm. Experimental results show that our dynamic schedules significantly speed up stateofthe art LPbased messagepassing algorithms on a wide variety of realworld problems.
[download] [bibTeX] [discuss]
Implementing regularization implicitly via approximate eigenvector computation
Michael Mahoney, Lorenzo Orecchia
Abstract:Regularization is a powerful technique for extracting useful information from noisy data. Typically, it is implemented by adding some sort of norm constraint to an objective function and then exactly optimizing the modified objective function. This procedure often leads to optimization problems that are computationally more expensive than the original problem, a fact that is clearly problematic if one is interested in largescale applications. On the other hand, a large body of empirical work has demonstrated that heuristics, and in some cases approximation algorithms, developed to speed up computations sometimes have the sideeffect of performing regularization implicitly. Thus, we consider the question: What is the regularized optimization objective that an approximation algorithm is exactly optimizing? We address this question in the context of computing approximations to the smallest nontrivial eigenvector of a graph Laplacian; and we consider three randomwalkbased procedures: one based on the heat kernel of the graph, one based on computing the the PageRank vector associated with the graph, and one based on a truncated lazy random walk. In each case, we provide a precise characterization of the manner in which the approximation method can be viewed as implicitly computing the exact solution to a regularized problem. Interestingly, the regularization is not on the usual vector form of the optimization problem, but instead it is on a related semidefinite program.
[download] [bibTeX] [discuss]
Parsing Natural Scenes and Natural Language with Recursive Neural Networks
Richard Socher, Cliff ChiungYu Lin, Andrew Ng, Chris Manning
Abstract:Recursive structure is commonly found in the inputs of different modalities such as natural scene images or natural language sentences. Discovering this recursive structure helps us to not only identify the units that an image or sentence contains but also how they interact to form a whole. We introduce a maxmargin structure prediction architecture based on recursive neural networks that can successfully recover such structure both in complex scene images as well as sentences. The same algorithm can be used both to provide a competitive syntactic parser for natural language sentences from the Penn Treebank and to outperform alternative approaches for semantic scene segmentation, annotation and classification. For segmentation and annotation our algorithm obtains a new level of stateoftheart performance on the Stanford background dataset (78.1%). The features from the image parse tree outperform Gist descriptors for scene classification by 4%.
[download] [bibTeX] [discuss]
Conjugate Markov Decision Processes
Philip Thomas, Andrew Barto
Abstract:Many open problems involve the search for a mapping that is used by an algorithm solving an MDP. Useful mappings are often from the state set to some other set. Examples include representation discovery (a mapping to a feature space) and skill discovery (a mapping to skill termination probabilities). Different mappings result in algorithms achieving varying expected returns. In this paper we present a novel approach to the search for any mapping used by any algorithm attempting to solve an MDP, for that which results in maximum expected return.
[download] [bibTeX] [discuss]
Learning Mallows Models with Pairwise Preferences
Tyler Lu, Craig Boutilier
Abstract:Learning preference distributions is a key problem in many areas (e.g., recommender systems, IR, social choice). However, existing methods require restrictive data models for evidence about user preferences. We relax these restrictions by considering as data arbitrary pairwise comparisonsthe fundamental building blocks of ordinal rankings. We develop the first algorithms for learning Mallows models (and mixtures) with pairwise comparisons. At the heart is a new algorithm, the generalized repeated insertion model (GRIM), for sampling from arbitrary ranking distributions. We develop approximate samplers that are exact for many important special casesand have provable bounds with pairwise evidenceand derive algorithms for evaluating loglikelihood, learning Mallows mixtures, and nonparametric estimation. Experiments on large, realworld datasets show the effectiveness of our approach.
[download] [bibTeX] [discuss]
Surrogate losses and regret bounds for costsensitive classification with exampledependent costs
Clayton Scott
Abstract:We study surrogate losses in the context of costsensitive classification with exampledependent costs, a problem also known as regression level set estimation. We give sufficient conditions on the surrogate loss for the existence of a surrogate regret bound. Such bounds imply that as the surrogate risk tends to its optimal value, so too does the expected misclassification cost. Our sufficient conditions encompass exampledependent versions of the hinge, exponential, and other common losses. These results provide theoretical justification for some previously proposed surrogatebased algorithms, and suggests others that have not yet been developed.
[download] [bibTeX] [discuss]
Efficient Rule Ensemble Learning using Hierarchical Kernels
Pratik Jawanpuria, Saketha Nath Jagarlapudi, Ganesh Ramakrishnan
Abstract:This paper addresses the problem of Rule Ensemble Learning (REL), where the goal is simultaneous discovery of a small set of simple rules and their optimal weights that lead to good generalization. Rules are assumed to be conjunctions of basic propositions concerning the values taken by the input features. From the perspectives of interpretability as well as generalization, it is highly desirable to construct rule ensembles with low training error, having rules that are i) simple, {\em i.e.}, involve few conjunctions and ii) few in number. We propose to explore the (exponentially) large feature space of all possible conjunctions optimally and efficiently by employing the recently introduced Hierarchical Kernel Learning (HKL) framework. The regularizer employed in the HKL formulation can be interpreted as a potential for discouraging selection of rules involving large number of conjunctions  justifying its suitability for constructing rule ensembles. Simulation results show that the proposed approach improves over stateoftheart REL algorithms in terms of generalization and indeed learns simple rules. Although this is encouraging, it can be shown that HKL selects a conjunction only if all its subsets are selected, and this is highly undesirable. We propose a novel convex formulation which alleviates this problem and generalizes the HKL framework. The main technical contribution of this paper is an efficient mirrordescent based active set algorithm for solving the new formulation. Empirical evaluations on REL problems illustrate the utility of generalized HKL.
[download] [bibTeX] [discuss]
An Augmented Lagrangian Approach to Constrained MAP Inference
Andre Martins, Mario Figueiredo, pedro Aguiar, Noah Smith, Eric Xing
Abstract:We propose a new fast algorithm for approximate MAP inference on factor graphs, which combines augmented Lagrangian optimization with the dual decomposition method. Each slave subproblem is given a quadratic penalty, which pushes toward faster consensus than in previous subgradient approaches. Our algorithm is provably convergent, parallelizable, and suitable for fine decompositions of the graph. We show how it can efficiently handle problems with (possibly global) structural constraints via simple sort operations. Experiments on synthetic and realworld data show that our approach compares favorably with the stateoftheart.
[download] [bibTeX] [discuss]
MeanVariance Optimization in Markov Decision Processes
Shie Mannor, John Tsitsiklis
Abstract:We consider finite horizon Markov decision processes under performance measures that involve both the mean and the variance of the cumulative reward. We show that either randomized or historybased policies can improve performance. We prove that the complexity of computing a policy that maximizes the mean reward under a variance constraint is NPhard for some cases, and strongly NPhard for others. We finally offer pseudopolynomial exact and approximation algorithms.
[download] [bibTeX] [discuss]
Time Series Clustering: Complex is Simpler!
Lei Li, B. Aditya Prakash
Abstract:Given a motion capture sequence, how to identify the category of the motion? Classifying human motions is a critical task in motion editing and synthesizing, for which manual labeling is clearly inefficient for large databases. Here we study the general problem of time series clustering. We propose a novel method of clustering time series that can (a) learn joint temporal dynamics in the data; (b) handle time lags; and (c) produce interpretable features. We achieve this by developing complexvalued linear dynamical systems (CLDS), which include realvalued Kalman filters as a special case; our advantage is that the transition matrix is simpler (just diagonal), and the transmission one easier to interpret. We then present ComplexFit, a novel EM algorithm to learn the parameters for the general model and its special case for clustering. Our approach produces significant improvement in clustering quality, 1.5 to 5 times better than wellknown competitors on real motion capture sequences.
[download] [bibTeX] [discuss]
Maxmargin Learning for Lower Linear Envelope Potentials in Binary Markov Random Fields
Stephen Gould
Abstract:The standard approach to maxmargin parameter learning for Markov random fields (MRFs) involves incrementally adding the most violated constraints during each iteration of the algorithm. This requires exact MAP inference, which is intractable for many classes of MRF. In this paper, we propose an exact MAP inference algorithm for binary MRFs containing a class of higherorder models, known as lower linear envelope potentials. Our algorithm is polynomial in the number of variables and number of linear envelope functions. With tractable inference in hand, we show how the parameters and corresponding feature vectors can be represented in a maxmargin framework for efficiently learning lower linear envelope potentials.
[download] [bibTeX] [discuss]
Inference of Inversion Transduction Grammars
Alexander Clark
Abstract:We present the first polynomial algorithm for learning a class of inversion transduction grammars (ITGs) that implement context free transducers  functions from strings to strings. The class of transductions that we can learn properly includes all subsequential transductions. These algorithms are based on a generalisation of distributional learning; we prove correctness of our algorithm under an identification in the limit model.
[download] [bibTeX] [discuss]
BCDNPKL: Scalable NonParametric Kernel Learning Using Block Coordinate Descent
EnLiang Hu, Bo Wang, SongCan Chen
Abstract:Most existing approaches for nonparametric kernel learning (NPKL) suffer from expensive computation, which would limit their applications to largescale problems. To address the scalability problem of NPKL, we propose a novel algorithm called BCDNPKL, which is very efficient and scalable. Superior to most existing approaches, BCDNPKL keeps away from semidefinite programming (SDP) and eigendecomposition, which benefits from two findings: 1) The original SDP framework of NPKL can be reduced into a far smallersized counterpart which is corresponding to the subkernel (referred to as boundary kernel) learning; 2) The subkernel learning can be efficiently solved by using the proposed block coordinate descent (BCD) technique. We provide a formal proof of global convergence for the proposed BCDNPKL algorithm. The extensive experiments verify the scalability and effectiveness of BCDNPKL, compared with the stateoftheart algorithms.
[download] [bibTeX] [discuss]
Learning Discriminative Fisher Kernels
Laurens Van der Maaten
Abstract:Fisher kernels provide a commonly used vectorial representation of structured objects. The paper presents a technique that exploits label information to improve the object representation of Fisher kernels by employing ideas from metric learning. In particular, the new technique trains a generative model in such a way that the distance between the loglikelihood gradients induced by two objects with the same label is as small as possible, and the distance between the gradients induced by two objects with different labels is as large as possible. We illustrate the strong performance of classifiers trained on the resulting object representations on problems in handwriting recognition, speech recognition, facial expression analysis, and bioinformatics.
[download] [bibTeX] [discuss]
Pruning nearest neighbor cluster trees
Samory Kpotufe, Ulrike von Luxburg
Abstract:Nearest neighbor ($k$NN) graphs are widely used in machine learning and data mining applications, and our aim is to better understand what they reveal about the cluster structure of the unknown underlying distribution of points. Moreover, is it possible to identify spurious structures that might arise due to sampling variability? Our first contribution is a statistical analysis that reveals how certain subgraphs of a $k$NN graph form a consistent estimator of the cluster tree of the underlying distribution of points. Our second and perhaps most important contribution is the following finite sample guarantee. We carefully work out the tradeoff between aggressive and conservative pruning and are able to guarantee the removal of all spurious cluster structures while at the same time guaranteeing the recovery of salient clusters. This is the first such finite sample result in the context of clustering.
[download] [bibTeX] [discuss]
Online AUC Maximization
Peilin Zhao, Steven Hoi, Rong Jin, Tianbao Yang
Abstract:Most studies of online learning measure the performance of a learner by classification accuracy, which is inappropriate for applications where the data are unevenly distributed among different classes. We address this limitation by developing online learning algorithm for maximizing Area Under the ROC curve (AUC), a metric that is widely used for measuring the classification performance for imbalanced data distributions. The key challenge of online AUC maximization is that it needs to optimize the pairwise loss between two instances from different classes. This is in contrast to the classical setup of online learning where the overall loss is a sum of losses over individual training examples. We address this challenge by exploiting the reservoir sampling technique, and present two algorithms for online AUC maximization with theoretic performance guarantee. Extensive experimental studies confirm the effectiveness and the efficiency of the proposed algorithms for maximizing AUC.
[download] [bibTeX] [discuss]
Beat the Mean Bandit
Yisong Yue, Thorsten Joachims
Abstract:The Dueling Bandits Problem is an online learning framework in which actions are restricted to noisy comparisons between pairs of strategies (also known as bandits). It models settings where absolute rewards are difficult to elicit but pairwise preferences are readily available. In this paper, we extend the Dueling Bandits Problem to a relaxed setting where preference magnitudes can violate transitivity. We present the first algorithm for this more general Dueling Bandits Problem and provide theoretical guarantees in both the online and the PAC settings. Furthermore, we show that the new algorithm has stronger guarantees than existing results even in the original Dueling Bandits Problem, which we validate empirically.
[download] [bibTeX] [discuss]
UltraFast Optimization Algorithm for Sparse Multi Kernel Learning
Francesco Orabona, Luo Jie
Abstract:Many stateoftheart approaches for Multi Kernel Learning (MKL) struggle at finding a compromise between performance, sparsity of the solution and speed of the optimization process. In this paper we look at the MKL problem at the same time from a learning and optimization point of view. So, instead of designing a regularizer and then struggling to find an efficient method to minimize it, we design the regularizer while keeping the optimization algorithm in mind. Hence, we introduce a novel MKL formulation, which mixes elements of pnorm and elasticnet kind of regularization. We also propose a fast stochastic gradient descent method that solves the novel MKL formulation. We show theoretically and empirically that our method has 1) stateoftheart performance on many classification tasks; 2) exact sparse solutions with a tunable level of sparsity; 3) a convergence rate bound that depends only logarithmically on the number of kernels used, and is independent of the sparsity required; 4) independence on the particular convex loss function used.
[download] [bibTeX] [discuss]
Estimating the Bayes Point Using Linear Knapsack Problems
Brian Potetz
Abstract:A Bayes Point machine is a binary classifier that approximates the Bayesoptimal classifier by estimating the mean of the posterior distribution of classifier parameters. Past Bayes Point machines have overcome the intractability of this goal by using message passing techniques that approximate the posterior of the classifier parameters as a Gaussian distribution. In this paper, we investigate alternative message passing approaches that do not rely on Gaussian approximation. To make this possible, we introduce a new computational shortcut based on linear multiplechoice knapsack problems that reduces the complexity of computing Bayes Point belief propagation messages from exponential to linear in the number of data features. Empirical tests of our approach show significant improvement in linear classification over both softmargin SVMs and Expectation Propagation Bayes Point machines for several realworld UCI datasets.
[download] [bibTeX] [discuss]
On optimization methods for deep learning
Quoc Le, Jiquan Ngiam, Adam Coates, Abhik Lahiri, Bobby Prochnow, Andrew Ng
Abstract:The predominant methodology in training deep learning advocates the use of stochastic gradient descent methods (SGDs). Despite its ease of implementation, SGDs are difficult to tune and parallelize. These problems make it challenging to develop, debug and scale up deep learning algorithms with SGDs. In this paper, we show that more sophisticated offtheshelf optimization methods such as Limited memory BFGS (LBFGS) and Conjugate gradient (CG) with linesearch can significantly simplify and speed up the process of pretraining deep algorithms. In our experiments, the difference between LBFGS/CG and SGDs are more pronounced if we consider algorithmic extensions (e.g., sparsity regularization) and hardware extensions (e.g., GPUs or computer clusters). Our experiments with distributed optimization support the use of LBFGS with locally connected networks and convolutional neural networks. Using LBFGS, our convolutional network model achieves 0.69\% on the standard MNIST dataset. This is a stateoftheart result on MNIST among algorithms that do not use distortions or pretraining.
[download] [bibTeX] [discuss]
Multiclass Classification with Bandit Feedback using Adaptive Regularization
Koby Crammer, Claudio Gentile
Abstract:We present a new multiclass algorithm in the bandit framework, where after making a prediction, the learning algorithm receives only partial feedback, i.e., a single bit of rightorwrong, rather then the true label. Our algorithm is based on the secondorder Perceptron, and uses upperconfidence bounds to trade off exploration and exploitation. We analyze this algorithm in a partial adversarial setting, where instances are chosen adversarially, while the labels are chosen according to a linear probabilistic model, which is also chosen adversarially. From the theoretical viewpoint, we show a regret of O(\sqrt{T}\log(T)), which improves over the current best bounds of O(T^{2/3}) in the fully adversarial setting. We evaluate our algorithm on nine realworld text classification problems, obtaining stateoftheart results, even compared with nonbandit online algorithms, especially when label noise is introduced.
[download] [bibTeX] [discuss]
On the Necessity of Irrelevant Variables
Dave Helmbold, Phil Long
Abstract:This work explores the effects of relevant and irrelevant boolean variables on the accuracy of classifiers. The analysis uses the assumption that the variables are conditionally independent given the class, and focuses on a natural family of learning algorithms for such sources when the relevant variables have a small advantage over random guessing. The main result is that algorithms relying predominately on irrelevant variables have error probabilities that quickly go to 0 in situations where algorithms that limit the use of irrelevant variables have errors bounded below by a positive constant. We also show that accurate learning is possible even when there are so few examples that one cannot determine with high confidence whether or not any individual variable is relevant.
[download] [bibTeX] [discuss]
ABCEP: Expectation Propagation for Likelihoodfree Bayesian Computation
Simon Barthelmé, Nicolas Chopin
Abstract:Many statistical models of interest to the natural and social sciences have no tractable likelihood function. Until recently, Bayesian inference for such models was thought infeasible. Pritchard et al. (1999) introduced an algorithm known as ABC, for Approximate Bayesian Computation, that enables Bayesian computation in such models. Despite steady progress since this first breakthrough, such as the adaptation of MCMC and Sequential Monte Carlo techniques to likelihoodfree inference, stateofthe art methods remain notoriously hard to use and require enormous computation times. Among other issues, one faces the difficult task of finding appropriate summary statistics for the model, and tuning the algorithm can be timeconsuming when little prior information is available. We show that Expectation Propagation, a widely successful approximate inference technique, can be adapted to the likelihoodfree context. The resulting algorithm does not require summary statistics, is an order of magnitude faster than existing techniques, and remains usable when prior information is vague.
[download] [bibTeX] [discuss]
A PACBayes Samplecompression Approach to Kernel Methods
Pascal Germain, Alexandre Lacoste, Francois Laviolette, Mario Marchand, Sara Shanian
Abstract:We propose a PACBayes sample compression approach to kernel methods that can accommodate any bounded similarity function and show that the support vector machine (SVM) classifier is a particular case of a more general class of datadependent classifiers known as majority votes of samplecompressed classifiers. We provide novel risk bounds for these majority votes and learning algorithms that minimize these bounds.
[download] [bibTeX] [discuss]
Integrating Partial Model Knowledge in Model Free RL Algorithms
Aviv Tamar, Dotan Di Castro, Ron Meir
Abstract:In reinforcement learning an agent uses online feedback from the environment and prior knowledge in order to adaptively select an effective policy. Model free approaches address this task by directly mapping external and internal states to actions, while model based methods attempt to construct a model of the environment, followed by a selection of optimal actions based on that model. Given the complementary advantages of both approaches, we suggest a novel algorithm which combines them into a single algorithm, which switches between a model based and a model free mode, depending on the current environmental state and on the status of the agent's knowledge. We prove that such an approach leads to improved performance whenever environmental knowledge is available, without compromising performance when such knowledge is absent. Numerical simulations demonstrate the effectiveness of the approach and suggest its efficacy in boosting policy gradient learning.
[download] [bibTeX] [discuss]
Fast Newtontype Methods for Total Variation Regularization
Álvaro Barbero, Suvrit Sra
Abstract:Numerous applications in statistics, signal processing, and machine learning regularize using Total Variation (TV) penalties. We study anisotropic (l1based) TV and also a related l2norm variant. We consider for both variants associated (1D) proximity operators, which lead to challenging optimization problems. We solve these problems by developing Newtontype methods that outperform the stateoftheart algorithms. More importantly, our 1DTV algorithms serve as building blocks for solving the harder task of computing 2 (and higher)dimensional TV proximity. We illustrate the computational benefits of our methods by applying them to several applications: (i) image denoising; (ii) image deconvolution (by plugging in our TV solvers into publicly available software); and (iii) four variants of fusedlasso. The results show large speedupsand to support our claims, we provide software accompanying this paper.
[download] [bibTeX] [discuss]
Parallel Coordinate Descent for L1Regularized Loss Minimization
Joseph Bradley, Aapo Kyrola, Daniel Bickson, Carlos Guestrin
Abstract:We propose Shotgun, a parallel coordinate descent algorithm for minimizing L1regularized losses. Though coordinate descent seems inherently sequential, we prove convergence bounds for Shotgun which predict linear speedups, up to a problemdependent limit. We present a comprehensive empirical study of Shotgun for Lasso and sparse logistic regression. Our theoretical predictions on the potential for parallelism closely match behavior on real data. Shotgun outperforms other published solvers on a range of large problems, proving to be one of the most scalable algorithms for L1.
[download] [bibTeX] [discuss]
LargeScale Convex Minimization with a LowRank Constraint
Shai ShalevShwartz, Alon Gonen, Ohad Shamir
Abstract:We address the problem of minimizing a convex function over the space of large matrices with low rank. While this optimization problem is hard in general, we propose an efficient greedy algorithm and derive its formal approximation guarantees. Each iteration of the algorithm involves (approximately) finding the left and right singular vectors corresponding to the largest singular value of a certain matrix, which can be calculated in linear time. This leads to an algorithm which can scale to large matrices arising in several applications such as matrix completion for collaborative filtering and robust low rank matrix approximation.
[download] [bibTeX] [discuss]
Approximate Dynamic Programming for Storage Problems
Lauren Hannah, David Dunson
Abstract:Storage problems are an important subclass of stochastic control problems. This paper presents a new method, approximate dynamic programming for storage, to solve storage problems with continuous, convex decision sets. Unlike other solution procedures, ADPS allows math programming to be used to make decisions each time period, even in the presence of large state variables. We test ADPS on the day ahead wind commitment problem with storage.
[download] [bibTeX] [discuss]
Online Submodular Minimization for Combinatorial Structures
Stefanie Jegelka, Jeff Bilmes
Abstract:Most results for online decision problems with structured concepts, such as trees or cuts, assume linear costs. In many settings, however, nonlinear costs are more realistic. Owing to their nonseparability, these lead to much harder optimization problems. Going beyond linearity, we address online approximation algorithms for structured concepts that allow the cost to be submodular, i.e., nonseparable. In particular, we show regret bounds for three Hannanconsistent strategies that capture different settings. Our results also tighten a regret bound for unconstrained online submodular minimization.
[download] [bibTeX] [discuss]
Minimal Loss Hashing for Compact Binary Codes
Mohammad Norouzi, David Fleet
Abstract:We propose a method for learning similaritypreserving hash functions that map highdimensional data onto binary codes. The formulation is based on structured prediction with latent variables and a hingelike loss function. It is efficient to train for large datasets, scales well to large code lengths, and outperforms stateoftheart methods.
[download] [bibTeX] [discuss]
The Hierarchical Beta Process for Convolutional Factor Analysis and Deep Learning
Bo Chen, Gungor Polatkan, Guillermo Sapiro, David Dunson, Lawrence Carin
Abstract:A convolutional factoranalysis model is developed, with the number of filters (factors) inferred via the beta process (BP) and hierarchical BP, for singletask and multitask learning, respectively. The computation of the model parameters is implemented within a Bayesian setting, employing Gibbs sampling; we explicitly exploit the convolutional nature of the expansion to accelerate computations. The model is used in a multilevel (deep) analysis of general data, with specific results presented for imageprocessing data sets, e.g., classification.
[download] [bibTeX] [discuss]
Simultaneous Learning and Covering with Adversarial Noise
Andrew Guillory, Jeff Bilmes
Abstract:We study simultaneous learning and covering problems: submodular set cover problems that depend on the solution to an active (query) learning problem. The goal is to jointly minimize the cost of both learning and covering. We extend recent work in this setting to allow for a limited amount of adversarial noise. Certain noisy query learning problems are a special case of our problem. Crucial to our analysis is a lemma showing the logical OR of two submodular cover constraints can be reduced to a single submodular set cover constraint. Combined with known results, this new lemma allows for arbitrary monotone circuits of submodular cover constraints to be reduced to a single constraint. As an example practical application, we present a movie recommendation website that minimizes the total cost of learning what the user wants to watch and recommending a set of movies.
[download] [bibTeX] [discuss]
Topic Modeling with Nonparametric Markov Tree
Haojun Chen, David Dunson, Lawrence Carin
Abstract:A new hierarchical treebased topic model is developed, based on nonparametric Bayesian techniques. The model has two unique attributes: (i) a child node in the tree may have more than one parent, with the goal of eliminating redundant subtopics deep in the tree; and (ii) parsimonious subtopics are manifested, by removing redundant usage of words at multiple scales. The depth and width of the tree are unbounded within the prior, with a retrospective sampler employed to adaptively infer the appropriate tree size based upon the corpus under study. Excellent quantitative results are manifested on five standard data sets, and the inferred tree structure is also found to be highly interpretable.
[download] [bibTeX] [discuss]
Relational Active Learning for Joint Collective Classification Models
Ankit Kuwadekar, Jennifer Neville
Abstract:In many network domains, labeled data may be costly to acquireindicating a need for {\em relational active learning} methods. Recent work has demonstrated that relational model performance can be improved by taking network structure into account when choosing instances to label. However, in collective inference settings, {\em both} model estimation {\em and} prediction can be improved by acquiring a node's labelsince relational models estimate a joint distribution over labels in the network and collective classification methods propagate information from labeled training data during prediction. This conflates improvement in learning with improvement in inference, since labeling nodes can reduce inference error without improving the overall quality of the learned model. Here, we use {\em acrossnetwork} classification to separate the effects on learning and prediction, and focus on reduction of learning error. When label propagation is used for learning, we find that labeling based on prediction {\em certainty} is more effective than labeling based on {\em uncertainty}. As such, we propose a novel active learning method that combines a networkbased {\em certainty} metric with semisupervised learning and relational resampling. We evaluate our approach on synthetic and realworld networks and show faster learning compared to several baselines, including the network based method of Bilgic et al. 2010.
[download] [bibTeX] [discuss]
A Cotraining Approach for Multiview Spectral Clustering
Abhishek Kumar, Hal Daume III, University of Maryland
Abstract:We propose a spectral clustering algorithm for the multiview setting where we have access to multiple views of the data, each of which can be independently used for clustering. Our spectral clustering algorithm has a flavor of cotraining, which is already a widely used idea in semisupervised learning. We work on the assumption that the true underlying clustering would assign a point to the same cluster irrespective of the view. Hence, we constrain our approach to only search for the clusterings that agree across the views. Our algorithm does not have any hyperparameters to set, which is a major advantage in unsupervised learning. We empirically compare with a number of baseline methods on synthetic and realworld datasets to show the efficacy of the proposed algorithm.
[download] [bibTeX] [discuss]
Learning from Multiple Outlooks
Maayan Harel, Shie Mannor
Abstract:We propose a novel problem formulation of learning a single task when the data are provided in different feature spaces. Each such space is called an outlook, and is assumed to contain both labeled and unlabeled data. The objective is to take advantage of the data from all the outlooks to better classify each of the outlooks. We devise an algorithm that computes optimal affine mappings from different outlooks to a target outlook by matching moments of the empirical distributions. We further derive a probabilistic interpretation of the resulting algorithm and a sample complexity bound indicating how many samples are needed to adequately find the mapping. We report the results of extensive experiments on activity recognition tasks that show the value of the proposed approach in boosting performance.
[download] [bibTeX] [discuss]
Adaptive Kernel Approximation for LargeScale NonLinear SVM Prediction
Michele Cossalter, Rong Yan, Lu Zheng
Abstract:The applicability of nonlinear support vector machines (SVMs) has been limited in largescale data collections because of their linear prediction complexity to the size of support vectors. We propose an efficient prediction algorithm with performance guarantee for nonlinear SVMs, termed AdaptSVM. It can selectively collapse the kernel function computation to a reduced set of support vectors, compensated by an additional correction term that can be easily computed online. It also allows adaptive fallback to original kernel computation based on its estimated variance and maximum error tolerance. In addition to theoretical analysis, we empirically evaluate on multiple largescale datasets to show that the proposed algorithm can speed up the prediction process up to 10000 times with only <0.5 accuracy loss.
[download] [bibTeX] [discuss]
RiskBased Generalizations of fdivergences
Darío GarcíaGarcía, Ulrike von Luxburg, Raúl SantosRodríguez
Abstract:We derive a generalized notion of fdivergences, called (f,l)divergences. We show that this generalization enjoys many of the nice properties of fdivergences, although it is a richer family. It also provides alternative definitions of standard divergences in terms of surrogate risks. As a first practical application of this theory, we derive a new estimator for the KulbackLeibler divergence that we use for clustering sets of vectors.
[download] [bibTeX] [discuss]
Learning MultiView Neighborhood Preserving Projections
Novi Quadrianto, Christoph Lampert
Abstract:We address the problem of metric learning for multiview data, namely the construction of embedding projections from data in different representations into a shared feature space, such that the Euclidean distance in this space provides a meaningful withinview as well as betweenview similarity. Our motivation stems from the problem of crossmedia retrieval tasks, where the availability of a joint Euclidean distance function is a prerequisite to allow fast, in particular hashingbased, nearest neighbor queries. We formulate an objective function that expresses the intuitive concept that matching samples are mapped closely together in the output space, whereas nonmatching samples are pushed apart, no matter in which view they are available. The resulting optimization problem is not convex, but it can be decomposed explicitly into a convex and a concave part, thereby allowing efficient optimization using the convexconcave procedure. Experiments on an image retrieval task show that nearestneighbor based crossview retrieval is indeed possible, and the proposed technique improves the retrieval accuracy over baseline techniques.
[download] [bibTeX] [discuss]
Better Algorithms for Selective Sampling
Francesco Orabona, Nicolň CesaBianchi
Abstract:We study online algorithms for selective sampling that use regularized least squares (RLS) as base classifier. These algorithms typically perform well in practice, and some of them have formal guarantees on their mistake and query rates. We refine and extend these guarantees in various ways, proposing algorithmic variants that exhibit better empirical behavior while enjoying performance guarantees under much more general conditions. We also show a simple way of coupling a generic gradientbased classifier with a specific RLSbased selective sampler, obtaining hybrid algorithms with combined performance guarantees.
[download] [bibTeX] [discuss]
Minimax Learning Rates for Bipartite Ranking and Plugin Rules
Sylvain Robbiano, Stéphan Clémençon
Abstract:While it is now wellknown in the standard binary classication setup, that, under suitable margin assumptions and complexity conditions on the regression function, fast or even superfast rates (i.e. rates faster than n^(1/2) or even faster than n^1) can be achieved by plugin classiers, no result of this nature has been proved yet in the context of bipartite ranking, though akin to that of classication. It is the main purpose of the present paper to investigate this issue. Viewing bipartite ranking as a nested continuous collection of costsensitive classication problems, we exhibit a global low noise condition under which certain plugin ranking rules can be shown to achieve fast (but not superfast) rates, establishing thus minimax upper bounds for the excess of ranking risk.
[download] [bibTeX] [discuss]
Task Space Retrieval Using Inverse Feedback Control
Nikolay Jetchev, Marc Toussaint
Abstract:Learning complex skills by repeating and generalizing expert behavior is a fundamental problem in robotics. A common approach is learning from demonstration: given examples of correct motions, learn a policy mapping state to action consistent with the training data. However, the usual approaches do not answer the question of what are appropriate representations to generate motions for specific tasks. Inspired by Inverse Optimal Control, we present a novel method to learn latent costs, imitate and generalize demonstrated behavior, and discover a task relevant motion representation: Task Space Retrieval Using Inverse Feedback Control (TRIC). We use the learned latent costs to create motion with a feedback controller. We tested our method on robot grasping of objects, a challenging highdimensional task. TRIC learns the important control dimensions for the grasping task from a few example movements and is able to robustly approach and grasp objects in new situations.
[download] [bibTeX] [discuss]
Bayesian CCA via Group Sparsity
Seppo Virtanen, Arto Klami, Samuel Kaski
Abstract:Bayesian treatments of Canonical Correlation Analysis (CCA) type latent variable models have been recently proposed for coping with overfitting in small sample sizes, as well as for producing factorizations of the data sources into correlated and nonshared effects. However, all of the current implementations of Bayesian CCA and its extensions are computationally inefficient for highdimensional data and, as shown in this paper, break down completely for highdimensional sources with low sample count. Furthermore, they cannot reliably separate the correlated effects from nonshared ones. We propose a new Bayesian CCA variant that is computationally efficient and works for highdimensional data, while also learning the factorization more accurately. The improvements are gained by introducing a group sparsity assumption and an improved variational approximation. The method is demonstrated to work well on multilabel prediction tasks and in analyzing brain correlates of naturalistic audio stimulation.
[download] [bibTeX] [discuss]
PILCO: A ModelBased and DataEfficient Approach to Policy Search
Marc Deisenroth, Carl Rasmussen
Abstract:In this paper, we introduce PILCO, a practical, dataefficient modelbased policy search method. PILCO reduces model bias, one of the key problems of modelbased reinforcement learning, in a principled way. By learning a probabilistic dynamics model and explicitly incorporating model uncertainty into longterm planning, PILCO can cope with very little data and facilitates learning from scratch in only a few trials. Policy evaluation is performed in closed form using stateoftheart approximate inference. Furthermore, policy gradients are computed analytically for policy improvement. We report unprecedented learning efficiency on challenging and highdimensional control tasks.
[download] [bibTeX] [discuss]
Suboptimal Solution Path Algorithm for Support Vector Machine
Masayuki Karasuyama, Ichiro Takeuchi
Abstract:We consider a suboptimal solution path algorithm for the Support Vector Machine. The solution path algorithm is known as an effective tool for solving a sequence of a parametrized optimization problems in machine learning. However, the algorithm needs to keep strict optimality conditions satisfied everywhere on the path. This requirement narrows the applicability of the path algorithm and adversely affects its computational efficiency. In our algorithm, user can specify tolerances to the optimality and control the tradeoff between accuracy of the solution and the computational cost. We also show that our suboptimal solutions can be interpreted as the solution of a perturbed optimization problem from the original one, provide some theoretical analyses of our algorithm based on a novel interpretation. The experimental results demonstrate the effectiveness of our algorithm in terms of efficiency and accuracy.
[download] [bibTeX] [discuss]
Incremental Basis Construction from Temporal Difference Error
Yi Sun, Faustino Gomez, Mark Ring, Jürgen Schmidhuber
Abstract:In many reinforcementlearning (RL) systems, the value function is approximated as a linear combination of a fixed set of basis functions. Performance can be improved by adding to this set. Previous approaches construct a series of basis functions that in sufficient number can eventually represent the value function. In contrast, we show that there is a single, ideal basis function, which can directly represent the value function. Its addition to the set immediately reduces the error to zerowithout changing existing weights. Moreover, this ideal basis function is simply the value function that results from replacing the MDP's reward function with its Bellman error. This result suggests a novel method for improving valuefunction estimation: a primary reinforcement learner estimates its value function using its present basis functions; it then sends its TD error to a secondary learner, which interprets that error as a reward function and estimates the corresponding value function; the resulting value function then becomes the primary learner's new basis function. We present both batch and online versions in combination with incremental basis projection, and demonstrate that the performance is superior to existing methods, especially in the case of large discount factors.
[download] [bibTeX] [discuss]
Predicting Legislative Roll Calls from Text
Sean Gerrish, David Blei
Abstract:We develop several predictive models linking legislative sentiment to legislative text. Our models, which draw on ideas from ideal point estimation and topic models, predict voting patterns based on the contents of bills and infer the political leanings of legislators. With supervised topics, we provide an exploratory window into how the language of the law is correlated with political support. We also derive approximate posterior inference algorithms based on variational methods. Across 12 years of legislative data, we predict specific voting patterns with high accuracy.
[download] [bibTeX] [discuss]
On Bayesian PCA: Automatic Dimensionality Selection and Analytic Solution
Shinichi Nakajima, Masashi Sugiyama, Derin Babacan
Abstract:In probabilistic PCA, the fully Bayesian estimation is computationally intractable. To cope with this problem, two types of approximation schemes were introduced: the partially Bayesian PCA (PBPCA) where only the latent variables are integrated out, and the variational Bayesian PCA (VBPCA) where the loading vectors are also integrated out. The VBPCA was proposed as an improved variant of PBPCA for enabling automatic dimensionality selection (ADS). In this paper, we investigate whether VBPCA is really the best choice from the viewpoints of computational efficiency and ADS. We first show that ADS is not the unique feature of VBPCAPBPCA is also actually equipped with ADS. We further show that PBPCA is more advantageous in computational efficiency than VBPCA because the global solution of PBPCA can be computed analytically. However, we also show the negative fact that PBPCA results in a trivial solution in the empirical Bayesian framework. We next consider a simplified variant of VBPCA, where the latent variables and loading vectors are assumed to be mutually independent (while the ordinary VBPCA only requires matrixwise independence). We show that this simplified VBPCA is the most advantageous in practice because its empirical Bayes solution experimentally works as well as the original VBPCA, and its global optimal solution can be computed efficiently in a closed form.
[download] [bibTeX] [discuss]
Learning Linear Functions with Quadratic and Linear Multiplicative Updates
Tom Bylander
Abstract:We analyze variations of multiplicative updates for learning linear functions online. These can be described as substituting exponentiation in the Exponentiated Gradient (EG) algorithm with quadratic and linear functions. Both kinds of updates substitute exponentiation with simpler operations and reduce dependence on the parameter that specifies the sum of the weights during learning. In particular, the linear multiplicative update places no restrictions on the sum of the weights, and, under a wide range of conditions, achieves worstcase behavior close to the EG algorithm. We perform our analysis for square loss and absolute loss, and for regression and classification. We also describe some experiments showing that the performance of our algorithms are comparable to EG and the $p$norm algorithm.
[download] [bibTeX] [discuss]
Domain Adaptation for LargeScale Sentiment Classification: A Deep Learning Approach
Xavier Glorot, Antoine Bordes, Yoshua Bengio
Abstract:The exponential increase in the availability of online reviews and recommendations makes sentiment classification an interesting topic in academic and industrial research. Reviews can span so many different domains that it is difficult to gather annotated training data for all of them. Hence, this paper studies the problem of domain adaptation for sentiment classifiers, hereby a system is trained on labeled reviews from one source domain but is meant to be deployed on another. We propose a deep learning approach which learns to extract a meaningful representation for each review in an unsupervised fashion. Sentiment classifiers trained with this highlevel feature representation clearly outperform stateoftheart methods on a benchmark composed of reviews of 4 types of Amazon products. Furthermore, this method scales well and allowed us to successfully perform domain adaptation on a larger industrialstrength dataset of 22 domains.
[download] [bibTeX] [discuss]
Learning with Whom to Share in Multitask Feature Learning
Zhuoliang Kang, Kristen Grauman, Fei Sha
Abstract:In multitask learning (MTL), multiple tasks are learnt jointly. A major assumption for this paradigm is that all those tasks are indeed related so that the joint training is appropriate and beneficial. In this paper, we study the problem of multitask learning of shared feature representations among tasks, while simultaneously determining ``with whom'' each task should share. We formulate the problem as a mixed integer programming and provide an alternating minimization technique to solve the optimization problem of jointly identifying grouping structures and parameters. The algorithm monotonically decreases the objective function and converges to a local optimum. Compared to the standard MTL paradigm where all tasks are in a single group, our algorithm improves its performance with statistical significance for three out of the four datasets we have studied. We also demonstrate its advantage over other task grouping techniques investigated in literature.
[download] [bibTeX] [discuss]
Boosting on a Budget: Sampling for FeatureEfficient Prediction
Lev Reyzin
Abstract:In this paper, we tackle the problem of featureefficient prediction: classification using a limited number of features per test example. We show that modifying an ensemble classifier such as AdaBoost, by sampling hypotheses from its final weighted predictor, is wellsuited for this task. We further consider an extension of this problem, where the costs of examining the various features can differ from one another, and we give an algorithm for this more general setting. We prove the correctness of our algorithms and derive bounds for the number of samples needed for given error rates. We also experimentally verify the effectiveness of our methods.
[download] [bibTeX] [discuss]
SpeedingUp HoeffdingBased Regression Trees With Options
Elena Ikonomovska, Joăo Gama, Bernard Zenko, Saso Dzeroski
Abstract:Data streams are ubiquitous and have in the last two decades become an important research topic. For their predictive nonparametric analysis, Hoeffdingbased trees are often a method of choice, offering a possibility of anytime predictions. However, one of their main problems is the delay in learning progress due to the existence of equally discriminative attributes. Options are a natural way to deal with this problem. Option trees build upon regular trees by adding splitting options in the internal nodes. As such they are known to improve accuracy, stability and reduce ambiguity. In this paper, we present online option trees for faster learning on numerical data streams. Our results show that options improve the anytime performance of ordinary online regression trees, while preserving the interpretable structure of trees and without significantly increasing the computational complexity of the algorithm.
[download] [bibTeX] [discuss]
Linear Regression under FixedRank Constraints: A Riemannian Approach
Gilles Meyer, Silvčre Bonnabel, Rodolphe Sepulchre, University of Ličge
Abstract:In this paper, we tackle the problem of learning a linear regression model whose parameter is a fixedrank matrix. We study the Riemannian manifold geometry of the set of fixedrank matrices and develop efficient linesearch algorithms. The proposed algorithms have many applications, scale to highdimensional problems, enjoy local convergence properties and confer a geometric basis to recent contributions on learning fixedrank matrices. Numerical experiments on benchmarks suggest that the proposed algorithms compete with the stateoftheart, and that manifold optimization offers a versatile framework for the design of rankconstrained machine learning algorithms.
[download] [bibTeX] [discuss]
Cauchy Graph Embedding
Dijun Luo, Chris Ding, Feiping Nie, Heng Huang
Abstract:Laplacian embedding provides a lowdimensional representation for the nodes of a graph where the edge weights denote pairwise similarity among the node objects. It is commonly assumed that the Laplacian embedding results preserve the local topology of the original data on the lowdimensional projected subspaces, i.e., for any pair of graph nodes with large similarity, they should be embedded closely in the embedded space. However, in this paper, we will show that the Laplacian embedding often cannot preserve local topology well as we expected. To enhance the local topology preserving property in graph embedding, we propose a novel Cauchy graph embedding which preserves the similarity relationships of the original data in the embedded space via a new objective. Consequentially the machine learning tasks (such as k Nearest Neighbor type classifications) can be easily conducted on the embedded data with better performance. The experimental results on both synthetic and real world benchmark data sets demonstrate the usefulness of this new type of embedding.
[download] [bibTeX] [discuss]
Uncovering the Temporal Dynamics of Diffusion Networks
Manuel Gomez Rodriguez, David Balduzzi, Bernhard Schölkopf
Abstract:Time plays an essential role in the diffusion of information, influence and disease over networks. In many cases we only observe when a node copies information, makes a decision or becomes infected  but the connectivity, transmission rates between nodes and transmission sources are unknown. Inferring the underlying dynamics is of outstanding interest since it enables forecasting, influencing and retarding infections, broadly construed. To this end, we model diffusion processes as discrete networks of continuous temporal processes occurring at different rates. Given cascade data  observed infection times of nodes  we infer the edges of the global diffusion network and estimate the transmission rates of each edge that best explain the observed data. The optimization problem is convex. The model naturally (without heuristics) imposes sparse solutions and requires no parameter tuning. The problem decouples into a collection of independent smaller problems, thus scaling easily to networks on the order of hundreds of thousands of nodes. Experiments on real and synthetic data show that our algorithm both recovers the edges of diffusion networks and accurately estimates their transmission rates from cascade data.
[download] [bibTeX] [discuss]
Multiclass Boosting with Hinge Loss based on Output Coding
Tianshi Gao, Daphne Koller
Abstract:Multiclass classification is an important and fundamental problem in machine learning. A popular family of multiclass classification methods belongs to reducing multiclass to binary based on output coding. Several multiclass boosting algorithms have been proposed to learn the coding matrix and the associated binary classifiers in a problemdependent way. These algorithms can be unified under a sumofexponential loss function defined in the domain of margins. Instead, multiclass SVM uses another type of loss function based on hinge loss. In this paper, we present a new outputcodingbased multiclass boosting algorithm using the multiclass hinge loss, which we call HingeBoost.OC. HingeBoost.OC is tested on various real world datasets and shows better performance than the existing multiclass boosting algorithm AdaBoost.ERP, onevsone, onevsall, ECOC and multiclass SVM in a majority of different cases.
[download] [bibTeX] [discuss]
Approximation Bounds for Inference using Cooperative Cuts
Stefanie Jegelka, Jeff Bilmes
Abstract:We analyze a family of probability distributions that are characterized by an embedded combinatorial structure. This family includes models having arbitrary treewidth and arbitrary sized factors. Unlike general models with such freedom, where the most probable explanation (MPE) problem is inapproximable, the combinatorial structure within our model, in particular the indirect use of submodularity, leads to several MPE algorithms that all have approximation guarantees.
[download] [bibTeX] [discuss]
Brier Curves: a New CostBased Visualisation of Classifier Performance
Jose HernandezOrallo, Peter Flach, Cčsar Ferri
Abstract:It is often necessary to evaluate classifier performance over a range of operating conditions, rather than as a point estimate. This is typically assessed through the construction of curvesover a space, visualising how one or two performance metrics vary with the operating condition. For binary classifiers in particular, cost space is a natural way of showing this range of performance, visualising loss against operating condition. However, the curves which have been traditionally drawn in cost space, known as cost curves, show the optimal loss, and hence assume knowledge of the optimal decision threshold for a given operating condition. Clearly, this leads to an optimistic assessment of classifier performance. In this paper we propose a more natural way of visualising classifier performance in cost space, which is to plot probabilistic loss on the yaxis, i.e., the loss arising from the probability estimates. This new curve provides new ways of understanding classifier performance and new tools to compare classifiers. In addition, we show that the area under this curve is exactly the Brier score, one of the most popular performance metrics for probabilistic classifiers.
[download] [bibTeX] [discuss]
Semisupervised Penalized Output Kernel Regression for Link Prediction
Céline Brouard, Florence D'AlcheBuc, Marie Szafranski
Abstract:Link prediction is addressed as an output kernel learning task through semisupervised Output Kernel Regression. Working in the framework of RKHS theory with vectorvalued functions, we establish a new representer theorem devoted to semisupervised least square regression. We then apply it to get a new model (POKR: Penalized Output Kernel Regression) and show its relevance using numerical experiments on artificial networks and two real applications using a very low percentage of labeled data in a transductive setting.
[download] [bibTeX] [discuss]
A New Bayesian Rating System for Team Competitions
Sergey Nikolenko, Alexander Sirotkin
Abstract:We present a novel probabilistic rating system for team competitions. Building upon TrueSkill(tm), we change the factor graph structure to cope with the problems of TrueSkill(tm), e.g., multiway ties and variable team size. We give detailed inference algorithms for the new structure. Experimental results show a significant improvement over TrueSkill(tm).
[download] [bibTeX] [discuss]
OptiML: An Implicitly Parallel DomainSpecific Language for Machine Learning
Arvind Sujeeth, HyoukJoong Lee, Kevin Brown, Tiark Rompf, Hassan Chafi, Michael Wu, Anand Atreya, Martin Odersky, Kunle Olukotun
Abstract:As the size of datasets continues to grow, machine learning applications are becoming increasingly limited by the amount of available computational power. Taking advantage of modern hardware requires using multiple parallel programming models targeted at different devices (e.g. CPUs and GPUs). However, programming these devices to run efficiently and correctly is difficult, errorprone, and results in software that is harder to read and maintain. We present OptiML, a domainspecific language (DSL) for machine learning. OptiML is an implicitly parallel, expressive and high performance alternative to MATLAB and C++. OptiML performs domainspecific analyses and optimizations and automatically generates CUDA code for GPUs. We show that OptiML outperforms explicitly parallelized MATLAB code in nearly all cases.
[download] [bibTeX] [discuss]
Infinite SVM: a Dirichlet Process Mixture of Largemargin Kernel Machines
Jun Zhu, Ning Chen, Eric Xing
Abstract:We present Infinite SVM (iSVM), a Dirichlet process mixture of largemargin kernel machines for multiway classification. An iSVM enjoys the advantages of both Bayesian nonparametrics in handling the unknown number of mixing components, and largemargin kernel machines in robustly capturing local nonlinearity of complex data. We develop an efficient variational learning algorithm for posterior inference of iSVM, and we demonstrate the advantages of iSVM over Dirichlet process mixture of generalized linear models and other benchmarks on both synthetic and real Flickr image classification datasets.
[download] [bibTeX] [discuss]
On the Integration of Topic Modeling and Dictionary Learning
Lingbo Li, Mingyuan Zhou, Guillermo Sapiro, Lawrence Carin
Abstract:A new nonparametric Bayesian model is developed to integrate dictionary learning and topic model into a unified framework. The model is employed to analyze partially annotated images, with the dictionary learning performed directly on image patches. Efficient inference is performed with a Gibbsslice sampler, and encouraging results are reported on widely used datasets.
[download] [bibTeX] [discuss]
Piecewise Bounds for Estimating BernoulliLogistic Latent Gaussian Models
Benjamin Marlin, Mohammad Khan, Kevin Murphy
Abstract:Bernoullilogistic latent Gaussian models (bLGMs) are a useful model class, but accurate parameter estimation is complicated by the fact that the marginal likelihood contains an intractable logisticGaussian integral. In this work, we propose the use of fixed piecewise linear and quadratic upper bounds to the logisticlogpartition (LLP) function as a way of circumventing this intractable integral. We describe a framework for approximately computing minimax optimal piecewise quadratic bounds, as well a generalized expectation maximization algorithm based on using piecewise bounds to estimate bLGMs. We prove a theoretical result relating the maximum error in the LLP bound to the maximum error in the marginal likelihood estimate. Finally, we present empirical results showing that piecewise bounds can be significantly more accurate than previously proposed variational bounds.
[download] [bibTeX] [discuss]
Access to Unlabeled Data can Speed up Prediction Time
Ruth Urner, Shai ShalevShwartz, Shai BenDavid
Abstract:Semisupervised learning (SSL) addresses the problem of training a classifier using a small number of labeled examples and many unlabeled examples. Most previous work on SSL focused on how availability of unlabeled data can improve the accuracy of the learned classifiers. In this work we study how unlabeled data can be beneficial for constructing faster classifiers. We propose an SSL algorithmic framework which can utilize unlabeled examples for learning classifiers from a predefined set of fast classifiers. We formally analyze conditions under which our algorithmic paradigm obtains significant improvements by the use of unlabeled data. As a side benefit of our analysis we propose a novel quantitative measure of the socalled cluster assumption. We demonstrate the potential merits of our approach by conducting experiments on the MNIST data set, showing that, when a sufficiently large unlabeled sample is available, a fast classifier can be learned from much fewer labeled examples than without such a sample.
[download] [bibTeX] [discuss]
From PACBayes Bounds to Quadratic Programs for Majority Votes
JeanFrancis Roy, Francois Laviolette, Mario Marchand
Abstract:We propose to construct a weighted majority vote on a set of basis functions by minimizing a risk bound (called the Cbound) that depends on the first two moments of the margin of the Qconvex combination realized on the training data. This bound minimization algorithm turns out to be a quadratic program that can be efficiently solved. A first version of the algorithm is designed for the supervised inductive setting and turns out to be competitive with AdaBoost, MDBoost and the SVM. The second version of the algorithm, designed for the transductive setting, competes well with TSVM. We also propose a new PACBayes theorem that bounds the difference between the "true" value of the Cbound and its empirical estimate and that, unexpectedly, contains no KLdivergence.
[download] [bibTeX] [discuss]
A Coherent Interpretation of AUC as a Measure of Aggregated Classification Performance
Peter Flach, Jose HernandezOrallo, Cčsar Ferri
Abstract:The area under the ROC curve (AUC), a wellknown measure of ranking performance, is also often used as a measure of classification performance, aggregating over decision thresholds as well as class and cost skews. However, David Hand has recently argued that AUC is fundamentally incoherent as a measure of aggregated classifier performance and proposed an alternative measure. Specifically, Hand derives a linear relationship between AUC and expected minimum loss, where the expectation is taken over a distribution of the misclassification cost parameter that depends on the model under consideration. Replacing this distribution with a Beta(2;2) distribution, Hand derives his alternative measure H. In this paper we offer an alternative, coherent interpretation of AUC as linearly related to expected loss. We use a distribution over cost parameter and a distribution over data points, both uniform and hence model independent. Should one wish to consider only optimal thresholds, we demonstrate that a simple and more intuitive alternative to Hands H measure is already available in the form of the area under the cost curve.
[download] [bibTeX] [discuss]
Support Vector Machines as Probabilistic Models
Vojtech Franc, Alexander Zien, Bernhard Schölkopf
Abstract:We show how the SVM can be viewed as a maximum likelihood estimate of a class of probabilistic models. This model class can be viewed as a reparametrization of the SVM in a similar vein to the $\nu$SVM reparametrizing the original SVM. It is not discriminative, but has a nonuniform marginal. We illustrate the benefits of this new view by rederiving and reinvestigating two established SVMrelated algorithms.
[download] [bibTeX] [discuss]
Adaptively Learning the Crowd Kernel
Omer Tamuz, Ce Liu, Serge Belongie, Ohad Shamir, Adam Kalai
Abstract:We introduce an algorithm that, given n objects, learns a similarity matrix over all n^2 pairs, from crowdsourced data *alone*. The algorithm samples responses to adaptively chosen tripletbased relativesimilarity queries. Each query has the form "is object a more similar to b or to c?" and is chosen to be maximally informative given the preceding responses. The output is an embedding of the objects into Euclidean space (like MDS); we refer to this as the "crowd kernel." SVMs reveal that the crowd kernel captures prominent and subtle features across a number of domains, such as "is striped" among neckties and "vowel vs. consonant" among letters.
[download] [bibTeX] [discuss]
Bayesian Learning via Stochastic Gradient Langevin Dynamics
Max Welling, Yee Whye Teh
Abstract:In this paper we propose a new framework for learning from large scale datasets based on iterative learning from small minibatches. By adding the right amount of noise to a standard stochastic gradient optimization algorithm we show that the iterates will converge to samples from the true posterior distribution as we anneal the stepsize. This seamless transition between optimization and Bayesian posterior sampling provides an inbuilt protection against overfitting. We also propose a practical method for Monte Carlo estimates of posterior statistics which monitors a ``sampling threshold'' and collects samples after it has been surpassed. We apply the method to three models: a mixture of Gaussians, logistic regression and ICA with natural gradients.
[download] [bibTeX] [discuss]
Multimodal Deep Learning
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, Andrew Ng
Abstract:Deep networks have been successfully applied to unsupervised feature learning for single modalities (e.g., text, images or audio). In this work, we propose a novel application of deep networks to learn features over multiple modalities. We present a series of tasks for multimodal learning and show how to train deep networks that learn features to address these tasks. In particular, we demonstrate cross modality feature learning, where better features for one modality (e.g., video) can be learned if multiple modalities (e.g., audio and video) are present at feature learning time. Furthermore, we show how to learn a shared representation between modalities and evaluate it on a unique task, where the classifier is trained with audioonly data but tested with videoonly data and viceversa. Our methods are validated on the CUAVE and AVLetters datasets with an audiovisual speech classification task, demonstrating best published visual speech classification on AVLetters and effective shared representation learning.
[download] [bibTeX] [discuss]
On the Robustness of Kernel Density MEstimators
JooSeuk Kim, Clayton Scott
Abstract:We analyze a method for nonparametric density estimation that exhibits robustness to contamination of the training sample. This method achieves robustness by combining a traditional kernel density estimator (KDE) with ideas from classical Mestimation. The KDE based on a Gaussian kernel is interpreted as a sample mean in the associated reproducing kernel Hilbert space (RKHS). This mean is estimated robustly through the use of a robust loss, yielding the socalled robust kernel density estimator (RKDE). This robust sample mean can be found via a kernelized iteratively reweighted least squares (IRWLS) algorithm. Our contributions are summarized as follows. First, we present a representer theorem for the RKDE, which gives an insight into the robustness of the RKDE. Second, we provide necessary and sufficient conditions for kernel IRWLS to converge to the global minimizer, in the Gaussian RKHS, of the objective function defining the RKDE. Third, characterize and provide a method for computing the influence function associated with the RKDE. Fourth, we illustrate the robustness of the RKDE through experiments on several data sets.
[download] [bibTeX] [discuss]
Beam Search based MAP Estimates for the Indian Buffet Process
Piyush Rai, Hal Daume III, University of Maryland
Abstract:Nonparametric latent feature models offer a flexible way to discover the latent features underlying the data, without having to a priori specify their number. The Indian Buffet Process (IBP) is a popular example of such a model. Inference in IBP based models, however, remains a challenge. Sampling techniques such as MCMC can be computationally expensive and can take a long time to converge to the stationary distribution. Variational techniques, although faster than sampling, can be difficult to design, and can still remain slow on large data. In many problems, however, we only seek a maximum a posteriori (MAP) estimate of the latent feature assignment matrix. For such cases, we show that techniques such as beam search can give fast, approximate MAP estimates in the IBP based models. If samples from the posterior are desired, these MAP estimates can also serve as sensible initializers for MCMC based algorithms. Experimental results on a variety of datasets suggest that our algorithms can be a computationally viable alternative to Gibbs sampling, the particle filter, and variational inference based approaches for the IBP, and also perform better than other heuristics such as greedy search.
[download] [bibTeX] [discuss]
Optimal Distributed Online Prediction
Ofer Dekel, Ran GiladBachrach, Ohad Shamir, Lin Xiao
Abstract:Online prediction methods are typically studied as serial algorithms running on a single processor. In this paper, we present the distributed minibatch (DMB) framework, a method of converting a serial gradientbased online algorithm into a distributed algorithm, and prove an asymptotically optimal regret bound for smooth convex loss functions and stochastic examples. Our analysis explicitly takes into account communication latencies between computing nodes in a network. We also present robust variants, which are resilient to failures and node heterogeneity in an synchronous distributed environment. Our method can also be used for distributed stochastic optimization, attaining an asymptotically linear speedup. Finally, we empirically demonstrate the merits of our approach on largescale online prediction problems.
[download] [bibTeX] [discuss]
Message Passing Algorithms for the Dirichlet Diffusion Tree
David Knowles, Jurgen Van Gael, Zoubin Ghahramani
Abstract:We demonstrate efficient approximate inference for the Dirichlet Diffusion Tree, a Bayesian nonparametric prior over tree structures. Although DDTs provide a powerful and elegant approach for modeling hierarchies they haven't seen much use to date. One problem is the computational cost of MCMC inference. We provide the first deterministic approximate inference methods for DDT models and show excellent performance compared to the MCMC alternative. We present message passing algorithms to approximate the Bayesian model evidence for a specific tree. This is used to drive sequential tree building and greedy search to find optimal tree structures, corresponding to hierarchical clusterings of the data. We demonstrate appropriate observation models for continuous and binary data. The empirical performance of our method is very close to the computationally expensive MCMC alternative on a density estimation problem, and significantly outperforms kernel density estimators.
[download] [bibTeX] [discuss]
Convex MaxProduct over Compact Sets for Protein Folding
Jian Peng, Tamir Hazan, David McAllester, Raquel Urtasun
Abstract:In this paper we present an approach to inference in graphical models with mixture of discrete and bounded continuous variables. In particular, we extend convex maxproduct to deal with these hybrid models and derive the conditions under which our approach is guaranteed to produce the MAP assignment. When dealing with continuous variables the messages are functions. We investigate a multigrid approach which can be viewed as a piecewise constant representation of these messages. While this approach provides the theoretical guarantees it is not very practical. Inspired by this, we further propose a particle convex maxproduct algorithm that significantly outperforms existing particle methods in the task of protein folding and performs comparable to the stateofthe art while using a smaller amount of prior knowledge.
[download] [bibTeX] [discuss]
Structure Learning in Ergodic Factored MDPs without Knowledge of the Transition Function's InDegree
Doran Chakraborty, Peter Stone
Abstract:This paper introduces Learn Structure and Exploit RMax (LSERMax), a novel model based structure learning algorithm for ergodic factoredstate MDPs. Given a planning horizon that satisfies a condition, LSERMax provably guarantees a return very close to the optimal return, with a high certainty, without requiring any prior knowledge of the indegree of the transition function as input. LSERMax is fully implemented with a thorough analysis of its sample complexity. We also present empirical results demonstrating its effectiveness compared to prior approaches to the problem.
[download] [bibTeX] [discuss]
Clusterpath: an Algorithm for Clustering using Convex Fusion Penalties
Toby Hocking, JeanPhilippe Vert, Francis Bach, Armand Joulin
Abstract: We present a new clustering algorithm by proposing a convex relaxation of hierarchical clustering, which results in a family of objective functions with a natural geometric interpretation. We give efficient algorithms for calculating the continuous regularization path of solutions, and discuss relative advantages of the parameters. Our method experimentally gives stateoftheart results similar to spectral clustering for nonconvex clusters, and has the added benefit of learning a tree structure from the data.
[download] [bibTeX] [discuss]
Tree preserving embedding
Albert Shieh, Tatsunori Hashimoto, Edo Airoldi
Abstract:Visualization techniques for complex data are a workhorse of modern scientific pursuits. The goal of visualization is to embed high dimensional data in a low dimensional space, while preserving structure in the data relevant to exploratory data analysis, such as the existence of clusters. However, existing visualization methods often either entirely fail to preserve clusters in embeddings due to the crowding problem or can only preserve clusters at a single resolution. Here, we develop a new approach to visualization, tree preserving embedding (TPE). Our approach takes advantage of the topological notion of connectedness to provably preserve clusters at all resolutions. Our performance guarantee holds for finite samples, which makes TPE a robust method for applications. Our approach suggests new strategies for robust data visualization in practice.
[download] [bibTeX] [discuss]
Clustering by LeftStochastic Matrix Factorization
Raman Arora, Maya Gupta, Amol Kapila, Maryam Fazel
Abstract:We propose clustering samples given their pairwise similarities by factorizing the similarity matrix into the product of a cluster probability matrix and its transpose. We propose a rotationbased algorithm to compute this leftstochastic decomposition (LSD). Theoretical results link the LSD clustering method to a soft kernel kmeans clustering, give conditions for when the factorization and clustering are unique, and provide error bounds. Experimental results on simulated and real similarity datasets show that the proposed method reliably provides accurate clusterings.
[download] [bibTeX] [discuss]
The Infinite Regionalized Policy Representation
Miao Liu, Xuejun Liao, Lawrence Carin
Abstract:We introduce the infinite regionalized policy presentation (iRPR), as a nonparametric policy for reinforcement learning in partially observable Markov decision processes (POMDPs). The iRPR assumes an unbounded set of decision states a priori, and infers the number of states to represent the policy given the experiences. We propose algorithms for learning the number of decision states while maintaining a proper balance between exploration and exploitation. Convergence analysis is provided, along with performance evaluations on benchmark problems.
[download] [bibTeX] [discuss]
SampleRank: Training Factor Graphs with Atomic Gradients
Michael Wick, Khashayar Rohanimanesh, Kedar Bellare, Aron Culotta, Andrew McCallum
Abstract:We present SampleRank, an alternative to contrastive divergence (CD) for estimating parameters in complex graphical models. SampleRank harnesses a userprovided loss function to distribute stochastic gradients across an MCMC chain. As a result, parameter updates can be computed between arbitrary MCMC states. SampleRank is not only faster than CD, but also achieves better accuracy in practice (up to 23\% error reduction on nounphrase coreference).
[download] [bibTeX] [discuss]
TreeStructured Infinite Sparse Factor Model
XianXing Zhang, David Dunson, Lawrence Carin
Abstract:A new treestructured multiplicative gamma process (TMGP) is developed, for inferring the depth of a treebased factoranalysis model. This new model is coupled with the nested Chinese restaurant process, to nonparametrically infer the depth and width (structure) of the tree. In addition to developing the model, theoretical properties of the TMGP are addressed, and a novel MCMC sampler is developed. The structure of the inferred tree is used to learn relationships between highdimensional data, and the model is also applied to compressive sensing and interpolation of incomplete images.
[download] [bibTeX] [discuss]
Preserving Personalized Pagerank in Subgraphs
Andrea Vattani, Deepayan Chakrabarti, Maxim Gurevich
Abstract:Choosing a subgraph that can concisely represent a large realworld graph is useful in many scenarios. The usual strategy employed is to sample nodes so that the induced subgraph matches the original graphs degree distribution, clustering coefficient, etc., but no attempt is made to preserve finegrained relationships between nodes, which are vital for applications like clustering, classification, and ranking. In this work, we model such relationships via the notion of Personalized PageRank Value (PPV). We show that induced subgraphs output by current sampling methods do not preserve PPVs, and propose algorithms for creating PPVpreserving subgraphs on any given subset of graph nodes. Experiments on three large realworld graphs show that the subgraphs created by our method improve upon induced subgraphs in terms of preserving PPVs, clustering accuracy, and maintaining basic graph properties.
[download] [bibTeX] [discuss]
Hierarchical Classification via Orthogonal Transfer
Lin Xiao, Dengyong Zhou, Mingrui Wu
Abstract:We consider multiclass classification problems where the set of labels are organized hierarchically as a category tree. We associate each node in the tree with a classifier and classify the examples recursively from the root to the leaves. We propose a hierarchical Support Vector Machine (SVM) that encourages the classifier at each node to be different from the classifiers at its ancestors. More specifically, we introduce regularizations that force the normal vector of the classifying hyperplane at each node to be orthogonal to those at its ancestors as much as possible. We establish conditions under which training such a hierarchical SVM is a convex optimization problem, and develop an efficient dualaveraging method for solving it. We evaluate the method on a number of realworld text categorization tasks and obtain stateoftheart performance.
[download] [bibTeX] [discuss]
A ThreeWay Model for Collective Learning on MultiRelational Data
Maximilian Nickel, Volker Tresp, HansPeter Kriegel
Abstract:Relational learning is becoming increasingly important in many areas of application. Here, we present a novel approach to relational learning based on the factorization of a threeway tensor. We show that unlike other tensor approaches, our method is able to perform collective learning via the latent components of the model and provide an efficient algorithm to compute the factorization. We substantiate our theoretical considerations regarding the collective learning capabilities of our model by the means of experiments on both a new dataset and a dataset commonly used in entity resolution. Furthermore, we show on common benchmark datasets that our approach achieves better or onpar results, if compared to current stateoftheart relational learning solutions, while it is significantly faster to compute.
[download] [bibTeX] [discuss]
Variational Inference for Policy Search in changing situations
Gerhard Neumann
Abstract:Many policy search algorithms minimize the KullbackLeibler (KL) divergence to a certain target distribution in order to fit their policy. The commonly used KLdivergence forces the resulting policy to be 'rewardattracted'. The policy tries to reproduce all positively rewarded experience while negative experience is neglected. However, the KLdivergence is not symmetric and we can also minimize the the reversed KLdivergence, which is typically used in variational inference. The policy now becomes 'costaverse'. It tries to avoid reproducing any negativelyrewarded experience while maximizing exploration. Due to this 'costaverseness' of the policy, Variational Inference for Policy Search (VIP) has several interesting properties. It requires no kernelbandwith nor exploration rate, such settings are determined automatically by the inference. The algorithm meets the performance of stateoftheart methods while being applicable to simultaneously learning in multiple situations. We concentrate on using VIP for policy search in robotics. We apply our algorithm to learn dynamic counterbalancing of different kinds of pushes with a humanlike 4link robot.
[download] [bibTeX] [discuss]
Learning Scoring Functions with OrderPreserving Losses and Standardized Supervision
David Buffoni, Clément Calauzenes, Patrick Gallinari, Nicolas Usunier
Abstract:We address the problem of designing surrogate losses for learning scoring functions in the context of label ranking. We extend to ranking problems a notion of order preserving losses previously introduced for multiclass classi?cation, and show that these losses lead to consistent formulations with respect to a family of ranking evaluation metrics. An orderpreserving loss can be tailored for a given evaluation metric by appropriately setting some weights depending on this metric and the observed supervision. These weights, called the standard form of the supervision, do not always exist, but we show that previous consistency results for ranking were proved in special cases where they do. We then evaluate a new pairwise loss consistent with the (Normalized) Discounted Cumulative Gain on benchmark datasets.
[download] [bibTeX] [discuss]
Contractive AutoEncoders: Explicit Invariance During Feature Extraction
Salah RIFAI, Pascal Vincent, Xavier Muller, Xavier Glorot, Yoshua Bengio
Abstract:We present in this paper a novel approach for training deterministic autoencoders. We show that by adding a well chosen penalty term to the classical reconstruction cost function, we can achieve results that equal or surpass those attained by other regularized autoencoders as well as denoising autoencoders on a range of datasets. This penalty term corresponds to the Frobenius norm of the Jacobian matrix of the encoder activations with respect to the input. We show that this penalty term results in a localized space contraction which in turn yields robust features on the activation layer. Furthermore, we show how this penalty term is related to both regularized autoencoders and denoising autoencoders and how it can be seen as a link between deterministic and nondeterministic autoencoders. We find empirically that this penalty helps to carve a representation that better captures the local directions of variation dictated by the data, corresponding to a lowerdimensional nonlinear manifold, while being more invariant to the vast majority of directions orthogonal to the manifold. Finally, we show that by using the learned features to initialize an MLP, we achieve state of the art classification error on a range of datasets, surpassing other methods of pretraining.
[download] [bibTeX] [discuss]
Variational Heteroscedastic Gaussian Process Regression
Miguel LazaroGredilla, Michalis Titsias
Abstract:Standard Gaussian processes (GPs) model observations' noise as constant throughout input space. This is often a too restrictive assumption, but one that is needed for GP inference to be tractable. In this work we present a nonstandard variational approximation that allows accurate inference in heteroscedastic GPs (i.e., under inputdependent noise conditions). Computational cost is roughly twice that of the standard GP, and also scales as O(n^3). Accuracy is verified by comparing with the golden standard MCMC and its effectiveness is illustrated on several synthetic and real datasets of diverse characteristics. An application to volatility forecasting is also considered.
[download] [bibTeX] [discuss]
Bounding the Partition Function using Holder's Inequality
Qiang Liu, Alexander Ihler, University of California
Abstract:We describe an algorithm for approximate inference in graphical models based on Holder's inequality that provides upper and lower bounds on common summation problems such as computing the partition function or probability of evidence in a graphical model. Our algorithm unifies and extends several existing approaches, including variable elimination techniques such as minibucket elimination and variational methods such as tree reweighted belief propagation and conditional entropy decomposition. We show that our method inherits benefits from each approach to provide significantly better bounds on sumproduct tasks.
[download] [bibTeX] [discuss]
Dynamic Egocentric Models for Citation Networks
Duy Vu, Arthur Asuncion, David Hunter, Padhraic Smyth
Abstract:The analysis of the formation and evolution of networks over time is of fundamental importance to social science, biology, and many other fields. While longitudinal network data sets are increasingly being recorded at the granularity of individual timestamped events, most studies only focus on collapsed crosssectional snapshots of the network. In this paper, we introduce a dynamic egocentric framework that models continuoustime network data using multivariate counting processes. For inference, an efficient partial likelihood approach is used, allowing our methods to scale to large networks. We apply our techniques to various citation networks and demonstrate the predictive power and interpretability of the learned statistical models.
[download] [bibTeX] [discuss]
The Constrained Weight Space SVM: Learning with Ranked Features
Kevin Small, Byron Wallace, Carla Brodley, Thomas Trikalinos
Abstract:Applying supervised learning methods to new classification tasks requires domain experts to label sufficient training data for the classifier to achieve acceptable performance. It is desirable to mitigate this annotation effort. To this end, a pertinent observation is that instance labels are often an indirect form of supervision; it may be more efficient to impart domain knowledge directly to the model in the form of labeled features. We present a novel algorithm for exploiting such domain knowledge which we call the Constrained Weight Space SVM (CWSVM). In addition to exploiting binary labeled features, our approach allows domain experts to provide ranked features, and, more generally, to express arbitrary expected relationships between sets of features. Our empirical results show that the CWSVM outperforms both baseline supervised learning strategies and previously proposed methods for learning with labeled features.
[download] [bibTeX] [discuss]
Robust Matrix Completion and Corrupted Columns
Yudong Chen, Huan Xu, Constantine Caramanis, Sujay Sanghavi, Dept. of Electrical and Computer Engineering
Abstract:This paper considers the problem of matrix completion, when some number of the columns are arbitrarily corrupted. It is wellknown that standard algorithms for matrix completion can return arbitrarily poor results, if even a single column is corrupted. What can be done if a large number, or even a constant fraction of columns are corrupted? In this paper, we study this very problem, and develop an robust and efficient algorithm for its solution. One direct application comes from robust collaborative filtering. Here, some number of users are socalled manipulators, and try to skew the predictions of the algorithm. Significantly, our results hold {\it without any assumptions on the observed entries of the manipulated columns}.
[download] [bibTeX] [discuss]
Online Discovery of Feature Dependencies
Alborz Geramifard, Finale Doshi, Joshua Redding, Nicholas Roy, Jonathan How
Abstract:Online representational expansion techniques have improved the learning speed of existing reinforcement learning (RL) algorithms in low dimensional domains, yet existing online expansion methods do not scale well to high dimensional problems. We conjecture that one of the main difficulties limiting this scaling is that features defined over the fulldimensional state space often generalize poorly. Hence, we introduce incremental Feature Dependency Discovery (iFDD) as a computationallyinexpensive method for representational expansion that can be combined with any online, valuebased RL method that uses binary features. Unlike other online expansion techniques, iFDD creates new features in low dimensional subspaces of the full state space where feedback errors persist. We provide convergence and computational complexity guarantees for iFDD, as well as showing empirically that iFDD scales well to high dimensional multiagent planning domains with hundreds of millions of stateaction pairs.
[download] [bibTeX] [discuss]
Variational Inference for StickBreaking Beta Process Priors
John Paisley, Lawrence Carin, David Blei
Abstract:We present a variational Bayesian inference algorithm for the stickbreaking construction of the beta process. We derive an alternate representation of the beta process that is amenable to variational inference, and present a bound relating the truncated beta process to its infinite counterpart. We assess performance on two matrix factorization problems, using a nonnegative factorization model and a linearGaussian model.
[download] [bibTeX] [discuss]
Apprenticeship Learning About Multiple Intentions
Monica Babes, Vukosi Marivate, Michael Littman, Kaushik Subramanian
Abstract:In this paper, we apply tools from inverse reinforcement learning (IRL) to the problem of learning from (unlabeled) demonstration trajectories of behavior generated by varying ``intentions'' or objectives. We derive an EM approach that clusters observed trajectories by inferring the objectives for each cluster using any of several possible IRL methods, and then uses the constructed clusters to quickly identify the intent of a trajectory. We show that a natural approach to IRLa gradient ascent method that modifies reward parameters to maximize the likelihood of the observed trajectoriesis successful at quickly identifying unknown reward functions. We demonstrate these ideas in the context of apprenticeship learning by acquiring the preferences of a human driver in a simple highway car simulator.
[download] [bibTeX] [discuss]
Minimum Probability Flow Learning
Jascha SohlDickstein, Peter Battaglino, Michael DeWeese
Abstract:Fitting probabilistic models to data is often difficult, due to the general intractability of the partition function and its derivatives. Here we propose a new parameter estimation technique that does not require computing an intractable normalization factor or sampling from the equilibrium distribution of the model. This is achieved by establishing dynamics that would transform the observed data distribution into the model distribution, and then setting as the objective the minimization of the KL divergence between the data distribution and the distribution produced by running the dynamics for an infinitesimal time. Score matching, minimum velocity learning, and certain forms of contrastive divergence are shown to be special cases of this learning technique. We demonstrate parameter estimation in Ising models, deep belief networks and an independent component analysis model of natural scenes. In the Ising model case, current state of the art techniques are outperformed by at least an order of magnitude in learning time, with lower error in recovered coupling parameters.
[download] [bibTeX] [discuss]
Infinite Dynamic Bayesian Networks
Finale Doshi, David Wingate, Josh Tenenbaum, Nicholas Roy
Abstract:We present the infinite dynamic Bayesian network model (iDBN), a nonparametric, factored statespace model that generalizes dynamic Bayesian networks (DBNs). The iDBN can infer every aspect of a DBN: the number of hidden factors, the number of values each factor can take, and (arbitrarily complex) connections and conditionals between factors and observations. In this way, the iDBN generalizes other nonparametric state space models, which until now generally focused on binary hidden nodes and more restricted connection structures. We show how this new prior allows us to find interesting structure in benchmark tests and on two realworld datasets involving weather data and neural information flow networks.
[download] [bibTeX] [discuss]
The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization
Adam Coates, Andrew Ng
Abstract: While vector quantization (VQ) has been applied widely to generate features for visual recognition problems, much recent work has focused on more powerful methods. In particular, sparse coding has emerged as a strong alternative to traditional VQ approaches and has been shown to achieve consistently higher performance on benchmark datasets. Both approaches can be split into a training phase, where the system learns a dictionary of basis functions from unlabeled data, and an encoding phase, where the dictionary is used to extract features from new inputs. In this work, we investigate the reasons for the success of sparse coding over VQ by decoupling these phases, allowing us to separate out the contributions of the training and encoding in a controlled way. Through extensive experiments on CIFAR, NORB and Caltech 101 datasets, we compare sparse coding and several other training and encoding schemes, including a form of VQ paired with a soft threshold activation function. Our results show not only that we can use fast VQ algorithms for training without penalty, but that we can just as well use randomly chosen exemplars from the training set. Rather than spend resources on training, we find it is more important to choose a good encoderwhich can often be as simple as a feed forward nonlinearity. Among our results, we demonstrate stateoftheart performance on both CIFAR and NORB.
[download] [bibTeX] [discuss]
Fast Global Alignment Kernels
Marco Cuturi
Abstract:We propose novel approaches to cast the widelyused family of Dynamic Time Warping (DTW) distances and similarities as positive definite kernels for time series. To this effect, we provide new theoretical insights on the family of Global Alignment kernels introduced by Cuturi et al. (2007) and propose alternative kernels which are both positive definite and faster to compute. We provide experimental evidence that these alternatives are both faster and more efficient in classification tasks than other kernels based on the DTW formalism.
[download] [bibTeX] [discuss]
Learning attentional policies for tracking and recognition in video with deep networks
Loris Bazzani, Nando Freitas, Hugo Larochelle, Vittorio Murino, JoAnne Ting
Abstract:We propose a novel attentional model for simultaneous object tracking and recognition that is driven by gaze data. Motivated by theories of the human perceptual system, the model consists of two interacting pathways: ventral and dorsal. The ventral pathway models object appearance and classification using deep (factored)restricted Boltzmann machines. At each point in time, the observations consist of retinal images, with decaying resolution toward the periphery of the gaze. The dorsal pathway models the location, orientation, scale and speed of the attended object. The posterior distribution of these states is estimated with particle filtering. Deeper in the dorsal pathway, we encounter an attentional mechanism that learns to control gazes so as to minimize tracking uncertainty. The approach is modular (with each module easily replaceable with more sophisticated algorithms), straightforward to implement, practically efficient, and works well in simple video sequences.
[download] [bibTeX] [discuss]
LargeScale Learning of Embeddings with Reconstruction Sampling
Yann Dauphin, Xavier Glorot, Yoshua Bengio
Abstract:In this paper, we present a novel method to speed up the learning of embeddings for largescale learning tasks involving very sparse data, as is typically the case for Natural Language Processing tasks. Our speedup method has been developed in the context of Denoising Autoencoders, which are trained in a purely unsupervised way to capture the input distribution, and learn embeddings for words and text similar to earlier neural language models. The main contribution is a new method to approximate reconstruction error by a sampling procedure. We show how this approximation can be made to obtain an unbiased estimator of the training criterion, and we show how it can be leveraged to make learning much more computationally efficient. We demonstrate the effectiveness of this method on the Amazon and RCV1 NLP datasets. Instead of reducing vocabulary size to make learning practical, our method allows us to train using very large vocabularies. In particular, reconstruction sampling requires 22x less training time on the full Amazon dataset.
[download] [bibTeX] [discuss]
Automatic Feature Decomposition for Single View Cotraining
Minmin Chen, Kilian Weinberger, Yixin Chen
Abstract:One of the most successful semisupervised learning approaches is cotraining for multiview data. In cotraining, one trains two classifiers, one for each view, and uses the most confident predictions of the unlabeled data for the two classifiers to ``teach each other''. In this paper, we extend cotraining to learning scenarios without an explicit multiview representation. Inspired by a theoretical analysis of Balcan et. al (2004), we introduce a novel algorithm that splits the feature space during learning, explicitly to encourage cotraining to be successful. We demonstrate the efficacy of our proposed method in a weaklysupervised setting on the challenging Caltech256 object recognition task, where we improve significantly over previous results by (Bergamo & Torresani, 2010) in almost all trainingset size settings.
[download] [bibTeX] [discuss]
Mapping kernels for trees
Kilho Shin, Marco Cuturi, Tetsuji Kuboyama
Abstract:We propose a comprehensive survey of tree kernels through the lens of the mapping kernels framework. We argue that most existing tree kernels, as well as many more that are presented for the first time in this paper, fall into a typology of kernels whose seemingly intricate computation can be efficiently factorized to yield polynomial time algorithms. Despite this fact, we argue that a naive implementation of such kernels remains prohibitively expensive to compute. We propose an approach whereby some computations for smaller trees are cached, which speeds up considerably the computation of all these tree kernels. We provide experimental evidence of this fact as well as preliminary results on the performance of these kernels.
[download] [bibTeX] [discuss]
Stochastic LowRank Kernel Learning for Regression
Pierre Machart, Thomas Peel, Sandrine Anthoine, Liva Ralaivola, Hervé Glotin
Abstract:We present a novel approach to learn a kernelbased regression function. It is based on the use of conical combinations of databased parameterized kernels and on a new stochastic convex optimization procedure of which we establish convergence guarantees. The overall learning procedure has the nice properties that a) the learned conical combination is automatically designed to perform the regression task at hand and b) the updates implicated by the optimization procedure are quite inexpensive. In order to shed light on the appositeness of our learning strategy, we present empirical results from experiments conducted on various benchmark datasets.
[download] [bibTeX] [discuss]
Sizeconstrained Submodular Minimization through Minimum Norm Base
Kiyohito Nagano, Yoshinobu Kawahara, Kazuyuki Aihara
Abstract:A number of combinatorial optimization problems in machine learning can be described as the problem of minimizing a submodular function. It is known that the unconstrained submodular minimization problem can be solved in strongly polynomial time. However, additional constraints make the problem intractable in many settings. In this paper, we discuss the submodular minimization under a size constraint, which is NPhard, and generalizes the densest subgraph problem and the uniform graph partitioning problem. Because of NPhardness, it is difficult to compute an optimal solution even for a prescribed size constraint. In our approach, we do not give approximation algorithms. Instead, the proposed algorithm computes optimal solutions for some of possible size constraints in polynomial time. Our algorithm utilizes the basic polyhedral theory associated with submodular functions. Additionally, we evaluate the performance of the proposed algorithm through computational experiments.
[download] [bibTeX] [discuss]
Locally Linear Support Vector Machines
Lubor Ladicky, Philip Torr
Abstract:Linear support vector machines ({\sc svm}s) have become popular for solving classification tasks due to their fast and simple online application to large scale data sets. However, many problems are not linearly separable. For these problems kernelbased {\sc svm}s are often used, but unlike their linear variant they suffer from various drawbacks in terms of computational and memory efficiency. Their response can be represented only as a function of the set of support vectors, which has been experimentally shown to grow linearly with the size of the training set. In this paper we propose a novel locally linear {\sc svm} classifier with smooth decision boundary and bounded curvature. We show how the functions defining the classifier can be approximated using local codings and show how this model can be optimized in an online fashion by performing stochastic gradient descent with the same convergence guarantees as standard gradient descent method for linear {\sc svm}. Our method achieves comparable performance to the stateoftheart whilst being significantly faster than competing kernel {\sc svm}s. We generalise this model to locally finite dimensional kernel {\sc svm}.
[download] [bibTeX] [discuss]
Functional Regularized Least Squares Classication with Operatorvalued Kernels
Hachem Kadri, Asma Rabaoui, philippe Preux, Emmanuel Duflos, Alain Rakotomamonjy
Abstract:Although operatorvalued kernels have recently received increasing interest in various machine learning and functional data analysis problems such as multitask learning or functional regression, little attention has been paid to the understanding of their associated feature spaces. In this paper, we explore the potential of adopting an operatorvalued kernel feature space perspective for the analysis of functional data. We then extend the Regularized Least Squares Classification (RLSC) algorithm to cover situations where there are multiple functions per observation. Experiments on a sound recognition problem show that the proposed method outperforms the classical RLSC algorithm.
[download] [bibTeX] [discuss]
Clustering Partially Observed Graphs via Convex Optimization
Ali Jalali, Yudong Chen, Sujay Sanghavi, Huan Xu
Abstract:This paper considers the problem of clustering a partially observed unweighted graph  i.e. one where for some node pairs we know there is an edge between them, for some others we know there is no edge, and for the remaining we do not know whether or not there is an edge. We want to organize the nodes into disjoint clusters so that there is relatively dense (observed) connectivity within clusters, and sparse across clusters. We take a novel yet natural approach to this problem, by focusing on finding the clustering that minimizes the number of "disagreements"  i.e. the sum of the number of (observed) missing edges within clusters, and (observed) present edges across clusters. Our algorithm uses convex optimization; its basis is a reduction of disagreement minimization to the problem of recovering an (unknown) lowrank matrix and an (unknown) sparse matrix from their partially observed sum. We show that our algorithm succeeds under certain natural assumptions on the optimal clustering and its disagreements. Our results significantly strengthen existing matrix splitting results for the special case of our clustering problem. Our results directly enhance solutions to the problem of Correlation Clustering with partial observations
[download] [bibTeX] [discuss]
On the Use of Variational Inference for Learning Discrete Graphical Models
Eunho Yang, Pradeep Ravikumar
Abstract:We study the general class of estimators for graphical model structure based on optimizing $\ell_1$regularized approximate loglikelihood, where the approximate likelihood uses tractable variational approximations of the partition function. We provide a messagepassing algorithm that \emph{directly} computes the $\ell_1$ regularized approximate MLE. Further, in the case of certain reweighted entropy approximations to the partition function, we show that surprisingly the $\ell_1$ regularized approximate MLE estimator has a \emph{closedform}, so that we would no longer need to run through many iterations of approximate inference and messagepassing. Lastly, we analyze this general class of estimators for graph structure recovery, or its \emph{sparsistency}, and show that it is indeed sparsistent under certain conditions.
[download] [bibTeX] [discuss]
Generating Text with Recurrent Neural Networks
Ilya Sutskever, James Martens, Geoffrey Hinton
Abstract:Recurrent Neural Networks (RNNs) are very powerful sequence models that do not enjoy widespread use because it is extremely difficult to train them properly. Fortunately, recent advances in Hessianfree optimization have been able to overcome the difficulties associated with training RNNs, making it possible to apply them successfully to challenging sequence problems. In this paper we demonstrate the power of RNNs trained with the new HessianFree optimizer (HF) by applying them to characterlevel language modeling tasks. The standard RNN architecture, while effective, is not ideally suited for such tasks, so we introduce a new RNN variant that uses multiplicative (or ``gated'') connections which allow the current input character to determine the transition matrix from one hidden state vector to the next. After training the multiplicative RNN with the HF optimizer for five days on 8 highend Graphics Processing Units, we were able to surpass the performance of the best previous single method for characterlevel language modeling  a hierarchical nonparametric sequence model. To our knowledge this represents the largest recurrent neural network application to date.
[download] [bibTeX] [discuss]
Probabilistic Matrix Addition
Amrudin Agovic, Arindam Banerjee, Snigdhansu Chatterje
Abstract:We introduce Probabilistic Matrix Addition (PMA) for modeling realvalued data matrices by simultaneously capturing covariance structure among rows and among columns. PMA additively combines two latent matrices drawn from two Gaussian Processes respectively over rows and columns. The resulting joint distribution over the observed matrix does not factorize over entries, rows, or columns, and can thus capture intricate dependencies in the matrix. Exact inference in PMA is possible, but involves inversion of large matrices, and can be computationally prohibitive. Efficient approximate inference is possible due to the sparse dependency structure among latent variables. We propose two families of approximate inference algorithms for PMA based on Gibbs sampling and MAP inference. We demonstrate the effectiveness of PMA for missing value prediction and multilabel classification problems.
[download] [bibTeX] [discuss]
Learning Recurrent Neural Networks with HessianFree Optimization
James Martens, Ilya Sutskever
Abstract: In this work we resolve the longoutstanding problem of how to effectively train recurrent neural networks (RNNs) on complex and difficult sequence modeling problems which may contain longterm data dependencies. Utilizing recent advances in the Hessianfree optimization approach \citep{hf}, together with a novel damping scheme, we successfully train RNNs on two sets of challenging problems. First, a collection of pathological synthetic datasets which are known to be impossible for standard optimization approaches (due to their extremely longterm dependencies), and second, on three natural and highly complex realworld sequence datasets where we find that our method significantly outperforms the previous stateoftheart method for training neural sequence models: the Long Shortterm Memory approach of \citet{lstm}. Additionally, we offer a new interpretation of the generalized GaussNewton matrix of \citet{schraudolph} which is used within the HF approach of Martens.
[download] [bibTeX] [discuss]
Sparse Additive Generative Models of Text
Jacob Eisenstein, Amr Ahmed, Eric Xing
Abstract:Generative models of text typically associate a multinomial with every class label or topic. Even in simple models this requires the estimation of thousands of parameters; in multifaceted latent variable models, standard approaches require additional latent ``switching'' variables for every token, complicating inference. In this paper, we propose an alternative generative model for text. The central idea is that each class label or latent topic is endowed with a model of the deviation in logfrequency from a constant background distribution. This approach has two key advantages: we can enforce sparsity to prevent overfitting, and we can combine generative facets through simple addition in log space, avoiding the need for latent switching variables. We demonstrate the applicability of this idea to a range of scenarios: classification, topic modeling, and more complex multifaceted generative models.
[download] [bibTeX] [discuss]
Classificationbased Policy Iteration with a Critic
Victor Gabillon, Alessandro Lazaric, Mohammad Ghavamzadeh, Bruno Scherrer
Abstract:In this paper, we study the effect of adding a value function approximation component (critic) to rollout classificationbased policy iteration (RCPI) algorithms. The idea is to use a critic to approximate the return after we truncate the rollout trajectories. This allows us to control the bias and variance of the rollout estimates of the actionvalue function. Therefore, the introduction of a critic can improve the accuracy of the rollout estimates, and as a result, enhance the performance of the RCPI algorithm. We present a new RCPI algorithm, called direct policy iteration with critic (DPICritic), and provide its finitesample analysis when the critic is based on the LSTD method. We empirically evaluate the performance of DPICritic and compare it with DPI and LSPI in two benchmark reinforcement learning problems.
[download] [bibTeX] [discuss]
Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation and Dictionary Selection
Abhimanyu Das, David Kempe
Abstract:We study the problem of selecting a subset of $k$ random variables from a large set, in order to obtain the best linear prediction of another variable of interest. This problem can be viewed in the context of both feature selection and sparse approximation. We analyze the performance of widely used greedy heuristics, using insights from the maximization of submodular functions and spectral analysis. We introduce the submodularity ratio as a key quantity to help understand why greedy algorithms perform well even when the variables are highly correlated. Using our techniques, we obtain the strongest known approximation guarantees for this problem, both in terms of the submodularity ratio and the smallest $k$sparse eigenvalue of the covariance matrix. We also analyze greedy algorithms for the dictionary selection problem, and significantly improve the previously known guarantees. Our theoretical analysis is complemented by experiments on realworld and synthetic data sets; the experiments show that the submodular ratio is a stronger predictor of the performance of greedy algorithms than other spectral parameters.
[download] [bibTeX] [discuss]
A Spectral Algorithm for Latent Tree Graphical Models
Ankur Parikh, Le Song, Eric Xing
Abstract:Latent variable models are powerful tools for probabilistic modeling, and have been successfully applied to various domains, such as speech analysis and bioinformatics. However, parameter learning algorithms for latent variable models have predominantly relied on local search heuristics such as expectation maximization (EM). We propose a fast, localminimumfree spectral algorithm for learning latent variable models with arbitrary tree topologies, and show that the joint distribution of the observed variables can be reconstructed from the marginals of triples of observed variables irrespective of the maximum degree of the tree. We demonstrate the performance of our spectral algorithm on synthetic and real datasets; for large training sizes, our algorithm performs comparable to or better than EM while being orders of magnitude faster.
[download] [bibTeX] [discuss]
A Unified Probabilistic Model for Global and Local Unsupervised Feature Selection
Yue Guan, Jennifer Dy, Michael Jordan
Abstract:Existing algorithms for joint clustering and feature selection can be categorized as either global or local approaches. Global methods select a single clusterindependent subset of features, whereas local methods select clusterspecific subsets of features. In this paper, we present a unified probabilistic model that can perform both global and local feature selection for clustering. Our approach is based on a hierarchical betaBernoulli prior combined with a Dirichlet process mixture model. We obtain global or local feature selection by adjusting the variance of the beta prior. We provide a variational inference algorithm for our model. In addition to simultaneously learning the clusters and features, this Bayesian formulation allows us to learn both the number of clusters and the number of features to retain. Experiments on synthetic and real data show that our unified model can find global and local features and cluster data as well as competing methods of each type.
[download] [bibTeX] [discuss]
Towards Making Unlabeled Data Never Hurt
YuFeng Li, ZhiHua Zhou
Abstract:It is usually expected that, when labeled data are limited, the learning performance can be improved by exploiting unlabeled data. In many cases, however, the performances of current semisupervised learning approaches may be even worse than purely using the limited labeled data.It is desired to have \textit{safe} semisupervised learning approaches which never degenerate learning performance by using unlabeled data. In this paper, we focus on semisupervised support vector machines (S3VMs) and propose S4VMs, i.e., safe S3VMs. Unlike S3VMs which typically aim at approaching an optimal lowdensity separator, S4VMs try to exploit the candidate lowdensity separators simultaneously to reduce the risk of identifying a poor separator with unlabeled data. We describe two implementations of S4VMs, and our comprehensive experiments show that the overall performance of S4VMs are highly competitive to S3VMs, while in contrast to S3VMs which degenerate performance in many cases, S4VMs are never significantly inferior to inductive SVMs.
[download] [bibTeX] [discuss]
On Random Weights and Unsupervised Feature Learning
Andrew Saxe, pang Wei Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, Andrew Ng
Abstract:Recently two anomalous results in the literature have shown that certain feature learning architectures can yield useful features for object recognition tasks even with untrained, random weights. In this paper we pose the question: why do random weights sometimes do so well? Our answer is that certain convolutional pooling architectures can be inherently frequency selective and translation invariant, even with random weights. Based on this we demonstrate the viability of extremely fast architecture search by using random weights to evaluate candidate architectures, thereby sidestepping the timeconsuming learning process. We then show that a surprising fraction of the performance of certain stateoftheart methods can be attributed to the architecture alone.
[download] [bibTeX] [discuss]
Doubly Robust Policy Evaluation and Learning
Miroslav Dudik, John Langford, Lihong Li
Abstract: We study decision making in environments where the reward is only partially observed, but can be modeled as a function of an action and an observed context. This setting, known as contextual bandits, encompasses a wide variety of applications including healthcare policy and Internet advertising. A central task is evaluation of a new policy given historic data consisting of contexts, actions and received rewards. The key challenge is that the past data typically does not faithfully represent proportions of actions taken by a new policy. Previous approaches rely either on models of rewards or models of the past policy. The former are plagued by a large bias whereas the latter have a large variance. In this work, we leverage the strength and overcome the weaknesses of the two approaches by applying the \emph{doubly robust} technique to the problems of policy evaluation and optimization. We prove that this approach yields accurate value estimates when we have \emph{either} a good (but not necessarily consistent) model of rewards \emph{or} a good (but not necessarily consistent) model of past policy. Extensive empirical comparison demonstrates that the doubly robust approach uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies. As such, we expect the doubly robust approach to become common practice.
[download] [bibTeX] [discuss]
Learning Deep Energy Models
Jiquan Ngiam, Zhenghao Chen, pang Wei Koh, Andrew Ng
Abstract:Deep generative models with multiple hidden layers have been shown to be able to learn meaningful and compact representations of data. In this work we propose deep energy models, a class of models that use a deep feedforward neural network to model the energy landscape that defines a probabilistic model. We are able to efficiently train all layers of our model at the same time, allowing the lower layers of the model to adapt to the training of the higher layers, producing better generative models. We evaluate the generative performance of our models on natural images and demonstrate that joint training of multiple layers yields qualitative and quantitative improvements over greedy layerwise training. We further generalize our models beyond the commonly used sigmoidal neural networks and show how a deep extension of the product of Studentt distributions model achieves good generative performance. Finally, we introduce a discriminative extension of our model and demonstrate that it outperforms other fullyconnected models on object recognition on the NORB dataset.
[download] [bibTeX] [discuss]
Bipartite Ranking through Minimization of Univariate Loss
Wojciech Kotlowski, Krzysztof Dembczynski, Eyke Huellermeier
Abstract:Minimization of the rank loss or, equivalently, maximization of the AUC in bipartite ranking calls for minimizing the number of disagreements between pairs of instances. Since the complexity of this problem is inherently quadratic in the number of training examples, it is tempting to ask how much is actually lost by minimizing a simple univariate loss function, as done by standard classification methods, as a surrogate. In this paper, we first note that minimization of 0/1 loss is not an option, as it may yield an arbitrarily high rank loss. We show, however, that better results can be achieved by means of a weighted (costsensitive) version of 0/1 loss. Yet, the real gain is obtained through marginbased loss functions, for which we are able to derive proper bounds, not only for rank risk but, more importantly, also for rank regret. The paper is completed with an experimental study in which we address specific questions raised by our theoretical analysis.
[download] [bibTeX] [discuss]
Manifold Identification of Dual Averaging Methods for Regularized Stochastic Online Learning
Sangkyun Lee, Stephen Wright
Abstract:Iterative methods that take steps in approximate subgradient directions have proved to be useful for stochastic learning problems over large or streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, whose purpose is to induce structure (for example, sparsity) in the solution, the solution often lies on a lowdimensional manifold along which the regularizer is smooth. This paper shows that a regularized dual averaging algorithm can identify this manifold with high probability. This observation motivates an algorithmic strategy in which, once a nearoptimal manifold is identified, we switch to an algorithm that searches only in this manifold, which typically has much lower intrinsic dimension than the full space, thus converging quickly to a nearoptimal point with the desired structure. Computational results are presented to illustrate these claims.
[download] [bibTeX] [discuss]
Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions
Alekh Agarwal, Sahand Negahban, Martin Wainwright
Abstract:We analyze a class of estimators based on a convex relaxation for solving highdimensional matrix decomposition problems. The observations are the noisy realizations of the sum of an (approximately) low rank matrix $\Theta^\star$ with a second matrix $\Gamma^\star$ endowed with a complementary form of lowdimensional structure. We derive a general theorem that gives upper bounds on the Frobenius norm error for an estimate of the pair $(\Theta^\star, \Gamma^\star)$ obtained by solving a convex optimization problem. We then specialize our general result to two cases that have been studied in the context of robust PCA: low rank plus sparse structure, and low rank plus a column sparse structure. Our theory yields Frobenius norm error bounds for both deterministic and stochastic noise matrices, and in the latter case, they are minimax optimal. The sharpness of our theoretical predictions is also confirmed in numerical simulations.
[download] [bibTeX] [discuss]
Bundle Selling by Online Estimation of Valuation Functions
Daniel Vainsencher, Ofer Dekel, Shie Mannor
Abstract:We consider the problem of online selection of a bundle of items when the cost of each item changes arbitrarily from round to round and the valuation function is initially unknown and revealed only through the noisy values of selected bundles (the bandit feedback setting). We are interested in learning schemes that have a small regret compared to an agent who knows the true valuation function. Since there are exponentially many bundles, further assumptions on the valuation functions are needed. We make the assumption that the valuation function is supermodular and has nonlinear interactions that are of low degree in a novel sense. We develop efficient learning algorithms that balance exploration and exploitation to achieve low regret in this setting.
[download] [bibTeX] [discuss]
Unsupervised Models of Images by SpikeandSlab RBMs
Aarron Courville, James Bergstra, Yoshua Bengio
Abstract:The spike and slab Restricted Boltzmann Machine (RBM) is defined by having both a real valued ``slab'' variable and a binary ``spike'' variable associated with each unit in the hidden layer. In this paper we generalize and extend the spike and slab RBM to include nonzero means of the conditional distribution over the observed variables conditional on the binary spike variables. We also introduce a term, quadratic in the observed data that we exploit to guarantee the all conditionals associated with the model are well defined  a guarantee that was absent in the original spike and slab RBM. The inclusion of these generalizations improves the performance of the spike and slab RBM as a feature learner and achieves competitive performance on the CIFAR10 image classification task. The spike and slab model, when trained in a convolutional configuration, can generate sensible samples that demonstrate that the model has capture the broad statistical structure of natural images.
[download] [bibTeX] [discuss]
Approximating Correlated Equilibria using Relaxations on the Marginal Polytope
Hetunandan Kamisetty, Eric Xing, Christopher Langmead
Abstract:In game theory, a Correlated Equilibrium (CE) is an equilibrium concept that generalizes the more wellknown Nash Equilibrium. If the game is represented as a graphical game, the computational complexity of computing an optimum CE is exponential in the treewidth of the graph. In settings where this exact computation is not feasible, it is desirable to approximate the properties of the CE, such as its expected social utility and marginal probabilities. We study outer relaxations of this problem that yield approximate marginal strategies for the players under a variety of utility functions. Results on simulated games and in a real problem involving drug design indicate that our approximations can be highly accurate and can be successfully used when exact computation of CE is infeasible.
[download] [bibTeX] [discuss]
Active Learning from Crowds
Yan Yan, Romer Rosales, Glenn Fung, Jennifer Dy
Abstract:Obtaining labels is expensive or timeconsuming, but unlabeled data is often abundant and easy to obtain. Many learning task can profit from intelligently choosing unlabeled instances to be labeled by an oracle also known as active learning, instead of simply labeling all the data or randomly selecting data to be labeled. Supervised learning traditionally relies on an oracle playing the role of a teacher. In the multiple annotator paradigm, an oracle, who knows the ground truth, no longer exists; instead, multiple labelers, with varying expertise, are available for querying. This paradigm posits new challenges to the active learning scenario. We can ask which data sample should be labeled next and which annotator should we query to benefit our learning model the most. In this paper, we develop a probabilistic model for learning from multiple annotators that can also learn the annotator expertise even when their expertise may not be consistently accurate (or inaccurate) across the task domain. In addition, we provide an optimization formulation that allows us to simultaneously learn the most uncertain sample and the annotator/s to query the labels from for active learning. Our active learning approach combines both intelligently selecting samples to label and learning from expertise among multiple labelers to improve learning performance.
[download] [bibTeX] [discuss]
Computational Rationalization: The Inverse Equilibrium Problem
Kevin Waugh, Brian Ziebart, Drew Bagnell
Abstract:Modeling the purposeful behavior of imperfect agents from a small number of observations is a challenging task. When restricted to the singleagent decisiontheoretic setting, inverse optimal control techniques assume that observed behavior is an approximately optimal solution to an unknown decision problem. These techniques learn a utility function that explains the example behavior and can then be used to accurately predict or imitate future behavior in similar observed or unobserved situations. In this work, we consider similar tasks in competitive and cooperative multiagent domains. Here, unlike singleagent settings, a player cannot myopically maximize its reward  it must speculate on how the other agents may act to influence the game's outcome. Employing the gametheoretic notion of regret and the principle of maximum entropy, we introduce a technique for predicting and generalizing behavior, as well as recovering a reward function in these domains.
[download] [bibTeX] [discuss]
FiniteSample Analysis of LassoTD
Mohammad Ghavamzadeh, Alessandro Lazaric, Remi Munos, Matthew Hoffman
Abstract:In this paper, we analyze the performance of LassoTD, a modification of LSTD in which the projection operator is defined as a Lasso problem. We first show that LassoTD is guaranteed to have a unique fixed point and its algorithmic implementation coincides with the recently presented LARSTD and LCTD methods. We then derive two bounds on the prediction error of LassoTD in the Markov design setting, i.e., when the performance is evaluated on the same states used by the method. The first bound makes no assumption, but has a slow rate w.r.t. the number of samples. The second bound is under an assumption on the empirical Gram matrix, called the compatibility condition, but has an improved rate and directly relates the prediction error to the sparsity of the value function in the feature space at hand.
[download] [bibTeX] [discuss]
Generalized Value Functions for Large Action Sets
Jason Pazis, Ron Parr
Abstract:The majority of value function approximation based reinforcement learning algorithms available today, focus on approximating the state (V) or stateaction (Q) value function and efficient action selection comes as an afterthought. On the other hand, realworld problems tend to have large action spaces, where evaluating every possible action becomes impractical. This mismatch presents a major obstacle in successfully applying reinforcement learning to realworld problems. In this paper we present a unified view of V and Q functions and arrive at a new spaceefficient representation, where action selection can be done exponentially faster, without the use of a model. We then describe how to calculate this new value function efficiently via approximate linear programming and provide experimental results that demonstrate the effectiveness of the proposed approach.
[download] [bibTeX] [discuss]
kDPPs: FixedSize Determinantal Point Processes
Alex Kulesza, Ben Taskar
Abstract:Determinantal point processes (DPPs) have recently been proposed as models for set selection problems where diversity is preferred. For example, they can be used to select diverse sets of sentences to form document summaries, or to find multiple nonoverlapping human poses in an image. However, DPPs conflate the modeling of two distinct characteristics: the size of the set, and its content. For many realistic tasks, the size of the desired set is known up front; e.g., in search we may want to show the user exactly ten results. In these situations the effort spent by DPPs modeling set size is not only wasteful, but actually introduces unwanted bias into the modeling of content. Instead, we propose the kDPP, a conditional DPP that models only sets of cardinality k. In exchange for this restriction, kDPPs offer greater expressiveness and control over content, and simplified integration into applications like search. We derive algorithms for efficiently normalizing, sampling, and marginalizing kDPPs, and propose an expertsstyle algorithm for learning combinations of kDPPs. We demonstrate the usefulness of the model on an image search task, where kDPPs significantly outperform MMR as judged by human annotators.
[download] [bibTeX] [discuss]
On Autoencoders and Score Matching for Energy Based Models
Kevin Swersky, Marc'Aurelio Ranzato, David Buchman, Benjamin Marlin, Nando Freitas
Abstract:We consider estimation methods for the class of continuousdata energy based models (EBMs). Our main result shows that estimating the parameters of an EBM using score matching when the conditional distribution over the visible units is Gaussian corresponds to training a particular form of regularized autoencoder. We show how different Gaussian EBMs lead to different autoencoder architectures, providing deep links between these two families of models. We compare the score matching estimator for the mPoT model, a particular Gaussian EBM, to several other training methods on a variety of tasks including image denoising and unsupervised feature extraction. We show that the regularization function induced by score matching leads to superior classification performance relative to a standard autoencoder. We also show that score matching yields classification results that are indistinguishable from betterknown stochastic approximation maximum likelihood estimators.
[download] [bibTeX] [discuss]
Generalized Boosting Algorithms for Convex Optimization
Alexander Grubb, Drew Bagnell
Abstract:Boosting is a popular way to derive powerful learners from simpler hypothesis classes. Following previous work (Mason et al., 1999; Friedman, 2000) on general boosting frameworks, we analyze gradientbased descent algorithms for boosting with respect to any convex objective and introduce a new measure of weak learner performance into this setting which generalizes existing work. We present the first weak to strong learning guarantees for the existing gradient boosting work for smooth convex objectives, and also demonstrate that this work fails for nonsmooth objectives. To address this issue, we present new algorithms which extend this boosting approach to arbitrary convex loss functions and give corresponding weak to strong convergence results. In addition, we demonstrate experimental results that support our analysis and demonstrate the need for the new algorithms we present.
[download] [bibTeX] [discuss]
Invited CrossConference Track
Debt Collections Using Constrained Reinforcement Learning
Naoki Abe; Prem Melville; Cezar Pendus; David L. Jensen; Chandan K. Reddy; Vince P. Thomas; James J. Bennett; Gary F. Anderson; Brent R. Cooley; Melissa Weatherwax; Timothy Gardinier; Gerard Miller
[download]
Bangpeng Yao; Aditya Khosla; Li FeiFei
[download]
Abraham Bachrach; Ruijie He; Nicholas Roy
[download]
Gil Weinberg
[download]
Christina Leslie
[download]
Maayan Roth; Tzvika Barenholz; Assaf BenDavid; David Deutscher; Guy Flysher; Avinatan Hassidim; Ilan Horn; Ari Leichtberg; Naty Leiser; Yossi Matias; Ron Merom
[download]
Fernando Diaz; Donald Metzler; Sihem AmerYahia
[download]
Rajesh Ranganath; Dan Jurafsky; Dan McFarland
[download]
