Machine learning has become a popular application domain for modern optimization techniques, pushing its algorithmic frontier. The need for large scale optimization algorithms which can handle millions of dimensions or data points, typical for the big data era, have brought a resurgence of interest for first order algorithms, making us revisit the venerable stochastic gradient method [Robbins-Monro 1951] as well as the Frank-Wolfe algorithm [Frank-Wolfe 1956]. In this talk, I will review recent improvements on these algorithms which can exploit the structure of modern machine learning approaches. I will explain why the Frank-Wolfe algorithm has become so popular lately; and present a surprising tweak on the stochastic gradient method which yields a fast linear convergence rate. Motivating applications will include weakly supervised video analysis and structured prediction problems.
Organizers: Philipp Hennig
Under acute threat, biological agents need to choose adaptive actions to survive. In my talk, I will provide a decision-theoretic view on this problem and ask, what are potential computational algorithms for this choice, and how are they implemented in neural circuits. Rational design principles and non-human animal data tentatively suggest a specific architecture that heavily relies on tailored algorithms for specific threat scenarios. Virtual reality computer games provide an opportunity to translate non-human animal tasks to humans and investigate these algorithms across species. I will discuss the specific challenges for empirical inference on underlying neural circuits given such architecture.
Organizers: Michel Besserve
Performance metrics are a key component of machine learning systems, and are ideally constructed to reflect real world tradeoffs. In contrast, much of the literature simply focuses on algorithms for maximizing accuracy. With the increasing integration of machine learning into real systems, it is clear that accuracy is an insufficient measure of performance for many problems of interest. Unfortunately, unlike accuracy, many real world performance metrics are non-decomposable i.e. cannot be computed as a sum of losses for each instance. Thus, known algorithms and associated analysis are not trivially extended, and direct approaches require expensive combinatorial optimization. I will outline recent results characterizing population optimal classifiers for large families of binary and multilabel classification metrics, including such nonlinear metrics as F-measure and Jaccard measure. Perhaps surprisingly, the prediction which maximizes the utility for a range of such metrics takes a simple form. This results in simple and scalable procedures for optimizing complex metrics in practice. I will also outline how the same analysis gives optimal procedures for selecting point estimates from complex posterior distributions for structured objects such as graphs. Joint work with Nagarajan Natarajan, Bowei Yan, Kai Zhong, Pradeep Ravikumar and Inderjit Dhillon.
Organizers: Mijung Park
We present a way to set the step size of Stochastic Gradient Descent, as the solution of a distance minimization problem. The obtained result has an intuitive interpretation and resembles the update rules of well known optimization algorithms. Also, asymptotic results to its relation to the optimal learning rate of Gradient Descent are discussed. In addition, we talk about two different estimators, with applications in Variational inference problems, and present approximate results about their variance. Finally, we combine all of the above, to present an optimization algorithm that can be used on both mini-batch optimization and Variational problems.
Organizers: Philipp Hennig
Bioelectronics integrates principles of electrical engineering and materials science to biology, medicine and ultimately health. Soft bioelectronics focus on designing and manufacturing electronic devices with mechanical properties close to those of the host biological tissue so that long-term reliability and minimal perturbation are induced in vivo and/or truly wearable systems become possible. We illustrate the potential of this soft technology with examples ranging from prosthetic tactile skins to soft multimodal neural implants.
Organizers: Diana Rebmann
Vaccine refusal can lead to outbreaks of previously eradicated diseases and is an increasing problem worldwide. Vaccinating decisions exemplify a complex, coupled system where vaccinating behavior and disease dynamics influence one another. Complex systems often exhibit characteristic dynamics near a tipping point to a new dynamical regime. For instance, critical slowing down -- the tendency for a system to start `wobbling'-- can increase close to a tipping point. We used a linear support vector machine to classify the sentiment of geo-located United States and California tweets concerning measles vaccination from 2011 to 2016. We also extracted data on internet searches on measles from Google Trends. We found evidence for critical slowing down in both datasets in the years before and after the 2014-15 Disneyland, California measles outbreak, suggesting that the population approached a tipping point corresponding to widespread vaccine refusal, but then receded from the tipping point in the face of the outbreak. A differential equation model of coupled behaviour-disease dynamics is shown to illustrate the same patterns. We conclude that studying critical phenomena in online social media data can help us develop analytical tools based on dynamical systems theory to identify populations at heightened risk of widespread vaccine refusal.
Organizers: Diana Rebmann
Standard methods of causal discovery take as input a statistical data set of measurements of well-defined causal variables. The goal is then to determine the causal relations among these variables. But how are these causal variables identified or constructed in the first place? Often we have sensor level data but assume that the relevant causal interactions occur at a higher scale of aggregation. Sometimes we only have aggregate measurements of causal interactions at a finer scale. I will motivate the general problem of causal discovery and present recent work on a framework and method for the construction and identification of causal macro-variables that ensures that the resulting causal variables have well-defined intervention distributions. Time permitting, I will show an application of this approach to large scale climate data, for which we were able to identify the macro-phenomenon of El Nino using an unsupervised method on micro-level measurements of the sea surface temperature and wind speeds over the equatorial Pacific.
Organizers: Sebastian Weichwald
Autonomous systems rely on learning from experience to automatically refine their strategy and adapt to their environment, and thereby have huge advantages over traditional hand engineered systems. At PROWLER.io we use reinforcement learning (RL) for sequential decision making under uncertainty to develop intelligent agents capable of acting in dynamic and unknown environments. In this talk we first give a general overview of the goals and the research conducted at PROWLER.io. Then, we will talk about two specific research topics. The first is Information-Theoretic Model Uncertainty which deals with the problem of making robust decisions that take into account unspecified models of the environment. The second is Deep Model-Based Reinforcement Learning which deals with the problem of learning the transition and the reward function of a Markov Decision Process in order to use it for data-efficient learning.
Organizers: Michel Besserve
Probabilistic deep learning methods have recently made great progress for generative and discriminative modeling. I will give a brief overview of recent developments and then present two contributions. The first is on a generalization of generative adversarial networks (GAN), extending their use considerably. GANs can be shown to approximately minimize the Jensen-Shannon divergence between two distributions, the true sampling distribution and the model distribution. We extend GANs to the class of f-divergences which include popular divergences such as the Kullback-Leibler divergence. This enables applications to variational inference and likelihood-free maximum likelihood, as well as enables GAN models to become basic building blocks in larger models. The second contribution is to consider representation learning using variational autoencoder models. To make learned representations of data useful we need ground them in semantic concepts. We propose a generative model that can decompose an observation into multiple separate latent factors, each of which represents a separate concept. Such disentangled representation is useful for recognition and for precise control in generative modeling. We learn our representations using weak supervision in the form of groups of observations where all samples within a group share the same value in a given latent factor. To make such learning feasible we generalize recent methods for amortized probabilistic inference to the dependent case. Joint work with: Ryota Tomioka (MSR Cambridge), Botond Cseke (MSR Cambridge), Diane Bouchacourt (Oxford)
Organizers: Lars Mescheder
As large tensor-variate data increasingly become the norm in applied machine learning and statistics, complex analysis methods similarly increase in prevalence. Such a trend offers the opportunity to understand more intricate features of the data that, ostensibly, could not be studied with simpler datasets or simpler methodologies. While promising, these advances are also perilous: these novel analysis techniques do not always consider the possibility that their results are in fact an expected consequence of some simpler, already-known feature of simpler data (for example, treating the tensor like a matrix or a univariate quantity) or simpler statistic (for example, the mean and covariance of one of the tensor modes). I will present two works that address this growing problem, the first of which uses Kronecker algebra to derive a tensor-variate maximum entropy distribution that shares modal moments with the real data. This distribution of surrogate data forms the basis of a statistical hypothesis test, and I use this method to answer a question of epiphenomenal tensor structure in populations of neural recordings in the motor and prefrontal cortex. In the second part, I will discuss how to extend this maximum entropy formulation to arbitrary constraints using deep neural network architectures in the flavor of implicit generative modeling, and I will use this method in a texture synthesis application.
Organizers: Philipp Hennig