The experts at MINDS are working to establish the principles behind the analysis and interpretation of massive amounts of complex data.
What are the theoretical foundations of data science? Our ambitious and multidisciplinary research agenda covers topics in mathematics, statistics, and theoretical computer science, as well as important topics at the intersection of all three.
Foundations of Representation Learning
Representation learning is a basic challenge in machine learning. Unsupervised techniques for representation learning capitalize on unlabeled data (which is often cheap and abundant) to learn a representation that reveals intrinsic low-dimensional structure in data, disentangles underlying factors of variation, and is useful across multiple tasks and domains. We’re focusing on developing new theory and methods for representation learning that can easily scale to large datasets. We’re interested in formally understanding what training criteria are employed for representation learning, what is a good notion of learnability for unsupervised settings and how does that inform methodology, and what inductive biases are implicit in algorithmic and heuristic approaches for representation learning that help with generalization.
Streaming and Sketching Algorithms for Clustering and Learning
As data sets have come to outgrow the available memory on most machines, streaming and sketching models of computation have emerged as popular areas of algorithmic research. Yet, despite significant progress in streaming and sketching algorithms, there is still a large gap between theory and practice. We are applying these methods to a wide class of learning problems including regression, M-estimators, and subspace clustering. We are also developing scalable frameworks for statistical inference of multimodal, high-dimensional, large, and complex data.
Modeling and Discovering Linguistic Structure
Human languages are intricate systems with interacting structures across many levels. Any collection of text or speech is created by stochastic processes that consider the structure of the events and concepts being discussed, the syntactic structure of sentences, the internal structure of words, the spelling or pronunciation of morphemes, and the meaning of those morphemes. These processes are further influenced by speakers’ efforts to be understood and by the evolution of language over time. We are developing probabilistic models and approximate inference methods that can discover the structure of a language and can reason about the structure of specific words, sentences, and documents. Our approaches are also relevant to other domains that call for integrating data science with prior theories involving latent variables.
High-Dimensional Learning in the Small-Sample Regime
Whereas the impact of high-dimensional machine learning has been enormous in many areas of science, this cannot be said for learning scenarios where the sample size (n) is very small compared to the dimension (d) of the data – the so-called “small n, large d” regime. For example, in computational molecular medicine, the impact of high-dimensional learning on clinical practice has been very limited despite almost two decades of applying data science to learning models and predictors for disease. A primary reason is the failure of most reported results to be replicated in cross-study validation. We believe that the only solution may be to massively reduce complexity by a combination of “digitizing” high-dimensional data relative to a baseline class, and hardwiring domain-specific information about mechanism into the learning and inference procedures.
Learning Maps for Complex Heterogeneous Data
Establishing connections between different data sets, regressing a function whose range is high-dimensional, and matching complex objects such as shapes or graphs, often require learning maps. These maps often have both high-dimensional domains and ranges. We are developing novel approaches for learning maps which correspond to a higher-level of abstraction in machine learning tasks. We are also studying how to detect, learn, and disregard invariances – or approximate invariances – that make learning tasks difficult by inflating the dimensionality of problems in computer vision and machine learning.
Statistical Inference on Attributed Graphs
We are considering a comprehensive paradigm for statistical inference on random graphs centered on spectral embedding for subsequent inference. We are examining the analogues, in graph inference, of several canonical tenets of classical Euclidean inference, including consistency and asymptotic normality of spectral embeddings, and the role these embeddings can play in the construction of data analytic methods for graph data. We are considering theory, methods, and applications.
Mathematical Analysis of Non-Convex Optimization Algorithms
Optimization has been an essential tool for solving many diverse data-driven problems. Historically, great focus has been placed on advances in convex optimization since such subproblems are routinely used in various applications. One may argue, however, that this is too limiting for many challenging applications. By allowing non-convex techniques to be adopted, we gain additional modeling flexibility that may be used to gain estimation robustness, improve algorithm scalability, and better model corruptions present in real-life data driven applications. Therefore, we are designing a new generation of model-based optimization algorithms for non-convex optimization, as well as developing the associated theory. Our goal is to design scalable optimization methods that are capable of efficiently identifying local minimizers and, when certain favorable problem structure exists, global minimizers.
Mathematics of Deep Learning
We have seen a dramatic increase in the performance of recognition systems thanks to the introduction of deep networks for representation learning. However, the mathematical reasons for this success remain elusive. One challenge is that there is currently no theory regarding how the network architecture should be constructed. Another challenge is that existing networks have billions of parameters, which make them prone to overfitting. A third challenge is that the training problem for deep networks is non-convex, so optimization algorithms could get stuck in non-global minima. We are developing a mathematical framework that addresses these challenges, provides theoretical insights for the success of current network architectures, and guides the design of novel architectures.
Statistical Analysis in Shape Spaces
Data sets where each observation is a curve, a surface, or a more complicated geometric object are becoming common in computer vision and medical imaging. Their statistical analysis raises multiple challenges, including defining the spaces where the data belong as mathematical objects. Such spaces are typically defined as (infinite-dimensional) Riemannian manifolds, and can be constructed through the action of deformations (induced by the diffeomorphism group) on objects. In such spaces, measuring distances and computing averages or low-dimensional data representations require highly computational numerical algorithms. In such contexts, testing statistical hypotheses (for example, to assess the impact of disease on human organs) and validating various models (describing, for example, propagation modes of neurodegenerative diseases across brain structures) become challenges.
Understanding How Vision Systems Deal with Massive Amounts of Complex Data
The visual system is humans’ under-appreciated superpower, enabling us to get information about the universe from ranges from a fraction of an inch to millions of light years. Every second, the vision system processes massive amounts of data and solves incredibly complex problems using orders of magnitude less powerful than advanced computers. To understand how this is possible, and to develop computer vision systems with similar capabilities, we use advanced mathematical models and machine learning tools combined with insights from cognitive science and neuroscience. We seek to develop algorithms that, like the human visual system, can deal with the enormous complexity of real-world images, are able to generalize flexibly to novel situations, are able to address many different visual tasks, and are able to learn from few examples.