Gaël Varoquaux (hosted by CIS) – “DirtyData: statistical learning on non-curated databases”
“DirtyData: statistical learning on non-curated databases”
Abstract: While growing amounts and diversity of data bring many promises to empirical studies, they also imposes more and more human curation before statistical analysis. “Dirty data” is reported as the worst roadblock to data science in practice [1]. One challenge is that in many data-science applications, for instance in healthcare or social sciences, the data are not measurements that naturally have a homogeneous structure, but rather heterogeneous entries and columns of different nature. The analysts must invest significant manual effort to cast the data in a representation amenable to statistical learning, traditionally using database-cleaning methods. Our goal in the DirtyData research axis is to unite statistical learning and database techniques to work directly on non-curated databases.
I will present 2 recent contributions to building a statistical-learning framework on non-curated databases. First, we tackle the problem of non-normalized categorical columns, eg with typos or nomenclature variations. We introduce two approaches to inject the data in a vector space, based either on a character-level Gamma-Poisson factorization to recover latent categories, or by exploiting unstudied properties of min-hash vectors that lead to very fast stateless transformations of string inclusions into simple vector inequalities [2]. Second, we study supervised learning in the presence of missing value [3]. We show that in missing-at-random settings simple imputation by the mean is consistent for powerful supervised models. We also stress that in missing not at random settings imputing may render supervised learning impossible and we study simple practical solutions. Studying of the seemingly-simple case of data generated with a linear mechanism shows that fitting imputation and linear models is brittle, and it is preferable to forgo imputation and fit richer models [4].
[1] https://www.kaggle.com/ash316/novice-to-grandmaster
[2] Encoding high-cardinality string categorical variables, P Cerda, G Varoquaux https://arxiv.org/abs/1907.01860
[3] On the consistency of supervised learning with missing values J Josse, N Prost, E Scornet, G Varoquaux, https://arxiv.org/abs/1902.06931
[4] Linear predictor on linearly-generated data with missing values: non consistency and solutions, M Le Morvan, N Prost, J Josse, E Scornet, G Varoquaux, accepted at AISTATS 2020.
Bio: Gaël Varoquaux is a tenured research director at Inria. His research focuses on statistical-learning tools for data science and scientific inference. Since 2008, he has been exploring data-intensive approaches to understand brain function and mental health. More generally, he develops tools to make machine learning easier, with statistical models suited for real-life, uncurated data, and software for data science. He co-funded scikit-learn, one of the reference machine-learning toolboxes, and helped build various central tools for data analysis in Python. Varoquaux has contributed key methods for learning on spatial data, matrix factorizations, and modeling covariance matrices. He has a PhD in quantum physics and is a graduate from Ecole Normale Superieure, Paris.