Istock4

Statistics and Data Science Seminar Series

The Department of Statistics hosts the Statistics and Data Science Seminar Series (SDSS) throughout the year and usually taking place on Friday afternoons at 2pm in COL 1.06. Topics include statistics, machine learning, computer science and their interface, both from a theoretical and applied point of view. We invite both internal and external speakers to present their latest cutting edge research. All are welcome to attend our seminars!

Winter Term 2025 

Friday 17 January 2025, 2-3pm - Nicolas Verzelen (INRAE)

nicolas verzelen

Title: Computation-information gap in high-dimensional clustering.

Abstract: We investigate the existence of a fundamental computation-information gap for the problem of clustering a mixture of isotropic Gaussian in the high-dimensional regime, where the ambientdimension p is larger than the number n of points. The existence of a computation-information gap ina specific Bayesian high-dimensional asymptotic regime has been conjectured by Lesieur et al. (2016)based on the replica heuristic from statistical physics. We provide evidence of the existence of such agap generically in the high-dimensional regime p >n, by (i) proving a non-asymptotic low-degreepolynomials computational barrier for clustering in high-dimension, matching the performance of thebest known polynomial time algorithms, and by (ii) establishing that the information barrier forclustering is smaller than the computational barrier, when the number K of clusters is large enough.These results are in contrast with the (moderately) low-dimensional regime n> poly(p,K) where there isno computation-information gap for clustering a mixture of isotropic Gaussian. This is based on a jointwork with Bertrand Even and Christophe Giraud (Paris-Saclay).

Biography: Website 

Friday 24 January 2025, 2-3pm - Peter Orbanz (UCL)

Peter OrbanzTitle: Gaussian and non-Gaussian universality, and applications to data augmentation.

Abstract: The term Gaussian universality refers to a class of results that are, loosely speaking, generalized central limit theorems (where, somewhat confusingly, the limit law is not necessarily Gaussian). They provide useful tools to study certain problems in machine learning. I will give a short overview of this idea and present two types of results: One are upper and lower bounds that map out where Gaussian universality is applicable and what rates of convergence one can expect. The other is the use of these techniques to obtain quantitative results on the effects of data augmentation in machine learning problems.

This is joint work with KH Huang (Gatsby Unit) and M Austern (Harvard).

 Biography: Website 

Friday 31 January 2025, 2-3pm - François-Xavier Briol (UCL)

Francois Xavier BriolTitle: Robust and Conjugate Gaussian Process Regression. 

Abstract: To enable closed form conditioning, a common assumption in Gaussian process (GP) regression is independent and identically distributed Gaussian observation noise. This strong and simplistic assumption is often violated in practice, which leads to unreliable inferences and uncertainty quantification. Unfortunately, existing methods for robustifying GPs break closed-form conditioning, which makes them less attractive to practitioners and significantly more computationally expensive. In this work, we demonstrate how to perform provably robust and conjugate Gaussian process (RCGP) regression at virtually no additional cost using generalised Bayesian inference. RCGP is particularly versatile as it enables exact conjugate closed form updates in all settings where standard GPs admit them. To demonstrate its strong empirical performance, we deploy RCGP for problems ranging from Bayesian optimisation to sparse variational Gaussian processes. 

Biography: Website

Friday 7 February 2025, 2-3pm - Karla Diaz Ordaz (UCL)

Karla Ordaz WebTitle: IV-learner: learning conditional average treatment effects using instrumental variables.

Abstract: Instrumental variable methods are very popular in econometrics and biostatistics for inferring causal average effects of an exposure on an outcome where there is unmeasured confounding. However, their application for learning heterogeneous treatment effects,  such as conditional average treatment effects (CATE), in combination with machine learning, is somewhat limited. 

A generic approach that allows the use of arbitrary machine learning algorithms can be based on the popular two-stage principle. We first "regress" the exposure on the instrumental variables (and pre-exposure covariates) and then learn the causal treatment effects by regressing the outcome on the predicted exposure. This is the approach of Foster and Syrgkanis (2023), referred to as IV-debiased machine learning (IV-DML).  

Unfortunately, the slow convergence rates of the data-adaptive estimators that affect the first-stage predictions propagate into the resulting CATE estimates.  

In view of this, we propose the IV-learner, inspired by infinite-dimensional targeted learning procedures (Vansteelandt 2023, van der Laan et al 2024). It strategically targets the first-stage predictions so they perform well in their ultimate task: CATE estimation. The resulting learner is easy to construct based on arbitrary, off-the-shelf algorithms.  

We study the finite sample performance of our proposal using simulations, and compare it to existing methods. We also illustrate it using a real data example.

Joint work with Stijn Vansteelandt, Stephen O’Neill, Richard Grieve.

Biography: Website

Friday 14 February 2025, 1-2pm - Solenne Gaucher (Centre de Mathématiques Appliquées at École polytechnique)

Solenne Gaucher

Please note the change of time for this seminar.  It will now take place 1-2pm. 

Title: Classification and regression under fairness constraints.

Abstract: Artificial intelligence (AI) is increasingly shaping the decisions that affect our lives—from hiring and education to healthcare and access to social services. While AI promises efficiency and objectivity, it also carries the risk of perpetuating and even amplifying societal biases embedded in the data used to train these systems.Algorithmic fairness aims to design and analyze algorithms capable of providing predictions that are both reliable and equitable.

In this talk, I will introduce one of the main approaches to achieving this goal: statistical fairness. After outlining the basic principles of this approach, I will focus specifically on a fairness criterion known as "demographic parity," which seeks to ensure that the distribution of predictions is identical across different populations. I will then discuss recent results related to regression and classification problems under this fairness constraint, exploring scenarios where differentiated treatment of populations is either permitted or prohibited.

Biography: Solenne Gaucher is an assistant professor at the Centre de Mathématiques Appliquées at École polytechnique. Her work focuses on developing fair machine learning algorithms to adresse algorithmic biaises and mitigate their broader societal impact. Prior to this role, she completed a postdoctoral fellowship at the Center for Research in Economics and Statistics (CREST) at ENSAE Paris. Solenne holds a Ph.D. in mathematics from Université Paris-Saclay. She was awarded the "Young French Talent" prize by the L'Oréal-UNESCO Foundation For Women in Science.

Friday 21 February 2025, 3.30-4.30pm - Eleni Matechou (University of Kent)

Eleni Matechou

Please note the later time of this seminar.

Title: Parametric, nonparametric and repulsive mixture models for ecological data

Abstract: Ecological surveys often track individuals or species to monitor time-varying processes such as migration patterns and changes in behavioural or life states. Mixture models are a suitable and flexible approach for analysing such data, and they have been used extensively in the field. In this seminar, parametric, nonparametric, and repulsive mixture models are discussed for different types of ecological data and results are presented for case studies on species monitored using standard ecological data, as well as data collected using new technologies. 

Biography: Website

Friday 7 March 2025, 2-3pm - Ruth Heller (Tel-Aviv University)

Ruth Heller

Please note that this seminar is joint with Department of Mathematics. 

Title: Addressing multiplicity and selection in conformal prediction.

Abstract: We begin by introducing the problem of multiple testing and selective inference, emphasizing thecentral roles of the false discovery rate (FDR) and the false coverage rate (FCR) in the analysis ofhigh-dimensional problems. We then address the important roles that the FDR and FCR may playin prediction problems, particularly in supervised learning tasks, including regression andclassification. In these contexts, conformal methods provide prediction sets for outcomes orlabels with finite-sample coverage guarantees for any machine learning predictor.Our focus is on cases where such prediction sets are constructed following a selection process.This selection process requires that the selected prediction sets be "informative" in a well-definedsense. We explore both classification and regression settings, where the analyst may defineinformative prediction sets as those that are sufficiently small, exclude null values, or satisfy otherappropriate monotone constraints. We introduce InfoSP and InfoSCOP, novel procedures thatprovide FCR control for informative prediction sets. We demonstrate the utility of these methodsthrough applications to both real and simulated data. While our primary focus is on the "batch"setting, we also address the ubiquitous "online" learning setting.

Joint work with Cherief-Abdeltaif, B. , Gazin, U., Humbert, P., Marandon, A., and Roquain, E.

Biography: Website 

Friday 7 March 2025, 3.15-4.15pm - Wen Zhou (NYU GPH) 

Wen Zhou

Please note the later time of this seminar.

Title: Identification of Informative Core Structures in Weighted Directed Networks with Uncertainty Quantification

Abstract: In network analysis, noises and biases, which are often introduced by peripheral or non-essential components, can mask pivotal structures and hinder the efficacy of many network modeling and inference procedures. Recognizing this, identification of the core--periphery (CP) structure has emerged as a crucial data pre-processing step. While the identification of the CP structure has been instrumental in pinpointing core structures within networks, its application to directed weighted networks has been underexplored. Many existing efforts either fail to account for the directionality or lack the theoretical justification of the identification procedure. In this work, we seek answers to three pressing questions: (i) How to distinguish the informative and noninformative structures in weighted directed networks? (ii) What approach offers computational efficiency in discerning these components? (iii) Upon the detection of CP structure, can uncertainty be quantified to evaluate the detection? We adopt the signal-plus-noise model, categorizing different types of noninformative relational patterns,  by which we define the sender and receiver peripheries. Furthermore, instead of confining the core component to a specific structure, we consider it complementary to either the sender or receiver peripheries. Based on our definitions on the sender and receiver peripheries, we propose spectral algorithms to identify the CP structure in directed weighted networks. Our algorithm stands out with statistical guarantees, ensuring the identification of sender and receiver peripheries with overwhelming probability. Additionally, we propose a hypothesis testing framework to infer CP structure upon detection. Our methods scale effectively for expansive directed networks. Implementing our methodology on faculty hiring network data revealed captivating insights into the informative structures and distinctions between informative and noninformative sender/receiver nodes across various academic disciplines. This is a joint work with Wenqin Du, Tianxi Li, and Lihua Lei. 

Biography: Wen Zhou is an Associate Professor in the Department of Biostatistics at the School of Global Public Health. He received his Ph.D.s in Statistics and Applied Mathematics from the Iowa State University. His research focuses on developing theories and methods for network data analysis, high-dimensional statistics, machine learning, and causal inference. He is interested in applications within genomics, genetics, protein structure modeling, social science, and health policy. Wen serves on the editorial boards of the Statistica Sinica, Journal of Multivariate Analysis, Biometrics, as well as serves as the Editor-in-Chief of Journal of Biopharmaceutical Statistics. He is an elected member of the International Statistical Institute and has been elected as the WNAR program coordinator in 2024.

Friday 14 March 2025, 2-3pm - Shuangning Li (University of Chicago Booth School of Business)

Shuangning LiTitle: Estimating the Global Average Treatment Effect under Structured Interference

Abstract: The field of causal inference develops methods for estimating treatment effects, often relying on the Stable Unit Treatment Value Assumption (SUTVA), which states that a unit’s outcome depends only on its own treatment. However, in many real-world settings, SUTVA is violated due to interference—where the treatment assigned to one unit influences the outcomes of others. Such interference can arise from social interactions among units or competition for shared resources, complicating causal analysis and leading to biased estimates. Fortunately, in many cases, interference follows structured patterns that can potentially be leveraged for more accurate estimation. In this paper, we examine and formalize two specific forms of structured interference—monotone interference and submodular interference—which we believe arise in many practical settings. We investigate how incorporating these structures can improve causal effect estimation. Our main contributions are (i) a set of bounds relating key interference estimands under these structural assumptions and (ii) new estimators that integrate these structures through constrained optimization. Since these constraints may introduce bias, we further develop debiasing techniques based on treatment regeneration and bootstrap methods to mitigate this issue.

This is joint work (ongoing) with Kevin Han and Johan Ugander.

Biography: Shuangning Li is an Assistant Professor of Econometrics and Statistics at the University of Chicago Booth School of Business. Previously, she was a postdoctoral fellow in the Department of Statistics at Harvard University. She earned her Ph.D. in Statistics from Stanford University, advised by Professors Emmanuel Candès and Stefan Wager. Before that, she obtained a Bachelor of Science from the University of Hong Kong. Her research focuses on causal inference, multiple hypothesis testing, selective inference, and statistical reinforcement learning.

Friday 21 March 2025, 2-3pm - Matteo Farnè (Università di Bologna)

Matteo-FarneTitle: Large spectral density matrix and dynamic factor model estimation by nuclear norm plus ℓ1 norm penalization”. Joint work with Angela Montanari and Matteo Barigozzi.

Abstract: This talk provides a comprehensive overview of the estimation framework for high-dimensional spectral density matrices via nuclear norm plus ℓ1 norm penalization under the assumption of an underlying low rank plus sparse structure, which naturally occurs when the data follow an approximate dynamic factor model with a sparse residual autocovariance. The underlying assumptions allow for non-pervasive latent dynamic eigenvalues and a prominent residual autocovariance pattern. In that context, existing approaches based on principal components may lead to misestimate the number of factors. On the contrary, minimizing a quadratic loss under a nuclear norm plus ℓ1 norm constraint, which controls the latent rank and the residual sparsity pattern via two specific threshold parameters, proves to be an effective optimization strategy.

In this time-dependent setting, we propose a new estimator of high-dimensional spectral density matrices, called ALgebraic Spectral Estimator (ALSE). The quadratic loss function requires as input the classical smoothed periodogram estimator and prompts a consequent choice of the two threshold parameters. We prove consistency of ALSE as both the dimension and the sample size diverge to infinity, as well as the recovery of latent rank and residual sparsity pattern with probability one. We then propose the UNshrunk ALgebraic Spectral Estimator (UNALSE), which is designed to minimize the Frobenius loss with respect to the pre-estimator while retaining the optimality of the ALSE, thus presenting the tightest error bound in minimax sense. On top of that, we prove that the ensuing estimators of dynamic factor loadings and scores via Bartlett’s and Thomson’s methods have the same minimax property, and we provide the asymptotic rates for those estimators. An interesting property of such dynamic factor score estimators is that they are entirely based on frequency-domain quantities, which allows to avoid a direct control on the ergodicity and the number of lags of the common component.

When applying UNALSE to a standard U.S. quarterly macroeconomic dataset, we find evidence of two main sources of comovements: a real factor driving the economy at business cycle frequencies, and a nominal factor driving the higher frequency dynamics. Dynamic factor scores can be estimated by UNALSE spectral density matrices, thus depicting the behaviour of each of the two factors over time.

BiographyWebsite

Monday 24 March 2025, 2-3pm - Xi Chen (NYU Stern School of Business)

Xi ChenPlease note that this seminar takes place on a Monday, not on our usual Friday timing.

Title: Digital Privacy in Personalised Pricing and Trustworthy Machine Learning via Blockchain

Abstract: This talk has two parts. The first part is on digital privacy in personalized pricing. When involving personalized information, how to protect the privacy of such information becomes a critical issue in practice. In this talk, we consider a dynamic pricing problem with an unknown demand function of posted prices and personalized information. By leveraging the fundamental framework of differential privacy, we develop a privacy-preserving dynamic pricing policy, which tries to maximize the retailer revenue while avoiding information leakage of individual customers' information and purchasing decisions. This is joint work with Prof. Yining Wang and Prof. David Simchi-Levi.

The second part introduces the concept of using blockchain to create a decentralized computing market for any AI training/fine-tuning. We introduce the concept of incentive-security that incentivizes rational trainers to behave honestly for their best interest. We design a Proof-of-Learning mechanism with computational efficiency, a provable incentive-security guarantee, and controllable difficulty.  Our research also proposes an environmentally friendly verification mechanism for blockchain systems, allowing existing proof-of-work computations to be used for AI services, thus achieving useful proof-of-work.

Biography: Xi Chen is a professor and Andre-Meyer Faculty Fellow at Stern School of Business, New York University, who is also an affiliated professor at Computer Science and Center for Data Science. Before that, he was a Postdoc in the group of Prof. Michael Jordan at UC Berkeley and obtained his Ph.D. from the Machine Learning Department at Carnegie Mellon University.

He studies high-dimensional machine learning, online learning, large-scale stochastic optimization,  and applications to operations management and FinTech. Recently, he started a new research line on blockchain technology and decentralized finance. He is an IMS Fellow, recipient of COPSS Leadership Award, NSF Career Award, The World’s Best 40 under 40 MBA Professor by Poets & Quants, and Forbes 30 under 30 in Science. Take a look at this website too. 

Friday 28 March 2025, 2-3pm - Nicola Gnecco (Imperial College London)

Nicola_Gnecco_croppedTitle: Extremal Random Forests

Abstract: Classical methods for quantile regression fail in cases where the quantile of interest is extreme and only few or no training data points exceed it. Asymptotic results from extreme value theory can be used to extrapolate beyond the range of the data, and several approaches exist that use linear regression, kernel methods or generalized additive models. Most of these methods break down if the predictor space has more than a few dimensions or if the regression function of extreme quantiles is complex. We propose a method for extreme quantile regression that combines the flexibility of random forests with the theory of extrapolation. Our extremal random forest (ERF) estimates the parameters of a generalized Pareto distribution, conditional on the predictor vector, by maximizing a local likelihood with weights extracted from a quantile random forest. We penalize the shape parameter in this likelihood to regularize its variability in the predictor space. Under general domain of attraction conditions, we show consistency of the estimated parameters in both the unpenalized and penalized case. Simulation studies show that our ERF outperforms both classical quantile regression methods and existing regression approaches from extreme value theory. We apply our methodology to extreme quantile prediction for U.S. wage data.

This is joint work with Edossa Merga Terefe and Sebastian Engelke.

BiographyWebsite

Spring Term 2025