Grants

Research grants

Current research grants

Statistical Foundations for Detecting Anomalous Structure in Stream Settings (DASS) - An EPSRC Programme Grant

Awarding body: EPSRC (Engineering and Physical Sciences Research Council)
Total value: £5,053,464 (LSE: £961,052)
Grant holders: 5 Academics from 4 UK Universities, LSE Holder is Professor Qiwei Yao
Start/end date: 01/11/2024 - 31/10/2029

Summary: With the exponentially increasing prevalence of networked sensors and other devices for collecting data in real-time, automated data analysis methods with theoretically justified performance guarantees are in constant demand. Often a key question with such streaming data is whether they show evidence of anomalous behaviour. This could, e.g., be due to malignant bot activity on a website; early warning of potential equipment failure or detection of methane leakages. These and other motivating examples share a common feature which is not accommodated by classical point anomaly models in statistics: the anomaly may not simply be an 'outlying' observation, but rather a distinctive pattern observed over consecutive observations. The strategic vision for this programme grant is to establish the statistical foundations for Detecting Anomalous Structure in Streaming data settings (DASS).

Discussions with a wide-range of industrial partners from different sectors have identified important, generic challenges that cut across distinct DASS applications, and are relevant for analysing streaming data more broadly:

Contemporary Constrained Environments: Anomaly detection is often performed under various constraints due, for example, to the restrictions on measurement frequency, the volume of data transferable between sensors and a central processor, or battery usage limits. Additionally, certain scenarios may impose privacy restrictions when handling sensitive data. Consequently, it has become imperative to establish the mathematical underpinning for rigorously examining the trade-offs between, e.g., statistical accuracy, communication efficiency, privacy preservation and computational demands.

Handling Data Realities: A substantial portion of research in statistical anomaly detection operates under the assumption of clean data. Nevertheless, real-world data typically exhibit various imperfections, such as missing values, labelling errors in data streams, synchronisation discrepancies, sensor malfunctions and heterogeneous sensor performance. Consequently, there is a pressing need for the development of principled, model-based procedures that can effectively address the features of real data and enhance the resilience of anomaly detection methods.

Identifying, Accounting for and Tracking Dependence: Not only are data streams often interdependent, but also anomalous patterns may be dependent across those streams. Taking into account both types of dependence is crucial in enhancing the statistical efficiency of anomaly detection algorithms, and also in controlling the errors arising from handling a large number of data streams in a principled way. Other challenges include tracking the path of an anomaly across multiple data sources with a view to learning causal indicators allowing for precautionary intervention.

Our ambitious goal of comprehensively addressing these challenges is only achievable via the programme grant scheme. Our philosophy is to tackle the methodological, theoretical and computational aspects of these statistical problems together. This integrated approach is essential to achieving the substantive fundamental advances in statistics envisaged, and to ensuring that our new methods are sufficiently robust and efficient to be widely adopted by academics, industry and society more generally.


Network Stochastic Processes and Time Series (NeST) -- An EPSRC Programme Grant  

Awarding body: EPSRC (Engineering and Physical Sciences Research Council)
Total value: £6,451,752 (LSE: £668,569)
Grant holders: 9 academics from 6 UK universities, LSE holder is Professor Qiwei Yao
Start/end date: 01/12/2022 - 30/11/2028 

Summary: Dynamic networks occur in many fields of science, technology and medicine, as well as everyday life. Understanding their behaviour has important applications. For example, whether it is to uncover serious crime on the dark web, intrusions in a computer network, or hijacks at global internet scales, better network anomaly detection tools are desperately needed in cyber-security. Characterising the network structure of multiple EEG time series recorded at different locations in the brain is critical for understanding neurological disorders and therapeutics development. Modelling dynamic networks is of great interest in transport applications, such as for preventing accidents on highways and predicting the influence of bad weather on train networks. Systematically identifying, attributing, and preventing misinformation online requires realistic models of information flow in social networks. 

Whilst simple random networks theory is well-established in maths and computer science, the recent explosion of dynamic network data has exposed a large gap in our ability to process real-life networks. Classical network models have led to a body of beautiful mathematical theory, but do not always capture the rich structure and temporal dynamics seen in real data, nor are they geared to answer practitioners' typical questions, e.g. relating to forecasting, anomaly detection or data ethics issues. Our NeST programme will develop robust, principled, yet computationally feasible ways of modelling dynamically changing networks and the statistical processes on them. 

Some aspects of these problems, such as quantifying the influence of policy interventions on the spread of misinformation or disease, require advances in probability theory. Dynamic network data are also notoriously difficult to analyse. At a computational level, the datasets are often very large and/or only available "on the stream". At a statistical level, they often come with important collection biases and missing data. Often, even understanding the data and how they may relate to the analysis goal can be challenging. Therefore, to tackle these research questions in a systematic way we need to bring probabilists, statisticians and application domain experts together. 

NeST's six-year programme will see probabilists and statisticians with theoretical, computational, machine learning and data science expertise, collaborate across six world-class institutes to conduct leading and impactful research. In different overlapping groups, we will tackle questions such as: How do we model data to capture the complex features and dynamics we observe in practice? How should we conduct exploratory data analysis or, to quote a famous statistician, "Looking at the data to see what it seems to say" (Tukey, 1977)? How can we forecast network data, or detect anomalies, changes, trends? To ground techniques in practice, our research will be informed and driven by challenges in many key scientific disciplines through frequent interaction with industrial & government partners in energy, cyber-security, the environment, finance, logistics, statistics, telecoms, transport, and biology. A valuable output of work will be high-quality, curated, dynamic network datasets from a broad range of application domains, which we will make publicly available in a repository for benchmarking, testing & reproducibility (responsible innovation), partly as a vehicle to foster new collaborations. We also have a strategy to disseminate knowledge through a diverse range of scientific publication routes, high-quality free software (e.g. R packages, Python notebooks accompanying data releases), conferences, patents and outreach activities. NeST will also carefully nurture and develop the next generation of highly-trained and research-active people in our area, which will contribute strongly to satisfying the high demand for such people in industry, government and academia.


Statistical Methods in Offline Reinforcement Learning

Awarding body: EPSRC (Engineering and Physical Sciences Research Council)
Total value: £398,393
Grant holder: Dr. Chengchun Shi
Start/end date: 1/4/2022 - 31/03/2025
 
Summary: Reinforcement learning (RL) is concerned with how intelligent agents take actions in a given environment to learn an optimal policy that maximises the cumulative reward that they receive. It has been arguably one of the most vibrant research frontiers in machine learning over the last few years. According to Google Scholar, over 40K scientific articles have been published in 2020 with the phrase "reinforcement learning". Over 100 papers on RL were accepted for presentation at ICML 2020 (a premier conference in the machine learning area), accounting for more than 10% of the accepted papers in total. Significant progress has been made in solving challenging problems across various domains using RL, including games, robotics, healthcare, bidding and automated driving. Nevertheless statistics as a field, as opposed to computer science, has only recently begun to engage with RL both in depth and in breadth. The proposed research will develop statistical learning methodologies to address several key issues in offline RL domains. Our objective is to propose RL algorithms that utilise previously collected data, without additional online data collection. The proposed research is primarily motivated by applications in healthcare. Most of the existing state-of-the-art RL algorithms were motivated by online settings (e.g., video games). Their generalisations to applications in healthcare remain unknown. We also remark that our solutions will be transferable to other fields (e.g., robotics).

A fundamental question the proposed research will consider is offline policy optimisation where the objective is to learn an optimal policy to maximise the long-term outcome based on an offline dataset. Solving this question faces at least two major challenges. First, in contrast to online settings where data are easy to collect or simulate, the number of observations in many offline applications (e.g., healthcare) is limited. With such limited data, it is critical to develop RL algorithms that are statistically efficient. The proposed research will devise some "value enhancement" methods that are generally applicable to state-of-the-art RL algorithms to improve their statistical efficiency. For a given initial policy computed by existing algorithms, we aim to output a new policy whose expected return converges at a faster rate, achieving the desired "value enhancement" property. Second, many offline datasets are created via aggregating over many heterogeneous data sources. This is typically the case in healthcare where the data trajectories collected from different patients might not have a common distribution function. We will study existing transfer learning methods in RL and develop new approaches designed for healthcare applications, based on our expertise in statistics.

Another question the proposed research will consider is off-policy evaluation (OPE). OPE aims to learn a target policy's expected return (value) with a pre-collected dataset generated by a different policy. It is critical in applications from healthcare and automated driving where new policies need to be evaluated offline before online validation. A common assumption made in most of the existing works is that of no unmeasured confounding. However, this assumption is not testable from the data. It can be violated in observational datasets generated from healthcare applications. Moreover, many offline applications will benefit from having a confidence interval (CI) that quantifies the uncertainty of the value estimator, due to the limited sample size. The proposed research is concerned with constructing a CI for a target policy's value in the presence of latent confounders. In addition, in a variety of applications, the outcome distribution is skewed and heavy-tailed. Criteria such as quantiles are more sensible than the mean. We will develop methodologies to learn the quantile curve of the return under a target policy and construct its associated confidence band.


Was that change real? Quantifying uncertainty for change points

Awarding body: EPSRC (Engineering and Physical Sciences Research Council)
Total value: £323,942
Grant holder: Professor Piotr Fryzlewicz
Start/end date: 1/10/2021 - 30/09/2024

Summary: Detecting changes in data is currently one of the most active areas of statistics. In many applications there is interest in segmenting the data into regions with the same statistical properties, either as a way to flexibly model data, to help with down-stream analysis or to ensure predictions are made based only on relevant data. Whilst in others the main interest lies in detecting when changes have occurred as they indicate features of interest, from potential failures of machinery to security breaches or the presence of genomic features such as copy number variations.

To date most research in this area has been developing methods for detecting changes: algorithms that input data and output a best guess as to whether there have been relevant changes, and if so how many there have been and when they occurred. A comparatively ignored problem is assessing how confident we are that a specific change has occurred in a given part of the data.

In many applications, quantifying the uncertainty around whether a change has occurred is of paramount importance. For example, if we are monitoring a large communication network, and changes indicate potential faults, it is helpful to know how confident we are that there is a fault at any given point in the network so that we can prioritise the use of limited resources available for investigating and repairing faults. When analysing calcium imaging data on neuronal activity, where changes correspond to times at which a neuron fires, it is helpful to know how certain we are that a neuron fired at each time point so as to improve down-stream analysis of the data.

A naive approach to this problem is to first detect changes and then apply standard statistical tests for their presence. But this approach is flawed as it uses the data twice, first to decide where to test and then to perform the test. We can overcome this using sample splitting ideas - where we use half the data to detect a change, and the other half to perform the test. But such methods lose power, e.g. from using only part of the data to detect changes.

This proposal will develop statistically valid approaches to quantifying uncertainty, that are more powerful than sample splitting approaches. These approaches are based on two complementary ideas (i) performing inference prior to detection; and (ii) develop tests for a change that account for earlier detection steps. The output will be a new general toolbox for change points encompassing both new general statistical methods and their implementation within software packages.

Recent grants - 2014 - 2024

Statistical Network Analysis: Model Selection, Differential Privacy, and Dynamic Structures

Awarding body: EPSRC (Engineering and Physical Sciences Research Council)
Total value: £631,743
Grant holder: Professor Yao
Start/end date: 01/06/2021 - 31/05/2024

Summary: In this proposal we tackle some challenging problems in the following three aspects of statistical network analysis. 

1. Jittered resampling for selecting network models.

The first and arguably the most important step in statistical modelling is to choose an appropriate model for a given data set. We propose a new `bootstrap jittering' or `jittered resampling' method for selecting an appropriate network model. The method does not impose any specific forms/conditions, therefore providing a generic tool for network model selection. 

2. Edge differential privacy for network data.

Network data often contain sensitive individual/personal information. Hence the primary concern for data privacy is two-folded: (a) to release only a sanitized version of the original network data to protect privacy, and (b) the sanitized data should preserve the information of interest such that the analysis based on the sanitized data is still meaningful. We will adopt the so-called dyadwise randomized response approach, and further develop this scheme to handle networks with additional node features/attributes (e.g., social networks with additional information on age, gender, hobby, occupation etc). 

3. Modelling and forecasting dynamic networks.

A substantial proportion of real networks are dynamic in nature. Understanding and being able to forecast the changes over time are of immense importance for, e.g., monitoring anomalies in internet traffic networks, predicting demand and setting pricing in electricity supply networks, managing natural resources in environmental readings in sensor networks, and understanding how news and opinion propagates in online social networks.  Combining recent developments on tensor decomposition and factor-driven dimension reduction with the efficient time series tools such as exponential smoothing and Kalman filters, we will take on this challenge to build some new dynamic models.


Change-point analysis in high dimensions

Awarding body: EPSRC (Engineering and Physical Sciences Research Council)
Total value: £229,022
Grant holder: Dr. Tengyao Wang
Start/end date: 1/03/2021 - 31/03/2024

Summary: Modern applications routinely generate time-ordered datasets, where many covariates are simultaneously measured over time. Examples include wearable technologies recording the health state of individuals from multi-sensor feedbacks, internet traffic data collected by tens of thousands of routers and functional magnetic resonance imaging (fMRI) scans that record the evolution of certain chemical contrast in different areas of the brain. The explosion in number of such high-dimensional data streams calls for methodological advances for their analysis.

Change-point analysis is an essential statistical technique used in identifying abrupt changes in such data streams. The identified 'change-points' often signal interesting or abnormal events, and can be used to carve up the data streams into shorter segments that are easier to analyse.

Classical change-point analysis methods identify changes in a single variable over time. However, they often suffer from significant performance loss in high-dimensional datasets when applied componentwise. The area of high-dimensional change-point analysis grew out of the need to respond to the challenge created by high-dimensional data streams. A few methods have been proposed in this relatively new area. However, they often require simplifying assumptions that restrict their usefulness in many applications.

In this proposal, I will develop new methods that can handle more realistic data settings. Specifically, I will develop (1) an algorithm that can monitor the data stream 'online' as data points are observed one after another, so that it responds to changes as quickly as possible while maintaining a low rate of false alarms; (2) a change-point procedure that can handle highly correlated component series, a situation that is very common in multi-sensor measurements; (3) a robust method for change-point estimation in the presence of missing or contaminated data. I will provide theoretical performance guarantees for the developed methods and implement them in open-source R packages.


Methods for the Analysis of Longitudinal Dyadic Data with an Application to Intergenerational Exchanges of Family Support

Awarding body: ESRC (Economic and Social Research Council) and EPSRC (Engineering and Physical Sciences Research Council)
Total value: £633,392
Grant holder: Professor Fiona Steele
Co-investigators: Siliang Zhang, Professor Jouni Kuha, Professor Irini Moustaki, Professor Chris Skinner, Dr Tania Burchardt (Centre for Analysis of Social Exclusion (CASE), LSE), Dr Eleni Karagiannaki (CASE, LSE), Professor Emily Grundy (University of Essex)
Start/end date: 01/10/2017 - 30/9/2021

Summary: Data on pairs of subjects (dyads) are commonly collected in social research. In family research, for example, there is interest in how the strength of parent-child relationships depends on characteristics of parents and children. Dyadic data provide detailed information on interpersonal processes, but they are challenging to analyse because of their highly complex structure: they are often longitudinal because of interest in dependencies between individuals over time, dyads may be clustered into larger groups (e.g. in families or organisations), and variables of interest such as perceptions and attitudes may be measured by multiple indicators. This research will develop a general latent variable modelling framework for the analysis of clustered multivariate dyadic data, with applications to the study of exchanges of support between non-coresident family members. A particular feature of this framework will be to allow modelling of associations between an individual's exchanges over time, between help given and received (reciprocity), between exchanges of time and money, between respondent-child and respondent-parent exchanges, and between members of the same household. Sensitivity of results to measurement error and non-ignorable attrition will be considered.

Please find more information at this web page.  


New challenges in time series analysis

Awarding body: EPSRC (Engineering and Physical Sciences Research Council)
Total value: £1,306,110.05
Grant holder: Professor Piotr Fryzlewicz
Start/end date: 01/04/2014 - 31/03/2019

Summary: This research will break new ground in the analysis of non-stationary, high-dimensional and curve-valued time series. Although many of the problems we propose to tackle are motivated by financial applications, our solutions will be transferable to other fields. In particular, we will:
(a) Re-define the way in which people think of non-stationarity. We will define (non-)stationarity to be a problem-dependent, rather than ‘fixed’ property of time series, and propose new statistical model selection procedures in accordance with this new point of view. This will lead to the concept of (non-)stationarity being put to much better use in solving practical problems (such as forecasting) than it so far has been;

(b) Propose new, problem-dependent dimensionality reduction procedures for time series which are both high-dimensional and non-stationary (dimensionality reduction is useful in practice as low-dimensional time series are much easier to handle). We hope that this problem-dependent approach will induce a completely new way of thinking of high-dimensional time series data and high-dimensional data in general;

(c) Propose new methods for statistical model selection in high-dimensional time series regression problems, including the non-stationary setting. Our new methods will be useful in fields such as financial forecasting or statistical market research;

(d) Investigate new methods for statistical model selection in high-dimensional time series (of, e.g., financial returns) in which the dependence structure changes in an abrupt fashion due to `shocks', e.g. macroeconomic announcements;

(e) Propose new multiscale time series models, specifically designed to solve a longstanding problem in finance of consistent modelling of financial returns on multiple time scales, e.g. intraday and intraday;
(f) Propose new ways of analysing time series of curves (e.g. yield curves) which can be non-stationary in a variety of ways. 

Older and miscellaneous grants