Statistical Foundations for Detecting Anomalous Structure in Stream Settings (DASS) - An EPSRC Programme Grant
Awarding body: EPSRC (Engineering and Physical Sciences Research Council)
Total value: £5,053,464 (LSE: £961,052)
Grant holders: 5 Academics from 4 UK Universities, LSE Holder is Professor Qiwei Yao
Start/end date: 01/11/2024 - 31/10/2029
Summary: With the exponentially increasing prevalence of networked sensors and other devices for collecting data in real-time, automated data analysis methods with theoretically justified performance guarantees are in constant demand. Often a key question with such streaming data is whether they show evidence of anomalous behaviour. This could, e.g., be due to malignant bot activity on a website; early warning of potential equipment failure or detection of methane leakages. These and other motivating examples share a common feature which is not accommodated by classical point anomaly models in statistics: the anomaly may not simply be an 'outlying' observation, but rather a distinctive pattern observed over consecutive observations. The strategic vision for this programme grant is to establish the statistical foundations for Detecting Anomalous Structure in Streaming data settings (DASS).
Discussions with a wide-range of industrial partners from different sectors have identified important, generic challenges that cut across distinct DASS applications, and are relevant for analysing streaming data more broadly:
Contemporary Constrained Environments: Anomaly detection is often performed under various constraints due, for example, to the restrictions on measurement frequency, the volume of data transferable between sensors and a central processor, or battery usage limits. Additionally, certain scenarios may impose privacy restrictions when handling sensitive data. Consequently, it has become imperative to establish the mathematical underpinning for rigorously examining the trade-offs between, e.g., statistical accuracy, communication efficiency, privacy preservation and computational demands.
Handling Data Realities: A substantial portion of research in statistical anomaly detection operates under the assumption of clean data. Nevertheless, real-world data typically exhibit various imperfections, such as missing values, labelling errors in data streams, synchronisation discrepancies, sensor malfunctions and heterogeneous sensor performance. Consequently, there is a pressing need for the development of principled, model-based procedures that can effectively address the features of real data and enhance the resilience of anomaly detection methods.
Identifying, Accounting for and Tracking Dependence: Not only are data streams often interdependent, but also anomalous patterns may be dependent across those streams. Taking into account both types of dependence is crucial in enhancing the statistical efficiency of anomaly detection algorithms, and also in controlling the errors arising from handling a large number of data streams in a principled way. Other challenges include tracking the path of an anomaly across multiple data sources with a view to learning causal indicators allowing for precautionary intervention.
Our ambitious goal of comprehensively addressing these challenges is only achievable via the programme grant scheme. Our philosophy is to tackle the methodological, theoretical and computational aspects of these statistical problems together. This integrated approach is essential to achieving the substantive fundamental advances in statistics envisaged, and to ensuring that our new methods are sufficiently robust and efficient to be widely adopted by academics, industry and society more generally.
Network Stochastic Processes and Time Series (NeST) -- An EPSRC Programme Grant
Awarding body: EPSRC (Engineering and Physical Sciences Research Council)
Total value: £6,451,752 (LSE: £668,569)
Grant holders: 9 academics from 6 UK universities, LSE holder is Professor Qiwei Yao
Start/end date: 01/12/2022 - 30/11/2028
Summary: Dynamic networks occur in many fields of science, technology and medicine, as well as everyday life. Understanding their behaviour has important applications. For example, whether it is to uncover serious crime on the dark web, intrusions in a computer network, or hijacks at global internet scales, better network anomaly detection tools are desperately needed in cyber-security. Characterising the network structure of multiple EEG time series recorded at different locations in the brain is critical for understanding neurological disorders and therapeutics development. Modelling dynamic networks is of great interest in transport applications, such as for preventing accidents on highways and predicting the influence of bad weather on train networks. Systematically identifying, attributing, and preventing misinformation online requires realistic models of information flow in social networks.
Whilst simple random networks theory is well-established in maths and computer science, the recent explosion of dynamic network data has exposed a large gap in our ability to process real-life networks. Classical network models have led to a body of beautiful mathematical theory, but do not always capture the rich structure and temporal dynamics seen in real data, nor are they geared to answer practitioners' typical questions, e.g. relating to forecasting, anomaly detection or data ethics issues. Our NeST programme will develop robust, principled, yet computationally feasible ways of modelling dynamically changing networks and the statistical processes on them.
Some aspects of these problems, such as quantifying the influence of policy interventions on the spread of misinformation or disease, require advances in probability theory. Dynamic network data are also notoriously difficult to analyse. At a computational level, the datasets are often very large and/or only available "on the stream". At a statistical level, they often come with important collection biases and missing data. Often, even understanding the data and how they may relate to the analysis goal can be challenging. Therefore, to tackle these research questions in a systematic way we need to bring probabilists, statisticians and application domain experts together.
NeST's six-year programme will see probabilists and statisticians with theoretical, computational, machine learning and data science expertise, collaborate across six world-class institutes to conduct leading and impactful research. In different overlapping groups, we will tackle questions such as: How do we model data to capture the complex features and dynamics we observe in practice? How should we conduct exploratory data analysis or, to quote a famous statistician, "Looking at the data to see what it seems to say" (Tukey, 1977)? How can we forecast network data, or detect anomalies, changes, trends? To ground techniques in practice, our research will be informed and driven by challenges in many key scientific disciplines through frequent interaction with industrial & government partners in energy, cyber-security, the environment, finance, logistics, statistics, telecoms, transport, and biology. A valuable output of work will be high-quality, curated, dynamic network datasets from a broad range of application domains, which we will make publicly available in a repository for benchmarking, testing & reproducibility (responsible innovation), partly as a vehicle to foster new collaborations. We also have a strategy to disseminate knowledge through a diverse range of scientific publication routes, high-quality free software (e.g. R packages, Python notebooks accompanying data releases), conferences, patents and outreach activities. NeST will also carefully nurture and develop the next generation of highly-trained and research-active people in our area, which will contribute strongly to satisfying the high demand for such people in industry, government and academia.
Statistical Methods in Offline Reinforcement Learning
Awarding body: EPSRC (Engineering and Physical Sciences Research Council)
Total value: £398,393
Grant holder: Dr. Chengchun Shi
Start/end date: 1/4/2022 - 31/03/2025
Summary: Reinforcement learning (RL) is concerned with how intelligent agents take actions in a given environment to learn an optimal policy that maximises the cumulative reward that they receive. It has been arguably one of the most vibrant research frontiers in machine learning over the last few years. According to Google Scholar, over 40K scientific articles have been published in 2020 with the phrase "reinforcement learning". Over 100 papers on RL were accepted for presentation at ICML 2020 (a premier conference in the machine learning area), accounting for more than 10% of the accepted papers in total. Significant progress has been made in solving challenging problems across various domains using RL, including games, robotics, healthcare, bidding and automated driving. Nevertheless statistics as a field, as opposed to computer science, has only recently begun to engage with RL both in depth and in breadth. The proposed research will develop statistical learning methodologies to address several key issues in offline RL domains. Our objective is to propose RL algorithms that utilise previously collected data, without additional online data collection. The proposed research is primarily motivated by applications in healthcare. Most of the existing state-of-the-art RL algorithms were motivated by online settings (e.g., video games). Their generalisations to applications in healthcare remain unknown. We also remark that our solutions will be transferable to other fields (e.g., robotics).
A fundamental question the proposed research will consider is offline policy optimisation where the objective is to learn an optimal policy to maximise the long-term outcome based on an offline dataset. Solving this question faces at least two major challenges. First, in contrast to online settings where data are easy to collect or simulate, the number of observations in many offline applications (e.g., healthcare) is limited. With such limited data, it is critical to develop RL algorithms that are statistically efficient. The proposed research will devise some "value enhancement" methods that are generally applicable to state-of-the-art RL algorithms to improve their statistical efficiency. For a given initial policy computed by existing algorithms, we aim to output a new policy whose expected return converges at a faster rate, achieving the desired "value enhancement" property. Second, many offline datasets are created via aggregating over many heterogeneous data sources. This is typically the case in healthcare where the data trajectories collected from different patients might not have a common distribution function. We will study existing transfer learning methods in RL and develop new approaches designed for healthcare applications, based on our expertise in statistics.
Another question the proposed research will consider is off-policy evaluation (OPE). OPE aims to learn a target policy's expected return (value) with a pre-collected dataset generated by a different policy. It is critical in applications from healthcare and automated driving where new policies need to be evaluated offline before online validation. A common assumption made in most of the existing works is that of no unmeasured confounding. However, this assumption is not testable from the data. It can be violated in observational datasets generated from healthcare applications. Moreover, many offline applications will benefit from having a confidence interval (CI) that quantifies the uncertainty of the value estimator, due to the limited sample size. The proposed research is concerned with constructing a CI for a target policy's value in the presence of latent confounders. In addition, in a variety of applications, the outcome distribution is skewed and heavy-tailed. Criteria such as quantiles are more sensible than the mean. We will develop methodologies to learn the quantile curve of the return under a target policy and construct its associated confidence band.
Was that change real? Quantifying uncertainty for change points
Awarding body: EPSRC (Engineering and Physical Sciences Research Council)
Total value: £323,942
Grant holder: Professor Piotr Fryzlewicz
Start/end date: 1/10/2021 - 30/09/2024
Summary: Detecting changes in data is currently one of the most active areas of statistics. In many applications there is interest in segmenting the data into regions with the same statistical properties, either as a way to flexibly model data, to help with down-stream analysis or to ensure predictions are made based only on relevant data. Whilst in others the main interest lies in detecting when changes have occurred as they indicate features of interest, from potential failures of machinery to security breaches or the presence of genomic features such as copy number variations.
To date most research in this area has been developing methods for detecting changes: algorithms that input data and output a best guess as to whether there have been relevant changes, and if so how many there have been and when they occurred. A comparatively ignored problem is assessing how confident we are that a specific change has occurred in a given part of the data.
In many applications, quantifying the uncertainty around whether a change has occurred is of paramount importance. For example, if we are monitoring a large communication network, and changes indicate potential faults, it is helpful to know how confident we are that there is a fault at any given point in the network so that we can prioritise the use of limited resources available for investigating and repairing faults. When analysing calcium imaging data on neuronal activity, where changes correspond to times at which a neuron fires, it is helpful to know how certain we are that a neuron fired at each time point so as to improve down-stream analysis of the data.
A naive approach to this problem is to first detect changes and then apply standard statistical tests for their presence. But this approach is flawed as it uses the data twice, first to decide where to test and then to perform the test. We can overcome this using sample splitting ideas - where we use half the data to detect a change, and the other half to perform the test. But such methods lose power, e.g. from using only part of the data to detect changes.
This proposal will develop statistically valid approaches to quantifying uncertainty, that are more powerful than sample splitting approaches. These approaches are based on two complementary ideas (i) performing inference prior to detection; and (ii) develop tests for a change that account for earlier detection steps. The output will be a new general toolbox for change points encompassing both new general statistical methods and their implementation within software packages.