IceDice: Predicting the stochastic behaviour of West Antarctica’s Marine Ice Sheet
Awarding body: NERC (Natural Environment Research Council)
Total value: £760,000 to £950,000
Grant holders: 9 Academics from British Antarctic Survey (BAS), LSE Holder is Dr Ieva Kazlauskaite
Start/end date: 09/2025 – 09/2028 (tentative)
Summary: Because the contribution that the West Antarctic ice sheet will make to sea level is so poorly known, the severity and cost of coastal flooding to be expected over the coming century remains hugely uncertain. Simplified modelling suggests optimal strategies for coastal adaptation can reduce flood-related costs by an order of magnitude from around $1 trillion per year globally with no adaptation, to around $100 billion per year by the end of this century. Realising this $900 billion saving in full will require the capacity to forecast sea level over planning horizons that extend over many decades. This means that accurate forecasts of the range of possible outcomes under different policy decisions have enormous economic and societal value: they provide the opportunity to stress-test decisions about energy policy and coastal adaptation to select the best policy. If the forecasts are too pessimistic, costly over-adaptation would constitute an opportunity cost that reduces economic growth. If they are too optimistic, expensive and catastrophic loss of infrastructure may ensue, along with great societal harm attributable to pervasive, large-scale flooding. This makes it extremely important that ice-sheet forecasts can attach reliable probabilities to the range of possible outcomes.
In this project we seek to provide exactly this information by eliminating a series of obstacles that have hitherto hindered progress in probabilistic forecasting of West Antarctica. In particular, we propose to greatly increase the realism of ice sheet models used in forecasts of sea level by including both ice fracture and ice ocean interaction in the same model using new approaches. Beyond this, we address the computational challenge of providing accurate probabilistic information. This has been hindered by inability to perform enough model simulations to fully sample the probability distribution. We seek to address this using a combination of large-scale compute, together with a strategy that combines efficient parameter calibration, emulation of models based on best practices in machine learning, and a sampling algorithm that is known to converge to the probabilities most informative for decision makers. We will pay particular attention to assessing the risks of extreme sea level contributions that are most damaging for coastal communities. In doing so, we will provide coastal planners the opportunity to follow optimal pathways that have potential to produce great economic savings and huge societal benefits.
Statistical Foundations for Detecting Anomalous Structure in Stream Settings (DASS) - An EPSRC Programme Grant
Awarding body: EPSRC (Engineering and Physical Sciences Research Council)
Total value: £5,053,464 (LSE: £961,052)
Grant holders: 5 Academics from 4 UK Universities, LSE Holder is Professor Qiwei Yao
Start/end date: 01/11/2024 - 31/10/2029
Summary: With the exponentially increasing prevalence of networked sensors and other devices for collecting data in real-time, automated data analysis methods with theoretically justified performance guarantees are in constant demand. Often a key question with such streaming data is whether they show evidence of anomalous behaviour. This could, e.g., be due to malignant bot activity on a website; early warning of potential equipment failure or detection of methane leakages. These and other motivating examples share a common feature which is not accommodated by classical point anomaly models in statistics: the anomaly may not simply be an 'outlying' observation, but rather a distinctive pattern observed over consecutive observations. The strategic vision for this programme grant is to establish the statistical foundations for Detecting Anomalous Structure in Streaming data settings (DASS).
Discussions with a wide-range of industrial partners from different sectors have identified important, generic challenges that cut across distinct DASS applications, and are relevant for analysing streaming data more broadly:
Contemporary Constrained Environments: Anomaly detection is often performed under various constraints due, for example, to the restrictions on measurement frequency, the volume of data transferable between sensors and a central processor, or battery usage limits. Additionally, certain scenarios may impose privacy restrictions when handling sensitive data. Consequently, it has become imperative to establish the mathematical underpinning for rigorously examining the trade-offs between, e.g., statistical accuracy, communication efficiency, privacy preservation and computational demands.
Handling Data Realities: A substantial portion of research in statistical anomaly detection operates under the assumption of clean data. Nevertheless, real-world data typically exhibit various imperfections, such as missing values, labelling errors in data streams, synchronisation discrepancies, sensor malfunctions and heterogeneous sensor performance. Consequently, there is a pressing need for the development of principled, model-based procedures that can effectively address the features of real data and enhance the resilience of anomaly detection methods.
Identifying, Accounting for and Tracking Dependence: Not only are data streams often interdependent, but also anomalous patterns may be dependent across those streams. Taking into account both types of dependence is crucial in enhancing the statistical efficiency of anomaly detection algorithms, and also in controlling the errors arising from handling a large number of data streams in a principled way. Other challenges include tracking the path of an anomaly across multiple data sources with a view to learning causal indicators allowing for precautionary intervention.
Our ambitious goal of comprehensively addressing these challenges is only achievable via the programme grant scheme. Our philosophy is to tackle the methodological, theoretical and computational aspects of these statistical problems together. This integrated approach is essential to achieving the substantive fundamental advances in statistics envisaged, and to ensuring that our new methods are sufficiently robust and efficient to be widely adopted by academics, industry and society more generally.
Network Stochastic Processes and Time Series (NeST) -- An EPSRC Programme Grant
Awarding body: EPSRC (Engineering and Physical Sciences Research Council)
Total value: £6,451,752 (LSE: £668,569)
Grant holders: 9 academics from 6 UK universities, LSE holder is Professor Qiwei Yao
Start/end date: 01/12/2022 - 30/11/2028
Summary: Dynamic networks occur in many fields of science, technology and medicine, as well as everyday life. Understanding their behaviour has important applications. For example, whether it is to uncover serious crime on the dark web, intrusions in a computer network, or hijacks at global internet scales, better network anomaly detection tools are desperately needed in cyber-security. Characterising the network structure of multiple EEG time series recorded at different locations in the brain is critical for understanding neurological disorders and therapeutics development. Modelling dynamic networks is of great interest in transport applications, such as for preventing accidents on highways and predicting the influence of bad weather on train networks. Systematically identifying, attributing, and preventing misinformation online requires realistic models of information flow in social networks.
Whilst simple random networks theory is well-established in maths and computer science, the recent explosion of dynamic network data has exposed a large gap in our ability to process real-life networks. Classical network models have led to a body of beautiful mathematical theory, but do not always capture the rich structure and temporal dynamics seen in real data, nor are they geared to answer practitioners' typical questions, e.g. relating to forecasting, anomaly detection or data ethics issues. Our NeST programme will develop robust, principled, yet computationally feasible ways of modelling dynamically changing networks and the statistical processes on them.
Some aspects of these problems, such as quantifying the influence of policy interventions on the spread of misinformation or disease, require advances in probability theory. Dynamic network data are also notoriously difficult to analyse. At a computational level, the datasets are often very large and/or only available "on the stream". At a statistical level, they often come with important collection biases and missing data. Often, even understanding the data and how they may relate to the analysis goal can be challenging. Therefore, to tackle these research questions in a systematic way we need to bring probabilists, statisticians and application domain experts together.
NeST's six-year programme will see probabilists and statisticians with theoretical, computational, machine learning and data science expertise, collaborate across six world-class institutes to conduct leading and impactful research. In different overlapping groups, we will tackle questions such as: How do we model data to capture the complex features and dynamics we observe in practice? How should we conduct exploratory data analysis or, to quote a famous statistician, "Looking at the data to see what it seems to say" (Tukey, 1977)? How can we forecast network data, or detect anomalies, changes, trends? To ground techniques in practice, our research will be informed and driven by challenges in many key scientific disciplines through frequent interaction with industrial & government partners in energy, cyber-security, the environment, finance, logistics, statistics, telecoms, transport, and biology. A valuable output of work will be high-quality, curated, dynamic network datasets from a broad range of application domains, which we will make publicly available in a repository for benchmarking, testing & reproducibility (responsible innovation), partly as a vehicle to foster new collaborations. We also have a strategy to disseminate knowledge through a diverse range of scientific publication routes, high-quality free software (e.g. R packages, Python notebooks accompanying data releases), conferences, patents and outreach activities. NeST will also carefully nurture and develop the next generation of highly-trained and research-active people in our area, which will contribute strongly to satisfying the high demand for such people in industry, government and academia.
Statistical Methods in Offline Reinforcement Learning
Awarding body: EPSRC (Engineering and Physical Sciences Research Council)
Total value: £398,393
Grant holder: Dr. Chengchun Shi
Start/end date: 1/4/2022 - 31/03/2025
Summary: Reinforcement learning (RL) is concerned with how intelligent agents take actions in a given environment to learn an optimal policy that maximises the cumulative reward that they receive. It has been arguably one of the most vibrant research frontiers in machine learning over the last few years. According to Google Scholar, over 40K scientific articles have been published in 2020 with the phrase "reinforcement learning". Over 100 papers on RL were accepted for presentation at ICML 2020 (a premier conference in the machine learning area), accounting for more than 10% of the accepted papers in total. Significant progress has been made in solving challenging problems across various domains using RL, including games, robotics, healthcare, bidding and automated driving. Nevertheless statistics as a field, as opposed to computer science, has only recently begun to engage with RL both in depth and in breadth. The proposed research will develop statistical learning methodologies to address several key issues in offline RL domains. Our objective is to propose RL algorithms that utilise previously collected data, without additional online data collection. The proposed research is primarily motivated by applications in healthcare. Most of the existing state-of-the-art RL algorithms were motivated by online settings (e.g., video games). Their generalisations to applications in healthcare remain unknown. We also remark that our solutions will be transferable to other fields (e.g., robotics).
A fundamental question the proposed research will consider is offline policy optimisation where the objective is to learn an optimal policy to maximise the long-term outcome based on an offline dataset. Solving this question faces at least two major challenges. First, in contrast to online settings where data are easy to collect or simulate, the number of observations in many offline applications (e.g., healthcare) is limited. With such limited data, it is critical to develop RL algorithms that are statistically efficient. The proposed research will devise some "value enhancement" methods that are generally applicable to state-of-the-art RL algorithms to improve their statistical efficiency. For a given initial policy computed by existing algorithms, we aim to output a new policy whose expected return converges at a faster rate, achieving the desired "value enhancement" property. Second, many offline datasets are created via aggregating over many heterogeneous data sources. This is typically the case in healthcare where the data trajectories collected from different patients might not have a common distribution function. We will study existing transfer learning methods in RL and develop new approaches designed for healthcare applications, based on our expertise in statistics.
Another question the proposed research will consider is off-policy evaluation (OPE). OPE aims to learn a target policy's expected return (value) with a pre-collected dataset generated by a different policy. It is critical in applications from healthcare and automated driving where new policies need to be evaluated offline before online validation. A common assumption made in most of the existing works is that of no unmeasured confounding. However, this assumption is not testable from the data. It can be violated in observational datasets generated from healthcare applications. Moreover, many offline applications will benefit from having a confidence interval (CI) that quantifies the uncertainty of the value estimator, due to the limited sample size. The proposed research is concerned with constructing a CI for a target policy's value in the presence of latent confounders. In addition, in a variety of applications, the outcome distribution is skewed and heavy-tailed. Criteria such as quantiles are more sensible than the mean. We will develop methodologies to learn the quantile curve of the return under a target policy and construct its associated confidence band.