Transmission T-003: Luu Hoang Duc and Jürgen Jost on Making the Most of Bad Data

Perspectival Study of the Adoration of Magi by Leonardo da Vinci, circa 1481.

March 30, 2020

To forecast the spread of the novel coronavirus, we must attend to the quality and consistency of the data.

Read the Reflection, written 29 July 2021, below the following original Transmission. For an updated version of the Transmission, see The Complex Alternative: Complexity Scientists on the COVID-19 Pandemic.

There is no shortage of data on the unfolding coronavirus epidemic. Countries around the world are publishing daily case counts, which should constitute a digital treasure trove for those of us who seek to understand and even forecast the spread of the epidemic. The problem with this massive quantity of data is its quality — datasets from different countries are not really compatible with each other, are often internally inconsistent, and in some cases could be politically manipulated.

So what’s a complexity scientist to do? In our research group in Leipzig, we believe we can establish general statistical regularities using simplifying assumptions and procedures that can compensate for data fluctuations. Below, we provide a few examples of problems that arise from inconsistent data, and solutions for making the most of it.

For each of the countries we survey, we distinguish different periods of pandemic development based on the respective growth rates for the number of infections recorded. In the beginning, the growth rate is typically extremely high but then weakens. In the final saturation phase, the growth rate has become so low that the development of the epidemic is essentially under control. Various countries are currently at different stages of development. In countries in which the growth rate is still very high, as is currently the case in Germany, it must be expected that a saturation phase will only occur after much higher case numbers.

How to handle data that are difficult to compare, unreliable, and inconsistent

There are now many data points from many countries on the spread of the coronavirus epidemic, updated at least once a day. But, as mentioned above, the data from different countries are difficult to compare.

Here are some of the problems: Test density and methodology vary greatly; not all virus carriers also show symptoms; not all infected people are identified; hospitals do not necessarily report releases to the authorities; those who have recovered at home will not always report; and the death toll is unclear, because it is difficult to distinguish between people who die from corona versus with corona. The epidemic is over when the number of active cases is zero, calculated as the difference between infected and recovered or deceased persons, but this calculation may be inconsistent or incorrect.

So, how can we deal statistically with such a data situation? Is it still possible to gain general insights into the course of the epidemic and perhaps even to make predictions about how long it could take for individual countries to bring the epidemic under control?

In short, we have to use simplifying assumptions and simple, robust procedures that can compensate for data fluctuations. Here, we assume that the ratio of reported cases to actual cases will remain reasonably constant, i.e., typically the test methodology and coverage will not change. Then the respective rates of increase will also be similar, and we can draw conclusions about the actual cases from the increase in reported cases.

We draw a simple regression line through the logarithmic rates of increase. Extrapolating this line yields a prognosis — very rough, of course — at which numbers the epidemic can probably be controlled and how long it will possibly last. The actual development will naturally depend on the measures taken to contain the epidemic and their implementation and compliance by the population.

We use data provided by WHO and WORLDOMETER and evaluate them for countries with over 1,000 reported infections. Here, we show the data for Italy. See https://www.mis.mpg.de/covid19/covid19-mpi-mis-leipzig-start.html for full data.

The first two graphs show the numbers of infected, deceased, and recovered persons and the active cases over time. The next two graphs show the number of daily new infections and new deaths. The final two graphs compare the growth rate of new infections with their linear regression (blue line). Large deviations from this straight line may indicate problems or systematic changes in data collection. The flatter the blue line is, the slower the epidemic weakens.

Heterogeneous contact networks

We also see important consequences for scientific models of the spread of epidemics.

Diseases are transmitted through contacts, and therefore many propagation models try to capture the network of social contacts. Typically, a fairly homogeneous network structure is used as a basis for the model to remain manageable, but the spreading of the coronavirus epidemic points to very heterogeneous underlying network structures. In South Korea, for instance, the virus apparently spread very quickly and strongly within a particular sect, which was favored by intensive contacts within the sect, but could then be confined because contacts with the rest of the population were apparently much thinner. The death rate in Italy is comparatively high, probably because the contacts between generations are more intense, allowing the virus to quickly reach the elderly, whereas in Germany and the Scandinavian countries it was probably first spread by ski tourists returning from their winter vacation in the Alps. In the Scandinavian countries there also seem to be two different waves of spread, unlike in the rest of the world.

Conclusions

Reported data about the COVID-19 epidemic are obviously incomplete, vary greatly between countries, are possibly politically manipulated, and are typically internally inconsistent. If we want to draw any reasonable statistical conclusions at all from such data, we need methods that can identify some robust trends. We, therefore, looked at the dynamics of the growth rate and see regularities there that are captured by a simple linear regression. This leads to prognoses, which, of course, are rough and tentative and will be affected by political measures taken and the compliance by the populations in the various countries.

We also see the scientific challenge for models of epidemic transmission in networks of social contacts that may be very heterogeneous — for instance, subgroups with few outside contacts — or in contrast to inter-generational contacts that can quickly carry infections into high-risk groups.

We hope that a deeper understanding of these and other problems will allow us to better cope with such epidemics in the future .

Luu Hoang Duc
Max Planck Institute for Mathematics in the Sciences, Leipzig

Jürgen Jost
Max Planck Institute for Mathematics in the Sciences, Leipzig
Santa Fe Institute

T-003 (Luu and Jost) PDF

Read more posts in the Transmission series, dedicated to sharing SFI insights on the coronavirus pandemic.

Listen to SFI President David Krakauer discuss this Transmission on episode 26 of our Complexity Podcast

Reflection

July 29, 2021

WHY DO WE NEED COMPLEX-SYSTEMS SCIENCE TO UNDERSTAND THE COVID-19 PANDEMIC?

The coronavirus disease 2019 (COVID-19) is caused by a small virus (SARS-CoV-2), which has diverged into several variants, and affects humans and societies across the world. Thus, modeling the disease might naturally span a vast range of scales, from the molecular to the global. This virus itself is only moderately complex, but its dynamics depend on the complexity of human biology and society. Analyzing the genome of the virus does not clarify these dynamics because its replication machinery depends, for its assembly and proliferation, on the properties of the host cells. The virus is transmitted through the air, and we know something about the aerosol physics behind transmission. The transmission depends on the behavior of people, and this may be modified by voluntary restraint or political measures, which are not always accepted and obeyed by people. Also, in many countries, vaccination campaigns encounter resistance from substantial parts of the population, while in other countries, there is a vaccine shortage. The spreading of the pandemic as well as the measures taken to constrain it may have many psychological, social, political, economic, and financial side effects and perhaps as-yet-unknown long-term consequences. Apparently, the severity of the disease is strongly correlated with age, but the availability and quality of treatment in the medical system can mitigate the mortality risk, although the medical systems in many countries are poorly equipped to cope with this challenge. Also, the interaction between scientific opinions and public controversies has repercussions, but is not easy to model.

Should a good model incorporate as many of these scales and dynamics as possible, from the biochemistry to the global economy, from the scale of virus replication in a host to the long-term political instabilities? Perhaps not, as such a model may depend on too many parameters that cannot be reliably estimated, and it may become far too complicated. However, there may exist certain universal patterns underlying the dynamics of this and, perhaps, other pandemics. Such patterns can only be captured by simpler models that identify essential aspects. But, while this valuable path may lead to important contributions, a profound scientific understanding needs to go beyond such models and incorporate complexity.

Conceptual Issues

As one of us (JJ) has advocated, the key feature of biological life is that a biological process can control and regulate other processes, and it improves that ability over time. This control can happen hierarchically and/or reciprocally. Thus, the information that a biological process needs to use concerns only the control, but not the content or the internal structure, of those processes. Those other processes can be—or rather, have to be—vastly more complex than the controlling process itself. Each biological process draws upon the complexity of its environment.

The novel coronavirus illustrates that thesis. From that perspective, we should conceptualize the virus not as a physical molecule, but as a dynamic process. Complex systems in general may consist of many interacting levels and scales. In fact, to understand the pandemic, it does not suffice to sequence the 26–32 kilobases of RNA of the virus. We rather need to understand the complexity of the human cells that the virus uses for its reproduction and the complexity of human societies that enable its transmission.

Complex systems are characteristically both vulnerable and resilient at the same time. In noncomplex systems, stochastic fluctuations usually average out, and small random events have only small consequences. In complex systems, by contrast, small and local random events may have large and global consequences. A single mutation of one copy of the virus may change the course of the global pandemic. The negligence of a single individual may have catastrophic consequences. The consequences may also be positive: the human immune system may adapt to the virus and learn to fight it off. Vaccinations may make people immune. A society may rearrange its interaction patterns to restrict physical contact between individuals. Through that, it may discover that certain rearrangements make it more efficient in other respects—for instance, in reducing unnecessary traffic and improving time management among home office workers.

Consequences

We need specialists in many disciplines. Knowing the genetic sequence of the virus and its mutated variants, we can apply our knowledge of human cell biology to understand how the virus uses the molecular machinery of human cells for its reproduction and how its copies can then invade other cells inside and across organisms. We can then try to interfere with various stages of the reproduction and transmission process by creating medicines or vaccines. Understanding the physics of transmission, we can propose contact restrictions that reduce the transmission chances. Analyzing the large-scale structure of social contacts, using concepts like network modularity or assortativity (i.e., to what extent the network consists of distinct modules that have only few connections between them, or whether highly connected individuals preferentially connect to other such individuals or rather stay away from them), can improve epidemiological models. In order to assess the effectiveness of countermeasures, we have to take into account psychological microfoundations of human behavior as well as cultural practices and differences. Contact restrictions emerge from interactions between political actors, scientific advice, public debates, and opinion dynamics in the population. Thus, insight into political processes and opinion dynamics can guide the scientific system in communicating its findings and formulating its advice efficiently, anticipating public reactions and possible countermeasures by individuals against political measures emerging from scientific advice. We also need to look at the scientific system from the outside and see how individual scientists or scientific institutions react to public pressure. We must anticipate how internal disagreement in the scientific system can be exploited by political agitators or interest groups to discredit science at large. We should also look at the long-term consequences, such as increasing social inequality, growing public debt, economic reorganizations, and education deficits for many schoolchildren and students.

The preceding seems rather obvious, but the point we want to make is that an understanding of the dynamics of the pandemic and of its potential long-term effects on our societies and economies requires that all of these factors and their interactions be considered. Pandemic models and advice based on just some of these factors—virology, epidemiology, aerosol physics, social contact structure, opinion dynamics, and collective behavior, or whatever—are almost surely inadequate. Of course, we need specialists in all those disciplines, but we also have to integrate their insights into a comprehensive understanding from a complex-systems perspective. This poses challenges, from analyzing the dynamics across vastly different scales and at many different levels, to coping with radical uncertainty.

Building on fundamental research in various disciplines, complex-systems scientists need to integrate these findings and create appropriate models. The models themselves may be simple, but effective simplicity must be grounded in the complex details and qualitative insights those various disciplines have uncovered. That is, we need both specialized research and its integration to understand how the virus exploits the complexity of human biology and society and to devise and propose appropriate intervention strategies.

On one hand, we need to be humble. We cannot predict chance events. We do not know which mutations will occur and whether the virus will become more harmful or more benign in the future. We may, however, simulate potential mutations and their ability to control the molecular mechanisms in human cells. Which of those will actually be realized, we do not know, but we might be able to predict their consequences when they occur. We cannot predict when and where superspreading events will happen, but we may propose strategies to reduce their probability. We need to keep in mind the point expressed earlier about complex systems, that small random events need not average out, but may have large-scale consequences for the system. Such amplifications exploit particular vulnerabilities of a system, and we should try to identify those. Some of our predictions may be self-defeating, because people get scared by that prediction and react appropriately.

On the other hand, we need to convince the general public and the politicians that scientific advice is the best advice we have. Science can provide insight into how the virus reproduces and spreads and point to short- and long-term consequences of the pandemic. Good complexity science can also integrate the individual findings and balance the effects at the various levels and scales.

We also need scientific insight to cope with radical uncertainty. In particular, when the pandemic first started to unfold in the spring of 2020, we did not know what consequences radical countermeasures or mere inactivity might have. In hindsight, of course, we know a lot more. But those decisions cannot and should not be held against the decision-makers. Rather, in order to make good decisions and act upon them, we must understand which psychological factors may help or hinder the exploitation of diffuse knowledge, the use of analogies with other situations of uncertainty, the assignment of appropriate weights to the many factors in play, and the adaptation of heuristics suitable for the challenges we face.

As complex-systems scientists, we try to build upon specific expertise and then integrate that expertise into the large-scale frame. Thus, we have analyzed and developed mathematical models for the spread of epidemics. The key parameter is the contact rate; because it depends on the behavior of people, it can therefore be influenced. We build our modeling upon a careful analysis of an extensive body of data, although the quality and reliability of some of those data do have certain problems that we have to overcome. The important point is that this parameter, the contact rate, can include stochastic fluctuations as extracted from the data, heterogeneity across populations, and systematic shifts resulting from the voluntary or enforced implementation of countermeasures. The model can thus readily include insight coming from social network science or social psychology.

In turn, when the model is run with different parameter values, it can be used to assess the effect of political actions. Other parameters are less variable, but may be estimated on the basis of results from virology or from an understanding of the infection process. We also regularly update the data about the pandemic in most countries around the world to enable a comparison of the different dynamics and to correlate them with political measures and social, economic, and other factors. Please consult our website for details.

Read more thoughts on the COVID-19 pandemic from complex-systems researchers in The Complex Alternative, published by SFI Press.

More SFI News

View All News

Transmission T-003: Luu Hoang Duc and Jürgen Jost on Making the Most of Bad Data

March 30, 2020

To forecast the spread of the novel coronavirus, we must attend to the quality and consistency of the data.

Reflection

WHY DO WE NEED COMPLEX-SYSTEMS SCIENCE TO UNDERSTAND THE COVID-19 PANDEMIC?

Share

News Media Contact

Santa Fe Institute

Tags

Related Projects

More SFI News

In memoriam: Daniel C. Dennett

New Book: The time for complexity economics has come

Karen Willcox Winner of the 2024 Theodore von Kármán Prize

Tim Kohler to deliver Linda S. Cordell Lecture

To accelerate biosphere science, reconnect three scientific cultures

Mirta Galesic receives prestigious ERC Advanced Grant

Carlo Rovelli receives 2024 Lewis Thomas Prize

Research News Brief: Defining a city using cell-phone data

Complexity tools for USDA nutritional guidelines

Quantifying the potential value of data

Carlo Rovelli joins SFI's Fractal Faculty

New book offers thoughtful approach to modeling complex social systems

Research News Brief: A test of AI “personalities” and behavior

Study: To make sense of history, embrace uncertainty

Study: Predicting steps in a random process

Embodied intelligence & a sense of self

How to track important changes in a dynamic network

African and South Asian students build new connections during inaugural Complexity Global School

New gifts support SFI Education and Postdoctoral programs

The cultural evolution of collective property rights