Show Summary Details

Page of

PRINTED FROM the OXFORD RESEARCH ENCYCLOPEDIA, CLIMATE SCIENCE ( (c) Oxford University Press USA, 2018. All Rights Reserved. Personal use only; commercial use is strictly prohibited (for details see Privacy Policy and Legal Notice).

Subscriber: null; date: 20 November 2018

Uncertainty Quantification in Multi-Model Ensembles

Summary and Keywords

Long-term planning for many sectors of society—including infrastructure, human health, agriculture, food security, water supply, insurance, conflict, and migration—requires an assessment of the range of possible futures which the planet might experience. Unlike short-term forecasts for which validation data exists for comparing forecast to observation, long-term forecasts have almost no validation data. As a result, researchers must rely on supporting evidence to make their projections. A review of methods for quantifying the uncertainty of climate predictions is given. The primary tool for quantifying these uncertainties are climate models, which attempt to model all the relevant processes that are important in climate change. However, neither the construction nor calibration of climate models is perfect, and therefore the uncertainties due to model errors must also be taken into account in the uncertainty quantification.

Typically, prediction uncertainty is quantified by generating ensembles of solutions from climate models to span possible futures. For instance, initial condition uncertainty is quantified by generating an ensemble of initial states that are consistent with available observations and then integrating the climate model starting from each initial condition. A climate model is itself subject to uncertain choices in modeling certain physical processes. Some of these choices can be sampled using so-called perturbed physics ensembles, whereby uncertain parameters or structural switches are perturbed within a single climate model framework. For a variety of reasons, there is a strong reliance on so-called ensembles of opportunity, which are multi-model ensembles (MMEs) formed by collecting predictions from different climate modeling centers, each using a potentially different framework to represent relevant processes for climate change. The most extensive collection of these MMEs is associated with the Coupled Model Intercomparison Project (CMIP). However, the component models have biases, simplifications, and interdependencies that must be taken into account when making formal risk assessments. Techniques and concepts for integrating model projections in MMEs are reviewed, including differing paradigms of ensembles and how they relate to observations and reality. Aspects of these conceptual issues then inform the more practical matters of how to combine and weight model projections to best represent the uncertainties associated with projected climate change.

Keywords: uncertainty, ensembles, scenarios, weighting, impacts, CMIP, parameterization


The past few decades have been associated with increasing confidence in the basic science of climate change: that additional greenhouse gases released into the earth’s atmosphere by humanity will alter the planet’s radiative balance, that this will result in an increase in global mean temperatures with consequences for the hydrological cycle, circulation, and cryosphere. Such changes have the capacity to alter the ecological boundary conditions for every species (Urban, 2015), humanity included. However, in order to be able to adapt to such changes, the details matter. Any risk assessment exercise requires an understanding of the uncertainties associated with our projections, but assessing or even identifying those uncertainties is often not a trivial matter.

Uncertainty in Climate Simulations

A modern Earth System Model (ESM) is a comprehensive representation of all of the planet’s major components: its atmosphere, ocean, land surface, and cryosphere. When combined, these components can simulate the response of the ESM to changing boundary conditions, such as the altered balance of incoming radiation due to anthropogenic greenhouse gas emissions. But such a simulation is necessarily an approximation of the real world and is subject to error. In order to use a climate model for risk assessment, the sources of these errors need to be understood.

Uncertainty in Future Human Behavior

An ESM simulation takes a set of external boundary conditions (such as incoming solar radiation) and integrates the state of the model over time. A climate change simulation also includes the effects of human activity, such as the emissions of greenhouse gases or aerosol precursors, deforestation or the irrigation of crops. Looking to the past, these boundary conditions are known to some degree, and the model can be initialized in a pre-industrial state and integrated until the present day. However, in the future, these boundary conditions are fundamentally uncertain. This is addressed in practice using scenarios—a discrete set of possible futures which represent a certain set of assumptions for future human behavior. Any future climate simulation is therefore conditional on the set of assumptions which made up its scenario.

The Limits of Predictability

Even if a model of the climate system was a perfect representation of reality, there are limits to the degree that the future can be predicted. A prediction, such as a weather forecast, takes an estimate of an initial state of the system and then integrates the model forwards in time. However, the equations of motion which define the evolution of the atmosphere are fundamentally chaotic: Small errors in the initial state of the model will grow exponentially over time such that the initial state only constrains the model’s evolution for a limited period of time. This period is often referred to as a model’s memory of the initial state. In weather forecasting, this effect is commonly known: tomorrow’s forecast is reasonably accurate, next week’s less so. As such, the memory of the atmosphere is of the order of weeks. In the ocean, the timescales are longer and the system has memory of the order of decades (Slingo & Palmer, 2011).

For longer periods of time, the system evolves within its state space, the envelope of possible combined configurations of the natural modes of variability of the model. These modes of variability result from the oscillations within the climate system as energy, heat, and mass are exchanged between components and regions to create preferred spatial modes. For century-scale climate projections initialized in a pre-industrial state, the timescales are an order of magnitude longer than the memory of the ocean or atmosphere, and so the model is effectively free to evolve within its state space and the exact initial state of the system is uninformative, provided that the model is initialized in a stable state of energetic balance.

A single model initialized with slightly different initial conditions can produce a range of self-consistent realizations of the climate. These solutions under a single set of boundary conditions constitute an initial condition ensemble. Many tens of ensemble members are required to sample the distribution of possible future changes consistent with a certain model and scenario.

An Approximation of Reality

Cutting-edge climate simulations tax the limitations of even the world’s largest supercomputers. To perform the types of multi-century simulations which are necessary to model the earth’s response to continued greenhouse gas emissions, the surface of the planet is divided into a grid, where the largest elements are of the order of hundreds of kilometers. The equations of motion describing fluid flow in the atmosphere and ocean are discretized and solved at each grid point, and all surface components such as vegetation and sea ice are generally described in a similar fashion.

Such a representation implies that there are processes that occur on a scale smaller than the grid which are not resolved natively by the model structure. Some of these processes, such as convection or ocean eddies, have the capacity to influence larger scales of the model, and so their effects must be represented by a parameterization: a function of variables at the resolved scales of the model that describes behavior which is not explicitly resolved. These functions are only approximations of the real world; theory may guide their formulation, but there may be multiple competing theories for which there is no clear preferential choice. Uncertain parameters must be calibrated either through theoretical constraints or by comparison with observed behavior.

The challenge of calibration, however, is formidable. A modern ESM often comprises millions of lines of code and hundreds of parameterizations. In a fully coupled system, each parameter has the capacity to have downstream effects and so they cannot necessarily be calibrated in isolation. The features of the climate system which model developers seek to represent are often not a function of a single parameter. For example, the dynamics of the El Niño Southern Oscillation depend on atmospheric convection, cloud physics, and dynamics, as well as ocean processes such as vertical mixing of heat. It is only through the joint calibration of all of these components that a solution can be found which approximates the real world.

In practice, the joint calibration of a climate model is a huge effort that takes years to complete, usually resulting in a single configuration of the model which exhibits an acceptable climatology and historical simulation. What defines an acceptable model is fundamentally subjective and different modeling centers may have differing priorities. Some requirements are more absolute than others: a model must preserve energy and mass, for example (any which do not can be easily disregarded). Other aspects may be subject to institutional priorities such as the characteristics of extreme precipitation, or the variability of the El Niño Southern Oscillation.

As a result, it is practically impossible to sample all of the possible choices and degrees of freedom which go into the construction of a single model. Certain studies, such as Frame et al. (2009), have attempted to explore a subset of these uncertainties in a single model by perturbing the values of model parameters. However, sampling structural choices (such as the type of parameterization) in a formal manner requires switches to be built into the model at the design stage—and this rarely happens. In addition, making parameter perturbations from the published configuration of a model will often result in a degraded present-day climate simulation. As such, projections made using a perturbed version of the model may not be equally representative of the real world.

Given these challenges, most attempts to quantify model uncertainty use the collection of models which have been produced by different modeling centers. Assuming that different groups have made different decisions about how to represent elements of the climate system, and they have then calibrated their models appropriately to best represent the real world given those assumptions and given the available observations, we have a sample of possible model representations of the real world and how they might respond to future climate forcing. The inherent diversity of the projections arising from those models, along with knowledge of their biases, common assumptions, and limitations, provide the best available basis for statements of confidence in future projections of the earth’s response to anthropogenic emissions.

The Ensemble of Opportunity

The most comprehensive archive of climate models available to date is the Coupled Model Intercomparison Project or CMIP (Eyring et al., 2016; Taylor, Stouffer, & Meehl, 2012) ensembles of climate model simulations. The most recent complete iteration, CMIP5 (Taylor et al., 2012), defined experimental protocols for a large number of different simulation types (Figure 1), some of which represented experiments to cleanly test model responses to idealized boundary conditions, while others represented more complete projections of future climate.

Uncertainty Quantification in Multi-Model EnsemblesClick to view larger

Figure 1. A schematic of the range of experiment types included in the CMIP5 experimental design. Reproduced from Taylor et al. (2012). D&A refers to “Detection and attribution”, “AMIP” is the “Atmospheric Model Intercomparision Project”, RCP is “Representative Concentration Pathway”, “E-driven” is “Emissions Driven”, “SST” is “Sea Surface Temperature.”

The component models of the ensemble are submitted in a coordinated fashion by the world’s major climate modeling centers. Because participation is voluntary, the resulting ensemble is not formally designed to sample uncertainty, rather it is an ensemble of opportunity (Gleckler, Taylor, & Doutriaux, 2008) and any inference made from the resulting distribution must be made in the context of the ad hoc nature of the sampling.

Many analyses have considered this ensemble to be a model democracy. In this interpretation, given a certain scenario or set of boundary conditions, each model from the CMIP archive is treated as an equally likely realization of the future climate. In this interpretation, future outputs from the ensemble of simulations can be treated as samples of an unknown truth, and one can make statements of confidence from the distribution of projected changes.

Uncertainty Quantification in Multi-Model EnsemblesClick to view larger

Figure 2. Mean projection and confidence of temperature and precipitation change per unit Kelvin warming. Reproduced from Figure 12.10 in IPCC-AR5 WG1. Collins et al. (2013).

Note. Stippling indicates regions where the mean change from present-day conditions exceeds the 95% percentile in the distribution of models.

We see evidence of this thinking in past reports of the Intergovernmental Panel on Climate Change (IPCC). Figure 2 shows a representation of the pattern of future changes in precipitation and temperature from the Fifth Assessment Report (AR5) (Pachauri et al., 2014). The most likely future is taken to be the mean of the model projections, while the confidence in that mean is assessed by considering regions where the projected change exceeds the ensemble spread in projections. This assumption assumes that each model in the ensemble is an equally likely representation of the real world, that each model is independent of the other models in the ensemble, and that the spread in projections is indicative of the true underlying uncertainty. Although the limitations of model democracy have long been recognized in the IPCC assessment reports (Knutti et al., 2010), such assumptions have persisted primarily due to the fact that there is no consensus on the best alternative model.

In order to make progress in making assessments based on the output of a multi-model ensemble (MME), it is necessary to consider the conceptual model of what the ensemble is. This requires addressing the key assumptions of model democracy, how appropriate those assumptions are, and whether progress can be made to improve assessment.

The Challenges of Multi-Model Analysis

There are a number of properties of the Coupled Model Intercomparison Project (CMIP) ensembles which present challenges for assessment and multi-model analysis. In Section 3.1, we consider the properties of the ensembles available today which act to complicate their interpretation, while Sections 3.2, 3.3, 3.4, and 3.5 discuss methodologies for advancing despite the known inadequacies of the available model ensembles.

Ensembles are Not Distributed Around “Truth”

Treating models as equally plausible and independent samples of future change is effectively treating model outputs as independent measurements of an unknown system, that is, assuming there are no common biases in the measurements. Making this assumption would imply that the more samples or measurements are taken, the more accurately the result is known. This assumption is commonly referred to as the “truth plus error” paradigm—that is, that model ensemble members are noisy samples from an underlying truth. Early model weighting approaches made this implicit assumption that model agreement implied increased confidence (Giorgi & Mearns, 2003), and later formal Bayesian approaches inherited this assumption (Tebaldi, Smith, Nychka, & Mearns, 2005)—effectively treating climate models as independent observations of the true future system state. Furthermore, it was assumed that models with a smaller present-day simulation bias were inherently more trustworthy than other models with a larger bias. The initial applications of such approaches led to highly confident projections of the future climate change (see Figure 3), but the inherent assumptions involved have since been challenged by a number of papers which considered the complexities of interpreting the ensemble of opportunity (Abramowitz & Bishop, 2015; Rougier, Goldstein, & House, 2013; Sanderson & Knutti, 2012; Yokohata et al., 2011).

Uncertainty Quantification in Multi-Model EnsemblesClick to view larger

Figure 3. Posterior distributions for warming. Tebaldi et al. (2005).

Note. Shows certain regions when models are used as inputs to a Bayesian model in which present-day model performance and model agreement increases model weight. Points on the horizontal axis show the original outputs from the ensemble, the solid curve shows the Bayesian posterior distribution using the models’ present-day temperature to form a likelihood (the dashed curve takes into account common biases in present-day and future simulations).

An alternative interpretation is that the ensemble members are interchangeable with reality (Rougier et al., 2013). In this worldview, the members of the multi-model ensemble (MME) represent discrete samples from a set of all plausible models. In the best case, reality can be treated as a possible member of this set. A number of papers have attempted to implement aspects of the interchangeable assumption into a quantitative model based on this ensemble interpretation, by considering perfect model arguments for how accurately a model can be predicted if it was treated as truth (Tebaldi & Sansó, 2009), or by considering how to transform ensemble results into a distribution which represents a set of equally likely independent projections (Bishop & Abramowitz, 2013). But the interpretation that the real world is interchangeable with models in the CMIP archive requires additional assumptions which can themselves be debated.

Model Projections are Not Equally Good Estimates of Reality

The first question to address in the interchangeable paradigm is that of whether all models are equally similar to reality, or whether we can say anything about the relative likelihood of different simulations. If all models are equally plausible, then there is no purpose to be gained by weighting the results of individual members and the distribution of independent projections represents the limit of possible knowledge of future change.

A test of indistinguishability was proposed by Annan and Hargreaves (2010), who used a rank histogram analysis to assess whether models conformed to the indistinguishable or truth-centered paradigm. This approach considers a range of outputs from the model (for example, the mean climatological temperature in a set of different, ideally independent, regions), and whether observed values tend to lie within, or outside of, the ensemble distribution (Figure 4). The analysis found that the historical behavior of the climate models was broadly in line with the ensemble being “reliable,” that is, that the observations were equally likely to lie anywhere within the ensemble distributions. Although this would imply that the interchangeable paradigm would be appropriate, there are limits to this approach for assessing ensemble properties. The rank histogram effectively assumes uncorrelated models and grid points, which complicates its usefulness when assessing climatological reliability (Marzban, Wang, Kong, & Leyton, 2011).

Uncertainty Quantification in Multi-Model EnsemblesClick to view larger

Figure 4. A rank histogram analysis of observations and CMIP3 model output for three different variables. Annan and Hargreaves (2010).

Note. The histogram shows where observations would lie in the ensemble distribution, assessed for each grid point in the domain. Histograms that are clustered toward the center of the ensemble imply that the ensemble is over-dispersive, that is, the real world will more often than not be closer to the middle of the ensemble than toward the edge.

Rank histograms are one approach for assessing an ensemble’s performance in simulating observed climate. But the method requires the selection of a variable to assess, an observation with which to compare it, and a time period to consider. However, studies that have attempted to assess multiple models based on their historical simulation performance have found large differences in model rankings depending on which variables or metrics are used to assess the models (Gleckler et al., 2008). These studies illustrate that any general estimate of model performance is dependent upon the priorities of the researcher (see Figure 5).

Uncertainty Quantification in Multi-Model EnsemblesClick to view larger

Figure 5. Model skill scores for different models in the CMIP3 archive of climate simulations. Gleckler et al. (2008).

Note. Skill scores for different metrics are illustrated in different rows of the grid. Warmer colors indicate relatively better performing models, while cool colors indicate poorly performing models. The leftmost columns indicate the multi-model median and mean values.

However, efforts to combine multiple metrics (Sanderson & Knutti, 2012) found that within the CMIP5 ensemble a subset of models were reliably better performing than others, although no models consistently outperformed the rest. Furthermore, models exhibited more similarity with the observations than they did with any particular model treated as truth, which undermines the notion that all models can be treated equally plausibly. But, as noted in Sanderson and Knutti (2012), the statistical properties of models in the present are not necessarily preserved in future projections. Models’ present-day simulations are to some degree calibrated to best represent observed reality, but the feedback processes which govern future change are not directly calibrated.

These studies together imply that neither the “truth plus error,” nor the “indistinguishable” model is strictly supported by the data from the models. Instead, models are weakly distinguishable (i.e., some weighting may be appropriate), but cannot be considered as independent measurements of an underlying truth.

All Models are Not Equally Complex

When making a climate model, each modeling center faces a large number of choices on how to represent a set of processes which are deemed important for climate. Making this choice is the first obstacle in interpreting the ensemble as a whole. Figure 6 shows that the processes which have been represented within the world’s leading climate models have increased substantially since the late 1960s. The models of the 1960s included only a coarse atmosphere with prescribed cloud and surface ice cover and simple thermodynamic land and ocean representations. However, the models of the early 21st century include interactive clouds and aerosols, dynamic ice sheets and vegetation, and even representation of human systems.

Uncertainty Quantification in Multi-Model EnsemblesClick to view larger

Figure 6. Schematic of climate model complexity over time. Reproduced from Jakob (2014).

However, the march of complexity is not uniform. The decision of an individual modeling group to include a particular process in their model may not match another group’s decision. For example, some modeling centers in CMIP5 chose to include prognostic aerosols (i.e., atmospheric aerosol burdens were calculated as a function of emissions), whereas others continued to use prescribed aerosols (where aerosol concentrations are predefined). Hence, when comparing projections from the two types of models, one is not necessarily comparing like with like.

This presents a problem if we wish to weight the models by their historical performance. Consider two models, one with an interactive aerosol scheme and another with prescribed aerosols. The model with prescribed aerosols would trivially perform better in any model evaluation which considered an error metric of atmospheric aerosol concentrations, although the interactive model would clearly be the more sophisticated, and likely more useful in a future climate simulation where the aerosol loadings are unknown. Similar arguments can be made for any process that is represented in only a fraction of models in the ensemble; thus particular care should be taken to ensure that all models in a given analysis are of comparable complexity.

Models are Not Independent

A further assumption of simple model democracy is that models are independent (this usually refers to the idea that models do not share common code or assumptions, rather than a strict statistical interpretation). If models do share a common code, then the number of effective models in the ensemble will be less than the apparent number of models in the ensemble. Moreover, if a model is highly replicated within the ensemble then any ensemble average will be unfairly biased toward the projection of the replicated model.

And indeed, there are a number of reasons to suspect that the distribution of models in present-day MMEs does not comprise a set of fully independent models. A qualitative assessment of model origins (Edwards, 2010) suggests that many models have a shared heritage. Many models have branched at some point in their history, and although they might have since partially diverged, they still contain some common code (or ideas/parameterizations) with other models in the ensemble (see Figure 7).

Uncertainty Quantification in Multi-Model EnsemblesClick to view larger

Figure 7. Schematic of the development of atmospheric models over time, as a function of time. Diagram reproduced from Edwards (2010).

Note. Red lines show cases where the full or partial code of a climate model was imported into another model.

Evidence for model interdependency can also be seen directly from the model output. Masson and Knutti (2011) showed that a simple assessment of inter-model mean error fields suggested that models which are known to have shared a common heritage also exhibit similar patterns of mean state bias in their temperature or precipitation fields. This property is illustrated in Figure 8, which uses a clustering algorithm to demonstrate that models with small pairwise distances in their model output tend to have a clearly common heritage: identical models with different resolutions, or different versions of the same climate model. In cases where there is no clear relationship, the authors found that the models exhibiting small pairwise distances did share some common ancestry.

Uncertainty Quantification in Multi-Model EnsemblesClick to view larger

Figure 8. Finding near neighbors using a hierarchical clustering algorithm.

Reproduced from Masson and Knutti (2011).

Note. Algorithm based on model output within the CMIP3 ensemble of climate models. Models which branch toward the left-hand side of the plot exhibit smaller pairwise errors, and are clustered first by the algorithm. Those groups of models which branch toward the right of the plot are the most dissimilar.

Models are Subject to Calibration Priorities

In many cases, the results of Masson and Knutti (2011) are intuitive: One would expect two versions of the same model run at different resolutions to have similar climatological biases, for example the two models share a large number of common assumptions and code. But in other cases, the similarity of the climatology cannot be trivially explained. For example, the Hadley Centre models HadCM3 and HadGEM1 in Figure 8 differ quite significantly in terms of their parameterizations, and yet they still appear as near neighbors. Subsequent studies with CMIP5 found similar behavior (Knutti, Masson, & Gettelman, 2013; Sanderson, Knutti, & Caldwell, 2015a)—models from the same center generally exhibited smaller pairwise errors than those from different institutions, irrespective of the degree to which the model components had been altered.

In these cases, it was proposed by Sanderson and Knutti (2012), the calibration of the models also plays a part in the similarity of the present-day climatological state. Each modeling center has specific protocols which are used to calibrate model parameters—concentrating on particular regions or processes which are of particular importance to the model developers. As a result, the models could exhibit similar error structures partly as a function of model tuning, as well as due to common structural assumptions. It was noted that this tuning effect, unlike error similarity which arises from common physical process assumptions, would tend to have a greater effect on present-day climatology than future change.

Methods to Address Imperfections in an Imperfect Ensemble

Taken together, many of the fundamental assumptions which would be needed to defend the use of model democracy in multi-model analysis have been challenged by the literature in the last few years. Models differ in performance, complexity, and independency, and each models’ mean climate is calibrated using different priorities which are not always made clear to the community.

If these aspects of the ensemble are ignored, and analyses continue to treat the multi-model archives as equally plausible and independent estimates of the future, there is a potential for analyses to be over (or under) confident, or biased toward particular models. In light of this imperfect ensemble, a number of approaches have been proposed to address some of these aspects.

Model Weighting for Performance

Some papers have made the assumption that a simulation’s consistency with historical records (or “skill”) implies greater accuracy in future projections, thus justifying weighting of models in the ensemble. For example, the Giorgi and Mearns (2003) approach introduced the concept that models should be up-weighted if historical projections were close to observations, and for proximity to the multi-model mean—implying that the mean projection will converge on the true future as the ensemble grows. These ideas were formalized in a Bayesian framework by Tebaldi (2004), but this framework was questioned by some who claimed that models should not be interpreted as independent estimates of the true state; rather, the real world should be seen as (at best) interchangeable and indistinguishable with the models in the archive (Annan & Hargreaves, 2010). Tebaldi and Sansó (2009) attempted to resolve this issue by associating the posterior distribution of model future change with a precision parameter, akin to the precision which the Bayesian formalism requires of each of the input models.

Recent Advances in Model Weighting

There is increasing acknowledgment that there are limits to the applicability of performance-based weighting to constrain future response (at least where there is no clear relationship between the weighting metric and the projected quantity). Räisänen, Ruokolainen, and Ylhäisi (2009) suggested that specific relationships should be sought between projected quantities and observable metrics (a closely related concept to the emergent constraint framework), while Weigel, Knutti, Liniger, and Appenzeller (2010) also noted that indiscriminate weighting could worsen performance in out-of-sample tests.

Addressing Model Interdependency in Weighting Schemes

Further to this, the literature has started to recognize the potential bias which might arise from model interdependency and replication within the CMIP archives, and a number of studies have suggested strategies for addressing this. Multiple authors noted that model error correlation could be used to make inference about model interdependency (Masson & Knutti, 2011; Pennell & Reichler, 2011), finding that the effective number of climate models was likely significantly less than the total number of models submitted to the archive.

In Sanderson et al. (2015a), a pairwise inter-model distance matrix was derived from the consideration of a number of output variables (as illustrated in Figure 9). These data were proposed to provide information on both model skill and replication and a weighting scheme based on such output was used for the United States’ 2017 Climate Science Special Report, which used inter-model distance information to down-weight heavily replicated models.

Uncertainty Quantification in Multi-Model EnsemblesClick to view larger

Figure 9. Distances between models in the CMIP5 archive.

Reproduced from Sanderson et al. (2015a).

Note. Evaluated by a root-mean square distance metric derived from multiple model output variables.

The distances were used in a framework that rewarded models for proximity to the observational point, while down-weighting models which exhibited very small distances to other models in the archive. Two tuning parameters defined the strength of each of these weighting effects, enabling the researcher to determine the degree to which model similarity and poor model performance would be down-weighted.

Sanderson et al.’s (2015a) study noted that “leave-one-out” out-of-sample testing for weighting schemes and emergent constraints may be an unfair test if close relatives of the model that is considered as truth remain in the test ensemble. They proposed a framework for a fair out-of-sample test, which allows the exclusion of close relatives from the test. The study found a small increase in out-of-sample projection skill for moderate weighting, but that strong skill weighting would negatively impact performance.

Bayesian updating schemes have also been considered in the light of model similarity and uniqueness. Sunyer, Madsen, Rosbjerg, and Arnbjerg-Nielsen (2014) proposed an extension to the original Bayesian model of model combination proposed by Tebaldi and Sansó (2009) to allow for model interdependency, using the error correlation matrix to propose a multivariate distribution from which the original models are drawn.

Ensemble Subsetting

A number of authors have suggested methodologies that go beyond simply weighting model output, where the entire ensemble is resampled in some fashion to provide a product which in theory can address some of the imperfections in the original ensemble. One of the simplest approaches is to consider a subset of models that can be demonstrated to exhibit fewer biases than the original ensemble; a couple of papers have focused on different aspects of multi-model bias.

Herger et al. (2017) developed an algorithm to select a subset of models from a multi-model archive such that the error in the multi-model mean was minimized, which, it was argued, may decrease the interdependency of the remaining subset because the algorithm tends to select models with relatively independent error patterns. Sanderson et al. (2015b) proposed a technique that explicitly set out to minimize interdependency and maximize skill in a subset of climate models, by using inter-model distance information to determine models which are likely to be co-dependent with other models in the archive.

However, all of the approaches for model weighting or subsetting considered here rely on a matrix of inter-model error correlation or integrated absolute differences between fields. This has the advantage of being relatively simple to compute, but comes at a cost—it provides only a single estimate of model similarity. In reality, models may share certain components only—and a bulk assessment of error correlation is a crude assessment of exactly which aspects of existing models are shared. Furthermore, there is an implicit assumption that the common error in historical simulations will be related to common features or biases in projections, which is not necessarily the case (Sanderson & Knutti, 2012).

Emergent Constraints and Process-Based Model Filtering

A more targeted approach for making projections based on a broad set of models of varying quality is broadly referred to as “emergent constraints.” In this approach, the researcher proposes a potentially observable quantity that can be used to constrain an unknown future change. In an ideal case, the researcher would propose a mechanism whereby both the observable quantity and the future change can be demonstrated to be a function of the same feedback. In a limited number of cases, such clear cause and effect can be readily demonstrated. For example, the emergent constraint proposed by Hall and Qu (2006) demonstrated that the seasonal amplitude of present-day Arctic ice extent is well correlated with future ice loss in a warming scenario.

The relationship proposed by Hall and Qu was intuitive; it is unsurprising that the same processes which might govern the response of ice extent to the annual solar cycle might also be relevant for the response of ice extent to long-term warming. Such information was used by Knutti et al. (2017) to constrain future projections of sea ice change.

However, many future changes represent a combination of processes which cannot be easily attributed to a single feedback process. For example, climate sensitivity (the equilibrium response of global mean temperature to a doubling of atmospheric carbon dioxide concentrations) is a function of combined atmospheric and land-based feedbacks (Andrews, Gregory, Webb, & Taylor, 2012) and therefore isolating a single observable constraint to explain these combined feedbacks is more difficult. There have been, however, numerous attempts to do this (Fasullo & Trenberth, 2012; Sherwood, Bony, & Dufresne, 2014)—where a single predictor is found to correlate with climate sensitivity and a physical process is proposed to explain the correlation.

The primary concern with this type of approach is that the effective sample size of the CMIP ensembles available today are sufficiently small that it is likely that some variables will correlate with climate sensitivity purely by chance (Caldwell et al., 2014). Therefore, the presence of a correlation alone cannot be considered to be comprehensive evidence that a future process has been constrained. In light of this, attention in recent years has focused on finding constraints for specific feedback processes rather than bulk global responses such as climate sensitivity (Klein & Hall, 2015).

Ensemble Resampling Approaches

A number of studies use the Coupled Model Intercomparison Project (CMIP) ensembles to produce simulated distributions of a hypothetical set of models which might represent a less biased assessment of possible future change. Bishop and Abramowitz (2013) suggested a framework to transform the output of model ensembles onto a set of “metamodels,” whose geometric mean better reflected the earth’s true historical climate. This technique was deemed the “replicate earth paradigm,” in which the possible realizations of natural variability of the true earth system were the target, reasoning that this would represent the limit of predictability for any simulation or ensemble of simulations.

An estimate of this ideal distribution is constructed from the original archive, first by using a multilinear regression to obtain the linear combination of models which minimizes the root mean squared error in a gridded surface temperature field compared with observations. An ensemble is constructed centered on this mean regression estimate by linearly transforming the original ensemble such that the transformed models are centered on the regression mean, and the width of the distribution is informed by expected natural variability. Because the CMIP inter-model spread for future simulations greatly exceeds the natural variability exhibited within any single model, the resulting transformed ensemble has less variance with regard to the mean projection than the original CMIP result.

The transformed ensemble has some desirable properties; the resulting metamodels are by-construction independent, and if historical temperature increase is an effective constraint on future change then these relationships would be naturally incorporated into the future projection. However, as in the case of the emergent constraint literature, there is a risk that the small sample size of the original ensemble could make the regression-based projection overconfident. And, because the transformed metamodels are linear combinations of models, they can no longer be interpreted as physical models in themselves.

Sanderson et al. (2015a) proposed an alternative technique for interpolating projected quantities between existing models in the archive in a space defined by mean state similarity metrics, allowing a resampling of metamodels which exhibit better mean state skill, a process which is also robust to model replication in the original archive. This approach allows distributions for future change to be constructed in a quasi-truth-centered ensemble, but the construction of the interpolated space makes the implicit assumption that models which exhibit a small present-day bias today are more trustworthy in the future—which is unlikely to be strictly true in the case where models have been calibrated to best reproduce historical climate (Sanderson & Knutti, 2012).

Transient Constraints and Internal Variability

Many of the studies considered in Section 5 have focused on mean climatology in order to assess model skill or model similarity. However, a number of studies have proposed that long-term trends, or aspects of climate variability, could be used to constrain the results of models.

Variability-Based Constraints

Perhaps the most intuitive constraint is to use the seasonal cycle as a proxy for future change. In some regions, the seasonal cycle is driven by a change in radiative forcing which can be compared to the change induced by greenhouse gases. This approach has been most successful at high latitudes (Hall & Qu, 2006), where the seasonal cycle of sea ice extent was shown to be well correlated to future sea ice change. Attempts to use the seasonal cycle in the multi-model archive to constrain global scale responses were less successful (Knutti, Meehl, Allen, & Stainforth, 2006).

Attempts have been made also to use inter-annual variability to constrain specific processes. For example, it was found that the response of CO2 concentrations to inter-annual variability in temperature in coupled carbon cycle models was a good predictor of long-term carbon loss from tropical rainforests (Cox et al., 2013). But the small sample size and lack of validation data in the Coupled Model Intercomparison Project (CMIP) ensembles makes the assessment of confidence in such constraints difficult; for example both Brown and Caldeira (2017) and Cox, Huntingford, and Williamson (2018) used metrics of climate variability to constrain future levels of warming, with divergent conclusions.

Long-Term Transient Constraints

Perhaps the most intuitive way to think about constraining future long-term climate change in response to greenhouse gas forcing is to consider past response over previous decades and centuries. However, this has proven difficult for a number of reasons. First, the temperature change of the 20th century was driven by a combination of forcing factors—the increasing atmospheric concentration of greenhouse gases acted to warm the system, but the increasing aerosol concentration also impacted global temperatures. It has been observed (Kiehl, 2007) that models tend to exhibit a relationship between the model’s total climate forcing in the 20th century and their climate sensitivity (see Figure 10). Broadly, this implies that model developers tend to make efforts to ensure that historical warming is in line with observations. This could be an explicit effort (by tuning aerosol parameters such that the total warming is compatible with observations), or an implicit effort (by rejecting incremental model versions which do not accurately reproduce the 20th century). Either way, this means that the 20th-century temperature increase is a poor predictor of future behavior in available models.

Uncertainty Quantification in Multi-Model EnsemblesClick to view larger

Figure 10. Relationship between GCM climate sensitivity in the CMIP3 archive and the GCM’s total anthropogenic forcing experienced at the end of the 20th century.

Reproduced from Kiehl (2007).

A number of studies have attempted to constrain global climate parameters using multi-decadal trends (Gillett, Arora, Matthews, & Allen, 2013; Otto et al., 2013). However, another complicating factor is that internal variability and uncertainty over observed temperature records can make the observed forced response to forcing rather uncertain. Temperature reconstructions of global warming from pre-industrial periods to the present day can differ by several tenths of a degree C (Hansen, Ruedy, Sato, & Lo, 2010), and modes of internal variability can add significant error to the estimation of observed global decadal temperature trends (Deser, Phillips, Bourdette, & Teng, 2012; Meehl, Hu, Arblaster, Fasullo, & Trenberth, 2013).

As such, disentangling to what degree recent changes in temperature are attributable to externally forced change or to internally generated internal variability is a limiting factor on how much can be inferred by historical trends. One potential way to address this is to employ “optimal fingerprinting” (Stott et al., 2006), which isolates spatial patterns associated with particular forcing types, thereby allowing an estimate of warming attributable to anthropogenic activity without the influence of variability. Such approaches provide a framework for understanding the forced response, but are themselves subject to uncertainty, as the spatial modes of response to different forcings cannot be unambiguously separated.

Perturbed Physics Ensembles

While the majority of analyses use output from the multi-model Coupled Model Intercomparison Project (CMIP) archives, a small number of studies also exploit output from perturbed parameter ensembles (PPEs), where uncertain parameters or structural switches are perturbed within a single climate model framework. Such ensembles have the capacity to produce a large number of simulations (Rowlands et al., 2012), and because the experiments are designed by a single group (rather than CMIP where the models are an ad hoc collection of simulations from willing participants), parameters can be sampled formally to explore the possible climate projections which can be simulated within the model structure.

Whereas multi-model ensembles (MMEs) contain members that have each been vetted, to some degree, to ensure reasonable simulation of historical climate, this is not true of PPEs. Because the process of producing a PPE involves explicitly detuning a model from its default configuration, the climatology produced by perturbed models can vary enormously. Models are no longer necessarily in energetic balance, often requiring surface flux corrections in order to produce stable future simulations.

However, analyses conducted using PPEs face a number of obstacles. Early studies on the topic used very large PPEs to find statistical emergent constraints on climate sensitivity using linear (Piani, Frame, Stainforth, & Allen, 2005) or non-linear (Knutti et al., 2006) transfer functions, and relying on observations to make predictions of future response. However, a PPE can only sample behavior in a single model—and thus any prediction is subject to potential structural errors in that model structure.

These assumptions can be tested to some degree by using emergent constraints derived within a PPE to attempt to predict the results of structurally different models within the CMIP archive. When conducting this kind of test, a number of studies found that emergent constraints arising in one PPE cannot be applied to another (Yokohata et al., 2010), nor are they necessarily valid in a structurally different ensemble (Sanderson, 2013). Other studies considered the rank behavior of observations treated as potential members of the ensemble to assess the relative dispersion of PPEs and MMEs such as CMIP (Yokohata et al., 2011). The study found that observations could be treated as plausible members of the CMIP archives, but the same was not true for PPEs where observed values often tended to fall outside the ensemble range.

Together, these studies imply that the PPEs produced to date have failed to reproduce the same degree of climatological diversity that can be observed in a MME such as CMIP, and even if a large number of parameters are perturbed, only a small subset of those parameters have a detectable influence on global response. Furthermore, the sensitivity to different parameters is not known a priori, so the effects of perturbing one particularly sensitive parameter can potentially dwarf subtler effects and render large fractions of the ensemble effectively unusable because they no longer represent an earth-like state.

Finally, the parameter space of a general circulation model (GCM) is generally very large, where the number of potential degrees of freedom in the model configuration could result in hundreds or even thousands of parameters. Sampling all possible parameter interactions rapidly becomes computationally infeasible as the number of parameters grows (Donoho, 2000). As such, PPEs are generally limited to a subset of the model’s parameter space and thus can only provide a conditional, rather than a general, assessment of the parametric uncertainty arising from the model’s design.

Although the ability to design a formal experiment through parameter sampling makes some forms of statistical analysis easier, the computational expense of running PPEs, combined with the fact that many ensemble members must be disregarded, complicates interpretation. Such issues have led to the use of model emulators or surrogates (Williamson et al., 2013), which use a statistical model to predict the relationship between model parameters and climatological features. As such, even if some models in the original PPE are demonstrably unlike reality, they are useful in that they provide information on how the model output responds to parameters.

A trained emulator can be used to find configurations of models which cannot be ruled out from observations, an approach known as “history matching.” By considering the diversity of future response in this set of plausible simulations, the researcher can produce a distribution of possible future change. Such distributions are often produced using a Bayesian framework (Jackson, 2009; Lee, Carslaw, Pringle, Mann, & Spracklen, 2011; Williamson et al., 2013), which provides a formal framework for updating prior assumptions about parameter values with information obtained from model skill (where an emulator might be used to rapidly sample the model response).

However, there are potential complexities in using such approaches to estimate future change. The use of a Bayesian approach requires the specification of a parameter prior, although there is no definitive approach for doing so. The selection of the prior can have large impacts on the posterior probability distribution (Frame et al., 2005), and the use of a Bayesian approach cannot address the structural limitations of the original model and parameter sampling strategy used to create the PPE.

Analyses Combining PPE and MME Information

A small number of studies have attempted to take information from both PPEs and MMEs in order to produce a more comprehensive assessment of future uncertainty in climate change. The UKCP09 project attempted to do this in a formal sense to produce probability distributions for future climate in the United Kingdom (Harris, Collins, Sexton, Murphy, & Booth, 2010). The method used a PPE to assess future uncertainty in climate, but also included an estimate of structural error in the model by using the CMIP ensemble as a proxy for reality. Broadly, the HadSM3 model was used to produce a PPE, and this PPE was used to create a model emulator. The emulator was then used to find model configurations which best represented historical climatology from models in the CMIP3 ensemble using a Bayesian updating procedure. The resulting distribution for future change could then be checked against the “true” future simulations from the CMIP models. The error in the prediction was used to build a discrepancy term—an estimate of additional error to be included in the posterior distribution to account for the fact that the model structure of HadSM3 differed from other models and from the real world. While conceptually one of the most complete approaches to quantify uncertainty, the method has been criticized for its complexity and assumptions that are hard to justify (Frigg, Smith, & Stainforth, 2015).

Another approach (Sanderson, 2013) involved using a constrained regression, which adapted the PPE-only regression methodology for predicting climate sensitivity (Piani et al., 2005), but introduced a constraint that CMIP models must also satisfy the derived relationship. This approach claimed to use the fact that PPEs allowed for a very large sample size, while rejecting correlations which were not observed to hold in the multi-model archive.

Future Horizons

Combining information from multiple climate simulations is necessary if we wish to quantify the uncertainty of the possible futures that the earth might experience under climate change, but assessing which approach is correct is difficult. In contrast to a weather forecast, where ensemble analysis schemes can be tweaked and improve to improve forecasts over time, a long-term climate projection offers no validation. The only sure way to know which model or weighting scheme is correct is to wait for the decades and centuries required for the projection to be validated (and even that would only provide one realization of natural variability, leaving some ambiguity as to which model was best). As such, the researcher today must rely on supporting evidence to make their projection.

In this article, we have considered a wide range of approaches of varying complexity, making different assumptions with different types of ensemble. Each approach has benefits and limitations, and the appropriate method for the researcher will depend upon the desired application, audience, and available data. Different approaches might produce more or less confident projections of the future, but by understanding the assumptions underlying the projections we can better explain why differences between projections occur.

This evidence can arise from a number of sources. Perhaps most common is a consideration of model errors, under the assumption that models which exhibit more error in their present-day or historical simulation are less trustworthy in the future. But climate models produce a vast quantity of output, and the challenge lies in identifying which errors might be most relevant for constraining future change. The relatively small number of samples available in the Coupled Model Intercomparison Project (CMIP) archives, compared with the high dimensionality of the model output, means that it is trivial to find observable quantities which are correlated with future change by chance. This is further complicated by the fact that there is no way to independently validate models. All model developers have access to information on historical climate, and therefore even an unphysical model can potentially be calibrated to reproduce historical behavior.

Hence, developing better constraints on future response must rely on more than bulk error assessment or data-mining statistics, both of which can be undermined by the nature of the data available. Instead, the problem of constraining future change must be broken down into component parts where the global response is described in terms of its component feedbacks. Each of these feedbacks can be considered in turn, assessing the physical plausibility of each model in representing the process, and considering which observable features of the present-day or historical climate might be useful in constraining future change.

Using historical trends to inform future trends is also fraught with difficulty. Modelers tend to ensure that their models correctly reproduce global long-term temperature change, but they might achieve similar results for different reasons, for instance by balancing sensitivity to greenhouse gases with aerosol forcing. However, because aerosol forcing is likely to decrease in the future as clean air policies become more widespread, long-term future change will be governed primarily by sensitivity to greenhouse gases. As such, models may broadly agree on past trends but can disagree dramatically on future trends. This is compounded by the influence of internal variability, which tends to add noise to any estimate of what observed historical forced trends in climate might be. However, as the system continues to warm both of these problems will be somewhat reduced as continued high-quality observations combined with a larger signal-to-noise ratio provide better transient observations of warming trends. As such, it seems likely that transient trends on response will become more important in the future for constraining long-term response.

Moving Beyond the Status Quo

Thinking forward to future climate assessments, there is as yet no consensus on how to process multi-model ensembles (MMEs) to best describe uncertainty in future change. For model weighting, the complexity of modern climate models implies almost by definition that there can be no single set of metrics appropriate for all purposes. Although there is precedent for applying targeted metrics to certain climate properties, such as sea ice in Intergovernmental Panel on Climate Change (IPCC) Fifth Assessment Report (AR5) (and more formally in Knutti et al., 2017) such processes tend to have a small set of metrics that are clearly related to the projected quantity in question. The use of a weighting which uses only a small number of components can produce a dramatic reduction in projected spread in quantities such as sea ice, where the mean state climatology of some models can rapidly rule out their relevance for projection. However, such metrics also provide little information on model interdependency (with little error correlation structure from which to draw inference), and cannot easily be generalized to more complex processes such as regional precipitation change, where it is less clear that one single historical metric can inform the relevance of a model’s future projection.

A danger also arises from the fact that models might have potentially been calibrated using observations which are subsequently used to assess model weights. A weighting scheme that relies only on metrics which are readily used as calibration targets (such as top of atmosphere energy balance) therefore has the potential to demonstrate only that models have sufficient degrees of freedom to fit the observed data: a challenge going forward will be to find metrics that are more indicative of processes which are relevant for constraining future trends.

Many existing weighting schemes (Giorgi & Mearns, 2003; Knutti et al., 2017; Sanderson, Wehner, & Knutti, 2017) rely on mean state metrics, and it is generally recognized that a comprehensive scheme would require a consideration of transient metrics, and possibly an assessment of internal variability. It has been demonstrated that ranking models by skill in their representation of internal variability differs from mean state skill (Santer et al., 2009), and some emergent constraints have used variability information to attempt to constrain future change (Cox et al., 2013; Zelinka & Hartmann, 2011). However, given that greater consideration of process-based metrics is essential in weighting schemes, it would seem prudent to consider both mean state and variability information in future schemes.

There is enough evidence to determine that the continued assumption of model democracy (Knutti, 2010) cannot be fully justified in future impact studies. It can be demonstrated that models are not independent, and there are several candidate approaches for addressing this interdependency when making assessments of change informed by MMEs. An all-purpose recommendation for weighting model skill is not possible at present, and assessments must be conducted on a case-by-case basis. In projection examples where there is a clear relationship between future behavior and present-day metrics, a skill weighting may be appropriate to reduce model spread. In cases where there is no clear set of relevant metrics, strong weighing may erroneously over-constrain projections—but carefully conducted out-of-sample tests can help mitigate against this risk by informing the appropriate strength of weighting on a case-by-case basis.

There are numerous communities that have focused on specific technical issues which have yet to be integrated into formal weighting schemes. The last few years have seen the proposal of many emergent constraints on cloud feedbacks and climate sensitivity (Klein & Hall, 2015) and carbon cycle responses (Wenzel, Cox, Eyring, & Friedlingstein, 2014), but there remain questions regarding the statistical robustness of these constraints as derived from the CMIP archive (Bracegirdle & Stephenson, 2013; Caldwell et al., 2014; Klocke, Pincus, & Quaas, 2011) and no studies have formalized their constraints in a framework which could be directly used in comprehensive multi-model assessment. This will require a reconciliation of the literature, which has to date focused on isolated process-based emergent constraints, with the more statistically comprehensive literature on model weighting.

There is also a potential for greater collaboration with the detection and attribution community, whose statistical machinery provides relevant information for the assessment of model representation of key feedback processes, an avenue which is yet to be explored for the purposes of model weighting. In addition, although there is a precedent for using paleoclimate simulations to constrain future change (Hargreaves, Annan, Yoshimori, & Abe-Ouchi, 2012), it is not common practice to consider paleoclimate model performance in model assessment—which presents an opportunity in the light of increasingly available long-term records (Marcott, Shakun, Clark, & Mix, 2013) and multi-century paleoclimate simulations (Otto-Bliesner et al., 2016; Zanchettin, Rubino, Matei, Bothe, & Jungclaus, 2012).

Making Projections More Useful

Incorporating climate information into long-term policy planning is increasingly common, and science has a role to play in ensuring that information can be utilized in making real-world decisions. There is an increasing demand for climate science to supply higher resolution projections of the future which may divert resources from robust uncertainty characterization (Dessai, Hulme, Lempert, & Pielke, 2009). Providing probability distributions is only useful if the uncertainties they describe are relevant to the application, and if the assumptions are well understood by the end user. This is often difficult or impossible to achieve in reality, and the researcher will be faced with the necessity to make judgments which highlight the key uncertainties in projections to end users in a concise fashion. Some of the approaches discussed in this article have the potential to facilitate such communication through the use of model subsets, which appropriately span a reasonable range of uncertainty in a projection, or simple probability distributions, which illustrate the risks which must be undertaken in planning.

However, the researcher is still fundamentally limited by the simulations that are available. And although efforts can be taken to mitigate varying model quality and model interdependency to some degree, the data available will never represent a comprehensive sample of all possible future climates. Models represent many processes as parameterizations, rather than explicitly resolving them. Such approximations imply some irreducible uncertainty, and forming useful projections has to imply both assessing the parameter value uncertainty, as well as some measure of the appropriateness of the parameterization implementation. Some of the approaches illustrated here have attempted to do this, but much more will need to be done in the coming years to better characterize how we should represent uncertainty in model formulation to the end user.

In addition, models do not represent many elements of the earth system, which might prove crucial for understanding long-term climate response to forcing (such as interactive ice sheets or a representation of methane release from melting permafrost), and although some models are working to introduce these components, it might be decades before we can represent these processes in a single model and fully understand their inherent uncertainties. As such, researchers will have to be aware of uncertainties arising from what is not represented within their models, and will continue to have to work with incomplete data by piecing together information from estimates which bound future response (in sea-level rise, for example) by considering output from multiple sources.

Researchers must also be aware of the fact that the MMEs available are not comprehensive samples of uncertainty, and as such, there is some likelihood that the true future may lie outside of the model distribution of projections. Some of the methods illustrated are unable to communicate this type of risk to the end user. For example, a model weighting scheme can at best describe the relative likelihood of different outcomes, it cannot describe the likelihood that the entire ensemble is biased. In order to address such issues, the researcher will again need to consider multiple sources of data: Perturbed physics ensembles and simple models of climate have the capacity to describe a more diverse set of futures (albeit with the complexity of addressing the limitations of each).

There is relatively little literature on how to combine information from perturbed physics and MMEs into an integrated assessment of uncertainty. Given the limitations of each of these ensembles, this presents an opportunity for analyses going forward. Future analyses must combine the structural diversity which is only possible within a MME with the formal experimental design and process variation only seen in perturbed physics experiments. This is difficult to achieve with existing data, given that very few existing models have produced perturbed physics archives—and those experiments which do exist have not been conducted according to a common experimental design.

Bridging Physical Understanding and Statistical Rigor

Many of the advances in multi-model analysis in recent decades have arisen from a perspective of pure process understanding, or by developing complex statistical models which rely on bulk metrics of model skill. As such, there is potential in coming years to apply more robust statistical approaches to constraints and process understanding, which have been discussed in the literature. Specialized ensembles such as the Cloud Feedback Model Intercomparison Project can provide insight into the physical processes and feedback mechanisms that contribute to the diversity of model simulations which are available today. Using this literature and data can provide more targeted metrics to evaluate the relevant aspects of model skill to better understand the plausibility of a given model’s future projection.

Despite the limitations of the MMEs available today, the coordinated experiments conducted in the CMIP experiments are undoubtedly the most comprehensive sample of climate projections available to researchers. Successive generations of the ensemble increase the number of processes which are described within the models, and although this will tend to prevent a convergence of projections, it will also decrease the issue of representing unresolved processes in uncertainty assessments.

A final and significant source of uncertainty is that of future human behavior. At timescales of multiple decades and longer, the uncertainty arising from the choice of greenhouse gas emission scenario exceeds all other sources (Hawkins & Sutton, 2009), although model uncertainty and scenario uncertainty are rarely combined into a single assessment. Jointly assessing human and physical uncertainties remains a relatively unexplored field. Some simple models of climate have been coupled to socio-economic models (Hartin, Patel, Schwarber, Link, & Bond-Lamberty, 2015; O’Neill, Ren, Jiang, & Dalton, 2012), but there are very few examples of coupled socio-economic models in a comprehensive Earth System Model (ESM) (Collins et al., 2015). Even in the presence of such models, no studies have attempted to jointly assess the uncertainties in future human behavior and physical climate uncertainty. However, although this kind of assessment may prove distant, multi-model ensembles can help bound our assessment of how the climate might continue to change with continued human activity—and understanding their origin, assumptions, and limitations is critical if science is to provide the best possible information to society.

Further Reading

Collins, M., Chandler, R. E., Cox, P. M., Huthnance, J. M., Rougier, J., & Stephenson, D. B. (2012). Quantifying future climate change. Nature Climate Change, 2, 403–409.Find this resource:

    Kharin, V. V., & Zwiers, F. W. (2002). Climate predictions with multimodel ensembles. Journal of Climate, 15(7), 793–799.Find this resource:

      Knutti, R. (2010). The end of model democracy? Climatic Change, 102, 395–404.Find this resource:

        Knutti, R., Abramowitz, G., Collins, M., Eyring, V., Gleckler, P. J., Hewitson, B., & Mearns, L. (2010). Good practice guidance paper on assessing and combining multi model climate projections. IPCC Expert Meeting on Assessing and Combining Multi Model Climate Projections, National Center for Atmospheric Research, Boulder, CO, January 25–27.Find this resource:

          Knutti, R., Allen, M. R., Friedlingstein, P., Gregory, J. M., Hegerl, G., Meehl, G. A., . . . Wigley, T. M. L. (2008). A review of uncertainties in global temperature projections over the twenty-first century. Journal of Climate, 21, 2651–2663.Find this resource:

            Sanderson, B. M., & Knutti, R. (2012). On the interpretation of constrained climate model ensembles. Geophysical Research Letters, 39, L16708.Find this resource:

              Tebaldi, C., & Knutti, R. (2007). The use of the multi-model ensemble in probabilistic climate projections. Philosophical Transactions of the Royal Society A, 365(1857), 2053–2075.Find this resource:


                Abramowitz, G., & Bishop, C. H. (2015). Climate model dependence and the ensemble dependence transformation of CMIP projections. Journal of Climate, 28(6), 2332–2348.Find this resource:

                  Andrews, T., Gregory, J. M., Webb, M. J., & Taylor, K. E. (2012). Forcing, feedbacks and climate sensitivity in CMIP5 coupled atmosphere‐ocean climate models. Geophysical Research Letters, 39(9), 1–7.Find this resource:

                    Annan, J. D., & Hargreaves, J. C. (2010). Reliability of the CMIP3 ensemble. Geophysical Research Letters, 37(2), L02703.Find this resource:

                      Bishop, C. H., & Abramowitz, G. (2013). Climate model dependence and the replicate earth paradigm. Climate Dynamics, 41(3–4), 885–900.Find this resource:

                        Bracegirdle, T. J., & Stephenson, D. B. (2013). On the robustness of emergent constraints used in multimodel climate change projections of Arctic warming. Journal of Climate, 26(2), 669–678.Find this resource:

                          Brown, P. T., & Caldeira, K. (2017). Greater future global warming inferred from earth’s recent energy budget. Nature, 552(7683), 45.Find this resource:

                            Caldwell, P. M., Bretherton, C. S., Zelinka, M. D., Klein, S. A., Santer, B. D., & Sanderson, B. M. (2014). Statistical significance of climate sensitivity predictors obtained by data mining. Geophysical Research Letters, 41(5), 1803–1808.Find this resource:

                              Collins, M., Knutti, R., Arblaster, J., Dufresne, J.-L., Fichefet, T., Friedlingstein, P., . . . Wehner, M. (2013). Long-term climate change: Projections, commitments and irreversibility. In IPCC (Ed.), Climate Change 2013: The Physical Science Basis. IPCC Working Group I Contribution to AR5 (pp. 1029–1136). Cambridge: Cambridge University Press.Find this resource:

                                Collins, W. D., Craig, A. P., Truesdale, J. E., Di Vittorio, A. V., Jones, A. D., Bond-Lamberty, B., . . . Thomson, A. M. (2015). The integrated Earth System Model (iESM): Formulation and functionality. Geoscientific Model Development Discussions, 8(1), 381–427.Find this resource:

                                  Cox, P. M., Huntingford, C., & Williamson, M. S. (2018). Emergent constraint on equilibrium climate sensitivity from global temperature variability. Nature, 553(7688), 319.Find this resource:

                                    Cox, P. M., Pearson, D., Booth, B. B., Friedlingstein, P., Huntingford, C., Jones, C. D., & Luke, C. M. (2013). Sensitivity of tropical carbon to climate change constrained by carbon dioxide variability. Nature, 494(7437), 341–344.Find this resource:

                                      Deser, C., Phillips, A., Bourdette, V., & Teng, H. (2012). Uncertainty in climate change projections: The role of internal variability. Climate Dynamics, 38(3–4), 527–546.Find this resource:

                                        Dessai, S., Hulme, M., Lempert, R., & Pielke, R., Jr. (2009). Climate prediction: A limit to adaptation. In W. N. Adger, I. Lorenzoni, & K. O’Brien (Eds.), Adapting to Climate Change: Thresholds, Values, Governance (pp. 64–78). Cambridge, UK: Cambridge University Press.Find this resource:

                                          Donoho, D. L. (2000). High-dimensional data analysis: The curses and blessings of dimensionality. AMS Math Challenges Lecture, 1, 32.Find this resource:

                                            Edwards, P. N. (2010). A vast machine: Computer models, climate data, and the politics of global warming. Cambridge, MA: MIT Press.Find this resource:

                                              Eyring, V., Bony, S., Meehl, G. A., Senior, C. A., Stevens, B., Stouffer, R. J., & Taylor, K. E. (2016). Overview of the Coupled Model Intercomparison Project Phase 6 (CMIP6) experimental design and organization. Geoscientific Model Development, 9(5), 1937–1958.Find this resource:

                                                Fasullo, J. T., & Trenberth, K. E. (2012). A less cloudy future: The role of subtropical subsidence in climate sensitivity. Science, 338(6108), 792–794.Find this resource:

                                                  Frame, D. J., Aina, T., Christensen, C. M., Faull, N. E., Knight, S. H. E., Piani, C., . . . Allen, M. R. (2009). The climateprediction. net BBC climate change experiment: Design of the coupled model ensemble. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 367(1890), 855–870.Find this resource:

                                                    Frame, D. J., Booth, B. B. B., Kettleborough, J. A., Stainforth, D. A., Gregory, J. M., Collins, M., & Allen, M. R. (2005). Constraining climate forecasts: The role of prior assumptions. Geophysical Research Letters, 32, L09702.Find this resource:

                                                      Frigg, R., Smith, L. A., & Stainforth, D. A. (2015). An assessment of the foundational assumptions in high-resolution climate projections: The case of UKCP09. Synthese, 192(12), 3979–4008.Find this resource:

                                                        Gillett, N. P., Arora, V. K., Matthews, D., & Allen, M. R. (2013). Constraining the ratio of global warming to cumulative CO2 emissions using CMIP5 simulations. Journal of Climate, 26(18), 6844–6858.Find this resource:

                                                          Giorgi, F., & Mearns, L. O. (2003). Probability of regional climate change based on the Reliability Ensemble Averaging (REA) method. Geophysical Research Letters, 30(12), 1629–1632.Find this resource:

                                                            Gleckler, P. J., Taylor, K. E., & Doutriaux, C. (2008). Performance metrics for climate models. Journal of Geophysical Research: Atmospheres, 113, D06104.Find this resource:

                                                              Hall, A., & Qu, X. (2006). Using the current seasonal cycle to constrain snow albedo feedback in future climate change. Geophysical Research Letters, 33, L03502.Find this resource:

                                                                Hansen, J., Ruedy, R., Sato, M., & Lo, K. (2010). Global surface temperature change. Reviews of Geophysics, 48, RG4004.Find this resource:

                                                                  Hargreaves, J. C., Annan, J. D., Yoshimori, M., & Abe-Ouchi, A. (2012). Can the Last Glacial Maximum constrain climate sensitivity? Geophysical Research Letters, 39, L24702.Find this resource:

                                                                    Harris, G. R., Collins, M., Sexton, D. M. H., Murphy, J. M., & Booth, B. B. B. (2010). Probabilistic projections for 21st century European climate. Natural Hazards and Earth System Sciences, 10(9), 2009–2020.Find this resource:

                                                                      Hartin, C. A., Patel, P., Schwarber, A., Link, R. P., & Bond-Lamberty, B. P. (2015). A simple object-oriented and open-source model for scientific and policy analyses of the global climate system–Hector v1. 0. Geoscientific Model Development, 8(4), 939–955.Find this resource:

                                                                        Hawkins, E., & Sutton, R. (2009). The potential to narrow uncertainty in regional climate predictions. Bulletin of the American Meteorological Society, 90(8), 1095–1107.Find this resource:

                                                                          Herger, N., Abramowitz, G., Knutti, R., Angélil, O., Lehmann, K., & Sanderson, B. M. (2017). Selecting a climate model subset to optimise key ensemble properties. Earth System Dynamics, 9, 1–24.Find this resource:

                                                                            Jackson, C. S. (2009). Use of Bayesian inference and data to improve simulations of multi-physics climate phenomena. Journal of Physics: Conference Series, 180, 012029.Find this resource:

                                                                              Jakob, C. (2014). Going back to basics. Nature Climate Change, 4, 1042–1045.Find this resource:

                                                                                Kiehl, J. T. (2007). Twentieth century climate model response and climate sensitivity. Geophysical Research Letters, 34(22), L22710.Find this resource:

                                                                                  Klein, S. A., & Hall, A. (2015). Emergent constraints for cloud feedbacks. Current Climate Change Reports, 1(4), 276–287.Find this resource:

                                                                                    Klocke, D., Pincus, R., & Quaas, J. (2011). On constraining estimates of climate sensitivity with present-day observations through model weighting. Journal of Climate, 24(23), 6092–6099.Find this resource:

                                                                                      Knutti, R. (2010). The end of model democracy? Climatic Change, 102(3–4), 395–404.Find this resource:

                                                                                        Knutti, R., Abramowitz, G., Collins, M., Eyring, V., Gleckler, P. J., Hewitson, B., & Mearns, L.(2010). Good Practice Guidance Paper on Assessing and Combining Multi Model Climate Projections. In T. F. Stocker, Q. Dahe, G.-K. Plattner, M. Tignor, & P. M. Midgley (Eds.), Meeting Report of the Intergovernmental Panel on Climate Change Expert Meeting on Assessing and Combining Multi Model Climate Projections. IPCC Working Group I Technical Support Unit, University of Bern, Bern, Switzerland.Find this resource:

                                                                                          Knutti, R., Masson, D., & Gettelman, A. (2013). Climate model genealogy: Generation CMIP5 and how we got there. Geophysical Research Letters, 40(6), 1194–1199.Find this resource:

                                                                                            Knutti, R., Meehl, G. A., Allen, M. R., & Stainforth, D. A. (2006). Constraining climate sensitivity from the seasonal cycle in surface temperature. Journal of Climate, 19(17), 4224–4233.Find this resource:

                                                                                              Knutti, R., Sedláček, J., Sanderson, B. M., Lorenz, R., Fischer, E. M., & Eyring, V. (2017). A climate model projection weighting scheme accounting for performance and interdependence. Geophysical Research Letters, 44(4), 1909–1918.Find this resource:

                                                                                                Lee, L. A., Carslaw, K. S., Pringle, K. J., Mann, G. W., & Spracklen, D. V. (2011). Emulation of a complex global aerosol model to quantify sensitivity to uncertain parameters. Atmospheric Chemistry and Physics, 11(23), 12253–12273.Find this resource:

                                                                                                  Marcott, S. A., Shakun, J. D., Clark, P. U., & Mix, A. C. (2013). A reconstruction of regional and global temperature for the past 11,300 years. Science, 339(6124), 1198–1201.Find this resource:

                                                                                                    Marzban, C., Wang, R., Kong, F., & Leyton, S. (2011). On the effect of correlations on rank histograms: Reliability of temperature and wind speed forecasts from finescale ensemble reforecasts. Monthly Weather Review, 139(1), 295–310.Find this resource:

                                                                                                      Masson, D., & Knutti, R. (2011). Climate model genealogy. Geophysical Research Letters, 38(8), L08703.Find this resource:

                                                                                                        Meehl, G. A., Hu, A., Arblaster, J. M., Fasullo, J., & Trenberth, K. E. (2013). Externally forced and internally generated decadal climate variability associated with the interdecadal Pacific oscillation. Journal of Climate, 26(18), 7298–7310.Find this resource:

                                                                                                          O’Neill, B. C., Ren, X., Jiang, L., & Dalton, M. (2012). The effect of urbanization on energy use in India and China in the iPETS model. Energy Economics, 34, S339–S345.Find this resource:

                                                                                                            Otto, A., Otto, F. E., Boucher, O., Church, J., Hegerl, G., Forster, P. M., . . . Allen, M. R. (2013). Energy budget constraints on climate response. Nature Geoscience, 6, 415–416.Find this resource:

                                                                                                              Otto-Bliesner, B. L., Brady, E. C., Fasullo, J., Jahn, A., Landrum, L., Stevenson, S., . . . Strand, G. (2016). Climate variability and change since 850 CE: An ensemble approach with the community Earth System Model. Bulletin of the American Meteorological Society, 97(5), 735–754.Find this resource:

                                                                                                                Pachauri, R. K., Allen, M. R., Barros, V. R., Broome, J., Cramer, W., Christ, R., . . . van Ypserle, J. P. (2014). Climate change 2014: Synthesis report: Contribution of Working Groups I, II and III to the fifth assessment report of the Intergovernmental Panel on Climate Change. IPCC, Geneva, Switzerland.

                                                                                                                Pennell, C., & Reichler, T. (2011). On the effective number of climate models. Journal of Climate, 24(9), 2358–2367.Find this resource:

                                                                                                                  Piani, C., Frame, D. J., Stainforth, D. A., & Allen, M. R. (2005). Constraints on climate change from a multi-thousand member ensemble of simulations. Geophysical Research Letters, 32(23), L23825.Find this resource:

                                                                                                                    Räisänen, J., Ruokolainen, L., & Ylhäisi, J. (2009). Weighting of model results for improving best estimates of climate change. Climate Dynamics, 35(2–3), 407–422.Find this resource:

                                                                                                                      Rougier, J., Goldstein, M., & House, L. (2013). Second-order exchangeability analysis for multimodel ensembles. Journal of the American Statistical Association, 108(503), 852–863.Find this resource:

                                                                                                                        Rowlands, D. J., Frame, D. J., Ackerley, D., Aina, T., Booth, B. B., Christensen, C., . . . Grandey, B. S. (2012). Broad range of 2050 warming from an observationally constrained large climate model ensemble. Nature Geoscience, 5(4), 256–260.Find this resource:

                                                                                                                          Sanderson, B. M. (2013). On the estimation of systematic error in regression-based predictions of climate sensitivity. Climate Change, 118(3–4), 757–770.Find this resource:

                                                                                                                            Sanderson, B. M., & Knutti, R. (2012). On the interpretation of constrained climate model ensembles. Geophysical Research Letters, 39, L16708.Find this resource:

                                                                                                                              Sanderson, B. M., Knutti, R., & Caldwell, P. (2015a). Addressing interdependency in a multimodel ensemble by interpolation of model properties. Journal of Climate, 28(13), 5150–5170.Find this resource:

                                                                                                                                Sanderson, B. M., Knutti, R., & Caldwell, P. (2015b). A representative democracy to reduce interdependency in a multimodel ensemble. Journal of Climate, 28(13), 5171–5194.Find this resource:

                                                                                                                                  Sanderson, B. M., Wehner, M., & Knutti, R. (2017). Skill and independence weighting for multi-model assessments. Geoscientific Model Development, 10(6), 2379–2395.Find this resource:

                                                                                                                                    Santer, B. D., Taylor, K. E., Gleckler, P. J., Bonfils, C., Barnett, T. P., Pierce, D. W., . . . Wehner, M. F. (2009). Incorporating model quality information in climate change detection and attribution studies. Proceedings of the National Academy of Sciences of the United States of America, 106(35), 14778–14783.Find this resource:

                                                                                                                                      Sherwood, S. C., Bony, S., & Dufresne, J.-L. (2014). Spread in model climate sensitivity traced to atmospheric convective mixing. Nature, 505(7481), 37–42.Find this resource:

                                                                                                                                        Slingo, J., & Palmer, T. (2011). Uncertainty in weather and climate prediction. Philosophical Transactions of the Royal Society A, 369(1956), 4751–4767.Find this resource:

                                                                                                                                          Stott, P. A., Mitchell, J. F., Allen, M. R., Delworth, T. L., Gregory, J. M., Meehl, G. A., & Santer, B. D. (2006). Observational constraints on past attributable warming and predictions of future global warming. Journal of Climate, 19(13), 3055–3069.Find this resource:

                                                                                                                                            Sunyer, M. A., Madsen, H., Rosbjerg, D., & Arnbjerg-Nielsen, K. (2014). A Bayesian approach for uncertainty quantification of extreme precipitation projections including climate model interdependency and nonstationary bias. Journal of Climate, 27(18), 7113–7132.Find this resource:

                                                                                                                                              Taylor, K. E., Stouffer, R. J., & Meehl, G. A. (2012). An overview of CMIP5 and the experiment design. Bulletin of the American Meteorological Society, 93(4), 485–498.Find this resource:

                                                                                                                                                Tebaldi, C. (2004). Regional probabilities of precipitation change: A Bayesian analysis of multimodel simulations. Geophysical Research Letters, 31, L24213.Find this resource:

                                                                                                                                                  Tebaldi, C., & Sansó, B. (2009). Joint projections of temperature and precipitation change from multiple climate models: A hierarchical Bayesian approach. Journal of the Royal Statistical Society. Series A, 172(1), 83–106.Find this resource:

                                                                                                                                                    Tebaldi, C., Smith, R. L., Nychka, D., & Mearns, L. O. (2005). Quantifying uncertainty in projections of regional climate change: A Bayesian approach to the analysis of multimodel ensembles. Journal of Climate, 18(10), 1524–1540.Find this resource:

                                                                                                                                                      Urban, M. C. (2015). Accelerating extinction risk from climate change. Science, 348(6234), 571–573.Find this resource:

                                                                                                                                                        Weigel, A. P., Knutti, R., Liniger, M. A., & Appenzeller, C. (2010). Risks Of Model Weighting In Multimodel Climate Projections. Journal of Climate, 23(15), 4175–4191.Find this resource:

                                                                                                                                                          Wenzel, S., Cox, P. M., Eyring, V., & Friedlingstein, P. (2014). Emergent constraints on climate-carbon cycle feedbacks in the CMIP5 Earth System Models. Journal of Geophysical Research: Biogeosciences, 119(5), 794–807.Find this resource:

                                                                                                                                                            Williamson, D., Goldstein, M., Allison, L., Blaker, A., Challenor, P., Jackson, L., & Yamazaki, K. (2013). History matching for exploring and reducing climate model parameter space using observations and a large perturbed physics ensemble. Climate Dynamics, 41(7–8), 1703–1729.Find this resource:

                                                                                                                                                              Yokohata, T., Annan, J. D., Collins, M., Jackson, C. S., Tobis, M., Webb, M. J., & Hargreaves, J. C. (2011). Reliability of multi-model and structurally different single-model ensembles. Climate Dynamics, 39(3–4), 599–616.Find this resource:

                                                                                                                                                                Yokohata, T., Webb, M. J., Collins, M., Williams, K. D., Yoshimori, M., Hargreaves, J. C., & Annan, J. D. (2010). Structural similarities and differences in climate responses to CO2 increase between two perturbed physics ensembles. Journal of Climate, 23(6), 1392.Find this resource:

                                                                                                                                                                  Zanchettin, D., Rubino, A., Matei, D., Bothe, O., & Jungclaus, J. H. (2012). Multidecadal-to-centennial SST variability in the MPI-ESM simulation ensemble for the last millennium. Climate Dynamics, 40(5–6), 1301–1318.Find this resource:

                                                                                                                                                                    Zelinka, M. D., & Hartmann, D. L. (2011). The observed sensitivity of high clouds to mean surface temperature anomalies in the tropics. Journal of Geophysical Research, 116, D23103.Find this resource: