Unified forward modelling of galaxies and type Ia supernovae

CIGaRS is a unified forward model for type Ia supernova cosmology. It is a diptych (Fig. 1) of two state-of-the-art Bayesian hierarchical simulators for photometric galaxy observations (dh) and summary statistics of type Ia supernovae (ds), joined on three levels:

  1. 1.

    by the probability of occurrence (equivalently, number \({N}_{{\rm{SN}}}^{h}\)) of type Ia supernovae in a given host h, based on the intrinsic properties of the latter, gh;

  2. 2.

    by the distribution of the intrinsic properties λs of the supernovae, which also depend on the gh(s) of their respective hosts;

  3. 3.

    by extrinsic effects of the host environment (for example, dust) and other properties (for example, cosmological redshift and peculiar velocity) on supernova light as it travels towards the observer.

The former two, which affect λ, can be considered causal connections, whereas the latter, which affects only the observables and data d, are coincidental. One last connection can arise during data reduction (for example, subtraction of the host light). We plan to implement this in a future extension that also implements the image analysis procedure of transient surveys. In the following, we describe in detail all components of CIGaRS (see also the full hierarchy in Fig. 1 and the list of parameters in Extended Data Table 1) and discuss the planned improvements, which are mainly related to the intrahost localization of the supernovae.

Host model

A comprehensive treatment of the photo-spectral evolution of galaxies—see the pioneering studies using the MOPED method83,84,85 and, for example, Mo et al.86 for a review—is beyond the scope of this work. At a high level, the intrinsic properties of a galaxy h, such as its SFH, metallicity (\({Z}_{* }^{h}\)) and interstellar dust content, are largely determined by the total stellar mass of the galaxy (\({M}_{* }^{h}\)) and its cosmological redshift (\({z}_{{\rm{c}}}^{h}\)), the latter being a proxy (given a cosmological model) for the age of the Universe (Th) at the time the galaxy’s light was emitted. To represent the intrinsic correlations inherent in galaxy formation, we adopt an off-the-shelf Bayesian framework for galaxy photometry: Prospector87, as updated in Prospector-β60.

SFH and chemical enrichment: Prospector-β

Prospector encodes a non-trivial SFH by discretizing it into seven time bins (the optimal number for describing moderate signal-to-noise observations88) spanning [0; Th]: the first two are fixed to [0 Myr; 30 Myr] and [30 Myr; 100 Myr], the next four logarithmically spaced within [100 Myr; 0.85Th], and the last covering [0.85Th; Th]. The total mass of a galaxy is then distributed into seven stellar populations that represent the stars born during each history bin:

$${{\bf{SFH}}}^{h}\equiv {\left[{\mathrm{SFH}}^{h,j}\right]}_{j=1}^{7},\,\,\,\,\,\,\mathrm{with}\,\,\,\,\,\,\mathop{\sum }\limits_{i=1}^{7}{\mathrm{SFH}}^{h,j}={M}_{* }^{h}.$$

(1)

In each forward realization, SFHh is sampled from the prior \(p({{\bf{SFH}}}^{h}| {M}_{* }^{h},{T}^{h})\) introduced in Prospector-β (Section 3.3 of ref. 60). The model adopts a non-parametric continuity prescription89 whereby the ratio of SFRs in adjacent bins exhibits a random scatter around the value dictated by the average (‘cosmic’) SFR density66, taking into account also the empirical observation that high-mass galaxies form earlier.

For a given SFHh, and assuming that all stars in a given population were born at approximately the same time (\({t}_{* }^{h,j}\), taken at the (linear) centre of the respective bin), the process of stellar population synthesis (SPS) produces the integrated light (before extinction) from all stars in a galaxy by convolving the emission spectra of single stellar populations with the SFH.

SPS also requires a prescription for the galactic chemical enrichment history: the metallicities of all its stellar populations. Prospector assumes a single metallicity within a galaxy (\({Z}_{* }^{h}\)), determining it (with appropriate scatter) from the total stellar mass, as in Gallazzi et al.67. This has also been shown to be appropriate for modelling the progenitors of type Ia supernovae90, which is our ultimate goal.

Host dust extinction

The light emitted by stars is then extinguished (and reddened) by dust within the host. Prospector considers separate birth-cloud and diffuse dust components. The former affects only the youngest stellar population (acting on the emission only from the first time bin) and has a fixed wavelength dependence proportional to λ−1 and optical depth \({\tau }_{1}^{h}\). Extinction due to diffuse (interstellar) dust is described by the KC13 law (ref. 81, following refs. 91,92) with shape parameter δh and optical depth \({\tau }_{2}^{h}\) related to the column density of dust along the line of sight (as it represents the total extinction in the (rest-frame) V-band, we will label it \({A}_{{{\rm{V}}}}^{h}\) instead).

In Prospector-β, the dust parameters are a priori independent from other galaxy properties; however, many studies have found a posteriori relations from large galaxy surveys. We enforce the empirical relations of Alsing et al. (equations (14) and (15) in ref. 93):

$${\tau}_{2}^{h}\equiv {A}_{{\mathrm{V}}}^{h} \sim {\mathcal{N}}\left(0.2+0.5\,{\mathrm{ReLU}}\left({\log}_{10}\,\left[{\mathrm{SFR}}^{h}\,\left({M}_{\odot}\,{\mathrm{yr}}^{-1}\right.\right)\right],0.{2}^{2}\right),$$

(2)

$${\delta }^{h} \sim {\mathcal{N}}\left(-0.095+0.111{A}_{{\mathrm{V}}}^{h}-0.0066{({A}_{{\mathrm{V}}}^{h})}^{2},0.{4}^{2}\right),$$

(3)

where the SFR is derived from the most recent (current) component of SFHh: SFRh ≡ SFHh,1/30 Myr. Prospector then sets the optical depth of the birth-cloud dust according to \({\tau }_{1}^{h}/{\tau }_{2}^{h}\approx {\mathcal{N}}(1,{0.3}^{2})\).

Importantly, the above description of dust extinction is global, that is averaged over the spatial light distribution of the galaxy, as the host photometry is usually integrated. It may, thus, not faithfully represent the local extinction law or the optical density of dust at any particular location within the galaxy, for example in the proximity of a given type Ia supernova. The spatial distribution of dust properties is a topic of active research (for example, refs. 94,95,96,97), with mounting evidence for its effect on the standardization of type Ia supernovae98,99. We will address this in a future refinement of the model, in coordination with the dust effects applied to supernova light.

Photometry and detection of hosts

Given the above quantities (and parameters that describe the presence and emission of interstellar gas and an active galactic nucleus, which we randomly sample from the Prospector-β priors as they are not directly relevant to studies of type Ia supernovae), we compute the rest-frame spectral energy distribution—more accurately, the spectral flux—Φh(λr) using the NN-based SPS emulator Speculator-α100. The speed-up this brings over traditional (from first principles) SPS is crucial for simulating the large amount of training data we need.

We then apply a redshift to Φh (in this study, we assume no peculiar velocities, \({{z}_{{\rm{c}}}}^{h}={z}^{h}={z}^{{s}}\), as the fraction of objects in our simulations and mock data with zc ≲ 0.02, where the distinction would be important, is negligible) and apply a cosmological (transverse co-moving) distance \({D}^{h}\equiv D({z}_{{\rm{c}}}^{h},{\mathcal{C}})\), where \({\mathcal{C}}\) are the cosmological parameters, to convert it to the observer-frame spectral flux density:

$${F}^{h}({\lambda }_{{\rm{o}}})\equiv \frac{{{\varPhi}}^{h}({\lambda }_{{\rm{o}}}/(1+{z}^{h}))}{{(1+{z}^{h})}^{3}4{{\uppi }}{({D}^{h})}^{2}},$$

(4)

where λo is the observer-frame wavelength. In principle, at this stage we need to account for extinction in intergalactic space101,102 and in the Milky Way. The former is a small effect (~0.015 mag at zc = 0.6) and is rarely addressed explicitly in supernova cosmology (see, for example, Section 4.4.7 in ref. 29), whereas the latter can be accounted for by using detailed multi-wavelength attenuation maps of the Milky Way103,104,105,106. Hence, in this work, we ignore the two effects and assume that the observations have been appropriately corrected.

Thus, we directly integrate (numerically) Fh within the Rubin ugrizy passbands, convert to AB magnitudes and apply 0.01 mag Gaussian noise (a conservative estimate for the Rubin Observatory’s absolute photometric calibration107) to arrive at the simulated galaxy data dh, a vector of length 6. Finally, to simulate detection, that is the indicator variable Sh, we apply a simple magnitude cut by stipulating that all six measured magnitudes must be brighter than the respective 5σ detection thresholds expected for LSST 10-yr co-added imaging108.

The description above treats galaxies as isotropically emitting point sources, whereas in reality, measuring their brightnesses requires accounting for inclination and de-blending of overlapping objects, which—to first order—would inflate the error and uncertainty beyond the photometric calibration floor we have assumed. To account for higher-order effects (for example, complicated biases due to inclination, blending or selection effects arising from the complex procedure for detecting and selecting objects for inclusion in a galaxy catalogue), one would need to process the raw image data. Owing to the flexibility of neural SBI, this can be achieved through high-fidelity simulations (see, for example, refs. 109,110,111,112,113) to which the same data-reduction procedure has been applied as to the real data.

Occurrence of type Ia supernovae

The first connection between galaxies and type Ia supernovae (or any other transient) is the emergence of the latter from the stellar populations of the former, which we describe through the DTD: the rate (per unit rest-frame time) of occurrence of type Ia supernovae from a given stellar population (per unit stellar mass within it), as a function of the age of the population t*. We adopt a power-law parameterization:

$${\rm{DTD}}\,({t}_{* })=A\times {({t}_{* }/{\rm{Gyr}})}^{b}\times {{{\rm{M}}}_{\odot }}^{-1}\,{{\rm{yr}}}^{-1},$$

(5)

which is guided by the theory of binary systems that decay through the emission of gravitational waves (for which b = −1), with independent priors on the normalization and slope from observational constraints68:

$${\log }_{10}\,A \sim {\mathcal{N}}\left(-12.15,{0.1}^{2}\right),$$

(6)

$$b \sim {\mathcal{N}}\left(-1.34,{0.2}^{2}\right).$$

(7)

To avoid an infinite total rate, we set the DTD to zero for t* < 0.1 Gyr (that is, for the two youngest stellar populations in Prospector), corresponding to the minimum time for the creation of a carbon-oxygen white dwarf114,115. This physical prescription resolves the strong a posteriori correlation between the cutoff time and A, which has previously prevented direct inference of the former77. In future analyses, we plan to relax this assumption and include a mixture of formation channels (for example, ref. 116), modelling their DTDs from first principles.

Given the DTD, we calculate the number of type Ia supernovae that arise (on expectation) from the stellar population j of galaxy h (recall that we assumed all stars within it have the same age \({t}_{* }^{h,j}\)) during a survey of (observer-frame) duration \({\mathcal{T}}\):

$$\left\langle {N}_{\mathrm{SN}}^{h,\,j}\right\rangle =\frac{{\mathcal{T}}}{1+{z}^{h}}\times {\mathrm{SFH}}^{h,\,j}\times \mathrm{DTD}\,({t}_{* }^{h,\,j}).$$

(8)

We then iterate (in parallel on a GPU) over all stellar populations of all galaxies, that is over all h and j, to generate

$${N}_{\mathrm{SN}}^{h,\,j} \sim \mathrm{Poisson}\left(\left\langle {N}_{\mathrm{SN}}^{h,\,j}\right\rangle \right)$$

(9)

supernovae. We sample their ages, \({\{{t}_{* }^{h,\,j,k}\}}_{k=1}^{{N}_{\mathrm{SN}}^{h,\,j}}\), uniformly within the time bin associated with population h and j. To simplify notation, we uniquely label each supernova with \(s\in \{1,\ldots ,{\sum }_{h,\,j}{N}_{\mathrm{SN}}^{h,\,j}\}\) and keep track of the host h(s) of each, so that we can associate the relevant intrinsic galaxy properties: \({z}_{{\rm{c}}}^{\,s}\equiv {z}_{{\rm{c}}}^{\,h(s)}\), \({Z}_{* }^{s}\equiv {Z}_{* }^{\,h(s)}\) and \({M}_{* }^{s}\equiv {M}_{* }^{h(s)}\) and host observations dh(s). We note that this assumes a perfect host association, which is often not the case in real observations, and we ignore the extra information from instances of several supernovae arising in the same host (see, for example, ref. 117); we plan to make improvements in these respects by extending the simulator towards detection and characterization of the transients directly from telescope images.

Finally, it is also possible to keep track—in the simulator—of the stellar population j(s) from which each supernova has arisen and associate the corresponding properties, if these differ from one stellar population to another within the same host (for example, to represent a fully realistic enrichment history through different \({Z}_{* }^{\,h,\,j}\)). Similarly, we can simulate the spatial distribution of stellar populations that give rise to metallicity and age gradients (for example, refs. 118,119) and differences in dust extinction (for example, refs. 94,95,96,97). We defer an expansion of the simulator within the hosts to future work.

Model for type Ia supernovae

Once we have determined how many type Ia supernovae have occurred during a survey, we proceed to simulate their observables. This follows a similar path, going from the intrinsic properties derived from a population model informed by the host of each object, through extinction, redshift and distance to noisy measurements and detection or sample selection.

Causal host–type Ia supernova connections

The cornerstone of CIGaRS is an explicit parameterized dependence of the intrinsic properties of type Ia supernovae on the characteristics of their host or of their progenitor stellar population: in general, p(λs∣gh(s),j(s), γ), where γ are global parameters. Our choice of which links to include is motivated by observational evidence4,6,7,8,9,10,11,12 and takes the form of a host-dependent magnitude offset δMs, but we could similarly introduce, for example, a parameterized host-dependent distribution of \({x}_{{\rm{int}}}^{s}\) (ref. 26).

As the physical source of δMs is not yet established, we allow for several possibilities (in isolation or combination)—namely, metallicity \({Z}_{* }^{s}\) and progenitor age \({t}_{* }^{s}\)—with correlation coefficients, respectively \({\gamma }_{{Z}_{* }}\) and \({\gamma }_{{t}_{* }}\), whose priors include 0 and wide ranges of positive and negative values. In keeping with current practice and as a ‘catch-all’ parameter, we also allow for an additional (or residual) ‘mass step’ ΔM with a similar wide prior and a free (inferred) stellar-mass location \({M}_{* }^{{\rm{step}}}\). The full intrinsic host connection is, therefore:

$$\delta {M}^{s}=\underbrace{{\gamma}_{{Z}_{\ast}}\times{{\rm{log}}_{10}}{{Z}_{\ast}^{s}}/{{Z}_{\odot}}}_{\rm{metallicity}}+ \underbrace{{\gamma}_{{t}_{\ast}}\times {t}_{\ast}^{s}}_{\rm{age}}+\underbrace{{\Delta} M \times {\mathbb{I}}({M}_{\ast}^{s} {>} {M}_{\ast}^{\rm{step}})}_{\text{ad hoc mass step}},$$

(10)

where 𝕀 returns 1 or 0 if its argument is true or false, respectively, and \({\gamma }_{{Z}_{* }},{\gamma }_{{t}_{* }},\Delta M,{M}_{* }^{\mathrm{step}}\in {\boldsymbol{\gamma }}\) are global parameters. Inference of the presence and strength of the respective effects is performed by examining their posterior(s) or performing a Bayesian model selection, which is most conveniently and rigorously achieved through simulation-based methods37. Moreover, with SBI, all underlying correlations are accounted for and marginalized over when inferring the cosmology, which ensures the robustness of dark energy inference against systematics arising from the host–type Ia supernova dependences.

Intrinsic properties and dust extinction of type Ia supernovae: Simple-BayeSN

Bayesian hierarchical modelling replaces empirical standardization with a priori correlated latent parameters sampled from hierarchical priors32,120:

$$\text{stretch:}\,\,{x}_{\mathrm{int}}^{s} \sim {\mathcal{N}}\left({\mu }_{x},{\sigma }_{x}^{2}\right),$$

(11)

$$\text{colour:}\,\,{c}_{\mathrm{int}}^{s} \sim {\mathcal{N}}\left({\mu }_{c}+{\alpha }_{c}{x}_{\mathrm{int}}^{s},{\sigma }_{c}^{2}\right),$$

(12)

$${\rm{absolute B}}{\text{-}}{\text{band magnitude}}\!:\,\,{M}_{\mathrm{int}}^{s} \sim {\mathcal{N}}\left({M}_{0}+\alpha {x}_{\mathrm{int}}^{s}+\beta {c}_{\mathrm{int}}^{s}+\delta {M}^{s},{\sigma }_{0}^{2}\right)$$

(13)

where μx, σx, μc, σc, M0, σ0, α, αc, β ∈ γ are population parameters assigned fixed hyper-priors, as listed in Extended Data Table 1.

Like starlight, supernovae are affected by the dust surrounding them and along the line of sight, that is, in intergalactic space and in the Milky Way. We will ignore the latter two, assuming that they have been perfectly corrected for, and adopt the Bayesian formulation of host dust extinction from Simple-BayeSN32, which acts on the intrinsic parameters introduced in equations (11) to (13) to obtain their ‘extrinsic’ versions:

$${x}_{{\rm{ext}}}^{s}={x}_{{\rm{int}}}^{s},$$

(14)

$${c}_{\mathrm{ext}}^{\,s}={c}_{\mathrm{int}}^{s}+{E}_{B-V}^{\,s}={c}_{\mathrm{int}}^{\,s}+{A}_{V}^{s}/{R}_{V}^{s},$$

(15)

$${M}_{\mathrm{ext}}^{s}={M}_{\mathrm{int}}^{s}+{A}_{B}^{s}={M}_{\mathrm{int}}^{s}+{A}_{{\rm{V}}}^{s}({R}_{V}^{s}+1)/{R}_{V}^{s},$$

(16)

where EB−V ≡ AB − AV is the selective extinction, with AB the total extinction in the rest-frame B-band (in which the peak magnitude of type Ia supernovae is typically standardized) and RV ≡ AV/EB−V. Note that the colour–magnitude standardization coefficient β in equation (13) refers only to cint rather than the dust-affected cext.

Ideally, in a unified model, one would coordinate the extinction applied to type Ia supernovae with the dust properties of the host. However, due to the complicated distribution of dust within galaxies (for example, refs. 94,95,96,97) and the non-trivial attenuation effects of dust, which include not only extinction but also scattering of galactic and supernova light into the line of sight31, the simple approach of adopting the host-global \({A}_{V}^{h}\) and R(δh) predicts an order of magnitude larger optical depths than empirically observed in supernova data. We will explore the effect of dust localization in a further study, and here we adopt instead the host-independent dust populations from Simple-BayeSN:

$${R}_{V}^{s} \sim {\mathcal{N}}\left({\mu }_{R},{\sigma }_{R}^{2}\right),$$

(17)

$${A}_{V}^{s} \sim \mathrm{Exponential}(1/\tau ),$$

(18)

with μR, σR, τ ∈ γ global parameters (Extended Data Table 1).

Cosmological distance

The extrinsic absolute (rest-frame B-band) magnitude \({M}_{\mathrm{ext}}^{s}\) is then transformed into an apparent (still rest-frame B-band) magnitude ms through the usual distance-modulus relation:

$${m}^{s}={M}_{\mathrm{ext}}^{s}+{\mu }^{s}\,\,\,\,\,\mathrm{with}\,\,\,\,\,{\mu }^{s}=\mu ({z}_{{\rm{c}}}^{s},c),$$

(19)

where \({z}_{{\rm{c}}}^{s}\) is the cosmological redshift of the supernova, and \({\mathcal{C}}\) are the cosmological parameters. In this study, we use Λ-cold dark matter, which is described by the present-day dimensionless densities of cold dark matter and dark energy in the form of a cosmological constant: \({\mathcal{C}}\equiv [{\varOmega }_{{\rm{m}}0},{\varOmega }_{\Lambda 0}]\). Just as when modelling galaxies, we disregard peculiar velocities, including those of the supernovae within their hosts, which are negligible at high redshifts and best accounted for at the level of LCs. To account for the motion of nearby objects (z ≲ 0.03), CIGaRS can be seamlessly extended with an appropriate Bayesian hierarchical model (for example, refs. 121,122,123), which would allow the incorporation of the parameters controlling large-scale structure formation124,125 within the unified framework.

Observables of type Ia supernovae

Observations of supernovae take the form of multi-band LCs: collections of flux measurements in different filters at irregularly spaced times, which vary from supernova to supernova. However, cosmological analyses are typically performed after the LCs have been summarized (independently of one another) by subtracting the constant contribution of the host and fitting an LC model, which produces parameter estimates \(\widehat{x}({\bf{LC}})\), \(\widehat{c}({\bf{LC}})\) and \(\widehat{m}({\bf{LC}})\) and fit uncertainties \({\widehat{\sigma }}_{x}({\bf{LC}})\), \({\widehat{\sigma }}_{c}({\bf{LC}})\) and \({\widehat{\sigma }}_{m}({\bf{LC}})\). It is then assumed that these summary statistics are related to the extrinsic supernova parameters \({x}_{{\rm{ext}}}^{s},{c}_{{\rm{ext}}}^{s}\) and ms by Gaussian sampling distributions:

$${\widehat{x}}^{s} \sim {\mathcal{N}}\left({x}_{\mathrm{ext}}^{s},{({\widehat{\sigma }}_{x}^{s})}^{2}\right),$$

(20)

$${\widehat{c}}^{s} \sim {\mathcal{N}}\left({c}_{\mathrm{ext}}^{s},{({\widehat{\sigma }}_{c}^{s})}^{2}\right),$$

(21)

$${\widehat{m}}^{s} \sim {\mathcal{N}}\left({m}^{s},{({\widehat{\sigma }}_{m}^{s})}^{2}\right).$$

(22)

As the uncertainties depend only on instrumental properties, like noise and cadence, their distributions can be robustly determined and treated as fixed simulator inputs. We used the model of Boyd et al.126, which is based on the results of current surveys and LSST simulations:

$$\ln{\widehat{\sigma }}_{x}^{s} \sim {\mathcal{N}}\left(-1.5,{0.5}^{2}\right),$$

(23)

$$\ln{\widehat{\sigma }}_{c}^{s} \sim {\mathcal{N}}\left(-3.5,{0.3}^{2}\right),$$

(24)

$$\ln{\widehat{\sigma }}_{m}^{s} \sim {\mathcal{N}}\left(0.1({m}^{s}-56),{0.6}^{2}\right).$$

(25)

We note that these values assume that the LC fits have been performed given a precise redshift measurement from spectroscopy of the supernova or the host. In its absence, for example, when inferring the redshift from the LC itself127, the fit uncertainties will be inflated, as both colour and (to a lesser extent) stretch are a posteriori correlated with redshift80 (Fig. 6). This would require adjusting the above distributions accordingly.

It is also possible to extend this description to a dense covariance matrix that represents the systematic correlations between supernovae, as demonstrated in SICRET. However, with hierarchical modelling and SBI, these can, instead, be accounted for explicitly in the simulator, thus replacing the likelihood-centric covariance formulation. Issues related to summary statistics can also be circumvented altogether by extending the simulation and analysis to full LCs, as demonstrated in SIDE-real.

Detection and selection of type Ia supernovae

The detection and inclusion of supernovae in modern cosmological analyses is a complex process that considers the raw data quality, the goodness of LC fits and the estimated summaries. Using host observations (for example, to derive redshifts or to study host–type Ia supernova connections) relies on a further procedure that identifies and associates them correctly (for example, refs. 128,129). A big advantage of our framework is that arbitrary detection, selection and association (and classification) criteria can be straightforwardly integrated into the forward model and accounted for with SBI, as recently shown in STAR NRE.

As the present study does not specifically focus on selection effects, we adopt the same simple supernova detection and selection procedure as in STAR NRE (Section 3.2 in ref. 49). Thus, we derive from the expected LSST observing conditions a probability \(p({S}^{s}| {\widehat{m}}^{s},{z}^{s})\) that depends on the observed brightness of the supernova and its redshift (due to the different detection limits in the different (observer-frame) bands in which the peak occurs). We treat as ‘detected and selected’ all host–type Ia supernova pairs (‘objects’) where both Sh and Ss are sampled true. We label their count Nsel and treat it as an observable, as in STAR NRE. Finally, the output of the simulator is a set of length-12 vectors that combine the host- and supernova-related observables for each of the detected and selected objects:

$$D\equiv {\left\{\left({{\bf{d}}}^{h(i)},{\widehat{m}}^{s},{({\widehat{\sigma }}_{m}^{i})}^{2},{\widehat{x}}^{s},{({\widehat{\sigma }}_{x}^{i})}^{2},{\widehat{c}}^{s},{({\widehat{\sigma }}_{c}^{i})}^{2}\right)\right\}}_{i=1}^{{N}_{\mathrm{sel}}}.$$

(26)

Implementation details and simulated counts

Here we discuss two technical details of our forward simulator.

First, we note that Prospector relies on the cosmological model to map redshift to age: \(T({z}_{{\rm{c}}},{\mathcal{C}})\). In principle, this calculation can be repeated for each sampled cosmology in our simulator; however, we do not expect this to have a noticeable effect on our results, as type Ia supernovae are mainly informative of cosmological distances rather than times; that is, for values of \({\mathcal{C}}\) consistent with a given sample of type Ia supernovae, \(T({z}_{{\rm{c}}},{\mathcal{C}})\) does not vary significantly with \({\mathcal{C}}\). Therefore, we treat T(zc) as fixed to that consistent with the cosmology from the Wilkinson Microwave Anisotropy Probe130, as when training and creating Prospector-β.

The second point concerns the number of objects that we simulate, which, in principle, depends on (and, hence, is informative of) the astrophysical and cosmological models (for example, through a parameter S8), as well as the surveyed sky area (Ω). At present, our high-level supernova-focused simulator does not include these details, as supernova observations do not carry information about this connection, beyond the distribution of their redshifts, which we assume is extracted mainly from the hosts.

Hence, the purely host-related part of CIGaRS has no free global parameters up to the point where the redshifted (observer-frame) spectral flux

$${\varPhi }_{{\rm{o}}}^{h}({\lambda }_{{\rm{o}}})\equiv \frac{{\varPhi }^{h}({\lambda }_{{\rm{o}}}/(1+{z}^{h}))}{{(1+{z}^{{h}})}^{3}}$$

(27)

is converted to the spectral flux density through cosmological distance, as in equation (4). This means that we can generate a fixed ‘bank’ of galaxies and associated observables—noiseless absolute ugrizy magnitudes obtained by integrating equation (27) through the respective band-pass filter—to serve as potential hosts when simulating supernovae. We choose a bank size of 1,000,000 and note that this represents the total population of galaxies (Ngal).

However, as we already noted in Fig. 3, the distribution of transient hosts is significantly skewed towards high-mass galaxies due to the abundance of prospective progenitors within them. As these are a minority in the total population and we use a fixed simulation bank, we, thus, run the risk of repeatedly including the same objects in our training catalogues, which could cause the network to become overfitted. To prevent this, we modified the distribution from which we sampled galaxy properties when generating the simulation bank to more closely represent the final distribution of type Ia supernova hosts. Specifically, we modified the stellar mass function in Prospector-β:

$$p({M}_{* },{z}_{{\rm{c}}})\to \frac{{M}_{* }}{1+{z}_{{\rm{c}}}}p({M}_{* },{z}_{{\rm{c}}}),$$

(28)

motivated by equation (8). We then exactly undid this when calculating the number of expected supernovae within each host by dividing equation (8) by the same factor, which preserves \(\left\langle {N}_{{\rm{SN}}}\right\rangle\) as a function of mass and redshift.

Then, when compiling a mock survey, we calculated the apparent brightnesses of the galaxies (given \({\mathcal{C}}\)), added noise and evaluated detection as in section ‘Photometry and detection of hosts’. We then seeded only the detected and selected subset with type Ia supernovae as described in section ‘Occurrence of type Ia supernovae’. This is an implicit selection criterion on the supernovae: we only considered supernovae for which we can observe and uniquely identify the host (with certainty); for our (current) method to be applicable, we need to apply the same cut to the real data, but this is usually not a stronger requirement than the cuts typically applied on the supernova data. For the default cosmological model and DTD (Extended Data Table 1), only about 39% of the galaxies are selected, and of these, the type Ia supernova rate is ~2,500 yr−1, of which about 1/3 pass detection and selection (for the default population parameters and our toy model of LSST).

Last, we need to set a survey duration (\({\mathcal{T}}\) equation (8)), but this is only well defined in combination with Ω or a physically meaningful Ngal, which we replaced with the fixed size of the galaxy ‘bank’. Therefore, we set an arbitrary scaling in equation (8) so as to achieve (on expectation) a given number of detected and selected objects, for example, 1,600 or 16,000. Importantly, as Ω and \({\mathcal{T}}\) are well known for real surveys, we can easily extend the simulator to properly calculate the galaxy and type Ia supernova counts when analysing real data. In the present set-up, we can apply the same scaling when generating training data as for the test mocks.

Set-based TMNRE for hierarchical models

Bayesian hierarchical models feature a large number of free parameters, which scales with the number of objects observed. Accurate, precise and fast inference from future data is, thus, a considerable computational challenge that likelihood-based methods are ill-suited to address. Moreover, the likelihood requires explicit calculations of intractable probabilities, often leading to ad hoc approximations. SBI addresses all these issues—scalability, realism and rigour—by delivering marginal Bayesian posteriors, given data D0, for any parameters of interest θ, implicitly integrated over all nuisance parameters (ν) and relevant stochastic processes (for example, selection and other systematic effects):

$$p({\boldsymbol{\theta }}| {\mathbf{D}}_{0})\propto p({\boldsymbol{\theta }})p({\mathbf{D}}_{0}| {\boldsymbol{\theta }})=p({\boldsymbol{\theta }})\int p({\mathbf{D}}_{0}| {\boldsymbol{\nu }},{\boldsymbol{\theta }})\,p({\boldsymbol{\nu }}| {\boldsymbol{\theta }})\,{\rm{d}}{\boldsymbol{\nu }}.$$

(29)

SBI requires only samples from the prior (p(θ)) and the marginal likelihood (p(Dθ)), which are provided by a stochastic forward simulator that represents the Bayesian joint model p(θ, D). There are different techniques (flavours of neural SBI) for using simulated pairs (θ, D) to train a NN and later perform inference from observed data D0. We adopt the approach called neural ratio estimation74, as it offers the greatest freedom in the choice of the NN architecture and in choosing priors for training and evaluation. It trains a network \(\widehat{\mathrm{r}}({\boldsymbol{\theta }},\mathbf{D})\) to approximate the single real number

$$r({\boldsymbol{\theta }},\mathbf{D})\equiv \frac{p({\boldsymbol{\theta }},\mathbf{D})}{p({\boldsymbol{\theta }})p(\mathbf{D})}=\frac{p({\boldsymbol{\theta }}| \mathbf{D})}{p({\boldsymbol{\theta }})}$$

(30)

by minimizing the binary cross-entropy loss commonly used for classification tasks. Once trained, \(\widehat{\mathrm{r}}({\boldsymbol{\theta }},{\mathbf{D}}_{0})\) evaluated with the observed data can simply be multiplied by the prior or used to reweight prior samples to represent the target posterior, according to the second equality in equation (30).

In principle, θ can represent any group of parameters that we wish to derive a posterior for. However, scientific interpretations are usually based on one- or two-dimensional marginal distributions, and consequently, we will define the following groups of (global) parameters of interest γg:

  • one-dimensional: \({M}_{0},{\sigma }_{0},\alpha ,\beta ,{\alpha }_{c},{M}_{* }^{{\rm{step}}}\);

  • two-dimensional: [A, b] for the DTD and [Ωm0, ΩΛ0] for cosmology;

  • two-dimensional: \([\Delta M,{\gamma }_{{Z}_{* }}]\), \([\Delta M,{\gamma }_{{t}_{* }}]\), and \([{\gamma }_{{Z}_{* }},{\gamma }_{{t}_{* }}]\), forming all combinations of the host–type Ia supernova connection parameters: a so-called corner plot.

For each of them, we train—simultaneously—separate ratio estimators \({\widehat{{\rm{r}}}}_{g}\), following shared data preprocessing.

Object-specific parameters

In addition to the global parameters, we will—simultaneously—train ratio estimators for all \({\mathcal{O}}\left({N}_{\mathrm{sel}}\right)\) object-specific (local) parameters in the hierarchical model (here we will use the unified label λi, which represents gh and λs in CIGaRS collectively). Although we can, as before, form local-parameter groups λg from parameters for the same object, we will infer each object-specific parameter λg ∈ {zc, M*, Z*, δ, τ2, xint, cint} marginally. Although in Fig. 6 we show only a subset of the results that represent scientific interest in the present study, estimating all local parameters during training helps extract informative features and ultimately improves the inference of the global parameters.

In SICRET and SIDE-real, we demonstrated simultaneous marginal inference of all {λi} for models in which the dataset size Nsel was known a priori. That is, it could be fixed after a survey is performed because selection effects were not considered. Here we extend the approach to simulators that produce datasets with various Nsel, which introduces two complications.

The first concerns identifying the objects. Within Bayesian hierarchical modelling, the {λi} are a priori independent and identically distributed, conditional on the set of global parameters γ, and each influences the sampling distribution of only one observed ‘object’; that is we have the general model structure:

$$p(\gamma ,\{{\lambda }^{i}\},\{{{\bf{d}}}^{i}\})=\left[\mathop{\prod }\limits_{i=1}^{{N}_{\mathrm{sel}}}p({{\bf{d}}}^{i}| {\lambda }^{i},{\boldsymbol{\gamma }})p({\lambda }^{i}| {\boldsymbol{\gamma }})\right]p(\gamma ).$$

(31)

Previously, we used a fixed ordered collection of auxiliary variables \({[{{\bf{a}}}^{i}]}_{i=1}^{{N}_{{\rm{sel}}}}\) to disambiguate the assignment of labels i, which effectively individualized the sampling distributions:

$$p({{\bf{d}}}^{i}| {\lambda }^{i},{\boldsymbol{\gamma }})\to p({{\bf{d}}}^{i}| {{\bf{a}}}^{i},{\lambda }^{i},{\boldsymbol{\gamma }})\to {p}_{i}({{\bf{d}}}^{i}| {\lambda }^{i},{\boldsymbol{\gamma }}).$$

(32)

However, the present simulator produces unordered (exchangeable) sets of a priori undetermined sizes, and so the auxiliary variables need to become an output of the model, that is be incorporated into the observable di. Indeed, CIGaRS explicitly models the host photometry (from which the redshift is ultimately derived) and the observational (fitted) uncertainties, which comprised ai in SICRET and SIDE-real. Thus, we can treat λi and di as realizations of singular random variables λ and d:

$$\left[\mathop{\prod }\limits_{i}p({\bf{d}}={{\bf{d}}}^{i}| {\boldsymbol{\lambda }}={\lambda }^{i},{\boldsymbol{\gamma }})p({\boldsymbol{\lambda }}={\lambda }^{i}| {\boldsymbol{\gamma }})\right]p({\boldsymbol{\gamma }}),$$

(33)

rather than collections of separate (independent and identically distributed) random variables, which allows us to train a single network to represent all posteriors:

(34)

where we have ignored the dependence on the full {di}, as it is approximately redundant with the truncation of global parameters, as argued for SICRET.

Second, a technical complication arises as we need training pairs

$${\boldsymbol{\lambda }},{\bf{d}} \sim p({\boldsymbol{\lambda }},{\bf{d}})=\int \,p({\boldsymbol{\lambda }},{\bf{d}}| {\boldsymbol{\gamma }})p({\boldsymbol{\gamma }})\,{\rm{d}}{\boldsymbol{\gamma }},$$

(35)

but the simulator samples in proportion to \(\left\langle {N}_{\mathrm{sel}}\right\rangle({\boldsymbol{\gamma }}) \times p({\boldsymbol{\gamma }})\) rather than simply p(γ). This is not a problem when Nsel is a fixed input and can otherwise be rectified by randomly selecting a fixed number Nsub of objects from the simulator output. In fact, Nsub := 1 would be the natural choice in line with the above treatment of {λi} and {di} as realizations of λ and d, and we chose this when generating a validation set, which also represents the prior p(λ). Still, by setting a larger Nsub, we can cheaply (without extra simulation) enlarge the effective batch size for training local-parameter inference networks, and so we chose Nsub = 100 in this case. Finally, when evaluating the results for the Nsel,0 objects in D0, we simply skipped the subselection and used the full set (Nsub := Nsel,0).

Network architecture: conditioned deep set++

The NN we use (depicted, together with details of the sizes of its various layers, in Fig. 4) is based on the conditioned deep-set architecture from STAR NRE (following Zaheer et al.76) but augmented in depth as in Zhang et al.75 due to the sophistication of the simulator and the inference tasks and with the addition of local-parameter estimators as in SICRET and SIDE-real.

Given a dataset D ≡ {di} with cardinality Nsel, we first preprocess each element di with a small feed-forward network to (automatically) derive (nonlinear) ‘features’:

$${{\mathtt{d}}}^{i}={\mathtt{DataPre}}({{\bf{d}}}^{i}),$$

(36)

which are generally useful across all inference tasks. These may include galaxy colours, (fiducially) standardized supernova magnitudes, observational uncertainty (from the provided variances) and estimates of object-specific parameters directly usable in the respective downstream ratio estimators. Although they are not forced to be directly ‘meaningful’ to a human scientist, we plan to explore and interpret these features in future work. The {di} are then used as the input for all ratio estimators.

Following Zaheer et al.’s representation theorem76 and the considerations laid out in STAR NRE, the estimator for a group of global parameters θg makes use of two feed-forward networks \({\widehat{\phi }}_{{\rm{g}}}\) and \({\widehat{\rho }}_{{\rm{g}}}\) and takes the form:

$$\ln\,{\widehat{\mathrm{r}}}_{{\rm{g}}}({\boldsymbol{\theta}}_{{\rm{g}}},\mathbf{D})\equiv {\widehat{\rho }}_{{\rm{g}}}({\boldsymbol{\theta}}_{{\rm{g}}},{\mathtt{S}}_{{\rm{g}}},{N}_{\mathrm{sel}}),$$

(37)

with

$${\mathtt{S}}_{{\rm{g}}}\equiv \frac{1}{{N}_{\mathrm{sel}}}\mathop{\sum}\limits_{i}{\hat{\phi}}_{{\rm{g}}}\left({\boldsymbol{\theta}}_{{\rm{g}}},\frac{{{\bf{d}}}^{i}-{\mathtt{m}}}{{\mathtt{s}}},{\mathtt{m}},{\mathtt{s}}\right),$$

(38)

where m and s are, respectively, the mean and standard deviation of the {di}. This implements the SetNorm operation advocated by Zhang et al.75 and explicitly preserves the information contained in the set moments m and s. For expressivity (which allows the set aggregation to be a simple averaging operation), \({\widehat{\phi }}_{{\rm{g}}}\) is a deep residual network that implements ‘skip connections’ as proposed by Zhang et al.75 after combining its inputs \((\boldsymbol{\theta}_{\rm{g}},\frac{{\bf{d}}^{i}-{\mathtt{m}}}{\mathtt{s}},{\mathtt{m}},{\mathtt{s}})\) into one vector through a single hidden network layer (Fig. 4). Finally, \({\widehat{\rho }}_{{\rm{g}}}\) similarly combines its inputs (θg, Sg, Nsel) and feeds them through a few fully connected layers to output the single number \(\ln\,{\widehat{{\rm{r}}}}_{{\rm{g}}}(\boldsymbol{\theta }_{{\rm{g}}},\mathbf{D})\).

The ratio estimator for a local or object-specific parameter λg is simpler, as it does not need to aggregate information across the set, and so it consists of a single feed-forward network:

$$\ln\,{\widehat{{\rm{r}}}}_{{\rm{g}}}({\lambda }_{{\rm{g}}},{{\bf{d}}}^{i})={\mathtt{LocalNRE}}_{{\rm{g}}}({\lambda }_{{\rm{g}}},{{\bf{d}}}^{i}),$$

(39)

which can be applied in parallel over the Nsub input data at the same time.

Truncation and fine-tuning

Neural SBI is amortized: once the network is trained, results can be quickly derived from numerous simulated (or real, if available) data realizations. While this can be exploited to verify and calibrate inference (see, for example, refs. 57,58), that would require training over a wide range of data, which can be inefficient if one is interested only in the results for a single dataset D0.

In such cases, the simulation budget, training time and network capacity can be optimized through various sequential SBI strategies, which modify (either continuously or in a succession of stages) the prior \(p({\boldsymbol{\theta }})\to \widetilde{p}({\boldsymbol{\theta }})\) from which parameters are drawn, based on intermediate results during training. A simple yet effective prescription is prior truncation73, in which the shape or form of \(\widetilde{p}({\boldsymbol{\theta }})\) is unchanged, but its support is restricted to a region T(D0) in which the posterior density is significantly different from zero:

$${\boldsymbol{\theta }}\notin T({D}_{0})\,\Rightarrow \,p({\boldsymbol{\theta }}| {D}_{0})\approx 0,$$

(40)

as approximated by a previously trained network evaluated at D0. Thus, the only modification needed is a trivial renormalization to account for the excluded prior probability mass:

$$p(\boldsymbol{\theta})\to {\widetilde{p}}_{T(\mathbf{D}_{0})}({\boldsymbol{\theta }})=\left\{\begin{array}{l}\frac{p({\boldsymbol{\theta }})}{{\int }_{T(\mathbf{D}_{0})}p({\boldsymbol{\theta }})\,{\rm{d}}{\boldsymbol{\theta }}}\,\,\,\,\,\,\mathrm{if}\,{\boldsymbol{\theta }}\in T(\mathbf{D}_{0}),\\ \begin{array}{cc}0, & \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\mathrm{otherwise},\end{array}\end{array}\right.$$

(41)

which leaves the posterior (given D0) unchanged while restricting simulated examples to more closely resemble D0.

We apply truncation separately for each global inference group. In one dimension, we use a simple contiguous interval that contains 99.99% approximate credibility and then sample training examples by analytically modifying the (simple, whether uniform or normal) prior. For two-dimensional groups, we either define an iso-likelihood contour (which again contains 99.99% of approximate posterior mass) and use rejection sampling within it or apply one-dimensional truncation separately for the two parameters (we truncate to an axis-aligned rectangular region) when they are not strongly correlated. The truncation regions are illustrated with grey shading in Fig. 5.

Once new training data are generated, we resume—instead of re-initializing—training, which means that the network is simply fine-tuned to give more accurate results in the ‘zoomed-in’ parameter space, instead of being forced to relearn the analysis from scratch. This is especially relevant for the preprocessor and deep-set featurizers (described below), whose computations should not appreciably depend on the parameter ranges.

Training details

Analysing the mock data with ~1,600 objects required two rounds of truncation (starting from the priors in Extended Data Table 1 and with the network in Fig. 4) for the results to converge (judged by the difference between posteriors from successive rounds). In each round, we generated 64,000 full example datasets (with 6,400 more for validation and to represent the priors when evaluating results) and trained using the Adam optimizer131 with learning rate 10−3 (reduced to 10−4 when fine-tuning) for at most 200 epochs, although the loss typically plateaus earlier. On two NVIDIA A100 GPUs with a combined mini-batch size of 64, one stage takes ~24 h, with most of the computation taken up by backpropagating through all 11 deep sets (for the separate global-parameter groups).

For the larger (Nsel ≈ 16,000) dataset, we took the final network from above and fine-tuned it on simulations from the same final truncated prior region but with a tenfold increase in output size (on expectation). As the deep-set architecture processes the full dataset at once, training it now required roughly ten times more compute and memory, so we used eight GPUs and a batch size of 32, which again required 24 h per truncation stage (we needed only one in this case).

Finally, for each global-parameter group γg, we picked the checkpoint with the lowest validation loss for the specific group, whereas for object-specific parameters λg, we picked the best checkpoint overall (lowest loss averaged over all groups). To represent the posteriors, we evaluated \(\widehat{\mathrm{r}}({{\boldsymbol{\theta }}}_{{\rm{g}}},\mathbf{D}_{0})\) over the parameter samples in the validation set, which represent p(θg), and used the results as importance weights for plotting contours and calculating posterior moments.