Data

We use the equal-area HEALPix grid for all of our models. We index HEALPix resolution by a discrete zoom level \(z\in {{\mathbb{Z}}}_{\ge 0}\) with

$${N}_{{\rm{side}}}(z)={2}^{z},\,{N}_{{\rm{pix}}}(z)=12\cdot {4}^{z},$$

(1)

so that each pixel at level z subdivides into four children at z + 1. Unless otherwise stated, we use z = 8 (Nside = 256, Npix = 786, 432), which provides an effective angular resolution comparable to the ERA5 0.25° grid while guaranteeing equal-area cells and iso-latitude rings27. We denote the coarsest and finest HEALPix levels used in the model by \({z}_{\min }\) and zin.

We use ECMWF ERA5 reanalysis at hourly cadence and aggregate each variable to daily means before any normalization or train/test splits. All models are trained/evaluated on ERA5 remapped to the HEALPix sphere at z = 8.

All models are trained on 1940–2021 and evaluated on a temporal hold-out from 2022 to 2025 (through April 2025). No evaluation-year samples are used during training, and all preprocessing statistics (below) are computed exclusively on the training period.

To enable hybrid training and evaluate the model’s ability to bridge climate model resolutions, we utilize data from the Max Planck Institute Earth System Model (MPI-ESM1.2) in its High-Resolution (HR) configuration39. We select a 10-member ensemble of historical simulations covering the period 1940-2014, which follow the CMIP6 historical forcing protocols. The atmospheric component (ECHAM6.3) of the HR configuration operates at a horizontal resolution of T127 (approximately 100 km or 0.93° at the equator) with 95 vertical levels. For consistency with our geometric architecture, we regrid the native spectral output to the HEALPix mesh at level 6 (Nside = 64), following the same remapping process as for the ERA5 dataset. The level 6 resolution corresponds to an approximate resolution of 100 km, which serves as the coarse-grain input for our downscaling and hybrid diffusion experiments. The Compressed Field Diffusion results were produced using a one-year prediction and were compared with the corresponding MPI-ESM1.2-HR simulations from the year 1940.

Daily aggregation is followed by a per-variable, percentile-based scaling using only training data. Let p01v and p99v denote the 1st and 99th percentiles of variable v on the training split. We scale

$${\widetilde{x}}^{(v)}=\frac{{x}^{(v)}-p0{1}_{v}}{p9{9}_{v}-p0{1}_{v}},$$

(2)

without clipping; values outside [p01v, p99v] may map outside [0, 1]. After inference/decoding, we invert the same transform to recover physical units. All error metrics are reported in physical units. On HEALPix (equal-area) we use unweighted spatial means, which are area-consistent by construction.

Model architecture

We adopt the multi-scale decomposition, scale conservation, and field-space attention design introduced in the published work Field-space attention for structure-preserving earth system transformers26. Our implementation mirrors that approach, and we refer readers to Witte et al.26 for complete derivations, ablations, and implementation details. We introduce Field-Space Compression and Field-Space Decompression as complementary techniques to the Field-Space Attention by Witte et al. for compression and decompression, respectively.

Each Compression and Decompression block performs a linear mapping per patch across selected zoom levels, using a user-defined patch zoomzP that sets the patch size. A patch anchored at zP contains

$${n}_{{\rm{pix}}}(z| {z}_{P})=\left\{\begin{array}{ll}{4}^{(z-{z}_{P})}, & z\ge {z}_{P}\,\,{\rm{(children\; grouped\; into\; channels)}}\,,\\ 1, & z < {z}_{P}\,\,\,\,{\rm{(ancestor\; broadcast)}}\end{array}\right.$$

(3)

pixels from level z. For generating the input patches, let \({{\mathcal{Z}}}_{{\rm{in}}}\subseteq \{{z}_{\min },\ldots ,{z}_{{\rm{in}}}\}\) be the set of input levels (typically several residual levels r(z) plus one coarse mean level \({x}^{({z}_{c})}\)). For each token on the zP-grid, we reshape each included level onto the patch and concatenate along the channel axis:

$${P}^{({z}_{P})}=\,{\rm{Concat}}\,\left(\{{{\mathcal{R}}}_{z\to {z}_{P}}({r}^{(z)}):z\in {{\mathcal{Z}}}_{r}\}\,\cup \,\{{{\mathcal{R}}}_{{z}_{c}\to {z}_{P}}({x}^{({z}_{c})})\}\right)\in {{\mathbb{R}}}^{B\times {N}_{{z}_{P}}\times {C}_{{\rm{in}}}},$$

(4)

with

$${C}_{{\rm{in}}}=\mathop{\sum }\limits_{z\in {{\mathcal{Z}}}_{r}\cup \{{z}_{c}\}}{n}_{{\rm{pix}}}(z| {z}_{P}),\,{{\mathcal{R}}}_{z\to {z}_{P}}(\cdot )\in {{\mathbb{R}}}^{B\times {N}_{{z}_{P}}\times {n}_{{\rm{pix}}}(z| {z}_{P})}.$$

(5)

We keep one channel per physical field, C = 1; scale mixing occurs only within \({P}^{({z}_{P})}\). For compression/downsampling, let \({z}_{\max }=\max {{\mathcal{Z}}}_{r}\) be the highest residual level provided to the block, and set the target to the next coarser level \({z}_{\downarrow }={z}_{\max }-1\). A shared linear map is applied per patch:

$${H}^{\Downarrow }\,=\,{W}^{\Downarrow }{P}^{({z}_{P})}+{b}^{\Downarrow }\,\in \,{{\mathbb{R}}}^{B\times {N}_{{z}_{P}}\times {C}_{{\rm{out}}}^{\Downarrow }},\,{W}^{\Downarrow }\in {{\mathbb{R}}}^{{C}_{{\rm{out}}}^{\Downarrow }\times {C}_{{\rm{in}}}},$$

(6)

with the number of output channels equal to the number of pixels in the target level’s patch:

$${C}_{{\rm{out}}}^{\Downarrow }={n}_{{\rm{pix}}}({z}_{\downarrow }| {z}_{P})={4}^{\max (0,{z}_{\downarrow }-{z}_{P})}.$$

(7)

A reshape distributes the patch outputs onto the z↓-grid:

$${\widehat{r}}^{({z}_{\downarrow })}={{\mathcal{H}}}_{{z}_{P}\to {z}_{\downarrow }}\left({H}^{\Downarrow }\right)\,\in \,{{\mathbb{R}}}^{B\times {N}_{{z}_{\downarrow }}\times 1},$$

(8)

where \({{\mathcal{H}}}_{{z}_{P}\to z}\) is the canonical HEALPix parent-child unstacking satisfying \({N}_{{z}_{P}}\cdot {n}_{{\rm{pix}}}(z| {z}_{P})={N}_{z}\). All levels strictly lower than z↓ (e.g., zc) are forwarded unchanged together with \({\widehat{r}}^{({z}_{\downarrow })}\); the highest level \({z}_{\max }\) is removed at this step.

Example (SeeFig. 1b): If zP = 6 and inputs are (z = 6, 7, 8) residuals, then

$${C}_{{\rm{in}}}={n}_{{\rm{pix}}}(6| 6)+{n}_{{\rm{pix}}}(7| 6)+{n}_{{\rm{pix}}}(8| 6)=1+4+16=21,$$

(9)

and compressing 8 → 7 yields \({C}_{{\rm{out}}}^{\Downarrow }={n}_{{\rm{pix}}}(7| 6)=4\). This realizes the mapping {6, 7, 8} → {7}.

For decompression/upsampling, given inputs up to \({z}_{\max }\), set \({z}_{\uparrow }={z}_{\max }+1\). The per-patch linear map produces the target-level patch, which is then scattered to the finer grid:

$${H}^{\Uparrow }\,=\,{W}^{\Uparrow }{P}^{({z}_{P})}+{b}^{\Uparrow }\,\in \,{{\mathbb{R}}}^{B\times {N}_{{z}_{P}}\times {C}_{{\rm{out}}}^{\Uparrow }},\,{C}_{{\rm{out}}}^{\Uparrow }={n}_{{\rm{pix}}}({z}_{\uparrow }| {z}_{P}),$$

(10)

$${\widehat{r}}^{({z}_{\uparrow })}={{\mathcal{H}}}_{{z}_{P}\to {z}_{\uparrow }}\left({H}^{\Uparrow }\right)\,\in \,{{\mathbb{R}}}^{B\times {N}_{{z}_{\uparrow }}\times 1}.$$

(11)

All levels at or below \({z}_{\max }\) pass through unchanged and are concatenated with \({\widehat{r}}^{({z}_{\uparrow })}\) for the next layer.

Example (SeeFig. 1d): With zP = 6 and inputs at {z = 6, 7}, we have

$${C}_{{\rm{in}}}={n}_{{\rm{pix}}}(6| 6)+{n}_{{\rm{pix}}}(7| 6)=1+4=5,$$

(12)

and upsampling 7 → 8 yields \({C}_{{\rm{out}}}^{\Uparrow }={n}_{{\rm{pix}}}(8| 6)=16\). This realizes the mapping {6, 7} → {8}.

Models

The HEALPix Convolution Autoencoder family operates directly on the raw input fields on the HEALPix grid. Its encoder and decoder follow the ResNet-style architecture of Rombach et al., with residual blocks that consist of two convolutional layers with normalization and nonlinearity, wrapped in a skip connection. We adapt the convolutions in the architecture to operate on the HEALPix grid. This equal-area discretization treats all points uniformly, avoiding polar distortions by respecting the globe’s spherical geometry. In the encoder, each spatial downsampling stage comprises a single identity ResNet block (preserving spatial resolution) followed by a downsampling ResNet block that applies average pooling, reducing the spatial resolution by a factor of 4 to the next lower HEALPix discretization. Symmetrically, in the decoder, each upsampling stage consists of a single identity ResNet block followed by an upsampling ResNet block using nearest-neighbour interpolation, increasing the spatial resolution by a factor of 4 to the next higher HEALPix discretization. Each model implements N + 1 such downsampling stages in the encoder and N + 1 upsampling stages in the decoder; in the bottleneck, a global spatial self-attention block (operating on the flattened spatial dimensions) is applied once after the encoder and once before the decoder. The effective compression depth is parameterized by N, with compression ratio f = 4N. For all models, the latent bottleneck has 4 channels per variable.

The Field-Space Convolution Autoencoders accept as input the mean field at a coarse HEALPix level together with three residual levels from the multi-scale decomposition described by Witte et al.26. For compression ratios 16, 64, and 256, the mean field is provided at level z = 3, and for compression ratio 1024 at level z = 1. In all cases, residuals are taken at levels z = 6, 7, 8 on HEALPix. All Field-Space Conv models initially apply a Field-Space compression layer that maps the levels z = 6, 7, 8 to level z = 6 with 128 channels. Afterwards, each model implements N − 1 downsampling stages in the encoder and N + 1 upsampling stages in the decoder, alongside a global spatial self-attention block after the encoder and before the decoder (similar to the HEALPix Conv models). The final layer is a Field-Space decompression layer that maps the level z = 6 back to the initial input levels z = 6, 7, 8. We observed that the initial mapping to level z = 6 provided better results than mapping the initial levels to level z = 8 and performing an equivalent model starting from level z = 8, while also being significantly more computationally efficient. For all models, the latent bottleneck has 4 channels per variable.

The Field-Space Transformer Autoencoders utilize the same multi-scale input as defined in the previous section. Each variant implements N Field-Space compression blocks in the encoder, preceded by two Field-Space Attention blocks, and N Field-Space decompression blocks in the decoder, followed by two Field-Space Attention blocks. In addition, a small stack of Field-Space Attention blocks is placed in the bottleneck: two Field-Space Attention blocks following the encoder and two preceding the decoder. The number of encoder/decoder stages N is tied to the compression factor via the bottleneck zoom level zbottleneck: with the highest level being \({z}_{\max }=8\), we set \(N={z}_{\max }-{z}_{{\rm{bottleneck}}}\), so that the compression ratio is 4N (e.g., zbottleneck = 6 yields N = 2 and a compression factor of 42 = 16; N = 3 corresponds to a factor of 43 = 64). In the final layer of the decoder, we apply the scale conservation operation introduced by Witte et al.26 to re-enforce the scale-conservative hierarchy before reconstructing the finest field. All Field-Space Autoencoder blocks use the spherical harmonic position embeddings defined in Witte et al.26.

To enable generative emulation, we train a latent diffusion model that operates directly on the compressed field representations. The model input consists of the learned compressed field state at level z = 5 and the base average at level z = 3, on which we define the diffusion and denoising process separately. The backbone architecture utilizes a model dimension of 256 and is composed of four stacked Field-Space Attention blocks designed to process the spatio-temporal structure of the data. Within each block, the model applies variable, spatial, and temporal attention simultaneously, allowing for the concurrent integration of information across physical variables and the global HEALPix grid. Crucially, the temporal attention mechanism operates over an 8-day context window, explicitly accounting for neighboring time steps to enforce temporal consistency in the generated sequence. To condition the generation, the network utilizes three embedding modules: a grid embedder (identical to the Field-Space Autoencoder component) to encode spatial topology, a time embedder to encode calendar timestamps following the implementation by Sun et al.40, and a diffusion step embedder to signal the current noise level. The diffusion process is trained using a cosine noise scheduler over T = 1000 steps, with the network objective set to velocity prediction (v-prediction)41. For efficient inference, we employ a Denoising Diffusion Implicit Model (DDIM) sampler42 and reduce the sampling trajectory to 100 steps.

All models were trained with an RMSE reconstruction loss and an Adam optimizer with 5000 warmup iterations, followed by cosine annealing for the remaining iterations. We observed that the transformer models could be trained using a significantly higher learning rate (λ = 1 × 10−3) compared to the convolutional models (λ = 1 × 10−4), which exhibited instabilities when the learning rate was further increased. We observed faster convergence and superior results with the transformer-based models when using a higher learning rate. The batch size was fixed at 4 for all models, with training run for 100,000 iterations. All models showed high convergence to minimal loss after training, with minimal variation in validation and training loss.