This chapter describes the development and evaluation of two new denoising-diffusion models based on Neural Cellular Automata (NCA): Diff-NCA and FourierDiff-NCA. These models were developed to address the parameter inefficiencies and local communication limitations in traditional diffusion models. We start with an introduction to the basic concepts and then describe the model architectures, training procedures and experimental setups.

Neural Cellular Automata overview

Neural Cellular Automata are one-cell models that iteratively approach a final goal through a learned update rule (a more detailed introduction can be found in Supplementary Notes 1 and 2). This approach reduces the model complexity traditionally associated with DDMs. Building on this basis, we present Diff-NCA. This image-generation methodology combines the diffusion process with the generalization capability of NCAs to generate images of varying sizes. Moreover, we introduce FourierDiff-NCA to address the local communication limitations of NCAs, which are important obstacles in capturing global knowledge. FourierDiff-NCA circumvents this limitation using the Fourier space and bridges distant parts of an image without the need for extensive iterative communication across the image.

Fourier space: single-step global communication

When using Neural Cellular Automata, one of the major limitations is the number of steps required for the model to acquire global knowledge. Since NCAs communicate with their direct neighbors, communication across a 100 × 100 image in a naive setup thus requires 100 steps. The inherent structure of the Fourier representation (we explain this in-depth in the Supplementary Note 3), where lower frequency data lives in the middle of a two-dimensional Fourier space, enables a fundamental shift in the communication pattern. This shift allows NCAs operating in this space to achieve global communication across the entire image in a single step, a stark contrast to the linear stepwise progression needed in image space. Additional iterations can be used to refine the transmitted signal. The possibility of instantaneous communication over the entire image space is a significant advantage that arises from applying an inverse Fast Fourier Transform following the initial NCA communication, whereby all relevant data in the limited Fourier space is transferred to the global scale. After the initial acquisition of global information in Fourier space, each cell in the image space starts from a comprehensive understanding of the global context and then adjusts the details based on local information, which is a clear departure from the traditional “detail-first” approach.

Notably, only a fraction of the Fourier space is required for this process since a substantial portion of the information in the Fourier space is assumed to be insignificant from a quality point of view (we further illustrate this in the Supplementary Fig. 3).

FourierDiff-NCA architecture

In FourierDiff-NCA, we address the challenge of achieving global coherence by beginning the diffusion process in Fourier space to capture global information and then transitioning to image space to integrate local details. (The corresponding pseudocode can be found in Supplementary Note 5 Listing 1) This approach allows us to combine global and local information effectively without the need for a high number of NCA steps s. We use the separate NCA models m1 and m2 for the image and Fourier space. The denoising process of 300 steps t is illustrated in Fig. 8.

Fig. 8figure 8

Diff-NCA predicts the noise using iterative local communication of NCA’s, whereas FourierDiff-NCA additionally utilizes the Fourier space to communicate global knowledge across the image space.

Diff-NCA

As a subset of the FourierDiff-NCA architecture, Diff-NCA uses only local communication (illustrated in the Supplementary Fig. 2). It is noteworthy that Diff-NCA can be run independently of FourierDiff-NCA, resulting in a diffusion process based solely on the image space, and thus considering only local features. The denoising task in follows the procedure of DDMs and serves as the input of Diff-NCA. It is a combination of the input image i and the noise n. An embedding includes the position x, y of the NCA, the timestep t of diffusion, and the NCA timesteps s.

$${n}_{p}={\text{Diff-NCA}}^{(s)}({i}_{n})$$

(1)

Diff-NCA predicts the noise np from in by iterating m1, s times over the image, incrementing its perceptive range by 1 per step.

$${\mathcal{L}}=\left({({n}_{p}-n)}^{2}+| {n}_{p}-n| \right)$$

(2)

The loss is computed between np and n using a combined L2 and L1 loss, which diverges from the original L2 loss, to enhance convergence.

Embedding

The embedding consists of linear information of position x, y, diffusion timestep t, and NCA timestep s.

$${e}_{\sin }=\,\text{SinusoidalEncoding}\,(x,y,t,s)$$

(3)

The variables x, y, t and s are processed using sinusoidal encoding14, as introduced by DDM1.

$$e={{\rm{l}}}_{2}(\,\text{SiLU}({\text{l}}_{1}({e}_{\sin })))$$

(4)

A sequence of a linear layer l1 of size 256, SiLU activation, and another linear layer l2 mapping to four output channels e is utilized and concatenated with the input.

$$\bar{{o}_{1}}={M}_{{\rm {eb}}1}({o}_{1}),\quad \bar{{o}_{2}}={M}_{{\rm {eb}}1}({o}_{2})$$

(5)

Subsequently, this encoding is then multiplied with the output o1 of the first 3 × 3 convolutional layer of Diff-NCA as well as the output o2 of the second 1 × 1 convolutional layer. As we use multiplicative conditioning, we use two additional multiplicative embedding blocks Meb1, Meb2, that map the four output channels to the required sizes of 2h and h, respectively, consisting of a 1 × 1 convolution, another SiLU, and a second 1 × 1 convolution.

FourierDiff-NCA

extends Diff-NCA by initiating the diffusion process by gathering global information in Fourier space.

$$f={{\rm{Extract}}}_{16\times 16}({\rm{FFT}}({i}_{n}))$$

(6)

Through a fast Fourier transform (FFT) on in, FourierDiff-NCA (FD-NCA) receives the diffusion task f in the Fourier space. We extract a 16 × 16 cell window w starting at the center of f paired with the embedding e, simplifying communication. This 16 × 16 quarter of the Fourier space contains enough details for global communication.

$$\bar{w}=\,\text{IFFT}({\text{FD-NCA}}^{({s}_{{\rm {f}}})}(w))$$

(7)

After 32 iterations sf (required to communicate once across and back of a 16 × 16 world) of m2 in Fourier space, an inverse FFT is performed to convert back to the image space, and the process transitions to Diff-NCA, providing initial global information in the channels c of each NCA cell. Thus \({n}_{p}={\text{Diff-NCA}}^{(s)}(\bar{w})\).

Model architecture

We design our model, illustrated in Fig. 8, with simplicity in mind. We keep the architecture identical for the image and Fourier space. However, there is a difference in the number of input channels c, since converting data to the Fourier space results in two values per channel.

Given an input image i in RGB format, the model is defined with three channels for input I, three channels for predicted noise N, and 90 additional empty channels E for storing information between steps. The empty channels are essential in any NCA as they are the only medium for information retention between steps. Therefore, the total number of channels c is given by: c = I + N + E =96, containing the image, output noise, and the NCA’s internal state v.

Local communication is implemented through a 3 × 3 2D convolution. The output of that convolution is concatenated with the previous internal state v and the embedding e. This concatenated vector is multiplied with the output of Meb1. The next layer is a 1 × 1 2D convolution that maps the concatenated vector to a hidden vector of depth h = 512. Cell-wise normalization along NCA channels is applied to ensure localized normalization. Essentially, this normalizes each pixel location individually by calculating the mean and variance across all channels at each spatial location. (A subversion that replaces the localized normalization with Layer normalization15 to investigate the influence of globalized normalization will be denoted as FourierDiff-NCAGN). A leaky ReLU activation introduces nonlinearity.

Now the output of the ReLU is multiplied by the output of Meb2. Afterwards, another 1 × 1 2D convolution maps the hidden layer back to the channel size c, resulting in an output vector o. The internal state v is updated by: v ← v + o, controlled by a random reset mechanism referred to as the fire rate, which is set to a probability of 90%. When the reset mechanism is activated, the update of the cells in question is set to 0.

Following practices aligned with the leading methods in the field2,16,17, our model incorporates an exponential moving average (EMA) on the weights, with a decay rate of 0.99.

Data

For the evaluation of our proposed methods, we select two distinct publicly available datasets that present different challenges.

CelebA dataset

The CelebA dataset11 (available for non-commercial research purposes only; full license can be found at mmlab.ie.cuhk.edu.hk/projects/CelebA.html), a widely used benchmark, consists of 202,599 images, each of size 178 × 218. We scale all images to a uniform size of 64 × 64 to match the dataset with the input requirements of the UNet and VNCA. The data is split 80%:10%:10% for training, validation, and testing. CelebA presents a challenge as it includes various facial images against different backgrounds. The inclusion of individuals with a range of accessories, hairstyles, and facial expressions further increases the complexity of the dataset. The size and variety of this dataset allow us to evaluate the ability of our models to handle intricate details and different visual features.

BCSS Pathology dataset

This dataset18 (CC0 1.0 Universal (CC0 1.0) license) contains 144 high-resolution pathology samples, which present a unique challenge for generative models. To address slight blurring artifacts at the highest resolution, the images are first resized to one-quarter in the x and y direction to obtain clear and concise visual patterns. We then extract patches of size 64 × 64 for training purposes. The BCSS pathology dataset provides an opportunity to rigorously test the capabilities of our proposed methods in generating large images in the medical domain. For this dataset, we use an 80%:10%:10% split for training, validation, and testing.

Infrastructure

All models are implemented in PyTorch19, where we use the official implementation of VNCA8. The models are trained on an Ubuntu 22.04 system using an Nvidia RTX 3090 and an Intel i7-12700 processor. Additional details can be found in the codebase.

Metrics

To assess the quality of the synthesized outputs, we use the well-established metrics Fréchet Inception Distance (FID)20 and Kernel Inception Distance (KID)21, which measure the similarity between the real and synthetic images based on the Inception-v322 model. For all evaluations, we use a set of 2048 real images from the test split compared to an equal number of synthesized images. This includes later superresolution and inpainting results.

Experimental setup

To ensure better reproducibility, detailed experiment settings are given below. All experiments performed are done with PyTorch19, with comprehensive descriptions complementing the full codebase, which will be released after acceptance.

Diff-NCA and FourierDiff-NCA

All conducted experiments utilize the Adam optimizer23. The chosen hyperparameters include a learning rate of 1.6 × 10−3, a learning rate gamma of 0.9999, betas for the learning rate as (0.9, 0.99), and epsilon (ϵ) set at 1 × 10−8. The models undergo training for 200,000 steps, utilizing a batch size of 16. The number of diffusion sampling steps is set to 300. Detailed configurations for FourierDiff-NCA are outlined in the Supplementary Note 6 Listing 2, while those for Diff-NCA are provided in Listing 3.

VNCA

We use the official CelebA-specific implementation of VNCA8 which can be found at https://github.com/rasmusbergpalm/vnca.

UNets

The training of all UNets employs the Adam optimizer with a learning rate of 3 × 10−5, betas set at (0.5, 0.999), and epsilon (ϵ) at 1 × 10−6. A training duration of 200,000 steps is undertaken using a batch size of 32. The number of diffusion sampling steps is set to 1000 as defined in the original DDM publication1. Refer to Table 6 for detailed layer configurations of each UNet model.

Table 6 Configuration of UNets