Decodierende Diffusionsmodelle: Kernkonzepte und PyTorch-Code

In diesem Artikel will ich versuchen, die Essenz der Diffusionsmodelle zu destillieren, um Ihnen die grundlegende, Kern-Intuition hinter ihnen zu geben, mit Code, um ein grundlegendes Diffusionsmodell zu trainieren, das am Ende in PyTorch implementiert wird.

Definition:

Die Definition :

Diffusion modelist eine Art generatives Modell in Machine Learning, verwendet, um qualitativ hochwertige Daten [wie Bilder] zu generieren, beginnend mit reinen Geräuschen. Daten werden durch Diffusionsschritte nach einer Markov-Kette geräuscht [da es sich um eine Sequenz stochastischer Ereignisse handelt, bei denen jeder Schritt von dem vorherigen Zeitschritt abhängt] und dann durch das Lernen des umgekehrten Prozesses rekonstruiert.

Lassen Sie uns ein wenig zurückblicken, um die Kernidee hinter Diffusionsmodellen zu verstehen.“Tiefes unbeaufsichtigtes Lernen mit Non-Equilibrium Thermodynamik”[1]Die Autoren beschreiben es als:

Tiefes unbeaufsichtigtes Lernen mit Non-Equilibrium Thermodynamik

Die grundlegende Idee, inspiriert von der statischen Physik ohne Gleichgewicht, besteht darin, die Struktur in einer Datenverteilung systematisch und langsam durch einen iterativen vorausdiffusionsprozess zu zerstören.

The essential idea, inspired by non-equilibrium statistical physics, is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process. We then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data.

Der Diffusionsprozess ist im Wesentlichen in eine vorwärts- und umgekehrte Phase unterteilt. Nehmen wir das Beispiel der Erzeugung realistischer hochwertiger Bilder mithilfe von Diffusionsmodellen.

Forward Diffusion Phase: We start with a real, high-quality image and add noise to it in steps to arrive at pure noise. Basically, we want to destroy the structure in the non-random data distribution that exists at the start.

Here, q is our forward process, x_t the output of the forward process at time step t, x_(t-1) is an input at time step t. N is a normal distribution with sqrt(1 - β_t) x_{t-1} mean and β_tI variance.

β_t [also called the schedule] here controls the amount of noise added at time step = t whose value ranges from 0→1. Depending on the type of schedule you use, you arrive at what is close to pure noise sooner or later. i.e. β_1,…,β_T is a variance schedule (that is either learned or fixed) which, if well-behaved, ensures that x_T is almost an isotropic Gaussian at sufficiently large T.
Reverse Diffusion Phase: This is where the actual machine learning takes place. As the name suggests, we try to transform the noise back into a sample from the target distribution in this phase. i.e. the model is learning to denoise pure Gaussian noise into a clean image. Once the neural network has been trained, this ability can be used to generate new images out of Gaussian noise through step-by-step reverse diffusion.

Since one cannot readily estimate q(x_(t-1)|x_t), we need to learn a model p_theta to approximate the conditional probabilities for the reverse diffusion process.
We want to model the probability density of an earlier time step given the current. If we apply this reverse formula for all time steps T→0, we can trace our steps back to the original data distribution. The time step information is provided usually as positional embeddings to the model. It is worth mentioning here that the diffusion model predicts the entire noise to be removed at a given timestep to make it equivalent to the image at the start, and not just the delta between the current and previous time step. However, we only subtract part of it and move to the next step. That is how the diffusion process works.

Um im Wesentlichen ein Diffusionsmodell zusammenzufassendestroys the structure in training datadurch die aufeinanderfolgende Zugabe von Gauss-Lärm, und dannlearns to recoverNach dem Training kann man das Diffusionsmodell verwenden, um Daten zu generieren, indem man einfachpassing randomly sampled noise through the “learned” denoising processFür eine detaillierte mathematische Erklärung, schauen Sie sich diesen Blog [4].

Implementation:

Die Implementierung:

Wir werden dieOxford Flowers102 Datensatz, die Bilder von Blumen in 102 Kategorien enthält und ein sehr einfaches Modell für die Zwecke dieses Artikels aufbaut, um die Kernidee und die Umsetzung von Diffusionsmodellen zu verstehen.

Forward phase:Da die Summe der Gaussianer auch ein Gaussianer ist, obwohl die Geräuschzusammensetzung sequentiell ist, kann man eine laute Version des Eingabebildes für einen bestimmten Zeitschritt vorbereiten [2].

def linear_beta_schedule(timesteps, start=1e-4, end=2e-2):
    """Creates a linearly increasing noise schedule."""
    return torch.linspace(start, end, timesteps)

def get_idx_from_list(vals, t, x_shape):
    """ Returns a specific index t of a passed list of values vals. """
    batch_size = t.shape[0]
    out = vals.gather(-1, t.cpu())
    return out.reshape(batch_size, *((1,) * (len(x_shape) - 1))).to(t.device)

def forward_diffusion_sample(x_0, t, device="cpu"):
    """ Takes an image and a timestep as input and returns the noisy version of it."""
    noise = torch.randn_like(x_0)
    sqrt_alphas_cumprod_t = get_index_from_list(sqrt_alphas_cumprod, t, x_0.shape)
    sqrt_one_minus_alphas_cumprod_t = get_idx_from_list(sqrt_one_minus_alphas_cumprod, t, x_0.shape)
    return sqrt_alphas_cumprod_t.to(device) * x_0.to(device) + sqrt_one_minus_alphas_cumprod_t.to(device) * noise.to(device), noise.to(device)


T = 300  # Total number of timesteps
betas = linear_beta_schedule(T)
# Precompute values for efficiency
alphas = 1. - betas
alphas_cumprod = torch.cumprod(alphas, dim=0)
alphas_cumprod_prev = F.pad(alphas_cumprod[:-1], (1, 0), value=1.0)

sqrt_recip_alphas = torch.sqrt(1. / alphas)
sqrt_alphas_cumprod = torch.sqrt(alphas_cumprod)
sqrt_one_minus_alphas_cumprod = torch.sqrt(1. - alphas_cumprod)
posterior_variance = betas * (1. - alphas_cumprod_prev) / (1. - alphas_cumprod)

Reverse Diffusion Phase:Wir verwenden ein einfaches U-Net-Nervennetzwerk, das lautes Bild und Zeitschritt [als positionelle Einbettung bereitgestellt] nimmt und den Lärm vorhersagt.ConvBlockDie unten stehende Schicht verwendet die sinusoidale Zeitschritt-Embeddung, die den zeitlichen Kontext erfasst, um die konvolutionäre Ausgabe zu konditionieren.

class SinusoidalPositionEmbeddings(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.dim = dim

    def forward(self, t):
        half_dim = self.dim // 2
        scale = math.log(10000) / (half_dim - 1)
        freqs = torch.exp(torch.arange(half_dim, device=t.device) * -scale)
        angles = t[:, None] * freqs[None, :]
        return torch.cat([angles.sin(), angles.cos()], dim=-1)

class ConvBlock(nn.Module):
    def __init__(self, in_channels, out_channels, time_emb_dim, upsample=False):
        super().__init__()
        self.time_mlp = nn.Linear(time_emb_dim, out_channels)
        self.upsample = upsample

        self.conv1 = nn.Conv2d(in_channels * 2 if upsample else in_channels, out_channels, kernel_size=3, padding=1)
        self.transform = (
            nn.ConvTranspose2d(out_channels, out_channels, kernel_size=4, stride=2, padding=1)
            if upsample else
            nn.Conv2d(out_channels, out_channels, kernel_size=4, stride=2, padding=1)
        )
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU()

    def forward(self, x, t):
        h = self.bn1(self.relu(self.conv1(x)))
        time_emb = self.relu(self.time_mlp(t))[(..., ) + (None,) * 2]
        h = h + time_emb
        h = self.bn2(self.relu(self.conv2(h)))
        return self.transform(h)

class SimpleUNet(nn.Module):
    """Simplified U-Net for denoising diffusion models."""

    def __init__(self):
        super().__init__()
        image_channels = 3
        down_channels = (64, 128, 256, 512, 1024)
        up_channels = (1024, 512, 256, 128, 64)
        output_channels = 3
        time_emb_dim = 32

        self.time_mlp = nn.Sequential(
            SinusoidalPositionEmbeddings(time_emb_dim),
            nn.Linear(time_emb_dim, time_emb_dim),
            nn.ReLU()
        )
        self.init_conv = nn.Conv2d(image_channels, down_channels[0], kernel_size=3, padding=1)

        self.down_blocks = nn.ModuleList([
            ConvBlock(down_channels[i], down_channels[i+1], time_emb_dim)
            for i in range(len(down_channels) - 1)
        ])

        self.up_blocks = nn.ModuleList([
            ConvBlock(up_channels[i], up_channels[i+1], time_emb_dim, upsample=True)
            for i in range(len(up_channels) - 1)
        ])

        self.final_conv = nn.Conv2d(up_channels[-1], output_channels, kernel_size=1)

    def forward(self, x, t):
        t_emb = self.time_mlp(t)
        x = self.init_conv(x)
        skip_connections = []

        for block in self.down_blocks:
            x = block(x, t_emb)
            skip_connections.append(x)

        for block in self.up_blocks:
            skip_x = skip_connections.pop()
            x = torch.cat([x, skip_x], dim=1)
            x = block(x, t_emb)
        return self.final_conv(x)

model = SimpleUnet()

Das Trainingsziel ist ein einfacher MSE-Verlust, der den Unterschied zwischen dem tatsächlichen Lärm und der Vorhersage des Modells für diesen Lärm berechnet.

def get_loss(model, x_0, t, device):
    x_noisy, noise = forward_diffusion_sample(x_0, t, device)
    noise_pred = model(x_noisy, t)
    return F.mse_loss(noise, noise_pred)

Schließlich können wir nach dem Training des Modells für 300 Epochen beginnen, ~ realistisch aussehende Bilder von Blumen zu erzeugen, indem wir reinen Gaussischen Lärm sammeln und ihn durch den gelernten umgekehrten Diffusionsprozess füttern.

References:

Deep Unsupervised Learning mit Nonequilibrium Thermodynamics Sohl-Dickstein, J. et al.
Denoising Diffusion Probabilistic Models Ho et al. [2020]
Diffusion Modelle schlagen GANs auf Bild Synthese Dhariwal und Nichol [2021]
Dieser erstaunliche Blog für ein tieferes Tauchen in die Mathematik hinter Diffusionsmodellen.
Dieses Repository hat Zugang zu einer Sammlung von Ressourcen und Dokumenten zu Diffusion Models.

Dieser erstaunliche Blog Dieses Repository

Decodierende Diffusionsmodelle: Kernkonzepte und PyTorch-Code

Zu lang; Lesen

Definition:

Implementation:

References:

About Author

Hängeetiketten

DIESER ARTIKEL WURDE VORGESTELLT IN...

Trending Topics

Classic

Neon Noir

Minty

Newspaper

HN StartUps

Decodierende Diffusionsmodelle: Kernkonzepte und PyTorch-Code

Zu lang; Lesen

Definition:

Implementation:

References:

About Author

Hängeetiketten

DIESER ARTIKEL WURDE VORGESTELLT IN...

ÄHNLICHE BEITRÄGE

Trending Topics

Classic

Neon Noir

Minty

Newspaper

HN StartUps