December 18, 2020

EchoFlow

Normalizing flows for generative modeling of tabular datasets.

EchoFlow

This library provides tools for generative modeling of tabular datasets using normalizing flows. Some of its core features include:

  • Density estimation.
  • Joint + conditional sampling.
  • Categorical + continuous values.
  • Normalizing flows from RealNVP and MAF.

To get started with EchoFlow, check out our documentation!

Motivating Example

Let us start by considering a simple tabular dataset containing two columns which, when plotted, forms a spiral. Our goal will be to train a generative model on this dataset and then sample from it to create a "synthetic" copy of the dataset. Some of the tools for accomplishing this include:

  • Copulas. This library uses copula functions, a classical statistical method which is widely used in finance.
  • CTGAN. This library uses generative adversarial networks, a deep learning-based method which has notably been used to generate photo-realistic images.
  • And, of course, EchoFlow. Our library implements normalizing flows, which uses specialized neural networks to transform probability distributions.

We applied each of these methods to our spiral dataset, generated 1000 samples, and visualized the results below.

This figure shows the spiral dataset as well as synthetic copies of the dataset generated using Copulas, CTGAN, and EchoFlow.

As shown in the above figure, EchoFlow produces significantly higher quality samples than either Copulas or CTGAN. In the following sections, we will (1) introduce some of the key concepts and math behind normalizing flows and (2) demonstrate some of the core functionality provided by the EchoFlow library.

Normalizing Flows

At a high level, normalizing flows work by transforming random variables with invertible neural networks by applying the change of variables in the probability density functions.

Invertible Neural Networks

The most important property of a normalizing flow is that it must be invertible. In practice, this means that each layer of the neural network must be invertible so that the whole neural network can be inverted.

This property is critical because the direct pass - \(f(x)\) - is used to map the input distribution to the prior distribution and the inverse pass - \(f^{-1}(z)\) - is used to map the prior distribution back to the target distribution.

For example, given a trained network, the direct pass could be used to map your tabular data to a standard multivariate normal distribution to evaluate the log-likelihood. Using the same network, the inverse pass could be used to map random noise sampled from the multivariate normal distribution into samples that resemble the original tabular data.

For those of you who are familiar with variational auto-encoders (VAE), these ideas may sound familiar - the direct pass is essentially the encoder in a VAE while the inverse pass is essentially the decoder. However, with normalizing flows, these two networks are combined into one invertible neural network.

Change of Variables

Suppose you have a random variable \(x\) which has distribution \(p_x(x)\). If you apply a function \(z = f(x)\), then the random variable \(z\) has the following distribution:

\[ p_z(z) = p_x(x) \bigg| det \frac{\partial f}{\partial x} \bigg|^{-1} \]

Normalizing flows use a neural network as the function \(f\) and apply this change of variable formula repeatedly to get from the input distribution to the prior distribution. Then, the loss function is simply the negative log-likelihood of the data. Therefore, in addition to being invertible, our neural network needs to be designed in such a way that the determinant of the Jacobian is easy to compute.

The general strategy used to achieve this is to use neural networks that have triangular Jacobian matrices. This corresponds to a model where, assuming the network has N inputs/outputs, the \(i\)th output only depends on the preceding \(i-1\) inputs. By using this type of autoregressive structure in each layer of the network, the determinants of the Jacobians can be multiplied together to compute the likelihood.

Visualizing Flow Layers

One way to get insight into how normalizing flows work is to visualize the output of each layer. The below example shows the output of each layer in a normalizing flow network which is being trained on the spiral dataset with a Gaussian prior.

This figure shows the values as they pass through each layer of the neural network.

The input is a standard multivariate normal. Each layer transforms the distribution until it approaches the target distribution which resembles a spiral.

Introducing EchoFlow

The EchoFlow library implements normalizing flows using PyTorch but also provides additional functionality to make it easier to apply to real-world tabular datasets. To get started with EchoFlow, you can install it from pip by running:

pip install echoflow
python -c "import echoflow; print(echoflow.__version__)"

Then, you can load the spirals dataset and train an EchoFlow model as follows:

from echoflow import EchoFlow
from echoflow.demo import load_dataset

model = EchoFlow()
model.fit(load_dataset())
synthetic = model.sample(num_samples=10)

You can pass any DataFrame to the fit method; the sample method will yield a new DataFrame containing the synthetic data with the specified number of rows. For advanced usage including conditional sampling, custom transformers, and more, check out our documentation here!

Benchmarks

EchoFlow uses the SDGym library for benchmarking. Using the default models - RNVP and MADE - we obtain better results than the CTGAN model across a variety of simulated datasets.

Currently, EchoFlow does not outperform the baseline on several real world datasets, largely due to sub-optimal handling of categorical values. We are looking into improving support for categorical variables through methods such as discrete normalizing flows and treating them as external conditioning variables as in CTGAN.

References