Neural networks are universal function approximators. Key word: function! While powerful function approximators, neural networks are not able to approximate non-functions. One important restriction to remember about functions - they have one input, one output! Neural networks suffer greatly when the training set has multiple values of Y for a single X.

In this guide I'll show you how to approximate the class of non-functions consisting of mappings from x -> y such that multiple y may exist for a given x. We'll use a class of neural networks called "Mixture Density Networks".

I'm going to use the new multibackend Keras Core project to build my Mixture Density networks. Great job to the Keras team on the project - it's awesome to be able to swap frameworks in one line of code.

Some bad news: I use TensorFlow probability in this guide... so it doesn't actually work with other backends.

Anyways, let's start by installing dependencies and sorting out imports:

Next, lets generate a noisy spiral that we're going to attempt to approximate. I've defined a few functions below to do this:

Next, lets invoke this function many times to construct a sample dataset:

As you can see, there's multiple possible values for Y with respect to a given X. Normal neural networks will simply learn the mean of these points with respect to geometric space.

We can quickly show this with a simple linear model:

Let's use mean squared error as well as the adam optimizer. These tend to be reasonable prototyping choices:

We can fit this model quite easy

And let's check out the result:

As expected, the model learns the geometric mean of all points in y for a given x.

Mixture Density Networks

Mixture Density networks can alleviate this problem. A Mixture density is a class of complicated densities expressible in terms of simpler densities. They are effectively the sum of a ton of probability distributions. Mixture Density networks learn to parameterize a mixture density distribution based on a given training set.

As a practitioner, all you need to know, is that Mixture Density Networks solve the problem of multiple values of Y for a given X. I'm hoping to add a tool to your kit- but I'm not going to formally explain the derivation of Mixture Density networks in this guide. The most important thing to know is that a Mixture Density network learns to parameterize a mixture density distribution. This is done by computing a special loss with respect to both the provided y_i label as well as the predicted distribution for the corresponding x_i. This loss function operates by computing the probability that y_i would be drawn from the predicted mixture distribution.

Let's implement a Mixture density network. Below, a ton of helper functions are defined based on an old Keras library Keras Mixture Density Network Layer.

I've adapted the code for use with Keras core.

Lets start writing a Mixture Density Network! First, we need a special activation function: ELU plus a tiny epsilon. This helps prevent ELU from outputting 0 which causes NaNs in Mixture Density Network loss evaluation.

Next, lets actually define a MixtureDensity layer that outputs all values needed to sample from the learned mixture distribution:

Lets construct an Mixture Density Network using our new layer:

Next, let's implement a custom loss function to train the Mixture Density Network layer based on the true values and our expected outputs:

Finally, we can call model.fit() like any other Keras model.

Let's make some predictions!

The MDN does not output a single value; instead it outputs values to parameterize a mixture distribution. To visualize these outputs, lets sample from the distribution.

Note that sampling is a lossy process. If you want to preserve all information as part of a greater latent representation (i.e. for downstream processing) I recommend you simply keep the distribution parameters in place.

Next lets use our sampling function:

Finally, we can visualize our network outputs

Beautiful. Love to see it

Conclusions

Neural Networks are universal function approximators - but they can only approximate functions. Mixture Density networks can approximate arbitrary x->y mappings using some neat probability tricks.

One more pretty graphic for the road: