Recently I had to deal with probabilities for vectors in python. Surprisingly, handling them is not straightforward, for practical and numerical reasons. I want to quickly sum up my findings here.
Assuming normal distribution, the vectors can be described by a multivariate normal distribution. In contrast to the univariate normal distribution, it is parameterized with a covariance matrix $\Sigma$ and a mean vector $M$. An excellent compilation of properties and methods to estimate these parameters is given here.
Estimation of the parameters in python
The second parameter, the mean $M$, is easy to get. It is the mean of each vector component and can be calculated using numpy with
np.mean(vectors, axis=0) (assuming row layout of each example, i.e. one example per row in a matrix. This layout is assumed for the rest of this article).
The first parameter $\Sigma$, however, is not that easy to determine. This is the covariance matrix of the variables. The formula for its calculation is easy enough, but having in mind that the probability density function (PDF) should be evaluated later, one more restriction is coming up: it must be invertible!
This is a property that is not trivial to get, for numerical and practical reasons. Even the simple example of the three 2D vectors [[1, 2], [2, 3], [3, 4]] (see image below, the mean is drawn in red and is equal to the second vector) has a singular covariance matrix that can not be used for calculating the PDF.
However, there is a smart parameterized method called 'covariance shrinking' (also described on the wikipedia article on Estimation of covariance matrices.The parameter is the shrinkage factor, that usually must be determined manually.
scikit learn goes even further and has an implementation of the method of Ledoit and Wolf described in “A Well-Conditioned Estimator for Large-Dimensional Covariance Matrices”, Ledoit and Wolf, Journal of Multivariate Analysis, Volume 88, Issue 2, February 2004, pages 365-411. It estimates this parameter automatically, yielding very good results. Using this, the parameters for the example given above can be determined as
1 2 3 4 5 6 7 8 9 10 11 12 13
import numpy as np from sklearn.covariance import LedoitWolf # Define the vectors vectors = np.array([[1, 2], [2, 3], [3, 4]]) # Parameter estimation mean = np.mean(vectors, axis=0) # M lw = LedoitWolf(store_precision=True, assume_centered=False) lw.fit(vectors) cov = lw.covariance_ # Sigma # If you called LedoitWolf with store_precision=True, you can # even get the precision matrix (inverse of the covariance # matrix) by getting precision = lw.precision_ # Sigma^-1
Calculation of the probability density function in python
For the specific application I was working on, I needed the
PDF values for new vectors. Though it is possible to now draw
new vector examples easily using the numpy function
numpy.random.multivariate_normal. However, there is neither
in numpy nor scipy a function available for getting the PDF
for a specific vector.
Searching a little, I found that several people were searching and/or implementing their own solutions for this purpose. A good one that I found quite reliable is one by David Cournapeau and he recommended it himself on this mailinglist. I found it in the trac archive of the referenced project.
I had to change the calculation of
_full_gauss_den(x, mu, va, log) to
y = -0.5 * N.dot((x-mu), inva) * (x-mu).T and also
remove references to the
misc package, which is unproblematic.
With this new module (I called it densities), we can plot the densities easily:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
import matplotlib.pyplot as plt import densities X, Y, U, V = np.zeros((vectors.shape, 1)), np.zeros((vectors.shape, 1)), vectors[:, 0], vectors[:, 1] plt.figure() ax = plt.gca() # Plot the normal vectors ax.quiver(X, Y, U, V, angles='xy', scale_units='xy', scale=1) ax.set_xlim([0, 5]) ax.set_ylim([0, 5]) # Plot the mean vector ax.quiver(0, 0, mean, mean, angles='xy', scale_units='xy', scale=1, color=[1.0, 0, 0]) # Create the sample grid x = np.linspace(0, 5) y = np.linspace(0, 5) # Create the meshgrid X, Y = np.meshgrid(x, y) Z = np.zeros((X.shape, X.shape)) # Get the values for i in range(X.shape): for j in range(X.shape): Z[i,j] = densities.gauss_den(np.mat([ [X[i,j], Y[i,j]] ]), mean, cov) # Plot the contours contourplot = plt.contour(X, Y, Z) plt.clabel(contourplot, inline=1, fontsize=10) # Draw the plot plt.draw()
This leads to the following result: