Çift yönlü Mahalanobis mesafeleri


18

Bir n×p değişken matrisindeki her bir gözlem çifti arasındaki R'deki örnek Mahalanobis mesafesini hesaplamam gerekiyor . Ben, yani sadece verimli bir çözüm gerekir n(n1)/2 mesafeleri hesaplanır ve tercihen de C / RCpp uygulanan / Fortran vb varsayalım Σ nüfus kovaryans matrisi bilinmemektedir, ve örnek kovaryans kullanımı yerine matris.

Bu soruyla özellikle ilgileniyorum, çünkü R'de ikili Mahalanobis mesafelerini hesaplamak için bir "konsensüs" yöntemi yok gibi görünüyor , yani distfonksiyonda veya fonksiyonda uygulanmadı cluster::daisy. mahalanobisFonksiyon programcısı ek iş olmadan ikili mesafeler hesaplamak etmez.

Bu zaten R'de Pairwise Mahalanobis mesafesini sordu , ancak oradaki çözümler yanlış görünüyor.

İşte doğru ama çok verimsiz ( n×n mesafeler hesaplandığından) yöntemi:

set.seed(0)
x0 <- MASS::mvrnorm(33,1:10,diag(c(seq(1,1/2,l=10)),10))
dM = as.dist(apply(x0, 1, function(i) mahalanobis(x0, i, cov = cov(x0))))

Bu kendimi C kodlamak için yeterince kolaydır, ancak bu temel bir şey önceden var olan bir çözüm olması gerekir gibi hissediyorum. Bir tane var mı?

Kısa kalan başka çözümler de vardır: yalnızca n ( n - 1 ) / 2 benzersiz mesafe gerektiğinde n × n mesafelerini HDMD::pairwise.mahalanobis()hesaplar . umut verici görünüyor, ama fonksiyonumun başkalarının kodumu çalıştırma yeteneğini ciddi şekilde sınırlayan bir paketten gelmesini istemiyorum . Bu uygulama mükemmel olmadığı sürece kendim yazmayı tercih ederim. Bu işlevle ilgili deneyimi olan var mı?n×nn(n1)/2compositions::MahalanobisDist()rgl


Hoşgeldiniz. Sorunuzdaki mesafenin iki matrisini yazdırabilir misiniz? Ve sizin için "verimsiz" nedir?
ttnphns

1
Sadece örnek kovaryans matrisini mi kullanıyorsunuz? Eğer öyleyse, bu 1) X merkezleme; 2) ortalanmış X'in SVD'sinin hesaplanması, örneğin UDV '; 3) U satırları arasındaki çift mesafelerin hesaplanması
vqv

Bunu soru olarak gönderdiğiniz için teşekkürler. Bence formülünüz doğru değil. Cevabımı aşağıda görebilirsiniz.
user603

@vqv Evet, örnek kovaryans matrisi. Orijinal yayın bunu yansıtacak şekilde düzenlenir.
ahfoss

Yanıtlar:


21

Ahfoss'un "özlü" çözümünden başlayarak SVD yerine Cholesky ayrışmasını kullandım.

cholMaha <- function(X) {
 dec <- chol( cov(X) )
 tmp <- forwardsolve(t(dec), t(X) )
 dist(t(tmp))
}

Daha hızlı olmalıdır, çünkü bir üçgen sistemi ileriye doğru çözmek ters kovaryans ile yoğun matris çarpımından daha hızlıdır ( buraya bakın ). Ahfoss ve whuber'ın çözümleriyle çeşitli ortamlarda karşılaştırmalar:

 require(microbenchmark)
 set.seed(26565)
 N <- 100
 d <- 10

 X <- matrix(rnorm(N*d), N, d)

 A <- cholMaha( X = X ) 
 A1 <- fastPwMahal(x1 = X, invCovMat = solve(cov(X))) 
 sum(abs(A - A1)) 
 # [1] 5.973666e-12  Ressuring!

   microbenchmark(cholMaha(X),
                  fastPwMahal(x1 = X, invCovMat = solve(cov(X))),
                  mahal(x = X))
Unit: microseconds
expr          min       lq   median       uq      max neval
cholMaha    502.368 508.3750 512.3210 516.8960  542.806   100
fastPwMahal 634.439 640.7235 645.8575 651.3745 1469.112   100
mahal       839.772 850.4580 857.4405 871.0260 1856.032   100

 N <- 10
 d <- 5
 X <- matrix(rnorm(N*d), N, d)

   microbenchmark(cholMaha(X),
                  fastPwMahal(x1 = X, invCovMat = solve(cov(X))),
                  mahal(x = X)
                    )
Unit: microseconds
expr          min       lq    median       uq      max neval
cholMaha    112.235 116.9845 119.114 122.3970  169.924   100
fastPwMahal 195.415 201.5620 205.124 208.3365 1273.486   100
mahal       163.149 169.3650 172.927 175.9650  311.422   100

 N <- 500
 d <- 15
 X <- matrix(rnorm(N*d), N, d)

   microbenchmark(cholMaha(X),
                  fastPwMahal(x1 = X, invCovMat = solve(cov(X))),
                  mahal(x = X)
                    )
Unit: milliseconds
expr          min       lq     median       uq      max neval
cholMaha    14.58551 14.62484 14.74804 14.92414 41.70873   100
fastPwMahal 14.79692 14.91129 14.96545 15.19139 15.84825   100
mahal       12.65825 14.11171 39.43599 40.26598 41.77186   100

 N <- 500
 d <- 5
 X <- matrix(rnorm(N*d), N, d)

   microbenchmark(cholMaha(X),
                  fastPwMahal(x1 = X, invCovMat = solve(cov(X))),
                  mahal(x = X)
                    )
Unit: milliseconds
expr           min        lq      median        uq       max neval
cholMaha     5.007198  5.030110  5.115941  5.257862  6.031427   100
fastPwMahal  5.082696  5.143914  5.245919  5.457050  6.232565   100
mahal        10.312487 12.215657 37.094138 37.986501 40.153222   100

Yani Cholesky eşit olarak daha hızlı görünüyor.


3
+1 Aferin! Bu çözümün neden daha hızlı olduğu açıklamasını takdir ediyorum.
whuber

Maha (), bir noktaya olan uzaklığın aksine size çift yönlü mesafe matrisini nasıl verir?
sheß

1
Haklısınız, değil, bu yüzden düzenlemem tamamen alakalı değil. Ben silerim, ama belki bir gün pakete maha () 'nin çift bir versiyonunu eklerim. Bunu işaret ettiğiniz için teşekkürler.
Matteo Fasiolo

1
Bu güzel olurdu! Dört gözle bekliyorum.
51

9

İki veri noktası arasındaki kare Mahalanobis mesafesinin standart formülü

D12=(x1x2)TΣ1(x1x2)

burada , gözlem i'ye karşılık gelen bir p × 1 vektörüdür . Tipik olarak, kovaryans matrisi gözlemlenen verilerden tahmin edilir. Matris tersini saymazsak, bu işlem her biri n ( n - 1 ) / 2 kez tekrarlanan p 2 + p çarpımları ve p 2 + 2 p eklemeleri gerektirir .xip×1ip2+pp2+2pn(n1)/2

Aşağıdaki türetmeyi düşünün:

D12=(x1x2)TΣ1(x1x2)=(x1x2)TΣ12Σ12(x1x2)=(x1TΣ12x2TΣ12)(Σ12x1Σ12x2)=(q1Tq2T)(q1q2)

burada . Not buxTiΣ-1qi=Σ12xi. BuΣ-1gerçeğine dayanırxiTΣ12=(Σ12xi)T=qiT simetriktir, bu da herhangi bir simetrik köşegenleştirilebilir matrisA=PEPT için,Σ12A=PEPT

A12T=(PE12PT)T=PTTE12TPT=PE12PT=A12

izin verirsek ve Σ - 1'in simetrik olduğunu fark edersek, Σ - 1'inA=Σ1Σ1 de simetrik olmalıdır. EğerX,birnxPgözlemlerin matrisi veSolduğun,xsmatris, örneğinıthsatırQolanqi, daha sonraSöz olarak ifade edilebilirxΣ-1Σ12Xn×pQn×pithQqiQ . Bu ve önceki sonuçlar,XΣ12

n ( n - 1 ) / 2 kez hesaplanan tek işlemler p çarpımları ve 2 p ilaveleridir ( p 2 + p çarpımlarının ve p 2 + 2 p'nin aksine

Dk=i=1p(QkiQi)2.
n(n1)/2p2pp2+pp2+2pYukarıdaki yöntemdeki ilaveler), orijinal O ( p 2 n 2 ) yerine hesaplama karmaşıklığı sırası olan bir algoritmaya yol açar .O(pn2+p2n)O(p2n2)
require(ICSNP) # for pair.diff(), C implementation

fastPwMahal = function(data) {

    # Calculate inverse square root matrix
    invCov = solve(cov(data))
    svds = svd(invCov)
    invCovSqr = svds$u %*% diag(sqrt(svds$d)) %*% t(svds$u)

    Q = data %*% invCovSqr

    # Calculate distances
    # pair.diff() calculates the n(n-1)/2 element-by-element
    # pairwise differences between each row of the input matrix
    sqrDiffs = pair.diff(Q)^2
    distVec = rowSums(sqrDiffs)

    # Create dist object without creating a n x n matrix
    attr(distVec, "Size") = nrow(data)
    attr(distVec, "Diag") = F
    attr(distVec, "Upper") = F
    class(distVec) = "dist"
    return(distVec)
}

İlginç. Üzgünüz, R. bilmiyorum. İşlevin pair.diff()her adımının çıktılarıyla neyi açıklayabilir ve sayısal bir örnek verebilir misiniz? Teşekkürler.
ttnphns

I edited the answer to include the derivation justifying these calculations, but I also posted a second answer containing code that is much more concise.
ahfoss

7

Let's try the obvious. From

Dij=(xixj)Σ1(xixj)=xiΣ1xi+xjΣ1xj2xiΣ1xj

it follows we can compute the vector

ui=xiΣ1xi

O(p2)

V=XΣ1X

O(pn2+p2n)

D=uu2V

where is the outer product with respect to +: (ab)ij=ai+bj.

An R implementation succinctly parallels the mathematical formulation (and assumes, with it, that Σ=Var(X) actually is invertible with inverse written h here):

mahal <- function(x, h=solve(var(x))) {
  u <- apply(x, 1, function(y) y %*% h %*% y)
  d <- outer(u, u, `+`) - 2 * x %*% h %*% t(x)
  d[lower.tri(d)]
}

Note, for compability with the other solutions, that only the unique off-diagonal elements are returned, rather than the entire (symmetric, zero-on-the-diagonal) squared distance matrix. Scatterplots show its results agree with those of fastPwMahal.

In C or C++, RAM can be re-used and uu computed on the fly, obviating any need for intermediate storage of uu.

Timing studies with n ranging from 33 through 5000 and p ranging from 10 to 100 indicate this implementation is 1.5 to 5 times faster than fastPwMahal within that range. The improvement gets better as p and n increase. Consequently, we can expect fastPwMahal to be superior for smaller p. The break-even occurs around p=7 for n100. Whether the same computational advantages of this straightforward solution pertain in other implementations may be a matter of how well they take advantage of vectorized array operations.


Looks good. I assume it could be made even more rapid by only calculating the lower diagonals, although I can't off-hand think of a way to do this in R without losing the speedy performance of apply and outer... except for breaking out Rcpp.
ahfoss

apply/outer have no speed advantage over plain-vanilla loops.
user603

@user603 I understand that in principle--but do the timing. Moreover, the main point of using these constructs is to provide semantic help for parallelizing the algorithm: the difference in how they express it is important. (It may be worth recalling the original question seeks C/Fortran/etc. implementations.) Ahfoss, I thought about limiting the calculation to the lower triangle too and agree that in R there seems to be nothing to gain by that.
whuber

5

If you wish to compute the sample Mahalanobis distance, then there are some algebraic tricks that you can exploit. They all lead to computing pairwise Euclidean distances, so let's assume we can use dist() for that. Let X denote the n×p data matrix, which we assume to be centered so that its columns have mean 0 and to have rank p so that the sample covariance matrix is nonsingular. (Centering requires O(np) operations.) Then the sample covariance matrix is

S=XTX/n.

The pairwise sample Mahalanobis distances of X is the same as the pairwise Euclidean distances of

XL
for any matrix L satisfying LLT=S1, e.g. the square root or Cholesky factor. This follows from some linear algebra and it leads to an algorithm requiring the computation of S, S1, and a Cholesky decomposition. The worst case complexity is O(np2+p3).

More deeply, these distances relate to distances between the sample principal components of X. Let X=UDVT denote the SVD of X. Then

S=VD2VT/n
and
S1/2=VD1VTn1/2.
So
XS1/2=UVTn1/2
and the sample Mahalanobis distances are just the pairwise Euclidean distances of U scaled by a factor of n, because Euclidean distance is rotation invariant. This leads to an algorithm requiring the computation of the SVD of X which has worst case complexity O(np2) when n>p.

Here is an R implementation of the second method which I cannot test on the iPad I am using to write this answer.

u = svd(scale(x, center = TRUE, scale = FALSE), nv = 0)$u
dist(u)
# these distances need to be scaled by a factor of n

2

This is a much more succinct solution. It is still based on the derivation involving the inverse square root covariance matrix (see my other answer to this question), but only uses base R and the stats package. It seems to be slightly faster (about 10% faster in some benchmarks I have run). Note that it returns Mahalanobis distance, as opposed to squared Maha distance.

fastPwMahal = function(x1,invCovMat) {
  SQRT = with(svd(invCovMat), u %*% diag(d^0.5) %*% t(v))
  dist(x1 %*% SQRT)
}

This function requires an inverse covariance matrix, and doesn't return a distance object -- but I suspect that this stripped-down version of the function will be more generally useful to stack exchange users.


3
This could be improved by replacing SQRT with the Cholesky decomposition chol(invCovMat).
vqv

1

I had a similar problem solved by writing a Fortran95 subroutine. As you do, I didn't want to calculate the duplicates among the n2 distances. Compiled Fortran95 is nearly as convenient with basic matrix calculations as R or Matlab, but much faster with loops. The routines for Cholesky decompositions and triangle substitutions can be used from LAPACK.

If you only use the Fortran77-features in the interface, your subroutine is still portable enough for others.


1

There a very easy way to do it using R Package "biotools". In this case you will get a Squared Distance Mahalanobis Matrix.

#Manly (2004, p.65-66)

x1 <- c(131.37, 132.37, 134.47, 135.50, 136.17)
x2 <- c(133.60, 132.70, 133.80, 132.30, 130.33)
x3 <- c(99.17, 99.07, 96.03, 94.53, 93.50)
x4 <- c(50.53, 50.23, 50.57, 51.97, 51.37)

#size (n x p) #Means 
x <- cbind(x1, x2, x3, x4) 

#size (p x p) #Variances and Covariances
Cov <- matrix(c(21.112,0.038,0.078,2.01, 0.038,23.486,5.2,2.844, 
        0.078,5.2,24.18,1.134, 2.01,2.844,1.134,10.154), 4, 4)

library(biotools)
Mahalanobis_Distance<-D2.dist(x, Cov)
print(Mahalanobis_Distance)

Can you please explain me what a squared distance matrix means? Respectively: I'm interested in the distance between two points/vectors so what does a matrix tell?
Ben

1

This is the expanded with code my old answer moved here from another thread.

I've been doing for a long time computation of a square symmetric matrix of pairwise Mahalanobis distances in SPSS via a hat matrix approach using solving of a system of linear equations (for it is faster than inverting of covariance matrix).

I'm not R user so I've just tried to reproduce @ahfoss' this recipe here in SPSS along with "my" recipe, on a data of 1000 cases by 400 variables, and I've found my way considerably faster.


A faster way to calculate the full matrix of pairwise Mahalanobis distances is through hat matrix H. I mean, if you are using a high-level language (such as R) with quite fast matrix multiplication and inversion functions built in you will need no loops at all, and it will be faster than doing casewise loops.

Definition. The double-centered matrix of squared pairwise Mahalanobis distances is equal to H(n1), where the hat matrix is X(XX)1X, computed from column-centered data X.

So, center columns of the data matrix, compute the hat matrix, multiply by (n-1), and perform operation opposite to double-centering. You get the matrix of squared Mahalanobis distances.

"Double centering" is the geometrically correct conversion of squared distances (such as Euclidean and Mahalanobis) into scalar products defined from the geometric centroid of the data cloud. This operation is implicitly based on the cosine theorem. Imagine you have a matrix of squared euclidean distances between your multivariate data poits. You find the centroid (multivariate mean) of the cloud and replace each pairwise distance by the corresponding scalar product (dot product), it is based on the distances hs to centroid and the angle between those vectors, as shown in the link. The h2s stand on the diagonal of that matrix of scalar products and h1h2cos are the off-diagonal entries. Then, using directly the cosine theorem formula you easily convert the "double-centrate" matrix back into the squared distance matrix.

In our settings, the "double-centrate" matrix is specifically the hat matrix (multiplied by n-1), not euclidean scalar products, and the resultant squared distance matrix is thus the squared Mahalanobis distance matrix, not squared euclidean distance matrix.

In matrix notation: Let H be the diagonal of H(n1), a column vector. Propagate the column into the square matrix: H= {H,H,...}; then Dmahal2=H+H2H(n1).

The code in SPSS and speed probe is below.


This first code corresponds to @ahfoss function fastPwMahal of the cited answer. It is equivalent to it mathematically. But I'm computing the complete symmetric matrix of distances (via matrix operations) while @ahfoss computed a triangle of the symmetric matrix (element by element).

matrix. /*Matrix session in SPSS;
        /*note: * operator means matrix multiplication, &* means usual, elementwise multiplication.
get data. /*Dataset 1000 cases x 400 variables
!cov(data%cov). /*compute usual covariances between variables [this is my own matrix function].
comp icov= inv(cov). /*invert it
call svd(icov,u,s,v). /*svd
comp isqrcov= u*sqrt(s)*t(v). /*COV^(-1/2)
comp Q= data*isqrcov. /*Matrix Q (see ahfoss answer)
!seuclid(Q%m). /*Compute 1000x1000 matrix of squared euclidean distances;
               /*computed here from Q "data" they are the squared Mahalanobis distances.
/*print m. /*Done, print
end matrix.

Time elapsed: 3.25 sec

The following is my modification of it to make it faster:

matrix.
get data.
!cov(data%cov).
/*comp icov= inv(cov). /*Don't invert.
call eigen(cov,v,s2). /*Do sdv or eigen decomposition (eigen is faster),
/*comp isqrcov= v * mdiag(1/sqrt(s2)) * t(v). /*compute 1/sqrt of the eigenvalues, and compose the matrix back, so we have COV^(-1/2).
comp isqrcov= v &* (make(nrow(cov),1,1) * t(1/sqrt(s2))) * t(v). /*Or this way not doing matrix multiplication on a diagonal matrix: a bit faster .
comp Q= data*isqrcov.
!seuclid(Q%m).
/*print m.
end matrix.

Time elapsed: 2.40 sec

Finally, the "hat matrix approach". For speed, I'm computing the hat matrix (the data must be centered first) X(XX)1X via generalized inverse (XX)1X obtained in linear system solver solve(X'X,X').

matrix.
get data.
!center(data%data). /*Center variables (columns).
comp hat= data*solve(sscp(data),t(data))*(nrow(data)-1). /*hat matrix, and multiply it by n-1 (i.e. by df of covariances).
comp ss= diag(hat)*make(1,ncol(hat),1). /*Now using its diagonal, the leverages (as column propagated into matrix).
comp m= ss+t(ss)-2*hat. /*compute matrix of squared Mahalanobis distances via "cosine rule".
/*print m.
end matrix.

[Notice that if in "comp ss" and "comp m" lines you use "sscp(t(data))",
 that is, DATA*t(DATA), in place of "hat", you get usual sq. 
 euclidean distances]

Time elapsed: 0.95 sec

0

The formula you have posted is not computing what you think you are computing (a U-statistics).

In the code I posted, I use cov(x1) as scaling matrix (this is the variance of the pairwise differences of the data). You are using cov(x0) (this is the covariance matrix of your original data). I think this is a mistake in your part. The whole point of using the pairwise differences is that it relieves you from the assumption that the multivariate distribution of your data is symmetric around a centre of symmetry (or to have to estimate that centre of symmetry for that matter, since crossprod(x1) is proportional to cov(x1)). Obviously, by using cov(x0) you lose that.

This is well explained in the paper I linked to in my original answer.


1
I think we're talking about two different things here. My method calculates Mahalanobis distance, which I've verified against a few other formulas. My formula has also now been independently verified by Matteo Fasiolo and (I assume) whuber in this thread. Yours is different. I'd be interested in understanding what you are calculating, but it is clearly different from the Mahalanobis distance as typically defined.
ahfoss

@ahfoss: 1) mahalanobis is the distance of the X to a point of symmetry in their metric. In your case, the X are a n*(n-1)/2 matrix od pairwise differences, their center of symmetry is the vector 0_p and their metric is what I called cov(X1) in my code. 2) ask yourself why you use a U-statistic in the first place, and as the paper explains you will see that using cov(x0) defeats that purpose.
user603

I think this is the disconnect. In my case the X are the rows of the observed data matrix (not distances), and I am interested in calculating the distance of every row to each other row, not the distance to a center. There are at least three "scenarios" in which Mahalanobis distance is used: [1] distance between distributions, [2] distance of observed units from the center of a distribution, and [3] distance between pairs of observed units (what I am referring to). What you describe resembles [2], except that X in your case are the pairwise distances with center Op.
ahfoss

After looking at the Croux et al. 1994 paper you cite, it is clear they discuss Mahalanobis distance in the context of outlier diagnostics, which is scenario [2] in my post above, although I will note that cov(x0) is typically used in this context, and seems to be consistent with Croux et al.'s usage. The paper does not mention U-statistics, at least not explicitly. They do mention S-, GS-, τ-, and LQD-estimators, perhaps you are referring to one of these?
ahfoss
Sitemizi kullandığınızda şunları okuyup anladığınızı kabul etmiş olursunuz: Çerez Politikası ve Gizlilik Politikası.
Licensed under cc by-sa 3.0 with attribution required.