पेयरवाइज महालनोबिस दूरियां

18

मुझे नमूने की प्रत्येक जोड़ी के बीच R में नमूना Mahalanobis दूरी की गणना करने की आवश्यकता है c covariates के मैट्रिक्स। मुझे एक समाधान की आवश्यकता है जो कुशल हो, अर्थात केवल दूरी की गणना की जाती है, और अधिमानतः C / RCpp / फोरट्रान आदि में लागू किया जाता है। मुझे लगता है कि , जनसंख्या सहसंयोजक मैट्रिक्स, अज्ञात है और नमूना का उपयोग करें इसके स्थान पर सहसंयोजक मैट्रिक्स। $n \times p$ $n(n-1)/2$ $\Sigma$

मैं इस सवाल में विशेष रूप से दिलचस्पी रखता हूं क्योंकि आर में जोड़ीदार महालनोबिस दूरी की गणना के लिए कोई "आम सहमति" विधि नहीं है, अर्थात यह distफ़ंक्शन में और न ही फ़ंक्शन में लागू नहीं cluster::daisyहै। mahalanobisसमारोह प्रोग्रामर से अतिरिक्त काम के बिना जोड़ो में दूरी की गणना नहीं करता है।

यह पहले से ही आर में पेयरवाइज महालनोबिस दूरी से पूछा गया था , लेकिन वहां समाधान गलत लगते हैं।

यहाँ एक सही लेकिन बहुत ही अयोग्य है (क्योंकि दूरी की गणना की जाती है) विधि: $n \times n$

set.seed(0)
x0 <- MASS::mvrnorm(33,1:10,diag(c(seq(1,1/2,l=10)),10))
dM = as.dist(apply(x0, 1, function(i) mahalanobis(x0, i, cov = cov(x0))))

यह सी में खुद को कोड करने के लिए काफी आसान है, लेकिन मुझे ऐसा लगता है कि इस मूल में कुछ हल होना चाहिए। वहाँ एक है?

ऐसे अन्य समाधान हैं जो कम आते हैं: दूरी की HDMD::pairwise.mahalanobis()गणना करता है , जब केवल अद्वितीय दूरी की आवश्यकता होती है। होनहार लगता है, लेकिन मैं नहीं चाहता कि मेरा फ़ंक्शन किसी ऐसे पैकेज से आए जो निर्भर करता है , जो मेरे कोड को चलाने के लिए दूसरों की क्षमता को गंभीर रूप से सीमित करता है । जब तक यह क्रियान्वयन सही नहीं होता, मैं अपना स्वयं का लेख लिखता हूँ। किसी को भी इस समारोह के साथ अनुभव है? $n \times n$ $n(n-1)/2$ compositions::MahalanobisDist()rgl

r algorithms distance

— ahfoss
स्रोत

स्वागत हे। क्या आप अपने प्रश्न में दूरी के दो मैट्रिक्स प्रिंट कर सकते हैं? और आपके लिए "अक्षम" क्या है?

— ttnphns

1

क्या आप केवल नमूना सहसंयोजक मैट्रिक्स का उपयोग कर रहे हैं? यदि ऐसा है, तो यह 1 के बराबर है) केंद्र X; 2) केंद्रित एक्स के एसवीडी की गणना, यूडीवी कहते हैं; 3) यू की पंक्तियों के बीच जोड़ीदार दूरी की गणना

— vqv

इस सवाल के रूप में पोस्ट करने के लिए धन्यवाद। मुझे लगता है कि आपका फॉर्मूला सही नहीं है। मेरा जवाब नीचे देखें।

— user603

@vqv हाँ, नमूना सहसंयोजक मैट्रिक्स। इसे दर्शाने के लिए मूल पोस्ट संपादित की जाती है।

— अहफॉस

इसी तरह के प्रश्न आँकड़े भी देखें ।stackexchange.com/q/33518/3277 ।

— ttnphns

21

आहफॉस के "सक्सिंट" समाधान से शुरू होकर, मैंने एसवीडी के स्थान पर चोल्स्की अपघटन का उपयोग किया है।

cholMaha <- function(X) {
 dec <- chol( cov(X) )
 tmp <- forwardsolve(t(dec), t(X) )
 dist(t(tmp))
}

यह तेज होना चाहिए, क्योंकि एक त्रिकोणीय प्रणाली को अग्रेषित करना तेजी से होता है फिर उलटा सहसंयोजक के साथ घनी मैट्रिक्स गुणा ( यहां देखें )। यहाँ कई सेटिंग्स में अहफोस और व्हिबर के समाधान के साथ मानक हैं:

 require(microbenchmark)
 set.seed(26565)
 N <- 100
 d <- 10

 X <- matrix(rnorm(N*d), N, d)

 A <- cholMaha( X = X ) 
 A1 <- fastPwMahal(x1 = X, invCovMat = solve(cov(X))) 
 sum(abs(A - A1)) 
 # [1] 5.973666e-12  Ressuring!

   microbenchmark(cholMaha(X),
                  fastPwMahal(x1 = X, invCovMat = solve(cov(X))),
                  mahal(x = X))
Unit: microseconds
expr          min       lq   median       uq      max neval
cholMaha    502.368 508.3750 512.3210 516.8960  542.806   100
fastPwMahal 634.439 640.7235 645.8575 651.3745 1469.112   100
mahal       839.772 850.4580 857.4405 871.0260 1856.032   100

 N <- 10
 d <- 5
 X <- matrix(rnorm(N*d), N, d)

   microbenchmark(cholMaha(X),
                  fastPwMahal(x1 = X, invCovMat = solve(cov(X))),
                  mahal(x = X)
                    )
Unit: microseconds
expr          min       lq    median       uq      max neval
cholMaha    112.235 116.9845 119.114 122.3970  169.924   100
fastPwMahal 195.415 201.5620 205.124 208.3365 1273.486   100
mahal       163.149 169.3650 172.927 175.9650  311.422   100

 N <- 500
 d <- 15
 X <- matrix(rnorm(N*d), N, d)

   microbenchmark(cholMaha(X),
                  fastPwMahal(x1 = X, invCovMat = solve(cov(X))),
                  mahal(x = X)
                    )
Unit: milliseconds
expr          min       lq     median       uq      max neval
cholMaha    14.58551 14.62484 14.74804 14.92414 41.70873   100
fastPwMahal 14.79692 14.91129 14.96545 15.19139 15.84825   100
mahal       12.65825 14.11171 39.43599 40.26598 41.77186   100

 N <- 500
 d <- 5
 X <- matrix(rnorm(N*d), N, d)

   microbenchmark(cholMaha(X),
                  fastPwMahal(x1 = X, invCovMat = solve(cov(X))),
                  mahal(x = X)
                    )
Unit: milliseconds
expr           min        lq      median        uq       max neval
cholMaha     5.007198  5.030110  5.115941  5.257862  6.031427   100
fastPwMahal  5.082696  5.143914  5.245919  5.457050  6.232565   100
mahal        10.312487 12.215657 37.094138 37.986501 40.153222   100

इसलिए चोल्स्की को समान रूप से तेज़ लगता है।

— माटेतो फैसिओलो
स्रोत

3

+1 शाबाश! मैं स्पष्टीकरण की सराहना करता हूं कि यह समाधान क्यों तेज है।

— whuber

महज (), आपको जोड़ीदार दूरी-मैट्रिक्स कैसे देता है, केवल एक बिंदु की दूरी के विपरीत?

— she

1

आप सही हैं, ऐसा नहीं है, इसलिए मेरा संपादन पूरी तरह से प्रासंगिक नहीं है। मैं इसे हटा दूंगा, लेकिन शायद एक दिन मैं पैकेज में माहा () का एक जोड़ीदार संस्करण जोड़ूंगा। इस पर ध्यान दिलाने के लिए धन्यवाद।

— मैटो फासिओलो

1

वो बहुत अच्छा होगा! इसके लिए आगे देख रहे हैं।

— she

9

दो डेटा बिंदुओं के बीच चुकता महालनोबिस दूरी के लिए मानक सूत्र है

D_{12} = (x_{1} - x_{2})^{T} Σ^{- 1} (x_{1} - x_{2})

$D_{12} = (x_1-x_2)^T \Sigma^{-1} (x_1-x_2)$

जहां एक है वेक्टर अवलोकन करने के लिए इसी । आमतौर पर, कोवेरिएंस मैट्रिक्स का अवलोकन डेटा से किया जाता है। मैट्रिक्स उलटा गिनती नहीं, इस ऑपरेशन के लिए गुणन और परिवर्धन की आवश्यकता होती है, प्रत्येक दोहराया बार। $x_i$ $p \times 1$ $i$ $p^2+p$ $p^2+2p$ $n(n-1)/2$

निम्नलिखित व्युत्पत्ति पर विचार करें:

\begin{array}{rcl} D_{12} & = & (x_{1} - x_{2})^{T} Σ^{- 1} (x_{1} - x_{2}) \\ = & (x_{1} - x_{2})^{T} Σ^{- \frac{1}{2}} Σ^{- \frac{1}{2}} (x_{1} - x_{2}) \\ = & (x_{1}^{T} Σ^{- \frac{1}{2}} - x_{2}^{T} Σ^{- \frac{1}{2}}) (Σ^{- \frac{1}{2}} x_{1} - Σ^{- \frac{1}{2}} x_{2}) \\ = & (q_{1}^{T} - q_{2}^{T}) (q_{1} - q_{2}) \end{array}

$\begin{eqnarray*} D_{12} &=& (x_1-x_2)^T \Sigma^{-1} (x_1-x_2) \\ &=& (x_1-x_2)^T \Sigma^{-\frac{1}{2}} \Sigma^{-\frac{1}{2}} (x_1-x_2) \\ &=& (x_1^T \Sigma^{-\frac{1}{2}} - x_2^T \Sigma^{-\frac{1}{2}}) (\Sigma^{-\frac{1}{2}}x_1 - \Sigma^{-\frac{1}{2}}x_2) \\ &=& (q_1^T - q_2^T)(q_1 - q_2) \end{eqnarray*}$

जहां । ध्यान दें कि $q_i = \Sigma^{-\frac{1}{2}}x_i$ । यह इस तथ्य पर निर्भर करता है कि $x_i^T \Sigma^{-\frac{1}{2}} = (\Sigma^{-\frac{1}{2}} x_i)^T = q_i^T$ सममित है, जो इस तथ्य के कारण है कि किसी भी सममित विकर्ण मैट्रिक्स। $\Sigma^{-\frac{1}{2}}$ $A = PEP^T$

\begin{array}{rcl} A^{{\frac{1}{2}}^{T}} & = & (P E^{\frac{1}{2}} P^{T})^{T} \\ = & P^{T^{T}} E^{{\frac{1}{2}}^{T}} P^{T} \\ = & P E^{\frac{1}{2}} P^{T} \\ = & A^{\frac{1}{2}} \end{array}

$\begin{eqnarray*} A^{\frac{1}{2}^T} &=& (PE^{\frac{1}{2}}P^T)^T \\ &=& P^{T^T} E^{\frac{1}{2}^T} P^T \\ &=& PE^{\frac{1}{2}}P^T \\ &=& A^{\frac{1}{2}} \end{eqnarray*}$

अगर हम करते हैं , और ध्यान दें कि सममित है, हम देखते हैं कि कि $A=\Sigma^{-1}$ $\Sigma^{-1}$ भी सममित होना चाहिए। यदिहैटिप्पणियों के मैट्रिक्स औरहैमैट्रिक्स ऐसी है किकी पंक्तिहै, तोसंक्षेप के रूप में व्यक्त किया जा सकता है $\Sigma^{-\frac{1}{2}}$ $X$ $n \times p$ $Q$ $n \times p$ $i^{th}$ $Q$ $q_i$ $Q$ । यह और पिछले परिणामों का अर्थ है कि $X\Sigma^{-\frac{1}{2}}$

केवल ऐसे परिचालनों की गणना की गई है जो बार गुणन और जोड़ हैं ( गुणन केविपरीतऔर

D_{k ℓ} = \sum_{i = 1}^{p} (Q_{k i} - Q_{ℓ i})^{2} .

$D_{k\ell} = \sum_{i=1}^p (Q_{ki}-Q_{\ell i})^2.$

n (n - 1) / 2

$n(n-1)/2$

p

$p$

2 p

$2p$

p^{2} + p

$p^2+p$

p^{2} + 2 p

$p^2+2p$ उपरोक्त विधि में अतिरिक्त), एक एल्गोरिथ्म कम्प्यूटेशनल जटिलता आदेश की है कि जिसके परिणामस्वरूप

के बजाय मूल

।

O (p n^{2} + p^{2} n)

$O(pn^2 + p^2n)$

O (p^{2} n^{2})

$O(p^2n^2)$

require(ICSNP) # for pair.diff(), C implementation

fastPwMahal = function(data) {

    # Calculate inverse square root matrix
    invCov = solve(cov(data))
    svds = svd(invCov)
    invCovSqr = svds$u %*% diag(sqrt(svds$d)) %*% t(svds$u)

    Q = data %*% invCovSqr

    # Calculate distances
    # pair.diff() calculates the n(n-1)/2 element-by-element
    # pairwise differences between each row of the input matrix
    sqrDiffs = pair.diff(Q)^2
    distVec = rowSums(sqrDiffs)

    # Create dist object without creating a n x n matrix
    attr(distVec, "Size") = nrow(data)
    attr(distVec, "Diag") = F
    attr(distVec, "Upper") = F
    class(distVec) = "dist"
    return(distVec)
}

— ahfoss
स्रोत

दिलचस्प। क्षमा करें, मुझे नहीं पता कि क्या आप pair.diff()अपने कार्य के हर चरण के प्रिंटआउट के साथ क्या कर सकते हैं और एक संख्यात्मक उदाहरण भी दे सकते हैं? धन्यवाद।

— ttnphns

मैंने इन गणनाओं को सही ठहराने वाले व्युत्पत्ति को शामिल करने के लिए उत्तर को संपादित किया, लेकिन मैंने एक दूसरा उत्तर भी दिया जिसमें कोड अधिक संक्षिप्त है।

— ahfoss

7

चलो स्पष्ट कोशिश करते हैं। से

D_{i j} = (x_{i} - x_{j})^{'} Σ^{- 1} (x_{i} - x_{j}) = x_{i}^{'} Σ^{- 1} x_{i} + x_{j}^{'} Σ^{- 1} x_{j} - 2 x_{i}^{'} Σ^{- 1} x_{j}

$D_{ij} = (x_i-x_j)^\prime \Sigma^{-1} (x_i-x_j)=x_i^\prime \Sigma^{-1}x_i + x_j^\prime \Sigma^{-1}x_j -2 x_i^\prime \Sigma^{-1}x_j$

यह इस प्रकार है कि हम वेक्टर की गणना कर सकते हैं

u_{i} = x_{i}^{'} Σ^{- 1} x_{i}

$u_i = x_i^\prime \Sigma^{-1}x_i$

में समय और मैट्रिक्स $O(p^2)$

V = X Σ^{- 1} X^{'}

$V = X \Sigma^{-1} X^\prime$

में समय, सबसे अधिक संभावना में निर्मित तेज (चलाने योग्य) सरणी आपरेशन का उपयोग कर, और उसके बाद के रूप में समाधान के लिए फार्म $O(p n^2 + p^2 n)$

D = u \oplus u - 2 V

$D = u \oplus u - 2 V$

जहां के संबंध में बाहरी उत्पाद है : $\oplus$ $+$ $(a \oplus b)_{ij} = a_i + b_j.$

एक Rकार्यान्वयन सफलतापूर्वक गणितीय सूत्रीकरण करता है (और मान लेता है, कि वास्तव में यहाँ उलटा साथ उलटा है ): $\Sigma=\text{Var}(X)$ $h$

mahal <- function(x, h=solve(var(x))) {
  u <- apply(x, 1, function(y) y %*% h %*% y)
  d <- outer(u, u, `+`) - 2 * x %*% h %*% t(x)
  d[lower.tri(d)]
}

नोट, अन्य समाधानों के साथ संगतता के लिए, कि केवल अद्वितीय ऑफ-विकर्ण तत्व वापस लौटाए जाते हैं, बल्कि पूरे (सममित, शून्य-पर-विकर्ण) वर्ग दूरी मैट्रिक्स। स्कैटरप्लॉट दिखाते हैं कि इसके परिणाम उन लोगों से सहमत हैं fastPwMahal।

सी या C ++ में, राम फिर से इस्तेमाल किया जा सकता और मक्खी पर गणना की, के मध्यवर्ती भंडारण के लिए किसी भी आवश्यकता समाप्त । $u\oplus u$ $u\oplus u$

साथ पढ़ाई समय से लेकर के माध्यम से और से लेकर करने के लिए से संकेत मिलता है इस कार्यान्वयन है करने के लिए बार की तुलना में तेजी है कि सीमा के भीतर। और वृद्धि के रूप में सुधार बेहतर हो जाता है । नतीजतन, हम छोटे लिए बेहतर होने की उम्मीद कर सकते हैं । ब्रेक-ईवन के आसपास होता के लिए $n$ $33$ $5000$ $p$ $10$ $100$ $1.5$ $5$ fastPwMahal $p$ $n$ fastPwMahal $p$ $p=7$ $n\ge 100$ । अन्य कार्यान्वयनों में इस सीधे समाधान के समान कम्प्यूटेशनल लाभ चाहे वे वेक्टराइज्ड ऐरे ऑपरेशंस का कितना लाभ उठाते हैं, यह बात हो सकती है।

— व्हीबर
स्रोत

अछा लगता है। मुझे लगता है कि यह केवल निचले विकर्णों की गणना करके और भी अधिक तेजी से बनाया जा सकता है, हालांकि मैं आर में ऐसा करने के तरीके के बारे में नहीं सोच सकता हूं कि आर के तेज प्रदर्शन को खोने के बिना applyऔर outer... बाहर तोड़ने के अलावा Rcpp।

— अहमफ

लागू करें / बाहरी को सादे-वेनिला छोरों पर कोई गति लाभ नहीं है।

— 14:60 पर उपयोगकर्ता 603

@ user603 मैं समझता हूं कि सिद्धांत रूप में - लेकिन समय का पालन करें। इसके अलावा, इन निर्माणों का उपयोग करने का मुख्य बिंदु एल्गोरिथ्म को समानांतर करने के लिए सिमेंटिक मदद प्रदान करना है: वे इसे कैसे व्यक्त करते हैं, इसमें अंतर महत्वपूर्ण है। (यह मूल प्रश्न सी / फोरट्रान / आदि को लागू करने के लिए याद रखने योग्य हो सकता है। कार्यान्वयन) अहफॉस, मैंने गणना को निचले त्रिकोण में भी सीमित करने के बारे में सोचा और सहमत हूं कि इससे Rकुछ भी हासिल नहीं होता है।

— whuber

5

यदि आप नमूना महालनोबिस दूरी की गणना करना चाहते हैं , तो कुछ बीजीय चालें हैं जिनका आप शोषण कर सकते हैं। वे सभी युग्मक यूक्लिडियन दूरियों की गणना करते हैं, इसलिए मान लेते हैं कि हम इसके लिए उपयोग कर सकते हैं dist()। चलो निरूपित डेटा मैट्रिक्स है, जो हम है कि अपने कॉलम हैं तो केंद्रित होना मान मतलब 0 और रैंक के लिए ताकि नमूना सहप्रसरण मैट्रिक्स व्युत्क्रमणीय है। (केंद्रित करने के लिए संचालन की आवश्यकता होती है ।) फिर नमूना सहसंयोजक मैट्रिक्स $X$ $n\times p$ $p$ $O(np)$

S = X^{T} X / n .

$S = X^T X / n.$

की जोड़ीदार नमूना महालनोबिस दूरियां किसी भी मैट्रिक्स संतोषजनक लिए की जोड़ीदार यूक्लिडियन दूरियों के समान है , जैसे वर्गमूल या चोल्स्की कारक। यह कुछ रैखिक बीजगणित से आता है और यह , , और एक चॉल्स्की अपघटन की गणना के लिए एक एल्गोरिथ्म की ओर जाता है । सबसे खराब स्थिति जटिलता । $X$

X L

$X L$

L

$L$

L L^{T} = S^{- 1}

$LL^T = S^{-1}$

S

$S$

S^{- 1}

$S^{-1}$

O (n p^{2} + p^{3})

$O(np^2 + p^3)$

More deeply, these distances relate to distances between the sample principal components of $X$ . Let $X=UDV^T$ denote the SVD of $X$ . Then

S = V D^{2} V^{T} / n

$S=VD^2V^T/n$ and

S^{- 1 / 2} = V D^{- 1} V^{T} n^{1 / 2} .

$S^{-1/2}=VD^{-1}V^T n^{1/2}.$ So

X S^{- 1 / 2} = U V^{T} n^{1 / 2}

$X S^{-1/2} = UV^T n^{1/2}$ and the sample Mahalanobis distances are just the pairwise Euclidean distances of

U

$U$ scaled by a factor of

\sqrt{n}

$\sqrt{n}$ , because Euclidean distance is rotation invariant. This leads to an algorithm requiring the computation of the SVD of

X

$X$ which has worst case complexity

O (n p^{2})

$O(n p^2)$ when

n > p

$n>p$ .

Here is an R implementation of the second method which I cannot test on the iPad I am using to write this answer.

u = svd(scale(x, center = TRUE, scale = FALSE), nv = 0)$u
dist(u)
# these distances need to be scaled by a factor of n

— vqv
स्रोत

2

This is a much more succinct solution. It is still based on the derivation involving the inverse square root covariance matrix (see my other answer to this question), but only uses base R and the stats package. It seems to be slightly faster (about 10% faster in some benchmarks I have run). Note that it returns Mahalanobis distance, as opposed to squared Maha distance.

fastPwMahal = function(x1,invCovMat) {
  SQRT = with(svd(invCovMat), u %*% diag(d^0.5) %*% t(v))
  dist(x1 %*% SQRT)
}

This function requires an inverse covariance matrix, and doesn't return a distance object -- but I suspect that this stripped-down version of the function will be more generally useful to stack exchange users.

— ahfoss
स्रोत

3

This could be improved by replacing SQRT with the Cholesky decomposition chol(invCovMat).

— vqv

1

I had a similar problem solved by writing a Fortran95 subroutine. As you do, I didn't want to calculate the duplicates among the $n^2$ distances. Compiled Fortran95 is nearly as convenient with basic matrix calculations as R or Matlab, but much faster with loops. The routines for Cholesky decompositions and triangle substitutions can be used from LAPACK.

If you only use the Fortran77-features in the interface, your subroutine is still portable enough for others.

— Horst Grünbusch
स्रोत

1

There a very easy way to do it using R Package "biotools". In this case you will get a Squared Distance Mahalanobis Matrix.

#Manly (2004, p.65-66)

x1 <- c(131.37, 132.37, 134.47, 135.50, 136.17)
x2 <- c(133.60, 132.70, 133.80, 132.30, 130.33)
x3 <- c(99.17, 99.07, 96.03, 94.53, 93.50)
x4 <- c(50.53, 50.23, 50.57, 51.97, 51.37)

#size (n x p) #Means 
x <- cbind(x1, x2, x3, x4) 

#size (p x p) #Variances and Covariances
Cov <- matrix(c(21.112,0.038,0.078,2.01, 0.038,23.486,5.2,2.844, 
        0.078,5.2,24.18,1.134, 2.01,2.844,1.134,10.154), 4, 4)

library(biotools)
Mahalanobis_Distance<-D2.dist(x, Cov)
print(Mahalanobis_Distance)

— Jalles10
स्रोत

Can you please explain me what a squared distance matrix means? Respectively: I'm interested in the distance between two points/vectors so what does a matrix tell?

— Ben

1

This is the expanded with code my old answer moved here from another thread.

I've been doing for a long time computation of a square symmetric matrix of pairwise Mahalanobis distances in SPSS via a hat matrix approach using solving of a system of linear equations (for it is faster than inverting of covariance matrix).

I'm not R user so I've just tried to reproduce @ahfoss' this recipe here in SPSS along with "my" recipe, on a data of 1000 cases by 400 variables, and I've found my way considerably faster.

A faster way to calculate the full matrix of pairwise Mahalanobis distances is through hat matrix $\bf H$ . I mean, if you are using a high-level language (such as R) with quite fast matrix multiplication and inversion functions built in you will need no loops at all, and it will be faster than doing casewise loops.

Definition. The double-centered matrix of squared pairwise Mahalanobis distances is equal to $\mathbf{H}(n-1)$ , where the hat matrix is $\bf X(X'X)^{-1}X'$ , computed from column-centered data $\bf X$ .

So, center columns of the data matrix, compute the hat matrix, multiply by (n-1), and perform operation opposite to double-centering. You get the matrix of squared Mahalanobis distances.

"Double centering" is the geometrically correct conversion of squared distances (such as Euclidean and Mahalanobis) into scalar products defined from the geometric centroid of the data cloud. This operation is implicitly based on the cosine theorem. Imagine you have a matrix of squared euclidean distances between your multivariate data poits. You find the centroid (multivariate mean) of the cloud and replace each pairwise distance by the corresponding scalar product (dot product), it is based on the distances $h$ s to centroid and the angle between those vectors, as shown in the link. The $h^2$ s stand on the diagonal of that matrix of scalar products and $h_1h_2\cos$ are the off-diagonal entries. Then, using directly the cosine theorem formula you easily convert the "double-centrate" matrix back into the squared distance matrix.

In our settings, the "double-centrate" matrix is specifically the hat matrix (multiplied by n-1), not euclidean scalar products, and the resultant squared distance matrix is thus the squared Mahalanobis distance matrix, not squared euclidean distance matrix.

In matrix notation: Let $H$ be the diagonal of $\mathbf{H}(n-1)$ , a column vector. Propagate the column into the square matrix: H= {H,H,...}; then $\mathbf {D_{mahal}^2} = H+H'-2 \mathbf{H}(n-1)$ .

The code in SPSS and speed probe is below.

This first code corresponds to @ahfoss function fastPwMahal of the cited answer. It is equivalent to it mathematically. But I'm computing the complete symmetric matrix of distances (via matrix operations) while @ahfoss computed a triangle of the symmetric matrix (element by element).

matrix. /*Matrix session in SPSS;
        /*note: * operator means matrix multiplication, &* means usual, elementwise multiplication.
get data. /*Dataset 1000 cases x 400 variables
!cov(data%cov). /*compute usual covariances between variables [this is my own matrix function].
comp icov= inv(cov). /*invert it
call svd(icov,u,s,v). /*svd
comp isqrcov= u*sqrt(s)*t(v). /*COV^(-1/2)
comp Q= data*isqrcov. /*Matrix Q (see ahfoss answer)
!seuclid(Q%m). /*Compute 1000x1000 matrix of squared euclidean distances;
               /*computed here from Q "data" they are the squared Mahalanobis distances.
/*print m. /*Done, print
end matrix.

Time elapsed: 3.25 sec

The following is my modification of it to make it faster:

matrix.
get data.
!cov(data%cov).
/*comp icov= inv(cov). /*Don't invert.
call eigen(cov,v,s2). /*Do sdv or eigen decomposition (eigen is faster),
/*comp isqrcov= v * mdiag(1/sqrt(s2)) * t(v). /*compute 1/sqrt of the eigenvalues, and compose the matrix back, so we have COV^(-1/2).
comp isqrcov= v &* (make(nrow(cov),1,1) * t(1/sqrt(s2))) * t(v). /*Or this way not doing matrix multiplication on a diagonal matrix: a bit faster .
comp Q= data*isqrcov.
!seuclid(Q%m).
/*print m.
end matrix.

Time elapsed: 2.40 sec

Finally, the "hat matrix approach". For speed, I'm computing the hat matrix (the data must be centered first) $\bf X(X'X)^{-1}X'$ via generalized inverse $\bf (X'X)^{-1}X'$ obtained in linear system solver solve(X'X,X').

matrix.
get data.
!center(data%data). /*Center variables (columns).
comp hat= data*solve(sscp(data),t(data))*(nrow(data)-1). /*hat matrix, and multiply it by n-1 (i.e. by df of covariances).
comp ss= diag(hat)*make(1,ncol(hat),1). /*Now using its diagonal, the leverages (as column propagated into matrix).
comp m= ss+t(ss)-2*hat. /*compute matrix of squared Mahalanobis distances via "cosine rule".
/*print m.
end matrix.

[Notice that if in "comp ss" and "comp m" lines you use "sscp(t(data))",
 that is, DATA*t(DATA), in place of "hat", you get usual sq. 
 euclidean distances]

Time elapsed: 0.95 sec

— ttnphns
स्रोत

0

The formula you have posted is not computing what you think you are computing (a U-statistics).

In the code I posted, I use cov(x1) as scaling matrix (this is the variance of the pairwise differences of the data). You are using cov(x0) (this is the covariance matrix of your original data). I think this is a mistake in your part. The whole point of using the pairwise differences is that it relieves you from the assumption that the multivariate distribution of your data is symmetric around a centre of symmetry (or to have to estimate that centre of symmetry for that matter, since crossprod(x1) is proportional to cov(x1)). Obviously, by using cov(x0) you lose that.

This is well explained in the paper I linked to in my original answer.

— user603
स्रोत

1

I think we're talking about two different things here. My method calculates Mahalanobis distance, which I've verified against a few other formulas. My formula has also now been independently verified by Matteo Fasiolo and (I assume) whuber in this thread. Yours is different. I'd be interested in understanding what you are calculating, but it is clearly different from the Mahalanobis distance as typically defined.

— ahfoss

@ahfoss: 1) mahalanobis is the distance of the X to a point of symmetry in their metric. In your case, the X are a n*(n-1)/2 matrix od pairwise differences, their center of symmetry is the vector 0_p and their metric is what I called cov(X1) in my code. 2) ask yourself why you use a U-statistic in the first place, and as the paper explains you will see that using cov(x0) defeats that purpose.

— user603

I think this is the disconnect. In my case the

X

$X$ are the rows of the observed data matrix (not distances), and I am interested in calculating the distance of every row to each other row, not the distance to a center. There are at least three "scenarios" in which Mahalanobis distance is used: [1] distance between distributions, [2] distance of observed units from the center of a distribution, and [3] distance between pairs of observed units (what I am referring to). What you describe resembles [2], except that

X

$X$ in your case are the pairwise distances with center

O_{p}

$O_p$ .

— ahfoss

After looking at the Croux et al. 1994 paper you cite, it is clear they discuss Mahalanobis distance in the context of outlier diagnostics, which is scenario [2] in my post above, although I will note that cov(x0) is typically used in this context, and seems to be consistent with Croux et al.'s usage. The paper does not mention U-statistics, at least not explicitly. They do mention

S

$S$ -,

G S

$GS$ -,

τ

$\tau$ -, and

L Q D

$LQD$ -estimators, perhaps you are referring to one of these?

— ahfoss