{"id":19,"date":"2010-04-30T14:06:37","date_gmt":"2010-04-30T12:06:37","guid":{"rendered":"http:\/\/djalil.chafai.net\/blog\/?p=19"},"modified":"2024-07-22T10:33:23","modified_gmt":"2024-07-22T08:33:23","slug":"wasserstein-distance-between-two-gaussians","status":"publish","type":"post","link":"https:\/\/djalil.chafai.net\/blog\/2010\/04\/30\/wasserstein-distance-between-two-gaussians\/","title":{"rendered":"Wasserstein distance between two Gaussians"},"content":{"rendered":"<figure id=\"attachment_9057\" aria-describedby=\"caption-attachment-9057\" style=\"width: 164px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" src=\"http:\/\/djalil.chafai.net\/blog\/wp-content\/uploads\/2016\/10\/Kantorovich.jpg\" alt=\"Leonid Vitaliyevich Kantorovich (1912 \u2013 1986)\" width=\"164\" height=\"228\" class=\"size-full wp-image-9057\" srcset=\"https:\/\/djalil.chafai.net\/blog\/wp-content\/uploads\/2016\/10\/Kantorovich.jpg 164w, https:\/\/djalil.chafai.net\/blog\/wp-content\/uploads\/2016\/10\/Kantorovich-108x150.jpg 108w\" sizes=\"(max-width: 164px) 100vw, 164px\" \/><figcaption id=\"caption-attachment-9057\" class=\"wp-caption-text\">Leonid Vitaliyevich Kantorovich (1912 \u2013 1986)<\/figcaption><\/figure>\n<p style=\"text-align: justify;\">The \\( {W_2} \\) Wasserstein coupling distance between two probability measures \\( {\\mu} \\) and \\( {\\nu} \\) on \\( {\\mathbb{R}^n} \\) is<\/p>\n<p style=\"text-align: center;\">\\[ W_2(\\mu;\\nu):=\\inf\\mathbb{E}(\\Vert X-Y\\Vert_2^2)^{1\/2} \\]<\/p>\n<p style=\"text-align: justify;\">where the infimum runs over all random vectors \\( {(X,Y)} \\) of \\( {\\mathbb{R}^n\\times\\mathbb{R}^n} \\) with \\( {X\\sim\\mu} \\) and \\( {Y\\sim\\nu} \\). It turns out that we have the following nice formula for \\( {d:=W_2(\\mathcal{N}(m_1,\\Sigma_1);\\mathcal{N}(m_2,\\Sigma_2))} \\): <a id=\"eqWG\" id=\"eqWG\"><\/a><\/p>\n<p style=\"text-align: center;\">\\[ d^2=\\Vert m_1-m_2\\Vert_2^2 +\\mathrm{Tr}(\\Sigma_1+\\Sigma_2-2(\\Sigma_1^{1\/2}\\Sigma_2\\Sigma_1^{1\/2})^{1\/2}). \\ \\ \\ \\ \\ (1) \\]<\/p>\n<p style=\"text-align: justify;\">This formula interested several authors including <a href= \"http:\/\/www.ams.org\/mathscinet-getitem?mr=MR752258\">Givens and Shortt<\/a>, <a href= \"http:\/\/www.ams.org\/mathscinet-getitem?mr=MR745785\">Knott and Smith<\/a>, <a href= \"http:\/\/www.ams.org\/mathscinet-getitem?mr=MR683223\">Olkin and Pukelsheim<\/a>, and <a href= \"http:\/\/www.ams.org\/mathscinet-getitem?mr=MR666017\">Dowson and Landau<\/a>. Note in particular that we have<\/p>\n<p style=\"text-align: center;\">\\[ \\mathrm{Tr}((\\Sigma_1^{1\/2}\\Sigma_2\\Sigma_1^{1\/2})^{1\/2})= \\mathrm{Tr}((\\Sigma_2^{1\/2}\\Sigma_1\\Sigma_2^{1\/2})^{1\/2}). \\]<\/p>\n<p style=\"text-align: justify;\">In the commutative case where \\( {\\Sigma_1\\Sigma_2=\\Sigma_2\\Sigma_1} \\), the formula <a href= \"#eqWG\">(1)<\/a> boils down simply to<\/p>\n<p style=\"text-align: center;\">\\[ W_2(\\mathcal{N}(m_1,\\Sigma_1);\\mathcal{N}(m_2,\\Sigma_2))^2 =\\Vert m_1-m_2\\Vert_2^2 +\\Vert\\Sigma_1^{1\/2}-\\Sigma_2^{1\/2}\\Vert_{Frobenius}^2. \\]<\/p>\n<p style=\"text-align: justify;\">To prove <a href=\"#eqWG\">(1)<\/a>, one can first reduce to the centered case \\( {m_1=m_2=0} \\). Next, if \\( {(X,Y)} \\) is a random vector (Gaussian or not) of \\( {\\mathbb{R}^n\\times\\mathbb{R}^n} \\) with covariance matrix<\/p>\n<p style=\"text-align: center;\">\\[ \\Gamma= \\begin{pmatrix} \\Sigma_1 & C\\\\ C^\\top&\\Sigma_2 \\end{pmatrix} \\]<\/p>\n<p style=\"text-align: justify;\">then the quantity<\/p>\n<p style=\"text-align: center;\">\\[ \\mathbb{E}(\\Vert X-Y\\Vert_2^2)=\\mathrm{Tr}(\\Sigma_1+\\Sigma_2-2C) \\]<\/p>\n<p style=\"text-align: justify;\">depends only on \\( {\\Gamma} \\). Also, when \\( {\\mu=\\mathcal{N}(0,\\Sigma_1)} \\) and \\( {\\nu=\\mathcal{N}(0,\\Sigma_2)} \\), one can restrict the infimum which defines \\( {W_2} \\) to run over Gaussian laws \\( {\\mathcal{N}(0,\\Gamma)} \\) on \\( {\\mathbb{R}^n\\times\\mathbb{R}^n} \\) with covariance matrix \\( {\\Gamma} \\) structured as above. The sole constrain on \\( {C} \\) is the Schur complement constraint:<\/p>\n<p style=\"text-align: center;\">\\[ \\Sigma_1-C\\Sigma_2^{-1}C^\\top\\succeq0. \\]<\/p>\n<p style=\"text-align: justify;\">The minimization of the function<\/p>\n<p style=\"text-align: center;\">\\[ C\\mapsto-2\\mathrm{Tr}(C) \\]<\/p>\n<p style=\"text-align: justify;\">under the constraint above leads to <a href=\"#eqWG\">(1)<\/a>. A detailed proof is given by <a href= \"http:\/\/www.ams.org\/mathscinet-getitem?mr=MR752258\">Givens and Shortt<\/a>. Alternatively, one may find an optimal transportation map as <a href= \"http:\/\/www.ams.org\/mathscinet-getitem?mr=MR745785\">Knott and Smith<\/a>. It turns out that \\( {\\mathcal{N}(m_2,\\Sigma_2)} \\) is the image law of \\( {\\mathcal{N}(m_1,\\Sigma_1)} \\) with the linear map<\/p>\n<p style=\"text-align: center;\">\\[ x\\mapsto m_2+A(x-m_1) \\]<\/p>\n<p style=\"text-align: justify;\">where<\/p>\n<p style=\"text-align: center;\">\\[ A=\\Sigma_1^{-1\/2}(\\Sigma_1^{1\/2}\\Sigma_2\\Sigma_1^{1\/2})^{1\/2}\\Sigma_1^{-1\/2}=A^\\top. \\]<\/p>\n<p style=\"text-align: justify;\">To check that this maps \\( {\\mathcal{N}(m_1,\\Sigma_1)} \\) to \\( {\\mathcal{N}(m_2,\\Sigma_2)} \\), say in the case \\( {m_1=m_2=0} \\) for simplicity, one may define the random column vectors \\( {X\\sim\\mathcal{N}(m_1,\\Sigma_1)} \\) and \\( {Y=AX} \\) and write<\/p>\n<p style=\"text-align: center;\">\\[ \\begin{array}{rcl} \\mathbb{E}(YY^\\top) &=& A \\mathbb{E}(XX^\\top) A^\\top\\\\ &=& \\Sigma_1^{-1\/2}(\\Sigma_1^{1\/2}\\Sigma_2\\Sigma_1^{1\/2})^{1\/2} (\\Sigma_1^{1\/2}\\Sigma_2\\Sigma_1^{1\/2})^{1\/2}\\Sigma_1^{-1\/2}\\\\ &=& \\Sigma_2. \\end{array} \\]<\/p>\n<p style=\"text-align: justify;\">To check that the map is optimal, one may use,<\/p>\n<p style=\"text-align: center;\">\\[ \\begin{array}{rcl} \\mathbb{E}(\\|X-Y\\|_2^2) &=&\\mathbb{E}(\\|X\\|_2^2)+\\mathbb{E}(\\|Y\\|_2^2)-2\\mathbb{E}(\\left&lt;X,Y\\right&gt;) \\\\ &=&\\mathrm{Tr}(\\Sigma_1)+\\mathrm{Tr}(\\Sigma_2)-2\\mathbb{E}(\\left&lt;X,AX\\right&gt;)\\\\ &=&\\mathrm{Tr}(\\Sigma_1)+\\mathrm{Tr}(\\Sigma_2)-2\\mathrm{Tr}(\\Sigma_1A) \\end{array} \\]<\/p>\n<p style=\"text-align: justify;\">and observe that by the cyclic property of the trace,<\/p>\n<p style=\"text-align: center;\">\\[ \\mathrm{Tr}(\\Sigma_1 A) =\\mathrm{Tr}((\\Sigma_1^{1\/2}\\Sigma_2\\Sigma_1^{1\/2})^{1\/2}). \\]<\/p>\n<p style=\"text-align: justify;\">The generalizations to elliptic families of distributions and to infinite dimensional Hilbert spaces is probably easy. Some more ``geometric'' properties of Gaussians with respect to such distances where studied more recently by <a href= \"http:\/\/arxiv.org\/abs\/0801.2250\">Takastu<\/a> and <a href= \"http:\/\/arxiv.org\/abs\/0812.2752\">Takastu and Yokota<\/a>.<\/p>\n<p style=\"text-align: justify;\">The optimal transport map looks strange. Actually, the uniqueness in the Brenier theorem states that if a map \\( {T} \\) maps \\( {\\mu} \\) to \\( {\\nu} \\) and is the gradient of a convex function, then it is the optimal transport map that maps \\( {\\mu} \\) to \\( {\\nu} \\) and<\/p>\n<p style=\"text-align: center;\">\\[ W_2^2(\\mu,\\nu)=\\int|T(x)-x|^2\\mathrm{d}\\mu. \\]<\/p>\n<p style=\"text-align: justify;\">On the other hand, the affine map \\( {x\\in\\mathbb{R}^d\\rightarrow m+Ax\\in\\mathbb{R}^d} \\) is the gradient of a convex function if and only if the \\( {d\\times d} \\) matrix \\( {A} \\) is semidefinite symmetric. As a direct application, since<\/p>\n<p style=\"text-align: center;\">\\[ x\\mapsto m_2+\\Sigma_1^{-1\/2}(\\Sigma_1^{1\/2}\\Sigma_2\\Sigma_1^{1\/2})^{1\/2}\\Sigma_1^{-1\/2}(x-m_1) \\]<\/p>\n<p style=\"text-align: justify;\">is the gradient of a convex function and maps \\( {\\mathcal{N}(m_1,\\Sigma_1)} \\) to \\( {\\mathcal{N}(m_2,\\Sigma_2)} \\), it is the optimal transport map, and we get the formula for \\( {W_2} \\) ! This is an alternative to Givens and Shortt.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The \\( {W_2} \\) Wasserstein coupling distance between two probability measures \\( {\\mu} \\) and \\( {\\nu} \\) on \\( {\\mathbb{R}^n} \\) is \\[ W_2(\\mu;\\nu):=\\inf\\mathbb{E}(\\Vert&#8230;<\/p>\n<div class=\"more-link-wrapper\"><a class=\"more-link\" href=\"https:\/\/djalil.chafai.net\/blog\/2010\/04\/30\/wasserstein-distance-between-two-gaussians\/\">Continue reading<span class=\"screen-reader-text\">Wasserstein distance between two Gaussians<\/span><\/a><\/div>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"iawp_total_views":20137},"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/djalil.chafai.net\/blog\/wp-json\/wp\/v2\/posts\/19"}],"collection":[{"href":"https:\/\/djalil.chafai.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/djalil.chafai.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/djalil.chafai.net\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/djalil.chafai.net\/blog\/wp-json\/wp\/v2\/comments?post=19"}],"version-history":[{"count":25,"href":"https:\/\/djalil.chafai.net\/blog\/wp-json\/wp\/v2\/posts\/19\/revisions"}],"predecessor-version":[{"id":20477,"href":"https:\/\/djalil.chafai.net\/blog\/wp-json\/wp\/v2\/posts\/19\/revisions\/20477"}],"wp:attachment":[{"href":"https:\/\/djalil.chafai.net\/blog\/wp-json\/wp\/v2\/media?parent=19"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/djalil.chafai.net\/blog\/wp-json\/wp\/v2\/categories?post=19"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/djalil.chafai.net\/blog\/wp-json\/wp\/v2\/tags?post=19"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}