Does Dodecad K12 or Gedrosia K12 calculator make more sense ? [Archive] - The Apricity Forum: A European Cultural Community

Zoro

10-01-2021, 02:13 PM

Here’s a map based on Dodecad K12 gedmatch calculator of Gedrosian or Baloch admixture percentages showing Europeans scoring significant Gedrosian Baloch, in fact showing northern Europeans scoring even more than southern Europeans !!

https://i.imgur.com/6D1xmEb.jpg

Here’s a chart of the GedrosiaDNA K12 Gedrosia Baloch component based on Gedrosia K 12 spreadsheet on Gedmatch showing Europeans scoring almost 0 Gedrosian and central and south Indians scoring less Gedrosian than Iranics such as Kurds and Africans east Asians and Siberians scoring zero Gedrosian and Arabs and Caucasians scoring significantly less Gedrosian than Iranics.

https://i.imgur.com/4OIwLdk.jpg

WHICH CALCULATOR MAKES MORE SENSE TO YOU Dodecad K12b or Gedrosia K12 ?

Mejgusu

10-01-2021, 02:21 PM

I think it’s actually true that some Europeans have some Gedrosia considering it always was a part of steppe. This maps of this site aren’t always very accurate, this one seems to be right for at least parts of Westasia. The other chart seems to be partly right too, although I must say Turks have much more Gedrosia than showed there. The same counts for Caucasians, both groups have 10-20%, not 1-10%.

Lucas

10-01-2021, 02:23 PM

Honestly not components are very important but oracle. For Dodecad after many updates it is quite good.
But of course agree Gedrosian in Dodecad is too high in Europe.

Zoro

10-01-2021, 02:37 PM

I think it’s actually true that some Europeans have some Gedrosia considering it always was a part of steppe. This maps of this site aren’t always very accurate, this one seems to be right for at least parts of Westasia. The other chart seems to be partly right too, although I must say Turks have much more Gedrosia than showed there. The same counts for Caucasians, both groups have 10-20%, not 1-10%.

I’m not saying Gedrosia K12 is perfect because it’s still based on Admixture program and the Caucasian vs Arab results maybe slightly off but overall though the European significant Gedrosia Baloch scores especially N. Europeans in Dodecad K12b are much more serious flaw than in Gedrosia K12

Steppe wouldn’t be valid excuse for Dodecad k12 because that component is based on specific Baloch genetic drift that’s why they score 60%

I didn’t mention Harappworld calculator but that one maybe even more flawed than Dodecad because there C. Indians scire more Gedrosian than Iranics including Pashtuns !!

Zoro

10-01-2021, 02:39 PM

Honestly not components are very important but oracle. For Dodecad after many updates it is quite good.
But of course agree Gedrosian in Dodecad is too high in Europe.

I disagree. Accurate Percentages are more important to most people because they do want to know how similar they are to other world populations

Mejgusu

10-01-2021, 02:48 PM

I’m not saying Gedrosia K12 is perfect because it’s still based on Admixture program and the Caucasian vs Arab results maybe slightly off but overall though the European significant Gedrosia Baloch scores especially N. Europeans in Dodecad K12b are much more serious flaw than in Gedrosia K12

Steppe wouldn’t be valid excuse for Dodecad k12 because that component is based on specific Baloch genetic drift that’s why they score 60%

I didn’t mention Harappworld calculator but that one maybe even more flawed than Dodecad because there C. Indians scire more Gedrosian than Iranics including Pashtuns !!

I would say it isn’t an excuse, but an explanation. At least I wouldn’t be surprised if some score just some Gedrosia.

Its probably not perfect like any other calculator but there must be something wrong if they show Turks and Caucasians with that amount of Gedrosia. Dodecad k12b is a great calculator, maybe the best if it comes to Westasians, maybe not the best if it comes to Europeans. The same counts for Harappaworld imo, maybe that doesn’t work well for South Asians but it is like Dodecad precise for West- and Central Asians.

Zoro

10-01-2021, 03:11 PM

I would say it isn’t an excuse, but an explanation. At least I wouldn’t be surprised if some score just some Gedrosia.

Its probably not perfect like any other calculator but there must be something wrong if they show Turks and Caucasians with that amount of Gedrosia.

I don’t think it’s that unreasonable that Gedrosia K12 shows Adana and Kayseri Turks with a little higher Gedrosia score than Georgians and Abkhazians thoughif we keep in mind that those Turks may in fact be a little closer to Kurds genetically (they maybe more admixed with Kurds) Certainly not as outrageous as N. Europeans scoring that much Gedrosia on Dodecad K12.

I don’t trust the Gedrosia scores on Harappaworld also since it shows C. Indians scoring more Baloch Gedrosia than Iranics since we know that’s impossible

Dr_Maul

10-01-2021, 03:27 PM

Lucas is right, these components are worthless (except in G25) oracle is what matters. Gedrosia is neither Baloch nor Iran N and steppes are half CHG not either so idk why it shows it in NW europe.. just use g25 for these details..

Komintasavalta

10-01-2021, 03:37 PM

Maybe Gedrosia is a proxy for steppe ancestry in Dod K12, which would explain why it's higher in Northern Europeans than in Southern Europeans.

In the ADMIXTURE run below, there were 5 ancient populations, 40 modern populations, and 20 samples from each population. So there were 8 times as many modern samples as ancient samples, and the SNP loadings of the admixture components were mostly determined by modern samples. Anyway, it's interesting how the main component in modern Europeans ended up being the green component that is maximal in Balochi and Brahui. The steppe samples had on average 64% of the green component and 36% of the WHG component. Also the proportion of the Balochi-Brahui component is higher in Mordovians and Russians than in Southern Europeans.

https://i.ibb.co/DRv1Ybv/admixfail2.jpg

Zoro

10-01-2021, 04:09 PM

Lucas is right, these components are worthless (except in G25) oracle is what matters. Gedrosia is neither Baloch nor Iran N and steppes are half CHG not either so idk why it shows it in NW europe.. just use g25 for these details..

The only problem that no one here understands except maybe Komi is that SNP loadings or G25 coordinates are affected by which populations are in the run that generate G25 coordinates. For example, if you want to make Iranians or Kurds more W. Eurasian you can put more Caucasian and European samples in the run. That will make Iranian/Kurd coordinates more W. Eurasian. If you want to make them more E. Eurasian you do the opposite. You’ll not become suspicious because the G25 modelling will still be reasonable.

The only Way you will become suspicious is if you did IBS comparisons of Kurds and Iranians vs E. and W. Eurasians

I don’t think the G 25 guy is doing anything on purpose but it’s just a result of the chaotic way he keeps adding samples to the run or the unbalanced way he generated those coordinates originally

Komintasavalta

10-01-2021, 04:37 PM

The only problem that no one here understands except maybe Komi is that SNP loadings or G25 coordinates are affected by which populations are in the run that generate G25 coordinates. For example, if you want to make Iranians or Kurds more W. Eurasian you can put more Caucasian and European samples in the run. That will make Iranian/Kurd coordinates more W. Eurasian. If you want to make them more E. Eurasian you do the opposite. You’ll not become suspicious because the G25 modelling will still be reasonable.

The only Way you will become suspicious is if you did IBS comparisons of Kurds and Iranians vs E. and W. Eurasians

I don’t think the G 25 guy is doing anything on purpose but it’s just a result of the chaotic way he keeps adding samples to the run or the unbalanced way he generated those coordinates originally

All samples in G25 are projected according to vbnetkhio: https://www.theapricity.com/forum/showthread.php?352378-Making-Vahaduo-like-models-based-on-SNP-level-data&p=7303874&viewfull=1#post7303874. So if that's the case, and if Davidski always uses the same set of reference samples when he generates new G25 coordinates, it doesn't matter how many new non-reference samples he adds to a SmartPCA run at a time.

If G25 only includes projected samples, I think it's because projected samples plot differently than reference samples in SmartPCA, which I demonstrated in the thread linked above by making a run where half of the samples from each population were projected and half were references. The same phenomenon in ADMIXTURE calculators is known as the infamous "calculator effect": https://bga101.blogspot.com/2012/05/beware-calculator-effect.html.

The way you can do a projected run with SmartPCA is to add the parameter `poplistname: /path/to/poplist`, where the poplist file contains the names of populations that are used as reference samples. Population names are specified in the sixth field of the fam file and not the first field.

(You probably know this already, but I'm trying to explain it to other users.)

Zoro

10-01-2021, 04:45 PM

All samples in G25 are projected according to vbnetkhio: https://www.theapricity.com/forum/showthread.php?352378-Making-Vahaduo-like-models-based-on-SNP-level-data&p=7303874&viewfull=1#post7303874. So if that's the case, and if Davidski always uses the same set of reference samples when he generates new G25 coordinates, it doesn't matter how many new non-reference samples he adds to a SmartPCA run at a time.

If G25 only includes projected samples, I think it's because projected samples plot differently than reference samples in SmartPCA, which I demonstrated in the thread linked above by making a run where half of the samples from each population were projected and half were references. The same phenomenon in ADMIXTURE calculators is known as the infamous "calculator effect": https://bga101.blogspot.com/2012/05/beware-calculator-effect.html.

The way you can do a projected run with SmartPCA is to add the parameter `poplistname: /path/to/poplist`, where the poplist file contains the names of populations that are used as reference samples. Population names are specified in the sixth field of the fam file and not the first field.

(You probably know this already, but I'm trying to explain it to other users.)

Even if that’s the case his original run when he generated coordinates wasn’t balanced. Do an experiment run the Plink PCA program with only 2 Iranians. In one run include Europeans and S and E Asians. In the other run include Europeans and S and E Asians AND Caucasians. Post the 2 Iranian coordinates here for both runs.
That should be interesting

Lucas

10-01-2021, 04:57 PM

Even if that’s the case his original run when he generated coordinates wasn’t balanced. Do an experiment run the Plink PCA program with only 2 Iranians. In one run include Europeans and S and E Asians. In the other run include Europeans and S and E Asians AND Caucasians. Post the 2 Iranian coordinates here for both runs.
That should be interesting

Just for reference, G25 is based on Smart PCA not Plink PCA.

Zoro

10-01-2021, 05:00 PM

Just for reference, G25 is based on Smart PCA not Plink PCA.

Ok, but they’re both PCA programs. He can do his experiment on Smart PCA instead

Komintasavalta

10-01-2021, 05:31 PM

Even if that’s the case his original run when he generated coordinates wasn’t balanced.

Yeah, has he ever published the list of samples that he used in the original reference run for G25?

When you use all modern and ancient population averages from the official G25 datasheets as a source but you remove Chuvashes, then the best model for Maris has a distance of .073, and similarly if you remove Maris, then the best model for Chuvashes has a distance of .044:

$ curl 'https://drive.google.com/uc?export=download&id=1wZr-UOve0KUKo_Qbgeo27m-CQncZWb8y' -Lso mas
$ curl 'https://drive.google.com/uc?export=download&id=1F2rKEVtu8nWSm7qFhxPU6UESQNsmA-sl' -Lso aas
$ curl https://pastebin.com/raw/afaMiFSa|tr -d \\r>mix;chmod +x mix
$ pip3 install cvxpy
[...]
$ t=Mari;./mix <(cat [am]as|grep -v ^$t,) <(grep ^$t, mas) -s
Mari (.039): 89% Chuvash + 9% RUS_Krasnoyarsk_BA + 2% RUS_AfontovaGora3
$ t=Mari;./mix <(cat [am]as|grep -v ^$t,|grep -v Chuvash) <(grep ^$t, mas) -s
Mari (.073): 88% Udmurt + 9% RUS_Krasnoyarsk_BA + 2% ITA_Tagliente + 1% DEU_LBK_KD
$ t=Chuvash;./mix <(cat [am]as|grep -v ^$t,) <(grep ^$t, mas) -s
Chuvash (.005): 68% Mari + 8% Lithuanian_RA + 6% Russian_Belgorod + 5% Lithuanian_VZ + 5% Darginian + 2% HRV_Vucedol + 2% CHN_Amur_River_Xianbei_IA + 1% MNG_Afanasievo_1_contam + 1% Ket + 1% GEO_CHG + 1% VNM_BA_Dong_Son_Culture + 0% Han_Shanghai + 0% Sorb_Niederlausitz + 0% CHE_FN_steppe_contam + 0% CHN_Miaozigou_MN + 0% Sakha + 0% UKR_Cimmerian_o
$ t=Chuvash;./mix <(cat [am]as|grep -v ^$t,|grep -v Mari) <(grep ^$t, mas) -s
Chuvash (.044): 81% Udmurt + 6% Russian_Pinega + 4% HUN_MBA_Vatya_o + 4% DEU_LBK_KD + 2% RUS_Krasnoyarsk_BA + 2% CHN_Yinwang_500BP + 1% Baltic_EST_BA

I think it's because in the initial set of reference samples that Davidski used with G25, there were some Mari or Chuvash samples, so some PCs on G25 ended up accounting for drift that is specific to Maris or Chuvashes. But then G25 gives less weight to the drift of other populations that were not included among the initial reference samples.

I merged samples from 1240K+HO with samples from Cardona et al. 2014 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE73996), and I calculated an f2 matrix for the samples. Then for each population that had an identical name in G25 and my dataset, I compared the f2 distance to the scaled G25 distance. In the plot below, Maris and Chuvashes are actually above the diagonal, because G25 accounts for drift that is specific to Maris. But then G25 underestimates the distance to other drifted or isolated populations, like Kubachinian, Kalash, Udmurt, Komi, Scottish, Icelandic, Kusunda, Chukchi, Surui, etc. (The reason why the distance to Even is much bigger in G25 than in my f2 matrix is that the Even population average in G25 is modeled as 12% Norwegian and 88% Han_Shanghai, but my Even samples had an average of 30% of a Caucasoid component in a K=2 Eurasian ADMIXTURE run.)

https://i.ibb.co/4dJQVRd/1.png

Zoro

10-01-2021, 05:59 PM

Yeah, has he ever published the list of samples that he used in the original reference run for G25?

When you use all modern and ancient population averages from the official G25 datasheets as a source but you remove Chuvashes, then the best model for Maris has a distance of .073, and similarly if you remove Maris, then the best model for Chuvashes has a distance of .044:

$ curl 'https://drive.google.com/uc?export=download&id=1wZr-UOve0KUKo_Qbgeo27m-CQncZWb8y' -Lso mas
$ curl 'https://drive.google.com/uc?export=download&id=1F2rKEVtu8nWSm7qFhxPU6UESQNsmA-sl' -Lso aas
$ curl https://pastebin.com/raw/afaMiFSa|tr -d \\r>mix;chmod +x mix
$ pip3 install cvxpy
[...]
$ t=Mari;./mix <(cat [am]as|grep -v ^$t,) <(grep ^$t, mas) -s
Mari (.039): 89% Chuvash + 9% RUS_Krasnoyarsk_BA + 2% RUS_AfontovaGora3
$ t=Mari;./mix <(cat [am]as|grep -v ^$t,|grep -v Chuvash) <(grep ^$t, mas) -s
Mari (.073): 88% Udmurt + 9% RUS_Krasnoyarsk_BA + 2% ITA_Tagliente + 1% DEU_LBK_KD
$ t=Chuvash;./mix <(cat [am]as|grep -v ^$t,) <(grep ^$t, mas) -s
Chuvash (.005): 68% Mari + 8% Lithuanian_RA + 6% Russian_Belgorod + 5% Lithuanian_VZ + 5% Darginian + 2% HRV_Vucedol + 2% CHN_Amur_River_Xianbei_IA + 1% MNG_Afanasievo_1_contam + 1% Ket + 1% GEO_CHG + 1% VNM_BA_Dong_Son_Culture + 0% Han_Shanghai + 0% Sorb_Niederlausitz + 0% CHE_FN_steppe_contam + 0% CHN_Miaozigou_MN + 0% Sakha + 0% UKR_Cimmerian_o
$ t=Chuvash;./mix <(cat [am]as|grep -v ^$t,|grep -v Mari) <(grep ^$t, mas) -s
Chuvash (.044): 81% Udmurt + 6% Russian_Pinega + 4% HUN_MBA_Vatya_o + 4% DEU_LBK_KD + 2% RUS_Krasnoyarsk_BA + 2% CHN_Yinwang_500BP + 1% Baltic_EST_BA

I think it's because in the initial set of reference samples that Davidski used with G25, there were some Mari or Chuvash samples, so some PCs on G25 ended up accounting for drift that is specific to Maris or Chuvashes. But then G25 gives less weight to the drift of other populations that were not included among the initial reference samples.

I merged samples from 1240K+HO with samples from Cardona et al. 2014 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE73996), and I calculated an f2 matrix for the samples. Then for each population that had an identical name in G25 and my dataset, I compared the f2 distance to the scaled G25 distance. In the plot below, Maris and Chuvashes are actually above the diagonal, because G25 accounts for drift that is specific to Maris. But then G25 underestimates the distance to other drifted or isolated populations, like Kubachinian, Kalash, Udmurt, Komi, Scottish, Icelandic, Kusunda, Chukchi, Surui, etc. (The reason why the distance to Even is much bigger in G25 than in my f2 matrix is that the Even population average in G25 is modeled as 12% Norwegian and 88% Han_Shanghai, but my Even samples had an average of 30% of a Caucasoid component in a K=2 Eurasian ADMIXTURE run.)

https://i.ibb.co/4dJQVRd/1.png

Well as you have discovered results change based on which sample are included in PCA, Admixture calculators, G25 and how you run them. So there’s always many unknown biases.

I can tell you for a fact that E. Eurasian percent in Iranics is a joke no matter which of these methods you use. To reduce biases and get closer to the truth just do a one-to-one IBS or IBD comparison of each sample or maybe use dstats or something like that. Doing so you’ll see that these calculator results are full of biases and inaccuracies

And to reduce bias even further use whole genomes in IBS or Dstats

Komintasavalta

10-01-2021, 07:12 PM

I can tell you for a fact that E. Eurasian percent in Iranics is a joke no matter which of these methods you use.

Yeah but in the ADMIXTURE run I posted on the previous page, Iranians had 82% of the Balochi-Brahui component, which is clearly partially East Eurasian (or at least it's partially undifferentiated Eurasian which shares a lack of western drift with East Eurasians). I now tried doing runs at different K values with the same samples, except I removed the ancient samples, because I usually get f*cked up results when I try to mix ancient and modern samples in the same ADMIXTURE run. Now in the K=2 run, the percentage of the eastern component was 14% in Balochi, 13% in Brahui, and 6% in Iranians. And even in the K=4 run, Iranians got 4% of the North Asian component and 1% of the East Asian component:

https://i.ibb.co/BjpPb01/admixture-complexheatmap.png

I think the algorithm used by ADMIXTURE works in a way that when there is a sample that is close enough to having 100% of one component, it pulls the sample closer to having 100% of that component. So in a typical ADMIXTURE run, a large percentage of samples just gets 100% of a single component and 0% of all other components. For example the K=2 run above included 1469 samples, but 274 out them had 100% of the Caucasoid component, and 223 had 100% of the Mongoloid component. But all of the samples that got 100% of the Caucasoid component cannot be equally closely related to the component. And if there was a way to reduce the strength of the pull that makes a sample have 100% of a single component, then for example in the K=2 run above it would probably increase the average percentage of the Mongoloid component in Europeans. (I should check if the effect is less strong in alternatives to ADMIXTURE like Frappe or DyStruct or STRUCTURE.)

Komintasavalta

10-02-2021, 06:59 AM

One way to make ADMIXTURE-based calculators more accurate might be to combine the Q matrices of runs at different K values into a single wide matrix. Or at least that would make distances calculated based on the datasheet of a calculator more accurate even without having to multiply by FST. Currently if you don't account for FST, population 1 that is 100% North_European in a K=12 run is equally far from population 2 that is 50% North_European and 50% Atlantic_Med as from population 3 that is 50% North_European and 50% Siberian. But if you join the Q matrix of a K=12 run with the Q matrix of a K=3 run, then in the K=3 run the first two populations would have a much lower distance than the first and the third population, which would also make the combined distance between the populations more accurate.

For example below is a PCA based on the Q matrix of a single K=8 run, where I didn't multiply the Q matrix with the FST matrix of the run. It exaggerates the distance between Europeans and Caucasians, because it doesn't know that the European component has a low FST distance to the Caucasian component. It treats a population with 100% of the European component as equally distant from a population with 100% of the Caucasian component as from a population with 100% of the Eskimo component.

https://i.ibb.co/5KW5rVv/2.png

But when you combine the Q matrices of all runs from K=2 to K=9 into a single wide matrix, it makes a PCA based on the matrix more accurate:

https://i.ibb.co/G3cFrwW/1.png

I think you also get more accurate oracles when you use a combined matrix of ADMIXTURE runs at different K values. For example here's 3-way models made using the Q matrix of a single K=8 run:

Hungarian (.00076): 95% English + 4% Lak + 1% Tibetan_Yunnan
Estonian (.00069): 82% Lithuanian + 17% Russian_Archangelsk_Krasnoborsky + 0% Icelandic
Finnish (.00083): 89% Lithuanian + 9% Ket + 1% Iranian
Karelian (.00000): 92% Lithuanian + 8% Nganasan + 0% Koryak
Veps (.00000): 86% Karelian + 12% Lithuanian + 2% Nganasan
Mordovian (.00069): 84% Russian_Archangelsk_Krasnoborsky + 11% Belarusian + 5% Karachai
Besermyan (.00573): 80% Udmurt + 17% Russian_Archangelsk_Krasnoborsky + 4% Balochi
Udmurt (.00779): 77% Besermyan + 16% Mansi + 6% Karelian
Mansi (.00015): 74% Selkup + 25% Lithuanian + 1% Eskimo_ChaplinSireniki
Selkup (.00068): 89% Enets + 9% Veps + 2% GujaratiD
Enets (.00017): 80% Nganasan + 15% Lithuanian + 4% Estonian
Nganasan (.04768): 100% Evenk_Transbaikal

Here's models made using a combined matrix of runs from K=2 to K=9:

Hungarian (.01717): 73% English + 20% Ukrainian + 7% Circassian
Estonian (.01661): 81% Lithuanian + 12% Finnish + 7% Veps
Finnish (.01363): 70% Veps + 17% Estonian + 13% Orcadian
Karelian (.00680): 65% Veps + 25% Russian_Archangelsk_Krasnoborsky + 10% Lithuanian
Veps (.01165): 98% Karelian + 2% Mansi + 0% Koryak
Mordovian (.00726): 43% Russian_Archangelsk_Krasnoborsky + 42% Ukrainian + 15% Chuvash
Besermyan (.01823): 85% Udmurt + 13% Czech + 2% Kalash
Udmurt (.01700): 72% Besermyan + 17% Mansi + 11% Russian_Archangelsk_Leshukonsky
Mansi (.05058): 50% Ket + 35% Yukagir_Forest + 15% Veps
Selkup (.02265): 74% Ket + 16% Tatar_Siberian_Zabolotniye + 10% Evenk_Transbaikal
Enets (.01851): 67% Ket + 31% Nganasan + 3% Russian_Archangelsk_Leshukonsky
Nganasan (.26531): 100% Evenk_Transbaikal

Zoro

10-02-2021, 08:39 AM

One way to make ADMIXTURE-based calculators more accurate might be to combine the Q matrices of runs at different K values into a single wide matrix. Or at least that would make distances calculated based on the datasheet of a calculator more accurate even without having to multiply by FST. Currently if you don't account for FST, population 1 that is 100% North_European in a K=12 run is equally far from population 2 that is 50% North_European and 50% Atlantic_Med as from population 3 that is 50% North_European and 50% Siberian. But if you join the Q matrix of a K=12 run with the Q matrix of a K=3 run, then in the K=3 run the first two populations would have a much lower distance than the first and the third population, which would also make the combined distance between the populations more accurate.

For example below is a PCA based on the Q matrix of a single K=8 run, where I didn't multiply the Q matrix with the FST matrix of the run. It exaggerates the distance between Europeans and Caucasians, because it doesn't know that the European component has a low FST distance to the Caucasian component. It treats a population with 100% of the European component as equally distant from a population with 100% of the Caucasian component as from a population with 100% of the Eskimo component.

But when you combine the Q matrices of all runs from K=2 to K=9 into a single wide matrix, it makes a PCA based on the matrix more accurate:

I think you also get more accurate oracles when you use a combined matrix of ADMIXTURE runs at different K values. For example here's 3-way models made using the Q matrix of a single K=8 run:

I like how you think out of the bix but I’m not sure I follow what you’re saying because the Q matrix is admixture proportion matrix which changes based on K. For ex at K=4:

0.15 0.15 0.5 0.2

Whereas FST matrix is the distances between components (column values). It’s not SNP weight . So how do you combine those two different matrices. Can you give example based on first couple rows of Q matrix

rothaer

10-02-2021, 11:53 AM

Maybe Gedrosia is a proxy for steppe ancestry in Dod K12, which would explain why it's higher in Northern Europeans than in Southern Europeans. (...)

It is a very interesting question what this Gedrosia component at all is.

If you look at the distribution map in the OP it does not fit to a steppe related thing. Because you have that component in Western Europe and 0% of that in Poland and Belarus. But these pops are known to have a big proportion of steppe ancestry.
So whatever is depicted by this Gedrosia component (might it be an arbitrarily and erronous chosen component?) the question remains by what migration it got that distribution.

Zoro

10-02-2021, 02:08 PM

It is a very interesting question what this Gedrosia component at all is.

If you look at the distribution map in the OP it does not fit to a steppe related thing. Because you have that component in Western Europe and 0% of that in Poland and Belarus. But these pops are known to have a big proportion of steppe ancestry.
So whatever is depicted by this Gedrosia component (might it be an arbitrarily and erronous chosen component?) the question remains by what migration it got that distribution.

Bottom line is it doesn’t make sense no matter what excuse we try to make to justify it.

What I just proved is even if Gedrosia percentages (no reason to believe E. Eurasian or any other component is accurate either. Chances if one is off others are off too) or any other admixture percentages in a calculator are wrong the public will have no idea that the calculator is flawed as long as the public sees that the modelling in oracles models them with their ethnic group. This applies to Eurogenes and G25 also.

You may ask how the calculator is able to model me or cluster me with my ethnic group if the calculator whether it is Dodecad Eurogenes or G25, if admixture percentages are off ?

ANSWER: As long as all people from your country all get similar wrong percentages of Gedrosia, E. Asian, W. Asian, African like you then oracles will model you with your countrymen.

That’s how Iranians and Kurds are still modelled with Iranians and Kurds in GEDmatch or G25 with silly E. Asian because all Iranians and Kurds get similar silly E. Asian

That’s how British are modeled with British in this calculator oracles even if they have silly Gedrosian percentages because all British have the same wrong Gedrosia percentages

CONCLUSION: Don’t ever judge a calculator to be good because oracles model you with your countrymen because the calculator admixture percentages can be wrong and you would never know it.
Conversely don’t judge calculators such as GedrosiaDNA project to be bad just because oracles don’t model you with your countrymen. It could be that spreadsheet doesn’t have enough populations, but admixture percentages can still be better than other calculators

Komintasavalta

10-02-2021, 03:13 PM

I like how you think out of the bix but I’m not sure I follow what you’re saying because the Q matrix is admixture proportion matrix which changes based on K. For ex at K=4:

0.15 0.15 0.5 0.2

Whereas FST matrix is the distances between components (column values). It’s not SNP weight . So how do you combine those two different matrices. Can you give example based on first couple rows of Q matrix

You can use matrix multiplication:

t=read.csv("https://pastebin.com/raw/UY1Em6qW",r=1)/100 # K13 original

fst=as.dist(read.csv(text=",North_Atlantic,Baltic,West_Med,West_Asian,East_Me d,Red_Sea,South_Asian,East_Asian,Siberian,Amerindi an,Oceanian,Northeast_African,Sub-Saharan
North_Atlantic,,,,,,,,,,,,,
Baltic,19,,,,,,,,,,,,
West_Med,28,36,,,,,,,,,,,
West_Asian,26,32,36,,,,,,,,,,
East_Med,26,35,28,21,,,,,,,,,
Red_Sea,52,62,50,48,39,,,,,,,,
South_Asian,64,65,76,57,60,82,,,,,,,
East_Asian,114,114,122,110,111,127,76,,,,,,
Siberian,111,111,123,109,112,130,83,56,,,,,
Amerindian,138,137,154,138,144,161,120,113,105,,,,
Oceanian,179,181,187,177,176,191,146,166,177,217,, ,
Northeast_African,122,127,124,116,108,121,113,145, 151,185,203,,
Sub-Saharan,146,150,150,140,135,141,133,164,170,204,22 0,41,",r=1))/1000
t2=as.matrix(t)%*%as.matrix(fst)

sort(as.matrix(dist(t2))[,"Mari"])

For example multiplying by the FST matrix moves Maris closer to Central Asians and South Asians but further from Europeans and Siberians, because it causes the distances between Eurasian populations to be largely determined by the Mongoloid-Caucasoid axis. It moves Maris closer to Turkmens and further from Kets and Selkups, which matches the results of f2. However I don't know if it's the right method to account for FST, because it sometimes gives weird results. For example it moves Maris closer to Balochi and Makrani than to Estonians, and it also moves Maris closer to Jordanians than to Bulgarians:

https://i.ibb.co/YyCR5p8/mari-k13-fst-multiplication.gif

library(tidyverse)
library(ggforce)

k13=read.csv("https://pastebin.com/raw/aLBEQ2cu",r=1,check=F)/100
f2=read.csv("https://drive.google.com/uc?export=download&id=1qnXblYFWLFnOiEj-NbjCVkHcGIsGe64R",r=1)
# g25=read.csv("https://drive.google.com/uc?export=download&id=1wZr-UOve0KUKo_Qbgeo27m-CQncZWb8y",r=1) # modern averages scaled

k13fst=as.dist(read.csv(text=",North_Atlantic,Baltic,West_Med,West_Asian,East_Me d,Red_Sea,South_Asian,East_Asian,Siberian,Amerindi an,Oceanian,Northeast_African,Sub-Saharan
North_Atlantic,,,,,,,,,,,,,
Baltic,19,,,,,,,,,,,,
West_Med,28,36,,,,,,,,,,,
West_Asian,26,32,36,,,,,,,,,,
East_Med,26,35,28,21,,,,,,,,,
Red_Sea,52,62,50,48,39,,,,,,,,
South_Asian,64,65,76,57,60,82,,,,,,,
East_Asian,114,114,122,110,111,127,76,,,,,,
Siberian,111,111,123,109,112,130,83,56,,,,,
Amerindian,138,137,154,138,144,161,120,113,105,,,,
Oceanian,179,181,187,177,176,191,146,166,177,217,, ,
Northeast_African,122,127,124,116,108,121,113,145, 151,185,203,,
Sub-Saharan,146,150,150,140,135,141,133,164,170,204,22 0,41,",r=1))/1000

pop=intersect(rownames(f2),rownames(k13))
# pop=intersect(rownames(g25),rownames(k13))
k13=k13[pop,]
f2=f2[pop,pop]
# g25=g25[pop,]

k13mult=as.matrix(k13)%*%as.matrix(k13fst)
xy=data.frame(x=rank(f2[,"Mari"]),y=rank(as.matrix(dist(k13))[,"Mari"]))
# xy=data.frame(x=rank(as.matrix(dist(g25))[,"Mari"]),y=rank(as.matrix(dist(k13mult))[,"Mari"]))

xy$k=as.factor(cutree(hclust(as.dist(f2)),16))
# xy$k=as.factor(cutree(hclust(dist(g25)),16))

ggplot(xy,aes(x,y))+
ggforce::geom_mark_hull(aes(color=k,fill=k),concav ity=1000,radius=unit(.15,"cm"),expand=unit(.15,"cm"),alpha=.2,size=.15)+
geom_abline(linetype="dashed",color="gray80",size=.3)+
geom_point(aes(color=k),size=.5)+
geom_text(aes(color=k),label=rownames(xy),size=2,v just=-.7)+
scale_x_continuous(breaks=seq(1,200,10),expand=exp ansion(mult=c(.04,.04)))+
scale_y_continuous(breaks=seq(1,200,10),expand=exp ansion(mult=c(.04,.04)))+
scale_fill_manual(values=rainbow_hcl(nlevels(xy$k) ,90,60))+
scale_color_manual(values=rainbow_hcl(nlevels(xy$k ),90,60))+
labs(x="Rank of f2 distance to Mari",y="Rank of K13 distance to Mari, not multiplied by FST")+
theme(
axis.text=element_text(size=6),
axis.text.y=element_text(angle=90,vjust=1,hjust=.5 ),
axis.ticks=element_blank(),
axis.ticks.length=unit(0,"cm"),
axis.title=element_text(size=8),
legend.position="none",
panel.background=element_rect(fill="white"),
panel.border=element_rect(color="gray85",fill=NA,size=.6),
panel.grid.major=element_line(color="gray85",size=.2),
plot.background=element_rect(fill="white"),
plot.subtitle=element_text(size=7),
plot.title=element_text(size=11)
)

ggsave("1.png",w=6,h=6)

However when you multiply the matrix of admixture percentages by the FST matrix, it makes a global PCA based on the K13 spreadsheet have the conventional shape where on PC1 and PC2, the other major cline is between Africans and Europeans:

https://i.ibb.co/34Q4yKm/2.png

If you don't multiply by FST, then the other major cline on PC1 and PC2 is between Europeans and South Asians instead:

https://i.ibb.co/VBqrfz4/1.png

Zoro

10-02-2021, 03:52 PM

You can use matrix multiplication:

For example it moves Maris relatively closer to Central Asians and South Asians and further from Europeans, because it causes the distances between Eurasian populations to be largely determined by the Mongoloid-Caucasoid axis. It moves Maris closer to Turkmens and further from Kets and Selkups, which matches the results of f2. However I don't know if it's the right method to account for FST, because it sometimes gives weird results. For example it moves Maris closer to Balochi and Makrani than to Estonians, and it also moves Maris closer to Jordanians than to Scots:

However when you multiply the matrix of admixture percentages by the FST matrix, it makes a global PCA based on the K13 spreadsheet have the conventional shape where on PC1 and PC2, the other major cline is between Africans and Europeans:

[]

I didn’t mean what code you use, i meant what logic do you have for doing that. The 1st pca you posted didn’t make sense to me in either mode looking at the distance Balochi-Iranian vs Balochi-Caucasians or Arabs. The last PCA you posted probably makes more sense

Zoro

10-02-2021, 03:53 PM

…….

Komintasavalta

10-02-2021, 05:15 PM

I didn’t mean what code you use, i meant what logic do you have for doing that. The 1st pca you posted didn’t make sense to me in either mode looking at the distance Balochi-Iranian vs Balochi-Caucasians or Arabs. The last PCA you posted probably makes more sense

I'm not sure if it's the correct way to account for FST, but I think it makes sense at least for making a PCA based on the datasheet of an admixture calculator. For example if you make a PCA of European populations from K13 updated without multiplying by FST, Tatars plot between Finns and Caucasians:

https://i.ibb.co/nCPhYM4/1.png

But after multiplying by FST, more weight is given to differences in Mongoloid admixture, and PC1 ends up differentiating populations based on the amount of Mongoloid ancestry. (However now it no longer makes sense to make a biplot that shows the loadings of the variables of the PCA, because the matrix that the PCA is based on no longer represents the percentages of the admixture components.)

https://i.ibb.co/DfVbqW9/3.png

Hektor12

10-02-2021, 05:19 PM

So whatever is depicted by this Gedrosia component (might it be an arbitrarily and erronous chosen component?) the question remains by what migration it got that distribution.I find the answer in the R1a-R1b question.

Zoro

10-02-2021, 06:32 PM

I'm not sure if it's the correct way to account for FST, but I think it makes sense at least for making a PCA based on the datasheet of an admixture calculator. For example if you make a PCA of European populations from K13 updated without multiplying by FST, Tatars plot between Finns and Caucasians:

But after multiplying by FST, more weight is given to differences in Mongoloid admixture, and PC1 ends up differentiating populations based on the amount of Mongoloid ancestry. (However now it no longer makes sense to make a biplot that shows the loadings of the variables of the PCA, because the matrix that the PCA is based on no longer represents the percentages of the admixture components.)
]

I like the original one without multiplying with FST. I don’t see logic with FST multiplication but I don’t even understand how you can multiply admixture percentages with FST distances between components doesn’t make sense to me unless I misunderstood you

Komintasavalta

10-02-2021, 07:14 PM

I like the original one without multiplying with FST. I don’t see logic with FST multiplication but I don’t even understand how you can multiply admixture percentages with FST distances between components doesn’t make sense to me unless I misunderstood you

It uses matrix multiplication (`%*%` operator in R):

https://i.ibb.co/0n3mm47/1-YGc-MQSr0ge-DGn96-Wn-Ek-Zw.png

Another example: In K13 original, Dolgan has 75% Siberian and Dai has 90% East Asian, but without multiplying by FST, Dolgan are further from Dai than from many SSAs. Without multiplying by FST, Burusho is much closer to Dolgans than Gujarati is, because Burusho has 6% Siberian and Gujarati has 1%, but after multiplying by FST, Gujarati moves to rank 66 and Burusho moves to rank 55. However there's also something wrong with how after multiplying by FST, Dolgans become closer to Ethiopian_Tigray than to most Europeans.

https://i.ibb.co/pwC6hrX/1.png

Komintasavalta

10-02-2021, 07:42 PM

Actually I think I now found a better way to account for FST, which is to first do multidimensional scaling on the FST matrix, and to then multiply the matrix of component percentages with the MDS matrix:

t=read.csv("https://pastebin.com/raw/UY1Em6qW",r=1)/100 # K13 original

fst=as.dist(read.csv(text=",North_Atlantic,Baltic,West_Med,West_Asian,East_Me d,Red_Sea,South_Asian,East_Asian,Siberian,Amerindi an,Oceanian,Northeast_African,Sub-Saharan
North_Atlantic,,,,,,,,,,,,,
Baltic,19,,,,,,,,,,,,
West_Med,28,36,,,,,,,,,,,
West_Asian,26,32,36,,,,,,,,,,
East_Med,26,35,28,21,,,,,,,,,
Red_Sea,52,62,50,48,39,,,,,,,,
South_Asian,64,65,76,57,60,82,,,,,,,
East_Asian,114,114,122,110,111,127,76,,,,,,
Siberian,111,111,123,109,112,130,83,56,,,,,
Amerindian,138,137,154,138,144,161,120,113,105,,,,
Oceanian,179,181,187,177,176,191,146,166,177,217,, ,
Northeast_African,122,127,124,116,108,121,113,145, 151,185,203,,
Sub-Saharan,146,150,150,140,135,141,133,164,170,204,22 0,41,",r=1))/1000

mds=cmdscale(fst,ncol(as.matrix(fst))-1)
t2=as.data.frame(as.matrix(t)%*%mds)
sort(as.matrix(dist(t2))[,"Selkup"])

Then Dolgans remain further from Ethiopians than from Europeans:

https://i.ibb.co/b3J00QQ/1.png

In the images below, the second dimension of the MDS plot differentiates Americans from Australo-Melanesians, because the biggest FST distance in K13 is between the Oceanian and Sub-Saharan components (.220), and the second biggest distance is between the Oceanian and Amerindian components (.217):

https://i.ibb.co/wS8gmKp/1.pnghttps://i.ibb.co/fXv6Vkt/2.png

Now also the correlation with f2 distance becomes better, so most populations are close to the diagonal in the plot below. On the right side of the diagonal, there are populations that have drift which K13 doesn't account for, like Kalashes, Orcadians, and Chukchi. On the left side of the diagonal, there are mixed populations with low driftedness, like some Central Asians and Hungarians.

https://i.ibb.co/RBHG2pT/f2.png

rothaer

10-02-2021, 08:25 PM

I find the answer in the R1a-R1b question.

So some autosomal DNA connected to R1b? Albeit a biological connenction is not there to a Y chromosome, but maybe a statistical...

Komintasavalta

10-03-2021, 02:20 AM

I selected populations that had an identical name in K13 updated and in my global f2 matrix. I then used Mantel's test to calculate the correlation between the f2 matrix and different versions of the K13 matrix:

f2 vs K13 with no multiplication by FST: .831
f2 vs K13 multiplied by FST matrix: .928
f2 vs K13 multiplied by MDS matrix of FST matrix: .953

When I compared K13 to scaled G25, I got even higher correlation, but I think it's partially because G25 and K13 include similar samples, or because both of them are less sensitive to drift than f2:

G25 vs K13 with no multiplication by FST: .889
G25 vs K13 multiplied by FST matrix: .978
G25 vs K13 multiplied by MDS matrix of FST matrix: .989

So in both cases, multiplying by MDS of FST produced the best correlation.

When I compared populations that had identical names in my f2 matrix and in G25, I got a correlation of .927 when using scaled coordinates and .781 when using unscaled coordinates. So the correlation with my f2 matrix was actually lower for scaled G25 than for K13 multiplied by MDS of FST.

Next I tried comparing my f2 matrix to versions of scaled G25 that were truncated to a different number of dimensions. I got the highest correlation when I included all 25 dimensions, but the increse in the correlation became very minor after the first 18 or 19 dimensions:

1 0.6538
2 0.7422
3 0.7733
4 0.7949
5 0.8082
6 0.8212
7 0.8839
8 0.9176
9 0.9178
10 0.9179
11 0.9186
12 0.9197
13 0.9200
14 0.9204
15 0.9206
16 0.9208
17 0.9209
18 0.9257
19 0.9263
20 0.9263
21 0.9266
22 0.9271
23 0.9271
24 0.9272
25 0.9277

When I compared my f2 matrix to unscaled G25, I got the highest correlation when I only included the first 18 dimensions of G25:

1 0.6538
2 0.7255
3 0.7553
4 0.7422
5 0.7024
6 0.6336
7 0.7950
8 0.7572
9 0.7620
10 0.7669
11 0.7650
12 0.7770
13 0.7776
14 0.7723
15 0.7728
16 0.7726
17 0.7742
18 0.7888
19 0.7867
20 0.7855
21 0.7840
22 0.7833
23 0.7810
24 0.7804
25 0.7812

library(ade4)

t=read.csv("https://pastebin.com/raw/UY1Em6qW",r=1)/100 # K13 updated
f2=read.csv("https://drive.google.com/uc?export=download&id=1qnXblYFWLFnOiEj-NbjCVkHcGIsGe64R",r=1) # global f2 matrix
# g25=read.csv("https://drive.google.com/uc?export=download&id=1wZr-UOve0KUKo_Qbgeo27m-CQncZWb8y",row.names=1) # modern averages scaled
# g25unscaled=read.csv("https://drive.google.com/uc?export=download&id=1y49hyvviJpHj9esVqyeiFm32DhnPlfRQ",row.names=1) # modern averages unscaled

fst=as.dist(read.csv(text=",North_Atlantic,Baltic,West_Med,West_Asian,East_Me d,Red_Sea,South_Asian,East_Asian,Siberian,Amerindi an,Oceanian,Northeast_African,Sub-Saharan
North_Atlantic,,,,,,,,,,,,,
Baltic,19,,,,,,,,,,,,
West_Med,28,36,,,,,,,,,,,
West_Asian,26,32,36,,,,,,,,,,
East_Med,26,35,28,21,,,,,,,,,
Red_Sea,52,62,50,48,39,,,,,,,,
South_Asian,64,65,76,57,60,82,,,,,,,
East_Asian,114,114,122,110,111,127,76,,,,,,
Siberian,111,111,123,109,112,130,83,56,,,,,
Amerindian,138,137,154,138,144,161,120,113,105,,,,
Oceanian,179,181,187,177,176,191,146,166,177,217,, ,
Northeast_African,122,127,124,116,108,121,113,145, 151,185,203,,
Sub-Saharan,146,150,150,140,135,141,133,164,170,204,22 0,41,",r=1))/1000

pop=intersect(rownames(f2),rownames(t))
f2=f2[pop,pop]
t=t[pop,]

t2=as.matrix(t)%*%as.matrix(fst)
t3=as.matrix(t)%*%cmdscale(fst,ncol(as.matrix(fst) )-2)

ade4::mantel.rtest(as.dist(f2),dist(t))
ade4::mantel.rtest(as.dist(f2),dist(t2))
ade4::mantel.rtest(as.dist(f2),dist(t3))

# mantel=sapply(1:25,function(x)ade4::mantel.rtest(d ist(g25[,1:x]),as.dist(f2))$obs)
# cat(paste(seq(mantel),sprintf("%.4f",mantel)),sep="\n")

Zoro

10-04-2021, 11:19 PM

I selected populations that had an identical name in K13 updated and in my global f2 matrix. I then used Mantel's test to calculate the correlation between the f2 matrix and different versions of the K13 matrix:

f2 vs K13 with no multiplication by FST: .831
f2 vs K13 multiplied by FST matrix: .928
f2 vs K13 multiplied by MDS matrix of FST matrix: .953

When I compared K13 to scaled G25, I got even higher correlation, but I think it's partially because G25 and K13 include similar samples, or because both of them are less sensitive to drift than f2:

G25 vs K13 with no multiplication by FST: .889
G25 vs K13 multiplied by FST matrix: .978
G25 vs K13 multiplied by MDS matrix of FST matrix: .989

So in both cases, multiplying by MDS of FST produced the best correlation.

When I compared populations that had identical names in my f2 matrix and in G25, I got a correlation of .927 when using scaled coordinates and .781 when using unscaled coordinates. So the correlation with my f2 matrix was actually lower for scaled G25 than for K13 multiplied by MDS of FST.

Next I tried comparing my f2 matrix to versions of scaled G25 that were truncated to a different number of dimensions. I got the highest correlation when I included all 25 dimensions, but the increse in the correlation became very minor after the first 18 or 19 dimensions:

1 0.6538
2 0.7422
3 0.7733
4 0.7949
5 0.8082
6 0.8212
7 0.8839
8 0.9176
9 0.9178
10 0.9179
11 0.9186
12 0.9197
13 0.9200
14 0.9204
15 0.9206
16 0.9208
17 0.9209
18 0.9257
19 0.9263
20 0.9263
21 0.9266
22 0.9271
23 0.9271
24 0.9272
25 0.9277

When I compared my f2 matrix to unscaled G25, I got the highest correlation when I only included the first 18 dimensions of G25:

1 0.6538
2 0.7255
3 0.7553
4 0.7422
5 0.7024
6 0.6336
7 0.7950
8 0.7572
9 0.7620
10 0.7669
11 0.7650
12 0.7770
13 0.7776
14 0.7723
15 0.7728
16 0.7726
17 0.7742
18 0.7888
19 0.7867
20 0.7855
21 0.7840
22 0.7833
23 0.7810
24 0.7804
25 0.7812

library(ade4)

t=read.csv("https://pastebin.com/raw/UY1Em6qW",r=1)/100 # K13 updated
f2=read.csv("https://drive.google.com/uc?export=download&id=1qnXblYFWLFnOiEj-NbjCVkHcGIsGe64R",r=1) # global f2 matrix
# g25=read.csv("https://drive.google.com/uc?export=download&id=1wZr-UOve0KUKo_Qbgeo27m-CQncZWb8y",row.names=1) # modern averages scaled
# g25unscaled=read.csv("https://drive.google.com/uc?export=download&id=1y49hyvviJpHj9esVqyeiFm32DhnPlfRQ",row.names=1) # modern averages unscaled

fst=as.dist(read.csv(text=",North_Atlantic,Baltic,West_Med,West_Asian,East_Me d,Red_Sea,South_Asian,East_Asian,Siberian,Amerindi an,Oceanian,Northeast_African,Sub-Saharan
North_Atlantic,,,,,,,,,,,,,
Baltic,19,,,,,,,,,,,,
West_Med,28,36,,,,,,,,,,,
West_Asian,26,32,36,,,,,,,,,,
East_Med,26,35,28,21,,,,,,,,,
Red_Sea,52,62,50,48,39,,,,,,,,
South_Asian,64,65,76,57,60,82,,,,,,,
East_Asian,114,114,122,110,111,127,76,,,,,,
Siberian,111,111,123,109,112,130,83,56,,,,,
Amerindian,138,137,154,138,144,161,120,113,105,,,,
Oceanian,179,181,187,177,176,191,146,166,177,217,, ,
Northeast_African,122,127,124,116,108,121,113,145, 151,185,203,,
Sub-Saharan,146,150,150,140,135,141,133,164,170,204,22 0,41,",r=1))/1000

pop=intersect(rownames(f2),rownames(t))
f2=f2[pop,pop]
t=t[pop,]

t2=as.matrix(t)%*%as.matrix(fst)
t3=as.matrix(t)%*%cmdscale(fst,ncol(as.matrix(fst) )-2)

ade4::mantel.rtest(as.dist(f2),dist(t))
ade4::mantel.rtest(as.dist(f2),dist(t2))
ade4::mantel.rtest(as.dist(f2),dist(t3))

# mantel=sapply(1:25,function(x)ade4::mantel.rtest(d ist(g25[,1:x]),as.dist(f2))$obs)
# cat(paste(seq(mantel),sprintf("%.4f",mantel)),sep="\n")

I still don’t understand what you mean by multiplying admixture matrix with FST matrix since FST values are variable based on which 2 components you refer to. For ex:

1st row Q admixture proportion matrix:

Sample. European. Asian. African
Adygei. 0.9. 0.1. 0

FST

European-Asian: 0.1
European-African: 0.3

What are you multiplying the 0.9 European admixture with 0.1 or 0.3 ?

Komintasavalta

10-04-2021, 11:56 PM

I posted more experiments using Mantel's test here: https://anthrogenica.com/showthread.php?22402-Mantel-s-Test-G25-vs-genetic-distances. They show that multiplying admixture weights by FST or MDS of FST greatly improves correlation with other measures of genetic distance.

I still don’t understand what you mean by multiplying admixture matrix with FST matrix since FST values are variable based on which 2 components you refer to. For ex:

1st row Q admixture proportion matrix:

Sample. European. Asian. African
Adygei. 0.9. 0.1. 0

FST

European-Asian: 0.1
European-African: 0.3

What are you multiplying the 0.9 European admixture with 0.1 or 0.3 ?

You didn't include the FST distance between Africans and Asians, but I set it as .4.

> qmatrix=read.csv(r=1,head=F,text="Adygei,.9,.1,0")
> fst=as.matrix(as.dist(read.csv(head=F,text=",,\n.1,,\n.3,.4,")))
> qmatrix
V2 V3 V4
Adygei 0.9 0.1 0
> fst
V1 V2 V3
V1 0.0 0.1 0.3
V2 0.1 0.0 0.4
V3 0.3 0.4 0.0
> as.matrix(qmatrix)%*%fst
V1 V2 V3
Adygei 0.01 0.09 0.31

First column: .9*0+.1*.1+0*.3 = .01
Second column: .9*.1+.1*0+0*.4 = .09
Third column: .9*.3+.1*.4+0*0 = .31

Komintasavalta

10-05-2021, 02:39 PM

I did a K=10 ADMIXTURE run of modern samples that had the same population name and sample ID in G25 and 1240K+HO.

I then compared the distance to Besermyans in G25 and in my ADMIXTURE run. When I didn't account for FST, Besermyans were closer to Kalmyks than to Jordanians, closer to Sardinians than to Tajiks, and closer to Datog than to Shor_Khakassia:

https://i.ibb.co/R90LvKc/3.png

However when I multiplied the admixture weights by FST, the points of the populations plotted much closer to the Loess regression curve. Compared to G25, my run underestimated the distance to Biaka, because there wasn't a distinct component for Bambutids. Compared to G25, my run also overestimated the distance to Papuans, because the Papuan component had a much higher FST distance to Eurasian components than the SSA component (which is probably because the Papuan component only accounted for a single population, but the SSA component accounted for West Africans, East Africans, and Biaka):

https://i.ibb.co/6B4RScF/2.png

When I multiplied the admixture weights by MDS of FST, I got similar results as when multiplying by FST, except the populations were now even closer to the regression curve:

https://i.ibb.co/t3ydRxT/1.png

I also tried using Mantel's test to calculate the correlation coefficient between a distance matrix of my ADMIXTURE run and a distance matrix of scaled G25 coordinates. When I didn't multiply the admixture weights by FST, the correlation was .775, but it increased to .956 when I multiplied the admixture weights by FST, and it further increased to .973 when I multiplied the admixture weights by MDS of FST:

library(ade4)

g25=read.csv("mis",r=1) # modern individuals scaled
admix=read.table(paste0("intersect2.p.10")) # Q matrix with additional columns for sample ID and population name
fst=as.dist(read.csv(paste0("intersect2.p.10.admixfst"),h=F))
admix=data.frame(admix[,-c(1,2)],row.names=paste0(admix[,2],":",admix[,1]))
g25=g25[rownames(admix),]

admixfst=as.matrix(admix)%*%as.matrix(fst)
admixmds=as.matrix(admix)%*%cmdscale(fst,ncol(admi x)-1)

mantel.rtest(dist(g25),dist(admix))$obs # .775 (FST not accounted for)
mantel.rtest(dist(g25),dist(admixfst))$obs # .956 (multiplied by FST)
mantel.rtest(dist(g25),dist(admixmds))$obs # .973 (multiplied by MDS of FST)

Here's code to make the ADMIXTURE run:

# open https://github.com/DReichLab/EIG https://www.cog-genomics.org/plink/1.9/ https://dalexander.github.io/admixture/download.html # download convertf, PLINK 1.9, and ADMIXTURE
wget https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_HO_public.{anno,ind,snp,geno}
f=v44.3_HO_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)
printf %s\\n ais\ 1UrhcfNMLW0oMXIbHGUE60v2taCM7PFw1 aas\ 1F2rKEVtu8nWSm7qFhxPU6UESQNsmA-sl mis\ 1HYrDwxEXv82DvDLoq736pS5ZTGJA4dn5 mas\ 1wZr-UOve0KUKo_Qbgeo27m-CQncZWb8y aiu\ 1YKkEOtyV5SISvmY_FyS4YSLXCxxYt5_W aau\ 1f0imQyVNZ9RPESNAYIeIkA8fx4wAVNYo miu\ 18GcEVEl3GI-ByviD-TgQQjvEaaTbNTr2 mau\ 1y49hyvviJpHj9esVqyeiFm32DhnPlfRQ|while read l m;do curl "https://drive.google.com/uc?export=download&id=$m" -Lso $l;done
x=intersect;cut -d, -f1 mis|awk -F\\t 'NR==FNR{a[$0];next}$8":"$2 in a{print$2,$8}' - v44.3_HO_public.anno>$x.pick
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x
plink --allow-no-sex --bfile $x --indep-pairwise 50 10 .1 --out $x;plink --allow-no-sex --bfile $x --extract $x.prune.in --make-bed --out $x.p
tav()(awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1,i]+=$i}}END{for(n in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i,j]/n[i]);print o}}' "FS=${1-$'\t'}")
k=10;admixture -j4 -C.1 $x.p.bed $k;awk 'NR==FNR{a[$1]=$2;next}{print$2,a[$2]}' $x.pick $x.p.fam|paste -d\ - $x.$k.Q>$x.$k;cut -d\ -f2- $x.$k|tav \ >$x.${k}a

Komintasavalta

10-09-2021, 03:41 AM

Has the FST matrix for Gedrosia K12 been published somewhere?

Ask you friend Kurd to calculate an f2 matrix between the populations in the Gedrosia K12 datasheet:

awk 'NR==FNR{a[$1]=$2;next}{$1=a[$2]}1' prefix.id2pop prefix.fam>temp;mv temp prefix.fam
Rscript -e 'library(admixtools);p="prefix";f=f2(p,unique_only=F);df=as.data.frame(f);write.c sv(round(xtabs(df[,3]~df[,2]+df[,1]),8),paste0(p,".f2"),quote=F)'

Then we can use Mantel's test to see if multiplying the datasheet with the FST matrix improves correlation with the f2 matrix.

Lucas

10-15-2021, 11:56 PM

I played a bit with multiplying Gedmatch oracles by FST values of respective calculators (those which I found).

I used this method of multiplying which Komintasavalta described first, not with MDS matrix (I wrote to you before what I tried with fst but I didn't use this my lame method now of course).

I must say I am impressed, now I started to be magically Polish everywhere instead of East Slavic:cool: All oracles are updated ones from Vahaduo website of course.
Distances also are lower and here I present real, not divided by 100 (in single results) for comparison.

Ok here we go!

FST K13 updated

Target: lukas
Distance: 19.6564% / 0.19656432 | ADC: 0.25x RC
87.0 Polish
8.4 Polish_Greater_Poland
3.4 Austrian
1.0 Yemenite_Jewish
0.2 Ethiopian_Anuak

Target: lukas
Distance: 19.4255% / 0.19425525
50.6 Polish
22.2 Austrian
22.2 Latvian
2.4 Belorussian
1.2 Lithuanian
0.8 Yemenite_Jewish
0.4 Ethiopian_Tigray
0.2 Biaka_Pygmy

Distance to: lukas
0.21973246 Polish
0.27625870 Polish_Greater_Poland
0.30207503 Polish_Kielce
0.31178434 Polish_Mazovia
0.34205900 Polish_Masuria
0.34529770 Polish_Kuyavia
0.36878944 Russian_Smolensk
0.37438655 Belorussian
0.37681152 Polish_Silesia
0.38225857 South_Polish

No FST K13 updated

Target: lukas
Distance: 149.13% / 1.49127494 | ADC: 0.25x RC
77.5 Belarusian_Minsk
18.4 Latvian
4.1 Yemenite_Jewish

Target: lukas
Distance: 107.98% / 1.07982576
42.1 Latvian
25.4 Lithuanian
19.0 Polish_Silesia
5.6 Polish_Greater_Poland
4.1 Lebanese_Christian
1.9 Lebanese_Druze
1.6 Yemenite_Jewish
0.3 Yoruban

Distance to: lukas
2.77668868 Belarusian_Minsk
3.66615603 Russian_Smolensk
4.05947041 Belorussian
4.14441793 Russian_Southwest
4.27659912 Polish_Mazovia
4.81652364 Ukrainian
5.91515004 Polish_Kielce
6.03415280 Polish_Podlaskie
6.12607542 Polish_Kuyavia
6.30359421 Russian_average

==========================================

FST K15 updated

Target: Luk
Distance: 16.8926% / 0.16892637 | ADC: 0.5x
84.0 Polish
15.4 Ukrainian_Belgorod
0.6 Tunisian

Target: Luk
Distance: 13.2656% / 0.13265647 | ADC: 0.25x
61.2 Polish
36.8 Belorussian
0.6 Egyptian
0.6 Saudi
0.6 Yemenite_Jewish
0.2 Ethiopian_Amhara

Target: Luk
Distance: 12.7546% / 0.12754597
50.0 Belorussian
47.4 Polish
1.6 Yemenite_Jewish
0.6 Samaritan
0.4 Somali

Distance to: Luk
0.21162697 Polish
0.25180224 Polish_Kielce
0.25812203 Russian_Smolensk
0.27567171 South_Polish
0.29129793 Belorussian
0.31165125 Polish_Masuria
0.31992042 Ukrainian_Belgorod
0.35382366 Estonian_Polish
0.37434292 Sorb_Lusatia
0.38617901 Ukrainian_Kiev_Avg

No FST K15 updated

Target: Luk
Distance: 206.85% / 2.06845548 | ADC: 0.5x RC
44.3 Russian_Smolensk
37.4 Estonian_Polish
18.3 Ukrainian_Belgorod

Target: Luk
Distance: 191.82% / 1.91822983 | ADC: 0.25x
57.1 Estonian_Polish (it is Estonian admixed average, not good proxy for Poland)
34.9 Russian_Smolensk
6.0 Ukrainian_Belgorod
1.5 North_Ossetian
0.3 Somali
0.2 Lithuanian

Target: Luk
Distance: 175.84% / 1.75837809
85.3 Estonian_Polish
9.2 Lithuanian
4.2 North_Ossetian
0.6 French_Basque
0.6 Somali
0.1 Yoruban

Distance to: Luk
2.57561643 Russian_Smolensk
3.18691073 Estonian_Polish
3.55793479 Belorussian
4.04486094 Southwest_Russian
4.30416078 Ukrainian_Belgorod
4.69517838 Ukrainian_Kiev_Avg
6.02754511 Polish
6.19795127 Ukraine_East
6.28458431 Polish_Kielce
7.55439607 Lithuanian

=======================================

FST Updated Dodecad K12b

Target: luk
Distance: 18.6611% / 0.18661132 | ADC: 0.25x
79.4 Polish
15.6 Lithuanian
4.4 Greek_Pontus
0.6 Sorb_Lusatia

Target: luk
Distance: 13.6143% / 0.13614328
79.0 Lithuanian
16.0 Italian_Romagna
4.0 Greek_Pontus
1.0 Greek_Izmir

Distance to: luk
0.24128083 Polish
0.27540548 Sorb_Lusatia
0.30333600 Polish_Mazovia
0.37641532 Polish_Warmia-Masuria
0.37821484 Belorussian
0.40281688 Slovak
0.47257684 Czech
0.48264095 Ukrainian
0.49794841 Russian_Smolensk
0.50688766 Belarusian_Minsk

No FST Updated Dodecad K12b

Target: Luk
Distance: 1.4333% / 1.43334537 | ADC: 0.25x
52.8 Belorussian
23.8 Russian_Voronezh
21.8 Polish_Mazovia
1.6 Kuwait2

Target: Luk
Distance: 64.08% / 0.64083465
67.2 Lithuanian
11.3 Greek_Peloponnese
7.4 Latvian
5.5 Bosnian
3.7 Sorb_Lusatia
3.5 Greek_Izmir
0.7 Kuwait2
0.3 Croat
0.2 Santhal
0.1 Albanian_Kosovo
0.1 Greek_Pontus

Distance to: Luk
2.39822851 Belorussian
2.50195923 Russian_Voronezh
2.66161605 Polish_Mazovia
2.68076482 Ukrainian
2.70185122 Russian_Smolensk
3.07553247 Russian_Kursk
3.48582845 Russian_Oryol
3.66465551 Belarusian_Minsk
4.37420850 Polish
5.18256693 Russian_Tver

Peterski

10-16-2021, 12:02 AM

(...) Yemenite_Jewish (...)

^^^
Do you really have some Jewish admix?

Do you score any Ashkenazi in 23andMe?

Leto

10-16-2021, 12:06 AM

I choose D K12b obviously but my preference is irrelevant.

Lucas

10-16-2021, 12:08 AM

I choose D K12b obviously but my preference is irrelevant.

Notice that irrespective which calc is multiplied by FST, all my results are oscillating about 80% Polish with higher ADC. Even bearing in mind we have somewhat different references in every calc (especially Dodecad compared to Eurogenes).

Peterski

10-16-2021, 01:08 AM

Notice that irrespective which calc is multiplied by FST, all my results are oscillating about 80% Polish with higher ADC. Even bearing in mind we have somewhat different references in every calc (especially Dodecad compared to Eurogenes).

But in Global25 you are close to Russian Smolensk average, if I remember correctly. Does it mean that G25 is also inaccurate for you just like calculators when not multiplied by FST ???

Peterski

10-16-2021, 01:14 AM

Check this South Wielkopolska outlier sample:

GEDmatch kit number - AY9156947

Without multiplying by FST, this sample is eastern-shifted (a clear outlier for Wielkopolska region). In Global25, this is also visible. I wonder if this sample will score more "normal" after applying FST - just like in your case ???

Komintasavalta

10-16-2021, 03:22 AM

Check this South Wielkopolska outlier sample:

GEDmatch kit number - AY9156947

Without multiplying by FST, this sample is eastern-shifted (a clear outlier for Wielkopolska region). In Global25, this is also visible. I wonder if this sample will score more "normal" after applying FST - just like in your case ???

In K13, if the matrix of admixture percentages is multiplied by an MDS matrix of the FST matrix, it moves the sample further from Finno-Permics, North Russians, and VURians, because it gives more weight to differences in Mongoloid ancestry.

Without accounting for FST, the sample is closer to some Serbs than to Swedes, but multiplying by MDS of FST moves it closer to Northwestern Europeans. (Because the main axis of genetic variation among mainstream IE Europeans is between wogs and non-wogs, and Poles have a similar level of wogginess as Swedes.)

The numbers shown on the x and y axis indicate the percentage of the distance from South_Wielopolska_outlier to the population that is the furthest from it.

https://i.ibb.co/f81wZHM/1.png

Lucas

10-16-2021, 09:40 AM

Without accounting for FST, the sample is closer to some Serbs than to Swedes, but multiplying by MDS of FST moves it closer to Northwestern Europeans. (Because the main axis of genetic variation among mainstream IE Europeans is between wogs and non-wogs, and Poles have a similar level of wogginess as Swedes.)

If it is universal for Eurogenes and Dodecad calcs respectively? Because in Dodecad we have unsupervised admixture run (Eurogenes are supervised). Also some components aren't comparable, like Gedrosia. And my results in all are nearly the same after FST.

BTW I noticed one interesting thing, which weren't discussed before I think. My component percentages after multiplication by FST matrix look like distances to components, I don't know if they are really distances but look like it.
Sorted from the lowest:

K13

Baltic 1,6277
North_Atlantic 1,89177
West_Asian 3,1113
East_Med 3,13918
West_Med 3,29874
Red_Sea 5,67981
South_Asian 6,57958
Siberian 11,27411
East_Asian 11,47395
Northeast_African 12,26858
Amerindian 13,95091
Sub-Saharan 14,60386
Oceanian 18,0752

K15

North_Sea 1,7986
Baltic 1,82585
Atlantic 1,88284
Eastern_Euro 2,03493
West_Asian 2,98029
East_Med 2,99288
West_Med 3,36092
Middle_Eastern 5,44257
South_Asian 6,18508
Northeast_African 10,89329
Siberian 11,12342
Southeast_Asian 11,20527
Amerindian 13,90996
Sub-Saharan 14,34409
Oceanian 17,87779

Dodecad K12b

North European 1,89208
Atlantic Med 3,28738
Caucasus 3,47649
Gedrosia 4,9647
Southwest Asian 5,73705
Northwest African 6,28406
South Asian 8,74821
East Asian 12,61477
Siberian 12,75133
Southeast Asian 13,17852
East African 13,6036
Sub Saharan 18,08489

michal3141

10-16-2021, 10:18 AM

If it is universal for Eurogenes and Dodecad calcs respectively? Because in Dodecad we have unsupervised admixture run (Eurogenes are supervised). Also some components aren't comparable, like Gedrosia. And my results in all are nearly the same after FST.

BTW I noticed one interesting thing, which weren't discussed before I think. My component percentages after multiplication by FST matrix look like distances to components, I don't know if they are really distances but look like it.
Sorted from the lowest:

K13

Baltic 1,6277
North_Atlantic 1,89177
West_Asian 3,1113
East_Med 3,13918
West_Med 3,29874
Red_Sea 5,67981
South_Asian 6,57958
Siberian 11,27411
East_Asian 11,47395
Northeast_African 12,26858
Amerindian 13,95091
Sub-Saharan 14,60386
Oceanian 18,0752

K15

North_Sea 1,7986
Baltic 1,82585
Atlantic 1,88284
Eastern_Euro 2,03493
West_Asian 2,98029
East_Med 2,99288
West_Med 3,36092
Middle_Eastern 5,44257
South_Asian 6,18508
Northeast_African 10,89329
Siberian 11,12342
Southeast_Asian 11,20527
Amerindian 13,90996
Sub-Saharan 14,34409
Oceanian 17,87779

Dodecad K12b

North European 1,89208
Atlantic Med 3,28738
Caucasus 3,47649
Gedrosia 4,9647
Southwest Asian 5,73705
Northwest African 6,28406
South Asian 8,74821
East Asian 12,61477
Siberian 12,75133
Southeast Asian 13,17852
East African 13,6036
Sub Saharan 18,08489

Yes.
Multiplication by the FST matrix results in distances to components.
But as demonstrated by Komintasavalta it is better to multiply by the MDS matrix of FST if you want better correlation between genetic distances and euclidean distances.

Lucas

10-16-2021, 10:23 AM

Yes.
Multiplication by the FST matrix results in distances to components.
But as demonstrated by Komintasavalta it is better to multiply by the MDS matrix of FST if you want better correlation between genetic distances and euclidean distances.

OK I did FST mutliplication in spreadsheet. But MDS matrix needs python?

michal3141

10-16-2021, 10:29 AM

OK I did FST mutliplication in spreadsheet. But MDS matrix needs python?

You can do it with R easily.
Use can use cmdscale function as demonstrated by Komintasavalta on another forum :)

EDIT: Removing stuff that destroyed this site.

Komintasavalta

10-16-2021, 11:51 AM

BTW I noticed one interesting thing, which weren't discussed before I think. My component percentages after multiplication by FST matrix look like distances to components, I don't know if they are really distances but look like it.
Sorted from the lowest:

K13

Baltic 1,6277
North_Atlantic 1,89177
West_Asian 3,1113
East_Med 3,13918
West_Med 3,29874
Red_Sea 5,67981
South_Asian 6,57958
Siberian 11,27411
East_Asian 11,47395
Northeast_African 12,26858
Amerindian 13,95091
Sub-Saharan 14,60386
Oceanian 18,0752

Wow, I hadn't thought of it that way, but if you multiply the coordinates of a sample by the FST matrix, it does actually calculate the distance of the sample to each component. This shows the distances min-max scaled on the range 0-100:

https://i.ibb.co/m64908C/a.png

I wonder if there's a version of MDS which accepts an asymmetric distance matrix like above as input.

library(pheatmap)
library(colorspace)
library(vegan) # for reorder.hclust

t=read.csv(h=F,r=1,text="Ukrainian_Belgorod,25.78,47.29,7.85,8.78,5.54,0.58 ,1.10,0.71,1.02,0.59,0.25,0.16,0.34
Ukrainian_Lviv,27.35,41.85,11.96,7.10,5.29,1.79,1. 20,0.39,1.36,0.82,0.75,0.00,0.13
Uttar_Pradesh,2.94,3.28,1.36,15.83,0.42,0.72,69.02 ,2.79,1.41,0.69,0.98,0.31,0.26
Uygur,7.18,9.70,0.87,18.93,4.66,0.18,10.42,24.33,2 0.48,2.35,0.42,0.26,0.21
Uzbeki,4.87,11.98,1.72,20.42,6.92,1.05,11.85,15.99 ,23.17,1.39,0.11,0.35,0.19
Velamas,0.15,0.19,0.13,21.78,2.62,0.26,70.10,1.58, 0.89,0.31,1.27,0.38,0.35
Vietnamese,0.05,1.24,0.02,0.14,0.05,0.24,5.11,86.4 5,5.20,0.16,1.06,0.16,0.12
West_German,43.14,22.44,14.78,6.85,8.54,0.86,0.94, 0.13,0.40,0.60,0.64,0.47,0.22
West_Greenlander,1.90,12.23,0.02,0.80,0.01,0.01,2. 80,6.23,38.66,37.10,0.06,0.14,0.04
West_Scottish,53.18,23.35,12.31,5.56,1.72,0.88,1.1 4,0.05,0.47,0.80,0.34,0.08,0.12
West_Sicilian,21.14,7.59,22.70,10.85,28.66,5.07,0. 65,0.52,0.20,0.05,0.61,0.99,0.96
Xibo,0.62,0.29,0.30,1.72,0.85,0.14,0.12,55.61,39.4 6,0.35,0.26,0.25,0.04
Yakut,0.56,3.45,1.61,1.67,0.21,0.56,1.13,15.25,74. 19,0.52,0.67,0.17,0.01
Yemenite_Jewish,0.23,0.00,5.52,6.28,52.36,28.49,0. 71,0.06,0.23,0.19,0.25,5.68,0.00
Yizu,0.04,0.03,0.04,0.08,0.01,0.10,3.29,70.41,23.7 4,0.48,1.45,0.13,0.20
Yoruban,0.05,0.05,0.11,0.20,0.13,0.91,0.07,0.11,0. 08,0.08,0.10,2.91,95.20")

fst=as.matrix(as.dist(read.csv(r=1,text=",North_Atlantic,Baltic,West_Med,West_Asian,East_Me d,Red_Sea,South_Asian,East_Asian,Siberian,Amerindi an,Oceanian,Northeast_African,Sub-Saharan
North_Atlantic,,,,,,,,,,,,,
Baltic,19,,,,,,,,,,,,
West_Med,28,36,,,,,,,,,,,
West_Asian,26,32,36,,,,,,,,,,
East_Med,26,35,28,21,,,,,,,,,
Red_Sea,52,62,50,48,39,,,,,,,,
South_Asian,64,65,76,57,60,82,,,,,,,
East_Asian,114,114,122,110,111,127,76,,,,,,
Siberian,111,111,123,109,112,130,83,56,,,,,
Amerindian,138,137,154,138,144,161,120,113,105,,,,
Oceanian,179,181,187,177,176,191,146,166,177,217,, ,
Northeast_African,122,127,124,116,108,121,113,145, 151,185,203,,
Sub-Saharan,146,150,150,140,135,141,133,164,170,204,22 0,41,")))/1000

mds=cmdscale(fst,ncol(fst)-1)
t2=as.matrix(t)%*%fst
t3=as.matrix(t)%*%mds

# pro technique for reordering the branches of a hierarchical clustering tree based on the first dimension in MDS/PCA
hc=reorder(hclust(dist(t3)),t3[,1])

mima=(t2-min(t2))/(max(t2)-min(t2))

pheatmap::pheatmap(
100*mima,
filename="1.png",
clustering_callback=function(...)hc,
cluster_cols=F,
legend=F,
cellwidth=18,
cellheight=18,
fontsize=10,
treeheight_row=100,
treeheight_col=100,
border_color=NA,
display_numbers=T,
number_format="%.0f",
fontsize_number=8,
number_color="black",
colorRampPalette(colorspace::hex(HSV(c(210,210,170 ,135,100,60,30,0,330,300,270),c(0,rep(.4,10)),1))) (256)
)

Komintasavalta

10-16-2021, 02:22 PM

I think `smacof::unfolding` does something like MDS on a rectangular matrix: https://cran.r-project.org/web/packages/smacof/smacof.pdf.

I used data for an ADMIXTURE run of 1064 samples that had the same sample ID and population name in 1240K+HO and G25. For the run at K=10, when I applied `smacof::unfolding` to a matrix of admixture percentages that was multiplied by the FST matrix, it produced the highest correlation with f2 distance when I restricted the output to 4 columns. Then the correlation was higher than when doing simple matrix multiplication by FST, but it was still lower than when multiplying by MDS of FST:

> library(smacof)
> library(ade4)
> f2=read.csv("intersect2.p.f2",r=1)
> t=read.table(intersect2.p.10a",r=1)[rownames(f2),]
> fst=as.matrix(as.dist(read.csv("intersect2.p.admixfst",h=F)))
> t2=as.matrix(t)%*%cmdscale(fst,ncol(fst)-1)
> t3=as.matrix(t)%*%fst
> mantel.rtest(dist(t2),as.dist(f2))$obs # muliplied by MDS of FST
[1] 0.9439003
> mantel.rtest(dist(t3),as.dist(f2))$obs # muliplied by FST
[1] 0.9236697
> sapply(2:20,function(n)mantel.rtest(dist(unfolding (as.matrix(t)%*%fst,ndim=n)$conf.row),as.dist(f2)) $obs)
[1] 0.9376099 0.9160480 0.9397850 0.9382412 0.9289399 0.9256893 0.9216252 0.9265515
[9] 0.9273626 0.9273712 0.9274165 0.9273795 0.9274111 0.9273540 0.9273588 0.9273744
[17] 0.9274254 0.9274580 0.9274602

Next I tried to compare `smacof::unfolding` to G25 distance. I again got the highest correlation when I restricted the output to 4 dimensions, where the correlation was again higher than when multiplied by FST but lower than when multiplied by MDS of FST:

> t=read.table("intersect2.p.10")
> rownames(t)=paste0(t[,2],":",t[,1])
> t=t[,-c(1,2)]
> fst=as.matrix(as.dist(read.csv("intersect2.p.10.admixfst",h=F)))
> g25=read.csv("g/25/mis",r=1)[rownames(t),]
> t2=as.matrix(t)%*%cmdscale(fst,ncol(fst)-1)
> t3=as.matrix(t)%*%fst
> mantel.rtest(dist(t2),dist(g25))$obs # multiplied by MDS of FST
[1] 0.9761411
> mantel.rtest(dist(t3),dist(g25))$obs # multiplied by FST
[1] 0.9616274
> sapply(2:20,function(n)mantel.rtest(dist(unfolding (as.matrix(t)%*%fst,ndim=n)$conf.row),dist(g25))$o bs)
[1] 0.9384111 0.9685256 0.9705586 0.9647864 0.9675478 0.9689743 0.9658520 0.9688607
[9] 0.9669519 0.9679367 0.9680450 0.9681135 0.9681581 0.9681966 0.9682287 0.9682593
[17] 0.9682812 0.9683001 0.9683189

The reason why I got higher correlation with G25 than with f2 might be that I used individual samples with G25 but population averages with f2.

Lucas

10-16-2021, 02:52 PM

Maybe it has some meaning, in Dienekes original post for K12b his fst matrix has such format https://i.imgur.com/EBszEjs.png, so not like those original created by Admixture.
I used such format for K13 and K15 too when I multiplied those fst matrices by matrices with original percentage values for refs.

Komintasavalta

10-16-2021, 03:29 PM

This is so cool. You can calculate the populations with the lowest distance to the Siberian component like this:

t=read.csv("https://pastebin.com/raw/9dMWpJcU",row.names=1,check.names=F) # K13 updated

fst=as.matrix(as.dist(read.csv(text=",,,,,,,,,,,,
19,,,,,,,,,,,,
28,36,,,,,,,,,,,
26,32,36,,,,,,,,,,
26,35,28,21,,,,,,,,,
52,62,50,48,39,,,,,,,,
64,65,76,57,60,82,,,,,,,
114,114,122,110,111,127,76,,,,,,
111,111,123,109,112,130,83,56,,,,,
138,137,154,138,144,161,120,113,105,,,,
179,181,187,177,176,191,146,166,177,217,,,
122,127,124,116,108,121,113,145,151,185,203,,
146,150,150,140,135,141,133,164,170,204,220,41,",head=F)))

t2=as.matrix(t)%*%fst
colnames(t2)=colnames(t)
s=head(sort(t2[,"Siberian"]),32)
cat(paste(round(s),names(s)),sep="\n")

If you don't want to install R, you can run the code here: https://www.mycompiler.io/new/r.

Output:

1599 Evens
1737 Evenki
1818 Oroqen
2070 Yakut
2236 Dolgan
3010 Koryak
3109 Buryat
3501 Hezhen
3557 Tuvinian
3690 Xibo
3724 Chukchi
4012 Japanese
4081 Mongolian
4372 Selkup
4379 Ket
4453 Naxi
4468 Altaian
4479 Tu
4612 Yizu
4800 Tujia
4897 She
4981 Miaozu
5071 Hakas
5481 East_Greenlander
5493 Kirgiz
5495 Shors
5551 Lahu
5712 Vietnamese
5826 Kazakh
5906 Tibeto-Burman_Burmese
5929 Dai
6176 West_Greenlander

All Polish populations are closer to Baltic than to North_Atlantic:

https://i.ibb.co/K0QJjkC/k13balticvsnorthatlantic.png

library(tidyverse)
library(ggforce)

t=read.csv("https://pastebin.com/raw/9dMWpJcU",row.names=1,check.names=F) # K13 updated

fst=as.matrix(as.dist(read.csv(text=",,,,,,,,,,,,
19,,,,,,,,,,,,
28,36,,,,,,,,,,,
26,32,36,,,,,,,,,,
26,35,28,21,,,,,,,,,
52,62,50,48,39,,,,,,,,
64,65,76,57,60,82,,,,,,,
114,114,122,110,111,127,76,,,,,,
111,111,123,109,112,130,83,56,,,,,
138,137,154,138,144,161,120,113,105,,,,
179,181,187,177,176,191,146,166,177,217,,,
122,127,124,116,108,121,113,145,151,185,203,,
146,150,150,140,135,141,133,164,170,204,220,41,",head=F)))

dm=as.matrix(t)%*%fst
colnames(dm)=colnames(t)
t2=as.matrix(t)%*%cmdscale(fst,ncol(fst)-1)

p1="North_Atlantic"
p2="Baltic"
xy=data.frame(x=dm[,p1],y=dm[,p2])

pick=apply(xy<3000,1,all)
xy=xy[pick,]
t2=t2[pick,]

seg=lapply(1:3,function(i)apply(as.matrix(dist(t2) ),1,function(x)unlist(xy[names(sort(x)[i]),],use.names=F))%>%t%>%cbind(xy))%>%do.call(rbind,.)%>%setNames(paste0("V",1:4))

xy$k=cutree(hclust(dist(t2)),24)

pal1=hcl(seq(15,375,length.out=n_distinct(xy$k)+1) %>%head(-1),95,60)

ggplot(xy,aes(x,y))+
geom_abline(linetype="dashed",color="gray80",size=.3)+
geom_segment(data=seg,aes(x=V1,y=V2,xend=V3,yend=V 4),color="gray50",size=.1)+
ggforce::geom_mark_hull(aes(color=as.factor(k),fil l=as.factor(k)),concavity=1000,radius=unit(.15,"cm"),expand=unit(.15,"cm"),alpha=.2,size=.15)+
geom_point(aes(color=as.factor(k)),size=.5)+
geom_text(aes(color=as.factor(k)),label=rownames(x y),size=2,vjust=-.7)+
scale_x_continuous(breaks=seq(0,10000,200),expand= expansion(mult=c(.06,.06)))+
scale_y_continuous(breaks=seq(0,10000,200),expand= expansion(mult=c(.06,.06)))+
scale_fill_manual(values=pal1)+
scale_color_manual(values=pal1)+
labs(x=paste0("Distance to ",p1),y=paste0("Distance to ",p2))+
theme(
axis.text=element_text(size=6),
axis.text.y=element_text(angle=90,vjust=1,hjust=.5 ),
axis.ticks=element_blank(),
axis.ticks.length=unit(0,"cm"),
axis.title=element_text(size=8),
legend.position="none",
panel.background=element_rect(fill="white"),
panel.border=element_rect(color="gray85",fill=NA,size=.6),
panel.grid.major=element_line(color="gray85",size=.2),
panel.grid.minor=element_blank(),
plot.background=element_rect(fill="white"),
plot.subtitle=element_text(size=7),
plot.title=element_text(size=11)
)

ggsave("1.png",width=7,height=7)

Maybe it has some meaning, in Dienekes original post for K12b his fst matrix has such format https://i.imgur.com/EBszEjs.png, so not like those original created by Admixture.
I used such format for K13 and K15 too when I multiplied those fst matrices by matrices with original percentage values for refs.

It doesn't matter, because `as.dist%>%as.matrix` copies the lower triangle to the upper triangle:

> fst=read.csv(text=",North_Atlantic,Baltic,West_Med,West_Asian
North_Atlantic,,,,
Baltic,19,,,
West_Med,28,36,,
West_Asian,26,32,36,",row.names=1)
> fst
North_Atlantic Baltic West_Med West_Asian
North_Atlantic NA NA NA NA
Baltic 19 NA NA NA
West_Med 28 36 NA NA
West_Asian 26 32 36 NA
> as.matrix(as.dist(fst))
North_Atlantic Baltic West_Med West_Asian
North_Atlantic 0 19 28 26
Baltic 19 0 36 32
West_Med 28 36 0 36
West_Asian 26 32 36 0