Does Dodecad K12 or Gedrosia K12 calculator make more sense ?

**~~Komintasavalta~~** · 10-01-2021, 04:37 PM

Originally Posted by Zoro

The only problem that no one here understands except maybe Komi is that SNP loadings or G25 coordinates are affected by which populations are in the run that generate G25 coordinates. For example, if you want to make Iranians or Kurds more W. Eurasian you can put more Caucasian and European samples in the run. That will make Iranian/Kurd coordinates more W. Eurasian. If you want to make them more E. Eurasian you do the opposite. You’ll not become suspicious because the G25 modelling will still be reasonable.

The only Way you will become suspicious is if you did IBS comparisons of Kurds and Iranians vs E. and W. Eurasians

I don’t think the G 25 guy is doing anything on purpose but it’s just a result of the chaotic way he keeps adding samples to the run or the unbalanced way he generated those coordinates originally

All samples in G25 are projected according to vbnetkhio: https://www.theapricity.com/forum/sh...=1#post7303874. So if that's the case, and if Davidski always uses the same set of reference samples when he generates new G25 coordinates, it doesn't matter how many new non-reference samples he adds to a SmartPCA run at a time.

If G25 only includes projected samples, I think it's because projected samples plot differently than reference samples in SmartPCA, which I demonstrated in the thread linked above by making a run where half of the samples from each population were projected and half were references. The same phenomenon in ADMIXTURE calculators is known as the infamous "calculator effect": https://bga101.blogspot.com/2012/05/...or-effect.html.

The way you can do a projected run with SmartPCA is to add the parameter `poplistname: /path/to/poplist`, where the poplist file contains the names of populations that are used as reference samples. Population names are specified in the sixth field of the fam file and not the first field.

(You probably know this already, but I'm trying to explain it to other users.)

**Zoro** · 10-01-2021, 04:45 PM

Originally Posted by Komintasavalta

All samples in G25 are projected according to vbnetkhio: https://www.theapricity.com/forum/sh...=1#post7303874. So if that's the case, and if Davidski always uses the same set of reference samples when he generates new G25 coordinates, it doesn't matter how many new non-reference samples he adds to a SmartPCA run at a time.

If G25 only includes projected samples, I think it's because projected samples plot differently than reference samples in SmartPCA, which I demonstrated in the thread linked above by making a run where half of the samples from each population were projected and half were references. The same phenomenon in ADMIXTURE calculators is known as the infamous "calculator effect": https://bga101.blogspot.com/2012/05/...or-effect.html.

The way you can do a projected run with SmartPCA is to add the parameter `poplistname: /path/to/poplist`, where the poplist file contains the names of populations that are used as reference samples. Population names are specified in the sixth field of the fam file and not the first field.

(You probably know this already, but I'm trying to explain it to other users.)

Even if that’s the case his original run when he generated coordinates wasn’t balanced. Do an experiment run the Plink PCA program with only 2 Iranians. In one run include Europeans and S and E Asians. In the other run include Europeans and S and E Asians AND Caucasians. Post the 2 Iranian coordinates here for both runs.
That should be interesting

**Lucas** · 10-01-2021, 04:57 PM

Originally Posted by Zoro

Even if that’s the case his original run when he generated coordinates wasn’t balanced. Do an experiment run the Plink PCA program with only 2 Iranians. In one run include Europeans and S and E Asians. In the other run include Europeans and S and E Asians AND Caucasians. Post the 2 Iranian coordinates here for both runs.
That should be interesting

Just for reference, G25 is based on Smart PCA not Plink PCA.

**Zoro** · 10-01-2021, 05:00 PM

Originally Posted by Lucas

Just for reference, G25 is based on Smart PCA not Plink PCA.

Ok, but they’re both PCA programs. He can do his experiment on Smart PCA instead

**~~Komintasavalta~~** · 10-01-2021, 05:31 PM

Originally Posted by Zoro

Even if that’s the case his original run when he generated coordinates wasn’t balanced.

Yeah, has he ever published the list of samples that he used in the original reference run for G25?

When you use all modern and ancient population averages from the official G25 datasheets as a source but you remove Chuvashes, then the best model for Maris has a distance of .073, and similarly if you remove Maris, then the best model for Chuvashes has a distance of .044:

Code:

$ curl 'https://drive.google.com/uc?export=download&id=1wZr-UOve0KUKo_Qbgeo27m-CQncZWb8y' -Lso mas
$ curl 'https://drive.google.com/uc?export=download&id=1F2rKEVtu8nWSm7qFhxPU6UESQNsmA-sl' -Lso aas
$ curl https://pastebin.com/raw/afaMiFSa|tr -d \\r>mix;chmod +x mix
$ pip3 install cvxpy
[...]
$ t=Mari;./mix <(cat [am]as|grep -v ^$t,) <(grep ^$t, mas) -s
Mari (.039): 89% Chuvash + 9% RUS_Krasnoyarsk_BA + 2% RUS_AfontovaGora3
$ t=Mari;./mix <(cat [am]as|grep -v ^$t,|grep -v Chuvash) <(grep ^$t, mas) -s
Mari (.073): 88% Udmurt + 9% RUS_Krasnoyarsk_BA + 2% ITA_Tagliente + 1% DEU_LBK_KD
$ t=Chuvash;./mix <(cat [am]as|grep -v ^$t,) <(grep ^$t, mas) -s
Chuvash (.005): 68% Mari + 8% Lithuanian_RA + 6% Russian_Belgorod + 5% Lithuanian_VZ + 5% Darginian + 2% HRV_Vucedol + 2% CHN_Amur_River_Xianbei_IA + 1% MNG_Afanasievo_1_contam + 1% Ket + 1% GEO_CHG + 1% VNM_BA_Dong_Son_Culture + 0% Han_Shanghai + 0% Sorb_Niederlausitz + 0% CHE_FN_steppe_contam + 0% CHN_Miaozigou_MN + 0% Sakha + 0% UKR_Cimmerian_o
$ t=Chuvash;./mix <(cat [am]as|grep -v ^$t,|grep -v Mari) <(grep ^$t, mas) -s
Chuvash (.044): 81% Udmurt + 6% Russian_Pinega + 4% HUN_MBA_Vatya_o + 4% DEU_LBK_KD + 2% RUS_Krasnoyarsk_BA + 2% CHN_Yinwang_500BP + 1% Baltic_EST_BA

I think it's because in the initial set of reference samples that Davidski used with G25, there were some Mari or Chuvash samples, so some PCs on G25 ended up accounting for drift that is specific to Maris or Chuvashes. But then G25 gives less weight to the drift of other populations that were not included among the initial reference samples.

I merged samples from 1240K+HO with samples from Cardona et al. 2014, and I calculated an f2 matrix for the samples. Then for each population that had an identical name in G25 and my dataset, I compared the f2 distance to the scaled G25 distance. In the plot below, Maris and Chuvashes are actually above the diagonal, because G25 accounts for drift that is specific to Maris. But then G25 underestimates the distance to other drifted or isolated populations, like Kubachinian, Kalash, Udmurt, Komi, Scottish, Icelandic, Kusunda, Chukchi, Surui, etc. (The reason why the distance to Even is much bigger in G25 than in my f2 matrix is that the Even population average in G25 is modeled as 12% Norwegian and 88% Han_Shanghai, but my Even samples had an average of 30% of a Caucasoid component in a K=2 Eurasian ADMIXTURE run.)

**Zoro** · 10-01-2021, 05:59 PM

Originally Posted by Komintasavalta

Yeah, has he ever published the list of samples that he used in the original reference run for G25?

When you use all modern and ancient population averages from the official G25 datasheets as a source but you remove Chuvashes, then the best model for Maris has a distance of .073, and similarly if you remove Maris, then the best model for Chuvashes has a distance of .044:

Code:

$ curl 'https://drive.google.com/uc?export=download&id=1wZr-UOve0KUKo_Qbgeo27m-CQncZWb8y' -Lso mas
$ curl 'https://drive.google.com/uc?export=download&id=1F2rKEVtu8nWSm7qFhxPU6UESQNsmA-sl' -Lso aas
$ curl https://pastebin.com/raw/afaMiFSa|tr -d \\r>mix;chmod +x mix
$ pip3 install cvxpy
[...]
$ t=Mari;./mix <(cat [am]as|grep -v ^$t,) <(grep ^$t, mas) -s
Mari (.039): 89% Chuvash + 9% RUS_Krasnoyarsk_BA + 2% RUS_AfontovaGora3
$ t=Mari;./mix <(cat [am]as|grep -v ^$t,|grep -v Chuvash) <(grep ^$t, mas) -s
Mari (.073): 88% Udmurt + 9% RUS_Krasnoyarsk_BA + 2% ITA_Tagliente + 1% DEU_LBK_KD
$ t=Chuvash;./mix <(cat [am]as|grep -v ^$t,) <(grep ^$t, mas) -s
Chuvash (.005): 68% Mari + 8% Lithuanian_RA + 6% Russian_Belgorod + 5% Lithuanian_VZ + 5% Darginian + 2% HRV_Vucedol + 2% CHN_Amur_River_Xianbei_IA + 1% MNG_Afanasievo_1_contam + 1% Ket + 1% GEO_CHG + 1% VNM_BA_Dong_Son_Culture + 0% Han_Shanghai + 0% Sorb_Niederlausitz + 0% CHE_FN_steppe_contam + 0% CHN_Miaozigou_MN + 0% Sakha + 0% UKR_Cimmerian_o
$ t=Chuvash;./mix <(cat [am]as|grep -v ^$t,|grep -v Mari) <(grep ^$t, mas) -s
Chuvash (.044): 81% Udmurt + 6% Russian_Pinega + 4% HUN_MBA_Vatya_o + 4% DEU_LBK_KD + 2% RUS_Krasnoyarsk_BA + 2% CHN_Yinwang_500BP + 1% Baltic_EST_BA

I think it's because in the initial set of reference samples that Davidski used with G25, there were some Mari or Chuvash samples, so some PCs on G25 ended up accounting for drift that is specific to Maris or Chuvashes. But then G25 gives less weight to the drift of other populations that were not included among the initial reference samples.

I merged samples from 1240K+HO with samples from Cardona et al. 2014, and I calculated an f2 matrix for the samples. Then for each population that had an identical name in G25 and my dataset, I compared the f2 distance to the scaled G25 distance. In the plot below, Maris and Chuvashes are actually above the diagonal, because G25 accounts for drift that is specific to Maris. But then G25 underestimates the distance to other drifted or isolated populations, like Kubachinian, Kalash, Udmurt, Komi, Scottish, Icelandic, Kusunda, Chukchi, Surui, etc. (The reason why the distance to Even is much bigger in G25 than in my f2 matrix is that the Even population average in G25 is modeled as 12% Norwegian and 88% Han_Shanghai, but my Even samples had an average of 30% of a Caucasoid component in a K=2 Eurasian ADMIXTURE run.)

Well as you have discovered results change based on which sample are included in PCA, Admixture calculators, G25 and how you run them. So there’s always many unknown biases.

I can tell you for a fact that E. Eurasian percent in Iranics is a joke no matter which of these methods you use. To reduce biases and get closer to the truth just do a one-to-one IBS or IBD comparison of each sample or maybe use dstats or something like that. Doing so you’ll see that these calculator results are full of biases and inaccuracies

And to reduce bias even further use whole genomes in IBS or Dstats

**~~Komintasavalta~~** · 10-01-2021, 07:12 PM

Originally Posted by Zoro

I can tell you for a fact that E. Eurasian percent in Iranics is a joke no matter which of these methods you use.

Yeah but in the ADMIXTURE run I posted on the previous page, Iranians had 82% of the Balochi-Brahui component, which is clearly partially East Eurasian (or at least it's partially undifferentiated Eurasian which shares a lack of western drift with East Eurasians). I now tried doing runs at different K values with the same samples, except I removed the ancient samples, because I usually get f*cked up results when I try to mix ancient and modern samples in the same ADMIXTURE run. Now in the K=2 run, the percentage of the eastern component was 14% in Balochi, 13% in Brahui, and 6% in Iranians. And even in the K=4 run, Iranians got 4% of the North Asian component and 1% of the East Asian component:

I think the algorithm used by ADMIXTURE works in a way that when there is a sample that is close enough to having 100% of one component, it pulls the sample closer to having 100% of that component. So in a typical ADMIXTURE run, a large percentage of samples just gets 100% of a single component and 0% of all other components. For example the K=2 run above included 1469 samples, but 274 out them had 100% of the Caucasoid component, and 223 had 100% of the Mongoloid component. But all of the samples that got 100% of the Caucasoid component cannot be equally closely related to the component. And if there was a way to reduce the strength of the pull that makes a sample have 100% of a single component, then for example in the K=2 run above it would probably increase the average percentage of the Mongoloid component in Europeans. (I should check if the effect is less strong in alternatives to ADMIXTURE like Frappe or DyStruct or STRUCTURE.)

**~~Komintasavalta~~** · 10-02-2021, 06:59 AM

One way to make ADMIXTURE-based calculators more accurate might be to combine the Q matrices of runs at different K values into a single wide matrix. Or at least that would make distances calculated based on the datasheet of a calculator more accurate even without having to multiply by FST. Currently if you don't account for FST, population 1 that is 100% North_European in a K=12 run is equally far from population 2 that is 50% North_European and 50% Atlantic_Med as from population 3 that is 50% North_European and 50% Siberian. But if you join the Q matrix of a K=12 run with the Q matrix of a K=3 run, then in the K=3 run the first two populations would have a much lower distance than the first and the third population, which would also make the combined distance between the populations more accurate.

For example below is a PCA based on the Q matrix of a single K=8 run, where I didn't multiply the Q matrix with the FST matrix of the run. It exaggerates the distance between Europeans and Caucasians, because it doesn't know that the European component has a low FST distance to the Caucasian component. It treats a population with 100% of the European component as equally distant from a population with 100% of the Caucasian component as from a population with 100% of the Eskimo component.

But when you combine the Q matrices of all runs from K=2 to K=9 into a single wide matrix, it makes a PCA based on the matrix more accurate:

I think you also get more accurate oracles when you use a combined matrix of ADMIXTURE runs at different K values. For example here's 3-way models made using the Q matrix of a single K=8 run:

Hungarian (.00076): 95% English + 4% Lak + 1% Tibetan_Yunnan
Estonian (.00069): 82% Lithuanian + 17% Russian_Archangelsk_Krasnoborsky + 0% Icelandic
Finnish (.00083): 89% Lithuanian + 9% Ket + 1% Iranian
Karelian (.00000): 92% Lithuanian + 8% Nganasan + 0% Koryak
Veps (.00000): 86% Karelian + 12% Lithuanian + 2% Nganasan
Mordovian (.00069): 84% Russian_Archangelsk_Krasnoborsky + 11% Belarusian + 5% Karachai
Besermyan (.00573): 80% Udmurt + 17% Russian_Archangelsk_Krasnoborsky + 4% Balochi
Udmurt (.00779): 77% Besermyan + 16% Mansi + 6% Karelian
Mansi (.00015): 74% Selkup + 25% Lithuanian + 1% Eskimo_ChaplinSireniki
Selkup (.00068): 89% Enets + 9% Veps + 2% GujaratiD
Enets (.00017): 80% Nganasan + 15% Lithuanian + 4% Estonian
Nganasan (.04768): 100% Evenk_Transbaikal

Here's models made using a combined matrix of runs from K=2 to K=9:

Hungarian (.01717): 73% English + 20% Ukrainian + 7% Circassian
Estonian (.01661): 81% Lithuanian + 12% Finnish + 7% Veps
Finnish (.01363): 70% Veps + 17% Estonian + 13% Orcadian
Karelian (.00680): 65% Veps + 25% Russian_Archangelsk_Krasnoborsky + 10% Lithuanian
Veps (.01165): 98% Karelian + 2% Mansi + 0% Koryak
Mordovian (.00726): 43% Russian_Archangelsk_Krasnoborsky + 42% Ukrainian + 15% Chuvash
Besermyan (.01823): 85% Udmurt + 13% Czech + 2% Kalash
Udmurt (.01700): 72% Besermyan + 17% Mansi + 11% Russian_Archangelsk_Leshukonsky
Mansi (.05058): 50% Ket + 35% Yukagir_Forest + 15% Veps
Selkup (.02265): 74% Ket + 16% Tatar_Siberian_Zabolotniye + 10% Evenk_Transbaikal
Enets (.01851): 67% Ket + 31% Nganasan + 3% Russian_Archangelsk_Leshukonsky
Nganasan (.26531): 100% Evenk_Transbaikal

**Zoro** · 10-02-2021, 08:39 AM

Originally Posted by Komintasavalta

One way to make ADMIXTURE-based calculators more accurate might be to combine the Q matrices of runs at different K values into a single wide matrix. Or at least that would make distances calculated based on the datasheet of a calculator more accurate even without having to multiply by FST. Currently if you don't account for FST, population 1 that is 100% North_European in a K=12 run is equally far from population 2 that is 50% North_European and 50% Atlantic_Med as from population 3 that is 50% North_European and 50% Siberian. But if you join the Q matrix of a K=12 run with the Q matrix of a K=3 run, then in the K=3 run the first two populations would have a much lower distance than the first and the third population, which would also make the combined distance between the populations more accurate.

For example below is a PCA based on the Q matrix of a single K=8 run, where I didn't multiply the Q matrix with the FST matrix of the run. It exaggerates the distance between Europeans and Caucasians, because it doesn't know that the European component has a low FST distance to the Caucasian component. It treats a population with 100% of the European component as equally distant from a population with 100% of the Caucasian component as from a population with 100% of the Eskimo component.

But when you combine the Q matrices of all runs from K=2 to K=9 into a single wide matrix, it makes a PCA based on the matrix more accurate:

I think you also get more accurate oracles when you use a combined matrix of ADMIXTURE runs at different K values. For example here's 3-way models made using the Q matrix of a single K=8 run:

I like how you think out of the bix but I’m not sure I follow what you’re saying because the Q matrix is admixture proportion matrix which changes based on K. For ex at K=4:

0.15 0.15 0.5 0.2

Whereas FST matrix is the distances between components (column values). It’s not SNP weight . So how do you combine those two different matrices. Can you give example based on first couple rows of Q matrix

**rothaer** · 10-02-2021, 11:53 AM

Originally Posted by Komintasavalta

Maybe Gedrosia is a proxy for steppe ancestry in Dod K12, which would explain why it's higher in Northern Europeans than in Southern Europeans. (...)

It is a very interesting question what this Gedrosia component at all is.

If you look at the distribution map in the OP it does not fit to a steppe related thing. Because you have that component in Western Europe and 0% of that in Poland and Belarus. But these pops are known to have a big proportion of steppe ancestry.
So whatever is depicted by this Gedrosia component (might it be an arbitrarily and erronous chosen component?) the question remains by what migration it got that distribution.