Yeah, has he ever published the list of samples that he used in the original reference run for G25?
When you use all modern and ancient population averages from the official G25 datasheets as a source but you remove Chuvashes, then the best model for Maris has a distance of .073, and similarly if you remove Maris, then the best model for Chuvashes has a distance of .044:
Code:
$ curl 'https://drive.google.com/uc?export=download&id=1wZr-UOve0KUKo_Qbgeo27m-CQncZWb8y' -Lso mas
$ curl 'https://drive.google.com/uc?export=download&id=1F2rKEVtu8nWSm7qFhxPU6UESQNsmA-sl' -Lso aas
$ curl https://pastebin.com/raw/afaMiFSa|tr -d \\r>mix;chmod +x mix
$ pip3 install cvxpy
[...]
$ t=Mari;./mix <(cat [am]as|grep -v ^$t,) <(grep ^$t, mas) -s
Mari (.039): 89% Chuvash + 9% RUS_Krasnoyarsk_BA + 2% RUS_AfontovaGora3
$ t=Mari;./mix <(cat [am]as|grep -v ^$t,|grep -v Chuvash) <(grep ^$t, mas) -s
Mari (.073): 88% Udmurt + 9% RUS_Krasnoyarsk_BA + 2% ITA_Tagliente + 1% DEU_LBK_KD
$ t=Chuvash;./mix <(cat [am]as|grep -v ^$t,) <(grep ^$t, mas) -s
Chuvash (.005): 68% Mari + 8% Lithuanian_RA + 6% Russian_Belgorod + 5% Lithuanian_VZ + 5% Darginian + 2% HRV_Vucedol + 2% CHN_Amur_River_Xianbei_IA + 1% MNG_Afanasievo_1_contam + 1% Ket + 1% GEO_CHG + 1% VNM_BA_Dong_Son_Culture + 0% Han_Shanghai + 0% Sorb_Niederlausitz + 0% CHE_FN_steppe_contam + 0% CHN_Miaozigou_MN + 0% Sakha + 0% UKR_Cimmerian_o
$ t=Chuvash;./mix <(cat [am]as|grep -v ^$t,|grep -v Mari) <(grep ^$t, mas) -s
Chuvash (.044): 81% Udmurt + 6% Russian_Pinega + 4% HUN_MBA_Vatya_o + 4% DEU_LBK_KD + 2% RUS_Krasnoyarsk_BA + 2% CHN_Yinwang_500BP + 1% Baltic_EST_BA
I think it's because in the initial set of reference samples that Davidski used with G25, there were some Mari or Chuvash samples, so some PCs on G25 ended up accounting for drift that is specific to Maris or Chuvashes. But then G25 gives less weight to the drift of other populations that were not included among the initial reference samples.
I merged samples from 1240K+HO with samples from
Cardona et al. 2014, and I calculated an f2 matrix for the samples. Then for each population that had an identical name in G25 and my dataset, I compared the f2 distance to the scaled G25 distance. In the plot below, Maris and Chuvashes are actually above the diagonal, because G25 accounts for drift that is specific to Maris. But then G25 underestimates the distance to other drifted or isolated populations, like Kubachinian, Kalash, Udmurt, Komi, Scottish, Icelandic, Kusunda, Chukchi, Surui, etc. (The reason why the distance to Even is much bigger in G25 than in my f2 matrix is that the Even population average in G25 is modeled as 12% Norwegian and 88% Han_Shanghai, but my Even samples had an average of 30% of a Caucasoid component in a K=2 Eurasian ADMIXTURE run.)
Bookmarks