qpAdm thread

**Zoro** · 03-06-2021, 11:52 AM

Originally Posted by Kaspias

Code:

√ 1578508 SNPs read in total
! 565741 SNPs remain after filtering. 540537 are polymorphic.

Let me show you all the models:

Right(Adygei, Cretan, Iranian, Mansi, Polish, Jordanian, Tatar_Tomsk, Italian_North, Albanian, French) -all the of Simeon's-

Code:

	target	left	weight	se	z
1	Kaspias	Bulgarian_1.DG	0.78	0.044	17.51
2	Kaspias	Turkmen.SG	0.22	0.044	4.98


	f4rank	dof	chisq	p.value	dofdiff	chisqdiff	p_nested
1	1	8	5.96	0.65119	10	553.51	1.6e-112
2	0	18	559.48	< 2e-16

Code:

	target	left	weight	se	z
1	Kaspias	Bulgarian_1.DG	0.76	0.048	15.89
2	Kaspias	Bashkir.SG	0.24	0.048	4.89


	f4rank	dof	chisq	p.value	dofdiff	chisqdiff	p_nested
1	1	8	6.44	0.59806	10	478.01	2.2e-96
2	0	18	484.45	< 2e-16

Code:

	target	left	weight	se	z
1	Kaspias	Bulgarian_1.DG	0.8	0.039	20.51
2	Kaspias	Uzbek.SG	0.2	0.039	5.13



	f4rank	dof	chisq	p.value	dofdiff	chisqdiff	p_nested
1	1	8	4.4	0.81917	10	796.21	1.3e-164
2	0	18	800.62	< 2e-16

Code:

	target	left	weight	se	z
1	Kaspias	Bulgarian_1.DG	0.82	0.044	18.83
2	Kaspias	Tatar_Tomsk.SG	0.18	0.044	4.15


	f4rank	dof	chisq	p.value	dofdiff	chisqdiff	p_nested
1	1	8	6.16	0.62951	10	628.79	1.2e-128
2	0	18	634.94	< 2e-16

Basically these mean that you're significantly E Asian shifted compared to Bulgarians (keep in mind Bulgarians themselves have Siberian and E Asian). Your best model based on p-value seems to be Uzbeks and Turkmen but if you want to fine tune this even more add Mongolians or Han to pright. Your p-values may drop but that's fine. I also don't get how you have Tatar-Tomsk in both pright and pleft at the same time

**Zoro** · 03-06-2021, 12:08 PM

@Kaspias

What command did you use to extract and to do FST?

**Kaspias** · 03-06-2021, 12:09 PM

Originally Posted by Zoro

Basically these mean that you're significantly E Asian shifted compared to Bulgarians (keep in mind Bulgarians themselves have Siberian and E Asian). Your best model based on p-value seems to be Uzbeks and Turkmen but if you want to fine tune this even more add Mongolians or Han to pright. Your p-values may drop but that's fine. I also don't get how you have Tatar-Tomsk in both pright and pleft at the same time

I replaced Tatar_Tomsk with Bashkir for the right while running Tomsk on the left, and the Tomsk was on the right while running other populations(Uzbek, Turkmen...)

Thank you for your input. Now I have a question, I realized while using ancient populations for the right the SNP number reduces to around ~100k and I now have ~600k with moderns. I constantly hear ancient for the right is a better idea, but considering the SNP amount which one would you say preferably for my case?

I'd like to model Balkan populations with Medieval samples for example, but believe so the SNP amount will be around 80K. Is that enough for a decent run or too low?

**Kaspias** · 03-06-2021, 12:12 PM

Originally Posted by Zoro

@Kaspias

What command did you use to extract and to do FST?

Code:

fst_blocks = fst("fstdir")
print((fst_blocks), n=2000)

**Zoro** · 03-06-2021, 12:16 PM

Originally Posted by Kaspias

I replaced Tatar_Tomsk with Bashkir for the right while running Tomsk on the left, and the Tomsk was on the right while running other populations(Uzbek, Turkmen...)

Thank you for your input. Now I have a question, I realized while using ancient populations for the right the SNP number reduces to around ~100k and I now have ~600k with moderns. I constantly hear ancient for the right is a better idea, but considering the SNP amount which one would you say preferably for my case?

I'd like to model Balkan populations with Medieval samples for example, but believe so the SNP amount will be around 80K. Is that enough for a decent run or too low?

I think it should be enough. Use the highest quality samples and stay away from 2 similar sources. Give it a try

**Kaspias** · 03-06-2021, 04:51 PM

Originally Posted by Zoro

I think it should be enough. Use the highest quality samples and stay away from 2 similar sources. Give it a try

Do you remember that you asked me the population to use while representing Anatolian Turk's Anatolian ancestry? I have found the Roopkund outlier in the spreadsheet who is a Central Anatolian Greek from the classical Ottoman Era, and simulated possible scenarios using 3 different Turkic populations.

Simon's Turkish samples are from Hodoğlugil's study and collected in Kayseri: http://simonsfoundation.s3.amazonaws...ion_update.txt

These Kayseri samples have ~6% East Eurasian on average -referencing Gedmatch- so we can draw further conclusions based on it.

Code:

# right

  "Mbuti.DG",
  "Han.DG",
  "Saami.DG",
  "Icelandic.DG",
  "Sardinian.DG",
  "Punjabi.DG",
  "Eskimo_Chaplin.DG",
  "BedouinB.DG",
  "Basque.DG"

! 90852 SNPs remain after filtering. 78431 are polymorphic.

Code:

 target               left                     weight     se     z
                                          
1 Anatolian_Turkish.DG Anatolian_Greek_Medieval  0.863 0.0363 23.8 
2 Anatolian_Turkish.DG Kimak.SG                  0.137 0.0363  3.76

f4rank   dof  chisq        p dofdiff chisqdiff  p_nested
                       
1      1     7   8.43 2.96e- 1       9      419.  1.00e-84
2      0    16 428.   5.31e-81      NA       NA  NA

! 99251 SNPs remain after filtering. 85286 are polymorphic.

Code:

 target               left                     weight     se     z
                                          
1 Anatolian_Turkish.DG Anatolian_Greek_Medieval  0.816 0.0472 17.3 
2 Anatolian_Turkish.DG Gokturk.SG                0.184 0.0472  3.90

 f4rank   dof chisq        p dofdiff chisqdiff  p_nested
                      
1      1     7  15.8 2.74e- 2       9      303.  6.10e-60
2      0    16 319.  3.31e-58      NA       NA  NA

! 247268 SNPs remain after filtering. 213710 are polymorphic.

Code:

 target               left                     weight     se     z
                                          
1 Anatolian_Turkish.DG Anatolian_Greek_Medieval  0.851 0.0283 30.1 
2 Anatolian_Turkish.DG Ottoman_MA2195.SG         0.149 0.0283  5.27

 f4rank   dof chisq         p dofdiff chisqdiff   p_nested
                        
1      1     7  17.2 1.64e-  2       9      728.  6.69e-151
2      0    16 745.  3.16e-148      NA       NA  NA

**Zoro** · 03-06-2021, 05:34 PM

Originally Posted by Kaspias

Do you remember that you asked me the population to use while representing Anatolian Turk's Anatolian ancestry? I have found the Roopkund outlier in the spreadsheet who is a Central Anatolian Greek from the classical Ottoman Era, and simulated possible scenarios using 3 different Turkic populations.

Simon's Turkish samples are from Hodoğlugil's study and collected in Kayseri: http://simonsfoundation.s3.amazonaws...ion_update.txt

These Kayseri samples have ~6% East Eurasian on average -referencing Gedmatch- so we can draw further conclusions based on it.

Code:

# right

  "Mbuti.DG",
  "Han.DG",
  "Saami.DG",
  "Icelandic.DG",
  "Sardinian.DG",
  "Punjabi.DG",
  "Eskimo_Chaplin.DG",
  "BedouinB.DG",
  "Basque.DG"

! 90852 SNPs remain after filtering. 78431 are polymorphic.

Code:

 target               left                     weight     se     z
                                          
1 Anatolian_Turkish.DG Anatolian_Greek_Medieval  0.863 0.0363 23.8 
2 Anatolian_Turkish.DG Kimak.SG                  0.137 0.0363  3.76

f4rank   dof  chisq        p dofdiff chisqdiff  p_nested
                       
1      1     7   8.43 2.96e- 1       9      419.  1.00e-84
2      0    16 428.   5.31e-81      NA       NA  NA

! 99251 SNPs remain after filtering. 85286 are polymorphic.

Code:

 target               left                     weight     se     z
                                          
1 Anatolian_Turkish.DG Anatolian_Greek_Medieval  0.816 0.0472 17.3 
2 Anatolian_Turkish.DG Gokturk.SG                0.184 0.0472  3.90

 f4rank   dof chisq        p dofdiff chisqdiff  p_nested
                      
1      1     7  15.8 2.74e- 2       9      303.  6.10e-60
2      0    16 319.  3.31e-58      NA       NA  NA

! 247268 SNPs remain after filtering. 213710 are polymorphic.

Code:

 target               left                     weight     se     z
                                          
1 Anatolian_Turkish.DG Anatolian_Greek_Medieval  0.851 0.0283 30.1 
2 Anatolian_Turkish.DG Ottoman_MA2195.SG         0.149 0.0283  5.27

 f4rank   dof chisq         p dofdiff chisqdiff   p_nested
                        
1      1     7  17.2 1.64e-  2       9      728.  6.69e-151
2      0    16 745.  3.16e-148      NA       NA  NA

Looks like the 1st model is best with a p-value of 0.30. The other ones can sort of be rejected.

As far as the Kayseri samples I kind of guessed that they would score that much on Gedmatch. Interestingly I didn't get nearly as many passing W Asian + E Asian models for them as I did for Iraqi Kurds. Probably has to do with a good modern W Asian source. Another reason I like using Ancients.

I also got this sort of passing model for the Kayseri Turks p-value 0.05

Admix SE
Turkish Armenian 57% 7%
Turkish Bulgarian 35% 7%
Turkish Yakut 8% 1%

Again those Turks would have more than 8% NE Asian because Armenians and Bulgarians also have some

I think you should try modeling yourself with these also plus Siberian and E Asian. You can use the ancient pright list I use which I posted earlier. I'm posting their missingness rate. They're not that bad.

Anatolia_EBA I2495 3682752 4676043 0.79
Anatolia_EBA I2683 3761883 4668444 0.81
Anatolia_EBA.SG MA2210_final 4094550 4668444 0.88
Anatolia_EBA.SG MA2212_final 4120486 4676043 0.88
Anatolia_EBA.SG MA2213_final 3984232 4668444 0.85
Anatolia_Epipaleolithic ZBC_IPB001 3830320 4676043 0.82
Anatolia_IA.SG MA2198_final 4157111 4668444 0.89
Anatolia_MLBA.SG MA2200_final 3750514 4676043 0.80
Anatolia_MLBA.SG MA2203_final 4091029 4668444 0.88
Anatolia_MLBA.SG MA2205_final 4152321 4676043 0.89

As far as your modern pright list I would add maybe Armenians or Iranians

**~~Komintasavalta~~** · 03-06-2021, 05:40 PM

I did this to get FST distances:

Code:

$ printf %s\\n Mansi Finnish Nganasan Selkup Karelian Udmurt Mordovian>pops
$ R -e 'library(admixtools);fst=fst("g/v44.3_HO_public/v44.3_HO_public",pop1=readLines("pops"));write.csv(fst,"fst",quote=F)'
ℹ Reading allele frequencies from packedancestrymap files...
ℹ v44.3_HO_public.geno has 13197 samples and 597573 SNPs
ℹ Calculating allele frequencies from 19 samples in 4 populations
ℹ Expected size of allele frequency data: 86 MB
597k SNPs read...
✔ 597573 SNPs read in total
! 593124 SNPs remain after filtering. 414780 are polymorphic.
ℹ Allele frequency matrix for 593124 SNPs and 4 populations is 62 MB
ℹ Computing pairwise f2 for all SNPs and population pairs requires 493 MB RAM without splitting
ℹ Computing without splitting since 493 < 8000 (maxmem)...
ℹ Data written to f2/
ℹ Reading precomputed data for 4 populations...
ℹ Reading f2 data for pair 10 out of 10...
Warning message:
In read_f2(dir, pops, pops2, afprod = afprod, fst = fst, remove_na = remove_na,  :
  Discarding 1 block(s) due to missing values!
Discarded block(s): 535
>
>
$ cat fst
,pop1,pop2,est,se,z,p
1,Finnish,Karelian,0.00129340940098996,0.000385024618276676,3.35929013261311,0.000781429796323533
2,Finnish,Mordovian,0.00543917401762932,0.0003253493038913,16.7179519137578,9.6995724807741e-63
3,Finnish,Nganasan,0.119054350470445,0.00113575044283692,104.824392736371,0
4,Finnish,Selkup,0.0601437871347565,0.000773515188052884,77.753854175963,0
5,Finnish,Udmurt,0.0187032075983067,0.000585527652038594,31.9424839001009,6.87035668325847e-224
6,Karelian,Mordovian,0.00590771927078587,0.000239357005605168,24.6816225656289,1.68484078541802e-134
7,Karelian,Udmurt,0.019523384287915,0.000473936956593016,41.1940533784545,0
8,Mansi,Finnish,0.0402190424166203,0.000841540850181047,47.7921450966614,0
9,Mansi,Karelian,0.0399509729801598,0.000728931940431647,54.8075489139662,0
10,Mansi,Mordovian,0.0383778238793512,0.000668216221602333,57.4332418739013,0
11,Mansi,Nganasan,0.0602924170396429,0.000867560662779261,69.4964855212476,0
12,Mansi,Selkup,0.0223050689999093,0.000513096315174833,43.4715049401769,0
13,Mansi,Udmurt,0.0240652073778455,0.000663882926915772,36.2491734644333,1.02394799408991e-287
14,Nganasan,Karelian,0.118602793551424,0.00108015804818372,109.801333009418,0
15,Nganasan,Mordovian,0.11745770899229,0.00099405063691636,118.160689838352,0
16,Nganasan,Selkup,0.0504386703379596,0.000674035384985417,74.8308938395731,0
17,Nganasan,Udmurt,0.0911528579608077,0.000973329536820952,93.650561821567,0
18,Selkup,Karelian,0.0595410382563537,0.00071504905760348,83.268466160781,0
19,Selkup,Mordovian,0.0579452329275561,0.000631133692337951,91.8113446184526,0
20,Selkup,Udmurt,0.0409818006630253,0.000612173976633259,66.9446958337084,0
21,Udmurt,Mordovian,0.0170968626475659,0.000406949757311595,42.0122197897642,0

There's probably an easier way to do this in R, but this converts the FST pairs into a table:

Code:

$ awk -F, 'NR>1{print$3","$2","$4;print$2","$3","$4}' fst|awk -F, '{print$1","$1","}1'|sort -u>/tmp/a
$ cut -d, -f3 /tmp/a|awk '{printf"%.6f"(NR%n?",":"\n"),$0}' n=$(awk 'END{print NR^.5}' /tmp/a) -|paste -d, <(cut -d, -f1 /tmp/a|sort -u) -|cat <(cut -d, -f1 /tmp/a|sort -u|paste -sd, -|sed s/^/,/) ->/tmp/b
$ cat /tmp/b
,Finnish,Karelian,Mansi,Mordovian,Nganasan,Selkup,Udmurt
Finnish,0.000000,0.001293,0.040219,0.005439,0.119054,0.060144,0.018703
Karelian,0.001293,0.000000,0.039951,0.005908,0.118603,0.059541,0.019523
Mansi,0.040219,0.039951,0.000000,0.038378,0.060292,0.022305,0.024065
Mordovian,0.005439,0.005908,0.038378,0.000000,0.117458,0.057945,0.017097
Nganasan,0.119054,0.118603,0.060292,0.117458,0.000000,0.050439,0.091153
Selkup,0.060144,0.059541,0.022305,0.057945,0.050439,0.000000,0.040982
Udmurt,0.018703,0.019523,0.024065,0.017097,0.091153,0.040982,0.000000

And this creates a heatmap of the table:

Code:

R -e 'install.packages(c("pheatmap","colorspace"),repos="https://cloud.r-project.org")'
R -e 'library(pheatmap)
library(colorspace)

t<-read.csv("/tmp/b",header=T,row.names=1,check.names=F)
t[t==0]=NA

pheatmap(
  1e4*t,
  filename="/tmp/a.png",
  legend=F,
  clustering_callback=function(...){hclust(as.dist(t))},
  cellwidth=18,
  cellheight=12,
  border_color=NA,
  display_numbers=T,
  number_format="%.0f",
  number_color="black",
  fontsize_number=6,
  colorRampPalette(hex(HSV(c(210,180,150,120,90,60,30,0),.5,1)))(256)
)'

At first I got an error that there were too many missing blocks, so I tried adding a `maxmiss=Inf` parameter:

Code:

R -e 'library("admixtools");extract_f2(pref="g/v44.3_HO_public/v44.3_HO_public",pops=c("Finnish","Mansi","Mari.SG","Estonian.DG"),outdir="f2",maxmiss=Inf);f2=f2_from_precomp("f2");fst=fst(f2);write.csv(fst,"fst",quote=F)'

However it gave me nonsensical results where the distance between Finns and Maris was an order of magnitude bigger than the distance between Finns and Mansi:

Code:

,pop1,pop2,est,se,z,p
1,Estonian.DG,Finnish,0.000904946578981571,0.000400460532337746,2.25976471064156,0.0238358576870844
2,Estonian.DG,Mansi,0.015818211648642,0.00055465784565136,28.5188639675077,6.83699132061968e-179
3,Estonian.DG,Mari.SG,0.174033745411937,0.00101794728786578,170.96538051279,0
4,Finnish,Mansi,0.0139136259691746,0.000307938735432869,45.1830977016134,0
5,Finnish,Mari.SG,0.17331351350787,0.000760056331041605,228.027195392683,0
6,Mansi,Mari.SG,0.17490109671494,0.000788482478735399,221.819890018931,0

**Kaspias** · 03-07-2021, 12:47 PM

Originally Posted by Zoro

Looks like the 1st model is best with a p-value of 0.30. The other ones can sort of be rejected.

As far as the Kayseri samples I kind of guessed that they would score that much on Gedmatch. Interestingly I didn't get nearly as many passing W Asian + E Asian models for them as I did for Iraqi Kurds. Probably has to do with a good modern W Asian source. Another reason I like using Ancients.

I also got this sort of passing model for the Kayseri Turks p-value 0.05

Admix SE
Turkish Armenian 57% 7%
Turkish Bulgarian 35% 7%
Turkish Yakut 8% 1%

Again those Turks would have more than 8% NE Asian because Armenians and Bulgarians also have some

I think you should try modeling yourself with these also plus Siberian and E Asian. You can use the ancient pright list I use which I posted earlier. I'm posting their missingness rate. They're not that bad.

Anatolia_EBA I2495 3682752 4676043 0.79
Anatolia_EBA I2683 3761883 4668444 0.81
Anatolia_EBA.SG MA2210_final 4094550 4668444 0.88
Anatolia_EBA.SG MA2212_final 4120486 4676043 0.88
Anatolia_EBA.SG MA2213_final 3984232 4668444 0.85
Anatolia_Epipaleolithic ZBC_IPB001 3830320 4676043 0.82
Anatolia_IA.SG MA2198_final 4157111 4668444 0.89
Anatolia_MLBA.SG MA2200_final 3750514 4676043 0.80
Anatolia_MLBA.SG MA2203_final 4091029 4668444 0.88
Anatolia_MLBA.SG MA2205_final 4152321 4676043 0.89

As far as your modern pright list I would add maybe Armenians or Iranians

Thanks again. I have had Iranians on the pright but it somehow reduced the p-value, so I removed it.

I think these 3 models are really crucial while answering "how the genome of Oghuz was?" We had been using DA89 for a long time, but I recently started to question the accuracy of our method(I think DA89 is 3/4 Gokturk and 1/4 Sogdian, this turns it into a false-candidate for Oghuz) and came to the conclusion that Oghuz should be in between Kipchak and Kimak after searching on historical perspective(The one that helped most was: İlk Oğuzlar, Osman Karatay). Both of the Kipchak samples we have is not representative so went with the Kimak-like option. Apparently I was right, because this is the only passing model when using Medieval samples. In addition, the region where these samples are collected(Kayseri) were housing Cappadocian Greeks that is what I used for the native admixture of them, so the result is pretty solid and one can come with guesses on Western Anatolia(10-40% Kimak?), too.

**Token** · 03-07-2021, 12:54 PM

Originally Posted by Kaspias

Code:

# right

  "Mbuti.DG",
  "Han.DG",
  "Saami.DG",
  "Icelandic.DG",
  "Sardinian.DG",
  "Punjabi.DG",
  "Eskimo_Chaplin.DG",
  "BedouinB.DG",
  "Basque.DG"

This is a very weak pright list.