0





| Thumbs Up/Down |
| Received: 11,834/93 Given: 7,274/68 |
Bulgarian + Uzbek,
Bulgarian + Turkmen,Code:f4rank dof chisq p dofdiff chisqdiff p_nested1 1 14 11.1 6.75e- 1 16 1483. 2.95e-306 2 0 30 1494. 8.98e-296 NA NA NA
Right pops:Code:f4rank dof chisq p dofdiff chisqdiff p_nested1 1 14 12.7 5.50e- 1 16 1029. 5.62e-209 2 0 30 1042. 6.54e-200 NA NA NA
Code:"Papuan.DG", "Eskimo_Sireniki.DG", "Jordanian.DG", "Punjabi.DG", "Yakut.DG", "Polish.DG", "Yoruba.DG", "Sardinian.DG", "Finnish.DG", "Armenian.DG", "Greek_1.DG", "Tatar_Volga.SG", "Iranian.DG", "Estonian.DG", "Altaian.DG", "Uzbek.SG"("Turkmen.SG")





| Thumbs Up/Down |
| Received: 1,250/11 Given: 524/7 |





| Thumbs Up/Down |
| Received: 1,250/11 Given: 524/7 |
Researchers use ancients for right pops. I have had good luck with this set and they don't have too many missing genotypes. if you are missing some of these samples you can substitute something similar
right= c('Khomani_San','Devils-Gate-N','Bichon','Morocco_Iberomaurusian',
'Anatolia_N','Kotias','Karelia', 'Yana-UP', "Iran_N', 'Kolyma-Mesol')
My Devils-Gate, Yana and Kolyma are WGS but you can use diploids if you have them.
Your p-values should improve alot.
Also not everyone can model successfully with just 2 sources. For example many Kurds can model with just 2 sources but Armenians or Iranians appear to have more complex histories and I usually need at least 3 sources for them. Not sure about your situation.





| Thumbs Up/Down |
| Received: 11,834/93 Given: 7,274/68 |
I used the exact same populations you recommend except for Turkmen which I replaced with MA2196, and that's what I get:
If I get it correctly the model still does not pass. What's the reason? I mean, I could add here a 3rd population - Greek or Crimean Tatar - that are potentials for me, but Greek will cause overfitting with Bulgarian whereas there is no Crimean Tatar in the spreadsheet.Code:target left weight se z1 Kaspias Bulgarian.DG 0.712 0.0923 7.71 2 Kaspias Turkey_Ottoman_2.SG 0.288 0.0923 3.12 f4rank dof chisq p dofdiff chisqdiff p_nested 1 1 8 6.59 5.82e- 1 10 87.0 2.10e-14 2 0 18 93.6 3.26e-12 NA NA NA pat wt dof chisq p f4rank Bulgarian.DG Turkey_Ottoman_2.SG feasible best dofdiff chisqdiff p_nested 1 00 0 8 6.59 0.582 1 0.712 0.288 TRUE NA NA NA NA 2 01 1 9 24.4 0.00366 0 1 NA TRUE TRUE 0 -23.9 1 3 10 1 9 48.4 0.000000218 0 NA 1 TRUE TRUE NA NA NA >
In addition, the SNP coverage reduced crucially when leaving Simeon's dataset:
! 29131 SNPs remain after filtering. 27980 are polymorphic.





| Thumbs Up/Down |
| Received: 11,834/93 Given: 7,274/68 |
Tuscan is too Northern for the base Balkan admixture of Thrace, need something in between Apulia and Islander Greek instead.
Almost got no additional Slav:
Code:target left weight se z1 Kaspias Tuscan_1.DG 0.721 0.180 4.01 2 Kaspias Polish.DG 0.0334 0.172 0.194 3 Kaspias Turkmen.SG 0.246 0.0398 6.18
Besides, I run these:
Code:target left weight se z1 Bulgarian.DG Hungary_Avar_5 0.391 0.349 1.12 2 Bulgarian.DG Bulgaria_IA 0.457 0.281 1.63 3 Bulgarian.DG Russia_Medieval_Nomad.SG 0.152 0.0781 1.95 f4rank dof chisq p dofdiff chisqdiff p_nested 1 2 7 9.69 2.07e- 1 9 24.5 3.58e- 3 2 1 16 34.2 5.13e- 3 11 319. 8.11e-62 3 0 27 353. 1.47e-58 NA NA NA Code:target left weight se z1 Gagauz Hungary_Avar_5 0.421 0.142 2.96 2 Gagauz Bulgaria_IA 0.429 0.118 3.64 3 Gagauz Russia_Medieval_Nomad.SG 0.151 0.0394 3.83 f4rank dof chisq p dofdiff chisqdiff p_nested 1 2 7 3.61 8.23e- 1 9 25.2 2.74e- 3 2 1 16 28.8 2.51e- 2 11 316. 4.07e-61 3 0 27 345. 8.25e-57 NA NA NA The same model on me:Code:target left weight se z1 Romanian Hungary_Avar_5 0.506 0.183 2.77 2 Romanian Bulgaria_IA 0.368 0.148 2.48 3 Romanian Russia_Medieval_Nomad.SG 0.126 0.0471 2.68 f4rank dof chisq p dofdiff chisqdiff p_nested 1 2 7 6.68 4.63e- 1 9 24.8 3.17e- 3 2 1 16 31.5 1.16e- 2 11 316. 3.78e-61 3 0 27 347. 2.22e-57 NA NA NA
Code:target left weight se z1 Kaspias Hungary_Avar_5 0.0528 0.234 0.225 2 Kaspias Bulgaria_IA 0.607 0.191 3.18 3 Kaspias Russia_Medieval_Nomad.SG 0.341 0.0677 5.03 f4rank dof chisq p dofdiff chisqdiff p_nested 1 2 7 4.68 6.98e- 1 9 28.3 8.44e- 4 2 1 16 33.0 7.39e- 3 11 298. 1.94e-57 3 0 27 331. 3.88e-54 NA NA NA





| Thumbs Up/Down |
| Received: 1,250/11 Given: 524/7 |
You can increase the 29K SNPs alot by using the 1240K SNP Reich set.
Let's first figure out which populations you are genetically closest to by running F2s. This will also tell us if somehow your personal data got corrupted or not. Don't use ancients like I did to keep your SNPs up.
When I run F2s for Bulgarians using 200K SNPs I get the following but I'm not using alot of pops more relevant to Bulgarians such as Hungarians, Greeks etc which you should use. In fact you can use all the Simons 30 or so pops in your dataset
POP1 POP2 F2 SE Z Bulgarian Sardinian 0.246 0.0010 258 Bulgarian Estonian 0.247 0.0008 313 Bulgarian Armenian 0.249 0.0008 296 Bulgarian Georgian 0.249 0.0007 358 Bulgarian Turkish-Kayseri 0.249 0.0007 371 Bulgarian Tatar-Volga 0.25 0.0008 328 Bulgarian Saami 0.25 0.0007 343 Bulgarian Iran-Hasanlu-IA 0.251 0.0011 239 Bulgarian Iranians-Fars 0.252 0.0015 168 Bulgarian Karelia-EHG 0.252 0.0012 212 Bulgarian Kotias-CHG 0.252 0.0009 291 Bulgarian Kalash 0.252 0.0008 304 Bulgarian Bashkir 0.252 0.0007 371 Bulgarian Pathan 0.253 0.0009 268 Bulgarian Jordanian 0.253 0.0009 288 Bulgarian Villabruna-UP-WHG 0.254 0.0010 256 Bulgarian Turkmen 0.254 0.0009 291 Bulgarian Balochi 0.254 0.0008 301 Bulgarian Brahui 0.254 0.0007 363 Bulgarian MA1-ANE 0.257 0.0009 274 Bulgarian Punjabi 0.257 0.0009 296 Bulgarian Yana-UP-WGS 0.258 0.0008 336 Bulgarian Devils-Gate-N-WGS 0.259 0.0008 316 Bulgarian Kolyma-Mesol-WGS 0.261 0.0011 240 Bulgarian Saharawi 0.261 0.0010 261 Bulgarian Eskimo-Sireniki 0.261 0.0008 324 Bulgarian Eskimo-Chaplin 0.262 0.0011 237 Bulgarian China-Tianyuan-UP 0.267 0.0012 215 Bulgarian UstIshim-UP 0.269 0.0010 263 Bulgarian Khomani-San 0.313 0.0013 245
Running F2s is simple. Do this
## Increase number of lines R prints
options(max.print = 100000)
extract_f2(pref, f2dir, pops = c(..........
f2_blocks = f2_from_precomp('............
##View(f2(f2_blocks))
print(f2(f2_blocks), n = 2000)





| Thumbs Up/Down |
| Received: 1,250/11 Given: 524/7 |
It looks like you're getting much closer. Your p-value is now passing at 6.98e- 1 which is basically 0.698 !
Your standard errors are not good though especially for Avar 1 Kaspias Hungary_Avar_5 0.0528 0.234 0.225 because it's saying 5.28% Avar +/-23.4%
All this means is your pright are not sufficient to distinguish the genetic difference between Hungary-Avar and Bulgaria-IA. Add a pright that you think is much genetically closer to Avar than Bulgaria-IA OR visa versa






| Thumbs Up/Down |
| Received: 124/5 Given: 62/0 |
@Kaspias
I am glad that my post helped. Nice to see that you too have managed to run it!
@Zoro
Very helpful advices all around. Thanks again.
---
So I made a few more runs (maxmiss=0 and 93k~ snps ) using the 1240K dataset and the following populations. I picked Tepecik for Neolithic Anatolia. Open for suggestions!
This seems to be the best result, standard errors can go lower I guess. The p values seem OKCode:right= c('Russia_DevilsCave_N.SG','Switzerland_Bichon.SG','Morocco_Iberomaurusian','Turkey_TepecikCiftlik_N.SG','Georgia_Kotias.SG','Russia_HG_Karelia', 'Russia_Yana_UP.SG', 'Iran_GanjDareh_N', 'Russia_Kolyma_M.SG') left = c("Bulgarian.DG","Adygei.DG","Turkmen.SG",'Georgian.DG','Greek_1.DG')
About the z values corresponding to weight estimations: What is being tested here? weight i = 0 ? It seems like it.
Also, why do we want to fail to reject the model hypothesis? Can't seem to find a layman interpretation (no surprise).
Run 1: (Greek and Bulgarian did not go well together and Greek instead of Bulgarian yielded better results..Georgian seems to be a non-factor here: not significantly different than 0. But I would expect to have around 10%. Adygei on the other hand has a high se here, possibly due to its rather close proximity to Georgian.)
Code:======================================= target left weight se z --------------------------------------- 1 me Adygei.DG 0.436 0.267 1.636 2 me Turkmen.SG 0.051 0.049 1.055 3 me Georgian.DG 0.053 0.2 0.263 4 me Greek_1.DG 0.46 0.127 3.617 --------------------------------------- the p value = 0.56 ==================================================== f4rank dof chisq p dofdiff chisqdiff p_nested ---------------------------------------------------- 1 3 5 3.924 0.56 7 36.838 0 2 2 12 40.761 0 9 101.281 0 3 1 21 142.042 0 11 732.627 0 4 0 32 874.67 0 NA NA NA ----------------------------------------------------
Another run, without Georgian. (Adygei SE is now 0.15)
Code:====================================== target left weight se z -------------------------------------- 1 me Adygei.DG 0.489 0.15 3.268 2 me Turkmen.SG 0.046 0.045 1.006 3 me Greek_1.DG 0.466 0.13 3.584 -------------------------------------- ===================================================== f4rank dof chisq p dofdiff chisqdiff p_nested ----------------------------------------------------- 1 2 6 3.965 0.681 8 68.9 0 2 1 14 72.865 0 10 653.942 0 3 0 24 726.807 0 NA NA NA -----------------------------------------------------
bonus 1: me vs the populations I used (f2 statistics). If I am interpreting these correctly it says I am closer to Bulgarians than the Adygei (albeit not by a significant margin). On g25 I get the opposite all the time, with a clear margin.
bonus 2: a graph. Perhaps this can give an idea as to whether the chosen populations are satisfactory or not. If the graphs produce nonsense, one can try different populations. This particular one I produced is possibly nonsense since I only used some moderns and pretty ancient populations.Code:==================================================================== pop1 pop2 est se z p -------------------------------------------------------------------- 1 me Bulgarian.DG 9e-04 0.0011 0.82034 0.41202 2 me Adygei.DG 0.00134 0.00116 1.14901 0.25055 3 me Georgian.DG 0.00232 0.00111 2.09104 0.03652 4 me Greek_1.DG 0.00301 0.00148 2.03656 0.04169 5 me Iran_GanjDareh_N 0.0566 0.00121 46.81045 0 6 me Turkmen.SG 0.06436 0.00126 51.18922 0 7 me Russia_Yana_UP.SG 0.08537 0.00142 60.25923 0 8 me Morocco_Iberomaurusian 0.09281 0.0014 66.0843 0 9 me Russia_DevilsCave_N.SG 0.10113 0.00151 66.87772 0 10 me Turkey_TepecikCiftlik_N.SG 0.13288 0.00139 95.47373 0 11 me Russia_HG_Karelia 0.16392 0.00155 106.00861 0 12 me Georgia_Kotias.SG 0.17455 0.00155 112.48209 0 13 me Switzerland_Bichon.SG 0.17911 0.00175 102.16275 0 14 me Russia_Kolyma_M.SG 0.19436 0.00177 109.87032 0 --------------------------------------------------------------------
![]()
50% Turkish_Deliorman + 50% Adygei @ 4,879





| Thumbs Up/Down |
| Received: 1,250/11 Given: 524/7 |
Congrats, looking good.
The best way to figure out which samples have the least missing SNPs so that you can use them in your run is to do this plink command:
....../plink/bfile Master --missing . This will output a file called plink.imiss and will list the number of missing SNPs in every sample. This way you can only use your best samples.
For ex, here's a portion of my plink.imiss file sorted by missingness
Anatolia_N Bar8 3563703 4668444 0.76 Anatolia_N Bar31 3646992 4676043 0.78 Anatolia_N I0707 3712565 4668444 0.80 Anatolia_N I0746 3726082 4676043 0.80 Anatolia_N I0745 3732807 4676043 0.80 Anatolia_N I0709 3741514 4676043 0.80 Anatolia_N I0708 3748125 4676043 0.80 Anatolia_N I1583_publ 3749842 4676043 0.80 Anatolia_N I1580_publ 3798544 4668444 0.81 Anatolia_N I0744 3837576 4676043 0.82 Anatolia_N I1581_publ 3874657 4668444 0.83 Anatolia_N I1585_publ 3875085 4668444 0.83 Anatolia_N I1579_publ 3880059 4668444 0.83 Anatolia_N I0736 3898059 4668444 0.84 Anatolia_N I1098 3914326 4668444 0.84 Anatolia_N ZHAG 3921509 4668444 0.84 Anatolia_N I1096 3957126 4676043 0.85 Anatolia_N I1097 3959996 4676043 0.85 Anatolia_N I1101 4047243 4676043 0.87 Anatolia_N I1103 4099354 4676043 0.88 Anatolia_Ottoman_1.SG MA2195_final 4109325 4668444 0.88 Anatolia_TepecikCiftlik_N.SG Tep003 4141647 4676043 0.89
You'll notice the best ENF samples are Bar8 with missingness of only 0.76 followed by Bar31 etc. You'll also notice that the ENF you used is one of the worst as far as missing SNPs at missingness of 0.89
Next what I do is go to my Eigenstrat .ind file and add _low to the samples with high missingness that I don't want Admixtools to use.
For ex
Anatolia_N:Bar8 F Anatolia_N
Anatolia_N:Bar31 M Anatolia_N
Anatolia_N:I0707 F Anatolia_N
Anatolia_N:I0708 M Anatolia_N
Anatolia_N:I0709 M Anatolia_N
Anatolia_N:I0736 F Anatolia_N_low
Anatolia_N:I0744 M Anatolia_N
Anatolia_N:I0745 M Anatolia_N
Anatolia_N:I0746 M Anatolia_N
Anatolia_N:I1096 M Anatolia_N_low
Anatolia_N:I1097 M Anatolia_N_low
Anatolia_N:I1098 F Anatolia_N_low
Anatolia_N:I1101 M Anatolia_N_low
Anatolia_N:I1103 M Anatolia_N_low
Anatolia_N:I1579_publ F Anatolia_N
Anatolia_N:I1580_publ F Anatolia_N
Anatolia_N:I1581_publ F Anatolia_N
Anatolia_N:I1583_publ M Anatolia_N
Anatolia_N:I1585_publ F Anatolia_N
Anatolia_N:ZHAG F Anatolia_N_low
Now when I add "Anatolia_N" to extract or pright only the ENF samples with low missingness are used and the rest are ignored.
You may ask why I don't only use the best 2 ENF samples instead of the best 8. The answer to that is the more samples the more accurate the allele frequencies for the population become. So its a tradeoff between ignoring worse samples and improving allele frequencies.
There are currently 1 users browsing this thread. (0 members and 1 guests)
Bookmarks