check if it works now:
Attachment 106247
Printable View
check if it works now:
Attachment 106247
I have tried with regular MyHeritage file this time, converted it to Ancestry.
The format in .ped file is as such:Code:@----------------------------------------------------------@
| PLINK! | v1.07 | 10/Aug/2009 |
|----------------------------------------------------------|
| (C) 2009 Shaun Purcell, GNU General Public License, v2 |
|----------------------------------------------------------|
| For documentation, citation & bug-report instructions: |
| http://pngu.mgh.harvard.edu/purcell/plink/ |
@----------------------------------------------------------@
Web-based version check ( --noweb to skip )
Connecting to web... failed connection
Problem connecting to web
Writing this text to log file [ data_new.log ]
Analysis started: Sat Feb 27 00:52:58 2021
Options in effect:
--file data
--make-bed
--out data_new
1426149 (of 1426149) markers to be included from [ data.map ]
ERROR:
A problem with line 1 in [ data.ped ]
Expecting 6 + 2 * 1426149 = 2852304 columns, but found 2842162
So actually it is how it should be. There is a blank between the alleles.Code:name1 name2 0 0 0 1 T T 0 0 A A A A A A G G A A G G
that looks like a very old version of plink, maybe it doesn't support tab separated .ped
try with plink 1.9:
https://www.cog-genomics.org/plink/
try with this one...
https://filebin.net/c73ep7a8u1pfgtxw
the "space" version should work with the older version of plink, the other one with the newer.
none of the 23andme to plink scripts worked for me without errors. with my script I converted my file, and also some ancients which I had in the 23andme format successfuly.
Kaspias' file seems to have some empty fields and some fields which aren't ACGT0 so i had to adapt the script.
What do you mean?
I've converted hundreds of 23 files to plink using:
......./plink --23file sample.txt sample --out sample_plink where ..... is the path to plink executible. Highlighted sample is what the file will be named inside fam
Then you can merge your "sample" file with your "master" file using
.......//plink --bfile master --bmerge sample.bed sample.bim sample.fam --make-bed --out master_new
As far as converting some format to 23 format just post one line of your format (FTDNA, Living or whatever) and I'll make you a Unix script to convert it in 2 minutes
First run:
√ 1379990 SNPs read in total
! 6558 SNPs remain after filtering. 6468 are polymorphic.
Code:target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Kaspias Turkmen 0.164 0.0991 1.65
2 Kaspias Bulgarian 0.836 0.0991 8.44
Many thanks to @Korialstrasz for the tutorial also to @vbnetkhio and @Zoro for giving a hand. I'd like to see some feedback on it, so I can try to improve.
6558 SNPs is too low to do an accurate comparison. Even though you weren't WGS genotyped I can still get you up to about 74,000 SNPs.
I assume you were able to create the Plink files from the Reich dataset based on the par file I wrote you. Can you post the plink .log file. It'll help me diagnose a few things.
Assuming you were able to get 1240K SNPs in your Plink data you can get about 70,000 overlapping SNPs if you only use your data and the Simmons samples in the Reich dataset. You'll recognize them because their IDs start with S_ such as "S_Armenian-1.DG"
If you only use your personal file and the Simmons ones starting in S_ when you extract in Admixtools 2 using:
extract_f2(pref, f2dir, pops = c(
then you should end up with about 70,000 SNPs
Congrats on making Plink files and using Admixtools. It's your gateway to much more meaningful analysis than merely using Vahaduo all the time !
6558 SNPs is too low to do an accurate comparison. Even though you weren't WGS genotyped I can still get you up to about 200,000 SNPs.
I assume you were able to create the Plink files from the Reich dataset based on the par file I wrote you. Can you post the plink .log file. It'll help me diagnose a few things.
Assuming you were able to get 1240K SNPs in your Plink data you can get about 200,000 overlapping SNPs if you only use your data and the Simmons samples in the Reich dataset. You'll recognize them because their IDs start with S_ such as "S_Armenian-1.DG"
If you only use your personal file and the Simmons ones starting in S_ when you extract in Admixtools 2 using:
extract_f2(pref, f2dir, pops = c(
then you should end up with about 200,000 SNPs
I use Admixtools 2 based on Eigenstrat geno snp and ind files.
Here are some 1240K pops I use alot because they don't drop your SNP counts
extract_f2(pref, f2dir, pops = c('Eskimo_Sireniki.DG',
'Punjabi',
'Turkmen','Pathan','Kalash','Bashkir','Kotias',
'Tatar_Volga','Turkish','Iranian','Armenian',
'Saami','Georgian','Jordanian',
'Estonian','Bulgarian','Sardinian','Avar','Hazara' ,
'Khomani_San',
'Papuan',
'Chukchi','Han','Uyghur','Mansi',
'Mongola','Buryat','Yakut','Adygei','Burmese','Jew _Iraqi',
'Russia_Abkhasian',
'Karelia','Balochi','Brahui'), maxmiss=0,verbose=TRUE)
.......
Log:
For merging:Code:.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (960586 variants, 1 person).
--file: Kaspias_n-temporary.bed + Kaspias_n-temporary.bim +
Kaspias_n-temporary.fam written.
960586 variants loaded from .bim file.
1 person (0 males, 0 females, 1 ambiguous) loaded from .fam.
Ambiguous sex ID written to Kaspias_n.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 0 founders and 1 nonfounder present.
Calculating allele frequencies... done.
Total genotyping rate is 0.974373.
960586 variants and 1 person pass filters and QC.
Note: No phenotypes present.
--make-bed to Kaspias_n.bed + Kaspias_n.bim + Kaspias_n.fam ... done.
Code:54 people loaded from master_plink.fam.
1 person to be merged from Kaspias_n.fam.
Of these, 1 is new, while 0 are present in the base dataset.
597573 markers loaded from master_plink.bim.
960586 markers to be merged from Kaspias_n.bim.
Of these, 782417 are new, while 178169 are present in the base dataset.
Warning: Variants 'rs144847714' and 'rs10492943' have the same position.
Warning: Variants 'rs3205229' and 'rs2229002' have the same position.
Warning: Variants 'rs769902' and 'rs201435286' have the same position.
748 more same-position warnings: see log file.
Performing single-pass merge (55 people, 1379990 variants).
Merged fileset written to merged_data-merge.bed + merged_data-merge.bim +
merged_data-merge.fam .
1379990 variants loaded from .bim file.
55 people (0 males, 0 females, 55 ambiguous) loaded from .fam.
Ambiguous sex IDs written to merged_data.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 54 founders and 1 nonfounder present.
Calculating allele frequencies... done.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missing.
Total genotyping rate is 0.432004.
1379990 variants and 55 people pass filters and QC.
Note: No phenotypes present.
--make-bed to merged_data.bed + merged_data.bim + merged_data.fam ... done.
Here it is...
√ 1379990 SNPs read in total
! 173175 SNPs remain after filtering. 170906 are polymorphic.
Code:target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Kaspias Bulgarian.DG 0.776 0.0339 22.9
2 Kaspias Turkmen.SG 0.224 0.0339 6.60
Code:target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Kaspias Bulgarian.DG 0.798 0.0299 26.7
2 Kaspias Uzbek.SG 0.202 0.0299 6.76
Ok I see that you have 178k SNPs overlapping with master dataset. You can increase that quite a bit by using the Reich 1240K instead of the 593k set you are using. It’s available at Reich Lab.
You can easily convert it from Eigenstrat to plink using the par file I gave you
Then in Plink you can merge your personal data and any other data with it. You can do IBS analysis in plink
Then you can convert your new plink master back to Eigenstrat. Let me know when you’re ready I’ll give you a different par file to do that
Then you can do Admixtools with you included using Eigenstrat
Bulgarian + Uzbek,
Bulgarian + Turkmen,Code:f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 14 11.1 6.75e- 1 16 1483. 2.95e-306
2 0 30 1494. 8.98e-296 NA NA NA
Right pops:Code:f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 14 12.7 5.50e- 1 16 1029. 5.62e-209
2 0 30 1042. 6.54e-200 NA NA NA
Code:"Papuan.DG",
"Eskimo_Sireniki.DG",
"Jordanian.DG",
"Punjabi.DG",
"Yakut.DG",
"Polish.DG",
"Yoruba.DG",
"Sardinian.DG",
"Finnish.DG",
"Armenian.DG",
"Greek_1.DG",
"Tatar_Volga.SG",
"Iranian.DG",
"Estonian.DG",
"Altaian.DG",
"Uzbek.SG"("Turkmen.SG")
Researchers use ancients for right pops. I have had good luck with this set and they don't have too many missing genotypes. if you are missing some of these samples you can substitute something similar
right= c('Khomani_San','Devils-Gate-N','Bichon','Morocco_Iberomaurusian',
'Anatolia_N','Kotias','Karelia', 'Yana-UP', "Iran_N', 'Kolyma-Mesol')
My Devils-Gate, Yana and Kolyma are WGS but you can use diploids if you have them.
Your p-values should improve alot.
Also not everyone can model successfully with just 2 sources. For example many Kurds can model with just 2 sources but Armenians or Iranians appear to have more complex histories and I usually need at least 3 sources for them. Not sure about your situation.
I used the exact same populations you recommend except for Turkmen which I replaced with MA2196, and that's what I get:
If I get it correctly the model still does not pass. What's the reason? I mean, I could add here a 3rd population - Greek or Crimean Tatar - that are potentials for me, but Greek will cause overfitting with Bulgarian whereas there is no Crimean Tatar in the spreadsheet.Code:target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Kaspias Bulgarian.DG 0.712 0.0923 7.71
2 Kaspias Turkey_Ottoman_2.SG 0.288 0.0923 3.12
f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 8 6.59 5.82e- 1 10 87.0 2.10e-14
2 0 18 93.6 3.26e-12 NA NA NA
pat wt dof chisq p f4rank Bulgarian.DG Turkey_Ottoman_2.SG feasible best dofdiff chisqdiff p_nested
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <dbl> <dbl> <dbl>
1 00 0 8 6.59 0.582 1 0.712 0.288 TRUE NA NA NA NA
2 01 1 9 24.4 0.00366 0 1 NA TRUE TRUE 0 -23.9 1
3 10 1 9 48.4 0.000000218 0 NA 1 TRUE TRUE NA NA NA
>
In addition, the SNP coverage reduced crucially when leaving Simeon's dataset:
! 29131 SNPs remain after filtering. 27980 are polymorphic.
Tuscan is too Northern for the base Balkan admixture of Thrace, need something in between Apulia and Islander Greek instead.
Almost got no additional Slav:
Code:target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Kaspias Tuscan_1.DG 0.721 0.180 4.01
2 Kaspias Polish.DG 0.0334 0.172 0.194
3 Kaspias Turkmen.SG 0.246 0.0398 6.18
Besides, I run these:
Code:target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Bulgarian.DG Hungary_Avar_5 0.391 0.349 1.12
2 Bulgarian.DG Bulgaria_IA 0.457 0.281 1.63
3 Bulgarian.DG Russia_Medieval_Nomad.SG 0.152 0.0781 1.95
f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 2 7 9.69 2.07e- 1 9 24.5 3.58e- 3
2 1 16 34.2 5.13e- 3 11 319. 8.11e-62
3 0 27 353. 1.47e-58 NA NA NA
Code:target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Gagauz Hungary_Avar_5 0.421 0.142 2.96
2 Gagauz Bulgaria_IA 0.429 0.118 3.64
3 Gagauz Russia_Medieval_Nomad.SG 0.151 0.0394 3.83
f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 2 7 3.61 8.23e- 1 9 25.2 2.74e- 3
2 1 16 28.8 2.51e- 2 11 316. 4.07e-61
3 0 27 345. 8.25e-57 NA NA NA
The same model on me:Code:target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Romanian Hungary_Avar_5 0.506 0.183 2.77
2 Romanian Bulgaria_IA 0.368 0.148 2.48
3 Romanian Russia_Medieval_Nomad.SG 0.126 0.0471 2.68
f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 2 7 6.68 4.63e- 1 9 24.8 3.17e- 3
2 1 16 31.5 1.16e- 2 11 316. 3.78e-61
3 0 27 347. 2.22e-57 NA NA NA
Code:target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Kaspias Hungary_Avar_5 0.0528 0.234 0.225
2 Kaspias Bulgaria_IA 0.607 0.191 3.18
3 Kaspias Russia_Medieval_Nomad.SG 0.341 0.0677 5.03
f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 2 7 4.68 6.98e- 1 9 28.3 8.44e- 4
2 1 16 33.0 7.39e- 3 11 298. 1.94e-57
3 0 27 331. 3.88e-54 NA NA NA
You can increase the 29K SNPs alot by using the 1240K SNP Reich set.
Let's first figure out which populations you are genetically closest to by running F2s. This will also tell us if somehow your personal data got corrupted or not. Don't use ancients like I did to keep your SNPs up.
When I run F2s for Bulgarians using 200K SNPs I get the following but I'm not using alot of pops more relevant to Bulgarians such as Hungarians, Greeks etc which you should use. In fact you can use all the Simons 30 or so pops in your dataset
<style type="text/css">td {border: 1px solid #ccc;}br {mso-data-placement:same-cell;}</style>
POP1 POP2 F2 SE Z Bulgarian Sardinian 0.246 0.0010 258 Bulgarian Estonian 0.247 0.0008 313 Bulgarian Armenian 0.249 0.0008 296 Bulgarian Georgian 0.249 0.0007 358 Bulgarian Turkish-Kayseri 0.249 0.0007 371 Bulgarian Tatar-Volga 0.25 0.0008 328 Bulgarian Saami 0.25 0.0007 343 Bulgarian Iran-Hasanlu-IA 0.251 0.0011 239 Bulgarian Iranians-Fars 0.252 0.0015 168 Bulgarian Karelia-EHG 0.252 0.0012 212 Bulgarian Kotias-CHG 0.252 0.0009 291 Bulgarian Kalash 0.252 0.0008 304 Bulgarian Bashkir 0.252 0.0007 371 Bulgarian Pathan 0.253 0.0009 268 Bulgarian Jordanian 0.253 0.0009 288 Bulgarian Villabruna-UP-WHG 0.254 0.0010 256 Bulgarian Turkmen 0.254 0.0009 291 Bulgarian Balochi 0.254 0.0008 301 Bulgarian Brahui 0.254 0.0007 363 Bulgarian MA1-ANE 0.257 0.0009 274 Bulgarian Punjabi 0.257 0.0009 296 Bulgarian Yana-UP-WGS 0.258 0.0008 336 Bulgarian Devils-Gate-N-WGS 0.259 0.0008 316 Bulgarian Kolyma-Mesol-WGS 0.261 0.0011 240 Bulgarian Saharawi 0.261 0.0010 261 Bulgarian Eskimo-Sireniki 0.261 0.0008 324 Bulgarian Eskimo-Chaplin 0.262 0.0011 237 Bulgarian China-Tianyuan-UP 0.267 0.0012 215 Bulgarian UstIshim-UP 0.269 0.0010 263 Bulgarian Khomani-San 0.313 0.0013 245
</dbl></dbl></dbl></lgl></lgl></dbl></dbl></dbl></dbl></dbl></dbl></dbl></chr></dbl></dbl></int></dbl></dbl></int></int></dbl></dbl></dbl></chr></chr>
Running F2s is simple. Do this
## Increase number of lines R prints
options(max.print = 100000)
extract_f2(pref, f2dir, pops = c(..........
f2_blocks = f2_from_precomp('............
##View(f2(f2_blocks))
print(f2(f2_blocks), n = 2000)
It looks like you're getting much closer. Your p-value is now passing at 6.98e- 1 which is basically 0.698 !
Your standard errors are not good though especially for Avar 1 Kaspias Hungary_Avar_5 0.0528 0.234 0.225 because it's saying 5.28% Avar +/-23.4%
All this means is your pright are not sufficient to distinguish the genetic difference between Hungary-Avar and Bulgaria-IA. Add a pright that you think is much genetically closer to Avar than Bulgaria-IA OR visa versa
@Kaspias
I am glad that my post helped. Nice to see that you too have managed to run it!
@Zoro
Very helpful advices all around. Thanks again.
---
So I made a few more runs (maxmiss=0 and 93k~ snps ) using the 1240K dataset and the following populations. I picked Tepecik for Neolithic Anatolia. Open for suggestions!
This seems to be the best result, standard errors can go lower I guess. The p values seem OKCode:
right= c('Russia_DevilsCave_N.SG','Switzerland_Bichon.SG','Morocco_Iberomaurusian','Turkey_TepecikCiftlik_N.SG','Georgia_Kotias.SG','Russia_HG_Karelia', 'Russia_Yana_UP.SG', 'Iran_GanjDareh_N', 'Russia_Kolyma_M.SG')
left = c("Bulgarian.DG","Adygei.DG","Turkmen.SG",'Georgian.DG','Greek_1.DG')
About the z values corresponding to weight estimations: What is being tested here? weight i = 0 ? It seems like it.
Also, why do we want to fail to reject the model hypothesis? Can't seem to find a layman interpretation (no surprise).
Run 1: (Greek and Bulgarian did not go well together and Greek instead of Bulgarian yielded better results..Georgian seems to be a non-factor here: not significantly different than 0. But I would expect to have around 10%. Adygei on the other hand has a high se here, possibly due to its rather close proximity to Georgian.)
Code:=======================================
target left weight se z
---------------------------------------
1 me Adygei.DG 0.436 0.267 1.636
2 me Turkmen.SG 0.051 0.049 1.055
3 me Georgian.DG 0.053 0.2 0.263
4 me Greek_1.DG 0.46 0.127 3.617
---------------------------------------
the p value = 0.56
====================================================
f4rank dof chisq p dofdiff chisqdiff p_nested
----------------------------------------------------
1 3 5 3.924 0.56 7 36.838 0
2 2 12 40.761 0 9 101.281 0
3 1 21 142.042 0 11 732.627 0
4 0 32 874.67 0 NA NA NA
----------------------------------------------------
Another run, without Georgian. (Adygei SE is now 0.15)
Code:
======================================
target left weight se z
--------------------------------------
1 me Adygei.DG 0.489 0.15 3.268
2 me Turkmen.SG 0.046 0.045 1.006
3 me Greek_1.DG 0.466 0.13 3.584
--------------------------------------
=====================================================
f4rank dof chisq p dofdiff chisqdiff p_nested
-----------------------------------------------------
1 2 6 3.965 0.681 8 68.9 0
2 1 14 72.865 0 10 653.942 0
3 0 24 726.807 0 NA NA NA
-----------------------------------------------------
bonus 1: me vs the populations I used (f2 statistics). If I am interpreting these correctly it says I am closer to Bulgarians than the Adygei (albeit not by a significant margin). On g25 I get the opposite all the time, with a clear margin.
bonus 2: a graph. Perhaps this can give an idea as to whether the chosen populations are satisfactory or not. If the graphs produce nonsense, one can try different populations. This particular one I produced is possibly nonsense since I only used some moderns and pretty ancient populations.Code:====================================================================
pop1 pop2 est se z p
--------------------------------------------------------------------
1 me Bulgarian.DG 9e-04 0.0011 0.82034 0.41202
2 me Adygei.DG 0.00134 0.00116 1.14901 0.25055
3 me Georgian.DG 0.00232 0.00111 2.09104 0.03652
4 me Greek_1.DG 0.00301 0.00148 2.03656 0.04169
5 me Iran_GanjDareh_N 0.0566 0.00121 46.81045 0
6 me Turkmen.SG 0.06436 0.00126 51.18922 0
7 me Russia_Yana_UP.SG 0.08537 0.00142 60.25923 0
8 me Morocco_Iberomaurusian 0.09281 0.0014 66.0843 0
9 me Russia_DevilsCave_N.SG 0.10113 0.00151 66.87772 0
10 me Turkey_TepecikCiftlik_N.SG 0.13288 0.00139 95.47373 0
11 me Russia_HG_Karelia 0.16392 0.00155 106.00861 0
12 me Georgia_Kotias.SG 0.17455 0.00155 112.48209 0
13 me Switzerland_Bichon.SG 0.17911 0.00175 102.16275 0
14 me Russia_Kolyma_M.SG 0.19436 0.00177 109.87032 0
--------------------------------------------------------------------
https://i.ibb.co/CsYKkz3/graph1.png
Congrats, looking good.
The best way to figure out which samples have the least missing SNPs so that you can use them in your run is to do this plink command:
....../plink/bfile Master --missing . This will output a file called plink.imiss and will list the number of missing SNPs in every sample. This way you can only use your best samples.
For ex, here's a portion of my plink.imiss file sorted by missingness
<style type="text/css">td {border: 1px solid #ccc;}br {mso-data-placement:same-cell;}</style>
Anatolia_N Bar8 3563703 4668444 0.76 Anatolia_N Bar31 3646992 4676043 0.78 Anatolia_N I0707 3712565 4668444 0.80 Anatolia_N I0746 3726082 4676043 0.80 Anatolia_N I0745 3732807 4676043 0.80 Anatolia_N I0709 3741514 4676043 0.80 Anatolia_N I0708 3748125 4676043 0.80 Anatolia_N I1583_publ 3749842 4676043 0.80 Anatolia_N I1580_publ 3798544 4668444 0.81 Anatolia_N I0744 3837576 4676043 0.82 Anatolia_N I1581_publ 3874657 4668444 0.83 Anatolia_N I1585_publ 3875085 4668444 0.83 Anatolia_N I1579_publ 3880059 4668444 0.83 Anatolia_N I0736 3898059 4668444 0.84 Anatolia_N I1098 3914326 4668444 0.84 Anatolia_N ZHAG 3921509 4668444 0.84 Anatolia_N I1096 3957126 4676043 0.85 Anatolia_N I1097 3959996 4676043 0.85 Anatolia_N I1101 4047243 4676043 0.87 Anatolia_N I1103 4099354 4676043 0.88 Anatolia_Ottoman_1.SG MA2195_final 4109325 4668444 0.88 Anatolia_TepecikCiftlik_N.SG Tep003 4141647 4676043 0.89
You'll notice the best ENF samples are Bar8 with missingness of only 0.76 followed by Bar31 etc. You'll also notice that the ENF you used is one of the worst as far as missing SNPs at missingness of 0.89
Next what I do is go to my Eigenstrat .ind file and add _low to the samples with high missingness that I don't want Admixtools to use.
For ex
Anatolia_N:Bar8 F Anatolia_N
Anatolia_N:Bar31 M Anatolia_N
Anatolia_N:I0707 F Anatolia_N
Anatolia_N:I0708 M Anatolia_N
Anatolia_N:I0709 M Anatolia_N
Anatolia_N:I0736 F Anatolia_N_low
Anatolia_N:I0744 M Anatolia_N
Anatolia_N:I0745 M Anatolia_N
Anatolia_N:I0746 M Anatolia_N
Anatolia_N:I1096 M Anatolia_N_low
Anatolia_N:I1097 M Anatolia_N_low
Anatolia_N:I1098 F Anatolia_N_low
Anatolia_N:I1101 M Anatolia_N_low
Anatolia_N:I1103 M Anatolia_N_low
Anatolia_N:I1579_publ F Anatolia_N
Anatolia_N:I1580_publ F Anatolia_N
Anatolia_N:I1581_publ F Anatolia_N
Anatolia_N:I1583_publ M Anatolia_N
Anatolia_N:I1585_publ F Anatolia_N
Anatolia_N:ZHAG F Anatolia_N_low
Now when I add "Anatolia_N" to extract or pright only the ENF samples with low missingness are used and the rest are ignored.
You may ask why I don't only use the best 2 ENF samples instead of the best 8. The answer to that is the more samples the more accurate the allele frequencies for the population become. So its a tradeoff between ignoring worse samples and improving allele frequencies.
Z is directly proportional to p-value. The higher the p the lower the Z
Your passing models are generally >0.05. The higher ones such as >0.40 are a little more accurate
Agree with regards to SE Adygei due to Georgian. If you really need to use both you have to add another pright that's shares more genetic drift with one vs the other.
Your SE is still acceptable. Your p of 0.68 is very good.
How many SNPs were used in the f2 run?
So I did this for every population I use, exluded those with very high missing ratios. But some populations like Ganj_Dareh only has a few samples and decided to stick to SNPs in the trade-off you mentioned. (I think I only eliminated those with 0.85+ missing)
Removed Tepecik and went on with only Turkey_N. Since my right pops cannot distinguish Greek vs Bulgarian, I wanted to use Polish for slavic proxy. (Not sure if this is a good idea). After going with Turkey_N, Adygei never fell below 0.6. This tells me that I am missing something. I´ll read this next: https://www.biorxiv.org/content/10.1...664v1.full.pdf
Perhaps I should use earlier, let´s say medieval, pops as left references. As I am mostly interested in getting a rough estimation of my admixture.
Code:
=======================================
target left weight se z
---------------------------------------
1 my_geno Adygei.DG 0.624 0.152 4.103
2 my_geno Turkmen.SG 0.012 0.044 0.282
3 my_geno Greek_1.DG 0.069 0.147 0.47
4 my_geno Polish.DG 0.294 0.094 3.138
---------------------------------------
======================================================
f4rank dof chisq p dofdiff chisqdiff p_nested
------------------------------------------------------
1 3 5 0.57 0.989 7 64.579 0
2 2 12 65.149 0 9 164.988 0
3 1 21 230.138 0 11 893.415 0
4 0 32 1123.552 0 NA NA NA
------------------------------------------------------
Those f2 statistics came from the run with 93k~ snps.
New one with 100k~ SNPs: (you can see the populations I use, moderns left ancients right.)
bonus:Code:===============================================
pop1 pop2 est se
-----------------------------------------------
1 me Bulgarian.DG 0.00071 0.00108
2 me Russia_Abkhasian.DG 0.00094 0.00111
3 me Georgian.DG 0.00097 0.00111
4 me Armenian.DG 0.00109 0.00109
5 me Adygei.DG 0.00114 0.00116
6 me Turkish.DG 0.00143 0.00113
7 me Greek_1.DG 0.00157 0.00112
8 me Russia_NorthOssetian.DG 0.00182 0.00114
9 me Iranian.DG 0.00219 0.00112
10 me Polish.DG 0.0022 0.00146
11 me Turkmen.SG 0.00557 0.00136
12 me Turkey_N 0.00895 0.00101
13 me Iran_GanjDareh_N 0.0279 0.00127
14 me Ukraine_N 0.0285 0.00113
15 me Russia_Yana_UP.SG 0.03287 0.00161
16 me Russia_DevilsCave_N.SG 0.06759 0.00159
17 me Georgia_Kotias.SG 0.18128 0.00152
18 me Russia_HG_Karelia 0.18468 0.00151
19 me Switzerland_Bichon.SG 0.18638 0.00168
20 me Russia_Kolyma_M.SG 0.20019 0.00163
-----------------------------------------------
possibly historically not accurate, but just felt curious to see what I´d get.
Thank you a lot again, Zoro.Code:
==========================================
target left weight se z
------------------------------------------
1 Turkish.DG Greek_1.DG 0.504 0.09 5.619
2 Turkish.DG Turkmen.SG 0.113 0.037 3.066
3 Turkish.DG Iranian.DG 0.383 0.115 3.33
------------------------------------------
======================================================
f4rank dof chisq p dofdiff chisqdiff p_nested
------------------------------------------------------
1 2 6 10.737 0.097 8 98.977 0
2 1 14 109.714 0 10 911.517 0
3 0 24 1021.231 0 NA NA NA
------------------------------------------------------
I'm having hard time while interpreting these tbh:
F2s
Fst:Code:pop1 pop2 est se
<chr> <chr> <dbl> <dbl>
1 Kaspias Adygei.DG 0.361 0.00114
2 Kaspias Albanian.DG 0.359 0.00128
3 Kaspias Bashkir.SG 0.363 0.00114
4 Kaspias Bulgarian.DG 0.360 0.00114
5 Kaspias Cretan.DG 0.360 0.00118
6 Kaspias French.DG 0.360 0.00113
7 Kaspias Greek_1.DG 0.360 0.00121
8 Kaspias Greek_2.DG 0.359 0.00122
9 Kaspias Iranian.DG 0.362 0.00115
10 Kaspias Italian_North.DG 0.360 0.00129
11 Kaspias Jordanian.DG 0.365 0.00107
12 Kaspias Mansi.DG 0.365 0.00111
13 Kaspias Polish.DG 0.360 0.00121
14 Kaspias Russian.DG 0.361 0.00113
15 Kaspias Tatar_Tomsk.SG 0.366 0.00108
16 Kaspias Turkish.DG 0.362 0.00112
17 Kaspias Turkmen.SG 0.364 0.00113
18 Kaspias Tuscan_1.DG 0.360 0.00115
19 Kaspias Uzbek.SG 0.366 0.00109
But I should add, the overlapping SNP number is ~610K now when used the 1240K dataset.Code:pop1 pop2 est se
<chr> <chr> <dbl> <dbl>
1 Kaspias Adygei.DG 0.134 0.00284
2 Kaspias Albanian.DG 0.137 0.00395
3 Kaspias Bashkir.SG 0.137 0.00296
4 Kaspias Bulgarian.DG 0.130 0.00288
5 Kaspias Cretan.DG 0.131 0.00294
6 Kaspias French.DG 0.134 0.00272
7 Kaspias Greek_1.DG 0.131 0.00337
8 Kaspias Greek_2.DG 0.129 0.00364
9 Kaspias Iranian.DG 0.133 0.00293
10 Kaspias Italian_North.DG 0.132 0.00346
11 Kaspias Jordanian.DG 0.141 0.00278
12 Kaspias Mansi.DG 0.156 0.00281
13 Kaspias Polish.DG 0.135 0.00352
14 Kaspias Russian.DG 0.134 0.00292
15 Kaspias Tatar_Tomsk.SG 0.142 0.00293
16 Kaspias Turkish.DG 0.133 0.00275
17 Kaspias Turkmen.SG 0.135 0.00298
18 Kaspias Tuscan_1.DG 0.134 0.00295
19 Kaspias Uzbek.SG 0.140 0.00270
Code:target left weight se z
1 Kaspias Bulgarian.DG 0.75 0.041 18.11
2 Kaspias Uzbek.SG 0.25 0.041 6.05
f4rank dof chisq p.value dofdiff chisqdiff p_nested
1 1 5 5.81 0.32551 7 735.49 1.5e-154
2 0 12 741.3 < 2e-16
I'm glad your getting the hang of it. Keep at it and you'll slowly get better !
Yes, it's best to use ancients to model yourself and others since they are the ones ancestral and not moderns allthough moderns can be used to see how one is shifted compared to neighboring moderns.
It's always good to get a few passing p-value models with decent SE and then pick the one you think is most accurate based on historical.
The model you show for the Kayseri Turks is close to what I had but I didn't use Iranians. I used Armenians, Bulgarians and Turkmen and got about 16% Turkmen with equal portions of Armenian and Bulgarian
I'm glad your getting the hang of it. Keep at it and you'll slowly get better ! Good to see that you were able to use the 1240K set.
I like your FST results better than F2s and I think you should use that more than F2s in the future for distances.
I see that Greek-2 and Bulgarians are closest to you but it looks like you split Greeks into individual samples whereas with Bulgarians it was the average of both Simons samples.
Can you split the Bulgarians and others into individual samples and repost for kicks
Your qpAdm model looks good ! but I'm interested to know how many SNPs in the extract and which pright you used
You're showing 25% Usbek -/+4% which is higher than what I got for the Iraqi Kurds and Kayseri Turks. The Iraqi Kurds were something like ~80% Armenian + ~20% Bashkir or Turkmen.
I like to model with moderns sometimes because it gives me an idea how shifted compared to their neighbors a population is
For ex Iraqi Kurds consistently get Iranian or Georgian or Armenian + 10% Estonian or some percentage Tatar or Bashkir. It doesn't mean that they have Estonian ancestry bu this tells me that a significant proportion of their ancestors came from NE of Armenia or Iran
How many SNPs did you end up with when you did your extract ?
btw, I read on the supplement to the performance assessment paper that the choice of the top right population matters. Is that still the case? As far as I know, it was necessary to put the target population on top of the left populations before. It seems to have changed on R.
I also played around with the allsnps = TRUE feature, and I can say that the results change A LOT.Quote:
Right Population File
This is alist of reference populations to be included in the qpAdm model. The number of reference populations must be greater than the number of left (i.e. target and source) populations. One population should be listed per line. The first population in the list will be used as a base for all f4-statistics calculated. Population order after the first population does not matter.
Fst, Bulgarian and Turkish also separated:
I do not get, how can I be closer to French and Adygei than Albanians? The rest seems ok though.Code:pop1 pop2 est se
<chr> <chr> <dbl> <dbl>
1 Kaspias Adygei.DG 0.134 0.00285
2 Kaspias Albanian.DG 0.137 0.00395
3 Kaspias Bashkir.SG 0.136 0.00300
4 Kaspias Bulgarian_1.DG 0.126 0.00347
5 Kaspias Bulgarian_2.DG 0.130 0.00358
6 Kaspias Cretan.DG 0.131 0.00294
7 Kaspias French.DG 0.134 0.00274
8 Kaspias Greek_1.DG 0.130 0.00339
9 Kaspias Greek_2.DG 0.129 0.00364
10 Kaspias Iranian.DG 0.133 0.00292
11 Kaspias Italian_North.DG 0.132 0.00351
12 Kaspias Jordanian.DG 0.141 0.00279
13 Kaspias Mansi.DG 0.156 0.00283
14 Kaspias Polish.DG 0.135 0.00354
15 Kaspias Russian.DG 0.134 0.00293
16 Kaspias Tatar_Tomsk.SG 0.142 0.00294
17 Kaspias Turkish_1.DG 0.145 0.00381
18 Kaspias Turkish_2.DG 0.132 0.00348
19 Kaspias Turkmen.SG 0.135 0.00297
20 Kaspias Tuscan_1.DG 0.134 0.00296
21 Kaspias Uzbek.SG 0.140 0.00272
Quote:
Your qpAdm model looks good ! but I'm interested to know how many SNPs in the extract and which pright you used
You're showing 25% Usbek -/+4% which is higher than what I got for the Iraqi Kurds and Kayseri Turks. The Iraqi Kurds were something like ~80% Armenian + ~20% Bashkir or Turkmen.
I like to model with moderns sometimes because it gives me an idea how shifted compared to their neighbors a population is
For ex Iraqi Kurds consistently get Iranian or Georgian or Armenian + 10% Estonian or some percentage Tatar or Bashkir. It doesn't mean that they have Estonian ancestry bu this tells me that a significant proportion of their ancestors came from NE of Armenia or Iran
How many SNPs did you end up with when you did your extract
Let me show you all the models:Code:√ 1578508 SNPs read in total
! 565741 SNPs remain after filtering. 540537 are polymorphic.
Right(Adygei, Cretan, Iranian, Mansi, Polish, Jordanian, Tatar_Tomsk, Italian_North, Albanian, French) -all the of Simeon's-
Code:
target left weight se z
1 Kaspias Bulgarian_1.DG 0.78 0.044 17.51
2 Kaspias Turkmen.SG 0.22 0.044 4.98
f4rank dof chisq p.value dofdiff chisqdiff p_nested
1 1 8 5.96 0.65119 10 553.51 1.6e-112
2 0 18 559.48 < 2e-16
Code:
target left weight se z
1 Kaspias Bulgarian_1.DG 0.76 0.048 15.89
2 Kaspias Bashkir.SG 0.24 0.048 4.89
f4rank dof chisq p.value dofdiff chisqdiff p_nested
1 1 8 6.44 0.59806 10 478.01 2.2e-96
2 0 18 484.45 < 2e-16
Code:
target left weight se z
1 Kaspias Bulgarian_1.DG 0.8 0.039 20.51
2 Kaspias Uzbek.SG 0.2 0.039 5.13
f4rank dof chisq p.value dofdiff chisqdiff p_nested
1 1 8 4.4 0.81917 10 796.21 1.3e-164
2 0 18 800.62 < 2e-16
I get negative scores when used Turkish samples to model my Turkic part, which might mean Turkic ancestors of Balkan Turks were not Anatolian Turkish-like.Code:
target left weight se z
1 Kaspias Bulgarian_1.DG 0.82 0.044 18.83
2 Kaspias Tatar_Tomsk.SG 0.18 0.044 4.15
f4rank dof chisq p.value dofdiff chisqdiff p_nested
1 1 8 6.16 0.62951 10 628.79 1.2e-128
2 0 18 634.94 < 2e-16
Code:
target left weight se z
1 Kaspias Bulgarian_1.DG 1.04 5.3 0.2
2 Kaspias Turkish_1.DG -0.04 5.3 -0.01
f4rank dof chisq p.value dofdiff chisqdiff p_nested
1 1 8 28.18 0.00044187 10 36.1 0.000081
2 0 18 64.28 4.09e-07
Code:
target left weight se z
1 Kaspias Bulgarian_1.DG 11.44 364.26 0.031
2 Kaspias Turkish_2.DG -10.44 364.26 -0.029
f4rank dof chisq p.value dofdiff chisqdiff p_nested
1 1 8 29.87 0.00022259 10 30.66 0.00067
2 0 18 60.54 1.6747e-06
Problem with modern people who identify as French is they may have recent N. African or other European ancestors. That aside since the default in Admixtools is to use common and less common alleles in calculations ENF in Europeans and W. Asians confounds results because it was so successful and Europeans and W. Asians have alot of it.
I have found that when I exclude very common alleles from analysis you have less of those sorts of problems due to ENF in Italians and French. Default in Admixtools is to use all alleles subject to maxmiss meaning that if an allele has a MAF of 0.50 it is still used. MAF 0.50 means that minor or ALT allele is found in all your pops in the analysis whether African, Asian or European meaning if "T" is the minor allele at that position then all Africans and Eurasians in your analysis are "T" at that position.
If you want to exclude these very ancient common alleles from your analysis set maxmaf=0.4 or 0.3 and then rerun. I'm pretty sure you wont get French or British or whatever having a small distance to you because of shared ENF or EHG.