qpAdm thread

Quote:

Originally Posted by Kaspias

I have done. However, I used a super kit(created with 3 different raw data) and had ~40MB size while an average raw data has 15-20, stating in case if it might be about it.

check if it works now:
Attachment 106247

Quote:

Originally Posted by vbnetkhio

check if it works now:
Attachment 106247

I have tried with regular MyHeritage file this time, converted it to Ancestry.

Code:

@----------------------------------------------------------@ | PLINK! | v1.07 | 10/Aug/2009 | |----------------------------------------------------------| | (C) 2009 Shaun Purcell, GNU General Public License, v2 | |----------------------------------------------------------| | For documentation, citation & bug-report instructions: | | http://pngu.mgh.harvard.edu/purcell/plink/ | @----------------------------------------------------------@ Web-based version check ( --noweb to skip ) Connecting to web... failed connection Problem connecting to web Writing this text to log file [ data_new.log ] Analysis started: Sat Feb 27 00:52:58 2021 Options in effect: --file data --make-bed --out data_new 1426149 (of 1426149) markers to be included from [ data.map ] ERROR: A problem with line 1 in [ data.ped ] Expecting 6 + 2 * 1426149 = 2852304 columns, but found 2842162

The format in .ped file is as such:

Code:

name1 name2 0 0 0 1 T T 0 0 A A A A A A G G A A G G

So actually it is how it should be. There is a blank between the alleles.

Quote:

Originally Posted by Kaspias

I have tried with regular MyHeritage file this time, converted it to Ancestry.

Code:

@----------------------------------------------------------@ | PLINK! | v1.07 | 10/Aug/2009 | |----------------------------------------------------------| | (C) 2009 Shaun Purcell, GNU General Public License, v2 | |----------------------------------------------------------| | For documentation, citation & bug-report instructions: | | http://pngu.mgh.harvard.edu/purcell/plink/ | @----------------------------------------------------------@ Web-based version check ( --noweb to skip ) Connecting to web... failed connection Problem connecting to web Writing this text to log file [ data_new.log ] Analysis started: Sat Feb 27 00:52:58 2021 Options in effect: --file data --make-bed --out data_new 1426149 (of 1426149) markers to be included from [ data.map ] ERROR: A problem with line 1 in [ data.ped ] Expecting 6 + 2 * 1426149 = 2852304 columns, but found 2842162

The format in .bed file is as such:

Code:

name1 name2 0 0 0 1 T T 0 0 A A A A A A G G A A G G

So actually it is how it should be. There is a blank between the alleles.

that looks like a very old version of plink, maybe it doesn't support tab separated .ped

try with plink 1.9:
https://www.cog-genomics.org/plink/

Quote:

Originally Posted by vbnetkhio

that looks like a very old version of plink, maybe it doesn't support tab separated .ped

try with plink 1.9:
https://www.cog-genomics.org/plink/

Code:

Possibly irregular .ped line. Restarting scan, assuming multichar alleles. Rescanning .ped file... 0% Error: Half-missing call in .ped file at variant 1389874, line 1.

Quote:

Originally Posted by Kaspias

Code:

Possibly irregular .ped line. Restarting scan, assuming multichar alleles. Rescanning .ped file... 0% Error: Half-missing call in .ped file at variant 1389874, line 1.

try with this one...
https://filebin.net/c73ep7a8u1pfgtxw

the "space" version should work with the older version of plink, the other one with the newer.

Quote:

Originally Posted by Kaspias

I have tried with regular MyHeritage file this time, converted it to Ancestry.

Code:

@----------------------------------------------------------@ | PLINK! | v1.07 | 10/Aug/2009 | |----------------------------------------------------------| | (C) 2009 Shaun Purcell, GNU General Public License, v2 | |----------------------------------------------------------| | For documentation, citation & bug-report instructions: | | http://pngu.mgh.harvard.edu/purcell/plink/ | @----------------------------------------------------------@ Web-based version check ( --noweb to skip ) Connecting to web... failed connection Problem connecting to web Writing this text to log file [ data_new.log ] Analysis started: Sat Feb 27 00:52:58 2021 Options in effect: --file data --make-bed --out data_new 1426149 (of 1426149) markers to be included from [ data.map ] ERROR: A problem with line 1 in [ data.ped ] Expecting 6 + 2 * 1426149 = 2852304 columns, but found 2842162

The format in .ped file is as such:

Code:

name1 name2 0 0 0 1 T T 0 0 A A A A A A G G A A G G

So actually it is how it should be. There is a blank between the alleles.

Plink doesn’t like Ancestry format. Just convert whatever you have to 23andme format and then convert the 23andme to Plink bed bim fam. It’s super easy. Also use Plink 1.9

Ancestry alleles are tab seperated ex A A whereas 23 have no space ex AA

Quote:

Originally Posted by Zoro

Plink doesn’t like Ancestry format. Just convert whatever you have to 23andme format and then convert the 23andme to Plink bed bim fam. It’s super easy. Also use Plink 1.9

Ancestry alleles are tab seperated ex A A whereas 23 have no space ex AA

none of the 23andme to plink scripts worked for me without errors. with my script I converted my file, and also some ancients which I had in the 23andme format successfuly.
Kaspias' file seems to have some empty fields and some fields which aren't ACGT0 so i had to adapt the script.

Quote:

Originally Posted by vbnetkhio

none of the 23andme to plink scripts worked for me without errors. with my script I converted my file, and also some ancients which I had in the 23andme format successfuly.
Kaspias' file seems to have some empty fields and some fields which aren't ACGT0 so i had to adapt the script.

What do you mean?

I've converted hundreds of 23 files to plink using:

......./plink --23file sample.txt sample --out sample_plink where ..... is the path to plink executible. Highlighted sample is what the file will be named inside fam

Then you can merge your "sample" file with your "master" file using

.......//plink --bfile master --bmerge sample.bed sample.bim sample.fam --make-bed --out master_new

As far as converting some format to 23 format just post one line of your format (FTDNA, Living or whatever) and I'll make you a Unix script to convert it in 2 minutes

Quote:

Originally Posted by Zoro

......./plink --23file sample.txt sample --out sample_plink

is that a plink function or a separate script??

edit:
found it. i wasn't aware of this function, I was using some old 23andme to ped scripts which didn't produce proper ped files.

Quote:

Originally Posted by vbnetkhio

is that a plink function or a separate script??

edit:
found it. i wasn't aware of this function, I was using some old 23andme to ped scripts which didn't produce proper ped files.

Whaaaat ?

Stay as far as you can from ped files unless you absolutely have to such as converting plink to eigenstrat. They take up a ton more space than bed files

Quote:

Originally Posted by vbnetkhio

try with this one...
https://filebin.net/c73ep7a8u1pfgtxw

the "space" version should work with the older version of plink, the other one with the newer.

This did not work either. But I was able to convert my data to plink. I could not do it by using python at first stage but then when used conda it worked.

Edit: I fixed the other problem too. But still stuck while extracting populations...

First run:

√ 1379990 SNPs read in total
! 6558 SNPs remain after filtering. 6468 are polymorphic.

Code:

target left weight se z <chr> <chr> <dbl> <dbl> <dbl> 1 Kaspias Turkmen 0.164 0.0991 1.65 2 Kaspias Bulgarian 0.836 0.0991 8.44

Many thanks to @Korialstrasz for the tutorial also to @vbnetkhio and @Zoro for giving a hand. I'd like to see some feedback on it, so I can try to improve.

Quote:

Originally Posted by Kaspias

First run:

√ 1379990 SNPs read in total
! 6558 SNPs remain after filtering. 6468 are polymorphic.

Code:

target left weight se z <chr> <chr> <dbl> <dbl> <dbl> 1 Kaspias Turkmen 0.164 0.0991 1.65 2 Kaspias Bulgarian 0.836 0.0991 8.44

Many thanks to @Korialstrasz for the tutorial also to @vbnetkhio and @Zoro for giving a hand. I'd like to see some feedback on it, so I can try to improve.

6558 SNPs is too low to do an accurate comparison. Even though you weren't WGS genotyped I can still get you up to about 74,000 SNPs.

I assume you were able to create the Plink files from the Reich dataset based on the par file I wrote you. Can you post the plink .log file. It'll help me diagnose a few things.

Assuming you were able to get 1240K SNPs in your Plink data you can get about 70,000 overlapping SNPs if you only use your data and the Simmons samples in the Reich dataset. You'll recognize them because their IDs start with S_ such as "S_Armenian-1.DG"

If you only use your personal file and the Simmons ones starting in S_ when you extract in Admixtools 2 using:

extract_f2(pref, f2dir, pops = c(

then you should end up with about 70,000 SNPs

Quote:

Originally Posted by Kaspias

First run:

√ 1379990 SNPs read in total
! 6558 SNPs remain after filtering. 6468 are polymorphic.

Code:

target left weight se z <chr> <chr> <dbl> <dbl> <dbl> 1 Kaspias Turkmen 0.164 0.0991 1.65 2 Kaspias Bulgarian 0.836 0.0991 8.44

Many thanks to @Korialstrasz for the tutorial also to @vbnetkhio and @Zoro for giving a hand. I'd like to see some feedback on it, so I can try to improve.

Congrats on making Plink files and using Admixtools. It's your gateway to much more meaningful analysis than merely using Vahaduo all the time !

6558 SNPs is too low to do an accurate comparison. Even though you weren't WGS genotyped I can still get you up to about 200,000 SNPs.

I assume you were able to create the Plink files from the Reich dataset based on the par file I wrote you. Can you post the plink .log file. It'll help me diagnose a few things.

Assuming you were able to get 1240K SNPs in your Plink data you can get about 200,000 overlapping SNPs if you only use your data and the Simmons samples in the Reich dataset. You'll recognize them because their IDs start with S_ such as "S_Armenian-1.DG"

If you only use your personal file and the Simmons ones starting in S_ when you extract in Admixtools 2 using:

extract_f2(pref, f2dir, pops = c(

then you should end up with about 200,000 SNPs

I use Admixtools 2 based on Eigenstrat geno snp and ind files.

Here are some 1240K pops I use alot because they don't drop your SNP counts

extract_f2(pref, f2dir, pops = c('Eskimo_Sireniki.DG',
'Punjabi',
'Turkmen','Pathan','Kalash','Bashkir','Kotias',
'Tatar_Volga','Turkish','Iranian','Armenian',
'Saami','Georgian','Jordanian',
'Estonian','Bulgarian','Sardinian','Avar','Hazara' ,
'Khomani_San',
'Papuan',
'Chukchi','Han','Uyghur','Mansi',
'Mongola','Buryat','Yakut','Adygei','Burmese','Jew _Iraqi',
'Russia_Abkhasian',
'Karelia','Balochi','Brahui'), maxmiss=0,verbose=TRUE)

.......

Quote:

Originally Posted by Zoro

Congrats on making Plink files and using Admixtools. It's your gateway to much more meaningful analysis than merely using Vahaduo all the time !

6558 SNPs is too low to do an accurate comparison. Even though you weren't WGS genotyped I can still get you up to about 200,000 SNPs.

I assume you were able to create the Plink files from the Reich dataset based on the par file I wrote you. Can you post the plink .log file. It'll help me diagnose a few things.

Assuming you were able to get 1240K SNPs in your Plink data you can get about 200,000 overlapping SNPs if you only use your data and the Simmons samples in the Reich dataset. You'll recognize them because their IDs start with S_ such as "S_Armenian-1.DG"

If you only use your personal file and the Simmons ones starting in S_ when you extract in Admixtools 2 using:

extract_f2(pref, f2dir, pops = c(

then you should end up with about 200,000 SNPs

Log:

Code:

.ped scan complete (for binary autoconversion). Performing single-pass .bed write (960586 variants, 1 person). --file: Kaspias_n-temporary.bed + Kaspias_n-temporary.bim + Kaspias_n-temporary.fam written. 960586 variants loaded from .bim file. 1 person (0 males, 0 females, 1 ambiguous) loaded from .fam. Ambiguous sex ID written to Kaspias_n.nosex . Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 0 founders and 1 nonfounder present. Calculating allele frequencies... done. Total genotyping rate is 0.974373. 960586 variants and 1 person pass filters and QC. Note: No phenotypes present. --make-bed to Kaspias_n.bed + Kaspias_n.bim + Kaspias_n.fam ... done.

For merging:

Code:

54 people loaded from master_plink.fam. 1 person to be merged from Kaspias_n.fam. Of these, 1 is new, while 0 are present in the base dataset. 597573 markers loaded from master_plink.bim. 960586 markers to be merged from Kaspias_n.bim. Of these, 782417 are new, while 178169 are present in the base dataset. Warning: Variants 'rs144847714' and 'rs10492943' have the same position. Warning: Variants 'rs3205229' and 'rs2229002' have the same position. Warning: Variants 'rs769902' and 'rs201435286' have the same position. 748 more same-position warnings: see log file. Performing single-pass merge (55 people, 1379990 variants). Merged fileset written to merged_data-merge.bed + merged_data-merge.bim + merged_data-merge.fam . 1379990 variants loaded from .bim file. 55 people (0 males, 0 females, 55 ambiguous) loaded from .fam. Ambiguous sex IDs written to merged_data.nosex . Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 54 founders and 1 nonfounder present. Calculating allele frequencies... done. Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands treat these as missing. Total genotyping rate is 0.432004. 1379990 variants and 55 people pass filters and QC. Note: No phenotypes present. --make-bed to merged_data.bed + merged_data.bim + merged_data.fam ... done.

Here it is...

√ 1379990 SNPs read in total
! 173175 SNPs remain after filtering. 170906 are polymorphic.

Code:

target left weight se z <chr> <chr> <dbl> <dbl> <dbl> 1 Kaspias Bulgarian.DG 0.776 0.0339 22.9 2 Kaspias Turkmen.SG 0.224 0.0339 6.60

Code:

target left weight se z <chr> <chr> <dbl> <dbl> <dbl> 1 Kaspias Bulgarian.DG 0.798 0.0299 26.7 2 Kaspias Uzbek.SG 0.202 0.0299 6.76

Quote:

Originally Posted by Kaspias

Log:

Code:

.ped scan complete (for binary autoconversion). Performing single-pass .bed write (960586 variants, 1 person). --file: Kaspias_n-temporary.bed + Kaspias_n-temporary.bim + Kaspias_n-temporary.fam written. 960586 variants loaded from .bim file. 1 person (0 males, 0 females, 1 ambiguous) loaded from .fam. Ambiguous sex ID written to Kaspias_n.nosex . Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 0 founders and 1 nonfounder present. Calculating allele frequencies... done. Total genotyping rate is 0.974373. 960586 variants and 1 person pass filters and QC. Note: No phenotypes present. --make-bed to Kaspias_n.bed + Kaspias_n.bim + Kaspias_n.fam ... done.

For merging:

Code:

54 people loaded from master_plink.fam. 1 person to be merged from Kaspias_n.fam. Of these, 1 is new, while 0 are present in the base dataset. 597573 markers loaded from master_plink.bim. 960586 markers to be merged from Kaspias_n.bim. Of these, 782417 are new, while 178169 are present in the base dataset. Warning: Variants 'rs144847714' and 'rs10492943' have the same position. Warning: Variants 'rs3205229' and 'rs2229002' have the same position. Warning: Variants 'rs769902' and 'rs201435286' have the same position. 748 more same-position warnings: see log file. Performing single-pass merge (55 people, 1379990 variants). Merged fileset written to merged_data-merge.bed + merged_data-merge.bim + merged_data-merge.fam . 1379990 variants loaded from .bim file. 55 people (0 males, 0 females, 55 ambiguous) loaded from .fam. Ambiguous sex IDs written to merged_data.nosex . Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 54 founders and 1 nonfounder present. Calculating allele frequencies... done. Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands treat these as missing. Total genotyping rate is 0.432004. 1379990 variants and 55 people pass filters and QC. Note: No phenotypes present. --make-bed to merged_data.bed + merged_data.bim + merged_data.fam ... done.

Ok I see that you have 178k SNPs overlapping with master dataset. You can increase that quite a bit by using the Reich 1240K instead of the 593k set you are using. It’s available at Reich Lab.

You can easily convert it from Eigenstrat to plink using the par file I gave you

Then in Plink you can merge your personal data and any other data with it. You can do IBS analysis in plink

Then you can convert your new plink master back to Eigenstrat. Let me know when you’re ready I’ll give you a different par file to do that

Then you can do Admixtools with you included using Eigenstrat

Quote:

Originally Posted by Kaspias

Here it is...

√ 1379990 SNPs read in total
! 173175 SNPs remain after filtering. 170906 are polymorphic.

Code:

target left weight se z <chr> <chr> <dbl> <dbl> <dbl> 1 Kaspias Bulgarian.DG 0.776 0.0339 22.9 2 Kaspias Turkmen.SG 0.224 0.0339 6.60

Code:

target left weight se z <chr> <chr> <dbl> <dbl> <dbl> 1 Kaspias Bulgarian.DG 0.798 0.0299 26.7 2 Kaspias Uzbek.SG 0.202 0.0299 6.76

Looking much better. Standard errors look good at 3%. Can you post the p-values so we can see if models are pass or fail. The 3rd row contains the p-value. Also can you post the p-right pops used

Quote:

Originally Posted by Zoro

Looking much better. Standard errors look good at 3%. Can you post the p-values so we can see if models are pass or fail. The 3rd row contains the p-value. Also can you post the p-right pops used

Bulgarian + Uzbek,

Code:

f4rank dof chisq p dofdiff chisqdiff p_nested <int> <int> <dbl> <dbl> <int> <dbl> <dbl> 1 1 14 11.1 6.75e- 1 16 1483. 2.95e-306 2 0 30 1494. 8.98e-296 NA NA NA

Bulgarian + Turkmen,

Code:

f4rank dof chisq p dofdiff chisqdiff p_nested <int> <int> <dbl> <dbl> <int> <dbl> <dbl> 1 1 14 12.7 5.50e- 1 16 1029. 5.62e-209 2 0 30 1042. 6.54e-200 NA NA NA

Right pops:

Code:

"Papuan.DG", "Eskimo_Sireniki.DG", "Jordanian.DG", "Punjabi.DG", "Yakut.DG", "Polish.DG", "Yoruba.DG", "Sardinian.DG", "Finnish.DG", "Armenian.DG", "Greek_1.DG", "Tatar_Volga.SG", "Iranian.DG", "Estonian.DG", "Altaian.DG", "Uzbek.SG"("Turkmen.SG")

Quote:

Originally Posted by Kaspias

Bulgarian + Uzbek,

Code:

f4rank dof chisq p dofdiff chisqdiff p_nested <int> <int> <dbl> <dbl> <int> <dbl> <dbl> 1 1 14 11.1 6.75e- 1 16 1483. 2.95e-306 2 0 30 1494. 8.98e-296 NA NA NA

Bulgarian + Turkmen,

Code:

f4rank dof chisq p dofdiff chisqdiff p_nested <int> <int> <dbl> <dbl> <int> <dbl> <dbl> 1 1 14 12.7 5.50e- 1 16 1029. 5.62e-209 2 0 30 1042. 6.54e-200 NA NA NA

Right pops:

Code:

"Papuan.DG", "Eskimo_Sireniki.DG", "Jordanian.DG", "Punjabi.DG", "Yakut.DG", "Polish.DG", "Yoruba.DG", "Sardinian.DG", "Finnish.DG", "Armenian.DG", "Greek_1.DG", "Tatar_Volga.SG", "Iranian.DG", "Estonian.DG", "Altaian.DG", "Uzbek.SG"("Turkmen.SG")

Unfortunately both models are fails since p-value of 1st is 2.95e-306 and 2nd about same. Your p-values for pass should be >.05. Remove a few of the p-rights to improve p-value.
you may need another source think about a third source that will help

Researchers use ancients for right pops. I have had good luck with this set and they don't have too many missing genotypes. if you are missing some of these samples you can substitute something similar

right= c('Khomani_San','Devils-Gate-N','Bichon','Morocco_Iberomaurusian',
'Anatolia_N','Kotias','Karelia', 'Yana-UP', "Iran_N', 'Kolyma-Mesol')

My Devils-Gate, Yana and Kolyma are WGS but you can use diploids if you have them.

Your p-values should improve alot.

Also not everyone can model successfully with just 2 sources. For example many Kurds can model with just 2 sources but Armenians or Iranians appear to have more complex histories and I usually need at least 3 sources for them. Not sure about your situation.

Quote:

Originally Posted by Zoro

Unfortunately both models are fails since p-value of 1st is 2.95e-306 and 2nd about same. Your p-values for pass should be >.05. Remove a few of the p-rights to improve p-value.
you may need another source think about a third source that will help

I used the exact same populations you recommend except for Turkmen which I replaced with MA2196, and that's what I get:

Code:

target left weight se z <chr> <chr> <dbl> <dbl> <dbl> 1 Kaspias Bulgarian.DG 0.712 0.0923 7.71 2 Kaspias Turkey_Ottoman_2.SG 0.288 0.0923 3.12 f4rank dof chisq p dofdiff chisqdiff p_nested <int> <int> <dbl> <dbl> <int> <dbl> <dbl> 1 1 8 6.59 5.82e- 1 10 87.0 2.10e-14 2 0 18 93.6 3.26e-12 NA NA NA pat wt dof chisq p f4rank Bulgarian.DG Turkey_Ottoman_2.SG feasible best dofdiff chisqdiff p_nested <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <dbl> <dbl> <dbl> 1 00 0 8 6.59 0.582 1 0.712 0.288 TRUE NA NA NA NA 2 01 1 9 24.4 0.00366 0 1 NA TRUE TRUE 0 -23.9 1 3 10 1 9 48.4 0.000000218 0 NA 1 TRUE TRUE NA NA NA >

If I get it correctly the model still does not pass. What's the reason? I mean, I could add here a 3rd population - Greek or Crimean Tatar - that are potentials for me, but Greek will cause overfitting with Bulgarian whereas there is no Crimean Tatar in the spreadsheet.

In addition, the SNP coverage reduced crucially when leaving Simeon's dataset:

! 29131 SNPs remain after filtering. 27980 are polymorphic.

Quote:

Originally Posted by Kaspias

I used the exact same populations you recommend except for Turkmen which I replaced with MA2196, and that's what I get:

Code:

target left weight se z <chr> <chr> <dbl> <dbl> <dbl> 1 Kaspias Bulgarian.DG 0.712 0.0923 7.71 2 Kaspias Turkey_Ottoman_2.SG 0.288 0.0923 3.12 f4rank dof chisq p dofdiff chisqdiff p_nested <int> <int> <dbl> <dbl> <int> <dbl> <dbl> 1 1 8 6.59 5.82e- 1 10 87.0 2.10e-14 2 0 18 93.6 3.26e-12 NA NA NA pat wt dof chisq p f4rank Bulgarian.DG Turkey_Ottoman_2.SG feasible best dofdiff chisqdiff p_nested <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <dbl> <dbl> <dbl> 1 00 0 8 6.59 0.582 1 0.712 0.288 TRUE NA NA NA NA 2 01 1 9 24.4 0.00366 0 1 NA TRUE TRUE 0 -23.9 1 3 10 1 9 48.4 0.000000218 0 NA 1 TRUE TRUE NA NA NA >

If I get it correctly the model still does not pass. What's the reason? I mean, I could add here a 3rd population - Greek or Crimean Tatar - that are potentials for me, but Greek will cause overfitting with Bulgarian whereas there is no Crimean Tatar in the spreadsheet.

In addition, the SNP coverage reduced crucially when leaving Simeon's dataset:

! 29131 SNPs remain after filtering. 27980 are polymorphic.

Try to do it with Tuscan, Ukrainian and Turkmen.

Quote:

Originally Posted by andre

Try to do it with Tuscan, Ukrainian and Turkmen.

Tuscan is too Northern for the base Balkan admixture of Thrace, need something in between Apulia and Islander Greek instead.

Almost got no additional Slav:

Code:

target left weight se z <chr> <chr> <dbl> <dbl> <dbl> 1 Kaspias Tuscan_1.DG 0.721 0.180 4.01 2 Kaspias Polish.DG 0.0334 0.172 0.194 3 Kaspias Turkmen.SG 0.246 0.0398 6.18

Besides, I run these:

Code:

target left weight se z <chr> <chr> <dbl> <dbl> <dbl> 1 Bulgarian.DG Hungary_Avar_5 0.391 0.349 1.12 2 Bulgarian.DG Bulgaria_IA 0.457 0.281 1.63 3 Bulgarian.DG Russia_Medieval_Nomad.SG 0.152 0.0781 1.95 f4rank dof chisq p dofdiff chisqdiff p_nested <int> <int> <dbl> <dbl> <int> <dbl> <dbl> 1 2 7 9.69 2.07e- 1 9 24.5 3.58e- 3 2 1 16 34.2 5.13e- 3 11 319. 8.11e-62 3 0 27 353. 1.47e-58 NA NA NA

Code:

target left weight se z <chr> <chr> <dbl> <dbl> <dbl> 1 Gagauz Hungary_Avar_5 0.421 0.142 2.96 2 Gagauz Bulgaria_IA 0.429 0.118 3.64 3 Gagauz Russia_Medieval_Nomad.SG 0.151 0.0394 3.83 f4rank dof chisq p dofdiff chisqdiff p_nested <int> <int> <dbl> <dbl> <int> <dbl> <dbl> 1 2 7 3.61 8.23e- 1 9 25.2 2.74e- 3 2 1 16 28.8 2.51e- 2 11 316. 4.07e-61 3 0 27 345. 8.25e-57 NA NA NA

Code:

target left weight se z <chr> <chr> <dbl> <dbl> <dbl> 1 Romanian Hungary_Avar_5 0.506 0.183 2.77 2 Romanian Bulgaria_IA 0.368 0.148 2.48 3 Romanian Russia_Medieval_Nomad.SG 0.126 0.0471 2.68 f4rank dof chisq p dofdiff chisqdiff p_nested <int> <int> <dbl> <dbl> <int> <dbl> <dbl> 1 2 7 6.68 4.63e- 1 9 24.8 3.17e- 3 2 1 16 31.5 1.16e- 2 11 316. 3.78e-61 3 0 27 347. 2.22e-57 NA NA NA

The same model on me:

Code:

target left weight se z <chr> <chr> <dbl> <dbl> <dbl> 1 Kaspias Hungary_Avar_5 0.0528 0.234 0.225 2 Kaspias Bulgaria_IA 0.607 0.191 3.18 3 Kaspias Russia_Medieval_Nomad.SG 0.341 0.0677 5.03 f4rank dof chisq p dofdiff chisqdiff p_nested <int> <int> <dbl> <dbl> <int> <dbl> <dbl> 1 2 7 4.68 6.98e- 1 9 28.3 8.44e- 4 2 1 16 33.0 7.39e- 3 11 298. 1.94e-57 3 0 27 331. 3.88e-54 NA NA NA

Quote:

Originally Posted by Kaspias

I used the exact same populations you recommend except for Turkmen which I replaced with MA2196, and that's what I get:

Code:

target left weight se z <chr> <chr> <dbl> <dbl> <dbl> 1 Kaspias Bulgarian.DG 0.712 0.0923 7.71 2 Kaspias Turkey_Ottoman_2.SG 0.288 0.0923 3.12 f4rank dof chisq p dofdiff chisqdiff p_nested <int> <int> <dbl> <dbl> <int> <dbl> <dbl> 1 1 8 6.59 5.82e- 1 10 87.0 2.10e-14 2 0 18 93.6 3.26e-12 NA NA NA pat wt dof chisq p f4rank Bulgarian.DG Turkey_Ottoman_2.SG feasible best dofdiff chisqdiff p_nested <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <dbl> <dbl> <dbl> 1 00 0 8 6.59 0.582 1 0.712 0.288 TRUE NA NA NA NA 2 01 1 9 24.4 0.00366 0 1 NA TRUE TRUE 0 -23.9 1 3 10 1 9 48.4 0.000000218 0 NA 1 TRUE TRUE NA NA NA >

If I get it correctly the model still does not pass. What's the reason? I mean, I could add here a 3rd population - Greek or Crimean Tatar - that are potentials for me, but Greek will cause overfitting with Bulgarian whereas there is no Crimean Tatar in the spreadsheet.

In addition, the SNP coverage reduced crucially when leaving Simeon's dataset:

! 29131 SNPs remain after filtering. 27980 are polymorphic.

You can increase the 29K SNPs alot by using the 1240K SNP Reich set.

Let's first figure out which populations you are genetically closest to by running F2s. This will also tell us if somehow your personal data got corrupted or not. Don't use ancients like I did to keep your SNPs up.

When I run F2s for Bulgarians using 200K SNPs I get the following but I'm not using alot of pops more relevant to Bulgarians such as Hungarians, Greeks etc which you should use. In fact you can use all the Simons 30 or so pops in your dataset

<style type="text/css">td {border: 1px solid #ccc;}br {mso-data-placement:same-cell;}</style>
POP1 POP2 F2 SE Z

Bulgarian Sardinian 0.246 0.0010 258

Bulgarian Estonian 0.247 0.0008 313

Bulgarian Armenian 0.249 0.0008 296

Bulgarian Georgian 0.249 0.0007 358

Bulgarian Turkish-Kayseri 0.249 0.0007 371

Bulgarian Tatar-Volga 0.25 0.0008 328

Bulgarian Saami 0.25 0.0007 343

Bulgarian Iran-Hasanlu-IA 0.251 0.0011 239

Bulgarian Iranians-Fars 0.252 0.0015 168

Bulgarian Karelia-EHG 0.252 0.0012 212

Bulgarian Kotias-CHG 0.252 0.0009 291

Bulgarian Kalash 0.252 0.0008 304

Bulgarian Bashkir 0.252 0.0007 371

Bulgarian Pathan 0.253 0.0009 268

Bulgarian Jordanian 0.253 0.0009 288

Bulgarian Villabruna-UP-WHG 0.254 0.0010 256

Bulgarian Turkmen 0.254 0.0009 291

Bulgarian Balochi 0.254 0.0008 301

Bulgarian Brahui 0.254 0.0007 363

Bulgarian MA1-ANE 0.257 0.0009 274

Bulgarian Punjabi 0.257 0.0009 296

Bulgarian Yana-UP-WGS 0.258 0.0008 336

Bulgarian Devils-Gate-N-WGS 0.259 0.0008 316

Bulgarian Kolyma-Mesol-WGS 0.261 0.0011 240

Bulgarian Saharawi 0.261 0.0010 261

Bulgarian Eskimo-Sireniki 0.261 0.0008 324

Bulgarian Eskimo-Chaplin 0.262 0.0011 237

Bulgarian China-Tianyuan-UP 0.267 0.0012 215

Bulgarian UstIshim-UP 0.269 0.0010 263

Bulgarian Khomani-San 0.313 0.0013 245

</dbl></dbl></dbl></lgl></lgl></dbl></dbl></dbl></dbl></dbl></dbl></dbl></chr></dbl></dbl></int></dbl></dbl></int></int></dbl></dbl></dbl></chr></chr>
Running F2s is simple. Do this

## Increase number of lines R prints
options(max.print = 100000)

extract_f2(pref, f2dir, pops = c(..........

f2_blocks = f2_from_precomp('............

##View(f2(f2_blocks))
print(f2(f2_blocks), n = 2000)

Quote:

Originally Posted by Kaspias

Tuscan is too Northern for the base Balkan admixture of Thrace, need something in between Apulia and Islander Greek instead.

Almost got no additional Slav:

The same model on me:

Code:

target left weight se z <chr> <chr> <dbl> <dbl> <dbl> 1 Kaspias Hungary_Avar_5 0.0528 0.234 0.225 2 Kaspias Bulgaria_IA 0.607 0.191 3.18 3 Kaspias Russia_Medieval_Nomad.SG 0.341 0.0677 5.03 f4rank dof chisq p dofdiff chisqdiff p_nested <int> <int> <dbl> <dbl> <int> <dbl> <dbl> 1 2 7 4.68 6.98e- 1 9 28.3 8.44e- 4 2 1 16 33.0 7.39e- 3 11 298. 1.94e-57 3 0 27 331. 3.88e-54 NA NA NA

It looks like you're getting much closer. Your p-value is now passing at 6.98e- 1 which is basically 0.698 !

Your standard errors are not good though especially for Avar 1 Kaspias Hungary_Avar_5 0.0528 0.234 0.225 because it's saying 5.28% Avar +/-23.4%

All this means is your pright are not sufficient to distinguish the genetic difference between Hungary-Avar and Bulgaria-IA. Add a pright that you think is much genetically closer to Avar than Bulgaria-IA OR visa versa

@Kaspias

I am glad that my post helped. Nice to see that you too have managed to run it!

@Zoro

Very helpful advices all around. Thanks again.

---

So I made a few more runs (maxmiss=0 and 93k~ snps ) using the 1240K dataset and the following populations. I picked Tepecik for Neolithic Anatolia. Open for suggestions!

Code:

right= c('Russia_DevilsCave_N.SG','Switzerland_Bichon.SG','Morocco_Iberomaurusian','Turkey_TepecikCiftlik_N.SG','Georgia_Kotias.SG','Russia_HG_Karelia', 'Russia_Yana_UP.SG', 'Iran_GanjDareh_N', 'Russia_Kolyma_M.SG') left = c("Bulgarian.DG","Adygei.DG","Turkmen.SG",'Georgian.DG','Greek_1.DG')

This seems to be the best result, standard errors can go lower I guess. The p values seem OK
About the z values corresponding to weight estimations: What is being tested here? weight i = 0 ? It seems like it.
Also, why do we want to fail to reject the model hypothesis? Can't seem to find a layman interpretation (no surprise).

Run 1: (Greek and Bulgarian did not go well together and Greek instead of Bulgarian yielded better results..Georgian seems to be a non-factor here: not significantly different than 0. But I would expect to have around 10%. Adygei on the other hand has a high se here, possibly due to its rather close proximity to Georgian.)

Code:

======================================= target left weight se z --------------------------------------- 1 me Adygei.DG 0.436 0.267 1.636 2 me Turkmen.SG 0.051 0.049 1.055 3 me Georgian.DG 0.053 0.2 0.263 4 me Greek_1.DG 0.46 0.127 3.617 --------------------------------------- the p value = 0.56 ==================================================== f4rank dof chisq p dofdiff chisqdiff p_nested ---------------------------------------------------- 1 3 5 3.924 0.56 7 36.838 0 2 2 12 40.761 0 9 101.281 0 3 1 21 142.042 0 11 732.627 0 4 0 32 874.67 0 NA NA NA ----------------------------------------------------

Another run, without Georgian. (Adygei SE is now 0.15)

Code:

====================================== target left weight se z -------------------------------------- 1 me Adygei.DG 0.489 0.15 3.268 2 me Turkmen.SG 0.046 0.045 1.006 3 me Greek_1.DG 0.466 0.13 3.584 -------------------------------------- ===================================================== f4rank dof chisq p dofdiff chisqdiff p_nested ----------------------------------------------------- 1 2 6 3.965 0.681 8 68.9 0 2 1 14 72.865 0 10 653.942 0 3 0 24 726.807 0 NA NA NA -----------------------------------------------------

bonus 1: me vs the populations I used (f2 statistics). If I am interpreting these correctly it says I am closer to Bulgarians than the Adygei (albeit not by a significant margin). On g25 I get the opposite all the time, with a clear margin.

Code:

==================================================================== pop1 pop2 est se z p -------------------------------------------------------------------- 1 me Bulgarian.DG 9e-04 0.0011 0.82034 0.41202 2 me Adygei.DG 0.00134 0.00116 1.14901 0.25055 3 me Georgian.DG 0.00232 0.00111 2.09104 0.03652 4 me Greek_1.DG 0.00301 0.00148 2.03656 0.04169 5 me Iran_GanjDareh_N 0.0566 0.00121 46.81045 0 6 me Turkmen.SG 0.06436 0.00126 51.18922 0 7 me Russia_Yana_UP.SG 0.08537 0.00142 60.25923 0 8 me Morocco_Iberomaurusian 0.09281 0.0014 66.0843 0 9 me Russia_DevilsCave_N.SG 0.10113 0.00151 66.87772 0 10 me Turkey_TepecikCiftlik_N.SG 0.13288 0.00139 95.47373 0 11 me Russia_HG_Karelia 0.16392 0.00155 106.00861 0 12 me Georgia_Kotias.SG 0.17455 0.00155 112.48209 0 13 me Switzerland_Bichon.SG 0.17911 0.00175 102.16275 0 14 me Russia_Kolyma_M.SG 0.19436 0.00177 109.87032 0 --------------------------------------------------------------------

bonus 2: a graph. Perhaps this can give an idea as to whether the chosen populations are satisfactory or not. If the graphs produce nonsense, one can try different populations. This particular one I produced is possibly nonsense since I only used some moderns and pretty ancient populations.

https://i.ibb.co/CsYKkz3/graph1.png

Quote:

Originally Posted by Korialstrasz

@Kaspias

I am glad that my post helped. Nice to see that you too have managed to run it!

@Zoro

Very helpful advices all around. Thanks again.

---

So I made a few more runs (maxmiss=0 and 93k~ snps ) using the 1240K dataset and the following populations. I picked Tepecik for Neolithic Anatolia. Open for suggestions!

Code:

right= c('Russia_DevilsCave_N.SG','Switzerland_Bichon.SG','Morocco_Iberomaurusian','Turkey_TepecikCiftlik_N.SG','Georgia_Kotias.SG','Russia_HG_Karelia', 'Russia_Yana_UP.SG', 'Iran_GanjDareh_N', 'Russia_Kolyma_M.SG') left = c("Bulgarian.DG","Adygei.DG","Turkmen.SG",'Georgian.DG','Greek_1.DG')

This seems to be the best result, standard errors can go lower I guess. The p values seem OK
About the z values corresponding to weight estimations: What is being tested here? weight i = 0 ? It seems like it.
Also, why do we want to fail to reject the model hypothesis? Can't seem to find a layman interpretation (no surprise).

Run 1: (Greek and Bulgarian did not go well together and Greek instead of Bulgarian yielded better results..Georgian seems to be a non-factor here: not significantly different than 0. But I would expect to have around 10%. Adygei on the other hand has a high se here, possibly due to its rather close proximity to Georgian.)

Code:

======================================= target left weight se z --------------------------------------- 1 me Adygei.DG 0.436 0.267 1.636 2 me Turkmen.SG 0.051 0.049 1.055 3 me Georgian.DG 0.053 0.2 0.263 4 me Greek_1.DG 0.46 0.127 3.617 --------------------------------------- the p value = 0.56 ==================================================== f4rank dof chisq p dofdiff chisqdiff p_nested ---------------------------------------------------- 1 3 5 3.924 0.56 7 36.838 0 2 2 12 40.761 0 9 101.281 0 3 1 21 142.042 0 11 732.627 0 4 0 32 874.67 0 NA NA NA ----------------------------------------------------

Another run, without Georgian. (Adygei SE is now 0.15)

Code:

====================================== target left weight se z -------------------------------------- 1 me Adygei.DG 0.489 0.15 3.268 2 me Turkmen.SG 0.046 0.045 1.006 3 me Greek_1.DG 0.466 0.13 3.584 -------------------------------------- ===================================================== f4rank dof chisq p dofdiff chisqdiff p_nested ----------------------------------------------------- 1 2 6 3.965 0.681 8 68.9 0 2 1 14 72.865 0 10 653.942 0 3 0 24 726.807 0 NA NA NA -----------------------------------------------------

bonus 1: me vs the populations I used (f2 statistics). If I am interpreting these correctly it says I am closer to Bulgarians than the Adygei (albeit not by a significant margin). On g25 I get the opposite all the time, with a clear margin.

Code:

==================================================================== pop1 pop2 est se z p -------------------------------------------------------------------- 1 me Bulgarian.DG 9e-04 0.0011 0.82034 0.41202 2 me Adygei.DG 0.00134 0.00116 1.14901 0.25055 3 me Georgian.DG 0.00232 0.00111 2.09104 0.03652 4 me Greek_1.DG 0.00301 0.00148 2.03656 0.04169 5 me Iran_GanjDareh_N 0.0566 0.00121 46.81045 0 6 me Turkmen.SG 0.06436 0.00126 51.18922 0 7 me Russia_Yana_UP.SG 0.08537 0.00142 60.25923 0 8 me Morocco_Iberomaurusian 0.09281 0.0014 66.0843 0 9 me Russia_DevilsCave_N.SG 0.10113 0.00151 66.87772 0 10 me Turkey_TepecikCiftlik_N.SG 0.13288 0.00139 95.47373 0 11 me Russia_HG_Karelia 0.16392 0.00155 106.00861 0 12 me Georgia_Kotias.SG 0.17455 0.00155 112.48209 0 13 me Switzerland_Bichon.SG 0.17911 0.00175 102.16275 0 14 me Russia_Kolyma_M.SG 0.19436 0.00177 109.87032 0 --------------------------------------------------------------------

]

Congrats, looking good.

The best way to figure out which samples have the least missing SNPs so that you can use them in your run is to do this plink command:

....../plink/bfile Master --missing . This will output a file called plink.imiss and will list the number of missing SNPs in every sample. This way you can only use your best samples.

For ex, here's a portion of my plink.imiss file sorted by missingness

<style type="text/css">td {border: 1px solid #ccc;}br {mso-data-placement:same-cell;}</style>
Anatolia_N Bar8 3563703 4668444 0.76

Anatolia_N Bar31 3646992 4676043 0.78

Anatolia_N I0707 3712565 4668444 0.80

Anatolia_N I0746 3726082 4676043 0.80

Anatolia_N I0745 3732807 4676043 0.80

Anatolia_N I0709 3741514 4676043 0.80

Anatolia_N I0708 3748125 4676043 0.80

Anatolia_N I1583_publ 3749842 4676043 0.80

Anatolia_N I1580_publ 3798544 4668444 0.81

Anatolia_N I0744 3837576 4676043 0.82

Anatolia_N I1581_publ 3874657 4668444 0.83

Anatolia_N I1585_publ 3875085 4668444 0.83

Anatolia_N I1579_publ 3880059 4668444 0.83

Anatolia_N I0736 3898059 4668444 0.84

Anatolia_N I1098 3914326 4668444 0.84

Anatolia_N ZHAG 3921509 4668444 0.84

Anatolia_N I1096 3957126 4676043 0.85

Anatolia_N I1097 3959996 4676043 0.85

Anatolia_N I1101 4047243 4676043 0.87

Anatolia_N I1103 4099354 4676043 0.88

Anatolia_Ottoman_1.SG MA2195_final 4109325 4668444 0.88

Anatolia_TepecikCiftlik_N.SG Tep003 4141647 4676043 0.89

You'll notice the best ENF samples are Bar8 with missingness of only 0.76 followed by Bar31 etc. You'll also notice that the ENF you used is one of the worst as far as missing SNPs at missingness of 0.89

Next what I do is go to my Eigenstrat .ind file and add _low to the samples with high missingness that I don't want Admixtools to use.

For ex
Anatolia_N:Bar8 F Anatolia_N
Anatolia_N:Bar31 M Anatolia_N
Anatolia_N:I0707 F Anatolia_N
Anatolia_N:I0708 M Anatolia_N
Anatolia_N:I0709 M Anatolia_N
Anatolia_N:I0736 F Anatolia_N_low
Anatolia_N:I0744 M Anatolia_N
Anatolia_N:I0745 M Anatolia_N
Anatolia_N:I0746 M Anatolia_N
Anatolia_N:I1096 M Anatolia_N_low
Anatolia_N:I1097 M Anatolia_N_low
Anatolia_N:I1098 F Anatolia_N_low
Anatolia_N:I1101 M Anatolia_N_low
Anatolia_N:I1103 M Anatolia_N_low
Anatolia_N:I1579_publ F Anatolia_N
Anatolia_N:I1580_publ F Anatolia_N
Anatolia_N:I1581_publ F Anatolia_N
Anatolia_N:I1583_publ M Anatolia_N
Anatolia_N:I1585_publ F Anatolia_N
Anatolia_N:ZHAG F Anatolia_N_low

Now when I add "Anatolia_N" to extract or pright only the ENF samples with low missingness are used and the rest are ignored.

You may ask why I don't only use the best 2 ENF samples instead of the best 8. The answer to that is the more samples the more accurate the allele frequencies for the population become. So its a tradeoff between ignoring worse samples and improving allele frequencies.

Quote:

Originally Posted by Korialstrasz

@Kaspias

I am glad that my post helped. Nice to see that you too have managed to run it!

@Zoro

Very helpful advices all around. Thanks again.

---

So I made a few more runs (maxmiss=0 and 93k~ snps ) using the 1240K dataset and the following populations. I picked Tepecik for Neolithic Anatolia. Open for suggestions!

Z is directly proportional to p-value. The higher the p the lower the Z

Your passing models are generally >0.05. The higher ones such as >0.40 are a little more accurate

Agree with regards to SE Adygei due to Georgian. If you really need to use both you have to add another pright that's shares more genetic drift with one vs the other.

Your SE is still acceptable. Your p of 0.68 is very good.

How many SNPs were used in the f2 run?

Quote:

Originally Posted by Zoro

Congrats, looking good.

Now when I add "Anatolia_N" to extract or pright only the ENF samples with low missingness are used and the rest are ignored.

You may ask why I don't only use the best 2 ENF samples instead of the best 8. The answer to that is the more samples the more accurate the allele frequencies for the population become. So its a tradeoff between ignoring worse samples and improving allele frequencies.

So I did this for every population I use, exluded those with very high missing ratios. But some populations like Ganj_Dareh only has a few samples and decided to stick to SNPs in the trade-off you mentioned. (I think I only eliminated those with 0.85+ missing)
Removed Tepecik and went on with only Turkey_N. Since my right pops cannot distinguish Greek vs Bulgarian, I wanted to use Polish for slavic proxy. (Not sure if this is a good idea). After going with Turkey_N, Adygei never fell below 0.6. This tells me that I am missing something. I´ll read this next: https://www.biorxiv.org/content/10.1...664v1.full.pdf

Perhaps I should use earlier, let´s say medieval, pops as left references. As I am mostly interested in getting a rough estimation of my admixture.

Code:

======================================= target left weight se z --------------------------------------- 1 my_geno Adygei.DG 0.624 0.152 4.103 2 my_geno Turkmen.SG 0.012 0.044 0.282 3 my_geno Greek_1.DG 0.069 0.147 0.47 4 my_geno Polish.DG 0.294 0.094 3.138 --------------------------------------- ====================================================== f4rank dof chisq p dofdiff chisqdiff p_nested ------------------------------------------------------ 1 3 5 0.57 0.989 7 64.579 0 2 2 12 65.149 0 9 164.988 0 3 1 21 230.138 0 11 893.415 0 4 0 32 1123.552 0 NA NA NA ------------------------------------------------------

Quote:

Originally Posted by Zoro

How many SNPs were used in the f2 run?

Those f2 statistics came from the run with 93k~ snps.

New one with 100k~ SNPs: (you can see the populations I use, moderns left ancients right.)

Code:

=============================================== pop1 pop2 est se ----------------------------------------------- 1 me Bulgarian.DG 0.00071 0.00108 2 me Russia_Abkhasian.DG 0.00094 0.00111 3 me Georgian.DG 0.00097 0.00111 4 me Armenian.DG 0.00109 0.00109 5 me Adygei.DG 0.00114 0.00116 6 me Turkish.DG 0.00143 0.00113 7 me Greek_1.DG 0.00157 0.00112 8 me Russia_NorthOssetian.DG 0.00182 0.00114 9 me Iranian.DG 0.00219 0.00112 10 me Polish.DG 0.0022 0.00146 11 me Turkmen.SG 0.00557 0.00136 12 me Turkey_N 0.00895 0.00101 13 me Iran_GanjDareh_N 0.0279 0.00127 14 me Ukraine_N 0.0285 0.00113 15 me Russia_Yana_UP.SG 0.03287 0.00161 16 me Russia_DevilsCave_N.SG 0.06759 0.00159 17 me Georgia_Kotias.SG 0.18128 0.00152 18 me Russia_HG_Karelia 0.18468 0.00151 19 me Switzerland_Bichon.SG 0.18638 0.00168 20 me Russia_Kolyma_M.SG 0.20019 0.00163 -----------------------------------------------

bonus:

possibly historically not accurate, but just felt curious to see what I´d get.

Code:

========================================== target left weight se z ------------------------------------------ 1 Turkish.DG Greek_1.DG 0.504 0.09 5.619 2 Turkish.DG Turkmen.SG 0.113 0.037 3.066 3 Turkish.DG Iranian.DG 0.383 0.115 3.33 ------------------------------------------ ====================================================== f4rank dof chisq p dofdiff chisqdiff p_nested ------------------------------------------------------ 1 2 6 10.737 0.097 8 98.977 0 2 1 14 109.714 0 10 911.517 0 3 0 24 1021.231 0 NA NA NA ------------------------------------------------------

Thank you a lot again, Zoro.

Quote:

Originally Posted by Zoro

It looks like you're getting much closer. Your p-value is now passing at 6.98e- 1 which is basically 0.698 !

Your standard errors are not good though especially for Avar 1 Kaspias Hungary_Avar_5 0.0528 0.234 0.225 because it's saying 5.28% Avar +/-23.4%

All this means is your pright are not sufficient to distinguish the genetic difference between Hungary-Avar and Bulgaria-IA. Add a pright that you think is much genetically closer to Avar than Bulgaria-IA OR visa versa

I'm having hard time while interpreting these tbh:

F2s

Code:

pop1 pop2 est se <chr> <chr> <dbl> <dbl> 1 Kaspias Adygei.DG 0.361 0.00114 2 Kaspias Albanian.DG 0.359 0.00128 3 Kaspias Bashkir.SG 0.363 0.00114 4 Kaspias Bulgarian.DG 0.360 0.00114 5 Kaspias Cretan.DG 0.360 0.00118 6 Kaspias French.DG 0.360 0.00113 7 Kaspias Greek_1.DG 0.360 0.00121 8 Kaspias Greek_2.DG 0.359 0.00122 9 Kaspias Iranian.DG 0.362 0.00115 10 Kaspias Italian_North.DG 0.360 0.00129 11 Kaspias Jordanian.DG 0.365 0.00107 12 Kaspias Mansi.DG 0.365 0.00111 13 Kaspias Polish.DG 0.360 0.00121 14 Kaspias Russian.DG 0.361 0.00113 15 Kaspias Tatar_Tomsk.SG 0.366 0.00108 16 Kaspias Turkish.DG 0.362 0.00112 17 Kaspias Turkmen.SG 0.364 0.00113 18 Kaspias Tuscan_1.DG 0.360 0.00115 19 Kaspias Uzbek.SG 0.366 0.00109

Fst:

Code:

pop1 pop2 est se <chr> <chr> <dbl> <dbl> 1 Kaspias Adygei.DG 0.134 0.00284 2 Kaspias Albanian.DG 0.137 0.00395 3 Kaspias Bashkir.SG 0.137 0.00296 4 Kaspias Bulgarian.DG 0.130 0.00288 5 Kaspias Cretan.DG 0.131 0.00294 6 Kaspias French.DG 0.134 0.00272 7 Kaspias Greek_1.DG 0.131 0.00337 8 Kaspias Greek_2.DG 0.129 0.00364 9 Kaspias Iranian.DG 0.133 0.00293 10 Kaspias Italian_North.DG 0.132 0.00346 11 Kaspias Jordanian.DG 0.141 0.00278 12 Kaspias Mansi.DG 0.156 0.00281 13 Kaspias Polish.DG 0.135 0.00352 14 Kaspias Russian.DG 0.134 0.00292 15 Kaspias Tatar_Tomsk.SG 0.142 0.00293 16 Kaspias Turkish.DG 0.133 0.00275 17 Kaspias Turkmen.SG 0.135 0.00298 18 Kaspias Tuscan_1.DG 0.134 0.00295 19 Kaspias Uzbek.SG 0.140 0.00270

But I should add, the overlapping SNP number is ~610K now when used the 1240K dataset.

Code:

target left weight se z 1 Kaspias Bulgarian.DG 0.75 0.041 18.11 2 Kaspias Uzbek.SG 0.25 0.041 6.05 f4rank dof chisq p.value dofdiff chisqdiff p_nested 1 1 5 5.81 0.32551 7 735.49 1.5e-154 2 0 12 741.3 < 2e-16

Quote:

Originally Posted by Korialstrasz

So I did this for every population I use, exluded those with very high missing ratios. But some populations like Ganj_Dareh only has a few samples and decided to stick to SNPs in the trade-off you mentioned. (I think I only eliminated those with 0.85+ missing)
Removed Tepecik and went on with only Turkey_N. Since my right pops cannot distinguish Greek vs Bulgarian, I wanted to use Polish for slavic proxy. (Not sure if this is a good idea). After going with Turkey_N, Adygei never fell below 0.6. This tells me that I am missing something. I´ll read this next: https://www.biorxiv.org/content/10.1...664v1.full.pdf

Perhaps I should use earlier, let´s say medieval, pops as left references. As I am mostly interested in getting a rough estimation of my admixture.

Code:

======================================= target left weight se z --------------------------------------- 1 my_geno Adygei.DG 0.624 0.152 4.103 2 my_geno Turkmen.SG 0.012 0.044 0.282 3 my_geno Greek_1.DG 0.069 0.147 0.47 4 my_geno Polish.DG 0.294 0.094 3.138 --------------------------------------- ====================================================== f4rank dof chisq p dofdiff chisqdiff p_nested ------------------------------------------------------ 1 3 5 0.57 0.989 7 64.579 0 2 2 12 65.149 0 9 164.988 0 3 1 21 230.138 0 11 893.415 0 4 0 32 1123.552 0 NA NA NA ------------------------------------------------------

Those f2 statistics came from the run with 93k~ snps.

New one with 100k~ SNPs: (you can see the populations I use, moderns left ancients right.)

Code:

=============================================== pop1 pop2 est se ----------------------------------------------- 1 me Bulgarian.DG 0.00071 0.00108 2 me Russia_Abkhasian.DG 0.00094 0.00111 3 me Georgian.DG 0.00097 0.00111 4 me Armenian.DG 0.00109 0.00109 5 me Adygei.DG 0.00114 0.00116 6 me Turkish.DG 0.00143 0.00113 7 me Greek_1.DG 0.00157 0.00112 8 me Russia_NorthOssetian.DG 0.00182 0.00114 9 me Iranian.DG 0.00219 0.00112 10 me Polish.DG 0.0022 0.00146 11 me Turkmen.SG 0.00557 0.00136 12 me Turkey_N 0.00895 0.00101 13 me Iran_GanjDareh_N 0.0279 0.00127 14 me Ukraine_N 0.0285 0.00113 15 me Russia_Yana_UP.SG 0.03287 0.00161 16 me Russia_DevilsCave_N.SG 0.06759 0.00159 17 me Georgia_Kotias.SG 0.18128 0.00152 18 me Russia_HG_Karelia 0.18468 0.00151 19 me Switzerland_Bichon.SG 0.18638 0.00168 20 me Russia_Kolyma_M.SG 0.20019 0.00163 -----------------------------------------------

bonus:

possibly historically not accurate, but just felt curious to see what I´d get.

Code:

========================================== target left weight se z ------------------------------------------ 1 Turkish.DG Greek_1.DG 0.504 0.09 5.619 2 Turkish.DG Turkmen.SG 0.113 0.037 3.066 3 Turkish.DG Iranian.DG 0.383 0.115 3.33 ------------------------------------------ ====================================================== f4rank dof chisq p dofdiff chisqdiff p_nested ------------------------------------------------------ 1 2 6 10.737 0.097 8 98.977 0 2 1 14 109.714 0 10 911.517 0 3 0 24 1021.231 0 NA NA NA ------------------------------------------------------

Thank you a lot again, Zoro.

I'm glad your getting the hang of it. Keep at it and you'll slowly get better !

Yes, it's best to use ancients to model yourself and others since they are the ones ancestral and not moderns allthough moderns can be used to see how one is shifted compared to neighboring moderns.

It's always good to get a few passing p-value models with decent SE and then pick the one you think is most accurate based on historical.

The model you show for the Kayseri Turks is close to what I had but I didn't use Iranians. I used Armenians, Bulgarians and Turkmen and got about 16% Turkmen with equal portions of Armenian and Bulgarian

Quote:

Originally Posted by Kaspias

I'm having hard time while interpreting these tbh:

F2s

Code:

pop1 pop2 est se <chr> <chr> <dbl> <dbl> 1 Kaspias Adygei.DG 0.361 0.00114 2 Kaspias Albanian.DG 0.359 0.00128 3 Kaspias Bashkir.SG 0.363 0.00114 4 Kaspias Bulgarian.DG 0.360 0.00114 5 Kaspias Cretan.DG 0.360 0.00118 6 Kaspias French.DG 0.360 0.00113 7 Kaspias Greek_1.DG 0.360 0.00121 8 Kaspias Greek_2.DG 0.359 0.00122 9 Kaspias Iranian.DG 0.362 0.00115 10 Kaspias Italian_North.DG 0.360 0.00129 11 Kaspias Jordanian.DG 0.365 0.00107 12 Kaspias Mansi.DG 0.365 0.00111 13 Kaspias Polish.DG 0.360 0.00121 14 Kaspias Russian.DG 0.361 0.00113 15 Kaspias Tatar_Tomsk.SG 0.366 0.00108 16 Kaspias Turkish.DG 0.362 0.00112 17 Kaspias Turkmen.SG 0.364 0.00113 18 Kaspias Tuscan_1.DG 0.360 0.00115 19 Kaspias Uzbek.SG 0.366 0.00109

Fst:

Code:

pop1 pop2 est se <chr> <chr> <dbl> <dbl> 1 Kaspias Adygei.DG 0.134 0.00284 2 Kaspias Albanian.DG 0.137 0.00395 3 Kaspias Bashkir.SG 0.137 0.00296 4 Kaspias Bulgarian.DG 0.130 0.00288 5 Kaspias Cretan.DG 0.131 0.00294 6 Kaspias French.DG 0.134 0.00272 7 Kaspias Greek_1.DG 0.131 0.00337 8 Kaspias Greek_2.DG 0.129 0.00364 9 Kaspias Iranian.DG 0.133 0.00293 10 Kaspias Italian_North.DG 0.132 0.00346 11 Kaspias Jordanian.DG 0.141 0.00278 12 Kaspias Mansi.DG 0.156 0.00281 13 Kaspias Polish.DG 0.135 0.00352 14 Kaspias Russian.DG 0.134 0.00292 15 Kaspias Tatar_Tomsk.SG 0.142 0.00293 16 Kaspias Turkish.DG 0.133 0.00275 17 Kaspias Turkmen.SG 0.135 0.00298 18 Kaspias Tuscan_1.DG 0.134 0.00295 19 Kaspias Uzbek.SG 0.140 0.00270

But I should add, the overlapping SNP number is ~610K now when used the 1240K dataset.

Code:

target left weight se z 1 Kaspias Bulgarian.DG 0.75 0.041 18.11 2 Kaspias Uzbek.SG 0.25 0.041 6.05 f4rank dof chisq p.value dofdiff chisqdiff p_nested 1 1 5 5.81 0.32551 7 735.49 1.5e-154 2 0 12 741.3 < 2e-16

I'm glad your getting the hang of it. Keep at it and you'll slowly get better ! Good to see that you were able to use the 1240K set.

I like your FST results better than F2s and I think you should use that more than F2s in the future for distances.

I see that Greek-2 and Bulgarians are closest to you but it looks like you split Greeks into individual samples whereas with Bulgarians it was the average of both Simons samples.

Can you split the Bulgarians and others into individual samples and repost for kicks

Your qpAdm model looks good ! but I'm interested to know how many SNPs in the extract and which pright you used

You're showing 25% Usbek -/+4% which is higher than what I got for the Iraqi Kurds and Kayseri Turks. The Iraqi Kurds were something like ~80% Armenian + ~20% Bashkir or Turkmen.

I like to model with moderns sometimes because it gives me an idea how shifted compared to their neighbors a population is

For ex Iraqi Kurds consistently get Iranian or Georgian or Armenian + 10% Estonian or some percentage Tatar or Bashkir. It doesn't mean that they have Estonian ancestry bu this tells me that a significant proportion of their ancestors came from NE of Armenia or Iran

How many SNPs did you end up with when you did your extract ?

Quote:

Originally Posted by Zoro

I'm glad your getting the hang of it. Keep at it and you'll slowly get better !

Yes, it's best to use ancients to model yourself and others since they are the ones ancestral and not moderns allthough moderns can be used to see how one is shifted compared to neighboring moderns.

It's always good to get a few passing p-value models with decent SE and then pick the one you think is most accurate based on historical.

The model you show for the Kayseri Turks is close to what I had but I didn't use Iranians. I used Armenians, Bulgarians and Turkmen and got about 16% Turkmen with equal portions of Armenian and Bulgarian

btw, I read on the supplement to the performance assessment paper that the choice of the top right population matters. Is that still the case? As far as I know, it was necessary to put the target population on top of the left populations before. It seems to have changed on R.

Quote:

Right Population File

This is alist of reference populations to be included in the qpAdm model. The number of reference populations must be greater than the number of left (i.e. target and source) populations. One population should be listed per line. The first population in the list will be used as a base for all f4-statistics calculated. Population order after the first population does not matter.

I also played around with the allsnps = TRUE feature, and I can say that the results change A LOT.

Quote:

Originally Posted by Zoro

I'm glad your getting the hang of it. Keep at it and you'll slowly get better ! Good to see that you were able to use the 1240K set.

I like your FST results better than F2s and I think you should use that more than F2s in the future for distances.

I see that Greek-2 and Bulgarians are closest to you but it looks like you split Greeks into individual samples whereas with Bulgarians it was the average of both Simons samples.

Can you split the Bulgarians and others into individual samples and repost for kicks

Fst, Bulgarian and Turkish also separated:

Code:

pop1 pop2 est se <chr> <chr> <dbl> <dbl> 1 Kaspias Adygei.DG 0.134 0.00285 2 Kaspias Albanian.DG 0.137 0.00395 3 Kaspias Bashkir.SG 0.136 0.00300 4 Kaspias Bulgarian_1.DG 0.126 0.00347 5 Kaspias Bulgarian_2.DG 0.130 0.00358 6 Kaspias Cretan.DG 0.131 0.00294 7 Kaspias French.DG 0.134 0.00274 8 Kaspias Greek_1.DG 0.130 0.00339 9 Kaspias Greek_2.DG 0.129 0.00364 10 Kaspias Iranian.DG 0.133 0.00292 11 Kaspias Italian_North.DG 0.132 0.00351 12 Kaspias Jordanian.DG 0.141 0.00279 13 Kaspias Mansi.DG 0.156 0.00283 14 Kaspias Polish.DG 0.135 0.00354 15 Kaspias Russian.DG 0.134 0.00293 16 Kaspias Tatar_Tomsk.SG 0.142 0.00294 17 Kaspias Turkish_1.DG 0.145 0.00381 18 Kaspias Turkish_2.DG 0.132 0.00348 19 Kaspias Turkmen.SG 0.135 0.00297 20 Kaspias Tuscan_1.DG 0.134 0.00296 21 Kaspias Uzbek.SG 0.140 0.00272

I do not get, how can I be closer to French and Adygei than Albanians? The rest seems ok though.

Quote:

Your qpAdm model looks good ! but I'm interested to know how many SNPs in the extract and which pright you used

You're showing 25% Usbek -/+4% which is higher than what I got for the Iraqi Kurds and Kayseri Turks. The Iraqi Kurds were something like ~80% Armenian + ~20% Bashkir or Turkmen.

I like to model with moderns sometimes because it gives me an idea how shifted compared to their neighbors a population is

For ex Iraqi Kurds consistently get Iranian or Georgian or Armenian + 10% Estonian or some percentage Tatar or Bashkir. It doesn't mean that they have Estonian ancestry bu this tells me that a significant proportion of their ancestors came from NE of Armenia or Iran

How many SNPs did you end up with when you did your extract

Code:

√ 1578508 SNPs read in total ! 565741 SNPs remain after filtering. 540537 are polymorphic.

Let me show you all the models:

Right(Adygei, Cretan, Iranian, Mansi, Polish, Jordanian, Tatar_Tomsk, Italian_North, Albanian, French) -all the of Simeon's-

Code:

target left weight se z 1 Kaspias Bulgarian_1.DG 0.78 0.044 17.51 2 Kaspias Turkmen.SG 0.22 0.044 4.98 f4rank dof chisq p.value dofdiff chisqdiff p_nested 1 1 8 5.96 0.65119 10 553.51 1.6e-112 2 0 18 559.48 < 2e-16

Code:

target left weight se z 1 Kaspias Bulgarian_1.DG 0.76 0.048 15.89 2 Kaspias Bashkir.SG 0.24 0.048 4.89 f4rank dof chisq p.value dofdiff chisqdiff p_nested 1 1 8 6.44 0.59806 10 478.01 2.2e-96 2 0 18 484.45 < 2e-16

Code:

target left weight se z 1 Kaspias Bulgarian_1.DG 0.8 0.039 20.51 2 Kaspias Uzbek.SG 0.2 0.039 5.13 f4rank dof chisq p.value dofdiff chisqdiff p_nested 1 1 8 4.4 0.81917 10 796.21 1.3e-164 2 0 18 800.62 < 2e-16

Code:

target left weight se z 1 Kaspias Bulgarian_1.DG 0.82 0.044 18.83 2 Kaspias Tatar_Tomsk.SG 0.18 0.044 4.15 f4rank dof chisq p.value dofdiff chisqdiff p_nested 1 1 8 6.16 0.62951 10 628.79 1.2e-128 2 0 18 634.94 < 2e-16

I get negative scores when used Turkish samples to model my Turkic part, which might mean Turkic ancestors of Balkan Turks were not Anatolian Turkish-like.

Code:

target left weight se z 1 Kaspias Bulgarian_1.DG 1.04 5.3 0.2 2 Kaspias Turkish_1.DG -0.04 5.3 -0.01 f4rank dof chisq p.value dofdiff chisqdiff p_nested 1 1 8 28.18 0.00044187 10 36.1 0.000081 2 0 18 64.28 4.09e-07

Code:

target left weight se z 1 Kaspias Bulgarian_1.DG 11.44 364.26 0.031 2 Kaspias Turkish_2.DG -10.44 364.26 -0.029 f4rank dof chisq p.value dofdiff chisqdiff p_nested 1 1 8 29.87 0.00022259 10 30.66 0.00067 2 0 18 60.54 1.6747e-06

Quote:

Originally Posted by Korialstrasz

btw, I read on the supplement to the performance assessment paper that the choice of the top right population matters. Is that still the case? As far as I know, it was necessary to put the target population on top of the left populations before. It seems to have changed on R.

I also played around with the allsnps = TRUE feature, and I can say that the results change A LOT.

qpAdm requires your target and sources for target in pleft and references in pright

I think that is referring to qpWave not qpAdm

I don't trust the calculations when allsnps=TRUE because you're not using SNPs that overlap all your pops

Quote:

Originally Posted by Kaspias

Fst, Bulgarian and Turkish also separated:

Code:

pop1 pop2 est se <chr> <chr> <dbl> <dbl> 1 Kaspias Adygei.DG 0.134 0.00285 2 Kaspias Albanian.DG 0.137 0.00395 3 Kaspias Bashkir.SG 0.136 0.00300 4 Kaspias Bulgarian_1.DG 0.126 0.00347 5 Kaspias Bulgarian_2.DG 0.130 0.00358 6 Kaspias Cretan.DG 0.131 0.00294 7 Kaspias French.DG 0.134 0.00274 8 Kaspias Greek_1.DG 0.130 0.00339 9 Kaspias Greek_2.DG 0.129 0.00364 10 Kaspias Iranian.DG 0.133 0.00292 11 Kaspias Italian_North.DG 0.132 0.00351 12 Kaspias Jordanian.DG 0.141 0.00279 13 Kaspias Mansi.DG 0.156 0.00283 14 Kaspias Polish.DG 0.135 0.00354 15 Kaspias Russian.DG 0.134 0.00293 16 Kaspias Tatar_Tomsk.SG 0.142 0.00294 17 Kaspias Turkish_1.DG 0.145 0.00381 18 Kaspias Turkish_2.DG 0.132 0.00348 19 Kaspias Turkmen.SG 0.135 0.00297 20 Kaspias Tuscan_1.DG 0.134 0.00296 21 Kaspias Uzbek.SG 0.140 0.00272

I do not get, how can I be closer to French and Adygei than Albanians? The rest seems ok though.

Problem with modern people who identify as French is they may have recent N. African or other European ancestors. That aside since the default in Admixtools is to use common and less common alleles in calculations ENF in Europeans and W. Asians confounds results because it was so successful and Europeans and W. Asians have alot of it.

I have found that when I exclude very common alleles from analysis you have less of those sorts of problems due to ENF in Italians and French. Default in Admixtools is to use all alleles subject to maxmiss meaning that if an allele has a MAF of 0.50 it is still used. MAF 0.50 means that minor or ALT allele is found in all your pops in the analysis whether African, Asian or European meaning if "T" is the minor allele at that position then all Africans and Eurasians in your analysis are "T" at that position.

If you want to exclude these very ancient common alleles from your analysis set maxmaf=0.4 or 0.3 and then rerun. I'm pretty sure you wont get French or British or whatever having a small distance to you because of shared ENF or EHG.

Quote:

Originally Posted by Kaspias

I get negative scores when used Turkish samples to model my Turkic part, which might mean Turkic ancestors of Balkan Turks were not Anatolian Turkish-like.

Code:

target left weight se z 1 Kaspias Bulgarian_1.DG 1.04 5.3 0.2 2 Kaspias Turkish_1.DG -0.04 5.3 -0.01 f4rank dof chisq p.value dofdiff chisqdiff p_nested 1 1 8 28.18 0.00044187 10 36.1 0.000081 2 0 18 64.28 4.09e-07

[CODE]

target left weight se z
1 Kaspias Bulgarian_1.DG 11.44 364.26 0.031
2 Kaspias Turkish_2.DG -10.44 364.26 -0.029

Yes, so basically its saying that Bulgarians already have alot of the Turkish type admixture you don't need additional Turkish to model yourself. Besides your SE get huge like with you at 1000% if you use 2 similar pops like that

POP1	POP2	F2	SE	Z
Bulgarian	Sardinian	0.246	0.0010	258
Bulgarian	Estonian	0.247	0.0008	313
Bulgarian	Armenian	0.249	0.0008	296
Bulgarian	Georgian	0.249	0.0007	358
Bulgarian	Turkish-Kayseri	0.249	0.0007	371
Bulgarian	Tatar-Volga	0.25	0.0008	328
Bulgarian	Saami	0.25	0.0007	343
Bulgarian	Iran-Hasanlu-IA	0.251	0.0011	239
Bulgarian	Iranians-Fars	0.252	0.0015	168
Bulgarian	Karelia-EHG	0.252	0.0012	212
Bulgarian	Kotias-CHG	0.252	0.0009	291
Bulgarian	Kalash	0.252	0.0008	304
Bulgarian	Bashkir	0.252	0.0007	371
Bulgarian	Pathan	0.253	0.0009	268
Bulgarian	Jordanian	0.253	0.0009	288
Bulgarian	Villabruna-UP-WHG	0.254	0.0010	256
Bulgarian	Turkmen	0.254	0.0009	291
Bulgarian	Balochi	0.254	0.0008	301
Bulgarian	Brahui	0.254	0.0007	363
Bulgarian	MA1-ANE	0.257	0.0009	274
Bulgarian	Punjabi	0.257	0.0009	296
Bulgarian	Yana-UP-WGS	0.258	0.0008	336
Bulgarian	Devils-Gate-N-WGS	0.259	0.0008	316
Bulgarian	Kolyma-Mesol-WGS	0.261	0.0011	240
Bulgarian	Saharawi	0.261	0.0010	261
Bulgarian	Eskimo-Sireniki	0.261	0.0008	324
Bulgarian	Eskimo-Chaplin	0.262	0.0011	237
Bulgarian	China-Tianyuan-UP	0.267	0.0012	215
Bulgarian	UstIshim-UP	0.269	0.0010	263
Bulgarian	Khomani-San	0.313	0.0013	245

Anatolia_N	Bar8	3563703	4668444	0.76
Anatolia_N	Bar31	3646992	4676043	0.78
Anatolia_N	I0707	3712565	4668444	0.80
Anatolia_N	I0746	3726082	4676043	0.80
Anatolia_N	I0745	3732807	4676043	0.80
Anatolia_N	I0709	3741514	4676043	0.80
Anatolia_N	I0708	3748125	4676043	0.80
Anatolia_N	I1583_publ	3749842	4676043	0.80
Anatolia_N	I1580_publ	3798544	4668444	0.81
Anatolia_N	I0744	3837576	4676043	0.82
Anatolia_N	I1581_publ	3874657	4668444	0.83
Anatolia_N	I1585_publ	3875085	4668444	0.83
Anatolia_N	I1579_publ	3880059	4668444	0.83
Anatolia_N	I0736	3898059	4668444	0.84
Anatolia_N	I1098	3914326	4668444	0.84
Anatolia_N	ZHAG	3921509	4668444	0.84
Anatolia_N	I1096	3957126	4676043	0.85
Anatolia_N	I1097	3959996	4676043	0.85
Anatolia_N	I1101	4047243	4676043	0.87
Anatolia_N	I1103	4099354	4676043	0.88
Anatolia_Ottoman_1.SG	MA2195_final	4109325	4668444	0.88
Anatolia_TepecikCiftlik_N.SG	Tep003	4141647	4676043	0.89