PDA

View Full Version : qpAdm thread



vbnetkhio
01-06-2021, 12:51 PM
a new version of ADMIXTOOLS is out:
https://uqrmaie1.github.io/admixtools/index.html

it's faster and more user friendly.

has anybody tried it out yet?

vbnetkhio
01-06-2021, 12:57 PM
https://i.imgur.com/bfvl1dE.png

I'm getting this error, i'm trying to figure out what's causing it.

vbnetkhio
01-06-2021, 01:06 PM
this is when I filter out the missing SNPs (By default, extract_f2() will be very cautious and exclude all SNPs which are missing in any population):
https://i.imgur.com/ktVDjyS.png

vbnetkhio
01-06-2021, 01:07 PM
populations used:

left
Russia_Sunghir6.SG
Greece_BA_Mycenaean

right
Russia_Afanasievo
Luxembourg_Loschbour
Russia_HG_Karelia
Anatolia_N
Poland_Globular_Amphora
Iran_GanjDareh_N
Kazakhstan_Eneolithic_Botai
Georgia_Kotias.SG
Morocco_Iberomaurusian
Nganasan
Papuan
Ami
ONG.SG
South_Africa_400BP.SG
Kenya_PastoralN

target
Serb

any recommendation is welcome.

Dr_Maul
01-06-2021, 01:40 PM
I'm sure a certain Kurd will be active in this thread soon enough

Leto
01-06-2021, 03:37 PM
I'm sure a certain Kurd will be active in this thread soon enough
Oh well. He will even destroy the font in this thread making it unreadably small :rolleyes:

Dr_Maul
01-06-2021, 03:40 PM
Oh well. He will even destroy the font in this thread making it unreadably small :rolleyes:

I don't really have a problem with Zoro unlike most people, however whatever he does to fuck up the thread font/size really gets the blood boiling...

gixajo
01-06-2021, 04:29 PM
a new version of ADMIXTOOLS is out:
https://uqrmaie1.github.io/admixtools/index.html

it's faster and more user friendly.

has anybody tried it out yet?

I saw it in Anthrogenica and I saved the link, but I don´t have used it already.

vbnetkhio
01-06-2021, 05:12 PM
according to this, Estonia BA passes as a Slavic proxy for Serbs (p-value 0.1) and the Hungarian Slavs don't (0.02 and 0.0006) :confused:
am I doing something wrong?



pat wt dof chisq p f4rank Greece_BA_Mycenaean Estonia_BA.SG feasible best dofdiff chisqdiff p_nested
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <dbl> <dbl> <dbl>
1 00 0 15 20.6 1.50e- 1 1 0.480 0.520 TRUE NA NA NA NA
2 01 1 16 341. 7.57e-63 0 1 NA TRUE TRUE 0 36.2 0
3 10 1 16 305. 2.54e-55 0 NA 1 TRUE TRUE NA NA NA

pat wt dof chisq p f4rank Greece_BA_Mycenaean Hungary_AvarPeriod feasible best dofdiff chisqdiff p_nested
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <dbl> <dbl> <dbl>
1 00 0 15 27.5 2.52e- 2 1 0.255 0.745 TRUE NA NA NA NA
2 01 1 16 333. 3.24e-61 0 1 NA TRUE TRUE 0 292. 0
3 10 1 16 41.6 4.52e- 4 0 NA 1 TRUE TRUE NA NA NA

pat wt dof chisq p f4rank Russia_Sunghir6.SG Greece_BA_Mycenaean Hungary_Avar_daughter.or.mother.AV1 feasible best dofdiff chisqdiff p_nested
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <dbl> <dbl> <dbl>
1 000 0 14 15.5 3.43e- 1 2 0.670 0.347 -0.0176 FALSE NA NA NA NA
2 001 1 15 16.2 3.70e- 1 1 0.654 0.346 NA TRUE TRUE 0 -12.2 1
3 010 1 15 28.4 1.93e- 2 1 2.71 NA -1.71 FALSE TRUE 0 -10.7 1
4 100 1 15 39.1 6.19e- 4 1 NA 0.368 0.632 TRUE TRUE NA NA NA
5 011 2 16 50.0 2.29e- 5 0 1 NA NA TRUE NA NA NA NA
6 101 2 16 314. 3.55e-57 0 NA 1 NA TRUE NA NA NA NA
7 110 2 16 74.4 1.70e- 9 0 NA NA 1 TRUE NA NA NA NA

edit: I mixed up Estonia_IA and Estonia_BA. Estonia IA was Uralic admixed, Estonia BA was still purely Balto-Slavic, so it makes some sense after all.

Zoro
01-06-2021, 06:31 PM
populations used:

left
Russia_Sunghir6.SG
Greece_BA_Mycenaean

right
Russia_Afanasievo
Luxembourg_Loschbour
Russia_HG_Karelia
Anatolia_N
Poland_Globular_Amphora
Iran_GanjDareh_N
Kazakhstan_Eneolithic_Botai
Georgia_Kotias.SG
Morocco_Iberomaurusian
Nganasan
Papuan
Ami
ONG.SG
South_Africa_400BP.SG
Kenya_PastoralN

target
Serb

any recommendation is welcome.

I think the outgroup set EurasianDNA used is good because they are high quality and have high SNP overlap with your test subjects plus they can differentiate streams of ancestry well

For Serbs you can start out with higher quality neolithic farmer samples + Iron gates Whg + relevant MLBA steppe + Iron age steppe/turkic

vbnetkhio
01-06-2021, 06:52 PM
I think the outgroup set EurasianDNA used is good because they are high quality and have high SNP overlap with your test subjects plus they can differentiate streams of ancestry well

For Serbs you can start out with higher quality neolithic farmer samples + Iron gates Whg + relevant MLBA steppe + Iron age steppe/turkic

but is there anything obviously wrong with these outgroups? I used Afanasievo instead of Yamnaya because it's a larger and higher quality sample, and they seem to be of pure Yamnaya descent.

also Nganasan and Ami because admixture detects them as "pure" and important admixture sources, so they probably the purest descendants of some ancient populations we don't have samples of yet?

is it ok to do things like this?

Token
01-06-2021, 07:08 PM
but is there anything obviously wrong with these outgroups? I used Afanasievo instead of Yamnaya because it's a larger and higher quality sample, and they seem to be of pure Yamnaya descent.

also Nganasan and Ami because admixture detects them as "pure" and important admixture sources, so they probably the purest descendants of some ancient populations we don't have samples of yet?

is it ok to do things like this?

Papuan, Nganassan, etc... are redundant since there will be no stream of ancestry from these pops to pleft.

vbnetkhio
01-06-2021, 07:16 PM
Papuan, Nganassan, etc... are redundant since there will be no stream of ancestry from these pops to pleft.

would Tianyuan and Devil's Cave samples be a better choice?

JamesBond007
01-06-2021, 07:35 PM
a new version of ADMIXTOOLS is out:
https://uqrmaie1.github.io/admixtools/index.html

it's faster and more user friendly.

has anybody tried it out yet?

I'll check it out. I'm updating my operating system to a newer version it could take hours. This new version looks interesting.

Is this different ? :


Full support for genotype data in PACKEDANCESTRYMAP/EIGENSTRAT format and PLINK format

So it will take just regular plink converted files now ? The documentation used to suck and especially for the Eigenstrat program or whatnot. That is where I got stuck If I remember correctly. I never got stuck with qpAdm I got stuck before that because the documentation sucked balls with converting files to Packagedancestrymap/Eigenstrat. However, if it just works with plink, at least with this new version, everything should be fine for me.

vbnetkhio
01-06-2021, 07:36 PM
I'll check it out. I'm updating my operating system to a newer version it could take hours. This new version looks interesting.

Is this different ? :


Full support for genotype data in PACKEDANCESTRYMAP/EIGENSTRAT format and PLINK format

So it will take just regular plink converted files now ?

yep.

Zoro
01-06-2021, 08:06 PM
Papuan, Nganassan, etc... are redundant since there will be no stream of ancestry from these pops to pleft.

The opposite is true. In qpAdm you don’t want recent geneflow right to left

Zoro
01-06-2021, 08:11 PM
but is there anything obviously wrong with these outgroups? I used Afanasievo instead of Yamnaya because it's a larger and higher quality sample, and they seem to be of pure Yamnaya descent.

also Nganasan and Ami because admixture detects them as "pure" and important admixture sources, so they probably the purest descendants of some ancient populations we don't have samples of yet?


is it ok to do things like this?

We do actually have high quality WGS and diploid ancients. Yana ANS WGS and Kolyma-Mesol WGS and Devils-Gate-N Neo-Siberians such as Nganasan are descended from these. Ami have Devils-Gate

Anyways I would replace those ancient Africans with Mbuti because they have quality and SNP overlap issues

Also if you’re modelling Serbs with Neolithic and Chalcolithic keep your outgroups Neolithic and older

Token
01-06-2021, 08:15 PM
The opposite is true. In qpAdm you don’t want recent geneflow right to left

False. QpAdm assumes no streams from pleft into pright, so you choose 'deep' pops for pright to avoid violating this premise and not because you don't want recent geneflow from pright into pleft.

vbnetkhio
01-06-2021, 08:16 PM
and how to intepret this?
the first 2 rows tell me that Estonia BA has WHG and EHG admixture, and the others just tell me that Estonia BA has steppe admixture.
how does this help me choose my outgroups?

-pop1pop2pop3pop4estsezp
<chr><chr><chr><chr><dbl><dbl><dbl><dbl>
1Greece_BA_MycenaeanEstonia_BA.SGRussia_Afanasievo Luxembourg_Loschbour0.005140.0008116.352.22e-10
2Greece_BA_MycenaeanEstonia_BA.SGRussia_Afanasievo Russia_HG_Karelia0.003160.0007144.439.33e-6
3Greece_BA_MycenaeanEstonia_BA.SGRussia_Afanasievo Anatolia_N-0.007440.000396-18.81.63e-78
4Greece_BA_MycenaeanEstonia_BA.SGRussia_Afanasievo Poland_Globular_Amphora-0.005460.000648-8.433.60e-17
5Greece_BA_MycenaeanEstonia_BA.SGRussia_Afanasievo Iran_GanjDareh_N-0.004370.000534-8.173.08e-16
6Greece_BA_MycenaeanEstonia_BA.SGRussia_Afanasievo Kazakhstan_Eneolithic_Botai0.001960.0006053.241.20 e-3
7Greece_BA_MycenaeanEstonia_BA.SGRussia_Afanasievo Georgia_Kotias.SG-0.003880.000768-5.054.48e-7
8Greece_BA_MycenaeanEstonia_BA.SGRussia_Afanasievo Morocco_Iberomaurusian-0.004900.000579-8.472.41e-17
9Greece_BA_MycenaeanEstonia_BA.SGRussia_Afanasievo Nganasan-0.001520.000456-3.328.85e-4
10Greece_BA_MycenaeanEstonia_BA.SGRussia_Afanasiev oPapuan-0.002570.000545-4.722.30e-6
11Greece_BA_MycenaeanEstonia_BA.SGRussia_Afanasiev oAmi-0.001950.000484-4.035.64e-5
12Greece_BA_MycenaeanEstonia_BA.SGRussia_Afanasiev oONG.SG-0.002330.000579-4.015.98e-5
13Greece_BA_MycenaeanEstonia_BA.SGRussia_Afanasiev oSouth_Africa_400BP.SG-0.003190.000623-5.122.99e-7
14Greece_BA_MycenaeanEstonia_BA.SGRussia_Afanasiev oKenya_PastoralN-0.005710.000513-11.18.98e-29
15Greece_BA_MycenaeanEstonia_BA.SGRussia_Afanasiev oNepal_Samzdong_1500BP.SG-0.002040.000686-2.972.99e-3
16Greece_BA_MycenaeanEstonia_BA.SGRussia_Afanasiev oRussia_OldBeringSea_Ekven-0.001080.000469-2.302.15e-2

Token
01-06-2021, 08:17 PM
would Tianyuan and Devil's Cave samples be a better choice?

Are these potential sources of ancestry for pleft pops? If no, the you don't need them. If Sunghir shows some Siberian admixture (which can be tested with G25), then it might be a good idea to include Devils Gate.

Zoro
01-06-2021, 08:22 PM
I’m out of here. You can thank and consult with ignoramus Token who has spent years doing qpAdm

Token
01-06-2021, 08:24 PM
Zoro is the typical case of a person who knows how to use qpAdm, but has no idea why he is using it that way.

vbnetkhio
01-06-2021, 08:35 PM
I’m out of here. You can thank and consult with ignoramus Token who has spent years doing qpAdm

well that's unfortunate, I wanted to hear both sides and try out both yours and his advice.

vbnetkhio
01-06-2021, 08:37 PM
Are these potential sources of ancestry for pleft pops? If no, the you don't need them. If Sunghir shows some Siberian admixture (which can be tested with G25), then it might be a good idea to include Devils Gate.

I plan to check later if any Siberian or Eastern Steppe samples fit into the model. But first i want to find the best sources for the Balto-Slavic and PaleoBalkan/Greco-Roman ancestry.

Token
01-06-2021, 08:52 PM
....

Kaazi
01-06-2021, 11:03 PM
a new version of ADMIXTOOLS is out:
https://uqrmaie1.github.io/admixtools/index.html

it's faster and more user friendly.

has anybody tried it out yet?

can you teach how to use it?

vbnetkhio
01-07-2021, 07:11 AM
can you teach how to use it?

first install rstudio:
https://rstudio.com/products/rstudio/download/

then just follow these instructions:
https://uqrmaie1.github.io/admixtools/index.html

JamesBond007
01-08-2021, 12:10 PM
I got it working but it was a major pain in the ass like a death by a thousand paper cuts. Not using admixtools2 etc... per se but merging plink files and extracting them via Rstudio etc... I was hacking shell and perl scripts up all night etc.... jeez !

https://i.postimg.cc/fLvgnDBd/newplot.png

https://i.postimg.cc/65qbgsR7/newplot1.png


https://i.postimg.cc/SjbX6bbZ/weighted.png

JamesBond007
01-08-2021, 12:12 PM
first install rstudio:
https://rstudio.com/products/rstudio/download/

then just follow these instructions:
https://uqrmaie1.github.io/admixtools/index.html

Normal people can't do this shit you have to be either a geneticist or the tech geek elite

vbnetkhio
01-08-2021, 09:01 PM
I got it working but it was a major pain in the ass like a death by a thousand paper cuts. Not using admixtools2 etc... per se but merging plink files and extracting them via Rstudio etc... I was hacking shell and perl scripts up all night etc.... jeez !

https://i.postimg.cc/fLvgnDBd/newplot.png

https://i.postimg.cc/65qbgsR7/newplot1.png


https://i.postimg.cc/SjbX6bbZ/weighted.png

nice! are those reference populations from the Reich lab dataset? I can send you Danish, Irish, Welsh etc. if you are interested

JamesBond007
01-09-2021, 09:21 AM
nice! are those reference populations from the Reich lab dataset? I can send you Danish, Irish, Welsh etc. if you are interested

Yes, please, that would be great thanks ! Those reference populations are from a combination of HGDP and the 1,0000 genome project I merged them so it is likely they come from both projects but maybe most of them from one I'd have to double check if you want details.

P.S. it is shame this thread is so dead but other stupid threads here are more popular ! :picard2:

vbnetkhio
01-09-2021, 09:43 AM
Yes, please, that would be great thanks ! Those reference populations are from a combination of HGDP and the 1,0000 genome project I merged them so it is likely they come from both projects but maybe most of them from one I'd have to double check if you want details.

P.S. it is shame this thread is so dead but other stupid threads here are more popular ! :picard2:

this dataset has Welsh samples:
https://evolbio.ut.ee/Ongaro_2019/

German:
https://evolbio.ut.ee/turkic/

Swedish:
https://evolbio.ut.ee/khazar/

i'll send you the rest later

JamesBond007
01-09-2021, 09:56 AM
this dataset has Welsh samples:
https://evolbio.ut.ee/Ongaro_2019/

German:
https://evolbio.ut.ee/turkic/

Swedish:
https://evolbio.ut.ee/khazar/

i'll send you the rest later

Ok, thanks a lot man. :thumb001:

Lucas
01-09-2021, 10:07 AM
I want to know if playing in qpadm only with modern samples is "valid" methodology. Outside TA all people discussing qpadm use only ancient samples.
But honestly I am much more interested in modern admixtures.

vbnetkhio
01-09-2021, 10:14 AM
I want to know if playing in qpadm only with modern samples is "valid" methodology. Outside TA all people discussing qpadm use only ancient samples.
But honestly I am much more interested in modern admixtures.

it 's valid

here on page 91 onwards they start modelling with moderns:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7093155/bin/NIHMS1551077-supplement-Supplement.pdf

they also use moderns as outgroups:


From Imperial period onward
To further increase the power to find best fit models for samples in Imperial era and later in qpAdm
analysis, we defined an additional “right” (outgroup) population set consisting of 18 diverse modern
populations (MOD18) (with the sample size indicated by the number in the parentheses):
Ami (10), Basque (29), BedouinB (19), Biaka (20), Bougainville (2), Chukchi (20), Eskimo_Naukan (12),
Han (43), Iranian (38), Ju-hoan_North (5), Karitiana (12), Mbuti (10), Papuan (14), Russian (22),
Sardinian (27), She (10), Ulchi (25), Yoruba (30).
As for earlier time periods, we performed qpAdm admixture modeling for Italian individuals sampled in
Imperial era and later in a stepwise fashion. Having observed the high inter-individual ancestry diversity
in Iron Age and after, we did not test one-way models, as a positive result (p>0.05) would only indicate
that the average ancestries of the sampled individuals from the two populations happened to be similar.
Instead, we tested two-way models for individual in each time period, proposing the two sources to be
preceding Italian samples in last period and another ancient population (Iron Age onward) or a modern
population. We considered a model to be acceptable if it has p>0.05 with both ANC17 and MOD18 as the
right set, and reported the results under MOD18, unless otherwise noted.

JamesBond007
01-09-2021, 10:15 AM
I want to know if playing in qpadm only with modern samples is "valid" methodology. Outside TA all people discussing qpadm use only ancient samples.
But honestly I am much more interested in modern admixtures.

It is a matter of statistics I'd venture to guess maybe some statistical methods are valid or 'more valid' than others when dealing with modern populations. It is really a math issue here not the tool itself IMHO but I'm not a mathematician.

Zoro
01-09-2021, 11:08 AM
well that's unfortunate, I wanted to hear both sides and try out both yours and his advice.

Ok I’ll give a 2nd chance but any wise guy comments from him or anyone else who thinks they know better and I’m out for good.

Zoro
01-09-2021, 11:09 AM
I want to know if playing in qpadm only with modern samples is "valid" methodology. Outside TA all people discussing qpadm use only ancient samples.
But honestly I am much more interested in modern admixtures.

Yes

Zoro
01-09-2021, 11:13 AM
It is a matter of statistics I'd venture to guess maybe some statistical methods are valid or 'more valid' than others when dealing with modern populations. It is really a math issue here not the tool itself IMHO but I'm not a mathematician.

QpAdm is a powerful tool in modelling present-day pops as admixtures of other present-day pops except perhaps in situations of continuous geneflow over an extended period of time

Zoro
01-09-2021, 11:16 AM
P.S. it is shame this thread is so dead but other stupid threads here are more popular ! :picard2:

Nothing more than a reflection of the type of audience here. Looks like just a handful of people interested in more serious analysis, or are able to understand serious analysis in the first place

Zoro
01-09-2021, 11:23 AM
I think the outgroup set EurasianDNA used is good because they are high quality and have high SNP overlap with your test subjects plus they can differentiate streams of ancestry well

For Serbs you can start out with higher quality neolithic farmer samples + Iron gates Whg + relevant MLBA steppe + Iron age steppe/turkic


Like I mentioned before you want your pright references to be differentially related to the dources you’re proposing. Here’s a qpAdm guide. I strongly recommend anyone interested in qpAdm familiarize themselves with it

https://www.biorxiv.org/content/10.1101/2020.04.09.032664v1.full.pdf


In fact, if all 'right' populations are symmetrically related to all ‘left’ populations in this way, qpAdm will not produce meaningful results. The method requires differential relatedness, meaning that at least some 'right' populations must be more closely related to a subset of 'left' populations than to the other 'left' populations.

andre
01-09-2021, 01:01 PM
qpAdm it's that type of thing that fascinating me, but, c'mon it's yet too "nerd" for normal human beings :).

The better thing for us (amateurs) in my opinion, it's just a good improve of G25 or Vahaduo.. maybe with adding a function that detect overlapping; it would be nice.

Lucas
01-09-2021, 01:33 PM
maybe with adding a function that detect overlapping; it would be nice.

That would be very good.

JamesBond007
01-09-2021, 02:15 PM
QpAdm is a powerful tool in modelling present-day pops as admixtures of other present-day pops except perhaps in situations of continuous geneflow over an extended period of time

Yes, I agree, I know he was asking about QpAdm, specifically, but I was talking in the broader context of of all functions including all functionality of Qpwave and qpgraph which I have not explored completely in depth as I just set the thing up the other day.

JamesBond007
01-09-2021, 02:26 PM
qpAdm it's that type of thing that fascinating me, but, c'mon it's yet too "nerd" for normal human beings :).

It was extremely nerdy to get working and one big reason is apparently in this scientific field they use a weird idiosyncratic format for SNPs, genotype or genetic data and have nerdy CLI tools to convert between them that work best in nerd shell or perl scripts so you are a nerd bashing away in the terminal and also because installing the program in R and extracting the Data in R is extremely nerdy. Using the shiny_tools_GUI or graphical user interface to it is not that much nerdier than G25 IMHO .


The better thing for us (amateurs) in my opinion, it's just a good improve of G25 or Vahaduo.. maybe with adding a function that detect overlapping; it would be nice.

The better thing for amateurs is not to get involved. I hate G25 man. The crap is retarded it says I'm closest to the Dutch and a mix of Norwegian with a little bit of Spanish mostly. That crap is trash. It is too fine-grained, especially for modern populations and it uses non-academic sources for many modern samples.

vbnetkhio
01-15-2021, 06:01 PM
qpAdm it's that type of thing that fascinating me, but, c'mon it's yet too "nerd" for normal human beings :).

The better thing for us (amateurs) in my opinion, it's just a good improve of G25 or Vahaduo..

the new version is much easier to use. you can ask here if you get stuck anywhere.


maybe with adding a function that detect overlapping; it would be nice.

what do you mean? like genetic similarity?

vbnetkhio
01-15-2021, 06:17 PM
Like I mentioned before you want your pright references to be differentially related to the dources you’re proposing. Here’s a qpAdm guide. I strongly recommend anyone interested in qpAdm familiarize themselves with it

https://www.biorxiv.org/content/10.1101/2020.04.09.032664v1.full.pdf


Papuan, Nganassan, etc... are redundant since there will be no stream of ancestry from these pops to pleft.

what do you think, is it ok to merge similar populations into clusters and use those as outgroups, to improve the SNP count? e.g. and ANE group including mal'ta, Afontova Gora and Botai.

because of this problem: (quote from anthrogenica)


You want a higher snp count for your models, try removing the less important low coverage samples in the right popslist like Natufian.

Zoro
01-15-2021, 06:48 PM
what do you think, is it ok to merge similar populations into clusters and use those as outgroups, to improve the SNP count? e.g. and ANE group including mal'ta, Afontova Gora and Botai.

because of this problem: (quote from anthrogenica)

No it wouldn't make sense

Yes you do want the highest SNP overlap possible. One thing I have learned from Eurasian dna is that accuracy should always be a priority. By using the highest quality samples you do 2 things you increase SNP overlap and you end up with higher accuracy because you filter out lower coverage samples.

Don't try to reinvent the wheel. Just use the high quality references Eurasian DNA uses they are optimized for quality as well as the ability to differentiate more closely related populations.https://eurasiandna.com/?p=2432


The following pright references were used in the qpAdm analysis:

Jo-Hoan-Simmons
Devils-Gate-Neolithic-WGS
Iran-GanjDareh-N
Anatolia-Neolithic
EHG-I0061-DIPLOID
Morocco-Iberomaurusian
Loschbour-DIPLOID
Kolyma-Mesolithic-WGS
Russia-Sunghir6
Botai-EN-DIPLOID
Yana-UP-WGS

Depending on the Scythian/Sarmatian samples used we were able to maintain an overlap of 220,000 to 400,000 SNPs between the samples.

In fact accuracy is so important to Eurasian DNA that they went ahead and diploid genotyped some of the published pseudo-haploid samples to get to a higher level than what the papers were using https://eurasiandna.com/?p=345

Token
01-15-2021, 06:52 PM
what do you think, is it ok to merge similar populations into clusters and use those as outgroups, to improve the SNP count? e.g. and ANE group including mal'ta, Afontova Gora and Botai.

because of this problem: (quote from anthrogenica)

SNP count is not crucial, qpAdm deals very well with missing data. From Harney et al:


Each simulation contains an average of ~30 million SNPs. In order to understand the performance of qpAdm with less data, we randomly down-sample the complete dataset to produce analysis datasets of 1 million, 100 thousand, and 10 thousand sites. In all cases, the average admixture proportion estimate generated is extremely close to the simulated α, although we do observe an increase in the amount of variance in the individual estimates as the amount of data analyzed decreases (Figure 3A; Supplementary Table 3). In order to increase computational efficiency and to better approximate typical analysis datasets, all subsequent analyses are performed on the data that has been randomly down-sampled to 1 million sites. We observe similar results when using non-random ascertainment schemes to select sites for analysis (Supplementary Table 4).
The impact of non-random ascertainment schemes on qpAdm analyses are described in more detail in a later section.
We find that qpAdm is robust to missing data, where data from randomly selected sites in each individual is considered missing with rate 10%, 25%, 50%, 75% or 90%

I generally prefer stick with diploid genomes but pseudo-haploidity has little effect on qpAdm too.

andre
02-06-2021, 12:18 PM
Could anyone who knows to use qpAdm test some balkanic populations? (bulgarians,romanians,serbians)

I'm very interest in how extra near eastern admix they get.

The model which i think it will works good is Barcin_N, WHG, some EBA steppe source (Yamnaya Samara i thing it's good) and Iran_N or CHG for show the extra Near eastern admix.

I don't think it would be necessary due to the low percent. but in case it's ok some Mongolia_N or Devil_Cave_N for extra North/East asian admix.

Hamilcar
02-06-2021, 12:24 PM
in which way is it better than g25 ?

andre
02-06-2021, 12:40 PM
in which way is it better than g25 ?

G25 it’s good but it’s an amatour tool. qpAdm it’s an accademic tool and more reliable.

Ion Basescul
02-06-2021, 12:41 PM
Could anyone who knows to use qpAdm test some balkanic populations? (bulgarians,romanians,serbians)

I'm very interest in how extra near eastern admix they get.

The model which i think it will works good is Barcin_N, WHG, some EBA steppe source (Yamnaya Samara i thing it's good) and Iran_N or CHG for show the extra Near eastern admix.

I don't think it would be necessary due to the low percent. but in case it's ok some Mongolia_N or Devil_Cave_N for extra North/East asian admix.

There aren't enough academic samples from Romania anyway. Behar ones are from Arges county and Reich's are a mix of Gorj and Alba. That's why K13 is king for now.

Zoro
02-06-2021, 01:35 PM
There aren't enough academic samples from Romania anyway. Behar ones are from Arges county and Reich's are a mix of Gorj and Alba. That's why K13 is king for now.

That's the good thing about qpAdm is that you can work with single samples unlike the Admixture program where you need multiple references for each population when you are creating it.

Zoro
02-06-2021, 01:36 PM
There aren't enough academic samples from Romania anyway. Behar ones are from Arges county and Reich's are a mix of Gorj and Alba. That's why K13 is king for now.

That's the good thing about qpAdm is that you can work with single samples unlike the Admixture program where you need multiple references for each population when you are creating it.

Lemminkäinen
02-06-2021, 02:17 PM
New version is quite sensitive for no calls / missing alleles. Good news is low run time that makes possible to run speculative analyses.

JamesBond007
02-06-2021, 02:32 PM
That's the good thing about qpAdm is that you can work with single samples unlike the Admixture program where you need multiple references for each population when you are creating it.

I still wish someone would make one big set of files (bim/bed/fam) or whatever plink format is merging all known European samples or at least into West Europe and East Europe sets or whatnot because honestly the thing I hate most about using this tool is merging sets because some sets are an eclectic mix but have some samples you need etc... I'm not in the mood to hack a set of shell scripts or perl scripts do to this kind of sh*t and it would save duplication of effor for people if someone could host it on server. Even if one does not make on big set of European files just having non-eclectic strictly European sets would be helpful uploaded to a server.

andre
02-06-2021, 03:48 PM
So, someone will run a model? :)

Token
02-06-2021, 05:07 PM
Could anyone who knows to use qpAdm test some balkanic populations? (bulgarians,romanians,serbians)

I'm very interest in how extra near eastern admix they get.

The model which i think it will works good is Barcin_N, WHG, some EBA steppe source (Yamnaya Samara i thing it's good) and Iran_N or CHG for show the extra Near eastern admix.

I don't think it would be necessary due to the low percent. but in case it's ok some Mongolia_N or Devil_Cave_N for extra North/East asian admix.


So, someone will run a model? :)

Bulgarian
Yamnaya_Samara 0.306 ± 0.0352
Anatolia_N 0.454 ± 0.0303
WHG 0.0796 ± 0.0140
Iran_N 0.161 ± 0.0387
p-value 0.559

Romanian
Yamnaya_Samara 0.341 ± 0.0373
Anatolia_N 0.430 ± 0.0315
WHG 0.0795 ± 0.0155
Iran_N 0.149 ± 0.0426
p-value 0.0467

Leto
02-06-2021, 05:11 PM
How much East Eurasian are Russians according to that method? Use Devil's Gate Cave for reference.

Token
02-06-2021, 05:21 PM
in which way is it better than g25 ?

G25 is better for recent ancestry, qpAdm is far superior in dealing with ancients.

andre
02-06-2021, 06:27 PM
How much East Eurasian are Russians according to that method? Use Devil's Gate Cave for reference.

Someone should make a damn video tutorial for qpAdm, so everyone can use it.

vbnetkhio
02-06-2021, 07:08 PM
Bulgarian
Yamnaya_Samara 0.306 ± 0.0352
Anatolia_N 0.454 ± 0.0303
WHG 0.0796 ± 0.0140
Iran_N 0.161 ± 0.0387
p-value 0.559

Romanian
Yamnaya_Samara 0.341 ± 0.0373
Anatolia_N 0.430 ± 0.0315
WHG 0.0795 ± 0.0155
Iran_N 0.149 ± 0.0426
p-value 0.0467

which outgroups should be used for a model like this? i tried these but the model failed.

Russia_Sunghir3.SG
Russia_Kostenki14
Czech_Vestonice16
ONG.SG
Iran_Mesolithic
Russia_MA1_HG.SG
Anatolia_Epipaleolithic
Morocco_Iberomaurusian
Mbuti
Ethiopia_4500BP_published.SG
China_Tianyuan

Ajeje Brazorf
02-06-2021, 08:08 PM
Bulgarian
Yamnaya_Samara 0.306 ± 0.0352
Anatolia_N 0.454 ± 0.0303
WHG 0.0796 ± 0.0140
Iran_N 0.161 ± 0.0387
p-value 0.559

Romanian
Yamnaya_Samara 0.341 ± 0.0373
Anatolia_N 0.430 ± 0.0315
WHG 0.0795 ± 0.0155
Iran_N 0.149 ± 0.0426
p-value 0.0467

Same model but with Global25:

Scaled


Target: Romanian
Distance: 3.2702% / 0.03270239
50.8 TUR_Barcin_N
43.2 Yamnaya_RUS_Samara
6.0 WHG

Target: Bulgarian
Distance: 3.2130% / 0.03212961
51.8 TUR_Barcin_N
41.2 Yamnaya_RUS_Samara
5.4 WHG
1.6 IRN_Ganj_Dareh_N

Unscaled


Target: Romanian
Distance: 2.1592% / 0.02159224
53.0 TUR_Barcin_N
42.6 Yamnaya_RUS_Samara
3.6 WHG
0.8 IRN_Ganj_Dareh_N

Target: Bulgarian
Distance: 2.1636% / 0.02163633
53.6 TUR_Barcin_N
38.6 Yamnaya_RUS_Samara
4.2 WHG
3.6 IRN_Ganj_Dareh_N

Zoro
02-06-2021, 11:42 PM
which outgroups should be used for a model like this? i tried these but the model failed.

Russia_Sunghir3.SG
Russia_Kostenki14
Czech_Vestonice16
ONG.SG
Iran_Mesolithic
Russia_MA1_HG.SG
Anatolia_Epipaleolithic
Morocco_Iberomaurusian
Mbuti
Ethiopia_4500BP_published.SG
China_Tianyuan


Without knowing what outgroups were used common sense would dictate that you should drop Iran-mes and Anatolia-epi from your outgroups as they’re too closely related to your sources Iran-n and Anatolia-N. I’m 100% sure you’ll see your p-values increase if you do that. Let me know how much better your p-values became after dropping them.

After doing that do another run by replacing Mota with Khomani and Yana and see what happens

Zoro
02-06-2021, 11:49 PM
How much East Eurasian are Russians according to that method? Use Devil's Gate Cave for reference.

You have to specify whether you want to use all Neolithic pops or all BA pops to model them. Obviously if you use Neolithic pops their EE will be higher than using all BA pops because the more forward you go in time the more mixed Eurasian pops became.

Komintasavalta
02-24-2021, 06:54 AM
a new version of ADMIXTOOLS is out:
https://uqrmaie1.github.io/admixtools/index.html

it's faster and more user friendly.

has anybody tried it out yet?

I tried running this in R (https://github.com/uqrmaie1/admixtools):


install.packages("devtools")
devtools::install_github("uqrmaie1/admixtools")
library("admixtools")

`devtools` failed to install because there was an error installing a dependency of its dependency:


ERROR: dependency ‘gert’ is not available for package ‘usethis’
* removing ‘/usr/local/lib/R/4.0/site-library/usethis’
ERROR: dependency ‘usethis’ is not available for package ‘devtools’
* removing ‘/usr/local/lib/R/4.0/site-library/devtools’

When I tried running `install.packages("gert")`, I got this error:


Configuration failed to find libgit2 library. Try installing:
* brew: libgit2 (MacOS)

So after I ran `brew install libgit2`, I was able to successfully install `devtools`. But then when I tried to install `admixtools`, it failed because the dependency `igraph` could not be installed. Running `brew uninstall --ignore-dependencies suite-sparse` fixed it.

Then I downloaded `v44.3_HO_public.tar` from here: https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/index.html. The file `v44.3_1240K_public.tar` had less modern populations, and for example it didn't have Nganasan.

I picked populations names from the last column of the file `v44.3_1240K_public/v44.3_1240K_public.ind`. Then I ran these commands:


printf %s\\n Mari.SG>target
printf %s\\n Russia_AfontovaGora3 Russia_HG_Karelia Estonia_CordedWare Germany_EN_LBK Russia_MLBA_Sintashta Russia_Medieval_Nomad.SG Nganasan France_Rochedane Sweden_Motala_HG>left
printf %s\\n Mbuti.DG Israel_Natufian_published Mixe.DG Ami.DG Itelmen.DG Czech_Vestonice16 Serbia_IronGates_Mesolithic Russia_Shamanka_Eneolithic.SG Papuan.DG Russia_Ust_Ishim.DG Switzerland_Bichon.SG>right
cat left right target>pops

Then in R I ran commands like this:


library("admixtools")
extract_f2(pref="v44.3_HO_public/v44.3_HO_public",pops=readLines("pops"),outdir="myf2dir")
f2_blocks=f2_from_precomp("myf2dir")
qpadm(f2_blocks,left=readLines("left"),right=readLines("right"),target=readLines("target"))

Output:


$weights
# A tibble: 9 x 5
target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Mari.SG Russia_AfontovaGora3 -0.613 0.701 -0.874
2 Mari.SG Russia_HG_Karelia 0.0170 0.560 0.0304
3 Mari.SG Estonia_CordedWare 0.425 1.17 0.362
4 Mari.SG Germany_EN_LBK 0.352 0.965 0.365
5 Mari.SG Russia_MLBA_Sintashta 0.419 0.841 0.498
6 Mari.SG Russia_Medieval_Nomad.SG -0.399 0.558 -0.715
7 Mari.SG Nganasan 0.0751 0.454 0.166
8 Mari.SG France_Rochedane 0.128 0.582 0.220
9 Mari.SG Sweden_Motala_HG 0.595 3.22 0.185

$f4
# A tibble: 1,100 x 9
pop1 pop2 pop3 pop4 est se z p weight
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mari.SG Estonia_CordedWare Ami.DG Czech_Vestonice16 -0.0156 0.0146 -1.07 0.286 0.425
2 Mari.SG fit Ami.DG Czech_Vestonice16 -0.000149 0.0158 -0.00939 0.993 NA
3 Mari.SG France_Rochedane Ami.DG Czech_Vestonice16 0.0219 0.0255 0.859 0.390 0.128
4 Mari.SG Germany_EN_LBK Ami.DG Czech_Vestonice16 0.0297 0.0168 1.76 0.0776 0.352
5 Mari.SG Nganasan Ami.DG Czech_Vestonice16 0.00882 0.0140 0.628 0.530 0.0751
6 Mari.SG Russia_AfontovaGora3 Ami.DG Czech_Vestonice16 0.00938 0.0211 0.443 0.657 -0.613
7 Mari.SG Russia_HG_Karelia Ami.DG Czech_Vestonice16 0.0406 0.0251 1.62 0.106 0.0170
8 Mari.SG Russia_Medieval_Nomad.SG Ami.DG Czech_Vestonice16 0.0281 0.0235 1.20 0.231 -0.399
9 Mari.SG Russia_MLBA_Sintashta Ami.DG Czech_Vestonice16 0.0129 0.0150 0.865 0.387 0.419
10 Mari.SG Sweden_Motala_HG Ami.DG Czech_Vestonice16 0.00579 0.0165 0.350 0.726 0.595
# … with 1,090 more rows

$rankdrop
# A tibble: 9 x 7
f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 8 2 2.07 0.356 4 0.114 0.998
2 7 6 2.18 0.902 6 1.27 0.973
3 6 12 3.45 0.991 8 2.82 0.945
4 5 20 6.27 0.998 10 6.76 0.748
5 4 30 13.0 0.997 12 12.6 0.396
6 3 42 25.7 0.978 14 22.7 0.0649
7 2 56 48.4 0.755 16 28.1 0.0308
8 1 72 76.5 0.337 18 62.2 0.000000895
9 0 90 139. 0.000751 NA NA NA

$popdrop
# A tibble: 511 x 20
pat wt dof chisq p f4rank Russia_Afontova… Russia_HG_Karel… Estonia_CordedW… Germany_EN_LBK Russia_MLBA_Sin… Russia_Medieval… Nganasan
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0000… 0 2 2.07 0.356 8 -0.613 0.0170 0.425 0.352 0.419 -0.399 0.0751
2 0000… 1 3 2.03 0.567 7 -0.792 0.391 0.507 -0.0808 0.780 -0.217 0.295
3 0000… 1 3 1.48 0.687 7 -0.974 0.155 0.267 0.141 0.741 -0.400 0.112
4 0000… 1 3 1.45 0.694 7 -0.730 -0.0587 0.241 0.257 0.427 -0.420 NA
5 0000… 1 3 1.71 0.636 7 -1.78 -1.43 -2.94 -0.857 -0.0838 NA -0.709
6 0000… 1 3 1.43 0.699 7 -1.22 -0.416 -1.07 -0.137 NA -0.539 0.160
7 0001… 1 3 1.68 0.641 7 -1.06 -0.110 -0.410 NA 0.314 -0.465 0.154
8 0010… 1 3 1.77 0.622 7 -1.05 0.0144 NA 0.220 0.400 -0.329 0.133
9 0100… 1 3 1.54 0.673 7 -0.898 NA 0.155 0.287 0.483 -0.395 0.0972
10 1000… 1 3 1.41 0.704 7 NA -8.23 -18.7 1.51 -12.8 3.80 -3.54
# … with 501 more rows, and 7 more variables: France_Rochedane <dbl>, Sweden_Motala_HG <dbl>, feasible <lgl>, best <lgl>, dofdiff <dbl>, chisqdiff <dbl>,
# p_nested <dbl>

These pages were helpful:

https://uqrmaie1.github.io/admixtools/articles/admixtools.html
https://comppopgenworkshop2019.readthedocs.io/en/latest/contents/05_qpwave_qpadm/qpwave_qpadm.html

Korialstrasz
02-24-2021, 04:37 PM
Ok, so I got it running with my own data. I thought I would have to use linux for this and wasted a considerable amount of time trying that but I managed to do it ENTIRELY on my windows pc.


*I daily use R and python and I can say that you need almost no proficiency in either of these languages. But you will need them installed properly, nonetheless.
*download one of the Reich datasets (https://reich.hms.harvard.edu/allen-ancient-dna-resource-aadr-downloadable-genotypes-present-day-and-ancient-dna-data)

Here is what I did, roughly:
1- converted my FTDNA raw to 23andme v2 raw using DNA kit studio.

2- used this (http://jade-cheng.com/au/23andme-to-plink/) script to convert the raw file to .map and .ped files. (This requires python, I have anaconda installed, so I just used the terminal in the conda environment but if you dont have python in your PATH, you will have to add it). Just go to the directory (cd Path) in which your raw file and the script reside and run this command:


python 23andme-to-plink.py yourfile.txt)
3- this gives you two files in the same directory ( yourfile.ped and yourfile.map).

4- download plink (https://zzz.bwh.harvard.edu/plink/download.shtml)

5- the folder you extract the contents of the archive matters, I did not want to bother with PATH so I just unpacked the contents in the same folder as above.

6- now that you both have plink and your .ped and .map files in the same folder, you can run the following command from the terminal: plink --file yourfile --make-bed --out yourfile_new (I only added the new suffix here to show that it is the name of the output files you will be getting). Now, you should have three files, which constitute the plink structure. You will be merging these with the Reich dataset to run qpadm for yourself.

7- Start R. (preferably Rstudio, I like using the embedded terminal there). install admixtools and all the other packages that are deemed necessary if you havent already (check the admixtools R package landing page for that) i:

make sure that your working directory is the same as the folder that contains the necessary files. (getwd() will tell you your current working directory. You can use setwd(yourdirectory) to change it.

8- Ok, so far so good. Reich dataset is distributed in the EIGENSTRAT format. We are going to have to convert this to plink too, so that we can merge the raw data with the dataset. But the entire dataset is too big, you will have to use a subset of it, luckily admixtools R package offers some functions for that.

the EIGENSTRAT structure, as far as I have seen, comes with 3sub files: .geno, .ind, and .snp. You may view the .ind file with a text editor to see which populations are available to us. (If you know the samples by heart, that is even better.)

This code below shows the subset of populations I used.

(For the "right-hand side" populations I used Zoro´s advice in this thread: https://www.theapricity.com/forum/showthread.php?325104-qpAdm-modelling-first-attempt&highlight=admixtools I have not read the theory yet so I am practically a layman at this. This is just an ad-hoc solution and I tried to replace the populations that were not present in the dataset. I would really love to get hear some advice in this regard. For the "left-hand side" populations I picked the relevant pops for my ancestry without much overlap) The code below should work and this will give you a vector of the population names with which you will be using to filter the dataset.



pops <- c( #right

"Mbuti.DG",
"Russia_Ust_Ishim.DG",
"China_Tianyuan",
"Goyet_Neanderthal.SG",
"Russia_Sunghir3.SG",
"Russia_Kostenki14.SG",
"Morocco_Iberomaurusian",
"Israel_Natufian_published",
"Spain_ElMiron",
"Russia_MA1_HG.SG",
"Georgia_Satsurblia.SG",
"Russia_DevilsCave_N.SG",
"Papuan.DG",
"Turkey_N.SG",
"Iran_GanjDareh_N",
"Switzerland_Bichon.SG",
#left
"Adygei",
"Turkmen",
"Bulgarian",
)



Assuming that you have extracted the contents of the Reich dataset to the same folder, the code below uses the conversion function that is available in the admixtools package. It will use all three parts of the eigenstrat structure and create 3 plink files with the extensions, .bim, .fam and .fim (You may now have noticed that now both your raw data and the dataset have the same format.)

First argument requires the prefix of the eigenstrat file and the second one, outpref, is arbitrary. This is going to be the prefix of your plink files. pops argument uses the populations you chose to subset the dataset. (it is impossible to take in all of them without a supercomputer)


eigenstrat_to_plink("v44.3_HO_public",outpref = "master_plink",pops = pops)

9- Now we go back to the terminal to merge the files.


plink --bfile master_plink --bmerge yourfile_new.bed yourfile_new.bim yourfile_new.fam --make-bed --out merged_data


master_plink = reich dataset in plink format
yourfile_new = your raw data in plink format
merged_data = prefix of the merged data in plink format

10- OK, if everything goes smoothly we only have two more steps until we successfully run qpadm. While the manual says there are other ways of doing it, I chose to do it the following way:
*Extract the f2 statistics from the merged dataset.


extract_f2("merged_data",outdir = "f2_new")

The code above looks for your input files and creates a new folder with many result output in it. Since we have just created a set of plink files with the prefix merged_data, we will be using that as the input. the second argument, outdir, is asking for the name of the folder within which the f2 statistics are going to be stored. This is arbitrary, but we will be using that folder to get the qpadm estimation.

11- Now, the qpadm part. It is vital to look inside the folder that has just been created. Look for the name of your raw data. This can be anything if you have not changed it before, and assuming that you have used the script I linked, the name of your file will likely have the suffix "FAM". Find it and add it among the "left" populations, AND set it as the target. As you might expect, you cannot introduce extra populations here, but if you are somehow not satisfied with the result, you can remove some of the populations. If youwould like to have more populations you are going to have to do the subset-merge process over again.


qpa <- qpadm(data = "f2_new",left = c("YOURFILE","Adygei","Turkmen", "Bulgarian",),right = c( "Mbuti.DG",
"Russia_Ust_Ishim.DG",
"China_Tianyuan",
"Goyet_Neanderthal.SG",
"Russia_Sunghir3.SG",
"Russia_Kostenki14.SG",
"Morocco_Iberomaurusian",
"Israel_Natufian_published",
"Spain_ElMiron",
"Russia_MA1_HG.SG",
"Georgia_Satsurblia.SG",
"Russia_DevilsCave_N.SG",
"Papuan.DG",
"Turkey_N.SG",
"Iran_GanjDareh_N",
"Switzerland_Bichon.SG"
),target = "YOURFILE")

This will run and store the results in an object named "qpa". qpa$weights will give you the weight estimations, if you are using RStudio you will be able to see what other values you can access. As far as I understand, we are looking for weight estimations between 0-1, with low standard errors and high p values. This makes sense as the estimations sum up to 1 and seem like the admixture proportion estimations.

I have got these results:


left weight se z
<chr> <dbl> <dbl> <dbl>
Adygei 0.439 0.595 0.738
Turkmen 0.348 0.304 1.14
Bulgarian 0.213 0.404 0.527

Not bad for a first attempt, I guess. If you put closely related populations together, you will likely get negative estimations with very high standard errors.
I would like to reiterate that I have no prior knowledge on population genetics and quite ignorant compared to other apricians. For all I know, what I attempted might just be bullshit.


I would also like to add that, after extracting the f2 statistics, I get notified that

√ 1034771 SNPs read in total
! 1331 SNPs remain after filtering. 1331 are polymorphic.
i Allele frequency matrix for 1331 SNPs and 22 populations is 0 MB

I am not sure if this is normal but it seemed suspicious to me. As it eliminates almost the entireity of the SNPs. (I checked to see how many SNPS my FTDNA data and the Reich dataset has in common, and it turned out to be a little fewer than 130k. Both Reich and raw data have almost 600k lines for SNPs, significant amount of which I believe are either no-calls or missing values.)

Korialstrasz
02-24-2021, 05:21 PM
I checked the converted 23andme file and noticed that a huge chunk of SNPS was now "--". I thought it would just convert the contents of the raw data to another format. Turns out I am wrong. Therefore, I manually converted the FTDNA file to 23andme format and now I have 2 times the SNPs in the allele frequency matrix. Strange.

vbnetkhio
02-24-2021, 05:33 PM
Then I downloaded `v44.3_HO_public.tar` from here: https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/index.html. The file `v44.3_1240K_public.tar` had less modern populations, and for example it didn't have Nganasan.


it's best to merge these 2. HO has more modern popualtions, but 1240K has all of the ancients and many important moderns with much more SNPs.

also check out the data here:
https://evolbio.ut.ee/

Zoro
02-24-2021, 09:57 PM
I have got these results:


left weight se z
<chr> <dbl> <dbl> <dbl>
Adygei 0.439 0.595 0.738
Turkmen 0.348 0.304 1.14
Bulgarian 0.213 0.404 0.527

Not bad for a first attempt, I guess. If you put closely related populations together, you will likely get negative estimations with very high standard errors.
I would like to reiterate that I have no prior knowledge on population genetics and quite ignorant compared to other apricians. For all I know, what I attempted might just be bullshit.


I would also like to add that, after extracting the f2 statistics, I get notified that

√ 1034771 SNPs read in total
! 1331 SNPs remain after filtering. 1331 are polymorphic.
i Allele frequency matrix for 1331 SNPs and 22 populations is 0 MB

I am not sure if this is normal but it seemed suspicious to me. As it eliminates almost the entireity of the SNPs. (I checked to see how many SNPS my FTDNA data and the Reich dataset has in common, and it turned out to be a little fewer than 130k. Both Reich and raw data have almost 600k lines for SNPs, significant amount of which I believe are either no-calls or missing values.)


First, congrats on getting the software running and absolutely no to you being more ignorant than other people here. In fact 98% of the people wouldn't even have a clue as to what you just wrote. I would say you're more knowledgeable than 98% of the people here !

Ok, so here's a few observations and tips:

1- "! 1331 SNPs remain after filtering. 1331 are polymorphic." This is absolutely not acceptable and will give you horrible results and in fact is mostly responsible for the 114% standard errors you got on Turkmen. Although its very important for Admixtools 2 not to have missing SNPs in any of your samples ( in other words maxmiss=0) it's just as important that you salvage at least 100,000 SNPs. Drop low coverage samples if you have to

2- Assuming you were able to get close to 100,000 SNPs if you still get high SE it means your left pops are too closely related and your right pops are unable to properly distinguish the difference between them. So add some right pops that are very differentially related to one left pop vs the other left pop.

3- Let me know if you need a simple script to convert your FTDNA or Ancestry data to 23andme format

You're on a good track, Good luck !

Korialstrasz
02-25-2021, 07:08 PM
First, congrats on getting the software running and absolutely no to you being more ignorant than other people here. In fact 98% of the people wouldn't even have a clue as to what you just wrote. I would say you're more knowledgeable than 98% of the people here !

Ok, so here's a few observations and tips:

1- "! 1331 SNPs remain after filtering. 1331 are polymorphic." This is absolutely not acceptable and will give you horrible results and in fact is mostly responsible for the 114% standard errors you got on Turkmen. Although its very important for Admixtools 2 not to have missing SNPs in any of your samples ( in other words maxmiss=0) it's just as important that you salvage at least 100,000 SNPs. Drop low coverage samples if you have to

2- Assuming you were able to get close to 100,000 SNPs if you still get high SE it means your left pops are too closely related and your right pops are unable to properly distinguish the difference between them. So add some right pops that are very differentially related to one left pop vs the other left pop.

3- Let me know if you need a simple script to convert your FTDNA or Ancestry data to 23andme format

You're on a good track, Good luck !

Thanks! Your instructions have been tremendously helpful. I think I managed to convert the FTDNA file myself without losing any SNPS, but I don´t know if I missed anything.

I took the part below from the admixtools documentation and this is pretty much in line with what you advise.


By default, extract_f2() will be very cautious and exclude all SNPs which are missing in any population (maxmiss = 0). If you lose too many SNPs this way, you can either

*limit the number of populations for which to extract f2-statistics,
*compute f3- and f4-statistics directly from genotype files, or
*increase the maxmiss parameter (maxmiss = 1 means no SNPs will be excluded).
The advantages and disadvantages of the different approaches are described here. Briefly, when running qpadm() and qpdstat() it can be better to choose the safer but slower options 1 and 2, while for qpgraph(), which is not centered around hypothesis testing, it is usually fine choose option 3. Since the absolute difference in f-statistics between these approaches is usually small, it can also make sense to use option 3 for exploratory analyses, and confirm key results using options 1 or 2.

I tried different maxmiss values to salvage some SNPS but the models I ran afterwards did not make much sense. I need to try different sets of populations, it seems. I had the impression that right-hand side populations functions akin to a "control variable", so, would it then make sense to run an analysis on modern populations using, let´s say, Iron Age samples that provide enough "control" for the left. Or is it better not to take too many liberties in this regard?


I´ll be reading the instructions here: https://www.biorxiv.org/content/biorxiv/early/2020/04/10/2020.04.09.032664/DC1/embed/media-1.pdf

Zoro
02-26-2021, 12:49 AM
Thanks! Your instructions have been tremendously helpful. I think I managed to convert the FTDNA file myself without losing any SNPS, but I don´t know if I missed anything.

I took the part below from the admixtools documentation and this is pretty much in line with what you advise.



I tried different maxmiss values to salvage some SNPS but the models I ran afterwards did not make much sense. I need to try different sets of populations, it seems. I had the impression that right-hand side populations functions akin to a "control variable", so, would it then make sense to run an analysis on modern populations using, let´s say, Iron Age samples that provide enough "control" for the left. Or is it better not to take too many liberties in this regard?


I´ll be reading the instructions here: https://www.biorxiv.org/content/biorxiv/early/2020/04/10/2020.04.09.032664/DC1/embed/media-1.pdf


Post a couple of runs here showing me all the details of the output such as no of snps, right and left pops and I’ll try to diagnose for you. I would use maxmiss=0.002 or 0.003

vbnetkhio
02-26-2021, 01:37 PM
some might find this useful:

I made an AncestryDNA raw data to .ped converter script for R:
106242

to run it, rename your raw data to "data.txt", then place the "data.txt" and "anc_to_ped.r" into your R directory, and run this command in R: source("anc_to_ped.r")

The file has to be in the AncestryDNA format.
If you have a different format, e.g 23andme, you can convert it first with DNA kit Studio (don't use a raw data template, just choose the ancestryDNA format)

Kaspias
02-26-2021, 05:14 PM
So I felt a need to learn how to run qpAdm, but pretty much beginner in these tools.

I get this error while trying to create the 3rd file in plink:


1426149 (of 1426149) markers to be included from [ data.map ]

ERROR:
A problem with line 1 in [ data.ped ]
Expecting 6 + 2 * 1426149 = 2852304 columns, but found 2842162

vbnetkhio
02-26-2021, 05:23 PM
So I felt a need to learn how to run qpAdm, but pretty much beginner in these tools.

I get this error while trying to create the 3rd file in plink:


1426149 (of 1426149) markers to be included from [ data.map ]

ERROR:
A problem with line 1 in [ data.ped ]
Expecting 6 + 2 * 1426149 = 2852304 columns, but found 2842162

did you use my script?
there's a bug, of course... i'll try to fix it.

edit:
all seems to work fine for me, what are you trying to do with the file?

Kaspias
02-26-2021, 07:05 PM
did you use my script?
there's a bug, of course... i'll try to fix it.

edit:
all seems to work fine for me, what are you trying to do with the file?

I was following Korialstrasz's entries in #68. Here what I have done:

I have got .bed and .bam, but while using this command: plink --file yourfile --make-bed --out yourfile_new to plink(bim fam fim) I received the error I posted. I used the R script you posted in order to get bed and bam.

Besides, while extracting the populations from eigenstrat file I could not manage to get multiple populations within the file but only a pop, like: eigenstrat_to_plink("v44.3_HO_public",outpref = "master_plink",pops = 316)

I think I will have some more problems in the following steps as I'm clueless, but that's it for now :D

vbnetkhio
02-26-2021, 07:11 PM
I was following Korialstrasz's entries in #68. Here what I have done:

I have got .bed and .bam, but while using this command: plink --file yourfile --make-bed --out yourfile_new to plink(bim fam fim) I received the error I posted. I used the R script you posted in order to get bed and bam.

Besides, while extracting the populations from eigenstrat file I could not manage to get multiple populations within the file but only a pop, like: eigenstrat_to_plink("v44.3_HO_public",outpref = "master_plink",pops = 316)

I think I will have some more problems in the following steps as I'm clueless, but that's it for now :D

did you convert your raw data to ancestry format first? (with allele1 and allele2 in separate columns?)
that seems to be causing your error.

vbnetkhio
02-26-2021, 07:12 PM
...

Kaspias
02-26-2021, 07:16 PM
did you convert your raw data to ancestry format first?

I have done. However, I used a super kit(created with 3 different raw data) and had ~40MB size while an average raw data has 15-20, stating in case if it might be about it.

vbnetkhio
02-26-2021, 08:44 PM
I have done. However, I used a super kit(created with 3 different raw data) and had ~40MB size while an average raw data has 15-20, stating in case if it might be about it.

check if it works now:
106247

Kaspias
02-26-2021, 09:01 PM
check if it works now:
106247

I have tried with regular MyHeritage file this time, converted it to Ancestry.


@----------------------------------------------------------@
| PLINK! | v1.07 | 10/Aug/2009 |
|----------------------------------------------------------|
| (C) 2009 Shaun Purcell, GNU General Public License, v2 |
|----------------------------------------------------------|
| For documentation, citation & bug-report instructions: |
| http://pngu.mgh.harvard.edu/purcell/plink/ |
@----------------------------------------------------------@

Web-based version check ( --noweb to skip )
Connecting to web... failed connection

Problem connecting to web

Writing this text to log file [ data_new.log ]
Analysis started: Sat Feb 27 00:52:58 2021

Options in effect:
--file data
--make-bed
--out data_new

1426149 (of 1426149) markers to be included from [ data.map ]

ERROR:
A problem with line 1 in [ data.ped ]
Expecting 6 + 2 * 1426149 = 2852304 columns, but found 2842162



The format in .ped file is as such:


name1 name2 0 0 0 1 T T 0 0 A A A A A A G G A A G G

So actually it is how it should be. There is a blank between the alleles.

vbnetkhio
02-26-2021, 09:04 PM
I have tried with regular MyHeritage file this time, converted it to Ancestry.


@----------------------------------------------------------@
| PLINK! | v1.07 | 10/Aug/2009 |
|----------------------------------------------------------|
| (C) 2009 Shaun Purcell, GNU General Public License, v2 |
|----------------------------------------------------------|
| For documentation, citation & bug-report instructions: |
| http://pngu.mgh.harvard.edu/purcell/plink/ |
@----------------------------------------------------------@

Web-based version check ( --noweb to skip )
Connecting to web... failed connection

Problem connecting to web

Writing this text to log file [ data_new.log ]
Analysis started: Sat Feb 27 00:52:58 2021

Options in effect:
--file data
--make-bed
--out data_new

1426149 (of 1426149) markers to be included from [ data.map ]

ERROR:
A problem with line 1 in [ data.ped ]
Expecting 6 + 2 * 1426149 = 2852304 columns, but found 2842162



The format in .bed file is as such:


name1 name2 0 0 0 1 T T 0 0 A A A A A A G G A A G G

So actually it is how it should be. There is a blank between the alleles.

that looks like a very old version of plink, maybe it doesn't support tab separated .ped

try with plink 1.9:
https://www.cog-genomics.org/plink/

Kaspias
02-26-2021, 09:18 PM
that looks like a very old version of plink, maybe it doesn't support tab separated .ped

try with plink 1.9:
https://www.cog-genomics.org/plink/


Possibly irregular .ped line. Restarting scan, assuming multichar alleles.
Rescanning .ped file... 0%
Error: Half-missing call in .ped file at variant 1389874, line 1.

vbnetkhio
02-26-2021, 09:26 PM
Possibly irregular .ped line. Restarting scan, assuming multichar alleles.
Rescanning .ped file... 0%
Error: Half-missing call in .ped file at variant 1389874, line 1.

try with this one...
https://filebin.net/c73ep7a8u1pfgtxw

the "space" version should work with the older version of plink, the other one with the newer.

Zoro
02-26-2021, 10:48 PM
I have tried with regular MyHeritage file this time, converted it to Ancestry.


@----------------------------------------------------------@
| PLINK! | v1.07 | 10/Aug/2009 |
|----------------------------------------------------------|
| (C) 2009 Shaun Purcell, GNU General Public License, v2 |
|----------------------------------------------------------|
| For documentation, citation & bug-report instructions: |
| http://pngu.mgh.harvard.edu/purcell/plink/ |
@----------------------------------------------------------@

Web-based version check ( --noweb to skip )
Connecting to web... failed connection

Problem connecting to web

Writing this text to log file [ data_new.log ]
Analysis started: Sat Feb 27 00:52:58 2021

Options in effect:
--file data
--make-bed
--out data_new

1426149 (of 1426149) markers to be included from [ data.map ]

ERROR:
A problem with line 1 in [ data.ped ]
Expecting 6 + 2 * 1426149 = 2852304 columns, but found 2842162



The format in .ped file is as such:


name1 name2 0 0 0 1 T T 0 0 A A A A A A G G A A G G

So actually it is how it should be. There is a blank between the alleles.

Plink doesn’t like Ancestry format. Just convert whatever you have to 23andme format and then convert the 23andme to Plink bed bim fam. It’s super easy. Also use Plink 1.9

Ancestry alleles are tab seperated ex A A whereas 23 have no space ex AA

vbnetkhio
02-27-2021, 06:08 AM
Plink doesn’t like Ancestry format. Just convert whatever you have to 23andme format and then convert the 23andme to Plink bed bim fam. It’s super easy. Also use Plink 1.9

Ancestry alleles are tab seperated ex A A whereas 23 have no space ex AA

none of the 23andme to plink scripts worked for me without errors. with my script I converted my file, and also some ancients which I had in the 23andme format successfuly.
Kaspias' file seems to have some empty fields and some fields which aren't ACGT0 so i had to adapt the script.

Zoro
02-27-2021, 10:32 AM
none of the 23andme to plink scripts worked for me without errors. with my script I converted my file, and also some ancients which I had in the 23andme format successfuly.
Kaspias' file seems to have some empty fields and some fields which aren't ACGT0 so i had to adapt the script.

What do you mean?

I've converted hundreds of 23 files to plink using:

......./plink --23file sample.txt sample --out sample_plink where ..... is the path to plink executible. Highlighted sample is what the file will be named inside fam


Then you can merge your "sample" file with your "master" file using

.......//plink --bfile master --bmerge sample.bed sample.bim sample.fam --make-bed --out master_new



As far as converting some format to 23 format just post one line of your format (FTDNA, Living or whatever) and I'll make you a Unix script to convert it in 2 minutes

vbnetkhio
02-27-2021, 10:51 AM
......./plink --23file sample.txt sample --out sample_plink

is that a plink function or a separate script??

edit:
found it. i wasn't aware of this function, I was using some old 23andme to ped scripts which didn't produce proper ped files.

Zoro
02-27-2021, 11:08 AM
is that a plink function or a separate script??

edit:
found it. i wasn't aware of this function, I was using some old 23andme to ped scripts which didn't produce proper ped files.

Whaaaat ?

Stay as far as you can from ped files unless you absolutely have to such as converting plink to eigenstrat. They take up a ton more space than bed files

Kaspias
02-27-2021, 05:32 PM
try with this one...
https://filebin.net/c73ep7a8u1pfgtxw

the "space" version should work with the older version of plink, the other one with the newer.

This did not work either. But I was able to convert my data to plink. I could not do it by using python at first stage but then when used conda it worked.

Edit: I fixed the other problem too. But still stuck while extracting populations...

Kaspias
03-02-2021, 05:01 PM
First run:

√ 1379990 SNPs read in total
! 6558 SNPs remain after filtering. 6468 are polymorphic.


target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Kaspias Turkmen 0.164 0.0991 1.65
2 Kaspias Bulgarian 0.836 0.0991 8.44




Many thanks to @Korialstrasz for the tutorial also to @vbnetkhio and @Zoro for giving a hand. I'd like to see some feedback on it, so I can try to improve.

Zoro
03-02-2021, 05:33 PM
First run:

√ 1379990 SNPs read in total
! 6558 SNPs remain after filtering. 6468 are polymorphic.


target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Kaspias Turkmen 0.164 0.0991 1.65
2 Kaspias Bulgarian 0.836 0.0991 8.44




Many thanks to @Korialstrasz for the tutorial also to @vbnetkhio and @Zoro for giving a hand. I'd like to see some feedback on it, so I can try to improve.

6558 SNPs is too low to do an accurate comparison. Even though you weren't WGS genotyped I can still get you up to about 74,000 SNPs.

I assume you were able to create the Plink files from the Reich dataset based on the par file I wrote you. Can you post the plink .log file. It'll help me diagnose a few things.

Assuming you were able to get 1240K SNPs in your Plink data you can get about 70,000 overlapping SNPs if you only use your data and the Simmons samples in the Reich dataset. You'll recognize them because their IDs start with S_ such as "S_Armenian-1.DG"

If you only use your personal file and the Simmons ones starting in S_ when you extract in Admixtools 2 using:

extract_f2(pref, f2dir, pops = c(

then you should end up with about 70,000 SNPs

Zoro
03-02-2021, 05:34 PM
First run:

√ 1379990 SNPs read in total
! 6558 SNPs remain after filtering. 6468 are polymorphic.


target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Kaspias Turkmen 0.164 0.0991 1.65
2 Kaspias Bulgarian 0.836 0.0991 8.44




Many thanks to @Korialstrasz for the tutorial also to @vbnetkhio and @Zoro for giving a hand. I'd like to see some feedback on it, so I can try to improve.


Congrats on making Plink files and using Admixtools. It's your gateway to much more meaningful analysis than merely using Vahaduo all the time !

6558 SNPs is too low to do an accurate comparison. Even though you weren't WGS genotyped I can still get you up to about 200,000 SNPs.

I assume you were able to create the Plink files from the Reich dataset based on the par file I wrote you. Can you post the plink .log file. It'll help me diagnose a few things.

Assuming you were able to get 1240K SNPs in your Plink data you can get about 200,000 overlapping SNPs if you only use your data and the Simmons samples in the Reich dataset. You'll recognize them because their IDs start with S_ such as "S_Armenian-1.DG"

If you only use your personal file and the Simmons ones starting in S_ when you extract in Admixtools 2 using:

extract_f2(pref, f2dir, pops = c(

then you should end up with about 200,000 SNPs

Zoro
03-02-2021, 05:41 PM
I use Admixtools 2 based on Eigenstrat geno snp and ind files.

Here are some 1240K pops I use alot because they don't drop your SNP counts

extract_f2(pref, f2dir, pops = c('Eskimo_Sireniki.DG',
'Punjabi',
'Turkmen','Pathan','Kalash','Bashkir','Kotias',
'Tatar_Volga','Turkish','Iranian','Armenian',
'Saami','Georgian','Jordanian',
'Estonian','Bulgarian','Sardinian','Avar','Hazara' ,
'Khomani_San',
'Papuan',
'Chukchi','Han','Uyghur','Mansi',
'Mongola','Buryat','Yakut','Adygei','Burmese','Jew _Iraqi',
'Russia_Abkhasian',
'Karelia','Balochi','Brahui'), maxmiss=0,verbose=TRUE)

Zoro
03-02-2021, 05:43 PM
.......

Kaspias
03-02-2021, 07:30 PM
Congrats on making Plink files and using Admixtools. It's your gateway to much more meaningful analysis than merely using Vahaduo all the time !

6558 SNPs is too low to do an accurate comparison. Even though you weren't WGS genotyped I can still get you up to about 200,000 SNPs.

I assume you were able to create the Plink files from the Reich dataset based on the par file I wrote you. Can you post the plink .log file. It'll help me diagnose a few things.

Assuming you were able to get 1240K SNPs in your Plink data you can get about 200,000 overlapping SNPs if you only use your data and the Simmons samples in the Reich dataset. You'll recognize them because their IDs start with S_ such as "S_Armenian-1.DG"

If you only use your personal file and the Simmons ones starting in S_ when you extract in Admixtools 2 using:

extract_f2(pref, f2dir, pops = c(

then you should end up with about 200,000 SNPs

Log:


.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (960586 variants, 1 person).
--file: Kaspias_n-temporary.bed + Kaspias_n-temporary.bim +
Kaspias_n-temporary.fam written.
960586 variants loaded from .bim file.
1 person (0 males, 0 females, 1 ambiguous) loaded from .fam.
Ambiguous sex ID written to Kaspias_n.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 0 founders and 1 nonfounder present.
Calculating allele frequencies... done.
Total genotyping rate is 0.974373.
960586 variants and 1 person pass filters and QC.
Note: No phenotypes present.
--make-bed to Kaspias_n.bed + Kaspias_n.bim + Kaspias_n.fam ... done.

For merging:


54 people loaded from master_plink.fam.
1 person to be merged from Kaspias_n.fam.
Of these, 1 is new, while 0 are present in the base dataset.
597573 markers loaded from master_plink.bim.
960586 markers to be merged from Kaspias_n.bim.
Of these, 782417 are new, while 178169 are present in the base dataset.
Warning: Variants 'rs144847714' and 'rs10492943' have the same position.
Warning: Variants 'rs3205229' and 'rs2229002' have the same position.
Warning: Variants 'rs769902' and 'rs201435286' have the same position.
748 more same-position warnings: see log file.
Performing single-pass merge (55 people, 1379990 variants).
Merged fileset written to merged_data-merge.bed + merged_data-merge.bim +
merged_data-merge.fam .
1379990 variants loaded from .bim file.
55 people (0 males, 0 females, 55 ambiguous) loaded from .fam.
Ambiguous sex IDs written to merged_data.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 54 founders and 1 nonfounder present.
Calculating allele frequencies... done.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missing.
Total genotyping rate is 0.432004.
1379990 variants and 55 people pass filters and QC.
Note: No phenotypes present.
--make-bed to merged_data.bed + merged_data.bim + merged_data.fam ... done.

Kaspias
03-02-2021, 07:51 PM
Here it is...

√ 1379990 SNPs read in total
! 173175 SNPs remain after filtering. 170906 are polymorphic.


target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Kaspias Bulgarian.DG 0.776 0.0339 22.9
2 Kaspias Turkmen.SG 0.224 0.0339 6.60



target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Kaspias Bulgarian.DG 0.798 0.0299 26.7
2 Kaspias Uzbek.SG 0.202 0.0299 6.76

Zoro
03-02-2021, 08:03 PM
Log:


.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (960586 variants, 1 person).
--file: Kaspias_n-temporary.bed + Kaspias_n-temporary.bim +
Kaspias_n-temporary.fam written.
960586 variants loaded from .bim file.
1 person (0 males, 0 females, 1 ambiguous) loaded from .fam.
Ambiguous sex ID written to Kaspias_n.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 0 founders and 1 nonfounder present.
Calculating allele frequencies... done.
Total genotyping rate is 0.974373.
960586 variants and 1 person pass filters and QC.
Note: No phenotypes present.
--make-bed to Kaspias_n.bed + Kaspias_n.bim + Kaspias_n.fam ... done.

For merging:


54 people loaded from master_plink.fam.
1 person to be merged from Kaspias_n.fam.
Of these, 1 is new, while 0 are present in the base dataset.
597573 markers loaded from master_plink.bim.
960586 markers to be merged from Kaspias_n.bim.
Of these, 782417 are new, while 178169 are present in the base dataset.
Warning: Variants 'rs144847714' and 'rs10492943' have the same position.
Warning: Variants 'rs3205229' and 'rs2229002' have the same position.
Warning: Variants 'rs769902' and 'rs201435286' have the same position.
748 more same-position warnings: see log file.
Performing single-pass merge (55 people, 1379990 variants).
Merged fileset written to merged_data-merge.bed + merged_data-merge.bim +
merged_data-merge.fam .
1379990 variants loaded from .bim file.
55 people (0 males, 0 females, 55 ambiguous) loaded from .fam.
Ambiguous sex IDs written to merged_data.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 54 founders and 1 nonfounder present.
Calculating allele frequencies... done.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missing.
Total genotyping rate is 0.432004.
1379990 variants and 55 people pass filters and QC.
Note: No phenotypes present.
--make-bed to merged_data.bed + merged_data.bim + merged_data.fam ... done.

Ok I see that you have 178k SNPs overlapping with master dataset. You can increase that quite a bit by using the Reich 1240K instead of the 593k set you are using. It’s available at Reich Lab.

You can easily convert it from Eigenstrat to plink using the par file I gave you

Then in Plink you can merge your personal data and any other data with it. You can do IBS analysis in plink

Then you can convert your new plink master back to Eigenstrat. Let me know when you’re ready I’ll give you a different par file to do that

Then you can do Admixtools with you included using Eigenstrat

Zoro
03-02-2021, 08:08 PM
Here it is...

√ 1379990 SNPs read in total
! 173175 SNPs remain after filtering. 170906 are polymorphic.


target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Kaspias Bulgarian.DG 0.776 0.0339 22.9
2 Kaspias Turkmen.SG 0.224 0.0339 6.60



target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Kaspias Bulgarian.DG 0.798 0.0299 26.7
2 Kaspias Uzbek.SG 0.202 0.0299 6.76

Looking much better. Standard errors look good at 3%. Can you post the p-values so we can see if models are pass or fail. The 3rd row contains the p-value. Also can you post the p-right pops used

Kaspias
03-02-2021, 08:34 PM
Looking much better. Standard errors look good at 3%. Can you post the p-values so we can see if models are pass or fail. The 3rd row contains the p-value. Also can you post the p-right pops used

Bulgarian + Uzbek,


f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 14 11.1 6.75e- 1 16 1483. 2.95e-306
2 0 30 1494. 8.98e-296 NA NA NA

Bulgarian + Turkmen,


f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 14 12.7 5.50e- 1 16 1029. 5.62e-209
2 0 30 1042. 6.54e-200 NA NA NA

Right pops:


"Papuan.DG",
"Eskimo_Sireniki.DG",
"Jordanian.DG",
"Punjabi.DG",
"Yakut.DG",
"Polish.DG",
"Yoruba.DG",
"Sardinian.DG",
"Finnish.DG",
"Armenian.DG",
"Greek_1.DG",
"Tatar_Volga.SG",
"Iranian.DG",
"Estonian.DG",
"Altaian.DG",
"Uzbek.SG"("Turkmen.SG")

Zoro
03-02-2021, 08:51 PM
Bulgarian + Uzbek,


f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 14 11.1 6.75e- 1 16 1483. 2.95e-306
2 0 30 1494. 8.98e-296 NA NA NA

Bulgarian + Turkmen,


f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 14 12.7 5.50e- 1 16 1029. 5.62e-209
2 0 30 1042. 6.54e-200 NA NA NA

Right pops:


"Papuan.DG",
"Eskimo_Sireniki.DG",
"Jordanian.DG",
"Punjabi.DG",
"Yakut.DG",
"Polish.DG",
"Yoruba.DG",
"Sardinian.DG",
"Finnish.DG",
"Armenian.DG",
"Greek_1.DG",
"Tatar_Volga.SG",
"Iranian.DG",
"Estonian.DG",
"Altaian.DG",
"Uzbek.SG"("Turkmen.SG")

Unfortunately both models are fails since p-value of 1st is 2.95e-306 and 2nd about same. Your p-values for pass should be >.05. Remove a few of the p-rights to improve p-value.
you may need another source think about a third source that will help

Zoro
03-02-2021, 09:40 PM
Researchers use ancients for right pops. I have had good luck with this set and they don't have too many missing genotypes. if you are missing some of these samples you can substitute something similar

right= c('Khomani_San','Devils-Gate-N','Bichon','Morocco_Iberomaurusian',
'Anatolia_N','Kotias','Karelia', 'Yana-UP', "Iran_N', 'Kolyma-Mesol')

My Devils-Gate, Yana and Kolyma are WGS but you can use diploids if you have them.

Your p-values should improve alot.

Also not everyone can model successfully with just 2 sources. For example many Kurds can model with just 2 sources but Armenians or Iranians appear to have more complex histories and I usually need at least 3 sources for them. Not sure about your situation.

Kaspias
03-03-2021, 11:18 AM
Unfortunately both models are fails since p-value of 1st is 2.95e-306 and 2nd about same. Your p-values for pass should be >.05. Remove a few of the p-rights to improve p-value.
you may need another source think about a third source that will help

I used the exact same populations you recommend except for Turkmen which I replaced with MA2196, and that's what I get:


target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Kaspias Bulgarian.DG 0.712 0.0923 7.71
2 Kaspias Turkey_Ottoman_2.SG 0.288 0.0923 3.12



f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 8 6.59 5.82e- 1 10 87.0 2.10e-14
2 0 18 93.6 3.26e-12 NA NA NA



pat wt dof chisq p f4rank Bulgarian.DG Turkey_Ottoman_2.SG feasible best dofdiff chisqdiff p_nested
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <dbl> <dbl> <dbl>
1 00 0 8 6.59 0.582 1 0.712 0.288 TRUE NA NA NA NA
2 01 1 9 24.4 0.00366 0 1 NA TRUE TRUE 0 -23.9 1
3 10 1 9 48.4 0.000000218 0 NA 1 TRUE TRUE NA NA NA
>

If I get it correctly the model still does not pass. What's the reason? I mean, I could add here a 3rd population - Greek or Crimean Tatar - that are potentials for me, but Greek will cause overfitting with Bulgarian whereas there is no Crimean Tatar in the spreadsheet.

In addition, the SNP coverage reduced crucially when leaving Simeon's dataset:

! 29131 SNPs remain after filtering. 27980 are polymorphic.

andre
03-03-2021, 12:21 PM
I used the exact same populations you recommend except for Turkmen which I replaced with MA2196, and that's what I get:


target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Kaspias Bulgarian.DG 0.712 0.0923 7.71
2 Kaspias Turkey_Ottoman_2.SG 0.288 0.0923 3.12



f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 8 6.59 5.82e- 1 10 87.0 2.10e-14
2 0 18 93.6 3.26e-12 NA NA NA



pat wt dof chisq p f4rank Bulgarian.DG Turkey_Ottoman_2.SG feasible best dofdiff chisqdiff p_nested
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <dbl> <dbl> <dbl>
1 00 0 8 6.59 0.582 1 0.712 0.288 TRUE NA NA NA NA
2 01 1 9 24.4 0.00366 0 1 NA TRUE TRUE 0 -23.9 1
3 10 1 9 48.4 0.000000218 0 NA 1 TRUE TRUE NA NA NA
>

If I get it correctly the model still does not pass. What's the reason? I mean, I could add here a 3rd population - Greek or Crimean Tatar - that are potentials for me, but Greek will cause overfitting with Bulgarian whereas there is no Crimean Tatar in the spreadsheet.

In addition, the SNP coverage reduced crucially when leaving Simeon's dataset:

! 29131 SNPs remain after filtering. 27980 are polymorphic.

Try to do it with Tuscan, Ukrainian and Turkmen.

Kaspias
03-03-2021, 01:27 PM
Try to do it with Tuscan, Ukrainian and Turkmen.

Tuscan is too Northern for the base Balkan admixture of Thrace, need something in between Apulia and Islander Greek instead.

Almost got no additional Slav:


target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Kaspias Tuscan_1.DG 0.721 0.180 4.01
2 Kaspias Polish.DG 0.0334 0.172 0.194
3 Kaspias Turkmen.SG 0.246 0.0398 6.18




Besides, I run these:


target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Bulgarian.DG Hungary_Avar_5 0.391 0.349 1.12
2 Bulgarian.DG Bulgaria_IA 0.457 0.281 1.63
3 Bulgarian.DG Russia_Medieval_Nomad.SG 0.152 0.0781 1.95


f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 2 7 9.69 2.07e- 1 9 24.5 3.58e- 3
2 1 16 34.2 5.13e- 3 11 319. 8.11e-62
3 0 27 353. 1.47e-58 NA NA NA



target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Gagauz Hungary_Avar_5 0.421 0.142 2.96
2 Gagauz Bulgaria_IA 0.429 0.118 3.64
3 Gagauz Russia_Medieval_Nomad.SG 0.151 0.0394 3.83

f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 2 7 3.61 8.23e- 1 9 25.2 2.74e- 3
2 1 16 28.8 2.51e- 2 11 316. 4.07e-61
3 0 27 345. 8.25e-57 NA NA NA


target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Romanian Hungary_Avar_5 0.506 0.183 2.77
2 Romanian Bulgaria_IA 0.368 0.148 2.48
3 Romanian Russia_Medieval_Nomad.SG 0.126 0.0471 2.68

f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 2 7 6.68 4.63e- 1 9 24.8 3.17e- 3
2 1 16 31.5 1.16e- 2 11 316. 3.78e-61
3 0 27 347. 2.22e-57 NA NA NA

The same model on me:


target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Kaspias Hungary_Avar_5 0.0528 0.234 0.225
2 Kaspias Bulgaria_IA 0.607 0.191 3.18
3 Kaspias Russia_Medieval_Nomad.SG 0.341 0.0677 5.03

f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 2 7 4.68 6.98e- 1 9 28.3 8.44e- 4
2 1 16 33.0 7.39e- 3 11 298. 1.94e-57
3 0 27 331. 3.88e-54 NA NA NA

Zoro
03-03-2021, 01:31 PM
I used the exact same populations you recommend except for Turkmen which I replaced with MA2196, and that's what I get:


target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Kaspias Bulgarian.DG 0.712 0.0923 7.71
2 Kaspias Turkey_Ottoman_2.SG 0.288 0.0923 3.12



f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 8 6.59 5.82e- 1 10 87.0 2.10e-14
2 0 18 93.6 3.26e-12 NA NA NA



pat wt dof chisq p f4rank Bulgarian.DG Turkey_Ottoman_2.SG feasible best dofdiff chisqdiff p_nested
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <dbl> <dbl> <dbl>
1 00 0 8 6.59 0.582 1 0.712 0.288 TRUE NA NA NA NA
2 01 1 9 24.4 0.00366 0 1 NA TRUE TRUE 0 -23.9 1
3 10 1 9 48.4 0.000000218 0 NA 1 TRUE TRUE NA NA NA
>

If I get it correctly the model still does not pass. What's the reason? I mean, I could add here a 3rd population - Greek or Crimean Tatar - that are potentials for me, but Greek will cause overfitting with Bulgarian whereas there is no Crimean Tatar in the spreadsheet.

In addition, the SNP coverage reduced crucially when leaving Simeon's dataset:

! 29131 SNPs remain after filtering. 27980 are polymorphic.



You can increase the 29K SNPs alot by using the 1240K SNP Reich set.

Let's first figure out which populations you are genetically closest to by running F2s. This will also tell us if somehow your personal data got corrupted or not. Don't use ancients like I did to keep your SNPs up.

When I run F2s for Bulgarians using 200K SNPs I get the following but I'm not using alot of pops more relevant to Bulgarians such as Hungarians, Greeks etc which you should use. In fact you can use all the Simons 30 or so pops in your dataset

<style type="text/css">td {border: 1px solid #ccc;}br {mso-data-placement:same-cell;}</style>
<colgroup><col width="115"><col width="159"><col width="132"><col width="56"><col width="56"></colgroup><tbody>
POP1
POP2
F2
SE
Z


Bulgarian
Sardinian
0.246
0.0010
258


Bulgarian
Estonian
0.247
0.0008
313


Bulgarian
Armenian
0.249
0.0008
296


Bulgarian
Georgian
0.249
0.0007
358


Bulgarian
Turkish-Kayseri
0.249
0.0007
371


Bulgarian
Tatar-Volga
0.25
0.0008
328


Bulgarian
Saami
0.25
0.0007
343


Bulgarian
Iran-Hasanlu-IA
0.251
0.0011
239


Bulgarian
Iranians-Fars
0.252
0.0015
168


Bulgarian
Karelia-EHG
0.252
0.0012
212


Bulgarian
Kotias-CHG
0.252
0.0009
291


Bulgarian
Kalash
0.252
0.0008
304


Bulgarian
Bashkir
0.252
0.0007
371


Bulgarian
Pathan
0.253
0.0009
268


Bulgarian
Jordanian
0.253
0.0009
288


Bulgarian
Villabruna-UP-WHG
0.254
0.0010
256


Bulgarian
Turkmen
0.254
0.0009
291


Bulgarian
Balochi
0.254
0.0008
301


Bulgarian
Brahui
0.254
0.0007
363


Bulgarian
MA1-ANE
0.257
0.0009
274


Bulgarian
Punjabi
0.257
0.0009
296


Bulgarian
Yana-UP-WGS
0.258
0.0008
336


Bulgarian
Devils-Gate-N-WGS
0.259
0.0008
316


Bulgarian
Kolyma-Mesol-WGS
0.261
0.0011
240


Bulgarian
Saharawi
0.261
0.0010
261


Bulgarian
Eskimo-Sireniki
0.261
0.0008
324


Bulgarian
Eskimo-Chaplin
0.262
0.0011
237


Bulgarian
China-Tianyuan-UP
0.267
0.0012
215


Bulgarian
UstIshim-UP
0.269
0.0010
263


Bulgarian
Khomani-San
0.313
0.0013
245

</tbody>

</dbl></dbl></dbl></lgl></lgl></dbl></dbl></dbl></dbl></dbl></dbl></dbl></chr></dbl></dbl></int></dbl></dbl></int></int></dbl></dbl></dbl></chr></chr>
Running F2s is simple. Do this


## Increase number of lines R prints
options(max.print = 100000)

extract_f2(pref, f2dir, pops = c(..........

f2_blocks = f2_from_precomp('............

##View(f2(f2_blocks))
print(f2(f2_blocks), n = 2000)

Zoro
03-03-2021, 01:41 PM
Tuscan is too Northern for the base Balkan admixture of Thrace, need something in between Apulia and Islander Greek instead.

Almost got no additional Slav:


The same model on me:


target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Kaspias Hungary_Avar_5 0.0528 0.234 0.225
2 Kaspias Bulgaria_IA 0.607 0.191 3.18
3 Kaspias Russia_Medieval_Nomad.SG 0.341 0.0677 5.03

f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 2 7 4.68 6.98e- 1 9 28.3 8.44e- 4
2 1 16 33.0 7.39e- 3 11 298. 1.94e-57
3 0 27 331. 3.88e-54 NA NA NA



It looks like you're getting much closer. Your p-value is now passing at 6.98e- 1 which is basically 0.698 !

Your standard errors are not good though especially for Avar 1 Kaspias Hungary_Avar_5 0.0528 0.234 0.225 because it's saying 5.28% Avar +/-23.4%

All this means is your pright are not sufficient to distinguish the genetic difference between Hungary-Avar and Bulgaria-IA. Add a pright that you think is much genetically closer to Avar than Bulgaria-IA OR visa versa

Korialstrasz
03-03-2021, 05:25 PM
@Kaspias

I am glad that my post helped. Nice to see that you too have managed to run it!

@Zoro

Very helpful advices all around. Thanks again.


---


So I made a few more runs (maxmiss=0 and 93k~ snps ) using the 1240K dataset and the following populations. I picked Tepecik for Neolithic Anatolia. Open for suggestions!





right= c('Russia_DevilsCave_N.SG','Switzerland_Bichon.SG' ,'Morocco_Iberomaurusian','Turkey_TepecikCiftlik_N .SG','Georgia_Kotias.SG','Russia_HG_Karelia', 'Russia_Yana_UP.SG', 'Iran_GanjDareh_N', 'Russia_Kolyma_M.SG')

left = c("Bulgarian.DG","Adygei.DG","Turkmen.SG",'Georgian.DG','Greek_1.DG')




This seems to be the best result, standard errors can go lower I guess. The p values seem OK
About the z values corresponding to weight estimations: What is being tested here? weight i = 0 ? It seems like it.
Also, why do we want to fail to reject the model hypothesis? Can't seem to find a layman interpretation (no surprise).

Run 1: (Greek and Bulgarian did not go well together and Greek instead of Bulgarian yielded better results..Georgian seems to be a non-factor here: not significantly different than 0. But I would expect to have around 10%. Adygei on the other hand has a high se here, possibly due to its rather close proximity to Georgian.)



=======================================
target left weight se z
---------------------------------------
1 me Adygei.DG 0.436 0.267 1.636
2 me Turkmen.SG 0.051 0.049 1.055
3 me Georgian.DG 0.053 0.2 0.263
4 me Greek_1.DG 0.46 0.127 3.617
---------------------------------------

the p value = 0.56

================================================== ==
f4rank dof chisq p dofdiff chisqdiff p_nested
----------------------------------------------------
1 3 5 3.924 0.56 7 36.838 0
2 2 12 40.761 0 9 101.281 0
3 1 21 142.042 0 11 732.627 0
4 0 32 874.67 0 NA NA NA
----------------------------------------------------




Another run, without Georgian. (Adygei SE is now 0.15)




======================================
target left weight se z
--------------------------------------
1 me Adygei.DG 0.489 0.15 3.268
2 me Turkmen.SG 0.046 0.045 1.006
3 me Greek_1.DG 0.466 0.13 3.584
--------------------------------------

================================================== ===
f4rank dof chisq p dofdiff chisqdiff p_nested
-----------------------------------------------------
1 2 6 3.965 0.681 8 68.9 0
2 1 14 72.865 0 10 653.942 0
3 0 24 726.807 0 NA NA NA
-----------------------------------------------------




bonus 1: me vs the populations I used (f2 statistics). If I am interpreting these correctly it says I am closer to Bulgarians than the Adygei (albeit not by a significant margin). On g25 I get the opposite all the time, with a clear margin.



================================================== ==================
pop1 pop2 est se z p
--------------------------------------------------------------------
1 me Bulgarian.DG 9e-04 0.0011 0.82034 0.41202
2 me Adygei.DG 0.00134 0.00116 1.14901 0.25055
3 me Georgian.DG 0.00232 0.00111 2.09104 0.03652
4 me Greek_1.DG 0.00301 0.00148 2.03656 0.04169
5 me Iran_GanjDareh_N 0.0566 0.00121 46.81045 0
6 me Turkmen.SG 0.06436 0.00126 51.18922 0
7 me Russia_Yana_UP.SG 0.08537 0.00142 60.25923 0
8 me Morocco_Iberomaurusian 0.09281 0.0014 66.0843 0
9 me Russia_DevilsCave_N.SG 0.10113 0.00151 66.87772 0
10 me Turkey_TepecikCiftlik_N.SG 0.13288 0.00139 95.47373 0
11 me Russia_HG_Karelia 0.16392 0.00155 106.00861 0
12 me Georgia_Kotias.SG 0.17455 0.00155 112.48209 0
13 me Switzerland_Bichon.SG 0.17911 0.00175 102.16275 0
14 me Russia_Kolyma_M.SG 0.19436 0.00177 109.87032 0
--------------------------------------------------------------------


bonus 2: a graph. Perhaps this can give an idea as to whether the chosen populations are satisfactory or not. If the graphs produce nonsense, one can try different populations. This particular one I produced is possibly nonsense since I only used some moderns and pretty ancient populations.


https://i.ibb.co/CsYKkz3/graph1.png (https://ibb.co/wrPh9RF)

Zoro
03-03-2021, 05:54 PM
@Kaspias

I am glad that my post helped. Nice to see that you too have managed to run it!

@Zoro

Very helpful advices all around. Thanks again.


---


So I made a few more runs (maxmiss=0 and 93k~ snps ) using the 1240K dataset and the following populations. I picked Tepecik for Neolithic Anatolia. Open for suggestions!





right= c('Russia_DevilsCave_N.SG','Switzerland_Bichon.SG' ,'Morocco_Iberomaurusian','Turkey_TepecikCiftlik_N .SG','Georgia_Kotias.SG','Russia_HG_Karelia', 'Russia_Yana_UP.SG', 'Iran_GanjDareh_N', 'Russia_Kolyma_M.SG')

left = c("Bulgarian.DG","Adygei.DG","Turkmen.SG",'Georgian.DG','Greek_1.DG')




This seems to be the best result, standard errors can go lower I guess. The p values seem OK
About the z values corresponding to weight estimations: What is being tested here? weight i = 0 ? It seems like it.
Also, why do we want to fail to reject the model hypothesis? Can't seem to find a layman interpretation (no surprise).

Run 1: (Greek and Bulgarian did not go well together and Greek instead of Bulgarian yielded better results..Georgian seems to be a non-factor here: not significantly different than 0. But I would expect to have around 10%. Adygei on the other hand has a high se here, possibly due to its rather close proximity to Georgian.)



=======================================
target left weight se z
---------------------------------------
1 me Adygei.DG 0.436 0.267 1.636
2 me Turkmen.SG 0.051 0.049 1.055
3 me Georgian.DG 0.053 0.2 0.263
4 me Greek_1.DG 0.46 0.127 3.617
---------------------------------------

the p value = 0.56

================================================== ==
f4rank dof chisq p dofdiff chisqdiff p_nested
----------------------------------------------------
1 3 5 3.924 0.56 7 36.838 0
2 2 12 40.761 0 9 101.281 0
3 1 21 142.042 0 11 732.627 0
4 0 32 874.67 0 NA NA NA
----------------------------------------------------




Another run, without Georgian. (Adygei SE is now 0.15)




======================================
target left weight se z
--------------------------------------
1 me Adygei.DG 0.489 0.15 3.268
2 me Turkmen.SG 0.046 0.045 1.006
3 me Greek_1.DG 0.466 0.13 3.584
--------------------------------------

================================================== ===
f4rank dof chisq p dofdiff chisqdiff p_nested
-----------------------------------------------------
1 2 6 3.965 0.681 8 68.9 0
2 1 14 72.865 0 10 653.942 0
3 0 24 726.807 0 NA NA NA
-----------------------------------------------------




bonus 1: me vs the populations I used (f2 statistics). If I am interpreting these correctly it says I am closer to Bulgarians than the Adygei (albeit not by a significant margin). On g25 I get the opposite all the time, with a clear margin.



================================================== ==================
pop1 pop2 est se z p
--------------------------------------------------------------------
1 me Bulgarian.DG 9e-04 0.0011 0.82034 0.41202
2 me Adygei.DG 0.00134 0.00116 1.14901 0.25055
3 me Georgian.DG 0.00232 0.00111 2.09104 0.03652
4 me Greek_1.DG 0.00301 0.00148 2.03656 0.04169
5 me Iran_GanjDareh_N 0.0566 0.00121 46.81045 0
6 me Turkmen.SG 0.06436 0.00126 51.18922 0
7 me Russia_Yana_UP.SG 0.08537 0.00142 60.25923 0
8 me Morocco_Iberomaurusian 0.09281 0.0014 66.0843 0
9 me Russia_DevilsCave_N.SG 0.10113 0.00151 66.87772 0
10 me Turkey_TepecikCiftlik_N.SG 0.13288 0.00139 95.47373 0
11 me Russia_HG_Karelia 0.16392 0.00155 106.00861 0
12 me Georgia_Kotias.SG 0.17455 0.00155 112.48209 0
13 me Switzerland_Bichon.SG 0.17911 0.00175 102.16275 0
14 me Russia_Kolyma_M.SG 0.19436 0.00177 109.87032 0
--------------------------------------------------------------------



]

Congrats, looking good.

The best way to figure out which samples have the least missing SNPs so that you can use them in your run is to do this plink command:

....../plink/bfile Master --missing . This will output a file called plink.imiss and will list the number of missing SNPs in every sample. This way you can only use your best samples.

For ex, here's a portion of my plink.imiss file sorted by missingness

<style type="text/css">td {border: 1px solid #ccc;}br {mso-data-placement:same-cell;}</style>
<colgroup><col width="195"><col width="109"><col width="74"><col width="80"><col width="61"></colgroup><tbody>
Anatolia_N
Bar8
3563703
4668444
0.76


Anatolia_N
Bar31
3646992
4676043
0.78


Anatolia_N
I0707
3712565
4668444
0.80


Anatolia_N
I0746
3726082
4676043
0.80


Anatolia_N
I0745
3732807
4676043
0.80


Anatolia_N
I0709
3741514
4676043
0.80


Anatolia_N
I0708
3748125
4676043
0.80


Anatolia_N
I1583_publ
3749842
4676043
0.80


Anatolia_N
I1580_publ
3798544
4668444
0.81


Anatolia_N
I0744
3837576
4676043
0.82


Anatolia_N
I1581_publ
3874657
4668444
0.83


Anatolia_N
I1585_publ
3875085
4668444
0.83


Anatolia_N
I1579_publ
3880059
4668444
0.83


Anatolia_N
I0736
3898059
4668444
0.84


Anatolia_N
I1098
3914326
4668444
0.84


Anatolia_N
ZHAG
3921509
4668444
0.84


Anatolia_N
I1096
3957126
4676043
0.85


Anatolia_N
I1097
3959996
4676043
0.85


Anatolia_N
I1101
4047243
4676043
0.87


Anatolia_N
I1103
4099354
4676043
0.88


Anatolia_Ottoman_1.SG (http://Anatolia_Ottoman_1.SG)
MA2195_final
4109325
4668444
0.88


Anatolia_TepecikCiftlik_N.SG (http://Anatolia_TepecikCiftlik_N.SG)
Tep003
4141647
4676043
0.89

</tbody>

You'll notice the best ENF samples are Bar8 with missingness of only 0.76 followed by Bar31 etc. You'll also notice that the ENF you used is one of the worst as far as missing SNPs at missingness of 0.89

Next what I do is go to my Eigenstrat .ind file and add _low to the samples with high missingness that I don't want Admixtools to use.

For ex
Anatolia_N:Bar8 F Anatolia_N
Anatolia_N:Bar31 M Anatolia_N
Anatolia_N:I0707 F Anatolia_N
Anatolia_N:I0708 M Anatolia_N
Anatolia_N:I0709 M Anatolia_N
Anatolia_N:I0736 F Anatolia_N_low
Anatolia_N:I0744 M Anatolia_N
Anatolia_N:I0745 M Anatolia_N
Anatolia_N:I0746 M Anatolia_N
Anatolia_N:I1096 M Anatolia_N_low
Anatolia_N:I1097 M Anatolia_N_low
Anatolia_N:I1098 F Anatolia_N_low
Anatolia_N:I1101 M Anatolia_N_low
Anatolia_N:I1103 M Anatolia_N_low
Anatolia_N:I1579_publ F Anatolia_N
Anatolia_N:I1580_publ F Anatolia_N
Anatolia_N:I1581_publ F Anatolia_N
Anatolia_N:I1583_publ M Anatolia_N
Anatolia_N:I1585_publ F Anatolia_N
Anatolia_N:ZHAG F Anatolia_N_low


Now when I add "Anatolia_N" to extract or pright only the ENF samples with low missingness are used and the rest are ignored.

You may ask why I don't only use the best 2 ENF samples instead of the best 8. The answer to that is the more samples the more accurate the allele frequencies for the population become. So its a tradeoff between ignoring worse samples and improving allele frequencies.

Zoro
03-03-2021, 06:02 PM
@Kaspias

I am glad that my post helped. Nice to see that you too have managed to run it!

@Zoro

Very helpful advices all around. Thanks again.


---


So I made a few more runs (maxmiss=0 and 93k~ snps ) using the 1240K dataset and the following populations. I picked Tepecik for Neolithic Anatolia. Open for suggestions!


Z is directly proportional to p-value. The higher the p the lower the Z

Your passing models are generally >0.05. The higher ones such as >0.40 are a little more accurate

Agree with regards to SE Adygei due to Georgian. If you really need to use both you have to add another pright that's shares more genetic drift with one vs the other.

Your SE is still acceptable. Your p of 0.68 is very good.

How many SNPs were used in the f2 run?

Korialstrasz
03-04-2021, 09:01 PM
Congrats, looking good.



Now when I add "Anatolia_N" to extract or pright only the ENF samples with low missingness are used and the rest are ignored.

You may ask why I don't only use the best 2 ENF samples instead of the best 8. The answer to that is the more samples the more accurate the allele frequencies for the population become. So its a tradeoff between ignoring worse samples and improving allele frequencies.


So I did this for every population I use, exluded those with very high missing ratios. But some populations like Ganj_Dareh only has a few samples and decided to stick to SNPs in the trade-off you mentioned. (I think I only eliminated those with 0.85+ missing)
Removed Tepecik and went on with only Turkey_N. Since my right pops cannot distinguish Greek vs Bulgarian, I wanted to use Polish for slavic proxy. (Not sure if this is a good idea). After going with Turkey_N, Adygei never fell below 0.6. This tells me that I am missing something. I´ll read this next: https://www.biorxiv.org/content/10.1101/2020.04.09.032664v1.full.pdf

Perhaps I should use earlier, let´s say medieval, pops as left references. As I am mostly interested in getting a rough estimation of my admixture.


=======================================
target left weight se z
---------------------------------------
1 my_geno Adygei.DG 0.624 0.152 4.103
2 my_geno Turkmen.SG 0.012 0.044 0.282
3 my_geno Greek_1.DG 0.069 0.147 0.47
4 my_geno Polish.DG 0.294 0.094 3.138
---------------------------------------

================================================== ====
f4rank dof chisq p dofdiff chisqdiff p_nested
------------------------------------------------------
1 3 5 0.57 0.989 7 64.579 0
2 2 12 65.149 0 9 164.988 0
3 1 21 230.138 0 11 893.415 0
4 0 32 1123.552 0 NA NA NA
------------------------------------------------------








How many SNPs were used in the f2 run?

Those f2 statistics came from the run with 93k~ snps.

New one with 100k~ SNPs: (you can see the populations I use, moderns left ancients right.)



===============================================
pop1 pop2 est se
-----------------------------------------------
1 me Bulgarian.DG 0.00071 0.00108
2 me Russia_Abkhasian.DG 0.00094 0.00111
3 me Georgian.DG 0.00097 0.00111
4 me Armenian.DG 0.00109 0.00109
5 me Adygei.DG 0.00114 0.00116
6 me Turkish.DG 0.00143 0.00113
7 me Greek_1.DG 0.00157 0.00112
8 me Russia_NorthOssetian.DG 0.00182 0.00114
9 me Iranian.DG 0.00219 0.00112
10 me Polish.DG 0.0022 0.00146
11 me Turkmen.SG 0.00557 0.00136
12 me Turkey_N 0.00895 0.00101
13 me Iran_GanjDareh_N 0.0279 0.00127
14 me Ukraine_N 0.0285 0.00113
15 me Russia_Yana_UP.SG 0.03287 0.00161
16 me Russia_DevilsCave_N.SG 0.06759 0.00159
17 me Georgia_Kotias.SG 0.18128 0.00152
18 me Russia_HG_Karelia 0.18468 0.00151
19 me Switzerland_Bichon.SG 0.18638 0.00168
20 me Russia_Kolyma_M.SG 0.20019 0.00163
-----------------------------------------------


bonus:

possibly historically not accurate, but just felt curious to see what I´d get.




==========================================
target left weight se z
------------------------------------------
1 Turkish.DG Greek_1.DG 0.504 0.09 5.619
2 Turkish.DG Turkmen.SG 0.113 0.037 3.066
3 Turkish.DG Iranian.DG 0.383 0.115 3.33
------------------------------------------

================================================== ====
f4rank dof chisq p dofdiff chisqdiff p_nested
------------------------------------------------------
1 2 6 10.737 0.097 8 98.977 0
2 1 14 109.714 0 10 911.517 0
3 0 24 1021.231 0 NA NA NA
------------------------------------------------------



Thank you a lot again, Zoro.

Kaspias
03-05-2021, 08:50 AM
It looks like you're getting much closer. Your p-value is now passing at 6.98e- 1 which is basically 0.698 !

Your standard errors are not good though especially for Avar 1 Kaspias Hungary_Avar_5 0.0528 0.234 0.225 because it's saying 5.28% Avar +/-23.4%

All this means is your pright are not sufficient to distinguish the genetic difference between Hungary-Avar and Bulgaria-IA. Add a pright that you think is much genetically closer to Avar than Bulgaria-IA OR visa versa

I'm having hard time while interpreting these tbh:

F2s


pop1 pop2 est se
<chr> <chr> <dbl> <dbl>
1 Kaspias Adygei.DG 0.361 0.00114
2 Kaspias Albanian.DG 0.359 0.00128
3 Kaspias Bashkir.SG 0.363 0.00114
4 Kaspias Bulgarian.DG 0.360 0.00114
5 Kaspias Cretan.DG 0.360 0.00118
6 Kaspias French.DG 0.360 0.00113
7 Kaspias Greek_1.DG 0.360 0.00121
8 Kaspias Greek_2.DG 0.359 0.00122
9 Kaspias Iranian.DG 0.362 0.00115
10 Kaspias Italian_North.DG 0.360 0.00129
11 Kaspias Jordanian.DG 0.365 0.00107
12 Kaspias Mansi.DG 0.365 0.00111
13 Kaspias Polish.DG 0.360 0.00121
14 Kaspias Russian.DG 0.361 0.00113
15 Kaspias Tatar_Tomsk.SG 0.366 0.00108
16 Kaspias Turkish.DG 0.362 0.00112
17 Kaspias Turkmen.SG 0.364 0.00113
18 Kaspias Tuscan_1.DG 0.360 0.00115
19 Kaspias Uzbek.SG 0.366 0.00109

Fst:


pop1 pop2 est se
<chr> <chr> <dbl> <dbl>
1 Kaspias Adygei.DG 0.134 0.00284
2 Kaspias Albanian.DG 0.137 0.00395
3 Kaspias Bashkir.SG 0.137 0.00296
4 Kaspias Bulgarian.DG 0.130 0.00288
5 Kaspias Cretan.DG 0.131 0.00294
6 Kaspias French.DG 0.134 0.00272
7 Kaspias Greek_1.DG 0.131 0.00337
8 Kaspias Greek_2.DG 0.129 0.00364
9 Kaspias Iranian.DG 0.133 0.00293
10 Kaspias Italian_North.DG 0.132 0.00346
11 Kaspias Jordanian.DG 0.141 0.00278
12 Kaspias Mansi.DG 0.156 0.00281
13 Kaspias Polish.DG 0.135 0.00352
14 Kaspias Russian.DG 0.134 0.00292
15 Kaspias Tatar_Tomsk.SG 0.142 0.00293
16 Kaspias Turkish.DG 0.133 0.00275
17 Kaspias Turkmen.SG 0.135 0.00298
18 Kaspias Tuscan_1.DG 0.134 0.00295
19 Kaspias Uzbek.SG 0.140 0.00270

But I should add, the overlapping SNP number is ~610K now when used the 1240K dataset.



target left weight se z
1 Kaspias Bulgarian.DG 0.75 0.041 18.11
2 Kaspias Uzbek.SG 0.25 0.041 6.05


f4rank dof chisq p.value dofdiff chisqdiff p_nested
1 1 5 5.81 0.32551 7 735.49 1.5e-154
2 0 12 741.3 < 2e-16

Zoro
03-05-2021, 03:22 PM
So I did this for every population I use, exluded those with very high missing ratios. But some populations like Ganj_Dareh only has a few samples and decided to stick to SNPs in the trade-off you mentioned. (I think I only eliminated those with 0.85+ missing)
Removed Tepecik and went on with only Turkey_N. Since my right pops cannot distinguish Greek vs Bulgarian, I wanted to use Polish for slavic proxy. (Not sure if this is a good idea). After going with Turkey_N, Adygei never fell below 0.6. This tells me that I am missing something. I´ll read this next: https://www.biorxiv.org/content/10.1101/2020.04.09.032664v1.full.pdf

Perhaps I should use earlier, let´s say medieval, pops as left references. As I am mostly interested in getting a rough estimation of my admixture.


=======================================
target left weight se z
---------------------------------------
1 my_geno Adygei.DG 0.624 0.152 4.103
2 my_geno Turkmen.SG 0.012 0.044 0.282
3 my_geno Greek_1.DG 0.069 0.147 0.47
4 my_geno Polish.DG 0.294 0.094 3.138
---------------------------------------

================================================== ====
f4rank dof chisq p dofdiff chisqdiff p_nested
------------------------------------------------------
1 3 5 0.57 0.989 7 64.579 0
2 2 12 65.149 0 9 164.988 0
3 1 21 230.138 0 11 893.415 0
4 0 32 1123.552 0 NA NA NA
------------------------------------------------------







Those f2 statistics came from the run with 93k~ snps.

New one with 100k~ SNPs: (you can see the populations I use, moderns left ancients right.)



===============================================
pop1 pop2 est se
-----------------------------------------------
1 me Bulgarian.DG 0.00071 0.00108
2 me Russia_Abkhasian.DG 0.00094 0.00111
3 me Georgian.DG 0.00097 0.00111
4 me Armenian.DG 0.00109 0.00109
5 me Adygei.DG 0.00114 0.00116
6 me Turkish.DG 0.00143 0.00113
7 me Greek_1.DG 0.00157 0.00112
8 me Russia_NorthOssetian.DG 0.00182 0.00114
9 me Iranian.DG 0.00219 0.00112
10 me Polish.DG 0.0022 0.00146
11 me Turkmen.SG 0.00557 0.00136
12 me Turkey_N 0.00895 0.00101
13 me Iran_GanjDareh_N 0.0279 0.00127
14 me Ukraine_N 0.0285 0.00113
15 me Russia_Yana_UP.SG 0.03287 0.00161
16 me Russia_DevilsCave_N.SG 0.06759 0.00159
17 me Georgia_Kotias.SG 0.18128 0.00152
18 me Russia_HG_Karelia 0.18468 0.00151
19 me Switzerland_Bichon.SG 0.18638 0.00168
20 me Russia_Kolyma_M.SG 0.20019 0.00163
-----------------------------------------------


bonus:

possibly historically not accurate, but just felt curious to see what I´d get.




==========================================
target left weight se z
------------------------------------------
1 Turkish.DG Greek_1.DG 0.504 0.09 5.619
2 Turkish.DG Turkmen.SG 0.113 0.037 3.066
3 Turkish.DG Iranian.DG 0.383 0.115 3.33
------------------------------------------

================================================== ====
f4rank dof chisq p dofdiff chisqdiff p_nested
------------------------------------------------------
1 2 6 10.737 0.097 8 98.977 0
2 1 14 109.714 0 10 911.517 0
3 0 24 1021.231 0 NA NA NA
------------------------------------------------------



Thank you a lot again, Zoro.



I'm glad your getting the hang of it. Keep at it and you'll slowly get better !

Yes, it's best to use ancients to model yourself and others since they are the ones ancestral and not moderns allthough moderns can be used to see how one is shifted compared to neighboring moderns.

It's always good to get a few passing p-value models with decent SE and then pick the one you think is most accurate based on historical.

The model you show for the Kayseri Turks is close to what I had but I didn't use Iranians. I used Armenians, Bulgarians and Turkmen and got about 16% Turkmen with equal portions of Armenian and Bulgarian

Zoro
03-05-2021, 03:38 PM
I'm having hard time while interpreting these tbh:

F2s


pop1 pop2 est se
<chr> <chr> <dbl> <dbl>
1 Kaspias Adygei.DG 0.361 0.00114
2 Kaspias Albanian.DG 0.359 0.00128
3 Kaspias Bashkir.SG 0.363 0.00114
4 Kaspias Bulgarian.DG 0.360 0.00114
5 Kaspias Cretan.DG 0.360 0.00118
6 Kaspias French.DG 0.360 0.00113
7 Kaspias Greek_1.DG 0.360 0.00121
8 Kaspias Greek_2.DG 0.359 0.00122
9 Kaspias Iranian.DG 0.362 0.00115
10 Kaspias Italian_North.DG 0.360 0.00129
11 Kaspias Jordanian.DG 0.365 0.00107
12 Kaspias Mansi.DG 0.365 0.00111
13 Kaspias Polish.DG 0.360 0.00121
14 Kaspias Russian.DG 0.361 0.00113
15 Kaspias Tatar_Tomsk.SG 0.366 0.00108
16 Kaspias Turkish.DG 0.362 0.00112
17 Kaspias Turkmen.SG 0.364 0.00113
18 Kaspias Tuscan_1.DG 0.360 0.00115
19 Kaspias Uzbek.SG 0.366 0.00109

Fst:


pop1 pop2 est se
<chr> <chr> <dbl> <dbl>
1 Kaspias Adygei.DG 0.134 0.00284
2 Kaspias Albanian.DG 0.137 0.00395
3 Kaspias Bashkir.SG 0.137 0.00296
4 Kaspias Bulgarian.DG 0.130 0.00288
5 Kaspias Cretan.DG 0.131 0.00294
6 Kaspias French.DG 0.134 0.00272
7 Kaspias Greek_1.DG 0.131 0.00337
8 Kaspias Greek_2.DG 0.129 0.00364
9 Kaspias Iranian.DG 0.133 0.00293
10 Kaspias Italian_North.DG 0.132 0.00346
11 Kaspias Jordanian.DG 0.141 0.00278
12 Kaspias Mansi.DG 0.156 0.00281
13 Kaspias Polish.DG 0.135 0.00352
14 Kaspias Russian.DG 0.134 0.00292
15 Kaspias Tatar_Tomsk.SG 0.142 0.00293
16 Kaspias Turkish.DG 0.133 0.00275
17 Kaspias Turkmen.SG 0.135 0.00298
18 Kaspias Tuscan_1.DG 0.134 0.00295
19 Kaspias Uzbek.SG 0.140 0.00270

But I should add, the overlapping SNP number is ~610K now when used the 1240K dataset.



target left weight se z
1 Kaspias Bulgarian.DG 0.75 0.041 18.11
2 Kaspias Uzbek.SG 0.25 0.041 6.05


f4rank dof chisq p.value dofdiff chisqdiff p_nested
1 1 5 5.81 0.32551 7 735.49 1.5e-154
2 0 12 741.3 < 2e-16



I'm glad your getting the hang of it. Keep at it and you'll slowly get better ! Good to see that you were able to use the 1240K set.

I like your FST results better than F2s and I think you should use that more than F2s in the future for distances.

I see that Greek-2 and Bulgarians are closest to you but it looks like you split Greeks into individual samples whereas with Bulgarians it was the average of both Simons samples.

Can you split the Bulgarians and others into individual samples and repost for kicks


Your qpAdm model looks good ! but I'm interested to know how many SNPs in the extract and which pright you used

You're showing 25% Usbek -/+4% which is higher than what I got for the Iraqi Kurds and Kayseri Turks. The Iraqi Kurds were something like ~80% Armenian + ~20% Bashkir or Turkmen.

I like to model with moderns sometimes because it gives me an idea how shifted compared to their neighbors a population is

For ex Iraqi Kurds consistently get Iranian or Georgian or Armenian + 10% Estonian or some percentage Tatar or Bashkir. It doesn't mean that they have Estonian ancestry bu this tells me that a significant proportion of their ancestors came from NE of Armenia or Iran

How many SNPs did you end up with when you did your extract ?

Korialstrasz
03-05-2021, 04:16 PM
I'm glad your getting the hang of it. Keep at it and you'll slowly get better !

Yes, it's best to use ancients to model yourself and others since they are the ones ancestral and not moderns allthough moderns can be used to see how one is shifted compared to neighboring moderns.

It's always good to get a few passing p-value models with decent SE and then pick the one you think is most accurate based on historical.

The model you show for the Kayseri Turks is close to what I had but I didn't use Iranians. I used Armenians, Bulgarians and Turkmen and got about 16% Turkmen with equal portions of Armenian and Bulgarian

btw, I read on the supplement to the performance assessment paper that the choice of the top right population matters. Is that still the case? As far as I know, it was necessary to put the target population on top of the left populations before. It seems to have changed on R.


Right Population File

This is alist of reference populations to be included in the qpAdm model. The number of reference populations must be greater than the number of left (i.e. target and source) populations. One population should be listed per line. The first population in the list will be used as a base for all f4-statistics calculated. Population order after the first population does not matter.

I also played around with the allsnps = TRUE feature, and I can say that the results change A LOT.

Kaspias
03-06-2021, 08:42 AM
I'm glad your getting the hang of it. Keep at it and you'll slowly get better ! Good to see that you were able to use the 1240K set.

I like your FST results better than F2s and I think you should use that more than F2s in the future for distances.

I see that Greek-2 and Bulgarians are closest to you but it looks like you split Greeks into individual samples whereas with Bulgarians it was the average of both Simons samples.

Can you split the Bulgarians and others into individual samples and repost for kicks




Fst, Bulgarian and Turkish also separated:


pop1 pop2 est se
<chr> <chr> <dbl> <dbl>
1 Kaspias Adygei.DG 0.134 0.00285
2 Kaspias Albanian.DG 0.137 0.00395
3 Kaspias Bashkir.SG 0.136 0.00300
4 Kaspias Bulgarian_1.DG 0.126 0.00347
5 Kaspias Bulgarian_2.DG 0.130 0.00358
6 Kaspias Cretan.DG 0.131 0.00294
7 Kaspias French.DG 0.134 0.00274
8 Kaspias Greek_1.DG 0.130 0.00339
9 Kaspias Greek_2.DG 0.129 0.00364
10 Kaspias Iranian.DG 0.133 0.00292
11 Kaspias Italian_North.DG 0.132 0.00351
12 Kaspias Jordanian.DG 0.141 0.00279
13 Kaspias Mansi.DG 0.156 0.00283
14 Kaspias Polish.DG 0.135 0.00354
15 Kaspias Russian.DG 0.134 0.00293
16 Kaspias Tatar_Tomsk.SG 0.142 0.00294
17 Kaspias Turkish_1.DG 0.145 0.00381
18 Kaspias Turkish_2.DG 0.132 0.00348
19 Kaspias Turkmen.SG 0.135 0.00297
20 Kaspias Tuscan_1.DG 0.134 0.00296
21 Kaspias Uzbek.SG 0.140 0.00272

I do not get, how can I be closer to French and Adygei than Albanians? The rest seems ok though.



Your qpAdm model looks good ! but I'm interested to know how many SNPs in the extract and which pright you used

You're showing 25% Usbek -/+4% which is higher than what I got for the Iraqi Kurds and Kayseri Turks. The Iraqi Kurds were something like ~80% Armenian + ~20% Bashkir or Turkmen.

I like to model with moderns sometimes because it gives me an idea how shifted compared to their neighbors a population is

For ex Iraqi Kurds consistently get Iranian or Georgian or Armenian + 10% Estonian or some percentage Tatar or Bashkir. It doesn't mean that they have Estonian ancestry bu this tells me that a significant proportion of their ancestors came from NE of Armenia or Iran

How many SNPs did you end up with when you did your extract


√ 1578508 SNPs read in total
! 565741 SNPs remain after filtering. 540537 are polymorphic.

Let me show you all the models:

Right(Adygei, Cretan, Iranian, Mansi, Polish, Jordanian, Tatar_Tomsk, Italian_North, Albanian, French) -all the of Simeon's-




target left weight se z
1 Kaspias Bulgarian_1.DG 0.78 0.044 17.51
2 Kaspias Turkmen.SG 0.22 0.044 4.98


f4rank dof chisq p.value dofdiff chisqdiff p_nested
1 1 8 5.96 0.65119 10 553.51 1.6e-112
2 0 18 559.48 < 2e-16





target left weight se z
1 Kaspias Bulgarian_1.DG 0.76 0.048 15.89
2 Kaspias Bashkir.SG 0.24 0.048 4.89


f4rank dof chisq p.value dofdiff chisqdiff p_nested
1 1 8 6.44 0.59806 10 478.01 2.2e-96
2 0 18 484.45 < 2e-16




target left weight se z
1 Kaspias Bulgarian_1.DG 0.8 0.039 20.51
2 Kaspias Uzbek.SG 0.2 0.039 5.13



f4rank dof chisq p.value dofdiff chisqdiff p_nested
1 1 8 4.4 0.81917 10 796.21 1.3e-164
2 0 18 800.62 < 2e-16




target left weight se z
1 Kaspias Bulgarian_1.DG 0.82 0.044 18.83
2 Kaspias Tatar_Tomsk.SG 0.18 0.044 4.15


f4rank dof chisq p.value dofdiff chisqdiff p_nested
1 1 8 6.16 0.62951 10 628.79 1.2e-128
2 0 18 634.94 < 2e-16

I get negative scores when used Turkish samples to model my Turkic part, which might mean Turkic ancestors of Balkan Turks were not Anatolian Turkish-like.




target left weight se z
1 Kaspias Bulgarian_1.DG 1.04 5.3 0.2
2 Kaspias Turkish_1.DG -0.04 5.3 -0.01


f4rank dof chisq p.value dofdiff chisqdiff p_nested
1 1 8 28.18 0.00044187 10 36.1 0.000081
2 0 18 64.28 4.09e-07




target left weight se z
1 Kaspias Bulgarian_1.DG 11.44 364.26 0.031
2 Kaspias Turkish_2.DG -10.44 364.26 -0.029


f4rank dof chisq p.value dofdiff chisqdiff p_nested
1 1 8 29.87 0.00022259 10 30.66 0.00067
2 0 18 60.54 1.6747e-06

Zoro
03-06-2021, 11:28 AM
btw, I read on the supplement to the performance assessment paper that the choice of the top right population matters. Is that still the case? As far as I know, it was necessary to put the target population on top of the left populations before. It seems to have changed on R.



I also played around with the allsnps = TRUE feature, and I can say that the results change A LOT.


qpAdm requires your target and sources for target in pleft and references in pright

I think that is referring to qpWave not qpAdm

I don't trust the calculations when allsnps=TRUE because you're not using SNPs that overlap all your pops

Zoro
03-06-2021, 11:36 AM
Fst, Bulgarian and Turkish also separated:


pop1 pop2 est se
<chr> <chr> <dbl> <dbl>
1 Kaspias Adygei.DG 0.134 0.00285
2 Kaspias Albanian.DG 0.137 0.00395
3 Kaspias Bashkir.SG 0.136 0.00300
4 Kaspias Bulgarian_1.DG 0.126 0.00347
5 Kaspias Bulgarian_2.DG 0.130 0.00358
6 Kaspias Cretan.DG 0.131 0.00294
7 Kaspias French.DG 0.134 0.00274
8 Kaspias Greek_1.DG 0.130 0.00339
9 Kaspias Greek_2.DG 0.129 0.00364
10 Kaspias Iranian.DG 0.133 0.00292
11 Kaspias Italian_North.DG 0.132 0.00351
12 Kaspias Jordanian.DG 0.141 0.00279
13 Kaspias Mansi.DG 0.156 0.00283
14 Kaspias Polish.DG 0.135 0.00354
15 Kaspias Russian.DG 0.134 0.00293
16 Kaspias Tatar_Tomsk.SG 0.142 0.00294
17 Kaspias Turkish_1.DG 0.145 0.00381
18 Kaspias Turkish_2.DG 0.132 0.00348
19 Kaspias Turkmen.SG 0.135 0.00297
20 Kaspias Tuscan_1.DG 0.134 0.00296
21 Kaspias Uzbek.SG 0.140 0.00272

I do not get, how can I be closer to French and Adygei than Albanians? The rest seems ok though.





Problem with modern people who identify as French is they may have recent N. African or other European ancestors. That aside since the default in Admixtools is to use common and less common alleles in calculations ENF in Europeans and W. Asians confounds results because it was so successful and Europeans and W. Asians have alot of it.

I have found that when I exclude very common alleles from analysis you have less of those sorts of problems due to ENF in Italians and French. Default in Admixtools is to use all alleles subject to maxmiss meaning that if an allele has a MAF of 0.50 it is still used. MAF 0.50 means that minor or ALT allele is found in all your pops in the analysis whether African, Asian or European meaning if "T" is the minor allele at that position then all Africans and Eurasians in your analysis are "T" at that position.

If you want to exclude these very ancient common alleles from your analysis set maxmaf=0.4 or 0.3 and then rerun. I'm pretty sure you wont get French or British or whatever having a small distance to you because of shared ENF or EHG.

Zoro
03-06-2021, 11:39 AM
I get negative scores when used Turkish samples to model my Turkic part, which might mean Turkic ancestors of Balkan Turks were not Anatolian Turkish-like.




target left weight se z
1 Kaspias Bulgarian_1.DG 1.04 5.3 0.2
2 Kaspias Turkish_1.DG -0.04 5.3 -0.01


f4rank dof chisq p.value dofdiff chisqdiff p_nested
1 1 8 28.18 0.00044187 10 36.1 0.000081
2 0 18 64.28 4.09e-07

[CODE]

target left weight se z
1 Kaspias Bulgarian_1.DG 11.44 364.26 0.031
2 Kaspias Turkish_2.DG -10.44 364.26 -0.029



Yes, so basically its saying that Bulgarians already have alot of the Turkish type admixture you don't need additional Turkish to model yourself. Besides your SE get huge like with you at 1000% if you use 2 similar pops like that

Zoro
03-06-2021, 11:52 AM
√ 1578508 SNPs read in total
! 565741 SNPs remain after filtering. 540537 are polymorphic.

Let me show you all the models:

Right(Adygei, Cretan, Iranian, Mansi, Polish, Jordanian, Tatar_Tomsk, Italian_North, Albanian, French) -all the of Simeon's-




target left weight se z
1 Kaspias Bulgarian_1.DG 0.78 0.044 17.51
2 Kaspias Turkmen.SG 0.22 0.044 4.98


f4rank dof chisq p.value dofdiff chisqdiff p_nested
1 1 8 5.96 0.65119 10 553.51 1.6e-112
2 0 18 559.48 < 2e-16





target left weight se z
1 Kaspias Bulgarian_1.DG 0.76 0.048 15.89
2 Kaspias Bashkir.SG 0.24 0.048 4.89


f4rank dof chisq p.value dofdiff chisqdiff p_nested
1 1 8 6.44 0.59806 10 478.01 2.2e-96
2 0 18 484.45 < 2e-16




target left weight se z
1 Kaspias Bulgarian_1.DG 0.8 0.039 20.51
2 Kaspias Uzbek.SG 0.2 0.039 5.13



f4rank dof chisq p.value dofdiff chisqdiff p_nested
1 1 8 4.4 0.81917 10 796.21 1.3e-164
2 0 18 800.62 < 2e-16




target left weight se z
1 Kaspias Bulgarian_1.DG 0.82 0.044 18.83
2 Kaspias Tatar_Tomsk.SG 0.18 0.044 4.15


f4rank dof chisq p.value dofdiff chisqdiff p_nested
1 1 8 6.16 0.62951 10 628.79 1.2e-128
2 0 18 634.94 < 2e-16




Basically these mean that you're significantly E Asian shifted compared to Bulgarians (keep in mind Bulgarians themselves have Siberian and E Asian). Your best model based on p-value seems to be Uzbeks and Turkmen but if you want to fine tune this even more add Mongolians or Han to pright. Your p-values may drop but that's fine. I also don't get how you have Tatar-Tomsk in both pright and pleft at the same time

Zoro
03-06-2021, 12:08 PM
@Kaspias

What command did you use to extract and to do FST?

Kaspias
03-06-2021, 12:09 PM
Basically these mean that you're significantly E Asian shifted compared to Bulgarians (keep in mind Bulgarians themselves have Siberian and E Asian). Your best model based on p-value seems to be Uzbeks and Turkmen but if you want to fine tune this even more add Mongolians or Han to pright. Your p-values may drop but that's fine. I also don't get how you have Tatar-Tomsk in both pright and pleft at the same time

I replaced Tatar_Tomsk with Bashkir for the right while running Tomsk on the left, and the Tomsk was on the right while running other populations(Uzbek, Turkmen...)

Thank you for your input. Now I have a question, I realized while using ancient populations for the right the SNP number reduces to around ~100k and I now have ~600k with moderns. I constantly hear ancient for the right is a better idea, but considering the SNP amount which one would you say preferably for my case?

I'd like to model Balkan populations with Medieval samples for example, but believe so the SNP amount will be around 80K. Is that enough for a decent run or too low?

Kaspias
03-06-2021, 12:12 PM
@Kaspias

What command did you use to extract and to do FST?


fst_blocks = fst("fstdir")
print((fst_blocks), n=2000)

Zoro
03-06-2021, 12:16 PM
I replaced Tatar_Tomsk with Bashkir for the right while running Tomsk on the left, and the Tomsk was on the right while running other populations(Uzbek, Turkmen...)

Thank you for your input. Now I have a question, I realized while using ancient populations for the right the SNP number reduces to around ~100k and I now have ~600k with moderns. I constantly hear ancient for the right is a better idea, but considering the SNP amount which one would you say preferably for my case?

I'd like to model Balkan populations with Medieval samples for example, but believe so the SNP amount will be around 80K. Is that enough for a decent run or too low?

I think it should be enough. Use the highest quality samples and stay away from 2 similar sources. Give it a try

Kaspias
03-06-2021, 04:51 PM
I think it should be enough. Use the highest quality samples and stay away from 2 similar sources. Give it a try

Do you remember that you asked me the population to use while representing Anatolian Turk's Anatolian ancestry? I have found the Roopkund outlier in the spreadsheet who is a Central Anatolian Greek from the classical Ottoman Era, and simulated possible scenarios using 3 different Turkic populations.

Simon's Turkish samples are from Hodoğlugil's study and collected in Kayseri: http://simonsfoundation.s3.amazonaws.com/share/SCDA/datasets/10_24_2014_SGDP_metainformation_update.txt

These Kayseri samples have ~6% East Eurasian on average -referencing Gedmatch- so we can draw further conclusions based on it.


# right

"Mbuti.DG",
"Han.DG",
"Saami.DG",
"Icelandic.DG",
"Sardinian.DG",
"Punjabi.DG",
"Eskimo_Chaplin.DG",
"BedouinB.DG",
"Basque.DG"



! 90852 SNPs remain after filtering. 78431 are polymorphic.


target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Anatolian_Turkish.DG Anatolian_Greek_Medieval 0.863 0.0363 23.8
2 Anatolian_Turkish.DG Kimak.SG 0.137 0.0363 3.76

f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 7 8.43 2.96e- 1 9 419. 1.00e-84
2 0 16 428. 5.31e-81 NA NA NA



! 99251 SNPs remain after filtering. 85286 are polymorphic.


target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Anatolian_Turkish.DG Anatolian_Greek_Medieval 0.816 0.0472 17.3
2 Anatolian_Turkish.DG Gokturk.SG 0.184 0.0472 3.90

f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 7 15.8 2.74e- 2 9 303. 6.10e-60
2 0 16 319. 3.31e-58 NA NA NA



! 247268 SNPs remain after filtering. 213710 are polymorphic.


target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Anatolian_Turkish.DG Anatolian_Greek_Medieval 0.851 0.0283 30.1
2 Anatolian_Turkish.DG Ottoman_MA2195.SG 0.149 0.0283 5.27

f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 7 17.2 1.64e- 2 9 728. 6.69e-151
2 0 16 745. 3.16e-148 NA NA NA

Zoro
03-06-2021, 05:34 PM
Do you remember that you asked me the population to use while representing Anatolian Turk's Anatolian ancestry? I have found the Roopkund outlier in the spreadsheet who is a Central Anatolian Greek from the classical Ottoman Era, and simulated possible scenarios using 3 different Turkic populations.

Simon's Turkish samples are from Hodoğlugil's study and collected in Kayseri: http://simonsfoundation.s3.amazonaws.com/share/SCDA/datasets/10_24_2014_SGDP_metainformation_update.txt

These Kayseri samples have ~6% East Eurasian on average -referencing Gedmatch- so we can draw further conclusions based on it.


# right

"Mbuti.DG",
"Han.DG",
"Saami.DG",
"Icelandic.DG",
"Sardinian.DG",
"Punjabi.DG",
"Eskimo_Chaplin.DG",
"BedouinB.DG",
"Basque.DG"



! 90852 SNPs remain after filtering. 78431 are polymorphic.


target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Anatolian_Turkish.DG Anatolian_Greek_Medieval 0.863 0.0363 23.8
2 Anatolian_Turkish.DG Kimak.SG 0.137 0.0363 3.76

f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 7 8.43 2.96e- 1 9 419. 1.00e-84
2 0 16 428. 5.31e-81 NA NA NA



! 99251 SNPs remain after filtering. 85286 are polymorphic.


target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Anatolian_Turkish.DG Anatolian_Greek_Medieval 0.816 0.0472 17.3
2 Anatolian_Turkish.DG Gokturk.SG 0.184 0.0472 3.90

f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 7 15.8 2.74e- 2 9 303. 6.10e-60
2 0 16 319. 3.31e-58 NA NA NA



! 247268 SNPs remain after filtering. 213710 are polymorphic.


target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Anatolian_Turkish.DG Anatolian_Greek_Medieval 0.851 0.0283 30.1
2 Anatolian_Turkish.DG Ottoman_MA2195.SG 0.149 0.0283 5.27

f4rank dof chisq p dofdiff chisqdiff p_nested
<int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 7 17.2 1.64e- 2 9 728. 6.69e-151
2 0 16 745. 3.16e-148 NA NA NA

Looks like the 1st model is best with a p-value of 0.30. The other ones can sort of be rejected.

As far as the Kayseri samples I kind of guessed that they would score that much on Gedmatch. Interestingly I didn't get nearly as many passing W Asian + E Asian models for them as I did for Iraqi Kurds. Probably has to do with a good modern W Asian source. Another reason I like using Ancients.

I also got this sort of passing model for the Kayseri Turks p-value 0.05

Admix SE
Turkish Armenian 57% 7%
Turkish Bulgarian 35% 7%
Turkish Yakut 8% 1%

Again those Turks would have more than 8% NE Asian because Armenians and Bulgarians also have some

I think you should try modeling yourself with these also plus Siberian and E Asian. You can use the ancient pright list I use which I posted earlier. I'm posting their missingness rate. They're not that bad.

Anatolia_EBA I2495 3682752 4676043 0.79
Anatolia_EBA I2683 3761883 4668444 0.81
Anatolia_EBA.SG MA2210_final 4094550 4668444 0.88
Anatolia_EBA.SG MA2212_final 4120486 4676043 0.88
Anatolia_EBA.SG MA2213_final 3984232 4668444 0.85
Anatolia_Epipaleolithic ZBC_IPB001 3830320 4676043 0.82
Anatolia_IA.SG MA2198_final 4157111 4668444 0.89
Anatolia_MLBA.SG MA2200_final 3750514 4676043 0.80
Anatolia_MLBA.SG MA2203_final 4091029 4668444 0.88
Anatolia_MLBA.SG MA2205_final 4152321 4676043 0.89


As far as your modern pright list I would add maybe Armenians or Iranians

Komintasavalta
03-06-2021, 05:40 PM
I did this to get FST distances:


$ printf %s\\n Mansi Finnish Nganasan Selkup Karelian Udmurt Mordovian>pops
$ R -e 'library(admixtools);fst=fst("g/v44.3_HO_public/v44.3_HO_public",pop1=readLines("pops"));write.csv(fst,"fst",quote=F)'
ℹ Reading allele frequencies from packedancestrymap files...
ℹ v44.3_HO_public.geno has 13197 samples and 597573 SNPs
ℹ Calculating allele frequencies from 19 samples in 4 populations
ℹ Expected size of allele frequency data: 86 MB
597k SNPs read...
✔ 597573 SNPs read in total
! 593124 SNPs remain after filtering. 414780 are polymorphic.
ℹ Allele frequency matrix for 593124 SNPs and 4 populations is 62 MB
ℹ Computing pairwise f2 for all SNPs and population pairs requires 493 MB RAM without splitting
ℹ Computing without splitting since 493 < 8000 (maxmem)...
ℹ Data written to f2/
ℹ Reading precomputed data for 4 populations...
ℹ Reading f2 data for pair 10 out of 10...
Warning message:
In read_f2(dir, pops, pops2, afprod = afprod, fst = fst, remove_na = remove_na, :
Discarding 1 block(s) due to missing values!
Discarded block(s): 535
>
>
$ cat fst
,pop1,pop2,est,se,z,p
1,Finnish,Karelian,0.00129340940098996,0.000385024 618276676,3.35929013261311,0.000781429796323533
2,Finnish,Mordovian,0.00543917401762932,0.00032534 93038913,16.7179519137578,9.6995724807741e-63
3,Finnish,Nganasan,0.119054350470445,0.00113575044 283692,104.824392736371,0
4,Finnish,Selkup,0.0601437871347565,0.000773515188 052884,77.753854175963,0
5,Finnish,Udmurt,0.0187032075983067,0.000585527652 038594,31.9424839001009,6.87035668325847e-224
6,Karelian,Mordovian,0.00590771927078587,0.0002393 57005605168,24.6816225656289,1.68484078541802e-134
7,Karelian,Udmurt,0.019523384287915,0.000473936956 593016,41.1940533784545,0
8,Mansi,Finnish,0.0402190424166203,0.0008415408501 81047,47.7921450966614,0
9,Mansi,Karelian,0.0399509729801598,0.000728931940 431647,54.8075489139662,0
10,Mansi,Mordovian,0.0383778238793512,0.0006682162 21602333,57.4332418739013,0
11,Mansi,Nganasan,0.0602924170396429,0.00086756066 2779261,69.4964855212476,0
12,Mansi,Selkup,0.0223050689999093,0.0005130963151 74833,43.4715049401769,0
13,Mansi,Udmurt,0.0240652073778455,0.0006638829269 15772,36.2491734644333,1.02394799408991e-287
14,Nganasan,Karelian,0.118602793551424,0.001080158 04818372,109.801333009418,0
15,Nganasan,Mordovian,0.11745770899229,0.000994050 63691636,118.160689838352,0
16,Nganasan,Selkup,0.0504386703379596,0.0006740353 84985417,74.8308938395731,0
17,Nganasan,Udmurt,0.0911528579608077,0.0009733295 36820952,93.650561821567,0
18,Selkup,Karelian,0.0595410382563537,0.0007150490 5760348,83.268466160781,0
19,Selkup,Mordovian,0.0579452329275561,0.000631133 692337951,91.8113446184526,0
20,Selkup,Udmurt,0.0409818006630253,0.000612173976 633259,66.9446958337084,0
21,Udmurt,Mordovian,0.0170968626475659,0.000406949 757311595,42.0122197897642,0

There's probably an easier way to do this in R, but this converts the FST pairs into a table:


$ awk -F, 'NR>1{print$3","$2","$4;print$2","$3","$4}' fst|awk -F, '{print$1","$1","}1'|sort -u>/tmp/a
$ cut -d, -f3 /tmp/a|awk '{printf"%.6f"(NR%n?",":"\n"),$0}' n=$(awk 'END{print NR^.5}' /tmp/a) -|paste -d, <(cut -d, -f1 /tmp/a|sort -u) -|cat <(cut -d, -f1 /tmp/a|sort -u|paste -sd, -|sed s/^/,/) ->/tmp/b
$ cat /tmp/b
,Finnish,Karelian,Mansi,Mordovian,Nganasan,Selkup, Udmurt
Finnish,0.000000,0.001293,0.040219,0.005439,0.1190 54,0.060144,0.018703
Karelian,0.001293,0.000000,0.039951,0.005908,0.118 603,0.059541,0.019523
Mansi,0.040219,0.039951,0.000000,0.038378,0.060292 ,0.022305,0.024065
Mordovian,0.005439,0.005908,0.038378,0.000000,0.11 7458,0.057945,0.017097
Nganasan,0.119054,0.118603,0.060292,0.117458,0.000 000,0.050439,0.091153
Selkup,0.060144,0.059541,0.022305,0.057945,0.05043 9,0.000000,0.040982
Udmurt,0.018703,0.019523,0.024065,0.017097,0.09115 3,0.040982,0.000000

And this creates a heatmap of the table:


R -e 'install.packages(c("pheatmap","colorspace"),repos="https://cloud.r-project.org")'
R -e 'library(pheatmap)
library(colorspace)

t<-read.csv("/tmp/b",header=T,row.names=1,check.names=F)
t[t==0]=NA

pheatmap(
1e4*t,
filename="/tmp/a.png",
legend=F,
clustering_callback=function(...){hclust(as.dist(t ))},
cellwidth=18,
cellheight=12,
border_color=NA,
display_numbers=T,
number_format="%.0f",
number_color="black",
fontsize_number=6,
colorRampPalette(hex(HSV(c(210,180,150,120,90,60,3 0,0),.5,1)))(256)
)'

https://i.ibb.co/GRjLJm3/fst.png

At first I got an error that there were too many missing blocks, so I tried adding a `maxmiss=Inf` parameter:


R -e 'library("admixtools");extract_f2(pref="g/v44.3_HO_public/v44.3_HO_public",pops=c("Finnish","Mansi","Mari.SG","Estonian.DG"),outdir="f2",maxmiss=Inf);f2=f2_from_precomp("f2");fst=fst(f2);write.csv(fst,"fst",quote=F)'

However it gave me nonsensical results where the distance between Finns and Maris was an order of magnitude bigger than the distance between Finns and Mansi:


,pop1,pop2,est,se,z,p
1,Estonian.DG,Finnish,0.000904946578981571,0.00040 0460532337746,2.25976471064156,0.0238358576870844
2,Estonian.DG,Mansi,0.015818211648642,0.0005546578 4565136,28.5188639675077,6.83699132061968e-179
3,Estonian.DG,Mari.SG,0.174033745411937,0.00101794 728786578,170.96538051279,0
4,Finnish,Mansi,0.0139136259691746,0.0003079387354 32869,45.1830977016134,0
5,Finnish,Mari.SG,0.17331351350787,0.0007600563310 41605,228.027195392683,0
6,Mansi,Mari.SG,0.17490109671494,0.000788482478735 399,221.819890018931,0

Kaspias
03-07-2021, 12:47 PM
Looks like the 1st model is best with a p-value of 0.30. The other ones can sort of be rejected.

As far as the Kayseri samples I kind of guessed that they would score that much on Gedmatch. Interestingly I didn't get nearly as many passing W Asian + E Asian models for them as I did for Iraqi Kurds. Probably has to do with a good modern W Asian source. Another reason I like using Ancients.

I also got this sort of passing model for the Kayseri Turks p-value 0.05

Admix SE
Turkish Armenian 57% 7%
Turkish Bulgarian 35% 7%
Turkish Yakut 8% 1%

Again those Turks would have more than 8% NE Asian because Armenians and Bulgarians also have some

I think you should try modeling yourself with these also plus Siberian and E Asian. You can use the ancient pright list I use which I posted earlier. I'm posting their missingness rate. They're not that bad.

Anatolia_EBA I2495 3682752 4676043 0.79
Anatolia_EBA I2683 3761883 4668444 0.81
Anatolia_EBA.SG MA2210_final 4094550 4668444 0.88
Anatolia_EBA.SG MA2212_final 4120486 4676043 0.88
Anatolia_EBA.SG MA2213_final 3984232 4668444 0.85
Anatolia_Epipaleolithic ZBC_IPB001 3830320 4676043 0.82
Anatolia_IA.SG MA2198_final 4157111 4668444 0.89
Anatolia_MLBA.SG MA2200_final 3750514 4676043 0.80
Anatolia_MLBA.SG MA2203_final 4091029 4668444 0.88
Anatolia_MLBA.SG MA2205_final 4152321 4676043 0.89


As far as your modern pright list I would add maybe Armenians or Iranians

Thanks again. I have had Iranians on the pright but it somehow reduced the p-value, so I removed it.


I think these 3 models are really crucial while answering "how the genome of Oghuz was?" We had been using DA89 for a long time, but I recently started to question the accuracy of our method(I think DA89 is 3/4 Gokturk and 1/4 Sogdian, this turns it into a false-candidate for Oghuz) and came to the conclusion that Oghuz should be in between Kipchak and Kimak after searching on historical perspective(The one that helped most was: İlk Oğuzlar, Osman Karatay). Both of the Kipchak samples we have is not representative so went with the Kimak-like option. Apparently I was right, because this is the only passing model when using Medieval samples. In addition, the region where these samples are collected(Kayseri) were housing Cappadocian Greeks that is what I used for the native admixture of them, so the result is pretty solid and one can come with guesses on Western Anatolia(10-40% Kimak?), too.

Token
03-07-2021, 12:54 PM
# right

"Mbuti.DG",
"Han.DG",
"Saami.DG",
"Icelandic.DG",
"Sardinian.DG",
"Punjabi.DG",
"Eskimo_Chaplin.DG",
"BedouinB.DG",
"Basque.DG"
This is a very weak pright list.

Kaspias
03-07-2021, 01:03 PM
This is a very weak pright list.

Any suggestions?

I have tried to use the ancient ones before it, like Devils Gate, Ust Ishim, and so on but the SNP number was only 9k, so went with the modern option.

Token
03-07-2021, 01:29 PM
Any suggestions?

I have tried to use the ancient ones before it, like Devils Gate, Ust Ishim, and so on but the SNP number was only 9k, so went with the modern option.

The crucial thing is to avoid using low coverage samples because the SNP overlap is always defined by the weakest link of the chain, and always use allsnps=YES (maxmiss=1 in Admixtools2, i believe). There is not much secret in choosing the pright, choose populations that don't violate the qpAdm assumption of no geneflow from pleft into pright (in practice, select prehistoric samples - the older the better), and make sure the populations in pright are all asymmetrically related to the populations in pleft.

Zoro
03-07-2021, 03:03 PM
I did this to get FST distances:


$ printf %s\\n Mansi Finnish Nganasan Selkup Karelian Udmurt Mordovian>pops
$ R -e 'library(admixtools);fst=fst("g/v44.3_HO_public/v44.3_HO_public",pop1=readLines("pops"));write.csv(fst,"fst",quote=F)'
ℹ Reading allele frequencies from packedancestrymap files...
ℹ v44.3_HO_public.geno has 13197 samples and 597573 SNPs
ℹ Calculating allele frequencies from 19 samples in 4 populations
ℹ Expected size of allele frequency data: 86 MB
597k SNPs read...
✔ 597573 SNPs read in total
! 593124 SNPs remain after filtering. 414780 are polymorphic.
ℹ Allele frequency matrix for 593124 SNPs and 4 populations is 62 MB
ℹ Computing pairwise f2 for all SNPs and population pairs requires 493 MB RAM without splitting
ℹ Computing without splitting since 493 < 8000 (maxmem)...
ℹ Data written to f2/
ℹ Reading precomputed data for 4 populations...
ℹ Reading f2 data for pair 10 out of 10...
Warning message:
In read_f2(dir, pops, pops2, afprod = afprod, fst = fst, remove_na = remove_na, :
Discarding 1 block(s) due to missing values!
Discarded block(s): 535
>
>
$ cat fst
,pop1,pop2,est,se,z,p
1,Finnish,Karelian,0.00129340940098996,0.000385024 618276676,3.35929013261311,0.000781429796323533
2,Finnish,Mordovian,0.00543917401762932,0.00032534 93038913,16.7179519137578,9.6995724807741e-63
3,Finnish,Nganasan,0.119054350470445,0.00113575044 283692,104.824392736371,0
4,Finnish,Selkup,0.0601437871347565,0.000773515188 052884,77.753854175963,0
5,Finnish,Udmurt,0.0187032075983067,0.000585527652 038594,31.9424839001009,6.87035668325847e-224
6,Karelian,Mordovian,0.00590771927078587,0.0002393 57005605168,24.6816225656289,1.68484078541802e-134
7,Karelian,Udmurt,0.019523384287915,0.000473936956 593016,41.1940533784545,0
8,Mansi,Finnish,0.0402190424166203,0.0008415408501 81047,47.7921450966614,0
9,Mansi,Karelian,0.0399509729801598,0.000728931940 431647,54.8075489139662,0
10,Mansi,Mordovian,0.0383778238793512,0.0006682162 21602333,57.4332418739013,0
11,Mansi,Nganasan,0.0602924170396429,0.00086756066 2779261,69.4964855212476,0
12,Mansi,Selkup,0.0223050689999093,0.0005130963151 74833,43.4715049401769,0
13,Mansi,Udmurt,0.0240652073778455,0.0006638829269 15772,36.2491734644333,1.02394799408991e-287
14,Nganasan,Karelian,0.118602793551424,0.001080158 04818372,109.801333009418,0
15,Nganasan,Mordovian,0.11745770899229,0.000994050 63691636,118.160689838352,0
16,Nganasan,Selkup,0.0504386703379596,0.0006740353 84985417,74.8308938395731,0
17,Nganasan,Udmurt,0.0911528579608077,0.0009733295 36820952,93.650561821567,0
18,Selkup,Karelian,0.0595410382563537,0.0007150490 5760348,83.268466160781,0
19,Selkup,Mordovian,0.0579452329275561,0.000631133 692337951,91.8113446184526,0
20,Selkup,Udmurt,0.0409818006630253,0.000612173976 633259,66.9446958337084,0
21,Udmurt,Mordovian,0.0170968626475659,0.000406949 757311595,42.0122197897642,0

There's probably an easier way to do this in R, but this converts the FST pairs into a table:


$ awk -F, 'NR>1{print$3","$2","$4;print$2","$3","$4}' fst|awk -F, '{print$1","$1","}1'|sort -u>/tmp/a
$ cut -d, -f3 /tmp/a|awk '{printf"%.6f"(NR%n?",":"\n"),$0}' n=$(awk 'END{print NR^.5}' /tmp/a) -|paste -d, <(cut -d, -f1 /tmp/a|sort -u) -|cat <(cut -d, -f1 /tmp/a|sort -u|paste -sd, -|sed s/^/,/) ->/tmp/b
$ cat /tmp/b
,Finnish,Karelian,Mansi,Mordovian,Nganasan,Selkup, Udmurt
Finnish,0.000000,0.001293,0.040219,0.005439,0.1190 54,0.060144,0.018703
Karelian,0.001293,0.000000,0.039951,0.005908,0.118 603,0.059541,0.019523
Mansi,0.040219,0.039951,0.000000,0.038378,0.060292 ,0.022305,0.024065
Mordovian,0.005439,0.005908,0.038378,0.000000,0.11 7458,0.057945,0.017097
Nganasan,0.119054,0.118603,0.060292,0.117458,0.000 000,0.050439,0.091153
Selkup,0.060144,0.059541,0.022305,0.057945,0.05043 9,0.000000,0.040982
Udmurt,0.018703,0.019523,0.024065,0.017097,0.09115 3,0.040982,0.000000

And this creates a heatmap of the table:


R -e 'install.packages(c("pheatmap","colorspace"),repos="https://cloud.r-project.org")'
R -e 'library(pheatmap)
library(colorspace)

t<-read.csv("/tmp/b",header=T,row.names=1,check.names=F)
t[t==0]=NA

pheatmap(
1e4*t,
filename="/tmp/a.png",
legend=F,
clustering_callback=function(...){hclust(as.dist(t ))},
cellwidth=18,
cellheight=12,
border_color=NA,
display_numbers=T,
number_format="%.0f",
number_color="black",
fontsize_number=6,
colorRampPalette(hex(HSV(c(210,180,150,120,90,60,3 0,0),.5,1)))(256)
)'

https://i.ibb.co/GRjLJm3/fst.png

At first I got an error that there were too many missing blocks, so I tried adding a `maxmiss=Inf` parameter:


R -e 'library("admixtools");extract_f2(pref="g/v44.3_HO_public/v44.3_HO_public",pops=c("Finnish","Mansi","Mari.SG","Estonian.DG"),outdir="f2",maxmiss=Inf);f2=f2_from_precomp("f2");fst=fst(f2);write.csv(fst,"fst",quote=F)'

However it gave me nonsensical results where the distance between Finns and Maris was an order of magnitude bigger than the distance between Finns and Mansi:


,pop1,pop2,est,se,z,p
1,Estonian.DG,Finnish,0.000904946578981571,0.00040 0460532337746,2.25976471064156,0.0238358576870844
2,Estonian.DG,Mansi,0.015818211648642,0.0005546578 4565136,28.5188639675077,6.83699132061968e-179
3,Estonian.DG,Mari.SG,0.174033745411937,0.00101794 728786578,170.96538051279,0
4,Finnish,Mansi,0.0139136259691746,0.0003079387354 32869,45.1830977016134,0
5,Finnish,Mari.SG,0.17331351350787,0.0007600563310 41605,228.027195392683,0
6,Mansi,Mari.SG,0.17490109671494,0.000788482478735 399,221.819890018931,0

You get "A" for effort. Nice heatmap.

They had some errors in the code. There's a new version dated yesterday of Admixtools. Remove it and re-install it. Also :


593124 SNPs remain after filtering. 414780 are polymorphic

indicates you have too many uninformative non-polymorphic SNPs. To remove them set maxmaf=0.45

For FST it's crucial to have maxmiss at default which is 0 because you want all your samples to overlap each other exactly for unbiased results

Zoro
03-07-2021, 03:37 PM
Thanks again. I have had Iranians on the pright but it somehow reduced the p-value, so I removed it.


I think these 3 models are really crucial while answering "how the genome of Oghuz was?" We had been using DA89 for a long time, but I recently started to question the accuracy of our method(I think DA89 is 3/4 Gokturk and 1/4 Sogdian, this turns it into a false-candidate for Oghuz) and came to the conclusion that Oghuz should be in between Kipchak and Kimak after searching on historical perspective(The one that helped most was: İlk Oğuzlar, Osman Karatay). Both of the Kipchak samples we have is not representative so went with the Kimak-like option. Apparently I was right, because this is the only passing model when using Medieval samples. In addition, the region where these samples are collected(Kayseri) were housing Cappadocian Greeks that is what I used for the native admixture of them, so the result is pretty solid and one can come with guesses on Western Anatolia(10-40% Kimak?), too.

The good thing about having SE and p-values is you get an idea which models are feasible unlike with Admixture or G25 where you get a bunch of feasible models and you have no idea which ones are no good (distance in G25 is NOT a substitute for p-values and SE by any means). Of course there are other issues besides model viability in G25.

For FST it's crucial to have maxmiss at default which is 0 because you want all your samples to overlap each other exactly for unbiased results

They had some errors in the code. There's a new version dated yesterday of Admixtools. Remove it and re-install it and re-run extract at default (maxmiss=0). When you run extract add the option maxmaf=0.45 to get rid of minor allele which have a frequency > 45% across all pops (very common old alleles shared by most global pops). Now I'm getting 940,000 polymorphic SNPs on the Simons samples at maxmiss=0 with the latest download of Admixtools.

Also Dilawer informed me that Plink doesn't maintain allele order. This can screw calculations up in Admixtools. Here is a couple of rows from the .snp file

rs199706086 1 0 10250 A C
rs112750067 1 0 10327 T C
rs201725126 1 0 13116 G T
rs200579949 1 0 13118 G A
rs180734498 1 0 13302 T C
rs79585140 1 0 14907 G A
rs75454623 1 0 14930 A G

Column 5 is supposed to be for the Reference or Ancestral allele and col 6 for the Derived or Minor allele. If you check dbSNP website you'll notice that the ones I bolded are wrong order. In other words one should be T G and the other A G. Plink screws up allele order.

I fixed the allele order in my .snp file using Dilawer's script so now it agrees with dbSNP. There were thousands of such mistakes in my .snp file

Komintasavalta
03-07-2021, 11:25 PM
This prints all columns and all rows of tables, prints only 3 significant digits, and doesn't display negative numbers in red (https://tibble.tidyverse.org/reference/formatting.html): `options(tibble.width=Inf,tibble.print_max=Inf, pillar.sigfig=3,pillar.neg=F)`.

`options(width=Sys.getenv("COLUMNS"))` or `options(width=180)` increases the width of the terminal.

`print(tbl,width=Inf,n=Inf)` or `as.data.frame(tbl)` displays a whole tibble. This displays an HTML table in a browser: `install.packages("formattable");library(formattable);formattable(tbl)`.

This removes columns from a tibble and formats the table as CSV where doubles have 3 digits after the decimal point:


> qp$popdrop%>select(!c(pat,wt,dof,dofdiff,chisqdiff,p_nested))%>%mutate(across(where(is.double),round,3))%>%format_csv%>%cat
chisq,p,f4rank,Nganasan,Norway_N_HG.SG,Russia_Afon tovaGora3,Turkey_Epipaleolithic,feasible,best
3.251,0.354,3,0.128,0.132,0.116,0.625,TRUE,NA
15.538,0.004,2,0.151,1.265,-0.416,NA,FALSE,TRUE
5.152,0.272,2,0.144,0.302,NA,0.554,TRUE,TRUE
3.827,0.43,2,0.123,NA,0.176,0.701,TRUE,TRUE
11.855,0.018,2,NA,-0.814,0.64,1.173,FALSE,TRUE
28.572,0,1,0.072,0.928,NA,NA,TRUE,NA
116.62,0,1,-0.108,NA,1.108,NA,FALSE,NA
11.9,0.036,1,0.179,NA,NA,0.821,TRUE,NA
22.973,0,1,NA,1.388,-0.388,NA,FALSE,NA
28.13,0,1,NA,0.622,NA,0.378,TRUE,NA
16.715,0.005,1,NA,NA,0.272,0.728,TRUE,NA
1386.434,0,0,1,NA,NA,NA,TRUE,NA
32.704,0,0,NA,1,NA,NA,TRUE,NA
125.18,0,0,NA,NA,1,NA,TRUE,NA
36.405,0,0,NA,NA,NA,1,TRUE,NA

This omits models with only one population (where f4rank is 0) and models that are not feasible (with one or more negative weight). It then sorts the remaining models by their p value:


> qp$popdrop%>%dplyr::filter(feasible==T&f4rank!=0)%>%arrange(desc(p))%>%dplyr::select(!c(pat,wt,dof,chisq,f4rank,feasible ,best,dofdiff,chisqdiff,p_nested))%>%mutate(across(where(is.double),round,3))%>%as.data.frame
p Nganasan Norway_N_HG.SG Russia_AfontovaGora3 Turkey_Epipaleolithic
1 0.430 0.123 NA 0.176 0.701
2 0.354 0.128 0.132 0.116 0.625
3 0.272 0.144 0.302 NA 0.554
4 0.036 0.179 NA NA 0.821
5 0.005 NA NA 0.272 0.728
6 0.000 NA 0.622 NA 0.378
7 0.000 0.072 0.928 NA NA

This saves the popdrop table to a CSV file (I know my pright sucks or whatever):


target="Finnish"
left=c("Turkey_Boncuklu_N.SG","Latvia_HG","Norway_N_HG.SG","Russia_HG_Karelia","Russia_HG_Tyumen","Russia_AfontovaGora3","Nganasan")
right=c("Mbuti.DG","Mixe.DG","Ami.DG","Czech_Vestonice16","Papuan.DG","Ethiopia_4500BP_published.SG","Russia_Kostenki14","Ju_hoan_North.SDG","Morocco_Iberomaurusian")
pops=c(left,right,target)

unlink("f2",recursive=T)
extract_f2(pref="g/v44.3_HO_public/v44.3_HO_public",pops=pops,outdir="f2")
f2=f2_from_precomp("f2")
qp=qpadm(f2,left=left,right=right,target=target)

qp2=qp$popdrop%>%dplyr::filter(feasible==T&f4rank!=0)%>%arrange(desc(p))%>%dplyr::select(!c(wt,dof,chisq,f4rank,feasible,bes t,dofdiff,chisqdiff,p_nested))
write_csv(qp2,"/tmp/a")

This generates a stacked bar chart of the models sorted by their p value:


library(tidyverse)
library(cowplot)
library(reshape2)

t=read_csv("/tmp/a")

abbr=c("Turk","Latv","Norw","Kare","Tyum","AG3","Ngan")

l=lapply(t$pat,function(x)abbr[unlist(gregexpr("0",x))]%>%paste(collapse=" "))
t$lab=paste0(l," (",sub("^0","",sprintf("%.3f",t$p)),")")

t=t[-c(1,2)]
t2=melt(t,id.var="lab")

p=ggplot(t2,aes(x=fct_rev(factor(lab,level=t$lab)) ,y=value,fill=variable),label=pvalue)+
geom_bar(stat="identity",width=1,position=position_fill(reverse=T))+
geom_text(aes(label=round(100*value)),position=pos ition_stack(vjust=.5,reverse=T),size=4)+
coord_flip()+
theme(
axis.text.x=element_blank(),
axis.text=element_text(color="black"),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
legend.box.just="center",
legend.box.margin=margin(0,unit="cm"),
legend.box.spacing=unit(.05,"in"),
legend.direction="horizontal",
legend.justification="center",
legend.margin=margin(0,unit="cm"),
legend.text=element_text(size=12),
legend.title=element_blank(),
panel.border=element_blank(),
text=element_text(size=18)
)+
guides(fill=guide_legend(ncol=3,label.position="right",byrow=T))+
scale_x_discrete(expand=c(0,0))+
scale_y_discrete(expand=c(0,0))+
xlab("")+
scale_fill_manual("legend",values=c("#be661f","#66f6ff","#3397f5","#22419c","#39de39","#157f0a","#ef50ed"))

ggdraw(p)
leg=get_legend(p)
p=p+theme(legend.position="none")
ggdraw(plot_grid(p,leg,ncol=1,rel_heights=c(1,.2)) )

ggsave("output.png",width=7,height=7)

https://i.ibb.co/vV9f0g5/qp.png

Why do Finns get such a high percentage of Turkey_Boncuklu? Is it because of bad right populations or something?

andre
03-08-2021, 05:06 PM
Hi everyone, could someone run with qpAdm North_Italians, Tuscans and Sicilians with this model?

WHG
Steppe_EMBA (Samara i think it's ok)
Barcin_N
Iran_N
Levant_PPNB

I want to see if Levant_PPNB it's necessary to run italians, in particulary southerns.

Thank you.

vbnetkhio
03-21-2021, 03:00 PM
which of these should be used as outgroups when modelling modern pops wiith mesolithic/neolithic samples?

Austria_Krems1_1I2483
Austria_Krems1_2_twin.I2483I2484
Austria_Krems1_2_twin.I2483_allI2484_all
Austria_KremsWA3I1577
Belgium_UP_GoyetQ116_1_publishedGoyetQ116-1_udg_published
Belgium_UP_GoyetQ116_1_published_allGoyetQ116-1_published
Belgium_UP_GoyetQ376-19_publishedGoyetQ376-19_published_d
Belgium_UP_GoyetQ53_1_published_lcGoyetQ53-1_published_d
Belgium_UP_GoyetQ56_16_published_lcGoyetQ56-16_published_d
Belgium_UP_MagdalenianGoyetQ-2
Belgium_UP_Magdalenian_udgGoyetQ-2_udg
China_TianyuanTianyuan
Czech_Pavlov1Pavlov1_d
Czech_Vestonice13Vestonice13_d
Czech_Vestonice14_lcVestonice14_d
Czech_Vestonice15Vestonice15_d
Czech_Vestonice16Vestonice16
Czech_Vestonice43Vestonice43_d
France_Rigney1_publishedRigney1_published_d
Germany_Brillenhohle_published_lcBrillenhohle_publ ished_d
Germany_Burkhardtshohle_publishedBurkhardtshohle_p ublished_d
Germany_HohleFels49_publishedHohleFels49_published _d
Germany_HohleFels79_published_lcHohleFels79_publis hed_d
Italy_South_HG_Ostuni1Ostuni1_d
Italy_South_HG_Ostuni2Ostuni2_d
Italy_South_HG_Paglicci108_published_lcPaglicci108 _published_d
Italy_South_HG_Paglicci133_publishedPaglicci133_pu blished
Romania_Cioclovina_published_lcCioclovina1_publish ed_d
Romania_MuieriiMuierii2_d
Romania_OaseOase1_d
Russia_Kostenki12Kostenki12
Russia_Kostenki14Kostenki14
Russia_Kostenki14.SGKostenki14.SG
Russia_Sunghir1.SGSunghir1.SG
Russia_Sunghir2.SGSunghir2.SG
Russia_Sunghir3.SGSunghir3.SG
Russia_Sunghir4.SGSunghir4.SG
Russia_Ust_Ishim_HG_published.DGUst_Ishim_publishe d.DG
Russia_Ust_Ishim.DGUstIshim_snpAD.DG
Russia_Yana_UP.SGYana_old.SG
Russia_Yana_UP.SGYana_old2.SG
Spain_ElMironElMiron_d

Kostenki14 and Ust-Ishim were used here, but this was a couple of years ago and new samples have been published since, so should something be added?
https://eurogenes.blogspot.com/2017/01/qpadm-tour-of-europe-mesolithic-to.html

vbnetkhio
03-21-2021, 10:46 PM
most of these Paleolithic samples were analyzed here:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4943878/figure/F3/?report=objectonly
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4943878/

I ran a PCA on all Reich samples older than 8000 ybp, to see into which clusters the more recently published samples fall. I removed Neanderthals and Denisovans, and African, Middle Eastern and Amerindian samples because they were outliers and skewed the European dimensions. I didn't do ld and maf pruning because they PCA. I guess Paleolithic samples probably require completely different ld and maf settings from modern pops.

The result is very similar to the MDS plot from the study:
https://i.imgur.com/KwBcpN2.png

The conclusions would be Sunghir samples fall into the Vestonice cluster, Yana clusters with Ust-Ishim, and Geometric and Azilian samples plot with El Miron.

Kaspias
09-01-2021, 05:02 PM
I was trying to extract new Turkish samples but received a few errors during the process. Does the files needs to be imputed in Linux by referencing to their genetic linkage?(by using SHAPEIT (https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#download) perhaps) At least that's what I understood, although have no idea how to do it.

Any tips on what is going on here?

Files can be found here (https://figshare.com/articles/dataset/The_genetic_structure_of_the_Turkish_population_re veals_high_levels_of_variation_and_admixture/15147642)in case someone would like to try it.


+ extract_f2("originknownturkish",outdir = "balkanturks", pops = pops)


i Reading allele frequencies from PLINK files...
Warning: 1 parsing failure.
row col expected actual file
1460 X6 embedded null 'originknownturkish.fam'

i originknownturkish.geno has 1460 samples and 423261 SNPs
i Calculating allele frequencies from 1 samples in 1 populations
i Expected size of allele frequency data: 102 MB
423k SNPs read...
√ 423261 SNPs read in total
! 421395 SNPs remain after filtering. 0 are polymorphic.
i Allele frequency matrix for 421395 SNPs and 1 populations is 34 MB
i Computing pairwise f2 for all SNPs and population pairs requires 67 MB RAM without splitting
i Computing without splitting since 67 < 8000 (maxmem)...

Error in cpp_get_block_lengths(numchr, dat[[distcol]], blgsize) :
upper value must be greater than lower value
Ek olarak: Warning messages:
1: In get_block_lengths(afdat$snpfile, blgsize = blgsize) :
No genetic linkage map or base positions found!Each chromosome will be its own block, which can make standard error estimates inaccurate.
2: In get_block_lengths(afdat$snpfile[poly, ], blgsize = blgsize) :
No genetic linkage map found! Defining blocks by base pair distance of 2e+06

goloden
12-21-2024, 04:10 PM
Hi brothers, could you help me convert a file from 23andme to qpadm ?