Making Vahaduo-like models based on SNP-level data?

**Lucas** · 09-22-2021, 08:51 PM

Originally Posted by vbnetkhio

I think the modern samples from evolbio.ut.ee and hgdp (the original, not the version in reich) have more ancestry relevant SNPs than those from reich. They should have 150-200k after ld and maf, depending on the sample choice. Then you can model the ones from reich with them as sources.

Yes,also good are world datasets from Busby and Lazaridis.

**~~Komintasavalta~~** · 09-23-2021, 08:49 AM

Originally Posted by vbnetkhio

I think the modern samples from evolbio.ut.ee and hgdp (the original, not the version in reich) have more ancestry relevant SNPs than those from reich. They should have 150-200k after ld and maf, depending on the sample choice. Then you can model the ones from reich with them as sources.

How would that work, because I need the source and target samples to have the same set of SNPs... Neither Vahaduo or Michal's script knows how to deal with input that has NA/null values, even though there's probably some way to do the convex optimization while allowing NA values.

Originally Posted by Lucas

Yes,also good are world datasets from Busby and Lazaridis.

I tried making a model for Chuvashes using all samples in the Busby dataset (https://data.mendeley.com/datasets/ckz9mtgrjj/3). But I again got zero or a few percent of ancestry from many unrelated populations:

Chuvash (20.407):
13% Russian_North
7% CEU
7% Mordvin
6% Ukrainian
5% Polish
5% Selkup
4% Nganasan
4% German
3% Hungarian
3% Lithuanian
3% Belarusian
3% Bulgarian
3% Croatian
3% Romanian
2% Uygur
2% Kyrgyz
2% Chukchi
2% Altaian
2% Tuvan
2% Mongol_Mongolia
2% Norwegian
1% Irish
1% Koryak
1% Uzbek
1% Kumyk
1% Dolgan
1% Ket
1% Finnish
1% Yakut
1% Burusho
1% Kalash
1% Lezgin
1% Colombian
1% Tajik
1% Oroqen
0% Welsh
0% Nogai
0% North_Ossetian
0% Kanjar
0% Evenk
0% Yukaghir
0% Kurdish_Kazakhstan
0% English
0% Adyghe
0% German_or_Austrian
0% Maya
0% Papuan_Sepik
0% Lambadi
0% Meghawal
0% Kurumba
0% Pathan
0% Yoruba
0% Kshatriya_Uttar_Pradesh
0% Surui
0% Bantu_Pedi_South_Africa
0% Bengali
0% Bantu_Ovambo_Angola
0% French
0% Brahmin_Uttar_Pradesh
0% Balkar
0% Nasioi_Bougainville
0% Hadza
0% Kol
0% Papuan_Highlands_East

Next I tried doing a PCA of the populations and making models based on the first 20 dimensions, and I limited the models to a maximum of 2-8 populations. When I used an unscaled PCA, some models got a few percent of ancestry from unrelated populations like Sandawe or Papuan_Sepik:

Chuvash (.00334): 82% Mordvin + 18% Nganasan
Chuvash (.00290): 80% Mordvin + 18% Nganasan + 1% Sandawe
Chuvash (.00236): 56% Mordvin + 20% Nganasan + 18% Polish + 7% Tajik
Chuvash (.00206): 62% Mordvin + 20% Nganasan + 13% Polish + 4% Brahui + 1% Sandawe
Chuvash (.00179): 56% Mordvin + 15% Polish + 14% Nganasan + 8% Selkup + 5% Tajik + 1% Sandawe
Chuvash (.00152): 56% Mordvin + 15% Polish + 13% Nganasan + 11% Selkup + 4% Lezgin + 1% Sandawe + 1% Papuan_Sepik
Chuvash (.00157): 47% Mordvin + 19% Polish + 17% Nganasan + 6% Selkup + 5% Lezgin + 4% Finnish + 1% Sandawe + 1% Kalash

But when I used a scaled PCA, the results became more reasonable:

Chuvash (.00057): 72% Mordvin + 28% Selkup
Chuvash (.00056): 69% Mordvin + 28% Selkup + 3% Polish
Chuvash (.00039): 73% Mordvin + 16% Nganasan + 6% Polish + 5% Selkup
Chuvash (.00032): 55% Mordvin + 17% Polish + 14% Nganasan + 9% Selkup + 5% Balkar
Chuvash (.00032): 55% Mordvin + 17% Polish + 14% Nganasan + 9% Selkup + 5% Balkar
Chuvash (.00030): 55% Mordvin + 17% Polish + 9% Selkup + 9% Nganasan + 5% Yakut + 4% Balkar
Chuvash (.00027): 51% Mordvin + 20% Polish + 13% Nganasan + 7% Selkup + 3% Burusho + 3% Yakut + 3% Balkar + 1% Finnish

If I try to specify a maximum number of populations when I make models based on the CSV file for SNP-level data, it takes forever both in Vahaduo and with Michal's script.

Here's a SmartPCA run of all samples in the Busby dataset:

And here's just Eurasian samples. The Finnish population average seems unusually western compared to the Mordvin and North Russian samples. And it's not just some Chukchi samples but all Chukchi samples that look like mixed with Europeans. And then Chuvashes are at around the same point on PC1 as Nogais, because the Nogai population average includes samples with low Mongoloid ancestry, like the Nogai_Karachay_Cherkessia samples in 1240K+HO.

There's so many South Asian samples in the Busby dataset that South Asians get their own component in a K=3 Eurasian ADMIXTURE run:

**Lucas** · 09-23-2021, 09:29 AM

Try with Lazaridis, there are two datasets 2014 and 2016. Interesting which is better.

**Lemminkäinen** · 09-23-2021, 09:53 AM

Komintasavalta

The Finnish population average seems unusually western compared to the Mordvin and North Russian samples.

The Siberian in Finns is not exactly the same as in North Russians. On your PCA they are on X-axis between North Russians and Belarussians, due to the difference in Siberian admixture. The exception on Y-axis figures genetic drift. My explanation.

Imho, min-max on X-axis is distorted as to Europeans. Min-max is always distorted, but you can choose the way by selecting populations and sizes.

**~~Komintasavalta~~** · 09-23-2021, 11:18 AM

Originally Posted by Lemminkäinen

The Siberian in Finns is not exactly the same as in North Russians. On your PCA they are on X-axis between North Russians and Belarussians, due to the difference in Siberian admixture. The exception on Y-axis figures genetic drift. My explanation.

Imho, min-max on X-axis is distorted as to Europeans. Min-max is always distorted, but you can choose the way by selecting populations and sizes.

Or maybe it's the Western European ancestry that pulled Finns west on PC1. I tried doing another SmartPCA run for just European samples, and again Finns plot lower on PC2 than North Russians and Mordvins. But I think it's because this run included relatively few samples with high Mongoloid ancestry, so PC2 differentiates between low-mong Western Europeans and low-mong Eastern Europeans, and Finns have more Germanic-like ancestry than Mordvins.

There's also something f*cked up with some Romanian and Bulgarian samples. And I don't know why some Polish samples plot further north than Lithuanian samples on PC1.

I used `--indep-pairwise 50 10 .2 --geno .01 --maf .03`, which kept about 90,000 SNPs. Maybe I did the QC wrong (https://data.mendeley.com/datasets/ckz9mtgrjj/3):

These genotypes were all generated on Illumina chips (550, 610, 660) for multiple different studies. The two main papers that this dataset was compiled for are: Hellenthal, et al 2014 A Genetic Atlas of Human Admixture History, Science; and Busby, et al 2015 The role of recent admixture in forming the contemporary West Eurasian genomic landscape, Current Biology.

The data are in PLINK format and the BusbyWorldwidePopulations.csv file outlines where the different datasets come from. Note that because these two datasets were combined together, not all populations are typed on the same set of SNPs. We have included genotype data on 523,443 SNPs, of which 441,038 are genotyped on at least 97.5% of individuals.

Therefore, additional QC steps are required to filter this set down to high quality calls, depending on the subset of samples that are required.

**Lemminkäinen** · 09-23-2021, 11:40 AM

Originally Posted by Komintasavalta

Or maybe it's the Western European ancestry that pulled Finns west on PC1. I tried doing another SmartPCA run for just European samples, and again Finns plot lower on PC2 than North Russians and Mordvins. But I think it's because this run included relatively few samples with high Mongoloid ancestry, so PC2 differentiates between low-mong Western Europeans and low-mong Eastern Europeans, and Finns have more Germanic-like ancestry than Mordvins.

There's also something f*cked up with some Romanian and Bulgarian samples. And I don't know why some Polish samples plot further north than Lithuanian samples on PC1.

I used `--indep-pairwise 50 10 .2 --geno .01 --maf .03`, which kept about 90,000 SNPs. Maybe I did the QC wrong (https://data.mendeley.com/datasets/ckz9mtgrjj/3):

These genotypes were all generated on Illumina chips (550, 610, 660) for multiple different studies. The two main papers that this dataset was compiled for are: Hellenthal, et al 2014 A Genetic Atlas of Human Admixture History, Science; and Busby, et al 2015 The role of recent admixture in forming the contemporary West Eurasian genomic landscape, Current Biology.

The data are in PLINK format and the BusbyWorldwidePopulations.csv file outlines where the different datasets come from. Note that because these two datasets were combined together, not all populations are typed on the same set of SNPs. We have included genotype data on 523,443 SNPs, of which 441,038 are genotyped on at least 97.5% of individuals.

Therefore, additional QC steps are required to filter this set down to high quality calls, depending on the subset of samples that are required.

Could be. Add Saamis, keeping Nganasans and East Asians, Chuvashes, Mordvas, North Russians and the European block without changes.

**Lucas** · 09-23-2021, 11:46 AM

Originally Posted by Komintasavalta

There's also something f*cked up with some Romanian and Bulgarian samples. And I don't know why some Polish samples plot further north than Lithuanian samples on PC1.

They are infamous "Estonian Poles", tested by Estonian Biocentre among their Polish minority, certainly mixed with Estonians. You can exclude those plotting to north.

Among Romanians are two Gypsy samples.
Not sure about Bulgarians, maybe some too. But I didn't test them thoroughly.

**Zoro** · 09-23-2021, 12:11 PM

Originally Posted by Komintasavalta

I used `--indep-pairwise 50 10 .2 --geno .01 --maf .03`, which kept about 90,000 SNPs. Maybe I did the QC wrong (https://data.mendeley.com/datasets/ckz9mtgrjj/3):
]

Nice looking plots!

Not a good idea to do —maf 0.03. Basically what you did was to throw out all rarer SNPs and leave the more common older SNPs with minor allele frequencies greater than 3% in your dataset.

In other words if you used 3000 samples then it means you threw out all alleles shared by 90 samples to the exclusion of the rest of samples. So if 3 populations of 30 samples each have unique alleles that set them apart from all the other populations, you just threw out those alleles!

I would do the opposite if you’re looking for more recent shared ancestry. I would throw out all the common alleles older than 10,000 years old by doing something like —max-maf 0.2 which gets rid of alleles shared by 2400 of your 3000 samples since they’re not as informative

You can also try —max-maf 0.25 or 0.3 if you’re not left with enough SNPs

Repost graphs. FYI the clustering will be more realistic but not as neat in terms of everyone in a population having the same recent ancestry. In other words you’ll find more variation within a population than G25 or some calculators have you believe

Just as there’s considerable phenotype variation in a population, in reality, outside of G25 or calculator Lalaland there’s considerable genotype variation in a population. Another thing to keep in mind is the SNPs picked by ancestry companies are optimized for intra-European variation. There are many SNPs unique to East Asians or Africans that are not genotyped. I think in 20 years your ancestry calculations will look different from now

**~~Komintasavalta~~** · 09-23-2021, 12:43 PM

Originally Posted by Lemminkäinen

Could be. Add Saamis, keeping Nganasans and East Asians, Chuvashes, Mordvas, North Russians and the European block without changes.

I added some Siberian samples from Busby, and I added samples from Tambets et al. 2018 (https://evolbio.ut.ee/Tambets2018/). The Finns from Busby look more western or northern than the Finns from Tambets.

I wonder if Tatar1003 is a Crimean Tatar, because it plots close to Nogais. The sample buryat_V43501 from Tambets looks hapa.

Originally Posted by Lucas

They are infamous "Estonian Poles", tested by Estonian Biocentre among their Polish minority, certainly mixed with Estonians. You can exclude those plotting to north.

Among Romanians are two Gypsy samples.
Not sure about Bulgarians, maybe some too. But I didn't test them thoroughly.

Yeah gypsies makes sense. But how do you know it's the Estonian Poles? The Polish samples are from Hellenthal et al. 2014, but I didn't find information about their geographic location anywhere, because it's an old paper with scanty supplementary information. In the plot above, the ID of the northernmost Polish sample is Polish2 and the second northernmost is POL079.

**~~Komintasavalta~~** · 09-23-2021, 01:40 PM

Originally Posted by Zoro

Not a good idea to do —maf 0.03. Basically what you did was to throw out all rarer SNPs and leave the more common older SNPs with minor allele frequencies greater than 3% in your dataset.

In other words if you used 3000 samples then it means you threw out all alleles shared by 90 samples to the exclusion of the rest of samples. So if 3 populations of 30 samples each have unique alleles that set them apart from all the other populations, you just threw out those alleles!

I would do the opposite if you’re looking for more recent shared ancestry. I would throw out all the common alleles older than 10,000 years old by doing something like —max-maf 0.2 which gets rid of alleles shared by 2400 of your 3000 samples since they’re not as informative

You can also try —max-maf 0.25 or 0.3 if you’re not left with enough SNPs

Yeah I have no idea which `--maf` or `--max-maf` setting I should use, so maybe it's best to not use them at all. But `--maf` and `--max-maf` don't select the SNPs to remove based on the minor allele frequency relative to other samples in the dataset, but based on the absolute frequency. Here's the number of SNP removed in my set of European samples from Busby, out of a total of 523443:

`--max-maf .499`: 898
`--max-maf .49`: 9297
`--max-maf .45`: 46980
`--max-maf .25`: 243706
`--max-maf .1`: 421404
`--max-maf .05`: 480608
`--maf .001`: 7873
`--maf .01`: 18428
`--maf .05`: 42808
`--maf .25`: 243905
`--maf .4`: 429267
`--maf .45`: 476446

Originally Posted by Zoro

Another thing to keep in mind is the SNPs picked by ancestry companies are optimized for intra-European variation. There are many SNPs unique to East Asians or Africans that are not genotyped. I think in 20 years your ancestry calculations will look different from now

Is that also true of the Human Origins array, or is that why West Africans have such a low f2 distance to Khoisan and Central African pygmies and Hadza in 1240K+HO? (https://anthrogenica.com/showthread....315#post800315)

Edit: apparently the Human Origins array was designed to differentiate 11 modern popuations, including Mbuti and San, but maybe it still gives relatively little weight to SNPs that are specific to Capoids and Bambutids (https://www.thermofisher.com/documen...ns_appnote.pdf):

A total of 1.81 million candidate SNPs, all from genome locations covered by sequencing reads from Neanderthals, Denisovans, and chimpanzees, were ascertained using a simple SNP discovery procedure first described by Keinan, et al., 2007.[1] The most important ascertainment involved using whole-genome shotgun sequencing data to discover differences between the two chromosomes carried by individuals from 11 populations (San, Yoruba, Mbuti, French, Sardinian, Han, Cambodian, Mongolian, Karitiana, Papuan, and Bougainville).

A paper about a new SNP panel said that the 1240K panel overestimates the difference between Africans and non-Africans (https://arborbiosci.com/wp-content/u...nel_Design.pdf):

We computed FST for 1) the whole genomes in the SGDP 2) the currently widely used 1240k panel, and for the new 850k Ancestral SNP panel. Differentiation between African and non-African populations is overestimated for the 1240k, but no similar bias is observed for the Ancestral SNP panel (Figure 2).