1
Thumbs Up |
Received: 7,243 Given: 2,623 |
Thumbs Up |
Received: 4,863 Given: 2,946 |
How would that work, because I need the source and target samples to have the same set of SNPs... Neither Vahaduo or Michal's script knows how to deal with input that has NA/null values, even though there's probably some way to do the convex optimization while allowing NA values.
I tried making a model for Chuvashes using all samples in the Busby dataset (https://data.mendeley.com/datasets/ckz9mtgrjj/3). But I again got zero or a few percent of ancestry from many unrelated populations:
Chuvash (20.407):
13% Russian_North
7% CEU
7% Mordvin
6% Ukrainian
5% Polish
5% Selkup
4% Nganasan
4% German
3% Hungarian
3% Lithuanian
3% Belarusian
3% Bulgarian
3% Croatian
3% Romanian
2% Uygur
2% Kyrgyz
2% Chukchi
2% Altaian
2% Tuvan
2% Mongol_Mongolia
2% Norwegian
1% Irish
1% Koryak
1% Uzbek
1% Kumyk
1% Dolgan
1% Ket
1% Finnish
1% Yakut
1% Burusho
1% Kalash
1% Lezgin
1% Colombian
1% Tajik
1% Oroqen
0% Welsh
0% Nogai
0% North_Ossetian
0% Kanjar
0% Evenk
0% Yukaghir
0% Kurdish_Kazakhstan
0% English
0% Adyghe
0% German_or_Austrian
0% Maya
0% Papuan_Sepik
0% Lambadi
0% Meghawal
0% Kurumba
0% Pathan
0% Yoruba
0% Kshatriya_Uttar_Pradesh
0% Surui
0% Bantu_Pedi_South_Africa
0% Bengali
0% Bantu_Ovambo_Angola
0% French
0% Brahmin_Uttar_Pradesh
0% Balkar
0% Nasioi_Bougainville
0% Hadza
0% Kol
0% Papuan_Highlands_East
Next I tried doing a PCA of the populations and making models based on the first 20 dimensions, and I limited the models to a maximum of 2-8 populations. When I used an unscaled PCA, some models got a few percent of ancestry from unrelated populations like Sandawe or Papuan_Sepik:
Chuvash (.00334): 82% Mordvin + 18% Nganasan
Chuvash (.00290): 80% Mordvin + 18% Nganasan + 1% Sandawe
Chuvash (.00236): 56% Mordvin + 20% Nganasan + 18% Polish + 7% Tajik
Chuvash (.00206): 62% Mordvin + 20% Nganasan + 13% Polish + 4% Brahui + 1% Sandawe
Chuvash (.00179): 56% Mordvin + 15% Polish + 14% Nganasan + 8% Selkup + 5% Tajik + 1% Sandawe
Chuvash (.00152): 56% Mordvin + 15% Polish + 13% Nganasan + 11% Selkup + 4% Lezgin + 1% Sandawe + 1% Papuan_Sepik
Chuvash (.00157): 47% Mordvin + 19% Polish + 17% Nganasan + 6% Selkup + 5% Lezgin + 4% Finnish + 1% Sandawe + 1% Kalash
But when I used a scaled PCA, the results became more reasonable:
Chuvash (.00057): 72% Mordvin + 28% Selkup
Chuvash (.00056): 69% Mordvin + 28% Selkup + 3% Polish
Chuvash (.00039): 73% Mordvin + 16% Nganasan + 6% Polish + 5% Selkup
Chuvash (.00032): 55% Mordvin + 17% Polish + 14% Nganasan + 9% Selkup + 5% Balkar
Chuvash (.00032): 55% Mordvin + 17% Polish + 14% Nganasan + 9% Selkup + 5% Balkar
Chuvash (.00030): 55% Mordvin + 17% Polish + 9% Selkup + 9% Nganasan + 5% Yakut + 4% Balkar
Chuvash (.00027): 51% Mordvin + 20% Polish + 13% Nganasan + 7% Selkup + 3% Burusho + 3% Yakut + 3% Balkar + 1% Finnish
If I try to specify a maximum number of populations when I make models based on the CSV file for SNP-level data, it takes forever both in Vahaduo and with Michal's script.
Here's a SmartPCA run of all samples in the Busby dataset:
And here's just Eurasian samples. The Finnish population average seems unusually western compared to the Mordvin and North Russian samples. And it's not just some Chukchi samples but all Chukchi samples that look like mixed with Europeans. And then Chuvashes are at around the same point on PC1 as Nogais, because the Nogai population average includes samples with low Mongoloid ancestry, like the Nogai_Karachay_Cherkessia samples in 1240K+HO.
There's so many South Asian samples in the Busby dataset that South Asians get their own component in a K=3 Eurasian ADMIXTURE run:
Last edited by Komintasavalta; 09-23-2021 at 09:31 AM.
Thumbs Up |
Received: 7,243 Given: 2,623 |
Try with Lazaridis, there are two datasets 2014 and 2016. Interesting which is better.
Thumbs Up |
Received: 3,437 Given: 1,436 |
Komintasavalta
The Siberian in Finns is not exactly the same as in North Russians. On your PCA they are on X-axis between North Russians and Belarussians, due to the difference in Siberian admixture. The exception on Y-axis figures genetic drift. My explanation.The Finnish population average seems unusually western compared to the Mordvin and North Russian samples.
Imho, min-max on X-axis is distorted as to Europeans. Min-max is always distorted, but you can choose the way by selecting populations and sizes.
Thumbs Up |
Received: 4,863 Given: 2,946 |
Or maybe it's the Western European ancestry that pulled Finns west on PC1. I tried doing another SmartPCA run for just European samples, and again Finns plot lower on PC2 than North Russians and Mordvins. But I think it's because this run included relatively few samples with high Mongoloid ancestry, so PC2 differentiates between low-mong Western Europeans and low-mong Eastern Europeans, and Finns have more Germanic-like ancestry than Mordvins.
There's also something f*cked up with some Romanian and Bulgarian samples. And I don't know why some Polish samples plot further north than Lithuanian samples on PC1.
I used `--indep-pairwise 50 10 .2 --geno .01 --maf .03`, which kept about 90,000 SNPs. Maybe I did the QC wrong (https://data.mendeley.com/datasets/ckz9mtgrjj/3):
These genotypes were all generated on Illumina chips (550, 610, 660) for multiple different studies. The two main papers that this dataset was compiled for are: Hellenthal, et al 2014 A Genetic Atlas of Human Admixture History, Science; and Busby, et al 2015 The role of recent admixture in forming the contemporary West Eurasian genomic landscape, Current Biology.
The data are in PLINK format and the BusbyWorldwidePopulations.csv file outlines where the different datasets come from. Note that because these two datasets were combined together, not all populations are typed on the same set of SNPs. We have included genotype data on 523,443 SNPs, of which 441,038 are genotyped on at least 97.5% of individuals.
Therefore, additional QC steps are required to filter this set down to high quality calls, depending on the subset of samples that are required.
Thumbs Up |
Received: 3,437 Given: 1,436 |
Thumbs Up |
Received: 7,243 Given: 2,623 |
They are infamous "Estonian Poles", tested by Estonian Biocentre among their Polish minority, certainly mixed with Estonians. You can exclude those plotting to north.
Among Romanians are two Gypsy samples.
Not sure about Bulgarians, maybe some too. But I didn't test them thoroughly.
Thumbs Up |
Received: 1,249 Given: 524 |
Nice looking plots!
Not a good idea to do —maf 0.03. Basically what you did was to throw out all rarer SNPs and leave the more common older SNPs with minor allele frequencies greater than 3% in your dataset.
In other words if you used 3000 samples then it means you threw out all alleles shared by 90 samples to the exclusion of the rest of samples. So if 3 populations of 30 samples each have unique alleles that set them apart from all the other populations, you just threw out those alleles!
I would do the opposite if you’re looking for more recent shared ancestry. I would throw out all the common alleles older than 10,000 years old by doing something like —max-maf 0.2 which gets rid of alleles shared by 2400 of your 3000 samples since they’re not as informative
You can also try —max-maf 0.25 or 0.3 if you’re not left with enough SNPs
Repost graphs. FYI the clustering will be more realistic but not as neat in terms of everyone in a population having the same recent ancestry. In other words you’ll find more variation within a population than G25 or some calculators have you believe
Just as there’s considerable phenotype variation in a population, in reality, outside of G25 or calculator Lalaland there’s considerable genotype variation in a population. Another thing to keep in mind is the SNPs picked by ancestry companies are optimized for intra-European variation. There are many SNPs unique to East Asians or Africans that are not genotyped. I think in 20 years your ancestry calculations will look different from now
Last edited by Zoro; 09-23-2021 at 12:28 PM.
Muzh ba staso la tyaro tsakha ra wubaasu
[IMG][/IMG]
Thumbs Up |
Received: 4,863 Given: 2,946 |
I added some Siberian samples from Busby, and I added samples from Tambets et al. 2018 (https://evolbio.ut.ee/Tambets2018/). The Finns from Busby look more western or northern than the Finns from Tambets.
I wonder if Tatar1003 is a Crimean Tatar, because it plots close to Nogais. The sample buryat_V43501 from Tambets looks hapa.
Yeah gypsies makes sense. But how do you know it's the Estonian Poles? The Polish samples are from Hellenthal et al. 2014, but I didn't find information about their geographic location anywhere, because it's an old paper with scanty supplementary information. In the plot above, the ID of the northernmost Polish sample is Polish2 and the second northernmost is POL079.
Thumbs Up |
Received: 4,863 Given: 2,946 |
Yeah I have no idea which `--maf` or `--max-maf` setting I should use, so maybe it's best to not use them at all. But `--maf` and `--max-maf` don't select the SNPs to remove based on the minor allele frequency relative to other samples in the dataset, but based on the absolute frequency. Here's the number of SNP removed in my set of European samples from Busby, out of a total of 523443:
`--max-maf .499`: 898
`--max-maf .49`: 9297
`--max-maf .45`: 46980
`--max-maf .25`: 243706
`--max-maf .1`: 421404
`--max-maf .05`: 480608
`--maf .001`: 7873
`--maf .01`: 18428
`--maf .05`: 42808
`--maf .25`: 243905
`--maf .4`: 429267
`--maf .45`: 476446
Is that also true of the Human Origins array, or is that why West Africans have such a low f2 distance to Khoisan and Central African pygmies and Hadza in 1240K+HO? (https://anthrogenica.com/showthread....315#post800315)
Edit: apparently the Human Origins array was designed to differentiate 11 modern popuations, including Mbuti and San, but maybe it still gives relatively little weight to SNPs that are specific to Capoids and Bambutids (https://www.thermofisher.com/documen...ns_appnote.pdf):
A total of 1.81 million candidate SNPs, all from genome locations covered by sequencing reads from Neanderthals, Denisovans, and chimpanzees, were ascertained using a simple SNP discovery procedure first described by Keinan, et al., 2007.[1] The most important ascertainment involved using whole-genome shotgun sequencing data to discover differences between the two chromosomes carried by individuals from 11 populations (San, Yoruba, Mbuti, French, Sardinian, Han, Cambodian, Mongolian, Karitiana, Papuan, and Bougainville).
A paper about a new SNP panel said that the 1240K panel overestimates the difference between Africans and non-Africans (https://arborbiosci.com/wp-content/u...nel_Design.pdf):
We computed FST for 1) the whole genomes in the SGDP 2) the currently widely used 1240k panel, and for the new 850k Ancestral SNP panel. Differentiation between African and non-African populations is overestimated for the 1240k, but no similar bias is observed for the Ancestral SNP panel (Figure 2).
Last edited by Komintasavalta; 09-23-2021 at 02:09 PM.
There are currently 1 users browsing this thread. (0 members and 1 guests)
Bookmarks