Page 2 of 6 FirstFirst 123456 LastLast
Results 11 to 20 of 51

Thread: Making Vahaduo-like models based on SNP-level data?

  1. #11
    Veteran Member Apricity Funding Member
    "Friend of Apricity"


    Join Date
    Oct 2016
    Last Online
    @
    Ethnicity
    me
    Country
    European Union
    Y-DNA
    R1a > YP1337 > R-BY160486*
    mtDNA
    H3*
    Gender
    Posts
    6,066
    Thumbs Up
    Received: 7,243
    Given: 2,623

    1 Not allowed!

    Default

    Quote Originally Posted by vbnetkhio View Post
    I think the modern samples from evolbio.ut.ee and hgdp (the original, not the version in reich) have more ancestry relevant SNPs than those from reich. They should have 150-200k after ld and maf, depending on the sample choice. Then you can model the ones from reich with them as sources.
    Yes,also good are world datasets from Busby and Lazaridis.

  2. #12
    Banned
    Join Date
    Sep 2020
    Last Online
    09-12-2023 @ 03:47 PM
    Location
    コミ共和国
    Meta-Ethnicity
    Finno-Permic
    Ethnicity
    Peasant
    Ancestry
    コミ
    Country
    Finland
    Taxonomy
    Karaboğa (euryprosopic, platyrrhine, dolichocephalic)
    Relationship Status
    Virgin
    Gender
    Posts
    2,170
    Thumbs Up
    Received: 4,863
    Given: 2,946

    3 Not allowed!

    Default

    Quote Originally Posted by vbnetkhio View Post
    I think the modern samples from evolbio.ut.ee and hgdp (the original, not the version in reich) have more ancestry relevant SNPs than those from reich. They should have 150-200k after ld and maf, depending on the sample choice. Then you can model the ones from reich with them as sources.
    How would that work, because I need the source and target samples to have the same set of SNPs... Neither Vahaduo or Michal's script knows how to deal with input that has NA/null values, even though there's probably some way to do the convex optimization while allowing NA values.

    Quote Originally Posted by Lucas View Post
    Yes,also good are world datasets from Busby and Lazaridis.
    I tried making a model for Chuvashes using all samples in the Busby dataset (https://data.mendeley.com/datasets/ckz9mtgrjj/3). But I again got zero or a few percent of ancestry from many unrelated populations:

    Chuvash (20.407):
    13% Russian_North
    7% CEU
    7% Mordvin
    6% Ukrainian
    5% Polish
    5% Selkup
    4% Nganasan
    4% German
    3% Hungarian
    3% Lithuanian
    3% Belarusian
    3% Bulgarian
    3% Croatian
    3% Romanian
    2% Uygur
    2% Kyrgyz
    2% Chukchi
    2% Altaian
    2% Tuvan
    2% Mongol_Mongolia
    2% Norwegian
    1% Irish
    1% Koryak
    1% Uzbek
    1% Kumyk
    1% Dolgan
    1% Ket
    1% Finnish
    1% Yakut
    1% Burusho
    1% Kalash
    1% Lezgin
    1% Colombian
    1% Tajik
    1% Oroqen
    0% Welsh
    0% Nogai
    0% North_Ossetian
    0% Kanjar
    0% Evenk
    0% Yukaghir
    0% Kurdish_Kazakhstan
    0% English
    0% Adyghe
    0% German_or_Austrian
    0% Maya
    0% Papuan_Sepik
    0% Lambadi
    0% Meghawal
    0% Kurumba
    0% Pathan
    0% Yoruba
    0% Kshatriya_Uttar_Pradesh
    0% Surui
    0% Bantu_Pedi_South_Africa
    0% Bengali
    0% Bantu_Ovambo_Angola
    0% French
    0% Brahmin_Uttar_Pradesh
    0% Balkar
    0% Nasioi_Bougainville
    0% Hadza
    0% Kol
    0% Papuan_Highlands_East

    Next I tried doing a PCA of the populations and making models based on the first 20 dimensions, and I limited the models to a maximum of 2-8 populations. When I used an unscaled PCA, some models got a few percent of ancestry from unrelated populations like Sandawe or Papuan_Sepik:

    Chuvash (.00334): 82% Mordvin + 18% Nganasan
    Chuvash (.00290): 80% Mordvin + 18% Nganasan + 1% Sandawe
    Chuvash (.00236): 56% Mordvin + 20% Nganasan + 18% Polish + 7% Tajik
    Chuvash (.00206): 62% Mordvin + 20% Nganasan + 13% Polish + 4% Brahui + 1% Sandawe
    Chuvash (.00179): 56% Mordvin + 15% Polish + 14% Nganasan + 8% Selkup + 5% Tajik + 1% Sandawe
    Chuvash (.00152): 56% Mordvin + 15% Polish + 13% Nganasan + 11% Selkup + 4% Lezgin + 1% Sandawe + 1% Papuan_Sepik
    Chuvash (.00157): 47% Mordvin + 19% Polish + 17% Nganasan + 6% Selkup + 5% Lezgin + 4% Finnish + 1% Sandawe + 1% Kalash

    But when I used a scaled PCA, the results became more reasonable:

    Chuvash (.00057): 72% Mordvin + 28% Selkup
    Chuvash (.00056): 69% Mordvin + 28% Selkup + 3% Polish
    Chuvash (.00039): 73% Mordvin + 16% Nganasan + 6% Polish + 5% Selkup
    Chuvash (.00032): 55% Mordvin + 17% Polish + 14% Nganasan + 9% Selkup + 5% Balkar
    Chuvash (.00032): 55% Mordvin + 17% Polish + 14% Nganasan + 9% Selkup + 5% Balkar
    Chuvash (.00030): 55% Mordvin + 17% Polish + 9% Selkup + 9% Nganasan + 5% Yakut + 4% Balkar
    Chuvash (.00027): 51% Mordvin + 20% Polish + 13% Nganasan + 7% Selkup + 3% Burusho + 3% Yakut + 3% Balkar + 1% Finnish

    If I try to specify a maximum number of populations when I make models based on the CSV file for SNP-level data, it takes forever both in Vahaduo and with Michal's script.

    Here's a SmartPCA run of all samples in the Busby dataset:



    And here's just Eurasian samples. The Finnish population average seems unusually western compared to the Mordvin and North Russian samples. And it's not just some Chukchi samples but all Chukchi samples that look like mixed with Europeans. And then Chuvashes are at around the same point on PC1 as Nogais, because the Nogai population average includes samples with low Mongoloid ancestry, like the Nogai_Karachay_Cherkessia samples in 1240K+HO.




    There's so many South Asian samples in the Busby dataset that South Asians get their own component in a K=3 Eurasian ADMIXTURE run:

    Last edited by Komintasavalta; 09-23-2021 at 09:31 AM.

  3. #13
    Veteran Member Apricity Funding Member
    "Friend of Apricity"


    Join Date
    Oct 2016
    Last Online
    @
    Ethnicity
    me
    Country
    European Union
    Y-DNA
    R1a > YP1337 > R-BY160486*
    mtDNA
    H3*
    Gender
    Posts
    6,066
    Thumbs Up
    Received: 7,243
    Given: 2,623

    1 Not allowed!

    Default

    Try with Lazaridis, there are two datasets 2014 and 2016. Interesting which is better.

  4. #14
    Veteran Member Apricity Funding Member
    "Friend of Apricity"


    Join Date
    Jun 2014
    Last Online
    03-13-2024 @ 06:31 PM
    Location
    Helsinki
    Ethnicity
    Finnish
    Country
    Finland
    Y-DNA
    I1
    mtDNA
    H39
    Politics
    Ugly history as it is. Don't blame me.
    Gender
    Posts
    4,729
    Thumbs Up
    Received: 3,437
    Given: 1,436

    2 Not allowed!

    Default

    Komintasavalta

    The Finnish population average seems unusually western compared to the Mordvin and North Russian samples.
    The Siberian in Finns is not exactly the same as in North Russians. On your PCA they are on X-axis between North Russians and Belarussians, due to the difference in Siberian admixture. The exception on Y-axis figures genetic drift. My explanation.

    Imho, min-max on X-axis is distorted as to Europeans. Min-max is always distorted, but you can choose the way by selecting populations and sizes.

  5. #15
    Banned
    Join Date
    Sep 2020
    Last Online
    09-12-2023 @ 03:47 PM
    Location
    コミ共和国
    Meta-Ethnicity
    Finno-Permic
    Ethnicity
    Peasant
    Ancestry
    コミ
    Country
    Finland
    Taxonomy
    Karaboğa (euryprosopic, platyrrhine, dolichocephalic)
    Relationship Status
    Virgin
    Gender
    Posts
    2,170
    Thumbs Up
    Received: 4,863
    Given: 2,946

    0 Not allowed!

    Default

    Quote Originally Posted by Lemminkäinen View Post
    The Siberian in Finns is not exactly the same as in North Russians. On your PCA they are on X-axis between North Russians and Belarussians, due to the difference in Siberian admixture. The exception on Y-axis figures genetic drift. My explanation.

    Imho, min-max on X-axis is distorted as to Europeans. Min-max is always distorted, but you can choose the way by selecting populations and sizes.
    Or maybe it's the Western European ancestry that pulled Finns west on PC1. I tried doing another SmartPCA run for just European samples, and again Finns plot lower on PC2 than North Russians and Mordvins. But I think it's because this run included relatively few samples with high Mongoloid ancestry, so PC2 differentiates between low-mong Western Europeans and low-mong Eastern Europeans, and Finns have more Germanic-like ancestry than Mordvins.

    There's also something f*cked up with some Romanian and Bulgarian samples. And I don't know why some Polish samples plot further north than Lithuanian samples on PC1.



    I used `--indep-pairwise 50 10 .2 --geno .01 --maf .03`, which kept about 90,000 SNPs. Maybe I did the QC wrong (https://data.mendeley.com/datasets/ckz9mtgrjj/3):

    These genotypes were all generated on Illumina chips (550, 610, 660) for multiple different studies. The two main papers that this dataset was compiled for are: Hellenthal, et al 2014 A Genetic Atlas of Human Admixture History, Science; and Busby, et al 2015 The role of recent admixture in forming the contemporary West Eurasian genomic landscape, Current Biology.

    The data are in PLINK format and the BusbyWorldwidePopulations.csv file outlines where the different datasets come from. Note that because these two datasets were combined together, not all populations are typed on the same set of SNPs. We have included genotype data on 523,443 SNPs, of which 441,038 are genotyped on at least 97.5% of individuals.

    Therefore, additional QC steps are required to filter this set down to high quality calls, depending on the subset of samples that are required.

  6. #16
    Veteran Member Apricity Funding Member
    "Friend of Apricity"


    Join Date
    Jun 2014
    Last Online
    03-13-2024 @ 06:31 PM
    Location
    Helsinki
    Ethnicity
    Finnish
    Country
    Finland
    Y-DNA
    I1
    mtDNA
    H39
    Politics
    Ugly history as it is. Don't blame me.
    Gender
    Posts
    4,729
    Thumbs Up
    Received: 3,437
    Given: 1,436

    0 Not allowed!

    Default

    Quote Originally Posted by Komintasavalta View Post
    Or maybe it's the Western European ancestry that pulled Finns west on PC1. I tried doing another SmartPCA run for just European samples, and again Finns plot lower on PC2 than North Russians and Mordvins. But I think it's because this run included relatively few samples with high Mongoloid ancestry, so PC2 differentiates between low-mong Western Europeans and low-mong Eastern Europeans, and Finns have more Germanic-like ancestry than Mordvins.

    There's also something f*cked up with some Romanian and Bulgarian samples. And I don't know why some Polish samples plot further north than Lithuanian samples on PC1.



    I used `--indep-pairwise 50 10 .2 --geno .01 --maf .03`, which kept about 90,000 SNPs. Maybe I did the QC wrong (https://data.mendeley.com/datasets/ckz9mtgrjj/3):

    These genotypes were all generated on Illumina chips (550, 610, 660) for multiple different studies. The two main papers that this dataset was compiled for are: Hellenthal, et al 2014 A Genetic Atlas of Human Admixture History, Science; and Busby, et al 2015 The role of recent admixture in forming the contemporary West Eurasian genomic landscape, Current Biology.

    The data are in PLINK format and the BusbyWorldwidePopulations.csv file outlines where the different datasets come from. Note that because these two datasets were combined together, not all populations are typed on the same set of SNPs. We have included genotype data on 523,443 SNPs, of which 441,038 are genotyped on at least 97.5% of individuals.

    Therefore, additional QC steps are required to filter this set down to high quality calls, depending on the subset of samples that are required.
    Could be. Add Saamis, keeping Nganasans and East Asians, Chuvashes, Mordvas, North Russians and the European block without changes.

  7. #17
    Veteran Member Apricity Funding Member
    "Friend of Apricity"


    Join Date
    Oct 2016
    Last Online
    @
    Ethnicity
    me
    Country
    European Union
    Y-DNA
    R1a > YP1337 > R-BY160486*
    mtDNA
    H3*
    Gender
    Posts
    6,066
    Thumbs Up
    Received: 7,243
    Given: 2,623

    3 Not allowed!

    Default

    Quote Originally Posted by Komintasavalta View Post

    There's also something f*cked up with some Romanian and Bulgarian samples. And I don't know why some Polish samples plot further north than Lithuanian samples on PC1.
    They are infamous "Estonian Poles", tested by Estonian Biocentre among their Polish minority, certainly mixed with Estonians. You can exclude those plotting to north.

    Among Romanians are two Gypsy samples.
    Not sure about Bulgarians, maybe some too. But I didn't test them thoroughly.

  8. #18
    Veteran Member Zoro's Avatar
    Join Date
    Dec 2017
    Last Online
    01-22-2023 @ 10:21 AM
    Meta-Ethnicity
    Indo-Iranian
    Ethnicity
    Kurd
    Ancestry
    74.31% W. Eurasian + 11.42% E. Eurasian + 5.42% S. Eurasian + 8.85% Basal Eurasian/African
    Country
    United States
    Region
    Kurdistan
    Y-DNA
    Q-M25
    mtDNA
    W4
    Gender
    Posts
    2,225
    Thumbs Up
    Received: 1,249
    Given: 524

    2 Not allowed!

    Default

    Quote Originally Posted by Komintasavalta View Post

    I used `--indep-pairwise 50 10 .2 --geno .01 --maf .03`, which kept about 90,000 SNPs. Maybe I did the QC wrong (https://data.mendeley.com/datasets/ckz9mtgrjj/3):
    ]
    Nice looking plots!

    Not a good idea to do —maf 0.03. Basically what you did was to throw out all rarer SNPs and leave the more common older SNPs with minor allele frequencies greater than 3% in your dataset.

    In other words if you used 3000 samples then it means you threw out all alleles shared by 90 samples to the exclusion of the rest of samples. So if 3 populations of 30 samples each have unique alleles that set them apart from all the other populations, you just threw out those alleles!

    I would do the opposite if you’re looking for more recent shared ancestry. I would throw out all the common alleles older than 10,000 years old by doing something like —max-maf 0.2 which gets rid of alleles shared by 2400 of your 3000 samples since they’re not as informative

    You can also try —max-maf 0.25 or 0.3 if you’re not left with enough SNPs

    Repost graphs. FYI the clustering will be more realistic but not as neat in terms of everyone in a population having the same recent ancestry. In other words you’ll find more variation within a population than G25 or some calculators have you believe

    Just as there’s considerable phenotype variation in a population, in reality, outside of G25 or calculator Lalaland there’s considerable genotype variation in a population. Another thing to keep in mind is the SNPs picked by ancestry companies are optimized for intra-European variation. There are many SNPs unique to East Asians or Africans that are not genotyped. I think in 20 years your ancestry calculations will look different from now
    Last edited by Zoro; 09-23-2021 at 12:28 PM.
    Muzh ba staso la tyaro tsakha ra wubaasu

    [IMG][/IMG]

  9. #19
    Banned
    Join Date
    Sep 2020
    Last Online
    09-12-2023 @ 03:47 PM
    Location
    コミ共和国
    Meta-Ethnicity
    Finno-Permic
    Ethnicity
    Peasant
    Ancestry
    コミ
    Country
    Finland
    Taxonomy
    Karaboğa (euryprosopic, platyrrhine, dolichocephalic)
    Relationship Status
    Virgin
    Gender
    Posts
    2,170
    Thumbs Up
    Received: 4,863
    Given: 2,946

    1 Not allowed!

    Default

    Quote Originally Posted by Lemminkäinen View Post
    Could be. Add Saamis, keeping Nganasans and East Asians, Chuvashes, Mordvas, North Russians and the European block without changes.
    I added some Siberian samples from Busby, and I added samples from Tambets et al. 2018 (https://evolbio.ut.ee/Tambets2018/). The Finns from Busby look more western or northern than the Finns from Tambets.

    I wonder if Tatar1003 is a Crimean Tatar, because it plots close to Nogais. The sample buryat_V43501 from Tambets looks hapa.



    Quote Originally Posted by Lucas View Post
    They are infamous "Estonian Poles", tested by Estonian Biocentre among their Polish minority, certainly mixed with Estonians. You can exclude those plotting to north.

    Among Romanians are two Gypsy samples.
    Not sure about Bulgarians, maybe some too. But I didn't test them thoroughly.
    Yeah gypsies makes sense. But how do you know it's the Estonian Poles? The Polish samples are from Hellenthal et al. 2014, but I didn't find information about their geographic location anywhere, because it's an old paper with scanty supplementary information. In the plot above, the ID of the northernmost Polish sample is Polish2 and the second northernmost is POL079.

  10. #20
    Banned
    Join Date
    Sep 2020
    Last Online
    09-12-2023 @ 03:47 PM
    Location
    コミ共和国
    Meta-Ethnicity
    Finno-Permic
    Ethnicity
    Peasant
    Ancestry
    コミ
    Country
    Finland
    Taxonomy
    Karaboğa (euryprosopic, platyrrhine, dolichocephalic)
    Relationship Status
    Virgin
    Gender
    Posts
    2,170
    Thumbs Up
    Received: 4,863
    Given: 2,946

    0 Not allowed!

    Default

    Quote Originally Posted by Zoro View Post
    Not a good idea to do —maf 0.03. Basically what you did was to throw out all rarer SNPs and leave the more common older SNPs with minor allele frequencies greater than 3% in your dataset.

    In other words if you used 3000 samples then it means you threw out all alleles shared by 90 samples to the exclusion of the rest of samples. So if 3 populations of 30 samples each have unique alleles that set them apart from all the other populations, you just threw out those alleles!

    I would do the opposite if you’re looking for more recent shared ancestry. I would throw out all the common alleles older than 10,000 years old by doing something like —max-maf 0.2 which gets rid of alleles shared by 2400 of your 3000 samples since they’re not as informative

    You can also try —max-maf 0.25 or 0.3 if you’re not left with enough SNPs
    Yeah I have no idea which `--maf` or `--max-maf` setting I should use, so maybe it's best to not use them at all. But `--maf` and `--max-maf` don't select the SNPs to remove based on the minor allele frequency relative to other samples in the dataset, but based on the absolute frequency. Here's the number of SNP removed in my set of European samples from Busby, out of a total of 523443:

    `--max-maf .499`: 898
    `--max-maf .49`: 9297
    `--max-maf .45`: 46980
    `--max-maf .25`: 243706
    `--max-maf .1`: 421404
    `--max-maf .05`: 480608
    `--maf .001`: 7873
    `--maf .01`: 18428
    `--maf .05`: 42808
    `--maf .25`: 243905
    `--maf .4`: 429267
    `--maf .45`: 476446

    Quote Originally Posted by Zoro View Post
    Another thing to keep in mind is the SNPs picked by ancestry companies are optimized for intra-European variation. There are many SNPs unique to East Asians or Africans that are not genotyped. I think in 20 years your ancestry calculations will look different from now
    Is that also true of the Human Origins array, or is that why West Africans have such a low f2 distance to Khoisan and Central African pygmies and Hadza in 1240K+HO? (https://anthrogenica.com/showthread....315#post800315)

    Edit: apparently the Human Origins array was designed to differentiate 11 modern popuations, including Mbuti and San, but maybe it still gives relatively little weight to SNPs that are specific to Capoids and Bambutids (https://www.thermofisher.com/documen...ns_appnote.pdf):

    A total of 1.81 million candidate SNPs, all from genome locations covered by sequencing reads from Neanderthals, Denisovans, and chimpanzees, were ascertained using a simple SNP discovery procedure first described by Keinan, et al., 2007.[1] The most important ascertainment involved using whole-genome shotgun sequencing data to discover differences between the two chromosomes carried by individuals from 11 populations (San, Yoruba, Mbuti, French, Sardinian, Han, Cambodian, Mongolian, Karitiana, Papuan, and Bougainville).

    A paper about a new SNP panel said that the 1240K panel overestimates the difference between Africans and non-Africans (https://arborbiosci.com/wp-content/u...nel_Design.pdf):

    We computed ​FST for 1) the whole genomes in the SGDP 2) the currently widely used 1240k panel​, and for the new 850k Ancestral SNP panel. Differentiation between African and non-African populations is overestimated for the 1240k, but no similar bias is observed for the Ancestral SNP panel (​Figure 2​).
    Last edited by Komintasavalta; 09-23-2021 at 02:09 PM.

Page 2 of 6 FirstFirst 123456 LastLast

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Similar Threads

  1. Map of West Eurasian Admixture based on Harrapa World data
    By Український пат in forum Autosomal DNA
    Replies: 58
    Last Post: 08-28-2022, 06:22 PM
  2. Replies: 148
    Last Post: 12-02-2021, 01:54 AM
  3. Replies: 1
    Last Post: 07-12-2017, 11:29 AM
  4. Replies: 9
    Last Post: 03-31-2017, 10:59 AM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •