Page 1 of 6 12345 ... LastLast
Results 1 to 10 of 51

Thread: Making Vahaduo-like models based on SNP-level data?

  1. #1
    Banned
    Join Date
    Sep 2020
    Last Online
    09-12-2023 @ 03:47 PM
    Location
    コミ共和国
    Meta-Ethnicity
    Finno-Permic
    Ethnicity
    Peasant
    Ancestry
    コミ
    Country
    Finland
    Taxonomy
    Karaboğa (euryprosopic, platyrrhine, dolichocephalic)
    Relationship Status
    Virgin
    Gender
    Posts
    2,170
    Thumbs Up
    Received: 4,863
    Given: 2,946

    7 Not allowed!

    Default Making Vahaduo-like models based on SNP-level data?

    (Reposted from Anthrogenica, because I want to hear comments from Lemminkäinen, Lucas, vbnetkhio, Zoro & co. (in alphabetical order).)

    1. Download the 1240K+HO dataset: https://reich.hms.harvard.edu/allen-...cient-dna-data. Convert it to PLINK format.

    Code:
    wget https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_HO_public.{anno,ind,snp,geno}
    x=v44.3_HO_public;convertf -p <(printf %s\\n genotypename:\ $x.geno snpname:\ $x.snp indivname:\ $x.ind outputformat:\ PACKEDPED genotypeoutname:\ $x.bed snpoutname:\ $x.bim indivoutname:\ $x.fam)
    2. Select modern populations with at least 20 samples, so individual-level genetic variation becomes averaged out. Include only the 20 samples with the highest SNP count from each population, because otherwise the models would be biased so that populations with a large number of samples would be preferred as sources. In the anno file, field 2 is sample name, field 8 is population name, field 6 is age in years before present, and field 15 is SNP count.

    Code:
    x=test;awk -F\\t '$6==0&&++a[$8]==20{print$8}' v44.3_HO_public.anno|grep -Ev '\.DG|\.SDG|\.SG|\.WGA|Ignore|outlier'|awk 'NR==FNR{a[$0];next}$3 in a&&++b[$3]<=20{print$1,$3}' - <(sort -t$'\t' -rnk15 v44.3_HO_public.ind)>$x.pick
    plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x
    3. Perform aggressive LD pruning (`--indep-pairwise 50 10 .05`) and remove all SNPs with missing data (`--geno 0`) in order to get the total SNP count down to about 30,000. If this removes too many SNPs, you can use something like `--geno .001` instead.

    Code:
    plink --bfile $x --indep-pairwise 50 10 .05 --geno 0 --out $x;plink --allow-no-sex --bfile $x --extract $x.prune.in --make-bed --out $x.p
    4. Convert the PLINK file to the EIGENSTRAT format. The eigenstratgeno file consists of one SNP per line with one number per individual, where the numbers 0-2 indicate the number of copies of the reference allele and 9 indicates missing data.

    Code:
    convertf -p <(printf %s\\n genotypename:\ $x.p.bed snpname:\ $x.p.bim indivname:\ $x.p.fam outputformat:\ EIGENSTRAT genotypeoutname:\ $x.eigenstratgeno snpoutname:\ $x.snp indivoutname:\ $x.ind)
    5. Convert the EIGENSTRAT file to a CSV format, and create another CSV file for average allele frequencies within populations.

    Code:
    awk -F '' '{for(i=1;i<=NF;i++)a[i,NR]=$i}END{for(j=1;j<=NF;j++)for(k=1;k<=NR;k++)printf(k==NR?"%s\n":"%s,"),a[j,k]}' $x.eigenstratgeno|tr -d 9|paste -d, <(awk 'NR==FNR{a[$1]=$2;next}{print a[$2]":"$2}' $x.pick $x.fam) ->$x.csv
    sed 's/:[^,]*//' $x.csv|awk -F, '{n[$1]++;for(i=2;i<=NF;i++){a[$1,i]+=$i}}END{for(i in n){o=i;for(j=2;j<=NF;j++)o=o","a[i,j]/n[i];print o}}'>$x.ave.csv
    6. Install the CVXPY library and download my modified version of michal3141's convex optimization script: https://github.com/michal3141/g25. My version uses the same algorithm as the original script, but I changed the output format. I also changed the solver to CVXOPT, because compared to the ECOS solver, it seems to deal better with an input matrix that has tens of thousands of columns.

    Code:
    pip3 install cvxpy
    curl https://pastebin.com/raw/afaMiFSa|tr -d \\r>mix;chmod +x mix
    7. Try out some models. The models usually make less sense than G25 models, and there's often about 0-5% ancestry from multiple populations that are not actually related to the target population. But I guess it's a start.

    Code:
    $ t=Selkup;./mix <(grep -v ^$t, test.ave.csv) <(grep ^$t, test.ave.csv)
    Selkup (31.204):
    21% Nganasan
    16% Bashkir
    11% Tuvinian
    9% Tubalar
    7% Chukchi
    7% Altaian
    7% Yakut
    7% Mordovian
    5% Russian
    3% Buryat
    2% Tajik
    2% Ulchi
    2% Uzbek
    1% Hungarian
    1% French
    1% Burusho
    0% Mongol
    $ t=Mordovian;./mix <(grep -v ^$t, test.ave.csv) <(grep ^$t, test.ave.csv)
    Mordovian (23.311):
    25% Russian
    17% Hungarian
    12% French
    7% Bashkir
    7% Spanish
    5% Tajik
    4% Italian_North
    4% Basque
    4% Greek
    4% Uzbek
    4% Selkup
    3% Adygei
    1% Tubalar
    1% Burusho
    1% Balochi
    1% Altaian
    0% Turkish
    0% Chukchi
    $ t=Nganasan;./mix <(grep -v ^$t, test.ave.csv) <(grep ^$t, test.ave.csv)
    Nganasan (36.739):
    22% Yakut
    22% Selkup
    19% Tuvinian
    14% Ulchi
    11% Buryat
    11% Chukchi
    I uploaded the CSV file for average allele frequencies within populations here: https://drive.google.com/file/d/1nuS...ew?usp=sharing. You can also use the CSV file with Vahaduo. When I tried to paste the contents of the file to the SOURCE tab, Vahaduo crashed in Safari but not in Chrome. Running a single model took about half a minute.

    Here's PCA plots that are based on the CSV file, where I excluded Yoruba and Biaka from the second plot:



    If you put the CSV file linked above to Vahaduo's TARGET tab and the CSV file produced by the code below to the SOURCE tab, you can create models similar to supervised ADMIXTURE:

    Code:
    curl -Ls 'https://drive.google.com/uc?export=download&id=1nuS1pZ-pZCCWfeA1JpMwhfqgAT-c9Oi8'>test.ave.csv
    tav(){ awk -F, '{n[$1]++;for(i=2;i<=NF;i++){a[$1,i]+=$i}}END{for(i in n){o=i;for(j=2;j<=NF;j++)o=o FS a[i,j]/n[i];print o}}' "FS=${1-$'\t'}";}
    printf %s\\n Yoruba+Biaka Palestinian+Yemeni_Northwest+BedouinA Russian+Mordovian+Hungarian Nganasan+Chukchi+Yakut Han+Japanese+Qiang|
    while read l;do tr + \\n<<<$l|awk -F, 'NR==FNR{a[$0];next}$1 in a' - test.ave.csv|cut -d, -f2-|sed s/^/$l,/|tav ,;done>source
    The image below displays the models copied from Vahaduo's MULTI tab. The number after the population name is the distance fit. Some South Asian populations got bad fits, and they got a few percent of the SSA component, which may be a sign that I should've added a South Asian component to this run.



    Does anyone know a better way to make models like this that are based on SNP-level data? Is there something I could do to improve this workflow?
    Last edited by Komintasavalta; 09-21-2021 at 12:37 PM.

  2. #2
    Banned
    Join Date
    Sep 2020
    Last Online
    09-12-2023 @ 03:47 PM
    Location
    コミ共和国
    Meta-Ethnicity
    Finno-Permic
    Ethnicity
    Peasant
    Ancestry
    コミ
    Country
    Finland
    Taxonomy
    Karaboğa (euryprosopic, platyrrhine, dolichocephalic)
    Relationship Status
    Virgin
    Gender
    Posts
    2,170
    Thumbs Up
    Received: 4,863
    Given: 2,946

    1 Not allowed!

    Default

    I now made another CSV file for all modern populations with at least 8 samples and with no suffix like .SG or .DG. It has 188 populations and 21065 SNPs. But now I get an even bigger number of populations in the models:

    Code:
    $ t=Udmurt;./mix <(grep -v $t test.ave.csv) <(grep $t test.ave.csv)
    Udmurt (27.542):
    8.7% Mansi
    7.6% Tatar_Kazan
    5.8% Selkup
    5.1% Chuvash
    4.7% Finnish
    3.9% Tatar_Siberian
    3.8% Ket
    3.8% Mordovian
    3.5% Tajik
    2.7% Belarusian
    2.4% Bashkir
    2.2% Orcadian
    2.2% Yukagir_Tundra
    2.1% Hungarian
    2.1% Norwegian
    2.0% Veps
    2.0% Russian
    1.8% Icelandic
    1.7% Kalash
    1.6% Lak
    1.6% French
    1.5% Ukrainian_North
    1.5% Nogai_Stavropol
    1.5% Darginian
    1.5% Chukchi
    1.4% Avar
    1.4% Czech
    1.3% Ukrainian
    1.3% Tatar_Mishar
    1.3% Romanian
    1.3% Nganasan
    1.2% Estonian
    1.2% Altaian_Chelkan
    1.2% English
    1.2% Lithuanian
    1.1% Khamnegan
    1.0% Hazara
    1.0% Kaitag
    1.0% Khakass
    1.0% Sindhi_Pakistan
    0.9% Karelian
    0.8% Tubalar
    0.7% Uzbek
    0.4% Gagauz
    0.4% Khakass_Kachin
    0.3% Ingushian
    0.3% Khomani
    0.3% Basque
    0.3% Karitiana
    0.3% Pathan
    0.1% Tabasaran
    0.1% Koryak
    0.0% Azeri
    However if I make models based on a PCA of the CSV file of allele frequencies, then I get a smaller number populations in the models. I applied G25-style scaling where I multiplied the eigenvectors with the square roots of the eigenvalues.

    Code:
    $ Rscript -e 't=read.csv("test.csv",row.names=1,header=F);p=prcomp(t);p2=(p$x*sqrt(p$sdev))[,1:ncol(p$x)-1];write.table(round(p2,5),"test.pca",col.names=F,quote=F,sep=",");p3=data.frame(aggregate(p2,by=list(sub(":.*","",rownames(p2))),mean),row.names=1);write.table(round(p3,5),"test.pca.ave",col.names=F,quote=F,sep=",")'
    Based on models made with the PCA, I keep getting Iranian or Caucasian ancestry for Udmurts, which matches the results of G25 and ADMIXTURE. I only used the first 10 dimensions in the first model below and the first 20 dimensions in the second model, because if I used all 188 dimensions, I got a model with about 50 populations.

    Code:
    $ t=Udmurt;k=10;./mix <(grep -v ^$t, test.pca.ave|cut -d, -f-$[k+1]) <(grep $t test.pca.ave|cut -d, -f-$[k+1]) -t.1 -f1
    Udmurt (2.573):
    46.0% Mansi
    41.3% Karelian
    12.5% Kalash
    0.2% Mbuti
    $ t=Udmurt;k=20;./mix <(grep -v $t test.pca.ave|cut -d, -f-$[k+1]) <(grep $t test.pca.ave|cut -d, -f-$[k+1]) -t.1 -f1
    Udmurt (3.747):
    32.6% Veps
    25.5% Ket
    14.8% Karelian
    10.5% Darginian
    9.4% Mansi
    4.3% Kalash
    1.9% Koryak
    0.8% Khomani
    0.3% Karitiana
    If Finns are modeled using a maximum of 4 source populations, they get 9% of ancestry from Siberian and Russian Far Eastern populations, even though part of the Siberian ancestry of Finns is hidden under the Estonian component:

    Code:
    $ t=Finnish;k=20;./mix <(grep -v $t test.pca.ave|cut -d, -f-$[k+1]) <(grep $t test.pca.ave|cut -d, -f-$[k+1]) -t.1 -f1 -m4
    Finnish (1.193):
    60.6% Estonian
    30.2% Ukrainian_North
    6.1% Tofalar
    3.1% Koryak
    But then if Finns are modeled without Uralic or VURian source populations, they get about 12% Siberian ancestry:

    Code:
    $ t=Finnish;k=20;./mix <(grep -v $t test.pca.ave|grep -Ev 'Estonian|Karelian|Veps|Udmurt|Selkup|Nganasan|Mansi|Chuvash|Bashkir|Tatar_Mishar'|cut -d, -f-$[k+1]) <(grep $t test.pca.ave|cut -d, -f-$[k+1]) -t.1 -f1 -m4
    Finnish (1.737):
    48.1% Lithuanian
    40.2% Ukrainian_North
    6.3% Yukagir_Tundra
    5.4% Ket
    Below are plots of the first 8 dimensions of the PCA. PC3 differentiates Americans, PC4 differentiates Siberians, and PC5 differentiates Papuans. I don't know why the populations from Malawi plot so far on PC1 and PC2, but the samples from Malawi also included some extreme outliers in a SmartPCA run.






    But now I just ended up reinventing G25, and it would've been easier to do your DIY G25 with SmartPCA. And I should've probably used a higher number of SNPs than just 20,000, and I should've removed outliers like some of the samples from Malawi.
    Last edited by Komintasavalta; 09-21-2021 at 06:05 PM.

  3. #3
    Veteran Member Apricity Funding Member
    "Friend of Apricity"


    Join Date
    Oct 2016
    Last Online
    @
    Ethnicity
    me
    Country
    European Union
    Y-DNA
    R1a > YP1337 > R-BY160486*
    mtDNA
    H3*
    Gender
    Posts
    6,066
    Thumbs Up
    Received: 7,243
    Given: 2,623

    2 Not allowed!

    Default

    Best if you have more then 100 000 for intercontinental comparisons. For small regional this 20 000 would be enough according to Admixture manual (I know it is not admixture here).

    In my opinion you recreated Admixture method using your own technique. Which is very interesting.

    Look there are sources of files from my old K4 calculator (first 10 lines each one)

    .alleles file (created using Dienekes method from PLINK .frq file)

    rs10000023 G T
    rs10000041 G T
    rs1000014 A G
    rs1000016 G A
    rs1000022 C T
    rs10000226 T C
    rs1000031 A G
    rs1000032 T G
    rs1000040 G A
    rs1000061 G A

    .F file (frequencies of four components for those alleles, created using Dienekes method from original Admixture .P file)

    0.526076 0.518078 0.615615 0.568263
    0.686147 0.557793 0.822141 0.908194
    0.870707 0.959106 0.81822 0.572322
    0.353429 0.988267 0.898614 0.774009
    0.670159 0.97078 0.419393 0.828624
    0.984299 0.661853 0.912809 0.99999
    0.701481 0.066696 0.618875 0.678923
    0.99782 0.854309 0.896498 0.99999
    0.657839 0.326454 0.715561 0.681471
    0.657131 0.324176 0.665631 0.609679


    components for this calc

    SE-Asian-Oceanian
    SSA
    West-Med
    Amerindian


    .Q file from Admxiture (not used by calculator but needed for making oracle, frequencies of four components among references used for Admixture run - unsupervised. I deleted original .fam file so don't know which one here exactly, could be Chinese, Yoruba etc)

    0.798662 0.025817 0.008296 0.167224
    0.001013 0.912440 0.076914 0.009633
    0.913330 0.042865 0.014940 0.028865
    0.003667 0.946447 0.048983 0.000904
    0.005281 0.990495 0.004104 0.000119
    0.020896 0.817714 0.143084 0.018306
    0.023215 0.889052 0.074717 0.013016
    0.000457 0.956805 0.040238 0.002501
    0.829160 0.000010 0.012228 0.158601
    0.013498 0.718899 0.242508 0.025095
    Last edited by Lucas; 09-22-2021 at 11:09 AM.

  4. #4
    Banned
    Join Date
    Sep 2020
    Last Online
    09-12-2023 @ 03:47 PM
    Location
    コミ共和国
    Meta-Ethnicity
    Finno-Permic
    Ethnicity
    Peasant
    Ancestry
    コミ
    Country
    Finland
    Taxonomy
    Karaboğa (euryprosopic, platyrrhine, dolichocephalic)
    Relationship Status
    Virgin
    Gender
    Posts
    2,170
    Thumbs Up
    Received: 4,863
    Given: 2,946

    1 Not allowed!

    Default

    Quote Originally Posted by Lucas View Post
    Best if you have more then 100 000 for intercontinental comparisons. For small regional this 20 000 would be enough according to Admixture manual (I know it is not admixture here).
    It's the reverse, and you need more SNPs for a local comparison and less for a global comparison (http://dalexander.github.io/admixtur...ure-manual.pdf):

    As a rule of thumb, we have found that 10,000 markers suffice to perform GWAS correction for continentally separated populations (for example, African, Asian, and European pop- ulations FST > .05) while more like 100,000 markers are necessary when the populations are within a continent (Europe, for instance, FST < 0.01).

    BTW I don't think 10,000 SNPs is enough for a global ADMIXTURE run, or at least it's not when all of the SNPs are removed because of LD pruning. In runs with about 20,000 or 30,000 SNPs, I get models where many samples get one or a few percent of components that are not actually related to the sample, and the effect becomes more pronounced with less than 20,000 SNPs. For example below in the ADMIXTURE run with 15,645 SNPs, I got 5% East Asian ancestry for Sardinians and Mozabites, but in the run with 68,154 SNPs, the results are more reasonable. In both runs, all SNPs were removed because of LD pruning.



    Anyway, the reason why I used such a low SNP count in this thread is that often CVXPY failed to solve even models with about 20,000 or 30,000 SNPs. Maybe it would be better to use a commercial solver, because the CVXPY tutorial says this: "If you need to solve a large mixed-integer problem quickly, or if you have a nonlinear mixed-integer model, then you will need to use a commercial solver such as CPLEX, GUROBI, XPRESS, or MOSEK." (https://www.cvxpy.org/tutorial/advan....html#advanced)

  5. #5
    Veteran Member Apricity Funding Member
    "Friend of Apricity"


    Join Date
    Oct 2016
    Last Online
    @
    Ethnicity
    me
    Country
    European Union
    Y-DNA
    R1a > YP1337 > R-BY160486*
    mtDNA
    H3*
    Gender
    Posts
    6,066
    Thumbs Up
    Received: 7,243
    Given: 2,623

    1 Not allowed!

    Default

    Yes it's the reverse. I was too lazy to check in manual before writing
    But I never created calculator runs with less then 60 000 snps. With 10 000 we can maybe create some K3 using only Yoruba, Han and Sardinian refs. But it is basically useless.
    Now I create only such calculators which have more then 100 000. More snps, more detailed results,

  6. #6
    Banned
    Join Date
    Sep 2020
    Last Online
    09-12-2023 @ 03:47 PM
    Location
    コミ共和国
    Meta-Ethnicity
    Finno-Permic
    Ethnicity
    Peasant
    Ancestry
    コミ
    Country
    Finland
    Taxonomy
    Karaboğa (euryprosopic, platyrrhine, dolichocephalic)
    Relationship Status
    Virgin
    Gender
    Posts
    2,170
    Thumbs Up
    Received: 4,863
    Given: 2,946

    0 Not allowed!

    Default

    Quote Originally Posted by Lucas View Post
    But I never created calculator runs with less then 60 000 snps. With 10 000 we can maybe create some K3 using only Yoruba, Han and Sardinian refs. But it is basically useless.
    Now I create only such calculators which have more then 100 000. More snps, more detailed results,
    With my SNP-level Vahaduo models, the results don't change that much between around 15,000 and 70,000 SNPs. But it would probably be better to use around 100,000 SNPs though.


  7. #7
    Veteran Member Apricity Funding Member
    "Friend of Apricity"


    Join Date
    Oct 2016
    Last Online
    @
    Ethnicity
    me
    Country
    European Union
    Y-DNA
    R1a > YP1337 > R-BY160486*
    mtDNA
    H3*
    Gender
    Posts
    6,066
    Thumbs Up
    Received: 7,243
    Given: 2,623

    0 Not allowed!

    Default

    Quote Originally Posted by Komintasavalta View Post
    With my SNP-level Vahaduo models, the results don't change that much between around 15,000 and 70,000 SNPs. But it would probably be better to use around 100,000 SNPs though.

    I talked about Admixture related calcs. Maybe in your model is possible to attain similar level of details with lower snp number. Should be tested more. If I send you raw file you can check me? But it needs Polish, Belarusian, Lithuanian samples too, for me German is not needed rather.

  8. #8
    Banned
    Join Date
    Sep 2020
    Last Online
    09-12-2023 @ 03:47 PM
    Location
    コミ共和国
    Meta-Ethnicity
    Finno-Permic
    Ethnicity
    Peasant
    Ancestry
    コミ
    Country
    Finland
    Taxonomy
    Karaboğa (euryprosopic, platyrrhine, dolichocephalic)
    Relationship Status
    Virgin
    Gender
    Posts
    2,170
    Thumbs Up
    Received: 4,863
    Given: 2,946

    0 Not allowed!

    Default

    You can do the conversion to CSV much faster by using `--recode A` with PLINK (https://www.cog-genomics.org/plink/1.9/data):

    Code:
    $ x=test3.p;plink --bfile $x --recode A --out $x>$x.log
    $ head -n5 $x.raw|cut -d' ' -f-10
    FID IID PAT MAT SEX PHENOTYPE rs1240747_A rs3766176_C rs6694101_C rs201250599_G
    49 Adg-194 0 0 1 1 1 1 0 0
    54 Adg-238 0 0 1 1 2 0 1 0
    65 ALT-741 0 0 1 1 0 1 0 0
    67 ALT-744 0 0 1 1 0 0 0 0
    $ sed 1d $x.raw|cut -d' ' -f2,7-|tr \  ,|head -n4|cut -d, -f-5
    Adg-194,1,1,0,0
    Adg-238,2,0,1,0
    ALT-741,0,1,0,0
    ALT-744,0,0,0,0
    Quote Originally Posted by Lucas View Post
    If I send you raw file you can check me? But it needs Polish, Belarusian, Lithuanian samples too, for me German is not needed rather.
    Sure.

  9. #9
    Veteran Member
    Join Date
    Jul 2019
    Last Online
    03-11-2024 @ 04:25 PM
    Ethnicity
    Unknown
    Country
    Antarctica
    Gender
    Posts
    3,911
    Thumbs Up
    Received: 3,471
    Given: 1,541

    1 Not allowed!

    Default

    Quote Originally Posted by Lucas View Post
    Best if you have more then 100 000 for intercontinental comparisons. For small regional this 20 000 would be enough according to Admixture manual (I know it is not admixture here).
    Isn't it the oposite - 20 000 is enough for an Africa/Europe/Asia run, and for something more detailed you need 100 000 or more.

    Edit: ok , i see it was already answered.

  10. #10
    Veteran Member
    Join Date
    Jul 2019
    Last Online
    03-11-2024 @ 04:25 PM
    Ethnicity
    Unknown
    Country
    Antarctica
    Gender
    Posts
    3,911
    Thumbs Up
    Received: 3,471
    Given: 1,541

    1 Not allowed!

    Default

    Quote Originally Posted by Komintasavalta View Post
    With my SNP-level Vahaduo models, the results don't change that much between around 15,000 and 70,000 SNPs. But it would probably be better to use around 100,000 SNPs though.

    I think the modern samples from evolbio.ut.ee and hgdp (the original, not the version in reich) have more ancestry relevant SNPs than those from reich. They should have 150-200k after ld and maf, depending on the sample choice. Then you can model the ones from reich with them as sources.

Page 1 of 6 12345 ... LastLast

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Similar Threads

  1. Map of West Eurasian Admixture based on Harrapa World data
    By Український пат in forum Autosomal DNA
    Replies: 58
    Last Post: 08-28-2022, 06:22 PM
  2. Replies: 148
    Last Post: 12-02-2021, 01:54 AM
  3. Replies: 1
    Last Post: 07-12-2017, 11:29 AM
  4. Replies: 9
    Last Post: 03-31-2017, 10:59 AM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •