Making Vahaduo-like models based on SNP-level data? [Archive] - The Apricity Forum: A European Cultural Community

View Full Version : Making Vahaduo-like models based on SNP-level data?

Komintasavalta

09-21-2021, 12:37 PM

(Reposted from Anthrogenica, because I want to hear comments from Lemminkäinen, Lucas, vbnetkhio, Zoro & co. (in alphabetical order).)

1. Download the 1240K+HO dataset: https://reich.hms.harvard.edu/allen-ancient-dna-resource-aadr-downloadable-genotypes-present-day-and-ancient-dna-data. Convert it to PLINK format.

wget https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_HO_public.{anno,ind,snp,geno}
x=v44.3_HO_public;convertf -p <(printf %s\\n genotypename:\ $x.geno snpname:\ $x.snp indivname:\ $x.ind outputformat:\ PACKEDPED genotypeoutname:\ $x.bed snpoutname:\ $x.bim indivoutname:\ $x.fam)

2. Select modern populations with at least 20 samples, so individual-level genetic variation becomes averaged out. Include only the 20 samples with the highest SNP count from each population, because otherwise the models would be biased so that populations with a large number of samples would be preferred as sources. In the anno file, field 2 is sample name, field 8 is population name, field 6 is age in years before present, and field 15 is SNP count.

x=test;awk -F\\t '$6==0&&++a[$8]==20{print$8}' v44.3_HO_public.anno|grep -Ev '\.DG|\.SDG|\.SG|\.WGA|Ignore|outlier'|awk 'NR==FNR{a[$0];next}$3 in a&&++b[$3]<=20{print$1,$3}' - <(sort -t$'\t' -rnk15 v44.3_HO_public.ind)>$x.pick
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x

3. Perform aggressive LD pruning (`--indep-pairwise 50 10 .05`) and remove all SNPs with missing data (`--geno 0`) in order to get the total SNP count down to about 30,000. If this removes too many SNPs, you can use something like `--geno .001` instead.

plink --bfile $x --indep-pairwise 50 10 .05 --geno 0 --out $x;plink --allow-no-sex --bfile $x --extract $x.prune.in --make-bed --out $x.p

4. Convert the PLINK file to the EIGENSTRAT format. The eigenstratgeno file consists of one SNP per line with one number per individual, where the numbers 0-2 indicate the number of copies of the reference allele and 9 indicates missing data.

convertf -p <(printf %s\\n genotypename:\ $x.p.bed snpname:\ $x.p.bim indivname:\ $x.p.fam outputformat:\ EIGENSTRAT genotypeoutname:\ $x.eigenstratgeno snpoutname:\ $x.snp indivoutname:\ $x.ind)

5. Convert the EIGENSTRAT file to a CSV format, and create another CSV file for average allele frequencies within populations.

awk -F '' '{for(i=1;i<=NF;i++)a[i,NR]=$i}END{for(j=1;j<=NF;j++)for(k=1;k<=NR;k++)printf(k==NR?"%s\n":"%s,"),a[j,k]}' $x.eigenstratgeno|tr -d 9|paste -d, <(awk 'NR==FNR{a[$1]=$2;next}{print a[$2]":"$2}' $x.pick $x.fam) ->$x.csv
sed 's/:[^,]*//' $x.csv|awk -F, '{n[$1]++;for(i=2;i<=NF;i++){a[$1,i]+=$i}}END{for(i in n){o=i;for(j=2;j<=NF;j++)o=o","a[i,j]/n[i];print o}}'>$x.ave.csv

6. Install the CVXPY library and download my modified version of michal3141's convex optimization script: https://github.com/michal3141/g25. My version uses the same algorithm as the original script, but I changed the output format. I also changed the solver to CVXOPT, because compared to the ECOS solver, it seems to deal better with an input matrix that has tens of thousands of columns.

pip3 install cvxpy
curl https://pastebin.com/raw/afaMiFSa|tr -d \\r>mix;chmod +x mix

7. Try out some models. The models usually make less sense than G25 models, and there's often about 0-5% ancestry from multiple populations that are not actually related to the target population. But I guess it's a start.

$ t=Selkup;./mix <(grep -v ^$t, test.ave.csv) <(grep ^$t, test.ave.csv)
Selkup (31.204):
21% Nganasan
16% Bashkir
11% Tuvinian
9% Tubalar
7% Chukchi
7% Altaian
7% Yakut
7% Mordovian
5% Russian
3% Buryat
2% Tajik
2% Ulchi
2% Uzbek
1% Hungarian
1% French
1% Burusho
0% Mongol
$ t=Mordovian;./mix <(grep -v ^$t, test.ave.csv) <(grep ^$t, test.ave.csv)
Mordovian (23.311):
25% Russian
17% Hungarian
12% French
7% Bashkir
7% Spanish
5% Tajik
4% Italian_North
4% Basque
4% Greek
4% Uzbek
4% Selkup
3% Adygei
1% Tubalar
1% Burusho
1% Balochi
1% Altaian
0% Turkish
0% Chukchi
$ t=Nganasan;./mix <(grep -v ^$t, test.ave.csv) <(grep ^$t, test.ave.csv)
Nganasan (36.739):
22% Yakut
22% Selkup
19% Tuvinian
14% Ulchi
11% Buryat
11% Chukchi

I uploaded the CSV file for average allele frequencies within populations here: https://drive.google.com/file/d/1nuS1pZ-pZCCWfeA1JpMwhfqgAT-c9Oi8/view?usp=sharing. You can also use the CSV file with Vahaduo. When I tried to paste the contents of the file to the SOURCE tab, Vahaduo crashed in Safari but not in Chrome. Running a single model took about half a minute.

Here's PCA plots that are based on the CSV file, where I excluded Yoruba and Biaka from the second plot:

https://i.ibb.co/z5yb4BF/allele-pca.jpg

If you put the CSV file linked above to Vahaduo's TARGET tab and the CSV file produced by the code below to the SOURCE tab, you can create models similar to supervised ADMIXTURE:

curl -Ls 'https://drive.google.com/uc?export=download&id=1nuS1pZ-pZCCWfeA1JpMwhfqgAT-c9Oi8'>test.ave.csv
tav(){ awk -F, '{n[$1]++;for(i=2;i<=NF;i++){a[$1,i]+=$i}}END{for(i in n){o=i;for(j=2;j<=NF;j++)o=o FS a[i,j]/n[i];print o}}' "FS=${1-$'\t'}";}
printf %s\\n Yoruba+Biaka Palestinian+Yemeni_Northwest+BedouinA Russian+Mordovian+Hungarian Nganasan+Chukchi+Yakut Han+Japanese+Qiang|
while read l;do tr + \\n<<<$l|awk -F, 'NR==FNR{a[$0];next}$1 in a' - test.ave.csv|cut -d, -f2-|sed s/^/$l,/|tav ,;done>source

The image below displays the models copied from Vahaduo's MULTI tab. The number after the population name is the distance fit. Some South Asian populations got bad fits, and they got a few percent of the SSA component, which may be a sign that I should've added a South Asian component to this run.

https://i.ibb.co/0rFbWDg/snp-standro.png

Does anyone know a better way to make models like this that are based on SNP-level data? Is there something I could do to improve this workflow?

Komintasavalta

09-21-2021, 02:58 PM

I now made another CSV file for all modern populations with at least 8 samples and with no suffix like .SG or .DG. It has 188 populations and 21065 SNPs. But now I get an even bigger number of populations in the models:

$ t=Udmurt;./mix <(grep -v $t test.ave.csv) <(grep $t test.ave.csv)
Udmurt (27.542):
8.7% Mansi
7.6% Tatar_Kazan
5.8% Selkup
5.1% Chuvash
4.7% Finnish
3.9% Tatar_Siberian
3.8% Ket
3.8% Mordovian
3.5% Tajik
2.7% Belarusian
2.4% Bashkir
2.2% Orcadian
2.2% Yukagir_Tundra
2.1% Hungarian
2.1% Norwegian
2.0% Veps
2.0% Russian
1.8% Icelandic
1.7% Kalash
1.6% Lak
1.6% French
1.5% Ukrainian_North
1.5% Nogai_Stavropol
1.5% Darginian
1.5% Chukchi
1.4% Avar
1.4% Czech
1.3% Ukrainian
1.3% Tatar_Mishar
1.3% Romanian
1.3% Nganasan
1.2% Estonian
1.2% Altaian_Chelkan
1.2% English
1.2% Lithuanian
1.1% Khamnegan
1.0% Hazara
1.0% Kaitag
1.0% Khakass
1.0% Sindhi_Pakistan
0.9% Karelian
0.8% Tubalar
0.7% Uzbek
0.4% Gagauz
0.4% Khakass_Kachin
0.3% Ingushian
0.3% Khomani
0.3% Basque
0.3% Karitiana
0.3% Pathan
0.1% Tabasaran
0.1% Koryak
0.0% Azeri

However if I make models based on a PCA of the CSV file of allele frequencies, then I get a smaller number populations in the models. I applied G25-style scaling where I multiplied the eigenvectors with the square roots of the eigenvalues.

$ Rscript -e 't=read.csv("test.csv",row.names=1,header=F);p=prcomp(t);p2=(p$x*sqrt(p$ sdev))[,1:ncol(p$x)-1];write.table(round(p2,5),"test.pca",col.names=F,quote=F,sep=",");p3=data.frame(aggregate(p2,by=list(sub(":.*","",rownames(p2))),mean),row.names=1);write.table(rou nd(p3,5),"test.pca.ave",col.names=F,quote=F,sep=",")'

Based on models made with the PCA, I keep getting Iranian or Caucasian ancestry for Udmurts, which matches the results of G25 and ADMIXTURE. I only used the first 10 dimensions in the first model below and the first 20 dimensions in the second model, because if I used all 188 dimensions, I got a model with about 50 populations.

$ t=Udmurt;k=10;./mix <(grep -v ^$t, test.pca.ave|cut -d, -f-$[k+1]) <(grep $t test.pca.ave|cut -d, -f-$[k+1]) -t.1 -f1
Udmurt (2.573):
46.0% Mansi
41.3% Karelian
12.5% Kalash
0.2% Mbuti
$ t=Udmurt;k=20;./mix <(grep -v $t test.pca.ave|cut -d, -f-$[k+1]) <(grep $t test.pca.ave|cut -d, -f-$[k+1]) -t.1 -f1
Udmurt (3.747):
32.6% Veps
25.5% Ket
14.8% Karelian
10.5% Darginian
9.4% Mansi
4.3% Kalash
1.9% Koryak
0.8% Khomani
0.3% Karitiana

If Finns are modeled using a maximum of 4 source populations, they get 9% of ancestry from Siberian and Russian Far Eastern populations, even though part of the Siberian ancestry of Finns is hidden under the Estonian component:

$ t=Finnish;k=20;./mix <(grep -v $t test.pca.ave|cut -d, -f-$[k+1]) <(grep $t test.pca.ave|cut -d, -f-$[k+1]) -t.1 -f1 -m4
Finnish (1.193):
60.6% Estonian
30.2% Ukrainian_North
6.1% Tofalar
3.1% Koryak

But then if Finns are modeled without Uralic or VURian source populations, they get about 12% Siberian ancestry:

$ t=Finnish;k=20;./mix <(grep -v $t test.pca.ave|grep -Ev 'Estonian|Karelian|Veps|Udmurt|Selkup|Nganasan|Man si|Chuvash|Bashkir|Tatar_Mishar'|cut -d, -f-$[k+1]) <(grep $t test.pca.ave|cut -d, -f-$[k+1]) -t.1 -f1 -m4
Finnish (1.737):
48.1% Lithuanian
40.2% Ukrainian_North
6.3% Yukagir_Tundra
5.4% Ket

Below are plots of the first 8 dimensions of the PCA. PC3 differentiates Americans, PC4 differentiates Siberians, and PC5 differentiates Papuans. I don't know why the populations from Malawi plot so far on PC1 and PC2, but the samples from Malawi also included some extreme outliers in a SmartPCA run.

https://i.ibb.co/GT5S6Jn/1.png
https://i.ibb.co/SrWJJL4/3.png
https://i.ibb.co/9psr8xp/5.png
https://i.ibb.co/5sxW5vf/7.png

But now I just ended up reinventing G25, and it would've been easier to do your DIY G25 with SmartPCA. And I should've probably used a higher number of SNPs than just 20,000, and I should've removed outliers like some of the samples from Malawi.

Lucas

09-22-2021, 11:55 AM

Best if you have more then 100 000 for intercontinental comparisons. For small regional this 20 000 would be enough according to Admixture manual (I know it is not admixture here).

In my opinion you recreated Admixture method using your own technique. Which is very interesting.

Look there are sources of files from my old K4 calculator (first 10 lines each one)

.alleles file (created using Dienekes method from PLINK .frq file)

rs10000023 G T
rs10000041 G T
rs1000014 A G
rs1000016 G A
rs1000022 C T
rs10000226 T C
rs1000031 A G
rs1000032 T G
rs1000040 G A
rs1000061 G A

.F file (frequencies of four components for those alleles, created using Dienekes method from original Admixture .P file)

0.526076 0.518078 0.615615 0.568263
0.686147 0.557793 0.822141 0.908194
0.870707 0.959106 0.81822 0.572322
0.353429 0.988267 0.898614 0.774009
0.670159 0.97078 0.419393 0.828624
0.984299 0.661853 0.912809 0.99999
0.701481 0.066696 0.618875 0.678923
0.99782 0.854309 0.896498 0.99999
0.657839 0.326454 0.715561 0.681471
0.657131 0.324176 0.665631 0.609679

components for this calc

SE-Asian-Oceanian
SSA
West-Med
Amerindian

.Q file from Admxiture (not used by calculator but needed for making oracle, frequencies of four components among references used for Admixture run - unsupervised. I deleted original .fam file so don't know which one here exactly, could be Chinese, Yoruba etc)

0.798662 0.025817 0.008296 0.167224
0.001013 0.912440 0.076914 0.009633
0.913330 0.042865 0.014940 0.028865
0.003667 0.946447 0.048983 0.000904
0.005281 0.990495 0.004104 0.000119
0.020896 0.817714 0.143084 0.018306
0.023215 0.889052 0.074717 0.013016
0.000457 0.956805 0.040238 0.002501
0.829160 0.000010 0.012228 0.158601
0.013498 0.718899 0.242508 0.025095

Komintasavalta

09-22-2021, 12:59 PM

Best if you have more then 100 000 for intercontinental comparisons. For small regional this 20 000 would be enough according to Admixture manual (I know it is not admixture here).

It's the reverse, and you need more SNPs for a local comparison and less for a global comparison (http://dalexander.github.io/admixture/admixture-manual.pdf):

As a rule of thumb, we have found that 10,000 markers suffice to perform GWAS correction for continentally separated populations (for example, African, Asian, and European pop- ulations FST > .05) while more like 100,000 markers are necessary when the populations are within a continent (Europe, for instance, FST < 0.01).

BTW I don't think 10,000 SNPs is enough for a global ADMIXTURE run, or at least it's not when all of the SNPs are removed because of LD pruning. In runs with about 20,000 or 30,000 SNPs, I get models where many samples get one or a few percent of components that are not actually related to the sample, and the effect becomes more pronounced with less than 20,000 SNPs. For example below in the ADMIXTURE run with 15,645 SNPs, I got 5% East Asian ancestry for Sardinians and Mozabites, but in the run with 68,154 SNPs, the results are more reasonable. In both runs, all SNPs were removed because of LD pruning.

https://i.ibb.co/xX6fhN2/admixture-snp-comparison.jpg

Anyway, the reason why I used such a low SNP count in this thread is that often CVXPY failed to solve even models with about 20,000 or 30,000 SNPs. Maybe it would be better to use a commercial solver, because the CVXPY tutorial says this: "If you need to solve a large mixed-integer problem quickly, or if you have a nonlinear mixed-integer model, then you will need to use a commercial solver such as CPLEX, GUROBI, XPRESS, or MOSEK." (https://www.cvxpy.org/tutorial/advanced/index.html#advanced)

Lucas

09-22-2021, 01:10 PM

Yes it's the reverse. I was too lazy to check in manual before writing:)
But I never created calculator runs with less then 60 000 snps. With 10 000 we can maybe create some K3 using only Yoruba, Han and Sardinian refs. But it is basically useless.
Now I create only such calculators which have more then 100 000. More snps, more detailed results,

Komintasavalta

09-22-2021, 01:46 PM

But I never created calculator runs with less then 60 000 snps. With 10 000 we can maybe create some K3 using only Yoruba, Han and Sardinian refs. But it is basically useless.
Now I create only such calculators which have more then 100 000. More snps, more detailed results,

With my SNP-level Vahaduo models, the results don't change that much between around 15,000 and 70,000 SNPs. But it would probably be better to use around 100,000 SNPs though.

https://i.ibb.co/zZmM4CC/a.jpg

Lucas

09-22-2021, 02:26 PM

With my SNP-level Vahaduo models, the results don't change that much between around 15,000 and 70,000 SNPs. But it would probably be better to use around 100,000 SNPs though.

https://i.ibb.co/zZmM4CC/a.jpg

I talked about Admixture related calcs. Maybe in your model is possible to attain similar level of details with lower snp number. Should be tested more. If I send you raw file you can check me? But it needs Polish, Belarusian, Lithuanian samples too, for me German is not needed rather.

Komintasavalta

09-22-2021, 03:37 PM

You can do the conversion to CSV much faster by using `--recode A` with PLINK (https://www.cog-genomics.org/plink/1.9/data):

$ x=test3.p;plink --bfile $x --recode A --out $x>$x.log
$ head -n5 $x.raw|cut -d' ' -f-10
FID IID PAT MAT SEX PHENOTYPE rs1240747_A rs3766176_C rs6694101_C rs201250599_G
49 Adg-194 0 0 1 1 1 1 0 0
54 Adg-238 0 0 1 1 2 0 1 0
65 ALT-741 0 0 1 1 0 1 0 0
67 ALT-744 0 0 1 1 0 0 0 0
$ sed 1d $x.raw|cut -d' ' -f2,7-|tr \ ,|head -n4|cut -d, -f-5
Adg-194,1,1,0,0
Adg-238,2,0,1,0
ALT-741,0,1,0,0
ALT-744,0,0,0,0

If I send you raw file you can check me? But it needs Polish, Belarusian, Lithuanian samples too, for me German is not needed rather.

Sure.

vbnetkhio

09-22-2021, 07:57 PM

Best if you have more then 100 000 for intercontinental comparisons. For small regional this 20 000 would be enough according to Admixture manual (I know it is not admixture here).

Isn't it the oposite - 20 000 is enough for an Africa/Europe/Asia run, and for something more detailed you need 100 000 or more.

Edit: ok , i see it was already answered.

vbnetkhio

09-22-2021, 08:05 PM

With my SNP-level Vahaduo models, the results don't change that much between around 15,000 and 70,000 SNPs. But it would probably be better to use around 100,000 SNPs though.

https://i.ibb.co/zZmM4CC/a.jpg

I think the modern samples from evolbio.ut.ee and hgdp (the original, not the version in reich) have more ancestry relevant SNPs than those from reich. They should have 150-200k after ld and maf, depending on the sample choice. Then you can model the ones from reich with them as sources.

Lucas

09-22-2021, 09:51 PM

Komintasavalta

09-23-2021, 09:49 AM

I think the modern samples from evolbio.ut.ee and hgdp (the original, not the version in reich) have more ancestry relevant SNPs than those from reich. They should have 150-200k after ld and maf, depending on the sample choice. Then you can model the ones from reich with them as sources.

How would that work, because I need the source and target samples to have the same set of SNPs... Neither Vahaduo or Michal's script knows how to deal with input that has NA/null values, even though there's probably some way to do the convex optimization while allowing NA values.

Yes,also good are world datasets from Busby and Lazaridis.

I tried making a model for Chuvashes using all samples in the Busby dataset (https://data.mendeley.com/datasets/ckz9mtgrjj/3). But I again got zero or a few percent of ancestry from many unrelated populations:

Chuvash (20.407):
13% Russian_North
7% CEU
7% Mordvin
6% Ukrainian
5% Polish
5% Selkup
4% Nganasan
4% German
3% Hungarian
3% Lithuanian
3% Belarusian
3% Bulgarian
3% Croatian
3% Romanian
2% Uygur
2% Kyrgyz
2% Chukchi
2% Altaian
2% Tuvan
2% Mongol_Mongolia
2% Norwegian
1% Irish
1% Koryak
1% Uzbek
1% Kumyk
1% Dolgan
1% Ket
1% Finnish
1% Yakut
1% Burusho
1% Kalash
1% Lezgin
1% Colombian
1% Tajik
1% Oroqen
0% Welsh
0% Nogai
0% North_Ossetian
0% Kanjar
0% Evenk
0% Yukaghir
0% Kurdish_Kazakhstan
0% English
0% Adyghe
0% German_or_Austrian
0% Maya
0% Papuan_Sepik
0% Lambadi
0% Meghawal
0% Kurumba
0% Pathan
0% Yoruba
0% Kshatriya_Uttar_Pradesh
0% Surui
0% Bantu_Pedi_South_Africa
0% Bengali
0% Bantu_Ovambo_Angola
0% French
0% Brahmin_Uttar_Pradesh
0% Balkar
0% Nasioi_Bougainville
0% Hadza
0% Kol
0% Papuan_Highlands_East

Next I tried doing a PCA of the populations and making models based on the first 20 dimensions, and I limited the models to a maximum of 2-8 populations. When I used an unscaled PCA, some models got a few percent of ancestry from unrelated populations like Sandawe or Papuan_Sepik:

Chuvash (.00334): 82% Mordvin + 18% Nganasan
Chuvash (.00290): 80% Mordvin + 18% Nganasan + 1% Sandawe
Chuvash (.00236): 56% Mordvin + 20% Nganasan + 18% Polish + 7% Tajik
Chuvash (.00206): 62% Mordvin + 20% Nganasan + 13% Polish + 4% Brahui + 1% Sandawe
Chuvash (.00179): 56% Mordvin + 15% Polish + 14% Nganasan + 8% Selkup + 5% Tajik + 1% Sandawe
Chuvash (.00152): 56% Mordvin + 15% Polish + 13% Nganasan + 11% Selkup + 4% Lezgin + 1% Sandawe + 1% Papuan_Sepik
Chuvash (.00157): 47% Mordvin + 19% Polish + 17% Nganasan + 6% Selkup + 5% Lezgin + 4% Finnish + 1% Sandawe + 1% Kalash

But when I used a scaled PCA, the results became more reasonable:

Chuvash (.00057): 72% Mordvin + 28% Selkup
Chuvash (.00056): 69% Mordvin + 28% Selkup + 3% Polish
Chuvash (.00039): 73% Mordvin + 16% Nganasan + 6% Polish + 5% Selkup
Chuvash (.00032): 55% Mordvin + 17% Polish + 14% Nganasan + 9% Selkup + 5% Balkar
Chuvash (.00032): 55% Mordvin + 17% Polish + 14% Nganasan + 9% Selkup + 5% Balkar
Chuvash (.00030): 55% Mordvin + 17% Polish + 9% Selkup + 9% Nganasan + 5% Yakut + 4% Balkar
Chuvash (.00027): 51% Mordvin + 20% Polish + 13% Nganasan + 7% Selkup + 3% Burusho + 3% Yakut + 3% Balkar + 1% Finnish

If I try to specify a maximum number of populations when I make models based on the CSV file for SNP-level data, it takes forever both in Vahaduo and with Michal's script.

Here's a SmartPCA run of all samples in the Busby dataset:

https://i.ibb.co/RQdXhtS/busby-pca-world.png

And here's just Eurasian samples. The Finnish population average seems unusually western compared to the Mordvin and North Russian samples. And it's not just some Chukchi samples but all Chukchi samples that look like mixed with Europeans. And then Chuvashes are at around the same point on PC1 as Nogais, because the Nogai population average includes samples with low Mongoloid ancestry, like the Nogai_Karachay_Cherkessia samples in 1240K+HO.

https://i.ibb.co/LzFXKXk/busby-eurasia-smartpca-pc12.png
https://i.ibb.co/QKHX6zJ/busby-eurasia-smartpca-pc34.png

There's so many South Asian samples in the Busby dataset that South Asians get their own component in a K=3 Eurasian ADMIXTURE run:

https://i.ibb.co/bvfH8QM/busby-admixture-eurasia-k3.jpg

Lucas

09-23-2021, 10:29 AM

Try with Lazaridis, there are two datasets 2014 and 2016. Interesting which is better.

Lemminkäinen

09-23-2021, 10:53 AM

Komintasavalta

The Finnish population average seems unusually western compared to the Mordvin and North Russian samples.

The Siberian in Finns is not exactly the same as in North Russians. On your PCA they are on X-axis between North Russians and Belarussians, due to the difference in Siberian admixture. The exception on Y-axis figures genetic drift. My explanation.

Imho, min-max on X-axis is distorted as to Europeans. Min-max is always distorted, but you can choose the way by selecting populations and sizes.

Komintasavalta

09-23-2021, 12:18 PM

The Siberian in Finns is not exactly the same as in North Russians. On your PCA they are on X-axis between North Russians and Belarussians, due to the difference in Siberian admixture. The exception on Y-axis figures genetic drift. My explanation.

Imho, min-max on X-axis is distorted as to Europeans. Min-max is always distorted, but you can choose the way by selecting populations and sizes.

Or maybe it's the Western European ancestry that pulled Finns west on PC1. I tried doing another SmartPCA run for just European samples, and again Finns plot lower on PC2 than North Russians and Mordvins. But I think it's because this run included relatively few samples with high Mongoloid ancestry, so PC2 differentiates between low-mong Western Europeans and low-mong Eastern Europeans, and Finns have more Germanic-like ancestry than Mordvins.

There's also something f*cked up with some Romanian and Bulgarian samples. And I don't know why some Polish samples plot further north than Lithuanian samples on PC1.

https://i.ibb.co/yy4X7Mz/busby-europe-smartpca.png

I used `--indep-pairwise 50 10 .2 --geno .01 --maf .03`, which kept about 90,000 SNPs. Maybe I did the QC wrong (https://data.mendeley.com/datasets/ckz9mtgrjj/3):

These genotypes were all generated on Illumina chips (550, 610, 660) for multiple different studies. The two main papers that this dataset was compiled for are: Hellenthal, et al 2014 A Genetic Atlas of Human Admixture History, Science; and Busby, et al 2015 The role of recent admixture in forming the contemporary West Eurasian genomic landscape, Current Biology.

The data are in PLINK format and the BusbyWorldwidePopulations.csv file outlines where the different datasets come from. Note that because these two datasets were combined together, not all populations are typed on the same set of SNPs. We have included genotype data on 523,443 SNPs, of which 441,038 are genotyped on at least 97.5% of individuals.

Therefore, additional QC steps are required to filter this set down to high quality calls, depending on the subset of samples that are required.

Lemminkäinen

09-23-2021, 12:40 PM

Or maybe it's the Western European ancestry that pulled Finns west on PC1. I tried doing another SmartPCA run for just European samples, and again Finns plot lower on PC2 than North Russians and Mordvins. But I think it's because this run included relatively few samples with high Mongoloid ancestry, so PC2 differentiates between low-mong Western Europeans and low-mong Eastern Europeans, and Finns have more Germanic-like ancestry than Mordvins.

There's also something f*cked up with some Romanian and Bulgarian samples. And I don't know why some Polish samples plot further north than Lithuanian samples on PC1.

https://i.ibb.co/yy4X7Mz/busby-europe-smartpca.png

I used `--indep-pairwise 50 10 .2 --geno .01 --maf .03`, which kept about 90,000 SNPs. Maybe I did the QC wrong (https://data.mendeley.com/datasets/ckz9mtgrjj/3):

These genotypes were all generated on Illumina chips (550, 610, 660) for multiple different studies. The two main papers that this dataset was compiled for are: Hellenthal, et al 2014 A Genetic Atlas of Human Admixture History, Science; and Busby, et al 2015 The role of recent admixture in forming the contemporary West Eurasian genomic landscape, Current Biology.

The data are in PLINK format and the BusbyWorldwidePopulations.csv file outlines where the different datasets come from. Note that because these two datasets were combined together, not all populations are typed on the same set of SNPs. We have included genotype data on 523,443 SNPs, of which 441,038 are genotyped on at least 97.5% of individuals.

Therefore, additional QC steps are required to filter this set down to high quality calls, depending on the subset of samples that are required.

Could be. Add Saamis, keeping Nganasans and East Asians, Chuvashes, Mordvas, North Russians and the European block without changes.

Lucas

09-23-2021, 12:46 PM

There's also something f*cked up with some Romanian and Bulgarian samples. And I don't know why some Polish samples plot further north than Lithuanian samples on PC1.

They are infamous "Estonian Poles", tested by Estonian Biocentre among their Polish minority, certainly mixed with Estonians. You can exclude those plotting to north.

Among Romanians are two Gypsy samples.
Not sure about Bulgarians, maybe some too. But I didn't test them thoroughly.

Zoro

09-23-2021, 01:11 PM

I used `--indep-pairwise 50 10 .2 --geno .01 --maf .03`, which kept about 90,000 SNPs. Maybe I did the QC wrong (https://data.mendeley.com/datasets/ckz9mtgrjj/3):
]

Nice looking plots!

Not a good idea to do —maf 0.03. Basically what you did was to throw out all rarer SNPs and leave the more common older SNPs with minor allele frequencies greater than 3% in your dataset.

In other words if you used 3000 samples then it means you threw out all alleles shared by 90 samples to the exclusion of the rest of samples. So if 3 populations of 30 samples each have unique alleles that set them apart from all the other populations, you just threw out those alleles!

I would do the opposite if you’re looking for more recent shared ancestry. I would throw out all the common alleles older than 10,000 years old by doing something like —max-maf 0.2 which gets rid of alleles shared by 2400 of your 3000 samples since they’re not as informative

You can also try —max-maf 0.25 or 0.3 if you’re not left with enough SNPs

Repost graphs. FYI the clustering will be more realistic but not as neat in terms of everyone in a population having the same recent ancestry. In other words you’ll find more variation within a population than G25 or some calculators have you believe

Just as there’s considerable phenotype variation in a population, in reality, outside of G25 or calculator Lalaland there’s considerable genotype variation in a population. Another thing to keep in mind is the SNPs picked by ancestry companies are optimized for intra-European variation. There are many SNPs unique to East Asians or Africans that are not genotyped. I think in 20 years your ancestry calculations will look different from now

Komintasavalta

09-23-2021, 01:43 PM

Could be. Add Saamis, keeping Nganasans and East Asians, Chuvashes, Mordvas, North Russians and the European block without changes.

I added some Siberian samples from Busby, and I added samples from Tambets et al. 2018 (https://evolbio.ut.ee/Tambets2018/). The Finns from Busby look more western or northern than the Finns from Tambets.

I wonder if Tatar1003 is a Crimean Tatar, because it plots close to Nogais. The sample buryat_V43501 from Tambets looks hapa.

https://i.ibb.co/FsQ1xqG/busby-tambets-smartpca.png

They are infamous "Estonian Poles", tested by Estonian Biocentre among their Polish minority, certainly mixed with Estonians. You can exclude those plotting to north.

Among Romanians are two Gypsy samples.
Not sure about Bulgarians, maybe some too. But I didn't test them thoroughly.

Yeah gypsies makes sense. But how do you know it's the Estonian Poles? The Polish samples are from Hellenthal et al. 2014 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4209567/), but I didn't find information about their geographic location anywhere, because it's an old paper with scanty supplementary information. In the plot above, the ID of the northernmost Polish sample is Polish2 and the second northernmost is POL079.

Komintasavalta

09-23-2021, 02:40 PM

Not a good idea to do —maf 0.03. Basically what you did was to throw out all rarer SNPs and leave the more common older SNPs with minor allele frequencies greater than 3% in your dataset.

In other words if you used 3000 samples then it means you threw out all alleles shared by 90 samples to the exclusion of the rest of samples. So if 3 populations of 30 samples each have unique alleles that set them apart from all the other populations, you just threw out those alleles!

I would do the opposite if you’re looking for more recent shared ancestry. I would throw out all the common alleles older than 10,000 years old by doing something like —max-maf 0.2 which gets rid of alleles shared by 2400 of your 3000 samples since they’re not as informative

You can also try —max-maf 0.25 or 0.3 if you’re not left with enough SNPs

Yeah I have no idea which `--maf` or `--max-maf` setting I should use, so maybe it's best to not use them at all. But `--maf` and `--max-maf` don't select the SNPs to remove based on the minor allele frequency relative to other samples in the dataset, but based on the absolute frequency. Here's the number of SNP removed in my set of European samples from Busby, out of a total of 523443:

`--max-maf .499`: 898
`--max-maf .49`: 9297
`--max-maf .45`: 46980
`--max-maf .25`: 243706
`--max-maf .1`: 421404
`--max-maf .05`: 480608
`--maf .001`: 7873
`--maf .01`: 18428
`--maf .05`: 42808
`--maf .25`: 243905
`--maf .4`: 429267
`--maf .45`: 476446

Another thing to keep in mind is the SNPs picked by ancestry companies are optimized for intra-European variation. There are many SNPs unique to East Asians or Africans that are not genotyped. I think in 20 years your ancestry calculations will look different from now

Is that also true of the Human Origins array, or is that why West Africans have such a low f2 distance to Khoisan and Central African pygmies and Hadza in 1240K+HO? (https://anthrogenica.com/showthread.php?24760-Is-Basal-Eurasian-real-or-an-fstats-artifact/page2&p=800315#post800315)

Edit: apparently the Human Origins array was designed to differentiate 11 modern popuations, including Mbuti and San, but maybe it still gives relatively little weight to SNPs that are specific to Capoids and Bambutids (https://www.thermofisher.com/document-connect/document-connect.html?url=https%3A%2F%2Fassets.thermofisher .com%2FTFS-Assets%2FLSG%2Fbrochures%2Faxiom_hu_origins_appnot e.pdf):

A total of 1.81 million candidate SNPs, all from genome locations covered by sequencing reads from Neanderthals, Denisovans, and chimpanzees, were ascertained using a simple SNP discovery procedure first described by Keinan, et al., 2007.[1] The most important ascertainment involved using whole-genome shotgun sequencing data to discover differences between the two chromosomes carried by individuals from 11 populations (San, Yoruba, Mbuti, French, Sardinian, Han, Cambodian, Mongolian, Karitiana, Papuan, and Bougainville).

A paper about a new SNP panel said that the 1240K panel overestimates the difference between Africans and non-Africans (https://arborbiosci.com/wp-content/uploads/2021/03/Skoglund_Ancestral_850K_Panel_Design.pdf):

We computed FST for 1) the whole genomes in the SGDP 2) the currently widely used 1240k panel, and for the new 850k Ancestral SNP panel. Differentiation between African and non-African populations is overestimated for the 1240k, but no similar bias is observed for the Ancestral SNP panel (Figure 2).

Lemminkäinen

09-23-2021, 02:42 PM

I added some Siberian samples from Busby, and I added samples from Tambets et al. 2018 (https://evolbio.ut.ee/Tambets2018/). The Finns from Busby look more western or northern than the Finns from Tambets.

I wonder if Tatar1003 is a Crimean Tatar, because it plots close to Nogais. The sample buryat_V43501 from Tambets looks hapa.

https://i.ibb.co/FsQ1xqG/busby-tambets-smartpca.png

Yeah gypsies makes sense. But how do you know it's the Estonian Poles? The Polish samples are from Hellenthal et al. 2014 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4209567/), but I didn't find information about their geographic location anywhere, because it's an old paper with scanty supplementary information. In the plot above, the ID of the northernmost Polish sample is Polish2 and the second northernmost is POL079.

You see, Chuvashes pull North Russians, Saamis pull Finns. Everything looks to be okay Saamis and Finns came around 2000 years ago and have common 1500 years old drift. If we have to search common ancestry with VUR ppl it has to be based on 1000 migrants who came here 2500-3000 years ago.

Zoro

09-23-2021, 03:34 PM

Yeah I have no idea which `--maf` or `--max-maf` setting I should use, so maybe it's best to not use them at all. But `--maf` and `--max-maf` don't select the SNPs to remove based on the minor allele frequency relative to other samples in the dataset, but based on the absolute frequency. Here's the number of SNP removed in my set of European samples from Busby, out of a total of 523443:

`--max-maf .499`: 898
`--max-maf .49`: 9297
`--max-maf .45`: 46980
`--max-maf .25`: 243706
`--max-maf .1`: 421404
`--max-maf .05`: 480608
`--maf .001`: 7873
`--maf .01`: 18428
`--maf .05`: 42808
`--maf .25`: 243905
`--maf .4`: 429267
`--maf .45`: 476446

Is that also true of the Human Origins array, or is that why West Africans have such a low f2 distance to Khoisan and Central African pygmies and Hadza in 1240K+HO? (https://anthrogenica.com/showthread.php?24760-Is-Basal-Eurasian-real-or-an-fstats-artifact/page2&p=800315#post800315)

Edit: apparently the Human Origins array was designed to differentiate 11 modern popuations, including Mbuti and San, but maybe it still gives relatively little weight to SNPs that are specific to Capoids and Bambutids (https://www.thermofisher.com/document-connect/document-connect.html?url=https%3A%2F%2Fassets.thermofisher .com%2FTFS-Assets%2FLSG%2Fbrochures%2Faxiom_hu_origins_appnot e.pdf):

A total of 1.81 million candidate SNPs, all from genome locations covered by sequencing reads from Neanderthals, Denisovans, and chimpanzees, were ascertained using a simple SNP discovery procedure first described by Keinan, et al., 2007.[1] The most important ascertainment involved using whole-genome shotgun sequencing data to discover differences between the two chromosomes carried by individuals from 11 populations (San, Yoruba, Mbuti, French, Sardinian, Han, Cambodian, Mongolian, Karitiana, Papuan, and Bougainville).

A paper about a new SNP panel said that the 1240K panel overestimates the difference between Africans and non-Africans (https://arborbiosci.com/wp-content/uploads/2021/03/Skoglund_Ancestral_850K_Panel_Design.pdf):

We computed FST for 1) the whole genomes in the SGDP 2) the currently widely used 1240k panel, and for the new 850k Ancestral SNP panel. Differentiation between African and non-African populations is overestimated for the 1240k, but no similar bias is observed for the Ancestral SNP panel (Figure 2).

you need to use —max-maf There’s no ambiguity trust me you can read the plink manual

Use .25 to filter out all alleles with frequency is greater than 25% in your data set and then re-post graph

Komintasavalta

09-23-2021, 07:43 PM

you need to use —max-maf There’s no ambiguity trust me you can read the plink manual

Use .25 to filter out all alleles with frequency is greater than 25% in your data set and then re-post graph

Adding `--max-maf .25` doesn't make that much difference, or at least not in the run of European samples from the Busby dataset:

https://i.ibb.co/ZGV1Svm/a.gif

Leto

09-23-2021, 08:00 PM

You see, Chuvashes pull North Russians, Saamis pull Finns. Everything looks to be okay Saamis and Finns came around 2000 years ago and have common 1500 years old drift. If we have to search common ancestry with VUR ppl it has to be based on 1000 migrants who came here 2500-3000 years ago.
So around 1000 BC they would have come from the Volga to the Eastern Baltic (first to Estonia, then to Southern Finland). Do I understand it correctly?

Dirdepo

09-23-2021, 08:03 PM

So around 1000 BC they would have come from the Volga to the Eastern Baltic (first to Estonia, then to Southern Finland). Do I understand it correctly?

You got it right Baltic_BA top shooter in this bitch Zlatan Ibrahimovic nanananannaananananannanananana Turkish music

vbnetkhio

09-23-2021, 09:41 PM

Adding `--max-maf .25` doesn't make that much difference, or at least not in the run of Eur

https://i.ibb.co/ZGV1Svm/a.gif

Could you test this:
-take a dataset
-split the samples of each population into 2 random halves
-run smartpca, use the first half to infer PCs, and project the other half
-compare them like this?

Lemminkäinen

09-23-2021, 09:49 PM

So around 1000 BC they would have come from the Volga to the Eastern Baltic (first to Estonia, then to Southern Finland). Do I understand it correctly?

At least what scientists say. In Estonia we see N1c1 around 500 BC. Finnish scientists assume that Saamis came to Finland at the same time. At the time Saamis were in Southeastern Finland they adapted loanwords from unknown and Germanic languages. It looks reasonable to think that Saami and Finnic diverged somewhere to east from Estonia around 1000 BC. Around 200-500 AD Finnic groups crossed the sea and settled Southwestern Finland.

Leto

09-23-2021, 10:30 PM

At least what scientists say. In Estonia we see N1c1 around 500 BC. Finnish scientists assume that Saamis came to Finland at the same time. At the time Saamis were in Southeastern Finland they adapted loanwords from unknown and Germanic languages. It looks reasonable to think that Saami and Finnic diverged somewhere to east from Estonia around 1000 BC. Around 200-500 AD Finnic groups crossed the sea and settled Southwestern Finland.
Interesting. I guess much of the Mongoloid in modern Finns is from the Sami because Estonians are apparently less Mong on average than Russians and even that may be from later, medieval migrations. All in all, both Finns and Estonians seem to be very Paleo-European genetically, which is ironic considering the popular jokes about the Finns being Asians.

Sandis

09-23-2021, 10:37 PM

At least what scientists say. In Estonia we see N1c1 around 500 BC. Finnish scientists assume that Saamis came to Finland at the same time. At the time Saamis were in Southeastern Finland they adapted loanwords from unknown and Germanic languages. It looks reasonable to think that Saami and Finnic diverged somewhere to east from Estonia around 1000 BC. Around 200-500 AD Finnic groups crossed the sea and settled Southwestern Finland.

I thought for a long time that N1c1 is younger in Baltics than scientists claimed previously and also that FU languages diverged later, not 4000 years ago or something like that.

Zoro

09-23-2021, 10:44 PM

Adding `--max-maf .25` doesn't make that much difference, or at least not in the run of European samples from the Busby dataset:

https://i.ibb.co/ZGV1Svm/a.gif

Looks good!

Can you try max-maf 0.2 with and without pruning

Lemminkäinen

09-23-2021, 11:07 PM

Interesting. I guess much of the Mongoloid in modern Finns is from the Sami because Estonians are apparently less Mong on average than Russians and even that may be from later, medieval migrations. All in all, both Finns and Estonians seem to be very Paleo-European genetically, which is ironic considering the popular jokes about the Finns being Asians.

It is likely from Saamis, who probably got it from Nenets or via contacts with people speaking unknown language which left words to Saami. The idea of tracking the Finnish Siberian admixture to the urheimat of Uralic languages is fascinating, but likely not true. I believe that it has been quite difficult to avoid Siberian admixture while living relatively northern or eastern location. Either one is true the Siberian admixture is a weak proof, but a big ship turns slowly and we have to wait long for all explaining theory of Baltic Finns.

Lemminkäinen

09-24-2021, 08:57 AM

I thought for a long time that N1c1 is younger in Baltics than scientists claimed previously and also that FU languages diverged later, not 4000 years ago or something like that.

Latest genetic studies claim that first N1c1 finds in Estonia is from the early Iron Age. A lovely theory could be that Baltic Finns, carrying N1c1, had powerful Iron weapons and they subjugated primitive Bronze Age R1a men in Estonia 🙂. Probably not receiving support from Slavic spheres.

They also beat us, underdeveloped Finnish aboriginals.

Komintasavalta

09-24-2021, 04:59 PM

Could you test this:
-take a dataset
-split the samples of each population into 2 random halves
-run smartpca, use the first half to infer PCs, and project the other half
-compare them like this?

That had a bigger impact than I expected. When I didn't project any samples, some Yemeni samples stretched PC2 much more than when I projected half of samples from each population. But I had disabled the automatic outlier removal feature of SmartPCA, which might have removed some of the Yemeni samples.

https://i.ibb.co/WGFPyVy/proj.png

vbnetkhio

09-24-2021, 05:52 PM

That had a bigger impact than I expected. When I didn't project any samples, some Yemeni samples stretched PC2 much more than when I projected half of samples from each population. But I had disabled the automatic outlier removal feature of SmartPCA, which might have removed some of the Yemeni samples.

https://i.ibb.co/WGFPyVy/proj.png

Thanks. Apparently the samples in G25 sheets are exclusively projected samples, because source and projected samples plot differently, and cannot be compared with each other.

But here on the second pca this effect can't be seen, all samples of the same population seem to cluster together.

Komintasavalta

09-24-2021, 06:26 PM

Thanks. Apparently the samples in G25 sheets are exclusively projected samples, because source and projected samples plot differently, and cannot be compared with each other.

But here on the second pca this effect can't be seen, all samples of the same population seem to cluster together.

Actually if you look at lower plot where half of the samples are projected, then for Mongoloid populations like Japanese or Hezhen or Daur or Oroqen, half of the samples plot further right on PC1 and half plot further left. Apparently it's the projected samples that plot further right and the reference samples that plot further left. The same effect exists for European populations, like for example Sardinians and Lithuanians, but it's harder to see because there's so many overlapping European populations.

The population label of each population is drawn at the center point of the samples from the population. In the lower plot where half of the samples are projected, if you look at populations at the right and left edges of the plot, there's a bigger horizontal distance between the population labels and the outermost samples.

Similarly if you look at the bottom edge of the lower plot, then populations like Mansi and Selkup are more spread out vertically than in the upper plot.

Ion Basescul

09-24-2021, 08:43 PM

Remove the 2 full Romas from the Romanian dataset...

Sandis

09-25-2021, 08:21 PM

Latest genetic studies claim that first N1c1 finds in Estonia is from the early Iron Age. A lovely theory could be that Baltic Finns, carrying N1c1, had powerful Iron weapons and they subjugated primitive Bronze Age R1a men in Estonia ��. Probably not receiving support from Slavic spheres.

They also beat us, underdeveloped Finnish aboriginals.

I read this useful study about transition to Iron Age in Estonia and arrival of Siberian ancestry.
Some estimate for split of Finnic languages is even 150 AD near Gulf of Finland.
We were told many years that FU languages arrived in Latvia with CC culture 4000 BC before IE migrations, but independent historians had different versions.
However we also had this Siberian influence probably at the same time period as in Estonia.
I hope that ship will start go faster.

Leto

09-26-2021, 01:19 AM

Komintasavalta

09-28-2021, 05:42 PM

When I try to make models using ancient source populations, I keep getting models like below where most populations have around 5% or more of every component, or for example here even Han has 5% ENF and Sardinian has 6% Mongoloid:

https://i.ibb.co/GkdPTnc/v.jpg

In the model above, each source population consisted of 80 samples. I only used `--indep-pairwise` but no `--geno`, which left me with about 140,000 SNPs. I then calculated average allele frequencies within populations so that I skipped NA values (similar to `colMeans(na.rm=T)` in R). After that about 3,000 SNPs still had an NA value, but I just removed them all.

x=ancient
printf %s\\n 'Steppe Russia_Samara_EBA_Yamnaya Russia_Steppe_Catacomb Russia_Afanasievo Russia_MBA_Poltavka Ukraine_Ozera_EBA_Yamnaya Ukraine_EBA_Yamnaya Ukraine_EBA_Yamnaya_published Kazakhstan_EBA_Yamnaya.SG Russia_Caucasus_EBA_Yamnaya Russia_Steppe_Eneolithic Russia_Khvalynsk_Eneolithic Russia_MLBA_Sintashta' 'Mongoloid China_NEastAsia_Coastal_EN China_SEastAsia_Island_EN China_Shimao_LN China_Upper_YR_LN China_WLR_MN China_YR_LN China_YR_MN Mongolia_North_N Russia_UstBelaya_Angara Nepal_Chokhopani_2700BP.SG Russia_DevilsCave_N.SG Russia_Shamanka_Eneolithic.SG Russia_UstBelaya_Angara.SG Russia_MN_Boisman Russia_Lokomotiv_Eneolithic.SG China_WLR_LN Russia_Siberia_UP' 'WHG_EHG Italy_Mesolithic.SG Italy_North_Villabruna_HG Latvia_HG Lithuania_EMN_Narva Luxembourg_Loschbour Romania_IronGates_Mesolithic Spain_HG Sweden_Motala_HG Russia_HG_Karelia Norway_N_HG.SG Luxembourg_Loschbour.DG England_Mesolithic_all.SG Ireland_Mesolithic.SG Latvia_HG.SG Romania_IronGates_Mesolithic.SG Spain_HG.SG Switzerland_Bichon.SG Norway_Mesolithic.SG Sweden_Mesolithic.SG Sweden_Motala_HG.SG Sweden_PWC.SG Russia_HG_Karelia.SG Russia_Sidelkino_HG.SG Estonia_MN_CCC_1 Estonia_MN_CCC_2 Estonia_EMN_Narva Serbia_IronGates_Mesolithic Ukraine_Mesolithic Russia_Popovo_HG Russia_EHG' 'ENF Turkey_N Germany_EN_LBK Hungary_EN_Koros Hungary_MN_LBK Israel_C Macedonia_N Portugal_LN_C Serbia_EN Spain_C Spain_EBA Spain_EN Spain_MLN'>$x.ref
tr ' ' \\n<$x.ref|cat - <(printf %s\\n Finnish Sardinian Udmurt Mansi Nganasan Han French Tajik Norwegian Lithuanian)|awk -F\\t 'NR==FNR{a[$0];next}$8 in a&&!b[$3]++&&++c[$8]<=80{print$2,$8}' - <(sort -t$'\t' -rnk15 v44.3_HO_public.anno)>$x.pop
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pop v44.3_HO_public.fam) --make-bed --out $x
plink --allow-no-sex --bfile $x --indep-pairwise 50 10 .1 --out $x;plink --allow-no-sex --bfile $x --extract $x.prune.in --make-bed --out $x.p
plink --allow-no-sex --bfile $x.p --recode A --out $x
awk 'NR==FNR{for(i=2;i<=NF;i++)a[$i]=$1;next}$2 in a{$2=a[$2]}1' $x.ref $x.pop|awk 'NR==FNR{a[$1]=$2;next}{print a[$2],$0}' - <(sed 1d $x.raw)|cut -d' ' -f1,8-|tr ' ' ,|gawk -F, '{for(i=2;i<=NF;i++)if($i!="NA"){a[$1][i]+=$i;n[$1][i]++}}END{for(i in a){o=i;for(j=2;j<=NF;j++)o=o FS (n[i][j]?a[i][j]/n[i][j]:"NA");print o}}'>$x.alfa
awk -F, 'NR==FNR{for(i=2;i<=NF;i++)if($i=="NA")a[i];next}{printf"%s",$1;for(i=2;i<=NF;i++)if(!(i in a))printf",%s",$i;print""}' $x.alfa{,}>$x.nona

BTW in order to calculate the distance fit of the models so that it is not affected by the number of SNPs, you can use the f2 formula (https://uqrmaie1.github.io/admixtools/articles/admixtools.html#f-statistics-basics-1), where instead of taking the square root of the summed square of differences, you divide it by the total number of SNPs:

https://i.ibb.co/vwZXn3m/f2.png

You can convert an Euclidean distance to an f2-style distance by squaring it and then dividing it by the number of SNPs. For example in the image above, the model for Finns had a distance of 66.32116041, and I had 140,522 SNPs, so the f2-style distance would be 66.32116041^2/140522 ≈ 0.031.

Or with Michal's script, you can replace this:

residual_norm=cp.norm(M@x-b,p=2).value

With this:

residual_norm=cp.norm(M@x-b,p=2).value**2/b.size

Komintasavalta

09-28-2021, 11:52 PM

In my previous post, I made a CSV file for the average allele frequencies of about 140,000 SNPs. The file contained 4 ancient populations that each had 80 samples, and it also contained a few modern populations with less samples. Because my CSV file has zero SNPs with missing data, you can use it to do a PCA with R's `prcomp` function instead of SmartPCA, and you can combine ancient and modern populations without needing to project ancient samples on modern references:

https://i.ibb.co/cyH0hmN/a.jpg

Another way you can do a scatterplot where you use the average allele frequencies of a large number of ancient samples is to do an MDS plot of an f2 matrix, where you just set the names in the first field of the `fam` file to the names of the ancient reference populations:

f2c()(Rscript --no-init-file -e 'library(admixtools);p=commandArgs(T)[1];f=f2(p,unique_only=F);df=as.data.frame(f);write.c sv(round(xtabs(df[,3]~df[,2]+df[,1]),6),paste0(p,".f2"),quote=F)' "$@")
x=ancient;awk 'NR==FNR{for(i=2;i<=NF;i++)a[$i]=$1;next}$2 in a{$2=a[$2]}1' $x.ref $x.pop|awk 'NR==FNR{a[$1]=$2;next}{$1=a[$2]}1' - $x.p.fam>temp;mv temp $x.p.fam;f2c $x.p

Then make an MDS plot of the f2 matrix using `cmdscale(as.dist(f2matrix))`:

https://i.ibb.co/z7t5kpQ/b.jpg

library(tidyverse)
library(ggforce) # for geom_mark_hull
library(colorspace) # for hex

t=read.csv("ancient.p.f2",row.names=1,check.names=F)

mds=cmdscale(as.dist(t),20,eig=T)
p=as.data.frame(mds$points)
eig=head(mds$eig,ncol(p))
pct=paste0("MDS dimension ",1:ncol(p)," (",sprintf("%.1f",100*eig/sum(eig)),"%)")

ranges=diff(apply(p,2,range))[1,]
p$k=as.factor(cutree(hclust(as.dist(t)),8))

pal1=hex(HSV(seq(0,360,length.out=nlevels(p$k)+1)%>%head(-1),.55,1))
pal2=hex(HSV(seq(0,360,length.out=nlevels(p$k)+1)%>%head(-1),.25,1))

for(i in c(1,3)){
seg=lapply(1:3,function(j)apply(t,1,function(x)unl ist(p[names(sort(x)[j]),c(i,i+1)],use.names=F))%>%t%>%cbind(p[,c(i,i+1)]))%>%do.call(rbind,.)%>%setNames(paste0("V",1:4))

p1=sym(paste0("V",i))
p2=sym(paste0("V",i+1))

maxrange=max(ranges[c(i,i+1)])
lims=sapply(c(i,i+1),function(x)mean(range(t[,x]))+c(-.52*maxrange,.52*maxrange))

ggplot(p,aes(!!p1,!!p2))+
geom_segment(data=seg,aes(x=V1,y=V2,xend=V3,yend=V 4),color="gray80",size=.15)+
geom_mark_hull(aes(group=k),color=pal1[p$k],fill=pal1[p$k],concavity=100,radius=unit(.15,"cm"),expand=unit(.15,"cm"),alpha=.2,size=.15)+
geom_point(aes(x=!!p1,y=!!p2),color=pal1[p$k],size=.3)+
geom_text(aes(x=!!p1,y=!!p2),label=rownames(p),col or=pal2[p$k],size=2,vjust=-.7)+
labs(x=pct[i],y=pct[i+1])+
scale_x_continuous(breaks=seq(-1,1,.01),expand=expansion(.1))+
scale_y_continuous(breaks=seq(-1,1,.01))+
theme(
axis.text.y=element_text(angle=90,vjust=1,hjust=.5 ),
axis.text=element_text(color="gray90",size=6),
axis.ticks.length=unit(0,"pt"),
axis.ticks=element_blank(),
axis.title=element_text(color="gray90",size=8),
legend.position="none",
panel.background=element_rect(fill="gray40"),
panel.border=element_rect(color="gray30",fill=NA,size=.4),
panel.grid.major=element_line(color="gray30",size=.2),
panel.grid.minor=element_blank(),
plot.background=element_rect(fill="gray40",color=NA),
plot.title=element_text(size=10,color="gray90")
)

ggsave(paste0(i,".png"),width=4,height=4)
}

I think you get more accurate f2 distances between ancient and modern populations when you use ancient populations that consist of a large number of samples:

$ rm -rf /tmp/tmp;mkdir /tmp/tmp;for p in Steppe ENF Mongoloid WHG_EHG;do awk -F, 'NR==1{for(i=2;i<=NF;i++)if($i==x)break;next}$1!=x{print$i,$1}' x=$p ancient2.p.f2|sort -n|awk '{$1=sprintf("%.4f",$1)}1'|printf %s\\n "f2 distance to $p:" "$(cat)">/tmp/tmp/$p;done;paste /tmp/tmp/*|column -s$'\t' -t
f2 distance to ENF: f2 distance to Mongoloid: f2 distance to Steppe: f2 distance to WHG_EHG:
0.0017 Sardinian 0.0012 Mongol 0.0022 Mordovian 0.0059 Lithuanian
0.0018 Italian_North 0.0017 Evenk_FarEast 0.0022 English 0.0063 Mordovian
0.0019 Italian_South 0.0018 Japanese 0.0023 Hungarian 0.0064 Finnish
0.0023 French 0.0019 Han 0.0023 Norwegian 0.0066 Norwegian
0.0027 Hungarian 0.0024 Ulchi 0.0024 Tajik 0.0067 English
0.0027 Basque 0.0026 Kazakh 0.0025 French 0.0067 Hungarian
0.0027 English 0.0029 Yakut 0.0025 Lithuanian 0.0069 French
0.0030 Armenian 0.0032 Khakass 0.0025 Finnish 0.0075 Chuvash
0.0033 Norwegian 0.0036 Yukagir_Tundra 0.0029 Chuvash 0.0076 Steppe
0.0034 Palestinian 0.0037 Nogai_Astrakhan 0.0031 Italian_North 0.0078 Basque
0.0039 Iranian 0.0050 Tatar_Siberian 0.0032 Bashkir 0.0082 Italian_North
0.0039 Mordovian 0.0060 Enets 0.0033 Iranian 0.0085 Bashkir
0.0041 Lithuanian 0.0062 Selkup 0.0035 Italian_South 0.0091 Udmurt
0.0042 Finnish 0.0067 Ket 0.0037 Armenian 0.0094 Italian_South
0.0048 Kurd 0.0069 Mansi 0.0038 Udmurt 0.0095 ENF
0.0049 Tajik 0.0071 Bashkir 0.0041 Basque 0.0095 Tajik
0.0052 Chuvash 0.0071 Nganasan 0.0044 Kurd 0.0097 Sardinian
0.0056 Steppe 0.0096 Chuvash 0.0045 Tatar_Siberian 0.0101 Tatar_Siberian
0.0057 Bashkir 0.0098 Udmurt 0.0050 Palestinian 0.0109 Iranian
0.0069 Udmurt 0.0099 Tajik 0.0055 Sardinian 0.0112 Armenian
0.0072 Tatar_Siberian 0.0127 Mordovian 0.0056 ENF 0.0113 Nogai_Astrakhan
0.0078 Nogai_Astrakhan 0.0129 Iranian 0.0056 Nogai_Astrakhan 0.0118 Palestinian
0.0093 Kazakh 0.0132 Finnish 0.0069 Kazakh 0.0120 Kurd
0.0095 WHG_EHG 0.0140 Hungarian 0.0069 Mansi 0.0124 Mansi
0.0109 Mansi 0.0144 Armenian 0.0076 WHG_EHG 0.0127 Kazakh
0.0111 Khakass 0.0145 Italian_South 0.0080 Khakass 0.0138 Khakass
0.0137 Selkup 0.0145 Steppe 0.0098 Selkup 0.0153 Selkup
0.0138 Evenk_FarEast 0.0146 French 0.0111 Ket 0.0167 Ket
0.0141 Mongol 0.0146 Kurd 0.0115 Evenk_FarEast 0.0167 Evenk_FarEast
0.0152 Ket 0.0146 Norwegian 0.0115 Mongol 0.0173 Mongol
0.0160 Enets 0.0147 Palestinian 0.0125 Enets 0.0179 Enets
0.0166 Yakut 0.0147 Italian_North 0.0139 Yakut 0.0196 Yakut
0.0170 Mongoloid 0.0147 English 0.0145 Mongoloid 0.0202 Mongoloid
0.0182 Han 0.0148 Lithuanian 0.0155 Yukagir_Tundra 0.0212 Yukagir_Tundra
0.0185 Japanese 0.0160 Basque 0.0159 Han 0.0215 Han
0.0188 Yukagir_Tundra 0.0166 Sardinian 0.0162 Japanese 0.0218 Japanese
0.0194 Ulchi 0.0170 ENF 0.0167 Ulchi 0.0223 Ulchi
0.0223 Nganasan 0.0202 WHG_EHG 0.0189 Nganasan 0.0246 Nganasan

The WHG_EHG population consists of a mixture of WHGs, WHG-EHG mixes, and EHGs. Even though it's not pure WHG, it's still so far from modern Europeans that even Udmurts are almost as far from it as from Han:

https://i.ibb.co/qjNXKCf/xy.jpg

Below I loaded the PLINK dataset where each ancient population had 80 samples into ADMIXTOOLS 2. Even though I used `maxmiss=0` and many of the ancient samples had low coverage, only about 4,000 SNPs got discarded, because only a few SNPs were missing for every sample in a population:

ℹ Reading allele frequencies from PLINK files...
ℹ ancient.p.geno has 815 samples and 142889 SNPs
ℹ Calculating allele frequencies from 815 samples in 14 populations
ℹ Expected size of allele frequency data: 64 MB
142k SNPs read...
✔ 142889 SNPs read in total
! 139024 SNPs remain after filtering. 139024 are polymorphic.
ℹ Allele frequency matrix for 139024 SNPs and 14 populations is 26 MB
ℹ Computing pairwise f2 for all SNPs and population pairs requires 716 MB RAM without splitting
ℹ Computing without splitting since 716 < 8000 (maxmem)...
ℹ Returning f2 blocks

Sandis

09-29-2021, 12:38 AM

WHG and EHG probably would be considered different race if lived today as they are so far from everyone.

Komintasavalta

09-29-2021, 01:26 AM

WHG and EHG probably would be considered different race if lived today as they are so far from everyone.

Yeah, I'm still not sure if I calculated the distances correctly. But I think I now found the trick to calculate f2 distances between ancient and modern populations in ADMIXTOOLS 2, which is to use the first field of the `fam` file to combine samples from different ancient populations into big reference populations.

I now did another run where I separated WHG from EHG. I also included a few Norwegian and Ukrainian HGs in the EHG population, because otherwise it wouldn't have had enough samples. (NHG refers to Norwegian HGs that are about 3/4 EHG and 1/4 WHG, like Steigen, Hum1, and Hum2, but some Ukrainian HGs have similar ancestry, like UKR_N:I5875.)

https://i.ibb.co/3MCnGyX/1.png

Now even the EHG_NHG population is closer to modern Europeans than to the WHG population. And also it's closer to Bashkirs than to the French, and it's almost as close to Selkups as to ENF.

f2 distance to EHG_NHG:
0.00564 Steppe
0.00595 Russian
0.00612 Mordovian
0.00687 Hungarian
0.00706 Bashkir
0.00743 French
0.00749 WHG
0.00821 Spanish
0.00827 Tajik
0.00872 Italian_North
0.00887 Greek
0.00900 Basque
0.00917 Uzbek
0.00945 Adygei
0.00994 Turkish
0.01015 Burusho
0.01025 Iranian
0.01033 Balochi
0.01069 Makrani
0.01082 Brahui
0.01106 Georgian
0.01133 Sardinian
0.01178 Palestinian
0.01188 Druze
0.01211 BedouinA
0.01248 Tubalar
0.01256 ENF
0.01328 Selkup
0.01394 Altaian
0.01418 Yemeni_Highlands
0.01429 Yemeni_Northwest
0.01625 Tuvinian
0.01709 Mongol
0.01760 Buryat
0.01896 Yakut
0.02020 Mongoloid
0.02032 Tibetan
0.02164 Qiang
0.02164 Han
0.02172 Japanese
0.02176 Chukchi
0.02206 Ulchi
0.02228 Dong
0.02420 Nganasan

Komintasavalta

09-29-2021, 02:38 PM

I tried doing a supervised ADMIXTURE run using the dataset in my previous post, but there were so few EHG samples relative to Siberian samples that the EHG component ended up accounting for Siberian ancestry, and for example Nganasans got 100% of the EHG component. And also the steppe component ended up accounting for South Asian and MENA ancestry. (In supervised ADMIXTURE, you can select which samples are forced to get 100% of each component, but other samples are still allowed to affect the SNP loadings of the same components.)

Next I tried doing a supervised and projected ADMIXTURE run, where I first did a supervised run just for my five ancient populations, and then I projected modern samples onto the run. But it produced similar results as my SNP-level Vahaduo models, where most populations got at least 5% of every component.

In the third image, I just did a regular unsupervised ADMIXTURE run with no projection. Now the results seem more reasonable, but it can give the false impression that the components are defined by the ancient populations, even though in the run below there were only 100 ancient samples but 8 times as many modern samples.

https://i.ibb.co/6sJndCV/admixfail.jpghttps://i.ibb.co/DRv1Ybv/admixfail2.jpg

I don't know if Frappe or some other software would work better than ADMIXTURE. Or maybe there's some better algorithm we could use to create SNP-level models with Michal's CVXPY script. I don't want to use qpAdm, because the results of qpAdm are influenced too heavily by the choice of outgroups.

vbnetkhio

10-06-2021, 06:48 PM

Yes,also good are world datasets from Busby and Lazaridis.

are they available in plink format for download?

Lucas

10-07-2021, 02:08 PM

are they available in plink format for download?

Lazaridis is on Estonian Biocentre maybe? Busby on Mendeley.

Lucas

10-07-2021, 02:11 PM

are they available in plink format for download?

Lazaridis is on Estonian Biocentre maybe? Busby on Mendeley.

vbnetkhio

10-07-2021, 02:48 PM

Lazaridis is on Estonian Biocentre maybe? Busby on Mendeley.

Busby is mostly African samples?

XenophobicPrussian

10-07-2021, 03:45 PM

I know the position of the Malawi Yao/Chewa samples PCA position can be caused by very rapid recent drift or just a problem with the PCA, but still. Looking up pictures of the Yao/Chewa, they look far more uniform, homogenous, and more stereotypical "SSA" than the Yoruba. They are also much more uniform in skin colour with almost no variation, while the Yoruba show a good amount of variation.

They also seem to have no obvious Khoisan features, like say, the Xhosa Bantu group in South Africa, who are usually the most outlying African population(Khoisan, not Xhosa) on PCAs. I haven't seen any other PCAs or papers show these PCA positions though(maybe they haven't even been on any PCAs previously?), but perhaps it's time to rethink whether or not Yoruba are actually the purest representation of SSA outside of Pygmies/Khoisan?

Sandis

10-07-2021, 10:41 PM

vbnetkhio

10-08-2021, 12:06 PM

Lazaridis is on Estonian Biocentre maybe? Busby on Mendeley.

from which study is the Lazaridis dataset exactly? there are many Lazaridis studies.

Komintasavalta

10-10-2021, 02:15 AM

from which study is the Lazaridis dataset exactly? there are many Lazaridis studies.

The datasets from Lazaridis et al. 2014 and 2016 can be downloaded from here: https://reich.hms.harvard.edu/datasets. 1240K+HO has a sample with the same ID for all 2068 samples in the Lazaridis 2016 dataset, and for 1945 out of 1963 of the samples in the Lazaridis 2014 dataset (but the remaining 18 samples consist of some standard ancient samples, a chimp reference genome, and 3 Australian samples that have a different ID). Almost all of the samples in Lazaridis 2016 are also in Lazaridis 2014.

$ curl https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_HO_public.anno -Lso ho.anno
$ wc -l NearEastPublic/HumanOriginsPublic2068.ind # Lazaridis 2016
2068
$ awk '{print$1}' NearEastPublic/HumanOriginsPublic2068.ind|awk -F\\t 'NR==FNR{a[$0];next}$2 in a{print$4}' - ho.anno|sort|uniq -c|sort
5 PickrellNatureCommunications2012
127 LazaridisNature2016
881 PattersonGenetics2012
1055 LazaridisNature2014
$ wc -l EuropeFullyPublic/data.ind # Lazaridis 2014
1963
$ awk '{print$1}' EuropeFullyPublic/data.ind|awk -F\\t 'NR==FNR{a[$0];next}$2 in a{print$4}' - ho.anno|sort|uniq -c|sort
1 Genome
5 PickrellNatureCommunications2012
887 PattersonGenetics2012
1052 LazaridisNature2014
$ awk 'NR==FNR{a[$1];next}!($1 in a)' EuropeFullyPublic/data.ind NearEastPublic/HumanOriginsPublic2068.ind|wc -l
130