Iron Age Estonians analysed using unsupervised Admixture [Archive] - The Apricity Forum: A European Cultural Community

Lemminkäinen

02-16-2021, 10:46 AM

Here we go, Iron Age Estonians are those cryptic names starting with X, V and O/0. All IA Estonian samples are from an Estonian study suggesting an eastern origin of Baltic Finns. Notice, the Siberian reference is of real Siberian origin from Siberia behind the Ural mountains, not the Kola-Saami related Nganasans (a Nenets group).

https://4.bp.blogspot.com/-2eFZ6ZCiy8U/XOBf_riBO6I/AAAAAAAABi4/ZK5sCblxwokFWX7SdkV-fpe-Q6F_RvTsgCLcBGAs/s1600/rplot3.gif

0LS10 was classified as an outlier among IA Estonian samples, based on isotope analyses.

VI14, VI15 and VI16 are errors, the study didn't include such samples. I got them among other Estonian samples from the distributed data base. They are excluded probably due a low quality.

Ouch, I got those three from the study excluded samples from Estonian Biocentre:

https://evolbio.ut.ee/Saag_2019/EasternBalticBAIAMA_1.2M.fam

I'll check them if I have spare time.

Lemminkäinen

02-17-2021, 09:06 AM

A few words more. My admixture analysis proved the same thing as the original study. Only ONE (0LS10) of all late bronze age and Iron Age Estonians showed relevant Siberian admixture and this one was proved in the study to have foreign origin. How then the scientific world keeps on insisting that the Siberian admixture is a fingerprint of Uralic languages and at the same time insists that Uralic languages came to the Baltic Sea region in the early Iron Age. The theory is that they came first to Estonia and around 300 AD continued to Southwestern Finland.

Another observation. Using Saami related Nenets group and Admixture Estonians get 1-2% Siberian and Balts 0%. Using similar process as we know in qpAdm Haak et al. 2015 got notable Siberian for all Northern Europeans. Now, using Admixture and Evenks I get for the Balts a bit below 10%, which seems to be in-line with Haak.

I wonder why the original study didn't include those three samples. Two of them seem to have Baltic CWC/BA-Siberian mixture, another also farmer admixture. These two looks like different cases of real Siberians with Baltic admixture. The third one, among some other study samples was dropped out of my test due to low SNP rate. Estonian Biocentre publishing study samples is an Estonian organisation below the Tartu university.

vbnetkhio

02-17-2021, 10:04 AM

A few words more. My admixture analysis proved the same thing as the original study. Only ONE (0LS10) of all late bronze age and Iron Age Estonians showed relevant Siberian admixture and this one was proved in the study to have foreign origin. How then the scientific world keeps on insisting that the Siberian admixture is a fingerprint of Uralic languages and at the same time insists that Uralic languages came to the Baltic Sea region in the early Iron Age. The theory is that they came first to Estonia and around 300 AD continued to Southwestern Finland.

Another observation. Using Saami related Nenets group and Admixture Estonians get 1-2% Siberian and Balts 0%. Using similar process as we know in qpAdm Haak et al. 2015 got notable Siberian for all Northern Europeans. Now, using Admixture and Evenks I get for the Balts a bit below 10%, which seems to be in-line with Haak.

I wonder why the original study didn't include those three samples. Two of them seem to have Baltic CWC/BA-Siberian mixture, another also farmer admixture. These two looks like different cases of real Siberians with Baltic admixture. The third one, among some other study samples was dropped out of my test due to low SNP rate. Estonian Biocentre publishing study samples is an Estonian organisation below the Tartu university.

well yes, Bronze age Estonians were purely Balto-Slavic, and the Iron age are like 90% Bronze age + 10% of some Komi-like influence (which isn't purely East Asian ofc)

https://eurogenes.blogspot.com/2019/05/uralic-specific-genome-wide-ancestry.html

Komintasavalta

02-17-2021, 11:19 AM

well yes, Bronze age Estonians were purely Balto-Slavic, and the Iron age are like 90% Bronze age + 10% of some Komi-like influence (which isn't purely East Asian ofc)

https://eurogenes.blogspot.com/2019/05/uralic-specific-genome-wide-ancestry.html

I calculated population averages from the datasheet in that post and I used them to do hierarchical K-means clustering. Bronze Age Estonians formed a cluster with modern Latvians and Lithuanians, but Iron Age Estonians formed a cluster with modern Estonians and central-north Russians. (Kostroma is northeast of Moscow and Tver is northwest of Moscow. Kostroma is on the northern side of the North Russian dialect border but Tver is on the southern side.)

curl -Ls "drive.google.com/open?id=1dqhpDxjgPNTObUx50WQ2tSNu2YKcI0vp">East_Baltic_BA-IA_transition
tr -d \\r< East_Baltic_BA-IA_transition|cut -f3-|sed 1,2d|tr \\t ,|sed s/,\$//>esto
tav(){ awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1][i]+=$i}}END{for(i in a){o=i;for(j=2;j<=NF;j++)o=o FS a[i][j]/n[i];print o}}' "FS=${1-$'\t'}";}
awk -F: '{print$1","$0}' esto|cut -d, -f1,3-|tav ,>estoave
R -e 'library(factoextra);for(n in c(2:30)){t<-read.csv("estoniaave",header=F,row.names=1);k<-hkmeans(t,n);
fviz_cluster(k,stand=F,shape=1,show.clust.cent=F,l abelsize=8,main=paste("Hierarchical K-means:",n,"clusters"))+theme(legend.position="none")+ggsave(sprintf("esto%02d.png",n))}'

https://i.imgur.com/UMkLdom.png
https://i.imgur.com/4tdHlAc.png

$ cat ~/bin/eud
#!/usr/bin/env ruby -roptparse

opt={}
OptionParser.new{|x|
x.on("-m NUM",Integer){|y|opt[:m]=y}
x.on("-f NUM",Integer){|y|opt[:f]=y}
}.parse!

a=IO.readlines(ARGV[0]).map{|l|x,*y=l.chomp.split(",");[x,y.map(&:to_f)]}

puts IO.readlines(ARGV[1]).map{|l|
x,*y=l.chomp.split(",")
y.map!(&:to_f)
d=a.reject{|z|z[0]==x}.map{|z|[z[1].map.with_index{|v,i|(v-y[i])**2}.sum**0.5,z[0]]}.sort_by(&:first)
d=d.take(opt[:m])if opt[:m]
"Distance to: #{x}\n"+d.map{|x,y|("%.#{opt[:f]||3}f"%x).sub(/^0/,"")+" "+y}*"\n"
}*"\n\n"

$ eud -m30 esto <(grep 'EST_[BI]A' estoave)
Distance to: Baltic_EST_BA
.009 Baltic_LVA_BA:Kivutkalns215
.012 Baltic_EST_BA:V9_2
.012 Baltic_LVA_BA:Kivutkalns153
.012 Baltic_EST_BA:V14_2
.013 Baltic_LVA_BA:Kivutkalns194
.013 Baltic_EST_BA:V16_1
.013 Baltic_LVA_BA:Kivutkalns207
.013 Baltic_LVA_BA:Kivutkalns19
.014 Baltic_LVA_BA:Kivutkalns222
.015 Baltic_EST_BA:X17_2
.015 Baltic_LVA_BA:Kivutkalns25
.015 Baltic_LVA_BA:Kivutkalns209
.016 Baltic_EST_BA:X14_1
.017 Baltic_EST_BA:X08_1
.019 Latvian:latvian58C6
.020 Baltic_LVA_BA:Kivutkalns42
.020 Latvian:latvian54H7
.020 Baltic_LVA_BA:Kivutkalns164
.021 Latvian:latvian54A2
.021 Baltic_EST_BA:0LS11_1
.022 Baltic_EST_BA:X11_1
.022 Lithuanian:lithuania2
.022 Baltic_EST_BA:X10_1
.023 Lithuanian:lithuania7
.023 Baltic_EST_BA:X15_2
.024 Russian_Orel:RussianOrjol45
.024 Latvian:latvian58C8
.024 Baltic_EST_IA:X04_1
.024 Baltic_EST_IA:X04_1
.025 Latvian:latvian22J5

Distance to: Baltic_EST_IA
.010 Baltic_EST_IA:VII4_1
.010 Baltic_EST_IA:VII4_1
.012 Estonian:ee53
.012 Estonian:ee105
.013 Baltic_EST_IA:0LS10_1
.013 Baltic_EST_IA:0LS10_1
.014 Estonian:ee86
.015 Estonian:ee114
.015 Estonian:ee68
.016 Baltic_EST_IA:X04_1
.016 Baltic_EST_IA:X04_1
.016 Estonian:ee45
.016 Baltic_EST_IA:V12_1
.016 Baltic_EST_IA:V12_1
.017 Estonian:ee140
.017 Vepsian:vepsa19
.017 Baltic_EST_IA:V11_1
.017 Baltic_EST_IA:V11_1
.017 Estonian:ee66
.018 Latvian:latvian54H7
.019 Estonian:ee1
.019 Estonian:ee144
.019 Baltic_LTU_BA:Turlojiske3
.019 Russian_Tver:Russia_Tver418
.019 Lithuanian:lithuania8
.019 Estonian:ee72
.019 Estonian:ee147
.019 Estonian:ee111
.020 Russian_Voronez:russianVoron101
.020 Estonian:ee108

Finns, Karelians, Vepsians, and Komi have a high level of PC2 and a low level of PC3. Compared to Iron Age Estonians, Bronze Age Estonians have a lower level of PC1 and PC2 and a higher level of PC3.

cat <(printf %s\\n '' PC{1..10}|paste -sd, -) estoave>estoaveh
R -e 'library(RColorBrewer);library(pheatmap);pheatmap( read.csv("estoaveh",header=T,row.names=1,check.names=F),filename="output.png",cluster_cols=F,
cellwidth=12,cellheight=12,fontsize=8,breaks=seq(-.05,.05,(.1/256)),rev(colorRampPalette(brewer.pal(7,"RdBu"))(256)))'

https://i.ibb.co/V2mYG1R/heat.png

Lemminkäinen

02-17-2021, 12:06 PM

well yes, Bronze age Estonians were purely Balto-Slavic, and the Iron age are like 90% Bronze age + 10% of some Komi-like influence (which isn't purely East Asian ofc)

https://eurogenes.blogspot.com/2019/05/uralic-specific-genome-wide-ancestry.html

It looks like early Iron Age Estonians were still quite a heterogeneous people (before 0 AD) and newcomers met old Baltic people, which is expectable. My test had some weaknesses, mainly because the CWC group is too homogeneous and I need more samples. This led to a unique CWC portion. Unsupervised Admixture is easily beaten by homogeneous sample groups. But despite of this caveat there is something worth noticing: Middle Age samples show western similarity (II* samples, Karja, Otepää). Another observation I already mentioned is that the Siberian is notable only in modern samples, excluding those two exceptions. It looks like the Siberian wave hit Estonians from the north, not from the east and this route is contradictory to the scientific view of the FU language direction.

Lemminkäinen

02-17-2021, 12:17 PM

I don't share the Polako's opinion about the connection between the Estonan IA y- and autosomal data. The study is right about it, but its outcomes were too far fetched, although in principle right.

RicoSuave

02-19-2021, 05:25 AM

Baltic_EST_BA is good to use, as Vbnetkhio said they nearly 100% European.

Komintasavalta

04-01-2021, 03:16 PM

There's also individual-level ADMIXTURE results in figure S1A of Saag et al. 2019: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6544527/#SMtitle.

https://i.ibb.co/0tbhc2h/saag-2019-s1-admixture.png

I converted the figure into a high-resolution image like this:

wget https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6544527/bin/EMS82666-supplement-Supplementary_Information.pdf
convert -density 2000 desktop/EMS82666-supplement-Supplementary_Information.pdf'[0]' a.png

Lemminkäinen

04-01-2021, 06:37 PM

https://4.bp.blogspot.com/-2eFZ6ZCiy8U/XOBf_riBO6I/AAAAAAAABi4/ZK5sCblxwokFWX7SdkV-fpe-Q6F_RvTsgCLcBGAs/s1600/rplot3.gif

My result from MEC calculstor is very close to Finnish results in this admixture.

Target: Mm
Distance: 1.3428% / 1.34277682
44.3 Steppe_EMBA
35.5 Europe_EN
14.4 VillabrunaWHG
5.8 Nganasan

Alt

Target: Mm
Distance: 1.2721% / 1.27205809
43.5 Steppe_EMBA
32.5 Anatolia_N
18.2 VillabrunaWHG
5.8 Nganasan

Petalpusher

04-01-2021, 07:54 PM

My result from MEC calculstor is very close to Finnish results in this admixture.

Target: Mm
Distance: 1.3428% / 1.34277682
44.3 Steppe_EMBA
35.5 Europe_EN
14.4 VillabrunaWHG
5.8 Nganasan

Alt

Target: Mm
Distance: 1.2721% / 1.27205809
43.5 Steppe_EMBA
32.5 Anatolia_N
18.2 VillabrunaWHG
5.8 Nganasan

I noticed that too. i think my calc works better for central-north and north Euros (or south europe has a too complicated genesis)

Btw what are those two samples x10 and v10 showing only EHG and CW?

Petalpusher

04-01-2021, 07:56 PM

My result from MEC calculstor is very close to Finnish results in this admixture.

Target: Mm
Distance: 1.3428% / 1.34277682
44.3 Steppe_EMBA
35.5 Europe_EN
14.4 VillabrunaWHG
5.8 Nganasan

Alt

Target: Mm
Distance: 1.2721% / 1.27205809
43.5 Steppe_EMBA
32.5 Anatolia_N
18.2 VillabrunaWHG
5.8 Nganasan

I noticed that too. i think my calc works better for central-north and north Euros (or south europe has a too complicated genesis)

Btw what are those two samples x10 and v10 showing only EHG and CW?

Lemminkäinen

04-01-2021, 08:20 PM

There's also individual-level ADMIXTURE results in figure S1A of Saag et al. 2019: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6544527/#SMtitle.

https://i.ibb.co/0tbhc2h/saag-2019-s1-admixture.png

I converted the figure into a high-resolution image like this:

wget https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6544527/bin/EMS82666-supplement-Supplementary_Information.pdf
convert -density 2000 desktop/EMS82666-supplement-Supplementary_Information.pdf'[0]' a.png

WHG seems to be overrepresented in the Baltic region, but I have seen this. Some estimates give near 90% WHG in Latvia, Lithuania and Estonia and other estimates 60%. Baltic BA has more WHG than Baltic CCC. It looks like a WHG revenge after Baltic CWC. Hard to believe, because I don't see a parellel HG I* growth, actually R1a is the dominant HG through the BA.

Komintasavalta

04-01-2021, 08:47 PM

WHG seems to be overrepresented in the Baltic region, but I have seen this. Some estimates give near 90% WHG in Latvia, Lithuania and Estonia and other estimates 60%. Baltic BA has more WHG than Baltic CCC. It looks like a WHG revenge after Baltic CWC. Hard to believe, because I don't see a parellel HG I* growth, actually R1a is the dominant HG through the BA.

The blue component isn't really WHG, because its proportion is almost as large in modern Latvians as in WHGs. I don't know how it's possible to do a projected ADMIXTURE run, but they somehow projected ancient individuals on modern data:

We performed ADMIXTURE analysis by projecting aDNA data on world-wide EBC-chipDB modern data (Figure S1C–D, Table S3) and present results at K=9 (Figure 1B, Figure S1A–B, Methods). EstBA individuals are clearly distinguishable from Estonian CWC individuals as the former have more of the blue component most frequent in WHGs and less of the brown and yellow components maximized in Caucasus hunter-gatherers and modern Khanty, respectively.

It's also interesting how the proportion of the yellow component that is maximal in Khanty decreased in Estonia over time: it's the highest in Comb Ceramic, lower in Corded Ware, and even lower in the late Bronze Age samples from this study. I didn't know that Comb Ceramic was so EHG-like.

Lemminkäinen

04-01-2021, 08:57 PM

I noticed that too. i think my calc works better for central-north and north Euros (or south europe has a too complicated genesis)

Btw what are those two samples x10 and v10 showing only EHG and CW?

My goal was to get admixtures of European Neolithic, Baltic HG, Siberian and Baltic CWC, but because CWC is itself a mixed entity we see some vagueness there. X10 is BA and V10 is early IA. X10 is around 50/50 HG+CWC, V10 fully CWC.

Lemminkäinen

04-01-2021, 09:16 PM

The blue component isn't really WHG, because its proportion is almost as large in modern Latvians as in WHGs. I don't know how it's possible to do a projected ADMIXTURE run, but they somehow projected ancient individuals on modern data:

We performed ADMIXTURE analysis by projecting aDNA data on world-wide EBC-chipDB modern data (Figure S1C–D, Table S3) and present results at K=9 (Figure 1B, Figure S1A–B, Methods). EstBA individuals are clearly distinguishable from Estonian CWC individuals as the former have more of the blue component most frequent in WHGs and less of the brown and yellow components maximized in Caucasus hunter-gatherers and modern Khanty, respectively.

It's also interesting how the proportion of the yellow component that is maximal in Khanty decreased in Estonia over time: it's the highest in Comb Ceramic, lower in Corded Ware, and even lower in the late Bronze Age samples from this study. I didn't know that Comb Ceramic was so EHG-like.

Admixture projection is simply, you first run Admixture using source populations and then using a "projection" parameter run a second analysis. But if there are later admixtures in present-day samples lacking from the ancient ones the result can be wrong. I think that something existing in the Baltic Sea area is interpreted as WHG. Admixture simply allocates what ever it finds to k-proportions. So "modern WHG" can be whatever, maybe the present Baltic makeup.

Komintasavalta

04-01-2021, 09:34 PM

Admixture projection is simply, you first run Admixture using source populations and then using a "projection" parameter run a second analysis.

Yeah maybe I should read the manual (http://dalexander.github.io/admixture/admixture-manual.pdf):

# Verify the two datasets have the same set of SNPs % diff -s reference.bim study.bim
# Run unsupervised ADMIXTURE with K=2
% admixture reference.bed 2
# Use learned allele frequencies as (fixed) input to next step % cp reference.2.P study.2.P.in
# Run projection ADMIXTURE with K=2
% admixture -P study.bed 2

BTW Zoro said earlier that 0.1 is too aggressive for the third parameter of `--indep-pairwise` (r^2 threshold), but an example in the ADMIXTURE manual uses `--indep-pairwise 50 10 0.1`:

### 2.3 Do I need to thin the marker set for linkage disequilibrium?

We tend to believe this is a good idea, since our model does not explicitly take LD into consideration, and since enormous data sets take more time to analyze. It is impossible to "remove" all LD, especially in recently-admixed populations, which have a high degree of "admixture LD". Two approaches to mitigating the effects of LD are to include markers that are separated from each other by a certain genetic distance, or to thin the markers according the observed sample correlation coefficients. The easiest way is the latter, using the `--indep-pairwise` option of PLINK. For example, if we start with a file `rawData.bed`, we could use the following commands to prune according to a correlation threshold and store the pruned dataset in `prunedData.bed`:

> % plink --bfile rawData --indep-pairwise 50 10 0.1
> (output indicating number of SNPs targeted for inclusion/exclusion)
> % plink --bfile rawData --extract plink.prune.in --make-bed --out prunedData

Specifically, the first command targets for removal each SNP that has an R2 value of greater than 0.1 with any other SNP within a 50-SNP sliding window (advanced by 10 SNPs each time). The second command copies the remaining (untargetted) SNPs to `prunedData.bed`.

This approach is imperfect but seems to work well in practice. Please read our paper for more information.

The manual also says that only 10,000 markers are needed for a global ADMIXTURE run:

As a rule of thumb, we have found that 10,000 markers suffice to perform GWAS correction for continentally separated populations (for example, African, Asian, and European populations FST > .05) while more like 100,000 markers are necessary when the populations are within a continent (Europe, for instance, FST < 0.01).

Petalpusher

04-01-2021, 09:47 PM

The problem is dealing with many sources carrying elements that can be recombined in something like EHG. West farmers had sizeable WHG in them before they even got into Europe, then they picked up some more during the trip. You recombine that with legit local EHG, more eastward EHG from the steppe that is more highly ANE. The WHG and ANE that isn't local is gonna be seen as ...EHG. Yet many affinities to WHG in all this too.

vbnetkhio

04-01-2021, 10:10 PM

BTW Zoro said earlier that 0.1 is too aggressive for the third parameter of `--indep-pairwise` (r^2 threshold), but an example in the ADMIXTURE manual uses `--indep-pairwise 50 10 0.1`:

in studies, the most common parameters when doing admixture are "200 25 0.4".
the last one can go from 0.1 to 0.5 if you want more or less aggressive pruning.

I found one study with maf 0.01, but usually there's no maf applied at all.

and for PCA, usually there is no filtering at all. optionally indep pairwise if the PCA seems distorted by some recent drift.

vbnetkhio

04-01-2021, 10:48 PM

WHG seems to be overrepresented in the Baltic region, but I have seen this. Some estimates give near 90% WHG in Latvia, Lithuania and Estonia and other estimates 60%. Baltic BA has more WHG than Baltic CCC. It looks like a WHG revenge after Baltic CWC. Hard to believe, because I don't see a parellel HG I* growth, actually R1a is the dominant HG through the BA.

it's because they actually projected ancient data into a modern model for some reason.
so that's a modern Baltic component, not WHG.

Lemminkäinen

04-02-2021, 09:38 AM

Yeah maybe I should read the manual (http://dalexander.github.io/admixture/admixture-manual.pdf):

# Verify the two datasets have the same set of SNPs % diff -s reference.bim study.bim
# Run unsupervised ADMIXTURE with K=2
% admixture reference.bed 2
# Use learned allele frequencies as (fixed) input to next step % cp reference.2.P study.2.P.in
# Run projection ADMIXTURE with K=2
% admixture -P study.bed 2

BTW Zoro said earlier that 0.1 is too aggressive for the third parameter of `--indep-pairwise` (r^2 threshold), but an example in the ADMIXTURE manual uses `--indep-pairwise 50 10 0.1`:

### 2.3 Do I need to thin the marker set for linkage disequilibrium?

We tend to believe this is a good idea, since our model does not explicitly take LD into consideration, and since enormous data sets take more time to analyze. It is impossible to "remove" all LD, especially in recently-admixed populations, which have a high degree of "admixture LD". Two approaches to mitigating the effects of LD are to include markers that are separated from each other by a certain genetic distance, or to thin the markers according the observed sample correlation coefficients. The easiest way is the latter, using the `--indep-pairwise` option of PLINK. For example, if we start with a file `rawData.bed`, we could use the following commands to prune according to a correlation threshold and store the pruned dataset in `prunedData.bed`:

> % plink --bfile rawData --indep-pairwise 50 10 0.1
> (output indicating number of SNPs targeted for inclusion/exclusion)
> % plink --bfile rawData --extract plink.prune.in --make-bed --out prunedData

Specifically, the first command targets for removal each SNP that has an R2 value of greater than 0.1 with any other SNP within a 50-SNP sliding window (advanced by 10 SNPs each time). The second command copies the remaining (untargetted) SNPs to `prunedData.bed`.

This approach is imperfect but seems to work well in practice. Please read our paper for more information.

The manual also says that only 10,000 markers are needed for a global ADMIXTURE run:

As a rule of thumb, we have found that 10,000 markers suffice to perform GWAS correction for continentally separated populations (for example, African, Asian, and European populations FST > .05) while more like 100,000 markers are necessary when the populations are within a continent (Europe, for instance, FST < 0.01).

Yes, the LD pruning is the way to get rid of modern LD, but I am not sure is it the right way to "get rid of" modern admixtures.

10000 snps can be enough to find out the global diversity, but how many snps we need to find out 2000 years' migrations in Estonia?

Lemminkäinen

04-02-2021, 09:45 AM

The problem is dealing with many sources carrying elements that can be recombined in something like EHG. West farmers had sizeable WHG in them before they even got into Europe, then they picked up some more during the trip. You recombine that with legit local EHG, more eastward EHG from the steppe that is more highly ANE. The WHG and ANE that isn't local is gonna be seen as ...EHG. Yet many affinities to WHG in all this too.

This is absolutely right. I see these problems in my Admixture run. On the other hand I am quite content that it is as good as it is taking into account all those problems snd I succeeded to isolate the Baltic CWC with only few samples and in an unsupervised run.

Lemminkäinen

04-02-2021, 09:48 AM

it's because they actually projected ancient data into a modern model for some reason.
so that's a modern Baltic component, not WHG.

Yeah, that is what I already suggested.

Komintasavalta

04-02-2021, 06:43 PM

I tried running ADMIXTURE with similar populations as Lemminkäinen, but the results didn't make much sense. For example at K=5, some samples within the same population had 100% of one component, some had 100% of a second component, and some had 100% of a third component. However when I included more populations, the results became more reasonable.

wget https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_HO_public.tar
tar -xf v44.3_HO_public.tar
f=v44.3_HO_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $o.bed snpoutname:\ $o.bim indivoutname:\ $o.fam)
printf %s\\n Besermyan Estonia_BA.SG Estonia_CordedWare Estonia_EMN_Narva Estonia_EarlyViking.SG Estonia_IA.SG Estonia_Medieval.SG Estonia_N_CombCeramic.SG Estonian Estonian.DG Finland_Levanluhta Finland_Saami_Modern.SG Finnish Finnish.DG Germany_EN_LBK Hungarian.DG Italy_North_Villabruna_HG Karelian Latvia_BA Latvia_HG Latvia_MN Lithuania_EMN_Narva Luxembourg_Loschbour.DG Mansi Mansi.DG Mordovian Nganasan Russia_Bolshoy Russia_IA_Ingria.SG Saami.DG Sardinian.SDG Selkup Turkey_N_published Udmurt>pops
x=estolatvi14;k=4
a0()(gawk -e'ARGIND==1{a[$0];next}' -e"$@")
plink --bfile v44.3_HO_public --keep <(a0 '$3 in a{print$1}' pops v44.3_HO_public.ind|a0 '$2 in a' - v44.3_HO_public.fam) --geno .9 --indep-pairwise 50 10 .1 --make-bed --out $x.temp
plink --bfile $x.temp --extract $x.temp.prune.in --make-bed --out $x
admixture $x.bed $k
awk 'NR==FNR{a[$1]=$3;next}{print a[$2],$2}' v44.3_HO_public.ind $x.fam|tr \ :|paste -d' ' - $x.$k.Q>$x.$k
tav()(awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1][i]+=$i}}END{for(i in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i][j]/n[i]);print o}}' "FS=${1-$'\t'}")
sed -E 's/:[^ ]*//;s/\.(SG|DG|SDG)//' $x.$k|tav ' '>$x.$k.ave

library(pheatmap)
library(colorspace)

t=read.table("estolatvi14.4.ave",row.names=1,header=F)
t=t[,c(1,4,2,3)]
t=t[order(-2*t[,1]-t[,2]+2*t[,4]),]

pheatmap(
100*t,
filename="a.png",
cluster_cols=F,
cluster_rows=F,
show_colnames=F,
legend=F,
cellwidth=16,
cellheight=16,
fontsize=8,
border_color=NA,
display_numbers=T,
number_format="%.0f",
fontsize_number=7,
number_color="black",
colorRampPalette(hex(HSV(c(210,210,130,60,40,20,0) ,c(0,.5,.5,.5,.5,.5,.5),1)))(256)
)

The heatmap below shows population averages for a total of 362 samples. The proportion of the third HG component is much higher in Estonia_BA (19%) than in Estonia_IA (9%). Comb Ceramic has a huge proportion of the HG component (66%), but there was no separate EHG component, so it also has 4% of the Nganasan component. There's something weird with Levänluhta because it has 15% of the Neolithic component, but maybe I should've just picked some of the highest-quality samples, because I also got weird FST results with Levänluhta.

https://i.ibb.co/HTNRMqW/admixture.png

Lucas

04-02-2021, 08:46 PM

BTW admixture runs with high K and many samples like 10 0000 (and ofc many snps) could take days. I recommend buy cheap VPS or cloud server with Ubuntu or even Debian for such thing (for one month for example). Even if it has low RAM doesn't matter it will just run longer.

Unless someone want to have his computer be burnt soon:)

vbnetkhio

04-02-2021, 11:46 PM

I tried running ADMIXTURE with similar populations as Lemminkäinen, but the results didn't make much sense. For example at K=5, some samples within the same population had 100% of one component, some had 100% of a second component, and some had 100% of a third component. However when I included more populations, the results became more reasonable.

did you remove all relatives and duplicate samples?