PDA

View Full Version : Visualizing an ADMIXTURE run as a polygonal diagram



Komintasavalta
05-08-2021, 10:47 AM
When you have a numeric matrix with three columns where the values of the columns add up to a constant on each row, and where there are no negative values, it is possible to visualize the matrix as a ternary plot, where the points within the matrix are drawn inside an equilateral triangle: https://en.wikipedia.org/wiki/Ternary_plot. Basically you can draw an equilateral triangle centered in the origin, with a vector pointing from the origin to each corner of the triangle, and you can then calculate the coordinates of the points as a linear combination of the vectors. Because the rows of the matrix add up to a constant, there is a one-to-one correspondence between coordinates within the triangle and the values in the matrix, since the value of the third column of the matrix is always equal to the values of the first and second column added together and subtracted from the constant.

It is possible to extend the concept of a ternary plot in order to draw a square plot for a matrix with four columns, to draw a pentagon-shaped plot for a matrix with five columns, and so on. However then there is no longer a one-to-one correspondence between points within the polygon and coordinates in the matrix, because for example within a square plot, a point in the middle of the plot can either have 25% of all four components or 50% of two opposite components.

I now selected almost all modern European samples from the 1240K+HO dataset, except I excluded duplicate samples, I excluded one sample from each pair of samples with PI_HAT of .3 or above, and I only included at most 16 samples per population. I then ran ADMIXTURE at each K value from 3 to 8, and I visualized the results as polygonal diagrams.

In the images below, I reordered the admixture components so that I always placed the Kalmyk component at the top of the diagram, because there were no Nenets samples in the dataset I used, so I considered Kalmyks to be the racially purest Europeans. I placed Northern Europeans on the right side of Kalmyks, because there is a cline from Northern Europeans to Kalmyks in Northeastern Europe, and I placed North Caucasians on the left side of Kalmyks, because Nogais are intermediate between Caucasians and Kalmyks.

https://i.ibb.co/sQJ8KP0/3.jpg
https://i.ibb.co/9YQm4dV/4.jpg
https://i.ibb.co/80K0L2L/5.jpg
https://i.ibb.co/9mx74vp/6.jpg
https://i.ibb.co/yRh5yS1/7.jpg
https://i.ibb.co/wQZqQyZ/8.jpg

The image below shows population averages from the same ADMIXTURE runs visualized as heatmaps. The clustering is based on a matrix where the columns of each run have been joined into a single wide matrix.

At K=3, the middle component seems like a WHG-like component, because its proportion is the highest in Basques and Lithuanians. The left component is maximal in Kalmyks, but it is also influenced by VURians and Nogais, so even Udmurts have 42% of the left component. Nogais are the only population that has a large proportion of both the first and third components. Nogais from Stavropol are from north of North Caucasus, and Nogais from Astrakhan are from between Kalmykia and Kazakhstan. Compared to them, Nogais from Karachay-Cherkessia (North Caucasus) are closer to Caucasians and less Mongoloid.

At K=4, the middle component breaks off into a Northern European component which is maximal in Estonians and Lithuanians and to a wog component which is maximal in Sardinians. However even Greeks still have 35% of the Caucasian component. Now the proportion of the Mongoloid component also decreases from 42% to 34% in Udmurts.

At K=5, the northern European component splits off into a mysterious ghost component whose proportion is the highest in Arkhangelsk Russians, Gagauzes, and Moldovans. At K=5, Kazan Tatars still have 14% of the Caucasian component but Chuvashes only have 2%. Bashkirs have 8% of the Caucasian component, 36% of the Mongoloid component, and 56% of the Northern European component. However the Bashkir samples are from Jeong et al. 2019 which included both northern and southern Bashkirs, and the southern Bashkir samples had much higher Mongoloid ancestry.

At K=6, the wog component splits off into a Sardinian component and a Maltese component. The Maltese component is more rare, and it has a high percentage only in Maltese, Ashkenazis, and Sicilians. Caucasians were overrepresented in this run, so the Caucasian component also splits into two different components at K=6.

At K=8, a Uralic component that is maximal in Vepsians appears. If these runs would have included more samples of non-Finnic Finno-Permic populations like Saami, Maris, or Komis, the Uralic component might have become more Mongoloid, or it would have appeared at an earlier K value.

https://i.ibb.co/wMNGM0C/admixture-euro3.jpg

Download required data and software:

1240K+HO dataset: https://reich.hms.harvard.edu/allen-ancient-dna-resource-aadr-downloadable-genotypes-present-day-and-ancient-dna-data
ADMIXTURE: https://github.com/NovembreLab
Binaries for PLINK 1.9: https://www.cog-genomics.org/plink2/
Compile EIGENSOFT from source: https://reich.hms.harvard.edu/software
Mac binaries for EIGENSOFT 7.2.1: https://drive.google.com/file/d/1H8kPzVXKEetImKYfyjbbD9xboz3JJVFP/view?usp=sharing
Mac binaries for an old fork of EIGENSOFT: https://github.com/chrchang/eigensoft

Download the 1240K+HO dataset and run ADMIXURE:


curl -LsO reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_HO_public.tar;tar -xf v44.3_HO_public.tar
f=v44.3_HO_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)
x=euro5
printf %s\\n Albanian Basque Basque.SDG Belarusian Bulgarian Cretan.DG Croatian Czech English French French.SDG Greek Icelandic Italian_North Italian_South Lithuanian Maltese Moldavian Norwegian Norwegian.DG Orcadian Orcadian.SDG Polish.DG Romanian Russian Russian.SDG Russian_Archangelsk_Krasnoborsky Russian_Archangelsk_Leshukonsky Russian_Archangelsk_Pinezhsky Sardinian Scottish Sicilian Spanish Spanish_North Ukrainian Ukrainian_North Besermyan Estonian Finnish Finnish.DG Hungarian Karelian Mordovian Saami.DG Udmurt Veps Chuvash Gagauz Tatar_Kazan Tatar_Mishar Abazin Adygei Adygei.SDG Avar Balkar Chechen Circassian Darginian Ingushian Kabardinian Kaitag Karachai Kumyk Lak Lezgin Lezgin.DG Ossetian Tabasaran Bashkir Jew_Ashkenazi Kalmyk Nogai_Astrakhan Nogai_Karachay_Cherkessia Nogai_Stavropol>$x.pop
sed 1d v44.3_HO_public.anno|sort -t$'\t' -rnk15|awk -F\\t '!a[$3]++{print$2,$8}'|awk 'NR==FNR{a[$0];next}$2 in a' $x.pop ->$x.temp.pick
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.temp.pick v44.3_HO_public.fam) --make-bed --out $x.temp
plink --allow-no-sex --bfile $x.temp --genome --out $x
awk 'FNR>1&&$10>=.25{print$2<$4?$2:$4}' $x.genome|awk 'NR==FNR{a[$0];next}!($1 in a)' - $x.temp.pick>$x.pick
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x
plink --allow-no-sex --bfile $x --indep-pairwise 50 10 .05 --out $x
plink --bfile $x --extract $x.prune.in --make-bed --out $x.pruned
tav()(awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1][i]+=$i}}END{for(i in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i][j]/n[i]);print o}}' "FS=${1-$'\t'}")
for k in {3..8};do admixture -j4 -C .1 $x.pruned.bed $k;paste -d' ' <(awk 'NR==FNR{a[$1]=$2;next}{print$2,a[$2]}' $x.pick $x.pruned.fam) $x.pruned.$k.Q>$x.$k;cut -d' ' -f2- $x.$k|tav \ >$x.$k.ave;done

Generate polygonal diagrams:


library(tidyverse)
library(ggforce)
library(ggrepel)

for(n in c(3,4,5,6,7,8)){
t=read.table(paste0("euro5.",n))
rownames(t)=paste0(t[,2],":",t[,1])
t=t[,-c(1,2)]

columnorder=list(c(2,1,3),c(4,3,2,1),c(2,5,4,3,1), c(4,1,2,3,6,5),c(1,7,5,3,2,6,4),c(2,3,7,4,1,8,5,6) )
t=t[,columnorder[[n-2]]]

corners=sapply(c(sin,cos),function(x)head(x(seq(0, 2,length.out=n+1)*pi),-1))
corners=corners*min(2/diff(apply(corners,2,range)))
corners[,2]=corners[,2]-mean(range((corners[,2])))

xy=as.data.frame(as.matrix(t)%*%corners)
grid=as.data.frame(rbind(cbind(corners,rbind(corne rs[-1,],corners[1,])),cbind(corners,matrix(apply(corners,2,mean),ncol =2,nrow=n,byrow=T))))

pop=sub(":.*","",rownames(xy))
pop=sub("\\.(DG|SDG|SG|WGA)","",pop)
centers=aggregate(xy,by=list(pop),mean)
xy$pop=pop

set.seed(1488)
color=as.factor(sample(seq(1,length(unique(xy$pop) ))))
cl=rbind(c(60,80),c(25,95),c(30,70),c(70,50),c(60, 100),c(20,50),c(15,40))
hues=max(ceiling(length(color)/nrow(cl)),2)
pal1=as.vector(apply(cl,1,function(x)hcl(head(seq( 15,375,length=hues+1),-1),x[1],x[2])))
pal2=as.vector(apply(cl,1,function(x)hcl(head(seq( 15,375,length=hues+1),-1),ifelse(x[2]>=60,.5*x[1],.1*x[1]),ifelse(x[2]>=60,.2*x[2],95))))

xy$V1=xy$V1+runif(nrow(xy))/1e3
xy$V2=xy$V2+runif(nrow(xy))/1e3

lims=apply(corners,2,range)+c(-.08,.08)

ggplot(xy,aes(x=V1,y=V2))+
geom_segment(data=grid,aes(x=V1,y=V2,xend=V3,yend= V4),color="gray85",size=.3)+
geom_voronoi_tile(aes(group=0,fill=color[as.factor(pop)],color=color[as.factor(pop)]),size=.07,max.radius=.055)+
geom_label_repel(data=centers,aes(x=V1,y=V2,label= Group.1,color=color,fill=color),max.overlaps=Inf,p oint.size=0,size=2.3,alpha=.8,label.r=unit(.1,"lines"),label.padding=unit(.1,"lines"),label.size=.1,box.padding=0,segment.size=.3)+
coord_fixed(xlim=lims[,1],ylim=lims[,2],expand=F)+
scale_fill_manual(values=pal1)+
scale_color_manual(values=pal2)+
theme(
axis.text=element_blank(),
axis.ticks=element_blank(),
axis.title=element_blank(),
legend.position="none",
panel.background=element_rect(fill="white")
)

ggsave(paste0(n,".png"),width=7,height=7)
}

Use ComplexHeatmap to combine heatmaps for different K values (https://jokergoo.github.io/ComplexHeatmap-reference/book/):


library(ComplexHeatmap)
library(circlize)
library(colorspace)
library(vegan)

kvals=c(3,4,5,6,7,8)

# columnorder=lapply(kvals,seq)
columnorder=list(c(2,1,3),c(4,3,2,1),c(2,5,4,3,1), c(4,1,2,3,6,5),c(1,7,5,3,2,6,4),c(2,3,7,4,1,8,5,6) )

mats=sapply(1:length(kvals),function(i){
t=100*read.table(paste0("euro5.",kvals[i],".ave"),row.names=1)[,columnorder[[i]]]
rownames(t)=sub("Cherkessia","Cher",sub("Russian_Archangelsk_","Rus_Arch_",rownames(t)))
data.frame(aggregate(t,list(sub("\\.(DG|SDG|SG|WGA)|_1|_2","",row.names(t))),mean),row.names=1)
})

png("a.png",w=6000,h=5000,res=144)

maps=sapply(kvals,function(k){
mat=as.matrix(mats[match(k,kvals)][[1]])
Heatmap(
mat,
show_heatmap_legend=F,
show_column_names=F,
show_row_names=F,
clustering_distance_rows="euclidean",
width=ncol(mat)*unit(30,"pt"),
height=nrow(mat)*unit(30,"pt"),
row_dend_width=unit(200,"pt"),
cluster_columns=F,
cluster_rows=reorder(hclust(dist(do.call(cbind,mat s))),-mats[[2]][,2]-2*mats[[2]][,1]),
column_title=paste0("K=",k),column_title_gp=gpar(fontsize=24),
right_annotation=rowAnnotation(text1=anno_text(gt_ render(rownames(mat),padding=unit(c(2,2,2,2),"mm")),just="left",location=unit(0,"npc"),gp=gpar(fontsize=17))),
col=colorRamp2(seq(0,100,length.out=7),hex(HSV(c(2 10,210,130,60,40,20,0),c(0,rep(.5,6)),1))),
cell_fun=function(j,i,x,y,w,h,fill)grid.text(sprin tf("%.0f",mat[i,j]),x,y,gp=gpar(fontsize=15))
)
})

draw(Reduce(`+`,maps))
dev.off()
system("mogrify -gravity center -trim -border 16 -bordercolor white a.png")

Tenma de Pegasus
05-08-2021, 11:22 AM
Its the new way to see genetic, very interesting! :thumb001:

Komintasavalta
05-08-2021, 01:34 PM
Here's also ADMIXTURE runs for Turkic samples in 1240K+HO. This time I didn't remove related samples with high PI_HAT, so at K=3, the bottom right corner and bottom left corner both include populations with high PI_HAT, like Tubalars, Todzins, Tofalars, and Dolgans. At K=6, there is also one component for Tofalars and another component for Dolgans and Yakuts.

https://i.ibb.co/hX0mM40/3.jpg
https://i.ibb.co/5nkPbRK/4.jpg
https://i.ibb.co/c2dbJrq/5.jpg
https://i.ibb.co/QfK2yy1/6.jpg
https://i.ibb.co/DQr3qZq/7.jpg
https://i.ibb.co/DRfrN7X/turkadmix.jpg

Below is a list of the 16 pairs of samples with the highest PI_HAT value. I should've probably at least removed samples with PI_HAT over .35 or .3, but I wanted to demonstrate how the presence of related samples can affect an ADMIXTURE run.

$ x=turk
$ printf %s\\n Altaian Altaian_Chelkan Azeri Balkar Bashkir Chuvash Dolgan Gagauz Karachai Karakalpak Kazakh Kazakh_China Khakass Khakass Khakass_Kachin Kumyk Kyrgyz_China Kyrgyz_Kyrgyzstan Kyrgyz_Kyrgyzstan.DG Kyrgyz_Tajikistan Kyrgyz_Tajikstan Nogai Nogai_Astrakhan Nogai_Karachay_Cherkessia Nogai_Stavropol Salar Shor_Khakassia Shor_Mountain Tatar_Kazan Tatar_Mishar Tatar_Siberian Tatar_Siberian_Zabolotniye Todzin Tofalar Tubalar Turkish Turkish.DG Turkish_Balikesir Turkmen Tuvinian Uyghur Uyghur.DG Uzbek Yakut Yakut.DG Yakut.SDG>$x.pop
$ awk -F$'\t' 'NR==FNR{a[$0];next}$8 in a&&(!a[$3]++){print$2,$8}' $x.pop v44.3_HO_public.anno>$x.pick
$ plink --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x
[...]
$ plink --bfile $x --genome --out $x
[...]
$ awk '{print$10,$2,$4}' $x.genome|sort -rn|head -n16|awk 'NR==FNR{a[$1]=$3;next}{print$1,a[$2]":"$2,a[$3]":"$3}' v44.3_HO_public.ind -
0.6302 Tofalar:Vgut8 Tofalar:Vgut12
0.6191 Shor_Khakassia:KHS-035 Shor_Khakassia:KHS-036
0.6139 Tubalar:Tuba23 Tubalar:Tuba24
0.5926 Khakass_Kachin:Khs-493 Khakass_Kachin:Khs-513
0.5816 Tubalar:ALT-116 Tubalar:Tuba2
0.4788 Tofalar:Vgut11 Tofalar:Vgut13
0.4433 Tofalar:Vgut1 Tofalar:Vgut4
0.4302 Tuvinian:Tuvinians86 Tuvinian:Tuvinians111
0.4224 Tubalar:Tuba10 Tubalar:Tuba11
0.3807 Tofalar:Vgut13 Tofalar:Vgut18
0.3765 Karachai:ABA-035 Karachai:ABA-091
0.3675 Tubalar:Tuba21 Tubalar:Tuba1
0.3619 Azeri:AZR-0864 Azeri:AZR-0868
0.3597 Kazakh:KZH-1611 Kazakh:KZH-1750
0.3341 Tatar_Mishar:TTR-272 Tatar_Mishar:TTR-464
0.3264 Tofalar:Vgut11 Tofalar:Vgut15

Komintasavalta
05-08-2021, 04:47 PM
I now selected ancient samples that had a mean age BP of 6000 or higher and that had at least 400,000 SNPs. I omitted some early Neolithic and WHG samples so they wouldn't be overrepresented. I also omitted Cameroon_SMA and Morocco_Iberomaurusian.

Here's plots of the population averages of the samples, where many populations only consist of a single sample. I joined the columns of the runs at all K values into a single wide matrix. I used the distance matrix of the wide matrix to connect each point to its three closest neighbors, and also to draw convex hulls around the populations based on hierarchical clustering.

At K=3, MA1 and Tyumen_HG are closer to the top pole than to the WHG pole, but it's probably because the top pole includes so many American samples.

https://i.ibb.co/vLRbpQ2/3.jpg

At K=4, the top pole splits into an American pole and to an East-North Asian pole. There are two paths which connect the WHG pole to the East-North Asian pole. Swedish HGs are connected to Ukraine_N, which is connected to Latvia_MN_o2, which is connected to EHGs, which is connected to WSHGs. Then you can choose from two paths to the East Asian pole: either go from Ust'-Ishim to Tianyuan to China_SEastAsia_Island_EN, or go from USA_Ancient_Beringian to Russia_Kolyma_M to Russia_Siberia_Lena.

https://i.ibb.co/Prf4xwx/4.jpg

At K=5, Sunghir splits off into its own pole. In the previous image at K=4, Sunghir had about 50% of the early Neolithic component, 25% of the WHG component, 15% of the East-North Asian component, and 10% of the American component. At K=5, MA1 is also close to the pole of Sunghir. EHGs are a mixture of WHG, American, and Sunghir.

https://i.ibb.co/f4F4rbb/5.jpg

At K=6, Iran_N splits off from Turkey_N. Iran_C and Armenia_C are intermediate between them.

https://i.ibb.co/XY0hQf7/6.jpg

At K=7, Siberians and Mongolians split off from East Asians, even though they merge again at K=8.

https://i.ibb.co/mbvJ9rs/7.jpg

At K=8, EHGs split off from WHGs, and SHGs are approximately halfway between EHGs and WHGs. However even Norway_Mesolithic has 100% of the EHG component, because in an ADMIXTURE run like this that includes a relatively small number of samples, often many samples only have 100% of a single component. WSHG is now between EHGs and Americans, but MA1 is between EHGs and Sunghir. Russia_Steppe_Eneolithic is close to the center of the plot, but it just has 44% of the Iran_N component and 44% of the EHG component.

https://i.ibb.co/Ws8JmhH/8.jpg

In the images above, Ust'-Ishim is close to the center of the plot at most K values. It actually has a balanced mix of different admixture components:

https://i.ibb.co/Rv9HS65/admixture-before-6000-bp.jpg


library(tidyverse)
library(ggforce)
library(ggrepel)

for(k in 3:8){
t=read.table(paste0("hqhg19.",k,"a"),row.names=1)
rownames(t)%<>%sub("\\.(DG|SDG|SG|WGA)","",.)

t=t[,list(c(1,2,3),c(2,3,4,1),c(2,5,1,4,3),c(1,5,4,3,6 ,2),c(4,5,7,3,1,2,6),c(5,3,1,8,6,4,7,2))[[k-2]]]

corners=sapply(c(sin,cos),function(x)head(x(seq(0, 2,length.out=k+1)*pi),-1))
corners=corners*min(2/diff(apply(corners,2,range)))
corners[,2]=corners[,2]-mean(range(corners[,2]))

xy=as.data.frame(as.matrix(t)%*%corners)
grid=as.data.frame(rbind(cbind(corners,rbind(corne rs[-1,],corners[1,])),cbind(corners,matrix(apply(corners,2,mean),ncol =2,nrow=k,byrow=T))))

joined=sapply(2:8,function(i)read.table(paste0("hqhg19.",i,"a"))[,-1])%>%do.call(cbind,.)%>%set_rownames(rownames(t))
dist=as.data.frame(as.matrix(dist(joined)))
seg=lapply(1:4,function(i)apply(dist,1,function(x) unlist(xy[names(sort(x)[i]),],use.names=F))%>%t%>%cbind(xy))%>%do.call(rbind,.)%>%setNames(paste0("V",1:4))
xy$k=as.factor(cutree(hclust(dist(joined)),16))

set.seed(1488)
color=as.factor(sample(seq(length(unique(xy$k)))))
cl=rbind(c(50,90),c(100,80))
hues=max(ceiling(length(color)/nrow(cl)),8)
pal1=as.vector(apply(cl,1,function(x)hcl(head(seq( 15,375,length=hues+1),-1),x[1],x[2])))

xy$V1=xy$V1+runif(nrow(xy))/1e3
xy$V2=xy$V2+runif(nrow(xy))/1e3

expand=c(.08,.02)

ggplot(xy,aes(x=V1,y=V2))+
geom_polygon(data=as.data.frame(corners),fill="gray40")+
geom_segment(data=grid,aes(x=V1,y=V2,xend=V3,yend= V4),color="gray50",size=.5)+
geom_mark_hull(aes(group=k,color=k,fill=k),concavi ty=1000,radius=unit(.3,"cm"),expand=unit(.3,"cm"),alpha=.2,size=.1)+
geom_segment(data=seg,aes(x=V1,y=V2,xend=V3,yend=V 4),color="gray20",size=.3)+
geom_point(aes(color=k),size=.5)+
geom_text_repel(aes(label=rownames(xy),color=k),ma x.overlaps=Inf,force=3,force_pull=2,size=2.3,segme nt.size=.15,min.segment.length=.15)+
coord_fixed(xlim=(1+expand[1])*c(-1,1),ylim=(1+expand[2])*c(-1,1))+
scale_fill_manual(values=pal1)+
scale_color_manual(values=pal1)+
theme(
axis.text=element_blank(),
axis.ticks=element_blank(),
axis.title=element_blank(),
legend.position="none",
panel.background=element_rect(fill="gray30"),
panel.grid=element_blank(),
plot.background=element_rect(fill="gray30",color=NA),
plot.margin=margin(0,0,0,0)
)

ggsave(paste0(n,".png"),width=7,height=7/(Reduce(`/`,1+expand)))
}

Komintasavalta
05-08-2021, 07:41 PM
Here's population averages from the European ADMIXTURE run, where each population is linked to its three closest neighbors.

I'm happy that Finns are connected to Kalmyks by only four links: first from Finnish to Russian_Archangelsk_Pinezhsky, then to Besermyan, then to Bashkir, and then to Kalmyk.

https://i.imgur.com/NepyCSJ.png

Komintasavalta
05-09-2021, 07:51 AM
Here's runs that include all samples with the suffix ".DG", except for Neanderthals, Denisovans, and samples with the prefix "Ignore_". This time I didn't manually reorder the admixture components at each corner of the polygons, so the order of the components is completely different at different K values.

For example Kusunda and Khonda_Dora are members of the same cluster. At K=10, Kusunda gets its own admixture component which located in the opposite corner from Khonda_Dora, but it doesn't mean that they would actually have a high genetic distance, because the corners of the polygon are in arbitrary order.

https://i.ibb.co/TBd8LBw/3.png
https://i.ibb.co/H7Ndn57/4.png
https://i.ibb.co/7jS1b7D/5.png
https://i.ibb.co/n8Gr4Sg/6.png
https://i.ibb.co/QPrD9kw/8.png
https://i.ibb.co/0qDX7QN/10.png
https://i.ibb.co/r65vHTV/12.png

Leto
05-09-2021, 08:27 AM
I don't know what all this gobbledegook is about but the Kalmyks are no Europeans by any stretch. They are simply 17th century immigrants from Mongolia. By this logic the French Canadians are pure Native Canadians.

Komintasavalta
05-09-2021, 12:52 PM
It also works with the spreadsheets of calculators. Here's Eurogenes K15 updated:

https://i.ibb.co/Cz9ppdG/eurogenesk15updated.png


library(tidyverse)
library(ggrepel)

t=read.csv("https://pastebin.com/raw/Q3inavNV",row.names=1,check.names=F)
t=t/100
n=ncol(t)

corners=sapply(c(sin,cos),function(x)head(x(seq(0, 2,length.out=n+1)*pi),-1))
corners=corners*min(2/diff(apply(corners,2,range)))
corners[,2]=corners[,2]-mean(range(corners[,2]))

xy=as.data.frame(as.matrix(t)%*%corners)
grid=as.data.frame(rbind(cbind(corners,rbind(corne rs[-1,],corners[1,])),cbind(corners,matrix(apply(corners,2,mean),ncol =2,nrow=n,byrow=T))))

dist=as.data.frame(as.matrix(dist(t)))
seg=lapply(1:4,function(i)apply(dist,1,function(x) unlist(xy[names(sort(x)[i]),],use.names=F))%>%t%>%cbind(xy))%>%do.call(rbind,.)%>%setNames(paste0("V",1:4))
xy$k=as.factor(cutree(hclust(dist(t)),16))

hue=c(0,30,60,90,130,180,210,240,280,320)
pal1=c(hex(HSV(hue[-c(8,9)],.5,1)),hex(HSV(hue,.25,1)))

expand=c(.02,.02)

angle=head(seq(360,0,length.out=n+1),-1)
angle=ifelse(angle>90&angle<=270,angle+180,angle)

ggplot(xy,aes(x=V1,y=V2))+
geom_polygon(data=as.data.frame(corners),fill="gray40")+
geom_text(data=as.data.frame(corners),aes(x=1.04*V 1,y=1.04*V2),label=names(t),size=3.2,angle=angle,c olor="gray80")+
geom_segment(data=grid,aes(x=V1,y=V2,xend=V3,yend= V4),color="gray50",size=.4)+
geom_mark_hull(aes(group=k,color=k,fill=k),concavi ty=1000,radius=unit(.3,"cm"),expand=unit(.3,"cm"),alpha=.15,size=.15)+
geom_segment(data=seg,aes(x=V1,y=V2,xend=V3,yend=V 4),color="gray20",size=.3)+
geom_point(aes(color=k),size=.5)+
geom_text(aes(label=rownames(xy),color=k),size=2.2 ,vjust=-.6)+
# geom_text_repel(aes(label=rownames(xy),color=k),ma x.overlaps=Inf,force=4,force_pull=2,size=2.2,segme nt.size=.2,min.segment.length=.2,box.padding=.05)+
coord_fixed(xlim=(1+expand[1])*c(-1,1),ylim=(1+expand[2])*c(-1,1))+
scale_fill_manual(values=pal1)+
scale_color_manual(values=pal1)+
theme(
axis.text=element_blank(),
axis.ticks=element_blank(),
axis.title=element_blank(),
legend.position="none",
panel.background=element_rect(fill="gray30"),
panel.grid=element_blank(),
plot.background=element_rect(fill="gray30",color=NA,size=0),
plot.margin=margin(0,0,0,0)
)

ggsave("t/a.png",width=9,height=9/(Reduce(`/`,1+expand)))


I don't know what all this gobbledegook is about but the Kalmyks are no Europeans by any stretch. They are simply 17th century immigrants from Mongolia. By this logic the French Canadians are pure Native Canadians.

Europe is a multiracial continent that is populated by the wog race, the white race, and the Turco-Uralo-Mongolic race.

Parts of Europe have been populated by peoples with 50% or higher Mongoloid ancestry since at least the time of Bolshoy Oleni Ostrov almost 4,000 years ago. And even before Nenetses, the area of Nenetsia was inhabited by Sikhirtya, who were described as having Mongoloid appearance (https://avaldsnes.info/en/informasjon/hjor/).

Even before the Kalmyk expansion, the area of Kalmykia was part of the Xacitarxan Khanate.

Some Kalmyks like this pass as Europeans (Kalmyk or Nenets or Kazakh) but not as unmixed East Asians:

https://i.imgur.com/34dDY2c.jpg
https://vk.com/public53212025?z=photo-53212025_421892640%2Falbum-53212025_/173547324

Komintasavalta
05-09-2021, 02:03 PM
Here's Dodecad K12b:

https://i.ibb.co/4NSWpbt/dodecadk12b.png

The clustering would work better if I was somehow able to take the table of FST distances between each component into account:

http://1.bp.blogspot.com/-kXZ8Mxu5dns/TybJ7CQJuPI/AAAAAAAAEbk/QYJc4rvQ3ww/s1600/fst.png

vbnetkhio
05-09-2021, 02:21 PM
The clustering would work better if I was somehow able to take the table of FST distances between each component into account:

http://1.bp.blogspot.com/-kXZ8Mxu5dns/TybJ7CQJuPI/AAAAAAAAEbk/QYJc4rvQ3ww/s1600/fst.png


I tried something like this recently:


a <- read.table("results.csv", header = TRUE, row.names=1)
b <- read.table("fst_distances.csv", header = TRUE, row.names=1)

a <- as.matrix(a)
b <- as.matrix(b)

c <- a %*% b

write.table(c, file = "fst_scaled.txt")

i didn't like the result. Basically all Europeans end up more similar to each other, and some Hungarians with a tiny bit of Asian were bigger outliers.

gixajo
05-09-2021, 02:22 PM
Something like Dirichlet-type distribution?

I found a very good image to understand PCAs in multivariate statistics in a simple and intuitive way, but I can't find it right now.

Peterski
05-09-2021, 03:01 PM
Very neat-looking! What software did you use to create these diagrams? R ???

Komintasavalta
05-09-2021, 03:55 PM
I tried something like this recently:


a <- read.table("results.csv", header = TRUE, row.names=1)
b <- read.table("fst_distances.csv", header = TRUE, row.names=1)

a <- as.matrix(a)
b <- as.matrix(b)

c <- a %*% b

write.table(c, file = "fst_scaled.txt")

i didn't like the result. Basically all Europeans end up more similar to each other, and some Hungarians with a tiny bit of Asian were bigger outliers.

That's exactly what I thought about doing, but I didn't think it would be that simple. But it actually worked. I made a new version of the Dodecad k12b graph in my previous post where I multiplied the matrix of admixture percentages by the FST matrix. It reduced the number of clusters in Europe and Caucasus, because the North_European, Atlantic_Med, and Caucasus components have low FST distances with each other. But it expectedly increased the number of clusters in Africa. Previously the three closest neighbors of Selkups were Kets, Dolgans, and Yukaghirs, because they all have a high proportion of the Siberian component, which is an Nganasan-like central-north Siberian component. Selkups were relatively far from Siberian populations with low Siberian and high Southeast_Asian, like Altaians. After multiplying by the FST matrix, Selkups became closer to southern Siberians like Altaians.

My script also uses matrix multiplication to calculate the coordinates inside the polygon. For example these are the corners of an equilateral triangle centered in the origin with radius 1:


> triangle=sapply(c(sin,cos),function(x)head(x(seq(0 ,2,length.out=3+1)*pi),-1))
> triangle
[,1] [,2]
[1,] 0.0000000 1.0
[2,] 0.8660254 -0.5
[3,] -0.8660254 -0.5

These were admixture proportions in one K=3 run:


> admix=read.table(text="Saami.DG 0.238969 0.761021 0.000010\nMansi.DG 0.534995 0.464994 0.000010",row.names=1)

Then the x and y coordinates inside the triangle would be these:


> as.matrix(admix)%*%triangle
[,1] [,2]
Saami.DG 0.6590549 -0.1415465
Mansi.DG 0.4026880 0.3024930

vbnetkhio
05-09-2021, 04:10 PM
That's exactly what I thought about doing, but I didn't think it would be that simple. But it actually worked. I made a new version of the Dodecad k12b graph in my previous post where I multiplied the matrix of admixture percentages by the FST matrix. It reduced the number of clusters in Europe and Caucasus, because the North_European, Atlantic_Med, and Caucasus components have low FST distances with each other. But it expectedly increased the number of clusters in Africa. Previously the three closest neighbors of Selkups were Kets, Dolgans, and Yukaghirs, because they all have a high proportion of the Siberian component, which is an Nganasan-like central-north Siberian component. Selkups were relatively far from Siberian populations with low Siberian and high Southeast_Asian, like Altaians. After multiplying by the FST matrix, Selkups became closer to southern Siberians like Altaians.

My script also uses matrix multiplication to calculate the coordinates inside the polygon. For example these are the corners of an equilateral triangle centered in the origin with radius 1:


> triangle=sapply(c(sin,cos),function(x)head(x(seq(0 ,2,length.out=3+1)*pi),-1))
> triangle
[,1] [,2]
[1,] 0.0000000 1.0
[2,] 0.8660254 -0.5
[3,] -0.8660254 -0.5

These were admixture proportions in one K=3 run:


> admix=read.table(text="Saami.DG 0.238969 0.761021 0.000010\nMansi.DG 0.534995 0.464994 0.000010",row.names=1)

Then the x and y coordinates inside the triangle would be these:


> as.matrix(admix)%*%triangle
[,1] [,2]
Saami.DG 0.6590549 -0.1415465
Mansi.DG 0.4026880 0.3024930
aaah so that's what you're doing.

it's also how the "location predictors" like these work:

https://gen3553.pagesperso-orange.fr/ADN/Europe.htm
https://gen3553.pagesperso-orange.fr/ADN/K15.htm

here i scaled k36 averages into real-life geographic coordinates:
https://www.theapricity.com/forum/showthread.php?303248-k36-schematic-map-(-quot-PCA-quot-)

(north atlantic is scaled to London, Italian to Rome etc.)

Komintasavalta
05-09-2021, 04:22 PM
Very neat-looking! What software did you use to create these diagrams? R ???

Yeah.

On macOS, you can run my scripts like this:


brew install R
brew install udunits # needed by ggforce
R -e 'install.packages(c("tidyverse","ggforce","ggrepel"),repos="https://cloud.r-project.org")'
R -e path/to/script.R

I think a lot of Windows users just use the RStudio IDE: https://www.rstudio.com. But I hate GUIs and I use R from Emacs:

https://i.ibb.co/0CJg4zN/a.jpg

gixajo
05-09-2021, 04:28 PM
This could be a good thread to post this link without being a complete Off Topic and being reunited several people with some interest in this type of "things":

https://cran.r-project.org/web/views/Graphics.html

Komintasavalta
05-09-2021, 07:09 PM
Here's Dodecad K7b.

When I calculated the clusters and the nearest neighbors, I now multiplied the matrix of admixture percentages with a square root of the matrix of FST distances between the admixture components. When I didn't take the square root of the FST distances, it seemed to have a too radical effect, and even Saudis were part of the same cluster with Finns.

https://i.ibb.co/WzPzqq3/dodecadk7b.jpg

vbnetkhio
05-09-2021, 07:13 PM
Yeah.


did you ever try calculating FST distances between populations with smartpca?

i tried now and i get this in the logfile:

"population: 0 Case 3450"

it recognizes 0 populations for some reason

Komintasavalta
05-09-2021, 07:24 PM
did you ever try calculating FST distances between populations with smartpca?

i tried now and i get this in the logfile:

"population: 0 Case 3450"

it recognizes 0 populations for some reason

You need to add population numbers to the sixth field of the fam file: https://www.biostars.org/p/266511/. The commands below use integers starting from 10 as group identifiers, because the numbers 1, 2, and 9 have a special meaning (1 assigns the line as a case, 2 assigns it as a control, and 9 ignores it).

`phylipoutname: fstfilename` saves an FST matrix to a file, but in the file the FST values only have three digits after the decimal point. There's also the undocumented parameter `fsthiprecision: YES` which causes the FST values that are printed to STDOUT to be multiplied by million instead of thousand, but it doesn't affect the contents of the `phylipoutname` file.

If an FST run includes more than 100 populations, SmartPCA exits with an error unless you include a parameter like `maxpops: 1000`.


x=uralic
sed 1d v44.3_HO_public.anno|sort -t$'\t' -rnk15|awk -F\\t '!a[$3]++{print$2,$8}'|awk 'NR==FNR{a[$0];next}$2 in a' <(printf %s\\n Besermyan Enets Estonian Finnish Hungarian Karelian Mansi Mordovian Nganasan Saami.DG Selkup Udmurt Veps) ->$x.pick
plink --allow-no-sex --bfile g/p/ho --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x
awk '!a[$2]++{i++}{print$1,i}' <(sort -k2 $x.pick)|awk 'NR==FNR{a[$1]=$2;next}{$6=a[$2]+9}1' - $x.fam>$x.famtemp;mv $x.fam{temp,}
smartpca -p <(printf %s\\n genotypename:\ $x.bed snpname:\ $x.bim indivname:\ $x.fam fstonly:\ YES fsthiprecision:\ YES)|tee $x.smartpca
p=$(awk 'NR==FNR{a[$1]=$2;next}{print a[$2]}' $x.{pick,fam}|awk '!a[$0]++')
sed -n '/fst \*1000000/,/^$/p' $x.smartpca|sed 1,2d|sed \$d|tr -s ' ' ,|cut -d, -f3-|paste -d, <(printf %s\\n "$p") -|cat <(printf %s\\n '' "$p"|paste -sd,) ->$x.fst

Maybe you're supposed to do LD pruning before calculating FST, because Kerminen et al. 2021 (https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1009347) said this: "We calculated pairwise-FST between the reference groups (Fig 2) and the ancestor candidate groups (S9 Fig) using SmartPCA of EIGENSOFT package[7] (fstonly: YES, fsthiprecision: YES) and 56,661 LD-independent variants."

vbnetkhio
05-09-2021, 09:07 PM
You need to add population numbers to the sixth field of the fam file: https://www.biostars.org/p/266511/. The commands below use integers starting from 10 as group identifiers, because the numbers 1, 2, and 9 have a special meaning (1 assigns the line as a case, 2 assigns it as a control, and 9 ignores it).

`phylipoutname: fstfilename` saves an FST matrix to a file, but in the file the FST values only have three digits after the decimal point. There's also the undocumented parameter `fsthiprecision: YES` which causes the FST values that are printed to STDOUT to be multiplied by million instead of thousand, but it doesn't affect the contents of the `phylipoutname` file.

If an FST run includes more than 100 populations, SmartPCA exits with an error unless you include a parameter like `maxpops: 1000`.

So I ended up with code like this:


x=uralic
printf %s\\n Besermyan Enets Estonian Finnish Hungarian Karelian Mansi Mordovian Nganasan Saami.DG Selkup Udmurt Veps>$x.pop
sed 1d v44.3_HO_public.anno|sort -t$'\t' -rnk15|awk -F\\t '!a[$3]++{print$2,$8}'|awk 'NR==FNR{a[$0];next}$2 in a' $x.pop ->$x.pick
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x
awk '!a[$2]++{i++}{print$1,i}' $x.pick|awk 'NR==FNR{a[$1]=$2;next}{$6=a[$2]+9}1' - $x.fam>$x.famtemp;mv $x.fam{temp,}
smartpca -p <(printf %s\\n genotypename:\ $x.bed snpname:\ $x.bim indivname:\ $x.fam fstonly:\ YES fsthiprecision:\ YES)|tee $x.smartpca
p=$(cut -d' ' -f2 $x.pick|awk '!a[$0]++');sed -n '/fst \*1000000/,/^$/p' $x.smartpca|sed 1,2d|sed \$d|tr -s ' ' ,|cut -d, -f3-|paste -d, <(echo "$p") -|cat <(printf %s\\n '' "$p"|paste -sd,) ->$x.fst

Maybe you're supposed to do LD pruning before calculating FST, because Kerminen et al. 2021 (https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1009347) said this: "We calculated pairwise-FST between the reference groups (Fig 2) and the ancestor candidate groups (S9 Fig) using SmartPCA of EIGENSOFT package[7] (fstonly: YES, fsthiprecision: YES) and 56,661 LD-independent variants."

would there be any problems with calculating fst this way:

run supervised admixture, assign each sample to it's population, and run with as many K as there is populations, and fst gets written to the output.

http://dalexander.github.io/admixture/admixture-manual.pdf

smartpca's version takes forever, this could actually be faster? because there are no unassigned samples, just the allele frequencies and fst will be calculated

vbnetkhio
05-09-2021, 09:43 PM
the calculation on (most of) evolbio's database has finished:

fst *1000 version:
https://pastebin.com/raw/UzeqH7Dr

s.dev * 1000000 version:
https://pastebin.com/raw/3HDjLwGU

phylip version
https://pastebin.com/raw/NxZUj10x

sadly, i messed up the pop numbers, so now there's a Corsico-Croat and a Germano-Greek population :rotfl:

the rest should be fine

Komintasavalta
05-09-2021, 09:47 PM
would there be any problems with calculating fst this way:

run supervised admixture, assign each sample to it's population, and run with as many K as there is populations, and fst gets written to the output.

http://dalexander.github.io/admixture/admixture-manual.pdf

smartpca's version takes forever, this could actually be faster? because there are no unassigned samples, just the allele frequencies and fst will be calculated

Did you do LD pruning? The paper by Kerminen et al. said that they only used about 60,000 SNPs to calculate FST, even though their paper was about Finnish subpopulations.

The ADMIXTURE manual says this:


2.4 How many markers do I need to supply to ADMIXTURE?

This depends on how genetically differentiated your populations are, and on what you plan to do with the estimates. It has been noted elsewhere [4] that the number of markers needed to resolve populations in this kind of analysis is inversely proportional to the genetic distance (FST) betweeen the populations.

It is also noted in that paper that more markers are needed to perform adequate GWAS correction than are needed to simply observe the population structure.

As a rule of thumb, we have found that 10,000 markers suffice to perform GWAS correction for continentally separated populations (for example, African, Asian, and European populations FST > .05) while more like 100,000 markers are necessary when the populations are within a continent (Europe, for instance, FST < 0.01).

Using supervised ADMIXTURE to calculate FST actually works, but it's slower than SmartPCA because the five priming steps take a long time:


$ cut -d' ' -f2 uralic.pick|awk '!a[$0]++'>uralic.pop
$ admixture -j4 --supervised uralic.bed 13
**** ADMIXTURE Version 1.3.0 ****
**** Copyright 2008-2015 ****
**** David Alexander, Suyash Shringarpure, ****
**** John Novembre, Ken Lange ****
**** ****
**** Please cite our paper! ****
**** Information at www.genetics.ucla.edu/software/admixture ****

Parallel execution requested. Will use 4 threads.
Random seed: 43
Point estimation method: Block relaxation algorithm
Convergence acceleration algorithm: QuasiNewton, 3 secant conditions
Point estimation will terminate when objective function delta < 0.1
Estimation of standard errors disabled; will compute point estimates only.
Supervised analysis mode. Examining .pop file...
Size of G: 181x597573
Performing five EM steps to prime main algorithm
1 (EM) Elapsed: 18.302 Loglikelihood: -7.38566e+07 (delta): 1.40226e+08
2 (EM) Elapsed: 18.246 Loglikelihood: -7.38526e+07 (delta): 3976.64
3 (EM) Elapsed: 20.037 Loglikelihood: -7.38526e+07 (delta): 0.176208
4 (EM) Elapsed: 20.448 Loglikelihood: -7.38526e+07 (delta): 0.0233383
5 (EM) Elapsed: 20.357 Loglikelihood: -7.38526e+07 (delta): 0.015767
Initial loglikelihood: -7.38526e+07
Starting main algorithm
1 (QN/Block) Elapsed: 12.214 Loglikelihood: -7.38526e+07 (delta): 0
Summary:
Converged in 1 iterations (113.629 sec)
Loglikelihood: -73852623.464629
Fst divergences between estimated populations:
Pop0 Pop1 Pop2 Pop3 Pop4 Pop5 Pop6 Pop7 Pop8 Pop9 Pop10 Pop11
Pop0
Pop1 0.043
Pop2 0.041 0.015
Pop3 0.043 0.023 0.021
Pop4 0.062 0.040 0.039 0.042
Pop5 0.056 0.046 0.042 0.041 0.061
Pop6 0.047 0.017 0.018 0.026 0.043 0.050
Pop7 0.117 0.083 0.086 0.096 0.111 0.121 0.086
Pop8 0.053 0.024 0.025 0.033 0.050 0.055 0.026 0.091
Pop9 0.075 0.049 0.049 0.056 0.073 0.076 0.050 0.113 0.055
Pop10 0.066 0.033 0.035 0.044 0.061 0.069 0.035 0.096 0.040 0.064
Pop11 0.069 0.034 0.037 0.047 0.064 0.074 0.036 0.097 0.041 0.065 0.048
Pop12 0.175 0.140 0.142 0.154 0.168 0.179 0.139 0.192 0.142 0.160 0.148 0.144
Writing output files.

You can also use ADMIXTOOLS 2 to calculate FST, but it's slower than SmartPCA (the `f2m` function converts FST or f2 pairs to a square matrix):


$ R -e 'library("admixtools");f2m=function(x){t=as.data.frame(x[,1:3]);t2=rbind(t,setNames(t[,c(2,1,3)],names(t)));xtabs(t2[,3]~t2[,2]+t2[,1])};fst=fst("v44.3_HO_public",c("Besermyan","Enets","Estonian","Finnish","Hungarian","Karelian","Mansi","Mordovian","Nganasan","Saami.DG","Selkup","Udmurt","Veps"));write.csv(f2m(fst),"fst",quote=F)'

Without LD pruning, calculating FST for the 13 populations listed in my previous post took 28 seconds with SmartPCA, 52 seconds with ADMIXTOOLS 2, and 113 seconds with ADMIXTURE. After I ran `--indep-pairwise 50 10 .1`, it only took about 3 seconds with SmartPCA.

vbnetkhio
05-09-2021, 09:53 PM
Did you do LD pruning? The paper by Kerminen et al. said that they only used about 60,000 SNPs to calculate FST, even though their paper was about Finnish subpopulations.

The ADMIXTURE manual says this:


2.4 How many markers do I need to supply to ADMIXTURE?

This depends on how genetically differentiated your populations are, and on what you plan to do with the estimates. It has been noted elsewhere [4] that the number of markers needed to resolve populations in this kind of analysis is inversely proportional to the genetic distance (FST) betweeen the populations.

It is also noted in that paper that more markers are needed to perform adequate GWAS correction than are needed to simply observe the population structure.

As a rule of thumb, we have found that 10,000 markers suffice to perform GWAS correction for continentally separated populations (for example, African, Asian, and European populations FST > .05) while more like 100,000 markers are necessary when the populations are within a continent (Europe, for instance, FST < 0.01).

Using supervised ADMIXTURE to calculate FST actually works, but it's slower than SmartPCA because the five priming steps take a long time:


$ cut -d' ' -f2 uralic.pick|awk '!a[$0]++'>uralic.pop
$ admixture -j4 --supervised uralic.bed 13
**** ADMIXTURE Version 1.3.0 ****
**** Copyright 2008-2015 ****
**** David Alexander, Suyash Shringarpure, ****
**** John Novembre, Ken Lange ****
**** ****
**** Please cite our paper! ****
**** Information at www.genetics.ucla.edu/software/admixture ****

Parallel execution requested. Will use 4 threads.
Random seed: 43
Point estimation method: Block relaxation algorithm
Convergence acceleration algorithm: QuasiNewton, 3 secant conditions
Point estimation will terminate when objective function delta < 0.1
Estimation of standard errors disabled; will compute point estimates only.
Supervised analysis mode. Examining .pop file...
Size of G: 181x597573
Performing five EM steps to prime main algorithm
1 (EM) Elapsed: 18.302 Loglikelihood: -7.38566e+07 (delta): 1.40226e+08
2 (EM) Elapsed: 18.246 Loglikelihood: -7.38526e+07 (delta): 3976.64
3 (EM) Elapsed: 20.037 Loglikelihood: -7.38526e+07 (delta): 0.176208
4 (EM) Elapsed: 20.448 Loglikelihood: -7.38526e+07 (delta): 0.0233383
5 (EM) Elapsed: 20.357 Loglikelihood: -7.38526e+07 (delta): 0.015767
Initial loglikelihood: -7.38526e+07
Starting main algorithm
1 (QN/Block) Elapsed: 12.214 Loglikelihood: -7.38526e+07 (delta): 0
Summary:
Converged in 1 iterations (113.629 sec)
Loglikelihood: -73852623.464629
Fst divergences between estimated populations:
Pop0 Pop1 Pop2 Pop3 Pop4 Pop5 Pop6 Pop7 Pop8 Pop9 Pop10 Pop11
Pop0
Pop1 0.043
Pop2 0.041 0.015
Pop3 0.043 0.023 0.021
Pop4 0.062 0.040 0.039 0.042
Pop5 0.056 0.046 0.042 0.041 0.061
Pop6 0.047 0.017 0.018 0.026 0.043 0.050
Pop7 0.117 0.083 0.086 0.096 0.111 0.121 0.086
Pop8 0.053 0.024 0.025 0.033 0.050 0.055 0.026 0.091
Pop9 0.075 0.049 0.049 0.056 0.073 0.076 0.050 0.113 0.055
Pop10 0.066 0.033 0.035 0.044 0.061 0.069 0.035 0.096 0.040 0.064
Pop11 0.069 0.034 0.037 0.047 0.064 0.074 0.036 0.097 0.041 0.065 0.048
Pop12 0.175 0.140 0.142 0.154 0.168 0.179 0.139 0.192 0.142 0.160 0.148 0.144
Writing output files.

You can also use ADMIXTOOLS 2 to calculate FST, but it's slower than SmartPCA (the `f2m` function converts FST or f2 pairs to a square matrix):


$ R -e 'library("admixtools");f2m=function(x){t=as.data.frame(x[,1:3]);t2=rbind(t,setNames(t[,c(2,1,3)],names(t)));xtabs(t2[,3]~t2[,2]+t2[,1])};fst=fst("v44.3_HO_public",c("Besermyan","Enets","Estonian","Finnish","Hungarian","Karelian","Mansi","Mordovian","Nganasan","Saami.DG","Selkup","Udmurt","Veps"));write.csv(f2m(fst),"fst",quote=F)'

Without LD pruning, calculating FST for the 13 populations listed in my previous post took 28 seconds with SmartPCA, 52 seconds with ADMIXTOOLS 2, and 113 seconds with ADMIXTURE. After I ran `--indep-pairwise 50 10 .1`, it only took about 3 seconds with SmartPCA.

you can use the multithreaded mode, the priming steps ar much faster. e.g. you can add the flag -j8 for 8 threads.

I decided to try without LD first.
in studies pruning is usually very light, or none. a higher number of SNPs is considered much more valuable.

vbnetkhio
05-09-2021, 10:02 PM
pca from the phylip fst file:
https://i.imgur.com/ZuFgXtJ.png

non-metric MDS, reflects geography much better:
https://i.imgur.com/jQLGpxi.png

i used the "user-supplied distance" option, it tells the program that the data is already a distance matrix, and not raw data.
the Samaritans are outliers for some reason. "and" are Amerindians from the andes.

UPGMA clustering, also with user supplied distance

https://i.imgur.com/uqrWNN6.png

Komintasavalta
05-09-2021, 10:45 PM
sadly, i messed up the pop numbers, so now there's a Corsico-Croat and a Germano-Greek population :rotfl:

Sorry, my code printed the population names in the wrong order. I edited my post on the previous page.


pca from the phylip fst file:

You can do classical MDS instead because the input is already a distance matrix: `cmdscale(as.dist(csv.read("input.fst",row.names=1)))`.

`cmdscale(dist(df))` produces identical coordinates to `prcomp(df)$x` (except the signs of some dimensions may arbitrarily be flipped).

vbnetkhio
05-09-2021, 11:00 PM
Sorry, my code printed the population names in the wrong order. I edited my post on the previous page.



You can do classical MDS instead because the input is already a distance matrix: `cmdscale(as.dist(csv.read("input.fst",row.names=1)))`.

`cmdscale(dist(df))` produces identical coordinates to `prcomp(df)$x` (except the signs of some dimensions may arbitrarily be flipped).

it's not your fault, i did it with vlookup in libreoffice.

Lucas
05-10-2021, 07:33 AM
Its the new way to see genetic, very interesting! :thumb001:

Yes, Komin brins us everyday new beatiful visualization scripts. I wonder what he will show for a year.

Komintasavalta
05-10-2021, 08:29 AM
Here's ADMIXTURE runs of Ural-Altaic populations. The clustering and nearest neighbors are still based on a matrix made by concatenating the columns of the admixture weights at different K values. I also tried calculating the clustering and nearest neighbors based on FST distances returned by SmartPCA, but it made less sense.

https://i.ibb.co/Pw8959B/3.jpg
https://i.ibb.co/VQN2d7P/4.jpg
https://i.ibb.co/1bp3tBk/5.jpg
https://i.ibb.co/54zpf0C/6.jpg
https://i.ibb.co/9VTdGVk/7.jpg

From these runs, you can see that Tofalars have a lot of Nganasan-like ancestry. They are a people related to Tuvans who live north of the Tuvan Republic in the Eastern Sayan. Todzins are a northern subgroup of Tuvans who also have Samoyedic ancestry (https://ru.wikipedia.org/wiki/Тувинцы-тоджинцы): "As the researchers believe, the Todzhins formed as a result of the mixing of the ancient Keto-speaking and Samoyedic population with the Dubo tribes who moved to the Sayan-Altai region [7] [8]."

Zabolotniye Tatars (Swamp Tatars) are a northern subgroup of Siberian Tatars who live in the swamp regions between the Siberian Tatar and Ob-Ugric territories. In these runs, Swamp Tatars were closer to Mansi than to other Siberian Tatars. According to physical anthropology, Swamp Tatars are also similar to Khanty (https://ru.wikipedia.org/wiki/Сибирские_татары):


- the Uralic type is the main one for all groups of Siberian Tatars occupying the northern area of ​​their residence, and as a component is traced in more southern groups of Siberian Tatars.
- the South Siberian type is characteristic primarily of the Turks of the Barabinsk steppe and as an admixture is noted in almost all Siberian Tatars, with a tendency to increase in the southern, steppe groups and to decrease in the northern, forest groups.

The Central Asian type was recorded among the Barabinians. Some groups of Tobolsk and Tomsk Tatars have the Chulym type. The Zabolotnye Tatars are extremely close to the Berezovsky Khanty [4].

Komintasavalta
05-10-2021, 03:03 PM
Yes, Komin brins us everyday new beatiful visualization scripts. I wonder what he will show for a year.

Yeah I just started using R 3 months ago, but I already have 230 files in my directory for R scripts.

I now also made a script for making a stacked bar chart of an ADMIXTURE run which uses the same clustering method as the scripts in this thread, where the clustering is based on a combined matrix of ADMIXTURE runs at different K values. The colors of the population labels are based on cutting the clustering tree in 12 parts.

https://i.imgur.com/6acb3cl.jpg


library(tidyverse)
library(ggdendro)
library(vegan)
library(colorspace)
library(cowplot)

t=read.table("https://pastebin.com/raw/FEwYnBNb",row.names=1) # population averages from ADMIXTURE with population names in first column

t=t[,c(3,2,5,1,4)] # reorder columns (change if your input does not have five columns)
names(t)=paste0("V",1:ncol(t))

# do clustering based on a combined matrix of admixture weights at different K values
joined=sapply(3:8,function(i)read.table(paste0("uralaltaic.",i,".ave"))[,-1])%>%do.call(cbind,.)%>%set_rownames(rownames(t))
hc=hclust(dist(joined))
hc=reorder(hc,-as.matrix(t)%*%seq(ncol(t))^2)
dist=as.matrix(dist(joined))
maxdist=which(dist==max(dist))[1]
hc=reorder(hc,dist[,maxdist%%nrow(dist)]-dist[,maxdist%/%nrow(dist)+1])

# fst=read.csv("https://pastebin.com/raw/ktMkDf24",row.names=1)[rownames(t),rownames(t)]
# fst[fst<0]=0
# maxfst=which(fst==max(fst))[1] # reorder branches based on distance to the pair of populations with the highest FST distance
# hc=reorder(hclust(as.dist(fst)),fst[,maxfst%%nrow(fst)]-fst[,maxfst%/%nrow(fst)+1])
# # hc=reorder(hclust(dist(t)),-as.matrix(t)%*%exp(seq(ncol(t)))) # reorder branches based on the order of the bars

k=as.factor(cutree(hc,12))[hc$labels[hc$order]]

tree=ggdendro::dendro_data(as.dendrogram(hc),type="triangle")

p1=ggplot(ggdendro::segment(tree))+
geom_segment(aes(x=y,y=x,xend=yend,yend=xend),size =.5,lineend="round",color="gray85")+ # `lineend="round"` draws corners properly when not using `type="triangle"`
scale_x_continuous(expand=expansion(mult=c(0,.01)) )+ # don't crop a few pixels from the right border of the tree
scale_y_continuous(limits=.5+c(0,nrow(t)),expand=c (0,0))+
theme(
axis.text=element_blank(),
axis.ticks=element_blank(),
axis.ticks.length=unit(0,"pt"), # remove extra space normally occupied by tick marks
axis.title=element_blank(),
panel.background=element_rect(fill="gray30"),
panel.grid=element_blank(),
plot.background=element_rect(fill="gray30",color=NA), # `color=NA` removes a thin white border around the plot
plot.margin=margin(5,5,5,0)
)

t=t[hc$labels[hc$order],]
t2=data.frame(V1=rownames(t)[row(t)],V2=colnames(t)[col(t)],V3=unname(do.call(c,t))) # an alternative to `pivot_longer` and `melt`
lab=round(100*t2$V3)
lab[lab<=1]=""

pal1=colorspace::hex(HSV(c(30,210,250,310,0),.4,.9 ))
pal2=colorspace::hex(HSV(c(30,210,250,310,0),.4,.2 ))
pal3=hex(HSV(seq(0,360,length.out=n_distinct(k)+1) %>%head(-1),.4,1))

p2=ggplot(t2,aes(x=factor(V1,level=rownames(t)),y= V3,fill=V2))+
geom_bar(stat="identity",width=1,position=position_fill(reverse=T),size=.2 ,color="gray10")+
geom_text(aes(label=lab),position=position_stack(v just=.5,reverse=T),size=3.5,color="gray10")+
coord_flip()+
scale_x_discrete(expand=c(0,0))+
scale_y_discrete(expand=c(0,0))+
scale_fill_manual(values=pal1)+
theme(
axis.text=element_text(color=pal3[k],size=11),
axis.text.x=element_blank(),
axis.ticks=element_blank(),
axis.title=element_blank(),
legend.position="none",
plot.background=element_rect(fill="gray30",color=NA),
panel.background=element_rect(fill="gray30"),
plot.margin=margin(5,0,5,5)
)

cowplot::plot_grid(p2,p1,rel_widths=c(1,.4))
ggsave("a.png",height=.27*nrow(t),width=7)

Lucas
05-10-2021, 05:43 PM
Yeah I just started using R 3 months ago, but I already have 230 files in my directory for R scripts.

I now also made a script for making a stacked bar chart of an ADMIXTURE run which uses the same clustering method as the scripts in this thread, where the clustering is based on a combined matrix of ADMIXTURE runs at different K values. The colors of the population labels are based on cutting the clustering tree in 12 parts.


I think your scripts should be used in academic papers. There are much better than some of their visualization of Admixture charts or PCA plots which looks like shit from 20 years ago graphically.

Lucas
05-10-2021, 05:49 PM
You should make Github page with all those scripts. And then it could be officialy used by others in their papers.

Komintasavalta
05-10-2021, 08:59 PM
If you do a non-SSA ADMIXTURE run with two components, you can estimate the amount of East Eurasian ancestry in different populations:


$ sort -rnk2 maalima2.i.2a|awk '{printf"%.1f %s\n",100*$2,$1}'
100.0 Zhuang
100.0 Tujia
100.0 Tibetan_Yunnan
100.0 She
100.0 Qiang
100.0 Nivh
100.0 Negidal
100.0 Naxi
100.0 Nanai
100.0 Mulam
100.0 Miao
100.0 Maonan
100.0 Li
100.0 Korean
100.0 Gelao
100.0 Dong
100.0 Ami
100.0 Atayal
99.9 Dai
99.9 Han
99.9 Japanese
99.8 Vietnamese
99.8 Yi
99.8 China_Lahu
99.8 Ulchi
99.7 Kankanaey
99.7 Murut
99.4 Kinh
99.2 Hezhen
98.7 Ilocano
98.4 Oroqen
98.3 Sherpa
98.2 Xibo
98.1 Dusun
97.7 Daur
97.6 Tibetan
97.4 Yugur
97.0 Mongola
96.5 Rai
95.5 Yukagir_Tundra
94.9 Nganasan
94.8 Bonan
94.6 Koryak
94.5 Visayan
94.5 Tu
94.0 Evenk_Transbaikal
93.8 Itelmen
93.6 Gurung
93.2 Chukchi
93.2 Tagalog
92.0 Chukchi1
91.4 Eskimo_ChaplinSireniki
90.9 Eskimo_Naukan
90.3 Salar
89.7 Cambodian
89.7 Khamnegan
89.6 Thai
89.4 Dungan
88.6 Dongxiang
88.6 Magar
88.6 Malay
88.5 Yakut
87.3 Todzin
87.1 Buryat
86.9 Tamang
86.8 Burmese
86.2 Mongol
83.7 Dolgan
83.5 Tofalar
83.4 Evenk_FarEast
83.2 Tuvinian
83.2 Karitiana
82.0 Kalmyk
81.2 Piapoco
81.2 Mixe
79.7 Surui
78.5 Pima
78.5 Kusunda
77.2 Zapotec
76.2 Mixtec
76.0 Enets
75.7 Kazakh_China
74.7 Khakass_Kachin
74.2 Altaian
72.7 Bolivian
71.9 Mayan
71.6 Kyrgyz_China
70.8 Quechua
70.8 Nasioi
69.2 Kyrgyz_Kyrgyzstan
68.7 Tharu
68.7 Ket
68.7 Even
68.0 Kyrgyz_Tajikistan
67.9 Papuan
67.7 Khakass
66.1 Newar
65.3 Selkup
63.7 Kazakh
63.3 Shor_Khakassia
62.6 Shor_Mountain
62.6 Tubalar
62.6 Australian
58.1 Altaian_Chelkan
56.5 Karakalpak
55.1 Hazara
54.4 Uyghur
54.0 Nogai_Astrakhan
52.7 Mansi
51.7 Tatar_Siberian_Zabolotniye
48.7 Nogai_Stavropol
47.8 Tatar_Siberian
46.9 Yukagir_Forest
46.6 Tlingit
42.2 Bahun
42.0 Bengali
39.1 Uzbek
36.9 Aleut
35.5 Bashkir
35.4 Turkmen
33.2 Punjabi
32.0 GujaratiD
30.1 GujaratiC
29.1 Burusho
28.3 Udmurt
27.1 GujaratiB
26.3 Nogai_Karachay_Cherkessia
24.9 Besermyan
23.5 GujaratiA
23.4 Jew_Cochin
23.4 Chuvash
22.1 Sindhi_Pakistan
21.3 Tatar_Kazan
19.7 Tajik
19.6 Pathan
18.5 Kalash
15.6 Tatar_Mishar
15.5 Russian_Archangelsk_Leshukonsky
14.3 Balochi
13.9 Turkish_Balikesir
13.6 Brahui
13.1 Russian_Archangelsk_Pinezhsky
11.0 Makrani
10.6 Abazin
10.4 Kabardinian
10.4 Veps
9.6 Russian_Archangelsk_Krasnoborsky
9.5 Karachai
9.3 Balkar
9.1 Karelian
8.7 Circassian
8.4 Azeri
8.3 Mordovian
8.0 Finnish
8.0 Ossetian
7.1 Kumyk
7.1 Ezid
6.4 Turkish
6.1 Adygei
6.0 Ingushian
5.9 Iranian
5.8 Russian
5.1 Lak
4.9 Avar
4.9 Tabasaran
4.8 Chechen
4.6 Lezgin
4.6 Darginian
4.5 Kaitag
4.3 Kubachinian
3.3 Kurd
2.9 Estonian
2.6 Abkhasian
2.5 Belarusian
2.1 Gagauz
1.6 Ukrainian
1.6 Ukrainian_North
1.6 Lithuanian
1.3 Hungarian
1.1 Lebanese
1.1 Georgian
1.0 Jew_Iranian
1.0 Moldavian
0.9 Czech
0.8 Jew_Georgian
0.7 Norwegian
0.7 Syrian
0.7 Jew_Ashkenazi
0.7 Yemeni_Desert
0.6 Jordanian
0.6 Assyrian
0.6 Lebanese_Muslim
0.6 Bulgarian
0.5 Armenian
0.5 Armenian_Hemsheni
0.4 Croatian
0.4 Saudi
0.3 Yemeni_Northwest
0.3 BedouinA
0.3 Yemeni_Highlands
0.3 French
0.3 Egyptian
0.3 Romanian
0.3 English
0.2 Icelandic
0.2 Lebanese_Christian
0.2 Maltese
0.2 Scottish
0.2 Palestinian
0.2 Greek
0.2 Orcadian
0.1 Italian_North
0.1 Italian_South
0.1 Druze
0.1 Jew_Turkish
0.1 Albanian
0.1 Spanish
0.0 Jew_Iraqi
0.0 Jew_Moroccan
0.0 Basque
0.0 Jew_Yemenite
0.0 Spanish_North
0.0 Sicilian
0.0 Sardinian
0.0 Jew_Tunisian
0.0 Jew_Libyan
0.0 Cypriot
0.0 Canary_Islander
0.0 BedouinB

I first did a global K=3 run of modern samples, where I selected samples where the years BP field in the anno file was 0:


curl -LsO reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_HO_public.tar;tar -xf v44.3_HO_public.tar
f=v44.3_HO_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)
igno()(grep -Ev '\.REF|rel\.|fail\.|Ignore_|_dup|_contam|_lc|_fath er|_mother|_son|_daughter|_brother|_sister|_siblin g|_twin|Neanderthal|Denisova|Vindija_light|Gorilla |Macaque|Marmoset|Orangutang|Primate_Chimp|hg19ref ')
x=maalima;awk -F\\t 'NR>1{print$2,$8}' v44.3_HO_public.anno|igno|grep -Ev '\.(SG|SDG|DG|WGA)'|grep -v _o|cut -d' ' -f1|awk -F\\t 'NR==FNR{a[$0];next}$2 in a&&$6==0&&(!a[$3]++){print$2,$8}' - v44.3_HO_public.anno>$x.pick
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x
plink --allow-no-sex --bfile $x --genome --out $x
awk 'FNR>1&&$10>=.3{print$2<$4?$2:$4}' $x.genome|awk 'NR==FNR{a[$0];next}!($1 in a)' - $x.pick>$x.i.pick
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.i.pick v44.3_HO_public.fam) --make-bed --out $x.i
plink --allow-no-sex --bfile $x.i --indep-pairwise 50 10 .01 --out $x.i
plink --bfile $x.i --extract $x.prune.in --make-bed --out $x.i.p
tav()(awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1,i]+=$i}}END{for(i in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i,j]/n[i]);print o}}' "FS=${1-$'\t'}")
k=3;admixture -j4 -C .1 $x.i.p.bed $k;paste -d' ' <(awk 'NR==FNR{a[$1]=$2;next}{print$2,a[$2]}' $x.i.pick $x.i.p.fam) $x.i.p.$k.Q>$x.$k;cut -d' ' -f2- $x.$k|tav \ >$x.$k.ave

Then I selected samples that had less than 20% of the SSA component (excluding Australians and Papuans) and I did a new K=2 run.

Next I did a K=3 run of the non-SSA samples. Even though I excluded some North African populations with the highest SSA ancestry, the third component still became an SSA-like or basal-like component, where both Egyptians and Papuans have about 50% of the third component. Maltese have 22% of the third component and Greeks have 9%. Even Thais have 6% of the third component.

https://i.ibb.co/dPnHhg4/non-ssa-admix.png

At K=4, Americans and Siberians split off from East-Southeast Asians. Kets have 70% of the American-Siberian component because they have so much ANE.

The clustering and nearest neighbors are based on just the runs at K=2, K=3, and K=4, because the K=5 run has already taken more than an hour. Maybe I should've done more aggressive LD pruning, because almost 100,000 SNPs remained even after `--indep-pairwise 50 10 .03`.

https://i.ibb.co/0JY93Gd/4.jpg

Sorry for all these huge images, but regular-size images look like crap on a retina display.


You should make Github page with all those scripts. And then it could be officialy used by others in their papers.

I already deleted my websites and my Github account years ago, and I decided that I was no longer going to make any contributions to the world.

Github is too gay and post web 2.0 anyway. Oldschool static websites are nicer.

Also I don't think they would like to use a script that says `set.seed(1488)`.

Komintasavalta
05-11-2021, 09:15 AM
I had to leave it running overnight, but the runs at K=6, K=7, and K=8 now finished.

At K=7, I got a component that is similar to the Gedrosia component in Dodecad K7b. It is maximal in Kalash, Brahui, Sindhi_Pakistan, and Balochi. In the official K7b spreadsheet (http://dodecad.blogspot.com/2012/01/k12b-and-k7b-calculators.html), the Gedrosia component is maximal Brahui, Balochi, Makrani, and Sindhi.

The European component is the highest in Lithuanians (96%) but it's the fifth highest in Spanish_North (94%) and the eighth highest in Basques (93%). In ADMIXTURE models where Southwestern Europeans have a high proportion of a European component, usually Uralic people have fairly high Mongoloid ancestry, and here also the proportion of the Nganasan component is 10% in Finns, 13% in Vepsians, and 30% in Udmurts.

Based on the links to the three nearest neighbors, there is a path from Finns to Mongols: first from Finnish to Veps, then to Tatar_Mishar, Tatar_Kazan, Chuvash, Udmurt, Aleut, Tlingit, Mansi, Altaian_Chelkan, Tubalar, Khakass, Altaian, Evenk_FarEast, Kalmyk, and then to Mongol. I didn't realize it until recently, but there is actually a huge genetic gap produced by the Gobi Desert, where Khalkha Mongols have a high genetic distance to Han and northern Chinese ethnicities. It is also visible in this image, where there is no line that connects Mongols to Hans, apart from lines that go through South Asians or Australians. However my method for calculating the nearest neighbors could still be improved, because now one of the three closest neighbors of Australians are Karakalpaks.

https://i.ibb.co/FDJ84T2/7.jpg

Lemminkäinen
05-11-2021, 10:17 AM
Structure makes this triangle straight from the genome data.

Zoro
05-11-2021, 10:28 AM
If you do a non-SSA ADMIXTURE run with two components, you can estimate the amount of East Eurasian ancestry in different populations:


$ sort -rnk2 maalima2.i.2a|awk '{printf"%.1f %s\n",100*$2,$1}'
100.0 Zhuang
100.0 Tujia
100.0 Tibetan_Yunnan
100.0 She
100.0 Qiang
100.0 Nivh
100.0 Negidal
100.0 Naxi
100.0 Nanai
100.0 Mulam
100.0 Miao
100.0 Maonan
100.0 Li
100.0 Korean
100.0 Gelao
100.0 Dong
100.0 Ami
100.0 Atayal
99.9 Dai
99.9 Han
99.9 Japanese
99.8 Vietnamese
99.8 Yi
99.8 China_Lahu
99.8 Ulchi
99.7 Kankanaey
99.7 Murut
99.4 Kinh
99.2 Hezhen
98.7 Ilocano
98.4 Oroqen
98.3 Sherpa
98.2 Xibo
98.1 Dusun
97.7 Daur
97.6 Tibetan
97.4 Yugur
97.0 Mongola
96.5 Rai
95.5 Yukagir_Tundra
94.9 Nganasan
94.8 Bonan
94.6 Koryak
94.5 Visayan
94.5 Tu
94.0 Evenk_Transbaikal
93.8 Itelmen
93.6 Gurung
93.2 Chukchi
93.2 Tagalog
92.0 Chukchi1
91.4 Eskimo_ChaplinSireniki
90.9 Eskimo_Naukan
90.3 Salar
89.7 Cambodian
89.7 Khamnegan
89.6 Thai
89.4 Dungan
88.6 Dongxiang
88.6 Magar
88.6 Malay
88.5 Yakut
87.3 Todzin
87.1 Buryat
86.9 Tamang
86.8 Burmese
86.2 Mongol
83.7 Dolgan
83.5 Tofalar
83.4 Evenk_FarEast
83.2 Tuvinian
83.2 Karitiana
82.0 Kalmyk
81.2 Piapoco
81.2 Mixe
79.7 Surui
78.5 Pima
78.5 Kusunda
77.2 Zapotec
76.2 Mixtec
76.0 Enets
75.7 Kazakh_China
74.7 Khakass_Kachin
74.2 Altaian
72.7 Bolivian
71.9 Mayan
71.6 Kyrgyz_China
70.8 Quechua
70.8 Nasioi
69.2 Kyrgyz_Kyrgyzstan
68.7 Tharu
68.7 Ket
68.7 Even
68.0 Kyrgyz_Tajikistan
67.9 Papuan
67.7 Khakass
66.1 Newar
65.3 Selkup
63.7 Kazakh
63.3 Shor_Khakassia
62.6 Shor_Mountain
62.6 Tubalar
62.6 Australian
58.1 Altaian_Chelkan
56.5 Karakalpak
55.1 Hazara
54.4 Uyghur
54.0 Nogai_Astrakhan
52.7 Mansi
51.7 Tatar_Siberian_Zabolotniye
48.7 Nogai_Stavropol
47.8 Tatar_Siberian
46.9 Yukagir_Forest
46.6 Tlingit
42.2 Bahun
42.0 Bengali
39.1 Uzbek
36.9 Aleut
35.5 Bashkir
35.4 Turkmen
33.2 Punjabi
32.0 GujaratiD
30.1 GujaratiC
29.1 Burusho
28.3 Udmurt
27.1 GujaratiB
26.3 Nogai_Karachay_Cherkessia
24.9 Besermyan
23.5 GujaratiA
23.4 Jew_Cochin
23.4 Chuvash
22.1 Sindhi_Pakistan
21.3 Tatar_Kazan
19.7 Tajik
19.6 Pathan
18.5 Kalash
15.6 Tatar_Mishar
15.5 Russian_Archangelsk_Leshukonsky
14.3 Balochi
13.9 Turkish_Balikesir
13.6 Brahui
13.1 Russian_Archangelsk_Pinezhsky
11.0 Makrani
10.6 Abazin
10.4 Kabardinian
10.4 Veps
9.6 Russian_Archangelsk_Krasnoborsky
9.5 Karachai
9.3 Balkar
9.1 Karelian
8.7 Circassian
8.4 Azeri
8.3 Mordovian
8.0 Finnish
8.0 Ossetian
7.1 Kumyk
7.1 Ezid
6.4 Turkish
6.1 Adygei
6.0 Ingushian
5.9 Iranian
5.8 Russian
5.1 Lak
4.9 Avar
4.9 Tabasaran
4.8 Chechen
4.6 Lezgin
4.6 Darginian
4.5 Kaitag
4.3 Kubachinian
3.3 Kurd
2.9 Estonian
2.6 Abkhasian
2.5 Belarusian
2.1 Gagauz
1.6 Ukrainian
1.6 Ukrainian_North
1.6 Lithuanian
1.3 Hungarian
1.1 Lebanese
1.1 Georgian
1.0 Jew_Iranian
1.0 Moldavian
0.9 Czech
0.8 Jew_Georgian
0.7 Norwegian
0.7 Syrian
0.7 Jew_Ashkenazi
0.7 Yemeni_Desert
0.6 Jordanian
0.6 Assyrian
0.6 Lebanese_Muslim
0.6 Bulgarian
0.5 Armenian
0.5 Armenian_Hemsheni
0.4 Croatian
0.4 Saudi
0.3 Yemeni_Northwest
0.3 BedouinA
0.3 Yemeni_Highlands
0.3 French
0.3 Egyptian
0.3 Romanian
0.3 English
0.2 Icelandic
0.2 Lebanese_Christian
0.2 Maltese
0.2 Scottish
0.2 Palestinian
0.2 Greek
0.2 Orcadian
0.1 Italian_North
0.1 Italian_South
0.1 Druze
0.1 Jew_Turkish
0.1 Albanian
0.1 Spanish
0.0 Jew_Iraqi
0.0 Jew_Moroccan
0.0 Basque
0.0 Jew_Yemenite
0.0 Spanish_North
0.0 Sicilian
0.0 Sardinian
0.0 Jew_Tunisian
0.0 Jew_Libyan
0.0 Cypriot
0.0 Canary_Islander
0.0 BedouinB

I first did a global K=3 run of modern samples, where I selected samples where the years BP field in the anno file was 0:


curl -LsO reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_HO_public.tar;tar -xf v44.3_HO_public.tar
f=v44.3_HO_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)
igno()(grep -Ev '\.REF|rel\.|fail\.|Ignore_|_dup|_contam|_lc|_fath er|_mother|_son|_daughter|_brother|_sister|_siblin g|_twin|Neanderthal|Denisova|Vindija_light|Gorilla |Macaque|Marmoset|Orangutang|Primate_Chimp|hg19ref ')
x=maalima;awk -F\\t 'NR>1{print$2,$8}' v44.3_HO_public.anno|igno|grep -Ev '\.(SG|SDG|DG|WGA)'|grep -v _o|cut -d' ' -f1|awk -F\\t 'NR==FNR{a[$0];next}$2 in a&&$6==0&&(!a[$3]++){print$2,$8}' - v44.3_HO_public.anno>$x.pick
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x
plink --allow-no-sex --bfile $x --genome --out $x
awk 'FNR>1&&$10>=.3{print$2<$4?$2:$4}' $x.genome|awk 'NR==FNR{a[$0];next}!($1 in a)' - $x.pick>$x.i.pick
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.i.pick v44.3_HO_public.fam) --make-bed --out $x.i
plink --allow-no-sex --bfile $x.i --indep-pairwise 50 10 .01 --out $x.i
plink --bfile $x.i --extract $x.prune.in --make-bed --out $x.i.p
tav()(awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1,i]+=$i}}END{for(i in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i,j]/n[i]);print o}}' "FS=${1-$'\t'}")
k=3;admixture -j4 -C .1 $x.i.p.bed $k;paste -d' ' <(awk 'NR==FNR{a[$1]=$2;next}{print$2,a[$2]}' $x.i.pick $x.i.p.fam) $x.i.p.$k.Q>$x.$k;cut -d' ' -f2- $x.$k|tav \ >$x.$k.ave

Then I selected samples that had less than 20% of the SSA component (excluding Australians and Papuans) and I did a new K=2 run.

Next I did a K=3 run of the non-SSA samples. Even though I excluded some North African populations with the highest SSA ancestry, the third component still became an SSA-like or basal-like component, where both Egyptians and Papuans have about 50% of the third component. Maltese have 22% of the third component and Greeks have 9%. Even Thais have 6% of the third component.

https://i.ibb.co/dPnHhg4/non-ssa-admix.png

At K=4, Americans and Siberians split off from East-Southeast Asians. Kets have 70% of the American-Siberian component because they have so much ANE.

The clustering and nearest neighbors are based on just the runs at K=2, K=3, and K=4, because the K=5 run has already taken more than an hour. Maybe I should've done more aggressive LD pruning, because almost 100,000 SNPs remained even after `--indep-pairwise 50 10 .03`.

https://i.ibb.co/0JY93Gd/4.jpg

Sorry for all these huge images, but regular-size images look like crap on a retina display.



I already deleted my websites and my Github account years ago, and I decided that I was no longer going to make any contributions to the world.

Github is too gay and post web 2.0 anyway. Oldschool static websites are nicer.

Also I don't think they would like to use a script that says `set.seed(1488)`.

Good job with graphing! I like how you continuously try to find new ways to visualize your results. Keep on doing what you do.

Here’s a few notes on your K3 unsupervised admixture run:

- Clustering should not necessarily be interpreted as shared genetic drift in the last 30,000 years, or geneflow. The classic example is Neanderthal clustering with SSA in Admixture. Therefore SNP ascertainment substantially skews results.

- You’ll notice problems with SNP ascertainment in Reich dataset with Turkmen, Tatar, Altaians and a couple of others not clustering as expected (maybe that’s why you didn’t include them?)

- Papuans having 50% of the oramge component along with Egyptians and Bedouin should not be interpreted as orange being Basal Eurasian because Papuans should not be scoring that much Basal Eurasian. Orange component maybe nothing more than clustering due to very ancient million year old alleles or some other non meaningful SNP artifacts

- Ezidi Kurds having higher E. Eurasian than Kurmanji or Sorani Kurds would not make sense in my experience since they don’t have as much Central Asian input as other Kurds

If you want to make a K3 based on East and West Eurasian my suggestion is supervised using good LBK samples as West Eurasian proxies

Komintasavalta
05-11-2021, 10:18 PM
I now figured out how to use the `circlize` package to draw a circular stacked bar chart: https://jokergoo.github.io/circlize_book/book/. Next I'll try to learn how to add a thin bar for each individual sample within a population.

https://i.ibb.co/Brq5mM9/circlize-admixture.jpg


library(circlize)
library(vegan) # for reorder.hclust (may be masked by the package seriation)
library(dendextend) # for color_branches

f="uralaltaic.i"
kvals=c(3,7)
columnorder=list(c(3,2,1),c(1,5,4,3,6,7,2))

mats=sapply(kvals,function(x)read.table(paste0(f,".",x,"a"),r=1)[,columnorder[lapply(columnorder,length)==x][[1]]])

joined=do.call(cbind,sapply(Sys.glob(paste0(f,".[0-9]a")),function(x)read.table(x,r=1)))
dist=as.data.frame(as.matrix(dist(joined)))
hc=hclust(dist(joined))

hc=reorder(hc,dist[,"Nganasan"]-dist[,"Estonian"])
# hc=reorder(hc,mats[[1]][,3]-mats[[1]][,1])
# maxdist=which(dist==max(dist))[1];hc=reorder(hc,dist[,maxdist%%nrow(dist)]-dist[,maxdist%/%nrow(dist)+1])

labelcolor=hcl(c(260,120,60,0,220,160,310,90)+15,6 0,70)
barcolor=list(hcl(c(220,120,310)+15,60,70),hcl(c(2 20,60,120,0,270,90,310)+15,60,70))

labels=hc$labels[hc$order]
cut=cutree(hc,8)
dend=color_branches(as.dendrogram(hc),k=length(uni que(cut)),col=labelcolor[unique(cut[labels])])

circos.clear()
png("a.png",w=2500,h=2500,res=300)
circos.par(cell.padding=c(0,0,0,0))
circos.initialize(0,xlim=c(0,nrow(mats[[1]])))

circos.track(ylim=c(0,1),bg.border=NA,track.height =.2,track.margin=c(.005,0),panel.fun=function(x,y)
for(i in 1:nrow(mats[[1]]))circos.text(i-.5,0,labels[i],adj=c(0,.5),facing="clockwise",niceFacing=T,cex=.65,col=labelcolor[cut[labels[i]]])
)

for(j in length(mats):1)circos.track(ylim=c(0,1),track.heig ht=.25,track.margin=c(0,.01),bg.lty=0,panel.fun=fu nction(x,y){
mat=as.matrix(mats[[j]][hc$order,])
pos=1:nrow(mat)-.5
barwidth=1
for(i in 1:ncol(mat)){
seq1=rowSums(mat[,seq(i-1),drop=F])
seq2=rowSums(mat[,seq(i),drop=F])
circos.rect(pos-barwidth/2,if(i==1){0}else{seq1},pos+barwidth/2,seq2,col=barcolor[[j]][i],border="gray20",lwd=.1)
}
for(i in 1:ncol(mat)){
seq1=rowSums(mat[,seq(i-1),drop=F])
seq2=rowSums(mat[,seq(i),drop=F])
lab=round(100*mat[,i])
lab[lab<=1]=""
circos.text(pos,if(i==1){seq1/2}else{seq1+(seq2-seq1)/2},labels=lab,col="gray10",cex=.5,facing="downward")
}
})

circos.track(ylim=c(0,attr(dend,"height")),track.height=.25,track.margin=c(0,.0015),bg.bor der=NA,panel.fun=function(x,y)circos.dendrogram(de nd))

circos.clear()
dev.off()


Structure makes this triangle straight from the genome data.

I guess you mean this (from the Structure manual, https://web.stanford.edu/group/pritchardlab/structure_software/release_versions/v2.3.4/structure_doc.pdf):

https://i.ibb.co/nsWz7Lb/structuretriangle.png

I thought that this was something that was invented by me...

Komintasavalta
05-17-2021, 04:41 PM
If you want to make a K3 based on East and West Eurasian my suggestion is supervised using good LBK samples as West Eurasian proxies

I haven't been very successful in making supervised runs where I have only used ancient samples as references. But if I do an unsupervised run with the right mixture of modern and ancient samples, I can get components for WHG and LBK at a relatively low K value.

The image below shows two ADMIXTURE runs at K=3 and K=6, where I included modern samples with the suffix `.DG`, and I included ancient samples with over 500,000 SNPs and with mean age BP over 6,000. I used `--indep-pairwise 50 10 .05` which kept 72,163 SNPs.

Now I don't get that much SSA in Eurasians even at K=3, but maybe it's partially because the SSA component includes many Capoid-Bambutids, so even West Africans only get 90-95% of the SSA component at K=3. In Gedrosia K3, even Somalis can get 3% East Eurasian ancestry, but maybe it's for similar reasons that even West Africans get East Eurasian ancestry in these runs.

However there's also something weird about how Villabruna gets 10% SSA at K=3.

At K=3, Norwegians get 8% of the East Eurasian component, but at K=6, Norwegians get 9% of the American component and 0% of the East Eurasian component. At K=6, Finns get 5% East Asian in addition to 10% American. At K=6, Karelia_HG gets 25% American, 42% WHG, and 32% LBK.

https://i.ibb.co/61fFtdw/k3k6.jpg

Here's a SmartPCA run of the same samples without SSAs, Saharans, or Australo-Melanesians. This time the clustering and lines to nearest neighbors are not based on an FST matrix, but they're just based on the first 8 dimensions of the PCA multiplied by the square roots of the eigenvalues. When I include ancient populations in an FST run, there's usually a huge distance from some ancient populations to other populations. I don't know if it's because of missing data or something.

https://i.ibb.co/qFJ1R2B/a.jpg

BTW where can we see the proxies that were used in Gedrosia K3? Was the West Eurasian component based on LBK or something?

Petalpusher
05-17-2021, 05:03 PM
BTW where can we see the proxies that were used in Gedrosia K3? Was the West Eurasian component based on LBK or something?

Yes. It's a quote from gedwiki (link seems dead right now)


Eurasia K3 - E Eurasian, W Eurasian, and Sub-Saharan African Calculator
This calculator calculates an individual's E Eurasian, W Eurasian, and Sub-Saharan African admixture.

The components are defined as follows:

1- E Eurasian - This component peaks in E & SE populations such as Ami, Nivkh, Dai, Han, and Ulchi, at about 100%, followed by Siberian & other Asian populations such as Nganasans, Tibetans, Subba, and Mongola.

2- W Eurasian - This component peaks in Neolithic European farmers such as Stuttgart, and LBK culture, as well as in most modern European populations at over 95%.

3- SSA (Sub-Saharan African) - This component peaks in Sub-Saharan African populations such as Yoruban, Esan, and Luhiya at over 97%.

Komintasavalta
05-29-2021, 08:44 AM
New VUR vs Moor K2 calculator:

$ x=vurmoor
$ printf %s\\n Albanian Basque Basque.SDG Belarusian Besermyan Bulgarian Chuvash Cretan.DG Croatian Czech English Estonian Finnish Finnish.DG French French.SDG Gagauz Greek Hungarian Icelandic Italian_North Italian_South Karelian Lithuanian Maltese Mari.SDG Moldavian Mordovian Norwegian Norwegian.DG Orcadian Orcadian.SDG Polish.DG Romanian Russian Russian.SDG Russian_Archangelsk_Krasnoborsky Russian_Archangelsk_Leshukonsky Russian_Archangelsk_Pinezhsky Saami.DG Sardinian Scottish Sicilian Spanish Spanish_North Tatar_Kazan Tatar_Mishar Udmurt Ukrainian Ukrainian_North Veps>$x.pop
$ awk -F\\t 'NR==FNR{a[$0];next}$8 in a&&!a[$3]++{print$3,$8}' $x.pop v44.3_HO_public.anno|awk '++a[$2]<=16'>$x.pick
$ plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x
[...]
$ plink --allow-no-sex --bfile $x --indep-pairwise 50 10 .05 --out $x;plink --allow-no-sex --bfile $x --extract $x.prune.in --make-bed --out $x.p
[...]
$ admixture -j4 -C .1 $x.p.bed 2
[...]
$ tav()(awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1][i]+=$i}}END{for(i in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i][j]/n[i]);print o}}' "FS=${1-$'\t'}")
$ awk 'NR==FNR{a[$1]=$2;next}{print a[$2]}' $x.{pick,fam}|paste -d' ' - $x.p.2.Q|sed -E 's/\.S?DG//'|tav ' '|sort -rnk2|awk '{for(i=2;i<=NF;i++)printf"%.0f ",100*$i;print$1}'
100 0 Udmurt
100 0 Chuvash
100 0 Besermyan
100 0 Russian_Archangelsk_Pinezhsky
100 0 Russian_Archangelsk_Leshukonsky
99 1 Veps
96 4 Tatar_Kazan
93 7 Karelian
86 14 Tatar_Mishar
83 17 Russian_Archangelsk_Krasnoborsky
81 19 Finnish
74 26 Mordovian
68 32 Estonian
66 34 Russian
64 36 Lithuanian
61 39 Ukrainian_North
57 43 Belarusian
54 46 Ukrainian
37 63 Czech
32 68 Hungarian
32 68 Icelandic
30 70 Gagauz
28 72 Norwegian
26 74 Scottish
26 74 Moldavian
23 77 Orcadian
22 78 Croatian
22 78 English
17 83 Bulgarian
16 84 Romanian
13 87 French
4 96 Albanian
4 96 Greek
0 100 Italian_North
0 100 Sicilian
0 100 Sardinian
0 100 Maltese
0 100 Italian_South
0 100 Basque
0 100 Spanish_North
0 100 Spanish

It's cool how Finns are 81% VUR but Norwegians are 72% Moor.

Flashball
05-29-2021, 10:00 AM
A depigmented Eurasian with high siberian-like blood who talks about "wog" for the Sardinians and the Basques, it's cute.

And the people here who say absolutely nothing about this completely stupid talk!

Ambient stupidity.

Zanzibar
08-30-2021, 07:29 PM
If you do a non-SSA ADMIXTURE run with two components, you can estimate the amount of East Eurasian ancestry in different populations:


$ sort -rnk2 maalima2.i.2a|awk '{printf"%.1f %s\n",100*$2,$1}'
100.0 Zhuang
100.0 Tujia
100.0 Tibetan_Yunnan
100.0 She
100.0 Qiang
100.0 Nivh
100.0 Negidal
100.0 Naxi
100.0 Nanai
100.0 Mulam
100.0 Miao
100.0 Maonan
100.0 Li
100.0 Korean
100.0 Gelao
100.0 Dong
100.0 Ami
100.0 Atayal
99.9 Dai
99.9 Han
99.9 Japanese
99.8 Vietnamese
99.8 Yi
99.8 China_Lahu
99.8 Ulchi
99.7 Kankanaey
99.7 Murut
99.4 Kinh
99.2 Hezhen
98.7 Ilocano
98.4 Oroqen
98.3 Sherpa
98.2 Xibo
98.1 Dusun
97.7 Daur
97.6 Tibetan
97.4 Yugur
97.0 Mongola
96.5 Rai
95.5 Yukagir_Tundra
94.9 Nganasan
94.8 Bonan
94.6 Koryak
94.5 Visayan
94.5 Tu
94.0 Evenk_Transbaikal
93.8 Itelmen
93.6 Gurung
93.2 Chukchi
93.2 Tagalog
92.0 Chukchi1
91.4 Eskimo_ChaplinSireniki
90.9 Eskimo_Naukan
90.3 Salar
89.7 Cambodian
89.7 Khamnegan
89.6 Thai
89.4 Dungan
88.6 Dongxiang
88.6 Magar
88.6 Malay
88.5 Yakut
87.3 Todzin
87.1 Buryat
86.9 Tamang
86.8 Burmese
86.2 Mongol
83.7 Dolgan
83.5 Tofalar
83.4 Evenk_FarEast
83.2 Tuvinian
83.2 Karitiana
82.0 Kalmyk
81.2 Piapoco
81.2 Mixe
79.7 Surui
78.5 Pima
78.5 Kusunda
77.2 Zapotec
76.2 Mixtec
76.0 Enets
75.7 Kazakh_China
74.7 Khakass_Kachin
74.2 Altaian
72.7 Bolivian
71.9 Mayan
71.6 Kyrgyz_China
70.8 Quechua
70.8 Nasioi
69.2 Kyrgyz_Kyrgyzstan
68.7 Tharu
68.7 Ket
68.7 Even
68.0 Kyrgyz_Tajikistan
67.9 Papuan
67.7 Khakass
66.1 Newar
65.3 Selkup
63.7 Kazakh
63.3 Shor_Khakassia
62.6 Shor_Mountain
62.6 Tubalar
62.6 Australian
58.1 Altaian_Chelkan
56.5 Karakalpak
55.1 Hazara
54.4 Uyghur
54.0 Nogai_Astrakhan
52.7 Mansi
51.7 Tatar_Siberian_Zabolotniye
48.7 Nogai_Stavropol
47.8 Tatar_Siberian
46.9 Yukagir_Forest
46.6 Tlingit
42.2 Bahun
42.0 Bengali
39.1 Uzbek
36.9 Aleut
35.5 Bashkir
35.4 Turkmen
33.2 Punjabi
32.0 GujaratiD
30.1 GujaratiC
29.1 Burusho
28.3 Udmurt
27.1 GujaratiB
26.3 Nogai_Karachay_Cherkessia
24.9 Besermyan
23.5 GujaratiA
23.4 Jew_Cochin
23.4 Chuvash
22.1 Sindhi_Pakistan
21.3 Tatar_Kazan
19.7 Tajik
19.6 Pathan
18.5 Kalash
15.6 Tatar_Mishar
15.5 Russian_Archangelsk_Leshukonsky
14.3 Balochi
13.9 Turkish_Balikesir
13.6 Brahui
13.1 Russian_Archangelsk_Pinezhsky
11.0 Makrani
10.6 Abazin
10.4 Kabardinian
10.4 Veps
9.6 Russian_Archangelsk_Krasnoborsky
9.5 Karachai
9.3 Balkar
9.1 Karelian
8.7 Circassian
8.4 Azeri
8.3 Mordovian
8.0 Finnish
8.0 Ossetian
7.1 Kumyk
7.1 Ezid
6.4 Turkish
6.1 Adygei
6.0 Ingushian
5.9 Iranian
5.8 Russian
5.1 Lak
4.9 Avar
4.9 Tabasaran
4.8 Chechen
4.6 Lezgin
4.6 Darginian
4.5 Kaitag
4.3 Kubachinian
3.3 Kurd
2.9 Estonian
2.6 Abkhasian
2.5 Belarusian
2.1 Gagauz
1.6 Ukrainian
1.6 Ukrainian_North
1.6 Lithuanian
1.3 Hungarian
1.1 Lebanese
1.1 Georgian
1.0 Jew_Iranian
1.0 Moldavian
0.9 Czech
0.8 Jew_Georgian
0.7 Norwegian
0.7 Syrian
0.7 Jew_Ashkenazi
0.7 Yemeni_Desert
0.6 Jordanian
0.6 Assyrian
0.6 Lebanese_Muslim
0.6 Bulgarian
0.5 Armenian
0.5 Armenian_Hemsheni
0.4 Croatian
0.4 Saudi
0.3 Yemeni_Northwest
0.3 BedouinA
0.3 Yemeni_Highlands
0.3 French
0.3 Egyptian
0.3 Romanian
0.3 English
0.2 Icelandic
0.2 Lebanese_Christian
0.2 Maltese
0.2 Scottish
0.2 Palestinian
0.2 Greek
0.2 Orcadian
0.1 Italian_North
0.1 Italian_South
0.1 Druze
0.1 Jew_Turkish
0.1 Albanian
0.1 Spanish
0.0 Jew_Iraqi
0.0 Jew_Moroccan
0.0 Basque
0.0 Jew_Yemenite
0.0 Spanish_North
0.0 Sicilian
0.0 Sardinian
0.0 Jew_Tunisian
0.0 Jew_Libyan
0.0 Cypriot
0.0 Canary_Islander
0.0 BedouinB

I first did a global K=3 run of modern samples, where I selected samples where the years BP field in the anno file was 0:


curl -LsO reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_HO_public.tar;tar -xf v44.3_HO_public.tar
f=v44.3_HO_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)
igno()(grep -Ev '\.REF|rel\.|fail\.|Ignore_|_dup|_contam|_lc|_fath er|_mother|_son|_daughter|_brother|_sister|_siblin g|_twin|Neanderthal|Denisova|Vindija_light|Gorilla |Macaque|Marmoset|Orangutang|Primate_Chimp|hg19ref ')
x=maalima;awk -F\\t 'NR>1{print$2,$8}' v44.3_HO_public.anno|igno|grep -Ev '\.(SG|SDG|DG|WGA)'|grep -v _o|cut -d' ' -f1|awk -F\\t 'NR==FNR{a[$0];next}$2 in a&&$6==0&&(!a[$3]++){print$2,$8}' - v44.3_HO_public.anno>$x.pick
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x
plink --allow-no-sex --bfile $x --genome --out $x
awk 'FNR>1&&$10>=.3{print$2<$4?$2:$4}' $x.genome|awk 'NR==FNR{a[$0];next}!($1 in a)' - $x.pick>$x.i.pick
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.i.pick v44.3_HO_public.fam) --make-bed --out $x.i
plink --allow-no-sex --bfile $x.i --indep-pairwise 50 10 .01 --out $x.i
plink --bfile $x.i --extract $x.prune.in --make-bed --out $x.i.p
tav()(awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1,i]+=$i}}END{for(i in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i,j]/n[i]);print o}}' "FS=${1-$'\t'}")
k=3;admixture -j4 -C .1 $x.i.p.bed $k;paste -d' ' <(awk 'NR==FNR{a[$1]=$2;next}{print$2,a[$2]}' $x.i.pick $x.i.p.fam) $x.i.p.$k.Q>$x.$k;cut -d' ' -f2- $x.$k|tav \ >$x.$k.ave

Then I selected samples that had less than 20% of the SSA component (excluding Australians and Papuans) and I did a new K=2 run.

Next I did a K=3 run of the non-SSA samples. Even though I excluded some North African populations with the highest SSA ancestry, the third component still became an SSA-like or basal-like component, where both Egyptians and Papuans have about 50% of the third component. Maltese have 22% of the third component and Greeks have 9%. Even Thais have 6% of the third component.

https://i.ibb.co/dPnHhg4/non-ssa-admix.png

At K=4, Americans and Siberians split off from East-Southeast Asians. Kets have 70% of the American-Siberian component because they have so much ANE.

The clustering and nearest neighbors are based on just the runs at K=2, K=3, and K=4, because the K=5 run has already taken more than an hour. Maybe I should've done more aggressive LD pruning, because almost 100,000 SNPs remained even after `--indep-pairwise 50 10 .03`.

https://i.ibb.co/0JY93Gd/4.jpg

Sorry for all these huge images, but regular-size images look like crap on a retina display.



I already deleted my websites and my Github account years ago, and I decided that I was no longer going to make any contributions to the world.

Github is too gay and post web 2.0 anyway. Oldschool static websites are nicer.

Also I don't think they would like to use a script that says `set.seed(1488)`.

Btw here is another data to look at the amount of Mongoloid-ness for various Uralics:
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1522-1/figures/6


Nenets 79%
Kets 74%
Selkups 70%
Khanty 61%
Mari 35%
Mansi 34% (usually they have much higher Mong than this. Southern Mansi?)
Udmurts 30%
Saami 24% (also usually higher Mongoloid like 26-30%)
Komi 19%
Mordovians 11%
Finns 8%
Estonians 5%

Komintasavalta
08-30-2021, 07:44 PM
Btw here is another data to look at the amount of Mongoloid-ness for various Uralics:
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1522-1/figures/6

Nenets 79%
Kets 74%
Selkups 70%
Khanty 61%
Mari 35%
Mansi 34%
Udmurts 30%
Saami 24%
Komi 19%
Mordovians 11%
Finns 8%
Estonians 5%

I don't think you can estimate the amount of Mongoloid ancestry based on the Nganasan component alone, because EHG is also part Mongoloid, and for example Karelians have 6% Nganasan and 24% EHG, but Vepsians have 12% Nganasan and 0% EHG.

I recreated that qpGraph model on the Finnish anthroforum:


library(admixtools)

t=read.table(text="R Yoruba.DG
R A
A O
A AA
O E
O CH
E EU
E L1
EU WH
EU EH
AA EH
EH EH2
EH EH1
EH AAA
AA AAA
WH CCE
WH Italy_North_Villabruna_HG
CH YM
CH CH1
CH AAN
EH1 YM
EH1 EH0
AAA AAN
CH1 EH0
CH1 Georgia_Kotias.SG
L1 Turkey_N_published
L1 LB
L1 YM1
L1 CW
EH2 LB
EH2 CWC
YM YM1
YM CW
EH0 Russia_HG_Karelia
AAN Nganasan
AAN CCA
LB Germany_EN_LBK
LB CCW
YM1 Russia_Samara_EBA_Yamnaya
CW CCW
CW Estonia_CordedWare
CCW CWC
CWC CCE
CCE CCA
CCA Udmurt")

pop=c("Udmurt","Germany_CordedWare","Estonia_CordedWare","Germany_EN_LBK","Italy_North_Villabruna_HG","Nganasan","Russia_HG_Karelia","Russia_Samara_EBA_Yamnaya","Turkey_N_published","Yoruba.DG","Georgia_Kotias.SG","Chimp.REF")

f2=f2_from_geno("path/to/v44.3_HO_public",pops=pop)
gr=qpgraph(f2,t)
plot_graph(gr$edges)

ggsave("a.png",width=9,height=9)
system("mogrify -trim -border 64 -bordercolor white a.png")

# plotly::plotly_graph(gr$edges)

# write_dot(gr$edges) # paste to graphviz.it for black-and-white graph

However I only got 25% of the Nganasan component for Udmurts:

https://i.ibb.co/mJn07Z0/a.png

Zanzibar
08-30-2021, 07:51 PM
I don't think you can estimate the amount of Mongoloid ancestry based on the Nganasan component alone, because EHG is also part Mongoloid, and for example Karelians have 6% Nganasan and 24% EHG, but Vepsians have 12% Nganasan and 0% EHG.

I recreated that qpGraph model on the Finnish anthroforum:


library(admixtools)

t=read.table(text="R Yoruba.DG
R A
A O
A AA
O E
O CH
E EU
E L1
EU WH
EU EH
AA EH
EH EH2
EH EH1
EH AAA
AA AAA
WH CCE
WH Italy_North_Villabruna_HG
CH YM
CH CH1
CH AAN
EH1 YM
EH1 EH0
AAA AAN
CH1 EH0
CH1 Georgia_Kotias.SG
L1 Turkey_N_published
L1 LB
L1 YM1
L1 CW
EH2 LB
EH2 CWC
YM YM1
YM CW
EH0 Russia_HG_Karelia
AAN Nganasan
AAN CCA
LB Germany_EN_LBK
LB CCW
YM1 Russia_Samara_EBA_Yamnaya
CW CCW
CW Estonia_CordedWare
CCW CWC
CWC CCE
CCE CCA
CCA Udmurt")

pop=c("Udmurt","Germany_CordedWare","Estonia_CordedWare","Germany_EN_LBK","Italy_North_Villabruna_HG","Nganasan","Russia_HG_Karelia","Russia_Samara_EBA_Yamnaya","Turkey_N_published","Yoruba.DG","Georgia_Kotias.SG","Chimp.REF")

f2=f2_from_geno("path/to/v44.3_HO_public",pops=pop)
gr=qpgraph(f2,t)
plot_graph(gr$edges)

ggsave("a.png",width=9,height=9)
system("mogrify -trim -border 64 -bordercolor white a.png")

# plotly::plotly_graph(gr$edges)

# write_dot(gr$edges) # paste to graphviz.it for black-and-white graph

However I only got 25% of the Nganasan component for Udmurts:

https://i.ibb.co/mJn07Z0/a.png

I think you are right. EHG also has some East Eurasian affinity since it is mostly ANE which also has like 25% ENA (Eastern Non-African)/East Eurasian-like affinity.

What data did you use to model the Udmurts? It seems not to be G25.. Also how did you generate the last diagram?

Btw here is another data from Anthrogenica using qpAdm: https://anthrogenica.com/showthread.php?21474-Some-Uralic-qpAdm-runs&p=695047&viewfull=1#post695047. I actually don't how to run them but I think you can still sort of interpret the results exhibited such as Finns being approximately 10% Nganasan, 36% Estonian_MN, 21-22% Corded Ware and 32% LBK Neolithic (seem to have WHG ancestry as well)


Finnish
Nganasan: 0.104±0.015
Estonia_MN_CCC: 0.358±0.070
Germany_CordedWare: 0.216±0.105
Hungary_LBK_MN.SG: 0.322±0.045



I ran qpAdm on some Uralic groups from the 1240K+HO dataset. I tried to keep the left and right pops consistent.

left:
Nganasan - Siberian ancestry
Estonia_MN_CCC - Chalcolithic forager ancestry derived from predominantly EHG and some Narva
Latvia_LN_CordedWare - Early Baltic Corded Ware with minor non-Steppe ancestry
Ukraine_Globular_Amphora - Farmer ancestry already accompanying some Western hunter-gatherer ancestry

right:
Cameroon_SMA_published
Morocco_Iberomaurusian
Czech_Vestonice16
Iberia_ElMiron
Jordan_PPNB_published
Iran_GanjDareh_N
Kolyma_M.SG
Russia_Shamanka_Eneolithic.SG
Belgium_UP_GoyetQ116_1_published
Israel_Natufian
Russia_HG_Tyumen
Luxembourg_Loschbour_published.DG
Russia_HG_Karelia
DevilsCave_N.SG

Udmurt
Nganasan: 0.307±0.007
Estonia_MN_CCC: 0.150±0.033
Latvia_LN_CordedWare: 0.387±0.045
Ukraine_Globular_Amphora: 0.155±0.029
tail: 0.099657
chisq: 15.999
39148

Mari
Nganasan: 0.344±0.013
Estonia_MN_CCC: 0.153±0.052
Latvia_LN_CordedWare: 0.280±0.078
Ukraine_Globular_Amphora: 0.223±0.049
tail: 0.119238
chisq: 15.367
39147

Saami
Nganasan: 0.292±0.011
Estonia_MN_CCC: 0.306±0.039
Latvia_LN_CordedWare: 0.142±0.058
Ukraine_Globular_Amphora: 0.260±0.037
tail: 0.195315
chisq: 13.534
39149


Here is another model (https://anthrogenica.com/showthread.php?21474-Some-Uralic-qpAdm-runs&p=695889&viewfull=1#post695889) for these pops. This time also has Finns, Estonians, Mordovians in the run.



1240K+HO transversions only, all SNPs.
left pops:
Nganasan
Estonia_MN_CCC
Germany_CordedWare
Hungary_LBK_MN.SG

right pops:
Cameroon_SMA_published
Morocco_Iberomaurusian
Iberia_ElMiron
Jordan_PPNB_published
Iran_GanjDareh_N
Anatolia_N
Ukraine_EBA_Yamnaya.SG
Kolyma_M.SG
Russia_MA1_HG.SG
Luxembourg_Loschbour_published.DG
Russia_Steppe_Eneolithic
Georgia_Kotias.SG
DevilsCave_N.SG
Russia_HG_Karelia


Mordovian
Nganasan: 0.106±0.013
Estonia_MN_CCC: 0.312±0.060
Germany_CordedWare: 0.288±0.093
Hungary_LBK_MN.SG: 0.295±0.037
tail: 0.45506
chisq: 9.835


Estonian
Nganasan: 0.043±0.015
Estonia_MN_CCC: 0.413±0.075
Germany_CordedWare: 0.235±0.113
Hungary_LBK_MN.SG: 0.308±0.045
tail: 0.317555
chisq: 11.533


Finnish
Nganasan: 0.104±0.015
Estonia_MN_CCC: 0.358±0.070
Germany_CordedWare: 0.216±0.105
Hungary_LBK_MN.SG: 0.322±0.045
tail: 0.45506
chisq: 9.835


Saami
Nganasan: 0.286±0.016
Estonia_MN_CCC: 0.326±0.076
Germany_CordedWare: 0.214±0.115
Hungary_LBK_MN.SG: 0.174±0.047
tail: 0.62748
chisq: 8.014


Mari
Nganasan: 0.345±0.023
Estonia_MN_CCC: 0.225±0.097
Germany_CordedWare: 0.350±0.149
Hungary_LBK_MN.SG: 0.081±0.060
tail: 0.825055
chisq: 5.882

Udmurt
Nganasan: 0.287±0.009
Estonia_MN_CCC: 0.176±0.045
Germany_CordedWare: 0.450±0.068
Hungary_LBK_MN.SG: 0.088±0.027
tail: 0.659253
chisq: 7.688

Voskos
08-30-2021, 08:06 PM
At K=6, the wog component splits off into a Sardinian component and a Maltese component.

Your credibility being reduced to a pile of shit through the use of a single word.

Leto
08-30-2021, 08:23 PM
Your credibility being reduced to a pile of shit through the use of a single word.
He means the Southern Caucasoids. Who cares, you dislike the North, he dislikes the South and I ain't like either of you.

Voskos
08-30-2021, 08:27 PM
He means the Southern Caucasoids. Who cares, you dislike the North, he dislikes the South and I ain't like either of you.

I don't dislike the North. And nobody asked or cares about your opinion.

Komintasavalta
08-30-2021, 10:19 PM
What data did you use to model the Udmurts? It seems not to be G25.. Also how did you generate the last diagram?

The data is from 1240K+HO. I used the plot_graph function of ADMIXTOOLS 2.


Btw here is another data from Anthrogenica by a user there using qpAdm: https://anthrogenica.com/showthread.php?21474-Some-Uralic-qpAdm-runs&p=695047&viewfull=1#post695047

I tried modeling Uralic populations with qpAdm too: https://anthrogenica.com/showthread.php?23677-R-scripts-for-ADMIXTOOLS-2&p=768662&viewfull=1#post768662.

But I don't really trust qpAdm, because the results can be all over the place depending on the choice of outgroups. For example in the two models below, the only difference is that I removed Turkey_Epipaleolithic from the outgroups in the second model, but it increased the Turkey_N ancestry of Finns by 23 percentage points:

https://i.imgur.com/nIQSCc1.pnghttps://i.ibb.co/jfZ9pqX/c.png


Your credibility being reduced to a pile of shit through the use of a single word.

Greeks are an ultra-wog people. Below are populations with the lowest FST distance to Greeks in 1240K+HO, up to Finns. Greeks are even closer to Iranians and Moroccan Jews than to Norwegians.

FST distance to Greek:
.0003 Bulgarian
.0004 Albanian
.0014 Gagauz
.0014 Italian_North
.0017 Italian_South
.0019 Sicilian
.0026 Romanian
.0027 Turkish
.0027 Spanish
.0030 Hungarian
.0032 Cypriot
.0032 Moldavian
.0035 Jew_Turkish
.0036 Croatian
.0037 French
.0038 Armenian
.0039 Maltese
.0041 Lebanese_Muslim
.0042 Czech
.0047 Kumyk
.0051 Lebanese_Christian
.0052 Kabardinian
.0052 English
.0052 Abazin
.0057 Jew_Ashkenazi
.0058 Lebanese
.0059 Ukrainian
.0061 Circassian
.0062 Jew_Moroccan
.0062 Iranian
.0063 Azeri
.0064 Adygei
.0066 Georgian
.0067 Russian
.0068 Balkar
.0068 Jordanian
.0068 Assyrian
.0069 Ukrainian_North
.0070 Belarusian
.0070 Canary_Islander
.0071 Lezgin
.0073 Abkhasian
.0075 Norwegian
.0077 Ossetian
.0078 Spanish_North
.0079 Syrian
.0080 Orcadian
.0080 Icelandic
.0081 Armenian_Hemsheni
.0083 Mordovian
.0084 Sardinian
.0085 Palestinian
.0086 Russian_Archangelsk_Krasnoborsky
.0086 Druze
.0087 Chechen
.0090 Jew_Iraqi
.0091 Tabasaran
.0092 Scottish
.0092 Estonian
.0093 Ingushian
.0094 Tatar_Kazan
.0097 BedouinA
.0098 Tatar_Mishar
.0099 Nogai_Karachay_Cherkessia
.0100 Basque
.0101 Lithuanian
.0104 Finnish

Voskos
08-30-2021, 11:29 PM
...

Nice pile of bullshit. Now check some real f3 stats using Yoruba as an outgroup (image taken from an Elsevier paper and not homemade like your charts) showing that even Cretans (and Cypriots!) share more drift with Norwegians than with either Levantines, Iranians or North Africans.

https://ars.els-cdn.com/content/image/1-s2.0-S0092867421003706-figs4_lrg.jpg


For clarity, we only show results for west Eurasian and north African populations and cap f3 values below 0.15. For each case, we show the geographic distribution of f3 (warmer colors represent greater sharing between populations X and Y).

Komintasavalta
08-31-2021, 12:09 AM
Nice pile of bullshit. Now check some real f3 stats using Yoruba as an outgroup (image taken from an Elsevier paper and not homemade like your charts) showing that even Cretans (and Cypriots!) share more drift with Norwegians than with either Levantines, Iranians or North Africans.

I think populations with high WHG just get high f3 values with other European populations. I ran f3(Yoruba, Greek, x) for the populations included in my previous post. Like in the image you posted, Greeks had the highest f3 value with Basques and the second highest value with Lithuanians:


> library(admixtools)
> p3=c("Bulgarian","Albanian","Gagauz","Italian_North","Italian_South","Sicilian","Romanian","Turkish","Spanish","Hungarian","Cypriot","Moldavian","Jew_Turkish","Croatian","French","Armenian","Maltese","Lebanese_Muslim","Czech","Kumyk","Lebanese_Christian","Kabardinian","English","Abazin","Jew_Ashkenazi","Lebanese","Ukrainian","Circassian","Jew_Moroccan","Iranian","Azeri","Adygei","Georgian","Russian","Balkar","Jordanian","Assyrian","Ukrainian_North","Belarusian","Canary_Islander","Lezgin","Abkhasian","Norwegian","Ossetian","Spanish_North","Syrian","Orcadian","Icelandic","Armenian_Hemsheni","Mordovian","Sardinian","Palestinian","Russian_Archangelsk_Krasnoborsky","Druze","Chechen","Jew_Iraqi","Tabasaran","Scottish","Estonian","Ingushian","Tatar_Kazan","BedouinA","Tatar_Mishar","Nogai_Karachay_Cherkessia","Basque","Lithuanian","Finnish")
> f=f3("v44.3_HO_public","Yoruba","Greek",p3)
> s=f[order(-f$est),]
> paste(sprintf("%.4f",s$est),s$pop3)%>%cat(sep="\n")
0.1663 Basque
0.1661 Lithuanian
0.1657 Sardinian
0.1657 Norwegian
0.1656 Spanish_North
0.1656 English
0.1656 Icelandic
0.1655 Czech
0.1655 Orcadian
0.1654 Croatian
0.1652 Hungarian
0.1652 Scottish
0.1651 Ukrainian_North
0.1651 Italian_North
0.1651 Estonian
0.1649 French
0.1649 Albanian
0.1648 Romanian
0.1648 Ukrainian
0.1648 Belarusian
0.1648 Bulgarian
0.1646 Moldavian
0.1645 Gagauz
0.1641 Russian
0.1635 Finnish
0.1635 Spanish
0.1627 Mordovian
0.1627 Russian_Archangelsk_Krasnoborsky
0.1620 Armenian_Hemsheni
0.1619 Italian_South
0.1615 Georgian
0.1613 Abkhasian
0.1613 Chechen
0.1609 Cypriot
0.1609 Armenian
0.1607 Lezgin
0.1607 Adygei
0.1606 Sicilian
0.1605 Tabasaran
0.1604 Ingushian
0.1601 Jew_Ashkenazi
0.1599 Tatar_Mishar
0.1598 Kumyk
0.1597 Ossetian
0.1596 Circassian
0.1595 Abazin
0.1594 Balkar
0.1594 Assyrian
0.1592 Kabardinian
0.1591 Turkish
0.1588 Jew_Turkish
0.1588 Maltese
0.1584 Lebanese_Christian
0.1582 Jew_Iraqi
0.1578 Tatar_Kazan
0.1576 Azeri
0.1573 Druze
0.1569 Canary_Islander
0.1558 Iranian
0.1555 Jew_Moroccan
0.1548 Lebanese_Muslim
0.1546 Nogai_Karachay_Cherkessia
0.1520 Lebanese
0.1497 Syrian
0.1489 Palestinian
0.1485 Jordanian
0.1433 BedouinA

Even f3(Yoruba, Chuvash, x) was higher for Lithuanians than for Tatars or Udmurts or Maris:

0.1649 Saami.DG
0.1647 Lithuanian
0.1639 Finnish
0.1633 Mari.SG
0.1627 Udmurt
0.1618 Tatar_Mishar
0.1611 Tatar_Kazan
0.1604 Basque
0.1600 Spanish_North
0.1595 Bashkir
0.1572 Greek
0.1563 Sardinian
0.1534 Kazakh

Leto
08-31-2021, 03:09 PM
Greeks are an ultra-wog people. Below are populations with the lowest FST distance to Greeks in 1240K+HO, up to Finns. Greeks are even closer to Iranians and Moroccan Jews than to Norwegians.

Would you say Russians are closer to Tajiks than to Armenians? Tajiks are 30-40% BA Steppe whereas Armenians have almost none. But on the other hand, Tajiks have a lot more non-Caucasoid admixture than Armenians who are pretty much 98-100% Cauc.
And please don't use those extreme outliers like Pinega, those are just a few thousand people and most Russians are not like them.

Komintasavalta
08-31-2021, 03:40 PM
Would you say Russians are closer to Tajiks than to Armenians? Tajiks are 30-40% BA Steppe whereas Armenians have almost none. But on the other hand, Tajiks have a lot more non-Caucasoid admixture than Armenians who are pretty much 98-100% Cauc.
And please don't use those extreme outliers like Pinega, those are just a few thousand people and most Russians are not like them.

Based on FST distances between samples in 1240K+HO, Russians are closer to Tajiks (.0111) than to Armenians (.0139).

Central Asian populations like Tajiks are mixed populations with a large effective population size, so they're the opposite of drifted, and they have relatively low FST and f2 distance to other populations. Russians are also closer to Tajiks than to Basques (.0120), even though on G25 Russians are much closer to Basques, but I think it's because FST is more sensitive to drift than G25.

Both North Russians and South Russians are overrepresented among the Russian samples, but the average latitude of the Russian samples is 56.6, which is almost the same as the latitude of the center of population of Russia. However 22 out of 71 of the samples are still from Arkhangelsk Oblast, which might skew the results a bit.

https://i.ibb.co/55BFxGJ/a.png

Here's the FST matrix as a CSV file: https://pastebin.com/raw/QrySXDRp. The FST values are multiplied by million.

vbnetkhio
09-10-2021, 07:18 PM
Here's population averages from the European ADMIXTURE run, where each population is linked to its three closest neighbors.

I'm happy that Finns are connected to Kalmyks by only four links: first from Finnish to Russian_Archangelsk_Pinezhsky, then to Besermyan, then to Bashkir, and then to Kalmyk.

https://i.imgur.com/NepyCSJ.png

Could you do an unsupervised admixture run with modern Siberians and Finno-Ugrics and non-SG ancient samples, measuring the mesolithic/neolithic components like ane,whg,etc? Sorry if you already did it, i can't find it.