Log in

View Full Version : Heatmaps of G25 Euclidean distances with R



Komintasavalta
02-20-2021, 12:19 PM
I used data from the file "Global25 pop averages modern scaled": https://eurogenes.blogspot.com/2019/07/getting-most-out-of-global25_12.html.

The numbers displayed in the heatmap are the same Euclidean distances that are shown by Vahaduo, but I multiplied them by 100 so I could make them fit the cells of the heatmap better, and because integers are nice.

Distances based on G25 have to be taken with a grain of salt, but based on the distances shown in the image below, Kola Saami are closer to Komis than to other Saami. The population that is closest to Komis are Kola Saami. Bashkirs are far from every population. Maris are the closest to non-Kola Saami. Non-Kola Saami are closest to Kola Saami, followed by Udmurts. Kazan Tatars are closest to Mishars, followed by Besermyans.

https://i.ibb.co/HPTdmNH/g25-euclidean-northeast-heatmap.png


brew install R
R -e 'install.packages(c("pheatmap","RColorBrewer"),repos="https://cloud.r-project.org")'
curl -Ls 'drive.google.com/uc?export=download&id=1wZr-UOve0KUKo_Qbgeo27m-CQncZWb8y'>modernave
awk -F, 'NR==FNR{a[$0];next}$1 in a' <(printf %s\\n Bashkir Besermyan Chuvash Estonian Finnish Karelian Komi Mari Mordovian Russian_Kostroma Russian_Pinega Saami Saami_Kola Tatar_Kazan Tatar_Mishar Udmurt Vepsian) modernave>selected
R -e 'library("pheatmap");library("RColorBrewer");t<-read.csv("selected",header=F,row.names=1);t2<-100*as.matrix(dist(t,upper=T));diag(t2)<-NA;
pheatmap(t2,filename="output.png",main="G25 Euclidean distances multiplied by 100",cellwidth=12,cellheight=12,fontsize=9,border_colo r=NA,
display_numbers=T,number_format="%.0f",fontsize_number=7,number_color="black",rev(colorRampPalette(brewer.pal(11,"Spectral"))(256)))'

Here are also some Siberian population averages:

https://i.ibb.co/fXb8RQc/g25-euclidean-siberia-heatmap.png

Here are 16 random populations:

https://i.ibb.co/dKNm042/g25-euclidean-random-heatmap.png

Note that the heatmaps above use different ranges of numbers for the color scale. You can give an argument like `breaks=seq(0,14.88,14.88/256)` for `pheatmap` to use a fixed scale from 0 to 14.88.

It's easy to calculate Euclidean distances in R:


$ Rscript -e 'round(dist(read.csv("modernave",row.names=1,header=T)[c("Chuvash","Khanty","Komi","Mari","Nenets","Udmurt"),],upper=T),3)'
Chuvash Khanty Komi Mari Nenets Udmurt
Chuvash 0.205 0.073 0.056 0.302 0.049
Khanty 0.205 0.247 0.173 0.112 0.186
Komi 0.073 0.247 0.125 0.343 0.067
Mari 0.056 0.173 0.125 0.270 0.082
Nenets 0.302 0.112 0.343 0.270 0.286
Udmurt 0.049 0.186 0.067 0.082 0.286
$ Rscript -e 't<-read.csv("modernave",row.names=1,header=T);round(head(sort(as.matrix(d ist(t))["Chuvash",]),8),3)'
Chuvash Besermyan Udmurt Mari Tatar_Kazan Saami Komi Saami_Kola
0.000 0.048 0.049 0.056 0.064 0.071 0.073 0.077
$ Rscript -e 't<-read.csv("modernave",row.names=1,header=T);p<-t["Chuvash",];head(round(sort(apply(t,1,function(x)sqrt(sum((x-p)^2)))),3),8)'
Chuvash Besermyan Udmurt Mari Tatar_Kazan Saami Komi Saami_Kola
0.000 0.048 0.049 0.056 0.064 0.071 0.073 0.077

You can also use awk:


$ awk -F, 'NR==FNR{for(i=2;i<=NF;i++)a[i]=$i;next}{s=0;for(i=2;i<=NF;i++)s+=($i-a[i])^2;print s^.5","$1}' <(grep Chuvash, modernave) modernave|sort -n|awk -F, '{printf"%.03f %s\n",$1,$2}'|sed s/^0//|head -n8
.000 Chuvash
.048 Besermyan
.049 Udmurt
.056 Mari
.064 Tatar_Kazan
.071 Saami
.073 Komi
.077 Saami_Kola

Or use this Ruby script:


$ cat ~/bin/eud
#!/usr/bin/env ruby -roptparse

opt={}
OptionParser.new{|x|
x.on("-m NUM",Integer){|y|opt[:m]=y}
x.on("-f NUM",Integer){|y|opt[:f]=y}
}.parse!

a=IO.readlines(ARGV[0]).map{|l|x,*y=l.chomp.split(",");[x,y.map(&:to_f)]}

puts IO.readlines(ARGV[1]).map{|l|
x,*y=l.chomp.split(",")
y.map!(&:to_f)
d=a.reject{|z|z[0]==x}.map{|z|[z[1].map.with_index{|v,i|(v-y[i])**2}.sum**0.5,z[0]]}.sort_by(&:first)
d=d.take(opt[:m])if opt[:m]
"Distance to: #{x}\n"+d.map{|x,y|("%.#{opt[:f]||3}f"%x).sub(/^0/,"")+" "+y}*"\n"
}*"\n\n"
$ eud -m8 modernave <(grep Chuvash modernave)
Distance to: Chuvash
.048 Besermyan
.049 Udmurt
.056 Mari
.064 Tatar_Kazan
.071 Saami
.073 Komi
.077 Saami_Kola
.087 Tatar_Mishar

The distances calculated by Vahaduo are also simple Euclidean distances:

https://i.ibb.co/GV3fk2G/20210220151009.jpg

Komintasavalta
02-21-2021, 03:38 PM
This selects Northeast European populations, takes the four closest rows to each of them from the datasheet for ancient population averages, and makes a dendrogram of all rows:

curl -Ls 'drive.google.com/uc?export=download&id=1wZr-UOve0KUKo_Qbgeo27m-CQncZWb8y'>modernave
curl -Ls 'drive.google.com/uc?export=download&id=1F2rKEVtu8nWSm7qFhxPU6UESQNsmA-sl'>ancientave
printf %s\\n Bashkir Besermyan Chuvash Finnish Finnish_East Ingrian Karelian Komi Latvian Lithuanian_PA Mari Mordovian Russian_Kostroma Russian_Pinega Russian_Tver Saami Saami_Kola Tatar_Kazan Tatar_Lipka Tatar_Mishar Udmurt Vepsian>pop
awk -F, 'NR==FNR{a[$0];next}$1 in a' pop modernave>modernselected
dist()(awk -F, 'NR==FNR{for(i=2;i<=NF;i++)a[i]=$i;next}$1!=""{s=0;for(i=2;i<=NF;i++)s+=($i-a[i])^2;print s^.5,$1}' "$1" "$2")
cat modernselected|while read l;do dist <(echo "$l") ancientave|sort -n|head -n4;done|cut -d' ' -f2|awk -F, 'NR==FNR{a[$0];next}$1 in a' - ancientave>ancientselected
cat {modern,ancient}selected>selected
R -e 'install.packages("ape",repos="https://cloud.r-project.org")'
R -e 'library("ape");png("output.png",w=750,h=900);plot(as.phylo(hclust(dist(read.csv("selected",header=F,row.names=1,check.names=F)))),type="phylogram",cex=1.3,no.margin=T,font=1);dev.off()'
mogrify -trim -border 10 -bordercolor white output.png

https://i.ibb.co/n0pDQ1t/nee-dendrogram.png

Ajeje Brazorf
02-21-2021, 03:56 PM
Hierarchical clustering of modern populations, which of these two makes more "sense" to you?

Scaled
https://i.ibb.co/XtGdwLW/Hierarchical-clustering-Scaled.png

Unscaled
https://i.ibb.co/1v94hVv/Hierarchical-clustering-Unscaled.png

JamesBond007
02-21-2021, 04:13 PM
Interesting but I'm not installing some random packages called : "pheatmap RColorBrewer" that are not in the OpenBSD ports system (dons tin foil hat). I'd have to read the source code first and I don't have time for that. Especially, when there is a website that can use for that -- no it is not the vahaduo website-- (I can contain my browser ) with unveil/pledge and 'secure the settings in my browser' and using a proxy :


NAME
unveil - unveil parts of a restricted filesystem view

SYNOPSIS
#include <unistd.h>

int
unveil(const char *path, const char *permissions);

DESCRIPTION
The first call to unveil() that specifies a path removes visibility of
the entire filesystem from all other filesystem-related system calls
(such as open(2), chmod(2) and rename(2)), except for the specified path
and permissions.

The unveil() system call remains capable of traversing to any path in the
filesystem, so additional calls can set permissions at other points in
the filesystem hierarchy.

After establishing a collection of path and permissions rules, future
calls to unveil() can be disabled by passing two NULL arguments.
Alternatively, pledge(2) may be used to remove the "unveil" promise.

The permissions argument points to a string consisting of zero or more of
the following characters:

r Make path available for read operations, corresponding to the
pledge(2) promise "rpath".
w Make path available for write operations, corresponding to
the pledge(2) promise "wpath".
x Make path available for execute operations, corresponding to
the pledge(2) promise "exec".
c Allow path to be created and removed, corresponding to the
pledge(2) promise "cpath".

A path that is a directory will enable all filesystem access underneath
path using permissions if and only if no more specific matching unveil()
exists at a lower level. Directories are remembered at the time of a
call to unveil(). This means that a directory that is removed and
recreated after a call to unveil() will appear to not exist.

Non-directory paths are remembered by name within their containing
directory, and so may be created, removed, or re-created after a call to
unveil() and still appear to exist.

Attempts to access paths not allowed by unveil() will result in an error
of EACCES when the permissions argument does not match the attempted
operation. ENOENT is returned for paths for which no unveil()
permissions qualify. After a process has terminated, lastcomm(1) will
mark it with the `U' flag if file access was prevented by unveil().

unveil() use can be tricky because programs misbehave badly when their
files unexpectedly disappear. In many cases it is easier to unveil the
directories in which an application makes use of files.

https://man.openbsd.org/unveil

The simple Rscript euclidian distances is something I would use though.

How is your mac doing, BTW (I guess you are using Darwin ? ) :


A previously undetected piece of malware found on almost 30,000 Macs worldwide is generating intrigue in security circles, which are still trying to understand precisely what it does and what purpose its self-destruct capability serves.

Once an hour, infected Macs check a control server to see if there are any new commands the malware should run or binaries to execute. So far, however, researchers have yet to observe delivery of any payload on any of the infected 30,000 machines, leaving the malware's ultimate goal unknown. The lack of a final payload suggests that the malware may spring into action once an unknown condition is met.

Also curious, the malware comes with a mechanism to completely remove itself, a capability that's typically reserved for high-stealth operations. So far, though, there are no signs the self-destruct feature has been used, raising the question why the mechanism exists. Besides those questions, the malware is notable for a version that runs natively on the M1 chip that Apple introduced in November, making it only the second known piece of macOS malware to do so...

The malware has been found in 153 countries with detections concentrated in the US, UK, Canada, France, and Germany.

Red Canary, the security firm that discovered the malware, has named it "Silver Sparrow."

First detected in August of 2020, the Silver Sparrow malware is interesting in several unsettling ways. It uses the macOS Installer Javascript API to launch a bash process to gain a foothold into the user's system, a hitherto-unobserved method for bypassing malware detection. This bash shell is then used to invoke macOS's built-in PlistBuddy tool to create a LaunchAgent which executes a bash script every hour. This is the command and control process, which downloads a JSON file containing (potentially) new instructions.

Besides the novel installation method, Silver Sparrow is also mysterious in its payload: a single, tiny binary that does nothing but open a window reading "Hello, World!" (in v1, which targets Intel Macs) or "You did it!" (in v2, which is an M1-compatible fat binary). These "bystander binaries" are never executed and appear to be proofs-of-concept or placeholders for future functionality.

https://it.slashdot.org/story/21/02/21/031224/sophisticated-new-malware-found-on-30000-macs-stumps-security-pros

Komintasavalta
02-21-2021, 06:35 PM
Interesting but I'm not installing some random packages

You can do a dendrogram in plain R like this, but then it cuts off the end of long labels like in the images posted by Ajeje Brazorf:


R -e 'png("output.png",w=700,h=8000);plot(as.dendrogram(hclust(dist(read .csv("Global25_PCA_modern_pop_averages_scaled.txt",header=T,row.names=1)))),horiz=T,yaxt="n");dev.off()'

`horiz=T` rotates the plot 90 degrees and `yaxt="n"` removes the scale. Run `?plot.dendrogram` in R to see help.


Hierarchical clustering of modern populations, which of these two makes more "sense" to you?

Scaled
https://i.ibb.co/XtGdwLW/Hierarchical-clustering-Scaled.png

Unscaled
https://i.ibb.co/1v94hVv/Hierarchical-clustering-Unscaled.png

You need scaled coordinates to calculate the distances correctly. In the unscaled version you posted, the Amerindian branch splits off from whites before the Baka pygmy branch. And Chuvashes and Maris split off from mainstream Europeans before SSAs.

In the unscaled version, too little weight is given to the first components which explain most of the variance:

https://i.ibb.co/7vTTRbS/unscaled-vs-scaled.png

Komintasavalta
02-22-2021, 10:15 PM
I'm getting more powerful at R. This does a PCA where it uses K-means clustering to assign label colors. It selects the 64 modern population averages that are the closest to the population average of Nenetses.

https://i.ibb.co/0yKbFsv/b.png


library("factoextra")
library("ggplot2")
library("colorspace")

download.file("https://drive.google.com/uc?export=download&id=1wZr-UOve0KUKo_Qbgeo27m-CQncZWb8y","modernave")

t<-read.csv("modernave",header=T,row.names=1)
t2<-t[names(head(sort(apply(t,1,function(x)sqrt(sum((x-t["Nenets",])^2)))),65)),]
k<-hkmeans(t2,16)
p<-prcomp(t2)

pct<-paste0(colnames(p$x)," (",round(p$sdev/sum(p$sdev)*100,1),"%)")

ggplot(p$x,aes(x=PC2,y=PC1,label=rownames(t2)))+
theme_dark()+
geom_text(aes(color=as.factor(k$cluster)),size=2.5 ,show.legend=F)+
theme(
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
plot.background=element_rect(fill="gray35"),
panel.background=element_rect(fill="gray20"),
panel.grid.major=element_line(color="gray35"),
panel.grid.minor=element_blank(),
text=element_text(color="gray10"),
axis.text=element_text(color="gray10")
)+
scale_x_continuous(breaks=seq(-1,1,.1),expand=expansion(mult=.17))+
scale_y_continuous(breaks=seq(-1,1,.1),expand=expansion(mult=.02))+
coord_fixed(ratio=p$sdev[1]/p$sdev[2],expand=T)+
xlab(pct[2])+ylab(pct[1])+
scale_color_discrete_qualitative(palette="Set 3")
ggsave("output.png")

Komintasavalta
02-22-2021, 11:41 PM
Here's the 64 closest ancient individuals to the Ymyyakhtakh population average. Ymy is in a world of its own together with kra001 who Davidski now says is "Proto-Uralic".

https://i.ibb.co/gS0jcjz/20210223024048.png

The closest modern matches to Ymyyakhtakh are Yukaghirs and Nganasans:

$ awk -F, 'NR==FNR{for(i=2;i<=NF;i++)a[i]=$i;next}$1!=""{s=0;for(i=2;i<=NF;i++)s+=($i-a[i])^2;print s^.5,$1}' <(grep Ymy mui/g25/ancientave) mui/g25/modern|sort -n|head -n16|awk '{printf"%.3f %s\n",$1,$2}'|sed s,^0,,
.052 Yukagir_Tundra:Ckrd18
.053 Yukagir_Tundra:Yuk_021
.059 Yukagir_Tundra:Yuk_009
.061 Yukagir_Tundra:Yuk_022
.063 Yukagir_Tundra:Yuk_019
.063 Nganassan:Nganassan10
.066 Nganassan:Nganassan8
.066 Nganassan:Nganassan2
.067 Yukagir_Tundra:Yuk_023
.068 Selkup:Selkup87
.068 Nganassan:Nganassan15
.070 Nganassan:Nganassan3
.071 Nganassan:Nganassan1
.072 Evenk:GS000014336
.072 Yukagir_Tundra:Yuk_011
.072 Evenk:GS000014335