Plink related questions [Archive] - The Apricity Forum: A European Cultural Community

Zoro

03-12-2021, 07:24 AM

This thread is for asking Plink related questions and posting IBS or IBD results from Plink programs such as --genome or any other Plink program

@ Komintasavalta

Regarding your question in the other thread of how to convert from .geno .snp .ind format used in Admixtools to Plink .bed .bim .fam.
Here's how I do it:

Create a text file like this. Use your own file name instead of sample and save it as par.PED.PACKEDPED

genotypename: sample.geno
snpname: sample.snp
indivname: sample.ind
outputformat: PACKEDPED
genotypeoutname: sample.bed
snpoutname: sample.bim
indivoutname: sample.fam
familynames: YES

At a linux terminal execute command:

......../convertf -p par.PED.PACKEDPED

put the path to the ADMIXTOOLS convertf file instead of ..........

You'll receive 3 Plink files : bed bim fam

Check your fam file. You can edit the names of the populations and their IDs to something you like or like how they were named in the .ind file

Zoro

03-12-2021, 07:40 AM

Discussion related to Lezgins, Chechens, and Daghestanis in Iraq, Lezgin-Kurd IBS closeness and Lezgin IBS located at :

https://www.theapricity.com/forum/showthread.php?341277-Dodecad-k12b-West-amp-Central-Asian-results-Vol-4/page52

Zoro

03-12-2021, 07:45 AM

Discussion related to G25 distance results wrongly showing :

1- Eurasians such as Mongols closer to Khomani-San and Ju-Hoan than to Mbuti
2- Eurasians such as Kurds closer to SSA than other Eurasians such as Papuans, Karitiana, and Surui
3- Kurds closer to Jordanians than to Uyghur , Baloch, Brahui etc

Conclusion: The above leads to overestimation of SW Asian and African in W Asians such as Kurds and underestimation of E Asian and Siberian

AND Plink IBS results correctly showing above in contrast to G25

located at https://www.theapricity.com/forum/showthread.php?343634-Why-is-the-amount-of-East-Eurasian-ancestry-of-Saamis-and-other-Uralics-underestimate-by-some-here/page2

Komintasavalta

03-12-2021, 08:22 AM

Yeah I figured it out already. I did something like this to make a global PCA of modern individuals in the Reich dataset:

wget reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_1240K_public.tar
tar -xf v44.3_1240K_public.tar
f=v44.3_1240K_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)
sed 1d v44.3_1240K_public.anno|grep -v 1KGPhase|awk -F\\t '$9=="Modern"{print$2,$13}'|grep -Ev '_dup|Ignore_|\.REF|_o'|sed -E 's/\.(SDG|DG|SG)$//'>picks0
awk 'NR==FNR{a[$1];next}$2 in a' picks0 v44.3_1240K_public.fam>picks
plink --bfile v44.3_1240K_public --keep picks --allow-no-sex --make-bed --out picks
plink --bfile picks --pca --geno .001 --allow-no-sex --out picks
paste -d' ' <(cut -d' ' -f2 picks0) <(cut -d' ' -f2- picks.eigenvec)>picks.eigenvec.2

When I didn't add `--geno .001`, it fked up the clustering of some populations at first, so that for example one South Asian population clustered together with Africans.

I couldn't get convertf to compile, but I downloaded a Mac binary from here: https://github.com/chrchang/eigensoft. The Mac binaries for plink from Harvard's website didn't work, but there were working binaries by another maintainer here: https://www.cog-genomics.org/plink/1.9/. You can download the v44.3_1240K_public.tar file manually from here: https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/index.html.

I then made this in R:

https://i.ibb.co/h2jY6SD/plink-pca-1240k.png

libarary(tidyverse)
library(colorspace)

f="picks"
t=read.table(paste0(f,".eigenvec.2"),sep=" ")
eig=as.double(readLines(paste0(f,".eigenval")))

# t=cbind(t[,c(1,2)],t(t(t[,-c(1,2)])*sqrt(eig))) # I think this corresponds to scaling in G25

pct=paste0("PC",seq(length(eig))," (",sprintf("%.1f",100*eig/sum(eig)),"%)")

ave=aggregate(t[,-c(1,2)],list(t[,1]),mean)
names(ave)=c("pop",paste0("PC",seq(ncol(ave)-1)))

k=cutree(hclust(dist(ave[,-1]),method="ward.D2"),k=12)
write.csv(k,"/tmp/k",quote=F)
ave$k=k

ggplot(ave,aes(x=PC1,y=PC2,label=pop))+
geom_point(aes(color=as.factor(k)),size=.5)+
geom_polygon(data=ave%>%group_by(k)%>%slice(chull(PC1,PC2)),alpha=.2,aes(color=as.facto r(k),fill=as.factor(k)),size=.3)+
geom_text(aes(label=pop,color=as.factor(k)),size=2 ,vjust=-.7)+
theme(
aspect.ratio=3/4,
axis.text=element_text(color="black",size=7),
axis.ticks.length=unit(0,"pt"),
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
axis.title=element_text(color="black",size=10),
legend.position="none",
panel.background=element_rect(fill="white"),
panel.grid.major=element_line(color="gray75",size=.2)
)+
scale_x_continuous(breaks=seq(-2,2,.1),expand=expansion(mult=.07))+
scale_y_continuous(breaks=seq(-2,2,.1),expand=expansion(mult=.06))+
labs(x=pct[1],y=pct[2])+
scale_color_discrete_qualitative(palette="Set 2",c=80,l=40)

ggsave("output.png")
system("/usr/local/bin/mogrify -trim -bordercolor white -border 20x20 output.png")

However there's something wrong with the distances in my PCA. For example Finns have about 5 times bigger distance to Khomani_San than to Yoruba:

$ tav(){ awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1][i]+=$i}}END{for(i in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i][j]/n[i]);print o}}' "FS=${1-$'\t'}";}
$ dist(){ awk -F, 'NR==FNR{for(i=2;i<=NF;i++)a[i]=$i;next}$1{s=0;for(i=2;i<=NF;i++)s+=($i-a[i])^2;print s^.5,$1}' "$2" "$1"|sort -n|awk '{printf"%.3f %s\n",$1,$2}'|sed s,^0,,;}
$ paste -d' ' <(cut -d' ' -f2 maailma0) <(cut -d' ' -f3- maailma.eigenvec)|tav ' '|tr ' ' ,>ave;dist ave <(grep Finnish ave)|tail -n16
.130 Karitiana
.130 Mende
.135 Gambian
.136 Esan
.155 Yoruba
.174 Papuan
.189 Mandenka
.206 BantuKenya
.216 Biaka
.242 BantuSA
.268 Mbuti
.336 Ju_hoan_North
.492 BantuHerero
.492 BantuSA_Herero
.497 BantuTswana
.705 Khomani_San

Am I supposed to apply some further quality control or filtering? I tried to include only samples with few missing SNPs: `awk -F\\t '$21>9e5' v44.3_1240K_public.anno`. I also tried increasing the value of the `--geno` option and I tried adding an option like `--max-maf .3`. None of it helped however.

I also tried multiplying the columns of the table with the square roots of the eigenvalues but it didn't help:

f="picks"
t=read.table(paste0(f,".eigenvec.2"),sep=" ")
eig=as.double(readLines(paste0(f,".eigenval")))

t2=(cbind(t[,c(1,2)],t(t(t[,-c(1,2)])*sqrt(eig))))

ave=aggregate(t[,-c(1,2)],list(t[,1]),mean)
ave2=aggregate(t2[,-c(1,2)],list(t2[,1]),mean)

ind=cbind(paste0(t[,1],":",t[,2]),t[,-c(1,2)])
ind2=cbind(paste0(t2[,1],":",t2[,2]),t2[,-c(1,2)])

write.table(ind,paste0(f,".ind"),quote=F,sep=",",col.names=F,row.names=F)
write.table(ind2,paste0(f,".indscaled"),quote=F,sep=",",col.names=F,row.names=F)
write.table(ave,paste0(f,".ave"),quote=F,sep=",",col.names=F,row.names=F)
write.table(ave2,paste0(f,".avescaled"),quote=F,sep=",",col.names=F,row.names=F)

Lucas

03-12-2021, 08:27 AM

Yeah I figured it out already. I did something like this to make a global PCA of modern individuals in the Reich dataset:

wget reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_1240K_public.tar
tar -xf v44.3_1240K_public.tar
f=v44.3_1240K_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)
sed 1d v44.3_1240K_public.anno|grep -v 1KGPhase|awk -F\\t '$9=="Modern"{print$2,$13}'|grep -Ev '_dup|Ignore_|\.REF|_o'|sed -E 's/\.(SDG|DG|SG)$//'|gv BIR>picks0
awk 'NR==FNR{a[$1];next}$2 in a' picks0 v44.3_1240K_public.fam>picks
plink --bfile v44.3_1240K_public --keep picks --allow-no-sex --make-bed --out picks
plink --bfile picks --pca --geno .001 --allow-no-sex --out picks
paste -d' ' <(cut -d' ' -f2 picks0) <(cut -d' ' -f2- picks.eigenvec)>picks.eigenvec.2

Why not SmartPCA? Davidski used it for G25, not PlinkPCA https://eurogenes.blogspot.com/2017/05/pca-projection-bias-fix.html

Lucas

03-12-2021, 08:50 AM

For plink dataset do also LD based pruning https://zzz.bwh.harvard.edu/plink/summary.shtml#prune

plink --file data --indep-pairwise 50 5 0.5 (for last better lower value like 0.3)

=======================================

Also missing rate per person https://zzz.bwh.harvard.edu/plink/thresh.shtml#miss2

plink --file mydata --mind 0.1

==========================================
Also minor allele frequency exclude https://zzz.bwh.harvard.edu/plink/thresh.shtml#maf

plink --file mydata --maf 0.05

After that dataset will be smaller in size of course but should be better.

Komintasavalta

03-12-2021, 11:18 AM

Why not SmartPCA? Davidski used it for G25, not PlinkPCA https://eurogenes.blogspot.com/2017/05/pca-projection-bias-fix.html

I tried SmartPCA with the whole Reich dataset at first, but the dataset was rejected because there were more than 100 populations:

$ f=g/v44.3_1240K_public/v44.3_1240K_public;smartpca -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind evecoutname:\ evec evaloutname:\ eval)
parameter file: /dev/fd/63
### THE INPUT PARAMETERS
##PARAMETER NAME: VALUE
genotypename: g/v44.3_1240K_public/v44.3_1240K_public.geno
snpname: g/v44.3_1240K_public/v44.3_1240K_public.snp
indivname: g/v44.3_1240K_public/v44.3_1240K_public.ind
evecoutname: evec
evaloutname: eval
## smartpca version: 10210
norm used

read 1073741824 bytes
read 2147483648 bytes
read 2859357147 bytes
packed geno read OK
number of populations too large. Increase maxpops if you wish
fatalx:
(makeeglist) You really want to analyse more than 100 populations?

I think the maxpops option needs to be changed from the source code where it's defined as `#define MAXPOPS 100`. Adding a maxpops option to the parfile didn't have an effect, and it's not documented as one of the options in the parfile (https://github.com/chrchang/eigensoft/blob/master/POPGEN/README).

Next I tried SmartPCA with a subset of samples from the Reich dataset:

$ plink --bfile g/bed/v44.3_1240K_public --keep <(awk -F\\t '$9=="Modern"&&$21>9e5{print$2}' g/v44.3_1240K_public/v44.3_1240K_public.anno|grep -v REF|head -n200|awk 'NR==FNR{a[$0];next}$2 in a' - g/bed/v44.3_1240K_public.fam) --make-bed --out reichsubset
$ f=reichsubset;smartpca -p <(printf %s\\n genotypename:\ $f.bed snpname:\ $f.bim indivname:\ $f.fam evecoutname:\ $f.evec evaloutname:\ $f.eval numoutlieriter:\ 0)

Without the option `numoutlieriter: 0`, it removed 30 out of 200 of the samples as outliers (including all SSAs).

However like with `plink --pca`, the distances of Khoisan and Bambutids seemed too high.

Actually what I needed was `--maf .05`:

$ plink --bfile g/bed/v44.3_1240K_public --keep <(awk -F\\t '$9=="Modern"&&$21>9e5{print$2}' g/v44.3_1240K_public/v44.3_1240K_public.anno|grep -v REF|head -n200|awk 'NR==FNR{a[$0];next}$2 in a' - g/bed/v44.3_1240K_public.fam) --allow-no-sex --maf .05 --make-bed --out withmaf
$ plink --bfile g/bed/v44.3_1240K_public --keep <(awk -F\\t '$9=="Modern"&&$21>9e5{print$2}' g/v44.3_1240K_public/v44.3_1240K_public.anno|grep -v REF|head -n200|awk 'NR==FNR{a[$0];next}$2 in a' - g/bed/v44.3_1240K_public.fam) --allow-no-sex --make-bed --out nomaf
$ f=withmaf;smartpca -p <(printf %s\\n genotypename:\ $f.bed snpname:\ $f.bim indivname:\ $f.fam evecoutname:\ $f.evec evaloutname:\ $f.eval numoutlieriter:\ 0)
$ f=nomaf;smartpca -p <(printf %s\\n genotypename:\ $f.bed snpname:\ $f.bim indivname:\ $f.fam evecoutname:\ $f.evec evaloutname:\ $f.eval numoutlieriter:\ 0)
$ sed 1d withmaf.evec|awk '{$1=$1}NF--' OFS=,|cut -d: -f2 >withmafdist
$ sed 1d nomaf.evec|awk '{$1=$1}NF--' OFS=,|cut -d: -f2 >nomafdist
$ dist(){ awk -F, 'NR==FNR{for(i=2;i<=NF;i++)a[i]=$i;next}$1{s=0;for(i=2;i<=NF;i++)s+=($i-a[i])^2;print s^.5,$1}' "$2" "$1"|sort -n|awk '{printf"%.3f %s\n",$1,$2}'|sed s,^0,,;}
$ dist withmafdist <(grep Finnish withmafdist)|tail -n16
.439 B_Karitiana-3.DG
.443 S_Eskimo_Sireniki-1.DG
.453 S_Eskimo_Sireniki-2.DG
.455 S_BedouinB-2.DG
.472 S_Eskimo_Chaplin-1.DG
.473 S_Eskimo_Naukan-1.DG
.473 A_Ju_hoan_North-5.DG
.475 S_Eskimo_Naukan-2.DG
.486 S_Khomani_San-1.DG
.495 B_Ju_hoan_North-4.DG
.503 S_Ju_hoan_North-1.DG
.511 S_BedouinB-1.DG
.516 S_Ju_hoan_North-2.DG
.597 A_Mbuti-5.DG
.601 B_Mbuti-4.DG
.627 S_Mbuti-3.DG
$ dist nomafdist <(grep Finnish nomafdist)|tail -n16
.314 S_Papuan-2.DG
.316 A_Karitiana-4.DG
.318 S_Papuan-9.DG
.323 A_Papuan-16.DG
.326 B_Karitiana-3.DG
.524 B_Mbuti-4.DG
.530 B_Ju_hoan_North-4.DG
.546 S_Ju_hoan_North-1.DG
.553 S_Ju_hoan_North-2.DG
.562 S_Mbuti-3.DG
.592 S_Khomani_San-1.DG
.629 A_Mbuti-5.DG
.738 B_Yoruba-3.DG
.880 S_Yoruba-2.DG
1.003 A_Yoruba-4.DG
1.008 A_Ju_hoan_North-5.DG

With `--maf .05` I got a plot similar to G25, but with `--maf .01` the distance from Bambutids and Capoids to other humans was reduced only moderately:

https://i.ibb.co/zHrfXVB/nomaf.png
https://i.ibb.co/84YVx7v/maf05.png
https://i.ibb.co/P9TW5dc/maf01.png

But what if the distance between Finns and Ju'Hoan is actually supposed to be much bigger than the distance between Finns and Karitiana? Could it be artificially reduced by G25 because it removes minor alleles that are specific to Capoids?

Zoro

03-12-2021, 01:17 PM

I tried SmartPCA with the whole Reich dataset at first, but the dataset was rejected because there were more than 100 populations:

But what if the distance between Finns and Ju'Hoan is actually supposed to be much bigger than the distance between Finns and Karitiana? Could it be artificially reduced by G25 because it removes minor alleles that are specific to Capoids?

I wouldn’t use —maf because that removes positions with allele frequency below for ex 0.01 if —maf 0.01. I would instead use —max-maf which dies opposite. For ex —max-maf 0.4 removes uninformative alleles common to your data > 40%

Also i would use —geno 0.001 if one wants to have an overlapping set of SNPs in all samples, in other words one doesn’t want some samples to have more Snps than others

Komintasavalta

03-12-2021, 01:44 PM

Actually `--maf .05` is probably way too high. When I tried `plink --pca` with different `--maf` settings, `--maf .05` caused Finns to be less than twice as far from Khomani San as from Armenians. The effect of `--maf` became noticeable between .005 and .01, but it became huge between .01 and .05.

$ for x in {0001,001,005,01,05};do plink --bfile g/bed/v44.3_1240K_public --keep <(awk -F\\t '$9=="Modern"&&$2~/\.DG$/{print$2,$13}' g/v44.3_1240K_public/v44.3_1240K_public.anno|grep -Ev 'REF\.|Ignore_|_o'|awk 'NR==FNR{a[$1];next}$2 in a' - g/bed/v44.3_1240K_public.fam) --allow-no-sex --maf .$x --geno .1 --pca --out $x;done
$ for x in {0001,001,005,01,05};do printf %s\\n '' "--maf .$x --geno .1:";grep Finnish-1 $x.eigenvec|awk 'NR==1{for(i=3;i<=NF;i++)a[i]=$i;next}{s=0;for(i=3;i<=NF;i++)s+=(a[i]-$i)^2;print s^.5,$2}' - $x.eigenvec|sort -n|egrep '(Khomani_San|Karitiana|Mbuti|Eskimo_Sireniki|Arme nian|Hungarian)-1';done

--maf .0001 --geno .1:
0.0307083 S_Hungarian-1.DG
0.100753 S_Armenian-1.DG
0.236026 S_Eskimo_Sireniki-1.DG
0.300353 S_Karitiana-1.DG
0.532705 S_Mbuti-1.DG
1.0062 S_Khomani_San-1.DG

--maf .001 --geno .1:
0.0307083 S_Hungarian-1.DG
0.100753 S_Armenian-1.DG
0.236026 S_Eskimo_Sireniki-1.DG
0.300353 S_Karitiana-1.DG
0.532705 S_Mbuti-1.DG
1.0062 S_Khomani_San-1.DG

--maf .005 --geno .1:
0.0346009 S_Hungarian-1.DG
0.140324 S_Armenian-1.DG
0.265385 S_Eskimo_Sireniki-1.DG
0.3049 S_Karitiana-1.DG
0.532409 S_Mbuti-1.DG
1.00394 S_Khomani_San-1.DG

--maf .01 --geno .1:
0.0461182 S_Hungarian-1.DG
0.246201 S_Armenian-1.DG
0.318358 S_Eskimo_Sireniki-1.DG
0.324301 S_Karitiana-1.DG
0.542761 S_Mbuti-1.DG
0.79708 S_Khomani_San-1.DG

--maf .05 --geno .1:
0.100773 S_Hungarian-1.DG
0.304498 S_Armenian-1.DG
0.429683 S_Eskimo_Sireniki-1.DG
0.450306 S_Khomani_San-1.DG
0.536177 S_Mbuti-1.DG
0.632922 S_Karitiana-1.DG

Here's the same with no `--geno` option:

--maf .0001:
0.030676 S_Hungarian-1.DG
0.100718 S_Armenian-1.DG
0.236023 S_Eskimo_Sireniki-1.DG
0.300351 S_Karitiana-1.DG
0.532722 S_Mbuti-1.DG
1.0062 S_Khomani_San-1.DG

--maf .001:
0.030676 S_Hungarian-1.DG
0.100718 S_Armenian-1.DG
0.236023 S_Eskimo_Sireniki-1.DG
0.300351 S_Karitiana-1.DG
0.532722 S_Mbuti-1.DG
1.0062 S_Khomani_San-1.DG

--maf .005:
0.0345906 S_Hungarian-1.DG
0.14085 S_Armenian-1.DG
0.265786 S_Eskimo_Sireniki-1.DG
0.304951 S_Karitiana-1.DG
0.532434 S_Mbuti-1.DG
1.00393 S_Khomani_San-1.DG

--maf .01:
0.0460322 S_Hungarian-1.DG
0.246952 S_Armenian-1.DG
0.318423 S_Eskimo_Sireniki-1.DG
0.324423 S_Karitiana-1.DG
0.543871 S_Mbuti-1.DG
0.796861 S_Khomani_San-1.DG

--maf .05:
0.100475 S_Hungarian-1.DG
0.30487 S_Armenian-1.DG
0.429481 S_Eskimo_Sireniki-1.DG
0.450133 S_Khomani_San-1.DG
0.53606 S_Mbuti-1.DG
0.633111 S_Karitiana-1.DG

BTW some of the individuals in my previous PCA were marked with `Ignore_`, like the one outlier Ju'Hoan. I probably should've not included them.

Zoro

03-12-2021, 02:00 PM

Maybe you didn’t understand what i said so I’ll repeat it’s bad idea to use —maf because that does opposite of what we want. It removes informative population specific alleles or in other words rarer alleles

You should use —max-maf instead which removes uninformative alleles common to all populations in other words very very ancient alleles

Try —max-maf 0.4 and —geno 0.001 and repost

Also check your plink bim file against dbsnp database to make sure you do not have some flipped alleles because that’s pretty common with plink

Plink bim col 5 should have alt allele and col 6 ref allele. I bet the order is wrong on some of your positions in the bim file

Komintasavalta

03-12-2021, 02:15 PM

it’s bad idea to use —maf because that does opposite of what we want. It removes informative population specific alleles or in other words rarer alleles

You should use —max-maf instead which removes uninformative alleles common to all populations in other words very very ancient alleles

Try —max-maf 0.4 and —geno 0.001 and repost

`--max-maf` had a pretty huge effect too. Adding `--max-maf .3` caused S_Finnish-1.DG to be twice as close to S_Hungarian-1.DG:

No --max-maf, no --geno:
0.0306736 S_Hungarian-1.DG
0.100753 S_Armenian-1.DG
0.235997 S_Eskimo_Sireniki-1.DG
0.300359 S_Karitiana-1.DG
0.532739 S_Mbuti-1.DG
1.00621 S_Khomani_San-1.DG

--max-maf .4, no --geno:
0.0220245 S_Hungarian-1.DG
0.0761764 S_Armenian-1.DG
0.148297 S_Eskimo_Sireniki-1.DG
0.269105 S_Karitiana-1.DG
0.533212 S_Mbuti-1.DG
1.00575 S_Khomani_San-1.DG

--max-maf .3, no --geno:
0.0164799 S_Hungarian-1.DG
0.0640595 S_Armenian-1.DG
0.149513 S_Eskimo_Sireniki-1.DG
0.261177 S_Karitiana-1.DG
0.537252 S_Mbuti-1.DG
1.00609 S_Khomani_San-1.DG

Adding `--geno .001` didn't change the results that much:

--max-maf .4 --geno .001:
0.0213731 S_Hungarian-1.DG
0.0702464 S_Armenian-1.DG
0.149484 S_Eskimo_Sireniki-1.DG
0.266066 S_Karitiana-1.DG
0.537867 S_Mbuti-1.DG
1.00574 S_Khomani_San-1.DG

Zoro

03-12-2021, 02:29 PM

`--max-maf` had a pretty huge effect too. Adding `--max-maf .3` caused S_Finnish-1.DG to be twice as close to S_Hungarian-1.DG:

No --max-maf, no --geno:
0.0306736 S_Hungarian-1.DG
0.100753 S_Armenian-1.DG
0.235997 S_Eskimo_Sireniki-1.DG
0.300359 S_Karitiana-1.DG
0.532739 S_Mbuti-1.DG
1.00621 S_Khomani_San-1.DG

--max-maf .4, no --geno:
0.0220245 S_Hungarian-1.DG
0.0761764 S_Armenian-1.DG
0.148297 S_Eskimo_Sireniki-1.DG
0.269105 S_Karitiana-1.DG
0.533212 S_Mbuti-1.DG
1.00575 S_Khomani_San-1.DG

--max-maf .3, no --geno:
0.0164799 S_Hungarian-1.DG
0.0640595 S_Armenian-1.DG
0.149513 S_Eskimo_Sireniki-1.DG
0.261177 S_Karitiana-1.DG
0.537252 S_Mbuti-1.DG
1.00609 S_Khomani_San-1.DG

Adding `--geno .001` didn't change the results that much:

--max-maf .4 --geno .001:
0.0213731 S_Hungarian-1.DG
0.0702464 S_Armenian-1.DG
0.149484 S_Eskimo_Sireniki-1.DG
0.266066 S_Karitiana-1.DG
0.537867 S_Mbuti-1.DG
1.00574 S_Khomani_San-1.DG

How does pca look with max-maf 0.4 and 0.3 and how many SNPs

Lemminkäinen

03-12-2021, 03:09 PM

Komintasavalta

However there's something wrong with the distances in my PCA. For example Finns have about 5 times bigger distance to Khomani_San than to Yoruba:

The reason is one of weaknesses in PCA; inbred populations in the edge areas tend to stretch. It is even stronger with low quality ancient samples. In case of ancient samples Smartpca includes corrections set by run time parameters, but in case of high quality samples it is a real problem, because it can't be corrected by filling missing SNPs by approximations. Use only one inbred population on the pca edges. If you have to use more than one you need to place a "bigger collector population" to collect edge samples.

This happens because PCA always leak on edges. In case of inbred populations even more.

Komintasavalta

03-12-2021, 03:59 PM

Removing more variants with `--geno` also caused Capoids to move further from other Africans.

- `--geno` (same as `--geno .1`) removed about 8% of all variants.
- `--geno .01` removed about 60% of all variants (719878 out of 1233013).
- `--geno .005` removed about 93% of all variants (1148877 out of 1233013)
- `--geno .001` removed all except 10 variants, i.e. 1233003 out of 1233013.

https://i.ibb.co/Gt05t7B/nogeno.png
https://i.ibb.co/0nMLM9j/geno1.png
https://i.ibb.co/qndrv1Q/geno01.png
https://i.ibb.co/vDs2kwg/geno005.png

The different `--geno` options didn't have a large impact on the distance to S_Finnish-1.DG:

No --geno:
0.0312757 S_Hungarian-1.DG
0.0722052 S_Eskimo_Sireniki-1.DG
0.0803997 S_Armenian-1.DG
0.10559 S_Karitiana-1.DG
0.310352 S_Mbuti-1.DG
0.998254 S_Khomani_San-1.DG

--geno .1
0.0316516 S_Hungarian-1.DG
0.072291 S_Eskimo_Sireniki-1.DG
0.0808708 S_Armenian-1.DG
0.105792 S_Karitiana-1.DG
0.311129 S_Mbuti-1.DG
0.998301 S_Khomani_San-1.DG

--geno .01:
0.029879 S_Hungarian-1.DG
0.074467 S_Eskimo_Sireniki-1.DG
0.0810191 S_Armenian-1.DG
0.106372 S_Karitiana-1.DG
0.312748 S_Mbuti-1.DG
0.998466 S_Khomani_San-1.DG

--geno .005:
0.0298644 S_Hungarian-1.DG
0.0731674 S_Eskimo_Sireniki-1.DG
0.0814011 S_Armenian-1.DG
0.106963 S_Karitiana-1.DG
0.282191 S_Mbuti-1.DG
0.997774 S_Khomani_San-1.DG

How does pca look with max-maf 0.4 and 0.3 and how many SNPs

Lower --max-maf settings caused Capoids to move further away from other Africans:

https://i.ibb.co/PwYjbkd/maxmaf0.png
https://i.ibb.co/D8HxRBd/maxmaf4.png
https://i.ibb.co/pzc4jXf/maxmaf3.png

Zoro

03-12-2021, 04:30 PM

@Komin

Africans moving further from Eurasians is normal and expected with max-maf because we’re removing the very ancient alleles Eurasians commonly share with SSA. So that’s good. Also with max-naf we remove some of the ENF alleles that Europeans share with each other and Asians and make Italians close to W Asians. So that’s also good

Can you post PCA for geno 0.001 and max-maf 0.4 and 0.3

PC1/PC2
PC1/PC3

Edit: if you want Eurasians to be more clearly visible on PCA. Don’t add Africans when you generate PCA but still include them when you process max-maf

Komintasavalta

03-12-2021, 04:59 PM

Can you post PCA for geno 0.001 and max-maf 0.4 and 0.3

PC1/PC2
PC1/PC3

`--max-maf .4 --geno .001` removed all except 11 variants with the populations I used previously. Therefore I tried selecting only samples with million or more SNPs instead. The clustering is still messed up, because for example Maris cluster with Malaysians in the first image.

`--geno .001 --max-maf .4` printed this:

598790 variants removed due to missing genotype data (--geno).
92894 variants removed due to minor allele threshold(s)

`-geno .001 --max-maf .3` printed this:

598790 variants removed due to missing genotype data (--geno).
189344 variants removed due to minor allele threshold(s)

https://i.ibb.co/XL1k6mp/maxmaf0.png
https://i.ibb.co/NSN34bw/maxmaf4.png
https://i.ibb.co/hXJfdJZ/maxmaf3.png

Papuans are the most differentiated on PC3 and Capoids are the most differentiated on PC4:

https://i.ibb.co/Z1ZvmMz/maxmaf4.png

Zoro

03-12-2021, 05:24 PM

Looks good and expected as we remove more and more very ancient alleles that are common to most pops the pops drift further apart but you didn’t post PC1/PC3

Also can you remove Africans from PCA so we can more clearly see Eurasians

Edit: if you want to remove old ENF common alleles that bring W Asians closer to Europeans and very old alleles that common to American Indians and other Eurasians don’t extract any Africans from your master. Only keep Eurasians and then do max-maf 0.4 or 0.3

Zanzibar

03-13-2021, 02:24 AM

Removing more variants with `--geno` also caused Capoids to move further from other Africans.

- `--geno` (same as `--geno .1`) removed about 8% of all variants.
- `--geno .01` removed about 60% of all variants (719878 out of 1233013).
- `--geno .005` removed about 93% of all variants (1148877 out of 1233013)
- `--geno .001` removed all except 10 variants, i.e. 1233003 out of 1233013.

https://i.ibb.co/Gt05t7B/nogeno.png
https://i.ibb.co/0nMLM9j/geno1.png
https://i.ibb.co/qndrv1Q/geno01.png
https://i.ibb.co/vDs2kwg/geno005.png

The different `--geno` options didn't have a large impact on the distance to S_Finnish-1.DG:

No --geno:
0.0312757 S_Hungarian-1.DG
0.0722052 S_Eskimo_Sireniki-1.DG
0.0803997 S_Armenian-1.DG
0.10559 S_Karitiana-1.DG
0.310352 S_Mbuti-1.DG
0.998254 S_Khomani_San-1.DG

--geno .1
0.0316516 S_Hungarian-1.DG
0.072291 S_Eskimo_Sireniki-1.DG
0.0808708 S_Armenian-1.DG
0.105792 S_Karitiana-1.DG
0.311129 S_Mbuti-1.DG
0.998301 S_Khomani_San-1.DG

--geno .01:
0.029879 S_Hungarian-1.DG
0.074467 S_Eskimo_Sireniki-1.DG
0.0810191 S_Armenian-1.DG
0.106372 S_Karitiana-1.DG
0.312748 S_Mbuti-1.DG
0.998466 S_Khomani_San-1.DG

--geno .005:
0.0298644 S_Hungarian-1.DG
0.0731674 S_Eskimo_Sireniki-1.DG
0.0814011 S_Armenian-1.DG
0.106963 S_Karitiana-1.DG
0.282191 S_Mbuti-1.DG
0.997774 S_Khomani_San-1.DG

Lower --max-maf settings caused Capoids to move further away from other Africans:

https://i.ibb.co/PwYjbkd/maxmaf0.png
https://i.ibb.co/D8HxRBd/maxmaf4.png
https://i.ibb.co/pzc4jXf/maxmaf3.png

Very interesting. These PCAs show that Native Americans really have Caucasoid/West Eurasian ancestry judging by how they clustered between Mongoloids and more Caucasoid admixed groups like Central Asians and South Asians. I actually think Amerindians should plot the same as Altaians though, in these PCAs, they are seem to be shifted east towards Mongoloids compared to Altaians. When actually both Natives and Altaians have very similar amounts of East and West Eurasian ancestries in percentages. Probably has to do with the extreme genetic drift and isolation in Amerindians.

Also its fascinating how Papuans and other Australoids also seem to drift towards South Asians in most PCAs. It makes me wonder if Australoidsin general have ancient West Eurasian ancestry?

Look at how Saamis and VURers like Mari, Bashkir are plotted very faraway from Caucasoids (Euros and West Asians/MENAs) lol. They are even farther from Euros and churkas/wogs in the PCAs than many South-Central Asians like Tajiks, Pashtuns, Balochs, Sindhis, Burushos or even some South Asians like certain Gujaratis are.

Zanzibar

03-13-2021, 02:25 AM

Looks good and expected as we remove more and more very ancient alleles that are common to most pops the pops drift further apart but you didn’t post PC1/PC3

Also can you remove Africans from PCA so we can more clearly see Eurasians

Edit: if you want to remove old ENF common alleles that bring W Asians closer to Europeans and very old alleles that common to American Indians and other Eurasians don’t extract any Africans from your master. Only keep Eurasians and then do max-maf 0.4 or 0.3

Do Australoids liker Papuans/Melanesians, Aborigines have West Eurasian ancestry? Why do they always seem so drift towards South Asians in most Global PCAs? Amerindians do have the same amount of West Eurasian/Caucasoid as some Central Asians like Altaians, Kyrgyzs do right?

Komintasavalta

03-13-2021, 06:17 AM

I made a matrix of 1-IBS distances of modern individuals in the Reich dataset with at least 1.1 million SNPs, excluding samples from 1000 Genomes (1KG):

$ x=ibs;awk -F\\t 'NR>1&&$9=="Modern"&&$7!~/^1KG/&&$21>1.1e6{print$2" "$13}' g/v44.3_1240K_public/v44.3_1240K_public.anno|grep -Ev '\.REF|Ignore_|_o|_dup|_contam|SomeAfrican'>$x.pick;plink --bfile g/bed/v44.3_1240K_public --allow-no-sex --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick g/bed/v44.3_1240K_public.fam) --geno .01 --distance 1-ibs square --out $x;awk '{print$2":"$1}' $x.pick|paste - $x.mdist|cat <(awk '{print$2":"$1}' $x.pick|cat <(echo) -|paste -s) ->$x.table
$ head -n4 ibs.table|cut -f-4|tr \\t ,
,Dinka.DG:A_Dinka-4.DG,Dai.DG:A_Dai-5.DG,Mandenka.DG:A_Mandenka-4.DG
Dinka.DG:A_Dinka-4.DG,0,0.268425,0.227236
Dai.DG:A_Dai-5.DG,0.268425,0,0.272545
Mandenka.DG:A_Mandenka-4.DG,0.227236,0.272545,0

This calculates population averages for the 1-IBS distances:

$ awk 'NR==FNR{a[NR]=$2;next}{for(i=1;i<=NF;i++){b[a[FNR]][a[i]]+=$i;n[a[FNR]][a[i]]++}}END{PROCINFO["sorted_in"]="@val_str_asc";for(i in b)printf"\t%s",i;print"";for(i in b){printf"%s",i;for(j in b)printf"\t"b[i][j]/n[i][j];print""}}' ibs.pick ibs.mdist

Or another way:

tav(){ awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1][i]+=$i}}END{for(i in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i][j]/n[i]);print o}}' "FS=${1-$'\t'}";}
tp(){ awk '{for(i=1;i<=NF;i++)a[i][NR]=$i}END{for(i in a)for(j in a[i])printf"%s"(j==NR?"\n":FS),a[i][j]}' "FS=${1-$'\t'}";}
paste <(cut -d' ' -f2 ibs.pick) ibs.mdist|tav|tp|paste <(echo;cut -d' ' -f2 ibs.pick) -|(gsed -u 1q;tav)

Or with R:

Rscript -e 't=read.table("ibs.mdist");t2=read.table("ibs.pick");a=aggregate(t,list(t2$V2),mean);row.names(a)=a[,1];a=a[,-1];a=aggregate(t(a),list(t2$V2),mean);row.names(a)=a[,1];a=a[,-1];write.table(round(a,6),"output",sep="\\t",quote=F)'

This finds the closest individuals to an individual:

$ awk 'NR==FNR{a[NR]=$2":"$1;b[$1]=NR;next}FNR>1{k=b["S_Mansi-1.DG"];print$k,a[FNR]}' ibs.pick ibs.mdist|sort -n|head -n4
0 Mansi.DG:S_Mansi-1.DG
0.207693 Mansi.DG:S_Mansi-2.DG
0.208493 Yakut.SDG:HGDP00951.SDG
0.208495 Yakut.DG:S_Yakut-2.DG

However the distance of Mansi is still lower to some Papuans than to some Altaians or Siberian Tatars:

0.156157 Papuan.DG:S_Papuan-2.DG
0.156167 Papuan.SDG:HGDP00540.SDG
0.156212 Teleut.SG:Teleuts1.SG
0.15626 Aymara.SG:TA6.SG
0.156294 Yukpa.SG:Y2040.SG
0.15647 RIA.SG:mondal_RIA-63.SG
0.156916 Altaian.SG:Altais2.SG
0.157109 Huichol.SG:HUI03.SG
0.157143 RIA.SG:mondal_RIA-42.SG
0.157306 Tatar_Irtysh_Barabinsk.SG:IrtyshBarabinskTatars2.S G

What am I doing wrong? Using `--geno .001` instead of `--geno .01` or adding `--max-maf .4` didn't help.

Do I need a "marker set in approximate linkage equilibrium" like `plink --help` says?

--distance [{square | square0 | triangle}] [{gz | bin | bin4}] ['ibs']
['1-ibs'] ['allele-ct'] ['flat-missing']
Write a lower-triangular tab-delimited table of (weighted) genomic
distances in allele count units to <output prefix>.dist, and a list of the
corresponding sample IDs to <output prefix>.dist.id. The first row of the
.dist file contains a single <genome 1-genome 2> distance, the second row
has the <genome 1-genome 3> and <genome 2-genome 3> distances in that
order, etc.
* It is usually best to perform this calculation on a marker set in
approximate linkage equilibrium.
* If the 'square' or 'square0' modifier is present, a square matrix is
written instead; 'square0' fills the upper right triangle with zeroes.
* If the 'gz' modifier is present, a compressed .dist.gz file is written
instead of a plain text file.
* If the 'bin' modifier is present, a binary (square) matrix of
double-precision floating point values, suitable for loading from R, is
instead written to <output prefix>.dist.bin. ('bin4' specifies
single-precision numbers instead.) This can be combined with 'square0'
if you still want the upper right zeroed out, or 'triangle' if you don't
want to pad the upper right at all.
* If the 'ibs' modifier is present, an identity-by-state matrix is written
to <output prefix>.mibs. '1-ibs' causes distances expressed as genomic
proportions (i.e. 1 - IBS) to be written to <output prefix>.mdist.
Combine with 'allele-ct' if you want to generate the usual .dist file as
well.
* By default, distance rescaling in the presence of missing genotype calls
is sensitive to allele count distributions: if variant A contributes, on
average, twice as much to other pairwise distances as variant B, a
missing call at variant A will result in twice as large of a missingness
correction. To turn this off (because e.g. your missing calls are highly
nonrandom), use the 'flat-missing' modifier.
* The computation can be subdivided with --parallel.

Lucas

03-13-2021, 08:36 AM

I made a matrix of 1-IBS distances of modern individuals in the Reich dataset with at least 1.1 million SNPs, excluding samples from 1000 Genomes (1KG):

$ x=ibs;awk -F\\t 'NR>1&&$9=="Modern"&&$7!~/^1KG/&&$21>1.1e6{print$2" "$13}' g/v44.3_1240K_public/v44.3_1240K_public.anno|grep -Ev '\.REF|Ignore_|_o|_dup|_contam|SomeAfrican'>$x.pick;plink --bfile g/bed/v44.3_1240K_public --allow-no-sex --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick g/bed/v44.3_1240K_public.fam) --geno .01 --distance 1-ibs square --out $x;awk '{print$2":"$1}' $x.pick|paste - $x.mdist|cat <(awk '{print$2":"$1}' $x.pick|cat <(echo) -|paste -s) ->$x.table
$ head -n4 ibs.table|cut -f-4|tr \\t ,
,Dinka.DG:A_Dinka-4.DG,Dai.DG:A_Dai-5.DG,Mandenka.DG:A_Mandenka-4.DG
Dinka.DG:A_Dinka-4.DG,0,0.268425,0.227236
Dai.DG:A_Dai-5.DG,0.268425,0,0.272545
Mandenka.DG:A_Mandenka-4.DG,0.227236,0.272545,0

This calculates population averages for the 1-IBS distances:

$ awk 'NR==FNR{a[NR]=$2;next}{for(i=1;i<=NF;i++){b[a[FNR]][a[i]]+=$i;n[a[FNR]][a[i]]++}}END{PROCINFO["sorted_in"]="@val_str_asc";for(i in b)printf"\t%s",i;print"";for(i in b){printf"%s",i;for(j in b)printf"\t"b[i][j]/n[i][j];print""}}' ibs.pick ibs.mdist

Or another way:

tav(){ awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1][i]+=$i}}END{for(i in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i][j]/n[i]);print o}}' "FS=${1-$'\t'}";}
tp(){ awk '{for(i=1;i<=NF;i++)a[i][NR]=$i}END{for(i in a)for(j in a[i])printf"%s"(j==NR?"\n":FS),a[i][j]}' "FS=${1-$'\t'}";}
paste <(cut -d' ' -f2 ibs.pick) ibs.mdist|tav|tp|paste <(echo;cut -d' ' -f2 ibs.pick) -|(gsed -u 1q;tav)

Or with R:

Rscript -e 't=read.table("ibs.mdist");t2=read.table("ibs.pick");a=aggregate(t,list(t2$V2),mean);row.names(a)=a[,1];a=a[,-1];a=aggregate(t(a),list(t2$V2),mean);row.names(a)=a[,1];a=a[,-1];write.table(round(a,6),"output",sep="\\t",quote=F)'

This finds the closest individuals to an individual:

$ awk 'NR==FNR{a[NR]=$2":"$1;b[$1]=NR;next}FNR>1{k=b["S_Mansi-1.DG"];print$k,a[FNR]}' ibs.pick ibs.mdist|sort -n|head -n4
0 Mansi.DG:S_Mansi-1.DG
0.207693 Mansi.DG:S_Mansi-2.DG
0.208493 Yakut.SDG:HGDP00951.SDG
0.208495 Yakut.DG:S_Yakut-2.DG

However the distance of Mansi is still lower to some Papuans than to some Altaians or Siberian Tatars:

0.156157 Papuan.DG:S_Papuan-2.DG
0.156167 Papuan.SDG:HGDP00540.SDG
0.156212 Teleut.SG:Teleuts1.SG
0.15626 Aymara.SG:TA6.SG
0.156294 Yukpa.SG:Y2040.SG
0.15647 RIA.SG:mondal_RIA-63.SG
0.156916 Altaian.SG:Altais2.SG
0.157109 Huichol.SG:HUI03.SG
0.157143 RIA.SG:mondal_RIA-42.SG
0.157306 Tatar_Irtysh_Barabinsk.SG:IrtyshBarabinskTatars2.S G

What am I doing wrong? Using `--geno .001` instead of `--geno .01` or adding `--max-maf .4` didn't help.

Do I need a "marker set in approximate linkage equilibrium" like `plink --help` says?

--distance [{square | square0 | triangle}] [{gz | bin | bin4}] ['ibs']
['1-ibs'] ['allele-ct'] ['flat-missing']
Write a lower-triangular tab-delimited table of (weighted) genomic
distances in allele count units to <output prefix>.dist, and a list of the
corresponding sample IDs to <output prefix>.dist.id. The first row of the
.dist file contains a single <genome 1-genome 2> distance, the second row
has the <genome 1-genome 3> and <genome 2-genome 3> distances in that
order, etc.
* It is usually best to perform this calculation on a marker set in
approximate linkage equilibrium.
* If the 'square' or 'square0' modifier is present, a square matrix is
written instead; 'square0' fills the upper right triangle with zeroes.
* If the 'gz' modifier is present, a compressed .dist.gz file is written
instead of a plain text file.
* If the 'bin' modifier is present, a binary (square) matrix of
double-precision floating point values, suitable for loading from R, is
instead written to <output prefix>.dist.bin. ('bin4' specifies
single-precision numbers instead.) This can be combined with 'square0'
if you still want the upper right zeroed out, or 'triangle' if you don't
want to pad the upper right at all.
* If the 'ibs' modifier is present, an identity-by-state matrix is written
to <output prefix>.mibs. '1-ibs' causes distances expressed as genomic
proportions (i.e. 1 - IBS) to be written to <output prefix>.mdist.
Combine with 'allele-ct' if you want to generate the usual .dist file as
well.
* By default, distance rescaling in the presence of missing genotype calls
is sensitive to allele count distributions: if variant A contributes, on
average, twice as much to other pairwise distances as variant B, a
missing call at variant A will result in twice as large of a missingness
correction. To turn this off (because e.g. your missing calls are highly
nonrandom), use the 'flat-missing' modifier.
* The computation can be subdivided with --parallel.

Do you use only modern samples from Reich dataset?

vbnetkhio

03-13-2021, 11:18 AM

Yeah I figured it out already. I did something like this to make a global PCA of modern individuals in the Reich dataset:

wget reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_1240K_public.tar
tar -xf v44.3_1240K_public.tar
f=v44.3_1240K_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)
sed 1d v44.3_1240K_public.anno|grep -v 1KGPhase|awk -F\\t '$9=="Modern"{print$2,$13}'|grep -Ev '_dup|Ignore_|\.REF|_o'|sed -E 's/\.(SDG|DG|SG)$//'>picks0
awk 'NR==FNR{a[$1];next}$2 in a' picks0 v44.3_1240K_public.fam>picks
plink --bfile v44.3_1240K_public --keep picks --allow-no-sex --make-bed --out picks
plink --bfile picks --pca --geno .001 --allow-no-sex --out picks
paste -d' ' <(cut -d' ' -f2 picks0) <(cut -d' ' -f2- picks.eigenvec)>picks.eigenvec.2

When I didn't add `--geno .001`, it fked up the clustering of some populations at first, so that for example one South Asian population clustered together with Africans.

I couldn't get convertf to compile, but I downloaded a Mac binary from here: https://github.com/chrchang/eigensoft. The Mac binaries for plink from Harvard's website didn't work, but there were working binaries by another maintainer here: https://www.cog-genomics.org/plink/1.9/. You can download the v44.3_1240K_public.tar file manually from here: https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/index.html.

I then made this in R:

https://i.ibb.co/h2jY6SD/plink-pca-1240k.png

libarary(tidyverse)
library(colorspace)

f="picks"
t=read.table(paste0(f,".eigenvec.2"),sep=" ")
eig=as.double(readLines(paste0(f,".eigenval")))

# t=cbind(t[,c(1,2)],t(t(t[,-c(1,2)])*sqrt(eig))) # I think this corresponds to scaling in G25

pct=paste0("PC",seq(length(eig))," (",sprintf("%.1f",100*eig/sum(eig)),"%)")

ave=aggregate(t[,-c(1,2)],list(t[,1]),mean)
names(ave)=c("pop",paste0("PC",seq(ncol(ave)-1)))

k=cutree(hclust(dist(ave[,-1]),method="ward.D2"),k=12)
write.csv(k,"/tmp/k",quote=F)
ave$k=k

ggplot(ave,aes(x=PC1,y=PC2,label=pop))+
geom_point(aes(color=as.factor(k)),size=.5)+
geom_polygon(data=ave%>%group_by(k)%>%slice(chull(PC1,PC2)),alpha=.2,aes(color=as.facto r(k),fill=as.factor(k)),size=.3)+
geom_text(aes(label=pop,color=as.factor(k)),size=2 ,vjust=-.7)+
theme(
aspect.ratio=3/4,
axis.text=element_text(color="black",size=7),
axis.ticks.length=unit(0,"pt"),
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
axis.title=element_text(color="black",size=10),
legend.position="none",
panel.background=element_rect(fill="white"),
panel.grid.major=element_line(color="gray75",size=.2)
)+
scale_x_continuous(breaks=seq(-2,2,.1),expand=expansion(mult=.07))+
scale_y_continuous(breaks=seq(-2,2,.1),expand=expansion(mult=.06))+
labs(x=pct[1],y=pct[2])+
scale_color_discrete_qualitative(palette="Set 2",c=80,l=40)

ggsave("output.png")
system("/usr/local/bin/mogrify -trim -bordercolor white -border 20x20 output.png")

However there's something wrong with the distances in my PCA. For example Finns have about 5 times bigger distance to Khomani_San than to Yoruba:

$ tav(){ awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1][i]+=$i}}END{for(i in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i][j]/n[i]);print o}}' "FS=${1-$'\t'}";}
$ dist(){ awk -F, 'NR==FNR{for(i=2;i<=NF;i++)a[i]=$i;next}$1{s=0;for(i=2;i<=NF;i++)s+=($i-a[i])^2;print s^.5,$1}' "$2" "$1"|sort -n|awk '{printf"%.3f %s\n",$1,$2}'|sed s,^0,,;}
$ paste -d' ' <(cut -d' ' -f2 maailma0) <(cut -d' ' -f3- maailma.eigenvec)|tav ' '|tr ' ' ,>ave;dist ave <(grep Finnish ave)|tail -n16
.130 Karitiana
.130 Mende
.135 Gambian
.136 Esan
.155 Yoruba
.174 Papuan
.189 Mandenka
.206 BantuKenya
.216 Biaka
.242 BantuSA
.268 Mbuti
.336 Ju_hoan_North
.492 BantuHerero
.492 BantuSA_Herero
.497 BantuTswana
.705 Khomani_San

Am I supposed to apply some further quality control or filtering? I tried to include only samples with few missing SNPs: `awk -F\\t '$21>9e5' v44.3_1240K_public.anno`. I also tried increasing the value of the `--geno` option and I tried adding an option like `--max-maf .3`. None of it helped however.

I also tried multiplying the columns of the table with the square roots of the eigenvalues but it didn't help:

f="picks"
t=read.table(paste0(f,".eigenvec.2"),sep=" ")
eig=as.double(readLines(paste0(f,".eigenval")))

t2=(cbind(t[,c(1,2)],t(t(t[,-c(1,2)])*sqrt(eig))))

ave=aggregate(t[,-c(1,2)],list(t[,1]),mean)
ave2=aggregate(t2[,-c(1,2)],list(t2[,1]),mean)

ind=cbind(paste0(t[,1],":",t[,2]),t[,-c(1,2)])
ind2=cbind(paste0(t2[,1],":",t2[,2]),t2[,-c(1,2)])

write.table(ind,paste0(f,".ind"),quote=F,sep=",",col.names=F,row.names=F)
write.table(ind2,paste0(f,".indscaled"),quote=F,sep=",",col.names=F,row.names=F)
write.table(ave,paste0(f,".ave"),quote=F,sep=",",col.names=F,row.names=F)
write.table(ave2,paste0(f,".avescaled"),quote=F,sep=",",col.names=F,row.names=F)

a PCA isn't a measure of genetic distance. it's just a graphical representation of the data. If you, for example, remove half of the African samples from this analysis, then African dimension would shrink, and they would appear closer to Europeans, and the Eurasian dimension would strecth.

I do a quick PCA of a dataset like this:

plink2 --bfile myfile --keep keep.txt --chr 1-22 --make-bed --out myfile_2
plink2 --bfile myfile_2 --maf 0.05 --indep-pairwise 50 10 0.1 --make-bed --out myfile_3
plink2 --bfile myfile_3 --extract myfile_3.prune.in --make-bed --out myfile_4
plink2 --bfile myfile_4 --mind 0.99 --make-bed --out myfile_5
plink2 --bfile myfile_5 --pca

--mind 0.99 removes the absolutely worst quality (ancient) samples.
after this the PCA looks fine, at least the first 2 dimensions.

vbnetkhio

03-13-2021, 11:24 AM

Komintasavalta

The reason is one of weaknesses in PCA; inbred populations in the edge areas tend to stretch. It is even stronger with low quality ancient samples. In case of ancient samples Smartpca includes corrections set by run time parameters, but in case of high quality samples it is a real problem, because it can't be corrected by filling missing SNPs by approximations. Use only one inbred population on the pca edges. If you have to use more than one you need to place a "bigger collector population" to collect edge samples.

This happens because PCA always leak on edges. In case of inbred populations even more.

it's not just a weakness of PCA. duplicate samples and 2nd degree relatives or closer should always be removed from any genetical analysis. And inbred populations like e.g. the Kalash, too.

Zoro

03-13-2021, 02:22 PM

I made a matrix of 1-IBS distances of modern individuals in the Reich dataset with at least 1.1 million SNPs, excluding samples from 1000 Genomes (1KG):

However the distance of Mansi is still lower to some Papuans than to some Altaians or Siberian Tatars:

You're not doing anything wrong. Something is wrong with the Simons Altaian, Tatar, and Bashkir samples. I have noticed that too. I just remove those samples from my analysis.

Why are you using the older Plink IBS flag though. The --genome flag is better IMO because it also does some form of IBS and gives you more detailed output..

Here's IBS with Mansi using --genome flag. geno 0.0001 is critical when doing IBS work make sure you use it otherwise the comparison becomes invalid for some samples. You need total overlap of SNPs

geno 0.0001 max-maf 0.3 335K SNPs

<tbody>
POPULATION
PI_HAT
IBS WITH MANSI

Even
0.855
0.031

Eskimo_Sireniki.DG
0.853
0.000

Ulchi
0.853
0.000

Itelmen
0.853
0.000

Yakut
0.853
0.000

Hezhen
0.853
0.000

Oroqen
0.853
0.000

Chukchi
0.852
0.134

Daur
0.852
0.000

Kyrgyz_Kyrgyzstan
0.852
0.121

Mongola
0.851
0.000

Tubalar
0.851
0.000

Korean
0.851
0.000

Mexico_Zapotec.DG
0.851
0.000

Xibo
0.850
0.000

Japanese
0.850
0.000

Quechua
0.850
0.000

Mixe
0.850
0.000

Tu
0.850
0.000

Mayan
0.850
0.000

Han
0.850
0.000

Naxi
0.850
0.000

Pima
0.850
0.000

Tlingit
0.850
0.034

Tujia
0.849
0.000

Yi
0.849
0.000

Miao
0.849
0.000

Piapoco
0.849
0.000

She
0.849
0.000

Saami
0.848
0.000

Uyghur
0.848
0.101

Surui
0.848
0.000

Mixtec
0.848
0.000

Dai
0.848
0.000

Burmese
0.848
0.000

Kinh
0.848
0.000

Karitiana
0.848
0.000

China_Lahu
0.848
0.000

Thai
0.848
0.000

Hazara
0.848
0.036

Ami.DG
0.847
0.000

Cambodian
0.846
0.000

Igorot
0.846
0.000

Kurd-IRAQ-WGS
0.846
0.000

Dusun
0.845
0.000

Finnish
0.845
0.000

Russian
0.845
0.000

Estonian
0.845
0.000

Maori
0.844
0.086

Burusho
0.844
0.000

Bengali
0.844
0.000

Icelandic
0.844
0.000

Norwegian
0.844
0.000

Kusunda
0.844
0.000

Adygei
0.843
0.000

Hungarian
0.843
0.000

Pathan
0.843
0.000

Orcadian
0.843
0.000

Ossetian-North
0.843
0.000

English
0.843
0.000

Brahmin
0.843
0.000

Tajik
0.843
0.000

Chechen
0.843
0.000

Punjabi
0.843
0.000

Kapu
0.843
0.000

Bulgarian
0.843
0.000

Madiga
0.842
0.000

Czech
0.842
0.000

Bergamo
0.842
0.000

Polish
0.842
0.000

Khonda_Dora
0.842
0.000

Yadava
0.842
0.000

Russia_Abkhasian
0.842
0.000

Spanish
0.842
0.000

Mala
0.842
0.000

Sindhi
0.842
0.000

Relli
0.842
0.000

Lezgin
0.842
0.000

Irula
0.842
0.000

Turkish-Kayseri
0.842
0.000

Kalash
0.842
0.000

Basque
0.841
0.000

Iranian-Fars
0.841
0.000

Greek
0.841
0.000

French
0.841
0.000

Armenian
0.841
0.000

Albanian.DG
0.841
0.000

Tuscan
0.840
0.000

Balochi
0.840
0.000

Brahui
0.840
0.000

Georgian
0.840
0.000

Sardinian
0.840
0.000

Jew_Iraqi
0.839
0.000

Makrani
0.839
0.000

YANA_UP_WGS
0.838
0.000

Druze
0.838
0.000

Palestinian
0.836
0.000

Jew_Yemenite
0.835
0.000

Bougainville
0.835
0.000

BedouinB
0.835
0.000

Jordanian
0.835
0.000

Samaritan
0.835
0.000

Papuan
0.832
0.000

Saharawi
0.826
0.000

Mozabite
0.825
0.000

Somali
0.809
0.000

Masai
0.801
0.000

BantuKenya
0.786
0.000

Luo
0.786
0.000

Gambian
0.785
0.000

Luhya
0.785
0.000

Mandenka
0.784
0.000

Yoruba
0.782
0.000

Mende
0.782
0.000

Esan
0.782
0.000

Biaka
0.778
0.000

Mbuti
0.775
0.000

Ju_hoan_North
0.770
0.000

Khomani_San
0.769
0.000

</tbody>
<style type="text/css">td {border: 1px solid #ccc;}br {mso-data-placement:same-cell;}</style>

Zoro

03-13-2021, 02:38 PM

Do Australoids liker Papuans/Melanesians, Aborigines have West Eurasian ancestry? Why do they always seem so drift towards South Asians in most Global PCAs? Amerindians do have the same amount of West Eurasian/Caucasoid as some Central Asians like Altaians, Kyrgyzs do right?

Haven’t check whether Austaloids have W Eurasian. They drift towards S Asians mainly because they both have alot of S Eurasian common ancestry. S Eurasian is similar to E Eurasian.

Amerindians and C Asians both have significant W Eurasian. As to who has more it will depend on the specific population. Sense American Indians have significant drift one way to check would be to do IBS from Amerindians and IBs from specific C Asians and compare the two lists to see what the ratio of W/E Eurasian IBS is for each

Zoro

03-13-2021, 02:41 PM

a PCA isn't a measure of genetic distance. it's just a graphical representation of the data. If you, for example, remove half of the African samples from this analysis, then African dimension would shrink, and they would appear closer to Europeans, and the Eurasian dimension would strecth.

I do a quick PCA of a dataset like this:

plink2 --bfile myfile --keep keep.txt --chr 1-22 --make-bed --out myfile_2
plink2 --bfile myfile_2 --maf 0.05 --indep-pairwise 50 10 0.1 --make-bed --out myfile_3
plink2 --bfile myfile_3 --extract myfile_3.prune.in --make-bed --out myfile_4
plink2 --bfile myfile_4 --mind 0.99 --make-bed --out myfile_5
plink2 --bfile myfile_5 --pca

--mind 0.99 removes the absolutely worst quality (ancient) samples.
after this the PCA looks fine, at least the first 2 dimensions.

Why do you do —maf 0.05. That removes less common population specific SNPs which is the opposite of what we want. We rather remove very very ancient common SNPs using —max-maf 0.4

Komintasavalta

03-13-2021, 03:00 PM

With no `--geno`, Somalis have higher 1-IBS with Yoruba than with Han, but with `--geno .001` Somalis are closer to Han. With `--geno .0001`, even Maasai are closer to Han than to Yoruba.

I only included samples with 1.1 million or more SNPs whose population name included `.DG`. I didn't use a `--max-maf` setting.

https://i.ibb.co/XDN9bnj/ibs.png
https://i.ibb.co/2F21MzT/ibs2.png
https://i.ibb.co/0M86Y8D/ibs3.png

x=ibs;awk -F\\t 'NR>1&&$9=="Modern"&&$7!~/^1KG/&&$21>1.1e6{print$2" "$13}' g/v44.3_1240K_public/v44.3_1240K_public.anno|grep -Ev '\.REF|Ignore_|_o|_dup|_contam|SomeAfrican|_father |_mother|_son|_daughter|_brother|_sister|_twin'>$x.pick;plink --bfile g/bed/v44.3_1240K_public --allow-no-sex --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick g/bed/v44.3_1240K_public.fam) --geno .001 --distance ibs square --out $x;awk '{print$2":"$1}' $x.pick|paste - $x.mibs|cat <(awk '{print$2":"$1}' $x.pick|cat <(echo) -|paste -s) ->$x.table

f="ibs3"
t=read.table(paste0(f,".mibs"))
t2=read.table(paste0(f,".pick"))

a=aggregate(t,list(t2$V2),mean)
row.names(a)=a[,1]
a=a[,-1]
a=aggregate(t(a),list(t2$V2),mean)
row.names(a)=a[,1]
a=a[,-1]
a=(1-a)

p1="Yoruba.DG"
p2="Han.DG"

# under=a[,p1]<=.242&a[,p2]<=.242
# a=a[under,under]

dg=grep("\\.DG$",row.names(a))
a=a[dg,dg]

pick=row.names(a)%in%c(p1,p2)
k=cutree(hclust(as.dist(a[!pick,!pick])),k=8)
b=a[!pick,pick]
names(b)=c("x","y")
b$k=k
# b=b-min(b[b!=0])

row.names(b)=sub("\\.DG","",row.names(b))

ggplot(b,aes(x,y))+
geom_abline(linetype="dashed",color="gray80",size=.3)+
geom_polygon(data=b%>%group_by(k)%>%slice(chull(x,y)),alpha=.2,aes(color=as.factor(k) ,fill=as.factor(k)),size=.3)+
geom_point(aes(color=as.factor(k)),size=.5)+
geom_text(label=rownames(b),aes(color=as.factor(k) ),size=2.2,vjust=-.7)+
theme(
legend.position="none",
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
axis.ticks.length=unit(0,"pt"),
plot.background=element_rect(fill="white"),
panel.grid.major=element_line(color="gray80",size=.25),
panel.background=element_rect(fill="white"),
panel.border=element_rect(color="gray80",fill=NA,size=.6),
axis.text=element_text(color="black")
)+
coord_cartesian(xlim=c(.105,.300),ylim=c(.105,.300 ),expand=F)+
scale_x_continuous(breaks=seq(-1,1,.01))+
scale_y_continuous(breaks=seq(-1,1,.01))+
labs(x=paste("1-IBS",p2),y=paste("1-IBS",p1))+
scale_color_discrete_qualitative(palette="Set 2",c=100,l=50)

ggsave("output.png")
system("mogrify -trim -bordercolor white -border 20x20 output.png")

Zoro

03-13-2021, 03:20 PM

With no `--geno`, Somalis have higher 1-IBS with Yoruba than with Han, but with `--geno .001` Somalis are closer to Han. With `--geno .0001`, even Maasai are closer to Han than to Yoruba.

I only included samples with 1.1 million or more SNPs whose population name included `.DG`. I didn't use a `--max-maf` setting.
]

I want to plot what you did. This will give me better idea what you’re doing. Please post all steps you took to generate PCA

Edit: I think you’re using IBS data for PCA, right. Why not generate 20 PC coordinates using plink —PCA command?

Edit: If the code you posted previously is complete I'll try to duplicate what you did but using my dataset to see if there's any difference

Komintasavalta

03-13-2021, 03:51 PM

Here lower `--geno` settings also moved South Asians closer to East Asians and further from Europeans. I used data from the same run as in my previous post.

https://i.ibb.co/gWMBCpG/ibs.png
https://i.ibb.co/PhYBbHF/ibs2.png
https://i.ibb.co/pb1ysrV/ibs3.png

I want to plot what you did. This will give me better idea what you’re doing. Please post all steps you took to generate PCA

Edit: I think you’re using IBS data for PCA, right. Why not generate 20 PC coordinates using plink —PCA command?

Edit: If the code you posted previously is complete I'll try to duplicate what you did but using my dataset to see if there's any difference

It's not a PCA. It's just a scatterplot that shows 1-IBS distance to one population on the x-axis and to another population on the y-axis.

The clustering is based on treating the whole 132x132 matrix of 1-IBS distances as a distance matrix.

In case someone else wants to replicate my code, use this to convert the Reich dataset to bed+bim+fam:

wget reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_1240K_public.tar
tar -xf v44.3_1240K_public.tar
f=v44.3_1240K_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)

I downloaded Mac binaries for convertf and plink from here: https://github.com/chrchang/eigensoft, https://www.cog-genomics.org/plink/1.9/.

Zoro

03-13-2021, 04:26 PM

Here lower `--geno` settings also moved South Asians closer to East Asians and further from Europeans. I used data from the same run as in my previous post.

[

I downloaded Mac binaries for convertf and plink from here: https://github.com/chrchang/eigensoft, https://www.cog-genomics.org/plink/1.9/.

Good thinking. You get another gold star for getting creative :)

South Asians moving closer to E Asians Is expected and normal

Unlike PCA or Admixture IBS for scatter plot is less susceptible to bias because IBS is not affected by number of samples in run. You can do a few of these to visualize 4 or 5 dimensions. For ex, Han vs Iberian, Han vs. Bedouin, Bedouin vs Iberian etc

Btw, Reich set also has 1000 genome data. Advantage is each pop is represented by 50 or more samples which is better for pop averages. Not sure how many overlapping SNPs with ancients or your personal data but you can try analysis using 1000G and compare with Simons

Edit: I replicate what you did later and post my script for fixing flipped alleles and plink bim based on human reference genome Later in the day

Zoro

03-13-2021, 07:50 PM

f="ibs3"
t=read.table(paste0(f,".mibs"))
t2=read.table(paste0(f,".pick"))

a=aggregate(t,list(t2$V2),mean)
row.names(a)=a[,1]
a=a[,-1]
a=aggregate(t(a),list(t2$V2),mean)
row.names(a)=a[,1]
a=a[,-1]
a=(1-a)

p1="Yoruba.DG"
p2="Han.DG"

# under=a[,p1]<=.242&a[,p2]<=.242
# a=a[under,under]

dg=grep("\\.DG$",row.names(a))
a=a[dg,dg]

pick=row.names(a)%in%c(p1,p2)
k=cutree(hclust(as.dist(a[!pick,!pick])),k=8)
b=a[!pick,pick]
names(b)=c("x","y")
b$k=k
# b=b-min(b[b!=0])

row.names(b)=sub("\\.DG","",row.names(b))

ggplot(b,aes(x,y))+
geom_abline(linetype="dashed",color="gray80",size=.3)+
geom_polygon(data=b%>%group_by(k)%>%slice(chull(x,y)),alpha=.2,aes(color=as.factor(k) ,fill=as.factor(k)),size=.3)+
geom_point(aes(color=as.factor(k)),size=.5)+
geom_text(label=rownames(b),aes(color=as.factor(k) ),size=2.2,vjust=-.7)+
theme(
legend.position="none",
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
axis.ticks.length=unit(0,"pt"),
plot.background=element_rect(fill="white"),
panel.grid.major=element_line(color="gray80",size=.25),
panel.background=element_rect(fill="white"),
panel.border=element_rect(color="gray80",fill=NA,size=.6),
axis.text=element_text(color="black")
)+
coord_cartesian(xlim=c(.105,.300),ylim=c(.105,.300 ),expand=F)+
scale_x_continuous(breaks=seq(-1,1,.01))+
scale_y_continuous(breaks=seq(-1,1,.01))+
labs(x=paste("1-IBS",p2),y=paste("1-IBS",p1))+
scale_color_discrete_qualitative(palette="Set 2",c=100,l=50)

ggsave("output.png")
system("mogrify -trim -bordercolor white -border 20x20 output.png")

I was going to run your code in R (my version is 3.6.3). What is the format of your inputs?

I see you're using 3 input files; ibs3, .mibs, .pick. What are they and what's your dataframe format for each file. Can you post a couple of rows with the headers from each?

vbnetkhio

03-13-2021, 09:37 PM

You're not doing anything wrong. Something is wrong with the Simons Altaian, Tatar, and Bashkir samples. I have noticed that too. I just remove those samples from my analysis.

There is something wrong with .SG and .DG samples in general. they usually have more SNPs, but a lot of that is noise.

However the distance of Mansi is still lower to some Papuans than to some Altaians or Siberian Tatars:

try out the data from evolbio.ut.ee and the non-SG/DG samples from the Reich HO dataset, too. I know they have has less SNPs but they are higher quality.

I believe evolbio has high quality samples of all the Finno-Ugric and Siberian peoples.

vbnetkhio

03-13-2021, 09:47 PM

Why do you do —maf 0.05. That removes less common population specific SNPs which is the opposite of what we want. We rather remove very very ancient common SNPs using —max-maf 0.4

either --maf 0.05 or --maf 0.01 get repeated a lot on forums as the standard filters. it's even mentioned on the official plink page, but it isn't explained why exactly 0.05.
what do you think was the reasoning behind this? maybe removing population specific SNPs has some advantages too, depending on what you are interested in?

vbnetkhio

03-13-2021, 10:18 PM

with this code, you can first calculate the PCs using only the high quality samples, and then project the low quality samples onto the PCA without them messing up the analysis.

the final PCA will be written to the "new_projection" file.
remove.txt is the list of low quality samples you want to remove.

plink2 --bfile myfile --remove remove.txt --chr 1-22 --make-bed --out myfile2
plink2 --bfile myfile2 --maf 0.05 --indep-pairwise 50 10 0.1 --make-bed --out myfile3
plink2 --bfile myfile3 --extract myfile3.prune.in --make-bed --out myfile4
plink2 --bfile myfile4 --mind 0.99 --make-bed --out myfile5
plink2 --bfile myfile5 --freq counts --pca 30 allele-wts --out ref_pcs
plink2 --bfile myfile --read-freq ref_pcs.acount --score ref_pcs.eigenvec.allele 2 5 header-read no-mean-imputation variance-standardize --score-col-nums 6-35 --out new_projection

also, I recommend using past4 (https://www.nhm.uio.no/english/research/infrastructure/past/) for plotting PCAs, it's much simpler tha R for this purpose.

Komintasavalta

03-14-2021, 01:57 AM

Unlike PCA or Admixture IBS for scatter plot is less susceptible to bias because IBS is not affected by number of samples in run.

Yeah but which variants are removed by `--geno` is.

Here in the first run I included all samples with the suffix `.SDG` and at least a million SNPs. In the second run I removed East/Southeast/North Asians. I used `--geno .0001` with both runs, which removed 83% of variants in the first run and 76% of variants in the second run.

In the first run, all SSAs are closer to Han, but in the second run, all SSAs are closer to Russians. Also other populations move closer to Russians in the second run.

https://i.ibb.co/WF7NykK/ibsa3.png
https://i.ibb.co/gw7dgtS/ibsr3.png

I was going to run your code in R (my version is 3.6.3). What is the format of your inputs?

I see you're using 3 input files; ibs3, .mibs, .pick. What are they and what's your dataframe format for each file. Can you post a couple of rows with the headers from each?

Actually they're `ibs3.pick` and `ibs3.mibs`, and they were generated by the shell code before the R code. (Except `x=ibs` should've been `x=ibs3` in the shell code, or `f="ibs3"` should've been `f="ibs"` in the R code.) `ibs` or `ibs3` is just an arbitrary name I used as a prefix. Here's the code again on multiple lines:

x=ibs
awk -F\\t 'NR>1&&$9=="Modern"&&$7!~/^1KG/&&$21>1.1e6{print$2" "$13}' g/v44.3_1240K_public/v44.3_1240K_public.anno|grep -Ev '\.REF|Ignore_|_o|_dup|_contam|SomeAfrican'|grep -Ev '_father|_mother|_son|_daughter|_brother|_sister|_ twin'>$x.pick
plink --bfile g/bed/v44.3_1240K_public --allow-no-sex --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick g/bed/v44.3_1240K_public.fam) --geno .001 --distance ibs square --out $x

This is not needed by the R code, but it adds G25-style sample IDs with population labels to the IBS matrix:

awk '{print$2":"$1}' $x.pick|paste - $x.mibs|cat <(awk '{print$2":"$1}' $x.pick|cat <(echo) -|paste -s) ->$x.table

`ibs.mibs` is a TSV matrix of IBS values generated by `plink --distance ibs square`. `ibs.pick` is not a special file needed by plink, but I just made it for convenience. It is a space-delimited table with two columns, where the first column contains sample IDs and the second column contains population names.

Zoro

03-14-2021, 03:58 AM

@ Komin and Vbnetkhio

The 1st thing I would do is fix the plink bim file. Col 5 is supposed to have the ALT allele and column 6 the REF allele. If you open it up and look at a couple of positions and compare to dbSNP website I'm sure you'll find that ALT and REF are flipped.

I got this script from Dilawer for fixing the bim file.

1- Get yourself a Hg19 Human Reference Genome. They're usually a few gigabytes in size. Format it so that it looks like the sample below. Col 5 is REF and col 6 is ALT. You can use AWK to format it.

26 64 rs3883917 C T
26 72 rs370271105 T C
26 93 rs369034419 A G
26 97 rs147830800 G A
26 103 rs369070397 G A
26 125 rs144402189 T C

2- Follow the steps below

##### TO FIX WRONG ALLELE ORDER IN PLINK BIM FILE AND IMPORT RECOMBINATION RATES INTO COL 3

### IMPORT ALT ALLELE FROM COL5 OF Hg19_60M_rs_ids_Plink_singlealleles.txt AND PLACE IT INTO COL5 OF BIM (MINOR ALLELE)

awk 'NR == FNR {REP[$3] = $5; next} ($2) in REP {$5 = REP[$2]} 1' Hg19_60M_rs_ids_Plink_singlealleles.txt MASTER9.bim > MASTER9.bim1

### IMPORT REF ALLELE FROM COL4 OF Hg19_60M_rs_ids_Plink_singlealleles.txt AND PLACE IT INTO COL6 OF BIM (MAJOR ALLELE)

awk 'NR == FNR {REP[$3] = $4; next} ($2) in REP {$6 = REP[$2]} 1' Hg19_60M_rs_ids_Plink_singlealleles.txt MASTER9.bim1 > MASTER9.bim2

### IMPORT RECOMB RATE FROM COL3 OF GENETIC_MAP_GRCH37_ALL_version2.txt AND PLACE IT INTO COL3 OF BIM

awk 'NR == FNR {REP[$1,$2] = $3; next} ($1,$4) in REP {$3 = REP[$1,$4]} 1' GENETIC_MAP_GRCH37_ALL_version2.txt MASTER9.bim2 > MASTER9.bim3

Zoro

03-14-2021, 04:04 AM

There is something wrong with .SG and .DG samples in general. they usually have more SNPs, but a lot of that is noise.

try out the data from evolbio.ut.ee and the non-SG/DG samples from the Reich HO dataset, too. I know they have has less SNPs but they are higher quality.

I believe evolbio has high quality samples of all the Finno-Ugric and Siberian peoples.

I didn't see any issues with the DG or SG files except for the samples I mentioned. Dilawer thinks it's possible that Simons samples were origianlly aligned with Hg17 or Hg18 Human reference and some were not properly converted to Hg19. Either that or somehow the data got corrupted on a few samples.

Zoro

03-14-2021, 04:07 AM

either --maf 0.05 or --maf 0.01 get repeated a lot on forums as the standard filters. it's even mentioned on the official plink page, but it isn't explained why exactly 0.05.
what do you think was the reasoning behind this? maybe removing population specific SNPs has some advantages too, depending on what you are interested in?

I see more harm than good with throwing out population specific SNPs such as with --maf 0.05. All that would do is increase the proportion of older more common alleles that are shared between many populations so in effect the distances between populations would become less. Kind of like going back in time

Zoro

03-14-2021, 04:25 AM

Yeah but which variants are removed by `--geno` is.

Positions are removed which are not genotyped in some or all samples.

Actually they're `ibs3.pick` and `ibs3.mibs`, and they were generated by the shell code before the R code. (Except `x=ibs` should've been `x=ibs3` in the shell code, or `f="ibs3"` should've been `f="ibs"` in the R code.) `ibs` or `ibs3` is just an arbitrary name I used as a prefix. Here's the code again on multiple lines:

x=ibs
awk -F\\t 'NR>1&&$9=="Modern"&&$7!~/^1KG/&&$21>1.1e6{print$2" "$13}' g/v44.3_1240K_public/v44.3_1240K_public.anno|grep -Ev '\.REF|Ignore_|_o|_dup|_contam|SomeAfrican'|grep -Ev '_father|_mother|_son|_daughter|_brother|_sister|_ twin'>$x.pick
plink --bfile g/bed/v44.3_1240K_public --allow-no-sex --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick g/bed/v44.3_1240K_public.fam) --geno .001 --distance ibs square --out $x

This is not needed by the R code, but it adds G25-style sample IDs with population labels to the IBS matrix:

awk '{print$2":"$1}' $x.pick|paste - $x.mibs|cat <(awk '{print$2":"$1}' $x.pick|cat <(echo) -|paste -s) ->$x.table

`ibs.mibs` is a TSV matrix of IBS values generated by `plink --distance ibs square`. `ibs.pick` is not a special file needed by plink, but I just made it for convenience. It is a space-delimited table with two columns, where the first column contains sample IDs and the second column contains population names.

I'm not going to run --ibs. I'll use my output from --genome. I don't even know what the output of --ibs looks like.

What part of your R code do I need to modify if my input file is something like this

SAMPLE IBS-Russian IBS-Han
Iberian 0.756 0.732
Turkish 0.761 0.746

Komintasavalta

03-14-2021, 04:36 AM

Below I have a balanced mix of Caucasoid and Mongoloid samples, so even after I added `--geno .01`, South Asians stayed at approximately the same position on the diagonal line.

https://i.ibb.co/WFSbx8f/genovsnogeno.gif

Btw, Reich set also has 1000 genome data. Advantage is each pop is represented by 50 or more samples which is better for pop averages. Not sure how many overlapping SNPs with ancients or your personal data but you can try analysis using 1000G and compare with Simons

Here the distance to Russians is bigger for Finns from 1000 Genomes (FIN.SG) than for Somali.DG or Maasai.DG. Also Yoruba from 1000 Genomes (YRI.SG) are in the top right corner of the plot even further than Khoisan. I used no `--geno` option, but adding `--geno .01` didn't help.

https://i.ibb.co/WpNYp9Y/ibs.png

If I only use modern high-SNP samples from one source (like only samples with a `.DG` or `.SDG` suffix), I seem to get mostly reasonable results. But I don't know how to mix samples from different sources yet.

You can find the source of the samples like this:

awk -F\\t '$2~/\.SDG$/{print$2,$7}' g/v44.3_1240K_public/v44.3_1240K_public.anno

- There are 4447 samples that end in `.SD`. Out of them 2535 are from 1000 Genomes (1KGPhase3). The publication of 442 is listed as MargaryanWillerslevNature2020 (Population genomics of the Viking world (https://www.nature.com/articles/s41586-020-2688-8)).
- The source of all 929 samples ending in `.SDG` is listed as BergstromScience2020 (Insights into human genetic variation and population history from 929 diverse genomes (https://science.sciencemag.org/content/367/6484/eaay5012)).
- 243 out of 298 samples ending in `.DG` have the source MallickNature2016 (The Simons Genome Diversity Project: 300 genomes from 142 diverse populations (https://pubmed.ncbi.nlm.nih.gov/27654912/)). Others are from 8 different sources.
- The fourth most common suffix is `.A0101`, but it's already so rare that only 27 samples have it. 16 samples with the suffix are from JeongPNAS2018 (Bronze Age population dynamics and the rise of dairy pastoralism on the eastern Eurasian steppe (https://www.pnas.org/content/115/48/E11248)).

You can easily grep for a line in the anno file and show it with the header row like this (this requires GNU sed for `-u` and gawk for arrays of arrays):

qg(){ sed -u 1q;grep -P "$@";}
tp(){ awk '{for(i=1;i<=NF;i++)a[i][NR]=$i}END{for(i in a)for(j in a[i])printf"%s"(j==NR?"\n":FS),a[i][j]}' "FS=${1-$'\t'}";}
nu(){ awk '{printf"%"x""y"%s\n",NR,$0}' "x=${1-d}" "y=${2- }";}
$ curl -O reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_1240K_public.anno
$ qg Finnish-1<v44.3_1240K_public.anno|tp|nu

also, I recommend using past4 (https://www.nhm.uio.no/english/research/infrastructure/past/) for plotting PCAs, it's much simpler tha R for this purpose.

I tried it but it gave me brain damage, and it only took me few weeks to learn R. I can run an interactive R session inside Emacs using ESS (https://ess.r-project.org/index.php?Section=home) (Emacs Speaks Statistics), so I don't even need any gay IDE like RStudio:

https://i.ibb.co/7zP4xrz/a.png

Komintasavalta

03-14-2021, 04:53 AM

Positions are removed which are not genotyped in some or all samples.

I'm not going to run --ibs. I'll use my output from --genome. I don't even know what the output of --ibs looks like.

What part of your R code do I need to modify if my input file is something like this

SAMPLE IBS-Russian IBS-Han
Iberian 0.756 0.732
Turkish 0.761 0.746

You need the full IBS matrix from `--ibs` for the clustering. Even though you could also do the clustering based on the output of `--pca`.

Here's a simplified version that doesn't do clustering:

library(ggplot2)

t=read.table(text="SAMPLE IBS-Russian IBS-Han
Iberian 0.756 0.732
Turkish 0.761 0.746",row.names=1,header=T,check.names=F)
# t=read.table("file",row.names=1,header=T,check.names=F)

n=names(t)
names(t)=c("x","y")

ggplot(t,aes(x,y))+
geom_abline(linetype="dashed",color="gray80",size=.3)+
geom_point(size=.5)+
geom_text(label=row.names(t),size=2.5,vjust=-.7)+
theme(
legend.position="none",
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
axis.ticks.length=unit(0,"pt"), # remove extra space occupied by tick marks
panel.grid.major=element_line(color="gray80",size=.25),
panel.background=element_rect(fill="white"),
panel.border=element_rect(color="gray80",fill=NA,size=.6),
panel.grid.minor=element_line(color="white",size=.15),
axis.text=element_text(color="black")
)+
# coord_fixed()+ # use same scale for x and y axis
# coord_cartesian(xlim=c(.7,.8),ylim=c(.7,.8),expand =F)+ # display fixed range
scale_x_continuous(breaks=seq(-1,1,.01),minor_breaks=seq(-1,1,.001),expand=expansion(mult=.12))+ # expansion is often needed to make labels fit on plot area
scale_y_continuous(breaks=seq(-1,1,.01),minor_breaks=seq(-1,1,.001))+
labs(x=n[1],y=n[2])

ggsave("output.png")
system("/usr/local/bin/mogrify -trim -bordercolor white -border 20x20 output.png")

vbnetkhio

03-14-2021, 09:32 AM

I didn't see any issues with the DG or SG files except for the samples I mentioned.

maybe the issues appear only when combining SG and non-SG samples. SG and non-SG samples of the same ethnicity sometimes plot separately on PCAs, get separated in admixture, etc.

Dilawer thinks it's possible that Simons samples were origianlly aligned with Hg17 or Hg18 Human reference and some were not properly converted to Hg19.

it's probably this. a part of the Ukrainians from the HO dataset have hg18 alleles and hg19 coordinates. maybe it's fixed in the newest version, i didn't check.

vbnetkhio

03-14-2021, 10:17 AM

I didn't see any issues with the DG or SG files except for the samples I mentioned. Dilawer thinks it's possible that Simons samples were origianlly aligned with Hg17 or Hg18 Human reference and some were not properly converted to Hg19. Either that or somehow the data got corrupted on a few samples.

This is what Davidski says:

Shotgun sequences often don't produce similar D-stats to capture data, especially if they're not UDG treated.
There's more noise in shotgun data, especially in the lower coverage and non-UDG treated sequences.

would it be possible to filter out this noise?

Komintasavalta

03-14-2021, 11:58 AM

try out the data from evolbio.ut.ee and the non-SG/DG samples from the Reich HO dataset, too. I know they have has less SNPs but they are higher quality.

I believe evolbio has high quality samples of all the Finno-Ugric and Siberian peoples.

I tried merging the files from here: https://evolbio.ut.ee/Tambets2018/.

printf %s\\n g/bed/Tambets_et_al_2018_{660k,610k}>/tmp/a
plink --bfile g/bed/Tambets_et_al_2018_1M --merge-list /tmp/a --pca

When I tried using a process substitution instead of a temporary file, I got a segfault.

When I also included the file Tambets_et_al_2018_650k.bed, I got this error:

Error: 211441 variants with 3+ alleles present.
* If you believe this is due to strand inconsistency, try --flip with
plink.missnp.
(Warning: if this seems to work, strand errors involving SNPs with A/T or C/G
alleles probably remain in your data. If LD between nearby SNPs is high,
--flip-scan should detect them.)
* If you are dealing with genuine multiallelic variants, we recommend exporting
that subset of the data to VCF (via e.g. '--recode vcf'), merging with
another tool/script, and then importing the result; PLINK is not yet suited
to handling them.
See https://www.cog-genomics.org/plink/1.9/data#merge3 for more discussion.

I had to look up population names from table S1 of Tambets 2018: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1522-1.

Result:

https://i.ibb.co/P49vTTG/output.png

However there's still something weird with the clustering, because many Mansi cluster together with Finnic people.

One sample has the ID "Nenets986" and population name "Maris", but it seems more like a Mari based on its placement on the plot above.

library(tidyverse);library(colorspace)

t=read.table("https://pastebin.com/raw/uKapWxrU")
eig=c(3.78,2.13,1.29,1.17,1.15,1.12,1.09,1.07,1.05 ,1.04,1.04,1.04,1.03,1.03,1.03,1.02,1.02,1.02,1.02 ,1.02)
pct=paste0("PC",seq(length(eig))," (",sprintf("%.1f",100*eig/sum(eig)),"%)")
k=cutree(hclust(dist(t[,-c(1,2)]),method="ward.D2"),k=12)
t$k=k
t$lab=paste0(t$V1,":",t$V2)
write.table(cbind(k,t$lab),"clusters",sep=" ",quote=F,col.names=F,row.names=F)

ggplot(t,aes(x=V3,y=V4))+
geom_point(aes(color=as.factor(k)),size=.5)+
geom_polygon(data=t%>%group_by(k)%>%slice(chull(V3,V4)),alpha=.2,aes(color=as.factor( k),fill=as.factor(k)),size=.3)+
geom_text(aes(label=lab,color=as.factor(k)),size=2 ,vjust=-.7)+
theme(
aspect.ratio=3/4,
axis.text=element_text(color="black",size=6),
axis.ticks.length=unit(0,"pt"),
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
axis.title=element_text(color="black",size=8),
legend.position="none",
panel.background=element_rect(fill="white"),
panel.grid.major=element_line(color="gray75",size=.2),
)+
scale_x_continuous(breaks=seq(-1,1,.1),expand=expansion(mult=.16))+
scale_y_continuous(breaks=seq(-1,1,.1),expand=expansion(mult=.04))+
labs(x=pct[1],y=pct[2])+
scale_color_discrete_qualitative(palette="Set 2",c=80,l=40)

ggsave("output.png")
system("mogrify -trim -bordercolor white -border 16x16 output.png")

vbnetkhio

03-14-2021, 12:17 PM

I tried merging the files from here: https://evolbio.ut.ee/Tambets2018/.

printf %s\\n g/bed/Tambets_et_al_2018_{660k,610k}>/tmp/a
plink --bfile g/bed/Tambets_et_al_2018_1M --merge-list /tmp/a --pca

When I tried using a process substitution instead of a temporary file, I got a segfault.

When I also included the file Tambets_et_al_2018_650k.bed, I got this error:

Error: 211441 variants with 3+ alleles present.
* If you believe this is due to strand inconsistency, try --flip with
plink.missnp.
(Warning: if this seems to work, strand errors involving SNPs with A/T or C/G
alleles probably remain in your data. If LD between nearby SNPs is high,
--flip-scan should detect them.)
* If you are dealing with genuine multiallelic variants, we recommend exporting
that subset of the data to VCF (via e.g. '--recode vcf'), merging with
another tool/script, and then importing the result; PLINK is not yet suited
to handling them.
See https://www.cog-genomics.org/plink/1.9/data#merge3 for more discussion.

I had to look up population names from table S1 of Tambets 2018: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1522-1.

Result:

https://i.ibb.co/P49vTTG/output.png

However there's still something weird with the clustering, because many Mansi cluster together with Finnic people.

One sample has the ID "Nenets986" and population name "Maris", but it seems more like a Mari based on its placement on the plot above.

library(tidyverse);library(colorspace)

t=read.table("https://pastebin.com/raw/uKapWxrU")
eig=c(3.78,2.13,1.29,1.17,1.15,1.12,1.09,1.07,1.05 ,1.04,1.04,1.04,1.03,1.03,1.03,1.02,1.02,1.02,1.02 ,1.02)
pct=paste0("PC",seq(length(eig))," (",sprintf("%.1f",100*eig/sum(eig)),"%)")
k=cutree(hclust(dist(t[,-c(1,2)]),method="ward.D2"),k=12)
t$k=k
t$lab=paste0(t$V1,":",t$V2)
write.table(cbind(k,t$lab),"clusters",sep=" ",quote=F,col.names=F,row.names=F)

ggplot(t,aes(x=V3,y=V4))+
geom_point(aes(color=as.factor(k)),size=.5)+
geom_polygon(data=t%>%group_by(k)%>%slice(chull(V3,V4)),alpha=.2,aes(color=as.factor( k),fill=as.factor(k)),size=.3)+
geom_text(aes(label=lab,color=as.factor(k)),size=2 ,vjust=-.7)+
theme(
aspect.ratio=3/4,
axis.text=element_text(color="black",size=6),
axis.ticks.length=unit(0,"pt"),
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
axis.title=element_text(color="black",size=8),
legend.position="none",
panel.background=element_rect(fill="white"),
panel.grid.major=element_line(color="gray75",size=.2),
)+
scale_x_continuous(breaks=seq(-1,1,.1),expand=expansion(mult=.16))+
scale_y_continuous(breaks=seq(-1,1,.1),expand=expansion(mult=.04))+
labs(x=pct[1],y=pct[2])+
scale_color_discrete_qualitative(palette="Set 2",c=80,l=40)

ggsave("output.png")
system("mogrify -trim -bordercolor white -border 16x16 output.png")

this is the merging process when dealing with strand inconsistency:

plink2 --bfile fileA --bmerge fileB --make-bed --out merge --allow-no-sex

plink2 --bfile fileB --flip merge-merge.missnp --make-bed --out fileB_flip
plink2 --bfile fileA --bmerge fileB_flip --make-bed --out merge --allow-no-sex

plink2 --bfile fileB_flip --exclude merge-merge.missnp --make-bed --out fileB_filtered
plink2 --bfile fileA --bmerge fileB_filtered --make-bed --out merge --allow-no-sex

always add --allow-no-sex, first try flipping the SNPs, if you get an error again, then you also have to exclude some SNPs (last 2 lines)

those Mansi and the Buryat are probably mixed with Russians.

vbnetkhio

03-14-2021, 12:28 PM

I tried merging the files from here: https://evolbio.ut.ee/Tambets2018/.

check out the other studies too, populations of your interest are scattered across all of them. Not just the Siberian studies, there are also Mordovians in the Caucasus study, Chuvash in the Jewish study etc.
I merged all evolbio files into one big dataset and use that.

Zoro

03-14-2021, 01:12 PM

You need the full IBS matrix from `--ibs` for the clustering. Even though you could also do the clustering based on the output of `--pca`.

Here's a simplified version that doesn't do clustering:

library(ggplot2)

t=read.table(text="SAMPLE IBS-Russian IBS-Han
Iberian 0.756 0.732
Turkish 0.761 0.746",row.names=1,header=T,check.names=F)
# t=read.table("file",row.names=1,header=T,check.names=F)

n=names(t)
names(t)=c("x","y")

ggplot(t,aes(x,y))+
geom_abline(linetype="dashed",color="gray80",size=.3)+
geom_point(size=.5)+
geom_text(label=row.names(t),size=2.5,vjust=-.7)+
theme(
legend.position="none",
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
axis.ticks.length=unit(0,"pt"), # remove extra space occupied by tick marks
panel.grid.major=element_line(color="gray80",size=.25),
panel.background=element_rect(fill="white"),
panel.border=element_rect(color="gray80",fill=NA,size=.6),
panel.grid.minor=element_line(color="white",size=.15),
axis.text=element_text(color="black")
)+
# coord_fixed()+ # use same scale for x and y axis
# coord_cartesian(xlim=c(.7,.8),ylim=c(.7,.8),expand =F)+ # display fixed range
scale_x_continuous(breaks=seq(-1,1,.01),minor_breaks=seq(-1,1,.001),expand=expansion(mult=.12))+ # expansion is often needed to make labels fit on plot area
scale_y_continuous(breaks=seq(-1,1,.01),minor_breaks=seq(-1,1,.001))+
labs(x=n[1],y=n[2])

ggsave("output.png")
system("/usr/local/bin/mogrify -trim -bordercolor white -border 20x20 output.png")

Here is a simple 2 dimensional IBS. It looks better than alot of PCAs since it doesn't have the bias issues relating to sample numbers

I didn't have time to modify your code to color code by column3( Region). My R codes are different.

I don't see any issues with the samples used in the run. Using plink --genome flag and squaring IBS values. I have corrected flipped alleles in my bim file using the script I posted earlier though.

https://i.imgur.com/wvA5hBW.jpg

https://i.imgur.com/Lt3LpCE.jpg

Zoro

03-14-2021, 01:23 PM

This is what Davidski says:

would it be possible to filter out this noise?

What he says doesn't apply to the moderns though.

As far as noise just use the higher coverage samples. You can tell which ones they are by the number of SNPs they have when you run --missing in Plink

Zoro

03-14-2021, 01:37 PM

I tried merging the files from here: https://evolbio.ut.ee/Tambets2018/.

printf %s\\n g/bed/Tambets_et_al_2018_{660k,610k}>/tmp/a
plink --bfile g/bed/Tambets_et_al_2018_1M --merge-list /tmp/a --pca

When I tried using a process substitution instead of a temporary file, I got a segfault.

When I also included the file Tambets_et_al_2018_650k.bed, I got this error:

Error: 211441 variants with 3+ alleles present.
* If you believe this is due to strand inconsistency, try --flip with
plink.missnp.
(Warning: if this seems to work, strand errors involving SNPs with A/T or C/G
alleles probably remain in your data. If LD between nearby SNPs is high,
--flip-scan should detect them.)
* If you are dealing with genuine multiallelic variants, we recommend exporting
that subset of the data to VCF (via e.g. '--recode vcf'), merging with
another tool/script, and then importing the result; PLINK is not yet suited
to handling them.
See https://www.cog-genomics.org/plink/1.9/data#merge3 for more discussion.

]

The root of most of these problems is Plink flipping alleles. I'll try to explain with this example.

Let's say your Plink Master file bim has rs9874 as G T and the dataset you're trying to merge has rs9874 T G or even worse they use the -ve orientation when they genotyped and rs9874 A C. In this case stupid Plink will reject by saying there's 3 alleles for this positon.

If you want to avoid this fix your Master bim using the script I provided. Fix the bim on the new file also using the script I provided and then merge the two in Plink.

You should have instead of 200,000 positions with 3+ alleles maybe 100 positions only. To get rid of these just exclude them from the new file using:

/plink --bfile New_file --exclude merge...... --make-bed --out New_file1

You can check this page to maybe find a reference genome you can use with my script

https://faculty.washington.edu/browning/beagle/beagle.html

Zoro

03-14-2021, 02:10 PM

You need the full IBS matrix from `--ibs` for the clustering. Even though you could also do the clustering based on the output of `--pca`.

Here's a simplified version that doesn't do clustering:

library(ggplot2)

t=read.table(text="SAMPLE IBS-Russian IBS-Han
Iberian 0.756 0.732
Turkish 0.761 0.746",row.names=1,header=T,check.names=F)
# t=read.table("file",row.names=1,header=T,check.names=F)

n=names(t)
names(t)=c("x","y")

ggplot(t,aes(x,y))+
geom_abline(linetype="dashed",color="gray80",size=.3)+
geom_point(size=.5)+
geom_text(label=row.names(t),size=2.5,vjust=-.7)+
theme(
legend.position="none",
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
axis.ticks.length=unit(0,"pt"), # remove extra space occupied by tick marks
panel.grid.major=element_line(color="gray80",size=.25),
panel.background=element_rect(fill="white"),
panel.border=element_rect(color="gray80",fill=NA,size=.6),
panel.grid.minor=element_line(color="white",size=.15),
axis.text=element_text(color="black")
)+
# coord_fixed()+ # use same scale for x and y axis
# coord_cartesian(xlim=c(.7,.8),ylim=c(.7,.8),expand =F)+ # display fixed range
scale_x_continuous(breaks=seq(-1,1,.01),minor_breaks=seq(-1,1,.001),expand=expansion(mult=.12))+ # expansion is often needed to make labels fit on plot area
scale_y_continuous(breaks=seq(-1,1,.01),minor_breaks=seq(-1,1,.001))+
labs(x=n[1],y=n[2])

ggsave("output.png")
system("/usr/local/bin/mogrify -trim -bordercolor white -border 20x20 output.png")

I've never used this type of script to cluster. Is it possible to use your script to cluster based on 2 column values and automatically color code based on each cluster ?

vbnetkhio

03-14-2021, 02:19 PM

@ Komin and Vbnetkhio

The 1st thing I would do is fix the plink bim file. Col 5 is supposed to have the ALT allele and column 6 the REF allele. If you open it up and look at a couple of positions and compare to dbSNP website I'm sure you'll find that ALT and REF are flipped.

they are flipped simply because plink puts the less frequent allele (in the given dataset) first, and the more frequent one second.
so you think this causes some issues with PCA and other analyses?

if all the samples you are analyzing are in the same plink dataset, then there should be no problems, right?

Zoro

03-14-2021, 02:28 PM

they are flipped simply because plink puts the less frequent allele (in the given dataset) first, and the more frequent one second.
so you think this causes some issues with PCA and other analyses?

if all the samples you are analyzing are in the same plink dataset, then there should be no problems, right?

I think it does in Admixtools because the program looks at col 6 of .snp for minor allele frequency calculations. In plink i would guess plink looks at col 5 of .bim for —maf or —max-maf. Would this cause issue inside plink I’m not sure

Komintasavalta

03-14-2021, 03:03 PM

I've never used this type of script to cluster. Is it possible to use your script to cluster based on 2 column values and automatically color code based on each cluster ?

Yeah but it won't be as accurate. A benefit of adding clusters on top of a scatterplot is that it visualizes dimensions that are not shown on the x and y axis, but if the clustering is just based on the x and y axis, it seems pointless.

You can also use the function fviz_cluster from the package factoextra. For example this plots random G25 modern averages:

R -e 'install.packages("factoextra",repos="https://cloud.r-project.org")'
R -e 'library(factoextra)
t=read.csv("https://drive.google.com/uc?export=download&id=1wZr-UOve0KUKo_Qbgeo27m-CQncZWb8y",row.names=1,header=T)
t=t[sample(nrow(t),64),]
fviz_cluster(hkmeans(t,8),labelsize=9, show.clust.cent=F,shape=1,main="")
ggsave("out.png")'

Komintasavalta

03-15-2021, 02:47 AM

Here's how you can do the clustering based on the output of `plink --genome`.

awk -F\\t 'NR>1&&$2~/\.DG/{print$2" "$13}' v44.3_1240K_public.anno|grep -Ev '\.REF|Ignore_|_o|_dup|_contam|_SomeAfrican|_fathe r|_mother|_son|_daughter|_brother|_sister|_twin'>pick
plink --allow-no-sex --bfile v44.3_1240K_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' pick v44.3_1240K_public.fam) --genome

library(tidyverse)
library(colorspace)

t=read.table("plink.genome",skip=1)
pops=read.table("pick")
pops=pops[order(pops[,1]),]

t=t[,c(2,4,12)]
t2=rbind(t,setNames(t[,c(2,1,3)],names(t)))
t3=as.data.frame.matrix(xtabs(t2[,3]~t2[,1]+t2[,2]))

p1="Luxembourg_Loschbour.DG"
p2="Russia_Ust_Ishim.DG"
a=aggregate(t3,list(pops[,2]),mean)
row.names(a)=a[,1]
a=a[,-1]
a=aggregate(t(a),list(pops[,2]),mean)
row.names(a)=a[,1]
a=a[,-1]
a=1-a

rej=grepl("Neanderthal|Denisovan|Loschbour_published|Ishim_HG _published",row.names(a))
a=a[!rej,!rej]

# limit=a[,p1]<=.257&a[,p2]<=.257
# a=a[limit,limit]

pick=row.names(a)%in%c(p1,p2)
b=a[!pick,pick]
names(b)=c("x","y")
k=cutree(hclust(as.dist(a[!pick,!pick]),method="ward.D2"),k=32)
write.table(k,"/tmp/clusters",sep=" ",quote=F,col.names=F)
b$k=k

# row.names(b)=sub("\\.DG","",row.names(b))

ggplot(b,aes(x,y))+
geom_abline(linetype="dashed",color="gray80",size=.3)+
geom_point(aes(color=as.factor(k)),size=.5)+
geom_polygon(data=b%>%group_by(k)%>%slice(chull(x,y)),alpha=.2,aes(color=as.factor(k) ,fill=as.factor(k)),size=.3)+
geom_text(label=rownames(b),aes(color=as.factor(k) ),size=2.4,vjust=-.7)+
theme(
axis.text=element_text(color="black"),
axis.ticks.length=unit(0,"pt"),
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
legend.position="none",
panel.background=element_rect(fill="white"),
panel.border=element_rect(color="gray80",fill=NA,size=.8),
panel.grid.major=element_line(color="gray80",size=.3),
panel.grid.minor=element_line(color="gray90",size=.2),
plot.background=element_rect(fill="white")
)+
# coord_fixed()+ # uncomment when not using coord_cartesian
coord_cartesian(xlim=c(.214,.257),ylim=c(.214,.257 ),expand=F)+ # zoom in
scale_x_continuous(breaks=seq(-1,1,.01),minor_breaks=seq(-1,1,.001))+
scale_y_continuous(breaks=seq(-1,1,.01),minor_breaks=seq(-1,1,.001))+
labs(x=paste("1-IBS",p1),y=paste("1-IBS",p2))+
scale_color_discrete_qualitative(palette="Set 2",c=100,l=50)

ggsave("output.png")
system("mogrify -trim -bordercolor white -border 24x24 output.png")

Result:

https://i.ibb.co/gS3GwrZ/a.png

Zoro

03-15-2021, 03:49 AM

Here's how you can do the clustering based on the output of `plink --genome`.

awk -F\\t 'NR>1&&$2~/\.DG/{print$2" "$13}' v44.3_1240K_public.anno|grep -Ev '\.REF|Ignore_|_o|_dup|_contam|_SomeAfrican|_fathe r|_mother|_son|_daughter|_brother|_sister|_twin'>pick
plink --allow-no-sex --bfile v44.3_1240K_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' pick v44.3_1240K_public.fam) --genome

library(tidyverse)
library(colorspace)

t=read.table("plink.genome",skip=1)
pops=read.table("pick")
pops=pops[order(pops[,1]),]

t=t[,c(2,4,12)]
t2=rbind(t,setNames(t[,c(2,1,3)],names(t)))
t3=as.data.frame.matrix(xtabs(t2[,3]~t2[,1]+t2[,2]))

p1="Luxembourg_Loschbour.DG"
p2="Russia_Ust_Ishim.DG"
a=aggregate(t3,list(pops[,2]),mean)
row.names(a)=a[,1]
a=a[,-1]
a=aggregate(t(a),list(pops[,2]),mean)
row.names(a)=a[,1]
a=a[,-1]
a=1-a

rej=grepl("Neanderthal|Denisovan|Loschbour_published|Ishim_HG _published",row.names(a))
a=a[!rej,!rej]

# limit=a[,p1]<=.257&a[,p2]<=.257
# a=a[limit,limit]

pick=row.names(a)%in%c(p1,p2)
b=a[!pick,pick]
names(b)=c("x","y")
k=cutree(hclust(as.dist(a[!pick,!pick]),method="ward.D2"),k=32)
write.table(k,"/tmp/clusters",sep=" ",quote=F,col.names=F)
b$k=k

# row.names(b)=sub("\\.DG","",row.names(b))

ggplot(b,aes(x,y))+
geom_abline(linetype="dashed",color="gray80",size=.3)+
geom_point(aes(color=as.factor(k)),size=.5)+
geom_polygon(data=b%>%group_by(k)%>%slice(chull(x,y)),alpha=.2,aes(color=as.factor(k) ,fill=as.factor(k)),size=.3)+
geom_text(label=rownames(b),aes(color=as.factor(k) ),size=2.4,vjust=-.7)+
theme(
axis.text=element_text(color="black"),
axis.ticks.length=unit(0,"pt"),
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
legend.position="none",
panel.background=element_rect(fill="white"),
panel.border=element_rect(color="gray80",fill=NA,size=.8),
panel.grid.major=element_line(color="gray80",size=.3),
panel.grid.minor=element_line(color="gray90",size=.2),
plot.background=element_rect(fill="white")
)+
# coord_fixed()+ # uncomment when not using coord_cartesian
coord_cartesian(xlim=c(.214,.257),ylim=c(.214,.257 ),expand=F)+ # zoom in
scale_x_continuous(breaks=seq(-1,1,.01),minor_breaks=seq(-1,1,.001))+
scale_y_continuous(breaks=seq(-1,1,.01),minor_breaks=seq(-1,1,.001))+
labs(x=paste("1-IBS",p1),y=paste("1-IBS",p2))+
scale_color_discrete_qualitative(palette="Set 2",c=100,l=50)

ggsave("output.png")
system("mogrify -trim -bordercolor white -border 24x24 output.png")

Result:

https://i.ibb.co/gS3GwrZ/a.png

Thanks but I already did it based on factoextra package. Will post a lttle later

Zoro

03-15-2021, 03:51 AM

Dendogram based on IBS distances to Sardinian ( W Eurasian) and Mongol ( E Eurasian). This better explains phenotype similarities between some W and S Asians and Mexicans and S. Americans than G25.

https://i.imgur.com/fVBt7Pv.jpg

Zoro

03-15-2021, 04:13 AM

Here we see that S Asians cluster not too far from American Indians. Sort of explains some phenotype overlaps between S Asians and American Indians.

https://i.imgur.com/FulA3EW.jpg

Komintasavalta

03-15-2021, 05:09 AM

Dendogram based on IBS distances to Sardinian ( W Eurasian) and Mongol ( E Eurasian). This better explains phenotype similarities between some W and S Asians and Mexicans and S. Americans than G25.

https://i.imgur.com/fVBt7Pv.jpg

You can also reorder the branches by using the function `reorder.hclust` from the package Vegan: https://www.rdocumentation.org/packages/vegan/versions/2.4-2/topics/reorder.hclust. For example here I ordered the tree from East Asians to Capoids.

I subtracted .20 from the 1-IBS values and I replaced values that were smaller than 0 with 0. It's not the right method to calculate the distances between the populations though, so the clustering is messed up and the height of the branches is not correct. Maybe we should use FST distances instead.

library(vegan) # for reorder.hclust
library(ape) # for plot.phylo

t=read.table("plink.mibs") # generated by `plink --distance ibs square`
t2=read.table("picks") # two-column table with sample ID in first column and population name in second column

a=aggregate(t,list(t2$V2),mean)
row.names(a)=a[,1]
a=a[,-1]
a=aggregate(t(a),list(t2$V2),mean)
row.names(a)=a[,1]
a=a[,-1]

a=1-a
a=a-.20
a[a<0]=0

rej=grepl("Neanderthal|Denisova",row.names(a))
a=a[!rej,!rej]

weight=as.matrix(2*a["Khomani_San.DG"]+a["Mbuti.DG"]+a["Yoruba.DG"]-4*a["Han.DG"])
r=reorder(hclust(as.dist(a),method="complete"),wts=weight)

png("out.png",w=900,h=2400)
plot(as.phylo(r),cex=1.5,font=1,underscore=T)
dev.off()

system("mogrify -trim -bordercolor white -border 16x16 out.png")

https://i.ibb.co/bP2Zv6t/ibsdendro.png

Komintasavalta

03-15-2021, 08:14 AM

I tried `plink --pca` with different values for the third parameter of `--indep-pairwise` (r^2 threshold):

awk -F\\t 'NR>1&&$2~/\.DG/{print$2" "$13}' v44.3_1240K_public.anno|grep -Ev '\.REF|Ignore_|_o|_dup|_contam|_SomeAfrican|_fathe r|_mother|_son|_daughter|_brother|_sister|_twin'|g rep -Ev 'Neanderthal|Denisova'>pick
for x in 0.1 0.3 0.5;do plink --allow-no-sex --bfile v44.3_1240K_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' pick v44.3_1240K_public.fam) --indep-pairwise 50 10 $x --make-bed --out $x;plink --allow-no-sex --bfile $x --extract $x.prune.in --pca --out $x;awk 'NR==FNR{a[$1]=$2;next}{$1=a[$2]}1' pick $x.eigenvec>$x.eigenvec.2;done

0.5 removed 62% of variants, 0.3 removed 75% of variants, and 0.1 removed 90% of variants.

I applied G25-style scaling to the PCA where I multiplied each column by the eigenvalues, because I've found that without it the clustering is more often messed up so that for example some SSAs or Mongoloids cluster with Europeans. The same thing happens with unscaled G25 coordinates. However even in the second image below, the clustering went wrong so that Yoruba clustered with Siberians. I don't know why.

Lower values for the last parameter of `--indep-pairwise` increased the distance of Capoids to non-Capoids more.

With no `--indep-pairwise` option, Russia_Ust_Ishim.DG formed a cluster with Russia_Ust_Ishim_HG_published.DG and Luxembourg_Loschbour.DG formed a cluster with Luxembourg_Loschbour_published.DG. (I should've probably removed the duplicate samples.) With `--indep-pairwise 50 10 0.5`, the cluster for Ust'-Ishim disappeared. With `--indep-pairwise 50 10 0.3`, the cluster for Loschbour also disappeared.

https://i.ibb.co/D9ssdkM/image.png
https://i.ibb.co/c3L30H8/0-5.png
https://i.ibb.co/QbmTm5J/0-3.png
https://i.ibb.co/TcLFhx6/0-1.png

library(tidyverse)
library(colorspace)

t=read.table("0.5.eigenvec.2",sep=" ")
eig=as.double(readLines("0.5.eigenval"))

t=(cbind(t[,c(1,2)],t(t(t[,-c(1,2)])*sqrt(eig)))) # scaling

ave=aggregate(t[,-c(1,2)],list(t[,1]),mean)
names(ave)=c("pop",paste0("PC",seq(ncol(ave)-1)))
k=cutree(hclust(dist(ave[,-c(1,2)]),method="ward.D2"),k=16)
ave$k=k

ggplot(ave,aes(x=PC1,y=PC2))+
geom_point(aes(color=as.factor(k)),size=.5)+
geom_polygon(data=ave%>%group_by(k)%>%slice(chull(PC1,PC2)),alpha=.2,aes(color=as.facto r(k),fill=as.factor(k)),size=.3)+
geom_text(aes(label=pop,color=as.factor(k)),size=2 ,vjust=-.7)+
theme(
aspect.ratio=1/sqrt(2),
legend.position="none",
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
axis.ticks.length=unit(0,"pt"),
panel.border=element_rect(color="gray75",fill=NA,size=.4),
panel.grid.major=element_line(color="gray70",size=.15),
panel.background=element_rect(fill="white"),
plot.title=element_text(size=9),
axis.text=element_text(color="black",size=6),
axis.title=element_text(color="black",size=8)
)+
scale_x_continuous(breaks=seq(-10,10,.1),expand=expansion(mult=.12))+
scale_y_continuous(breaks=seq(-10,10,.1),expand=expansion(mult=.03))+
labs(x="",y="")+
scale_color_discrete_qualitative(palette="Set 2",c=80,l=40)

ggsave("0.5.png")
system(paste0("mogrify -trim -bordercolor white -border 20x20 0.5.png"))

Komintasavalta

03-15-2021, 09:16 AM

this is the merging process when dealing with strand inconsistency:

plink2 --bfile fileA --bmerge fileB --make-bed --out merge --allow-no-sex

plink2 --bfile fileB --flip merge-merge.missnp --make-bed --out fileB_flip
plink2 --bfile fileA --bmerge fileB_flip --make-bed --out merge --allow-no-sex

plink2 --bfile fileB_flip --exclude merge-merge.missnp --make-bed --out fileB_filtered
plink2 --bfile fileA --bmerge fileB_filtered --make-bed --out merge --allow-no-sex

always add --allow-no-sex, first try flipping the SNPs, if you get an error again, then you also have to exclude some SNPs (last 2 lines)

Thanks, I got it to work when I also ran the last two lines:

wget evolbio.ut.ee/Tambets2018/Tambets_et_al_2018_{1M,660k,650k,610k}.{bed,bim,fa m}
pli(){ plink --allow-no-sex "$@";}
pli --bfile Tambets_et_al_2018_1M --bmerge Tambets_et_al_2018_650k --make-bed --out merge
pli --bfile Tambets_et_al_2018_650k --flip merge-merge.missnp --make-bed --out Tambets_et_al_2018_650k_flip
pli --bfile Tambets_et_al_2018_1M --bmerge Tambets_et_al_2018_650k_flip --make-bed --out merge
pli --bfile Tambets_et_al_2018_650k_flip --exclude merge-merge.missnp --make-bed --out Tambets_et_al_2018_650k_filtered
pli --bfile Tambets_et_al_2018_1M --bmerge Tambets_et_al_2018_650k_filtered --make-bed --out merge

I used plink 1.9. With plink 2, I needed to replace `--bmerge` with `--pmerge`, but even then I got this error: `Error: Failed to open Tambets_et_al_2018_650k.psam`.

(Edit: added back missing third plink command.)

Lucas

03-15-2021, 01:31 PM

Thanks, I got it to work with the `--exclude` option:

wget evolbio.ut.ee/Tambets2018/Tambets_et_al_2018_{1M,660k,650k,610k}.{bed,bim,fa m}
pli(){ plink --allow-no-sex "$@";}
pli --bfile Tambets_et_al_2018_1M --bmerge Tambets_et_al_2018_650k --make-bed --out merge
pli --bfile Tambets_et_al_2018_650k --flip merge-merge.missnp --make-bed --out Tambets_et_al_2018_650k_flip
pli --bfile Tambets_et_al_2018_650k_flip --exclude merge-merge.missnp --make-bed --out Tambets_et_al_2018_650k_filtered
pli --bfile Tambets_et_al_2018_1M --bmerge Tambets_et_al_2018_650k_filtered --make-bed --out merge

I used plink 1.9. With plink 2, I needed to replace `--bmerge` with `--pmerge`, but even then I got this error: `Error: Failed to open Tambets_et_al_2018_650k.psam`.

It needs pgen/pvar/psam format dataset. https://www.cog-genomics.org/plink/2.0/formats#psam
You converted to it?

Zoro

03-15-2021, 02:24 PM

I tried `plink --pca` with different values for the third parameter of `--indep-pairwise` (VIF threshold):

awk -F\\t 'NR>1&&$2~/\.DG/{print$2" "$13}' v44.3_1240K_public.anno|grep -Ev '\.REF|Ignore_|_o|_dup|_contam|_SomeAfrican|_fathe r|_mother|_son|_daughter|_brother|_sister|_twin'|g rep -Ev 'Neanderthal|Denisova'>pick
for x in 0.1 0.3 0.5;do plink --allow-no-sex --bfile v44.3_1240K_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' pick v44.3_1240K_public.fam) --indep-pairwise 50 10 $x --make-bed --out $x;plink --allow-no-sex --bfile $x --extract $x.prune.in --pca --out $x;awk 'NR==FNR{a[$1]=$2;next}{$1=a[$2]}1' pick $x.eigenvec>$x.eigenvec.2;done

0.5 removed 62% of variants, 0.3 removed 75% of variants, and 0.1 removed 90% of variants.

I applied G25-style scaling to the PCA where I multiplied each column by the eigenvalues, because I've found that without it the clustering is more often messed up so that for example some SSAs or Mongoloids cluster with Europeans. The same thing happens with unscaled G25 coordinates. However even in the second image below, the clustering went wrong so that Yoruba clustered with Siberians. I don't know why.

Lower values for the last parameter of `--indep-pairwise` increased the distance of Capoids to non-Capoids more.

With no `--indep-pairwise` option, Russia_Ust_Ishim.DG formed a cluster with Russia_Ust_Ishim_HG_published.DG and Luxembourg_Loschbour.DG formed a cluster with Luxembourg_Loschbour_published.DG. (I should've probably removed the duplicate samples.) With `--indep-pairwise 50 10 0.5`, the cluster for Ust'-Ishim disappeared. With `--indep-pairwise 50 10 0.3`, the cluster for Loschbour also disappeared.

https://i.ibb.co/D9ssdkM/image.png
https://i.ibb.co/c3L30H8/0-5.png
https://i.ibb.co/QbmTm5J/0-3.png
https://i.ibb.co/TcLFhx6/0-1.png

library(tidyverse)
library(colorspace)

t=read.table("0.5.eigenvec.2",sep=" ")
eig=as.double(readLines("0.5.eigenval"))

t=(cbind(t[,c(1,2)],t(t(t[,-c(1,2)])*sqrt(eig)))) # scaling

ave=aggregate(t[,-c(1,2)],list(t[,1]),mean)
names(ave)=c("pop",paste0("PC",seq(ncol(ave)-1)))
k=cutree(hclust(dist(ave[,-c(1,2)]),method="ward.D2"),k=16)
ave$k=k

ggplot(ave,aes(x=PC1,y=PC2))+
geom_point(aes(color=as.factor(k)),size=.5)+
geom_polygon(data=ave%>%group_by(k)%>%slice(chull(PC1,PC2)),alpha=.2,aes(color=as.facto r(k),fill=as.factor(k)),size=.3)+
geom_text(aes(label=pop,color=as.factor(k)),size=2 ,vjust=-.7)+
theme(
aspect.ratio=1/sqrt(2),
legend.position="none",
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
axis.ticks.length=unit(0,"pt"),
panel.border=element_rect(color="gray75",fill=NA,size=.4),
panel.grid.major=element_line(color="gray70",size=.15),
panel.background=element_rect(fill="white"),
plot.title=element_text(size=9),
axis.text=element_text(color="black",size=6),
axis.title=element_text(color="black",size=8)
)+
scale_x_continuous(breaks=seq(-10,10,.1),expand=expansion(mult=.12))+
scale_y_continuous(breaks=seq(-10,10,.1),expand=expansion(mult=.03))+
labs(x="",y="")+
scale_color_discrete_qualitative(palette="Set 2",c=80,l=40)

ggsave("0.5.png")
system(paste0("mogrify -trim -bordercolor white -border 20x20 0.5.png"))

I'm not a big fan of LD pruning because it throws out alot of SNPs that are population specific, but I do see some use for it in highly drifted populations.

Certainly --indep-pairwise 50 10 0.1-0.3 is way too aggressive, not to mention you are left with too few markers. Throwing out markers with r2 values higher than 0.1-0.3 removes too much useful information in terms of variation.

As far as duplicate samples never include them except for IBS because they can screw up true allele frequencies in populations

Zoro

03-15-2021, 02:26 PM

Thanks, I got it to work with the `--exclude` option:

wget evolbio.ut.ee/Tambets2018/Tambets_et_al_2018_{1M,660k,650k,610k}.{bed,bim,fa m}
pli(){ plink --allow-no-sex "$@";}
pli --bfile Tambets_et_al_2018_1M --bmerge Tambets_et_al_2018_650k --make-bed --out merge
pli --bfile Tambets_et_al_2018_650k --flip merge-merge.missnp --make-bed --out Tambets_et_al_2018_650k_flip
pli --bfile Tambets_et_al_2018_650k_flip --exclude merge-merge.missnp --make-bed --out Tambets_et_al_2018_650k_filtered
pli --bfile Tambets_et_al_2018_1M --bmerge Tambets_et_al_2018_650k_filtered --make-bed --out merge

I used plink 1.9. With plink 2, I needed to replace `--bmerge` with `--pmerge`, but even then I got this error: `Error: Failed to open Tambets_et_al_2018_650k.psam`.

I'll diagnose it out if you want to show me all the steps you took in Plink 1.9

vbnetkhio

03-15-2021, 02:29 PM

Thanks, I got it to work with the `--exclude` option:

wget evolbio.ut.ee/Tambets2018/Tambets_et_al_2018_{1M,660k,650k,610k}.{bed,bim,fa m}
pli(){ plink --allow-no-sex "$@";}
pli --bfile Tambets_et_al_2018_1M --bmerge Tambets_et_al_2018_650k --make-bed --out merge
pli --bfile Tambets_et_al_2018_650k --flip merge-merge.missnp --make-bed --out Tambets_et_al_2018_650k_flip
pli --bfile Tambets_et_al_2018_650k_flip --exclude merge-merge.missnp --make-bed --out Tambets_et_al_2018_650k_filtered
pli --bfile Tambets_et_al_2018_1M --bmerge Tambets_et_al_2018_650k_filtered --make-bed --out merge

I used plink 1.9. With plink 2, I needed to replace `--bmerge` with `--pmerge`, but even then I got this error: `Error: Failed to open Tambets_et_al_2018_650k.psam`.

yeah sorry, plink2 doesn't have a merge function yet, I meant to put just "plink" there.

but you missed one step here. you should attempt to merge Tambets_et_al_2018_650k_flip with Tambets_et_al_2018_1M. then if there are still some snps which can't be merged, a new missnp file will be written, and then you should just exclude those. this way you keep the most SNPs.

Zoro

03-15-2021, 02:32 PM

Here's how you can do the clustering based on the output of `plink --genome`.

awk -F\\t 'NR>1&&$2~/\.DG/{print$2" "$13}' v44.3_1240K_public.anno|grep -Ev '\.REF|Ignore_|_o|_dup|_contam|_SomeAfrican|_fathe r|_mother|_son|_daughter|_brother|_sister|_twin'>pick
plink --allow-no-sex --bfile v44.3_1240K_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' pick v44.3_1240K_public.fam) --genome

library(tidyverse)
library(colorspace)

t=read.table("plink.genome",skip=1)
pops=read.table("pick")
pops=pops[order(pops[,1]),]

t=t[,c(2,4,12)]
t2=rbind(t,setNames(t[,c(2,1,3)],names(t)))
t3=as.data.frame.matrix(xtabs(t2[,3]~t2[,1]+t2[,2]))

p1="Luxembourg_Loschbour.DG"
p2="Russia_Ust_Ishim.DG"
a=aggregate(t3,list(pops[,2]),mean)
row.names(a)=a[,1]
a=a[,-1]
a=aggregate(t(a),list(pops[,2]),mean)
row.names(a)=a[,1]
a=a[,-1]
a=1-a

rej=grepl("Neanderthal|Denisovan|Loschbour_published|Ishim_HG _published",row.names(a))
a=a[!rej,!rej]

# limit=a[,p1]<=.257&a[,p2]<=.257
# a=a[limit,limit]

pick=row.names(a)%in%c(p1,p2)
b=a[!pick,pick]
names(b)=c("x","y")
k=cutree(hclust(as.dist(a[!pick,!pick]),method="ward.D2"),k=32)
write.table(k,"/tmp/clusters",sep=" ",quote=F,col.names=F)
b$k=k

# row.names(b)=sub("\\.DG","",row.names(b))

ggplot(b,aes(x,y))+
geom_abline(linetype="dashed",color="gray80",size=.3)+
geom_point(aes(color=as.factor(k)),size=.5)+
geom_polygon(data=b%>%group_by(k)%>%slice(chull(x,y)),alpha=.2,aes(color=as.factor(k) ,fill=as.factor(k)),size=.3)+
geom_text(label=rownames(b),aes(color=as.factor(k) ),size=2.4,vjust=-.7)+
theme(
axis.text=element_text(color="black"),
axis.ticks.length=unit(0,"pt"),
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
legend.position="none",
panel.background=element_rect(fill="white"),
panel.border=element_rect(color="gray80",fill=NA,size=.8),
panel.grid.major=element_line(color="gray80",size=.3),
panel.grid.minor=element_line(color="gray90",size=.2),
plot.background=element_rect(fill="white")
)+
# coord_fixed()+ # uncomment when not using coord_cartesian
coord_cartesian(xlim=c(.214,.257),ylim=c(.214,.257 ),expand=F)+ # zoom in
scale_x_continuous(breaks=seq(-1,1,.01),minor_breaks=seq(-1,1,.001))+
scale_y_continuous(breaks=seq(-1,1,.01),minor_breaks=seq(-1,1,.001))+
labs(x=paste("1-IBS",p1),y=paste("1-IBS",p2))+
scale_color_discrete_qualitative(palette="Set 2",c=100,l=50)

ggsave("output.png")
system("mogrify -trim -bordercolor white -border 24x24 output.png")

Result:

https://i.ibb.co/gS3GwrZ/a.png

I was going to compare the scatter plot you generated with your code with the one I generated but I think you may have an error in your code.

Here's how your t2 file looks like:

> head(t2, n=3)
V2 V4 V12
1 S_Ami1 S_Adygei-1 0.892939
2 S_Ami1 S_Adygei-2 0.891089
3 S_Ami1 S_Albanian1 0.890971

Here's my pick file:

Adygei S_Adygei-1
Adygei S_Adygei-2
Ami.DG S_Ami1
Ami.DG S_Ami2
Armenian S_Armenian-1
Armenian S_Armenian-2
Balochi S_Balochi-1
Balochi S_Balochi-2
BantuKenya S_BantuKenya-1
BantuKenya S_BantuKenya-2

Here's the error message:

> t3=as.data.frame.matrix(xtabs(t2[,3]~t2[,1]+t2[,2]))
> p1="Mongola"
> p2="Sardinian"
> a=aggregate(t3,list(pops[,2]),mean)
Error in aggregate.data.frame(t3, list(pops[, 2]), mean) :
arguments must have same length

> head(t3, n=3)
Altais1 Altais2 Bashkirs1 Bashkirs2 Bashkirs3 S_Abkhasian-1
Altais1 0.000000 0.887653 0.882354 0.883418 0.881705 0.877456
Altais2 0.887653 0.000000 0.882295 0.884274 0.881161 0.876892
Bashkirs1 0.882354 0.882295 0.000000 0.881556 0.879413 0.876450
S_Abkhasian-2 S_Adygei-1 S_Adygei-2 S_Albanian1 S_Ami1 S_Ami2
Altais1 0.874894 0.877768 0.875895 0.873549 0.887796 0.887467
Altais2 0.874022 0.877397 0.874894 0.873506 0.888828 0.889391
Bashkirs1 0.874233 0.877311 0.875086 0.872841 0.881875 0.881902

Komintasavalta

03-15-2021, 02:43 PM

Here's my pick file:

Adygei S_Adygei-1
Adygei S_Adygei-2
Ami.DG S_Ami1
Ami.DG S_Ami2
Armenian S_Armenian-1
Armenian S_Armenian-2
Balochi S_Balochi-1
Balochi S_Balochi-2
BantuKenya S_BantuKenya-1
BantuKenya S_BantuKenya-2

It's supposed to have population names in the second column and sample IDs in the first column. I just tried running my code again and it worked.

You can also generate the pick file like this (where `plink.mibs.id` is a file generated by `plink --distance ibs square`):

awk 'NR==FNR{a[$1]=$3;next}{print$2,a[$2]}' v44.3_1240K_public.ind plink.mibs.id>pick

Zoro

03-15-2021, 03:03 PM

It's supposed to have population names in the second column and sample IDs in the first column. I just tried running my code again and it worked.

You can also generate the pick file like this (where `plink.mibs.id` is a file generated by `plink --distance ibs square`):

awk 'NR==FNR{a[$1]=$3;next}{print$2,a[$2]}' v44.3_1240K_public.ind plink.mibs.id>pick

I changed the pick file to :

head(pops, n=3)
V1 V2
172 S_Abkhasian-1 Russia_Abkhasian
173 S_Abkhasian-2 Russia_Abkhasian
2 S_Adygei-1 Adygei

but still same error:

> t=read.table("plink.genome",skip=1)
> pops=read.table("pick")
> pops=pops[order(pops[,1]),]
> t=t[,c(2,4,12)]
> t2=rbind(t,setNames(t[,c(2,1,3)],names(t)))
> t3=as.data.frame.matrix(xtabs(t2[,3]~t2[,1]+t2[,2]))
> p1="Mongola"
> p2="Sardinian"
> a=aggregate(t3,list(pops[,2]),mean)
Error in aggregate.data.frame(t3, list(pops[, 2]), mean) :
arguments must have same length

Edit: Must the pick file contain the same samples as plink.genome file ? I only picked some of the samples from plink.genome to be in pick file

Komintasavalta

03-16-2021, 07:47 AM

Edit: Must the pick file contain the same samples as plink.genome file ?

Yeah.

I'm not a big fan of LD pruning because it throws out alot of SNPs that are population specific, but I do see some use for it in highly drifted populations.

Certainly --indep-pairwise 50 10 0.1-0.3 is way too aggressive, not to mention you are left with too few markers. Throwing out markers with r2 values higher than 0.1-0.3 removes too much useful information in terms of variation.

When I googled `site:anthrogenica.com --indep-pairwise`, I saw these combinations:

- `--indep-pairwise 200 25 0.4`
- `--indep-pairwise 50 5 0.2`
- `--indep-pairwise 50 5 0.3`

So the range for the third parameter (r^2 threshold) was between 0.2 and 0.4.

This was used as an example in the documentation: `--indep-pairwise 50 5 0.5` (https://zzz.bwh.harvard.edu/plink/summary.shtml).

`--indep-pairwise 50 5 .5` removed 60.3% of variants. When I replaced the first parameter with 200, it removed 63.2% of variants, and when I replaced the second parameter with 25, it removed 59.7% of variants.

So higher values are more aggressive for the first parameter (window size), but lower values are more aggressive for the second and third parameters (step size and r^2 threshold).

Lemminkäinen

03-16-2021, 09:40 AM

http://terheninenmaa.blogspot.com/2020/02/projecting-or-imputing.html?m=1

Komintasavalta

03-16-2021, 10:16 AM

I couldn't get EIGENSOFT or SMARTPCA to compile, so I'm using Mac binaries for an old fork of EIGENSOFT: https://github.com/chrchang/eigensoft. The old version is missing the autoshrink option which Lemminkäinen used in his blog post. Also in the old version, the MAXPOPS options defaults to 100 and it can't be changed from the parfile, but in the newest version of SMARTPCA, the MAXPOPS options defaults to 1000 and it can also be increased from the parfile. I guess I'll have to learn how to compile it.

Lemminkäinen

03-16-2021, 10:21 AM

I couldn't get EIGENSOFT or SMARTPCA to compile, so I'm using Mac binaries for an old fork of EIGENSOFT: https://github.com/chrchang/eigensoft. The old version is missing the autoshrink option which Lemminkäinen used in his blog post. Also in the old version, the MAXPOPS options defaults to 100 and it can't be changed from the parfile, but in the newest version of SMARTPCA, the MAXPOPS options defaults to 1000 and it can also be increased from the parfile. I guess I'll have to learn how to compile it.

I can share Linux objects, later today.

Lemminkäinen

03-16-2021, 10:32 AM

The meaning of Autoshrink is to avoid the "leak" of peripheral low quality samples. It is not perfect however.

Komintasavalta

03-16-2021, 10:36 AM

I can share Linux objects, later today.

I don't need it. Maybe I could get it to compile on GNU/Linux myself. But I don't want to install a GNU/Linux VM, and I no longer have access to GNU/Linux via SSH either.

This was the error I got after I had ran `brew install openblas` and `cd EIG-7.2.1/src`:

$ make CFLAGS='-I/usr/local/opt/openblas/include -I/usr/local/opt/gsl/include' LDFLAGS='-L/usr/local/opt/openblas/lib -L/usr/local/opt/gsl/lib'
cc -L/usr/local/opt/openblas/lib -L/usr/local/opt/gsl/lib baseprog.o mcio.o egsubs.o admutils.o h2d.o eigensrc/exclude.o nicksrc/libnick.a -lgsl -lopenblas -lm -lpthread -o baseprog
cc -I/usr/local/opt/openblas/include -I/usr/local/opt/gsl/include -I../include -I/usr/include/openblas -c -o convertf.o convertf.c
convertf.c:257:5: error: implicit declaration of function 'setfamilypopnames' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
setfamilypopnames (YES);
^
1 error generated.
make: *** [convertf.o] Error 1

Lemminkäinen

03-16-2021, 10:54 AM

I don't need it. Maybe I could get it to compile on GNU/Linux myself. But I don't want to install a GNU/Linux VM, and I no longer have access to GNU/Linux via SSH either.

This was the error I got after I had ran `brew install openblas` and `cd EIG-7.2.1/src`:

$ make CFLAGS='-I/usr/local/opt/openblas/include -I/usr/local/opt/gsl/include' LDFLAGS='-L/usr/local/opt/openblas/lib -L/usr/local/opt/gsl/lib'
cc -L/usr/local/opt/openblas/lib -L/usr/local/opt/gsl/lib baseprog.o mcio.o egsubs.o admutils.o h2d.o eigensrc/exclude.o nicksrc/libnick.a -lgsl -lopenblas -lm -lpthread -o baseprog
cc -I/usr/local/opt/openblas/include -I/usr/local/opt/gsl/include -I../include -I/usr/include/openblas -c -o convertf.o convertf.c
convertf.c:257:5: error: implicit declaration of function 'setfamilypopnames' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
setfamilypopnames (YES);
^
1 error generated.
make: *** [convertf.o] Error 1

I have seen problems with those compilstion scripts; sometimes links to libraries are in a wrong sequence etc. -> library is not found. Like Reichlab had not tested scripts before publishing. But maybe they don't test on all environments and compilers.

Zoro

03-16-2021, 11:52 AM

Yeah.

Still same error even when no of samples in pick and no of samples in plnk.genome are same and with same names

When I googled `site:anthrogenica.com --indep-pairwise`, I saw these combinations:

- `--indep-pairwise 200 25 0.4`
- `--indep-pairwise 50 5 0.2`
- `--indep-pairwise 50 5 0.3`

So the range for the third parameter (r^2 threshold) was between 0.2 and 0.4.

This was used as an example in the documentation: `--indep-pairwise 50 5 0.5` (https://zzz.bwh.harvard.edu/plink/summary.shtml).

`--indep-pairwise 50 5 .5` removed 60.3% of variants. When I replaced the first parameter with 200, it removed 63.2% of variants, and when I replaced the second parameter with 25, it removed 59.7% of variants.

So higher values are more aggressive for the first parameter (window size), but lower values are more aggressive for the second and third parameters (step size and r^2 threshold).

That's what I said. I wouldn't take anything on Anthrogenica too seriously. Most of them are just hobbyists like here. The one used in the example This was used as an example in the documentation: `--indep-pairwise 50 5 0.5` is ok but I still think it does more harm than good for most types of analysis like I explained in my previous post.

Zoro

03-16-2021, 12:07 PM

These IBS Sardinian (W Eurasian) vs Mongolians (E Eurasians) are more informative to me than PCAs because unlike PCAs they don't change depending on which samples and how many samples of each population are in the run (ditto G25),

The more I look at them the more I learn. This one is done with 465K overlapping SNPs in all samples including very ancient common alleles

They correctly show Khomani and Ju-Hoan with less W and E Eurasian admixture than Mbuti which in turn have less than Yoruba and the like just like in the latest Paleolithic African paper. G25 of course, gets it wrong by showing the opposite wrt Khomani and Ju-Hoan having more Eurasian admixture than Mbuti.

We also correctly see here, unlike the G25 which gets this wrong too, that Kurds and other W Asians are closer to other Eurasians such as Amerindians and Papuans than to SSA

https://i.imgur.com/NeRGWsy.jpg

Zoro

03-16-2021, 12:15 PM

This one zooming in on Eurasians done by removing very common very old alleles using MaxMaf 0.3 and using 335K overlapping SNPs in all populations show:

1- American Indians are as E Eurasian shifted as C Asians such as Kyrgiz, Uyghur, Hazara and Mansi but they have less Sardinian ENF W Eurasian/Basal Eurasian related

2- 33,000 year old Yana-UP Ancient Siberian has about the same level W Eurasian as S Indians but with less E Eurasian. It also has the same level of E Eurasian as northern S Asians and some W Asians but with less ENF Sardinian related.

Do you learn anything else from this one?

https://i.imgur.com/KNi6rZF.jpg

Komintasavalta

03-16-2021, 12:17 PM

I have zero knowledge of C, and there was probably an easier way to do it, but I got smartpca to compile after I added function prototypes to the source files manually.

I added this to `src/ksrc/kjg_fpca.c`:

size_t get_ncols ();
size_t get_nrows ();
size_t kjg_geno_get_normalized_rows (int I, size_t KJG_FPCA_ROWS, double *Y);

I added this to `src/eigensrc/smartpca.c`:

void setid2pops (char *id2pops, Indiv **indivmarkers, int numindivs);
void kjg_fpca (int numeigs,int fastdim,int fastiter,double *evals, double *evecs);

I added this to `src/gval.c`:

int getcolxz (double *xcol, SNP *cupt, int *xindex, int *xtypes, int nrows, int col, double *xmean, double *xfancy, int *n0, int *n1);

Still same error even when no of samples in pick and no of samples in plnk.genome are same and with same names

Did you also flip the columns of the pick file so the population name is in the second column?

$ head -n2 pick
A_Dinka-4.DG Dinka.DG
A_Dai-5.DG Dai.DG

The R script I posted worked for me after I had run this:

awk -F\\t 'NR>1&&$2~/\.DG/{print$2" "$13}' v44.3_1240K_public.anno|grep -Ev '\.REF|Ignore_|_o|_dup|_contam|_SomeAfrican|_fathe r|_mother|_son|_daughter|_brother|_sister|_twin'>pick
plink --allow-no-sex --bfile v44.3_1240K_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' pick v44.3_1240K_public.fam) --genome

Zoro

03-16-2021, 12:43 PM

Did you also flip the columns of the pick file so the population name is in the second column?

$ head -n2 pick
A_Dinka-4.DG Dinka.DG
A_Dai-5.DG Dai.DG

The R script I posted worked for me after I had ran this:

awk -F\\t 'NR>1&&$2~/\.DG/{print$2" "$13}' v44.3_1240K_public.anno|grep -Ev '\.REF|Ignore_|_o|_dup|_contam|_SomeAfrican|_fathe r|_mother|_son|_daughter|_brother|_sister|_twin'>pick
plink --allow-no-sex --bfile v44.3_1240K_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' pick v44.3_1240K_public.fam) --genome

Yeah I did like i showed you in previous post

Komintasavalta

03-16-2021, 01:24 PM

The images you posted don't show IBS values but they're PCAs done by fviz_cluster. Look at the scale of the axes. Why would you have IBS values over 1 or under 0?

Use my script instead of fviz_cluster if you don't want to do a PCA.

Here's a new version of my script that doesn't require the pick file but that looks up population names from an .ind file:

t=read.table("plink.genome",skip=1)
ind=read.table("path/to/v44.3_1240K_public.ind",row.names=1)

t=t[,c(2,4,12)]
t2=rbind(t,setNames(t[,c(2,1,3)],names(t)))
t3=as.data.frame.matrix(xtabs(t2[,3]~t2[,1]+t2[,2]))

pops=ind[row.names(t3),2]

a=aggregate(t3,list(pops),mean)
row.names(a)=a[,1]
a=a[,-1]
a=aggregate(t(a),list(pops),mean)
row.names(a)=a[,1]
a=a[,-1]
a=1-a

# rej=grepl("Neanderthal|Denisovan|Loschbour_published|Ishim_HG _published",row.names(a))
# a=a[!rej,!rej]

p1="Luxembourg_Loschbour.DG"
p2="Russia_Ust_Ishim.DG"

# under=a[,p1]<=.257&a[,p2]<=.257
# a=a[under,under]

pick=row.names(a)%in%c(p1,p2)
b=a[!pick,pick]
names(b)=c("x","y")
k=cutree(hclust(as.dist(a[!pick,!pick]),method="ward.D2"),k=32)
write.table(k,"clusters",sep=" ",quote=F,col.names=F)
b$k=k

# row.names(b)=sub("\\.DG","",row.names(b))

ggplot(b,aes(x,y))+
geom_abline(linetype="dashed",color="gray80",size=.3)+
geom_point(aes(color=as.factor(k)),size=.7)+
geom_polygon(data=b%>%group_by(k)%>%slice(chull(x,y)),alpha=.2,aes(color=as.factor(k) ,fill=as.factor(k)),size=.4)+
geom_text(label=rownames(b),aes(color=as.factor(k) ),size=2.5,vjust=-.7)+
theme(
legend.position="none",
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
axis.ticks.length=unit(0,"pt"),
panel.grid.major=element_line(color="gray80",size=.4),
panel.background=element_rect(fill="white"),
panel.border=element_rect(color="gray80",fill=NA,size=.8),
panel.grid.minor=element_line(color="gray90",size=.2),
axis.text=element_text(color="black")
)+
coord_cartesian(xlim=c(.214,.257),ylim=c(.214,.257 ),expand=F)+ # zoom in
# coord_fixed()+ # uncomment when not using coord_cartesian()
scale_x_continuous(breaks=seq(-1,1,.01),minor_breaks=seq(-1,1,.001))+
scale_y_continuous(breaks=seq(-1,1,.01),minor_breaks=seq(-1,1,.001))+
labs(x=paste("1-IBS",p1),y=paste("1-IBS",p2))+
scale_color_discrete_qualitative(palette="Set 2",c=100,l=50)+

ggsave("out.png")
system("mogrify -trim -bordercolor white -border 20x20 out.png")

Zoro

03-16-2021, 01:28 PM

This one zooming in on Eurasians done by removing very common very old alleles using MaxMaf 0.3 and using 335K overlapping SNPs in all populations show:

1- American Indians are as E Eurasian shifted as C Asians such as Kyrgiz, Uyghur, Hazara and Mansi but they have less Sardinian ENF W Eurasian/Basal Eurasian related

2- 33,000 year old Yana-UP Ancient Siberian has about the same level W Eurasian as S Indians but with less E Eurasian. It also has the same level of E Eurasian as northern S Asians and some W Asians but with less ENF Sardinian related.

Do you learn anything else from this one?

https://i.imgur.com/KNi6rZF.jpg

Not only was the E/W Eurasian genetic ratio of 33,000 Yana-UP Ancient Siberia close to S Indians but pigmentation wise it also appears to have been close to S India pigmentation.

This pigmentation prediction chart is courtesy EurasianDNA and is based on the latest around 260 pigmentation associated SNPs unlike the older ones based on a handful of SNPs. The lower the ratio of light to dark alleles the darker the predicted pigment.

Notice how dark Bichon-WHG's prediction

<style type="text/css">td {border: 1px solid #ccc;}br {mso-data-placement:same-cell;}</style>
<colgroup><col width="281"><col width="100"><col width="100"><col width="100"></colgroup><tbody>
SAMPLE
DARK ALLELES
LIGHT ALLELES
RATIO LIGHT TO DARK

YANA-UP-WGS
176
86
0.49

LOSCHBOUR-WHG
159
103
0.65

ANATOLIA_N_Bar8
168
94
0.56

YAMNA_KARAGASH_DIPLOID
162
100
0.62

BICHON-WHG
180
82
0.46

STUTTGART-EEF
163
99
0.61

Basque_S_Basque-1
148
114
0.77

Basque_S_Basque-2
156
106
0.68

British_1000G_HG00100
146
116
0.79

British_1000G_HG00101
144
118
0.82

British_1000G_HG00110
150
112
0.75

British_1000G_HG00115
150
112
0.75

British_1000G_HG00118
160
102
0.64

British_1000G_HG00122
150
112
0.75

British_1000G_HG00124
150
112
0.75

British_1000G_HG00133
158
104
0.66

British_1000G_HG00139
162
100
0.62

Burmese_S_Burmese-1
170
92
0.54

Burmese_S_Burmese-2
174
88
0.51

Kurd-IRAQ
166
96
0.58

Papuan_B_Papuan-15
185
77
0.42

Papuan_S_Papuan-2
198
64
0.32

Papuan_S_Papuan-3
186
76
0.41

Papuan_S_Papuan-5
184
78
0.42

Papuan_S_Papuan-6
186
76
0.41

Papuan_S_Papuan-7
190
72
0.38

Papuan_S_Papuan-9
189
73
0.39

Pathan_S_Pathan-1
169
93
0.55

Pathan_S_Pathan-2
172
90
0.52

Saharawi_S_Saharawi-1
174
88
0.51

Saharawi_S_Saharawi-2
163
99
0.61

Samaritan_S_Samaritan-1
177
85
0.48

Sardinian_B_Sardinian-3
167
95
0.57

Sardinian_S_Sardinian-1
163
99
0.61

Sardinian_S_Sardinian-2
158
104
0.66

Somali_S_Somali-1
183
79
0.43

Tajik_Tadjik
172
90
0.52

Tajik_Tajiks1
144
118
0.82

Tajik_Tajiks3
174
88
0.51

Tamil_1000G_HG03644
204
58
0.28

Tamil_1000G_HG03673
182
80
0.44

Tamil_1000G_HG03679
178
84
0.47

Tamil_1000G_HG03680
170
92
0.54

Tamil_1000G_HG03681
190
72
0.38

Tamil_1000G_HG03684
186
76
0.41

Tamil_1000G_HG03685
176
86
0.49

Tamil_1000G_HG03686
202
60
0.30

Toscani_1000G_NA20506
166
96
0.58

Toscani_1000G_NA20511
162
100
0.62

Toscani_1000G_NA20512
150
112
0.75

Toscani_1000G_NA20515
154
108
0.70

Toscani_1000G_NA20517
164
98
0.60

Toscani_1000G_NA20527
164
98
0.60

Toscani_1000G_NA20530
154
108
0.70

Turk-Kayseri-1
148
114
0.77

Turk-Kayseri-2
170
92
0.54

Turkmen_Turkmens1
166
96
0.58

Turkmen_Turkmens2
154
108
0.70

</tbody>

Zoro

03-16-2021, 02:05 PM

The images you posted don't show IBS values but they're PCAs done by fviz_cluster. Look at the scale of the axes. Why would you have IBS values over 1 or under 0?

Use my script instead of fviz_cluster if you don't want to do a PCA.

Here's a new version of my script that doesn't require the pick file but that looks up population names from an .ind file:

Yeah the values don't mean much but the clustering is exactly the same as my other PCA. In other words there's no scaling or anything going on to change the positions of the samples on the graph.

I tried your new script but I don't think the pops file looks right.

t=read.table("plink.genome",skip=1)
> ind=read.table("ind",row.names=1)
> head(t,n=3)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 Abkhasian Abkhasian-1 Adygei S_Adygei-1 UN NA 1 0 0 0 -1 0.892939
2 Abkhasian Abkhasian-1 Adygei S_Adygei-2 UN NA 1 0 0 0 -1 0.891089
3 Abkhasian Abkhasian-1 Albanian.DG S_Albanian1 UN NA 1 0 0 0 -1 0.890971
V13 V14
1 0.0806 1.8878
2 0.3635 1.9710
3 0.0008 1.7600
> t=t[,c(2,4,12)]
> head(t,n=3)
> t=t[,c(2,4,12)]
> head(t,n=3)
V2 V4 V12
1 WGS2 S_Adygei-1 0.892939
2 WGS2 S_Adygei-2 0.891089
3 WGS2 S_Albanian1 0.890971

> t2=rbind(t,setNames(t[,c(2,1,3)],names(t)))

> head(t2,n=3)
V2 V4 V12
1 Abkhasian-1 S_Adygei-1 0.892939
2 Abkhasian-1 S_Adygei-2 0.891089
3 Abkhasian-1 S_Albanian1 0.890971

t3=as.data.frame.matrix(xtabs(t2[,3]~t2[,1]+t2[,2]))
> head(t3,n=3)
Altais1 Altais2 Bashkirs1 Bashkirs2 Bashkirs3 S_Abkhasian-1
Altais1 0.000000 0.887653 0.882354 0.883418 0.881705 0.877456
Altais2 0.887653 0.000000 0.882295 0.884274 0.881161 0.876892
Bashkirs1 0.882354 0.882295 0.000000 0.881556 0.879413 0.876450
S_Abkhasian-2 S_Adygei-1 S_Adygei-2 S_Albanian1 S_Ami1 S_Ami2
Altais1 0.874894 0.877768 0.875895 0.873549 0.887796 0.887467
Altais2 0.874022 0.877397 0.874894 0.873506 0.888828 0.889391
Bashkirs1 0.874233 0.877311 0.875086 0.872841 0.881875 0.881902

> pops=ind[row.names(t3),2]
> head(pops,n=3)
[1] <NA> <NA> <NA>
127 Levels: Adygei Albanian.DG Altaian.DG Ami.DG Armenian Balochi ... Yoruba

I wonder if the UNIX carriage return being different from DOS is causing issues. I re-made my .ind file in UNIX. I can do a UNIX to DOS command on the .ind file to see if it changes anything unless you have a better idea

Edit: Unix2DOS didn't change anything. FYI my ind file is tab delimited

Zanzibar

03-16-2021, 02:34 PM

Dendogram based on IBS distances to Sardinian ( W Eurasian) and Mongol ( E Eurasian). This better explains phenotype similarities between some W and S Asians and Mexicans and S. Americans than G25.

https://i.imgur.com/fVBt7Pv.jpg

If I remember Mongols aren't pure East Eurasian but have minor Western admixture, wouldn't that skew the results? Wouldn't it be better to use Han Chinese or Korean in this case?

Zoro

03-16-2021, 03:42 PM

If I remember Mongols aren't pure East Eurasian but have minor Western admixture, wouldn't that skew the results? Wouldn't it be better to use Han Chinese or Korean in this case?

On the scatter plot above It may shift American Indians and Central Asians just a little on the E Eurasian x-axis but will not change their position on the Sardinian y-axis. So C Asians will remain significantly more Sardinian ENF and WHG and Basal Eurasian shifted than American Indians.

This makes perfect sense because the ancestors of American Indians left Siberia before all the waves of bronze age Steppe and iron age Iranic Admixture which affected central Asians

You can also see that Karitiana and Surui appear to be an earlier migration out of Siberia than other American Indians or perhaps not as affected by European admixing recently

Zanzibar

03-16-2021, 04:39 PM

On the scatter plot above It may shift American Indians and Central Asians just a little on the E Eurasian x-axis but will not change their position on the Sardinian y-axis. So C Asians will remain significantly more Sardinian ENF and WHG and Basal Eurasian shifted than American Indians.

This makes perfect sense because the ancestors of American Indians left Siberia before all the waves of bronze age Steppe and iron age Iranic Admixture which affected central Asians

You can also see that Karitiana and Surui appear to be an earlier migration out of Siberia than other American Indians or perhaps not as affected by European admixing recently

Ah I C. Interesting. But American Indians have similar amounts of E and W Eurasian ancestries as C Asians right? Do American Indians have any other type of W Eurasian ancestries besides ANE (not counting recent European admixture of course)?

But overall do Karitiana and Surui have more W Eurasian than Northern Natives like Sioux, Ojibwe or Navajo since the latter also received additional Siberian ancestry from Athabaskan/Na-Dene speakers?

Also can you investigate whether Australoids like Papuans, Melanesians, Australian Aborigines possess W Eurasian ancestry besides the S Eurasian that they shared with South Asians and make them heavily drifted towards them in many PCAs?

Zoro

03-17-2021, 12:15 AM

Ah I C. Interesting. But American Indians have similar amounts of E and W Eurasian ancestries as C Asians right? Do American Indians have any other type of W Eurasian ancestries besides ANE (not counting recent European admixture of course)?

But overall do Karitiana and Surui have more W Eurasian than Northern Natives like Sioux, Ojibwe or Navajo since the latter also received additional Siberian ancestry from Athabaskan/Na-Dene speakers?

Also can you investigate whether Australoids like Papuans, Melanesians, Australian Aborigines possess W Eurasian ancestry besides the S Eurasian that they shared with South Asians and make them heavily drifted towards them in many PCAs?

Where did I say C Asians and American Indians have similar amounts of E and W Eurasian ancestry. Didn't I just say that the above scatter plots show they have similar amounts of E Eurasian ancestry but C Asians have higher W Eurasian ancestry (Sardinian ENF, WHG, Basal Eurasian).

I also just said the above scatter plots made alot of sense because the ancestors of American Indians left Siberia before the Bronze Age Steppe waves and Iron Age Iranic waves moving east brought more ENF, WHG, and Basal Eurasian to Central Asia. That's why according to the above scatter plot C Asians have similar E Eurasian as American Indians but they have higher W Eurasian as defined by ENF, WHG, and BE.

I can plot IBS 33,000 year old Ancient Siberian Yana vs some form of W Eurasian (WHG, ENF) later. The fact that many Central Asians and Iranics such as Kurds and Persians superficially look like Mexicans etc probably has to do with both having some of the former having significant ancient Siberian of some sort (ANE, ANS, or something similar).

Wrt to Karitiana and Surui vs other American Indians the above plots show that former with slightly less E Eurasian but at the same time significantly less Sardinian ENF, WHG etc. An explanation for the slightly higher E Eurasian could be later waves into the Americas from Siberia (than Karitiana) had more E Asian rich Neo-Siberian than the former waves (Karitiana). But that wouldn't explain the higher W Eurasian in the later groups. I think that could be explained with more recent European admixture into the non Karitiana and Surui.

Anyways all in all for me the simple 2 dimensional scatter plot above is more informative than PCA which are prone to all sorts of biases.

Wrt Australoids the above plot clearly shows that Papuans have less Sardinian ENF, WHG and Basal than even the most southern Indians and E Asians. Ofc this can be explained that they're a very ancient split and share less drift with ALL; WHG, ENF, and even Devils-Gate. Their relative closeness to S Indians and SE Asians is primarily AASI. As to whether some Australoids have more or less more recent S Asian. I believe yes and we can see some of this in the previous plot I posted showing the difference between Bougainville and Papuans.

Zanzibar

03-17-2021, 03:15 PM

Where did I say C Asians and American Indians have similar amounts of E and W Eurasian ancestry. Didn't I just say that the above scatter plots show they have similar amounts of E Eurasian ancestry but C Asians have higher W Eurasian ancestry (Sardinian ENF, WHG, Basal Eurasian).

I also just said the above scatter plots made alot of sense because the ancestors of American Indians left Siberia before the Bronze Age Steppe waves and Iron Age Iranic waves moving east brought more ENF, WHG, and Basal Eurasian to Central Asia. That's why according to the above scatter plot C Asians have similar E Eurasian as American Indians but they have higher W Eurasian as defined by ENF, WHG, and BE.

I can plot IBS 33,000 year old Ancient Siberian Yana vs some form of W Eurasian (WHG, ENF) later. The fact that many Central Asians and Iranics such as Kurds and Persians superficially look like Mexicans etc probably has to do with both having some of the former having significant ancient Siberian of some sort (ANE, ANS, or something similar).

Wrt to Karitiana and Surui vs other American Indians the above plots show that former with slightly less E Eurasian but at the same time significantly less Sardinian ENF, WHG etc. An explanation for the slightly higher E Eurasian could be later waves into the Americas from Siberia (than Karitiana) had more E Asian rich Neo-Siberian than the former waves (Karitiana). But that wouldn't explain the higher W Eurasian in the later groups. I think that could be explained with more recent European admixture into the non Karitiana and Surui.

Anyways all in all for me the simple 2 dimensional scatter plot above is more informative than PCA which are prone to all sorts of biases.

Wrt Australoids the above plot clearly shows that Papuans have less Sardinian ENF, WHG and Basal than even the most southern Indians and E Asians. Ofc this can be explained that they're a very ancient split and share less drift with ALL; WHG, ENF, and even Devils-Gate. Their relative closeness to S Indians and SE Asians is primarily AASI. As to whether some Australoids have more or less more recent S Asian. I believe yes and we can see some of this in the previous plot I posted showing the difference between Bougainville and Papuans.

Ok I haven't been thoroughly reading and following of all your posts. So apologies if I misunderstand anything. Alright, do American Indians have a third component that is neither West nor East Eurasian? Because then what's the non-East Eurasian ancestry of them if it is not West Eurasian? By C Asians, do you also include Altaians and Kyrgyzs having have more W Eurasian than American Indians?

Are you referring to the scatter plots you made or the one that Komin created? So do American Indians have a third specific component that is neither West nor East Eurasian that make them have less W Eurasian than C Asians?

Yes but remember that Mexicans have significant European admixture as well. Or are you referring only to Mexican Indians?

Ok so if I understand you correctly, the Kartiana and Surui have less E Eurasian than the latter groups but also possess less W Eurasian as well? That make sense due to recent European admixture into the latter groups as you answered me.

Alright but do Australoids have any W Eurasian though despite the fact they have less than most Southern Indians and E Asians/SE Asians? And where do Australoids received these recent S Asian or AASI from?

Lemminkäinen

03-17-2021, 05:17 PM

Grouping populations by admixture groups and population distances gives different results. Admixtures forget distances and distances (IBS or Fst) forget common ancestry proportions.

http://terheninenmaa.blogspot.com/2014/08/?m=0

Komintasavalta

03-17-2021, 05:34 PM

Grouping populations by admixture groups and population distances gives different results. Admixtures forget distances and distances (IBS or Fst) forget common ancestry proportions.

http://terheninenmaa.blogspot.com/2014/08/?m=0

I don't know what the "Gedmatch admix analysis" you wrote about is, in your first uncorrected dendrogram, the clustering looks similar to when unscaled G25 coordinates are used to do a dendrogram. Yoruba are part of the same cluster with Finns, but Selkups are part of a different cluster:

https://lh6.googleusercontent.com/proxy/PwWbN4HiKQi6ltN_FmSzT7s8F-K4bH9Np7Zjs2w-U6peHp-gdEiMMeHc6os3y1PF2ZSjTwu1LqN_xuPjpd0s1CSEiiNyGdPws Plbgw=s0-d

Similarly in the heatmap below that is based on unscaled G25 coordinates, Finns are part of the same cluster with Yoruba and Igbo, but Mansi are in a different cluster:

https://i.ibb.co/fMDDFcL/heatunscaled.png

However with scaled coordinates, PC1 which differentiates SSAs from non-SSAs is given more weight, so the clustering becomes more realistic:

https://i.ibb.co/5K45S8q/heatscaled.png

The clustering was done by using the hclust function with the default linkage method ("complete"):

library(pheatmap)
library(vegan) # for reorder.hclust()
library(colorspace) # for hex()

t=read.csv("https://drive.google.com/uc?export=download&id=1wZr-UOve0KUKo_Qbgeo27m-CQncZWb8y",header=T,row.names=1) # scaled
# t=read.csv("https://drive.google.com/uc?export=download&id=1y49hyvviJpHj9esVqyeiFm32DhnPlfRQ",header=T,row.names=1) # unscaled

pop=c("Finnish","German","Han_Shanghai","Igbo","Ju_hoan_North","Karitiana","Mansi","Mari","Mbuti","Nganassan","Onge","Papuan","Pashtun","Saami","Somali","Yoruba")
t=t[row.names(t)%in%pop,]
d=as.matrix(dist(t))
sort=reorder(hclust(dist(t)),wts=d["Ju_hoan_North",]-d["Han_Shanghai",])

pheatmap(
t,
filename="out.png",
clustering_callback=function(...){sort},
cluster_cols=F,
legend=F,
treeheight_row=80,
cellwidth=16,
cellheight=16,
fontsize=8,
border_color=NA,
display_numbers=T,
number_format="%.2f",
fontsize_number=7,
number_color="black",
breaks=seq(-.6,.6,1.2/256),
colorRampPalette(hex(HSV(c(210,210,0,0,0),c(.6,.3, 0,.3,.6),c(1))))(256)
)

Lucas

03-18-2021, 10:44 AM

I don't know what the "Gedmatch admix analysis" you wrote about is, in your first uncorrected dendrogram, the clustering looks similar to when unscaled G25 coordinates are used to do a dendrogram. Yoruba are part of the same cluster with Finns, but Selkups are part of a different cluster:

https://lh6.googleusercontent.com/proxy/PwWbN4HiKQi6ltN_FmSzT7s8F-K4bH9Np7Zjs2w-U6peHp-gdEiMMeHc6os3y1PF2ZSjTwu1LqN_xuPjpd0s1CSEiiNyGdPws Plbgw=s0-d

You read dendrogram in wrongly here. It doesnt' have any meaning that at the top of clade Siberians are near Africans. It is root which is important. And it shows that most different from rest of the world are East-Asians, then Siberians and, then Africans. Of course it could be the opposite in terms of of root should be primeval, but certainly it isn't the same cluster for those groups.
Than rest of the world is divided to Indians and rest. Then Sardinians (EEF) are different. And again and again.

Komintasavalta

03-18-2021, 12:08 PM

You read dendrogram in wrongly here. It doesnt' have any meaning that at the top of clade Siberians are near Africans. It is root which is important. And it shows that most different from rest of the world are East-Asians, then Siberians and, then Africans. Of course it could be the opposite in terms of of root should be primeval, but certainly it isn't the same cluster for those groups.
Than rest of the world is divided to Indians and rest. Then Sardinians (EEF) are different. And again and again.

Maybe Lemmi-san used a different clustering method, but the hclust function I used starts by joining the most similar nodes: "This function performs a hierarchical cluster analysis using a set of dissimilarities for the n objects being clustered. Initially, each object is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster. At each stage distances between clusters are recomputed by the Lance-Williams dissimilarity update formula according to the particular clustering method being used." (https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/hclust)

The default linkage method used by hclust in complete linkage, where the distance between two clusters depends on the distance of the furthest members of the clusters: "Complete-linkage clustering is one of several methods of agglomerative hierarchical clustering. At the beginning of the process, each element is in a cluster of its own. The clusters are then sequentially combined into larger clusters until all elements end up being in the same cluster. The method is also known as farthest neighbour clustering." (https://en.wikipedia.org/wiki/Complete-linkage_clustering)

In the G25 datasheet for unscaled modern population averages, Finns are actually closer to Igbo than to Mansi:

$ curl "https://drive.google.com/uc?export=download&id=1y49hyvviJpHj9esVqyeiFm32DhnPlfRQ">modernaveunscaled
$ awk -F, 'NR==FNR{for(i=2;i<=NF;i++)a[i]=$i;next}$1{s=0;for(i=2;i<=NF;i++)s+=($i-a[i])^2;print s^.5,$1}' <(grep Finnish, modernaveunscaled) modernaveunscaled|sort -n|grep -A5 Igbo
0.0866521 Igbo
0.0868445 Luhya_Kenya
0.0869378 Mansi
0.086945 Tamang
0.0870619 Dharkar
0.0871078 Dong_Guizhou

Zoro

03-18-2021, 12:23 PM

You read dendrogram in wrongly here. It doesnt' have any meaning that at the top of clade Siberians are near Africans. It is root which is important. And it shows that most different from rest of the world are East-Asians, then Siberians and, then Africans. Of course it could be the opposite in terms of of root should be primeval, but certainly it isn't the same cluster for those groups.
Than rest of the world is divided to Indians and rest. Then Sardinians (EEF) are different. And again and again.

I think that's what Komin means. It should be the opposite. So in other words if we reduce this whole dendogram to just 2 clusters it would show an E Asian cluster and then the rest of the world; Africans, Europeans, W Asians, Siberians on the other cluster which is basically wrong which implies the FST distances are wrong.

If reduced to 2 clusters it should correctly show Africans as one cluster and Eurasians and American Indians as the other.

Zoro

03-18-2021, 12:43 PM

Ok I haven't been thoroughly reading and following of all your posts. So apologies if I misunderstand anything. Alright, do American Indians have a third component that is neither West nor East Eurasian? Because then what's the non-East Eurasian ancestry of them if it is not West Eurasian? By C Asians, do you also include Altaians and Kyrgyzs having have more W Eurasian than American Indians?

Are you referring to the scatter plots you made or the one that Komin created? So do American Indians have a third specific component that is neither West nor East Eurasian that make them have less W Eurasian than C Asians?

I clearly said the plot I posted. Yes that would be the implication that there are other components. That's also why one shouldn't rely on Admixture calculator results because their results have a disclaimer that " Results are only correct if the references are actually ancestral to the person tested". How can this disclaimer be true for an individual let alone every individual that uses the calculator when they use specific references as moderns ??

For example if you use a calculator that uses Caucasian and SW Asian modern references, then the results for a Kurd tester would only hold if modern Caucasians and SW Asians are the actual ancestral pops for Kurds, which they clearly are not, thus invalidating the whole result

As far as the other components of ancestry for American Indians it would be safe to say they're related to non-ENF, non-WHG W Eurasians and perhaps other yet to be discovered Siberian or E. Asian lineages.

Or are you referring only to Mexican Indians? Yes, other C American Indians also have recent European admixture

Ok so if I understand you correctly, the Kartiana and Surui have less E Eurasian than the latter groups but also possess less W Eurasian as well? That make sense due to recent European admixture into the latter groups as you answered me. Yes that was 1st part of my answer. The 2nd part I had wrote was due to Karitiana being an earlier split out of Siberia, where the latter received additional E Asian related after the split of Karitiana

Alright but do Australoids have any W Eurasian though despite the fact they have less than most Southern Indians and E Asians/SE Asians? And where do Australoids received these recent S Asian or AASI from?

I'll try to answer later.

Zoro

03-18-2021, 01:14 PM

I clearly said the plot I posted. Yes that would be the implication that there are other components. That's also why one shouldn't rely on Admixture calculator results because their results have a disclaimer that " Results are only correct if the references are actually ancestral to the person tested". How can this disclaimer be true for an individual let alone every individual that uses the calculator when they use specific references as moderns ??

For example if you use a calculator that uses Caucasian and SW Asian modern references, then the results for a Kurd tester would only hold if modern Caucasians and SW Asians are the actual ancestral pops for Kurds, which they clearly are not, thus invalidating the whole result

As far as the other components of ancestry for American Indians it would be safe to say they're related to non-ENF, non-WHG W Eurasians and perhaps other yet to be discovered Siberian or E. Asian lineages.

To clarify further, the IBS results such as the cluster plots I posted earlier can be used to shed light on how E Asian shifted a population is RELATIVE to another population since they're based on a one-to-one gene-to-gene comparison. For example if we look at the result of the Kurds we see they share more IBS with Mongols than most W. Asians and E Europeans in the run, yet if you look at Admixture calculator results you would never have known this.

The reason for this is the Admixture calculator disclaimer I mentioned before which says the results are valid if the modern references are ALL the ACTUAL ancestral references for Kurds which of course they're not. In fact, if you were to drop Caucasian component from a calculator and replace it with a Balkan one then the results of Kurds will change differently from say the results of Turks because now Turks will score more W Eurasian (Balkan) and Kurds would score less W Eurasian (Balkan).

It should also be noted that after the Iron Age Scythian period, the Ilkhanid and Tumurid dynasties ruled for about 250 years over Kurdistan region. They were then succeeded by Seljuks and Ottomen. So that would have some impact on phenotype of some Kurds as well besides Ancient Siberian.

Lemminkäinen

03-18-2021, 02:36 PM

I don't know what the "Gedmatch admix analysis" you wrote about is, in your first uncorrected dendrogram, the clustering looks similar to when unscaled G25 coordinates are used to do a dendrogram. Yoruba are part of the same cluster with Finns, but Selkups are part of a different cluster:

Admixture analyses converted to dendrogram don't see distances between proportions. African, East Asian, European and Siberian are only NAMES giving some ideas in your brain. After attributing distances to proportions we have an exact view.

vbnetkhio

03-19-2021, 04:03 PM

To clarify further, the IBS results such as the cluster plots I posted earlier can be used to shed light on how E Asian shifted a population is RELATIVE to another population since they're based on a one-to-one gene-to-gene comparison. For example if we look at the result of the Kurds we see they share more IBS with Mongols than most W. Asians and E Europeans in the run, yet if you look at Admixture calculator results you would never have known this.

The reason for this is the Admixture calculator disclaimer I mentioned before which says the results are valid if the modern references are ALL the ACTUAL ancestral references for Kurds which of course they're not. In fact, if you were to drop Caucasian component from a calculator and replace it with a Balkan one then the results of Kurds will change differently from say the results of Turks because now Turks will score more W Eurasian (Balkan) and Kurds would score less W Eurasian (Balkan).

It should also be noted that after the Iron Age Scythian period, the Ilkhanid and Tumurid dynasties ruled for about 250 years over Kurdistan region. They were then succeeded by Seljuks and Ottomen. So that would have some impact on phenotype of some Kurds as well besides Ancient Siberian.

what do you think, would it be possible to make a proper mesolithic/neolithic admixture calculator with components like WHG, EEF, etc, one which average users could use and compare themselves to ancient samples.
there isn't a single one currently. G25 also has issues in determining these components.

I tried to do it, but there are problems:
1)supervised admixture would be the easiest solution, but there are too few ancient samples and they are low quality. e.g. there is just 3 ANE samples.

2)if you run an unsupervised admixture analysis on a dataset of all ancient samples from Reich, the components get singled out succesfully. Also, the allele frequencies get imputed from all the samples in the dataset, so having only 3 ANE samples is not a problem.
but, with this calc people couldn't compare themsleves to the ancient samples because of the calculator effect.

3)I tried keeping only 50 most basic ancient samples, and adding HGDP to them, to help impute the components. but then supervised admixture struggles with extracting the correct components. e.g. east asians get split into 2 components, but ANE doesn't get a component because it isn't that relevant on a modern dataset.

Zoro

03-20-2021, 03:09 AM

what do you think, would it be possible to make a proper mesolithic/neolithic admixture calculator with components like WHG, EEF, etc, one which average users could use and compare themselves to ancient samples.
there isn't a single one currently. G25 also has issues in determining these components.

Yes good idea. It would be much more accurate than G25 for determining similarity between user and ancients or moderns

1)supervised admixture would be the easiest solution, but there are too few ancient samples and they are low quality. e.g. there is just 3 ANE samples.

Yes unfortunately Admixture needs a few good quality samples from each population to accurately determine allele frequecies

3)I tried keeping only 50 most basic ancient samples, and adding HGDP to them, to help impute the components. but then supervised admixture struggles with extracting the correct components. e.g. east asians get split into 2 components, but ANE doesn't get a component because it isn't that relevant on a modern dataset.

Most people including Davidski makes the mistake of adding alot of samples to the run. Problem when you do that is that the alley frequencies don’t reflect the component References but instead reflect the allele frequencies of the component references and all the other samples in the run. That’s why if you have one ANE sample and a bunch of Siberians cluster under it that component is not really ANE but rather Siberian component

vbnetkhio

03-21-2021, 02:21 PM

Yes good idea. It would be much more accurate than G25 for determining similarity between user and ancients or moderns

Yes unfortunately Admixture needs a few good quality samples from each population to accurately determine allele frequecies

Most people including Davidski makes the mistake of adding alot of samples to the run. Problem when you do that is that the alley frequencies don’t reflect the component References but instead reflect the allele frequencies of the component references and all the other samples in the run. That’s why if you have one ANE sample and a bunch of Siberians cluster under it that component is not really ANE but rather Siberian component

maybe an automated qpadm model would be better, and even easier to make?

this Davidski's model can give a feasible model for all Europeans:
https://eurogenes.blogspot.com/2017/01/qpadm-tour-of-europe-mesolithic-to.html

do you think something more universal which works for Asians and MENA too could be made?

vbnetkhio

03-22-2021, 12:36 PM

This one zooming in on Eurasians done by removing very common very old alleles using MaxMaf 0.3 and using 335K overlapping SNPs in all populations show:

1- American Indians are as E Eurasian shifted as C Asians such as Kyrgiz, Uyghur, Hazara and Mansi but they have less Sardinian ENF W Eurasian/Basal Eurasian related

2- 33,000 year old Yana-UP Ancient Siberian has about the same level W Eurasian as S Indians but with less E Eurasian. It also has the same level of E Eurasian as northern S Asians and some W Asians but with less ENF Sardinian related.

Do you learn anything else from this one?

https://i.imgur.com/KNi6rZF.jpg

is Yana part of the same popualtion as Ust-Ishim?

Zoro

03-22-2021, 01:49 PM

is Yana part of the same popualtion as Ust-Ishim?

Although it may seem that 33,000 year old Yana may have some ancestry from 45000 year old Ust Ishim because they both inhabited Siberia, this does not appear to be the case based on the papers that reference them.

It also appears that Ust-Ishim lineage went extinct and did not contribute to any present day humans on the planet. Ofc we have to remember that 40,000 years ago there were very few humans living on the planet and they had to deal with Ice Ages which would have been devastating that far north for those hunter gatherers.

It appears that Ust-Ishim belongs to a lineage that broke off from the W Eurasian branch shortly after E Eurasians and W Eurasians split from each other.

But Ishim appears to have broke off the W Eurasian line even before W Eurasians (Goyet, Villabruna, El-Miron, Vestonice, Sunghir) and N Eurasians (Yana, MA1) split off from each other so in theory it should be equally related to both W Eurasians and N Eurasians. In practice though it would be related more to N Eurasians because the former received Basal Eurasian which would cause the FST between W and N Eurasians to grow.

Present day Europeans and American Indians appear more related to MA1 than to Yana though. Additionally Yana appears a little more related to E Eurasians than MA1 to E Eurasians

vbnetkhio

03-22-2021, 02:12 PM

Although it may seem that 33,000 year old Yana may have some ancestry from 45000 year old Ust Ishim because they both inhabited Siberia, this does not appear to be the case based on the papers that reference them.

It also appears that Ust-Ishim lineage went extinct and did not contribute to any present day humans on the planet. Ofc we have to remember that 40,000 years ago there were very few humans living on the planet and they had to deal with Ice Ages which would have been devastating that far north for those hunter gatherers.

It appears that Ust-Ishim belongs to a lineage that broke off from the W Eurasian branch shortly after E Eurasians and W Eurasians split from each other.

But Ishim appears to have broke off the W Eurasian line even before W Eurasians (Goyet, Villabruna, El-Miron, Vestonice, Sunghir) and N Eurasians (Yana, MA1) split off from each other so in theory it should be equally related to both W Eurasians and N Eurasians. In practice though it would be related more to N Eurasians because the former received Basal Eurasian which would cause the FST between W and N Eurasians to grow.

Present day Europeans and American Indians appear more related to MA1 than to Yana though. Additionally Yana appears a little more related to E Eurasians than MA1 to E Eurasians

which paleolithic samples should be used as outgroups when doing a simple model for later populations like e.g. whg/eef/yamnaya?
is more = better? would it be bad if I included all of them (Ust-Ishim, Yana, Vestonice, Kostenki, Sunghir, El Miron..)

Zoro

03-22-2021, 02:35 PM

which paleolithic samples should be used as outgroups when doing a simple model for later populations like e.g. whg/eef/yamnaya?
is more = better? would it be bad if I included all of them (Ust-Ishim, Yana, Vestonice, Kostenki, Sunghir, El Miron..)

you need enough Outgroups to be able to differentiate your sources and keep the standard errors reasonable but if you put too many then you may have a hard time getting a good P value.

I would start with the ones you mentioned including Villabruna, EHG, Taforalt, Devils-Gate, Kotias

vbnetkhio

03-24-2021, 02:09 PM

I tried SmartPCA with the whole Reich dataset at first, but the dataset was rejected because there were more than 100 populations:

how did you install smartpca?
i managed to do it with "apt-get install", but it's an older version, it doesn't accept a parfile, the flags need to be added in line.

vbnetkhio

03-24-2021, 02:51 PM

you need enough Outgroups to be able to differentiate your sources and keep the standard errors reasonable but if you put too many then you may have a hard time getting a good P value.

I would start with the ones you mentioned including Villabruna, EHG, Taforalt, Devils-Gate, Kotias

why are my p values like this?

https://i.imgur.com/5c5ecWg.png

did i miss any important right pops?

i susepct there could be something wrong with my admixtools installation.
could you make a working model for Italians, and i'll try to replicate it on my PC with the same right and left pops.

Komintasavalta

03-24-2021, 03:45 PM

how did you install smartpca?
i managed to do it with "apt-get install", but it's an older version, it doesn't accept a parfile, the flags need to be added in line.

I downloaded a new version of EIGENSOFT from here: https://www.hsph.harvard.edu/alkes-price/software/.

I ran `brew install gsl homebrew/science/openblas` as suggested in the README. It failed before I ran `brew uninstall --ignore-dependencies suite-sparse`.

I first got an error like this:

$ cd EIG-7.2.1/src
$ make CFLAGS='-I/usr/local/opt/openblas/include -I/usr/local/opt/gsl/include' LDFLAGS='-L/usr/local/opt/openblas/lib -L/usr/local/opt/gsl/lib'
cc -I/usr/local/opt/openblas/include -I/usr/local/opt/gsl/include -I../include -I/usr/include/openblas -c -o convertf.o convertf.c
convertf.c:257:5: error: implicit declaration of function 'setfamilypopnames' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
setfamilypopnames (YES);
^
1 error generated.
make: *** [convertf.o] Error 1

I was able to run `make` successfully after I added function prototypes to the source files manually. For example I added this to `src/eigensrc/smartpca.c`:

void setid2pops (char *id2pops, Indiv **indivmarkers, int numindivs);
void kjg_fpca (int numeigs,int fastdim,int fastiter,double *evals, double *evecs);

However I now found a second option that doesn't require editing the source files. I had earlier tried to run `make CC=gcc`, but it used `/usr/bin/gcc` which is actually Clang on macOS. `brew install gcc` installs a real GCC to `/usr/local/bin/gcc-10`.

First I ran `cd EIG-7.2.1/src/nicksrc` and I ran `make`, because without it I got an error like this:

gauss.c:1:11: fatal error: ranmath.h: No such file or directory
1 | #include "ranmath.h"
| ^~~~~~~~~~~
compilation terminated.
make[1]: *** [gauss.o] Error 1
make: *** [nicksrc/libnick.a] Error 2

I have no idea why it worked, but I was then able to compile everything successfully by running `make` two times: first using GCC with the `-k` option (`make -k CC=gcc-10`) and then using Clang (`make`):

# brew uninstall --ignore-dependencies suite-sparse
# brew install gcc gsl homebrew/science/openblas
wget github.com/DReichLab/EIG/archive/v7.2.1.tar.gz
tar -xf EIG-7.2.1.tar.gz
cd EIG-7.2.1/src/nicksrc
make
cd ..
make -k CC=/usr/local/bin/gcc-10 CFLAGS='-I/usr/local/opt/openblas/include -I/usr/local/opt/gsl/include' LDFLAGS='-L/usr/local/opt/openblas/lib -L/usr/local/opt/gsl/lib'
make CFLAGS='-I/usr/local/opt/openblas/include -I/usr/local/opt/gsl/include' LDFLAGS='-L/usr/local/opt/openblas/lib -L/usr/local/opt/gsl/lib'

Then test it:

mkdir -p ~/bin;cp eigensrc/smartpca ~/bin;cd
wget evolbio.ut.ee/Tambets2018/Tambets_et_al_2018_1M.{bed,bim,fam}
f=Tambets_et_al_2018_1M
bin/smartpca -p <(printf %s\\n genotypename:\ $f.bed snpname:\ $f.bim indivname:\ $f.fam evecoutname:\ $f.evec evaloutname:\ $f.eval numoutlieriter:\ 0 autoshrink:\ YES)

Zoro

03-25-2021, 04:18 AM

why are my p values like this?

https://i.imgur.com/5c5ecWg.png

did i miss any important right pops?

i susepct there could be something wrong with my admixtools installation.
could you make a working model for Italians, and i'll try to replicate it on my PC with the same right and left pops.

Admixtools 2 was updated several times over the past couple of months. I would reinstall if i were you
You have some poor quality samples in your pright such as Iran-Meso-belt and Natufian. I would use Ibero and Iran-Ganjdareh-N instead they’re much better quality and therefore have more SNP overlap. The set used at EurasianDNA is a balanced high quality set try them.

Looking at your output it’s not liking your choice of Yamnaya and Israel-N in addition to your other sources to model Italians. Why did you use Yamnaya? They’re relevant to perhaps Caucasians but not Italians. If you want to measure steppe in Italians you can use EHG because It doesn’t have the W Asian side Yamnaya has. Once you get a number on EHG you can work backwards and extrapolate for ex if you get 10% EHG and you know the relevant MLBA or IA steppe pop for Italians (X) was 30% EHG then you can guess that Italians had 33% X.
Or you can use Italian-N as a source or Stuttgart and add relevant MLBA or IA sources yo pleft

Komintasavalta

03-25-2021, 10:57 AM

why are my p values like this?

https://i.imgur.com/5c5ecWg.png

did i miss any important right pops?

i susepct there could be something wrong with my admixtools installation.
could you make a working model for Italians, and i'll try to replicate it on my PC with the same right and left pops.

I couldn't reproduce it. I tried both the 1240K and 1240K+HO datasets.

$ p(){ printf %s\\n "$@";}
$ p Russia_Samara_EBA_Yamnaya Turkey_N Luxembourg_Loschbour_published.DG Iran_GanjDareh_N Israel_PPNB>/tmp/left
$ p Georgia_Satsurblia.SG Iran_Mesolithic_BeltCave_all_published Israel_Natufian_published China_Tianyuan Mbuti.SDG Italy_North_Villabruna_HG Papuan.SDG Morocco_Iberomaurusian Russia_Ust_Ishim_HG_published.DG Russia_Kostenki14 Russia_Yana_UP.SG Ethiopia_4500BP_published.SG Czech_Vestonice16 Spain_ElMiron>/tmp/right
$ p Italian_North.SDG>/tmp/target
$ cat /tmp/{left,right,target}>/tmp/pops
$ rm -rf /tmp/f2;R -e 'library("admixtools");extract_f2(pref="path/to/v44.3_1240K_public",pops=readLines("/tmp/pops"),outdir="/tmp/f2");f2=f2_from_precomp("/tmp/f2");qp=qpadm(f2,left=readLines("/tmp/left"),right=readLines("/tmp/right"),target=readLines("/tmp/target"));pd=qp$popdrop%>%mutate(across(where(is.double),round,3));write.cs v(pd,"popdrop",quote=F,row.names=F)'
[output omitted]
$ cut -d, -f5,7-12 popdrop|column -s, -t
p Russia_Samara_EBA_Yamnaya Turkey_N Luxembourg_Loschbour_published.DG Iran_GanjDareh_N Israel_PPNB feasible
0.968 1.099 0.164 -0.161 -0.309 0.207 FALSE
0.951 1.192 0.486 -0.246 -0.431 NA FALSE
0.849 0.704 0.111 -0.029 NA 0.214 FALSE
0.881 0.858 -0.052 NA -0.105 0.299 FALSE
0.98 1.269 NA -0.179 -0.368 0.278 FALSE
0.255 NA 2.359 -0.301 -0.181 -0.877 FALSE
0.688 0.558 0.508 -0.065 NA NA FALSE
0.49 0.586 0.444 NA -0.03 NA FALSE
0.872 0.72 0.028 NA NA 0.252 TRUE
0.93 2.869 NA -0.63 -1.239 NA FALSE
0.888 0.761 NA -0.022 NA 0.262 FALSE
0.916 0.815 NA NA -0.094 0.279 FALSE
0.077 NA 0.706 0.057 0.236 NA TRUE
0.287 NA 1.79 -0.182 NA -0.608 FALSE
0.175 NA 1.132 NA 0.142 -0.274 FALSE
0.057 NA NA 0.201 0.47 0.329 TRUE
0.499 0.559 0.441 NA NA NA TRUE
0.347 1.088 NA -0.088 NA NA FALSE
0.347 1.219 NA NA -0.219 NA FALSE
0.909 0.734 NA NA NA 0.266 TRUE
0.038 NA 1.068 -0.068 NA NA FALSE
0.084 NA 0.844 NA 0.156 NA TRUE
0.126 NA 1.446 NA NA -0.446 FALSE
0.002 NA NA 0.278 0.722 NA TRUE
0.014 NA NA 0.128 NA 0.872 TRUE
0.007 NA NA NA 0.014 0.986 TRUE
0.268 1 NA NA NA NA TRUE
0.026 NA 1 NA NA NA TRUE
0 NA NA 1 NA NA TRUE
0 NA NA NA 1 NA TRUE
0.008 NA NA NA NA 1 TRUE

On the list above, all models with 4 or more populations have one or more negative weight so the value of their feasible column is false.

vbnetkhio

03-25-2021, 02:02 PM

Admixtools 2 was updated several times over the past couple of months. I would reinstall if i were you
You have some poor quality samples in your pright such as Iran-Meso-belt and Natufian. I would use Ibero and Iran-Ganjdareh-N instead they’re much better quality and therefore have more SNP overlap. The set used at EurasianDNA is a balanced high quality set try them.

Looking at your output it’s not liking your choice of Yamnaya and Israel-N in addition to your other sources to model Italians. Why did you use Yamnaya? They’re relevant to perhaps Caucasians but not Italians. If you want to measure steppe in Italians you can use EHG because It doesn’t have the W Asian side Yamnaya has. Once you get a number on EHG you can work backwards and extrapolate for ex if you get 10% EHG and you know the relevant MLBA or IA steppe pop for Italians (X) was 30% EHG then you can guess that Italians had 33% X.
Or you can use Italian-N as a source or Stuttgart and add relevant MLBA or IA sources yo pleft

it's a fresh R installation and fresh admixtools.
I'm not that interested Italians specifically, I'm trying to make something universal that works for all of Europe. I went with Italians first because user andre requested it on another thread.
In studies, in models like this, Yamnaya is usually used as a steppe proxy, and it gives a feasible model for all of Europe, so I went with that.

look at this model Token did in another thread:

Bulgarian
Yamnaya_Samara 0.306 ± 0.0352
Anatolia_N 0.454 ± 0.0303
WHG 0.0796 ± 0.0140
Iran_N 0.161 ± 0.0387
p-value 0.559

Romanian
Yamnaya_Samara 0.341 ± 0.0373
Anatolia_N 0.430 ± 0.0315
WHG 0.0795 ± 0.0155
Iran_N 0.149 ± 0.0426
p-value 0.0467

I want to first make a simple steppe/eef/whg model for all Europeans (but currently I can't get that to pass either), and then check which populations have extra Middle Eastern and Siberian influence outside of that model, and what are the best proxies for this influence. So, here I added Iran_N, Levant_N, CHG, Morroco EN, because all of the later Middle Easterners are a mix of these.

I think this would fully utilize the power of qpAdm.

I couldn't reproduce it. I tried both the 1240K and 1240K+HO datasets.

ok, so there is nothing wrong with my admixtools. good to know.

vbnetkhio

03-28-2021, 02:36 PM

Admixtools 2 was updated several times over the past couple of months. I would reinstall if i were you
You have some poor quality samples in your pright such as Iran-Meso-belt and Natufian. I would use Ibero and Iran-Ganjdareh-N instead they’re much better quality and therefore have more SNP overlap. The set used at EurasianDNA is a balanced high quality set try them.

Looking at your output it’s not liking your choice of Yamnaya and Israel-N in addition to your other sources to model Italians. Why did you use Yamnaya? They’re relevant to perhaps Caucasians but not Italians. If you want to measure steppe in Italians you can use EHG because It doesn’t have the W Asian side Yamnaya has. Once you get a number on EHG you can work backwards and extrapolate for ex if you get 10% EHG and you know the relevant MLBA or IA steppe pop for Italians (X) was 30% EHG then you can guess that Italians had 33% X.
Or you can use Italian-N as a source or Stuttgart and add relevant MLBA or IA sources yo pleft

I went by this logic when choosing my outgroups:
https://eurogenes.blogspot.com/2017/01/qpadm-tour-of-europe-mesolithic-to.html

The relationship between Satsurblia and Kotias, the Caucasus_HG reference, is much older than the arrival of Caucasus_HG admixture in Europe, so as far as I know, my setup is fine.

Same thing with Villabruna; it's much older than the three Western_HG references and their relationship with modern Europeans.

So I also added Iran_meso and Iberomaurusian as right pops, and Iran_N and Morroco_LN as left pops, by this same logic.

Also, I think, when measuring ANE in modern populations, MA1 should be used as an outgroup and Afontova Gora 3 as a left pop?

vbnetkhio

03-28-2021, 02:41 PM

Admixtools 2 was updated several times over the past couple of months. I would reinstall if i were you
You have some poor quality samples in your pright such as Iran-Meso-belt and Natufian. I would use Ibero and Iran-Ganjdareh-N instead they’re much better quality and therefore have more SNP overlap. The set used at EurasianDNA is a balanced high quality set try them.

Looking at your output it’s not liking your choice of Yamnaya and Israel-N in addition to your other sources to model Italians. Why did you use Yamnaya? They’re relevant to perhaps Caucasians but not Italians. If you want to measure steppe in Italians you can use EHG because It doesn’t have the W Asian side Yamnaya has. Once you get a number on EHG you can work backwards and extrapolate for ex if you get 10% EHG and you know the relevant MLBA or IA steppe pop for Italians (X) was 30% EHG then you can guess that Italians had 33% X.
Or you can use Italian-N as a source or Stuttgart and add relevant MLBA or IA sources yo pleft

could you do this?

Hi everyone, could someone run with qpAdm North_Italians, Tuscans and Sicilians with this model?

WHG
Steppe_EMBA (Samara i think it's ok)
Barcin_N
Iran_N
Levant_PPNB

I want to see if Levant_PPNB it's necessary to run italians, in particulary southerns.

Thank you.

Komintasavalta

04-18-2021, 07:52 PM

New question: how do I remove duplicate samples from the 1240K+HO dataset?

vbnetkhio

04-18-2021, 07:57 PM

New question: how do I remove duplicate samples from the 1240K+HO dataset?

check which have the same Master ID in the .anno file, keep the one with more SNPs.(autosomal targets)
it's also good to remove any with ignore,sister,brother,mother,father,daughter,son,s ibling,rel(relative),lc(low coverage) in their instance id.

Komintasavalta

04-18-2021, 09:44 PM

check which have the same Master ID in the .anno file, keep the one with more SNPs.(autosomal targets)

Ok, so basically this keeps only the samples with the highest SNP count:

grep -v _dup v44.3_HO_public.anno|sort -t$'\t' -rnk15|awk -F\\t '!a[$3]++'|cut -f2

The third column is master ID, the second column is version ID, and the 15th column is SNP count.

I was able to run SmartPCA with duplicates removed like this:

smap()(smartpca -p <(printf %s\\n genotypename:\ $1.bed snpname:\ $1.bim indivname:\ $1.fam evecoutname:\ $1.evec evaloutname:\ $1.eval "${@:2}"))
smap2()(smap "$@" numoutlieriter:\ 0 autoshrink:\ YES lsqproject:\ YES numoutevec:\ 20)
tav()(awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1][i]+=$i}}END{for(i in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i][j]/n[i]);print o}}' "FS=${1-$'\t'}")
smapa()(sed 1d $1.evec|awk NF--|tr : \ |awk 'NR==FNR{a[$2]=$1;next}{$1=a[$2]}1' $1.pick -|sed s/\ /:/|tr ' ' ,>$1.csv;sed 's/:[^,]*//' $1.csv|tav ,>$1.ave;awk 'NR==FNR{a[NR]=$0;next}{printf"%s",$1;for(i=2;i<=NF;i++)printf",%.6f",$i*a[i-1]^.5;print""}' $1.{eval,ave}>$1.avescaled)
smap3()(smap2 $1;smapa $1)
igno()(grep -Ev '\.REF|Ignore_|_outlier|_dup|_contam|_father|_moth er|_son|_daughter|_brother|_sister|_sibling|_twin| \.rel\.|Neanderthal|Denisova|Vindija')
keep2()(plink --allow-no-sex --bfile $1 --keep <(awk 'NR==FNR{a[$2];next}$2 in a' $2.pick $1.fam) --make-bed --out $2)
x=latlong7;grep -v _dup v44.3_HO_public.anno|sort -t$'\t' -rnk15|awk -F\\t 'NR>1&&!a[$3]++&&$11>=45&&$12>=-22&&$15>=5e5&&$6==0{print$8,$2}'|igno|grep -Ev '_o|_lc'>$x.pick;keep2 v44.3_HO_public $x;smap3 $x

At first because there were duplicate samples, some of the clines on PCs 4 and further were dominated by populations with duplicate samples. Now I got more reasonable plots for the higher dimensions. Except maybe there's still duplicates among the Selkup or Ashkenazi samples.

https://i.ibb.co/H78Sd10/1.png
https://i.ibb.co/Yj600Gd/3.png
https://i.ibb.co/rZXfp8N/5.png
https://i.ibb.co/0MSJ7JJ/7.png

Code for Voronoi tesselation:

library(tidyverse)
library(ggforce)

t=read.csv("latlong7.csv",row.names=1,header=F)
names(t)=paste0("PC",1:ncol(t))

pop=sub("\\.(DG|SDG|SG|WGA)","",sub(":.*","",rownames(t)))
centers=aggregate(t,by=list(pop),mean)
t$pop=pop

set.seed(1488)
color=as.factor(sample(seq(1,length(unique(t$pop)) )))
col=rbind(c(60,80),c(25,95),c(30,70),c(60,100),c(2 0,50),c(70,50))
hues=max(ceiling(length(color)/nrow(col)),12)
pal1=as.vector(apply(col,1,function(x)hcl(seq(15,3 75,length=hues+1)[1:hues],x[1],x[2])))
pal2=as.vector(apply(col,1,function(x)hcl(seq(15,3 75,length=hues+1)[1:hues],.8*x[1],.3*x[2])))

ggplot(t,aes(x=PC1,y=PC2,group=-1L))+
ggforce::geom_voronoi_tile(aes(fill=color[as.factor(pop)],color=color[as.factor(pop)]),size=.07,max.radius=.003)+
geom_label(data=centers,aes(x=PC1,y=PC2,label=Grou p.1,color=color,fill=color),alpha=.7,size=2,label. r=unit(.08,"lines"),label.padding=unit(.08,"lines"),label.size=0)+
coord_fixed()+
scale_x_continuous(breaks=seq(-1,1,.02),expand=expansion(mult=.02))+
scale_y_continuous(breaks=seq(-1,1,.02),expand=expansion(mult=.02))+
scale_fill_manual(values=pal1)+
scale_color_manual(values=pal2)+
theme(
axis.text=element_text(color="black",size=6),
axis.ticks=element_blank(),
axis.ticks.length=unit(0,"pt"),
axis.title=element_text(color="black",size=8),
legend.position="none",
panel.background=element_rect(fill="white"),
panel.border=element_rect(color="gray85",fill=NA,size=.4),
panel.grid.major=element_line(color="gray85",size=.2)
)

ggsave("a.png",width=7,height=7)
system("mogrify -trim -bordercolor white -border 16x16 a.png")

vbnetkhio

04-18-2021, 09:51 PM

Except maybe there's still duplicates among the Selkup or Ashkenazi samples.

maybe close relatives, those are endogamous populations.
you can extract just them in separate datasets (one for Selkups, one for Ahkenazi) do plink --genome on them to check if they are close relatives.
(for me plink --genome doesn't work on too large datasets with multiple populations.)

Komintasavalta

04-18-2021, 10:48 PM

maybe close relatives, those are endogamous populations.
you can extract just them in separate datasets (one for Selkups, one for Ahkenazi) do plink --genome on them to check if they are close relatives.
(for me plink --genome doesn't work on too large datasets with multiple populations.)

When I ran `plink --genome` without LD pruning for all samples in my run, one pair of Selkups had the 76th highest IBS out of a total of 468028 pairs of samples:

$ plink --bfile latlong7 --genome
[output omitted]
$ awk 'NR==FNR{a[$1]=$3;next}FNR>1{print a[$2],$2,a[$4],$4,$12}' v44.3_HO_public.ind plink.genome|sort -rnk5|awk '$1==x&&$3==x{print NR,$0}' x=Selkup|head
76 Selkup Tuebingen51 Selkup Tuebingen62 0.843568
749 Selkup Tuebingen51 Selkup Tuebingen52 0.834890
1026 Selkup Tuebingen51 Selkup Tuebingen59 0.833330
1204 Selkup Selkup105 Selkup Tuebingen53 0.832269
1437 Selkup Tuebingen58 Selkup Tuebingen62 0.831279
1456 Selkup Tuebingen52 Selkup Tuebingen59 0.831183
1827 Selkup Tuebingen54 Selkup Tuebingen62 0.829653
1833 Selkup Tuebingen52 Selkup Tuebingen62 0.829639
1925 Selkup Tuebingen51 Selkup Tuebingen54 0.829319
1926 Selkup Tuebingen60 Selkup Tuebingen79 0.829308
$ sed 1d plink.genome|wc -l
468028

When I tried running SmartPCA with `numoutlieriter: 1` instead of `numoutlieriter: 0`, it removed only 8 individuals, but 2 of them were Selkups. It also removed Tofalars who were among the highest-ranking IBS pairs.

REMOVED outlier 2676:Ul56 iter 1 evec 7 sigmage 6.086 pop: Control
REMOVED outlier 3598:Tuebingen64 iter 1 evec 4 sigmage 8.457 pop: Control
REMOVED outlier 3599:Tuebingen72 iter 1 evec 4 sigmage 7.305 pop: Control
REMOVED outlier 3633:Vgut1 iter 1 evec 9 sigmage -6.654 pop: Control
REMOVED outlier 3638:Vgut8 iter 1 evec 9 sigmage -6.557 pop: Control
REMOVED outlier 3639:Vgut11 iter 1 evec 9 sigmage -6.316 pop: Control
REMOVED outlier 3640:Vgut12 iter 1 evec 9 sigmage -7.208 pop: Control
REMOVED outlier 3641:Vgut13 iter 1 evec 9 sigmage -6.398 pop: Control
number of samples after outlier removal: 960

The pair of samples with the highest IBS consisted of two Tofalars that were both removed by SmartPCA:

$ awk 'NR==FNR{a[$1]=$3;next}FNR>1{print a[$2],$2,a[$4],$4,$12}' g/g/ho.ind plink.genome|sort -rnk5|awk '$1==x&&$3==x{print NR,$0}' x=Tofalar|head
1 Tofalar Vgut8 Tofalar Vgut12 0.891949
7 Tofalar Vgut11 Tofalar Vgut13 0.864923
13 Tofalar Vgut1 Tofalar Vgut4 0.860371
41 Tofalar Vgut13 Tofalar Vgut18 0.847714
256 Tofalar Vgut11 Tofalar Vgut15 0.838493
281 Tofalar Vgut6 Tofalar Vgut12 0.838214
302 Tofalar Vgut7 Tofalar Vgut11 0.838022
320 Tofalar Vgut1 Tofalar Vgut11 0.837828
384 Tofalar Vgut1 Tofalar Vgut15 0.837239
433 Tofalar Vgut7 Tofalar Vgut19 0.836832

So maybe it's better to not use `numoutlieriter: 0` after all.

vbnetkhio

04-18-2021, 11:03 PM

When I ran `plink --genome` without LD pruning for all samples in my run, one pair of Selkups had the 76th highest IBS out of a total of 468028 pairs of samples:

ibs means identity by state, i. e. similarity by descent from the same populations and shared drifts. that doesn't necessarily mean close relatedness. but since it's so close in this case it probably means something,

in plink's output PI_HAT is the IBD (identity by descent) measure, and DST is IBS.
so you should look at the PI_HAT in this case.
https://www.cog-genomics.org/plink/1.9/ibd

in this case runs of homozigosity could also be useful?
https://www.cog-genomics.org/plink/1.9/ibd#homozyg

Komintasavalta

04-23-2021, 02:19 PM

in plink's output PI_HAT is the IBD (identity by descent) measure, and DST is IBS.
so you should look at the PI_HAT in this case.
https://www.cog-genomics.org/plink/1.9/ibd

Yeah, I now excluded pairs of samples with PI_HAT of .25 or higher after LD pruning, and I ran SmartPCA with `numoutlieriter: 0`.

curl -LsO reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_HO_public.tar;tar -xf v44.3_HO_public.tar
f=v44.3_HO_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)
igno()(grep -Ev '\.REF|rel\.|fail\.|Ignore_|_dup|_contam|_lc|_fath er|_mother|_son|_daughter|_brother|_sister|_siblin g|_twin|Neanderthal|Denisova|Vindija_light|Gorilla |Macaque|Marmoset|Orangutang|Primate_Chimp|hg19ref ')
x=prefix;awk -F\\t 'NR>1&&$11>45&&$12>50{print$2,$3,$8,$15}' v44.3_HO_public.anno|igno|grep -v _o|sort -rnk4|awk '!a[$2]++{print$1,$3}'>$x.pick
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --indep-pairwise 200 25 .2 --make-bed --out $x.temp
plink --allow-no-sex --bfile $x.temp --extract $x.temp.prune.in --genome --out $x
awk 'NR==FNR{a[$1]=$2;next}FNR>1&&$10>=.25{print$2,a[$2];print$4,a[$4]}' $x.pick $x.genome|sort -u>$x.ignore
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}!($1 in a)' $x.ignore $x.pick|awk 'NR==FNR{a[$1];next}$2 in a' - v44.3_HO_public.fam) --make-bed --out $x
smartpca -p <(printf %s\\n genotypename:\ $x.bed snpname:\ $x.bim indivname:\ $x.fam evecoutname:\ $x.evec evaloutname:\ $x.eval numoutevec:\ 20 autoshrink:\ YES lsqproject:\ YES numoutlieriter:\ 0)

Pairs of samples with highest PI_HAT:

$ awk 'NR==FNR{a[$1]=$2;next}{print$10,a[$2]":"$2,a[$4]":"$4}' $x.pick $x.genome|sort -rn|head
0.5501 Tofalar:Vgut8 Tofalar:Vgut12
0.5289 Tubalar:Tuba23 Tubalar:Tuba24
0.5285 Shor_Khakassia:KHS-035 Shor_Khakassia:KHS-036
0.5028 Khakass_Kachin:Khs-493 Khakass_Kachin:Khs-513
0.4881 Tubalar:ALT-116 Tubalar:Tuba2
0.3945 Chukchi1:Sir50 Chukchi1:Sir41
0.3651 Nganasan:Tuebingen19 Nganasan:Tuebingen116
0.3608 Tofalar:Vgut11 Tofalar:Vgut13
0.3572 Nganasan:Tuebingen23 Nganasan:Tuebingen124
0.3455 Negidal:AMU-160 Nivh:AMU-164

vbnetkhio

04-23-2021, 09:39 PM

Yeah, I now excluded pairs of samples with PI_HAT of .25 or higher after LD pruning, and I ran SmartPCA with `numoutlieriter: 0`.

curl -LsO reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_HO_public.tar;tar -xf v44.3_HO_public.tar
f=v44.3_HO_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)
igno()(grep -Ev '\.REF|rel\.|fail\.|Ignore_|_dup|_contam|_lc|_fath er|_mother|_son|_daughter|_brother|_sister|_siblin g|_twin|Neanderthal|Denisova|Vindija_light|Gorilla |Macaque|Marmoset|Orangutang|Primate_Chimp|hg19ref ')
x=prefix;awk -F\\t 'NR>1&&$11>45&&$12>50{print$2,$3,$8,$15}' v44.3_HO_public.anno|igno|grep -v _o|sort -rnk4|awk '!a[$2]++{print$1,$3}'>$x.pick
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --indep-pairwise 200 25 .2 --make-bed --out $x.temp
plink --allow-no-sex --bfile $x.temp --extract $x.temp.prune.in --genome --out $x
awk 'NR==FNR{a[$1]=$2;next}FNR>1&&$10>=.25{print$2,a[$2];print$4,a[$4]}' $x.pick $x.genome|sort -u>$x.ignore
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}!($1 in a)' $x.ignore $x.pick|awk 'NR==FNR{a[$1];next}$2 in a' - v44.3_HO_public.fam) --make-bed --out $x
smartpca -p <(printf %s\\n genotypename:\ $x.bed snpname:\ $x.bim indivname:\ $x.fam evecoutname:\ $x.evec evaloutname:\ $x.eval numoutevec:\ 20 autoshrink:\ YES lsqproject:\ YES numoutlieriter:\ 0)

Pairs of samples with highest PI_HAT:

$ awk 'NR==FNR{a[$1]=$2;next}{print$10,a[$2]":"$2,a[$4]":"$4}' $x.pick $x.genome|sort -rn|head
0.5501 Tofalar:Vgut8 Tofalar:Vgut12
0.5289 Tubalar:Tuba23 Tubalar:Tuba24
0.5285 Shor_Khakassia:KHS-035 Shor_Khakassia:KHS-036
0.5028 Khakass_Kachin:Khs-493 Khakass_Kachin:Khs-513
0.4881 Tubalar:ALT-116 Tubalar:Tuba2
0.3945 Chukchi1:Sir50 Chukchi1:Sir41
0.3651 Nganasan:Tuebingen19 Nganasan:Tuebingen116
0.3608 Tofalar:Vgut11 Tofalar:Vgut13
0.3572 Nganasan:Tuebingen23 Nganasan:Tuebingen124
0.3455 Negidal:AMU-160 Nivh:AMU-164

did removing those have an effect on the PCA?

Komintasavalta

04-25-2021, 11:47 PM

did removing those have an effect on the PCA?

Actually it only helped partially. I didn't check the result of the SmartPCA command I posted, but even after I removed the samples with PI_HAT of .25 or higher, there were still some outliers because I used the parameter `numoutlieriter: 0`.

Often it's not even enough if you run SmartPCA once without `numoutlieriter`, remove outliers manually, and run it again, because the next run will often have new outliers. I think that's why the default behavior is to do multiple iterations of outlier removal.

Komintasavalta

09-23-2021, 02:19 PM

How do you convert MyHeritage raw data to PLINK format? It looks like this:

$ head -n20 MyHeritage_raw_dna_data.csv
##fileformat=MyHeritage
##format=MHv1.0
##chip=GSA
##timestamp=2021-09-23 08:59:54 UTC
##reference=build37
#
# MyHeritage DNA raw data.
# For each SNP, we provide the identifier, chromosome number, base pair position and genotype.
# The genotype is reported on the forward (+) strand with respect to the human reference build 37.
# THIS INFORMATION IS FOR YOUR PERSONAL USE AND IS INTENDED FOR GENEALOGICAL RESEARCH
# ONLY. IT IS NOT INTENDED FOR MEDICAL OR HEALTH PURPOSES. PLEASE BE AWARE THAT THE
# DOWNLOADED DATA WILL NO LONGER BE PROTECTED BY OUR SECURITY MEASURES.
RSID,CHROMOSOME,POSITION,RESULT
"rs547237130","1","72526","AA"
"rs562180473","1","565703","AA"
"rs575203260","1","567693","TT"
"rs3131972","1","752721","GG"
"rs200599638","1","752918","GG"
"rs12184325","1","754105","CC"
"rs114525117","1","759036","GG"

Lucas

09-23-2021, 02:58 PM

How do you convert MyHeritage raw data to PLINK format? It looks like this:

$ head -n20 MyHeritage_raw_dna_data.csv
##fileformat=MyHeritage
##format=MHv1.0
##chip=GSA
##timestamp=2021-09-23 08:59:54 UTC
##reference=build37
#
# MyHeritage DNA raw data.
# For each SNP, we provide the identifier, chromosome number, base pair position and genotype.
# The genotype is reported on the forward (+) strand with respect to the human reference build 37.
# THIS INFORMATION IS FOR YOUR PERSONAL USE AND IS INTENDED FOR GENEALOGICAL RESEARCH
# ONLY. IT IS NOT INTENDED FOR MEDICAL OR HEALTH PURPOSES. PLEASE BE AWARE THAT THE
# DOWNLOADED DATA WILL NO LONGER BE PROTECTED BY OUR SECURITY MEASURES.
RSID,CHROMOSOME,POSITION,RESULT
"rs547237130","1","72526","AA"
"rs562180473","1","565703","AA"
"rs575203260","1","567693","TT"
"rs3131972","1","752721","GG"
"rs200599638","1","752918","GG"
"rs12184325","1","754105","CC"
"rs114525117","1","759036","GG"

If you have problem with those company formats, you can alwas can use DNAKitStudio for convertion to 23me format for example.

Komintasavalta

09-23-2021, 05:50 PM

If you have problem with those company formats, you can alwas can use DNAKitStudio for convertion to 23me format for example.

I don't have a Windows VM, so I registered an account at dnagenics.com and used the option "DNA Tools > RAW Converter".

But apparently the 23andme format is the same as the MyHeritage format, except it doesn't have a header line and it's TSV and not CSV. So you can just do this:

plink --23file <(grep -v ^\# MyHeritage_raw_dna_data.csv|sed 1d|tr , \\t|tr -d \") --make-bed --out mybed