Plink related questions

**Zoro** · 03-12-2021, 07:24 AM

This thread is for asking Plink related questions and posting IBS or IBD results from Plink programs such as --genome or any other Plink program

@ Komintasavalta

Regarding your question in the other thread of how to convert from .geno .snp .ind format used in Admixtools to Plink .bed .bim .fam.
Here's how I do it:

Create a text file like this. Use your own file name instead of sample and save it as par.PED.PACKEDPED

genotypename: sample.geno
snpname: sample.snp
indivname: sample.ind
outputformat: PACKEDPED
genotypeoutname: sample.bed
snpoutname: sample.bim
indivoutname: sample.fam
familynames: YES

At a linux terminal execute command:

......../convertf -p par.PED.PACKEDPED

put the path to the ADMIXTOOLS convertf file instead of ..........

You'll receive 3 Plink files : bed bim fam

Check your fam file. You can edit the names of the populations and their IDs to something you like or like how they were named in the .ind file

**Zoro** · 03-12-2021, 07:40 AM

Discussion related to Lezgins, Chechens, and Daghestanis in Iraq, Lezgin-Kurd IBS closeness and Lezgin IBS located at :

https://www.theapricity.com/forum/sh...s-Vol-4/page52

**Zoro** · 03-12-2021, 07:45 AM

Discussion related to G25 distance results wrongly showing :

1- Eurasians such as Mongols closer to Khomani-San and Ju-Hoan than to Mbuti
2- Eurasians such as Kurds closer to SSA than other Eurasians such as Papuans, Karitiana, and Surui
3- Kurds closer to Jordanians than to Uyghur , Baloch, Brahui etc

Conclusion: The above leads to overestimation of SW Asian and African in W Asians such as Kurds and underestimation of E Asian and Siberian

AND Plink IBS results correctly showing above in contrast to G25

located at https://www.theapricity.com/forum/sh...ome-here/page2

**~~Komintasavalta~~** · 03-12-2021, 08:22 AM

Yeah I figured it out already. I did something like this to make a global PCA of modern individuals in the Reich dataset:

Code:

wget reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_1240K_public.tar
tar -xf v44.3_1240K_public.tar
f=v44.3_1240K_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)
sed 1d v44.3_1240K_public.anno|grep -v 1KGPhase|awk -F\\t '$9=="Modern"{print$2,$13}'|grep -Ev '_dup|Ignore_|\.REF|_o'|sed -E 's/\.(SDG|DG|SG)$//'>picks0
awk 'NR==FNR{a[$1];next}$2 in a' picks0 v44.3_1240K_public.fam>picks
plink --bfile v44.3_1240K_public --keep picks --allow-no-sex --make-bed --out picks
plink --bfile picks --pca --geno .001 --allow-no-sex --out picks
paste -d' ' <(cut -d' ' -f2 picks0) <(cut -d' ' -f2- picks.eigenvec)>picks.eigenvec.2

When I didn't add `--geno .001`, it fked up the clustering of some populations at first, so that for example one South Asian population clustered together with Africans.

I couldn't get convertf to compile, but I downloaded a Mac binary from here: https://github.com/chrchang/eigensoft. The Mac binaries for plink from Harvard's website didn't work, but there were working binaries by another maintainer here: https://www.cog-genomics.org/plink/1.9/. You can download the v44.3_1240K_public.tar file manually from here: https://reichdata.hms.harvard.edu/pu...ses/index.html.

I then made this in R:

Code:

libarary(tidyverse)
library(colorspace)

f="picks"
t=read.table(paste0(f,".eigenvec.2"),sep=" ")
eig=as.double(readLines(paste0(f,".eigenval")))

# t=cbind(t[,c(1,2)],t(t(t[,-c(1,2)])*sqrt(eig))) # I think this corresponds to scaling in G25

pct=paste0("PC",seq(length(eig))," (",sprintf("%.1f",100*eig/sum(eig)),"%)")

ave=aggregate(t[,-c(1,2)],list(t[,1]),mean)
names(ave)=c("pop",paste0("PC",seq(ncol(ave)-1)))

k=cutree(hclust(dist(ave[,-1]),method="ward.D2"),k=12)
write.csv(k,"/tmp/k",quote=F)
ave$k=k

ggplot(ave,aes(x=PC1,y=PC2,label=pop))+
geom_point(aes(color=as.factor(k)),size=.5)+
geom_polygon(data=ave%>%group_by(k)%>%slice(chull(PC1,PC2)),alpha=.2,aes(color=as.factor(k),fill=as.factor(k)),size=.3)+
geom_text(aes(label=pop,color=as.factor(k)),size=2,vjust=-.7)+
theme(
  aspect.ratio=3/4,
  axis.text=element_text(color="black",size=7),
  axis.ticks.length=unit(0,"pt"),
  axis.ticks.x=element_blank(),
  axis.ticks.y=element_blank(),
  axis.title=element_text(color="black",size=10),
  legend.position="none",
  panel.background=element_rect(fill="white"),
  panel.grid.major=element_line(color="gray75",size=.2)
)+
scale_x_continuous(breaks=seq(-2,2,.1),expand=expansion(mult=.07))+
scale_y_continuous(breaks=seq(-2,2,.1),expand=expansion(mult=.06))+
labs(x=pct[1],y=pct[2])+
scale_color_discrete_qualitative(palette="Set 2",c=80,l=40)

ggsave("output.png")
system("/usr/local/bin/mogrify -trim -bordercolor white -border 20x20 output.png")

However there's something wrong with the distances in my PCA. For example Finns have about 5 times bigger distance to Khomani_San than to Yoruba:

Code:

$ tav(){ awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1][i]+=$i}}END{for(i in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i][j]/n[i]);print o}}' "FS=${1-$'\t'}";}
$ dist(){ awk -F, 'NR==FNR{for(i=2;i<=NF;i++)a[i]=$i;next}$1{s=0;for(i=2;i<=NF;i++)s+=($i-a[i])^2;print s^.5,$1}' "$2" "$1"|sort -n|awk '{printf"%.3f %s\n",$1,$2}'|sed s,^0,,;}
$ paste -d' ' <(cut -d' ' -f2 maailma0) <(cut -d' ' -f3- maailma.eigenvec)|tav ' '|tr ' ' ,>ave;dist ave <(grep Finnish ave)|tail -n16
.130 Karitiana
.130 Mende
.135 Gambian
.136 Esan
.155 Yoruba
.174 Papuan
.189 Mandenka
.206 BantuKenya
.216 Biaka
.242 BantuSA
.268 Mbuti
.336 Ju_hoan_North
.492 BantuHerero
.492 BantuSA_Herero
.497 BantuTswana
.705 Khomani_San

Am I supposed to apply some further quality control or filtering? I tried to include only samples with few missing SNPs: `awk -F\\t '$21>9e5' v44.3_1240K_public.anno`. I also tried increasing the value of the `--geno` option and I tried adding an option like `--max-maf .3`. None of it helped however.

I also tried multiplying the columns of the table with the square roots of the eigenvalues but it didn't help:

Code:

f="picks"
t=read.table(paste0(f,".eigenvec.2"),sep=" ")
eig=as.double(readLines(paste0(f,".eigenval")))

t2=(cbind(t[,c(1,2)],t(t(t[,-c(1,2)])*sqrt(eig))))

ave=aggregate(t[,-c(1,2)],list(t[,1]),mean)
ave2=aggregate(t2[,-c(1,2)],list(t2[,1]),mean)

ind=cbind(paste0(t[,1],":",t[,2]),t[,-c(1,2)])
ind2=cbind(paste0(t2[,1],":",t2[,2]),t2[,-c(1,2)])

write.table(ind,paste0(f,".ind"),quote=F,sep=",",col.names=F,row.names=F)
write.table(ind2,paste0(f,".indscaled"),quote=F,sep=",",col.names=F,row.names=F)
write.table(ave,paste0(f,".ave"),quote=F,sep=",",col.names=F,row.names=F)
write.table(ave2,paste0(f,".avescaled"),quote=F,sep=",",col.names=F,row.names=F)

**Lucas** · 03-12-2021, 08:27 AM

Originally Posted by Komintasavalta

Yeah I figured it out already. I did something like this to make a global PCA of modern individuals in the Reich dataset:

Code:

wget reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_1240K_public.tar
tar -xf v44.3_1240K_public.tar
f=v44.3_1240K_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)
sed 1d v44.3_1240K_public.anno|grep -v 1KGPhase|awk -F\\t '$9=="Modern"{print$2,$13}'|grep -Ev '_dup|Ignore_|\.REF|_o'|sed -E 's/\.(SDG|DG|SG)$//'|gv BIR>picks0
awk 'NR==FNR{a[$1];next}$2 in a' picks0 v44.3_1240K_public.fam>picks
plink --bfile v44.3_1240K_public --keep picks --allow-no-sex --make-bed --out picks
plink --bfile picks --pca --geno .001 --allow-no-sex --out picks
paste -d' ' <(cut -d' ' -f2 picks0) <(cut -d' ' -f2- picks.eigenvec)>picks.eigenvec.2

Why not SmartPCA? Davidski used it for G25, not PlinkPCA https://eurogenes.blogspot.com/2017/...-bias-fix.html

**Lucas** · 03-12-2021, 08:50 AM

For plink dataset do also LD based pruning https://zzz.bwh.harvard.edu/plink/summary.shtml#prune

plink --file data --indep-pairwise 50 5 0.5 (for last better lower value like 0.3)

=======================================

Also missing rate per person https://zzz.bwh.harvard.edu/plink/thresh.shtml#miss2

plink --file mydata --mind 0.1

==========================================
Also minor allele frequency exclude https://zzz.bwh.harvard.edu/plink/thresh.shtml#maf

plink --file mydata --maf 0.05

After that dataset will be smaller in size of course but should be better.

**~~Komintasavalta~~** · 03-12-2021, 11:18 AM

Originally Posted by Lucas

Why not SmartPCA? Davidski used it for G25, not PlinkPCA https://eurogenes.blogspot.com/2017/...-bias-fix.html

I tried SmartPCA with the whole Reich dataset at first, but the dataset was rejected because there were more than 100 populations:

$ f=g/v44.3_1240K_public/v44.3_1240K_public;smartpca -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind evecoutname:\ evec evaloutname:\ eval)
parameter file: /dev/fd/63
### THE INPUT PARAMETERS
##PARAMETER NAME: VALUE
genotypename: g/v44.3_1240K_public/v44.3_1240K_public.geno
snpname: g/v44.3_1240K_public/v44.3_1240K_public.snp
indivname: g/v44.3_1240K_public/v44.3_1240K_public.ind
evecoutname: evec
evaloutname: eval
## smartpca version: 10210
norm used

read 1073741824 bytes
read 2147483648 bytes
read 2859357147 bytes
packed geno read OK
number of populations too large. Increase maxpops if you wish
fatalx:
(makeeglist) You really want to analyse more than 100 populations?

I think the maxpops option needs to be changed from the source code where it's defined as `#define MAXPOPS 100`. Adding a maxpops option to the parfile didn't have an effect, and it's not documented as one of the options in the parfile (https://github.com/chrchang/eigensof.../POPGEN/README).

Next I tried SmartPCA with a subset of samples from the Reich dataset:

$ plink --bfile g/bed/v44.3_1240K_public --keep <(awk -F\\t '$9=="Modern"&&$21>9e5{print$2}' g/v44.3_1240K_public/v44.3_1240K_public.anno|grep -v REF|head -n200|awk 'NR==FNR{a[$0];next}$2 in a' - g/bed/v44.3_1240K_public.fam) --make-bed --out reichsubset
$ f=reichsubset;smartpca -p <(printf %s\\n genotypename:\ $f.bed snpname:\ $f.bim indivname:\ $f.fam evecoutname:\ $f.evec evaloutname:\ $f.eval numoutlieriter:\ 0)

Without the option `numoutlieriter: 0`, it removed 30 out of 200 of the samples as outliers (including all SSAs).

However like with `plink --pca`, the distances of Khoisan and Bambutids seemed too high.

Actually what I needed was `--maf .05`:

$ plink --bfile g/bed/v44.3_1240K_public --keep <(awk -F\\t '$9=="Modern"&&$21>9e5{print$2}' g/v44.3_1240K_public/v44.3_1240K_public.anno|grep -v REF|head -n200|awk 'NR==FNR{a[$0];next}$2 in a' - g/bed/v44.3_1240K_public.fam) --allow-no-sex --maf .05 --make-bed --out withmaf
$ plink --bfile g/bed/v44.3_1240K_public --keep <(awk -F\\t '$9=="Modern"&&$21>9e5{print$2}' g/v44.3_1240K_public/v44.3_1240K_public.anno|grep -v REF|head -n200|awk 'NR==FNR{a[$0];next}$2 in a' - g/bed/v44.3_1240K_public.fam) --allow-no-sex --make-bed --out nomaf
$ f=withmaf;smartpca -p <(printf %s\\n genotypename:\ $f.bed snpname:\ $f.bim indivname:\ $f.fam evecoutname:\ $f.evec evaloutname:\ $f.eval numoutlieriter:\ 0)
$ f=nomaf;smartpca -p <(printf %s\\n genotypename:\ $f.bed snpname:\ $f.bim indivname:\ $f.fam evecoutname:\ $f.evec evaloutname:\ $f.eval numoutlieriter:\ 0)
$ sed 1d withmaf.evec|awk '{$1=$1}NF--' OFS=,|cut -d: -f2 >withmafdist
$ sed 1d nomaf.evec|awk '{$1=$1}NF--' OFS=,|cut -d: -f2 >nomafdist
$ dist(){ awk -F, 'NR==FNR{for(i=2;i<=NF;i++)a[i]=$i;next}$1{s=0;for(i=2;i<=NF;i++)s+=($i-a[i])^2;print s^.5,$1}' "$2" "$1"|sort -n|awk '{printf"%.3f %s\n",$1,$2}'|sed s,^0,,;}
$ dist withmafdist <(grep Finnish withmafdist)|tail -n16
.439 B_Karitiana-3.DG
.443 S_Eskimo_Sireniki-1.DG
.453 S_Eskimo_Sireniki-2.DG
.455 S_BedouinB-2.DG
.472 S_Eskimo_Chaplin-1.DG
.473 S_Eskimo_Naukan-1.DG
.473 A_Ju_hoan_North-5.DG
.475 S_Eskimo_Naukan-2.DG
.486 S_Khomani_San-1.DG
.495 B_Ju_hoan_North-4.DG
.503 S_Ju_hoan_North-1.DG
.511 S_BedouinB-1.DG
.516 S_Ju_hoan_North-2.DG
.597 A_Mbuti-5.DG
.601 B_Mbuti-4.DG
.627 S_Mbuti-3.DG
$ dist nomafdist <(grep Finnish nomafdist)|tail -n16
.314 S_Papuan-2.DG
.316 A_Karitiana-4.DG
.318 S_Papuan-9.DG
.323 A_Papuan-16.DG
.326 B_Karitiana-3.DG
.524 B_Mbuti-4.DG
.530 B_Ju_hoan_North-4.DG
.546 S_Ju_hoan_North-1.DG
.553 S_Ju_hoan_North-2.DG
.562 S_Mbuti-3.DG
.592 S_Khomani_San-1.DG
.629 A_Mbuti-5.DG
.738 B_Yoruba-3.DG
.880 S_Yoruba-2.DG
1.003 A_Yoruba-4.DG
1.008 A_Ju_hoan_North-5.DG

With `--maf .05` I got a plot similar to G25, but with `--maf .01` the distance from Bambutids and Capoids to other humans was reduced only moderately:

But what if the distance between Finns and Ju'Hoan is actually supposed to be much bigger than the distance between Finns and Karitiana? Could it be artificially reduced by G25 because it removes minor alleles that are specific to Capoids?

**Zoro** · 03-12-2021, 01:17 PM

Originally Posted by Komintasavalta

I tried SmartPCA with the whole Reich dataset at first, but the dataset was rejected because there were more than 100 populations:

But what if the distance between Finns and Ju'Hoan is actually supposed to be much bigger than the distance between Finns and Karitiana? Could it be artificially reduced by G25 because it removes minor alleles that are specific to Capoids?

I wouldn’t use —maf because that removes positions with allele frequency below for ex 0.01 if —maf 0.01. I would instead use —max-maf which dies opposite. For ex —max-maf 0.4 removes uninformative alleles common to your data > 40%

Also i would use —geno 0.001 if one wants to have an overlapping set of SNPs in all samples, in other words one doesn’t want some samples to have more Snps than others

**~~Komintasavalta~~** · 03-12-2021, 01:44 PM

Actually `--maf .05` is probably way too high. When I tried `plink --pca` with different `--maf` settings, `--maf .05` caused Finns to be less than twice as far from Khomani San as from Armenians. The effect of `--maf` became noticeable between .005 and .01, but it became huge between .01 and .05.

$ for x in {0001,001,005,01,05};do plink --bfile g/bed/v44.3_1240K_public --keep <(awk -F\\t '$9=="Modern"&&$2~/\.DG$/{print$2,$13}' g/v44.3_1240K_public/v44.3_1240K_public.anno|grep -Ev 'REF\.|Ignore_|_o'|awk 'NR==FNR{a[$1];next}$2 in a' - g/bed/v44.3_1240K_public.fam) --allow-no-sex --maf .$x --geno .1 --pca --out $x;done
$ for x in {0001,001,005,01,05};do printf %s\\n '' "--maf .$x --geno .1:";grep Finnish-1 $x.eigenvec|awk 'NR==1{for(i=3;i<=NF;i++)a[i]=$i;next}{s=0;for(i=3;i<=NF;i++)s+=(a[i]-$i)^2;print s^.5,$2}' - $x.eigenvec|sort -n|egrep '(Khomani_San|Karitiana|Mbuti|Eskimo_Sireniki|Arme nian|Hungarian)-1';done

--maf .0001 --geno .1:
0.0307083 S_Hungarian-1.DG
0.100753 S_Armenian-1.DG
0.236026 S_Eskimo_Sireniki-1.DG
0.300353 S_Karitiana-1.DG
0.532705 S_Mbuti-1.DG
1.0062 S_Khomani_San-1.DG

--maf .001 --geno .1:
0.0307083 S_Hungarian-1.DG
0.100753 S_Armenian-1.DG
0.236026 S_Eskimo_Sireniki-1.DG
0.300353 S_Karitiana-1.DG
0.532705 S_Mbuti-1.DG
1.0062 S_Khomani_San-1.DG

--maf .005 --geno .1:
0.0346009 S_Hungarian-1.DG
0.140324 S_Armenian-1.DG
0.265385 S_Eskimo_Sireniki-1.DG
0.3049 S_Karitiana-1.DG
0.532409 S_Mbuti-1.DG
1.00394 S_Khomani_San-1.DG

--maf .01 --geno .1:
0.0461182 S_Hungarian-1.DG
0.246201 S_Armenian-1.DG
0.318358 S_Eskimo_Sireniki-1.DG
0.324301 S_Karitiana-1.DG
0.542761 S_Mbuti-1.DG
0.79708 S_Khomani_San-1.DG

--maf .05 --geno .1:
0.100773 S_Hungarian-1.DG
0.304498 S_Armenian-1.DG
0.429683 S_Eskimo_Sireniki-1.DG
0.450306 S_Khomani_San-1.DG
0.536177 S_Mbuti-1.DG
0.632922 S_Karitiana-1.DG

Here's the same with no `--geno` option:

--maf .0001:
0.030676 S_Hungarian-1.DG
0.100718 S_Armenian-1.DG
0.236023 S_Eskimo_Sireniki-1.DG
0.300351 S_Karitiana-1.DG
0.532722 S_Mbuti-1.DG
1.0062 S_Khomani_San-1.DG

--maf .001:
0.030676 S_Hungarian-1.DG
0.100718 S_Armenian-1.DG
0.236023 S_Eskimo_Sireniki-1.DG
0.300351 S_Karitiana-1.DG
0.532722 S_Mbuti-1.DG
1.0062 S_Khomani_San-1.DG

--maf .005:
0.0345906 S_Hungarian-1.DG
0.14085 S_Armenian-1.DG
0.265786 S_Eskimo_Sireniki-1.DG
0.304951 S_Karitiana-1.DG
0.532434 S_Mbuti-1.DG
1.00393 S_Khomani_San-1.DG

--maf .01:
0.0460322 S_Hungarian-1.DG
0.246952 S_Armenian-1.DG
0.318423 S_Eskimo_Sireniki-1.DG
0.324423 S_Karitiana-1.DG
0.543871 S_Mbuti-1.DG
0.796861 S_Khomani_San-1.DG

--maf .05:
0.100475 S_Hungarian-1.DG
0.30487 S_Armenian-1.DG
0.429481 S_Eskimo_Sireniki-1.DG
0.450133 S_Khomani_San-1.DG
0.53606 S_Mbuti-1.DG
0.633111 S_Karitiana-1.DG

BTW some of the individuals in my previous PCA were marked with `Ignore_`, like the one outlier Ju'Hoan. I probably should've not included them.

**Zoro** · 03-12-2021, 02:00 PM

Maybe you didn’t understand what i said so I’ll repeat it’s bad idea to use —maf because that does opposite of what we want. It removes informative population specific alleles or in other words rarer alleles

You should use —max-maf instead which removes uninformative alleles common to all populations in other words very very ancient alleles

Try —max-maf 0.4 and —geno 0.001 and repost

Also check your plink bim file against dbsnp database to make sure you do not have some flipped alleles because that’s pretty common with plink

Plink bim col 5 should have alt allele and col 6 ref allele. I bet the order is wrong on some of your positions in the bim file