Plink related questions

**~~Komintasavalta~~** · 04-18-2021, 07:52 PM

New question: how do I remove duplicate samples from the 1240K+HO dataset?

**vbnetkhio** · 04-18-2021, 07:57 PM

Originally Posted by Komintasavalta

New question: how do I remove duplicate samples from the 1240K+HO dataset?

check which have the same Master ID in the .anno file, keep the one with more SNPs.(autosomal targets)
it's also good to remove any with ignore,sister,brother,mother,father,daughter,son,s ibling,rel(relative),lc(low coverage) in their instance id.

**~~Komintasavalta~~** · 04-18-2021, 09:44 PM

Originally Posted by vbnetkhio

check which have the same Master ID in the .anno file, keep the one with more SNPs.(autosomal targets)

Ok, so basically this keeps only the samples with the highest SNP count:

Code:

grep -v _dup v44.3_HO_public.anno|sort -t$'\t' -rnk15|awk -F\\t '!a[$3]++'|cut -f2

The third column is master ID, the second column is version ID, and the 15th column is SNP count.

I was able to run SmartPCA with duplicates removed like this:

Code:

smap()(smartpca -p <(printf %s\\n genotypename:\ $1.bed snpname:\ $1.bim indivname:\ $1.fam evecoutname:\ $1.evec evaloutname:\ $1.eval "${@:2}"))
smap2()(smap "$@" numoutlieriter:\ 0 autoshrink:\ YES lsqproject:\ YES numoutevec:\ 20)
tav()(awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1][i]+=$i}}END{for(i in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i][j]/n[i]);print o}}' "FS=${1-$'\t'}")
smapa()(sed 1d $1.evec|awk NF--|tr : \ |awk 'NR==FNR{a[$2]=$1;next}{$1=a[$2]}1' $1.pick -|sed s/\ /:/|tr ' ' ,>$1.csv;sed 's/:[^,]*//' $1.csv|tav ,>$1.ave;awk 'NR==FNR{a[NR]=$0;next}{printf"%s",$1;for(i=2;i<=NF;i++)printf",%.6f",$i*a[i-1]^.5;print""}' $1.{eval,ave}>$1.avescaled)
smap3()(smap2 $1;smapa $1)
igno()(grep -Ev '\.REF|Ignore_|_outlier|_dup|_contam|_father|_mother|_son|_daughter|_brother|_sister|_sibling|_twin|\.rel\.|Neanderthal|Denisova|Vindija')
keep2()(plink --allow-no-sex --bfile $1 --keep <(awk 'NR==FNR{a[$2];next}$2 in a' $2.pick $1.fam) --make-bed --out $2)
x=latlong7;grep -v _dup v44.3_HO_public.anno|sort -t$'\t' -rnk15|awk -F\\t 'NR>1&&!a[$3]++&&$11>=45&&$12>=-22&&$15>=5e5&&$6==0{print$8,$2}'|igno|grep -Ev '_o|_lc'>$x.pick;keep2 v44.3_HO_public $x;smap3 $x

At first because there were duplicate samples, some of the clines on PCs 4 and further were dominated by populations with duplicate samples. Now I got more reasonable plots for the higher dimensions. Except maybe there's still duplicates among the Selkup or Ashkenazi samples.

Code for Voronoi tesselation:

Code:

library(tidyverse)
library(ggforce)

t=read.csv("latlong7.csv",row.names=1,header=F)
names(t)=paste0("PC",1:ncol(t))

pop=sub("\\.(DG|SDG|SG|WGA)","",sub(":.*","",rownames(t)))
centers=aggregate(t,by=list(pop),mean)
t$pop=pop

set.seed(1488)
color=as.factor(sample(seq(1,length(unique(t$pop)))))
col=rbind(c(60,80),c(25,95),c(30,70),c(60,100),c(20,50),c(70,50))
hues=max(ceiling(length(color)/nrow(col)),12)
pal1=as.vector(apply(col,1,function(x)hcl(seq(15,375,length=hues+1)[1:hues],x[1],x[2])))
pal2=as.vector(apply(col,1,function(x)hcl(seq(15,375,length=hues+1)[1:hues],.8*x[1],.3*x[2])))

ggplot(t,aes(x=PC1,y=PC2,group=-1L))+
ggforce::geom_voronoi_tile(aes(fill=color[as.factor(pop)],color=color[as.factor(pop)]),size=.07,max.radius=.003)+
geom_label(data=centers,aes(x=PC1,y=PC2,label=Group.1,color=color,fill=color),alpha=.7,size=2,label.r=unit(.08,"lines"),label.padding=unit(.08,"lines"),label.size=0)+
coord_fixed()+
scale_x_continuous(breaks=seq(-1,1,.02),expand=expansion(mult=.02))+
scale_y_continuous(breaks=seq(-1,1,.02),expand=expansion(mult=.02))+
scale_fill_manual(values=pal1)+
scale_color_manual(values=pal2)+
theme(
  axis.text=element_text(color="black",size=6),
  axis.ticks=element_blank(),
  axis.ticks.length=unit(0,"pt"),
  axis.title=element_text(color="black",size=8),
  legend.position="none",
  panel.background=element_rect(fill="white"),
  panel.border=element_rect(color="gray85",fill=NA,size=.4),
  panel.grid.major=element_line(color="gray85",size=.2)
)

ggsave("a.png",width=7,height=7)
system("mogrify -trim -bordercolor white -border 16x16 a.png")

**vbnetkhio** · 04-18-2021, 09:51 PM

Originally Posted by Komintasavalta

Except maybe there's still duplicates among the Selkup or Ashkenazi samples.

maybe close relatives, those are endogamous populations.
you can extract just them in separate datasets (one for Selkups, one for Ahkenazi) do plink --genome on them to check if they are close relatives.
(for me plink --genome doesn't work on too large datasets with multiple populations.)

**~~Komintasavalta~~** · 04-18-2021, 10:48 PM

Originally Posted by vbnetkhio

maybe close relatives, those are endogamous populations.
you can extract just them in separate datasets (one for Selkups, one for Ahkenazi) do plink --genome on them to check if they are close relatives.
(for me plink --genome doesn't work on too large datasets with multiple populations.)

When I ran `plink --genome` without LD pruning for all samples in my run, one pair of Selkups had the 76th highest IBS out of a total of 468028 pairs of samples:

$ plink --bfile latlong7 --genome
[output omitted]
$ awk 'NR==FNR{a[$1]=$3;next}FNR>1{print a[$2],$2,a[$4],$4,$12}' v44.3_HO_public.ind plink.genome|sort -rnk5|awk '$1==x&&$3==x{print NR,$0}' x=Selkup|head
76 Selkup Tuebingen51 Selkup Tuebingen62 0.843568
749 Selkup Tuebingen51 Selkup Tuebingen52 0.834890
1026 Selkup Tuebingen51 Selkup Tuebingen59 0.833330
1204 Selkup Selkup105 Selkup Tuebingen53 0.832269
1437 Selkup Tuebingen58 Selkup Tuebingen62 0.831279
1456 Selkup Tuebingen52 Selkup Tuebingen59 0.831183
1827 Selkup Tuebingen54 Selkup Tuebingen62 0.829653
1833 Selkup Tuebingen52 Selkup Tuebingen62 0.829639
1925 Selkup Tuebingen51 Selkup Tuebingen54 0.829319
1926 Selkup Tuebingen60 Selkup Tuebingen79 0.829308
$ sed 1d plink.genome|wc -l
468028

When I tried running SmartPCA with `numoutlieriter: 1` instead of `numoutlieriter: 0`, it removed only 8 individuals, but 2 of them were Selkups. It also removed Tofalars who were among the highest-ranking IBS pairs.

REMOVED outlier 2676:Ul56 iter 1 evec 7 sigmage 6.086 pop: Control
REMOVED outlier 3598:Tuebingen64 iter 1 evec 4 sigmage 8.457 pop: Control
REMOVED outlier 3599:Tuebingen72 iter 1 evec 4 sigmage 7.305 pop: Control
REMOVED outlier 3633:Vgut1 iter 1 evec 9 sigmage -6.654 pop: Control
REMOVED outlier 3638:Vgut8 iter 1 evec 9 sigmage -6.557 pop: Control
REMOVED outlier 3639:Vgut11 iter 1 evec 9 sigmage -6.316 pop: Control
REMOVED outlier 3640:Vgut12 iter 1 evec 9 sigmage -7.208 pop: Control
REMOVED outlier 3641:Vgut13 iter 1 evec 9 sigmage -6.398 pop: Control
number of samples after outlier removal: 960

The pair of samples with the highest IBS consisted of two Tofalars that were both removed by SmartPCA:

$ awk 'NR==FNR{a[$1]=$3;next}FNR>1{print a[$2],$2,a[$4],$4,$12}' g/g/ho.ind plink.genome|sort -rnk5|awk '$1==x&&$3==x{print NR,$0}' x=Tofalar|head
1 Tofalar Vgut8 Tofalar Vgut12 0.891949
7 Tofalar Vgut11 Tofalar Vgut13 0.864923
13 Tofalar Vgut1 Tofalar Vgut4 0.860371
41 Tofalar Vgut13 Tofalar Vgut18 0.847714
256 Tofalar Vgut11 Tofalar Vgut15 0.838493
281 Tofalar Vgut6 Tofalar Vgut12 0.838214
302 Tofalar Vgut7 Tofalar Vgut11 0.838022
320 Tofalar Vgut1 Tofalar Vgut11 0.837828
384 Tofalar Vgut1 Tofalar Vgut15 0.837239
433 Tofalar Vgut7 Tofalar Vgut19 0.836832

So maybe it's better to not use `numoutlieriter: 0` after all.

**vbnetkhio** · 04-18-2021, 11:03 PM

Originally Posted by Komintasavalta

When I ran `plink --genome` without LD pruning for all samples in my run, one pair of Selkups had the 76th highest IBS out of a total of 468028 pairs of samples:

ibs means identity by state, i. e. similarity by descent from the same populations and shared drifts. that doesn't necessarily mean close relatedness. but since it's so close in this case it probably means something,

in plink's output PI_HAT is the IBD (identity by descent) measure, and DST is IBS.
so you should look at the PI_HAT in this case.
https://www.cog-genomics.org/plink/1.9/ibd

in this case runs of homozigosity could also be useful?
https://www.cog-genomics.org/plink/1.9/ibd#homozyg

**~~Komintasavalta~~** · 04-23-2021, 02:19 PM

Originally Posted by vbnetkhio

in plink's output PI_HAT is the IBD (identity by descent) measure, and DST is IBS.
so you should look at the PI_HAT in this case.
https://www.cog-genomics.org/plink/1.9/ibd

Yeah, I now excluded pairs of samples with PI_HAT of .25 or higher after LD pruning, and I ran SmartPCA with `numoutlieriter: 0`.

Code:

curl -LsO reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_HO_public.tar;tar -xf v44.3_HO_public.tar
f=v44.3_HO_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)
igno()(grep -Ev '\.REF|rel\.|fail\.|Ignore_|_dup|_contam|_lc|_father|_mother|_son|_daughter|_brother|_sister|_sibling|_twin|Neanderthal|Denisova|Vindija_light|Gorilla|Macaque|Marmoset|Orangutang|Primate_Chimp|hg19ref')
x=prefix;awk -F\\t 'NR>1&&$11>45&&$12>50{print$2,$3,$8,$15}' v44.3_HO_public.anno|igno|grep -v _o|sort -rnk4|awk '!a[$2]++{print$1,$3}'>$x.pick
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --indep-pairwise 200 25 .2 --make-bed --out $x.temp
plink --allow-no-sex --bfile $x.temp --extract $x.temp.prune.in --genome --out $x
awk 'NR==FNR{a[$1]=$2;next}FNR>1&&$10>=.25{print$2,a[$2];print$4,a[$4]}' $x.pick $x.genome|sort -u>$x.ignore
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}!($1 in a)' $x.ignore $x.pick|awk 'NR==FNR{a[$1];next}$2 in a' - v44.3_HO_public.fam) --make-bed --out $x
smartpca -p <(printf %s\\n genotypename:\ $x.bed snpname:\ $x.bim indivname:\ $x.fam evecoutname:\ $x.evec evaloutname:\ $x.eval numoutevec:\ 20 autoshrink:\ YES lsqproject:\ YES numoutlieriter:\ 0)

Pairs of samples with highest PI_HAT:

Code:

$ awk 'NR==FNR{a[$1]=$2;next}{print$10,a[$2]":"$2,a[$4]":"$4}' $x.pick $x.genome|sort -rn|head
0.5501 Tofalar:Vgut8 Tofalar:Vgut12
0.5289 Tubalar:Tuba23 Tubalar:Tuba24
0.5285 Shor_Khakassia:KHS-035 Shor_Khakassia:KHS-036
0.5028 Khakass_Kachin:Khs-493 Khakass_Kachin:Khs-513
0.4881 Tubalar:ALT-116 Tubalar:Tuba2
0.3945 Chukchi1:Sir50 Chukchi1:Sir41
0.3651 Nganasan:Tuebingen19 Nganasan:Tuebingen116
0.3608 Tofalar:Vgut11 Tofalar:Vgut13
0.3572 Nganasan:Tuebingen23 Nganasan:Tuebingen124
0.3455 Negidal:AMU-160 Nivh:AMU-164

**vbnetkhio** · 04-23-2021, 09:39 PM

Originally Posted by Komintasavalta

Yeah, I now excluded pairs of samples with PI_HAT of .25 or higher after LD pruning, and I ran SmartPCA with `numoutlieriter: 0`.

Code:

curl -LsO reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_HO_public.tar;tar -xf v44.3_HO_public.tar
f=v44.3_HO_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)
igno()(grep -Ev '\.REF|rel\.|fail\.|Ignore_|_dup|_contam|_lc|_father|_mother|_son|_daughter|_brother|_sister|_sibling|_twin|Neanderthal|Denisova|Vindija_light|Gorilla|Macaque|Marmoset|Orangutang|Primate_Chimp|hg19ref')
x=prefix;awk -F\\t 'NR>1&&$11>45&&$12>50{print$2,$3,$8,$15}' v44.3_HO_public.anno|igno|grep -v _o|sort -rnk4|awk '!a[$2]++{print$1,$3}'>$x.pick
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --indep-pairwise 200 25 .2 --make-bed --out $x.temp
plink --allow-no-sex --bfile $x.temp --extract $x.temp.prune.in --genome --out $x
awk 'NR==FNR{a[$1]=$2;next}FNR>1&&$10>=.25{print$2,a[$2];print$4,a[$4]}' $x.pick $x.genome|sort -u>$x.ignore
plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}!($1 in a)' $x.ignore $x.pick|awk 'NR==FNR{a[$1];next}$2 in a' - v44.3_HO_public.fam) --make-bed --out $x
smartpca -p <(printf %s\\n genotypename:\ $x.bed snpname:\ $x.bim indivname:\ $x.fam evecoutname:\ $x.evec evaloutname:\ $x.eval numoutevec:\ 20 autoshrink:\ YES lsqproject:\ YES numoutlieriter:\ 0)

Pairs of samples with highest PI_HAT:

Code:

$ awk 'NR==FNR{a[$1]=$2;next}{print$10,a[$2]":"$2,a[$4]":"$4}' $x.pick $x.genome|sort -rn|head
0.5501 Tofalar:Vgut8 Tofalar:Vgut12
0.5289 Tubalar:Tuba23 Tubalar:Tuba24
0.5285 Shor_Khakassia:KHS-035 Shor_Khakassia:KHS-036
0.5028 Khakass_Kachin:Khs-493 Khakass_Kachin:Khs-513
0.4881 Tubalar:ALT-116 Tubalar:Tuba2
0.3945 Chukchi1:Sir50 Chukchi1:Sir41
0.3651 Nganasan:Tuebingen19 Nganasan:Tuebingen116
0.3608 Tofalar:Vgut11 Tofalar:Vgut13
0.3572 Nganasan:Tuebingen23 Nganasan:Tuebingen124
0.3455 Negidal:AMU-160 Nivh:AMU-164

did removing those have an effect on the PCA?

**~~Komintasavalta~~** · 04-25-2021, 11:47 PM

Originally Posted by vbnetkhio

did removing those have an effect on the PCA?

Actually it only helped partially. I didn't check the result of the SmartPCA command I posted, but even after I removed the samples with PI_HAT of .25 or higher, there were still some outliers because I used the parameter `numoutlieriter: 0`.

Often it's not even enough if you run SmartPCA once without `numoutlieriter`, remove outliers manually, and run it again, because the next run will often have new outliers. I think that's why the default behavior is to do multiple iterations of outlier removal.

**~~Komintasavalta~~** · 09-23-2021, 02:19 PM

How do you convert MyHeritage raw data to PLINK format? It looks like this:

$ head -n20 MyHeritage_raw_dna_data.csv
##fileformat=MyHeritage
##format=MHv1.0
##chip=GSA
##timestamp=2021-09-23 08:59:54 UTC
##reference=build37
#
# MyHeritage DNA raw data.
# For each SNP, we provide the identifier, chromosome number, base pair position and genotype.
# The genotype is reported on the forward (+) strand with respect to the human reference build 37.
# THIS INFORMATION IS FOR YOUR PERSONAL USE AND IS INTENDED FOR GENEALOGICAL RESEARCH
# ONLY. IT IS NOT INTENDED FOR MEDICAL OR HEALTH PURPOSES. PLEASE BE AWARE THAT THE
# DOWNLOADED DATA WILL NO LONGER BE PROTECTED BY OUR SECURITY MEASURES.
RSID,CHROMOSOME,POSITION,RESULT
"rs547237130","1","72526","AA"
"rs562180473","1","565703","AA"
"rs575203260","1","567693","TT"
"rs3131972","1","752721","GG"
"rs200599638","1","752918","GG"
"rs12184325","1","754105","CC"
"rs114525117","1","759036","GG"