Page 12 of 13 FirstFirst ... 28910111213 LastLast
Results 111 to 120 of 122

Thread: Plink related questions

  1. #111
    Banned
    Join Date
    Sep 2020
    Last Online
    09-12-2023 @ 03:47 PM
    Location
    コミ共和国
    Meta-Ethnicity
    Finno-Permic
    Ethnicity
    Peasant
    Ancestry
    コミ
    Country
    Finland
    Taxonomy
    Karaboğa (euryprosopic, platyrrhine, dolichocephalic)
    Relationship Status
    Virgin
    Gender
    Posts
    2,170
    Thumbs Up
    Received: 4,862
    Given: 2,946

    0 Not allowed!

    Default

    New question: how do I remove duplicate samples from the 1240K+HO dataset?

  2. #112
    Veteran Member
    Join Date
    Jul 2019
    Last Online
    03-11-2024 @ 04:25 PM
    Ethnicity
    Unknown
    Country
    Antarctica
    Gender
    Posts
    3,911
    Thumbs Up
    Received: 3,471
    Given: 1,541

    1 Not allowed!

    Default

    Quote Originally Posted by Komintasavalta View Post
    New question: how do I remove duplicate samples from the 1240K+HO dataset?
    check which have the same Master ID in the .anno file, keep the one with more SNPs.(autosomal targets)
    it's also good to remove any with ignore,sister,brother,mother,father,daughter,son,s ibling,rel(relative),lc(low coverage) in their instance id.

  3. #113
    Banned
    Join Date
    Sep 2020
    Last Online
    09-12-2023 @ 03:47 PM
    Location
    コミ共和国
    Meta-Ethnicity
    Finno-Permic
    Ethnicity
    Peasant
    Ancestry
    コミ
    Country
    Finland
    Taxonomy
    Karaboğa (euryprosopic, platyrrhine, dolichocephalic)
    Relationship Status
    Virgin
    Gender
    Posts
    2,170
    Thumbs Up
    Received: 4,862
    Given: 2,946

    2 Not allowed!

    Default

    Quote Originally Posted by vbnetkhio View Post
    check which have the same Master ID in the .anno file, keep the one with more SNPs.(autosomal targets)
    Ok, so basically this keeps only the samples with the highest SNP count:

    Code:
    grep -v _dup v44.3_HO_public.anno|sort -t$'\t' -rnk15|awk -F\\t '!a[$3]++'|cut -f2
    The third column is master ID, the second column is version ID, and the 15th column is SNP count.

    I was able to run SmartPCA with duplicates removed like this:

    Code:
    smap()(smartpca -p <(printf %s\\n genotypename:\ $1.bed snpname:\ $1.bim indivname:\ $1.fam evecoutname:\ $1.evec evaloutname:\ $1.eval "${@:2}"))
    smap2()(smap "$@" numoutlieriter:\ 0 autoshrink:\ YES lsqproject:\ YES numoutevec:\ 20)
    tav()(awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1][i]+=$i}}END{for(i in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i][j]/n[i]);print o}}' "FS=${1-$'\t'}")
    smapa()(sed 1d $1.evec|awk NF--|tr : \ |awk 'NR==FNR{a[$2]=$1;next}{$1=a[$2]}1' $1.pick -|sed s/\ /:/|tr ' ' ,>$1.csv;sed 's/:[^,]*//' $1.csv|tav ,>$1.ave;awk 'NR==FNR{a[NR]=$0;next}{printf"%s",$1;for(i=2;i<=NF;i++)printf",%.6f",$i*a[i-1]^.5;print""}' $1.{eval,ave}>$1.avescaled)
    smap3()(smap2 $1;smapa $1)
    igno()(grep -Ev '\.REF|Ignore_|_outlier|_dup|_contam|_father|_mother|_son|_daughter|_brother|_sister|_sibling|_twin|\.rel\.|Neanderthal|Denisova|Vindija')
    keep2()(plink --allow-no-sex --bfile $1 --keep <(awk 'NR==FNR{a[$2];next}$2 in a' $2.pick $1.fam) --make-bed --out $2)
    x=latlong7;grep -v _dup v44.3_HO_public.anno|sort -t$'\t' -rnk15|awk -F\\t 'NR>1&&!a[$3]++&&$11>=45&&$12>=-22&&$15>=5e5&&$6==0{print$8,$2}'|igno|grep -Ev '_o|_lc'>$x.pick;keep2 v44.3_HO_public $x;smap3 $x
    At first because there were duplicate samples, some of the clines on PCs 4 and further were dominated by populations with duplicate samples. Now I got more reasonable plots for the higher dimensions. Except maybe there's still duplicates among the Selkup or Ashkenazi samples.






    Code for Voronoi tesselation:

    Code:
    library(tidyverse)
    library(ggforce)
    
    t=read.csv("latlong7.csv",row.names=1,header=F)
    names(t)=paste0("PC",1:ncol(t))
    
    pop=sub("\\.(DG|SDG|SG|WGA)","",sub(":.*","",rownames(t)))
    centers=aggregate(t,by=list(pop),mean)
    t$pop=pop
    
    set.seed(1488)
    color=as.factor(sample(seq(1,length(unique(t$pop)))))
    col=rbind(c(60,80),c(25,95),c(30,70),c(60,100),c(20,50),c(70,50))
    hues=max(ceiling(length(color)/nrow(col)),12)
    pal1=as.vector(apply(col,1,function(x)hcl(seq(15,375,length=hues+1)[1:hues],x[1],x[2])))
    pal2=as.vector(apply(col,1,function(x)hcl(seq(15,375,length=hues+1)[1:hues],.8*x[1],.3*x[2])))
    
    ggplot(t,aes(x=PC1,y=PC2,group=-1L))+
    ggforce::geom_voronoi_tile(aes(fill=color[as.factor(pop)],color=color[as.factor(pop)]),size=.07,max.radius=.003)+
    geom_label(data=centers,aes(x=PC1,y=PC2,label=Group.1,color=color,fill=color),alpha=.7,size=2,label.r=unit(.08,"lines"),label.padding=unit(.08,"lines"),label.size=0)+
    coord_fixed()+
    scale_x_continuous(breaks=seq(-1,1,.02),expand=expansion(mult=.02))+
    scale_y_continuous(breaks=seq(-1,1,.02),expand=expansion(mult=.02))+
    scale_fill_manual(values=pal1)+
    scale_color_manual(values=pal2)+
    theme(
      axis.text=element_text(color="black",size=6),
      axis.ticks=element_blank(),
      axis.ticks.length=unit(0,"pt"),
      axis.title=element_text(color="black",size=8),
      legend.position="none",
      panel.background=element_rect(fill="white"),
      panel.border=element_rect(color="gray85",fill=NA,size=.4),
      panel.grid.major=element_line(color="gray85",size=.2)
    )
    
    ggsave("a.png",width=7,height=7)
    system("mogrify -trim -bordercolor white -border 16x16 a.png")

  4. #114
    Veteran Member
    Join Date
    Jul 2019
    Last Online
    03-11-2024 @ 04:25 PM
    Ethnicity
    Unknown
    Country
    Antarctica
    Gender
    Posts
    3,911
    Thumbs Up
    Received: 3,471
    Given: 1,541

    1 Not allowed!

    Default

    Quote Originally Posted by Komintasavalta View Post
    Except maybe there's still duplicates among the Selkup or Ashkenazi samples.
    maybe close relatives, those are endogamous populations.
    you can extract just them in separate datasets (one for Selkups, one for Ahkenazi) do plink --genome on them to check if they are close relatives.
    (for me plink --genome doesn't work on too large datasets with multiple populations.)
    Last edited by vbnetkhio; 04-18-2021 at 10:04 PM.

  5. #115
    Banned
    Join Date
    Sep 2020
    Last Online
    09-12-2023 @ 03:47 PM
    Location
    コミ共和国
    Meta-Ethnicity
    Finno-Permic
    Ethnicity
    Peasant
    Ancestry
    コミ
    Country
    Finland
    Taxonomy
    Karaboğa (euryprosopic, platyrrhine, dolichocephalic)
    Relationship Status
    Virgin
    Gender
    Posts
    2,170
    Thumbs Up
    Received: 4,862
    Given: 2,946

    1 Not allowed!

    Default

    Quote Originally Posted by vbnetkhio View Post
    maybe close relatives, those are endogamous populations.
    you can extract just them in separate datasets (one for Selkups, one for Ahkenazi) do plink --genome on them to check if they are close relatives.
    (for me plink --genome doesn't work on too large datasets with multiple populations.)
    When I ran `plink --genome` without LD pruning for all samples in my run, one pair of Selkups had the 76th highest IBS out of a total of 468028 pairs of samples:

    $ plink --bfile latlong7 --genome
    [output omitted]
    $ awk 'NR==FNR{a[$1]=$3;next}FNR>1{print a[$2],$2,a[$4],$4,$12}' v44.3_HO_public.ind plink.genome|sort -rnk5|awk '$1==x&&$3==x{print NR,$0}' x=Selkup|head
    76 Selkup Tuebingen51 Selkup Tuebingen62 0.843568
    749 Selkup Tuebingen51 Selkup Tuebingen52 0.834890
    1026 Selkup Tuebingen51 Selkup Tuebingen59 0.833330
    1204 Selkup Selkup105 Selkup Tuebingen53 0.832269
    1437 Selkup Tuebingen58 Selkup Tuebingen62 0.831279
    1456 Selkup Tuebingen52 Selkup Tuebingen59 0.831183
    1827 Selkup Tuebingen54 Selkup Tuebingen62 0.829653
    1833 Selkup Tuebingen52 Selkup Tuebingen62 0.829639
    1925 Selkup Tuebingen51 Selkup Tuebingen54 0.829319
    1926 Selkup Tuebingen60 Selkup Tuebingen79 0.829308
    $ sed 1d plink.genome|wc -l
    468028

    When I tried running SmartPCA with `numoutlieriter: 1` instead of `numoutlieriter: 0`, it removed only 8 individuals, but 2 of them were Selkups. It also removed Tofalars who were among the highest-ranking IBS pairs.

    REMOVED outlier 2676:Ul56 iter 1 evec 7 sigmage 6.086 pop: Control
    REMOVED outlier 3598:Tuebingen64 iter 1 evec 4 sigmage 8.457 pop: Control
    REMOVED outlier 3599:Tuebingen72 iter 1 evec 4 sigmage 7.305 pop: Control
    REMOVED outlier 3633:Vgut1 iter 1 evec 9 sigmage -6.654 pop: Control
    REMOVED outlier 3638:Vgut8 iter 1 evec 9 sigmage -6.557 pop: Control
    REMOVED outlier 3639:Vgut11 iter 1 evec 9 sigmage -6.316 pop: Control
    REMOVED outlier 3640:Vgut12 iter 1 evec 9 sigmage -7.208 pop: Control
    REMOVED outlier 3641:Vgut13 iter 1 evec 9 sigmage -6.398 pop: Control
    number of samples after outlier removal: 960

    The pair of samples with the highest IBS consisted of two Tofalars that were both removed by SmartPCA:

    $ awk 'NR==FNR{a[$1]=$3;next}FNR>1{print a[$2],$2,a[$4],$4,$12}' g/g/ho.ind plink.genome|sort -rnk5|awk '$1==x&&$3==x{print NR,$0}' x=Tofalar|head
    1 Tofalar Vgut8 Tofalar Vgut12 0.891949
    7 Tofalar Vgut11 Tofalar Vgut13 0.864923
    13 Tofalar Vgut1 Tofalar Vgut4 0.860371
    41 Tofalar Vgut13 Tofalar Vgut18 0.847714
    256 Tofalar Vgut11 Tofalar Vgut15 0.838493
    281 Tofalar Vgut6 Tofalar Vgut12 0.838214
    302 Tofalar Vgut7 Tofalar Vgut11 0.838022
    320 Tofalar Vgut1 Tofalar Vgut11 0.837828
    384 Tofalar Vgut1 Tofalar Vgut15 0.837239
    433 Tofalar Vgut7 Tofalar Vgut19 0.836832

    So maybe it's better to not use `numoutlieriter: 0` after all.

  6. #116
    Veteran Member
    Join Date
    Jul 2019
    Last Online
    03-11-2024 @ 04:25 PM
    Ethnicity
    Unknown
    Country
    Antarctica
    Gender
    Posts
    3,911
    Thumbs Up
    Received: 3,471
    Given: 1,541

    1 Not allowed!

    Default

    Quote Originally Posted by Komintasavalta View Post
    When I ran `plink --genome` without LD pruning for all samples in my run, one pair of Selkups had the 76th highest IBS out of a total of 468028 pairs of samples:
    ibs means identity by state, i. e. similarity by descent from the same populations and shared drifts. that doesn't necessarily mean close relatedness. but since it's so close in this case it probably means something,

    in plink's output PI_HAT is the IBD (identity by descent) measure, and DST is IBS.
    so you should look at the PI_HAT in this case.
    https://www.cog-genomics.org/plink/1.9/ibd

    in this case runs of homozigosity could also be useful?
    https://www.cog-genomics.org/plink/1.9/ibd#homozyg

  7. #117
    Banned
    Join Date
    Sep 2020
    Last Online
    09-12-2023 @ 03:47 PM
    Location
    コミ共和国
    Meta-Ethnicity
    Finno-Permic
    Ethnicity
    Peasant
    Ancestry
    コミ
    Country
    Finland
    Taxonomy
    Karaboğa (euryprosopic, platyrrhine, dolichocephalic)
    Relationship Status
    Virgin
    Gender
    Posts
    2,170
    Thumbs Up
    Received: 4,862
    Given: 2,946

    2 Not allowed!

    Default

    Quote Originally Posted by vbnetkhio View Post
    in plink's output PI_HAT is the IBD (identity by descent) measure, and DST is IBS.
    so you should look at the PI_HAT in this case.
    https://www.cog-genomics.org/plink/1.9/ibd
    Yeah, I now excluded pairs of samples with PI_HAT of .25 or higher after LD pruning, and I ran SmartPCA with `numoutlieriter: 0`.

    Code:
    curl -LsO reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_HO_public.tar;tar -xf v44.3_HO_public.tar
    f=v44.3_HO_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)
    igno()(grep -Ev '\.REF|rel\.|fail\.|Ignore_|_dup|_contam|_lc|_father|_mother|_son|_daughter|_brother|_sister|_sibling|_twin|Neanderthal|Denisova|Vindija_light|Gorilla|Macaque|Marmoset|Orangutang|Primate_Chimp|hg19ref')
    x=prefix;awk -F\\t 'NR>1&&$11>45&&$12>50{print$2,$3,$8,$15}' v44.3_HO_public.anno|igno|grep -v _o|sort -rnk4|awk '!a[$2]++{print$1,$3}'>$x.pick
    plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --indep-pairwise 200 25 .2 --make-bed --out $x.temp
    plink --allow-no-sex --bfile $x.temp --extract $x.temp.prune.in --genome --out $x
    awk 'NR==FNR{a[$1]=$2;next}FNR>1&&$10>=.25{print$2,a[$2];print$4,a[$4]}' $x.pick $x.genome|sort -u>$x.ignore
    plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}!($1 in a)' $x.ignore $x.pick|awk 'NR==FNR{a[$1];next}$2 in a' - v44.3_HO_public.fam) --make-bed --out $x
    smartpca -p <(printf %s\\n genotypename:\ $x.bed snpname:\ $x.bim indivname:\ $x.fam evecoutname:\ $x.evec evaloutname:\ $x.eval numoutevec:\ 20 autoshrink:\ YES lsqproject:\ YES numoutlieriter:\ 0)
    Pairs of samples with highest PI_HAT:

    Code:
    $ awk 'NR==FNR{a[$1]=$2;next}{print$10,a[$2]":"$2,a[$4]":"$4}' $x.pick $x.genome|sort -rn|head
    0.5501 Tofalar:Vgut8 Tofalar:Vgut12
    0.5289 Tubalar:Tuba23 Tubalar:Tuba24
    0.5285 Shor_Khakassia:KHS-035 Shor_Khakassia:KHS-036
    0.5028 Khakass_Kachin:Khs-493 Khakass_Kachin:Khs-513
    0.4881 Tubalar:ALT-116 Tubalar:Tuba2
    0.3945 Chukchi1:Sir50 Chukchi1:Sir41
    0.3651 Nganasan:Tuebingen19 Nganasan:Tuebingen116
    0.3608 Tofalar:Vgut11 Tofalar:Vgut13
    0.3572 Nganasan:Tuebingen23 Nganasan:Tuebingen124
    0.3455 Negidal:AMU-160 Nivh:AMU-164

  8. #118
    Veteran Member
    Join Date
    Jul 2019
    Last Online
    03-11-2024 @ 04:25 PM
    Ethnicity
    Unknown
    Country
    Antarctica
    Gender
    Posts
    3,911
    Thumbs Up
    Received: 3,471
    Given: 1,541

    0 Not allowed!

    Default

    Quote Originally Posted by Komintasavalta View Post
    Yeah, I now excluded pairs of samples with PI_HAT of .25 or higher after LD pruning, and I ran SmartPCA with `numoutlieriter: 0`.

    Code:
    curl -LsO reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_HO_public.tar;tar -xf v44.3_HO_public.tar
    f=v44.3_HO_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)
    igno()(grep -Ev '\.REF|rel\.|fail\.|Ignore_|_dup|_contam|_lc|_father|_mother|_son|_daughter|_brother|_sister|_sibling|_twin|Neanderthal|Denisova|Vindija_light|Gorilla|Macaque|Marmoset|Orangutang|Primate_Chimp|hg19ref')
    x=prefix;awk -F\\t 'NR>1&&$11>45&&$12>50{print$2,$3,$8,$15}' v44.3_HO_public.anno|igno|grep -v _o|sort -rnk4|awk '!a[$2]++{print$1,$3}'>$x.pick
    plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --indep-pairwise 200 25 .2 --make-bed --out $x.temp
    plink --allow-no-sex --bfile $x.temp --extract $x.temp.prune.in --genome --out $x
    awk 'NR==FNR{a[$1]=$2;next}FNR>1&&$10>=.25{print$2,a[$2];print$4,a[$4]}' $x.pick $x.genome|sort -u>$x.ignore
    plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}!($1 in a)' $x.ignore $x.pick|awk 'NR==FNR{a[$1];next}$2 in a' - v44.3_HO_public.fam) --make-bed --out $x
    smartpca -p <(printf %s\\n genotypename:\ $x.bed snpname:\ $x.bim indivname:\ $x.fam evecoutname:\ $x.evec evaloutname:\ $x.eval numoutevec:\ 20 autoshrink:\ YES lsqproject:\ YES numoutlieriter:\ 0)
    Pairs of samples with highest PI_HAT:

    Code:
    $ awk 'NR==FNR{a[$1]=$2;next}{print$10,a[$2]":"$2,a[$4]":"$4}' $x.pick $x.genome|sort -rn|head
    0.5501 Tofalar:Vgut8 Tofalar:Vgut12
    0.5289 Tubalar:Tuba23 Tubalar:Tuba24
    0.5285 Shor_Khakassia:KHS-035 Shor_Khakassia:KHS-036
    0.5028 Khakass_Kachin:Khs-493 Khakass_Kachin:Khs-513
    0.4881 Tubalar:ALT-116 Tubalar:Tuba2
    0.3945 Chukchi1:Sir50 Chukchi1:Sir41
    0.3651 Nganasan:Tuebingen19 Nganasan:Tuebingen116
    0.3608 Tofalar:Vgut11 Tofalar:Vgut13
    0.3572 Nganasan:Tuebingen23 Nganasan:Tuebingen124
    0.3455 Negidal:AMU-160 Nivh:AMU-164
    did removing those have an effect on the PCA?

  9. #119
    Banned
    Join Date
    Sep 2020
    Last Online
    09-12-2023 @ 03:47 PM
    Location
    コミ共和国
    Meta-Ethnicity
    Finno-Permic
    Ethnicity
    Peasant
    Ancestry
    コミ
    Country
    Finland
    Taxonomy
    Karaboğa (euryprosopic, platyrrhine, dolichocephalic)
    Relationship Status
    Virgin
    Gender
    Posts
    2,170
    Thumbs Up
    Received: 4,862
    Given: 2,946

    0 Not allowed!

    Default

    Quote Originally Posted by vbnetkhio View Post
    did removing those have an effect on the PCA?
    Actually it only helped partially. I didn't check the result of the SmartPCA command I posted, but even after I removed the samples with PI_HAT of .25 or higher, there were still some outliers because I used the parameter `numoutlieriter: 0`.

    Often it's not even enough if you run SmartPCA once without `numoutlieriter`, remove outliers manually, and run it again, because the next run will often have new outliers. I think that's why the default behavior is to do multiple iterations of outlier removal.

  10. #120
    Banned
    Join Date
    Sep 2020
    Last Online
    09-12-2023 @ 03:47 PM
    Location
    コミ共和国
    Meta-Ethnicity
    Finno-Permic
    Ethnicity
    Peasant
    Ancestry
    コミ
    Country
    Finland
    Taxonomy
    Karaboğa (euryprosopic, platyrrhine, dolichocephalic)
    Relationship Status
    Virgin
    Gender
    Posts
    2,170
    Thumbs Up
    Received: 4,862
    Given: 2,946

    0 Not allowed!

    Default

    How do you convert MyHeritage raw data to PLINK format? It looks like this:

    $ head -n20 MyHeritage_raw_dna_data.csv
    ##fileformat=MyHeritage
    ##format=MHv1.0
    ##chip=GSA
    ##timestamp=2021-09-23 08:59:54 UTC
    ##reference=build37
    #
    # MyHeritage DNA raw data.
    # For each SNP, we provide the identifier, chromosome number, base pair position and genotype.
    # The genotype is reported on the forward (+) strand with respect to the human reference build 37.
    # THIS INFORMATION IS FOR YOUR PERSONAL USE AND IS INTENDED FOR GENEALOGICAL RESEARCH
    # ONLY. IT IS NOT INTENDED FOR MEDICAL OR HEALTH PURPOSES. PLEASE BE AWARE THAT THE
    # DOWNLOADED DATA WILL NO LONGER BE PROTECTED BY OUR SECURITY MEASURES.
    RSID,CHROMOSOME,POSITION,RESULT
    "rs547237130","1","72526","AA"
    "rs562180473","1","565703","AA"
    "rs575203260","1","567693","TT"
    "rs3131972","1","752721","GG"
    "rs200599638","1","752918","GG"
    "rs12184325","1","754105","CC"
    "rs114525117","1","759036","GG"

Page 12 of 13 FirstFirst ... 28910111213 LastLast

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Similar Threads

  1. 3 Questions
    By OldSchool47 in forum Deutschland - English Entries
    Replies: 0
    Last Post: 07-26-2019, 06:56 PM
  2. 50 Questions for Men
    By Oneeye in forum Gender Issues
    Replies: 9
    Last Post: 03-20-2017, 12:50 AM
  3. Some Questions
    By FilthyLibertine in forum Anthropology
    Replies: 12
    Last Post: 11-02-2012, 05:28 PM
  4. 5 questions
    By HawkR in forum Games
    Replies: 445
    Last Post: 12-15-2011, 12:32 PM

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •