qpAdm thread

**Kaspias** · 03-07-2021, 01:03 PM

Originally Posted by Token

This is a very weak pright list.

Any suggestions?

I have tried to use the ancient ones before it, like Devils Gate, Ust Ishim, and so on but the SNP number was only 9k, so went with the modern option.

**Token** · 03-07-2021, 01:29 PM

Originally Posted by Kaspias

Any suggestions?

I have tried to use the ancient ones before it, like Devils Gate, Ust Ishim, and so on but the SNP number was only 9k, so went with the modern option.

The crucial thing is to avoid using low coverage samples because the SNP overlap is always defined by the weakest link of the chain, and always use allsnps=YES (maxmiss=1 in Admixtools2, i believe). There is not much secret in choosing the pright, choose populations that don't violate the qpAdm assumption of no geneflow from pleft into pright (in practice, select prehistoric samples - the older the better), and make sure the populations in pright are all asymmetrically related to the populations in pleft.

**Zoro** · 03-07-2021, 03:03 PM

Originally Posted by Komintasavalta

I did this to get FST distances:

Code:

$ printf %s\\n Mansi Finnish Nganasan Selkup Karelian Udmurt Mordovian>pops
$ R -e 'library(admixtools);fst=fst("g/v44.3_HO_public/v44.3_HO_public",pop1=readLines("pops"));write.csv(fst,"fst",quote=F)'
ℹ Reading allele frequencies from packedancestrymap files...
ℹ v44.3_HO_public.geno has 13197 samples and 597573 SNPs
ℹ Calculating allele frequencies from 19 samples in 4 populations
ℹ Expected size of allele frequency data: 86 MB
597k SNPs read...
✔ 597573 SNPs read in total
! 593124 SNPs remain after filtering. 414780 are polymorphic.
ℹ Allele frequency matrix for 593124 SNPs and 4 populations is 62 MB
ℹ Computing pairwise f2 for all SNPs and population pairs requires 493 MB RAM without splitting
ℹ Computing without splitting since 493 < 8000 (maxmem)...
ℹ Data written to f2/
ℹ Reading precomputed data for 4 populations...
ℹ Reading f2 data for pair 10 out of 10...
Warning message:
In read_f2(dir, pops, pops2, afprod = afprod, fst = fst, remove_na = remove_na,  :
  Discarding 1 block(s) due to missing values!
Discarded block(s): 535
>
>
$ cat fst
,pop1,pop2,est,se,z,p
1,Finnish,Karelian,0.00129340940098996,0.000385024618276676,3.35929013261311,0.000781429796323533
2,Finnish,Mordovian,0.00543917401762932,0.0003253493038913,16.7179519137578,9.6995724807741e-63
3,Finnish,Nganasan,0.119054350470445,0.00113575044283692,104.824392736371,0
4,Finnish,Selkup,0.0601437871347565,0.000773515188052884,77.753854175963,0
5,Finnish,Udmurt,0.0187032075983067,0.000585527652038594,31.9424839001009,6.87035668325847e-224
6,Karelian,Mordovian,0.00590771927078587,0.000239357005605168,24.6816225656289,1.68484078541802e-134
7,Karelian,Udmurt,0.019523384287915,0.000473936956593016,41.1940533784545,0
8,Mansi,Finnish,0.0402190424166203,0.000841540850181047,47.7921450966614,0
9,Mansi,Karelian,0.0399509729801598,0.000728931940431647,54.8075489139662,0
10,Mansi,Mordovian,0.0383778238793512,0.000668216221602333,57.4332418739013,0
11,Mansi,Nganasan,0.0602924170396429,0.000867560662779261,69.4964855212476,0
12,Mansi,Selkup,0.0223050689999093,0.000513096315174833,43.4715049401769,0
13,Mansi,Udmurt,0.0240652073778455,0.000663882926915772,36.2491734644333,1.02394799408991e-287
14,Nganasan,Karelian,0.118602793551424,0.00108015804818372,109.801333009418,0
15,Nganasan,Mordovian,0.11745770899229,0.00099405063691636,118.160689838352,0
16,Nganasan,Selkup,0.0504386703379596,0.000674035384985417,74.8308938395731,0
17,Nganasan,Udmurt,0.0911528579608077,0.000973329536820952,93.650561821567,0
18,Selkup,Karelian,0.0595410382563537,0.00071504905760348,83.268466160781,0
19,Selkup,Mordovian,0.0579452329275561,0.000631133692337951,91.8113446184526,0
20,Selkup,Udmurt,0.0409818006630253,0.000612173976633259,66.9446958337084,0
21,Udmurt,Mordovian,0.0170968626475659,0.000406949757311595,42.0122197897642,0

There's probably an easier way to do this in R, but this converts the FST pairs into a table:

Code:

$ awk -F, 'NR>1{print$3","$2","$4;print$2","$3","$4}' fst|awk -F, '{print$1","$1","}1'|sort -u>/tmp/a
$ cut -d, -f3 /tmp/a|awk '{printf"%.6f"(NR%n?",":"\n"),$0}' n=$(awk 'END{print NR^.5}' /tmp/a) -|paste -d, <(cut -d, -f1 /tmp/a|sort -u) -|cat <(cut -d, -f1 /tmp/a|sort -u|paste -sd, -|sed s/^/,/) ->/tmp/b
$ cat /tmp/b
,Finnish,Karelian,Mansi,Mordovian,Nganasan,Selkup,Udmurt
Finnish,0.000000,0.001293,0.040219,0.005439,0.119054,0.060144,0.018703
Karelian,0.001293,0.000000,0.039951,0.005908,0.118603,0.059541,0.019523
Mansi,0.040219,0.039951,0.000000,0.038378,0.060292,0.022305,0.024065
Mordovian,0.005439,0.005908,0.038378,0.000000,0.117458,0.057945,0.017097
Nganasan,0.119054,0.118603,0.060292,0.117458,0.000000,0.050439,0.091153
Selkup,0.060144,0.059541,0.022305,0.057945,0.050439,0.000000,0.040982
Udmurt,0.018703,0.019523,0.024065,0.017097,0.091153,0.040982,0.000000

And this creates a heatmap of the table:

Code:

R -e 'install.packages(c("pheatmap","colorspace"),repos="https://cloud.r-project.org")'
R -e 'library(pheatmap)
library(colorspace)

t<-read.csv("/tmp/b",header=T,row.names=1,check.names=F)
t[t==0]=NA

pheatmap(
  1e4*t,
  filename="/tmp/a.png",
  legend=F,
  clustering_callback=function(...){hclust(as.dist(t))},
  cellwidth=18,
  cellheight=12,
  border_color=NA,
  display_numbers=T,
  number_format="%.0f",
  number_color="black",
  fontsize_number=6,
  colorRampPalette(hex(HSV(c(210,180,150,120,90,60,30,0),.5,1)))(256)
)'

At first I got an error that there were too many missing blocks, so I tried adding a `maxmiss=Inf` parameter:

Code:

R -e 'library("admixtools");extract_f2(pref="g/v44.3_HO_public/v44.3_HO_public",pops=c("Finnish","Mansi","Mari.SG","Estonian.DG"),outdir="f2",maxmiss=Inf);f2=f2_from_precomp("f2");fst=fst(f2);write.csv(fst,"fst",quote=F)'

However it gave me nonsensical results where the distance between Finns and Maris was an order of magnitude bigger than the distance between Finns and Mansi:

Code:

,pop1,pop2,est,se,z,p
1,Estonian.DG,Finnish,0.000904946578981571,0.000400460532337746,2.25976471064156,0.0238358576870844
2,Estonian.DG,Mansi,0.015818211648642,0.00055465784565136,28.5188639675077,6.83699132061968e-179
3,Estonian.DG,Mari.SG,0.174033745411937,0.00101794728786578,170.96538051279,0
4,Finnish,Mansi,0.0139136259691746,0.000307938735432869,45.1830977016134,0
5,Finnish,Mari.SG,0.17331351350787,0.000760056331041605,228.027195392683,0
6,Mansi,Mari.SG,0.17490109671494,0.000788482478735399,221.819890018931,0

You get "A" for effort. Nice heatmap.

They had some errors in the code. There's a new version dated yesterday of Admixtools. Remove it and re-install it. Also :

593124 SNPs remain after filtering. 414780 are polymorphic

indicates you have too many uninformative non-polymorphic SNPs. To remove them set maxmaf=0.45

For FST it's crucial to have maxmiss at default which is 0 because you want all your samples to overlap each other exactly for unbiased results

**Zoro** · 03-07-2021, 03:37 PM

Originally Posted by Kaspias

Thanks again. I have had Iranians on the pright but it somehow reduced the p-value, so I removed it.

I think these 3 models are really crucial while answering "how the genome of Oghuz was?" We had been using DA89 for a long time, but I recently started to question the accuracy of our method(I think DA89 is 3/4 Gokturk and 1/4 Sogdian, this turns it into a false-candidate for Oghuz) and came to the conclusion that Oghuz should be in between Kipchak and Kimak after searching on historical perspective(The one that helped most was: İlk Oğuzlar, Osman Karatay). Both of the Kipchak samples we have is not representative so went with the Kimak-like option. Apparently I was right, because this is the only passing model when using Medieval samples. In addition, the region where these samples are collected(Kayseri) were housing Cappadocian Greeks that is what I used for the native admixture of them, so the result is pretty solid and one can come with guesses on Western Anatolia(10-40% Kimak?), too.

The good thing about having SE and p-values is you get an idea which models are feasible unlike with Admixture or G25 where you get a bunch of feasible models and you have no idea which ones are no good (distance in G25 is NOT a substitute for p-values and SE by any means). Of course there are other issues besides model viability in G25.

For FST it's crucial to have maxmiss at default which is 0 because you want all your samples to overlap each other exactly for unbiased results

They had some errors in the code. There's a new version dated yesterday of Admixtools. Remove it and re-install it and re-run extract at default (maxmiss=0). When you run extract add the option maxmaf=0.45 to get rid of minor allele which have a frequency > 45% across all pops (very common old alleles shared by most global pops). Now I'm getting 940,000 polymorphic SNPs on the Simons samples at maxmiss=0 with the latest download of Admixtools.

Also Dilawer informed me that Plink doesn't maintain allele order. This can screw calculations up in Admixtools. Here is a couple of rows from the .snp file

rs199706086 1 0 10250 A C
rs112750067 1 0 10327 T C
rs201725126 1 0 13116 G T
rs200579949 1 0 13118 G A
rs180734498 1 0 13302 T C
rs79585140 1 0 14907 G A
rs75454623 1 0 14930 A G

Column 5 is supposed to be for the Reference or Ancestral allele and col 6 for the Derived or Minor allele. If you check dbSNP website you'll notice that the ones I bolded are wrong order. In other words one should be T G and the other A G. Plink screws up allele order.

I fixed the allele order in my .snp file using Dilawer's script so now it agrees with dbSNP. There were thousands of such mistakes in my .snp file

**~~Komintasavalta~~** · 03-07-2021, 11:25 PM

This prints all columns and all rows of tables, prints only 3 significant digits, and doesn't display negative numbers in red (https://tibble.tidyverse.org/reference/formatting.html): `options(tibble.width=Inf,tibble.print_max=Inf, pillar.sigfig=3,pillar.neg=F)`.

`options(width=Sys.getenv("COLUMNS"))` or `options(width=180)` increases the width of the terminal.

`print(tbl,width=Inf,n=Inf)` or `as.data.frame(tbl)` displays a whole tibble. This displays an HTML table in a browser: `install.packages("formattable");library(formattable);formattable(tbl)`.

This removes columns from a tibble and formats the table as CSV where doubles have 3 digits after the decimal point:

Code:

> qp$popdrop%>select(!c(pat,wt,dof,dofdiff,chisqdiff,p_nested))%>%mutate(across(where(is.double),round,3))%>%format_csv%>%cat
chisq,p,f4rank,Nganasan,Norway_N_HG.SG,Russia_AfontovaGora3,Turkey_Epipaleolithic,feasible,best
3.251,0.354,3,0.128,0.132,0.116,0.625,TRUE,NA
15.538,0.004,2,0.151,1.265,-0.416,NA,FALSE,TRUE
5.152,0.272,2,0.144,0.302,NA,0.554,TRUE,TRUE
3.827,0.43,2,0.123,NA,0.176,0.701,TRUE,TRUE
11.855,0.018,2,NA,-0.814,0.64,1.173,FALSE,TRUE
28.572,0,1,0.072,0.928,NA,NA,TRUE,NA
116.62,0,1,-0.108,NA,1.108,NA,FALSE,NA
11.9,0.036,1,0.179,NA,NA,0.821,TRUE,NA
22.973,0,1,NA,1.388,-0.388,NA,FALSE,NA
28.13,0,1,NA,0.622,NA,0.378,TRUE,NA
16.715,0.005,1,NA,NA,0.272,0.728,TRUE,NA
1386.434,0,0,1,NA,NA,NA,TRUE,NA
32.704,0,0,NA,1,NA,NA,TRUE,NA
125.18,0,0,NA,NA,1,NA,TRUE,NA
36.405,0,0,NA,NA,NA,1,TRUE,NA

This omits models with only one population (where f4rank is 0) and models that are not feasible (with one or more negative weight). It then sorts the remaining models by their p value:

Code:

> qp$popdrop%>%dplyr::filter(feasible==T&f4rank!=0)%>%arrange(desc(p))%>%dplyr::select(!c(pat,wt,dof,chisq,f4rank,feasible,best,dofdiff,chisqdiff,p_nested))%>%mutate(across(where(is.double),round,3))%>%as.data.frame
      p Nganasan Norway_N_HG.SG Russia_AfontovaGora3 Turkey_Epipaleolithic
1 0.430    0.123             NA                0.176                 0.701
2 0.354    0.128          0.132                0.116                 0.625
3 0.272    0.144          0.302                   NA                 0.554
4 0.036    0.179             NA                   NA                 0.821
5 0.005       NA             NA                0.272                 0.728
6 0.000       NA          0.622                   NA                 0.378
7 0.000    0.072          0.928                   NA                    NA

This saves the popdrop table to a CSV file (I know my pright sucks or whatever):

Code:

target="Finnish"
left=c("Turkey_Boncuklu_N.SG","Latvia_HG","Norway_N_HG.SG","Russia_HG_Karelia","Russia_HG_Tyumen","Russia_AfontovaGora3","Nganasan")
right=c("Mbuti.DG","Mixe.DG","Ami.DG","Czech_Vestonice16","Papuan.DG","Ethiopia_4500BP_published.SG","Russia_Kostenki14","Ju_hoan_North.SDG","Morocco_Iberomaurusian")
pops=c(left,right,target)

unlink("f2",recursive=T)
extract_f2(pref="g/v44.3_HO_public/v44.3_HO_public",pops=pops,outdir="f2")
f2=f2_from_precomp("f2")
qp=qpadm(f2,left=left,right=right,target=target)

qp2=qp$popdrop%>%dplyr::filter(feasible==T&f4rank!=0)%>%arrange(desc(p))%>%dplyr::select(!c(wt,dof,chisq,f4rank,feasible,best,dofdiff,chisqdiff,p_nested))
write_csv(qp2,"/tmp/a")

This generates a stacked bar chart of the models sorted by their p value:

Code:

library(tidyverse)
library(cowplot)
library(reshape2)

t=read_csv("/tmp/a")

abbr=c("Turk","Latv","Norw","Kare","Tyum","AG3","Ngan")

l=lapply(t$pat,function(x)abbr[unlist(gregexpr("0",x))]%>%paste(collapse=" "))
t$lab=paste0(l," (",sub("^0","",sprintf("%.3f",t$p)),")")

t=t[-c(1,2)]
t2=melt(t,id.var="lab")

p=ggplot(t2,aes(x=fct_rev(factor(lab,level=t$lab)),y=value,fill=variable),label=pvalue)+
geom_bar(stat="identity",width=1,position=position_fill(reverse=T))+
geom_text(aes(label=round(100*value)),position=position_stack(vjust=.5,reverse=T),size=4)+
coord_flip()+
theme(
  axis.text.x=element_blank(),
  axis.text=element_text(color="black"),
  axis.ticks=element_blank(),
  axis.title.x=element_blank(),
  legend.box.just="center",
  legend.box.margin=margin(0,unit="cm"),
  legend.box.spacing=unit(.05,"in"),
  legend.direction="horizontal",
  legend.justification="center",
  legend.margin=margin(0,unit="cm"),
  legend.text=element_text(size=12),
  legend.title=element_blank(),
  panel.border=element_blank(),
  text=element_text(size=18)
)+
guides(fill=guide_legend(ncol=3,label.position="right",byrow=T))+
scale_x_discrete(expand=c(0,0))+
scale_y_discrete(expand=c(0,0))+
xlab("")+
scale_fill_manual("legend",values=c("#be661f","#66f6ff","#3397f5","#22419c","#39de39","#157f0a","#ef50ed"))

ggdraw(p)
leg=get_legend(p)
p=p+theme(legend.position="none")
ggdraw(plot_grid(p,leg,ncol=1,rel_heights=c(1,.2)))

ggsave("output.png",width=7,height=7)

Why do Finns get such a high percentage of Turkey_Boncuklu? Is it because of bad right populations or something?

**andre** · 03-08-2021, 05:06 PM

Hi everyone, could someone run with qpAdm North_Italians, Tuscans and Sicilians with this model?

WHG
Steppe_EMBA (Samara i think it's ok)
Barcin_N
Iran_N
Levant_PPNB

I want to see if Levant_PPNB it's necessary to run italians, in particulary southerns.

Thank you.

**vbnetkhio** · 03-21-2021, 03:00 PM

which of these should be used as outgroups when modelling modern pops wiith mesolithic/neolithic samples?

Austria_Krems1_1	I2483
Austria_Krems1_2_twin.I2483	I2484
Austria_Krems1_2_twin.I2483_all	I2484_all
Austria_KremsWA3	I1577
Belgium_UP_GoyetQ116_1_published	GoyetQ116-1_udg_published
Belgium_UP_GoyetQ116_1_published_all	GoyetQ116-1_published
Belgium_UP_GoyetQ376-19_published	GoyetQ376-19_published_d
Belgium_UP_GoyetQ53_1_published_lc	GoyetQ53-1_published_d
Belgium_UP_GoyetQ56_16_published_lc	GoyetQ56-16_published_d
Belgium_UP_Magdalenian	GoyetQ-2
Belgium_UP_Magdalenian_udg	GoyetQ-2_udg
China_Tianyuan	Tianyuan
Czech_Pavlov1	Pavlov1_d
Czech_Vestonice13	Vestonice13_d
Czech_Vestonice14_lc	Vestonice14_d
Czech_Vestonice15	Vestonice15_d
Czech_Vestonice16	Vestonice16
Czech_Vestonice43	Vestonice43_d
France_Rigney1_published	Rigney1_published_d
Germany_Brillenhohle_published_lc	Brillenhohle_published_d
Germany_Burkhardtshohle_published	Burkhardtshohle_published_d
Germany_HohleFels49_published	HohleFels49_published_d
Germany_HohleFels79_published_lc	HohleFels79_published_d
Italy_South_HG_Ostuni1	Ostuni1_d
Italy_South_HG_Ostuni2	Ostuni2_d
Italy_South_HG_Paglicci108_published_lc	Paglicci108_published_d
Italy_South_HG_Paglicci133_published	Paglicci133_published
Romania_Cioclovina_published_lc	Cioclovina1_published_d
Romania_Muierii	Muierii2_d
Romania_Oase	Oase1_d
Russia_Kostenki12	Kostenki12
Russia_Kostenki14	Kostenki14
Russia_Kostenki14.SG	Kostenki14.SG
Russia_Sunghir1.SG	Sunghir1.SG
Russia_Sunghir2.SG	Sunghir2.SG
Russia_Sunghir3.SG	Sunghir3.SG
Russia_Sunghir4.SG	Sunghir4.SG
Russia_Ust_Ishim_HG_published.DG	Ust_Ishim_published.DG
Russia_Ust_Ishim.DG	UstIshim_snpAD.DG
Russia_Yana_UP.SG	Yana_old.SG
Russia_Yana_UP.SG	Yana_old2.SG
Spain_ElMiron	ElMiron_d

Kostenki14 and Ust-Ishim were used here, but this was a couple of years ago and new samples have been published since, so should something be added?
https://eurogenes.blogspot.com/2017/...lithic-to.html

**vbnetkhio** · 03-21-2021, 10:46 PM

most of these Paleolithic samples were analyzed here:
https://www.ncbi.nlm.nih.gov/pmc/art...ort=objectonly
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4943878/

I ran a PCA on all Reich samples older than 8000 ybp, to see into which clusters the more recently published samples fall. I removed Neanderthals and Denisovans, and African, Middle Eastern and Amerindian samples because they were outliers and skewed the European dimensions. I didn't do ld and maf pruning because they PCA. I guess Paleolithic samples probably require completely different ld and maf settings from modern pops.

The result is very similar to the MDS plot from the study:

The conclusions would be Sunghir samples fall into the Vestonice cluster, Yana clusters with Ust-Ishim, and Geometric and Azilian samples plot with El Miron.

**Kaspias** · 09-01-2021, 05:02 PM

I was trying to extract new Turkish samples but received a few errors during the process. Does the files needs to be imputed in Linux by referencing to their genetic linkage?(by using SHAPEIT perhaps) At least that's what I understood, although have no idea how to do it.

Any tips on what is going on here?

Files can be found here in case someone would like to try it.

Code:

+ extract_f2("originknownturkish",outdir = "balkanturks", pops = pops)


i Reading allele frequencies from PLINK files...
Warning: 1 parsing failure.
 row col expected        actual                     file
1460  X6          embedded null 'originknownturkish.fam'

i originknownturkish.geno has 1460 samples and 423261 SNPs
i Calculating allele frequencies from 1 samples in 1 populations
i Expected size of allele frequency data: 102 MB
423k SNPs read...
√ 423261 SNPs read in total
! 421395 SNPs remain after filtering. 0 are polymorphic.
i Allele frequency matrix for 421395 SNPs and 1 populations is 34 MB
i Computing pairwise f2 for all SNPs and population pairs requires 67 MB RAM without splitting
i Computing without splitting since 67 < 8000 (maxmem)...

Error in cpp_get_block_lengths(numchr, dat[[distcol]], blgsize) : 
  upper value must be greater than lower value
Ek olarak: Warning messages:
1: In get_block_lengths(afdat$snpfile, blgsize = blgsize) :
  No genetic linkage map or base positions found!Each chromosome will be its own block, which can make standard error estimates inaccurate.
2: In get_block_lengths(afdat$snpfile[poly, ], blgsize = blgsize) :
  No genetic linkage map found! Defining blocks by base pair distance of 2e+06

Thread: qpAdm thread

Thread Tools

Search Thread

Thread Information

Users Browsing this Thread

Similar Threads

Long Range Rifle thread......(Sniper thread)

[qpAdm] Someone know how to use it?

qpAdm modelling, first attempt

Bookmarks

Bookmarks

Posting Permissions