1
Thumbs Up |
Received: 3,471 Given: 1,541 |
which outgroups should be used for a model like this? i tried these but the model failed.
Russia_Sunghir3.SG
Russia_Kostenki14
Czech_Vestonice16
ONG.SG
Iran_Mesolithic
Russia_MA1_HG.SG
Anatolia_Epipaleolithic
Morocco_Iberomaurusian
Mbuti
Ethiopia_4500BP_published.SG
China_Tianyuan
Thumbs Up |
Received: 2,203 Given: 9 |
Same model but with Global25:
Scaled
UnscaledCode:Target: Romanian Distance: 3.2702% / 0.03270239 50.8 TUR_Barcin_N 43.2 Yamnaya_RUS_Samara 6.0 WHG Target: Bulgarian Distance: 3.2130% / 0.03212961 51.8 TUR_Barcin_N 41.2 Yamnaya_RUS_Samara 5.4 WHG 1.6 IRN_Ganj_Dareh_N
Code:Target: Romanian Distance: 2.1592% / 0.02159224 53.0 TUR_Barcin_N 42.6 Yamnaya_RUS_Samara 3.6 WHG 0.8 IRN_Ganj_Dareh_N Target: Bulgarian Distance: 2.1636% / 0.02163633 53.6 TUR_Barcin_N 38.6 Yamnaya_RUS_Samara 4.2 WHG 3.6 IRN_Ganj_Dareh_N
Thumbs Up |
Received: 1,249 Given: 524 |
Without knowing what outgroups were used common sense would dictate that you should drop Iran-mes and Anatolia-epi from your outgroups as they’re too closely related to your sources Iran-n and Anatolia-N. I’m 100% sure you’ll see your p-values increase if you do that. Let me know how much better your p-values became after dropping them.
After doing that do another run by replacing Mota with Khomani and Yana and see what happens
Muzh ba staso la tyaro tsakha ra wubaasu
[IMG][/IMG]
Thumbs Up |
Received: 1,249 Given: 524 |
Thumbs Up |
Received: 4,863 Given: 2,946 |
I tried running this in R (https://github.com/uqrmaie1/admixtools):
install.packages("devtools")
devtools::install_github("uqrmaie1/admixtools")
library("admixtools")
`devtools` failed to install because there was an error installing a dependency of its dependency:
ERROR: dependency ‘gert’ is not available for package ‘usethis’
* removing ‘/usr/local/lib/R/4.0/site-library/usethis’
ERROR: dependency ‘usethis’ is not available for package ‘devtools’
* removing ‘/usr/local/lib/R/4.0/site-library/devtools’
When I tried running `install.packages("gert")`, I got this error:
Configuration failed to find libgit2 library. Try installing:
* brew: libgit2 (MacOS)
So after I ran `brew install libgit2`, I was able to successfully install `devtools`. But then when I tried to install `admixtools`, it failed because the dependency `igraph` could not be installed. Running `brew uninstall --ignore-dependencies suite-sparse` fixed it.
Then I downloaded `v44.3_HO_public.tar` from here: https://reichdata.hms.harvard.edu/pu...ses/index.html. The file `v44.3_1240K_public.tar` had less modern populations, and for example it didn't have Nganasan.
I picked populations names from the last column of the file `v44.3_1240K_public/v44.3_1240K_public.ind`. Then I ran these commands:
printf %s\\n Mari.SG>target
printf %s\\n Russia_AfontovaGora3 Russia_HG_Karelia Estonia_CordedWare Germany_EN_LBK Russia_MLBA_Sintashta Russia_Medieval_Nomad.SG Nganasan France_Rochedane Sweden_Motala_HG>left
printf %s\\n Mbuti.DG Israel_Natufian_published Mixe.DG Ami.DG Itelmen.DG Czech_Vestonice16 Serbia_IronGates_Mesolithic Russia_Shamanka_Eneolithic.SG Papuan.DG Russia_Ust_Ishim.DG Switzerland_Bichon.SG>right
cat left right target>pops
Then in R I ran commands like this:
library("admixtools")
extract_f2(pref="v44.3_HO_public/v44.3_HO_public",pops=readLines("pops"),outdir="myf2dir")
f2_blocks=f2_from_precomp("myf2dir")
qpadm(f2_blocks,left=readLines("left"),right=readLines("right"),target=readLines("target"))
Output:
$weights
# A tibble: 9 x 5
target left weight se z
1 Mari.SG Russia_AfontovaGora3 -0.613 0.701 -0.874
2 Mari.SG Russia_HG_Karelia 0.0170 0.560 0.0304
3 Mari.SG Estonia_CordedWare 0.425 1.17 0.362
4 Mari.SG Germany_EN_LBK 0.352 0.965 0.365
5 Mari.SG Russia_MLBA_Sintashta 0.419 0.841 0.498
6 Mari.SG Russia_Medieval_Nomad.SG -0.399 0.558 -0.715
7 Mari.SG Nganasan 0.0751 0.454 0.166
8 Mari.SG France_Rochedane 0.128 0.582 0.220
9 Mari.SG Sweden_Motala_HG 0.595 3.22 0.185
$f4
# A tibble: 1,100 x 9
pop1 pop2 pop3 pop4 est se z p weight
1 Mari.SG Estonia_CordedWare Ami.DG Czech_Vestonice16 -0.0156 0.0146 -1.07 0.286 0.425
2 Mari.SG fit Ami.DG Czech_Vestonice16 -0.000149 0.0158 -0.00939 0.993 NA
3 Mari.SG France_Rochedane Ami.DG Czech_Vestonice16 0.0219 0.0255 0.859 0.390 0.128
4 Mari.SG Germany_EN_LBK Ami.DG Czech_Vestonice16 0.0297 0.0168 1.76 0.0776 0.352
5 Mari.SG Nganasan Ami.DG Czech_Vestonice16 0.00882 0.0140 0.628 0.530 0.0751
6 Mari.SG Russia_AfontovaGora3 Ami.DG Czech_Vestonice16 0.00938 0.0211 0.443 0.657 -0.613
7 Mari.SG Russia_HG_Karelia Ami.DG Czech_Vestonice16 0.0406 0.0251 1.62 0.106 0.0170
8 Mari.SG Russia_Medieval_Nomad.SG Ami.DG Czech_Vestonice16 0.0281 0.0235 1.20 0.231 -0.399
9 Mari.SG Russia_MLBA_Sintashta Ami.DG Czech_Vestonice16 0.0129 0.0150 0.865 0.387 0.419
10 Mari.SG Sweden_Motala_HG Ami.DG Czech_Vestonice16 0.00579 0.0165 0.350 0.726 0.595
# … with 1,090 more rows
$rankdrop
# A tibble: 9 x 7
f4rank dof chisq p dofdiff chisqdiff p_nested
1 8 2 2.07 0.356 4 0.114 0.998
2 7 6 2.18 0.902 6 1.27 0.973
3 6 12 3.45 0.991 8 2.82 0.945
4 5 20 6.27 0.998 10 6.76 0.748
5 4 30 13.0 0.997 12 12.6 0.396
6 3 42 25.7 0.978 14 22.7 0.0649
7 2 56 48.4 0.755 16 28.1 0.0308
8 1 72 76.5 0.337 18 62.2 0.000000895
9 0 90 139. 0.000751 NA NA NA
$popdrop
# A tibble: 511 x 20
pat wt dof chisq p f4rank Russia_Afontova… Russia_HG_Karel… Estonia_CordedW… Germany_EN_LBK Russia_MLBA_Sin… Russia_Medieval… Nganasan
1 0000… 0 2 2.07 0.356 8 -0.613 0.0170 0.425 0.352 0.419 -0.399 0.0751
2 0000… 1 3 2.03 0.567 7 -0.792 0.391 0.507 -0.0808 0.780 -0.217 0.295
3 0000… 1 3 1.48 0.687 7 -0.974 0.155 0.267 0.141 0.741 -0.400 0.112
4 0000… 1 3 1.45 0.694 7 -0.730 -0.0587 0.241 0.257 0.427 -0.420 NA
5 0000… 1 3 1.71 0.636 7 -1.78 -1.43 -2.94 -0.857 -0.0838 NA -0.709
6 0000… 1 3 1.43 0.699 7 -1.22 -0.416 -1.07 -0.137 NA -0.539 0.160
7 0001… 1 3 1.68 0.641 7 -1.06 -0.110 -0.410 NA 0.314 -0.465 0.154
8 0010… 1 3 1.77 0.622 7 -1.05 0.0144 NA 0.220 0.400 -0.329 0.133
9 0100… 1 3 1.54 0.673 7 -0.898 NA 0.155 0.287 0.483 -0.395 0.0972
10 1000… 1 3 1.41 0.704 7 NA -8.23 -18.7 1.51 -12.8 3.80 -3.54
# … with 501 more rows, and 7 more variables: France_Rochedane, Sweden_Motala_HG , feasible , best , dofdiff , chisqdiff ,
# p_nested
These pages were helpful:
https://uqrmaie1.github.io/admixtool...dmixtools.html
https://comppopgenworkshop2019.readt...ave_qpadm.html
Thumbs Up |
Received: 124 Given: 62 |
Ok, so I got it running with my own data. I thought I would have to use linux for this and wasted a considerable amount of time trying that but I managed to do it ENTIRELY on my windows pc.
*I daily use R and python and I can say that you need almost no proficiency in either of these languages. But you will need them installed properly, nonetheless.
*download one of the Reich datasets (https://reich.hms.harvard.edu/allen-...cient-dna-data)
Here is what I did, roughly:
1- converted my FTDNA raw to 23andme v2 raw using DNA kit studio.
2- used this (http://jade-cheng.com/au/23andme-to-plink/) script to convert the raw file to .map and .ped files. (This requires python, I have anaconda installed, so I just used the terminal in the conda environment but if you dont have python in your PATH, you will have to add it). Just go to the directory (cd Path) in which your raw file and the script reside and run this command:
)Code:python 23andme-to-plink.py yourfile.txt
3- this gives you two files in the same directory ( yourfile.ped and yourfile.map).
4- download plink (https://zzz.bwh.harvard.edu/plink/download.shtml)
5- the folder you extract the contents of the archive matters, I did not want to bother with PATH so I just unpacked the contents in the same folder as above.
6- now that you both have plink and your .ped and .map files in the same folder, you can run the following command from the terminal: plink --file yourfile --make-bed --out yourfile_new (I only added the new suffix here to show that it is the name of the output files you will be getting). Now, you should have three files, which constitute the plink structure. You will be merging these with the Reich dataset to run qpadm for yourself.
7- Start R. (preferably Rstudio, I like using the embedded terminal there). install admixtools and all the other packages that are deemed necessary if you havent already (check the admixtools R package landing page for that) i:
make sure that your working directory is the same as the folder that contains the necessary files. (getwd() will tell you your current working directory. You can use setwd(yourdirectory) to change it.
8- Ok, so far so good. Reich dataset is distributed in the EIGENSTRAT format. We are going to have to convert this to plink too, so that we can merge the raw data with the dataset. But the entire dataset is too big, you will have to use a subset of it, luckily admixtools R package offers some functions for that.
the EIGENSTRAT structure, as far as I have seen, comes with 3sub files: .geno, .ind, and .snp. You may view the .ind file with a text editor to see which populations are available to us. (If you know the samples by heart, that is even better.)
This code below shows the subset of populations I used.
(For the "right-hand side" populations I used Zoro´s advice in this thread: https://www.theapricity.com/forum/sh...ght=admixtools I have not read the theory yet so I am practically a layman at this. This is just an ad-hoc solution and I tried to replace the populations that were not present in the dataset. I would really love to get hear some advice in this regard. For the "left-hand side" populations I picked the relevant pops for my ancestry without much overlap) The code below should work and this will give you a vector of the population names with which you will be using to filter the dataset.
Code:pops <- c( #right "Mbuti.DG", "Russia_Ust_Ishim.DG", "China_Tianyuan", "Goyet_Neanderthal.SG", "Russia_Sunghir3.SG", "Russia_Kostenki14.SG", "Morocco_Iberomaurusian", "Israel_Natufian_published", "Spain_ElMiron", "Russia_MA1_HG.SG", "Georgia_Satsurblia.SG", "Russia_DevilsCave_N.SG", "Papuan.DG", "Turkey_N.SG", "Iran_GanjDareh_N", "Switzerland_Bichon.SG", #left "Adygei", "Turkmen", "Bulgarian", )
Assuming that you have extracted the contents of the Reich dataset to the same folder, the code below uses the conversion function that is available in the admixtools package. It will use all three parts of the eigenstrat structure and create 3 plink files with the extensions, .bim, .fam and .fim (You may now have noticed that now both your raw data and the dataset have the same format.)
First argument requires the prefix of the eigenstrat file and the second one, outpref, is arbitrary. This is going to be the prefix of your plink files. pops argument uses the populations you chose to subset the dataset. (it is impossible to take in all of them without a supercomputer)
9- Now we go back to the terminal to merge the files.Code:eigenstrat_to_plink("v44.3_HO_public",outpref = "master_plink",pops = pops)
master_plink = reich dataset in plink formatCode:plink --bfile master_plink --bmerge yourfile_new.bed yourfile_new.bim yourfile_new.fam --make-bed --out merged_data
yourfile_new = your raw data in plink format
merged_data = prefix of the merged data in plink format
10- OK, if everything goes smoothly we only have two more steps until we successfully run qpadm. While the manual says there are other ways of doing it, I chose to do it the following way:
*Extract the f2 statistics from the merged dataset.
The code above looks for your input files and creates a new folder with many result output in it. Since we have just created a set of plink files with the prefix merged_data, we will be using that as the input. the second argument, outdir, is asking for the name of the folder within which the f2 statistics are going to be stored. This is arbitrary, but we will be using that folder to get the qpadm estimation.Code:extract_f2("merged_data",outdir = "f2_new")
11- Now, the qpadm part. It is vital to look inside the folder that has just been created. Look for the name of your raw data. This can be anything if you have not changed it before, and assuming that you have used the script I linked, the name of your file will likely have the suffix "FAM". Find it and add it among the "left" populations, AND set it as the target. As you might expect, you cannot introduce extra populations here, but if you are somehow not satisfied with the result, you can remove some of the populations. If youwould like to have more populations you are going to have to do the subset-merge process over again.
This will run and store the results in an object named "qpa". qpa$weights will give you the weight estimations, if you are using RStudio you will be able to see what other values you can access. As far as I understand, we are looking for weight estimations between 0-1, with low standard errors and high p values. This makes sense as the estimations sum up to 1 and seem like the admixture proportion estimations.Code:qpa <- qpadm(data = "f2_new",left = c("YOURFILE","Adygei","Turkmen", "Bulgarian",),right = c( "Mbuti.DG", "Russia_Ust_Ishim.DG", "China_Tianyuan", "Goyet_Neanderthal.SG", "Russia_Sunghir3.SG", "Russia_Kostenki14.SG", "Morocco_Iberomaurusian", "Israel_Natufian_published", "Spain_ElMiron", "Russia_MA1_HG.SG", "Georgia_Satsurblia.SG", "Russia_DevilsCave_N.SG", "Papuan.DG", "Turkey_N.SG", "Iran_GanjDareh_N", "Switzerland_Bichon.SG" ),target = "YOURFILE")
I have got these results:
Not bad for a first attempt, I guess. If you put closely related populations together, you will likely get negative estimations with very high standard errors.Code:left weight se zAdygei 0.439 0.595 0.738 Turkmen 0.348 0.304 1.14 Bulgarian 0.213 0.404 0.527
I would like to reiterate that I have no prior knowledge on population genetics and quite ignorant compared to other apricians. For all I know, what I attempted might just be bullshit.
I would also like to add that, after extracting the f2 statistics, I get notified that
√ 1034771 SNPs read in total
! 1331 SNPs remain after filtering. 1331 are polymorphic.
i Allele frequency matrix for 1331 SNPs and 22 populations is 0 MB
I am not sure if this is normal but it seemed suspicious to me. As it eliminates almost the entireity of the SNPs. (I checked to see how many SNPS my FTDNA data and the Reich dataset has in common, and it turned out to be a little fewer than 130k. Both Reich and raw data have almost 600k lines for SNPs, significant amount of which I believe are either no-calls or missing values.)
Last edited by Korialstrasz; 02-24-2021 at 06:27 PM.
50% Turkish_Deliorman + 50% Adygei @ 4,879
Thumbs Up |
Received: 124 Given: 62 |
I checked the converted 23andme file and noticed that a huge chunk of SNPS was now "--". I thought it would just convert the contents of the raw data to another format. Turns out I am wrong. Therefore, I manually converted the FTDNA file to 23andme format and now I have 2 times the SNPs in the allele frequency matrix. Strange.
50% Turkish_Deliorman + 50% Adygei @ 4,879
Thumbs Up |
Received: 3,471 Given: 1,541 |
it's best to merge these 2. HO has more modern popualtions, but 1240K has all of the ancients and many important moderns with much more SNPs.
also check out the data here:
https://evolbio.ut.ee/
Last edited by vbnetkhio; 02-24-2021 at 05:45 PM.
There are currently 1 users browsing this thread. (0 members and 1 guests)
Bookmarks