7
(Reposted from Anthrogenica, because I want to hear comments from Lemminkäinen, Lucas, vbnetkhio, Zoro & co. (in alphabetical order).)
1. Download the 1240K+HO dataset: https://reich.hms.harvard.edu/allen-...cient-dna-data. Convert it to PLINK format.
2. Select modern populations with at least 20 samples, so individual-level genetic variation becomes averaged out. Include only the 20 samples with the highest SNP count from each population, because otherwise the models would be biased so that populations with a large number of samples would be preferred as sources. In the anno file, field 2 is sample name, field 8 is population name, field 6 is age in years before present, and field 15 is SNP count.Code:wget https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_HO_public.{anno,ind,snp,geno} x=v44.3_HO_public;convertf -p <(printf %s\\n genotypename:\ $x.geno snpname:\ $x.snp indivname:\ $x.ind outputformat:\ PACKEDPED genotypeoutname:\ $x.bed snpoutname:\ $x.bim indivoutname:\ $x.fam)
3. Perform aggressive LD pruning (`--indep-pairwise 50 10 .05`) and remove all SNPs with missing data (`--geno 0`) in order to get the total SNP count down to about 30,000. If this removes too many SNPs, you can use something like `--geno .001` instead.Code:x=test;awk -F\\t '$6==0&&++a[$8]==20{print$8}' v44.3_HO_public.anno|grep -Ev '\.DG|\.SDG|\.SG|\.WGA|Ignore|outlier'|awk 'NR==FNR{a[$0];next}$3 in a&&++b[$3]<=20{print$1,$3}' - <(sort -t$'\t' -rnk15 v44.3_HO_public.ind)>$x.pick plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x
4. Convert the PLINK file to the EIGENSTRAT format. The eigenstratgeno file consists of one SNP per line with one number per individual, where the numbers 0-2 indicate the number of copies of the reference allele and 9 indicates missing data.Code:plink --bfile $x --indep-pairwise 50 10 .05 --geno 0 --out $x;plink --allow-no-sex --bfile $x --extract $x.prune.in --make-bed --out $x.p
5. Convert the EIGENSTRAT file to a CSV format, and create another CSV file for average allele frequencies within populations.Code:convertf -p <(printf %s\\n genotypename:\ $x.p.bed snpname:\ $x.p.bim indivname:\ $x.p.fam outputformat:\ EIGENSTRAT genotypeoutname:\ $x.eigenstratgeno snpoutname:\ $x.snp indivoutname:\ $x.ind)
6. Install the CVXPY library and download my modified version of michal3141's convex optimization script: https://github.com/michal3141/g25. My version uses the same algorithm as the original script, but I changed the output format. I also changed the solver to CVXOPT, because compared to the ECOS solver, it seems to deal better with an input matrix that has tens of thousands of columns.Code:awk -F '' '{for(i=1;i<=NF;i++)a[i,NR]=$i}END{for(j=1;j<=NF;j++)for(k=1;k<=NR;k++)printf(k==NR?"%s\n":"%s,"),a[j,k]}' $x.eigenstratgeno|tr -d 9|paste -d, <(awk 'NR==FNR{a[$1]=$2;next}{print a[$2]":"$2}' $x.pick $x.fam) ->$x.csv sed 's/:[^,]*//' $x.csv|awk -F, '{n[$1]++;for(i=2;i<=NF;i++){a[$1,i]+=$i}}END{for(i in n){o=i;for(j=2;j<=NF;j++)o=o","a[i,j]/n[i];print o}}'>$x.ave.csv
7. Try out some models. The models usually make less sense than G25 models, and there's often about 0-5% ancestry from multiple populations that are not actually related to the target population. But I guess it's a start.Code:pip3 install cvxpy curl https://pastebin.com/raw/afaMiFSa|tr -d \\r>mix;chmod +x mix
I uploaded the CSV file for average allele frequencies within populations here: https://drive.google.com/file/d/1nuS...ew?usp=sharing. You can also use the CSV file with Vahaduo. When I tried to paste the contents of the file to the SOURCE tab, Vahaduo crashed in Safari but not in Chrome. Running a single model took about half a minute.Code:$ t=Selkup;./mix <(grep -v ^$t, test.ave.csv) <(grep ^$t, test.ave.csv) Selkup (31.204): 21% Nganasan 16% Bashkir 11% Tuvinian 9% Tubalar 7% Chukchi 7% Altaian 7% Yakut 7% Mordovian 5% Russian 3% Buryat 2% Tajik 2% Ulchi 2% Uzbek 1% Hungarian 1% French 1% Burusho 0% Mongol $ t=Mordovian;./mix <(grep -v ^$t, test.ave.csv) <(grep ^$t, test.ave.csv) Mordovian (23.311): 25% Russian 17% Hungarian 12% French 7% Bashkir 7% Spanish 5% Tajik 4% Italian_North 4% Basque 4% Greek 4% Uzbek 4% Selkup 3% Adygei 1% Tubalar 1% Burusho 1% Balochi 1% Altaian 0% Turkish 0% Chukchi $ t=Nganasan;./mix <(grep -v ^$t, test.ave.csv) <(grep ^$t, test.ave.csv) Nganasan (36.739): 22% Yakut 22% Selkup 19% Tuvinian 14% Ulchi 11% Buryat 11% Chukchi
Here's PCA plots that are based on the CSV file, where I excluded Yoruba and Biaka from the second plot:
If you put the CSV file linked above to Vahaduo's TARGET tab and the CSV file produced by the code below to the SOURCE tab, you can create models similar to supervised ADMIXTURE:
The image below displays the models copied from Vahaduo's MULTI tab. The number after the population name is the distance fit. Some South Asian populations got bad fits, and they got a few percent of the SSA component, which may be a sign that I should've added a South Asian component to this run.Code:curl -Ls 'https://drive.google.com/uc?export=download&id=1nuS1pZ-pZCCWfeA1JpMwhfqgAT-c9Oi8'>test.ave.csv tav(){ awk -F, '{n[$1]++;for(i=2;i<=NF;i++){a[$1,i]+=$i}}END{for(i in n){o=i;for(j=2;j<=NF;j++)o=o FS a[i,j]/n[i];print o}}' "FS=${1-$'\t'}";} printf %s\\n Yoruba+Biaka Palestinian+Yemeni_Northwest+BedouinA Russian+Mordovian+Hungarian Nganasan+Chukchi+Yakut Han+Japanese+Qiang| while read l;do tr + \\n<<<$l|awk -F, 'NR==FNR{a[$0];next}$1 in a' - test.ave.csv|cut -d, -f2-|sed s/^/$l,/|tav ,;done>source
Does anyone know a better way to make models like this that are based on SNP-level data? Is there something I could do to improve this workflow?
Bookmarks