Homozygous by descend figures in Eurasia [Archive] - The Apricity Forum: A European Cultural Community

View Full Version : Homozygous by descend figures in Eurasia

Lemminkäinen

10-01-2021, 08:18 PM

Komintasavalta

10-01-2021, 11:55 PM

Can you post the commands to calculate it? I downloaded Beagle, but its manual or command line help doesn't mention anything about HBD.

I was able to run this example code (https://faculty.washington.edu/browning/beagle/beagle.html):

wget http://faculty.washington.edu/browning/beagle/{beagle.28Jun21.220.jar,bref3.28Jun21.220.jar,test .28Jun21.220.vcf.gz}
gzip -dc test.28Jun21.220.vcf.gz|cut -f-190|tr / \||gzip>ref.28Jun21.220.vcf.gz
gzip -dc test.28Jun21.220.vcf.gz|cut -f-190|tr / \||gzip>ref.28Jun21.220.vcf.gz
gzip -dc test.28Jun21.220.vcf.gz|cut -f-9,191-200|gzip>target.28Jun21.220.vcf.gz
java -jar beagle.28Jun21.220.jar gt=test.28Jun21.220.vcf.gz out=out.gt
java -jar beagle.28Jun21.220.jar ref=ref.28Jun21.220.vcf.gz gt=target.28Jun21.220.vcf.gz out=out.ref
java -jar bref3.28Jun21.220.jar ref.28Jun21.220.vcf.gz>ref.28Jun21.220.bref3
java -jar beagle.28Jun21.220.jar ref=ref.28Jun21.220.bref3 gt=target.28Jun21.220.vcf.gz out=out.bref3

And I converted a PLINK dataset to VCF like this:

plink2 --bfile test --recode vcf --out test

Komintasavalta

10-28-2021, 11:19 AM

Apparently the option for calculating HBD was removed in Beagle 5.0, but in version 4.1, you can use `ibd=true` to calculate HBD and IBD: https://faculty.washington.edu/browning/beagle/b4_1.html.

wget https://faculty.washington.edu/browning/beagle/beagle.27Jan18.7e1.jar
wget https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V50/V50.0/SHARE/public.dir/v50.0_HO_public.{anno,ind,snp,geno}
f=v50.0_HO_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)
printf %s\\n Besermyan Enets Estonian Finnish Hungarian Karelian Mansi Mordovian Nganasan Selkup Udmurt Veps|awk 'NR==FNR{a[$0];next}$3 in a&&++b[$3]<=2{print$1}' - v50.0_HO_public.ind|awk 'NR==FNR{a[$0];next}$2 in a' - v50.0_HO_public.fam|plink --bfile v50.0_HO_public --allow-no-sex --keep <(cat) --autosome --recode vcf --out ural
java -jar beagle.27Jan18.7e1.jar gt=ural.vcf ibd=true out=ural
gzip -dc ural.hbd.gz|cut -d_ -f2-|awk 'NR==FNR{a[$1]=$3;next}{print a[$1],$0}' v50.0_HO_public.ind -|awk '{a[$1]+=$9;n[$1]++}END{for(i in a)print a[i]/n[i],i}' OFMT=%.4f|sort -n

Here's the output which shows the average LOD score for each population:

12.3450 Hungarian
14.3075 Estonian
26.5012 Finnish
27.6500 Mordovian
29.4841 Enets
30.4700 Veps
31.1556 Karelian
31.6653 Nganasan
41.9403 Mansi
42.1547 Udmurt
43.1838 Besermyan
47.1471 Selkup

I only included two samples from each population above, so the output is definitely not representative, because when I included all samples, I got an error like this with some samples:

ERROR: Missing one or both alleles for a genotype:
23 24300228 rs186929726 G A . . PR GT 0
Exiting Program

Or this:

Refined IBD
model scale: 2.00
run time: 8 seconds

mean edges/level: 8 max edges/level: 21
mean edges/node: 1.222 mean count/edge: 45
Exception in thread "main" java.lang.IllegalArgumentException:
2 225834818 rs73081884 T C
3 1706284 rs76302465 G A
at ibd.IbdSegment.checkArguments(IbdSegment.java:85)
at ibd.IbdSegment.<init>(IbdSegment.java:68)
at main.WindowWriter.merge(WindowWriter.java:329)
at main.WindowWriter.printIbd(WindowWriter.java:307)
at main.WindowWriter.printIbd(WindowWriter.java:286)
at main.Main.printOutput(Main.java:211)
at main.Main.phaseData(Main.java:157)
at main.Main.main(Main.java:115)

Beagle 3.3 has an `estimatehbd=true` option which was removed in Beagle 4.0. It requires an additional markers file, but it gave me an error like this regardless of which format I tried to use for the markers file (http://faculty.washington.edu/browning/beagle/b3.html):

$ wget http://faculty.washington.edu/browning/beagle/beagle.jar # Beagle 3.3
$ sort -snk1,1 -nk3 ural.bim|cut -f2,3,5,6 >ural.markers # sort by chromosome and them by cM
$ # sort -nk4 ural.bim|cut -f2,3,5,6 >ural.markers # sort by integer position
$ # sort -nk4 ural.bim|cut -f2,4,5,6 >ural.markers # sort by integer position and use integer position instead of cM in the second field
$ java -jar beagle.jar unphased=ural.vcf estimatehbd=true markers=ural.markers out=ural missing=.
Beagle version 3.3.2 (31 Oct 2011)
Enter "java -jar beagle.jar" for summary of command line arguments.
Start time: 01:12 PM EEST on 28 Oct 2021

Command line: java -Xmx3641m -jar beagle.jar
unphased=ural11.vcf
estimatehbd=true
markers=ural11.markers
out=ural11
missing=.
Exception in thread "main" java.lang.IllegalArgumentException: Error: Marker positions must be nondecreasing: marker rs150294240
at phaser.ag.<init>(Unknown Source)
at phaser.ag.a(Unknown Source)
at phaser.v.<init>(Unknown Source)
at phaser.PhaseMain.<init>(Unknown Source)
at phaser.PhaseMain.main(Unknown Source)

The manual of Beagle 3.3 says that the marker position must be given as cM for HBD detection (https://faculty.washington.edu/browning/beagle/beagle_3.3.2_31Oct11.pdf):

Each line of the marker file will contain four or more white-space delimited fields. The first field is the marker identifier. The second field is the marker position. The remaining fields are the marker alleles. The markers must be given in chromosomal order. If you are performing HBD/IBD detection, the position of each marker must be given using the centiMorgan scale. If you are not performing HBD/IBD detection, the marker positions on a chromosome can be base positions, genetic positions, or sequential integers. The fastIBD algorithm does not require cM marker positions

In the Beagle file given in Example 1 above, the corresponding markers file when marker positions are given in NCBI Build 35 coordinates is

rs2289311 79235661 A G
rs1248628 79236165 C T
rs10762764 79236371 G T

Lucas

10-28-2021, 01:30 PM

I made a while ago a statistics showing homozygous by descend figures in Eurasia. It is made by Beagle, which supports HBD figures as well as IBD. I have today more populations and it would be interesting to see more Near Easterners, because Iranians were rather high on this test (cousin marriages?). HBD figures common homozygous ancestry in a similar way than IBD common ancestry and is a better way to report inbreeding than individual ROH's.

http://terheninenmaa.blogspot.com/2018/05/homozygosity-by-descent-figures-in.html?m=0

Of course Iranian position is good. I advise you to include Pakistanis, Afghans and more Indians where cousin marriages are still very popular. I bet they definietly will be more.

https://en.wikipedia.org/wiki/Cousin_marriage

The prevalence of first-cousin marriage in Western countries has declined since the late 19th century and early 20th century. In the Middle East and South Asia, cousin marriage is still strongly favoured.

Lemminkäinen

10-28-2021, 04:01 PM

@Komintasavalta

In later versions it is a separate program. First run phasing using Beagle or any other program, generate VCF and the run the Beagle IBD-HBD program. It is downloadable from the same directory. I recommend to not use imputing, but run first plink with missing parameter and select data optimizing sample and SNP amounts.