Quote:
Statistical phasing
It is not always possible to obtain trios for phasing and, even if it were, it is not economical or computationally feasible to phase large trio datasets. Sophisticated statistical algorithms have been developed which phase the data based on allele frequencies derived from reference populations. A number of programs are available such as Beagle and FastIBD. Phasing can be done with a high degree of accuracy if large enough reference cohorts are available which are representative of the populations being studied. However, with genotype data the current methodologies are not able to reliably phase small segments under 5 cMs. One study reported a false positive rate of over 67% for 2-4 cM segments when compared with trios.[2]
Statistical or population-based phasing works because our DNA is all very similar and because it's passed on in chunks. Think of it like trying to read a sentence when some of the letters are missing. There are only so many combinations that will fit in the available spaces. If you saw these words:
R-d is my f-v--r-t- c-l--r
You would probably be able to work out that the sentence should read:
Red is my favourite colour
There are regional variations in the "sentences" but even if there were a couple of "deletions" you'd still be able to work it out:
Red is my favorite color
Difficulties arise when you have a short word without the context of a full sentence. R-d on its own could be red, rid, or rod.
Quote:
Genetic genealogy companies
The raw genotype data generated by the Illumina microarray chips used for the autosomal DNA tests from the genetic genealogy companies is unphased and therefore does not distinguish the alleles on the maternal and paternal chromosomes. Customers who download their raw data file will observe that in the genotype column there are two DNA letters for each SNP. These letters are unsorted and could have come from either parent.
AncestryDNA and MyHeritage DNA are currently the only two companies which phase the data before assigning matches. Ancestry has developing its own phasing algorithm known as Underdog. The technical details are provided in the AncestryDNA Matching White Paper. They claim to have an error rate of under 1% and the error rate improves as the size of the training reference dataset increases. As of the beginning of 2016, AncestryDNA uses a reference panel of more than 300,000 genotypes. The details of MyHeritage DNA's phasing is given in the their blog post on major updates and improvements to MyHeritage DNA matching. See also the presentation given by Yaniv Erlich, MyHeritage DNA's Chief Scientific Officer, at Rootstech 2018 MyHeritage DNA 1010: from test to results
Note, however, that if you download the raw data from AncestryDNA or MyHeritage to upload to third-party sites you will receive a file of unphased data.
The 23andMe test and the Family Finder test from Family Tree DNA do not phase the data before assigning matches. However, 23andMe uses statistical phasing for their Ancestry Composition. If one or both parents has been tested at 23andMe Ancestry Composition can determine which ancestral segments have been inherited from each parent. For a detailed explanation see the 23andMe article on The phasing process.