3
The popularity of Oracles is not something new, you get the results of a test based on K clusters and then with those percentages the Oracle compares the distance of your numbers with the ones from certain ethnic group or with admixtures of ethnic groups, basically giving an accurate result of your ancestry, right? WRONG!
PART I: THE SNP PROBLEM
SNPs are variations that occur in a certain position of our DNA and that are based around only one nucleotide. The number of SNPs found by scientists is really big, reaching the millions, but as sequencing your entire genome is something really expensive, ancestry companies usually put a limit of SNPs to test, the average being between 600 to 800 thousand SNPs, and although this means that the test will become more inaccurate, it will also become more affordable to pay.
Now, the real problem begins when you decide to download your raw data and upload to external calculators that uses a smaller number of SNPs than your raw data, meaning that the number that was tiny compared to the total number of SNPs becomes even smaller. For example, atleast in my experience, calculators from GEDmatch used approximately 50 thousand SNPs out of the 700 thousand that I have in my raw data, that's basically 10x less!
PART II: THE NUMBER PROBLEM
Ok, ignoring the SNP problem, there's also another huge problem with Oracles, and that is the number problem. Going back to how the Oracle works, it uses the numbers given by the test to compare with the numbers that ethnic groups usually get from the test, but the problem is that IT'S NOT ALWAYS THE SAME SNPS, JUST THE SAME CLUSTER.
Imagine we have two SNPs: one is called R and the other is S.
- R could be either (T)imine or (A)denine, so could S.
- R with the (T) and S with the (T) are labeled as part of the "Amerindian" cluster.
- S with the (T) is present in East Asian ethnic groups aswell as Amerindian ethnic groups.
If we decide to make a test using only 4 SNPs, for the sake of simplicity, and two of them are R and S, then an individual who has R with (T) will score the same 25% "Amerindian" as an individual who has S with (T), excluding TOTALLY the fact that the (T) of the S is also present in East Asians.
PART III: SOLUTION?
I'm coding a calculator that will not have those problems, release date is not well defined yet as it is in the initial phase, I will however post a thread when it becomes reality.
SOURCES:
https://medlineplus.gov/genetics/und...cresearch/snp/
https://en.m.wikipedia.org/wiki/K-means_clustering
https://beholdgenealogy.com/blog/?p=2700
EDIT: I'm also going to make this thread a Dev Log for the calculator I'm developing.
Bookmarks