PDA

View Full Version : The Oracle Question



Mont
10-15-2021, 10:27 PM
The popularity of Oracles is not something new, you get the results of a test based on K clusters and then with those percentages the Oracle compares the distance of your numbers with the ones from certain ethnic group or with admixtures of ethnic groups, basically giving an accurate result of your ancestry, right? WRONG!

PART I: THE SNP PROBLEM
SNPs are variations that occur in a certain position of our DNA and that are based around only one nucleotide. The number of SNPs found by scientists is really big, reaching the millions, but as sequencing your entire genome is something really expensive, ancestry companies usually put a limit of SNPs to test, the average being between 600 to 800 thousand SNPs, and although this means that the test will become more inaccurate, it will also become more affordable to pay.
Now, the real problem begins when you decide to download your raw data and upload to external calculators that uses a smaller number of SNPs than your raw data, meaning that the number that was tiny compared to the total number of SNPs becomes even smaller. For example, atleast in my experience, calculators from GEDmatch used approximately 50 thousand SNPs out of the 700 thousand that I have in my raw data, that's basically 10x less!

PART II: THE NUMBER PROBLEM
Ok, ignoring the SNP problem, there's also another huge problem with Oracles, and that is the number problem. Going back to how the Oracle works, it uses the numbers given by the test to compare with the numbers that ethnic groups usually get from the test, but the problem is that IT'S NOT ALWAYS THE SAME SNPS, JUST THE SAME CLUSTER.

Imagine we have two SNPs: one is called R and the other is S.
- R could be either (T)imine or (A)denine, so could S.
- R with the (T) and S with the (T) are labeled as part of the "Amerindian" cluster.
- S with the (T) is present in East Asian ethnic groups aswell as Amerindian ethnic groups.

If we decide to make a test using only 4 SNPs, for the sake of simplicity, and two of them are R and S, then an individual who has R with (T) will score the same 25% "Amerindian" as an individual who has S with (T), excluding TOTALLY the fact that the (T) of the S is also present in East Asians.

PART III: SOLUTION?
I'm coding a calculator that will not have those problems, release date is not well defined yet as it is in the initial phase, I will however post a thread when it becomes reality.

SOURCES:
https://medlineplus.gov/genetics/understanding/genomicresearch/snp/
https://en.m.wikipedia.org/wiki/K-means_clustering
https://beholdgenealogy.com/blog/?p=2700

EDIT: I'm also going to make this thread a Dev Log for the calculator I'm developing.

Gallop
10-15-2021, 10:40 PM
I sensed that something was wrong


Ok, keep us posted.

Mont
10-15-2021, 11:19 PM
I sensed that something was wrong


Ok, keep us posted.

I will.

Mont
10-17-2021, 11:04 PM
DEV LOG #1

- The name of the calculator has already been defined, it will be called "cHenry" (c because of calculator and Henry because I like that name). Although that's the current name, in the future maybe I will change it.

- According to PCA graphs done with the genome of ethnic groups around the world, the shape shown when a graph is made with the first two eigenvectors is of a triangle and, as something to start, I will be using that as reference and making the calculator calculate your % between Caucasoid, Caucaso-Mongoloid, Mongoloid, Negro-Mongoloid, Negroid and Caucaso-Negroid.

Lucas
10-18-2021, 09:15 AM
Now, the real problem begins when you decide to download your raw data and upload to external calculators that uses a smaller number of SNPs than your raw data, meaning that the number that was tiny compared to the total number of SNPs becomes even smaller. For example, atleast in my experience, calculators from GEDmatch used approximately 50 thousand SNPs out of the 700 thousand that I have in my raw data, that's basically 10x less!


Lol. It is problem not in gedmatch but your new raw file which with every new version (in 23me v5 for example) has less and less compatible snps with Gedmatch based calcs. My old FTDNA raw file has between 150-200 000 compatible snps depends on calculator.

oszkar07
10-18-2021, 09:29 AM
The popularity of Oracles is not something new, you get the results of a test based on K clusters and then with those percentages the Oracle compares the distance of your numbers with the ones from certain ethnic group or with admixtures of ethnic groups, basically giving an accurate result of your ancestry, right? WRONG!

PART I: THE SNP PROBLEM
SNPs are variations that occur in a certain position of our DNA and that are based around only one nucleotide. The number of SNPs found by scientists is really big, reaching the millions, but as sequencing your entire genome is something really expensive, ancestry companies usually put a limit of SNPs to test, the average being between 600 to 800 thousand SNPs, and although this means that the test will become more inaccurate, it will also become more affordable to pay.
Now, the real problem begins when you decide to download your raw data and upload to external calculators that uses a smaller number of SNPs than your raw data, meaning that the number that was tiny compared to the total number of SNPs becomes even smaller. For example, atleast in my experience, calculators from GEDmatch used approximately 50 thousand SNPs out of the 700 thousand that I have in my raw data, that's basically 10x less!

PART II: THE NUMBER PROBLEM
Ok, ignoring the SNP problem, there's also another huge problem with Oracles, and that is the number problem. Going back to how the Oracle works, it uses the numbers given by the test to compare with the numbers that ethnic groups usually get from the test, but the problem is that IT'S NOT ALWAYS THE SAME SNPS, JUST THE SAME CLUSTER.

Imagine we have two SNPs: one is called R and the other is S.
- R could be either (T)imine or (A)denine, so could S.
- R with the (T) and S with the (T) are labeled as part of the "Amerindian" cluster.
- S with the (T) is present in East Asian ethnic groups aswell as Amerindian ethnic groups.

If we decide to make a test using only 4 SNPs, for the sake of simplicity, and two of them are R and S, then an individual who has R with (T) will score the same 25% "Amerindian" as an individual who has S with (T), excluding TOTALLY the fact that the (T) of the S is also present in East Asians.

PART III: SOLUTION?
I'm coding a calculator that will not have those problems, release date is not well defined yet as it is in the initial phase, I will however post a thread when it becomes reality.

SOURCES:
https://medlineplus.gov/genetics/understanding/genomicresearch/snp/
https://en.m.wikipedia.org/wiki/K-means_clustering
https://beholdgenealogy.com/blog/?p=2700

EDIT: I'm also going to make this thread a Dev Log for the calculator I'm developing.

With all that you have said it still seems the case for many users here that the many free calculators that use raw data and show Oracles are often reasonably accurate in relation to peoples known ancestry.

Many users feel they get more information and accuracy from these calculators than what some of the commercial companies give.

The commercial companies with their updates can be so variable you could look at the results for the same person for each update and sometimes its the case there are significant changes in the ethnicity estimate from update to update. Sometimes the update does not make any sense at all.
In theory your argument should make sense but when we compare the actual results people get for their known ancestry from commercial companies and online calculators ...sometimes and often the online calcs are better.

Mont
10-19-2021, 04:16 PM
Lol. It is problem not in gedmatch but your new raw file which with every new version (in 23me v5 for example) has less and less compatible snps with Gedmatch based calcs. My old FTDNA raw file has between 150-200 000 compatible snps depends on calculator.

That's why I said it was a problem I have, I didn't assume it was the same for other people.

Mont
10-19-2021, 04:21 PM
With all that you have said it still seems the case for many users here that the many free calculators that use raw data and show Oracles are often reasonably accurate in relation to peoples known ancestry.

Many users feel they get more information and accuracy from these calculators than what some of the commercial companies give.

The commercial companies with their updates can be so variable you could look at the results for the same person for each update and sometimes its the case there are significant changes in the ethnicity estimate from update to update. Sometimes the update does not make any sense at all.
In theory your argument should make sense but when we compare the actual results people get for their known ancestry from commercial companies and online calculators ...sometimes and often the online calcs are better.

My argument is not for commercial calculators and against GEDmatch calculators, I'm only doing a critique and showing the disadvantages of the GEDmatch calculators that I noticed.

SouthDutch7991
10-19-2021, 04:47 PM
This has been a big problem for me because of how different my two commercial tests were, in general my AncestryDNA kit scores significantly differently on GEDmatch generated oracles from my FTDNA kit.

ANCESTRYk13,46.93,23.17,16.83,5.21,3.93,0.7,0,0.33 ,0.49,0.51,0.48,0,1.42
SNPs used in this evaluation: 170544.



FTDNAk13,46.32,23.3,16.17,4.01,5.89,1.19,0.00,0.73 ,0.23,0.86,0.00,0.00,1.3
SNPs used in this evaluation: 77936.

Mont
10-19-2021, 11:40 PM
This has been a big problem for me because of how different my two commercial tests were, in general my AncestryDNA kit scores significantly differently on GEDmatch generated oracles from my FTDNA kit.

ANCESTRYk13,46.93,23.17,16.83,5.21,3.93,0.7,0,0.33 ,0.49,0.51,0.48,0,1.42
SNPs used in this evaluation: 170544.



FTDNAk13,46.32,23.3,16.17,4.01,5.89,1.19,0.00,0.73 ,0.23,0.86,0.00,0.00,1.3
SNPs used in this evaluation: 77936.

If you want a "more accurate" score, I recommend comparing both raw datas and analysing if there are different SNPs that were not analysed by the Ancestry one and then make a weighted average for a final score.

Mont
11-14-2021, 03:26 PM
DEV LOG #2

- Recently I have been very busy so not much was achieved, other than that I changed from Windows to Linux as it has more applications that are useful for creating calculators.

Mont
01-25-2022, 09:22 PM
DEV LOG #3

- The "cHenry" thing sounded childish, bad idea.

- The calculator will mainly focus on three components which are named Western, Eastern and Basal. Basically the same as Caucasoid, Mongoloid and Negroid in order, but with less bias on the terminology.

- The process of choosing the datasets for the calculator already started, soon we will publish it (yes, it's now a team that is working on the calculator).

Mont
01-25-2022, 09:29 PM
Also, from now on I will only keep updated the ones who said they wanted an update.