G25 scaled vs unscaled [Archive] - The Apricity Forum: A European Cultural Community

marco

10-17-2019, 10:29 PM

Can someone explain to me the difference of scaled vs unscaled coordinates, I’ve noticed I’m closer to certain things in my unscaled coordinates opposed to my scaled

Calpurnius

10-18-2019, 12:06 AM

Well one would need to delve a bit into the mathematics of it and I'm not even exactly sure how Eurogenes runs the PCA in practice, what software he is using and what parameters he set, but essentially "scaling" means that you multiply each variable by the square root of the its variance. What this means in practice is that when "unscaled", the variance for instance in the first global dimension, that is Africa-Eurasia, is set to 1, and the same is set for the second global dimension, West-East Eurasia. This also means that the euclidean distances as computed by these programs are not as realistic.
For a practical example if one takes the typical extreme poles of world variation on your typical global PCA, Yoruba, Sardinians and Han Chinese and you compute the unscaled distances, you get something like:

Distance to: Yoruba
0.10343781 Han
0.10482554 Sardinian

Distance to: Han
0.09992067 Sardinian
0.10343781 Yoruba

Distance to: Sardinian
0.09992067 Han
0.10482554 Yoruba

So as you can see, the distance between the two Eurasians is about the same(only a very small ~0.005 difference in the case of Han-Sardinian vs Han-Yoruba) as the distance each of them have from the African population, which is not what you would expect from the known philogeny of humans i.e pure Eurasians forming a clade and thus closer together than they are to Africans.
Instead, if you use scaled coords, you get something more realistic:

Distance to: Yoruba
0.77636788 Sardinian
0.84430462 Han

Distance to: Han
0.64350770 Sardinian
0.84430462 Yoruba

Distance to: Sardinian
0.64350770 Han
0.77636788 Yoruba
That is, Eurasians being close together(0.64) than any of them is to Yoruba(0.77 and 0.84), and more numerically significant in this case. What this also means is that when modelling, basically in the unscaled coords the sources that you expect to be distant are not considered "as" distant as they should be, which is probably the reason why distal ancestry like Yoruba can even show up in northern Europe at non trivial values when using unscaled values, which is something more direct methods like qpAdm or ADMIXTURE have never found.
For example, here are two models for Finns(!) using unscaled and scaled coordinates without and with Yoruba added:

Unscaled:

Target: Finnish(no Yoruba in the sources)
Distance: 3.5017% / 0.03501687
Aggregated
59.6 Yamnaya_RUS_Samara
30.6 Anatolia_Barcin_N
9.8 WHG

Target: Finnish(Yoruba in the sources)
Distance: 3.4784% / 0.03478432
Aggregated
57.2 Yamnaya_RUS_Samara
28.4 Anatolia_Barcin_N
9.6 WHG
4.8 Yoruba

Scaled:

Target: Finnish(no Yoruba in the sources)
Distance: 6.5811% / 0.06581063
Aggregated
58.4 Yamnaya_RUS_Samara
23.0 Anatolia_Barcin_N
18.6 WHG

Target: Finnish(Yoruba in the sources)
Distance: 6.5811% / 0.06581063
Aggregated
58.4 Yamnaya_RUS_Samara
23.0 Anatolia_Barcin_N
18.6 WHG

So basically as one can see, in the unscaled coords, distal sources may improve the fit even if realistically they aren't really there, while in scaled coords they are penalized and won't appear. I guess this may also have the downside that if distal ancestries are indeed present they may be underestimated even in the scaled.

Impaler

10-18-2019, 12:10 AM

Well one would need to delve a bit into the mathematics of it and I'm not even exactly sure how Eurogenes runs the PCA in practice, what software he is using and what parameters he set, but essentially "scaling" means that you multiply each variable by the square root of the its variance. What this means in practice is that when "unscaled", the variance for instance in the first global dimension, that is Africa-Eurasia, is set to 1, and the same is set for the second global dimension, West-East Eurasia. This also means that the euclidean distances as computed by these programs are not as realistic.
For a practical example if one takes the typical extreme poles of world variation on your typical global PCA, Yoruba, Sardinians and Han Chinese and you compute the unscaled distances, you get something like:

Distance to: Yoruba
0.10343781 Han
0.10482554 Sardinian

Distance to: Han
0.09992067 Sardinian
0.10343781 Yoruba

Distance to: Sardinian
0.09992067 Han
0.10482554 Yoruba

So as you can see, the distance between the two Eurasians is about the same(only a very small ~0.005 difference in the case of Han-Sardinian vs Han-Yoruba) as the distance each of them have from the African population, which is not what you would expect from the known philogeny of humans i.e pure Eurasians forming a clade and thus closer together than they are to Africans.
Instead, if you use scaled coords, you get something more realistic:

Distance to: Yoruba
0.77636788 Sardinian
0.84430462 Han

Distance to: Han
0.64350770 Sardinian
0.84430462 Yoruba

Distance to: Sardinian
0.64350770 Han
0.77636788 Yoruba
That is, Eurasians being close together(0.64) than any of them is to Yoruba(0.77 and 0.84), and more numerically significant in this case. What this also means is that when modelling, basically in the unscaled coords the sources that you expect to be distant are not considered "as" distant as they should be, which is probably the reason why distal ancestry like Yoruba can even show up in northern Europe at non trivial values when using unscaled values, which is something more direct methods like qpAdm or ADMIXTURE have never found.
For example, here are two models for Finns(!) using unscaled and scaled coordinates without and with Yoruba added:

Unscaled:

Target: Finnish(no Yoruba in the sources)
Distance: 3.5017% / 0.03501687
Aggregated
59.6 Yamnaya_RUS_Samara
30.6 Anatolia_Barcin_N
9.8 WHG

Target: Finnish(Yoruba in the sources)
Distance: 3.4784% / 0.03478432
Aggregated
57.2 Yamnaya_RUS_Samara
28.4 Anatolia_Barcin_N
9.6 WHG
4.8 Yoruba

Scaled:

Target: Finnish(no Yoruba in the sources)
Distance: 6.5811% / 0.06581063
Aggregated
58.4 Yamnaya_RUS_Samara
23.0 Anatolia_Barcin_N
18.6 WHG

Target: Finnish(Yoruba in the sources)
Distance: 6.5811% / 0.06581063
Aggregated
58.4 Yamnaya_RUS_Samara
23.0 Anatolia_Barcin_N
18.6 WHG

So basically as one can see, in the unscaled coords, distal sources may improve the fit even if realistically they aren't really there, while in scaled coords they are penalized and won't appear. I guess this may also have the downside that if distal ancestries are indeed present they may be underestimated even in the scaled.

How about the penalty? Default is better than pen=0?

Calpurnius

10-18-2019, 12:35 AM

How about the penalty? Default is better than pen=0?
As I understand it, the penalty is indeed supposed to penalize samples that have a big distance from the target, but that's sort of the problem, if the (euclidean) distances themselves are unrealistic then multiplying by a penalty I guess isn't going to help much, and indeed I just tried running the same model above with pen_def(0.001), pen_def*10, pen_def*100, pen_def*1000 and it actually gets worse, because Yoruba is closer to Finns than it is to WHG(!), which is just plain wrong obviously.
It may help with the scaled coords though.

# penalty is squared distance of sample to target
# objective function =
# squared dist of batch mean to target + coef*penalty

Calpurnius

10-18-2019, 12:36 AM

So I mean, in general using scaled coords seems preferable, though maybe in the reverse situation when trying to model some target using very closely related sources, unscaled could be useful, I'm not sure.

Wend-Kruzek

07-18-2023, 10:07 PM

So I mean, in general using scaled coords seems preferable, though maybe in the reverse situation when trying to model some target using very closely related sources, unscaled could be useful, I'm not sure.

Hi.
for me research is goood no scaled ----
scaled pack more things into one. and then the resolution is in---the........y yes, there are responses like--this must be counted on
my mind

thnks have NIce day