PDA

View Full Version : PCA-Based Admixture Modeling in Vahaduo Methodological description and statistical limitations



cass
01-11-2026, 11:34 PM
PCA-Based Admixture Modeling in Vahaduo

Methodological description and statistical limitations


Abstract

PCA-based admixture tools such as Vahaduo (AdmixtureJS) https://vahaduo.github.io/vahaduo/ are widely used for exploratory analysis of genetic similarity. Although their mathematical procedures are valid, results are often overinterpreted as literal reconstructions of ancestry. This article describes how Vahaduo operates from a statistical perspective and, drawing directly on established literature, outlines the fundamental limitations of PCA-based admixture models, including non-identifiability, overfitting, and the neglect of chronological structure.

1. How Vahaduo (AdmixtureJS) operates

Vahaduo is a browser-based tool that operates exclusively on PCA-derived coordinates, most commonly Global25. Each population or individual is represented as a point in a multidimensional Euclidean space.

The algorithm approximates a target point as a weighted combination of reference points by minimizing numerical reconstruction error under non-negativity constraints. From a statistical standpoint, this procedure corresponds to constrained linear regression performed in PCA space. The method is purely geometric and does not model allele frequencies, demographic processes, or population history.

Vahaduo reports both direct pairwise distances (“Distance to”) and admixture-based reconstruction error. Importantly, the optimization does not include intrinsic penalties for model complexity, genetic distance between sources and target, or chronological inconsistency.

2. PCA as a descriptive, not historical, framework

Principal component analysis is a technique for dimensionality reduction and visualization, not a causal or historical model. As emphasized in the statistical literature:

“Principal component analysis is primarily a descriptive tool. Interpretation beyond data reduction and visualization must be approached with caution.”
(Jolliffe, Principal Component Analysis)

Accordingly, PCA space represents patterns of similarity rather than explicit population histories. Using PCA coordinates as inputs for admixture modeling therefore requires careful interpretive restraint.

3. Non-identifiability and overfitting in admixture models

Admixture models in PCA space inherit well-known statistical problems associated with high-dimensional correlated data. The authors of STRUCTURE explicitly note:

“Different combinations of population allele frequencies can explain the data almost equally well.”
(Pritchard et al., 2000)

This implies non-identifiability: multiple distinct solutions may fit the data similarly well. In the absence of regularization, the model may distribute weights across many reference populations, including distant ones, to marginally reduce error.

From a statistical learning perspective, lower error alone is not sufficient evidence of a better model:

“A model with lower training error is not necessarily better, as it may simply be fitting noise rather than signal.”
(Hastie, Tibshirani & Friedman, The Elements of Statistical Learning)

4. Why admixture components should not be read literally

The interpretation of admixture coefficients as direct ancestry proportions has been strongly criticized. Lawson et al. state unambiguously:

“ADMIXTURE bar plots do not represent literal ancestry proportions and should not be interpreted as such.”
(Lawson et al., Nature Communications, 2018)

They further emphasize:

“Population structure plots are statistical summaries of genetic variation, not direct representations of population history.”

In the context of Vahaduo, small or distant components often function as geometric correction terms in PCA space rather than biologically meaningful ancestry signals.

5. Chronology as a hidden source of bias

Although time is not explicitly represented in PCA coordinates, genetic data are inherently time-dependent. As noted by Pickrell and Pritchard:

“Genetic drift causes populations to move through allele frequency space over time, such that samples from different time periods cannot be treated as equivalent.”
(Pickrell & Pritchard, 2012)

Mixing samples from widely separated periods within a single admixture model implicitly assumes temporal equivalence, introducing structural bias and increasing variance. This further exacerbates overfitting and instability.

6. Methodological implications

Taken together, the statistical and population-genetic literature leads to consistent conclusions:

PCA-based admixture models are descriptive and exploratory, not inferential.

Low reconstruction error does not guarantee meaningful biological interpretation.

Lack of regularization promotes overfitting and fragmentation of signals.

Chronological structure must be explicitly respected.

Without these safeguards, admixture models risk transforming simple similarity patterns into artificial complexity driven by PCA geometry rather than by genuine biological history.




Case study:
In a modern Kashubian individual, direct PCA distance analysis consistently showed the strongest genetic similarity to Kashubian samples from Kartuzy as well as to immediately neighboring northern Polish populations, including other Kashubian subgroups and adjacent f. German Pomeranian and Greater Poland regional samples, reflected by the smallest and tightly clustered distance values. These distance-based similarities were stable across different reference panels and did not depend on any admixture modeling assumptions, indicating a clear and coherent local genetic signal. As emphasized by Jolliffe, “principal component analysis is primarily a descriptive tool”, and in this context direct distances provide the most straightforward statistical measure of genetic resemblance.

By contrast, an unrestricted PCA-based admixture model that ignored distance constraints decomposed this local Kashubian signal into a mixture dominated by Lithuanian reference populations and substantial French components, despite both groups exhibiting clearly larger direct distances to the target individual than the local populations. This outcome reflects the intrinsic non-identifiability of admixture models, since, as noted by Pritchard et al., “different combinations of population allele frequencies can explain the data almost equally well”. The model achieved a lower numerical reconstruction error by using Lithuanian and French populations as geometric correction points in PCA space rather than because of genuine genetic proximity.

When distance-aware constraints were applied, the admixture solution collapsed back toward Kashubian and immediately neighboring Polish populations, fully restoring consistency with the distance-based results. Crucially, this biologically coherent solution displayed a higher numerical reconstruction distance than the unrestricted model, demonstrating that in PCA-based admixture analyses the most meaningful biological interpretation often emerges only after rejecting the lowest-error, distance-blind fit. This case study illustrates that distance-based similarity captures true local genetic relationships, whereas admixture models that ignore distance can generate artificial long-range components driven by PCA geometry rather than by ancestry, consistent with the warning that “ADMIXTURE bar plots do not represent literal ancestry proportions” (Lawson et al.).