4
Ancestry comparison between Non hispanic whites, African Americans and Hispanics 2020
https://www.biorxiv.org/content/10.1...755v1.full.pdf
Population structure and pharmacogenomic risk stratification in the United States
Materials and Methods
Study Cohort
Self-identified race and ethnicity (SIRE) information and whole genome genotypes for Americans over the age of 50 and their spouses were collected as part of a nationally-representative longitudinal panel study called the Health and Retirement Study (HRS)33 . For the current study, only HRS participants with both SIRE and genotype information were considered (8,912 participants). The 284 participants who did not identify with one of the three largest racial/ethnic categories in the HRS data – non-Hispanic White (5,927), nonHispanic Black (1,527), and Hispanic/Latino of any race (1,174) – were excluded from this analysis. This yielded a total of 8,628 individuals in our final analysis cohort.
Results
Self-identified race/ethnicity (SIRE) and Genetic
Ancestry (GA) in the US
We compared SIRE to GA for a cohort of 8,628 individuals characterized as part of the Health and Retirement Study (HRS), for whom both SIRE information and whole genome genotypes were available (Table 1). HRS participants self-identified according to racial and ethnic labels defined by the US Government Office of Management and Budget (OMB). OMB defines five racial groups and two ethnic groups to assess disparities in health and environmental risks45 . HRS participants were asked to select one or more race category and a single ethnic designation as Hispanic/Latino or not. We considered the race and ethnicity selections together and focused on the three largest categories in the HRS cohort: non-Hispanic White (5,927; 68.7%), non-Hispanic Black (1,527; 17.7%), and Hispanic/Latino of any race (1,174; 13.6%). We refer to these three groups here as White, Black, and Hispanic. The percentages of each SIRE group in the HRS cohort resemble the demographics of the US: White=72.4%, Black=12.6%, and Hispanic=16.3%45 .
Continental ancestry profiles were inferred for members of the HRS cohort by comparing their whole genome genotypes to whole genome sequence and genotype data for reference populations from Europe, Africa, and the Americas as described in the Materials and Methods. Each HRS participant was assigned European, African, and Native American ancestry proportions, and the resulting ancestry profiles were then clustered into three distinct (non-overlapping) GA groups using kmeans clustering. GA groups were defined without reference to SIRE group labels, using unsupervised clustering on continental ancestry fractions alone, and
the choice to cluster ancestry profiles into three groups was made to allow for direct comparison with the three SIRE groups and in light of known patterns of continental ancestry in the US46 . Permutation analysis was used to confirm the stability of the resulting GA groups and their robustness to changes in sample size (Supplementary Figure 1). The distributions of continental ancestry fractions were compared for the three SIRE groups – White, Black, and Hispanic – and the three GA groups (Figure 1).
The three objectively defined GA groups appear to correspond well to the SIRE groups, with respect to the distributions of individuals’ continental ancestry fractions (Figure 1 – top row). GA groups 1, 2, and 3
correspond to the White, Black, and Hispanic SIRE groups, respectively. The distributions of continental ancestry fractions for the SIRE and their corresponding GA groups are compared in Supplementary Figure 2.
Despite the apparent similarity between SIRE and GA, ternary plots underscore the broader distribution of ancestry fractions within SIRE groups compared to the non-overlapping GA groups delineated by k-means
clustering (Figure 1 – middle row). This is especially true for the Hispanic group, consistent with the fact that it may include individuals who identify as any race. Overall, SIRE and the GA groups show similar average
continental ancestry percentages: White/Group 1 show ~99% European ancestry, Black/Group2 have ~82% African ancestry, and Hispanic/Group 3 show predominantly European ancestry (~60%) with the
highest levels of Native American ancestry (~37%) and the greatest variance in continental ancestry for any of the three groups. The correspondence between the SIRE and GA groups was quantified by haracterizing the overlap of membership assignments across the two groupings (Supplementary Figure 3). Overall, individuals’ membership in the three SIRE and corresponding GA groups show 96.2% concordance. The highest concordance is seen for the White/Group 1 pair, followed by Black/Group 2, with Hispanic/Group 3 showing the lowest concordance. The levels of concordance vary according to which grouping system is
taken as the reference for comparison. This distinction is most obvious for the Hispanic/Group 3 pairing: 96.6% of Group 3 members self-identify as Hispanic, while only 77.1% of self-identified Hispanics fall into Group 3.
Bookmarks