PDA

View Full Version : How large a sample size is needed to we could get accurate results for a population?



Magnolia
08-20-2017, 11:39 AM
Have you ever thought about that?
Every day we can see a lot of threads where you calculate some results and on the basics you make conslusions about yourself/your nation/other nations.
Do you realize your conclusions can be absolutely wrong?
If you have an sample of one person, two, ten people be realistic it has no value.

Laberia
08-20-2017, 11:56 AM
Have you ever thought about that?
Every day we can see a lot of threads where you calculate some results and on the basics you make conslusions about yourself/your nation/other nations.
Do you realize your conclusions can be absolutely wrong?
If you have an sample of one person, two, ten people be realistic it has no value.
Yes. This is a good question that i have made to people many times. And i would like to hear the answer of the experts here.

Rethel
08-20-2017, 12:00 PM
Support.

Albobalboa
08-20-2017, 12:01 PM
https://www.qualtrics.com/blog/determining-sample-size/

http://www.sjsu.edu/faculty/gerstman/StatPrimer/z-two-tails.pdf

Statistic formula. If you want to be correct.

Peterski
08-20-2017, 12:14 PM
We are not using sample sizes as small as one, but usually at least 20-50 or more. However in scientific publications they sometimes use samples of around 10 people.

As for reliability of small samples - if a person is typical for his or her region, then even one sample is OK. For example my sample size for Wielkopolska region (Greater Poland) is 23 people, but the rate of similarity between average from 23 people and my results is at least 90%. Which means that even using a sample size of one - just my own results - would be sufficient.

As for the Czech Republic - our samples are from Czech user Syky, and the sample size is around 40:

https://www.theapricity.com/forum/member.php?18010-Syky

Syky's samples are Bohemians and Moravians, with no Silesians.


Samples has to be taken randomly. Etc.

I'm using samples from Harvard's Human Origins Dataset, including Lusatian Sorbs, Poznan Poles, Lublin Poles, Sachsen Germans.

Other samples are also found randomly, via GEDmatch:

https://www.theapricity.com/forum/showthread.php?208626-Canadian-Inuits-in-Eurogenes-K36&p=4368219&viewfull=1#post4368219

https://www.theapricity.com/forum/showthread.php?216549-Latin-American-Gedmatch-Kits-(Random-Sample)&p=4542509&viewfull=1#post4542509

Laberia
08-20-2017, 12:21 PM
We are not using sample sizes as small as one, but usually 20-50 or more. However in scientific publications they sometimes use samples of around 10 people.

As for reliability of small samples - if a person is typical for his or her region, then even one sample is OK. For example my sample size for Wielkopolska region (Greater Poland) are 23 people, but the rate of similarity between average from 23 people and my results is at least 90%. Which means that using a sample size of one would be sufficient.

As for Czechs - our samples are from Czech user Syky, and the sample size is around 40:

https://www.theapricity.com/forum/member.php?18010-Syky

I have read in a study that they tested 26 Albanians, one for every region. Can we consider the result of this study acceptable?

Lotus Star
08-20-2017, 12:27 PM
I think, a sample needs to be at least in the hundreds or thousands to be credible, scholastically.

Magnolia
08-20-2017, 12:28 PM
There are mathematical methods for the sample size determination (it is not only about the sample size but all regions have to be in the correct proportions represented. Samples has to be taken randomly. Etc.

Only if these conditions are fulfilled, than one can make credible conclusions. Otherwise - zero value. The aim - manipulate people, or eg to deal with own complexes.

Rethel
08-20-2017, 12:28 PM
I have read in a study that they tested 26 Albanians, one for every region. Can we conseder the result of this study acceptable?

There should be at least 1000 from every region taken with statistical accuracy.

Probably even hg propotions are wrong, becasue they are coincidental.


There are mathematical methods for the sample size determination (it is not only about the sample size but all regions have to be in the correct proportions represented. Samples has to be taken randomly. Etc.

Only if these conditions are fulfilled, than one can make credible conclusion. Otherwise - zero value. The aim - manipulate people, or eg to deal with own complexes.

Exactly what I always meant.

Laberia
08-20-2017, 12:32 PM
I think, a sample needs to be at least in the hundreds or thousands to be credible, scholastically.

Hundreds or thousands out of how many people in total, one hundred thousands, one million, etc?

Peterski
08-20-2017, 12:37 PM
I have read in a study that they tested 26 Albanians, one for every region. Can we conseder the result of this study acceptable?

In K36 Oracle, Albanian sample size is very large, a few hundred IIRC.

Peterski
08-20-2017, 12:40 PM
There are mathematical methods for the sample size determination (it is not only about the sample size but all regions have to be in the correct proportions represented. Samples has to be taken randomly. Etc.

Only if these conditions are fulfilled, than one can make credible conclusions. Otherwise - zero value. The aim - manipulate people, or eg to deal with own complexes.

I'm using samples from Harvard's Human Origins Dataset, including Lusatian Sorbs, Poznan Poles, Lublin Poles, Sachsen Germans.

Other samples are also found randomly, via GEDmatch:

https://www.theapricity.com/forum/showthread.php?208626-Canadian-Inuits-in-Eurogenes-K36&p=4368219&viewfull=1#post4368219

https://www.theapricity.com/forum/showthread.php?216549-Latin-American-Gedmatch-Kits-(Random-Sample)&p=4542509&viewfull=1#post4542509

Laberia
08-20-2017, 12:47 PM
In K36 Oracle, Albanian sample size is very large, a few hundred IIRC.

Let simplify a litle bitt our discussion.
We have a region, region N. This region is inhabitated by 1.000.000 people. You are a team leader of a group of geneticists and your duty is to determine the haplogroups present in the population of region N and the percentages for every haplogroup. First way is testing the inhabitants one by one but of course this is not possible for the moment, maybe in the future. So, you have to start to select some people to test. How many people you have to test from the inhabitants of the region N and how they are selected in order to have an acceptable result? I mean the minimum of samples.

Magnolia
08-20-2017, 12:55 PM
The main issue with these samples is that these are taken from people who were tested from their own will. They wanted to be tested for a reason (eg there is a family story their grand-grandfather was a Jew - they want to confirm it by a test...).

Anyway if there is somthing written on internet - no methology is mentioned, no information about the size sample, etc., even the terminology is not accurante - everyone with brain should know it is crap.

Peterski
08-20-2017, 12:56 PM
Let simplify a litle bitt our discussion.
We have a region, region N. This region is inhabitated by 1.000.000 people. You are a team leader of a group of geneticists and your duty is to determine the haplogroups present in the population of region N and the percentages for every haplogroup. First way is testing the inhabitants one by one but of course this is not possible for the moment, maybe in the future. So, you have to start to select some people to test. How many people you have to test from the inhabitants of the region N and how they are selected in order to have an acceptable result? I mean the minimum of samples.

For haplogroups, the more the better. So I would want at least 150 people from region N.

But autosomal DNA is something different. For this purpose 50 people would be enough.

Magnolia
08-20-2017, 01:03 PM
lol. It is funny to see a person who has no clue that these methods exist to say how large samples should be...

Laberia
08-20-2017, 01:05 PM
For haplogroups, the more the better. So I would want at least 150 people from region N.

But autosomal DNA is something different. For this purpose 50 people would be enough.

Where we can read this? Is this written in some texts or this is only your personal opinion?
And how this persons have to be selected, russian roulet or there is an methodology taking in consideration the history of the region, etc?

Rethel
08-20-2017, 01:13 PM
should know it is crap.

Maybe not crap, but we do not have something better.

Tests should be make the same way, as any statistic is made.
Should be divived on population size by age, county, province, proportionaly to the density aso.

Obviously it is not. so the results of %% of hgs, are +/- with big margines of error.

Austosomal - could be a crap, especially, that tested groups are much smaller than
in the case of hg, much more coincidental, and as we can see on this forum, different
analysys give different results. Litwin once is a Slovakian, another time he is a Masovian,
after another method he is a Greatpole. And btw, autosomal analysys were statrted as a
crap commercial testing for migrants in America, who did not know were they are from.
They are happy like children when you tell them that they are 10% Zulu, 1% Massai 5% Finn,
15% Ainu, 9% Navajo - but it has no practical relevance, and tells nothing about someone's
provenance - if it is trustfull test at all. Even if is, then is worthless anyway in personal case.

Peterski
08-20-2017, 01:14 PM
Magnolia called me "amateur geneticist".

But I'm employed by a DNA testing company, and officially a genetics professional:

http://i.imgur.com/xo8fHcS.png

And who are you Magnolia? A toilet cleaner? :confused:

Magnolia
08-20-2017, 01:20 PM
lol You are a brigadier with no education in genetics, mathematics, statistics, not even in history. Unlike you I know exactly what I'm speaking about in this thread ; ). Btw. It says a lot about that company that they have an universal contract for everyone.

And dont be afraid Poles are very good in toilet cleaning, we dont have to do that, nobody in Europe have to. It is your specialization. Everybody knows that.

Rethel
08-20-2017, 01:22 PM
Can you both stop, and stick to the topic? :mad:

Peterski
08-20-2017, 01:23 PM
I'm done with this discussion.

Laberia
08-20-2017, 01:30 PM
Can you both stop, and stick to the topic? :mad:

Yes, i want to make clear that my interest is genuine and have nothing to do with this Czech-Polish crisis. I was thinking to open a thread an to make some questions to the persons who are experts in genetics. Unfortunately he left the discussion.

Magnolia
08-20-2017, 01:31 PM
I'm done with this discussion.

Of course you are done, because you have no clue what are you talking about...

Magnolia
08-20-2017, 01:32 PM
Yes, i want to make clear that my interest is genuine and have nothing to do with this Czech-Polish crisis. I was thinking to open a thread an to make some questions to the persons who are experts in genetics. Unfortunately he left the discussion.

He left the discussion because he doesnt know what he is talking about. He has no education in it. He only posts nonsense with no value all the time.
Thats why he started an OT here, because he has nothing valuable to say.

Laberia
08-20-2017, 01:42 PM
He left the discussion because he doesnt know what he is talking about. He has no education in it. He only posts nonsense with no value all the time.
Thats why he started an OT here, because he has nothing valuable to say.
Ok, but seriously i am interested about this topic.

Magnolia
08-20-2017, 01:44 PM
Thumb down that is his answer.
Why dont you tell us where did you study genetics, etc.?
Why do you manipulate people here again with an universal conntract from a company?
Why do you want to make them believe you are an expert when you are not?

Magnolia
08-20-2017, 01:48 PM
Ok, but seriously i am interested about this topic.

Me too. But he is butthurt again. Because this topic disbelieve his TA's posts.

Peterski
08-20-2017, 02:05 PM
My time is precious, I don't waste it for talking or arguing with people whom I don't like.

I'm not an internet crusader who has to win every argument against random anonymous idiots. I was like this back when I posted on Historum, so you can read my old posts there if you wish.

It is Sunday and I have guests over, I don't have time to write lengthy posts. Bye.

Magnolia
08-20-2017, 02:18 PM
You have no time to answer a simple question why you want to make people believe you are an expert in something you are not.

You have no time to say how it is but you have time to.write a long blah blah blah nonsense.

You have no time for a short answer but you have time to read my reputation comments, share my personal.pictures and create a collages to humiliate me.

Your words confirm your actions as usual.

This thread was taken seriously I would appricite if you could stop posting here your butthurt posts and in general stp to participate here because as for the topic you have no clue whats going on.

Rethel
08-20-2017, 02:21 PM
http://lifementor.pl/wp-content/uploads/2011/05/klotnia.jpg

firemonkey
08-20-2017, 02:29 PM
@Magnolia What are your qualifications/skills in this area that legitimise your insistence in criticising Litvin ?

Laberia
08-20-2017, 02:31 PM
http://lifementor.pl/wp-content/uploads/2011/05/klotnia.jpg

Sometimes love show some strange signs.

Seya
08-20-2017, 02:31 PM
i don't know how large is the sample size but the results are pretty accurate...i mean, all of this people that have been tested here got very close results to the samples taken from their countries...i once read somewhere that they take as reference only people who know very good their ancestors...

Rethel
08-20-2017, 02:32 PM
i don't know how large is the sample size but the results are pretty accurate...i mean, all of this people that have been tested here got very close results to the samples taken from their countries...i once read somewhere that they take as reference only people who know very good their ancestors...

Thanx for back to the topic :)

Dandelion
08-20-2017, 02:33 PM
The results brought up about Czech people are those by a Polish mad scientist twirling his moustache who has as goal 'proving' the similarity between Poles and Czech.

Magnolia
08-20-2017, 02:36 PM
@Magnolia What are your qualifications/skills in this area that legitimise your insistence in criticising Litvin ?
I dont want to share my personal.stuff here.

Petalpusher
08-20-2017, 02:42 PM
Number of samples don't matter much, it's how representative they are. Studies often don't use a lot of samples but those who focus on a particular region/country make sure they are representative, they discard outliers and those who are obviously not ethnically locals, which is easy to spot, so the sample set can appear quite small but sometimes much more people were tested than what it seems, they just keep the ethnic core of a certain group.

In real world cases it's also easy to see how good is a sample, when every random guy who get tested decently matches the sample, it's reasonnable to conclude it's a good one. Even with a few members here we can verify how accurate they are usually, just by plotting a group of people of the same origin let's say Albanians as we have done it here, they as expected form their own little cluster around the average, which is kind of amazing when you think about it, just with a few drops of saliva blindly sent from all over the world, all end up as Albanians. Of course with some variability but it means the average was correct to begin with. There has been millions of people tested right now in the world, it's fair to say the vast majority of samples are accurate by now.

Laberia
08-20-2017, 02:45 PM
The methodology used by Gallup:

Gallup Daily Tracking Methodology[edit]
Gallup conducts 1,000 interviews per day, 350 days out of the year, among both landline and cell phones across the U.S. for its health and well-being survey[11] and political and economic survey. Gallup Daily tracking methodology relies on live interviewers, dual-frame random-digit-dial sampling (which includes landline as well as cellular telephone phone sampling to reach those in cell phone-only households), and uses a multi-call design to reach respondents not contacted on the initial attempt.
Gallup completes 500 cellphone surveys and 500 landline surveys daily, divided evenly between the two topical questionnaires.[12] The population of the U.S. that relies only on cell phones makes 34% of the population.[13]
The findings from Gallup's U.S. surveys are based on the organization's standard national telephone samples, consisting of list-assisted random-digit-dial (RDD) telephone samples using a proportionate, stratified sampling design. A computer randomly generates the phone numbers Gallup calls from all working phone exchanges (the first three numbers of your local phone number) and not-listed phone numbers; thus, Gallup is as likely to call unlisted phone numbers as well as listed phone numbers.
Within each contacted household reached via landline, an interview is sought with an adult 18 years of age or older living in the household who will have the next birthday. Gallup does not use the same respondent selection procedure when making calls to cell phones because they are typically associated with one individual rather than shared among several members of a household. Gallup Daily tracking includes Spanish-language interviews for Spanish-speaking respondents and interviews in Alaska and Hawaii.
When respondents to be interviewed are selected at random, every adult has an equal probability of falling into the sample. The typical sample size for a Gallup poll, either a traditional stand-alone poll or one night's interviewing from Gallup's Daily tracking, is 1,000 national adults with a margin of error of ±4 percentage points. Gallup's Daily tracking process now allows Gallup analysts to aggregate larger groups of interviews for more detailed subgroup analysis. But the accuracy of the estimates derived only marginally improves with larger sample sizes.
After Gallup collects and processes survey data, each respondent is assigned a weight so that the demographic characteristics of the total weighted sample of respondents match the latest estimates of the demographic characteristics of the adult population available from the U.S. Census Bureau. Gallup weights data to census estimates for gender, race, age, educational attainment, and region.[14]
The data are weighted daily by number of adults in the household and the respondents' reliance on cell phones, to adjust for any disproportion in selection probabilities. The data are then weighted to compensate for nonrandom nonresponse, using targets from the U.S. Census Bureau for age, region, gender, education, Hispanic ethnicity, and race. The resulting sample represents an estimated 95% of all U.S. households.[15][16]

firemonkey
08-20-2017, 02:45 PM
@Magnolia- So basically you want to criticise Litvin whilst not giving any reason for us to believe your opinions carry any weight .
The whole "I don't want to share personal stuff" is a smokescreen . No one is asking you for your home address and telephone number .

Rethel
08-20-2017, 02:51 PM
Polish mad scientist twirling his moustache

:confused:

Magnolia
08-20-2017, 02:59 PM
@Magnolia- So basically you want to criticise Litvin whilst not giving any reason for us to believe your opinions carry any weight .
The whole "I don't want to share personal stuff" is a smokescreen . No one is asking you for your home address and telephone number .

Please, could you stop OT? It would be great. Litvin said he was an expert in that and I know for sure he was not. That was the only reason I spoke about it.
And no I have no reason to reaval myself here to much. I dont want to. And the reason is not I have something to be ashamed of. The reason is eg everything personal I said eg to Litvin even in privite he used against me in public.

Laberia
08-20-2017, 03:00 PM
Number of samples don't matter much, it's how representative they are. Studies often don't use a lot of samples but those who focus on a particular region/country make sure they are representative, they discard outliers and those who are obviously not ethnically locals, which is easy to spot, so the sample set can appear quite small but sometimes much more people were tested than what it seems, they just keep the ethnic core of a certain group.

In real world cases it's also easy to see how good is a sample, when every random guy who get tested decently matches the sample, it's reasonnable to conclude it's a good one. Even with a few members here we can verify how accurate they are usually, just by plotting a group of people of the same origin let's say Albanians as we have done it here, they as expected form their own little cluster around the average, which is kind of amazing when you think about it, just with a few drops of saliva blindly sent from all over the world, all end up as Albanians. Of course with some variability but it means the average was correct to begin with. There has been millions of people tested right now in the world, it's fair to say the vast majority of samples are accurate by now.

According to you, genetics don`t follow the rule that higher the number of the samples tested, equal low possibility of wrong result.
And how decide you if a person is rapresentative or not of a region?

Dandelion
08-20-2017, 03:06 PM
:confused:

This blog is made by a Polish person who's widely respected in anthropology forums. He a competent bio-informatician. While it's true he's not above Polish nationalism, he's still no idiot.

http://dienekes.blogspot.be/2010/10/more-detailed-analysis-of-eurasian.html

I believe Magnolia was referring to him when Litvin talked about Czech ethnic composition.

Rethel
08-20-2017, 03:08 PM
This blog is made by a Polish person who's widely respected in anthropology forums. He a competent bio-informatician. While it's true he's not above Polish nationalism, he's still no idiot.

http://dienekes.blogspot.be/2010/10/more-detailed-analysis-of-eurasian.html

I believe Magnolia was referring to him when Litvin talked about Czech ethnic composition.

Dienekes is a Pole? :blink:

Dandelion
08-20-2017, 03:11 PM
Dienekes is a Pole? :blink:

He a different person (a Greek) it appears. Meh. I post here and not on Anthrogenica (serious anthroboard) for a reason lol.

I meant this guy:
http://eurogenes.blogspot.be/

Laberia
08-20-2017, 03:20 PM
Dienekes is a Pole? :blink:

He a different person (a Greek) it appears. Meh. I post here and not on Anthrogenica (serious anthroboard) for a reason lol.

I meant this guy:
http://eurogenes.blogspot.be/
So, we don`t know who is/are this person/s, the academic background, etc and we continue to discuss based in their work. This story of the Anonymous of XXI century is unbelievable.

Rethel
08-20-2017, 03:36 PM
So, we don`t know who is/are this person/s, the academic background, etc and we continue to discuss based in their work. This story of the Anonymous of XXI century is unbelievable.

It seems that Dienekes don't like Greeks, so Idk, how he can be one?

Petalpusher
08-20-2017, 03:47 PM
According to you, genetics don`t follow the rule that higher the number of the samples tested, equal low possibility of wrong result.
And how decide you if a person is rapresentative or not of a region?

What i explained is they are using a lot of samples first when available but only keep a few to make an average, there are technical reasons as well to keep a reasonnably low sample count when you make admixture runs. They basically keep the central point of their set and eventually a few around to illustrate some variability (if it exists).

Let's say you have 100 samples of country X and 85 very closely gravitate around a single point, there s no need to keep those 85 samples in the run, you just select about 10 and maybe put a few around to make a representative average. I ve seen this method explained in many studies, specially the regional ones, country wide and i mean the academic studies, not the bloggers who just re-use the sample set available everywhere. You will see just ten samples and wonder if that's enough, when in reality it's only the core of their set that has been carefully filtered already.

A good average has two functions, either to show if people are close to it and be deemed as local OR diverge to a certain degree from it. There s no need for thousands of samples anyway because it's precisely what the most basic principle of admixture is based on, the correct assumption that ethnic people of x group won't diverge much from each others as they have been mixed together over and over through many generations, and that their admixture is totally stabilized by now, which is hopefully still the case otherwise we would have nothing to compare.

Magnolia
08-20-2017, 04:21 PM
Petalpusher, you are speaking about a representative sample that is based on a research of many people from a region.
This is an absolutely correct method. But I was speaking about something else than a representative sample.
I was speaking about amateur geneticists on internet who are able to take non-representative sample and claim that the sample is representative for a whole population.

Alessio
08-20-2017, 04:32 PM
This blog is made by a Polish person who's widely respected in anthropology forums. He a competent bio-informatician. While it's true he's not above Polish nationalism, he's still no idiot.

http://dienekes.blogspot.be/2010/10/more-detailed-analysis-of-eurasian.html

I believe Magnolia was referring to him when Litvin talked about Czech ethnic composition.

I think you meant this one:

http://eurogenes.blogspot.nl/

Petalpusher
08-20-2017, 05:02 PM
Petalpusher, you are speaking about a representative sample that is based on a research of many people from a region.
This is an absolutely correct method. But I was speaking about something else than a representative sample.
I was speaking about amateur geneticists on internet who are able to take non-representative sample and claim that the sample is representative for a whole population.

It's discutable when people make their own average, it can work if by chance they stumbled upon very average individuals, but i wouldn't trust them either because as i ve tried to explain there s usually way more work involved in professional studies than just picking up a few samples and call it an average, even if sometimes it appears so.

Peterski
08-20-2017, 05:58 PM
We use mainly samples with GEDCOMs, which provide genealogical info about family tree.

So our samples are representative as they are people with ancestors from the same region.

Peterski
08-20-2017, 06:04 PM
there s usually way more work involved in professional studies than just picking up a few samples and call it an average, even if sometimes it appears so.

You are delusional and you overestimate the quality of these professional studies.

I do not even hesitate to say, that prrofessional studies have actually lower quality samples than our K36 Oracle samples. For example there are outliers among samples from Harvard's Human Origin dataset, because they are selected based on place of residence (for example current residents of Poznań and Lublin, respectively), not based on origins ancestors. Among 15 Poles from Poznań, there are 4 outliers. Among 8 East Poles from Lublin, there are 3 outliers (people with recent German ancestry*).

Of course we filtered out these outliers and included only representative samples. When using GEDmatch, we have access to GEDCOMs which show full family trees. This is more reliable.

*Probably descended from these settlers:

http://cejsh.icm.edu.pl/cejsh/element/bwmeta1.element.5c1c6cc0-2bdf-3268-a6b6-2f79175fcaed

https://library.ndsu.edu/grhc/history_culture/history/files/Jerry%20Frank%20-%20The%20German%20Migration%20to%20the%20East.pdf

Magnolia
08-20-2017, 06:09 PM
You have no clue what representative samples are.

Peterski
08-20-2017, 06:14 PM
This is a GEDCOM pedigee chart, we divide our samples into regional groups based on such data:

http://i.imgur.com/xKwDQsV.png

http://i.imgur.com/xKwDQsV.png

For example this person is one 35-40 Northern Germans, which are our "North_German" average.

==============

Authors of "professional" studies often don't check ancestry, but rely only on places of residence.

Peterski
08-20-2017, 06:27 PM
Not all samples have GEDCOMs. In cases where GEDCOMs are not available, we create a PCA graph and see if samples with no GEDCOMs cluster close to samples with GEDCOMs. This is how we filter out outliers. If they don't cluster close to samples with confirmed ancestry (based on genealogical data), then it means that they have ancestors from another region - and are not representative for a given region.

Petalpusher
08-20-2017, 06:29 PM
Gedcom tells you nothing about the ethnic origin of someone, only a genealogical tree, just names. They could all be Jewish mulattos with German names for all we know. That's why studies don't rely on that, they just filter from the beginning the most oddly non ethnic with the 4 GP requirement, then they run and plot them to actually verify their admixture, which is what matters genetically, not a name or a place of birth/death.

You also fail to understand why they purposedly kept those "outliers" on top of their baseline average, they are a significant portion of x pop even if outlying compared to the majority, they do need to stay in the average run in the right proportion.

Magnolia
08-20-2017, 06:35 PM
Petalpusher described what representative samples are quite well.
It is not about eg to take a person who has no foreigners in his familly tree. Such a person dont have to be typical even for his village.
And stop Litvin you are able to take results of one person and claim this person is typical not only for his region but for the whole country.

Peterski
08-20-2017, 06:36 PM
Gedcom tells you nothing about the ethnic origin of someone, only a genealogical tree, just names. They could all be Jewish mulattos with German names for all we know.

Nope, they can't be Jewish Mulattos without having "East Med" and "West African" admixtures.

We rely on both genealogical data, and on genetic data (PCA plots showing how they cluster).

One of 23 Wielkopolska Poles from our sample (I'm also part of this sample):

http://i.imgur.com/m6iXlQf.png


then they run and plot them to actually verify their admixture, which is what matters genetically, not a name or a place of birth/death. You also fail to understand why they purposedly kept those "outliers" on top of their baseline average, they are a significant portion of x pop even if outlying compared to the majority, they do need to stay in the average run in the right proportion.

I don't fail to understand anything, I told you several times that I filter out outliers.

But it does not mean that I don't use them. When I create regional Polish averages, I don't include obvious outliers. But these outliers later go to "Poland_Mixed", I don't throw them away.

"Poland_Mixed" includes everyone who has ancestry from several regions of Poland.

Peterski
08-20-2017, 06:43 PM
You also fail to understand why they purposedly kept those "outliers" on top of their baseline average, they are a significant portion of x pop even if outlying compared to the majority, they do need to stay in the average run in the right proportion.

NOBODY did regional Polish averages before us. All studies lump all Poles together.

K36 Oracle is the first project which divided Poland into smaller regional populations.

And we do include outliers - we include them as part of "Poland_Mixed" reference.

So what is your problem?

We do the same with French - outliers go to "France_average", not to "Bretagne", etc.

Peterski
08-20-2017, 06:58 PM
Let's say you have 100 samples of country X and 85 very closely gravitate around a single point, there s no need to keep those 85 samples in the run, you just select about 10 and maybe put a few around to make a representative average. I ve seen this method explained in many studies, specially the regional ones, country wide and i mean the academic studies, not the bloggers who just re-use the sample set available everywhere. You will see just ten samples and wonder if that's enough, when in reality it's only the core of their set that has been carefully filtered already.

This is indeed how it works.

Which is why I did not filter out outliers from Harvard's sample of Sorbs, which includes only 8 people, but which captures the genetic diversity of Lusatian Sorbs pretty well. And even though these 8 look diverse (at least 3 are outliers, they are different than the "core" of 5), the average admixture results of these 8 actually make perfect sense. So they did what you say - they selected about 10 samples from a larger group, but these 8 samples capture the whole diversity that they observed. And I understand this.

But what you don't understand, is that K36 Oracle is a pioneering work when it comes to regional averages for many ethnic groups, e.g. Poles. There is no study with such a comprehensive data on regional differences within Poland. So while for Lusatian Sorbs I did not filter outliers, for Polish regions I did.

And then I included outliers as "Poland_Mixed" reference. I did not throw them away.

Peterski
08-20-2017, 07:11 PM
As for these Lublin Poles it was Davidski who adviced me to filter out the ones with recent German ancestry. If you have 3 Germans among 8 samples, the average for these 8 will cluster with Western Poles, despite being Eastern Poles. So if I want a representative sample for Lublin Region (which is in the central part of Eastern Poland), I have to filter out these Polonized Germans. There is no way around it.

And I really don't think that as many as 37.5% (3/8) of Eastern Poles have recent German ancestry. So this sample was not representative to begin with, despite being from Harvard's dataset.

K36 Oracle also has samples for South-East Poland (Podkarpacie) and North-East Poland (Podlasie and Sudovia). These are samples from GEDmatch, with GEDCOMs. Not from Harvard.

Magnolia
08-20-2017, 07:17 PM
Once again representative samples are not about picking people who have no foreign ancestors.
It is about to find a typical (average) person who has the most common mix of ancestors in a region/country.

Peterski
08-20-2017, 07:22 PM
When I asked Davidski (Polako) why these 3 Lublin Poles are different from the other 5:

http://i.imgur.com/SExNNWg.png


Once again representative samples are not about picking people who have no foreign ancestors.
It is about to find a typical (average) person who has the most common mix of ancestors in a region/country.

I did not filter out Czechs with German ancestry. I included all Czechs that I got.

And I got these samples from Czech users (mainly from Syky).

Remember, that for Czechs I made one average for entire Republic. I did not make regional averages. So this average includes Bohemians and Moravians (but no Silesians and no foreigners).

For Poland, we made both regional averages and "Poland_Mixed" average.

Magnolia
08-20-2017, 07:52 PM
Stop being butthurt and read something about research's methodology. For every research its methodology and data are alpha and omega. Samples have to be taken according to a plan - and yes it takes time and money; "somobody gave me samples" - is nothing, it is important to know more information, These samples can be even fakes... no need to mention we cant speak about a typical person or generalize on the whole population.

Dandelion
08-20-2017, 07:54 PM
When I asked Davidski (Polako) why these 3 Lublin Poles are different from the other 5:


How Czech woman pictures Davidski with samples from Czech Silesia

http://clipart-library.com/images/5TRKpARXc.jpg

Graham
08-20-2017, 08:08 PM
I think, a sample needs to be at least in the hundreds or thousands to be credible, scholastically.

"The central limit theorem is a statistical theory that states that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population."

You could get away with a smaller number really. A sample size of say 25 if representative and similar wouldn't be too bad.



Have you ever thought about that?
Every day we can see a lot of threads where you calculate some results and on the basics you make conslusions about yourself/your nation/other nations.
Do you realize your conclusions can be absolutely wrong?
If you have an sample of one person, two, ten people be realistic it has no value.

There are statistical tests that do work out related groups & hypothesis test. P-values, confidence intervals etc.. Its a bit complicated. Not sure what they use. Chi-squared testing or whatever.

But anyway, read into the central limit theorem to help with original question. :)


https://upload.wikimedia.org/wikipedia/commons/1/12/Central_Limit_Theorem.png

Peterski
08-20-2017, 08:18 PM
These samples can be even fakes.

No they are not fake and here are some of their surnames:

Jiskrova
Dortová
Aubrechtova
Hromádko
Tousova
Lukyn
Hetmer
Bednarova
Švábek
Soucek
Hovorkova
Brtnik
Autrata
Cepelova
Cepela
Lett
Soldat
Sivan
Lettova
Hovorka
Baca - Vlach from Moravia

============================

Now for example here is GEDCOM pedigree of Ottomar Lett:

http://i.imgur.com/OjoWD3X.png

Magnolia
08-20-2017, 08:31 PM
How Czech woman pictures Davidski with samples from Czech Silesia

http://clipart-library.com/images/5TRKpARXc.jpg
I dont care who that person is. If he doesnt know rules for a research, he is an amateur not a researcher. To be active a lot doesnt mean to be good at something.

Magnolia
08-20-2017, 08:35 PM
If representative ... yeah thats what is going on.

Peterski
08-20-2017, 08:42 PM
Does for example this Czech sample look fake?:

https://www.sendspace.com/file/ilz9yn

Graham
08-20-2017, 08:53 PM
If representative ... yeah thats what is going on.

If the spread in a population ( a large sample size) has a wide range and separate clusters on both ends then there's an argument for smaller groups( still ok sized) split up from that for example. If a big group huddles near its median/mean, then keep it as one.

But a sample population should not have a large range, then you have an error.

Magnolia
08-20-2017, 09:16 PM
It is about statistics, not about anything else.
There are statistical methods for the sample size determination to we could get accurate information about a typical/average person/to we could determinate a typical/average person.

The approach to take one random person and claim that person is typical for a region is error.

Dick
08-21-2017, 03:08 AM
Gedcom tells you nothing about the ethnic origin of someone, only a genealogical tree, just names. They could all be Jewish mulattos with German names for all we know. That's why studies don't rely on that, they just filter from the beginning the most oddly non ethnic with the 4 GP requirement, then they run and plot them to actually verify their admixture, which is what matters genetically, not a name or a place of birth/death.

You also fail to understand why they purposedly kept those "outliers" on top of their baseline average, they are a significant portion of x pop even if outlying compared to the majority, they do need to stay in the average run in the right proportion.

http://vignette3.wikia.nocookie.net/nintendo/images/f/f4/The_Mario_Bros..jpeg/revision/latest?cb=20140221224337&path-prefix=en

Karol Klačansky
08-21-2017, 09:39 AM
A sample size should be over 30 at least, and then the bigger the better. .

Sent from my KIW-L21 using Tapatalk

Lotus Star
08-21-2017, 11:16 AM
Hundreds or thousands out of how many people in total, one hundred thousands, one million, etc?

It depends on how large a population is. There is a ratio.

Laberia
08-21-2017, 11:22 AM
It depends on how large a population is. There is a ratio.

Let simplify a litle bitt our discussion.
We have a region, region N. This region is inhabitated by 1.000.000 people. You are a team leader of a group of geneticists and your duty is to determine the haplogroups present in the population of region N and the percentages for every haplogroup. First way is testing the inhabitants one by one but of course this is not possible for the moment, maybe in the future. So, you have to start to select some people to test. How many people you have to test from the inhabitants of the region N and how they are selected in order to have an acceptable result? I mean the minimum of samples.