What is a Representative Sample?
A representative sample is where your sample matches some characteristic of your population, usually the characteristic you’re targeting with your research. For example, if you’re conducting a survey about how women’s experience as a data scientist compare to men, you would want your sample to reflect the percentage of women in the data science workforce.
The purpose of sampling is to obtain a statistic that tells you something about a population. A statistic is representative if it represents the attributes of a known parameter in the population. When the statistic does not represent the population parameter, it is called unrepresentative. The type of bias that occurs in statistics when there is an unrepresentative sample is called selection bias.
Representative Doesn’t Mean Replication
Even if the sample is labeled as “representative”, it doesn’t mean that every aspect of the population is included. For example, in quota sampling, you maintain correct proportions present in the population. For example, if your original population is 45% female and 55% male, your quota sample should reflect those percentages. However, your sample tells you nothing about age distribution, income distribution, or other metrics; it’s merely representative of proportions of female and male.
Obtaining Representative Samples
In general random selection results in a representative sample; you can make generalizations and predictions about the population as long as you have used a probability sampling method.
In general, you should strive to avoid bias in your survey, trial or experiment. Although it may seem like a simple task to randomly select from a population to get a representative sample, the reality is the practice is full of pitfalls. For example, the results of your study might become skewed due to factors you didn’t account for, like knowledge of which patients are getting which treatments in clinical trials, or poor data collection methods. As a researcher, you may delegate responsibility for parts of your study to other people or even to outside sources; you should make sure everyone involved is following your carefully planned procedure to the letter.
One way to avoid unrepresentative samples is to make sure you haven’t excluded certain population members, like minorities or people who work two jobs. If it isn’t possible to get a representative sample (perhaps because of availability of participants), then you should tweak your results to reflect the population. One way to do this is with weighting factors, which are used to make samples match the population. For example, let’s say your sample ends up with 95% male and 5% female. Industry data tells you that the percentage of females may be as high as 25%. In order to make sure that you have a representative sample, you could add a little more “weight” to data from females.
On the other hand, some sampling methods are designed in such a way that they can’t result in representative samples. However, they are useful for very specific situations—as long as you understand their limitations. For example, convenience sampling involves drawing a sample from a convenient, readily available population. Although the sample isn’t representative of the population it can be useful for pilot testing.
The Growing Participation of Women in the Data Science Community.