Statistics Definitions > Reliability and Validity
Outside of statistical research, reliability and validity are used interchangeably. For research and testing, there are subtle differences. Reliability implies consistency: if you take the ACT five times, you should get roughly the same results every time. A test is valid if it measures what it’s supposed to.
Tests that are valid are also reliable. The ACT is valid (and reliable) because it measures what a student learned in high school. However, tests that are reliable aren’t always valid. For example, let’s say your thermometer was a degree off. It would be reliable (giving you the same results each time) but not valid (because the thermometer wasn’t recording the correct temperature).
Reliability is a measure of the stability or consistency of test scores. You can also think of it as the ability for a test or research findings to be repeatable. For example, a medical thermometer is a reliable tool that would measure the correct temperature each time it is used. In the same way, a reliable math test will accurately measure mathematical knowledge for every student who takes it and reliable research findings can be replicated over and over.
Of course, it’s not quite as simple as saying you think a test is reliable. There are many statistical tools you can use to measure reliability. For example:
- Kuder-Richardson 20: a measure of internal reliability for a binary test (i.e. one with right or wrong answers).
- Cronbach’s alpha: measures internal reliability for tests with multiple possible answers.
Internal vs. External Reliability
Internal reliability, or internal consistency, is a measure of how well your test is actually measuring what you want it to measure. External reliability means that your test or measure can be generalized beyond what you’re using it for. For example, a claim that individual tutoring improves test scores should apply to more than one subject (e.g. to English as well as math). A test for depression should be able to detect depression in different age groups, for people in different socio-economic statuses, or introverts.
One specific type is parallel forms reliability, where two equivalent tests are given to students a short time apart. If the forms are parallel, then the tests produce the same observed results.
A reliability coefficient is a measure of how well a test measures achievement. It is the proportion of variance in observed scores (i.e. scores on the test) attributable to true scores (the theoretical “real” score that a person would get if a perfect test existed).
- Cronbach’s alpha — the most widely used internal-consistency coefficient.
- A simple correlation between two scores from the same person is one of the simplest ways to estimate a reliability coefficient. If the scores are taken at different times, then this is one way to estimate test-retest reliability; Different forms of the test given on the same day can estimate parallel forms reliability.
- Pearson’s correlation can be used to estimate the theoretical reliability coefficient between parallel tests.
- The Spearman Brown formula is a measure of reliability for split-half tests.
- Cohen’s Kappa measures interrater reliability.
The range of the reliability coefficient is from 0 to 1. Rule of thumb for preferred levels of the coefficient:
- For high stakes tests (e.g. college admissions), > 0.85. Some authors suggest this figure should be above .90.
- For low stakes tests (e.g. classroom assessment), > 0.70. Some authors suggest this figure should be above 0.80
- Composite Reliability
- Concurrent Validity.
- Content Validity.
- Convergent Validity.
- Consequential Validity.
- Criterion Validity.
- Curricular Validity and Instructional Validity.
- Ecological Validity.
- External Validity.
- Face Validity.
- Formative validity & Summative Validity.
- Incremental Validity
- Internal Validity.
- Predictive Validity.
- Sampling Validity.
- Statistical Conclusion Validity.
Validity is defined by how well a test measures what it’s supposed to measure. Curricular validity refers to how well test items reflect the actual curriculum (i.e. a test is supposed to be a measure of what’s on the curriculum). It usually refers to a specific, well-defined curriculum, like those provided by states to schools. McClung (1978) defines it as
“…a measure of how well test items represent the objectives of the curriculum”.
A similar term is instructional validity, which is how well the test items reflect what is actually taught. McClung defines instructional validity as “an actual measure of whether the schools are providing students with instruction in the knowledge and skills measured by the test.”
In an ideal educational world, there would be no need for a distinction between instructional and curricular validity: teachers follow a curriculum, students learn what is on the curriculum through their teachers. However, it doesn’t always follow that a child will be taught what is on the curriculum. Many things can have an impact on what parts of the curriculum are taught (or not taught), including:
- Inexperienced teachers,
- Substitute teachers,
- Poorly managed schools/flow of information,
- Teachers may choose not to teach specific parts of the curriculum they don’t agree with (e.g. evolution or sex education),
- Teachers might skip over parts of the curriculum they don’t fully understand (like mathematics. According to this report, elementary school teachers struggle with basic math concepts).
How to Measure Curricular Validity
Curricular validity is usually measured by a panel of curriculum experts. It’s not measured statistically, but rather by a rating of “valid” or “not valid.” A test that meets one definition of validity might not meet another. For example, a test might have curricular validity, but not instructional validity and vice versa.
McClung, M. S. (1978). Competency testing programs: Legal and educational
issues. Fordham Law Review, 47, 651-712.
Ostashevsky, L. (2016). Elementary school teachers struggle with Common Core math standards.