# Sufficient Statistic & The Sufficiency Principle: Simple Definition, Example

Contents:

## What is a Sufficient Statistic?

A sufficient statistic is a statistic that summarizes all of the information in a sample about a chosen parameter. For example, the sample mean, x̄, estimates the population mean, μ. x̄ is a sufficient statistic if it retains all of the information about the population mean that was contained in the original data points.

According to statistician Ronald Fisher,

“…no other statistic that can be calculated from the same sample provides any additional information as to the value of the parameter.”

In layman’s terms, a sufficient statistic is your best bet for summarizing your data; You can use it even if you don’t know any of the actual values in the sample.

## Sufficient Statistic Example

You can think of a sufficient statistic as an estimator that allows you to estimate the population parameter as well as if you knew all of the data in all possible samples.

For example, let’s say you have the simple data set 1,2,3,4,5. You would calculate the sample mean as (1 + 2 + 3 + 4 + 5) / 5 = 3, which gives you the estimate of the population mean as 3. Let’s assume you don’t know those values (1, 2, 3, 4, 5), but you only know that the sample mean is 3. You would also estimate the population mean as 3, which would be just as good as knowing the whole data set. The sample mean of 3 is a sufficient statistic. To put this another way, if you have the sample mean, then knowing all of the data items makes no difference in how good your estimate is: it’s already “the best”.

Order statistics for iid samples are also sufficient statistics. This does not hold for data that isn’t iid because only in these samples, can you re-order the data without losing meaning.

## Formal Definition of Sufficient Statistics

More formally, a statistic Y is said to be a sufficient estimator for some parameter θ if the conditional distribution of Y: T(X1, X2,…,Xn) doesn’t depend on θ. While this definition is fairly simple, actually finding the conditional distribution is the tough part. In fact, most statisticians consider it extremely difficult. One, slightly easier, way to find the conditional distribution is to use the Factorization Theorem.

## Factorization Theorem

Suppose that you have a random sample X = (X1,…, Xn) from some function f(x|θ) and that f(x|θ) is the joint pdf of X. A statistic is sufficient if you can write the following joint pdf for functions g(t|θ) and h(x):

f(x|θ) = g(T(X)|θ)h(x)

Where:

• θ is the unknown parameter belonging to the parameter space Q,
• and the pdf exists for all values of x, and θ ∈ Q.

## Complement of Sufficiency

An ancillary statistic is the complement of sufficiency. While sufficient statistics give you all of the information about a parameter, ancillary statistics give you no information.

## The Sufficiency Principle

The Sufficiency Principle, S, (or Birnbaum’s S) allows us to potentially reduce our data footprint and eliminate extra, non-informative data. The data reduction method summarizes the data while retaining all the information about a particular parameter, θ.

Birnbaum (1962) was first to outline the principle, which is defined as:
“In the presence of a sufficient statistic t(x) with statistical model E’, the inferences concerning θ from E and x should be the same as from E’ and t(x)” ~ (Fraser, 1962)
In other words, let’s say you have an observable variable x, with a model E. And let’s say you also have a sufficient statistic, t(x) with a model E. Any inferences about a certain parameter from the first model should be the same as those made from the second model.

When collecting data, the sufficiency principle justifies ignoring certain pieces of information (Steel, 2007). For example, let’s say you were conducting an experiment recording the number of heads in a coin toss. You could record the number of heads and tails, along with their order: HTTHTTTHHH…. Or, you could just record the number of heads (e.g. 25 heads). For the purposes of a binomial experiment, the number of heads would be a sufficient statistic. Recording all of the tails, and their order, would give you no more information (assuming the variables are independent and identically distributed).

## Viewing the Sufficiency Principle as Data Reduction

We know from the sufficiency principle that if we have a sufficient statistic Y = T (X) and a statistical model, the inferences we can make about θ from our model and X (the data set) must be the same as from that model and Y.

This makes sufficiency a very strong property; a way of data reduction, or condensing all the important information in our sample into the statistic.

## The Sufficiency Principle Unpacked

Let’s consider for moment Y, a sufficient statistic, and X, a set of observations. We want to look at the pair (X, Y). Since Y is dependent on X, the pair (X, Y) will give us the same information about parameter θ that X does.

But since Y is sufficient, the conditional distribution of X given Y is independent of θ.

What exactly does that mean?

Let X be your last statistics lecture and the video recording of it, and Y be the notes you took about it. Parameter θ is information needed on question #7 in your class final. Y depends entirely on X, and the video and class notes includes exactly the same information as just the video did; nothing added. But if you’ve taken sufficient notes, the conditional distribution of the lecture given your notes is independent of that question #7 information. Conditional distribution here just means the probability distribution of the info in your notes, given the lecture. If the info is in the lecture, it’s in your notes. Once you’ve checked your memorized notes, going back and listening to the lecture won’t help you solve question #7.

Now let’s go back to the generic sufficiency principle and mathematical statistics. A focus on both bits of information (data set X and statistic Y) does not give us any more information about the distribution of θ than we’d have if we only focused on the statistic. And after looking at statistic Y, a look at X doesn’t give us any new information on θ it won’t enlighten us on whether a particular value of θ is more likely or less likely then another.

So if Y is a sufficient statistic, we don’t need to consider data set X any more, after using it to calculate Y; it becomes redundant.

## References

Birnbaum, A. (1962). On the foundations of statistical inference. J. Am. Statist. Assoc. 57, 269-306.
Chen, H. (n.d.) Advanced Statistical Inference: Principles of Data Reduction.
Retrieved December 12, 2017 from http://www.math.ntu.edu.tw/~hchen/teaching/StatInference/notes/ch6.pdf
Fisher, R.A. (1922). “On the mathematical foundations of theoretical statistics”. Philosophical Transactions of the Royal Society A 222: 309–368.
Fraser, D. (1962). On the Sufficiency and Likelihood Principles. Technical Report No. 43. Retrieved December 29, 2017 from: https://statistics.stanford.edu/sites/default/files/CHE%20ONR%2043.pdf
Mezzetti, M. (n.d.) Principles of Data Reduction: The Sufficiency Principle. Universita Tor Vergata.
Retrieved December 3, 2016 from http://economia.uniroma2.it/master-science/financeandbanking/corso/asset/YTo0OntzOjI6ImlkIjtzOjM6IjI3NyI7czozOiJpZGEiO3M6NToiMjM2OTkiO3M6MjoiZW0iO047czoxOiJjIjtzOjU6ImNmY2QyIjt9
Ross, S. (2010). Introduction to Probability Models. 11th Edition. Elsevier.
Steel, D. (2007). Bayesian Confirmation Theory and the Likelihood Principle. Synthese 156: 53. Retrieved December 29, 2017 from: https://msu.edu/~steel/Bayes_and_LP.pdf