Statistics Definitions > Exploratory Data Analysis EDA
What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is an approach to analyzing data. It’s where the researcher takes a bird’s eye view of the data and tries to make some sense of it. It’s often the first step in data analysis, implemented before any formal statistical techniques are applied. Although specific statistical techniques can be used, like creating histograms or box plots, EDA is not a set of techniques or procedures; the Engineering Statistics Handbook calls EDA a “philosophy.” EDA is considered by some to be more of an art form than a science.
Exploratory data analysis is a complement to inferential statistics, which tends to be fairly rigid with rules and formulas. EDA involves the analyst trying to get a “feel” for the data set, often using their own judgment to determine what the most important elements in the data set are.
The purpose of exploratory data analysis is to:
- Check for missing data and other mistakes.
- Gain maximum insight into the data set and its underlying structure.
- Uncover a parsimonious model, one which explains the data with a minimum number of predictor variables.
- Check assumptions associated with any model fitting or hypothesis test.
- Create a list of outliers or other anomalies.
- Find parameter estimates and their associated confidence intervals or margins of error.
- Identify the most influential variables.
Other, specific knowledge can be obtained through EDA such as creating a ranked list of relevant factors. You may not necessarily include all of the above items in your data analysis, although it’s likely you’ll want to include at least a few. They should be viewed as guidelines, rather than rigid rules.
Types of Exploratory Data Analysis
EDA falls into four main areas:
- Univariate non-graphical — looking at one variable of interest, like age, height, income level etc.
- Univariate graphical.
- Multivariate non-graphical — analysis of multiple variables at the same time.
- Multivariate graphical.
- Multidimensional scaling: a visual representation of distances or similarities between sets of objects.
If you prefer an online interactive environment to learn R and statistics, this free R Tutorial by Datacamp is a great way to get started. If you're are somewhat comfortable with R and are interested in going deeper into Statistics, try this Statistics with R track.Comments? Need to post a correction? Please post on our Facebook page.