It’s a first step in many types of analysis, because discrete functions and data are much easier to analyze than their continuous counterparts.
The Theory Behind Discretization
We can visualize the process of discretizing as:
- Analyzing the continuous values a variable takes on,
- Dividing them into segments,
- Grouping them into bins. First, decide how to select the number of bins; and second, decide how wide to make them.
It is important to realize that in any actual discretizing, a certain amount of error is introduced. The prime goal is always to minimize the error as much as possible when choosing the number of bins and their width. We can do this by increasing the number of intervals we’re dividing our function or variable; just as a pixelated photo made up of tiny squares will become more true-to-life as we decrease the size of the squares. But the more intervals we use, the more unwieldy our discretization becomes, so we end up looking for the fine line: what is the minimum number of intervals we can divide this function into and still get reasonably accurate results?
Types of Discretization
When one variable at a time is discretized, it’s called static variable discretizing. This is the most common type of discretization.
Dynamic variable discretizing involves discretizing all the variables at once, or simultaneously. In dynamic variable discretization you have to keep track of, and deal with, any interdependencies (interactions) between the variables.
Unsupervised discretization algorithms are the simplest algorithms to make use of, because the only parameter you would specify is the number of intervals to use; or else, how many values should be included in each interval.
In supervised discretization algorithms you don’t specify the number of bins, and the discretization is run based on entropy and purity based calculations.
Methods of Discretization
The Minimum Description Length principle (MDL) model for discretization is perhaps most commonly used; it uses “dynamic repartitioning”, using mutual information to—recursively—define the best intervals or bins. Other mechanisms include:
- Ameva: this algorithm uses chi-square statistics to maximize a contingency coefficient, generating the minimum number of discrete intervals.
- CACC: the Class-Attribute Contingency Coefficient algorithm is another supervised top-down discretization method that uses a contingency coefficient. It generates bins with a ‘greedy method‘.
- CAIM: the Class-Attribute Interdependence Maximization method maximizes mutual class-attribute interdependence. The goal is to generate the smallest number of bins for a single continuous attribute.
Dougherty, et. al (1995). Supervised and Unsupervised Discr. of Continuous Features. Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, California, July 9–12, 1995. Retrieved January 6, 2018 from: https://doi.org/10.1016/B978-1-55860-377-6.50032-3
Juhola, M. (2016) Data Mining Advanced Study Course. Retrieved January 6, 2018 from http://www.uta.fi/sis/tie/tl/index/Datamining6.pdf Jan 2, 2018
If you prefer an online interactive environment to learn R and statistics, this free R Tutorial by Datacamp is a great way to get started. If you're are somewhat comfortable with R and are interested in going deeper into Statistics, try this Statistics with R track.Comments? Need to post a correction? Please post on our Facebook page.