Statistics How To

Latent Semantic Analysis

Statistics Definitions > Latent Semantic Analysis

What is Latent Semantic Analysis?

Latent Semantic Analysis (LSA) is a way to analyze how words and groups of words are used in texts. It is used to answer questions like:

  • What is the underlying meaning of the text?
  • What effect do words have on the meaning of passages?
  • How does the average meaning of words in a passage relate to the overall meaning of a passage?

Language (especially the English language) is complex, in part because words have multiple meanings. For example, the word “hot” can mean a variety of things including “near boiling,” “sexy,” or “priced to sell.” A lot depends on the context you’re using it in (i.e. the surrounding passage). “Hot” in one text might have a completely different meaning in another, so finding related words, passages, or entire texts is no easy task. LSA attempts to do this by mapping words to concepts like “temperature,” “sex,” or “business.” The words and the linked concepts are then compared to arrive at the real meaning of text.

Latent semantic analysis is also called latent semantic indexing (LSI).

Method

latent semantic analysis

A matrix where each element shows how often words appear in a text.

LSA uses an advanced matrix algebra method called Singular Value Decomposition (SVD) to factorize matrices . SVD is usually impractical to perform by hand for anything more than a small sample of text. In fact it really only became popular after the 1980s when computers came on the scene to handle the complex algorithms.


The basic method is:

  • The text is converted into matrices to represent passages. Each cell in the matrix contains the number of times a certain word appears in a certain passage.
  • The matrix is factorized so that that every passage is represented as a vector. The value for each vector is the sum of vectors representing its component words.
  • Dot products, cosines or similar metrics are used to represent similarities between words and passages.

The theory behind the algorithms used in SVD is beyond the scope of this article, but you can read more about it in this University of Victoria article.

------------------------------------------------------------------------------

If you prefer an online interactive environment to learn R and statistics, this free R Tutorial by Datacamp is a great way to get started. If you're are somewhat comfortable with R and are interested in going deeper into Statistics, try this Statistics with R track.

Comments are now closed for this post. Need help or want to post a correction? Please post a comment on our Facebook page and I'll do my best to help!
Latent Semantic Analysis was last modified: October 12th, 2017 by Stephanie Glen