Not the Usual (Data) Suspects: Text Analysis Gets Interesting

UNLIMITED DATA | BY JAMES KULICH | 5 MIN READ

You know how frustrating it can feel when you are not understood? I recently overheard someone I know getting more and more agitated as she needed to tell her phone at least three times to set an alarm to go off in 20 minutes. The impudence of that phone!

Text analysis and language processing technologies have come a long way in a short time. This is amazing given how difficult it is to make sense of spoken or written language.

Text Analysis Challenges

Normand Peladeau, CEO of Montreal-based Provalis Research and creator of the state-of-the-art text analytics tool Wordstat, gave us a sense of the challenges involved in developing tools for natural language processing during a workshop at Elmhurst University earlier this year.

Words have many forms. A collection of 32,000 hotel reviews would typically contain about 1.7 million words, of which only about 20,000 are truly distinct after combining words that only differ in form, such as being singular or plural.

One idea can be expressed via many words or phrases. What idea comes to mind from this list: “not interesting,” “dull,” “fell asleep,” “not fun”?

Boring!

Meanwhile, one word can have many meanings as contexts change. Consider this sentence: “This fall take a break from the cold; catch a plane and go south.” There is a total of something like 31.7 billion possible ways to combine the various meanings of the individual words in this sentence.

How about misspellings? Peladeau supplied another example during the workshop: In a collection of 1.8 million student comments from course evaluations, there were 95 different spellings of “enthusiastic.”

Assembling a Text Analytics Toolkit

Fortunately, there are some powerful tools for dealing with all of this. There are systematic ways to reduce word forms to a base that can be used in text analytics.

Some old ideas from social science research, such as factor analysis, turn out to be quite useful when extracting topics from collections of words. Some really new ideas, such as deep learning neural networks, allow us to make the jump from rather clumsy attempts at language processing to the tools commonly available on our phones—those we now expect to work on the first try!

One example we use in our courses involves examining 10,000 Yelp restaurant reviews. The goal is to determine steps you might take to get to five stars. Ten thousand reviews translate to about 1.3 million words. A tool like Wordstat can ingest and do a basic topical analysis of this data in seconds.

After a deeper dive, a path to five stars emerges: Serve good ice cream!

Why Use Text Analysis?

There is real business value to be gained from text analytics. Once digital, text data can be combined with all of the other forms of data commonly available in a business environment.

Suppose your goal is to get a sense of when the stock price of a technology company might be ready to change directions. Your initial instinct may be to dive into all of the market data available. Given that these sorts of predictions are really hard, you probably won’t see much of a result. But, what if you could capture a sense of local or even world events that might affect your stock price? Text analysis, applied to readily available news headlines, will allow you to do just that.

Facts are great, but what about feelings? Can you capture the sentiments of people who might be active in your market? Once again, text analytics is the tool of choice.

A tool like Wordstat provides a basic sentiment dictionary, a way to score known text based on positive or negative sentiment. A more customized sentiment dictionary can be developed from this base and used to analyze new and relevant text. The result: another boost in your ability to predict a change in stock price.

None of this is perfect. Some see sentiment analysis as a magic bullet. It’s not. But, when the usual data sources are combined with novel forms of data like text, progress can be made.

A key is recognizing that data points come in many forms, not just the lists of numbers to which we are accustomed. Sometimes the source of real business value is hidden in plain sight—everyday text data. We now have the tools we need to tap into what these so-called unstructured data sets offer.

At Elmhurst University, we emphasize the creation of value throughout all of our Master’s in Data Science and Analytics courses. A deep dive into the possibilities of text analytics is one way we do this.

About the Author

Jim Kulich is a professor in the Department of Computer Science and Information Systems at Elmhurst University. Jim directs Elmhurst’s master’s program in data science and analytics and teaches courses to graduate students who come to the program from a wide range of professional backgrounds.