Our media experts provide context and analyses on the latest search and social news in this monthly roundup.
One of the oldest and most common ways of storing information is text. Humans have used text for thousands of years to record data long before the existence of Microsoft Excel or databases. Today, most information in an organization exists as text documents, such as emails, Google documents, or even post-its on the wall. That's where Natural Language Processing, or NLP, comes into play.
NLP is a branch of data science. Its main purpose is to turn text (and speech) into structured data to obtain actionable insights. The ‘natural’ in NLP is a contrast to ‘unnatural’ languages used by computers such as C++, javascript, python, etc.
NLP has several applications but can be grouped into two main categories:
Both of which are of incredible use for the largely text-based SEM industry. For SEM, you can use NLP for tasks such as search term reports and keyword expansion tools. Both tasks leverage NLP technology to analyze search queries, detect associated keywords and then suggest related keywords. Other, more complex uses of NLP involve generating audiences based on search term content or forming a baseline bid for long-tail keywords.
Phrase Segmentation
The first step involved in an NLP process is phrase segmentation. Specifically, NLP breaks the phrase down into sections, typically using full stops or commas.
Tokenization
After that comes tokenization, which is a fancy way of saying that we are going to “define the unit of the language”. In the case of English, that means a word that can be easily separated because it is contained between spaces. Then comes a first-grade refresher: deciding what type of word -- verbs, nouns or adjectives -- we are talking about. Tokenization does this easily for certain words such as car. However, it might need context when a word has multiple meanings such as bitter or fair.
Lemmatization
After that, we go to lemmatization, which means drilling down a word to its base form. Some words can vary from the root word, like geese and goose; lemmatization essentially tracks down the base word associated with all of those variations.
Remove Stop Words
From there, we remove stop words or filler words. These are stock words that don't really add any meaning to the phrase itself such as and, is, and, the. NLP removes these words to reduce the noise while interpreting the phrase.
Machine Learning
The following step is perhaps the most complex: using machine learning to understand how each component of the phrase relates to the others.
Noun Recognition
Once we have moved past the grammar portion of this process, the NLP funnel moves into noun recognition, which splits the phrases, but uses nouns as segments to extract information. It works like this: say our NLP system has detected nouns like “EU” “Trump” “California”. Our Named Entity Recognition algorithm (as it is technically called) recognizes that California and EU are geographical locations and that Trump is an American politician.
Coreference Resolution
The final step of the NLP funnel is coreference resolution, which aims to understand pronouns. While humans can determine through context to which noun the pronoun refers, it becomes trickier for a computer.
NLP is not without its challenges. Computer programming is based on understanding the literal meaning of structured languages. Transitioning structured languages to understand unstructured natural language with contextual reference, metaphors, spelling mistakes and all the idiosyncrasies contained in our written and oral communication is a huge leap.
Take, for example, the following headline from a major news publication: “Labor admits Brexit could lead UK to freefall.” A literal interpretation of this phrase is that the physical act of work (labor) has somehow gained consciousness and admits that were Brexit to occur, the entire physical United Kingdom would somehow be sucked into space and dropped off of earth's gravity. Of course, through context and social understanding, we immediately know the real meaning of the phrase. But we only acquire that understanding through years of practice interpreting and reading between the lines. Only context gives the phrase the meaning it really has: that a political party admitted to a potential economic impact were the UK to sever from the EU.
Here’s the problem: the computer not only has to understand the literal meaning of every word (even such terms as UK or Brexit which are acronyms, or terms not present in the English language), but it has to derive the potential contextual meaning from the combination of words in the phrase.
A whole spectrum of new developments are possible thanks to NLP. For example:
NLP also enables a new generation of search engines in which the user searches as they speak, eliminating keywords and topics (this is already present in Apple’s Siri and Amazon’s Alexa). The potential of NLP not only looks to the future but also to the past. Once NLP is sufficiently powerful, researchers can use it to analyze all the text data acquired from past activities, thus creating a sort of backfill for all the unstructured data that we have accumulated.
NLP is a branch of data science that uses a series of steps to segment and extract information from text and speech. It aims to solve limitations in software that understand only formal or structured data. Text or language is ‘unstructured,’ so converting into ‘structured’ allows us to convert a collection of information into actionable insights. And it’s becoming a valuable tool in SEM.
NLP tools are flooding the SEM industry for such functions as keyword expansion, search query analysis, predictive search, and voice search. And this is just the beginning of its many use cases. Going forward, we will leverage NLP in new, innovative ways, further bridging the gap between human language and computer data.