Introduction

Natural language processing is one of the most promising areas of machine learning and artificial intelligence today and the area that is growing the fastest. There are lots of figures out there that are trying to predict the growth of the market, but I prefer mot to mention them as predictions

Natural language processing includes a wide area of topics like, for example:

  • Speech recognition
  • Language translation
  • Virtual assistant
  • Text classification
  • Text summarization
  • Text generation

There are a number of libraries and cloud based APIs available today and the choice is growing. This series of articles will dig a bit deeper in some of the available techniques and experiment NLP libraries and APIs in the context of entity recognition in a given document, documents similarities and classification of documents.

Using Google NLP API with R for entity detection

Setting up the project in Google Cloud Platform

Before anything, you have to have a registered account with Google.

As first step you need to create a project on the Google Cloud Platform

Once your project is created go to the API and services and start enabling an API service

In the list of available APIs, you chose the Cloud Natural Language API

and once done you enable the API by clicking enable. Once done you need to add credentials to the API and get the access and API keys.

This is done using below screen, download your credentials in JSON format and you will need it on your machine when you want to run the experiment. This step is absolutely required.

Experimenting with R

In order to experiment the Google API with R, we will use the googleLanguageR library:

install.packages('googleLanguageR')
library(googleLanguageR)

You will need to indicate the library where to find the credentials you downloaded earlier and also the json file name:

GL_AUTH='.'
gl_auth("your file.json")

In our experiment we will try to identify major entities in 2 texts and find the common entities. We will use a job description and a profile to try to match a job description with a person profile. First let us load the 2 PDF files and so some cleaning

library(pdftools)
library(stringr)
text <- pdf_text("xxxxxx.pdf")
textCV<-paste(text,collapse = " ")
textCV=str_replace_all(textCV,"[^[:graph:]]", " ")
text <- pdf_text("yyyyyyy.pdf")
textJob<-paste(text,collapse = " ")
textJob=str_replace_all(textJob,"[^[:graph:]]", " ")

Further text cleaning might be required, so let us do minimum cleaning of the text corpus and remove numbers and convert everything to lower case:

library(tm)
library(stringi)
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, content_transformer(stri_trans_tolower))
return(corpus)
}
text_corpus_clean_CV<-clean_corpus(VCorpus(VectorSource(textCV)))

Now we are ready to call the Google NLP API and detect entities in the document

nlp <- gl_nlp(text_corpus_clean_CV[[1]]$content)

The result of the API call includes identified tokens, sentences, classification, language, sentiment and entities. We will focus on detected entities for now and try to find common entities between the job profile and the person profile.

25 common entities have been detected and for each document we can see how many times the entity is appearing in the text compared to the other document. One observation is that in some cases entities are present in the singular and plural form, something that might not add much information to our comparison.

(Visited 6 times, 1 visits today)
%d bloggers like this: