Introduction

In the first article of the series we have shown how to use the Google NLP API to extract words from a person profile and compare it to a job description (https://smartlake.ch/natural-language-processing-experimenting-entity-recognition-part-1/). In this second article, let us do the same experiment using AWS ( Amazon Web Services) Comprehend API. We will be using R again, but the same can be easily done in Python as well and we will show how at the end of the article.

Using AWS Comprehend API with R for entity detection

Setting up the test user in AWS IAM Console

Before anything, you have to have a registered account with AWS, as it is not recommended to use your root user to access AWS services, let us create a specific user for our experimentation.

On the next page, you will be asked to assign the user to a pre-defined group ( or to create a specific group). For example I have defined a group with full access to AWS Comprehend and I will assign the test user to this group.

Once the user is successfully created, you will have the ability to download the user credentials, this is critical information in order to be able to access the AWS API later on, so keep it somewhere safe.

Experimenting with R

In order to experiment the AWS API with R, we will use the aws.comprehend R package and library:

install.packages(‘aws.comprehend’)
library(aws.comprehend)

Note that the library has an issue in the detect entities function if you wnat to use it, so check in github for the fix if you want to use it.

Now it is time to use your previously saved access key ID and secret access key that should be defined in an environment variable for aws.comprehend library to find the credentials:

Sys.setenv(“AWS_ACCESS_KEY_ID” = “xxxxxxxxxxxx”,
“AWS_SECRET_ACCESS_KEY” = “yyyyyyyyyyyyyyyyyyyyyyyyy”,
“AWS_DEFAULT_REGION” = “us-east-1”)

Exactly like in the previous article https://medium.com/@patrick.rotzetter/natural-language-processing-experimenting-entity-recognition-with-google-amazon-nltk-and-maybe-b1fe673efe46, we are using following code to read the PDF files and do some pre-processing

# read PDF file using pdftools library
library(pdftools)
library(stringr)
text <- pdf_text(“xxxxxx.pdf”)
textCV<-paste(text,collapse = ” “)
textCV=str_replace_all(textCV,”[^[:graph:]]”, ” “)
text <- pdf_text(“yyyyyyy.pdf”)
textJob<-paste(text,collapse = ” “)
textJob=str_replace_all(textJob,”[^[:graph:]]”, ” “)

# clean the text, remove numbers and make everything lower case library(tm)
library(stringi)
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, content_transformer(stri_trans_tolower))
return(corpus)
}
text_corpus_clean_CV<-clean_corpus(VCorpus(VectorSource(textCV)))

And now we are ready to call the AWS NLP API, in this case we will be using the ‘detect syntax’ functionality, the text will be analyzed and words will be classified as noun, verb, adverb and so on.

awsJobSyntax<-detect_syntax(textJob)

We will filter all nouns from the AWS result and group them by noun and count them:

awsDetectedNounsinJob<-awsJobSyntax %>% filter(PartOfSpeech.Tag==’NOUN’) %>% group_by(Text) %>% summarise(count = n())

Compared to the Google API results, we can see some significant differences in the number of times a word has been identified, for example we had 144 times the word ‘projects’ in the previous experiment and 14 times using AWS.

Experimenting with Python

Setting up AWS command line interface

To be able to interact with AWS Comprehend APiyou will need to dowload the AWS command line interface (CLI) and configure it using your previously stored credentials. ( more on this under https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)

Using the Python API

The boto3 library is the is the Amazon Web Services (AWS) SDK for Python. So first import the library

import boto3

Then you instantiate a comprehend object with ‘comprehend’ as parameter for the boto3.client call

comprehend = boto3.client(service_name=’comprehend’, region_name=’us-east-1′)

You can read your PDF file using any popular PDF library, for example like this

def readPdfFileComprehend(filename):
text=list()
read_pdf = PyPDF2.PdfFileReader(filename)
for i in range(read_pdf.getNumPages()):
page = read_pdf.getPage(i)
txt=page.extractText()
text.append(txt)
return text

Once you have the text you want to analyze, just call the detect entities function and that’s it

entities = comprehend.detect_entities(Text=text, LanguageCode=’en’)

In the next articles, we will experiment some other popular libraries like NLTK or SpaCy.

(Visited 9 times, 1 visits today)
%d bloggers like this: