Introduction

In the previous 3 articles we have illustrated the usage of Google and AWS NLP APIs and experimented the spacy library to extract entities and nouns from different documents. We have chosen to use personal profiles and job description, as this is a common use case and easily understandable.

We would like to go further than previous articles and experiment spacy into more details.

Named Entity Recognition

Let us now use the Python library for this example as this gives access to more features than using the R library ( at least as far as I understood). As usual we need to install the spacy library and download the corresponding models we want to use ( more on this under https://spacy.io/usage/.

Let us now parse a document using spacy and print an extract of the named entities found that correspond to ‘PRODUCT’:

import spacy
spacy.load('en')
docCV=nlp(textCV)
for ent in docCV.ents:
# Print the entity text and its label
    if ent.label_=='PRODUCT':
        print(ent.text, ent.label_,)

Agile PRODUCT
Tier 1 PRODUCT

The results are not impressive with the small English model, so that might be different with the medium model:

nlp = spacy.load('en_core_web_md')
for ent in docCV.ents:
# Print the entity text and its label
    if ent.label_=='PRODUCT':
        print(ent.text, ent.label_,)

In fact in that no entity of type PRODUCT was detected, this is quite surprising. That pretty much depends on which text corpus the model was trained on. If we apply the same model to another profile, we get following results:

C++ PRODUCT
C++ PRODUCT
Solaris PRODUCT
C++ PRODUCT

Spacy detected C++ and Solaris as products, but not Java or JavaScript. So let us add our own detection rules to the model to get closer to what we want to, i.e. identify technical skills in people’s profile:

patterns = [{"label": "PROG", "pattern": [{"lower": "java"}]},
            {"label": "PROG", "pattern": [{"lower": "javascript"}]}]
ruler = EntityRuler(nlp, patterns=patterns,overwrite_ents=True)
nlp.add_pipe(ruler)
docCV=nlp(textCV)
for ents in docCV.ents:
# Print the entity text and its label
     if ents.label_=='PROG':
     print(ents.text, ents.label_,)
Java PROG
Java PROG
Java PROG
Java PROG
Java PROG

So we have been able to add rule based entity recognition to the statistical model. This allows for fine tuning models in a quick way to adapt to specific domains and needs.

Let us also analyze the job description and see what kind of entities are recognized using only the statistical model:

SLQ ORG
7 years DATE
Computer Science ORG
Java GPE
JavaScript ORG
Boosting, Trees ORG
SLQ ORG
Redshift ORG
S3, Spark PRODUCT
DigitalOcean ORG
3rd ORDINAL
Google Analytics ORG
Adwords ORG
Crimson Hexagon ORG
Map/Reduce ORG
Hadoop ORG
Gurobi GPE
MySQL GPE
Business Objects ORG

This is interesting, because in this case the model detected Java as a geopolitical entity and JavaScript as an organization. It did that automatically without having to code specific rules for this specific document.

So in summary, we can say that not only entity recognition is dependent on the statistical model used but also on the structure of the document we are working on. This is more complex than anticipated.

Document similarities

A nice feature of spacy is the ability to compare linguistic and semantic similarities between tokens (words), sentences and documents. In order to do so , we must use spacy models with word vectors loaded. In order to do that, we need to load either the medium web English model or the large model. Let us load the medium model using the command line:

python -m spacy download en_core_web_md

Each word in the pre-trained model has a corresponding vector and a document vector will be defaulted to the average vector of all documents tokens (words). The similarity will be computed using the cosine similarity between the document vectors. The result will be a number between 0 and 1, 1 being the highest score ( cosine of 0 is 1, meaning the vectors in space coincide with each other)

Let us start with a generic data scientist job description and see what happens.

docJob.similarity(docCV)
Out[98]: 0.9235262880617061

docJob.similarity(docCV2)
Out[99]: 0.9415320577235222

As both profiles are technical and the job description also includes quite a number of technical words, this is no surprise that the documents are quite similar on average. So let us try it with another publication on CyberSecurity from the European Parliament:

docCV.similarity(docDoc)
Out[106]: 0.8724268941129953

With little surprise similarity has come down although the target document is also technical in nature.

This is all for now, next time, let us see how we can further fine tune our document analyisis using spacy.

Patrick Rotzetter

(Visited 20 times, 1 visits today)
%d bloggers like this: