In the previous 3 articles we have illustrated the usage of Google and AWS NLP APIs and experimented the spacy library to extract entities and nouns from different documents. We have chosen to use personal profiles and job description, as this is a common use case and easily understandable.
We would like to go further than previous articles and experiment spacy into more details.
Named Entity Recognition
Let us now use the Python library for this example as this gives access to more features than using the R library ( at least as far as I understood). As usual we need to install the spacy library and download the corresponding models we want to use ( more on this under https://spacy.io/usage/.
Let us now parse a document using spacy and print an extract of the named entities found that correspond to ‘PRODUCT’:
import spacy spacy.load('en') docCV=nlp(textCV) for ent in docCV.ents: # Print the entity text and its label if ent.label_=='PRODUCT': print(ent.text, ent.label_,) Agile PRODUCT Tier 1 PRODUCT
The results are not impressive with the small English model, so that might be different with the medium model:
nlp = spacy.load('en_core_web_md') for ent in docCV.ents: # Print the entity text and its label if ent.label_=='PRODUCT': print(ent.text, ent.label_,)
In fact in that no entity of type PRODUCT was detected, this is quite surprising. That pretty much depends on which text corpus the model was trained on. If we apply the same model to another profile, we get following results:
So we have been able to add rule based entity recognition to the statistical model. This allows for fine tuning models in a quick way to adapt to specific domains and needs.
Let us also analyze the job description and see what kind of entities are recognized using only the statistical model:
7 years DATE
Computer Science ORG
Boosting, Trees ORG
S3, Spark PRODUCT
Google Analytics ORG
Crimson Hexagon ORG
Business Objects ORG
So in summary, we can say that not only entity recognition is dependent on the statistical model used but also on the structure of the document we are working on. This is more complex than anticipated.
A nice feature of spacy is the ability to compare linguistic and semantic similarities between tokens (words), sentences and documents. In order to do so , we must use spacy models with word vectors loaded. In order to do that, we need to load either the medium web English model or the large model. Let us load the medium model using the command line:
python -m spacy download en_core_web_md
Each word in the pre-trained model has a corresponding vector and a document vector will be defaulted to the average vector of all documents tokens (words). The similarity will be computed using the cosine similarity between the document vectors. The result will be a number between 0 and 1, 1 being the highest score ( cosine of 0 is 1, meaning the vectors in space coincide with each other)
Let us start with a generic data scientist job description and see what happens.
As both profiles are technical and the job description also includes quite a number of technical words, this is no surprise that the documents are quite similar on average. So let us try it with another publication on CyberSecurity from the European Parliament:
With little surprise similarity has come down although the target document is also technical in nature.
This is all for now, next time, let us see how we can further fine tune our document analyisis using spacy.