In the first two articles of the series we have shown how to use the Google NLP andthe AWS Comprehend API to extract words from a person profile and compare it to a job description ( In this third article, let us do the same experiment using spaCy. spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more. You can learn more under spaCy is relatively new compared to NLTK for example and has the advantage to support word vectors for example which is not supported by NLTK.

We will be using R again for fast experimentation ( just a matter of personal taste) using the spacyR library. spacyR that is a wrapper to the spaCy Python library, but the same can be easily done in Python.

Using spaCy with R

Setting spacyr package and library

Setting up spaCy for R can be quite cumbersome and I had a number of blogs to finally being able to install it properly.

For the installation steps, make sure you run your R environment as an administrator ( For Windows right click the RStudio icon (or R desktop icon) and select “Run as administrator” when launching R ). You can follow the steps described under

First load the spacyr package and library

if (!require(“spacyr”)) {

Then install and initialize spacy with the default English model for example. This model does not include word vectors.


We can also load a different model like “en_core_web_md” and initialize spacy with it. This model is an English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Craw. Note that you need to download the model first using pip.


Parse the documents and compare

Now we are ready to parse our previously loaded documents.

parsedJob <- spacy_parse(textJob, entity = TRUE)
parsedCV <- spacy_parse(textCV, entity = TRUE)

We could use the detected entities, but the result is too narrow for our objective with the standard model. An idea would be to extend the model and define business specific entities like programming language or skills. We will discuss this in a subsequent article.

So let us use only the nouns for now filter the part-of-speech-tags from the parse function accordingly


We will use the lemma to find common words and not the token to eliminate plural words for example and give a more accurate picture

groupEntitiesJob<-parsedJobWords %>% group_by(lemma) %>% summarise(count = n())
groupEntitiesCV<-parsedCVWords %>% group_by(lemma) %>% summarise(count = n())
groupedEntities<,groupEntitiesCV,by.x = ‘lemma’,by.y=’lemma’,all.x = FALSE)

And finally let us plot the pyramid graph to see the final comparison

Compared to previous results, we can see that we have eliminated plural words like projects and teams, which gives a cleaner result.

Analyzing the POS tags

An interesting comparison is also to see how POS (Part of speech) tags are distributed in each document. Let us plot a bar chart showing the distribution of tags for each document,

Number of occurrences of each POS tag in the job description
Number of occurrences of each POS tag in the CV

We can notice a higher proportion of proper nouns in the profile, but that is expected in this type of document.

In a subsequent article, we will try to introduce rules to detect custom entities to enrich our model with specific technical terms. Also we will use document similarity functionality of spaCy to compare documents.

(Visited 319 times, 1 visits today)
%d bloggers like this: