Introduction

In the previous 5 articles we have illustrated the usage of Google and AWS NLP APIs. We have also experimented the spacy library to extract entities and nouns from different documents. We have shown how to improve the model using pattern matching function from spaCy. Finally we have also trained the model with new entities.

Now we want to go into the full implementation of a comparison engine that goes beyond the simple keyword search and uses functionality of spacy.

We have chosen to use personal profiles and job description, as this is a common use case and easily understandable by anybody. Although we think that using this kind of algorithms for matching candidates to jobs is very far from being enough.

Matching used nouns

An interesting comparison is to look at the nouns used in the CV and in the job description. We are using the ability of spacy to identify part of speech automatically and then filter by nouns.

So basically the code looks as follows:

      dfCV<-data_frame(parse_text(textCV))`# call to a python function
      dfCV<-subset(dfCV, POS=="NOUN") # filter only nouns
      groupEntitiesJob<-dfJob %>% group_by(lemma) %>% summarise(count = n())

We can see the final result in the comparison graph below:

The graph has been created using ggplot and the application has been implemented using Shiny ( https://shiny.rstudio.com/ )

Identifying skills

We need to go beyond simple word matching and identify required skills from the job description and match them to available skills in the CV. Spacy allows to do sentence matching. For this, we will use a list of possible skills downloaded from open source datasets with a list of more than 6000 entries. The entries will be automatically detected by the spacy and then compared to the job and CV texts. This is relatively easy with spacy, the difficulty is to have enough skills to catch most of the important skills in different domains.

This can be done simply in Python, assuming skills are available in a csv file:

matcher=PhraseMatcher(nlp.vocab)
patterns=[]
skills = pd.read_csv("skills.csv")
for index, row in skills.iterrows():
    patterns3.append(nlp(row["SKILLS"].lower()))
matcher.add("Skills", patterns)

The result is quite cool and we can clearly identify matching and missing skills.

A Short Video

You can see s short demo of the system below

The tech stack

For this version, we have used following stack:

  • R with reticulate, ggplot and dplyr as main libraries
  • Python to access spacy functionality
  • Conda to deploy Python
  • Shiny for building the application
  • Docker container
  • Azure docker instance to deploy the application

Patrick Rotzetter

(Visited 134 times, 1 visits today)
%d bloggers like this: