Introduction

In the previous 4 articles we have illustrated the usage of Google and AWS NLP APIs. We have also experimented the spacy library to extract entities and nouns from different documents. We have shown how to improve the modle using pattern matching function from spaCy.

We would like to go further than previous articles and experiment spacy advanced functions. spaCy allows to update the statistical model and train the model with new entities without using ‘hard coded’ matching rules. This will allow to fine tune the model to our specific domain. This is quite useful for entity recognition or text classification.

We have chosen to use personal profiles and job description, as this is a common use case and easily understandable by anybody, although we think that using this kind of algorithms for matching candidates to jobs is very far from being enough.

Using the pre-trained model

As shown in the previous article, let us us parse the document with the pre-trained large English model on the generic job description document and let us see what entities have been identified:

import spacy
spacy.load('en_core_web_lg')
docJob=nlp(textCV)
for ent in docJob.ents:
# Print the entity text and its label
    if ent.label_=='PRODUCT':
        print(ent.text, ent.label_,)

SLQ ORG
7 years DATE
Statistics ORG
Mathematics ORG
C++ LANGUAGE
Java LOC
SLQ ORG
S3 PRODUCT
Spark ORG
DigitalOcean ORG
3rd ORDINAL
Google Analytics ORG
Site 
Catalyst, ORG
Coremetrics ORG
Adwords ORG
Crimson Hexagon ORG
Facebook Insights ORG
Hadoop NORP
Hive ORG
Gurobi GPE
MySQL GPE
Business Objects ORG
Glassdoor ORG

Interestingly, C++ has been detected as a language and Java as a geopolitical entity, since it is a Indonesian Island. JavaScript for example has not identified at all. This obviously depends on what kind of context the model has been trained, and probably not specifically on computer science domain.

Previously we have shown that we can add our own rules to the model.So let us add our own detection rules to the model to get closer to what we want to, i.e. identify technical skills in people’s profile:

# define patterns we want to recognize
patterns = [{"label": "PROG", "pattern": [{"lower": "java"}]},
            {"label": "PROG", "pattern": [{"lower": "javascript"}]}]
# define an entity ruler using predefined patterns
ruler = EntityRuler(nlp, patterns=patterns,overwrite_ents=True)
# add the ruler to the nlp pipeline
nlp.add_pipe(ruler)
# apply to the job document
docJob=nlp(textJob)
for ents in docJob.ents:
# Print the entity text and its label
     if ents.label_=='PROG':
     print(ents.text, ents.label_,)
Java PROG
JavaScript PROG

Now we have Java and JavaScript identified as programming language.

So we have been able to add rule based entity recognition to the statistical model. This allows for fine tuning models in a quick way to adapt to specific domains and needs.

This is interesting, because in this case the model detected Java as a geopolitical entity and JavaScript as an organization. It did that automatically without having to code specific rules for this specific document.

So in summary, we can say that not only entity recognition is dependent on the statistical model used but obviously on what kind of domain the documents used for the training were referring to.

Training and updating the model

spaCy allows us to train the underlying neural network and update it with our specific domain knowledge. This is a coll feature as this exactly what we want to do. First let us add some examples of entities we want to detect in representatives sentences:

trainData=[('Java is a programming language', {'entities': [(0, 4,'PROG')]}),
('I have 5 years experience in JavaScript', {'entities': [(27, 37,'PROG')]}),
('Extensive Java experience required', {'entities': [(10, 14,'PROG')]}),
('JavaScript is a programming language used mainly in front-end development', {'entities': [(0, 10, 'PROG')]}),
('Java is an object oriented programming language', {'entities': [(0, 4, 'PROG')]}),
('I have a long experience in project management', {'entities': []})]

It also recommended to give negative examples for example sentences without any entity.

Let us now train and update the model with these new entities and training examples:

# initialize a blank spacy model
nlp = spacy.blank('en')
# Create blank entity recognizer and add it to the pipeline
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)
# Add a new label for programming language
ner.add_label('PROG')
# Start the training
nlp.begin_training()
# Train for 10 iterations
for itn in range(10):
    random.shuffle(trainData)
    # Divide examples into batches
    for batch in spacy.util.minibatch(trainData, size=2):
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
    # Update the model
        nlp.update(texts, annotations)

Now we are ready to test our model on the same document as before.

docJob=nlp(textJob)
for ents in docJob.ents:
    # Print the document text and entitites
    if ents.label_=='PROG':
        print(ents.text, ents.label_,)
Data PROG
Job PROG
Description PROG
Job PROG
Java PROG
JavaScript PROG
Site PROG
Map PROG

This is pretty cool, now Java and JavaScript have been recognized by the spaCy neural network. Unfortunately it has also classified a few new words as PROG, but that is a good start.

Patrick Rotzetter

(Visited 22 times, 1 visits today)
%d bloggers like this: