In the previous 4 articles we have illustrated the usage of Google and AWS NLP APIs. We have also experimented the spacy library to extract entities and nouns from different documents. We have shown how to improve the modle using pattern matching function from spaCy.
We would like to go further than previous articles and experiment spacy advanced functions. spaCy allows to update the statistical model and train the model with new entities without using ‘hard coded’ matching rules. This will allow to fine tune the model to our specific domain. This is quite useful for entity recognition or text classification.
We have chosen to use personal profiles and job description, as this is a common use case and easily understandable by anybody, although we think that using this kind of algorithms for matching candidates to jobs is very far from being enough.
Using the pre-trained model
As shown in the previous article, let us us parse the document with the pre-trained large English model on the generic job description document and let us see what entities have been identified:
import spacy spacy.load('en_core_web_lg') docJob=nlp(textCV) for ent in docJob.ents: # Print the entity text and its label if ent.label_=='PRODUCT': print(ent.text, ent.label_,) SLQ ORG 7 years DATE Statistics ORG Mathematics ORG C++ LANGUAGE Java LOC SLQ ORG S3 PRODUCT Spark ORG DigitalOcean ORG 3rd ORDINAL Google Analytics ORG Site Catalyst, ORG Coremetrics ORG Adwords ORG Crimson Hexagon ORG Facebook Insights ORG Hadoop NORP Hive ORG Gurobi GPE MySQL GPE Business Objects ORG Glassdoor ORG
Previously we have shown that we can add our own rules to the model.So let us add our own detection rules to the model to get closer to what we want to, i.e. identify technical skills in people’s profile:
So we have been able to add rule based entity recognition to the statistical model. This allows for fine tuning models in a quick way to adapt to specific domains and needs.
So in summary, we can say that not only entity recognition is dependent on the statistical model used but obviously on what kind of domain the documents used for the training were referring to.
Training and updating the model
spaCy allows us to train the underlying neural network and update it with our specific domain knowledge. This is a coll feature as this exactly what we want to do. First let us add some examples of entities we want to detect in representatives sentences:
It also recommended to give negative examples for example sentences without any entity.
Let us now train and update the model with these new entities and training examples:
# initialize a blank spacy model nlp = spacy.blank('en') # Create blank entity recognizer and add it to the pipeline ner = nlp.create_pipe('ner') nlp.add_pipe(ner) # Add a new label for programming language ner.add_label('PROG') # Start the training nlp.begin_training() # Train for 10 iterations for itn in range(10): random.shuffle(trainData) # Divide examples into batches for batch in spacy.util.minibatch(trainData, size=2): texts = [text for text, annotation in batch] annotations = [annotation for text, annotation in batch] # Update the model nlp.update(texts, annotations)
Now we are ready to test our model on the same document as before.