Where Is My Last Name in CRM?

I have been recently talking about my tech hacks to solve day-2-day problems in technology using different programming approaches. This blog post is about cleaning up the name field within a customer resource.

I was given this challenge in context to dutch language and asked if there are remedies beyond usual grep or split commands in order to derive first name versus the last name from a field that currently only holds the full name with the last name field being blank!

It was an interesting problem to look at as you could leverage many approaches like deploying Mechanical Turk to specialized cleansing services but , I chose to go a different way .

I used following packages in python to write a small code in order to get my results:

Probable People – An open source package maintained by Datamade.
SpaCy – An industrial strength Natural Language Processing(NLP) package

What is Probable People?

probablepeople is a python library for parsing unstructured romanized name or company strings into components, using advanced NLP methods. This is based off usaddress, a python library for parsing addresses.

Try it out on our web interface! For those who aren’t python developers.

What this can do: Using a probabilistic model, it makes (very educated) guesses in identifying name or corporation components, even in tricky cases where rule-based parsers typically break down.

What this cannot do: It cannot identify components with perfect accuracy, nor can it verify that a given name/company is correct/valid.

probablepeople learns how to parse names/companies through a body of training data. If you have examples of names/companies that stump this parser, please send them over! By adding more examples to the training data, probablepeople can continue to learn and improve.

What is SpaCy?

Spacy is an industrial strength NLP written in python and more can be fond on the site , it might not be worth me writing more about it here due to its popularity.

Whilst both packages provide powerful machine learning approaches to re-train , train and evaluate your machine learning model in context of the problem , I have taken an OOTB(Out Of The Box) approach to directly ingest data with available corpus and probabilistic parser.

Approach

In terms of approach , I have used a pipeline architecture where the same data is send across to both libraries and then reconciled for presentation in the output. In simple terms I have used CRF(Conditional Random Field) approach of ProbablePeople & Named Entity Recognition(NER) from SpaCy to construct a pipeline to achieve my objective.

Simple Workflow For Creating a Structured output for name parser

Following are some basic code snippets to help you understand simple workings within the code and assemble your own output.

#Installation Commands 
pip install probablepeople
pip install spacy
pip install xlrd
pip install pandas
...
#Using dutch corpus for spacy
python -m spacy download nl_core_news_sm
...
#import
import probablepeople as pp
import pandas as pd
import xlrd
import csv
import os.path
import spacy
from spacy.matcher import Matcher
import nl_core_news_sm
...
#load corpus
nlp = nl_core_news_sm.load()
...
#Clean-up functions
def _removeNumbers(s):
    # Python code to demonstrate 
    # how to remove numeric digits from string 
    # using join and isdigit 

    # using filter and lambda 
    # to remove numeric digits from string 
    res = "".join(filter(lambda x: not x.isdigit(), s)) 

    return res 

def _removePunctuation(s): 
    # punctuation marks 
    punctuations = r'''!()-[]{};:'"\,<>./?@#$%^&*_~'''
  
    # traverse the given string and if any punctuation 
    # marks occur replace it with null 
    for x in s.lower(): 
        if x in punctuations: 
            s = s.replace(x, "") 
  
    # Print string without punctuation 
    return s
  
def _removeNonAscii(s): return "".join(i for i in s if ord(i)<128)

...
#NER Functions
def _nerExtraction(s):
    doc = nlp(s)
    entity_collection = []
    for ent in doc.ents:
        entity = {}
        entity[ ent.label_] = ent.text
        entity_collection.append (entity)
    
    return  entity_collection

#Parser Function Call
 try:
     ordered_text = pp.tag(value)
 except pp.RepeatedLabelError as e :
      .....

Using single field input we got one or many fields in a structured manner as below in a csv file! During the exercise it was also interesting to see that every name was not a person but ended up being a company name !

    'ner_entity',
    'ner_type',
    'crf_type',
    'PrefixMarital',    
    'PrefixOther',
    'GivenName',
    'FirstInitial',
    'MiddleName',
    'MiddleInitial',
    'Surname',
    'LastInitial',
    'SuffixGenerational',
    'SuffixOther',
    'Nickname',
    'SecondGivenName',
    'SecondSurname',
    'And',
    'CorporationName',
    'CorporationNameOrganization',
    'CorporationLegalType',
    'CorporationNamePossessiveOf',
    'ShortForm',
    'ProxyFor',
    'AKA'

Using the above flow , I was able to clean-up and provide a simple automation to a CRM flow that can then be converted to an API and be able to provide value using open-source approach.

If there is any feedback or comments do let me via post comments!

Everyday Reflections