Tag Archives: python

What comes to your mind when I say JUNK!

The blog focusses on how we are applying Exploratory Data Analysis(EDA) to establish insights with various initiatives at RecipeDabba . We use python libraries like pandas-profiling , DTale to generate some of these analysis. One will find the read interesting and how things have gotten shifted during COVID times !

OK…as per dictionary the literal meaning is something which is discarded and has no or extremely low value.

Aha! Surprisingly while this definition fits well when it comes to Junkyard but not so much when it comes to junk food.

Junk food on the contrary is much loved and a go-to food for many. Makes me wonder why? Why despite a food shouting loud and clear that, Hey dude! I am not good for you we still will go head over heels for it.

Anyways, I will not roast it more, since now you get the drift already.

Our work at Recipe Dabba is majorly focused around promoting cooking as life skill and healthy eating as lifestyle for kids. And as part of this endeavor, we have been successfully running this specially curated program called 21 Days No Junk Food Challenge since 2019 April. In here we work around few nudges and simultaneously educate kids (aged between 5 to 12 years) around food choices and with help of parents monitor their eating pattern for 3 weeks. The idea is to motivate children towards better food choices and lots of things around this. So, while we did some 6 seasons of this fun based, habit formation challenge, here are few findings purely around the kind of temptations participants had during these 21 days period.

  1. Majority of Junk food temptations were under Readymade Snacks and Desserts. These could be things like Chocolates, Cookies, Oreos, Ketchup, Chips, Candies, French Fries etc.
  2. COVID-19 time seasons (Post Feb 2020) saw a major drop in outside junk food like Pizza, Burgers, Vada Pavs, Samosas etc. , however the packaged foods like Chocolate, Cookies etc. were still present in the list.
  1. Again, when comparing the Pandemic and Pre-Pandemic times, the drop was there in consumptions of chocolates as a major chunk of these were also coming from birthday parties, schools, day cares, activity classes or other social gatherings which stopped during COVID-19 seasons.
  2. The temptations for carbonated drinks, creamy cakes and pastries also were much controlled during COVID-19 run season.

These findings though are only from a sample of 100 urban kids but are powerful enough to infer that it is not impossible for kids to stick to a homemade and healthy food unless they are pushed into temptations. These temptations are mad-made and mostly around because of our social structure. Also, many times food like ketchup, cookies and likes are always present at are home and sometimes they are our food of convenience over anything else. 21 Days No Junk Food Challenge surely works in a controlled environment, but it does help us make parents aware of how small changes can make larger impacts in long run. Controlling temptations is not easy for even adults and these are not even teenagers we are referring here.

Some of the ways how these worked towards controlling these temptations are.

  1. Working and agreeing upon healthy alternatives which parents provided them with.
  2. Asking more questions about how to manage these temptations.
  3. Taking one day at a time with a promise to practice control.
  4. Working on small portions, if at all the temptations were stronger.

To sum up, changes aren’t easy and when it to comes something we have been calling out our comfort food for long, it becomes all the more complicated. Young kids who came into the world only looking for mother’s milk for sure are not to be blamed for the choices we got them into.

Junk or processed food can’t be removed 100% from our systems, courtesy to the lifestyle we have chosen. However, it certainly can be reduced to a great extend in order to increase the proportion of the healthy food. After all, while everyone eats healthy, all what makes the difference is how much unhealthy are we eating too?

Where Is My Last Name in CRM?

I have been recently talking about my tech hacks to solve day-2-day problems in technology using different programming approaches. This blog post is about cleaning up the name field within a customer resource.

I was given this challenge in context to dutch language and asked if there are remedies beyond usual grep or split commands in order to derive first name versus the last name from a field that currently only holds the full name with the last name field being blank!

It was an interesting problem to look at as you could leverage many approaches like deploying Mechanical Turk to specialized cleansing services but , I chose to go a different way .

I used following packages in python to write a small code in order to get my results:

  1. Probable People – An open source package maintained by Datamade.
  2. SpaCy –  An industrial strength Natural Language Processing(NLP) package

What is Probable People?

probablepeople is a python library for parsing unstructured romanized name or company strings into components, using advanced NLP methods. This is based off usaddress, a python library for parsing addresses.

Try it out on our web interface! For those who aren’t python developers.

What this can do: Using a probabilistic model, it makes (very educated) guesses in identifying name or corporation components, even in tricky cases where rule-based parsers typically break down.

What this cannot do: It cannot identify components with perfect accuracy, nor can it verify that a given name/company is correct/valid.

probablepeople learns how to parse names/companies through a body of training data. If you have examples of names/companies that stump this parser, please send them over! By adding more examples to the training data, probablepeople can continue to learn and improve.

What is SpaCy?

Spacy is an industrial strength NLP written in python and more can be fond on the site , it might not be worth me writing more about it here due to its popularity.

Whilst both packages provide powerful machine learning approaches to re-train , train and evaluate your machine learning model in context of the problem , I have taken an OOTB(Out Of The Box) approach to directly ingest data with available corpus and probabilistic parser.

Approach

In terms of approach , I have used a pipeline architecture where the same data is send across to both libraries and then reconciled for presentation in the output. In simple terms I have used CRF(Conditional Random Field) approach of ProbablePeople & Named Entity Recognition(NER) from SpaCy to construct a pipeline to achieve my objective.

Simple Workflow For Creating a Structured output for name parser

Following are some basic code snippets to help you understand simple workings within the code and assemble your own output.

#Installation Commands 
pip install probablepeople
pip install spacy
pip install xlrd
pip install pandas
...
#Using dutch corpus for spacy
python -m spacy download nl_core_news_sm
...
#import
import probablepeople as pp
import pandas as pd
import xlrd
import csv
import os.path
import spacy
from spacy.matcher import Matcher
import nl_core_news_sm
...
#load corpus
nlp = nl_core_news_sm.load()
...
#Clean-up functions
def _removeNumbers(s):
    # Python code to demonstrate 
    # how to remove numeric digits from string 
    # using join and isdigit 

    # using filter and lambda 
    # to remove numeric digits from string 
    res = "".join(filter(lambda x: not x.isdigit(), s)) 

    return res 

def _removePunctuation(s): 
    # punctuation marks 
    punctuations = r'''!()-[]{};:'"\,<>./?@#$%^&*_~'''
  
    # traverse the given string and if any punctuation 
    # marks occur replace it with null 
    for x in s.lower(): 
        if x in punctuations: 
            s = s.replace(x, "") 
  
    # Print string without punctuation 
    return s
  
def _removeNonAscii(s): return "".join(i for i in s if ord(i)<128)

...
#NER Functions
def _nerExtraction(s):
    doc = nlp(s)
    entity_collection = []
    for ent in doc.ents:
        entity = {}
        entity[ ent.label_] = ent.text
        entity_collection.append (entity)
    
    return  entity_collection

#Parser Function Call
 try:
     ordered_text = pp.tag(value)
 except pp.RepeatedLabelError as e :
      .....

Using single field input we got one or many fields in a structured manner as below in a csv file! During the exercise it was also interesting to see that every name was not a person but ended up being a company name !

    'ner_entity',
    'ner_type',
    'crf_type',
    'PrefixMarital',    
    'PrefixOther',
    'GivenName',
    'FirstInitial',
    'MiddleName',
    'MiddleInitial',
    'Surname',
    'LastInitial',
    'SuffixGenerational',
    'SuffixOther',
    'Nickname',
    'SecondGivenName',
    'SecondSurname',
    'And',
    'CorporationName',
    'CorporationNameOrganization',
    'CorporationLegalType',
    'CorporationNamePossessiveOf',
    'ShortForm',
    'ProxyFor',
    'AKA'

Using the above flow , I was able to clean-up and provide a simple automation to a CRM flow that can then be converted to an API and be able to provide value using open-source approach.

If there is any feedback or comments do let me via post comments!