Modifying NYT ingredient tagger to a sole python implementation

NOTE: This is a jupyter notebook converted to markdown. As such, it does not look quite good. The original notebook can be seen here.

In [41]:
%load_ext watermark
%watermark


2016-05-05T15:37:46

CPython 2.7.11
IPython 4.0.3

compiler   : GCC 4.4.7 20120313 (Red Hat 4.4.7-1)
system     : Linux
release    : 3.19.0-58-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 8
interpreter: 64bit


This is a notebook showing a modification of the original NYT Ingredient Phrase tagger. Here is the article where they talk about it.

That github repository contains New York Time's tool for performing Named Entity Recognition via Conditional Random Fields on food recipes to extract the ingredients used on those recipes as well as the quantities.

On their implementation they use a CRF++ as the extractor.


Here I will use pycrfsuite instead of CRF++, the main reasons being:

  • by using a full python solution (even though pycrfsuite is just a wrapper around crfsuite) we can deploy the model more easily, and

  • installing CRF++ proved to be a challenge in Ubuntu 14.04

You can install pycrfsuite by doing:

pip install python-crfsuite


We load the train_file with features produced by calling (as it appears on the README):

bin/generate_data --data-path=input.csv --count=180000 --offset=0 > tmp/train_file


In [1]:
import re
import json

from itertools import chain
import nltk
import pycrfsuite

from lib.training import utils


In [2]:
with open('tmp/train_file') as fname:
    lines = fname.readlines()
    items = [line.strip('\n').split('\t') for line in lines]
    items = [item for item in items if len(item)==6]


In [3]:
items[:10]


Out[3]:
[['1$1/4', 'I1', 'L20', 'NoCAP', 'NoPAREN', 'B-QTY'],
 ['cups', 'I2', 'L20', 'NoCAP', 'NoPAREN', 'B-UNIT'],
 ['cooked', 'I3', 'L20', 'NoCAP', 'NoPAREN', 'B-COMMENT'],
 ['and', 'I4', 'L20', 'NoCAP', 'NoPAREN', 'I-COMMENT'],
 ['pureed', 'I5', 'L20', 'NoCAP', 'NoPAREN', 'I-COMMENT'],
 ['fresh', 'I6', 'L20', 'NoCAP', 'NoPAREN', 'I-COMMENT'],
 ['butternut', 'I7', 'L20', 'NoCAP', 'NoPAREN', 'B-NAME'],
 ['squash', 'I8', 'L20', 'NoCAP', 'NoPAREN', 'I-NAME'],
 [',', 'I9', 'L20', 'NoCAP', 'NoPAREN', 'OTHER'],
 ['or', 'I10', 'L20', 'NoCAP', 'NoPAREN', 'I-COMMENT']]


As we can see, each line of the train_file follows the format:

  • token
  • position on the phrase. (I1 would be first word, I2 the second, and so on)
  • LX , being the length group of the token (defined by LengthGroup)
  • NoCAP or YesCAP, whether the token is capitalized or not
  • YesParen or NoParen, whether the token is inside parenthesis or not


PyCRFSuite expects the input to be a list of the structured items and their respective tags. So we process the items from the train file and bucket them into sentences


In [4]:
sentences = []

sent = [items[0]]
for item in items[1:]:
    if 'I1' in item:
        sentences.append(sent)
        sent = [item]
    else:
        sent.append(item)
len(sentences)


Out[4]:
177029


In [5]:
import random
random.shuffle(sentences)
test_size = 0.1
data_size = len(sentences)

test_data = sentences[:int(test_size*data_size)]
train_data = sentences[int(test_size*data_size):]


In [31]:
def sent2labels(sent):
    return [word[-1] for word in sent]

def sent2features(sent):
    return [word[:-1] for word in sent]

def sent2tokens(sent):
    return [word[0] for word in sent]   

y_train = [sent2labels(s) for s in train_data]
X_train = [sent2features(s) for s in train_data]
X_train[1]


Out[31]:
[['Orange', 'I1', 'L8', 'YesCAP', 'NoPAREN'],
 ['peel', 'I2', 'L8', 'NoCAP', 'NoPAREN'],
 [',', 'I3', 'L8', 'NoCAP', 'NoPAREN'],
 ['sliced.', 'I4', 'L8', 'NoCAP', 'NoPAREN']]


We set up the CRF trainer. We will use the default values and include all the possible joint features


In [32]:
trainer = pycrfsuite.Trainer(verbose=False)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)


I obtained the following hyperparameters by performing a GridSearchCV with the scikit learn implementation of pycrfsuite.


In [33]:
trainer.set_params(
{
        'c1': 0.43,
        'c2': 0.012,
        'max_iterations': 100,
        'feature.possible_transitions': True,
        'feature.possible_states': True,
        'linesearch': 'StrongBacktracking'
    }
)


We train the model (this might take a while)


In [34]:
trainer.train('tmp/trained_pycrfsuite')


Now we have a pretrained model that we can just deploy


In [35]:
tagger = pycrfsuite.Tagger()
tagger.open('tmp/trained_pycrfsuite')


Out[35]:
<contextlib.closing at 0x7f2984586990>


Now we just add a wrapper function for the script found in lib/testing/convert_to_json.py and create a convient way to parse an ingredient sentence


In [40]:
import re
import json
from lib.training import utils
from string import punctuation

from nltk.tokenize import PunktSentenceTokenizer

tokenizer = PunktSentenceTokenizer()

def get_sentence_features(sent):
    """Gets  the features of the sentence"""
    sent_tokens = utils.tokenize(utils.cleanUnicodeFractions(sent))

    sent_features = []
    for i, token in enumerate(sent_tokens):
        token_features = [token]
        token_features.extend(utils.getFeatures(token, i+1, sent_tokens))
        sent_features.append(token_features)
    return sent_features

def format_ingredient_output(tagger_output, display=False):
    """Formats the tagger output into a more convenient dictionary"""
    data = [{}]
    display = [[]]
    prevTag = None


    for token, tag in tagger_output:
    # turn B-NAME/123 back into "name"
        tag = re.sub(r'^[BI]\-', "", tag).lower()

        # ---- DISPLAY ----
        # build a structure which groups each token by its tag, so we can
        # rebuild the original display name later.

        if prevTag != tag:
            display[-1].append((tag, [token]))
            prevTag = tag
        else:
            display[-1][-1][1].append(token)
            #               ^- token
            #            ^---- tag
            #        ^-------- ingredient

            # ---- DATA ----
            # build a dict grouping tokens by their tag

            # initialize this attribute if this is the first token of its kind
        if tag not in data[-1]:
            data[-1][tag] = []

        # HACK: If this token is a unit, singularize it so Scoop accepts it.
        if tag == "unit":
            token = utils.singularize(token)

        data[-1][tag].append(token)

    # reassemble the output into a list of dicts.
    output = [
        dict([(k, utils.smartJoin(tokens)) for k, tokens in ingredient.iteritems()])
        for ingredient in data
        if len(ingredient)
    ]

    # Add the raw ingredient phrase
    for i, v in enumerate(output):
        output[i]["input"] = utils.smartJoin(
            [" ".join(tokens) for k, tokens in display[i]])

    return output

def parse_ingredient(sent):
    """ingredient parsing logic"""
    sentence_features = get_sentence_features(sent)
    tags = tagger.tag(sentence_features)
    tagger_output = zip(sent2tokens(sentence_features), tags)
    parsed_ingredient =  format_ingredient_output(tagger_output)
    if parsed_ingredient:
        parsed_ingredient[0]['name'] = parsed_ingredient[0].get('name','').strip('.')
    return parsed_ingredient

def parse_recipe_ingredients(ingredient_list):
    """Wrapper around parse_ingredient so we can call it on an ingredient list"""
    sentences = tokenizer.tokenize(q)
    sentences = [sent.strip('\n') for sent in sentences]
    ingredients = []
    for sent in sentences:
        ingredients.extend(parse_ingredient(sent))
    return ingredients


In [39]:
q = '''
2 1/4 cups all-purpose flour.
1/2 teaspoon baking soda.
1 cup (2 sticks) unsalted butter, room temperature.
1/2 cup granulated sugar.
1 cup packed light-brown sugar.
1 teaspoon salt.
2 teaspoons pure vanilla extract.
2 large eggs.
2 cups (about 12 ounces) semisweet and/or milk chocolate chips.
'''

parse_recipe_ingredients(q)


Out[39]:
[{'input': u'2$1/4 cups all-purpose flour.',
  'name': u'all-purpose flour',
  'qty': u'2$1/4',
  'unit': u'cup'},
 {'input': u'1/2 teaspoon baking soda.',
  'name': u'baking',
  'other': u'soda.',
  'qty': u'1/2',
  'unit': u'teaspoon'},
 {'comment': u'(2 sticks)',
  'input': u'1 cup (2 sticks) unsalted butter, room temperature.',
  'name': u'unsalted butter',
  'other': u', room temperature.',
  'qty': u'1',
  'unit': u'cup'},
 {'input': u'1/2 cup granulated sugar.',
  'name': u'granulated sugar',
  'qty': u'1/2',
  'unit': u'cup'},
 {'comment': u'packed',
  'input': u'1 cup packed light-brown sugar.',
  'name': '',
  'other': u'light-brown sugar.',
  'qty': u'1',
  'unit': u'cup'},
 {'input': u'1 teaspoon salt.',
  'name': '',
  'other': u'salt.',
  'qty': u'1',
  'unit': u'teaspoon'},
 {'comment': u'pure',
  'input': u'2 teaspoons pure vanilla extract.',
  'name': u'vanilla',
  'other': u'extract.',
  'qty': u'2',
  'unit': u'teaspoon'},
 {'comment': u'large',
  'input': u'2 large eggs.',
  'name': u'eggs',
  'qty': u'2'},
 {'comment': u'(about 12 ounces) semisweet and/or',
  'input': u'2 cups (about 12 ounces) semisweet and/or milk chocolate chips.',
  'name': u'milk chocolate chips',
  'qty': u'2',
  'unit': u'cup'}]


Show Comments