By Manuel Garrido — May 5, 2016

Modifying NYT ingredient tagger to a sole python implementation

NOTE: This is a jupyter notebook converted to markdown. As such, it does not look quite good. The original notebook can be seen here.

In [41]: %load_ext watermark %watermark

2016-05-05T15:37:46 CPython 2.7.11 IPython 4.0.3 compiler : GCC 4.4.7 20120313 (Red Hat 4.4.7-1) system : Linux release : 3.19.0-58-generic machine : x86_64 processor : x86_64 CPU cores : 8 interpreter: 64bit

This is a notebook showing a modification of the original NYT Ingredient Phrase tagger. Here is the article where they talk about it.

That github repository contains New York Time's tool for performing Named Entity Recognition via Conditional Random Fields on food recipes to extract the ingredients used on those recipes as well as the quantities.

On their implementation they use a CRF++ as the extractor.

Here I will use pycrfsuite instead of CRF++, the main reasons being:

by using a full python solution (even though pycrfsuite is just a wrapper around crfsuite) we can deploy the model more easily, and

installing CRF++ proved to be a challenge in Ubuntu 14.04

You can install pycrfsuite by doing:

pip install python-crfsuite

We load the train_file with features produced by calling (as it appears on the README):

bin/generate_data --data-path=input.csv --count=180000 --offset=0 > tmp/train_file

In [1]: import reimport json from itertools import chainimport nltkimport pycrfsuite from lib.training import utils

In [2]: with open('tmp/train_file') as fname: lines = fname.readlines() items = [line.strip('\n').split('\t') for line in lines] items = [item for item in items if len(item)==6]

In [3]: items[:10]

Out[3]: [['1$1/4', 'I1', 'L20', 'NoCAP', 'NoPAREN', 'B-QTY'], ['cups', 'I2', 'L20', 'NoCAP', 'NoPAREN', 'B-UNIT'], ['cooked', 'I3', 'L20', 'NoCAP', 'NoPAREN', 'B-COMMENT'], ['and', 'I4', 'L20', 'NoCAP', 'NoPAREN', 'I-COMMENT'], ['pureed', 'I5', 'L20', 'NoCAP', 'NoPAREN', 'I-COMMENT'], ['fresh', 'I6', 'L20', 'NoCAP', 'NoPAREN', 'I-COMMENT'], ['butternut', 'I7', 'L20', 'NoCAP', 'NoPAREN', 'B-NAME'], ['squash', 'I8', 'L20', 'NoCAP', 'NoPAREN', 'I-NAME'], [',', 'I9', 'L20', 'NoCAP', 'NoPAREN', 'OTHER'], ['or', 'I10', 'L20', 'NoCAP', 'NoPAREN', 'I-COMMENT']]

As we can see, each line of the train_file follows the format:

token
position on the phrase. (I1 would be first word, I2 the second, and so on)
LX , being the length group of the token (defined by LengthGroup)
NoCAP or YesCAP, whether the token is capitalized or not
YesParen or NoParen, whether the token is inside parenthesis or not

PyCRFSuite expects the input to be a list of the structured items and their respective tags. So we process the items from the train file and bucket them into sentences

In [4]: sentences = [] sent = [items[0]]for item in items[1:]: if 'I1' in item: sentences.append(sent) sent = [item] else: sent.append(item)len(sentences)

Out[4]: 177029

In [5]: import randomrandom.shuffle(sentences)test_size = 0.1data_size = len(sentences) test_data = sentences[:int(test_size*data_size)]train_data = sentences[int(test_size*data_size):]

In [31]: def sent2labels(sent): return [word[-1] for word in sent] def sent2features(sent): return [word[:-1] for word in sent] def sent2tokens(sent): return [word[0] for word in sent] y_train = [sent2labels(s) for s in train_data]X_train = [sent2features(s) for s in train_data]X_train[1]

Out[31]: [['Orange', 'I1', 'L8', 'YesCAP', 'NoPAREN'], ['peel', 'I2', 'L8', 'NoCAP', 'NoPAREN'], [',', 'I3', 'L8', 'NoCAP', 'NoPAREN'], ['sliced.', 'I4', 'L8', 'NoCAP', 'NoPAREN']]

We set up the CRF trainer. We will use the default values and include all the possible joint features

In [32]: trainer = pycrfsuite.Trainer(verbose=False) for xseq, yseq in zip(X_train, y_train): trainer.append(xseq, yseq)

I obtained the following hyperparameters by performing a GridSearchCV with the scikit learn implementation of pycrfsuite.

In [33]: trainer.set_params({ 'c1': 0.43, 'c2': 0.012, 'max_iterations': 100, 'feature.possible_transitions': True, 'feature.possible_states': True, 'linesearch': 'StrongBacktracking' })

We train the model (this might take a while)

In [34]: trainer.train('tmp/trained_pycrfsuite')

Now we have a pretrained model that we can just deploy

In [35]: tagger = pycrfsuite.Tagger()tagger.open('tmp/trained_pycrfsuite')

Out[35]: <contextlib.closing at 0x7f2984586990>

Now we just add a wrapper function for the script found in lib/testing/convert_to_json.py and create a convient way to parse an ingredient sentence

In [40]: import reimport jsonfrom lib.training import utilsfrom string import punctuation from nltk.tokenize import PunktSentenceTokenizer tokenizer = PunktSentenceTokenizer() def get_sentence_features(sent): """Gets the features of the sentence""" sent_tokens = utils.tokenize(utils.cleanUnicodeFractions(sent)) sent_features = [] for i, token in enumerate(sent_tokens): token_features = [token] token_features.extend(utils.getFeatures(token, i+1, sent_tokens)) sent_features.append(token_features) return sent_features def format_ingredient_output(tagger_output, display=False): """Formats the tagger output into a more convenient dictionary""" data = [{}] display = [[]] prevTag = None for token, tag in tagger_output: # turn B-NAME/123 back into "name" tag = re.sub(r'^[BI]\-', "", tag).lower() # ---- DISPLAY ---- # build a structure which groups each token by its tag, so we can # rebuild the original display name later. if prevTag != tag: display[-1].append((tag, [token])) prevTag = tag else: display[-1][-1][1].append(token) # ^- token # ^---- tag # ^-------- ingredient # ---- DATA ---- # build a dict grouping tokens by their tag # initialize this attribute if this is the first token of its kind if tag not in data[-1]: data[-1][tag] = [] # HACK: If this token is a unit, singularize it so Scoop accepts it. if tag == "unit": token = utils.singularize(token) data[-1][tag].append(token) # reassemble the output into a list of dicts. output = [ dict([(k, utils.smartJoin(tokens)) for k, tokens in ingredient.iteritems()]) for ingredient in data if len(ingredient) ] # Add the raw ingredient phrase for i, v in enumerate(output): output[i]["input"] = utils.smartJoin( [" ".join(tokens) for k, tokens in display[i]]) return output def parse_ingredient(sent): """ingredient parsing logic""" sentence_features = get_sentence_features(sent) tags = tagger.tag(sentence_features) tagger_output = zip(sent2tokens(sentence_features), tags) parsed_ingredient = format_ingredient_output(tagger_output) if parsed_ingredient: parsed_ingredient[0]['name'] = parsed_ingredient[0].get('name','').strip('.') return parsed_ingredient def parse_recipe_ingredients(ingredient_list): """Wrapper around parse_ingredient so we can call it on an ingredient list""" sentences = tokenizer.tokenize(q) sentences = [sent.strip('\n') for sent in sentences] ingredients = [] for sent in sentences: ingredients.extend(parse_ingredient(sent)) return ingredients

In [39]: q = '''2 1/4 cups all-purpose flour.1/2 teaspoon baking soda.1 cup (2 sticks) unsalted butter, room temperature.1/2 cup granulated sugar.1 cup packed light-brown sugar.1 teaspoon salt.2 teaspoons pure vanilla extract.2 large eggs.2 cups (about 12 ounces) semisweet and/or milk chocolate chips.''' parse_recipe_ingredients(q)

Out[39]: [{'input': u'2$1/4 cups all-purpose flour.', 'name': u'all-purpose flour', 'qty': u'2$1/4', 'unit': u'cup'}, {'input': u'1/2 teaspoon baking soda.', 'name': u'baking', 'other': u'soda.', 'qty': u'1/2', 'unit': u'teaspoon'}, {'comment': u'(2 sticks)', 'input': u'1 cup (2 sticks) unsalted butter, room temperature.', 'name': u'unsalted butter', 'other': u', room temperature.', 'qty': u'1', 'unit': u'cup'}, {'input': u'1/2 cup granulated sugar.', 'name': u'granulated sugar', 'qty': u'1/2', 'unit': u'cup'}, {'comment': u'packed', 'input': u'1 cup packed light-brown sugar.', 'name': '', 'other': u'light-brown sugar.', 'qty': u'1', 'unit': u'cup'}, {'input': u'1 teaspoon salt.', 'name': '', 'other': u'salt.', 'qty': u'1', 'unit': u'teaspoon'}, {'comment': u'pure', 'input': u'2 teaspoons pure vanilla extract.', 'name': u'vanilla', 'other': u'extract.', 'qty': u'2', 'unit': u'teaspoon'}, {'comment': u'large', 'input': u'2 large eggs.', 'name': u'eggs', 'qty': u'2'}, {'comment': u'(about 12 ounces) semisweet and/or', 'input': u'2 cups (about 12 ounces) semisweet and/or milk chocolate chips.', 'name': u'milk chocolate chips', 'qty': u'2', 'unit': u'cup'}]