Handle missing categoricals with PMML

PMML, a markup language developed by the Data Mining Group is, in my opinion, a well needed standard in the Data Science ecosystem. PMML is basically an xml format to define Machine learning pipelines, which allows for (sort of) interoperability between different ML Platforms.

In particular, I have been working lately with Openscoring, a wonderful software that creates a web server with an easy to use REST api to deploy models and evaluate data with them. (While I am here, kudos to Villu Ruusmann to singlehandedly spearhead the Machine Learning standardization movement. In my opinion, Openscoring fullfills a hole that is severely underserved, the low latency, real time edge prediction. It can deploy Scikit Learn, Spark , KNIME models and evaluate them with nearly submillisecond latency. As good as Openscoring is, I have been struggling with using openscoring to fit our needs at Tribe (Java is not a language I particulary know/like).

The thing is, pmml is not an easy markup language (here is a simple example). How could it be? There are tens of different ML algorithms and preprocessors built in many different platforms, and developing a common language among them is not an easy task.

Here is an issue that, as common as it seems, I had trouble finding a solution to, and I hope that this article can help someone in the future.

How can you handle categorical variables when you dont know all the possible valid values?

For example, let's say you are analizing Web requests and using the http request browser version name as a feature for modelling. Over time, new web browsers will pop up all the time. If your model does not know how to handle those new values, it can go bad.

Some PMML implementations provide this  this functionality (here is an example using R's pmml package). However, Spark support in handling missing values is unfortunately that great.

Given how PMML models are basically xml, we can modify them as long as we follow DMG's pmml structure. And the trick is on PMML's mining schema. Basically, to any MiningField (which is PMML's way of saying "field to use for evaluation"), we can add the attributes missingValueReplacement and invalidValueTreatment. For example, we can have a field like:

<MiningField name="browser" missingValueReplacement="NULL" missingValueTreatment="asMissing"/>  

This basically tells any PMML evaluator to replace any value that either is invalid (for example, a float), or does not exist in the field definition, with the value "NULL".

For this to work, at training time, you have to make sure that any categorical field (basically, strings), at least have one case with the default null value NULL on it. One way of ensuring the default null value appears on every field is adding an extra data point to your training data with all fields as NULL.

After training your model/s and exporting to pmml, the only thing missing is to modify that xml and add the missing data attributes I explained above.

Here is a simple implementation to read a raw pmml model , add the missing value replacements and redeploy. Here is a snippet.

import xml.etree.ElementTree as ET  
from xml.sax.saxutils import unescape

#Load the original pmml model
with open("raw_model.xml") as fname:  
    xml_model = ET.fromstring(fname.read())

    # Here is where we edit the MiningFields and add the desired attributes
    for field in self.model.findall(".//{{{ns}}}MiningField".format(ns=self.ns)):
    if self.fields[field.attrib["name"]] == "string":
        field.set("invalidValueTreatment","asMissing")
        field.set("missingValueReplacement", "NULL")

#we export the modified model
with open("processed_model.xml", "w") as fname:  
    #Please someone fix the stdlib xml library so I don't have to do this
    model_string = unescape(ET.tostring(self.model, encoding='utf8', method='xml').decode())
    fname.write(model_string)

Hope that helps, thanks for reading!