Handle missing categoricals with PMML

PMML, a markup language developed by the Data Mining Group is, in my opinion, a well needed standard in the Data Science ecosystem. PMML is basically an xml format to define Machine learning pipelines, which allows for (sort of) interoperability between different ML Platforms. In particular, I have been working…

Video: Jornadas Data Science en Murcia

El 21 de Abril de 2017, y gracias al apoyo de Centic y del Info de Murcia, unas 80 personas se acercaron a que yo les diera la brasa durante 3 horas sobre todo lo relacionado con Data Science. AquĆ­ dejo el video. Las transparencias las podeis ver en SlideShare…

This is what a memory leak looks like

Left, side of this chart, VSZ (virtual memory) and RSS (RAM) over time (obtained via ps) for a process using poor implementation of KafkaClient in java, creates a new kafka client per GET request. This is bad. Right side of the chart, current performance once I fixed the previous developer's…

Note to self: Changing loglevel in apache Spark

Very quick note for future reference. Please ignore. Change loglevel in spark Easy peasy, you can do it programatically in the application like: spark.sparkContext.setLogLevel("WARN") Change loglevel in yarn This one took a while to find, you can just run spark-submit while previously exporting this envvar: export YARN…

How to reuse HTTP response body in Golang

Took me a while to figure it out, but it seems that in golang you cant re-read from an http response. I found here a way to solve it. For debugging purposes, I had to be able to print the raw response as well as decoding it to json, to…