Manugarri's blog

Things I like/dislike about GCP (when compared with AWS)

Manuel Garrido — Thu, 20 Nov 2025 15:04:42 GMT

I have been working with AWS (Amazon Web Services) over the past decade. At my current job at Semrush, I am lucky enough that I got to work with GCP (Google Cloud Platform) for a change.

After a year of experience with it, here are my very personal comments on GCP:

Things I DONT like about GCP (when compared with AWS):

Regarding Bigquery, I find it a major design flaw how Bigquery supports a single database per project. This means that you are forced to either separate domains by schema , which is very ugly, and leads to namespace separation like teamA_stage versus teamB_stage , etcetera, all in the same database. This lack of database selection in the same project makes things much more complicated when you are dealing with tools that follow normal convention that a project contains multiple databases, each with schemas and tables that are specific to the database domain. DBT for multiproject is particularly tricky to implement in an elegant way with Bigquery, particularly for monorepos where multiple teams push their own models to it.
Google Secret manager is very similar to the AWS counterpart, however, there is the major difference that by default, once a secret in gsm is deleted, its gone forever. A devops at semrush once applied a stale terraform by mistake and dropped a few secrets for our platform, even when we scalated the request to recover to the highest level, GCP tech support told us there was nothing to be done to recover the secrets. i asked tech support and even they couldn't recover, no soft-deletes or grace period. This is very bad, because usually secrets are one of the things that dont usually have backups (because they are inherently security risks), and one wrong terraform apply can essentially destroy full data pipelines or applications.
Interacting with Google Services in python is a experience orders of magnitude worse than working with AWS. When you are interacting with AWS services in python (a very common use case since python is now a major player in data engineering among other disciplines), you just need to install a single package ( boto3). You install boto3 in your environment, then everything works. With Google, each individual service requires its own packaging, and the documentation for these packages is severly lacking, to the point of sometimes not being sure which package was the official google package to interact with a particular Google Service.

On the other hand, google python packages are so strongly versioned (due to protobuf apis I assume), that it is almost impossible to install certain dependencies. For example, when working with Apache Airflow, providers usually provide a single package with all of their specific code inside (these packages are usually named apache-airflow-providers-XXX , for example apache-airflow-providers-google). When working with AWS, package conflicts were rare, because AWS packages have a good balance of dependency requirements. However when working with google as an airflow provider on airflow, because airflow google provider contains all of google services in there, and each google service is added in there (with the specific package and the specific requirements), sometimes you can end up in a place where you just cant find a version of a package to install.

Things I DO like about GCP (when compared with AWS):

Provisioning virtual machines with custom machine types is neat, that way you dont need to figure out which machine micro/macro/biggie/smallie is the one with your cpu/memory requirements, and instead you can just provision a machine with machine type custom-6-20480 for a machine with 6 cpus and 20gb memory.
The fact that Google Cloud Storage buckets have soft delete by default is a nice thing, it allows you to recover objects when you do an oppsie and delete some data. In AWS S3 buckets do have versioning, but it is an opt in setting, which means if you dont enable it, by the time you need to recover some data it will be too late.

AI Agents are getting surprisingly easy to implement

Manuel Garrido — Fri, 03 Jan 2025 13:10:20 GMT

As most people in the Data world, I have been more and more exposed to LLM based AI tools.
Things like Copilot or Perplexity have been adopted as new, better tools and have indeed made my regular workflow faster (to an extent).

Side note: I don't understand why we have spent over a decade saying 'Don't call it AI, call it Machine Learning, because AI is a much broader spectrum that does not necessarily rely on training data', but then OpenAI released ChatGPT a couple years ago and suddenly we are all forced to use the dreaded term again.

While the big AI companies are pushing towards bigger , smarter models (which makes sense, since bigger models are quite expensive to train and are a natural business moat to protect their position of monopoly), other voices are proposing something different. What if, instead of having massive models that can do anything (compose a song, then solve differential equations, then tell a joke), we had ensembles of small, focused LLM models cooperating with each other?

These LLM models are called AI Agents (or Agentic workflows), They usually have access to tools (aka, functions) and can take a general user input and decide which tools to call, or which agents to call.

Up until recently, building these Agents has been quite messy, with different frameworks showing up.
The most popular framework for LLM work is Langchain, which is extremely convoluted and unintuitive from a software engineering point of view.

Alternatively, one can use specific provider packages directly (like Openai python client), which makes things easier to modify, at the cost of more verbose code.

However, just a few days ago, Huggingface released smolagents, a library that dramatically simplifies Agent development. Its a batteries included package that makes developing AI Agents a breeze.

Lets use an example to compare the difference between building an Agent from scratch vs using smolagents.

Building a currency exchange Agent

We will blatantly copy the excelent tutorial at SwirlAI newsletter and we will build an Agent that can take user queries, and perform currency conversion and return the converted currency.

Manual implementation

To build the Agent manually, the core thing we need to implement is the Tool. Remember, the main difference between an LLM powered chatbot and an LLM powered Agent, is that Agents have access to tools (which can be tools that call other agents).

We define a generic Tool class and a decorator that can turn a python function into an LLM compatible tool just by reading its docstring.

"""https://www.newsletter.swirlai.com/p/building-ai-agents-from-scratch-part"""
from dataclasses import dataclass
from typing import Dict, Any, Callable, get_type_hints, _GenericAlias, List
import inspect
import os
from openai import OpenAI
import json
import urllib

GITHUB_API_TOKEN = os.environ["GITHUB_API_TOKEN"]

def parse_docstring_params(docstring: str) -> Dict[str, str]:
    """Extract parameter descriptions from docstring."""
    if not docstring:
        return {}
    
    params = {}
    lines = docstring.split('\n')
    in_params = False
    current_param = None
    
    for line in lines:
        line = line.strip()
        if line.startswith('Parameters:'):
            in_params = True
        elif in_params:
            if line.startswith('-') or line.startswith('*'):
                current_param = line.lstrip('- *').split(':')[0].strip()
                params[current_param] = line.lstrip('- *').split(':')[1].strip()
            elif current_param and line:
                params[current_param] += ' ' + line.strip()
            elif not line:
                in_params = False

    return params

def get_type_description(type_hint: Any) -> str:
    """Get a human-readable description of a type hint."""
    if isinstance(type_hint, _GenericAlias):
        if type_hint._name == 'Literal':
            return f"one of {type_hint.__args__}"
    return type_hint.__name__

@dataclass
class Tool:
    """Tool class that can produce valid Agent function calls from function docstrings"""
    name: str
    description: str
    func: Callable[..., str]
    parameters: Dict[str, Dict[str, str]]

    def __call__(self, *args, **kwargs) -> str:
        return self.func(*args, **kwargs)

def tool(name: str = None):
    def decorator(func: Callable[..., str]) -> Tool:
        tool_name = name or func.__name__
        description = inspect.getdoc(func) or "No description available"

        type_hints = get_type_hints(func)
        param_docs = parse_docstring_params(description)
        sig = inspect.signature(func)

        params = {}
        for param_name, param in sig.parameters.items():
            params[param_name] = {
                "type": get_type_description(type_hints.get(param_name, Any)),
                "description": param_docs.get(param_name, "No description available")
            }

        return Tool(
            name=tool_name,
            description=description.split('\n\n')[0],
            func=func,
            parameters=params
        )
    return decorator

This tool class takes a function docstring and generates json documentation of the expected inputs the tool will take. It will also add a description to the json that will be used in the System prompt to tell the LLM which tools it has access to and how to use them.

Now that we have the decorator, we can create the currency exchange function fairly easily.

@tool()
def convert_currency(amount: float, from_currency: str, to_currency: str) -> str:
    """Converts currency using latest exchange rates.
    
    Parameters:
        - amount: Amount to convert
        - from_currency: Source currency code (e.g., USD)
        - to_currency: Target currency code (e.g., EUR)
    """
    try:
        url = f"https://open.er-api.com/v6/latest/{from_currency.upper()}"
        with urllib.request.urlopen(url) as response:
            data = json.loads(response.read())

        if "rates" not in data:
            return "Error: Could not fetch exchange rates"

        rate = data["rates"].get(to_currency.upper())
        if not rate:
            return f"Error: No rate found for {to_currency}"

        converted = amount * rate
        return f"{amount} {from_currency.upper()} = {converted:.2f} {to_currency.upper()}"

    except Exception as e:
        return f"Error converting currency: {str(e)}"

Next we need to build the Agent class, this will be the class in charge of answering the user queries, and it will have access to a list of tools.

class Agent:
    def __init__(self):
        """Initialize Agent with empty tool registry."""
        self.client = OpenAI(
              base_url="https://models.inference.ai.azure.com",
              api_key=os.environ["GITHUB_API_TOKEN"],
        )
        self.tools: Dict[str, Tool] = {}
    
    def add_tool(self, tool: Tool) -> None:
        """Register a new tool with the agent."""
        self.tools[tool.name] = tool
    
    def get_available_tools(self) -> List[str]:
        """Get list of available tool descriptions."""
        return [f"{tool.name}: {tool.description}" for tool in self.tools.values()]
    
    def use_tool(self, tool_name: str, **kwargs: Any) -> str:
        """Execute a specific tool with given arguments."""
        if tool_name not in self.tools:
            raise ValueError(f"Tool '{tool_name}' not found. Available tools: {list(self.tools.keys())}")
        
        tool = self.tools[tool_name]
        return tool.func(**kwargs)

    def create_system_prompt(self) -> str:
        """Create the system prompt for the LLM with available tools."""
        tools_json = {
            "role": "AI Assistant",
            "capabilities": [
                "Using provided tools to help users when necessary",
                "Responding directly without tools for questions that don't require tool usage",
                "Planning efficient tool usage sequences"
            ],
            "instructions": [
                "Use tools only when they are necessary for the task",
                "If a query can be answered directly, respond with a simple message instead of using tools",
                "When tools are needed, plan their usage efficiently to minimize tool calls"
            ],
            "tools": [
                {
                    "name": tool.name,
                    "description": tool.description,
                    "parameters": {
                        name: {
                            "type": info["type"],
                            "description": info["description"]
                        }
                        for name, info in tool.parameters.items()
                    }
                }
                for tool in self.tools.values()
            ],
            "response_format": {
                "type": "json",
                "schema": {
                    "requires_tools": {
                        "type": "boolean",
                        "description": "whether tools are needed for this query"
                    },
                    "direct_response": {
                        "type": "string",
                        "description": "response when no tools are needed",
                        "optional": True
                    },
                    "thought": {
                        "type": "string", 
                        "description": "reasoning about how to solve the task (when tools are needed)",
                        "optional": True
                    },
                    "plan": {
                        "type": "array",
                        "items": {"type": "string"},
                        "description": "steps to solve the task (when tools are needed)",
                        "optional": True
                    },
                    "tool_calls": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "tool": {
                                    "type": "string",
                                    "description": "name of the tool"
                                },
                                "args": {
                                    "type": "object",
                                    "description": "parameters for the tool"
                                }
                            }
                        },
                        "description": "tools to call in sequence (when tools are needed)",
                        "optional": True
                    }
                },
                "examples": [
                    {
                        "query": "Convert 100 USD to EUR",
                        "response": {
                            "requires_tools": True,
                            "thought": "I need to use the currency conversion tool to convert USD to EUR",
                            "plan": [
                                "Use convert_currency tool to convert 100 USD to EUR",
                                "Return the conversion result"
                            ],
                            "tool_calls": [
                                {
                                    "tool": "convert_currency",
                                    "args": {
                                        "amount": 100,
                                        "from_currency": "USD", 
                                        "to_currency": "EUR"
                                    }
                                }
                            ]
                        }
                    },
                    {
                        "query": "What's 500 Japanese Yen in British Pounds?",
                        "response": {
                            "requires_tools": True,
                            "thought": "I need to convert JPY to GBP using the currency converter",
                            "plan": [
                                "Use convert_currency tool to convert 500 JPY to GBP",
                                "Return the conversion result"
                            ],
                            "tool_calls": [
                                {
                                    "tool": "convert_currency",
                                    "args": {
                                        "amount": 500,
                                        "from_currency": "JPY",
                                        "to_currency": "GBP"
                                    }
                                }
                            ]
                        }
                    },
                    {
                        "query": "What currency does Japan use?",
                        "response": {
                            "requires_tools": False,
                            "direct_response": "Japan uses the Japanese Yen (JPY) as its official currency. This is common knowledge that doesn't require using the currency conversion tool."
                        }
                    }
                ]
            }
        }
        
        return f"""You are an AI assistant that helps users by providing direct answers or using tools when necessary.
Configuration, instructions, and available tools are provided in JSON format below:

{json.dumps(tools_json, indent=2)}

Always respond with a JSON object following the response_format schema above. 
Remember to use tools only when they are actually needed for the task."""

    def plan(self, user_query: str) -> Dict:
        """Use LLM to create a plan for tool usage."""
        messages = [
            {"role": "system", "content": self.create_system_prompt()},
            {"role": "user", "content": user_query}
        ]
        
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
            temperature=0
        )
        
        try:
            return json.loads(response.choices[0].message.content)
        except json.JSONDecodeError:
            raise ValueError("Failed to parse LLM response as JSON")

    def execute(self, user_query: str) -> str:
        """Execute the full pipeline: plan and execute tools."""
        try:
            plan = self.plan(user_query)
            
            if not plan.get("requires_tools", True):
                return plan["direct_response"]
            
            # Execute each tool in sequence
            results = []
            for tool_call in plan["tool_calls"]:
                tool_name = tool_call["tool"]
                tool_args = tool_call["args"]
                result = self.use_tool(tool_name, **tool_args)
                results.append(result)
            
            # Combine results
            return f"""Thought: {plan['thought']}
Plan: {'. '.join(plan['plan'])}
Results: {'. '.join(results)}"""
            
        except Exception as e:
            return f"Error executing plan: {str(e)}"

Most of the code in the Agent is the system prompt, and how to add the available tools as part of the prompt.

Notice how the Agent also has a plan step in which it decides if any tool has to be used, then all the required tools are executed sequentially.

Also, for the sake of simplicity, this Agent doesnt run a ReAct (Reason, Act) loop, meaning all potential tools are decided at the planning step and there is no chance of reevaluating such plan based on the tools output. For example if one tool returns an error, then we wont have a way to adapt to that error.

Now if we want to build an app with an Agent that can do currency conversion we just have to do this.

agent = Agent()
agent.add_tool(convert_currency)

query = "I am traveling to Japan from Serbia, I have 1500 of local currency, how much of Japanese currency will I be able to get?"

print(f"\nQuery: {query}")
result = agent.execute(query)
print(result)

And this is the output:

Magic!

Pretty neat right?

SmolAgents implementation

With smolagents all of the class definition is managed by the library. For convenience I will use the HFAPI model (which uses Huggingface Inference service). But you can use OpenAI as well.

import json
import os
from typing import Optional
import urllib

from smolagents.agents import ToolCallingAgent
from smolagents import tool, HfApiModel, LiteLLMModel

GITHUB_API_TOKEN = os.environ["GITHUB_API_TOKEN"]


'''
# NOTE, I dont have a personal OpenAI account, and Azure Inference API doesnt 
# Have access to tool calling models. So i will use the Hf Inference API model for this example.

model = LiteLLMModel(
    model_id="gpt-4o",
    api_base="https://models.inference.ai.azure.com",
    api_key=os.environ["GITHUB_API_TOKEN"],
)
'''
model = HfApiModel("Qwen/Qwen2.5-72B-Instruct")

@tool
def convert_currency(amount: float, from_currency: str, to_currency: str) -> str:
    """
    Converts currency using latest exchange rates.
    
    Args:
        amount: Amount to convert
        from_currency: Source currency code (e.g., USD)
        to_currency: Target currency code (e.g., EUR)
    """
    try:
        url = f"https://open.er-api.com/v6/latest/{from_currency.upper()}"
        with urllib.request.urlopen(url) as response:
            data = json.loads(response.read())

        if "rates" not in data:
            return "Error: Could not fetch exchange rates"

        rate = data["rates"].get(to_currency.upper())
        if not rate:
            return f"Error: No rate found for {to_currency}"

        converted = amount * rate
        return f"{amount} {from_currency.upper()} = {converted:.2f} {to_currency.upper()}"

    except Exception as e:
        return f"Error converting currency: {str(e)}"

agent = ToolCallingAgent(tools=[convert_currency], model=model)
query = "I am traveling to Japan from Serbia, I have 1500 of local currency, how much of Japanese currency will I be able to get?"

result = agent.run(query)
print(result)

As you can see, now we only need to define the custom tool we require, usually this is all we will require in production, as the business logic is usually enclosed in this kind of custom functions.

Running this version we get the following output.

More magic!

Running the smolagents model takes significantly longer (even when using the OpenAI model). It seems like some processing is happening even before the inference is called.

However, the benefits are tremendous. As you can see the results are nicer, since smolagents adds logging by default. And we can add a ton of modifications to the behaviour (such as planning) with just one or two keywords.

I am very excited about this library, and cant recommend checking their examples enough.

One day project: Wikiloc exporter

Manuel Garrido — Tue, 12 Nov 2024 14:36:41 GMT

Last weekend, I had some time to kill, and I thought I would tackle a minor pet peeve of mine.

My wife and I are trying to go on a hike at least once a month, we like to try different trails around our area. In Spain, the dominant site for trail information is wikiloc. Its an awesome sime, with hundreds of trails ranging from beginner trails to very hard trails.

However, the site also uses freemium scheme that doesnt resonate well with me. See, you can see the trail on your browser for free, but for navigation functionalities, you have to pay X a month, which is a bit steep for a casual user like me. This means when you are walking the trail, you have to use your orientation skills to figure out where the hell you are on the map, or risk getting lost, which has happened to us more than once.

I thought, wouldnt it be great if I could extract the trail information into a map service with proper geolocation? How about MapHub, one of the simplest free map providers that exist?

Turns out it was easier than expected, and here is the code if you want to use it yourself.

Exploring the site

First thing I did was to explore wikiloc. Its a site choke full of functionalities, most of them behind a paywall unfortunately.

Wikiloc Trail page

As we can see on the bottom left of the map, Wikiloc uses leaflet, an awesome OSS library for map building. I have used it a few times for personal projects (here is an example).

What that means is that somewhere on the javascript side, there is a geojson that is being used to generate the leaflet trail (as a polyline). Let's see how to do that.

Extracting the data

By inspecting the site's code, we can see that the waypoints (meaning, the individual marquers we can see highlighting points of interest) are defined as Json linked data inside the site.

json LD data for a particular waypoint

That is great, we can explore the site to see how to fetch these points. Chances are, there is a javascript variable somewhere that makes use of those points to display them on the leaflet map.

We can inspect the variables on the site inside the developer console. I always use this simple snippet to print the variables defined at the window scope level, because the website devs have probably used a reasonable,meaningful name to define the map variables.

```javascript
var variables = {}
undefined
for (var name in this) {
    variables[name] = name;
    variables[name]=this[name]
}
```

gimme all the vars

After running this snippet and printing the variable names, we find 2 objects that provide us the information that we need, mapData and trailMap

bingo

mapData contains the waypoints with their coordinates, names, and so on. We will fetch that information and add it to our exported map.

trailMap contains the reference to leaflet itself (that is imported as a separate library). This means we can export the trail information (meaning, the trail course ) as a geojson easily since thats what Leaflet uses internally.

bingo 2

In particular, we can export the layers of the leaflet map with this simple JS snippet.

```javascript
var collection = {'type':'FeatureCollection','features':[]}; trailMap.eachLayer(function (layer) {if (typeof(layer.toGeoJSON) === 'function') collection.features.push(layer.toGeoJSON())}); 
```

That snippet will export a variable with valid geojson representing the trail.

Exporting the data

Now that we know how to extract the data from the site, we just have to copy all the js code into a python script that will run the extraction for us.

Since we need to execute Javascript to extract the data, we will use a headless browser (playwright) to execute the js.

extracting the trail data using playwright

Then we need to push the data into MapHub. This is super easy, as the only thing we need to do is to create a maphub account, make a api token, and use that token to submit http requests to their api endpoint to create the map.

api request to create Maphub map from a geojson object

We can do some final aesthetic improvements too, for example its nice to have the initial waypoint for a trail colored in green, we can do that by specifying the point's properties. For example, to change the color of the point we just have to set the property marker-color to a different hexcode color.

And voila, here is the exported map in MapHub!

success!

Four lessons from managing a company wide Airflow plugin

Manuel Garrido — Tue, 23 Apr 2024 07:42:19 GMT

At my current company (Carpe Data, we are hiring!) , one of my tasks is to maintain an internal Airflow plugin (common-airflow-utils) with Airflow related utilities. The library is used by 5 different teams, powering a big part of our production worfklows. and it provides higher abstractions on top of the airflow API, with the goals of standardizing Airflow practices, as well as making dag writing much easier, particularly for those teams where Python expertise is lacking.

For example, for a regular dag owner (the person writing DAG code), instead of launching an emr job via an EmrCreateJobFlowOperator they can just call the common utilities function create_emr_job_flow(*args, **kwargs).

This library has been 1 year in production, and there are a few things that I have learned from building and maintaining it:

1.Taskflow api is great, but not for internal library functions

When I joined Carpe, there was already some Airflow code dangling around. It wasn't great, but since my goal was to set up new Airflow libraries with (hopefully) better standards, I tried to keep existing code whenever possible.

This meant keeping some internal airflow plugin functions that followed the Airflow Taskflow API. In a nutshell, the Airflow Taskflow API is a new way (for Airflow 2.0.0 or higher) to define operators, using standard python functions instead of the class based operators that were used before Taskflow API was released.

At my previous company, we were using Airflow < 2.0.0, and that meant I was not used to using the taskflow API. When I saw the ease of use to define operators using regular python functions, I was hooked. So much easier to use! So elegant! its just python with a magical @taskdecorator!.

So I released the first version of the common utils keeping some of the legacy code that used the taskflow API to build internal operators.

In retrospective, this wasn't a great idea.

A year has passed, and this library has grown, not only the number of operators it contains, but also in the number of teams who have adopted it as well as the number of people contributing to it.

Recently, we have had some major issues with the library, and one of the main reasons is the choice of taskflow api for internal custom operators.

I will explain this with an example.

Lets assume we have a very simple dag, that performs the following steps:
- 1 . Run a sql query in snowflake, via a library call to snowflake_operator
- 2. Decide based on the output of that query, whether to run the next step
- 3. if the branch on 2) is true, then we want to print the output of the result of 1).

Here is how our snowflake_operator looks like inside the common-airflow-utils library. We use AWS ECS to run our operators inside containers (more on that later):

def snowflake_operator(
    task_id: str,
    sql_query: str,
):
    """
    Runs the ECS snowflake job
    """

    @task(multiple_outputs=True)
    def _setup_operator_args(
        sql_query,
    ):
        """
        Function that evaluates the lazy airflow parameters so they can be used as regulard arguments downstream.
        """
        return {
            "sql_query": sql_query,
        }

    @task
    def _setup_ecs_command(
        sql_query,
    ):
        """
        Generates the full ECS docker command.
        """
        command = ["--sql_query", sql_query]
        return command

    with TaskGroup(group_id=task_id) as task_group:
        args = _setup_operator_args(
            sql_query=sql_query,
        )

        command = _setup_ecs_command.override(trigger_rule=trigger_rule)(
            sql_query=args['sql_query'],
        )

        run_ecs_operator.override(task_id='run_snowflake_query',
                                  )(
                                      task_id=task_id,
                                      container_name="snowflake",
                                      command=command,
                                  )

    return task_group

And here is how our dag would look like to use the snowflake operator.


from airflow.decorators import task

from commons_airflow_utils.dag import DAG
from commons_airflow_utils.operators.snowflake.snowflake_secretsmanager import snowflake_operator

DAG_ID = "super_dag"

with DAG(
    dag_id=DAG_ID,
    doc_md=__doc__
) as dag:
    # this operator queries snowflake and returns randomly a 1 or a 2
    run_snowflake_1 = snowflake_operator(
        task_id='run_snowflake_1',
        sql_query="""
                WITH arr AS (SELECT array_construct(1, 2) arr),
                number_selection AS (SELECT arr[ABS(MOD(RANDOM(), array_size(arr)))] number FROM arr)
                SELECT number FROM number_selection;
        """
    )

    @task.branch
    def choose_run_next_step(number_selection):
        if number_selection == '2':
            return 'run_snowflake_2'
        else:
            return 'skip'

    @task
    def skip():
        print("SKIPPING")

    # snowflake python output has a weird format
    run_snowflake_2 = snowflake_operator(
        task_id='run_snowflake_2',
        sql_query="""
                SELECT {{ task_instance.xcom_pull(task_ids='run_snowflake_1')[0]['result'][0][0] }} + 1;
        """
    )

    @task
    def print_snowflake_2_result(result):
        print(result)

    run_snowflake_1 >> [skip(), run_snowflake_2] >> print_snowflake_2_result(run_snowflake_2)

The dag looks like this on airflow:

Very simple so far.

Now let's imagine that the business logic changes, and we decide we want to change the workflow a bit:

- 1 . Run a sql query in snowflake, via a library call to snowflake_operator
- 2. Add one to the output of 1) (*product decision!*)
- 3. Decide based on the output of that query, whether to run the next step
- 4. if the branch on 3) is equal to `2` , then we want to print the output of the result of 2).

No problem, we just have to update our DAG to add the step:

   # this operator queries snowflake and returns randomly a 1 or a 2
    run_snowflake_1 = snowflake_operator(
        task_id='run_snowflake_1',
        sql_query="""
                WITH arr AS (SELECT array_construct(1, 2) arr),
                number_selection AS (SELECT arr[ABS(MOD(RANDOM(), array_size(arr)))] number FROM arr)
                SELECT number FROM number_selection;
        """
    )

    @task
    def add_one_to_snowflake_1_result(snowflake_result):
        print(snowflake_result)
        return snowflake_result[0]['result'][0][0] + 1

    snowflake_1_output_plus_one = add_one_to_snowflake_1_result(run_snowflake_1)

    @task.branch
    def choose_run_next_step(number_selection):
        if number_selection == '2':
            return 'run_snowflake_2'
        else:
            return 'skip'

    @task
    def skip():
        print("SKIPPING")



    @task
    def print_snowflake_result(result):
        print(result)


    run_snowflake_2 = snowflake_operator(
        task_id='run_snowflake_2',
        sql_query="""
                SELECT {{ task_instance.xcom_pull(task_ids='snowflake_1_output_plus_one') }} + 1;
        """
    )

    print_snowflake_2_result = print_snowflake_result(run_snowflake_2)

    run_snowflake_1 >> snowflake_1_output_plus_one >> [skip(), run_snowflake_2] >> print_snowflake_2_result

we go to test the new update to the dag , and we get an error, PANIC!!

We see our dag failing, why?

Well, the error message is clear:

Sadness

Since the function snowflake_operator returns a task group, task groups have no output since they are not tasks per se. Task groups are not lazily evaluated like tasks are, so you cant really use {{}} params, or xcoms inside them (only inside the internal tasks).

No problem, you think, let's fix that. We can just instead of returning the taskgroup, we can return the output of the last step inside the task, since the operator essentially runs a docker ecs command and its output is the only thing we care about. That way downstream tasks can easily interact with the output of the snowflake_operator.

Here is how the snowflake_operator looks like with that slight modification:

def snowflake_operator(
    task_id: str,
    sql_query: str,
):
    """
    Runs the ECS snowflake job
    """

    @task(multiple_outputs=True)
    def _setup_operator_args(
        sql_query,
    ):
        """
        Function that evaluates the lazy airflow parameters so they can be used as regulard arguments downstream.
        """
        return {
            "sql_query": sql_query,
        }

    @task
    def _setup_ecs_command(
        sql_query,
    ):
        """
        Generates the full ECS docker command.
        """
        command = ["--sql_query", sql_query]
        return command

    with TaskGroup(group_id=task_id) as task_group:
        args = _setup_operator_args(
            sql_query=sql_query,
        )

        command = _setup_ecs_command.override(trigger_rule=trigger_rule)(
            sql_query=args['sql_query'],
        )

        ecs_output = run_ecs_operator.override(task_id='run_snowflake_query',
                                  )(
                                      task_id=task_id,
                                      container_name="snowflake",
                                      command=command,
                                  )

    return ecs_output

We run the dag with the new version of the snowflake operator and with a few modifications, the dag works, yaaay!.

Wait a second, lets look at the dag structure:

Madness

The dependencies are all messed up! Now the dependencies are only pointing to the last step of the internal task group for snowflake_operator.

To make the dag work with this new operator change we have to change the branch operator to look like this:

@task.branch
    def choose_run_next_step(number_selection):
        if number_selection == '2':
            return 'run_snowflake_2.run_snowflake_query'
        else:
            return 'skip'

Since we now have to reference the internal step inside the taskgroup as the next step in the branch operator, that means the internal api of the common-airflow-utils library now is used by every one and it becomes essentially part of the public api (if you change it, things will break).

All of this effort for what? All of the benefits of writing vanilla python functions to build taskflow operators are gone since we are adding a lot of complexity just to manage the fact that task_groups are not tasks, and thus not lazily evaluated and cannot interact by themselves with airflow context.

Let's see now the snowflake_operator following the standard class based method to develop custom operators:

from airflow.models.baseoperator import BaseOperator
from common_airflow_utils.ecs import run_ecs_operator_function

class SnowflakeECsOperator(BaseOperator):
    """
    ECS Operator to run sql queries on Snowflake.
    """

    ui_color = "#abf0ff"
    template_fields = (
        "sql_query",

    )
    container_name = "snowflake"


    def __init__(
            self,
            sql_query: str,
            **kwargs
    ) -> None:
        super().__init__(**kwargs)
        self.sql_query = sql_query

    def _setup_ecs_command(self):
        command = ["--sql-query", self.sql_query]
        return command

    def execute(self, **context):  # pylint: disable=unused-argument
        """
        Executes the ECS docker command and returns.
        """
        command = self._setup_ecs_command()

        ecs_output = run_ecs_operator_function(
            task_id=self.task_id,
            container_name=self.container_name,
            task_version=self.task_version,
            command=command,
        )
        return  ecs_output

def snowflake_operator(
    task_id: str,
    sql_query: str,
):
    """
    Runs the ECS snowflake job
    """
    return SnowflakeECsOperator(task_id=task_id, sql_query=sql_query)

This operator is a single airflow task, this means branching and sequencing works out of the box, and the output can be easily accessed via the .output attribute. Since it inherits from BaseOperator, we can pass to it all of the standard arguments that airflow operators support (trigger_rules, retries, hooks, and so on).
Since its a single task, the scheduler has to track 3 times less objects.

Here is how the dag graph looks like now that we are using class based operators:

Much simpler! And the thing is, from the point of view of the dag writer, they don't care about the internals of the snowflake operator, only that it receives a sql query and it runs it!

2.Airflow uni-tests are hard , smoke tests are less hard

Unittests are the first line of defense for software engineers, they check that all the individual parts of your codebase are working as you expect.

Unittests on airflow are very tricky, for a couple reasons.

First and foremost, Airflow dags and operators require an Airflow context to work. This means in production there is a separate process that takes care of the triggering, computing, state checking for the workflows.

If you read tutorials about airflow testing, you can see that its easy to test that the dags produce valid airflow code (as seen on the official docs), or that the dags have the specifications you want (meaning, that if you want your dags to have task2 after task1 that is the case).

However, in practice most of the errors that happen with Airflow dags have nothing to do with those kind of bugs, in my experience errors usually happen because of intradependencies between tasks. Those are hard to unittest on airflow, and to be able to unittest properly you are forced to modify the actual implementation of your dags (see an example here). Changing production code to make testing easier is a big no no on my book (tests should adapt to production, not the other way around).

Another big set of issues have nothing to do with the code itself, but are related to the environment. Airflow being a monolithic orchestrator, requires tons of secrets/variables/connections to properly work. Implenting all of that complexity in a pure unittest suite is hard, that is why I usually have very lightweight unittests that ensure we avoid stupid mistakes (for example, if a dag is supposed to run daily, I want to make sure a dev doesnt remove the cron argument to work locally and pushes to prod). But I mostly test dags via local runs (using the excellent aws repo for local airflow running)

3.Airflow code is not standard Python code

Every framework is by definition based on a set of assumptions/modifications that make some things easier, at the cost of making other things harder, or plain impossible.

Airflow takes this to the next level, and every dev who is dealing with Airflow code probably has struggled to implement standard libraries for code quality in a way that do not crease false flags when dealing with DAG code.

At my current company, we use standard pre-commit plugins for code quality, plus Sonar for company wide static code analysis.

We have had to modify our pylint settings quite a bit in order to fit airflow specific quirks, one of the most common one is how airflow recommends settign up tasks dependencies via bitwise operators.

For example, if you want to define 2 tasks in a way that task2 starts after task1, the recommended way to implement that dependency is this:

task1 = DummyOperator(...)
task2 = DummyOperator(...)
task1 >> task2

Pylint will freak out at this with the very obvious error code pointless statement why would you do a bitwise operation if you are not going to assign the result to anything?
Sonar might also freak out depending on your company settings.

The way to disable these false alarms is to either disable them poroject wide (which means real issues wont be detected), or pepper your dag code with comments like these ones:

task1 = DummyOperator(...)
task2 = DummyOperator(...)
#this is sad
task1 > task2 # pylint: disable=pointless statement #NOSONAR

Another one of Airflow's quirks that doesnt play well with vanilla code analysers is the top level imports. As stated on Airflow's Best Practices docs:

...if you have an import statement that takes a long time or the imported module itself executes code at the top-level, that can also impact the performance of the scheduler.

Which goes against the most basic of python PEP8 guidelines (imports at the top of the files).

Pylint rightly will throw the error import-outside-top-level (which again ,you can bypass with a pylint disable statement.

4.Containerized operators are great

My current job is the third one in which I am in charge of managing Airflow environments. This time, I knew I would do something different to avoid dependency conflicts.

One of the biggest caveats of Airflow in my opinion is that software dependencies (i.e. python packages) are shared. This means if an Airflow environment is used by multiple tenants, the environment will need to support the requirements of every single workflow of every single team using the environment. If a team requires requests<2.0.0 and another team requires requests>2.0.0 , then both teams cant use the same environment.

Worse still, when updating the environment, no matter if you are running airflow on premises or using a service like AWS MWAA, the installation does not validate the compatibility between packages, meaning that conflict between packages can bring the whole scheduler down, putting the admin in the uncomfortable position of knowing that every minute the Airflow environment is broken not a single job can run. I have been there a few times and trust me, its quite stressful.

Fortunately, there is a very nice way to avoid dependency issues on Airflow, using container operators!.

What are container operators? Simply put, these are airflow operators where the computation doesnt happen in the same environment the airflow scheduler or workers are, but inside a container, whether a pure Docker container (DockerOperator or KubernetesPodOperator), or a cloud based container (ECSRunTaskOperator for AWS or CloudRunExecuteJobOperator for Google Cloud Services).

Since these containers run docker images, they can be tagged, versioned, and isolated. Even better, they enable airflow to run workflows on any language! The best part of them though, is that their computation requirements are isolated, meaning heavy tasks cant bring down the worker. We had a task at my current company that was run occasionally, but that required a lot of memory. It would sometimes bring the worker down. Moving it to ECS allowed us to define specific memory requirements for the task.

Thats it, I hope you liked the article!

Thing i learned migrating from Digital Ocean to AWS Lightsail

Manuel Garrido — Sat, 17 Jun 2023 15:14:45 GMT

Its been 9 years since I spin up my Digital Ocean instance. It has been hosting my personal site, blog, random personal projects, and even faced one or two Hacker news front page traffic spikes.

Why named dokku you wonder, oh young reader? Well, back before kubernetes was a thing, and right when docker was starting to become popular, a tool came up that promised an easy way to maintain your own Platform as a Service, dokku. When I set up my machine I planned to use dokku for every project. Reality showed me that deploying to dokku was just too much effort when I owned the whole infrastructure ( a single instance).

That actually meant, my tiny Digital Ocean became a dumpster fire, full of random folders I did not know what they were doing, different database engines installed (and their daemons still kicking) and random broken python virtual environments.

Additionally, since my domain ended up becoming semi popular for a while, it was somehow added to botlists of domains. So it has been constantly under attack of bots, bringing down some of the services I personally use. Due to my incompetence when I set up the instance, things like https/cloudflare were not an option.

So my plan was to migrate to a new instance. In particular, I wanted to use AWS Lightsail, mostly because thats the cloud platform im most comfortable with and it opened the door to potentially more complex projects.

It's been a while since I had time to nerd out and work on some side projects (having kids is the ultimate time sink). But after postponing the migration for a while, I finally killed my trusty DO instance today.

Here are some random thoughts I wrote while I was painstainkingly migrating the services and hopefully, setting them up on a more structured way.

https is hard! not extremely hard, but i could see how some less technical folks have a hard time getting it setup. Letsencrypt seems to be the only free way to spin up certificates currently, and there are a few magical commands that need to be run in order to set up certificates the right way. Related to this:

- You need to enable 443 for https on light sail, I found no mention of this when googling 'lightsail setup https'. I understand most people that use lightsail just want a prepackaged wordpress, but that is not always the case.

- certbot autodefault nginx settings do not work if you use a custom subdomain (my blog is hosted at blog.manugarri.com). certbot is awesome nonetheless.

I use Ghost as my blogging platform. ghost blog is tremendously unhelpful when you want to migrate content. I tried importing the content from the old blog and i just got the message:

"Please install Ghost 1.0, import the file and then update your blog to the latest Ghost version.\nVisit https://ghost.org/docs/update/ or ask for help in our https://forum.ghost.org."IncorrectUsageError: Detected unsupported file structure.

which i understand, but i fail to see how hard it would be to keep backwards compatibility for what is basically a simple json structure like this:

{
  "id": 2,
  "uuid": "de30db4d-fdde-48b8-8548-fd3c9804cfb0",
  "title": "How to easily set up Subdomain routing in Nginx",
  "slug": "how-to-easily-set-up-subdomain-routing-in-nginx",
  "markdown": "ARTICLE MARKDOWN HERE",
  "image": null,
  "featured": 0,
  "page": 0,
  "status": "published",
  "language": "en_US",
  "meta_title": null,
  "meta_description": null,
  "author_id": 1,
  "created_at": "2014-09-30 00:42:27",
  "created_by": 1,
  "updated_at": "2014-09-30 03:21:56",
  "updated_by": 1,
  "published_at": "2014-09-30 00:42:27",
  "published_by": 1,
  "visibility": "public",
  "mobiledoc": null,
  "amp": null
}

like, which killer feature was so extremely awesome yet so critically different that it forced people go to the hoops of spinning back an old ghost instance, fight with the updates, then dump then migrate? the content is the same for fucks sake.

I had to build my own shitty script to take a valid (empty) dump from my new blog instance, then make the old dump compatible by adding the missing fields:

import json
from copy import deepcopy
from itertools import islice
def batched(iterable, chunk_size):
    iterator = iter(iterable)
    while chunk := tuple(islice(iterator, chunk_size)):
        yield chunk


valid_file = "manugarris-blog.ghost.2023-05-27-15-13-43.pretty.json"
posts_file = "manuel-garridos-blog.ghost.2023-05-27.pretty.json"

with open(valid_file) as fname: valid_file_data = json.load(fname)

with open(posts_file) as fname: valid_posts_data = json.load(fname)

posts = valid_posts_data["db"][0]["data"]["posts"]

n_posts_per_batch = 10

i = 0
for posts_batch in batched(posts, n_posts_per_batch):
    posts_authors = [
          {
            "id": post["id"],
            "post_id": post["id"],
            "author_id": "1",
            "sort_order": 0
          }
          for post in posts_batch
    ]

    valid_file_data["db"][0]["data"]["posts"] = posts_batch
    valid_file_data["db"][0]["data"]["posts_authors"] = posts_authors
    with open(f"batch_dump.{i}.json", "w") as fname:
        print(f"batch_dump.{i}.json")
        json.dump(valid_file_data, fname)

Any way, the migration took 3 saturdays, so it wasnt the end of the world. Im amazed that I have been able to run so many side projects/sites/blogs on a 5USD/month instance for 9 years without updating it. Having my own machine allowed me to grow significantly as an engineer, and the cost was totally worth it.

NOTES: Setting up git after a fresh install.

Manuel Garrido — Mon, 30 May 2022 20:07:45 GMT

Recently I did a fresh install of Ubuntu 20.04 via WSL2 (which i don't love yet but its growing on me), and I had to do the following steps to set git up:

1.Install git (duh!)Im just putting it here for completion sake

sudo apt-get update  
sudo apt-get install git

2. Add ssh keys

You have to add your desired keys to your ssh agent, found this on Stack overflow and many other places.

eval $(ssh-agent)  
ssh-add

These commands permanently add your ssh keys to your keychain and will skip having to ask the passphrase any time you want to clone a repository via git.

3.Update git config.

Due to recent updates to github's git protocol implementation (implemented as of January 11 2022) , it is not enough to add ssh keys (RSA, DSA are deprecated), you have to change your local git configuration (nice explanation on Stack Overflow):

git config --global url."git@github.com:".insteadOf git://github.com/

Making a simple, better weather and traffic conditions map for Spain's roads

Manuel Garrido — Sat, 23 Jan 2021 10:42:29 GMT

TL;DR

I made a map displaying weather and traffic road conditions for Spain that is easier to use, nicer and faster than the official Spanish Government map.
As usual (1, 2), I keep being disappointed with the Spanish Government Open Data policies.

You can check out the map here. I also shared the required code on Github.

An introduction, and a bit of ranting.

It was January 2021, and I was spending the holidays with my family in my awesome hometown of Murcia, Spain.

For you future travelers, this year was the COVID pandemic year, so things were a bit weird and traveling wasnt as easy as you whipersnappers are used to. Me and my family (two kids at the moment of writing this) would be driving back from Murcia to Lisbon, Portugal where I reside.

Additionally, this year saw record breaking snowstorms in Spain thanks to (Storm Filomena).

These two reasons meant that going back home to Lisbon from my hometown required planning, since there was a real risk of getting stuck in the car with 2 crying babies (omg Im shivering just thinking about it).

So a couple days before the travel day, I checked online to see any information on the roads.

The best resource I could find (and if there is a better resource, the fact that it cant be easily found defeats its purpose) was the official Spain Traffic Authority (DGT, Direccion General de Trafico) Map. You can check it here

Here is a screenshot of how the map looks like:

There are a few things that trigger me when I see this map:

Slow, this map seems to be an embedded map from an internal GIS system or something, plus it has a ton of features that makes it pretty slow.
Confusing no legend regarding event types
Overall Ugliness, you can tell icons just jam up next to each other, and they are mostly gray, with a tiny hint of color indicating the road circulation level.

But the worst thing of all, there is no navigation search!. This means the official traffic map forces you to know the actual code of the road you are planning to drive on in order to see if there are any weather events affecting the road's state. Im not a truck driver! I don't know these names!

What I needed at that uncertain time was to be able to find out if driving from point A to Point B would go through any road that was blocked for any reason. The only way to do so with the official map was to search in google maps for driving directions and then check the DGT map for any road event.

Building a better map

Here is my version (link)

You may notice a few differences between my map and DGT map:

fast, my map consist of a simple map, so the only loading consists of the map tiles themselves
easy to read , my map has an actual legend that indicates what each icon means. This is Dataviz 101
pretty, this is more of a personal opinion, but using brighter colors for the event icons makes the map more appealing, and visual appeal increases user engagement.

And most important of all, you can get see the road conditions for the trip you are planning!. Just type the origin and destination and click the button, and the map will plot the route.

Application Details

You can see the code powering the map on Github.

My initial idea was to implement the map as a 100% frontend solution, since that would keep the site from exploding if it becomes too popular, but due to CORS limitations I had to implement a simple backend to fetch the DGT road condition events.

The backend app is a FastAPI web application in charge of rendering the index page, fetching the road condition events, and geocoding the navigation search terms. It is the first time I use FastAPI, and its super easy to use and faster than other similar microframeworks (like Flask)

The map itself is a simple Leaflet map with overlaid events, when the user loads the map, an HTTP GET request fetches the official DGT road conditions data from this url , the same one the official map uses (you can check using the developer network tools on your browser).

I used httpx to perform the GET request, for no reason besides testing it, its supposed to be the next Requests.

Leaflet provides basic icons out of the box, but since I wanted to display a few different event types, I used erikflower's awesome Weather icons. These icons not only are beautiful (particularly compared to DGT's ugly ones), but also render very fast since they are not bitmaps.

I found it a bit complicated how to add custom icons, but this doc explains it nicely.

Finally, I used OpenRouteService as a geocoding package to translate the Navigation search terms into geocoordinates. It doesnt work as well as Google Maps, but its open source and Google Maps has turned a bit evil in recent times. OpenRouteService has a nice python package.

Notes

Leaflet keeps getting better and better!, now its super easy to add custom tile providers. there are even a ton of tile providers now thanks to the awesome leaflet-providers project. For example, here is how the map looks like with a different tile provider (Stadia)

Again and again, I come up with a nice frontend project idea that I have to implement with a backend just because of CORS. There should be a way to disable CORS for well spirited applications. Maybe prompting the user to disable CORS for a site?

Airflow UI: How to trigger a DAG with custom parameters?

Manuel Garrido — Tue, 28 Jul 2020 09:33:07 GMT

Airflow is one of the most widely used Schedulers currently in the tech industry. Initially developed at Airbnb, a few years ago it became an Apache foundation project, quickly becoming one of the foundation top projects.

It is a direct competitor of other schedulers such as Spotify's Luigi or newer solutions such as DigDag or Prefect (created by core Airflow developers, I'm keeping this one on my list for future projects when it matures a bit).

At my current company, Daltix, we are moving away from an older tool, Jenkins, a CI/CD tool we hacked so it can act as a job scheduler, to Airflow. The improvements we gained by using an actual job scheduler are great (dag visualization, dynamic dag setup, specific task triggering among others),

BUT

There is a feature that Jenkins has that most schedulers do not. I will explain with an example.

Lets say I have a DAG (we can call it a job) that performs some sql queries to generate a Persistent Derived Table PDT for a customer.

This job will be a templated job, meaning that in order to run it we need to specify which customer database (as a parameter customer_code for example) to run it for. We can do so easily by passing configuration parameters when we trigger the airflow DAG.

Here is what the Airflow DAG (named navigator_pdt_supplier in this example) would look like:

So basically we have a first step where we parse the configuration parameters, then we run the actual PDT, and if something goes wrong, we get a Slack notification.

The first step, parse_job_args_task is a simple PythonOperator that parses the configuration parameter customer_code provided in the DAG run configuration (a DAG run is a specific trigger of the DAG):

dag = DAG(  
    dag_id="navigator_pdt_supplier",
    tags=["data_augmentation"],
    schedule_interval=NONE,
)

dag.trigger_arguments = {"customer_code": "string"} # these are the arguments we would like to trigger manually

def parse_job_args_fn(**kwargs):  
    dag_run_conf = kwargs["dag_run"].conf #  here we get the parameters we specify when triggering
    kwargs["ti"].xcom_push(key="customer_code", value=dag_run_conf["customer_code"]) # push it as an airflow xcom

parse_job_args_task = PythonOperator(  
    task_id="parse_job_args_task",
    python_callable=parse_job_args_fn,
    provide_context=True,
    dag=dag
)

After this step, we can reference the customer_code parameter in the PDT just by doing (this is an example):

run_pdt = SQLOperator(  
  query=f"USE DATABASE {{ task_instance.xcom_pull(key='customer_code') }}"

Great! Only question though, how do we actually run this DAG? We can't run it on a cron basis, since we need to provide additional parameters to the DAG when we trigger it. We can't trigger it manually via the trigger dag UI button either.

We can trigger it via Airflow's API, with a simple call like this:

import requests:

AIRFLOW_API_ENDPOINT = "http://.....//api/experimental"  
DAG_ID = "navigator_pdt_supplier" # dag to trigger

# these are the custom parameters
parameters = {"customer_code": "ACME"}

result = requests.post(f"{AIRFLOW_API_ENDPOINT}/dags/{DAG_ID}/dag_runs", json={"conf": parameters})

This is great, but not only requires an additional security step (opening Airflow API), but it restricts Airflow usage only to technical people who know how to do api calls.

Here comes Jenkins' killer feature! which is, you can define parameters using a simple interface when triggering a Job!

This is a feature that is not available on Airflow. Which brings us to the meat of this post:

How to add a "custom trigger" option on Airflow:

Airflow's interface and functionality can be expanded by the use of plugins. We can update or create Operators easily, and we can also create web views to add additional features.

Plugins need to be saved on the Airflow plugins folder, usually $AIRFLOW_HOME/plugins

Airflow UI can be run using 2 different Flask-based packages. By default it uses Flask-Admin to render the UI, however if the new Role Based Access Control flag is enabled RBAC, Airflow uses Flask-appbuilder to manage the UI.

We can create a plugin called trigger_view.py and save it in the Airflow plugins directory with the following contents:

from airflow.api.common.experimental.trigger_dag import trigger_dag  
from airflow import configuration as conf  
from airflow.plugins_manager import AirflowPlugin  
from airflow.models import DagBag  
from flask import render_template_string, request, Markup  
from airflow.utils import timezone


trigger_template = """  
  
  
    Home
      {% if messages %}
        
        {% for message in messages %}
          {{ message }}
        {% endfor %}
        
      {% endif %}
    Manual Trigger
    
       
          Select a dag:
          
          
              {%- for dag_id, dag_arguments in dag_data.items() %}
                  
                    {% if dag_arguments %}
                        Arguments to trigger dag {{dag_id}}:

                    {% endif %}
                    {% for dag_argument_name, _ in dag_arguments.items() %}
                        

                    {% endfor %}
                  
              {%- endfor %}
          
          

          
        {% if csrf_token %}
            
        {% endif %}
       
    
  
  
"""


def trigger(dag_id, trigger_dag_conf):  
    """Function that triggers the dag with the custom conf"""
    execution_date = timezone.utcnow()

    dagrun_job = {
        "dag_id": dag_id,
        "run_id": f"manual__{execution_date.isoformat()}",
        "execution_date": execution_date,
        "replace_microseconds": False,
        "conf": trigger_dag_conf
    }
    r = trigger_dag(**dagrun_job)
    return r


# if we dont have RBAC enabled, we setup a flask admin View
from flask_admin import BaseView, expose  
class FlaskAdminTriggerView(BaseView):  
    @expose("/", methods=["GET", "POST"])
    def list(self):
        if request.method == "POST":
            print(request.form)
            trigger_dag_id = request.form["dag"]
            trigger_dag_conf = {k.replace(trigger_dag_id, "").lstrip("-"): v for k, v in request.form.items() if k.startswith(trigger_dag_id)}
            dag_run = trigger(trigger_dag_id, trigger_dag_conf)
            messages = [f"Dag {trigger_dag_id} triggered with configuration: {trigger_dag_conf}"]
            dag_run_url = DAG_RUN_URL_TMPL.format(dag_id=dag_run.dag_id, run_id=dag_run.run_id)
            messages.append(Markup(f'Dag Run url'))
            dag_data = {dag.dag_id: getattr(dag, "trigger_arguments", {}) for dag in DagBag().dags.values()}
            return render_template_string(trigger_template, dag_data=dag_data, messages=messages)
        else:
            dag_data = {dag.dag_id: getattr(dag, "trigger_arguments", {}) for dag in DagBag().dags.values()}
            return render_template_string(trigger_template, dag_data=dag_data)
v = FlaskAdminTriggerView(category="Extra", name="Manual Trigger")



# If we have RBAC, airflow uses flask-appbuilder, if not it uses flask-admin
from flask_appbuilder import BaseView as AppBuilderBaseView, expose  
class AppBuilderTriggerView(AppBuilderBaseView):  
    @expose("/", methods=["GET", "POST"])
    def list(self):
        if request.method == "POST":
            print(request.form)
            trigger_dag_id = request.form["dag"]
            trigger_dag_conf = {k.replace(trigger_dag_id, "").lstrip("-"): v for k, v in request.form.items() if k.startswith(trigger_dag_id)}
            dag_run = trigger(trigger_dag_id, trigger_dag_conf)
            messages = [f"Dag {trigger_dag_id} triggered with configuration: {trigger_dag_conf}"]
            dag_run_url = DAG_RUN_URL_TMPL.format(dag_id=dag_run.dag_id, run_id=dag_run.run_id)
            messages.append(Markup(f'Dag Run url'))
            dag_data = {dag.dag_id: getattr(dag, "trigger_arguments", {}) for dag in DagBag().dags.values()}
            return render_template_string(trigger_template, dag_data=dag_data, messages=messages)
        else:
            dag_data = {dag.dag_id: getattr(dag, "trigger_arguments", {}) for dag in DagBag().dags.values()}
            return render_template_string(trigger_template, dag_data=dag_data)


v_appbuilder_view = AppBuilderTriggerView()  
v_appbuilder_package = {"name": "Manual Trigger",  
                        "category": "Extra",
                        "view": v_appbuilder_view}



# Defining the plugin class
class TriggerViewPlugin(AirflowPlugin):  
    name = "triggerview_plugin"
    admin_views = [v] # if we dont have RBAC we use this view and can comment the next line
    appbuilder_views = [v_appbuilder_package] # if we use RBAC we use this view and can comment the previous line

After setting up the plugin and restarting the airflow UI, we get an additional menu link on the top bar, clicking on it will lead us to this glorious interface:

On this new menu we will be able to manually trigger a dag, and if that dag has an additional parameter trigger_arguments , the trigger menu will allow us to trigger the dag with the custom parameter!

After we select the customer_code parameter and click the trigger button, we get a confirmation message and a link to the specific dag run so we can monitor it.

Neat right? There are many ways to improve this simple plugin (adding an execution_date datepicker, or different UI forms depending on the argument type), would love to hear how you would update them!

Airflow UI: How to trigger a DAG with custom parameters?

Manuel Garrido — Tue, 28 Jul 2020 09:33:07 GMT

BUT

There is a feature that Jenkins has that most schedulers do not. I will explain with an example.

Lets say I have a DAG (we can call it a job) that performs some sql queries to generate a Persistent Derived Table PDT for a customer.

Here is what the Airflow DAG (named navigator_pdt_supplier in this example) would look like:

So basically we have a first step where we parse the configuration parameters, then we run the actual PDT, and if something goes wrong, we get a Slack notification.

dag = DAG(  
    dag_id="navigator_pdt_supplier",
    tags=["data_augmentation"],
    schedule_interval=NONE,
)

dag.trigger_arguments = {"customer_code": "string"} # these are the arguments we would like to trigger manually

def parse_job_args_fn(**kwargs):  
    dag_run_conf = kwargs["dag_run"].conf #  here we get the parameters we specify when triggering
    kwargs["ti"].xcom_push(key="customer_code", value=dag_run_conf["customer_code"]) # push it as an airflow xcom

parse_job_args_task = PythonOperator(  
    task_id="parse_job_args_task",
    python_callable=parse_job_args_fn,
    provide_context=True,
    dag=dag
)

After this step, we can reference the customer_code parameter in the PDT just by doing (this is an example):

run_pdt = SQLOperator(  
  query=f"USE DATABASE {{ task_instance.xcom_pull(key='customer_code') }}"

We can trigger it via Airflow's API, with a simple call like this:

import requests:

AIRFLOW_API_ENDPOINT = "http://.....//api/experimental"  
DAG_ID = "navigator_pdt_supplier" # dag to trigger

# these are the custom parameters
parameters = {"customer_code": "ACME"}

result = requests.post(f"{AIRFLOW_API_ENDPOINT}/dags/{DAG_ID}/dag_runs", json={"conf": parameters})

This is great, but not only requires an additional security step (opening Airflow API), but it restricts Airflow usage only to technical people who know how to do api calls.

Here comes Jenkins' killer feature! which is, you can define parameters using a simple interface when triggering a Job!

This is a feature that is not available on Airflow. Which brings us to the meat of this post:

How to add a "custom trigger" option on Airflow:

Airflow's interface and functionality can be expanded by the use of plugins. We can update or create Operators easily, and we can also create web views to add additional features.

Plugins need to be saved on the Airflow plugins folder, usually $AIRFLOW_HOME/plugins

We can create a plugin called trigger_view.py and save it in the Airflow plugins directory with the following contents:

from airflow.api.common.experimental.trigger_dag import trigger_dag  
from airflow import configuration as conf  
from airflow.plugins_manager import AirflowPlugin  
from airflow.models import DagBag  
from flask import render_template_string, request, Markup  
from airflow.utils import timezone


trigger_template = """  
  
  
    Home
      {% if messages %}
        
        {% for message in messages %}
          {{ message }}
        {% endfor %}
        
      {% endif %}
    Manual Trigger
    
       
          Select a dag:
          
          
              {%- for dag_id, dag_arguments in dag_data.items() %}
                  
                    {% if dag_arguments %}
                        Arguments to trigger dag {{dag_id}}:

                    {% endif %}
                    {% for dag_argument_name, _ in dag_arguments.items() %}
                        

                    {% endfor %}
                  
              {%- endfor %}
          
          

          
        {% if csrf_token %}
            
        {% endif %}
       
    
  
  
"""


def trigger(dag_id, trigger_dag_conf):  
    """Function that triggers the dag with the custom conf"""
    execution_date = timezone.utcnow()

    dagrun_job = {
        "dag_id": dag_id,
        "run_id": f"manual__{execution_date.isoformat()}",
        "execution_date": execution_date,
        "replace_microseconds": False,
        "conf": trigger_dag_conf
    }
    r = trigger_dag(**dagrun_job)
    return r


# if we dont have RBAC enabled, we setup a flask admin View
from flask_admin import BaseView, expose  
class FlaskAdminTriggerView(BaseView):  
    @expose("/", methods=["GET", "POST"])
    def list(self):
        if request.method == "POST":
            print(request.form)
            trigger_dag_id = request.form["dag"]
            trigger_dag_conf = {k.replace(trigger_dag_id, "").lstrip("-"): v for k, v in request.form.items() if k.startswith(trigger_dag_id)}
            dag_run = trigger(trigger_dag_id, trigger_dag_conf)
            messages = [f"Dag {trigger_dag_id} triggered with configuration: {trigger_dag_conf}"]
            dag_run_url = DAG_RUN_URL_TMPL.format(dag_id=dag_run.dag_id, run_id=dag_run.run_id)
            messages.append(Markup(f'Dag Run url'))
            dag_data = {dag.dag_id: getattr(dag, "trigger_arguments", {}) for dag in DagBag().dags.values()}
            return render_template_string(trigger_template, dag_data=dag_data, messages=messages)
        else:
            dag_data = {dag.dag_id: getattr(dag, "trigger_arguments", {}) for dag in DagBag().dags.values()}
            return render_template_string(trigger_template, dag_data=dag_data)
v = FlaskAdminTriggerView(category="Extra", name="Manual Trigger")



# If we have RBAC, airflow uses flask-appbuilder, if not it uses flask-admin
from flask_appbuilder import BaseView as AppBuilderBaseView, expose  
class AppBuilderTriggerView(AppBuilderBaseView):  
    @expose("/", methods=["GET", "POST"])
    def list(self):
        if request.method == "POST":
            print(request.form)
            trigger_dag_id = request.form["dag"]
            trigger_dag_conf = {k.replace(trigger_dag_id, "").lstrip("-"): v for k, v in request.form.items() if k.startswith(trigger_dag_id)}
            dag_run = trigger(trigger_dag_id, trigger_dag_conf)
            messages = [f"Dag {trigger_dag_id} triggered with configuration: {trigger_dag_conf}"]
            dag_run_url = DAG_RUN_URL_TMPL.format(dag_id=dag_run.dag_id, run_id=dag_run.run_id)
            messages.append(Markup(f'Dag Run url'))
            dag_data = {dag.dag_id: getattr(dag, "trigger_arguments", {}) for dag in DagBag().dags.values()}
            return render_template_string(trigger_template, dag_data=dag_data, messages=messages)
        else:
            dag_data = {dag.dag_id: getattr(dag, "trigger_arguments", {}) for dag in DagBag().dags.values()}
            return render_template_string(trigger_template, dag_data=dag_data)


v_appbuilder_view = AppBuilderTriggerView()  
v_appbuilder_package = {"name": "Manual Trigger",  
                        "category": "Extra",
                        "view": v_appbuilder_view}



# Defining the plugin class
class TriggerViewPlugin(AirflowPlugin):  
    name = "triggerview_plugin"
    admin_views = [v] # if we dont have RBAC we use this view and can comment the next line
    appbuilder_views = [v_appbuilder_package] # if we use RBAC we use this view and can comment the previous line

After setting up the plugin and restarting the airflow UI, we get an additional menu link on the top bar, clicking on it will lead us to this glorious interface:

After we select the customer_code parameter and click the trigger button, we get a confirmation message and a link to the specific dag run so we can monitor it.

Note to Self. Installing LightGBM in Ubuntu 18.04

Manuel Garrido — Fri, 13 Jul 2018 16:50:00 GMT

These are the steps I took to install Microsoft's cool Gradient Boosted Models library, LightGBM

Step 1. Install CUDA

I am not going to explain this step because it is easy to find.

Step 2. Install Boost

sudo apt-get install libboost-all-dev

Step 3. Clone LightGBM and build with CUDA enabled

git clone --recursive https://github.com/Microsoft/LightGBM && cd LightGBM  
export CXX=g++-7 CC=gcc-7  # replace 7 with version of gcc installed on your machine  
mkdir build && cd build  
cmake .. -DUSE_GPU=1  
make -j4

Step 4. Install python bindings

cd ..  
pip install setuptools numpy scipy scikit-learn -U  
cd python-package/  
python setup.py install --precompile

Now you just need to add the argument device="gpu" when creatting your LightGBMModel.

Note to self. Pyspark failling with "Error while instantiating ‘org.apache.spark.sql.hive.HiveSessionState’"

Manuel Garrido — Tue, 22 May 2018 10:12:48 GMT

If you run pyspark and see this error (it happens in scala-shell as well):

Error while instantiating ‘org.apache.spark.sql.hive.HiveSessionState’

The solution is easy, yet ridiculous.
1. Create the folder /tmp/hive
2. Give it chmod permissions sudo chmod -R 777 /tmp/hive

Found here

Note to self: Fixing encoding in Golang ascii85

Manuel Garrido — Thu, 19 Apr 2018 08:36:22 GMT

Yesterday I spent a few hours dealing with what I like to call "the edges of StackOverflow". By that I mean those situations in which you are trying to solve a programming problem (mostly a bug) and you have no idea why its happening, and even worse, no amount of search (in StackOverflow or Github) yield any information that might seem somewhat related to the issue.

I think this xkcd strip puts it quite clearly:

The issue in question was this. I am working on working on a project involving cookies. The standard procedure in programmatic media buying (i.e., online ads) is to codify the cookie data in ascii85 (or base85).

So I was implementing the encoding/decoding package using Golang's ascii85 package as follows:

package main

import (  
    "fmt"
    "encoding/json"
)




type User struct {  
   Age int
   Interests []string
}

func decodeCookie(cookieValue string) string {  
    cookieEncodedBytes := []byte(cookieValue)
    cookieDecodedBytes := make([]byte, len(cookieEncodedBytes))
    nCookieDecodedBytes, _, _ := ascii85.Decode(cookieDecodedBytes, cookieEncodedBytes, true)
    cookieDecodedBytes = cookieDecodedBytes[:nCookieDecodedBytes]
    return string(cookieDecodedBytes)
}

func encodeCookie(cookieValue string) string {  
    cookieBytes := []byte(cookieValue)
    cookieEncodedb85Bytes := make([]byte, ascii85.MaxEncodedLen(len(cookieBytes)))
    _ = ascii85.Encode(cookieEncodedb85Bytes, cookieBytes)
    cookieEncodedString := string(cookieEncodedb85Bytes)
    return cookieEncodedString
}


func main() {  
    user := User{
          25, 
          []string{"music", "football"},
    }

    userJson, _ := json.Marshal(user) 
    fmt.Println("User as json", string(userJson))

    userB85Encoded := encodeCookie(string(userJson))
    fmt.Println("User as jsonB85", userB85Encoded)


    userB85Decoded := decodeCookie(userB85Encoded)
    fmt.Println("User as json", userB85Encoded)

    decodedUser := User{}
    err := json.Unmarshal([]byte(userB85Decoded), &decodedUser)
    if err != nil {
        fmt.Println("Error deserializing json bytes", err)
    }

   fmt.Println(fmt.Sprintf("Deserialized User:%v", decodedUser))
}

This code will print the following output :

User as json {"Age":25,"Interests":["music","football"]}  
User as jsonB85 HQkagAKj/j2(TqCDKKH1ATMs7,!&pPD09o6@j3HJAoDU0@UX(h,$fTs  
User as json {"Age":25,"Interests":["music","football"]}  
Error deserializing json bytes invalid character '\x00' after top-level value  
Deserialized User:{0 []}

So we see that, what we thought would be an easy encoding/decoding (easy encoding, HA!) implementation is failing for some reason. The error says:

Error deserializing json bytes invalid character '\x00' after top-level value

But where is that character? The character \x00 is the null byte, so when printed it does not show up in the output.

We can go further by checking the length of the encoded/encoded strings to see if there is a mismatch by adding a few lines:

package main

import (  
    "fmt"
    "encoding/json"
)




type User struct {  
   Age int
   Interests []string
}

func decodeCookie(cookieValue string) string {  
    cookieEncodedBytes := []byte(cookieValue)
    cookieDecodedBytes := make([]byte, len(cookieEncodedBytes))
    nCookieDecodedBytes, _, _ := ascii85.Decode(cookieDecodedBytes, cookieEncodedBytes, true)
    cookieDecodedBytes = cookieDecodedBytes[:nCookieDecodedBytes]
    return string(cookieDecodedBytes)
}

func encodeCookie(cookieValue string) string {  
    cookieBytes := []byte(cookieValue)
    cookieEncodedb85Bytes := make([]byte, ascii85.MaxEncodedLen(len(cookieBytes)))
    _ = ascii85.Encode(cookieEncodedb85Bytes, cookieBytes)
    cookieEncodedString := string(cookieEncodedb85Bytes)
    return cookieEncodedString
}


func main() {  
    user := User{
          25, 
          []string{"music", "football"},
    }

    userOriginalJson, _ := json.Marshal(user) 
    fmt.Println("User as json", string(userOriginalJson))

    userB85Encoded := encodeCookie(string(userOriginalJson))
    fmt.Println("User as jsonB85", userB85Encoded)


    userB85DecodedJson := decodeCookie(userB85Encoded)
    fmt.Println("User as json", userB85DecodedJson)

    decodedUser := User{}
    err := json.Unmarshal([]byte(userB85DecodedJson), &decodedUser)
    if err != nil {
        fmt.Println("Error deserializing json bytes", err)
    }

   fmt.Println(fmt.Sprintf("Deserialized User:%v", decodedUser))

   //NOW WE ADD THESE LINES

   fmt.Println("length of original json string", len(userOriginalJson))
   fmt.Println("length of decoded json string", len(userB85DecodedJson))
}

Now the two last lines of output will show:

length of original json string 43  
length of decoded json string 44

So we see that there is a difference between the original and the decoded string! How is that possible?

The only hint I found about why this might be happening is in the ridiculously succint (as usual) ascii85 go documentation:

|[...] The encoding handles 4-byte chunks, using a special encoding for the last fragment[...]

So what if the issue is that because the input length to decodeCookie (the json string) is not a multiple of 4 ascii85 adds null values to the nearest multiple, turning a 43 length byte array into a 44 length byte array?

We can fix this by removing the null bytes from the output byte array, using the convenient bytes.trim function:

package main

import (  
    "fmt"
    "bytes"
    "encoding/json"
    "encoding/ascii85"
)




type User struct {  
   Age int
   Interests []string
}

func decodeCookie(cookieValue string) string {  
    cookieEncodedBytes := []byte(cookieValue)
    cookieDecodedBytes := make([]byte, len(cookieEncodedBytes))
    nCookieDecodedBytes, _, _ := ascii85.Decode(cookieDecodedBytes, cookieEncodedBytes, true)
    cookieDecodedBytes = cookieDecodedBytes[:nCookieDecodedBytes]

        //ascii85 adds /x00 null bytes at the end
    cookieDecodedBytes = bytes.Trim(cookieDecodedBytes, "\x00")
    return string(cookieDecodedBytes)
}

func encodeCookie(cookieValue string) string {  
    cookieBytes := []byte(cookieValue)
    cookieEncodedb85Bytes := make([]byte, ascii85.MaxEncodedLen(len(cookieBytes)))
    _ = ascii85.Encode(cookieEncodedb85Bytes, cookieBytes)
    cookieEncodedString := string(cookieEncodedb85Bytes)
    return cookieEncodedString
}

func main() {  
    user := User{
          25, 
          []string{"music", "football"},
    }

    userOriginalJson, _ := json.Marshal(user) 
    fmt.Println("User as json", string(userOriginalJson))

    userB85Encoded := encodeCookie(string(userOriginalJson))
    fmt.Println("User as jsonB85", userB85Encoded)


    userB85DecodedJson := decodeCookie(userB85Encoded)
    fmt.Println("User as json", userB85DecodedJson)

    decodedUser := User{}
    err := json.Unmarshal([]byte(userB85DecodedJson), &decodedUser)
    if err != nil {
        fmt.Println("Error deserializing json bytes", err)
    }

   fmt.Println(fmt.Sprintf("Deserialized User:%v", decodedUser))

   //NOW WE ADD THESE LINES

   fmt.Println("length of original json string", len(userOriginalJson))
   fmt.Println("length of decoded json string", len(userB85DecodedJson))

here is a go playground link to the code above.

Now the output is as expected:

User as json {"Age":25,"Interests":["music","football"]}  
User as jsonB85 HQkagAKj/j2(TqCDKKH1ATMs7,!&pPD09o6@j3HJAoDU0@UX(h,$fTs  
User as json {"Age":25,"Interests":["music","football"]}  
Deserialized User:{25 [music football]}  
length of original json string 43  
length of decoded json string 43

And that fixes the issue! I hope that in the future the Golang community will focus a bit more on documentation and examples.

Thats all, thanks for reading!

Note to self:Print statements not showing up on systemd logs? Do this

Manuel Garrido — Wed, 31 Jan 2018 14:24:33 GMT

Let's assume we have a service set up as follows:

[Unit]
Description=systemd_microservice

[Service]
User=USER  
Group=GROUP  
WorkingDirectory=systemd_working_directory  
ExecStart=/usr/bin/python python_scripts.py  
SuccessExitStatus=143  
TimeoutStopSec=10  
Restart=on-failure  
RestartSec=10

[Install]
WantedBy=multi-user.target

And inside python_script.py you have a bunch of print statements.

You set up your service and your surprise when you do

sudo journalctl -f -u python_service.service

The logs dont show up!

The reason is python stdout is being buffered when redirected to journal, and thus it only shows up in blocks

How to avoid this? Easy! Just set up your parameter ExecStart in the service file like this:

[Unit]
Description=systemd_microservice, now with logs!

[Service]
User=USER  
Group=GROUP  
WorkingDirectory=systemd_working_directory  
ExecStart=/usr/bin/python -u python_scripts.py  
SuccessExitStatus=143  
TimeoutStopSec=10  
Restart=on-failure  
RestartSec=10

[Install]
WantedBy=multi-user.target

Did you notice? the parameter -u makes forces the stdout and stderr to be unbuffered! Alternatively you can set the environment variable PYTHONUNBUFFERED to anything and will have the same effect. You can see the rest of the options for the python command line interface here

Note to self: Disable caps lock in Ubuntu 16.04

Manuel Garrido — Wed, 25 Oct 2017 09:08:11 GMT

Sources: here and here

This post shows how to disable the caps lock key and enables it only by pressing both shift keys together.

1. Install DCONF

$ sudo apt-get install dconf-tools

2. Disable caps lock and reenable it as pressing both shift keys at once:

$ setxkbmap -option "caps:none" $ setxkbmap -option "shift:both_capslock"

What is it to work in a Startup - the good and bad

Manuel Garrido — Wed, 25 Oct 2017 09:05:16 GMT

Nowadays, everyone seems to be fascinated about startups. Media bombard us with success story after success story, displaying incredible offices featuring slides instead of stairs and in house chefs preparing home made dinners.

I started working in November 2013 in a NYC based Startup named Namely. I was the 18th employee joining the company. As of now, 4 years later, Namely has more than 300 employees.

Back when I joined, we had two offices. One, in Manhattan, where the Account Management team (now called Client success), the Sales team (now called Inside Sales), and Operations team worked off. The other office, in Greenpoint, Brooklyn, where the Engineering and design teams were based off. This last office was where I was based off, but would go to Manhattan from time to time for meetings.

I can say without a doubt, that working at Namely is the best job I'have ever had.

This post is a personal account of what it means working in one of those startups. I will talk about the good things - there are a lot - , but also about the bad things.

DISCLAIMER: I left Namely in 2015. All opinions written here are based on the 2013-2015 period.

The Good

![]()

Namely Labs Lounge

Perks

Perks are one of the aspects that are more representative of Startups. Things like unlimited vacation (which interestingly enough means that people take less vacation than when they have limited number of days off), ping pong table, unlimited snacks and beer... Coming from the corporate world, where you need to pay to get a bottle of water, all these perks made me feel much more appreciated and more willing to put extra effort in.

Namely Labs Lounge

Growth

Since your tasks and deliverables wont be very clear defined, and they will change as the company finds its own path, you will learn way more than you would if you were filling a hole in a Corporation, where your position would be clearly defined and you would continue to do the same until you changed your position.

In Namely, every employee gets to spend 3000$/year to spend in education, however he/she wants to. For example, I chose to go to Strata 2015, probably the most important Data Conference in the planet. Other employees prepare to register in Online MOOCs.

It's feels like a family

When a startup is small, everybody knows each other. More important, every helps each other, working long hours not because you have to, but because everyone is on the same boat together. Thus, you get to know each other better than you would if you were part of a big team on a big company. You get beers together, celebrate birthdays, have internal jokes, on a company level.

I remember that time I went to our Manhattan office in December 2014, and I realized I didn't know everybody there!. It felt like something had changed, something had been lost.

Daily Standup, this was ALL the Engineering/Design team back in 2013. Now there are more than 50 people in both teams.

You have an impact

One of the things I realized first when I started working at Namely was: "If I don't do something that I believe needs to be done, nobody will."

When you are working on a small company trying to become a succesful, big company, everybody has a lot on their plate, and the list of things pending to be done is huge.

So what do you do? You build those things. And then those things become your baby, and if they have an impact in the company (sometimes they don't), that impact will have been because of you. Not because of some director somewhere thinking of strategy or other marketing buzzwords. It was you who built that. That feeling is priceless.

Namely Data Cleaner, my first web application

So that would be the good things. Now let's move to...

The Bad

Limited Resources

I remember when we were using Trello and the backlog list would have so many cards on it, it was painful to see.

In a successful startup, you realize very soon that time is the most scarce resource. It takes time to close a deal, it takes time to implement a feature, it takes time to wait for the Engineering team to deploy a feature that will allow you to get some metrics. If you are as impatient as I am, waiting for things beyond your control to happen so you can do other things can feel like torture.

Changes Changes

This one is a good/bad thing. Being on a small team on a young company means that culture can change very quickly (the institutional knowledge is very small), and also that teams embrace new tools and procedures all the time.

However, being able to change sometimes means that there is no stability to complete long term plans, and one can see how the efforts put into a specific project are washed away when the need for that project disappear.

Politics rule

Being a small team means that everybody means everybody.

And while that means that free riders are spotted very quickly, it also means that inter-personal relationships carry more weight when deciding what everyone is worth. That can affect career growth, and those that are either working remotely or just not good in makin g their voice heard can see how other people's careers grow faster than theirs.

And most important of all... it won't last forever

This is the reason why I'm writing this article. ALL OF WHAT I WROTE DOES NOT APPLY TO THE COMPANY THAT NAMELY IS NOW.

By any measure, Namely is still a Startup, but it's on its way to become a medium sized business..

But most of all those good, and bad things I wrote about and that I loved/hated are gone.

As we grew, it was clear the need to start adding more structure to processes and teams.

Suddenly, you weren't able to work on a project that was very crucial for your department. It had to be scoped, and prioritized, meaning that for the majority of the time, you wouldn't do what you thought should have been done, but what the teams agreed had to be done.

And all of those changes happened because well, you just can't manage a 200 people company the same way as you manage a 20 people company.

So, if you are a part of a small startup, remember:

Enjoy as much as you can, because it won't last forever
if things go well the company will grow and things will change
if things go bad, well, that will be the end of it