Things I like/dislike about GCP (when compared with AWS)
I have been working with AWS (Amazon Web Services) over the past decade. At my current job at Semrush, I am lucky enough that I got to work with GCP (Google Cloud Platform) for a change.
After a year of experience with it, here are my very personal comments on GCP:
Things I DONT like about GCP (when compared with AWS):
- Regarding Bigquery, I find it a major design flaw how Bigquery supports a single database per project. This means that you are forced to either separate domains by schema , which is very ugly, and leads to namespace separation like
teamA_stageversusteamB_stage, etcetera, all in the same database. This lack of database selection in the same project makes things much more complicated when you are dealing with tools that follow normal convention that a project contains multiple databases, each with schemas and tables that are specific to the database domain. DBT for multiproject is particularly tricky to implement in an elegant way with Bigquery, particularly for monorepos where multiple teams push their own models to it. - Google Secret manager is very similar to the AWS counterpart, however, there is the major difference that by default, once a secret in gsm is deleted, its gone forever. A devops at semrush once applied a stale terraform by mistake and dropped a few secrets for our platform, even when we scalated the request to recover to the highest level, GCP tech support told us there was nothing to be done to recover the secrets. i asked tech support and even they couldn't recover, no soft-deletes or grace period. This is very bad, because usually secrets are one of the things that dont usually have backups (because they are inherently security risks), and one wrong terraform apply can essentially destroy full data pipelines or applications.
- Interacting with Google Services in python is a experience orders of magnitude worse than working with AWS. When you are interacting with AWS services in python (a very common use case since python is now a major player in data engineering among other disciplines), you just need to install a single package (
boto3). You install boto3 in your environment, then everything works. With Google, each individual service requires its own packaging, and the documentation for these packages is severly lacking, to the point of sometimes not being sure which package was the official google package to interact with a particular Google Service.
On the other hand, google python packages are so strongly versioned (due to protobuf apis I assume), that it is almost impossible to install certain dependencies. For example, when working with Apache Airflow, providers usually provide a single package with all of their specific code inside (these packages are usually named apache-airflow-providers-XXX , for example apache-airflow-providers-google). When working with AWS, package conflicts were rare, because AWS packages have a good balance of dependency requirements. However when working with google as an airflow provider on airflow, because airflow google provider contains all of google services in there, and each google service is added in there (with the specific package and the specific requirements), sometimes you can end up in a place where you just cant find a version of a package to install.
Things I DO like about GCP (when compared with AWS):
- Provisioning virtual machines with custom machine types is neat, that way you dont need to figure out which machine micro/macro/biggie/smallie is the one with your cpu/memory requirements, and instead you can just provision a machine with machine type
for a machine with 6 cpus and 20gb memory.custom-6-20480 - The fact that Google Cloud Storage buckets have soft delete by default is a nice thing, it allows you to recover objects when you do an oppsie and delete some data. In AWS S3 buckets do have versioning, but it is an opt in setting, which means if you dont enable it, by the time you need to recover some data it will be too late.