For the last four years I have been jumping back and forth between Python and Go.
The majority of the work I carry out as a Machine Learning Engineer at Octopus
Energy is focused on Python and the typical
data science stack. We are working on medium/big scale data processing projects
using all the usual suspects: keras, scikit-learn, pandas, airflow etc. But at
the moment, I’m not doing any Go.
There are two major things that I miss from Go when working on a Python project:
static typing and interfaces. At Octopus Energy we use Python’s type
hinting in order to avoid silly mistakes (the ones I do more
often) with some success but there are intrinsic limitations.
While dealing with data streams and especially with the io package in Python,
it becomes a burden trying to keep a flexible interface while reusing elements
of the standard library. Consider the following example:
import pandas as pd
def transform(input_file: str) -> pd.DataFrame:
df = pd.read_csv(input_file, index=”index”)
return _transform(df)
This implementation has some drawbacks: passing a file-path string for
pandas to load a CSV file makes testing difficult as you would need to create an
actual CSV. Also, we restrict ourselves to local files for applying our data
transform – however you might want to load the stream from other sources, such as HTTP
connections or a S3 bucket.
You might be tempted to rewrite this function using io classes.
import pandas as pd
import io
def transform(input_io: io.TextIO) -> pd.DataFrame:
df = pd.read_csv(input_io, index=”index”)
return _transform(df)
This is more sensible: we’re using a class offered by the standard library and
pandas will like it as it uses an intrinsic reading protocol (more on this
soon). This means that as long as whatever you send down to pd.read_csv has a
method/function called read, pandas will be able to load the data. But it’s
difficult to extend the classes in the io package.
Many ftp or S3 readers won’t inherit from these classes
as typing is not really big in the python ecosystem. Also, building fake io
subclasses to test is an ordeal as you need to create a bunch of empty stubs to
extend them.
Python protocols
Let’s focus on the concept of protocol in python. This not something new, you
might know that calling len() in any object that contains the method __len__
will just work; __getitem__ and the pickling protocol are more examples.
Python, being dynamically typed, will check programmatically that __len__ exists
before calling it during runtime.
If you’ve written any Go code, this will sound familiar, but of course you would
enforce your protocol into an actual interface. In Go, there’s no concept of
classes or inheritance, just structs. Structs can have functions attached
to them called methods (sorry the nomenclature makes the comparison a bit
messy). You define interfaces as contracts that your struct needs to meet, just
like protocols! But being strictly typed, this is enforced during compilation
time.
Relying on type hinting to play along with protocols is known as structural
subtyping in python.
You might be wondering how this is different from inheritance. Let’s have a look
at the following example where we will create a useful function to print the
length of different objects.
class LenMixin:
def __len__(self) -> int:
raise NotImplementedError()
class OneLengther(LenMixin):
def __len__(self) -> int:
return 1
def print_with_len(has_len: LenMixin):
print(len(has_len))
print_with_len(OneLengther()) # yay
print_with_len(list([1, 2, 3])) # ney! type error
Mypy will complain about the last line in this example. The reason being the
list object is not a subclass of LenMixin. That’s unfortunate as the list
class actually offers all the functionality we need. Of course, we could create
a MyList class that inherits from both list and LenMixin but it adds tons of
boilerplate code.
The package typing_extensions allows us to use protocols in a strictly typed
fashion. A Protocol defines a set of functions that a class needs to have in
order to be used as an argument to another function.
from typing_extensions import Protocol
class Sizable(Protocol):
“””A custom definition of https://mypy.readthedocs.io/en/latest/protocols.html#sized “””
def __len__(self) -> int:
pass
class MySizable: # No superclass
def __len__(self) -> int:
return 1
def print_sizable(sizable: Sizable):
print(len(sizable))
print_sizable(MySizable()) # yay
print_sizable(list([1, 2, 3])) # yaaaay!
In this way we just need to have the __len__ method in our class (or function
in our module) in order to fulfill the print_sizable requirements.
The whole concept of dependency injection in Go revolves around the idea of
having interfaces as a contract. Your python object or your Go struct needs
to fulfill this contract in order to be used as a parameter to a function. This
blog post by Christoph Berger explains it way
better, and this video by the
great Francesc Campoy will give you a deep insight of the concept.
Coming back to the reader example, here we define a protocol that allows reading
streams.
from abc import abstractmethod
from typing_extensions import Protocol
from typing import Any
class Reader(Protocol):
“””Reader protocol sets the contract to read streams.”””
@abstractmethod
def __iter__(self):
“””Make the reader an iterable.”””
pass
@abstractmethod
def read(self, size: int = -1) -> Any:
“””Read the given amount of bytes.”””
pass
The reason why our reader needs to be iterable comes from panda’s file
objects.
Now we can rewrite our tranform_cvs function to use this protocol as an input.
import pandas as pd
from abc import abstractmethod
from typing_extensions import Protocol
from typing import Any
class Reader(Protocol):
“””Reader protocol sets the contract to read streams.”””
@abstractmethod
def __iter__(self):
“””Make the reader an iterable.”””
pass
@abstractmethod
def read(self, size: int = -1) -> Any:
“””Read the given amount of bytes.”””
pass
def transform(input_reader: Reader) -> pd.DataFrame:
df = pd.read_csv(input_reader, index=”index”)
return _transform(df)
From this implementation we obtain a function that accepts not only any
file-like-object, such as files, io.StringIO or io.BytesIO, but also any
other third party libraries to stream S3 buckets, FTP, sFTP, HTTP, as long as
they are iterables and have a read(i:int=-1) method. No inheritance needed, no
extra glue code to make this work with third party libraries or testing fakes.
Protocols and structural subtyping make injecting dependencies and creating fake
objects for testing easy and elegant.
Source:: Kraken Technologies