Mathieu De Coster

Modern Python for Data Scientists

18 May 2022

Recent versions of the Python programming language have gotten many useful additions that can help you to avoid stringly typed programming, and, in combination with a properly set up development environment, can make your life a lot easier.

If you've worked with Python before, then you will be very familiar with dictionaries, lists, and tuples. These elementary data structures allow for fast iterations on small scripts, but are also useful in larger projects.

Common errors when using dictionaries are mistyped keys. There is also no indication as to which keys may be present in a dictionary, until you actually run the program (or go through the code). In contrast, when classes are used, your IDE can provide suggestions for fields, and you know exactly which operations are supported.

Take the following statement for example:

x = d['a'][1] / 2

This small statement contains several assumptions:

  1. d is a dictionary
  2. d contains a key 'a'
  3. d['a'] is a list/tuple/array with at least 2 elements
  4. d['a'][1] is a number

This example shows how Python allows you to write code quickly, but that code can become error prone if you are not careful.

Furthermore, recall the adage, "Code is written only once, but read N times." Dynamically typed languages such as Python make it easy to iterate quickly, which is very useful in a research context. However, reading existing Python code can be daunting if it is badly documented or lacks type annotations.

Let's look at a simple artificial example of a function that computes precision and recall from a set of predictions and ground truth labels.

def metrics(y_pred, y_true):
  assert len(y_pred) == len(y_true), "Expected as many predictions as labels"

  tp = fp = fn = 0
  for i in range(len(y_pred)):
    pred = y_pred[i]
    label = y_true[i]
    match label, pred:
      case 1, 1:
        tp += 1
      case 1, 0:
        fn += 1
      case 0, 1:
        fp += 1

  precision = tp / (tp + fp) if tp + fp != 0 else None
  recall = tp / (tp + fn) if tp + fn != 0 else None

  return precision, recall

As said, this function computes precision and recall, but it returns None for the precision if there are only negative predictions, and None for the recall if there are only negative labels.

This code may appear simple enough, but if we look closer, there are some caveats.

  1. There is no indication in the function signature that precision and recall can be None. Without reading the code, we might assume that NaN values may be returned, or that simply a ZeroDivisionError would be raised. We don't know how to handle edge cases if all we have is the function signature.
  2. Can we also pass NumPy arrays to this function? In this case, yes, but it isn't immediately clear.
  3. We need to remember the order in which precision and recall are returned.

We could add a docstring to the function to explain these things, like so:

def metrics(y_pred, y_true):
  """Compute precision and recall for a list of predictions and corresponding labels.

  :param y_pred: array_like, Binary predictions.
  :param y_true: array_like, Binary labels.
  :return: A tuple with precision and recall, or `None` for undefined precision or recall.
  """
  # ...
  return precision, recall

However, this does not allow our IDE to automatically check if we have violated this function's contract in any way. We still need to be very careful.

With some proper programming techniques and some tools presented to us by recent Python versions, we can make our code a lot cleaner, simpler to grasp, and less prone to bugs.

Type annotations

The first, and probably most well known, technique, is to use type annotations. While they don't make Python a statically typed language, they can make code a lot easier to read. Let's add some type annotations to our function.

import numpy as np
from typing import Tuple, List, Union, Optional

def metrics(y_pred: Union[np.ndarray, List], y_true: Union[np.ndarray, List]) -> Tuple[Optional[float], Optional[float]]:
  # ...
  return precision, recall

Now, we immediately know that we can pass Python lists as well as NumPy arrays, and that our function returns a tuple with two values.

We can also create a type hint for an array-like parameter:

import numpy as np
from typing import Tuple, List, Union, Optional

ArrayLike = Union[np.ndarray, List]

def metrics(y_pred: ArrayLike, y_true: ArrayLike) -> Tuple[Optional[float], Optional[float]]:
  # ...
  return precision, recall

We have solved problems 1 and 2! But we still need to remember the order of the return values. And what if we want to return the F1 score as well? We'd have to update all of the code that references metrics()! We could return a dictionary, but then we need to carefully document which keys the dictionary will contain (or risk KeyErrors somewhere downstream).

A better solution would be to return an instance of a class. Yet, then we need to write an entire class just to hold some data, including __init__ and perhaps __repr__ and __str__ as well.

This is where data classes come in.

Data classes

Data classes are just like regular classes, but annotated with a @dataclass property from the dataclasses package. It generates a ton of boilerplate for you, so that you no longer need to write __init__, __repr__, and others, manually.

We might define a data class for our function's return values as follows:

from dataclasses import dataclass
from typing import Optional

@dataclass
class MetricsResult:
  precision: Optional[float]
  recall: Optional[float]

We can now easily construct an instance in our function:

import numpy as np
from typing import Tuple, List, Union, Optional

ArrayLike = Union[np.ndarray, List]

def metrics(y_pred: ArrayLike, y_true: ArrayLike) -> MetricsResult:
  # ...
  return MetricsResult(precision, recall)

You can also customize individual fields, e.g., to set a default value or to exclude it from the automatically generated representation string.

  from dataclasses import dataclass
  from typing import Optional
  
  @dataclass
  class MetricsResult:
    precision: Optional[float] = None
    recall: Optional[float] = None

Here is how you stop a field from showing up in __repr__:

  from dataclasses import dataclass, field
  from typing import Optional
  
  @dataclass
  class MetricsResult:
    precision: Optional[float] = field(repr=False, default=None)
    recall: Optional[float] = field(repr=False, default=None)

You can access the values simply using the dot notation, because MetricsResult is just a class. Data classes have many more useful properties, which you can investigate in the documentation.

"But," you might say, "a lot of Python functions expect dicts and tuples." You are correct, many Python libraries (including the standard library) expect dicts, lists and tuples as arguments. Luckily, data classes can easily be converted into these using the .asdict() and .astuple() methods. This also means that data classes are also trivial to serialize to JSON.

In summary, you can get the best of both worlds using data classes: type checking in your IDE, and the flexibility of dictionaries.

Input and output types

This tip is not Python specific, but can be a good practice in other programming languages as well. Note how we used a class instead of a tuple as a return value. Thanks to this approach, we can easily modify the return value of the function by adding fields, without breaking existing code. We can easily add a third field containing the F1 score.

It should be clear that being able to make non-breaking changes to your code can save you a lot of time and headaches.

Conclusion

Python is used in a lot of research and prototyping situations. While prototypes should be thrown away, they often grow organically into larger projects. By using the right programming techniques, your prototype code can be reusable and clear. And as we've seen, it doesn't take a lot of time!

In the European research project SignON, we are processing a lot of datasets with a common Python code base. Thanks to data classes and type annotations, we are able to share and reuse code across international teams more easily and with less hiccups.