@willmcgugan did a code review. This is his feedback:
Kloppy
General observations
- Names like utils.py and helpers.py are best avoided. Best to come up with descriptive names, even if that means separating them in to smaller files.
- Typing throughout which is great. Consider enabling strict mode in Mypy to catch missing types.
- I think you will need to add a py.typed files to expose typing information to user of the library
- I wonder if you can convert some dataclasses to namedtuples, particularly the small mathematical entities, like Point and Dimensions. They are immutable by default, and typically smaller / faster than dataclasses.
- Your install requirements are unbounded. For instance,
lxml>=4.5.0
. If lxml follows semver then version 5.0.0 of lxml may have breaking changes, which could in turn break kloppy.
Package structure
I see a fair number of * imports, which are generally frowned upon these days. The problem is that you import every symbol, including many things that are unintended. It's better to explicitly import every symbol (as tedious as that can sometimes be).
You might also want to prefix module / file names that you don't expect to be imported directly with an underscore.
Some of the module names are quite verbose (and difficult to remember). It's best to strive for as flat a module structure that makes sense.
for instance, I'm looking at EventDataset and the import is:
kloppy.domain.models.event.EventDataset
Consider discarding a few levels to simplify that. For instance domain.models
is probably irrelevant to the caller. Perhaps it could be simplified to the following:
kloppy.events.EventDatset
These changes will likely require restructuring the project to reflect how the caller would want to use the library, rather than the internal logic.
In general, the less you require the user of a library to remember, the more positively they will perceive it.
Some of your helper methods look like perhaps they should be methods of the Dataset object, for instance transform
and to_pandas
.
Exceptions
There are a number of instances of raise Exception
. Raising an exception like this makes it impossible to explicitly catch, as except Exception
would catch everything. Better to create custom exception derived from Exception, so the caller can explicitly catch those.
There are also a few ValueError
s raised. It generally not a good idea to raise builtin exceptions for similar reasons as above. Any number of bugs may raise ValueErrors and you risk hiding this by conflating them with a ValueError. Better to use a custom exception for these.
Serializers
I focused on serializers, since this seems to be a major feature of the library.
I see you have an abstract base class for serializers, CodeDataSerializer
which has serialize
and deserialize
methods. The deserialize
method is problematic because of the options dict which can take arbitrary parameters, and similarly inputs
which has an arbitrary number of Readable
objects. Passing in an object without any structure means that you lose out on a) typing and b) expressive method signatures.
When we spoke I suggested leaving these options out of the abstract methods and putting them in the constructor. With the options in the constructor they can be part of the signature rather than bundled in a dict.
Here's a (rough) example:
class MetricaTrackingSerializer(EventDataSerializer):
def __init__(self, raw_data_home: Readable, raw_data_away: Readable, sample_rate: float = 1/12):
self.raw_data_home = raw_data_home
self.raw_data_away = raw_data_away
self.sample_rate = sample_rate
def deserialize(self) -> EventDataset:
...
def serialize(self) -> Tuple[str, str]:
...
It may be the case that you will only need those parameters for deserialize, which brings me to another suggestion: separate the two classes in to a Serializer
and Deserializer
. A class should ideally do only one thing, and your current serializer baseclass does two things. By separating them you can also encode any options in the constructor which doesn't need to be abstract.
If you separate the serializers in to two classes, the method names deserialize
and serialize
become redundant. You may as well make them callable by adding a __call__
method.
class MetricaTrackingDeserializer(Deserializer):
def __init__(self, raw_data_home: Readable, raw_data_away: Readable, sample_rate: float = 1/12):
....
def __call__(self) -> EventDataset:
....
That way you can use them like this:
deserializer = MetricaTrackingDeserializer(foo, bar):
dataset = deserializer()
This brings me to another point. Do you need an abstract base class at all?
An base class is useful when you want to accept an object with a common interface. But as far as I can tell, you don't accept the base class as an argument anywhere. Which suggests to me that they could be functions and not a class.
How about a module for each format, with a load
and save
function in each (or import
and export
if you prefer). These are somewhat friendlier terms that serializer, and easier to type.
Your deserializer could simply be:
def load(raw_data_home: Readable, raw_data_away: Readable, sample_rate: float = 1/12):
...
You could structure your project to put these formats at the top level, and you could import them like this:
from kloppy import metrica
my_dataset = metrica.load_tracking("home_file_.csv", "away_file.csv")
This is not too dissimilar to your load_
helper functions, but I think friendlier still, and there wouldn't need to be separate helper methods.
Just to labour the point a bit more, here's an example from the docs:
from kloppy import StatsBombSerializer, Provider
with open(
f"{base_dir}/files/statsbomb_lineup.json", "rb"
) as lineup_data, open(
f"{base_dir}/files/statsbomb_event.json", "rb"
) as event_data:
dataset = serializer.deserialize(
inputs={"lineup_data": lineup_data, "event_data": event_data},
options={"coordinate_system": Provider.STATSBOMB},
)
return dataset
Consider this as an alternative:
from kloppy import statsbomb
dataset = statsbomb.load(
f"{base_dir}/files/statsbomb_lineup.json",
f"{base_dir}/files/statsbomb_event.json",
coordinates="statsbomb"
)
Note that rather than files, this accepts paths. I suspect this will be the most common requirement. If you also need to accept file-like objects you could accept a Union[str, IO[str]]
and handle that within the load
method.