pysport / kloppy Goto Github PK

kloppy: standardizing soccer tracking- and event data

License: BSD 3-Clause "New" or "Revised" License

Python 99.97% Shell 0.03%

kloppy's Introduction

kloppy: standardizing soccer tracking and event data

klop·pen (klopte, heeft geklopt) - juist zijn; overeenkomen, uitkomen met: dat klopt, dat kan kloppen is juist; dat klopt als een zwerende vinger dat is helemaal juist

What is it?

Each vendor of soccer data uses its own unique format to describe the course of a game. Hence, software written to analyze this data has to be tailored to a specific vendor and cannot be used without modifications to analyze data from other vendors. Kloppy is a Python package that addresses the challenges posed by the variety of data formats and aims to be the fundamental building block for processing soccer tracking and event data. It provides (de)serializers, standardized data models, filters, and transformers which make working with tracking and event data from different vendors a breeze.

Main features

Here are just a few of the things that kloppy does well:

Loading data

Load public datasets to get started right away
Understandable standardized data models for tracking and event data
Out-of-the-box (de)serializing tracking and event data from different vendors into standardized models and vice versa

Processing data

Flexibly transform a dataset's pitch dimensions from one format to another (e.g., from OPTA's 100x100 to TRACAB meters)
Transform the orientation of a dataset (e.g., from TRACAB fixed orientation to "Home Team" orientation)

Pattern matching

Search for complex patterns in event data
Use kloppy-query to export fragments to XML file

Where to get it

The source code is currently hosted on GitHub at: https://github.com/PySport/kloppy.

Installers for the latest released version are available at the Python package index.

pip install kloppy

Install from github (dev version)

pip install git+https://github.com/PySport/kloppy.git

Documentation

The official documentation is hosted on pysport.org: https://kloppy.pysport.org.

Contributing to kloppy

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.

An overview on how to contribute can be found in the contributing guide.

If you are simply looking to start working with the kloppy codebase, navigate to the GitHub "issues" tab and start looking through interesting issues.

Current contributors

Made with contrib.rocks.

kloppy's People

Contributors

Stargazers

Watchers

kloppy's Issues

Create a toolbox of generic but often used mutations

When a dataset is loaded often a lot of preprocessing needs to be done before the actual model can be applied. A first step was to add the Transformer class to change orientation of the dataset and scale the pitch.

It would be great to have a filled toolbox. There we need to keep in mind that we are able to apply a chain of tools with different settings. An example is scikit-learn 'Pipeline' ( https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html ).

Those tools need to be 'pure', meaning they won't alter the input.

Setup code formatting tool

kloppy doesn't have any code formatting tool in place. To make sure the code will be maintainable in the long term a code formatter is needed.

Black ( https://github.com/psf/black ) might be a good option.

Remove numpy dependency from SecondSpectrum

SecondSpectrum uses numpy for some operations. This makes kloppy depend on numpy. This is unwanted. These operations should be replaced by their python-only counterparts.

Remove 'deleted events' from OPTA f24 data

I would propose that as default load_opta_event_data() would remove 'deleted events' from the dataset with the option to override.

Deleted events would be looked at in extremely rare circumstances but could cause some issues with analysis

Include Skillcorner into Kloppy

Today Friends of Tracking and Skillcorner announced a new open Dataset for Tracking Data. I just saw that Tweet about another different Format and immediately thought about Kloppy!

So what would be needed to include Skillcorner into Kloppy?

I am not that familiar yet with what everything is needed but I would be happy to help.

Implement BodyPart qualifier for Statsbomb

Error transforming tracking dataset : position undefined

Hello,

I tried to use transform method to update tracking dataset to new pitch dimensions but I ran into error because ball position from first frame is None.

Here is the Traceback :

Traceback (most recent call last):
  File "/Users/benoitblanc/Documents/Workspace/kloppy/examples/datasets/metrica.py", line 36, in <module>
    main()
  File "/Users/benoitblanc/Documents/Workspace/kloppy/examples/datasets/metrica.py", line 29, in main
    to_pitch_dimensions=[(-5500, 5500), (-3300, 3300)]
  File "/Users/benoitblanc/Documents/Workspace/kloppy/kloppy/helpers.py", line 80, in transform
    to_pitch_dimensions=to_pitch_dimensions
  File "/Users/benoitblanc/Documents/Workspace/kloppy/kloppy/domain/services/transformers/__init__.py", line 132, in transform_dataset
    frames = list(map(transformer.transform_frame, dataset.records))
  File "/Users/benoitblanc/Documents/Workspace/kloppy/kloppy/domain/services/transformers/__init__.py", line 76, in transform_frame
    ball_position=self.transform_point(frame.ball_position, flip),
  File "/Users/benoitblanc/Documents/Workspace/kloppy/kloppy/domain/services/transformers/__init__.py", line 29, in transform_point
    x_base = self._from_pitch_dimensions.x_dim.to_base(point.x)
AttributeError: 'NoneType' object has no attribute 'x'

This is an easy problem that I can solve for my first contribution 😄

to_pandas fails for Opta data when trying to read player_id

Some event types, like formation change do not have a player_id attribute. to_pandas should check that and return None for player_id if not is available in the event.

Restructure documentation

Current documentation located at https://kloppy.pysport.org contains same-like content at different places, and also problems are solved in different ways. The documentation should be super easy.

Structure should be:

Home (very short intro like https://github.com/PySport/kloppy/blob/master/README.md#what-is-it)
Getting-started (Notebook files containing best practices for loading data for all providers)
- Opta
- Metrica
- etc
- Open data
Examples (All kind of examples for common tasks, like transforming stuff etc)
API Reference
Providers
Other
- Issues
- Contributing
- License
- About
- Changelog

Issue: Sportec 2019 events data serializer

While using events data for the 2019-2020 Bundesliga Season. I ran into an error while loading in the Sportec data. This problem does not occur for 2020-2021 season's data.

Example code:

from kloppy import load_sportec_event_data
dataset = load_sportec_event_data(events_path, 
                                  matchinfo_path
                                 )

Error message:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-9-e862fa286ec2> in <module>
      3 from kloppy import SportecEventSerializer
      4 
----> 5 dataset = load_sportec_event_data(events_path, 
      6                                   matchinfo_path
      7                                  )

C:\ProgramData\Anaconda3\lib\site-packages\kloppy\helpers.py in load_sportec_event_data(event_data_filename, match_data_filename, options)
    125     ) as match_data:
    126 
--> 127         return serializer.deserialize(
    128             inputs={"event_data": event_data, "match_data": match_data},
    129             options=options,

C:\ProgramData\Anaconda3\lib\site-packages\kloppy\infra\serializers\event\sportec\serializer.py in deserialize(self, inputs, options)
    303                     )
    304                     if period_id == 1:
--> 305                         team_left = event_chain[SPORTEC_EVENT_NAME_KICKOFF][
    306                             "TeamLeft"
    307                         ]

KeyError: 'TeamLeft'

Code review feedback

@willmcgugan did a code review. This is his feedback:

Kloppy

URL: https://github.com/PySport/kloppy
Date Reviewed: Sep 3 2021

General observations

Names like utils.py and helpers.py are best avoided. Best to come up with descriptive names, even if that means separating them in to smaller files.
Typing throughout which is great. Consider enabling strict mode in Mypy to catch missing types.
I think you will need to add a py.typed files to expose typing information to user of the library
I wonder if you can convert some dataclasses to namedtuples, particularly the small mathematical entities, like Point and Dimensions. They are immutable by default, and typically smaller / faster than dataclasses.
Your install requirements are unbounded. For instance, lxml>=4.5.0. If lxml follows semver then version 5.0.0 of lxml may have breaking changes, which could in turn break kloppy.

Package structure

I see a fair number of * imports, which are generally frowned upon these days. The problem is that you import every symbol, including many things that are unintended. It's better to explicitly import every symbol (as tedious as that can sometimes be).

You might also want to prefix module / file names that you don't expect to be imported directly with an underscore.

Some of the module names are quite verbose (and difficult to remember). It's best to strive for as flat a module structure that makes sense.

for instance, I'm looking at EventDataset and the import is:

kloppy.domain.models.event.EventDataset

Consider discarding a few levels to simplify that. For instance domain.models is probably irrelevant to the caller. Perhaps it could be simplified to the following:

kloppy.events.EventDatset

These changes will likely require restructuring the project to reflect how the caller would want to use the library, rather than the internal logic.

In general, the less you require the user of a library to remember, the more positively they will perceive it.

Some of your helper methods look like perhaps they should be methods of the Dataset object, for instance transform and to_pandas.

Exceptions

There are a number of instances of raise Exception. Raising an exception like this makes it impossible to explicitly catch, as except Exception would catch everything. Better to create custom exception derived from Exception, so the caller can explicitly catch those.

There are also a few ValueErrors raised. It generally not a good idea to raise builtin exceptions for similar reasons as above. Any number of bugs may raise ValueErrors and you risk hiding this by conflating them with a ValueError. Better to use a custom exception for these.

Serializers

I focused on serializers, since this seems to be a major feature of the library.

I see you have an abstract base class for serializers, CodeDataSerializer which has serialize and deserialize methods. The deserialize method is problematic because of the options dict which can take arbitrary parameters, and similarly inputs which has an arbitrary number of Readable objects. Passing in an object without any structure means that you lose out on a) typing and b) expressive method signatures.

When we spoke I suggested leaving these options out of the abstract methods and putting them in the constructor. With the options in the constructor they can be part of the signature rather than bundled in a dict.

Here's a (rough) example:

class MetricaTrackingSerializer(EventDataSerializer):

    def __init__(self, raw_data_home: Readable, raw_data_away: Readable, sample_rate: float = 1/12):
        self.raw_data_home = raw_data_home
        self.raw_data_away = raw_data_away
        self.sample_rate = sample_rate
    
    def deserialize(self) -> EventDataset:
        ...

    def serialize(self) -> Tuple[str, str]:
        ...

It may be the case that you will only need those parameters for deserialize, which brings me to another suggestion: separate the two classes in to a Serializer and Deserializer. A class should ideally do only one thing, and your current serializer baseclass does two things. By separating them you can also encode any options in the constructor which doesn't need to be abstract.

If you separate the serializers in to two classes, the method names deserialize and serialize become redundant. You may as well make them callable by adding a __call__ method.

class MetricaTrackingDeserializer(Deserializer):

    def __init__(self, raw_data_home: Readable, raw_data_away: Readable, sample_rate: float = 1/12):
        ....
    
    def __call__(self) -> EventDataset:
        ....

That way you can use them like this:

deserializer = MetricaTrackingDeserializer(foo, bar):
dataset = deserializer()

This brings me to another point. Do you need an abstract base class at all?

An base class is useful when you want to accept an object with a common interface. But as far as I can tell, you don't accept the base class as an argument anywhere. Which suggests to me that they could be functions and not a class.

How about a module for each format, with a load and save function in each (or import and export if you prefer). These are somewhat friendlier terms that serializer, and easier to type.

Your deserializer could simply be:

def load(raw_data_home: Readable, raw_data_away: Readable, sample_rate: float = 1/12):
    ...

You could structure your project to put these formats at the top level, and you could import them like this:

from kloppy import metrica

my_dataset = metrica.load_tracking("home_file_.csv", "away_file.csv")

This is not too dissimilar to your load_ helper functions, but I think friendlier still, and there wouldn't need to be separate helper methods.

Just to labour the point a bit more, here's an example from the docs:

from kloppy import StatsBombSerializer, Provider

with open(
        f"{base_dir}/files/statsbomb_lineup.json", "rb"
    ) as lineup_data, open(
        f"{base_dir}/files/statsbomb_event.json", "rb"
    ) as event_data:
        dataset = serializer.deserialize(
            inputs={"lineup_data": lineup_data, "event_data": event_data},
            options={"coordinate_system": Provider.STATSBOMB},
        )
        return dataset

Consider this as an alternative:

from kloppy import statsbomb

dataset = statsbomb.load(
    f"{base_dir}/files/statsbomb_lineup.json",
    f"{base_dir}/files/statsbomb_event.json",
    coordinates="statsbomb"
 )

Note that rather than files, this accepts paths. I suspect this will be the most common requirement. If you also need to accept file-like objects you could accept a Union[str, IO[str]] and handle that within the load method.

More non-generic events

More common event types (like Tackles and Interceptions) should have their own types within kloppy. Should an issue be created for each type in order to discuss how to convert different providers to a common framework?

Include all PassQualifiers in Event Deserializers

Right now the metrica event json serializer only adds SetPieceQualifier to the kloppy event. BodyPartQualifier and PassQualifier are not included.

For the BodyPartQualifier the way I see it, the data only includes the subtype "HEAD" if it is a header, but no information about which foot was used for example.

PassQualifiers that are used can be found in the documentation of metrica events. Here it is maybe noteworthy that Goal Kick is considered a pass type in metrica, while it is a set piece in kloppy. Goal Kick is already included as a SetPieceQualifier in the serializer already.

I assume additions need to be made in _get_event_qualifiers()

EDIT: Found out that this is not only the case with Metrica, but other providers as well. From what I found these are the ones that are missing. I did not check yet if the data is at all accessible with the provider, just a list where the qualifiers are not used so far.

BodyPart: Metrica, Opta, Sportec
PassType: Metrica, Opta, Sportec, Statsbomb
SetPiece: already used for all providers

Transforming Tracking Data not working

Hi,
I have kloppy==1.2.1 and I'm running into an issue transforming metrica tracking data.

from kloppy import datasets, helpers dataset = datasets.load("metrica_tracking", options={}) dataset = dataset.transform(dataset, to_pitch_dimensions=[[0, 105], [-34, 34]])

This gives me the following error message:
AttributeError: 'TrackingDataset' object has no attribute 'get_orientation_factor'

Am I doing something wrong? Any suggestions?

IDs instead of [home, away, shirt_number]

Currently while building the datasets, the way to reference a team is by [home, away], and the way to reference a player is by combining [home, away] with the shirt number. I believe when available in the data, it would be better to index by the team and players IDs.

What are your thoughts on this?

Enrich events with state

When filtering or creating windows of events it's useful to have additional state added to the events. This state can contain current score, lineup and other stuff like possession_id.

I think we need something like this. With state added to Event.

class StateReducer(metaclass=ABCMeta):
    @abstractmethod
    def get_initial_state(self):
        raise NotImplementedError

    @abstractmethod
    def reduce(self, current_state, event: Event):
        raise NotImplementedError


class PossessionTeamReducer(StateReducer):
    def get_initial_state(self):
        return None

    def reduce(self, current_state, event: Event):
        if event.event_type in ....:
            return event.team
        else:
            return current_state


def add_event_state(dataset: EventDataset, **state_reducers: Dict[str, StateReducer]) -> EventDataset:
    if not state_reducers:
        state_reducers = dict(
            possession_team=PossessionReducer(),
            score=ScoreReducer()
        )

    state_cls = make_dataclass('State', [(key, Any) for key in state_reducers.keys()])

    state = state_cls(**{
        key: reducer.get_initial_state()
        for key, reducer in state_reducers.items()
    })

    events = []
    for event in dataset.events:
        state =state_cls(** {
            key: reducer.reduce(getattr(state, key), event)
            for key, reducer in state_reducers.items()
        })
        event.state = state
        
    return replace(
        dataset,
        events=events
    )

Allow users to pass a function to add extra columns to `to_pandas` output

My use case: using Statsbomb data, I wanted to know which player was pressured in each pressure events.

To solve that I used a custom _record_converter when I called to_pandas, similar to _event_to_pandas_row_converter but with an extra line in the row dict:

first_related_event = event.raw_event['related_events'][0] if 'related_events' in event.raw_event else None,

It worked perfectly (I then joined by the event_id to get the player id from the related event).

The ideal solution would be to allow custom functions to build extra columns to the DataFrame. Something like this:

get_related_event = lambda x: x.raw_event['related_events'][0] if 'related_events' in x.raw_event else None
to_pandas(dataset, custom_fields={'first_related_event': get_related_event})

Would this make sense?

Metadata model

Would be useful to have a game model with pre defined fields that get filled automatically if the data is available on the data package, or could get filled in manually.

Could be based on the fields/inputs present in the FIFA EPTS standard format:

Global
Teams
Players

This way independently of the data provider you could have the same fields and know whether they are available or not.

Second Spectrum serializer

Implement the second spectrum tracking data serializer

Issue transforming Opta data

Bug report by @MKlaasman

Notebook to reproduce the bug: https://github.com/MKlaasman/error-kloppy/blob/main/error%20Kloppy%202.1.0.ipynb

Read EPTS data provider from metadata.xml file

@koenvo now you can set the provider of the EPTS data from load_metadata, however you can't do it from the load helper method load_epts_tracking_data. Wold be nice to have it there as well so you can use this helper without having to set it after the dataset is created.

I can implement this if you agree.

Enable Kloppy to handle GPS tracking data

GPS tracking data is an important data source. Kloppy should be able to handle and distinguish GPS tracking data.

Add parsing functions for OPTA f1 and f40 files

current we don't have functions for parsing f1 and f40 xml flies.

ideally, this allows a simple version to pandas :) as well as an dataclass version

Adding a data type type to the datasets

@koenvo I'm working with games in which I have a tracking dataset and an event dataset, and I'd like to have share methods, like for example get_team_dataframe which should do different things depending on of the dataset is a tracking dataset or an event one.

I'd propose we add a type DataType and a data_type attribute to the Datasets:

class DataType(Enum):
    TRACKING = "tracking"
    EVENT = "event"

@dataclass
class Dataset(ABC):
    records: List[DataRecord]
    metadata: Metadata
    data_type: DataType

If you agree with this, I can implement it.

ADD option to convert all coordinates to same dimensions

To make it easier for people to start going and don’t have to think about pitch dimensions we should add a global option to convert all coordinates to the same dimensions (meters for example).

Related twitter thread

https://mobile.twitter.com/lemonwatcher/status/1285854097233633281

Improve sequence definition

As discussed, I propose we improve the definition of sequences. Currently to start a new sequence, only the team of an event is checked:

if state.team != event.team:
    state = replace(
        state, sequence_id=state.sequence_id + 1, team=event.team
    )

The problem with this definition is that you can have events that don't define a sequence define one. Plus, a sequence only ends with an event from another team, while for example the ball could be out, or it could be a goal, or it could be that a foul was committed.

I propose we change is so that the events CARRY, RECOVERY, PASS start a sequence, and the events BALL_OUT, FOUL_COMMITTED, SHOT end it.

What do you think @koenvo?

add a function to calculate the distance to a position within the Point class

there is a need to calculate distances between a player and a position (other players, the ball, an area, etc)

I think it would be beneficial to have a function within the Point class that returns a distance

something like

def distance_to(self, another_point):
distance = ....some calc....
return distance

what are peoples thoughts of this function being located within the Point class?

Be able to extend / pass your own Event/Frame class

Providing the classes for Events / Frames opens the possibility to add additional:

functionality to events / frames
attributes to events

Especially 2) is very useful. Providers add very useful information that are not common shared. With just a single implementation of an Event we have to throw away this information.

When we are able to extend the Event class, we can pass a provider specific implementation of an Event like StatsbombShotEvent which holds more attributes.

Serializer for sportscode like XML

Some clubs get data from instat or wyscout which comes in form of "codes" XML format. This format is the "sportscode xml" format, meaning is how sportscode export the coding of a game in XML format. Is not very structured data, nothing like event type data, but more like coding of phases of the game. Build up by this team x seconds to y seconds, counter by the opponent form x seconds to y seconds, etc. Specially colleges in the US get data in this format, and they want to use it to then know for examples what happens on their defensive phases and that type of things.

A typical bit of xml file looks like this.

<file>
 <SORT_INFO>
  <sort_type>sort order</sort_type>
 </SORT_INFO>
 <ALL_INSTANCES>
  <instance>
   <ID>0</ID>
   <code>Offsets</code>
   <start>2</start>
   <end>5</end>
   <label>
    <text>First half start</text>
    <group>Offset</group>
   </label>
  </instance>
  <instance>
   <ID>4</ID>
   <code>Duke Blue Devils - Opposition build-up play</code>
   <start>2</start>
   <end>13</end>
   <label>
    <group>Event</group>
    <text>Opposition build-up play</text>
   </label>
  </instance>
 </ALL_INSTANCES>
 <ROWS>
  <row>
   <sort_order>1</sort_order>
   <code>Offsets</code>
   <R>0</R>
   <G>0</G>
   <B>0</B>
  </row>
  <row>
   <sort_order>2</sort_order>
   <code>Duke Blue Devils - Opposition build-up play</code>
   <R>65535</R>
   <G>65535</G>
   <B>65535</B>
  </row>
 </ROWS>
</file>

Since there are no unique code identifiers and there is no guaranteed metadata on each file (not specific provider), everyone can code or include any information they want, I'd suggest not to add them as another event serializer, but rather include it as another serializer type.

May be have serializer:
events
tracking
codes

And have the xml serializer return a CodesDataset that doesn't have any metadata, and just the records, with a method to go from the records to a dataframe?

@koenvo what do you think?

Wyscout event data serializer

Refactor serializer options into own component

The tracking serializers accept options to reduce the number of frames that needs to be loaded:

limit: total number of frames to be returned
sample_rate: return only each nth frame
only_alive: return only frames where ball is alive (when provider added that info to dataset)

I would like to be able to filter on time stamps. Use case is to grab only frames when some event happened and inspect only that frames. A match contains around 1600 events and this filter would - in this case - result in loading 75x less unneeded data (1.600 Vs 120.000 frames)

The refactor would also allow other filters to be applied. Filtering on timestamp has the advantage of not needed to load the entire Frame object yet. We might also load the ball position as Point because that will probably also be an important variable to filter on.

User Documentation in Github Wiki

I think it would be very useful to have a full documentation with examples and everything in the github Wiki.
With more and more things being added the ReadMe gets to overcrowded. With the Wiki you can make an extra tab for every topic and can have several or more specific examples.

Would be happy to help with that!

Add a provider type to dataset.metadata

I think a good addition in the future would be to add a provider type to the metadata. That way you would know from which provider the data came, to check if you need to for example:

handle it differently when for example transforming coordinates,
syncing tracking and event data of different providers
doing or not a certain step on your analysis if data comes from one provider or the other one.

We could add it in the kloppy\domain\common.py, something like:

class Provider(Enum):
    METRICA= "metrica"
    TRACAB = "tracab"
    OPTA= "opta"
    STATSBOMB = "statsbomb"

    def __str__(self):
        return self.value

@dataclass
class Metadata:
    teams: List[Team]
    periods: List[Period]
    pitch_dimensions: PitchDimensions
    score: Score
    frame_rate: float
    orientation: Orientation
    flags: DatasetFlag
    provider: Provider

I'll eventually need it for a project I'm working on and I could do it then, but could also be a good first issue for someone else that want's to contribute.

Event models

TODO:

Investigate event and attribute availability around providers (opta, statsbomb, metrica, wyscout to start with)
Determine version 1: event types (pass, shot, dribble) and attributes (from step 1)
Write the models

V2:

Make it possible to read into Raw Events (inspired by Atomic-SPADL)

An Event might have to following structure.
team
team_code
type
type_code
subtype
subtype_code
period
start_frame
start_time
end_frame
end_time
from
from_code
to
to_code
start_x
start_y
end_x
end_y

Idea: Generalized workflow for fetching the different data sources

My idea is a workflow of fetching data from different data sources without having to code the whole import all the time.

The goal would be to have a Framework that includes maximal possible implementation for a generalized data import. So that you have less work if you want to add different sources.
You probably cannot would not have a "one implementation fits all" solution because the way data is available always is a little bit different.

My idea is inspired by the AutoPKG Projekt (https://github.com/autopkg/autopkg).
The idea is to have a basic structure of how to input the data.
By using different processors for different types of data and the specific arguments for an API or for the website where the data is stored. By passing Arguments to the Processors all the details can be set.
It would be something where you can start by having only one or two different possibilities and then add over time.

So in the end you would need:

The basic Framework for fetching the data
The Processors and the Arguments for the different types how the data is available

By writing this down I see this idea has still a lot of holes to fill. So if it is unclear or just not realistic, just close the issue.
The project you started is really inspiring and I hope I can contribute in some sort of way.
Have a nice evening!

Sync event data to tracking data

There is a need to match the time of an event with the time within the tracking data, I am trying to implement a way this can be done accurately to remove the error syncing the datasets (opta,tracab). The first step to this is to add a very clumsy sync via the use of timestamps before things are refined.

I would propose two attributes are added to the event object structure, clumsy_sync and precise_sync which are empty and only filled if the user wishes (as it relies on event data and tracking data to be present.

filled by the following function.

Interested to hear thoughts

Guidance on API integration

Hi,

is there any guidance how to use Kloppy with an actual API instead of the files, e.g., Wyscout API?

Thanks

Datafactory event deserializer

Datafactory produces matches event data in a particular json format. It would be nice to have parsing support for this provider.

Include z-coordinate for ball

Some tracking data providers include a z-coordinate for the ball. It's useful to load this data.

Steps:

Check all tracking data serializers and see which include the z-coordinate
Add a Point3D that inherits from Point and include a z attribute
Update serializers from 1 to include z coordiate
Add tests to make sure it works
Update the docs

Note: when you prefer TDD you can swap 3 and 4.

Make loader support mechanism extensible

Now the loader support mechanism can download files from HTTP/HTTPS sources. Would be nice to make it extensible so that it can also load from S3 buckets or other locations.

At this branch there is some pseudo code to show how it could be implemented:
https://github.com/ivd-git/kloppy/tree/spike/loader-mechanism

Opta Event data serializer

Need to enter some details here

Setup docs

The 'readthedocs' page is created but there is no:

Configuration
Docs

We should setup configuration to have the docs automatically generated. And create a list of topics that we would like to cover. My first idea is that examples are really important for the community.

Transform orientation "BALL_OWNING_TEAM" should work on event data

Event data does not have ball_owning_team set and therefore transform throw an exception when trying to transform to BALL_OWNING_TEAM orientation. Suspicion is this should work and ball_owning_team should be set to same value as action_executing_team

Allow to_pandas to return player_id as column on tracking datasets

On tracking datasets, to_pandas returns a DataFrame with a row per frame. To make further transformations it may be easier to work with a DataFrame with one row per frame per player.

So to_pandas could allow a player_as_index option that produces this transformation.

This code worked on a metrica dataset:

df = dataset.to_pandas()

def add(a,b): return a + b
players = add(*[[p.player_id for p in t.players] for t in dataset.metadata.teams])+['ball']

x_coords = [p+'_x' for p in players]
y_coords = [p+'_y' for p in players]
ball_cords = ['ball_x', 'ball_y']

data_columns = set(df.columns)-set(x_coords)-set(y_coords)-set(ball_cords)

x = df.set_index(list(data_columns))[x_coords]
y = df.set_index(list(data_columns))[y_coords]
x.columns = players
y.columns = players

df = pd.concat([x.stack(),y.stack()], axis=1)
df.columns = ['x', 'y']

If you're ok with it I can test it further and submit a pull request.

'id' being used as 'event_id' instead of 'event_id' in f24 parsing

When we parse opta f24 data we use 'id' and allocate it to 'event_id' in our event class. But we then seem to discard 'event_id' from the f24. seems there is a need to keep the original event_id as well?

see here line

Discuss other sports

Kloppy started as a library for handling soccer event and tracking in a standardized way. There are many providers for just this data.

With the release of klopp-queue we introduced a algorithm agnostic way of searching for interesting event patterns. The expectation is this will grow to also operate on tracking data and combinations of event and tracking data.

This framework can be applied to event and tracking data for all kind of sports.

In this ticket I would like to collect information about other sports that can benefit from the framework, see what data is available and possible changed required to support the sport.

README Quickstart Files

Hi, i am very interested in this kind of projects related to Data & Sports. I am following the README instructions and into the Quickstart section I cannot find the files for each tracking system defined. I understand that each one has published or shared some data files to start using/experimenting this tool but I cannot found into the repo. These files are in another repo? Where is the location?

I.E. below CSV files are not present in the repo. Same issue with other examples.

metrica data

dataset = load_metrica_tracking_data('home_file.csv', 'away_file.csv')

I can fork the repo and add this information in the README if i know where are this files.
Congratulations for your work and i am ready to collaborate in the future.

Thanks.

Error: loading Sportec 2019-2020 events data

While using events data for the 2019-2020 Bundesliga Season. I ran into an error while loading in the Sportec data. This problem does not occur for 2020-2021 season's data.

Example code:

from kloppy import load_sportec_event_data
dataset = load_sportec_event_data(events_path, 
                                  matchinfo_path
                                 )

Error message:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-9-e862fa286ec2> in <module>
      3 from kloppy import SportecEventSerializer
      4 
----> 5 dataset = load_sportec_event_data(events_path, 
      6                                   matchinfo_path
      7                                  )

C:\ProgramData\Anaconda3\lib\site-packages\kloppy\helpers.py in load_sportec_event_data(event_data_filename, match_data_filename, options)
    125     ) as match_data:
    126 
--> 127         return serializer.deserialize(
    128             inputs={"event_data": event_data, "match_data": match_data},
    129             options=options,

C:\ProgramData\Anaconda3\lib\site-packages\kloppy\infra\serializers\event\sportec\serializer.py in deserialize(self, inputs, options)
    303                     )
    304                     if period_id == 1:
--> 305                         team_left = event_chain[SPORTEC_EVENT_NAME_KICKOFF][
    306                             "TeamLeft"
    307                         ]

KeyError: 'TeamLeft'

Period equality in tests

The __eq__ method for Period is currently implemented as a check of equality of the id attributes. This makes sense as it makes it easy to compare between games, e.g. asking do these events from different games both happen in the first half?

However, this also potentially introduces errors in the tests where we want to check that a Period is identical to what is expected. For example, in test_tracab.py, the following assertion will pass for any value of start_timestamp etc provided the id is correct

assert dataset.metadata.periods[0] == Period(
    id=1,
    start_timestamp=4.0,
    end_timestamp=4.08,
    attacking_direction=AttackingDirection.HOME_AWAY,
)

This should be easy enough to fix, depending on how you guys think would be best to go about it?

Add length and width property to PitchDimensions

Currently there is a PitchDiementions object as attribute in the Metadata of a Dataset. The PitchDimension class has attributes [x,y]_dim and[x,y]_per_meter but there is no property to directly get the length or width of the field.

Would be good to have something like the below implemented so you an get directly get the dimensions of the field in meter without having to do extra calculations.

@dataclass
class PitchDimensions:
    x_dim: Dimension
    y_dim: Dimension
    x_per_meter: float = None
    y_per_meter: float = None

    @property
    def length(self) -> float:
        return self.x_dim.max / self.x_per_meter

    @property
    def width(self) -> float:
        return self.y_dim.max / self.y_per_meter

This would work for the EPTS dataset/metadata, not sure it requires catching exceptions for other providers / formats. Those will need to be taken into account as well.