Adding a new data provider to kloppy
This document will outline the basics of how to get started on adding a new Event data provider or a new Tracking data provider.
General
Deserialization
Kloppy has two types of datasets, namely EventDataset and TrackingDataset. These datasets are generally constructed from two files: a file containing raw (event/tracking) data and a meta data file containing pitch dimensions, squad, match and player information etc.
The creation of these standardized datasets is called "deserialization".
"Deserialization is the process of converting a data structure or object state stored in a format like JSON, XML, or a binary format into a usable object in memory." $^1$
Due to its vast amount of available open data we'll use the SportecEventDeserializer as our guide for deserializing event data and we'll use the SkillCornerDeserializer as an example of how to deserialize tracking data, because it has open data available and because it is provided in the most common format for tracking data delivery ("json").
File Structure
Adding a new provider requires the creation of at least four files:
- The deserializer file, located in
kloppy/kloppy/infra/serializers/{event | tracking}/{provider_name}/deserializer.py. (Sportec Deserializer File) - The loader file, located in
kloppy/_providers/{provider_name}.py. (Sportec Loader File) - The initialization file, located in
kloppy/{provider_name}.py. (Sportec Initialization File) - The unit test file, located in
kloppy/tests/test_{provider_name}.py. (Sportec Unit Test File)
Deserializer File
The deserializer file contains the main {ProviderName}Deserializer class and an associated {ProviderName}Inputs classes. As examplified below:
Source: Sportec Deserializer File
Loader File
The loader file contains one or more loading functions, grouped by provider. For example if a data provider provides both event and tracking data as well as open data for both this file will contain:
load_event()load_tracking()load_open_event_data()load_open_tracking_data()
These functions look something like this:
Source: Sportec Loader File
Initialization File
To easily use kloppy (i.e. from kloppy import provider_name) each provider has this file. It should contain the following, for each of loading functions in the loader file:
Source: Sportec Initialization File
Unit Tests
Before finalizing your new provider deserializer, you'll have to add automated tests. These tests are meant to ensure correct behaviour and they should help catch any (breaking) changes in the future.
Kloppy using pytest for their unit testing.
For example:
Source: Sportec Unit Test File
Code Formatting
Kloppy uses the black code formatter to ensure all code conforms to a specified format. It is necessary to
format the code using black prior to committing. There are two ways to do this, one is to manually run the command
black . (to run black on all .py files (or, black <filename.py> to run on a specific file).
Alternatively, it is possible to setup a Git hook to automatically run black upon a git commit command. To do this,
follow these instructions:
- Execute the command
pre-commit installto install the Git hook. - When you next run a
git commitcommand on the repository, black will run and automatically format any changed files. Note: if black needs to re-format a file, the commit will fail, meaning you will then need to executegit add .andgit commitagain to commit the files updated by black.
Updating Documentation
Kloppy uses MkDocs to create documenation. The documentation primarily consists of Markdown files.
To install all documentation related dependancies run:
To open the documentation in http://127.0.0.1:8000/ run:
Add any changes you're committing via Pull Request to the documentation to keep it up-to-date.
Creating Pull Request
To cleanly share the additions made to kloppy you need to make what is called a Pull Request (PR). To do this cleanly it is advised to do the following:
- If you start for the first time, create a "fork" of the kloppy repository.
- Clone the newly created fork locally.
- Create a new branch using something like
git checkout -b <branch_name> - After you have made your changes in this newly created branch:
- Run
pytest kloppy/testsorpytest kloppy/tests/test_{provider_name}.pyto ensure all tests complete successfully. - Update the documentation with any and all changes.
- Run
black <filename>on all the new files. - Run
git add <filename>on all the new files. - Run
git commit -m "<some message>" - Push the changes to your remote reposity with
git pushorgit push --set-upstream originif you're pushing for the first time. - Now, go to kloppy > Pull Requests and click the green "New pull request" button.
- Set base:
master - Set compare:
<branch_name> - Click "Create pull request"
- Write an exhaustive Pull Request message to inform everything you've contributed.
- Finally, after the PR has been completed automated tests will run on GitHub to make sure eveything (still) functions as expected.
Event Data
Files
See 1.2 File Structure for more information.
Loading File
The loading file should have a function load (or load_event if this data provider also provides tracking data).
This function takes at least:
- One or more
FileLikeobjects, generally a file of event data and a meta data file. event_types, an optional list of strings to filter the dataset at load time. (e.g. event_types can be["pass", "shot"]) these string values relate to theEventTypeclass.coordinates, an optional string that relates toProviderand their associated Coordinate Systems. (e.g. coordinates can be"secondspectrum"or"statsbomb").event_factory, an optionalEventFactory.
Within the function we instantiate the ProviderNameDeserialzer that we import from from kloppy.infra.serializers.event.{new_provider} alongside the ProviderNameInputs.
Note: The "opening" of the file is handled by FileLike and with open_as_file() as shown below.
Deserialization File
- Create a
ProviderNameDataInputsclass - Create a
ProviderNameEventDataDeserializer - Add a new provider to the
Providerclass inkloppy.domain.models.common.pyset that new provider in theproviderproperty in theProviderNameEventDataDeserializer
Create a deserialize method that takes the ProviderNameDataInputs as inputs. Within the deserialize method we do two high level actions:
These two will ultimately form the EventDataset that is returned from the deserialize method. i.e.
EventDataset
Parsing Metadata
Use the meta data and event data feeds to parse:
teamsasTeamobjects in a list[home_team, away_team]periodsa list ofPeriodobjects (don't forget about optional extra time.)- Each period has an
id(1, 2, 3, 4) - Each period has a
start_timestampandend_timestampof typetimedelta. Thistimedeltaobject describes the time elapsed since the start of the period in question. pitch_dimensionsis aPitchDimensionsorientation. Identify the direction of play (Orientation) (e.g.orientation = Orientation.ACTION_EXECUTING_TEAM)flags. Indicate if our dataset contains information on who the ball owning team is and/or if we know ball state.- For example:
flags = DatasetFlag.BALL_STATE | DatasetFlag.BALL_OWNING_TEAMorflags = ~(DatasetFlag.BALL_STATE | DatasetFlag.BALL_OWNING_TEAM) provider. Update theProviderenum class and add the new provider.coordinate_system. ACoordinateSystemobject that contains information likepitch_length,vertical_orientationetc. Create a newProviderNameCoordinateSytemclass that inherits fromProviderCoordinateSystem.- Optional metadata such as:
score(Score)frame_rate(float)date(datetime)- etc.
Parsing Events
Before parsing the events order them by their timestamp to create a chronological ordering.
Now, for each possible EventType create an event by using the built in event factory. This EventFactory is inherited into the ProviderNameDeserializer through the EventDataDeserializer as described above.
Parsing each individual event type, requires some generic_events_kwargs (dict) that contains information such as player, team (of event executing player) etc. Additionally, it also contains the full raw_event. This ensure that no information is actually lost while parsing an event.
Now, we combine these generic_event_kwargs and our event specific {someEvent}_event_kwargs and use self.event_factory.build_{someEvent} to consistantly churn out events of the same structure.
Finally, each event is appended to the events list.
Deserialization Checklist
- Make sure the
FileLikeobjects are processed correctly in the deserializer. This means opening the files in the Loader File usingopen_as_file. - Create variables for each string representation of events, to make the code less error prone. e.g.
SPORTEC_EVENT_NAME_OWN_GOAL = "OwnGoal"- Don't forget about different types (i.e.
SetPieceType,CardType,PassType,BodyPartType,GoalKeeperActionTypeorDuelType) - Don't forget about different result types (i.e.
PassResult,ShotResult,TakeOnResult,CarryResult,DuelResult,InterceptionResult) - Don't forget to include own goals, yellow and red cards, extra time, penalties etc.
- Map provider specific position labels to Kloppy standardized position labels, e.g.:
- When converting these position labels use
position_types_mapping.get(provider_position_label, PositionType.Unknown). This will ensure even if we have a missing position label our newly built deserializer every position will be of typePositionType. deserializereturns anEventDataset(metadata=metadata, records=events)Periodstart_timestampis of typetimedelta. This time delta relates to the start of a period (i.e. each period starts at0)- Parse
Substitutionsseperately fromPlayerOnandPlayerOff(if this is provided by the provider). - Player Off / Player On events represent players (temporarily) leaving the pitch (e.g. injury treatment, red card)
- Substitutions represent a one for one exchange of two players.
- Update the
event-spec.ymlfile in the Documentation to cover: - parsed. If this event is now included in the event data parser.
- not implemented. If this event is provided by the data provider, but is currently not included in kloppy.
- not supported. If this event is not provided by the data provider.
- unknown. If the status is unknown.
- inferred. When an event is inferred from other events (e.g. Ball out events for some providers)