Adding a new data provider to kloppy
This document will outline the basics of how to get started on adding a new Event data provider or a new Tracking data provider.
General
Deserialization
Kloppy has two types of datasets, namely EventDataset
and TrackingDataset
. These datasets are generally constructed from two files: a file containing raw (event/tracking) data and a meta data file containing pitch dimensions, squad, match and player information etc.
The creation of these standardized datasets is called "deserialization".
"Deserialization is the process of converting a data structure or object state stored in a format like JSON, XML, or a binary format into a usable object in memory." $^1$
Due to its vast amount of available open data we'll use the SportecEventDeserializer as our guide for deserializing event data and we'll use the SkillCornerDeserializer as an example of how to deserialize tracking data, because it has open data available and because it is provided in the most common format for tracking data delivery ("json").
File Structure
Adding a new provider requires the creation of at least four files:
- The deserializer file, located in
kloppy/kloppy/infra/serializers/{event | tracking}/{provider_name}/deserializer.py
. (Sportec Deserializer File) - The loader file, located in
kloppy/_providers/{provider_name}.py
. (Sportec Loader File) - The initialization file, located in
kloppy/{provider_name}.py
. (Sportec Initialization File) - The unit test file, located in
kloppy/tests/test_{provider_name}.py
. (Sportec Unit Test File)
Deserializer File
The deserializer file contains the main {ProviderName}Deserializer
class and an associated {ProviderName}Inputs
classes. As examplified below:
Source: Sportec Deserializer File
Loader File
The loader file contains one or more loading functions, grouped by provider. For example if a data provider provides both event and tracking data as well as open data for both this file will contain:
load_event()
load_tracking()
load_open_event_data()
load_open_tracking_data()
These functions look something like this:
Source: Sportec Loader File
Initialization File
To easily use kloppy (i.e. from kloppy import provider_name
) each provider has this file. It should contain the following, for each of loading functions in the loader file:
Source: Sportec Initialization File
Unit Tests
Before finalizing your new provider deserializer, you'll have to add automated tests. These tests are meant to ensure correct behaviour and they should help catch any (breaking) changes in the future.
Kloppy using pytest for their unit testing.
For example:
Source: Sportec Unit Test File
Code Formatting
Kloppy uses the black code formatter to ensure all code conforms to a specified format. It is necessary to
format the code using black prior to committing. There are two ways to do this, one is to manually run the command
black .
(to run black
on all .py
files (or, black <filename.py>
to run on a specific file).
Alternatively, it is possible to setup a Git hook to automatically run black upon a git commit
command. To do this,
follow these instructions:
- Execute the command
pre-commit install
to install the Git hook. - When you next run a
git commit
command on the repository, black will run and automatically format any changed files. Note: if black needs to re-format a file, the commit will fail, meaning you will then need to executegit add .
andgit commit
again to commit the files updated by black.
Updating Documentation
Kloppy uses MkDocs to create documenation. The documentation primarily consists of Markdown files.
To install all documentation related dependancies run:
To open the documentation in http://127.0.0.1:8000/ run:
Add any changes you're committing via Pull Request to the documentation to keep it up-to-date.
Creating Pull Request
To cleanly share the additions made to kloppy you need to make what is called a Pull Request (PR). To do this cleanly it is advised to do the following:
- If you start for the first time, create a "fork" of the kloppy repository.
- Clone the newly created fork locally.
- Create a new branch using something like
git checkout -b <branch_name>
- After you have made your changes in this newly created branch:
- Run
pytest kloppy/tests
orpytest kloppy/tests/test_{provider_name}.py
to ensure all tests complete successfully. - Update the documentation with any and all changes.
- Run
black <filename>
on all the new files. - Run
git add <filename>
on all the new files. - Run
git commit -m "<some message>"
- Push the changes to your remote reposity with
git push
orgit push --set-upstream origin
if you're pushing for the first time. - Now, go to kloppy > Pull Requests and click the green "New pull request" button.
- Set base:
master
- Set compare:
<branch_name>
- Click "Create pull request"
- Write an exhaustive Pull Request message to inform everything you've contributed.
- Finally, after the PR has been completed automated tests will run on GitHub to make sure eveything (still) functions as expected.
Event Data
Files
See 1.2 File Structure for more information.
Loading File
The loading file should have a function load
(or load_event
if this data provider also provides tracking data).
This function takes at least:
- One or more
FileLike
objects, generally a file of event data and a meta data file. event_types
, an optional list of strings to filter the dataset at load time. (e.g. event_types can be["pass", "shot"]
) these string values relate to theEventType
class.coordinates
, an optional string that relates toProvider
and their associated Coordinate Systems. (e.g. coordinates can be"secondspectrum"
or"statsbomb"
).event_factory
, an optionalEventFactory
.
Within the function we instantiate the ProviderNameDeserialzer
that we import from from kloppy.infra.serializers.event.{new_provider}
alongside the ProviderNameInputs
.
Note: The "opening" of the file is handled by FileLike
and with open_as_file()
as shown below.
Deserialization File
- Create a
ProviderNameDataInputs
class - Create a
ProviderNameEventDataDeserializer
- Add a new provider to the
Provider
class inkloppy.domain.models.common.py
set that new provider in theprovider
property in theProviderNameEventDataDeserializer
Create a deserialize
method that takes the ProviderNameDataInputs
as inputs. Within the deserialize
method we do two high level actions:
These two will ultimately form the EventDataset
that is returned from the deserialize
method. i.e.
EventDataset
Parsing Metadata
Use the meta data and event data feeds to parse:
teams
asTeam
objects in a list[home_team, away_team]
periods
a list ofPeriod
objects (don't forget about optional extra time.)- Each period has an
id
(1, 2, 3, 4) - Each period has a
start_timestamp
andend_timestamp
of typetimedelta
. Thistimedelta
object describes the time elapsed since the start of the period in question. pitch_dimensions
is aPitchDimensions
orientation
. Identify the direction of play (Orientation
) (e.g.orientation = Orientation.ACTION_EXECUTING_TEAM
)flags
. Indicate if our dataset contains information on who the ball owning team is and/or if we know ball state.- For example:
flags = DatasetFlag.BALL_STATE | DatasetFlag.BALL_OWNING_TEAM
orflags = ~(DatasetFlag.BALL_STATE | DatasetFlag.BALL_OWNING_TEAM
) provider
. Update theProvider
enum class and add the new provider.coordinate_system
. ACoordinateSystem
object that contains information likepitch_length
,vertical_orientation
etc. Create a newProviderNameCoordinateSytem
class that inherits fromProviderCoordinateSystem
.- Optional metadata such as:
score
(Score
)frame_rate
(float
)date
(datetime
)- etc.
Parsing Events
Before parsing the events order them by their timestamp to create a chronological ordering.
Now, for each possible EventType
create an event
by using the built in event factory. This EventFactory
is inherited into the ProviderNameDeserializer
through the EventDataDeserializer
as described above.
Parsing each individual event type, requires some generic_events_kwargs
(dict) that contains information such as player, team (of event executing player) etc. Additionally, it also contains the full raw_event
. This ensure that no information is actually lost while parsing an event.
Now, we combine these generic_event_kwargs
and our event specific {someEvent}_event_kwargs
and use self.event_factory.build_{someEvent}
to consistantly churn out events of the same structure.
Finally, each event
is appended to the events
list.
Deserialization Checklist
- Make sure the
FileLike
objects are processed correctly in the deserializer. This means opening the files in the Loader File usingopen_as_file
. - Create variables for each string representation of events, to make the code less error prone. e.g.
SPORTEC_EVENT_NAME_OWN_GOAL = "OwnGoal"
- Don't forget about different types (i.e.
SetPieceType
,CardType
,PassType
,BodyPartType
,GoalKeeperActionType
orDuelType
) - Don't forget about different result types (i.e.
PassResult
,ShotResult
,TakeOnResult
,CarryResult
,DuelResult
,InterceptionResult
) - Don't forget to include own goals, yellow and red cards, extra time, penalties etc.
- Map provider specific position labels to Kloppy standardized position labels, e.g.:
- When converting these position labels use
position_types_mapping.get(provider_position_label, PositionType.Unknown)
. This will ensure even if we have a missing position label our newly built deserializer every position will be of typePositionType
. deserialize
returns anEventDataset(metadata=metadata, records=events)
Period
start_timestamp
is of typetimedelta
. This time delta relates to the start of a period (i.e. each period starts at0
)- Parse
Substitutions
seperately fromPlayerOn
andPlayerOff
(if this is provided by the provider). - Player Off / Player On events represent players (temporarily) leaving the pitch (e.g. injury treatment, red card)
- Substitutions represent a one for one exchange of two players.
- Update the
event-spec.yml
file in the Documentation to cover: - parsed. If this event is now included in the event data parser.
- not implemented. If this event is provided by the data provider, but is currently not included in kloppy.
- not supported. If this event is not provided by the data provider.
- unknown. If the status is unknown.
- inferred. When an event is inferred from other events (e.g. Ball out events for some providers)