Skip to content

kloppy.io

kloppy.io

I/O utilities for reading raw data.

FileLike module-attribute

FileLike = Union[FileOrPath, Source]

Source dataclass

Source(data, optional=False, skip_if_missing=False)

A wrapper around a file-like object to enable optional inputs.

PARAMETER DESCRIPTION
data

The file-like object.

TYPE: FileLike

optional

Whether the file is optional. Defaults to False.

TYPE: bool DEFAULT: False

skip_if_missing

Whether to skip the file if it is missing. Defaults to False.

TYPE: bool DEFAULT: False

Example:

1
>>> open_as_file(Source.create("example.csv", optional=True))

data instance-attribute

data

optional class-attribute instance-attribute

optional = False

skip_if_missing class-attribute instance-attribute

skip_if_missing = False

create classmethod

create(input_, **kwargs)
Source code in kloppy/io.py
@classmethod
def create(cls, input_: Optional[FileOrPath], **kwargs):
    if isinstance(input_, Source):
        return replace(input_, **kwargs)
    return Source(data=input_, **kwargs)

open_as_file

open_as_file(input_, mode='rb')

Open a byte stream to/from the given input object.

The following input types are supported
  • A string or pathlib.Path object representing a local file path.
  • A string representing a URL. It should start with 'http://' or 'https://'.
  • A string representing a path to a file in a Amazon S3 cloud storage bucket. It should start with 's3://'.
  • A xml or json string containing the data. The string should contain a '{' or '<' character. Otherwise, it will be treated as a file path.
  • A bytes object containing the data.
  • A buffered binary stream that inherits from io.BufferedIOBase.
  • A Source object that wraps any of the above input types.
PARAMETER DESCRIPTION
input_

The input/output object to be opened.

TYPE: FileLike

mode

File mode - 'rb' (read), 'wb' (write), or 'ab' (append). Defaults to 'rb'.

TYPE: str DEFAULT: 'rb'

RETURNS DESCRIPTION
BinaryIO

A binary stream to/from the input object.

TYPE: AbstractContextManager[Optional[BinaryIO]]

RAISES DESCRIPTION
ValueError

If the input is required but not provided, or invalid mode.

InputNotFoundError

If the input file is not found and should not be skipped.

TypeError

If the input type is not supported.

NotImplementedError

If write mode is used with unsupported input types.

Example:

1
2
3
4
5
6
7
>>> # Reading
>>> with open_as_file("example.txt") as f:
...     contents = f.read()
>>>
>>> # Writing
>>> with open_as_file("output.txt", mode="wb") as f:
...     f.write(b"Hello, world!")
Note

To support reading data from other sources, see the Adapter class.

If the given file path or URL ends with '.gz', '.xz', or '.bz2', the file will be automatically compressed/decompressed.

Write mode limitations: - HTTP/HTTPS URLs: Not supported - Inline strings/bytes: Not supported (invalid output destination)

Source code in kloppy/io.py
def open_as_file(
    input_: FileLike,
    mode: str = "rb",
) -> AbstractContextManager[Optional[BinaryIO]]:
    """Open a byte stream to/from the given input object.

    The following input types are supported:
        - A string or `pathlib.Path` object representing a local file path.
        - A string representing a URL. It should start with 'http://' or
          'https://'.
        - A string representing a path to a file in a Amazon S3 cloud storage
          bucket. It should start with 's3://'.
        - A xml or json string containing the data. The string should contain
          a '{' or '<' character. Otherwise, it will be treated as a file path.
        - A bytes object containing the data.
        - A buffered binary stream that inherits from `io.BufferedIOBase`.
        - A [Source](`kloppy.io.Source`) object that wraps any of the above
          input types.

    Args:
        input_ (FileLike): The input/output object to be opened.
        mode (str): File mode - 'rb' (read), 'wb' (write), or 'ab' (append).
            Defaults to 'rb'.

    Returns:
        BinaryIO: A binary stream to/from the input object.

    Raises:
        ValueError: If the input is required but not provided, or invalid mode.
        InputNotFoundError: If the input file is not found and should not be skipped.
        TypeError: If the input type is not supported.
        NotImplementedError: If write mode is used with unsupported input types.

    Example:

        >>> # Reading
        >>> with open_as_file("example.txt") as f:
        ...     contents = f.read()
        >>>
        >>> # Writing
        >>> with open_as_file("output.txt", mode="wb") as f:
        ...     f.write(b"Hello, world!")

    Note:
        To support reading data from other sources, see the
        [Adapter](`kloppy.io.adapters.Adapter`) class.

        If the given file path or URL ends with '.gz', '.xz', or '.bz2', the
        file will be automatically compressed/decompressed.

        Write mode limitations:
            - HTTP/HTTPS URLs: Not supported
            - Inline strings/bytes: Not supported (invalid output destination)
    """
    # 1. Handle Source wrapper logic first
    if isinstance(input_, Source):
        if input_.data is None:
            if input_.optional:
                return contextlib.nullcontext(None)
            raise ValueError("Input required but not provided.")

        try:
            return open_as_file(input_.data, mode=mode)
        except InputNotFoundError:
            if input_.skip_if_missing:
                logger.info(f"Input {input_.data} not found. Skipping")
                return contextlib.nullcontext(None)
            raise

    # 2. Validate input for Write Modes
    if mode in ("wb", "ab"):
        if isinstance(input_, str) and ("{" in input_ or "<" in input_):
            raise TypeError("Cannot write to inline JSON/XML string.")
        if isinstance(input_, bytes):
            raise TypeError(
                "Cannot write to bytes object. Use BytesIO instead."
            )

    # 3. Handle Inline Data (Read Mode)
    if mode == "rb":
        if isinstance(input_, str) and ("{" in input_ or "<" in input_):
            return contextlib.nullcontext(BytesIO(input_.encode("utf8")))
        if isinstance(input_, bytes):
            return contextlib.nullcontext(BytesIO(input_))

    # 4. Handle Adapter-based URIs/Paths
    # Check if input looks like a path or string URI
    if isinstance(input_, (str, os.PathLike)):
        uri = _filepath_from_path_or_filelike(input_)
        adapter = get_adapter(uri)

        if adapter:
            if mode == "rb":
                stream = BufferedStream()
                adapter.read_to_stream(uri, stream)
                stream.seek(0)
                return contextlib.nullcontext(stream)
            else:
                return _write_context_manager(uri, mode)

        # check if the uri is a string with adapter prefix
        elif isinstance(input_, str):
            prefix_match = re.match(r"^([a-zA-Z0-9+.-]+)://", input_)
            if prefix_match:
                raise AdapterError(
                    f"No adapter found for {prefix_match.group(1)}://"
                )

        # If no adapter found, fall through to standard _open (local file handling)

    # 5. Handle File Objects or Standard Local Files
    if (
        hasattr(input_, "readinto")
        or hasattr(input_, "write")
        or isinstance(input_, (str, os.PathLike))
    ):
        # --- Validation: Check mode compatibility for existing file objects ---
        if not isinstance(input_, (str, os.PathLike)):
            input_mode = getattr(input_, "mode", None)
            if input_mode and input_mode != mode:
                raise ValueError(
                    f"File opened in mode '{input_mode}' but '{mode}' requested"
                )

        # --- Processing: Open or wrap the input ---
        # _open handles:
        # 1. Opening paths
        # 2. Extracting binary buffers from TextIOWrapper
        # 3. Detecting compression (gzip, etc) and returning a Decompressor wrapper
        opened = _open(input_, mode)

        # --- Ownership: Decide if we should close the file on exit ---

        # Case A: We created a new wrapper (e.g. opened a path, or wrapped BytesIO in GzipFile)
        # We return the object directly so its __exit__ cleans up the wrapper.
        # Note: We check if `opened` is different from `input_` AND different from `input_.buffer`
        # (the latter handles the TextIOWrapper case where we don't want to close the wrapper).
        is_transformed = opened is not input_
        if hasattr(input_, "buffer"):
            is_transformed = is_transformed and opened is not input_.buffer

        if is_transformed:
            # Exception: If the original input was a file object, and _open returned a
            # compression wrapper (like GzipFile), closing GzipFile usually closes the
            # underlying file.
            return cast(AbstractContextManager, opened)

        # Case B: It is the exact same raw stream (e.g. plain BytesIO)
        # We wrap in nullcontext so we don't close the user's object.
        return contextlib.nullcontext(opened)

    raise TypeError(f"Unsupported input type: {type(input_)}")

get_file_extension

get_file_extension(file_or_path)

Determine the file extension of the given file-like object.

If the file has compression extensions such as '.gz', '.xz', or '.bz2', they will be stripped before determining the extension.

PARAMETER DESCRIPTION
file_or_path

The file-like object whose extension needs to be determined.

TYPE: FileLike

RETURNS DESCRIPTION
str

The file extension, including the dot ('.') if present.

TYPE: str

RAISES DESCRIPTION
Exception

If the extension cannot be determined.

Example:

1
2
3
4
5
6
>>> get_file_extension("example.xml.gz")
'.xml'
>>> get_file_extension(Path("example.txt"))
'.txt'
>>> get_file_extension(Source(data="example.csv"))
'.csv'
Source code in kloppy/io.py
def get_file_extension(file_or_path: FileLike) -> str:
    """Determine the file extension of the given file-like object.

    If the file has compression extensions such as '.gz', '.xz', or '.bz2',
    they will be stripped before determining the extension.

    Args:
        file_or_path (FileLike): The file-like object whose extension needs to be determined.

    Returns:
        str: The file extension, including the dot ('.') if present.

    Raises:
        Exception: If the extension cannot be determined.

    Example:

        >>> get_file_extension("example.xml.gz")
        '.xml'
        >>> get_file_extension(Path("example.txt"))
        '.txt'
        >>> get_file_extension(Source(data="example.csv"))
        '.csv'
    """
    if isinstance(file_or_path, (str, bytes)) or hasattr(
        file_or_path, "__fspath__"
    ):
        path = os.fspath(file_or_path)  # type: ignore
        for ext in [".gz", ".xz", ".bz2"]:
            if path.endswith(ext):
                path = path[: -len(ext)]
        return os.path.splitext(path)[1]

    if isinstance(file_or_path, Source):
        return get_file_extension(file_or_path.data)

    raise TypeError(
        f"Could not determine extension for input type: {type(file_or_path)}"
    )