Output Types

geebeam supports four output formats, selected with the output_type argument. The right choice depends on your downstream use:

Format

output_type

Best uses

Extra dependencies required?

GeoTIFF

"tiff" (default)

GIS tools (rasterio, QGIS)

No

WebDataset

"webdataset"

PyTorch pipelines

No

TensorFlow Datasets

"tfds"

TensorFlow pipelines

Yes (geebeam[tensorflow])

TFRecord

"tfrecord"

Automatic dataset stats

Yes (geebeam[tensorflow])

GeoTIFF (default)

Each patch is written as a multi-band GeoTIFF, with a Parquet file containing metadata for all patches.

geebeam.sample_and_run_pipeline(
    ...,
    output_type="tiff",   # this is the default
    output_path="./output/",
)

Output structure:

output/
├── train/
│   ├── 00000.tif
│   ├── 00001.tif
│   └── ...
├── validation/
│   └── ...
└── metadata-00000-of-00001.parquet

Each .tif contains all bands from all images in image_list, in the order they were passed. The Parquet file has one row per patch with columns id, x, y, x_topleft, y_topleft, split, image_path, and any columns from extra_metadata.

Reading the output:

import rasterio
import pandas as pd

with rasterio.open("output/train/00000.tif") as ds:
    data = ds.read()          # shape: (n_bands, height, width)
    print(ds.descriptions)    # band names

df = pd.read_parquet("output/metadata-00000-of-00001.parquet")

GeoTIFF is the simplest format and often the best starting point.

WebDataset

Patches are written as sharded .tar archives in WebDataset format. Each sample inside a shard is a pair of files: a GeoTIFF ({id}.tif) and a JSON metadata file ({id}.json).

geebeam.sample_and_run_pipeline(
    ...,
    output_type="webdataset",
    output_path="./output/",
)

Output structure:

output/
├── train-<worker-id>-000000.tar
├── train-<worker-id>-000001.tar
└── validation-<worker-id>-000000.tar

Each .tar contains alternating .tif and .json entries. The worker ID in the filename is just there to prevent problems when multiple Beam workers write in parallel.

Reading the output:

import webdataset as wds
import rasterio
import io

dataset = (
    wds.WebDataset("output/train-*.tar")
    .decode()
)

for sample in dataset:
    tif_bytes = sample["tif"]
    metadata  = sample["json"]   # already decoded to a dict
    with rasterio.open(io.BytesIO(tif_bytes)) as ds:
        data = ds.read()

WebDataset works well with PyTorch DataLoader and is a good choice for large-scale training pipelines that stream data directly from Google Cloud Storage without downloading it locally.

TensorFlow Datasets

Patches are written as a TensorFlow Datasets (TFDS) custom dataset. This format integrates directly with tf.data and the TFDS catalogue.

Install:

pip install geebeam[tensorflow]

Usage:

geebeam.sample_and_run_pipeline(
    ...,
    output_type="tfds",
    output_path="./output/",
    dataset_name="my_dataset",     # used as the TFDS dataset name
    dataset_version="1.0.0",       # must be a valid semver string
)

Reading the output:

import tensorflow_datasets as tfds

ds = tfds.load("my_dataset", data_dir="./output/", split="train")
for example in ds.take(1):
    print(example.keys())

The TFDS format is the best choice for TensorFlow-native training pipelines. It handles split management, shuffling, and prefetching automatically through the standard tf.data API.

TFRecord

Warning

TFRecord output is not recommended for most use cases. Use "tiff" or "tfds" instead. Choose TFRecord only if you specifically need to compute dataset statistics with TensorFlow Data Validation (TFDV) — that is the only thing this format provides over "tfds". The TFRecord output_type comes with some drawbacks (no standard loading API, harder to inspect, needs semi-manual schema coupling).

Install:

pip install geebeam[tensorflow]

Usage:

geebeam.sample_and_run_pipeline(
    ...,
    output_type="tfrecord",
    output_path="./output/",
)

Output structure:

output/
├── train/
│   └── *.tfrecord
├── validation/
│   └── *.tfrecord
├── schema.json      ← feature names and types
└── stats.tfrecord   ← TFDV statistics (training split only)

The pipeline automatically computes TFDV statistics over the training split and writes them alongside the records. This is the main reason to choose this format — if you want to validate feature distributions, detect anomalies, or generate a data schema for a TFX pipeline. If you do not need TFDV stats, "tfds" makes it much easier to read and use the data after download.