Output Types
geebeam supports four output formats, selected with the output_type
argument. The right choice depends on your downstream use:
Format |
|
Best uses |
Extra dependencies required? |
|---|---|---|---|
GeoTIFF |
|
GIS tools (rasterio, QGIS) |
No |
WebDataset |
|
PyTorch pipelines |
No |
TensorFlow Datasets |
|
TensorFlow pipelines |
Yes ( |
TFRecord |
|
Automatic dataset stats |
Yes ( |
GeoTIFF (default)
Each patch is written as a multi-band GeoTIFF, with a Parquet file containing metadata for all patches.
geebeam.sample_and_run_pipeline(
...,
output_type="tiff", # this is the default
output_path="./output/",
)
Output structure:
output/
├── train/
│ ├── 00000.tif
│ ├── 00001.tif
│ └── ...
├── validation/
│ └── ...
└── metadata-00000-of-00001.parquet
Each .tif contains all bands from all images in image_list, in the
order they were passed. The Parquet file has one row per patch with columns
id, x, y, x_topleft, y_topleft, split,
image_path, and any columns from extra_metadata.
Reading the output:
import rasterio
import pandas as pd
with rasterio.open("output/train/00000.tif") as ds:
data = ds.read() # shape: (n_bands, height, width)
print(ds.descriptions) # band names
df = pd.read_parquet("output/metadata-00000-of-00001.parquet")
GeoTIFF is the simplest format and often the best starting point.
WebDataset
Patches are written as sharded .tar archives in
WebDataset format. Each sample
inside a shard is a pair of files: a GeoTIFF ({id}.tif) and a JSON
metadata file ({id}.json).
geebeam.sample_and_run_pipeline(
...,
output_type="webdataset",
output_path="./output/",
)
Output structure:
output/
├── train-<worker-id>-000000.tar
├── train-<worker-id>-000001.tar
└── validation-<worker-id>-000000.tar
Each .tar contains alternating .tif and .json entries. The
worker ID in the filename is just there to prevent problems when multiple
Beam workers write in parallel.
Reading the output:
import webdataset as wds
import rasterio
import io
dataset = (
wds.WebDataset("output/train-*.tar")
.decode()
)
for sample in dataset:
tif_bytes = sample["tif"]
metadata = sample["json"] # already decoded to a dict
with rasterio.open(io.BytesIO(tif_bytes)) as ds:
data = ds.read()
WebDataset works well with PyTorch DataLoader and is a good choice for
large-scale training pipelines that stream data directly from Google Cloud
Storage without downloading it locally.
TensorFlow Datasets
Patches are written as a
TensorFlow Datasets (TFDS) custom
dataset. This format integrates directly with tf.data and the TFDS
catalogue.
Install:
pip install geebeam[tensorflow]
Usage:
geebeam.sample_and_run_pipeline(
...,
output_type="tfds",
output_path="./output/",
dataset_name="my_dataset", # used as the TFDS dataset name
dataset_version="1.0.0", # must be a valid semver string
)
Reading the output:
import tensorflow_datasets as tfds
ds = tfds.load("my_dataset", data_dir="./output/", split="train")
for example in ds.take(1):
print(example.keys())
The TFDS format is the best choice for TensorFlow-native training pipelines.
It handles split management, shuffling, and prefetching automatically through
the standard tf.data API.
TFRecord
Warning
TFRecord output is not recommended for most use cases. Use
"tiff" or "tfds" instead. Choose TFRecord only if you
specifically need to compute dataset statistics with
TensorFlow Data Validation
(TFDV) — that is the only thing this format provides over "tfds".
The TFRecord output_type comes with some drawbacks (no standard loading API,
harder to inspect, needs semi-manual schema coupling).
Install:
pip install geebeam[tensorflow]
Usage:
geebeam.sample_and_run_pipeline(
...,
output_type="tfrecord",
output_path="./output/",
)
Output structure:
output/
├── train/
│ └── *.tfrecord
├── validation/
│ └── *.tfrecord
├── schema.json ← feature names and types
└── stats.tfrecord ← TFDV statistics (training split only)
The pipeline automatically computes TFDV statistics over the training split
and writes them alongside the records. This is the main reason to choose
this format — if you want to validate feature distributions, detect anomalies,
or generate a data schema for a TFX pipeline. If you do not need TFDV stats,
"tfds" makes it much easier to read and use the data after download.