Getting Started
In this tutorial we will download a set of Landsat 8 image patches from a
region of central-west Brazil and save them as GeoTIFFs on your local machine.
By the end you will have real files you can open in QGIS, rasterio, or any GIS
tool — and everything you need to use geebeam for your work!
Note
This tutorial runs locally. We’ll keep the sample count small (10 patches) so everything finishes in under a minute or so. For info on running on DataFlow, see Scaling up with Dataflow.
What is geebeam?
geebeam builds and executes Apache Beam pipelines to download patches of
Earth Engine data. It automatically runs in parallel, can be run locally or
on Google Cloud Dataflow, and supports writing to various popular geospatial
machine learning/AI dataset formats, including GeoTIFFs, Tensorflow Datasets,
and WebDataset.
Prerequisites
Before starting you will need:
Python 3.10 or later
A Google account with Earth Engine access. Sign up at earthengine.google.com if you haven’t already. Approval is usually instant for existing Google accounts.
A Google Cloud project. Although this tutorial does not cost anything (as long as you meet Earth Engine’s Noncommercial/Research Use criteria), the Python Earth Engine API requires an active Google Cloud Project. You can create one at console.cloud.google.com.
Installation
pip install geebeam
To verify the install worked:
import geebeam
print(geebeam.__version__)
Authentication
geebeam uses the Earth Engine Python API, which needs a one-time authentication step:
earthengine authenticate
This opens a browser window to login and save credentials locally. You only need to do this once.
You also need to tell Earth Engine which Google Cloud project to use. The easiest way is to detect it from your environment:
import google.auth
PROJECT_ID = google.auth.default()[1] # reads from gcloud config
print(PROJECT_ID) # confirm it's the right project
Or just set it directly:
PROJECT_ID = "my-gcp-project"
Step 1: Define the image you want
geebeam downloads image “chips” — fixed-size pixel patches — from any
ee.Image you can build with the Earth Engine Python API. You tell geebeam
what to download; it handles the parallelism and file writing.
Here we’ll build a cloud-free Landsat 8 composite for 2023:
import ee
import geebeam
ee.Initialize(project=PROJECT_ID) # Uses PROJECT_ID from the last step
ls8_collection = ee.ImageCollection("LANDSAT/LC08/C02/T1_L2") \
.filterDate("2023-01-01", "2023-12-31")
.filter(ee.Filter.lt('CLOUD_COVER', 50))
ls8_composite = ls8_collection.median().select(
["SR_B2", "SR_B3", "SR_B4", "SR_B5", "SR_B6", "SR_B7"]
)
The image is not downloaded yet — this is just an EE graph definition (the “recipe” for
creating the image). Nothing leaves Google’s servers until the pipeline runs.
When that happens, geebeam will automatically serialize the graph definition and
send it to the workers to start downloading patches, all with one command!
Step 2: Run the pipeline
sample_and_run_pipeline() is the quickest way to get started. It
randomly samples n_sample points within a region, then downloads a patch
centered on each point.
geebeam.sample_and_run_pipeline(
image_list=[ls8_composite],
sampling_region=ee.Geometry.Rectangle(-55.0, -12.0, -50.0, -16.0),
n_sample=10,
patch_size=128, # 128 × 128 pixels per chip
scale=30, # 30 m/pixel (native Landsat resolution)
crs="EPSG:4326",
project=PROJECT_ID,
position='center',
output_path="./tutorial_output/",
validation_ratio=0.2, # 2 of the 10 patches go to the validation split
)
You will see some Apache Beam log lines scroll by. When it finishes the
tutorial_output/ directory will exist.
What do ``patch_size`` and ``scale`` mean?
Each chip covers 128 × 30 m = 3.84 km on a side. Increasing patch_size
gives more spatial context. Increasing scale also gives more spatial context
but at the cost of lower resolution/detail.
What does ``position`` do?
By default each sampling point is the center of its patch
(position="center"). You can change this to "top-left",
"top-right", "bottom-left", or "bottom-right" if your workflow
requires the coordinate to refer to a specific corner.
Why is ``ls8_composite`` wrapped in a list ``[]``?
image_list must be a Python list of ee.Image objects, even
when you only have one image. This is how geebeam knows how to split
bands across workers when needed (see EE’s 50 MB per-request limit). If you pass
a single ee.Image, geebeam will wrap it in a list and keep going
(with a warning). Currently, ee.ImageCollection objects are NOT supported.
Step 3: Inspect the output
tutorial_output/
├── train/
│ ├── 00000.tif
│ ├── 00001.tif
│ └── ... (8 patches)
├── validation/
│ ├── 00008.tif
│ └── 00009.tif (2 patches)
├── metadata-00000-of-#####.parquet
├── metadata-00001-of-#####.parquet
└── ...
Each .tif is a multi-band GeoTIFF “chip” containing all the bands from your image
list. The metadata Parquet files records the sampling location (x, y),
the patch origin (x_topleft, y_topleft), the split assignment, and the
file path for each chip. Each worker writes to its own separate parquet file,
but it’s easy to read them together later (see below).
Open a chip with rasterio to confirm it looks right (you may need to install matplotlib):
import rasterio
import matplotlib.pyplot as plt
import numpy as np
with rasterio.open("tutorial_output/train/00000.tif") as ds:
print("CRS:", ds.crs)
print("Bands:", ds.count)
print("Shape:", ds.height, "×", ds.width)
# Read RGB bands (B4=red, B3=green, B2=blue in Landsat 8 L2)
# Rasterio bands are 1-indexed (start at 1 instead of 0)
r, g, b = ds.read(3), ds.read(2), ds.read(1)
rgb = np.stack([r, g, b], axis=-1).astype(float)
rgb = (rgb - rgb.min()) / (rgb.max() - rgb.min() + 1e-6) # stretch to [0, 1]
plt.imshow(rgb)
plt.title("Patch 00000 — RGB")
plt.axis("off")
plt.show()
You can also read the metadata table:
import pandas as pd
import glob
all_parquets = glob.glob('tutorial_output/metadata-*.parquet')
df = pd.concat([pd.read_parquet(p) for p in all_parquets])
print(df[["id", "x", "y", "x_topleft", "y_topleft", "split", "image_path"]])
Step 4. Adding a second image
With geebeam, downloading many datasets in a single pipeline is just
as easy as downloading one. The bands from every image in image_list
are stacked into the same chip/patch files:
# MapBiomas land-use / land-cover, same year
mb_lulc = (
ee.Image(
"projects/mapbiomas-public/assets/brazil/lulc/"
"collection10_1/mapbiomas_brazil_collection10_1_coverage_v1"
)
.select("classification_2023")
)
geebeam.sample_and_run_pipeline(
image_list=[ls8_composite, mb_lulc], # both images here
sampling_region=ee.Geometry.Rectangle(-55.0, -12.0, -50.0, -16.0),
n_sample=10,
patch_size=128,
scale=30,
crs="EPSG:4326",
project=PROJECT_ID,
output_path="./tutorial_output_with_lulc/",
validation_ratio=0.2,
)
The output TIFFs will now contain 7 bands (6 Landsat + 1 LULC).
What’s next
Other output formats — pass
output_type="webdataset"to produce a WebDataset tar oroutput_type="tfds"(requirespip install geebeam[tensorflow]) to write as a Tensorflow dataset. See Output Types for more info.Scaling up with DataFlow — geebeam makes it super easy to run large pipelines on Google Cloud Dataflow by providing pre-built Docker images. See Scaling up with Dataflow for more.
Custom and gridded sampling — use
run_pipeline()directly and pass your ownsampling_pointsas a Geopandas GeoDataFrame, Pandas DataFrame, or Earth Engine Feature Collection (but they must have point geometries). Or usegrid_and_run_pipeline()to sample on a regular grid rather than randomly for when you need complete spatial coverage without gaps. Seesamplerfor more info, including custom dataset splits (e.g. train/val/test).Split processing of large patches — Earth Engine limits single request sizes to 50 MB, but geebeam lets you split processing: EE’s 50 MB per-request limit
API reference — full parameter documentation for all three pipeline functions and the sampler module is on the
geebeampage.
Get Help / Contribute
If you run into a problem or have a question, please open an issue on GitHub. Check the existing issues first — your question may already be asked/answered.
Contributions are welcome. Some ways to help:
Report a bug or request a feature — open an issue describing what you encountered or what you’d like to see.
Add a new feature — browse the issue tracker for ideas, then submit a pull request against the main repository.
Write examples — new notebooks or scripts showing real-world usage are always appreciated.