This is such a big deal. Trying to get stuff done with geospatial data in native formats in big companies has been somewhat painful for a while, because a lot of them tend to default to using DataBricks and Delta format.
At least now there's a great open source tech stack combo with Trino as your query engine (which is decidedly a lot less annoying to run at scale than OSS Spark) and Iceberg as your storage format that a lot of your "Enterprise Architecture" types can be comfortable enough with.
Please note that not all query engines supports the native geo type in Iceberg yet. The first one to support it is Apache Sedona, which works well with Spark:
Better. I still feel like there's an element of explanations there nearly as opaque as "a monad is a monoid in the category of endofunctors",
but I can mostly forgive it.
Clearly the review is written for someone already familiar with GIS terms.
For example,
"The geometry type represents spatial objects in a planar space using Cartesian geometry, assuming all calculations, including distance and area measurements, are performed on a flat surface."
says nothing that couldn't be better stated as "this is for flat-projection maps, like the Mercator".
GeoParquet will still be used for bit since both the native geo types in parquet and geoparquet have slight differences. But, the ultimate goal is to shift all data / workloads to the parquet native geo in a couple of years!
I’m super exited about the planned support for multidimensional data, I know zarr is there but a long term storage format other than netcdf would be interesting maybe also something to replace grib, geotiff etc to share files
Beside Zarr, there are also efforts to support different types of raster (sort of multidimensional) data such as geotiff and NetCDF. The Iceberg Geo spec was heavily influenced by the Havasu project proposed by Wherobots, which also supports that type of raster data. However, the Iceberg geo spec still only supports only geometry for now.
Surely Zarr is already a long-term storage format for multidimensional data? It can even be mapped directly to netCDF, GRIB and geoTIFF via VirtualiZarr[0].
Also if you like Iceberg and you like arrays you will really like Icechunk[1], which is Version-controlled Zarr!
I know icechunk and I’m a huge fan of earthmover. But a common binary format like parquet seems nice… with interop for e.g duckdb and geo queries, you can “just load” era5 and do something like get wind direction/speed along the following path for the last 5 years group by day etc…
If you know the exact tensor shape of your data ahead of time Zarr works well (we use it as the dataformat for our ml experiments). If you have dynamically growing data or irregular shapes zarr doesn't work as well.
> This is just the beginning of modernizing geospatial data storage. We’re already looking ahead to other types of geospatial data such as raster, point cloud, spatial indexes…
At least now there's a great open source tech stack combo with Trino as your query engine (which is decidedly a lot less annoying to run at scale than OSS Spark) and Iceberg as your storage format that a lot of your "Enterprise Architecture" types can be comfortable enough with.