Apache Iceberg now supports geospatial data types natively

ZeroCool2u · 2025-02-15T14:44:48 1739630688

This is such a big deal. Trying to get stuff done with geospatial data in native formats in big companies has been somewhat painful for a while, because a lot of them tend to default to using DataBricks and Delta format.

At least now there's a great open source tech stack combo with Trino as your query engine (which is decidedly a lot less annoying to run at scale than OSS Spark) and Iceberg as your storage format that a lot of your "Enterprise Architecture" types can be comfortable enough with.

Mosarwat · 2025-02-16T01:08:47 1739668127

Please note that not all query engines supports the native geo type in Iceberg yet. The first one to support it is Apache Sedona, which works well with Spark:

https://github.com/apache/sedona

However, the ultimate goal is to make more engines (e.g., Arrow, Trino...) support the geo type too

tiems · 2025-02-15T15:10:20 1739632220

Since it’s getting added to the parquet spec itself, it should hopefully make its way to delta too soon.

ddkto · 2025-02-15T13:13:49 1739625229

Some more details: https://cloudnativegeo.org/blog/2025/02/geoparquet-2.0-going...

cratermoon · 2025-02-15T15:54:41 1739634881

This is almost what I want, as it mentions the data types GEOMETRY and GEOGRAPHY, but is there a pointer somewhere to the definitions of the types?

gloflo · 2025-02-15T16:10:36 1739635836

Click through the links on that page.

cratermoon · 2025-02-15T19:20:08 1739647208

I did. The links are giving me information about as useful as "a monad is a monoid in the category of endofunctors, what's the problem?".

In fact, your comment is about equally helpful. Ever heard the phrase "technically accurate but totally useless"? https://wiki.c2.com/?UselessTruth

dr-jia-yu · 2025-02-16T01:32:26 1739669546

Wherobots wrote a detailed tech review of the new types. You can find the definition and reasoning there: https://wherobots.com/iceberg-geo-technical-insights-and-imp...

cratermoon · 2025-02-16T15:00:50 1739718050

Better. I still feel like there's an element of explanations there nearly as opaque as "a monad is a monoid in the category of endofunctors", but I can mostly forgive it. Clearly the review is written for someone already familiar with GIS terms. For example, "The geometry type represents spatial objects in a planar space using Cartesian geometry, assuming all calculations, including distance and area measurements, are performed on a flat surface." says nothing that couldn't be better stated as "this is for flat-projection maps, like the Mercator".

dr-jia-yu · 2025-02-16T01:35:03 1739669703

A more detailed tech review (from Wherobots) of 2 new Geo types can be found here: Iceberg GEO: Technical Insights and Implementation Strategies

https://wherobots.com/iceberg-geo-technical-insights-and-imp...

bz_bz_bz · 2025-02-15T15:41:45 1739634105

Anyone know if this means Arrow could be next? I assume GeoParquet is no longer needed; I wonder what happens to GeoArrow.

Mosarwat · 2025-02-16T00:57:33 1739667453

GeoParquet will still be used for bit since both the native geo types in parquet and geoparquet have slight differences. But, the ultimate goal is to shift all data / workloads to the parquet native geo in a couple of years!

Mosarwat · 2025-02-16T00:59:28 1739667568

Now, Apache Sedona is the first engine that will support that native geo type in parquet, but Arrow will also support it very soon.

bitschubser_ · 2025-02-15T13:24:28 1739625868

I’m super exited about the planned support for multidimensional data, I know zarr is there but a long term storage format other than netcdf would be interesting maybe also something to replace grib, geotiff etc to share files

Mosarwat · 2025-02-16T01:02:40 1739667760

Beside Zarr, there are also efforts to support different types of raster (sort of multidimensional) data such as geotiff and NetCDF. The Iceberg Geo spec was heavily influenced by the Havasu project proposed by Wherobots, which also supports that type of raster data. However, the Iceberg geo spec still only supports only geometry for now.

https://wherobots.com/building-a-spatial-data-lakehouse/

tomnicholas1 · 2025-02-15T14:14:02 1739628842

Surely Zarr is already a long-term storage format for multidimensional data? It can even be mapped directly to netCDF, GRIB and geoTIFF via VirtualiZarr[0].

Also if you like Iceberg and you like arrays you will really like Icechunk[1], which is Version-controlled Zarr!

[0] https://github.com/zarr-developers/VirtualiZarr

[1] https://icechunk.io/en/latest/

bitschubser_ · 2025-02-15T15:10:19 1739632219

I know icechunk and I’m a huge fan of earthmover. But a common binary format like parquet seems nice… with interop for e.g duckdb and geo queries, you can “just load” era5 and do something like get wind direction/speed along the following path for the last 5 years group by day etc…

lysecret · 2025-02-15T16:39:26 1739637566

If you know the exact tensor shape of your data ahead of time Zarr works well (we use it as the dataformat for our ml experiments). If you have dynamically growing data or irregular shapes zarr doesn't work as well.

tomnicholas1 · 2025-02-15T19:53:32 1739649212

Icechunk can handle growing dimensions with ACID transactions!

For irregular shapes in some cases using multiple groups + xarray.DataTree can help you, but in general yeah ragged data is hard.

Upitor · 2025-02-15T13:55:39 1739627739

Where do you see the upcoming support for multidim data? Link?

bitschubser_ · 2025-02-15T15:01:56 1739631716

> This is just the beginning of modernizing geospatial data storage. We’re already looking ahead to other types of geospatial data such as raster, point cloud, spatial indexes…

its not far from raster to full multidimensional