Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Apache Iceberg now supports geospatial data types natively (wherobots.com)
105 points by Mosarwat 10 months ago | hide | past | favorite | 21 comments


This is such a big deal. Trying to get stuff done with geospatial data in native formats in big companies has been somewhat painful for a while, because a lot of them tend to default to using DataBricks and Delta format.

At least now there's a great open source tech stack combo with Trino as your query engine (which is decidedly a lot less annoying to run at scale than OSS Spark) and Iceberg as your storage format that a lot of your "Enterprise Architecture" types can be comfortable enough with.


Please note that not all query engines supports the native geo type in Iceberg yet. The first one to support it is Apache Sedona, which works well with Spark:

https://github.com/apache/sedona

However, the ultimate goal is to make more engines (e.g., Arrow, Trino...) support the geo type too


Since it’s getting added to the parquet spec itself, it should hopefully make its way to delta too soon.



This is almost what I want, as it mentions the data types GEOMETRY and GEOGRAPHY, but is there a pointer somewhere to the definitions of the types?


Click through the links on that page.


I did. The links are giving me information about as useful as "a monad is a monoid in the category of endofunctors, what's the problem?".

In fact, your comment is about equally helpful. Ever heard the phrase "technically accurate but totally useless"? https://wiki.c2.com/?UselessTruth


Wherobots wrote a detailed tech review of the new types. You can find the definition and reasoning there: https://wherobots.com/iceberg-geo-technical-insights-and-imp...


Better. I still feel like there's an element of explanations there nearly as opaque as "a monad is a monoid in the category of endofunctors", but I can mostly forgive it. Clearly the review is written for someone already familiar with GIS terms. For example, "The geometry type represents spatial objects in a planar space using Cartesian geometry, assuming all calculations, including distance and area measurements, are performed on a flat surface." says nothing that couldn't be better stated as "this is for flat-projection maps, like the Mercator".


A more detailed tech review (from Wherobots) of 2 new Geo types can be found here: Iceberg GEO: Technical Insights and Implementation Strategies

https://wherobots.com/iceberg-geo-technical-insights-and-imp...


Anyone know if this means Arrow could be next? I assume GeoParquet is no longer needed; I wonder what happens to GeoArrow.


GeoParquet will still be used for bit since both the native geo types in parquet and geoparquet have slight differences. But, the ultimate goal is to shift all data / workloads to the parquet native geo in a couple of years!


Now, Apache Sedona is the first engine that will support that native geo type in parquet, but Arrow will also support it very soon.


I’m super exited about the planned support for multidimensional data, I know zarr is there but a long term storage format other than netcdf would be interesting maybe also something to replace grib, geotiff etc to share files


Beside Zarr, there are also efforts to support different types of raster (sort of multidimensional) data such as geotiff and NetCDF. The Iceberg Geo spec was heavily influenced by the Havasu project proposed by Wherobots, which also supports that type of raster data. However, the Iceberg geo spec still only supports only geometry for now.

https://wherobots.com/building-a-spatial-data-lakehouse/


Surely Zarr is already a long-term storage format for multidimensional data? It can even be mapped directly to netCDF, GRIB and geoTIFF via VirtualiZarr[0].

Also if you like Iceberg and you like arrays you will really like Icechunk[1], which is Version-controlled Zarr!

[0] https://github.com/zarr-developers/VirtualiZarr

[1] https://icechunk.io/en/latest/


I know icechunk and I’m a huge fan of earthmover. But a common binary format like parquet seems nice… with interop for e.g duckdb and geo queries, you can “just load” era5 and do something like get wind direction/speed along the following path for the last 5 years group by day etc…


If you know the exact tensor shape of your data ahead of time Zarr works well (we use it as the dataformat for our ml experiments). If you have dynamically growing data or irregular shapes zarr doesn't work as well.


Icechunk can handle growing dimensions with ACID transactions!

For irregular shapes in some cases using multiple groups + xarray.DataTree can help you, but in general yeah ragged data is hard.


Where do you see the upcoming support for multidim data? Link?


> This is just the beginning of modernizing geospatial data storage. We’re already looking ahead to other types of geospatial data such as raster, point cloud, spatial indexes…

its not far from raster to full multidimensional




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: