How to Build a Geospatial Lakehouse, Part 2

0/5 No votes

Report this app



In Half 1 of this two-part collection on the best way to construct a Geospatial Lakehouse, we launched a reference structure and design ideas to think about when constructing a Geospatial Lakehouse. The Lakehouse paradigm combines the very best parts of information lakes and knowledge warehouses. It simplifies and standardizes knowledge engineering pipelines for enterprise-based on the identical design sample. Structured, semi-structured, and unstructured knowledge are managed below one system, successfully eliminating knowledge silos.

In Half 2, we concentrate on the sensible concerns and supply steering that can assist you implement them. We current an instance reference implementation with pattern code, to get you began.

Design Pointers

To appreciate the advantages of the Databricks Geospatial Lakehouse for processing, analyzing, and visualizing geospatial knowledge, you will have to:

  1. Outline and break-down your geospatial-driven drawback. What drawback are you fixing? Are you analyzing and/or modeling in-situ location knowledge (e.g., map vectors aggregated with satellite tv for pc TIFFs) to combination with, for instance, time-series knowledge (climate, soil info)? Are you looking for insights into or modeling motion patterns throughout geolocations (e.g., machine pings at factors of curiosity between residential and industrial places) or multi-party relationships between these? Relying in your workload, every use case would require completely different underlying geospatial libraries and instruments to course of, question/mannequin and render your insights and predictions.
  2. Resolve on the knowledge format requirements. Databricks recommends Delta Lake format based mostly on the open Apache Parquet format to your Geospatial knowledge. Delta comes with knowledge skipping and Z-ordering, that are notably effectively fitted to geospatial indexing (similar to geohashing, hexagonal indexing), bounding field min/max x/y generated columns, and geometries (similar to these generated by Sedona, Geomesa). A shortlist of those requirements you’ll assist you to higher finest perceive the minimal viable pipeline wanted.
  3. Know and scope the volumes, timeframes and use instances required for:
    • uncooked knowledge and knowledge processing on the Bronze layer
    • analytics on the Silver and Gold layers
    • modeling on the Gold layers and past

    Geospatial analytics and modeling efficiency and scale rely vastly on format, transforms, indexing and metadata ornament. Information windowing might be relevant to geospatial and different use instances, when windowing and/or querying throughout broad timeframes overcomplicates your work with none analytics/modeling worth and/or efficiency advantages. Geospatial knowledge is rife with sufficient challenges round frequency, quantity, the lifecycle of codecs all through the info pipeline, with out including very costly, grossly inefficient extractions throughout these.

  4. Choose from a shortlist of really helpful libraries, applied sciences and instruments optimized for Apache Spark; these focusing on your knowledge format requirements along with the outlined drawback set(s) to be solved. Take into account whether or not the info volumes being processed in every stage and run of your knowledge analytics and modeling can match into reminiscence or not. Take into account what varieties of queries you will have to run (e.g., vary, spatial be a part of, kNN, kNN be a part of, and so on.) and what varieties of coaching and manufacturing algorithms you will have to execute, along with Databricks suggestions, to know and select the best way to finest help these.
  5. Outline, design and implement the logic to course of your multi-hop pipeline. For instance, along with your Bronze tables for mobility and POI knowledge, you may generate geometries out of your uncooked knowledge and beautify these with a primary order partitioning schema (similar to an acceptable “area” superset of postal code/district/US-county, subset of province/US-state) along with secondary/tertiary partitioning (similar to hexagonal index). With Silver tables, you may concentrate on extra orders of partitioning, making use of Z-ordering, and additional optimizing with Delta OPTIMIZE + VACUUM. For Gold, you may think about knowledge coalescing, windowing (the place relevant, and throughout shorter, contiguous timeframes), and LOB segmentation along with additional Delta optimizations particular to those tighter knowledge units. You additionally could discover you want an additional post-processing layer to your Line of Enterprise (LOB) or knowledge science/ML customers. With every layer, validate these optimizations and perceive their applicability.
  6. Leverage Databricks SQL Analytics to your high layer consumption of your Geospatial Lakehouse.
  7. Outline the orchestration to drive your pipeline, with idempotency in thoughts. Begin with a easy pocket book that calls the notebooks implementing your uncooked knowledge ingestion, Bronze=>Silver=>Gold layer processing, and any post-processing wanted. Guarantee that any element of your pipeline might be idempotently executed and debugged. Elaborate from there solely as needed. Combine your orchestrations into you administration and monitoring and CI/CD ecosystem as merely and minimally as potential.
  8. Apply the distributed programming observability paradigm – the Spark UI, MLflow experiments, Spark and MLflow logs, metrics, and much more logs – for troubleshooting points. When you have utilized the earlier step appropriately, this can be a easy course of. There isn’t a “simple button” to magically remedy points in distributed processing you want good quaint distributed software program debugging, studying logs, and utilizing different observability instruments. Databricks gives self-paced and instructor-led trainings to information you if wanted.
    From right here, configure your end-to-end knowledge and ML pipeline to observe these logs, metrics, and different observability knowledge and mirror and report these. There’s extra depth on these matters obtainable within the Databricks Machine Studying weblog together with Drifting Away: Testing ML fashions in Manufacturing and AutoML Toolkit – Deep Dive from 2021’s Information + AI Summit.

Implementation concerns

Information pipeline

To your Geospatial Lakehouse, within the Bronze Layer, we advocate touchdown uncooked knowledge of their “unique constancy” format, then standardizing this knowledge into essentially the most workable format, cleaning then adorning the info to finest make the most of Delta Lake’s knowledge skipping and compaction optimization capabilities. Within the Silver Layer, we then incrementally course of pipelines that load and be a part of excessive cardinality knowledge, multi-dimensional cluster and+ grid indexing, and adorning the info additional with related metadata to help highly-performant queries and efficient knowledge administration. These are the ready tables/views of successfully queryable geospatial knowledge in an ordinary, agreed taxonomy. For Gold, we offer segmented, highly-refined knowledge units from which knowledge scientists develop and prepare their fashions and knowledge analysts glean their insights, that are optimized particularly for his or her use instances. These tables carry LOB particular knowledge for objective constructed options in knowledge science and analytics.

Placing this collectively to your Databricks Geospatial Lakehouse: There’s a development from uncooked, simply transportable codecs to highly-optimized, manageable, multidimensionally clustered and listed, and most simply queryable and accessible codecs for finish customers.


Given the plurality of enterprise questions that geospatial knowledge can reply, it’s vital that you just select the applied sciences and instruments that finest serve your necessities and use instances. To finest inform these decisions, it’s essential to consider the varieties of geospatial queries you intend to carry out.

The principal geospatial question varieties embody:

  • Vary-search question
  • Spatial-join question
  • Spatial k-nearest-neighbor question (kNN question)
  • Spatial k-nearest-neighbor be a part of question (kNN-join question)
  • Spatio-textual operations

Libraries similar to GeoSpark/Sedona help range-search, spatial-join and kNN queries (with the assistance of UDFs), whereas GeoMesa (with Spark) and LocationSpark help range-search, spatial-join, kNN and kNN-join queries.


It’s a well-established sample that knowledge is first queried coarsely to find out broader tendencies. That is adopted by querying in a finer-grained method in order to isolate all the things from knowledge hotspots to machine studying mannequin options.

This sample utilized to spatio-temporal knowledge, similar to that generated by geographic info programs (GIS), presents a number of challenges. Firstly, the info volumes make it prohibitive to index broadly categorized knowledge to a excessive decision (see the subsequent part for extra particulars). Secondly, geospatial knowledge defies uniform distribution no matter its nature — geographies are clustered across the options analyzed, whether or not these are associated to factors of curiosity (clustered in denser metropolitan areas), mobility (equally clustered for foot visitors, or clustered in transit channels per transportation mode), soil traits (clustered in particular ecological zones), and so forth. Thirdly, sure geographies are demarcated by a number of timezones (similar to Brazil, Canada, Russia and the US), and others (similar to China, Continental Europe, and India) should not.

It’s troublesome to keep away from knowledge skew given the dearth of uniform distribution until leveraging particular strategies. Partitioning this knowledge in a fashion that reduces the usual deviation of information volumes throughout partitions ensures that this knowledge might be processed horizontally. We advocate to first grid index (in our use case, geohash) uncooked spatio-temporal knowledge based mostly on latitude and longitude coordinates, which teams the indexes based mostly on knowledge density relatively than logical geographical definitions; then partition this knowledge based mostly on the bottom grouping that displays essentially the most evenly distributed knowledge form as an efficient data-defined area, whereas nonetheless adorning this knowledge with logical geographical definitions. Such areas are outlined by the variety of knowledge factors contained therein, and thus can signify all the things from giant, sparsely populated rural areas to smaller, densely populated districts inside a metropolis, thus serving as a partitioning scheme higher distributing knowledge extra uniformly and avoiding knowledge skew.

On the identical time, Databricks is creating a library, referred to as Mosaic, to standardize this method; see our weblog Environment friendly Level in Polygons by way of PySpark and BNG Geospatial Indexing, which covers the method we used. An extension to the Apache Spark framework, Mosaic permits simple and quick processing of large geospatial datasets, which incorporates in-built indexing making use of the above patterns for efficiency and scalability.

Geolocation constancy:

Generally, the higher the geolocation constancy (resolutions) used for indexing geospatial datasets, the extra distinctive index values will likely be generated. Consequently, the info quantity itself post-indexing can dramatically improve by orders of magnitude. For instance, growing decision constancy from 24000ft2 to 3500ft2 will increase the variety of potential distinctive indices from 240 billion to 1.6 trillion; from 3500ft2 to 475ft2 will increase the variety of potential distinctive indices from 1.6 trillion to 11.6 trillion.

We must always all the time step again and query the need and worth of high-resolution, as their sensible purposes are actually restricted to highly-specialized use instances. For instance, think about POIs; on common these vary from 1500-4000ft2 and might be sufficiently captured for evaluation effectively beneath the best decision ranges; analyzing visitors at larger resolutions (masking 400ft2, 60ft2 or 10ft2) will solely require higher cleanup (e.g., coalescing, rollup) of that visitors and exponentiates the distinctive index values to seize. With mobility + POI knowledge analytics, you’ll in all chance by no means want resolutions past 3500ft2

For one more instance, think about agricultural analytics, the place comparatively smaller land parcels are densely outfitted with sensors to find out and perceive tremendous grained soil and climatic options. Right here the logical zoom lends the use case to making use of larger decision indexing, given that every level’s significance will likely be uniform.

If a legitimate use case calls for prime geolocation constancy, we advocate solely making use of larger resolutions to subsets of information filtered by particular, larger stage classifications, similar to these partitioned uniformly by data-defined area (as mentioned within the earlier part). For instance, for those who discover a explicit POI to be a hotspot to your explicit options at a decision of 3500ft2, it could make sense to extend the decision for that POI knowledge subset to 400ft2 and likewise for comparable hotspots in a manageable geolocation classification, whereas sustaining a relationship between the finer resolutions and the coarser ones on a case-by-case foundation, all whereas broadly partitioning knowledge by the area idea we mentioned earlier.

Geospatial library structure & optimization:

Geospatial libraries range of their designs and implementations to run on Spark. The bases of those components vastly into efficiency, scalability and optimization to your geospatial options.

Given the commoditization of cloud infrastructure, similar to on Amazon Internet Companies (AWS), Microsoft Azure Cloud (Azure), and Google Cloud Platform (GCP), geospatial frameworks could also be designed to reap the benefits of scaled cluster reminiscence, compute, and or IO. Libraries similar to GeoSpark/Apache Sedona are designed to favor cluster reminiscence; utilizing them naively, you might expertise memory-bound conduct. These applied sciences could require knowledge repartition, and trigger a big quantity of information being despatched to the driving force, resulting in efficiency and stability points. Working queries utilizing these kind of libraries are higher fitted to experimentation functions on smaller datasets (e.g., lower-fidelity knowledge). Libraries similar to Geomesa are designed to favor cluster IO, which use multi-layered indices in persistence (e.g., Delta Lake) to effectively reply geospatial queries, and effectively swimsuit the Spark structure at scale, permitting for large-scale processing of higher-fidelity knowledge. Libraries similar to sf for R or GeoPandas for Python are optimized for a variety of queries working on a single machine, higher used for smaller-scale experimentation with even lower-fidelity knowledge.

On the identical time, Databricks is actively creating a library, referred to as Mosaic, to standardize this method. An extension to the Spark framework, Mosaic supplies native integration for simple and quick processing of very giant geospatial datasets. It consists of built-in geo-indexing for prime efficiency queries and scalability, and encapsulates a lot of the info engineering wanted to generate geometries from frequent knowledge encodings, together with the well-known-text, well-known-binary, and JTS Topology Suite (JTS) codecs.

See our weblog on Environment friendly Level in Polygons by way of PySpark and BNG Geospatial Indexing for extra on the method.


What knowledge you intend to render and the way you purpose to render them will drive decisions of libraries/applied sciences. We should think about how effectively rendering libraries swimsuit distributed processing, giant knowledge units; and what enter codecs (GeoJSON, H3, Shapefiles, WKT), interactivity ranges (from none to excessive), and animation strategies (convert frames to mp4, native reside animations) they help. Geovisualization libraries similar to, plotly and are effectively fitted to rendering giant datasets rapidly and effectively, whereas offering a excessive diploma of interplay, native animation capabilities, and ease of embedding. Libraries similar to folium can render giant datasets with extra restricted interactivity.

Language and platform flexibility:

Your knowledge science and machine studying groups could write code principally in Python, R, Scala or SQL; or with one other language solely. In deciding on the libraries and applied sciences used with implementing a Geospatial Lakehouse, we want to consider the core language and platform competencies of our customers. Libraries similar to Geospark/Apache Sedona and Geomesa help PySpark, Scala and SQL, whereas others similar to Geotrellis help Scala solely; and there are a physique of R and Python packages constructed upon the C Geospatial Information Abstraction Library (GDAL).

Instance implementation utilizing mobility and point-of-interest knowledge


As introduced in Half 1, the final structure for this Geospatial Lakehouse instance is as follows:

Example reference architecture for the Databricks Geospatial Lakehouse
Diagram 1

Making use of this architectural design sample to our earlier instance use case, we’ll implement a reference pipeline for ingesting two instance geospatial datasets, point-of-interest (Safegraph) and cell machine pings (Veraset), into our Databricks Geospatial Lakehouse. We primarily concentrate on the three key phases – Bronze, Silver, and Gold.

A Databricks Geospatial Lakehouse detailed design for our example Pings + POI geospatial use case
Diagram 2

As per the aforementioned method, structure, and design ideas, we used a mixture of Python, Scala and SQL in our instance code.

We subsequent stroll by way of every stage of the structure.

Uncooked Information Ingestion:

We begin by loading a pattern of uncooked Geospatial knowledge point-of-interest (POI) knowledge. This POI knowledge might be in any variety of codecs. In our use case, it’s CSV.

raw_df = spark.learn.format("csv").schema(schema) 
.choice("delimiter", ",") 
.choice("quote", """) 
.choice("escape", """)
.choice("header", "true")


Bronze Tables: Unstructured, proto-optimized ‘semi uncooked’ knowledge

For the Bronze Tables, we remodel uncooked knowledge into geometries after which clear the geometry knowledge. Our instance use case consists of pings (GPS, mobile-tower triangulated machine pings) with the uncooked knowledge listed by geohash values. We then apply UDFs to rework the WKTs into geometries, and index by geohash ‘areas’.

def poly_to_H3(wkts: pd.Sequence) -> pd.Sequence:
    polys = geopandas.GeoSeries.from_wkt(wkts)
    indices = h3.polyfill(geo_json_geom, decision, True)
    h3_list = record(indices)
    return pd.Sequence(h3_list)

def poly_area(wkts: pd.Sequence) -> pd.Sequence:
    polys = geopandas.GeoSeries.from_wkt(wkts)


h3_df = spark.desk("geospatial_lakehouse_blog_db.raw_graph_poi")
        .choose("placekey", "safegraph_place_id", "parent_placekey", "parent_safegraph_place_id", "location_name", "manufacturers", "latitude", "longitude", "street_address", "metropolis", "area", "postal_code", "polygon_wkt") 
        .withColumn("space", poly_area(col("polygon_wkt")))
        .filter(col("space") < 0.001)
        .withColumn("h3", poly_to_H3(col("polygon_wkt"))) 
        .withColumn("h3_array", break up(col("h3"), ","))
        .withColumn("h3", explode("h3_array"))
        .drop("h3_array").withColumn("h3_hex", hex("h3"))

Silver Tables: Optimized, structured & fastened schema knowledge

For the Silver Tables, we advocate incrementally processing pipelines that load and be a part of high-cardinality knowledge, indexing and adorning the info additional to help highly-performant queries. In our instance, we used pings from the Bronze Tables above, then we aggregated and remodeled these with point-of-interest (POI) knowledge and hex-indexed these knowledge units utilizing H3 queries to write down Silver Tables utilizing Delta Lake. These tables have been then partitioned by area, postal code and Z-ordered by the H3 indices.

We additionally processed US Census Block Group (CBG) knowledge capturing US Census Bureau profiles, listed by GEOID codes to combination and remodel these codes utilizing Geomesa to generate geometries, then hex-indexed these aggregates/transforms utilizing H3 queries to write down extra Silver Tables utilizing Delta Lake. These have been then partitioned and Z-ordered just like the above.

These Silver Tables have been optimized to help quick queries similar to “discover all machine pings for a given POI location inside a specific time window,” and “coalesce frequent pings from the identical machine + POI right into a single report, inside a time window.”

# Silver-to-Gold H3 listed queries
gold_h3_indexed_ad_ids_df = spark.sql("""
     SELECT ad_id, geo_hash_region, geo_hash, h3_index, utc_date_time 
     FROM silver_tables.silver_h3_indexed
     ORDER BY geo_hash_region 

gold_h3_lag_df = spark.sql("""
     choose ad_id, geo_hash, h3_index, utc_date_time, row_number()             
     ORDER BY utc_date_time asc) as rn,
     lag(geo_hash, 1) over(partition by ad_id 
     ORDER BY utc_date_time asc) as prev_geo_hash
     FROM goldh3_indexed_ad_ids

gold_h3_coalesced_df = spark.sql(""" 
choose ad_id, geo_hash, h3_index, utc_date_time as ts, rn, coalesce(prev_geo_hash, geo_hash) as prev_geo_hash from gold_h3_lag  

gold_h3_cleansed_poi_df = spark.sql(""" 
        choose ad_id, geo_hash, h3_index, ts,
               SUM(CASE WHEN geo_hash = prev_geo_hash THEN 0 ELSE 1 END) OVER (ORDER BY ad_id, rn) AS group_id from gold_h3_coalesced

# write this out right into a gold desk 

Gold Tables: Extremely-optimized, structured knowledge with evolving schema

For the Gold Tables, respective to our use case, we successfully a) sub-queried and additional coalesced frequent pings from the Silver Tables to provide a subsequent stage of optimization b) embellished coalesced pings from the Silver Tables and window these with well-defined time intervals c) aggregated with the CBG Silver Tables and remodel for modelling/querying on CBG/ACS statistical profiles in the USA. The ensuing Gold Tables have been thus refined for the road of enterprise queries to be carried out every day along with offering updated coaching knowledge for machine studying.

# KeplerGL rendering of Silver/Gold H3 queries
lat = 40.7831
lng = -73.9712
decision = 6
parent_h3 = h3.geo_to_h3(lat, lng, decision)
res11 = [Row(x) for x in list(h3.h3_to_children(parent_h3, 11))]

schema = StructType([       
    StructField('hex_id', StringType(), True)

sdf = spark.createDataFrame(knowledge=res11, schema=schema)

def getLat(h3_id):
  return h3.h3_to_geo(h3_id)[0]

def getLong(h3_id):
  return h3.h3_to_geo(h3_id)[1]

def getParent(h3_id, parent_res):
  return h3.h3_to_parent(h3_id, parent_res)

# Notice that mother or father and youngsters hexagonal indices could typically not 
# completely align; as such this isn't meant to be exhaustive,
# relatively simply display one sort of enterprise query that 
# a Geospatial Lakehouse will help to simply handle 
pdf = (sdf.withColumn("h3_res10", getParent("hex_id", lit(10)))
       .withColumn("h3_res9", getParent("hex_id", lit(9)))
       .withColumn("h3_res8", getParent("hex_id", lit(8)))
       .withColumn("h3_res7", getParent("hex_id", lit(7)))
       .withColumnRenamed('hex_id', "h3_res11")

example_1_html = create_kepler_html(knowledge= {"hex_data": pdf }, config=map_config, peak=600)


For a sensible instance, we utilized a use case ingesting, aggregating and reworking mobility knowledge within the type of geolocation pings (suppliers embody Veraset, Tamoco, Irys, inmarket, Factual) with focal point (POI) knowledge (suppliers embody Safegraph, AirSage, Factual, Cuebiq, Predicio) and with US Census Bureau Group (CBG) and American Neighborhood Survey (ACS), to mannequin POI options vis-a-vis visitors, demographics and residence.

Bronze Tables: Unstructured, proto-optimized ‘semi uncooked’ knowledge

We discovered that the candy spot for loading and processing of historic, uncooked mobility knowledge (which generally is within the vary of 1-10TB) is finest carried out on giant clusters (e.g., a devoted 192-core cluster or bigger) over a shorter elapsed time interval (e.g., 8 hours or much less). Cluster sharing different workloads is ill-advised as loading Bronze Tables is likely one of the most useful resource intensive operations in any Geospatial Lakehouse. One can scale back DBU expenditure by an element of 6x by dedicating a big cluster to this stage. In fact, outcomes will range relying upon the info being loaded and processed.

Silver Tables: Optimized, structured & fastened schema knowledge

Whereas H3 indexing and querying performs and scales out much better than non-approximated level in polygon queries, it’s typically tempting to use hex indexing resolutions to the extent it is going to overcome any acquire. With mobility knowledge, as utilized in our instance use case, we discovered our “80/20” H3 resolutions to be 11 and 12 for successfully “zooming in” to the best grained exercise. H3 decision 11 captures a mean hexagon space of 2150m2/3306ft2; 12 captures a mean hexagon space of 307m2/3305ft2. For reference relating to POIs, a mean Starbucks coffeehouse has an space of 186m2/2000m2; a mean Dunkin’ Donuts has an space of 242m2/2600ft2; and a mean Wawa location has an space of 372m2/4000ft2. H3 decision 11 captures as much as 237 billion distinctive indices; 12 captures as much as 1.6 trillion distinctive indices. Our findings indicated that the stability between H3 index knowledge explosion and knowledge constancy was finest discovered at resolutions 11 and 12.

Rising the decision stage, say to 13 or 14 (with common hexagon areas of 44m2/472ft2 and 6.3m2/68ft2), one finds the exponentiation of H3 indices (to 11 trillion and 81 trillion, respectively) and the resultant storage burden plus efficiency degradation far outweigh the advantages of that stage of constancy.

Taking this method has, from expertise, led to complete Silver Tables capability to be within the 100 trillion information vary, with disk footprints from 2-3 TB.

Gold Tables: Extremely-optimized, structured knowledge with evolving schema

In our instance use case, we discovered the pings knowledge as sure (spatially joined) inside POI geometries to be considerably noisy, with what successfully have been redundant or extraneous pings in sure time intervals at sure POIs. To take away the info skew these launched, we aggregated pings inside slender time home windows in the identical POI and excessive decision geometries to cut back noise, adorning the datasets with extra partition schemes, thus offering additional processing of those datasets for frequent queries and EDA. This method reduces the capability wanted for Gold Tables by 10-100x, relying on the specifics. Whereas might have a plurality of Gold Tables to help your Line of Enterprise queries, EDA or ML coaching, these will vastly scale back the processing instances of those downstream actions and outweigh the incremental storage prices.

For visualizations, we rendered particular analytics and modelling queries from chosen Gold Tables to finest mirror particular insights and/or options, utilizing

With, we will rapidly render hundreds of thousands to billions of factors and carry out spatial aggregations on the fly, visualizing these with completely different layers along with a excessive diploma of interactivity.

You possibly can render a number of resolutions of information in a reductive method — execute broader queries, similar to these throughout areas, at a decrease decision.

Under are some examples of the renderings throughout completely different layers with

Right here we use a set of coordinates of NYC (The Alden by Central Park West) to provide a hex index at decision 6. We are able to then discover all the kids of this hexagon with a reasonably fine-grained decision, on this case, decision 11:

[ rendering of H3 indexed data at resolution 6 overlaid with resolution 11 children centered at The Alden by Central Park in NYC
Diagram 3

Next, we query POI data for Washington DC postal code 20005 to demonstrate the relationship between polygons and H3 indices; here we capture the polygons for various POIs as together with the corresponding hex indices computed at resolution 13. Supporting data points include attributes such as the location name and street address:

Polygons for POI with corresponding H3 indices for Washington DC postal code 20005
Diagram 4

Zoom in at the location of the National Portrait Gallery in Washington, DC, with our associated polygon, and overlapping hexagons at resolutions 11, 12 and 13 B, C; this illustrates how to break out polygons from individuals hex indexes to constrain the total volume of data used to render the map.

Zoom in at National Portrait Gallery in Washington, DC, displaying overlapping hexagons at resolutions 11, 12, and 13
Diagram 5

You can explore and validate your points, polygons, and hexagon grids on the map in a Databricks notebook, and create similarly useful maps with these.


For our example use cases, we used GeoPandas, Geomesa, H3 and KeplerGL to produce our results. In general, you will expect to use a combination of either GeoPandas, with Geospark/Apache Sedona or Geomesa, together with H3 +, plotly, folium; and for raster data, Geotrellis + Rasterframes.

Below we provide a list of geospatial technologies integrated with Spark for your reference:

  • Ingestion
    • GeoPandas
      • Simple, easy to use and robust ingestion of formats from ESRI ArcSDE, PostGIS, Shapefiles through to WKBs/WKTs
      • Can scale out on Spark by ‘manually’ partitioning source data files and running more workers
    • GeoSpark/Apache Sedona
      • GeoSpark is the original Spark 2 library; Sedona (in incubation with the Apache Foundation as of this writing), the Spark 3 revision
      • GeoSpark ingestion is straightforward, well documented and works as advertised
      • Sedona ingestion is WIP and needs more real world examples and documentation
    • GeoMesa
      • Spark 2 & 3
      • GeoMesa ingestion is generalized for use cases beyond Spark, therefore it requires one to understand its architecture more comprehensively before applying to Spark. It is well documented and works as advertised.
    • Databricks Mosaic (to be released)
      • Spark 3
      • This project is currently under development. More details on its ingestion capabilities will be available upon release.
  • Geometry processing
    • GeoSpark/Apache Sedona
      • GeoSpark is the original Spark 2 library; Sedona (in incubation with the Apache Foundation as of this writing), the Spark 3 revision
      • As with ingestion, GeoSpark is well documented and robust
      • As with in
      • RDDs and Dataframes
      • Bi-level spatial indexing
      • Range joins, Spatial joins, KNN queries
      • Python, Scala and SQL APIs
    • GeoMesa
      • Spark 2 & 3
      • RDDs and Dataframes
      • Tri-level spatial indexing via global grid
      • Range joins, Spatial joins, KNN queries, KNN joins
      • Python, Scala and SQL APIs
    • Databricks Mosaic (to be released)
      • Spark 3
      • This project is currently under development. More details on its geometry processing capabilities will be available upon release.
  • Raster map processing
    • Geotrellis
      • Spark 2 & 3
      • RDDs
      • Cropping, Warping, Map Algebra
      • Scala APIs
    • Rasterframes
      • Spark 2, active Spark 3 branch
      • Dataframes
      • Map algebra, Masking, Tile aggregation, Time series, Raster joins
      • Python, Scala, and SQL APIs
  • Grid/Hexagonal indexing and querying
    • H3
      • Compatible with Spark 2, 3
      • C core
      • Scala/Java, Python APIs (along with bindings for JavaScript, R, Rust, Erlang and many other languages)
      • KNN queries, Radius queries
    • Databricks Mosaic (to be released)
      • Spark 3
      • This project is currently under development. More details on its indexing capabilities will be available upon release.
  • Visualization

We will continue to add to this list and technologies develop.

Downloadable notebooks

For your reference, you can download the following example notebook(s)

  1. Raw to Bronze processing of Geometries: Notebook with example of simple ETL of Pings data incrementally from raw parquet to bronze table with new columns added including H3 indexes, as well as how to use Scala UDFs in Python, which then runs incremental load from Bronze to Silver Tables and indexes these using H3
  2. Silver Processing of datasets with geohashing: Notebook that shows example queries that can be run off of the Silver Tables, and what kind of insights can be achieved at this layer
  3. Silver to Gold processing: Notebook that shows example queries that can be run off of the Silver Tables to produce useful Gold Tables, from which line of business intelligence can be gleaned
  4. KeplerGL rendering: Notebook that shows example queries that can be run off of the Gold Tables and demonstrates using the KeplerGL library to render over these queries. Please note that this is slightly different from using a Juypter notebook as in the Kepler documentation examples


The Databricks Geospatial Lakehouse can provide an optimal experience for geospatial data and workloads, affording you the following advantages: domain-driven design; the power of Delta Lake, Databricks SQL, and collaborative notebooks; data format standardization; distributed processing technologies integrated with Apache Spark for optimized, large-scale processing; powerful, high-performance geovisualization libraries — all to deliver a rich yet flexible platform experience for spatio-temporal analytics and machine learning. There is no one-size-fits-all solution, but rather an architecture and platform enabling your teams to customize and model according to your requirements and the demands of your problem set. The Databricks Geospatial Lakehouse supports static and dynamic datasets equally well, enabling seamless spatio-temporal unification and cross-querying with tabular and raster-based data, and targets very large datasets from the 100s of millions to trillions of rows. Together with the collateral we are sharing with this article, we provide a practical approach with real-world examples for the most challenging and varied spatio-temporal analyses and models. You can explore and visualize the full wealth of geospatial data easily and without struggle and gratuitous complexity within Databricks SQL and notebooks.

Next Steps

Start with the aforementioned notebooks to begin your journey to highly available, performant, scalable and meaningful geospatial analytics, data science and machine learning today, and contact us to learn more about how we assist customers with geospatial use cases.

The above notebooks are not intended to be run in your environment as is. You will need access to geospatial data such as POI and Mobility datasets as demonstrated with these notebooks. Access to live ready-to-query data subscriptions from Veraset and Safegraph are available seamlessly through Databricks Delta Sharing. Please reach out to if you would like to gain access to this data.


Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.