Skip to content

Performance: Slow Dataset Uploads on Large GeoNode Instances #14194

@t-book

Description

@t-book

Expected Behavior

Uploading a single shapefile dataset should complete in a reasonable time (30-60 seconds) regardless of how many datasets already exist in the GeoNode instance.

Actual Behavior

On GeoNode instances with thousands of existing datasets, uploading a single new dataset takes very long. (e.g., 2:40 minutes for a simple shapefile).

GeoServer logs show repeated requests like:

geoserver4geonode | 05 May 11:57:23 INFO   [geoserver.filters] - 172.18.0.8 "GET /geoserver/rest/workspaces/geonode/coveragestores/Karabalgasun_2018_14_100m_optimized/coverages.xml" took 8ms
geoserver4geonode | 05 May 11:57:23 INFO   [geoserver.filters] - 172.18.0.8 "GET /geoserver/rest/workspaces/geonode/coveragestores/Karabalgasun_2018_15_100m_optimized/coverages.xml" took 8ms
geoserver4geonode | 05 May 11:57:23 INFO   [geoserver.filters] - 172.18.0.8 "GET /geoserver/rest/workspaces/geonode/coveragestores/Karabalgasun_2018_16_100m_optimized/coverages.xml" took 8ms
geoserver4geonode | 05 May 11:57:23 INFO   [geoserver.filters] - 172.18.0.8 "GET /geoserver/rest/workspaces/geonode/coveragestores/Karabalgasun_Arctron_2007_UTM48_merged.tif/coverages.xml" took 7ms
... (one [geoserver.filters] request per existing coveragestore)

This looks similar to an issue we faced in gn 3
#7618

Steps to Reproduce the Problem

  1. Set up a GeoNode instance with 1000+ existing datasets/many coveragestores
  2. Attempt to upload a new shapefile via the "Upload Dataset" UI
  3. Observe the upload time

Root Cause

My guess is the bottleneck is in geonode/upload/publisher.py in the sanity_checks() method (line ~176). When validating SRID/projection, the code calls:

res = self.cat.get_resource(x, workspace=self.workspace)  # ← NO store parameter!

Without a store parameter, gsconfig.Catalog.get_resource() performs a broad search across ALL stores:

  • Calls get_stores(workspaces=workspace) → returns all 1000+ stores
  • Loops through each store and calls store.get_resources(name=name)
  • Result: 1000+ REST API calls for a single resource check

This occurs for each resource being validated.

Solution

Add the store= parameter to the get_resource() call in sanity_checks(). Since the resource was just published to self.store, we can search only there:

def sanity_checks(self, resources):
    for _resource in resources:
        # OPTIMIZATION: Add store= parameter to search only in the specific store
        res = self.cat.get_resource(
            _resource.get("name"),
            store=self.store,  # ← KEY FIX
            workspace=self.workspace
        )
        
        if not res or (res and not res[0].projection):
            raise PublishResourceException(
                f"The SRID for the resource {_resource} is not correctly set. Please check Geoserver logs"
            )

Performance Impact

  • Before: 2:40 minutes for a single shapefile upload
  • After: 44 seconds

Specifications

  • GeoNode version: 5.0.1
  • Installation type: geonode-project (iDAI.geoserver fork)
  • Installation method: docker
  • Platform: Ubuntu 24.04, GeoServer 2.27.5
  • Additional details:
    • Instance has 2500+ existing datasets
    • Affected file: geonode/upload/publisher.py
    • Root cause: gsconfig.Catalog.get_resource() behavior without store parameter

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions