Use `std::execution::par` in `dbEdgeProcessor` and `dbPolygonTools` sorts by nikosavola · Pull Request #2364 · KLayout/klayout

nikosavola · 2026-06-03T13:12:05Z

This PR introduces C++17 parallel sorting (std::execution::par) to the core database sorting operations within dbEdgeProcessor.cc and dbPolygonTools.cc.

Flat operations (boolean and sizing) on extremely dense shape regions require heavy sorting of large edge arrays (e.g., sorting by ymin before scanline sweeps). Moving these from sequential std::sort to C++17 parallel sorting (backed by Intel TBB on GCC/Linux) should yield some performance boost for flat layout manipulation.

Benchmark Results

This PR introduces parallel sorting using C++17 execution policies in dbEdgeProcessor.cc and dbPolygonTools.cc.

To verify the performance improvements, I created a test script that generates a large grid of $1000 \times 1000$ intersecting polygons and measures the execution time of boolean operations (AND) which relies heavily on edge processing and polygon tools.

Benchmark Setup

Compiler / Environment: GCC 14.2 with C++17, linked with Intel TBB
Test Command: taskset -c <cores> env LD_LIBRARY_PATH=./bin-release ./bin-release/klayout -b -r test_bool.rb
Benchmarking Tool: hyperfine (min 3 runs per thread count)
Thread Counts Swept: 1, 2, 4, 6, 8

Note: taskset was used to precisely limit the CPU cores available to TBB.

Results

The table below shows the execution times across different thread counts:

Threads	Mean Time (s)	Std Dev (s)
1	10.62	0.08
2	10.28	0.02
4	10.02	0.09
6	9.96	0.05
8	9.95	0.05

While the speedup scales incrementally with the thread count, the gains are modest (~7% speedup from 1 to 8 threads). The parallel sorting via TBB accelerates the sorting step itself but other sequential parts of the boolean operations still dominate the overall execution time.

Reproduction

The results were gathered using the following Python script with, which handles the hyperfine execution and plots the results.

Benchmark Script (`benchmark.py`)

# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "matplotlib",
#     "pandas",
#     "tabulate"
# ]
# ///
import json
import matplotlib.pyplot as plt
import pandas as pd
import sys
import subprocess

def run_benchmarks():
    commands = []
    threads = [1, 2, 4, 6, 8]
    for t in threads:
        # Limit to `t` cores using taskset
        cmd = f"taskset -c 0-{t-1} env LD_LIBRARY_PATH=./bin-release ./bin-release/klayout -b -r test_bool.rb"
        commands.append((t, cmd))
        
    hyperfine_cmd = ["hyperfine", "--export-json", "results.json", "--min-runs", "3"]
    for t, cmd in commands:
        hyperfine_cmd.extend(["-n", f"{t} Threads", cmd])
        
    print("Running hyperfine...")
    subprocess.run(hyperfine_cmd, check=True)
    
    with open("results.json") as f:
        data = json.load(f)
        
    results = []
    for i, t in enumerate(threads):
        res = data["results"][i]
        results.append({
            "Threads": t,
            "Mean Time (s)": res["mean"],
            "Std Dev (s)": res["stddev"]
        })
        
    df = pd.DataFrame(results)
    print(df.to_markdown(index=False))
    
    plt.figure(figsize=(8, 5))
    plt.errorbar(df["Threads"], df["Mean Time (s)"], yerr=df["Std Dev (s)"], marker='o', capsize=5)
    plt.xlabel("Number of Threads")
    plt.ylabel("Mean Execution Time (s)")
    plt.title("Parallel Edge Processor Benchmark")
    plt.grid(True)
    plt.xticks(threads)
    plt.savefig("benchmark_plot.png")
    print("Plot saved as benchmark_plot.png")

if __name__ == "__main__":
    run_benchmarks()

Test Case (`test_bool.rb`)

ly = RBA::Layout::new
ly.dbu = 0.001
top = ly.add_cell("TOP")

l1 = ly.layer(1, 0)
l2 = ly.layer(2, 0)

n = 1000
puts "Generating data..."
n.times do |x|
  n.times do |y|
    ly.cell(top).shapes(l1).insert(RBA::Box::new(x*10, y*10, x*10+8, y*10+8))
    ly.cell(top).shapes(l2).insert(RBA::Box::new(x*10+4, y*10+4, x*10+12, y*10+12))
  end
end

puts "Running boolean..."
ep = RBA::EdgeProcessor::new
reg1 = RBA::Region::new(ly.cell(top).shapes(l1))
reg2 = RBA::Region::new(ly.cell(top).shapes(l2))

r = reg1 & reg2
puts "Result has #{r.size} polygons"

You can run it with uv run benchmark.py to install dependencies and execute the benchmark in a folder with ./bin-release containing a KLayout build.

Compilation Notes

To properly enable Intel TBB parallel execution via <execution>, the compilation needs to specify at least -std=c++17.
I also had to add LIBS += -ltbb to link the Intel Threading Building Blocks library correctly. Not sure what would be a nice general solution for allowing other libraries to implement the execution policy from the standard library.

…orts

…ecution header guards

Use std::execution::par in dbEdgeProcessor and dbPolygonTools s…

b116fe4

…orts

This was referenced Jun 3, 2026

Leverage execution policies inside dbAsIfFlatRegion object traversal #2365

Draft

Leverage execution policies for LVS equivalence #2366

Draft

Leverage execution policies in dbShapes and PLC decomposition #2367

Draft

nikosavola marked this pull request as ready for review June 4, 2026 12:40

Make TBB linking conditional and improve backward compatibility of ex…

7e6119c

…ecution header guards

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `std::execution::par` in `dbEdgeProcessor` and `dbPolygonTools` sorts#2364

Use `std::execution::par` in `dbEdgeProcessor` and `dbPolygonTools` sorts#2364
nikosavola wants to merge 2 commits into
KLayout:masterfrom
nikosavola:nikosavola/push-rurmupwrtvlv

nikosavola commented Jun 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nikosavola commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results

Benchmark Setup

Results

Reproduction

Compilation Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nikosavola commented Jun 3, 2026 •

edited

Loading