Skip to content

Parallelize hierarchical DRC processing with OpenMP#2368

Open
nikosavola wants to merge 2 commits into
KLayout:masterfrom
nikosavola:nikosavola/push-rxtuymzvtspt
Open

Parallelize hierarchical DRC processing with OpenMP#2368
nikosavola wants to merge 2 commits into
KLayout:masterfrom
nikosavola:nikosavola/push-rxtuymzvtspt

Conversation

@nikosavola
Copy link
Copy Markdown
Contributor

This PR introduces OpenMP-based parallelization to KLayout's core hierarchical processing engine (dbHierProcessor.cc). By parallelizing bottom-up boolean and sizing operations (via db::DeepRegion and db::DeepShapeStore), we achieve speedups on highly nested layouts with heavy overlap.

Build Configuration

To ensure this PR does not break existing workflows, OpenMP is provided as an opt-in build feature. By default, builds remain unaffected. To enable the feature, compile with:

./build.sh -with-openmp

Note: The configuration in src/klayout.pri automatically routes the correct compiler and linker flags per OS.

Benchmarking & Results

Performance was validated by generating a deeply nested layout (1,000 unique cells, heavy intentional overlaps) to force the hierarchical engine to process complex geometry. We then measured the execution time of consecutive AND and SIZE operations using DeepShapeStore.

Benchmark Layout Generation

# /// script
# requires = ["klayout"]
# ///
import klayout.db as db

# Create a new layout
layout = db.Layout()

# Define layers
l1 = layout.layer(1, 0)
l2 = layout.layer(2, 0)

# Create the top cell
top = layout.create_cell("TOP")
pitch = 100000 # 100 um in dbu

# Create 1000 unique cells, each with 1000 rectangles
for i in range(1000):
    cell = layout.create_cell(f"CELL_{i}")
    for j in range(1000):
        # some randomish overlapping shapes
        x = (j * 10 % 100) * 1000
        y = (j * 15 % 100) * 1000
        cell.shapes(l1).insert(db.Box(x, y, x + 20000, y + 20000))
        cell.shapes(l2).insert(db.Box(x + 5000, y + 5000, x + 25000, y + 25000))
        cell.shapes(l1).insert(db.Box(x + 2000, y + 2000, x + 12000, y + 12000))
    
    # Instance them all in TOP in a grid
    px = (i % 25) * pitch
    py = (i // 25) * pitch
    trans = db.Trans(db.Trans.R0, px, py)
    top.insert(db.CellInstArray(cell.cell_index(), trans))

# Flat noise to force hierarchy engine to process everything
for i in range(100):
    x = (i * 50) * 1000
    y = (i * 50) * 1000
    top.shapes(l2).insert(db.Box(x, y, x + 20000, y + 20000))

# Write the layout
options = db.SaveLayoutOptions()
layout.write("test_layout_many_cells.gds", options)
print("Generated test_layout_many_cells.gds using KLayout")

Benchmark Execution Script

import sys
import time
import os
import platform
import subprocess

sys.path.append('bin-release/pymod')
import klayout.db as db

def get_sys_info():
    try:
        cpu = subprocess.check_output("lscpu | grep 'Model name' | cut -f 2 -d ':' | awk '{$1=$1}1'", shell=True).decode().strip()
    except Exception:
        cpu = platform.processor()
    try:
        mem_bytes = os.sysconf('SC_PAGE_SIZE') * os.sysconf('SC_PHYS_PAGES')
        mem_gb = mem_bytes / (1024.**3)
        mem_str = f"{mem_gb:.1f} GB"
    except Exception:
        mem_str = "Unknown"
    return cpu, mem_str

threads = int(sys.argv[1]) if len(sys.argv) > 1 else 1

cpu_name, ram_size = get_sys_info()

layout = db.Layout()
layout.read("test_layout_many_cells.gds")

l1 = layout.find_layer(1, 0)
l2 = layout.find_layer(2, 0)

top = layout.top_cell()

dss = db.DeepShapeStore()
dss.threads = threads

r1 = db.Region(top.begin_shapes_rec(l1), dss)
r2 = db.Region(top.begin_shapes_rec(l2), dss)

# Burn-in
_ = r1.count()
_ = r2.count()

t0 = time.time()
r3 = r1 & r2
t1 = time.time()

# Sizing operation will cause massive neighbor overlaps!
# Hier processor will have to work hard here.
r4 = r1.sized(10)
t2 = time.time()

print(f"System: {cpu_name} with {ram_size} RAM")
print(f"Threads: {threads}")
print(f"Result (AND): {r3.hier_count()}")
print(f"Result (SIZE): {r4.hier_count()}")
print(f"AND Duration: {t1 - t0:.3f} s")
print(f"SIZE Duration: {t2 - t1:.3f} s")

Test System: Intel Core Ultra 7 258V, 32 GB RAM
Measurement: hyperfine execution measuring mean time and scaling across threads.

Threads Mean Execution Time Relative Speedup
1 (Baseline) 244.389 s ± 22.007 s 1.0x
2 203.122 s ± 19.400 s 1.20x
4 169.160 s ± 27.339 s 1.44x
8 147.937 s ± 20.087 s 1.65x

Conclusion: Running with 8 threads, the hierarchical layout operations completed 1.65x faster compared to the single-threaded baseline.

benchmark_plot

- Replace sequential task queues in dbHierProcessor and dbCompoundOperation with OpenMP tasking and parallel loops
- Add thread safety to CompoundRegionOperationCache using tl::Mutex
- Enable OpenMP compiler and linker flags in klayout.pri
- Update _drc_engine.rb documentation to explicitly state thread allocation for hierarchical mode
@klayoutmatthias
Copy link
Copy Markdown
Collaborator

Hi @nikosavola,

thanks a lot for your efforts, the code updates and the benchmark data. It's a significant achievement and a nice piece of work and I really appreciate that.

So far a big "thank you"!

As the solution is opt-in, I can basically merge this PR, but I think it will need continuous maintenance. I can't do that as I don't have the bandwidth to take care of another deployment axis or CI/CD integration. Also my experiences with OpenMP are limited, specifically regarding debugging and potential platform compatibility issues.

Are you willing to take over support of the OpenMP execution branch in the future?

Matthias

@nikosavola
Copy link
Copy Markdown
Contributor Author

Hi @nikosavola,

thanks a lot for your efforts, the code updates and the benchmark data. It's a significant achievement and a nice piece of work and I really appreciate that.

So far a big "thank you"!

As the solution is opt-in, I can basically merge this PR, but I think it will need continuous maintenance. I can't do that as I don't have the bandwidth to take care of another deployment axis or CI/CD integration. Also my experiences with OpenMP are limited, specifically regarding debugging and potential platform compatibility issues.

Are you willing to take over support of the OpenMP execution branch in the future?

Matthias

Hi Matthias,

thanks for the great project. I can definitely try providing support. For keeping track of whether anything fails, adding new CI pipelines that build and test with OpenMP across different platforms would make sense. Maybe even a regression benchmark.

I'm familiar with GitHub and GitLab CI but noticed you are mostly running on Azure. Is it okay to add new pipelines? Are there any cost limitations etc. or do you get the compute time free for being open-source? If cost is not a limit, I can try adding and 'maintaining' some pipelines. These things could probably be different PRs.

Side note

If you want to read up on parallel programming, I recommend the course material for the excellent Programming Parallel Computers course from Aalto University. It is freely available at https://ppc.cs.aalto.fi/ and the material is fast to read through. The length of the course comes from implementing the algorithms correctly yourself.

@nikosavola
Copy link
Copy Markdown
Contributor Author

About the deployment, I'm not really familiar with production-ready C++ in this regard. I know you could do something like dynamic dispatch of dynamically linked libraries etc. or compile the OpenMP sections to a shared library. I think we may keep OpenMP as a feature a user needs to compile from source for now.

@klayoutmatthias
Copy link
Copy Markdown
Collaborator

For now, it's okay to leave it as an opt-in feature for users who want or can to compile from source. It's just that I would need someone to help debugging in case there are problems. CI comes next. Cost constraints don't apply as of now, but I am frequently hitting hard limits, so load is at the edge already. Some CI is done on private hardware therefore.

I have experience with parallelization, but on a level of explicit threads (pthread, QThread etc.). KLayout utilized parallelism in many places already (drawing, tiled mode etc.), but the architecture is based on a worker/task paradigm. I've debugged many issues such as race conditions, thread synchronization and deadlocks and I know what's ahead.

I understand that OpenMP is embedded much deeper into the compiler architecture. I wonder what the implications are. I imagine that there are portability issues and I am frequently using KLayout myself in load-balancer environments, which demand predictable resource allocation. So high dynamic load peaks are not welcome there. I've read that OpenMP choses the number of threads "automatically", so I don't know if that creates a problem.

I'm interested to see if OpenMP is able to mitigate the locking overhead problem. Locks (even atomic spin locks) impose a significant overhead. A mutex lock in my experience roughly takes 1 microsecond, which makes it extremely inefficient when you guard small code pieces like a simple arithmetic instruction. Task formation and scheduling is a challenging topic, and I had enough cases where the overhead alone entirely compensated the gain from parallelization. Maybe OpenMP does some magic here. I will try your code on some benchmarks of mine.

Best regards,

Matthias

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants