Parallelize hierarchical DRC processing with OpenMP#2368
Conversation
- Replace sequential task queues in dbHierProcessor and dbCompoundOperation with OpenMP tasking and parallel loops - Add thread safety to CompoundRegionOperationCache using tl::Mutex - Enable OpenMP compiler and linker flags in klayout.pri - Update _drc_engine.rb documentation to explicitly state thread allocation for hierarchical mode
|
Hi @nikosavola, thanks a lot for your efforts, the code updates and the benchmark data. It's a significant achievement and a nice piece of work and I really appreciate that. So far a big "thank you"! As the solution is opt-in, I can basically merge this PR, but I think it will need continuous maintenance. I can't do that as I don't have the bandwidth to take care of another deployment axis or CI/CD integration. Also my experiences with OpenMP are limited, specifically regarding debugging and potential platform compatibility issues. Are you willing to take over support of the OpenMP execution branch in the future? Matthias |
Hi Matthias, thanks for the great project. I can definitely try providing support. For keeping track of whether anything fails, adding new CI pipelines that build and test with OpenMP across different platforms would make sense. Maybe even a regression benchmark. I'm familiar with GitHub and GitLab CI but noticed you are mostly running on Azure. Is it okay to add new pipelines? Are there any cost limitations etc. or do you get the compute time free for being open-source? If cost is not a limit, I can try adding and 'maintaining' some pipelines. These things could probably be different PRs. Side noteIf you want to read up on parallel programming, I recommend the course material for the excellent Programming Parallel Computers course from Aalto University. It is freely available at https://ppc.cs.aalto.fi/ and the material is fast to read through. The length of the course comes from implementing the algorithms correctly yourself. |
|
About the deployment, I'm not really familiar with production-ready C++ in this regard. I know you could do something like dynamic dispatch of dynamically linked libraries etc. or compile the OpenMP sections to a shared library. I think we may keep OpenMP as a feature a user needs to compile from source for now. |
|
For now, it's okay to leave it as an opt-in feature for users who want or can to compile from source. It's just that I would need someone to help debugging in case there are problems. CI comes next. Cost constraints don't apply as of now, but I am frequently hitting hard limits, so load is at the edge already. Some CI is done on private hardware therefore. I have experience with parallelization, but on a level of explicit threads (pthread, QThread etc.). KLayout utilized parallelism in many places already (drawing, tiled mode etc.), but the architecture is based on a worker/task paradigm. I've debugged many issues such as race conditions, thread synchronization and deadlocks and I know what's ahead. I understand that OpenMP is embedded much deeper into the compiler architecture. I wonder what the implications are. I imagine that there are portability issues and I am frequently using KLayout myself in load-balancer environments, which demand predictable resource allocation. So high dynamic load peaks are not welcome there. I've read that OpenMP choses the number of threads "automatically", so I don't know if that creates a problem. I'm interested to see if OpenMP is able to mitigate the locking overhead problem. Locks (even atomic spin locks) impose a significant overhead. A mutex lock in my experience roughly takes 1 microsecond, which makes it extremely inefficient when you guard small code pieces like a simple arithmetic instruction. Task formation and scheduling is a challenging topic, and I had enough cases where the overhead alone entirely compensated the gain from parallelization. Maybe OpenMP does some magic here. I will try your code on some benchmarks of mine. Best regards, Matthias |
This PR introduces OpenMP-based parallelization to KLayout's core hierarchical processing engine (
dbHierProcessor.cc). By parallelizing bottom-up boolean and sizing operations (viadb::DeepRegionanddb::DeepShapeStore), we achieve speedups on highly nested layouts with heavy overlap.Build Configuration
To ensure this PR does not break existing workflows, OpenMP is provided as an opt-in build feature. By default, builds remain unaffected. To enable the feature, compile with:
Note: The configuration in
src/klayout.priautomatically routes the correct compiler and linker flags per OS.Benchmarking & Results
Performance was validated by generating a deeply nested layout (1,000 unique cells, heavy intentional overlaps) to force the hierarchical engine to process complex geometry. We then measured the execution time of consecutive AND and SIZE operations using
DeepShapeStore.Benchmark Layout Generation
Benchmark Execution Script
Test System: Intel Core Ultra 7 258V, 32 GB RAM
Measurement:
hyperfineexecution measuring mean time and scaling across threads.Conclusion: Running with 8 threads, the hierarchical layout operations completed 1.65x faster compared to the single-threaded baseline.