feat(cwl): integration of CWL job submission and execution into DiracX#877
Draft
ryuwd wants to merge 39 commits intoDIRACGrid:mainfrom
Draft
feat(cwl): integration of CWL job submission and execution into DiracX#877ryuwd wants to merge 39 commits intoDIRACGrid:mainfrom
ryuwd wants to merge 39 commits intoDIRACGrid:mainfrom
Conversation
Documentation build overview
8 files changed ·
|
850b0a5 to
32a33a2
Compare
ryuwd
commented
Apr 10, 2026
ryuwd
commented
Apr 10, 2026
ryuwd
commented
Apr 10, 2026
ryuwd
commented
Apr 10, 2026
| Services: ServicesConfig = ServicesConfig() | ||
| """Configuration for various DIRAC services.""" | ||
| SoftwareDistModule: str = "LocalSoftwareDist" | ||
| SoftwareDistModule: str = "" |
Contributor
Author
There was a problem hiding this comment.
Was causing errors in the Pilot
to be checked with @chaen
ryuwd
commented
Apr 10, 2026
Comment on lines
+75
to
+81
| # TODO: Compute Adler32 checksum before upload | ||
| # TODO: Extract POOL/ROOT GUID if applicable | ||
| # TODO: Prefer local SEs (getSEsForSite) before remote ones | ||
| # TODO: Implement retry with exponential backoff on transient failures | ||
| # TODO: On complete failure, create a failover Request (RMS) | ||
| # for async recovery instead of raising immediately | ||
| # TODO: Report upload progress via job status updates |
Contributor
Author
There was a problem hiding this comment.
This command is still untested in cert. StoreOutputData still needs fuller implementation, discussion, and testing.
75e0b03 to
f9133d2
Compare
dd8abd1 to
8a45d3c
Compare
ab58da3 to
27d0557
Compare
Implements streaming interpolation-drop compression for prmon time-series data, porting the HSF/prmon algorithm to pure Python (no pandas) for use in DIRACOS2 environments. Includes full TDD test suite with 5 tests.
Per the CWL v1.2 spec, location is an IRI that identifies a file resource (supports custom URI schemes like LFN: and SB:), while path is a local filesystem path set after staging. Previously both URI schemes and local paths were placed in path, which breaks when cwltool normalises inputs via file_uri(). Readers now check location before path, writers place URI schemes in location, and validation rejects LFN:/SB: in the path field on both the client and server side.
Verify that DiracPathMapper produces correct target values (what cwltool assigns to the File path field at runtime) for different PFN types: file:// to local path, https:// and root:// passed through as URLs, and SB: resolved via replica map.
load_inputfile() converts input dicts into cwl_utils File objects where location="SB:..." and path=None. The extract methods only checked .path on objects, silently dropping SB: and LFN: references stored in .location. This caused empty replica maps and sandbox download failures on the worker.
…cwltool cwltool ships mypyc-compiled wheels on Linux x86_64. Inside the compiled CommandLineTool.job(), `self.make_path_mapper(...)` is resolved as a direct C call when the override is a `@staticmethod` — Python's MRO is bypassed and DiracCommandLineTool's PathMapper override silently never runs. Two changes restore the override: * tool.py: drop @staticmethod and add `self`. Instance-method dispatch goes through the descriptor protocol and survives the compiled call site. Toil's CWL runner uses the same pattern (see toil's src/toil/cwl/cwltoil.py:1110). * _mypyc_compat.py: rewrite as three explicit branches — pure-Python cwltool (skip), compiled+source (install meta-path finder forcing .py load for cwltool.command_line_tool), compiled-only (raise; the override cannot be restored without a .py source). The finder also raises on missing .py instead of silently falling back to .so. The active mode is logged and announced on stderr at install time so production behaviour is observable. Tests: test_executor.py updates the make_path_mapper unit tests for the new instance-method signature; test_no_cwltool_import.py asserts the runtime invariant we care about (cwltool.command_line_tool is loaded from a .py source after the executor is imported) rather than the implementation detail of which finder is registered.
The standalone dirac-cwl package also exports a `dirac-cwl-run` console script (dirac_cwl.job.executor.__main__:cli). Once both packages coexist in the same environment (e.g. lb-dirac during the migration window), the two entry points collide — pip silently lets the second installer overwrite the first, and which `dirac-cwl-run` actually runs becomes non-deterministic across resolvers. Rename the diracx version to `dirac-cwl-runner`, which: * avoids the collision (different name from dirac-cwl's `-run`); * matches the CWL-community convention of `<framework>-cwl-runner` (toil-cwl-runner, arvados-cwl-runner, cwl-tes…); * aligns with the `dirac` CLI verb prefix used elsewhere in diracx. Updates: pyproject entry-point declaration, JobWrapper.run_job subprocess invocation, logger names across the executor modules (so log filters track the binary), and the tests/docs that reference the old name. lbaplocal can opt in via its existing `--executor` flag today; the default flips once dirac-cwl is removed from lb-dirac.
dirac:Job.input_sandbox sources are resolved at submission time by
looking up each source name in the supplied inputs dict. cwltool
applies workflow-level input defaults later, on the worker — too late
for sandbox upload. Without an inputs.yml or CLI override, defaults
like 'default: { class: File, path: ... }' silently fail to upload.
Merge the workflow's declared defaults into each job dict between
CLI/inputs.yml processing and sandbox grouping so a sandbox source
referencing a defaulted File input gets uploaded as expected.
Worker writes the fetched CWL YAML verbatim to disk and points dirac-cwl-runner at it. Drops JobModel/JobInputModel/BaseJobModel indirection and the cwl_utils.save() that re-serialised inline sub-workflow ids with a 'run/' prefix, breaking URI fragment resolution downstream.
Removes the RuntimeError raise — using exceptions for an expected control-flow path (cwltool exited non-zero) muddied the outer exception handler and produced misleading 'Failed to execute workflow' tracebacks for normal user-job failures. post_process now always logs cwltool's output JSON, parses what it can, attempts the output-sandbox upload (so partial outputs from pickValue=all_non_null reach the user even on permanentFail), and returns whether its own infrastructure ran cleanly. run_job decides DONE vs FAILED from cwltool's exit code separately.
79b4f36 to
fa5e403
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
End-to-end CWL job submission and execution for DiracX, from CLI to worker
node and back. This PR introduces the full subsystem: there is no prior CWL
path in diracx, so the components below are all new code unless noted
otherwise.
Follows the plan in #858. Goes with DIRACGrid/DIRAC#8506.
CLI (
diracx-cli)dirac job submit cwl <workflow> [inputs...]--rangeexpansion. Workflow input defaults are merged into job dicts before sandbox grouping sodefault: { class: File, path: ... }entries are uploaded as expected.dirac job submit cmd -- <command>dirac job searchdirac job sandbox list|peek|get <job_id>peekpages through$PAGERwithLESS=-Rso ANSI colours render. Output sandboxes only.parse_cli_args.dirac-cwl-runnerFsAccessandPathMapperfor LFN/SB resolution via replica maps. ThePathMapperoverrides cwltool'smapper()so thatSB:keys carrying#fragmentidentifiers are looked up by their full key rather than fragment-stripped, which cwltool's default does for plain HTTP-style URIs. Name follows the CWL-community<framework>-cwl-runnerconvention (toil-cwl-runner, arvados-cwl-runner, cwl-tes).Worker node (
diracx-api)JobWrappercwl_utils.save().post_processsemanticsbooland runs on both success and failure paths. Cwltool exiting non-zero is treated as a normal control-flow outcome, so partial outputs frompickValue: all_non_nullreach the user even onpermanentFail.JobMinorStatus.APP_ERRORSis set on failure.JobMonitorKillcommand handling, and stall detection via a rolling CPU/wall ratio window. Sends one final heartbeat on exit.PEEK_LINESare hard-coded; should move to CS config (issue: TBD).PrmonFifoReaderreads metrics from a named pipe (1 s sampling,--fast-memmon) instead of polling the on-disk TSV.OnlineCompressorports the HSF/prmon interpolation-drop algorithm to pure Python so the time-series can be compressed in-process without pandas (DIRACOS2 has no pandas).[job echo-tool] completed success,[workflow ] starting step greet) streamed as ApplicationStatus with rate-limited commits.SB:<se>|<s3_path>#<filename>. The#fragmentidentifies the file inside the tar archive; theSB:reference is preserved end-to-end and resolved to a presigned URL at extraction time.JobReportset_job_statuseson commit. Heartbeats are sent through synchronously bysend_heartbeat, returning any pending server commands (e.g.Kill) to the caller. Uses generated-clientHeartbeatData/JobCommand.PATHso cwltool's JS evaluation finds Node and sandbox-staged binaries resolve.StoreOutputDatacommandPostProcessCommandon jobs withoutput_datahints, callsDataManager.putAndRegisterover the configured SE list, and raises on total failure. None of the production-grade behaviour (Adler32 checksumming, GUID extraction, local-SE preference, retry/backoff, RMS failover Request, progress reporting) is implemented; all are marked TODO.Server (
diracx-logic,diracx-routers,diracx-core)cwl_to_jdl()extractsdirac:Jobhints from CWL and emits a JDL string for the legacy matcher. Derives JobName, JobGroup, JobType, Priority, LogLevel, Site, CPUTime, MaxWallTime, Min/MaxNumberOfProcessors, processor tags (MultiProcessor, NProcessors, GPU), I/O sandboxes, InputData, OutputData, OutputPath, OutputSE. Alegacy_jdlfield onJobHintis merged last as a user override.stdout:/stderr:fields are added toOutputSandboxautomatically.#fragmentstrippingSB:references for server ownership checks; the full URI with#filenameis preserved in CWL inputs for worker extraction.--range.JobHint,IOSource,OutputDataEntry, and the pre/post-process command framework.ReplicaMapis preexisting and was extended in this PR to acceptSB:keys (validation passes them through with the prefix; LFN: keys still have the prefix stripped).validate_requirements()checks every CWL Requirement and Hint against a whitelist. Pass-through (no matcher impact):InlineJavascriptRequirement,SchemaDefRequirement,InitialWorkDirRequirement,EnvVarRequirement,ShellCommandRequirement,LoadListingRequirement,InplaceUpdateRequirement,WorkReuse,NetworkAccess,SubworkflowFeatureRequirement,ScatterFeatureRequirement,MultipleInputFeatureRequirement,StepInputExpressionRequirement. Rejected:DockerRequirement,MPIRequirement,SoftwareRequirement. Unknown requirements raise.SoftwareDistModuledefault"LocalSoftwareDist"to""to fix a Pilot error.Client (
diracx-client)Key design decisions
locationvspathper CWL v1.2LFN:,SB:) live inFile.location;pathis reserved for local filesystem paths set after staging. Validation rejects scheme URIs inpath. Readers checklocationbeforepathoncwl_utilsFileobjects.InlineJavascriptRequirementCompatibility caveats
make_path_mapperoverride under mypyc-compiled cwltoolCommandLineTool.job(), a@staticmethodoverride ofmake_path_mapperis bypassed because the call resolves as a direct C call rather than through Python's MRO. The PR ships two complementary fixes: (1) the override is declared as an instance method so dispatch goes through the descriptor protocol and survives the compiled call site; (2)_mypyc_compat.pyinstalls asys.meta_pathfinder, invoked at executor package import, that forces.py-source loading forcwltool.command_line_toolwhen both source and compiled modules are present. The finder skips itself for pure-Python cwltool installs and raises at import time if only the compiled extension is present, so a wheel that strips the.pysource can never silently disable the override. The active mode is logged and printed to stderr at install._mypyc_compat.pyhack is necessary anymore.dirac-cwl-runnerstrips leading non-JSON noise from cwltool's stdout when prmon emits warnings before the JSON document.Test coverage
20 new test files across the affected packages.
diracx-apitest_job_monitor.py,test_job_report.py,test_job_wrapper.py,test_job_wrapper_integration.py,test_job_wrapper_sandbox.py,test_prmon_compress.py,test_prmon_reader.pydiracx-clitest_cwl_submit.py,test_executor.py,test_executor_integration.py,test_fs_access.py,test_no_cwltool_import.py,test_submission_confirm.py,test_submission_inputs.py,test_submission_integration.py,test_submission_pipeline.py,test_submission_sandbox.py,test_submit_simple.pydiracx-coretest_replica_map.py(extended for SB: keys)diracx-dbtest_workflow_db.pydiracx-logictest_cwl_submission.pyCert testing on
diracx-cert.app.cern.chcovers LHCb Simulation jobs end-to-end. See per-row "Tested in cert" columns above for component-level cert coverage.Status
Under certification testing on
diracx-cert.app.cern.ch. The follow-upslisted above are intentionally scoped out of this PR.
cc @aldbr