feat: Add LogElements transform to Java SDK#38533
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request adds a convenient logging utility to the Apache Beam Java SDK, mirroring the existing functionality in the Python SDK. It enables developers to easily inspect PCollection elements during pipeline execution without altering the data flow, which is particularly useful for debugging and monitoring data processing pipelines. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a new LogElements PTransform to the Beam Java core SDK that logs each element of a PCollection (optionally with prefix, timestamp, window, and pane info) and passes elements through unchanged, along with unit tests.
Changes:
- Introduces
LogElements<T>PTransformwith fluent builders (trace/debug/info/warn/error/of,withPrefix,withTimestamp,withWindow,withPaneInfo) and display data. - Implements an internal
LoggingFnDoFnthat formats and logs each element at the configured SLF4J level, then outputs it unchanged. - Adds
LogElementsTestcovering element pass-through, format string composition, and display data.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/LogElements.java | New PTransform for logging PCollection elements at a configurable SLF4J level with optional metadata. |
| sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/LogElementsTest.java | Unit tests for element pass-through, log formatting, and display data. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| default: | ||
| LOG.error("{}", message); |
| @Test | ||
| public void testDisplayData() { | ||
| DisplayData displayData = | ||
| DisplayData.from( | ||
| LogElements.of(Level.WARN).withPrefix("row: ").withTimestamp().withWindow()); | ||
|
|
||
| assertThat(displayData, hasDisplayItem("level", "WARN")); | ||
| assertThat(displayData, hasDisplayItem("prefix", "row: ")); | ||
| assertThat(displayData, hasDisplayItem("withTimestamp", true)); | ||
| assertThat(displayData, hasDisplayItem("withWindow", true)); | ||
| } |
There was a problem hiding this comment.
Code Review
This pull request introduces the LogElements transform, which allows for logging elements in a PCollection at various log levels with optional metadata such as timestamps and window information. The review feedback highlights a performance concern where log messages are formatted regardless of whether the log level is enabled; it is recommended to wrap the logging logic in a level check to avoid unnecessary overhead. Additionally, the explicit call to setCoder should be removed to avoid potential runtime exceptions and rely on Beam's automatic coder propagation.
| log( | ||
| level, | ||
| formatForLogging( | ||
| element, | ||
| prefix, | ||
| withTimestamp, | ||
| withWindow, | ||
| withPaneInfo, | ||
| timestamp, | ||
| window, | ||
| paneInfo)); | ||
| receiver.output(element); | ||
| } |
There was a problem hiding this comment.
The formatForLogging method is called for every element before checking if the log level is enabled. This involves string concatenation and toString() calls (potentially on large objects, windows, or timestamps) which can be very expensive in high-throughput pipelines. You should check if the log level is enabled before constructing the log message.
if (isLoggingEnabled(level)) {
log(
level,
formatForLogging(
element,
prefix,
withTimestamp,
withWindow,
withPaneInfo,
timestamp,
window,
paneInfo));
}
receiver.output(element);
}
private boolean isLoggingEnabled(Level level) {
switch (level) {
case TRACE:
return LOG.isTraceEnabled();
case DEBUG:
return LOG.isDebugEnabled();
case INFO:
return LOG.isInfoEnabled();
case WARN:
return LOG.isWarnEnabled();
case ERROR:
default:
return LOG.isErrorEnabled();
}
}|
|
||
| @Override | ||
| public PCollection<T> expand(PCollection<T> input) { | ||
| return input.apply("Log", ParDo.of(new LoggingFn<>(this))).setCoder(input.getCoder()); |
There was a problem hiding this comment.
Calling setCoder(input.getCoder()) is generally redundant for a ParDo that returns the same type as its input, as Beam automatically propagates the coder. Additionally, input.getCoder() can throw an IllegalStateException if the coder hasn't been set yet and cannot be inferred at this stage of pipeline construction. It is safer to let the SDK handle coder propagation.
| return input.apply("Log", ParDo.of(new LoggingFn<>(this))).setCoder(input.getCoder()); | |
| return input.apply("Log", ParDo.of(new LoggingFn<>(this))); |
|
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
|
the failed checks seems unrelated to my commit. |
|
assign set of reviewers |
|
Assigning reviewers: R: @chamikaramj for label java. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
|
retest this please |
|
Reviewers are already assigned to this PR: @chamikaramj |
|
R: @ahmedabu98 please take a look. Thanks. |
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
ahmedabu98
left a comment
There was a problem hiding this comment.
The transform looks great, but I think testing could be better
|
Will merge when tests go green |
|
really appreciate the speedy review. |
|
Looks like the new test failed: |
|
so the issue is that the DirectRunner uses slf4j-simple which outputs to stderr, which ExpectedLogs can't capture(only works with JUL). solutions like changing slf4j-simple to slf4j-jdk14 has high blast radius. the next most obvious path is to capture stderr(trace and debug still won't appear) but can be flaky. (bad idea) Lmk if you have better idea. |
Please add a meaningful description for your change here
What does this PR do?
Added a Java SDK LogElements transform for logging each element of a PCollection while passing the elements through unchanged.
Supports SLF4J log levels plus optional prefix, timestamp, window, and pane info logging, mirroring the Python SDK’s LogElements convenience transform.
addresses #38528
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.