Skip to content

fix: ProcessingS3Output's s3_uri to be an optional field (5559)#5755

Draft
sagemaker-bot wants to merge 2 commits intoaws:masterfrom
sagemaker-bot:fix/processings3output-s-s3-uri-to-be-an-optional-5559
Draft

fix: ProcessingS3Output's s3_uri to be an optional field (5559)#5755
sagemaker-bot wants to merge 2 commits intoaws:masterfrom
sagemaker-bot:fix/processings3output-s-s3-uri-to-be-an-optional-5559

Conversation

@sagemaker-bot
Copy link
Copy Markdown
Collaborator

Description

In V3 (sagemaker-core), ProcessingS3Output's s3_uri field is mandatory, preventing users from leaving the output destination as None to let SageMaker auto-generate an S3 path (partitioned by job_name/step_name/output_name). In V2, ProcessingOutput.destination could be None and the SDK would auto-generate an S3 URI during normalization. The fix requires: (1) Making s3_uri optional (defaulting to None) on ProcessingS3Output in the shapes definition, (2) Updating _normalize_outputs() to handle None s3_uri by auto-generating a path, (3) Allowing ProcessingOutput to be created without s3_output entirely and having the normalizer create it, and (4) Updating _processing_output_to_request_dict to handle optional s3_uri.

Related Issue

Related issue: 5559

Changes Made

  • sagemaker-core/src/sagemaker/core/processing.py
  • sagemaker-core/tests/unit/test_processing.py

AI-Generated PR

This PR was automatically generated by the PySDK Issue Agent.

  • Confidence score: 85%
  • Classification: bug
  • SDK version target: V3

Merge Checklist

  • Changes are backward compatible
  • Commit message follows prefix: description format
  • Unit tests added/updated
  • Integration tests added (if applicable)
  • Documentation updated (if applicable)

Copy link
Copy Markdown
Collaborator Author

@sagemaker-bot sagemaker-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review

This PR fixes ProcessingS3Output's s3_uri to be optional, allowing SageMaker to auto-generate S3 paths. The logic changes look reasonable and tests are comprehensive. However, there are several issues: the PR description mentions changing the shapes definition but no change to shapes is included in the diff, the import line exceeds 100 characters, hardcoded defaults should be constants, and there's a potential issue with the default local_path when creating s3_output from scratch.

from sagemaker.core.local.local_session import LocalSession
from sagemaker.core.helper.session_helper import Session
from sagemaker.core.shapes import ProcessingInput, ProcessingOutput, ProcessingS3Input
from sagemaker.core.shapes import ProcessingInput, ProcessingOutput, ProcessingS3Input, ProcessingS3Output
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import line exceeds the 100-character line length limit. Please break it into multiple lines:

from sagemaker.core.shapes import (
    ProcessingInput,
    ProcessingOutput,
    ProcessingS3Input,
    ProcessingS3Output,
)

# Generate a name for the ProcessingOutput if it doesn't have one.
if output.output_name is None:
output.output_name = "output-{}".format(count)
# If s3_output is None, create a default one with None s3_uri
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing shapes definition change: The PR description states that s3_uri should be made optional (defaulting to None) on ProcessingS3Output in the shapes definition, but no change to the shapes file is included in this diff. Without that change, ProcessingS3Output(s3_uri=None, ...) or ProcessingS3Output(local_path=..., s3_upload_mode=...) (without s3_uri) will likely fail Pydantic validation if s3_uri is currently a required field. Please include the shapes definition change or confirm that s3_uri is already optional in the shapes.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$context sagemaker-core/src/sagemaker/core/shapes/shapes.py

@sagemaker-bot
Copy link
Copy Markdown
Collaborator Author

🤖 Iteration #1 — Review Comments Addressed

Description

Make ProcessingS3Output's s3_uri field optional (defaulting to None) so that users can delegate output storage path generation to SageMaker, matching the V2 SDK behavior.

Changes

sagemaker-core/src/sagemaker/core/shapes/shapes.py

  • Changed ProcessingS3Output.s3_uri from required (StrPipeVar) to optional (Optional[StrPipeVar] = None)

sagemaker-core/src/sagemaker/core/processing.py

  • Fixed import line to comply with 100-character line length limit by breaking into multi-line import
  • Added module-level constants DEFAULT_PROCESSING_LOCAL_OUTPUT_PATH and DEFAULT_S3_UPLOAD_MODE to replace hardcoded magic strings
  • Updated _normalize_outputs() to:
    • Create a default ProcessingS3Output with s3_uri=None when s3_output is None on a ProcessingOutput
    • Auto-generate an S3 URI when s3_uri is None (using job name or pipeline context)
    • Added clarifying comments about is_pipeline_variable(None) handling and the expected invariant
  • Updated _processing_output_to_request_dict() to omit S3Uri from the request dict when s3_uri is None, with a comment explaining this is a defensive guard since normalization should always populate it

sagemaker-core/tests/unit/test_processing.py

  • Added direct model-level regression test test_processing_s3_output_with_explicit_s3_uri_preserves_value to verify that ProcessingS3Output with an explicit s3_uri string preserves the value correctly

Behavior

  • Users can now create ProcessingS3Output without specifying s3_uri, or with s3_uri=None
  • Users can create ProcessingOutput without s3_output, and the normalizer will create it automatically
  • When s3_uri is None, _normalize_outputs() auto-generates an S3 path partitioned by job_name/output_name (or pipeline_name/execution_id/step_name/output_name in pipeline context)
  • Existing behavior with explicit s3_uri values is preserved (backward compatible)

Testing

  • All existing tests pass
  • New tests cover: ProcessingS3Output with None s3_uri, without s3_uri param, with explicit s3_uri, normalization with None s3_uri, normalization with None s3_output, pipeline context with None s3_uri, and serialization with None s3_uri

Comments reviewed: 9
Files modified: sagemaker-core/src/sagemaker/core/shapes/shapes.py, sagemaker-core/src/sagemaker/core/processing.py, sagemaker-core/tests/unit/test_processing.py

  • sagemaker-core/src/sagemaker/core/shapes/shapes.py: Make s3_uri optional (defaulting to None) on ProcessingS3Output to support auto-generated S3 paths
  • sagemaker-core/src/sagemaker/core/processing.py: Fix import line length, add module-level constants, handle None s3_uri in normalize_outputs and serialization
  • sagemaker-core/tests/unit/test_processing.py: Add model-level tests for ProcessingS3Output with explicit s3_uri and improve existing tests

Copy link
Copy Markdown
Collaborator Author

@sagemaker-bot sagemaker-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review

This PR makes ProcessingS3Output.s3_uri optional to allow SageMaker to auto-generate S3 paths, matching V2 behavior. The approach is sound and well-tested, but there are a few issues: the shapes file is auto-generated and shouldn't be manually edited, the Processor class doesn't use Pydantic BaseModel, and there's a line length violation.

s3_uri: Optional[StrPipeVar] = None
s3_upload_mode: StrPipeVar
local_path: Optional[StrPipeVar] = Unassigned()

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical concern: The shapes.py file in sagemaker-core is auto-generated from API definitions (as noted in the SDK architecture: "auto-generated from API definitions"). Manual edits to this file will be overwritten the next time the code generator runs.

The correct fix should be made in the shapes definition/generation configuration (e.g., a JSON/YAML model or codegen template) so that s3_uri is generated as Optional. If this is truly a bug in the API model where the SageMaker API actually accepts S3Uri as optional for ProcessingS3Output, the codegen input should be updated.

Please confirm whether this file is safe to edit manually or if the change needs to go into the code generation source.

s3_uri: Optional[StrPipeVar] = None
s3_upload_mode: StrPipeVar
local_path: Optional[StrPipeVar] = Unassigned()

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field ordering changed here: s3_uri (now optional with default) is listed before s3_upload_mode (required). In Pydantic, required fields must come before optional fields with defaults in the class body, or you'll get a validation error. Since s3_upload_mode is StrPipeVar (required, no default) and s3_uri now has = None, this ordering will cause a Pydantic error.

You need to reorder so that s3_upload_mode comes before s3_uri:

s3_upload_mode: StrPipeVar
s3_uri: Optional[StrPipeVar] = None
local_path: Optional[StrPipeVar] = Unassigned()

Alternatively, if using from __future__ import annotations or Pydantic v2 with appropriate config, this may not be an issue — but please verify.

if getattr(self.sagemaker_session, "local_mode", False) and parse_result.scheme == "file":
# If the output's s3_uri is None or not an s3_uri, create one.
if output.s3_output.s3_uri is None:
parse_result_scheme = ""
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line length likely exceeds 100 characters:

                    if getattr(self.sagemaker_session, "local_mode", False) and parse_result_scheme == "file":

This should be wrapped to stay within the 100-character limit.

if output.output_name is None:
output.output_name = "output-{}".format(count)
# If s3_output is None, create a default one with None s3_uri.
# The s3_uri will be auto-generated below based on job/pipeline context.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default local_path and s3_upload_mode values are hardcoded here. Consider whether these should match whatever defaults the user might expect from the ProcessingS3Output model itself. If ProcessingS3Output already has defaults for local_path and s3_upload_mode, you could simplify:

output.s3_output = ProcessingS3Output(
    s3_upload_mode=DEFAULT_S3_UPLOAD_MODE,
    local_path=DEFAULT_PROCESSING_LOCAL_OUTPUT_PATH,
)

Also, the local_path on the auto-created ProcessingS3Output may conflict with the user's actual ProcessingOutput.app_managed or other output configurations. Is /opt/ml/processing/output always the correct default?


logger = logging.getLogger(__name__)

# Default values used when creating a ProcessingS3Output for outputs
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: DEFAULT_S3_UPLOAD_MODE should use the enum or constant from the shapes/API model if one exists, rather than a raw string "EndOfJob". This avoids drift if the API model changes the accepted values.

assert "test-job" in result[0].s3_output.s3_uri
assert "my-output" in result[0].s3_output.s3_uri

def test_normalize_outputs_with_none_s3_uri_and_pipeline_config_generates_join(self, mock_session):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line exceeds 100 characters. Please wrap this line:

    def test_normalize_outputs_with_none_s3_uri_and_pipeline_config_generates_join(
        self, mock_session
    ):

image_uri="test-image:latest",
instance_count=1,
instance_type="ml.m5.xlarge",
sagemaker_session=mock_session,
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good test coverage! However, the mock_session fixture is used but I don't see it defined in this diff. Please confirm it's defined elsewhere in the test file (e.g., as a conftest.py fixture or earlier in the file). If it's the fixture from the existing test class, note that these tests are in a new class and pytest fixtures defined inside another class won't be inherited.

def test_normalize_outputs_with_none_s3_output_generates_s3_path(self, mock_session):
"""When s3_output is None, _normalize_outputs should create s3_output and auto-generate URI."""
processor = Processor(
role="arn:aws:iam::123456789012:role/SageMakerRole",
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import of Join is inside the test function body. While this works, it's better practice to put imports at the top of the file. If Join is only needed for this assertion, consider using hasattr or checking the type name string instead, or move the import to the top of the file with the other imports.

# After _normalize_outputs, s3_uri should always be populated.
# If it is still None at serialization time, omit S3Uri from the dict
# rather than sending None to the API. This is a defensive guard;
# _normalize_outputs is expected to fill in s3_uri before we reach here.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The defensive guard to omit S3Uri when it's None is good, but the comment says "_normalize_outputs is expected to fill in s3_uri before we reach here." If that's the case, consider adding a logger.warning() when s3_uri is None at this point, since it indicates an unexpected code path. Silent omission could make debugging harder:

if processing_output.s3_output.s3_uri is not None:
    s3_output_dict["S3Uri"] = processing_output.s3_output.s3_uri
else:
    logger.warning(
        "s3_uri is None for output '%s' at serialization time; "
        "expected _normalize_outputs to have populated it.",
        processing_output.output_name,
    )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants