API¶

This part of the documentation covers all the interfaces of sagemaker_pyspark.

SageMakerModel¶

class SageMakerModel(endpointInstanceType, endpointInitialInstanceCount, requestRowSerializer, responseRowDeserializer, existingEndpointName=None, modelImage=None, modelPath=None, modelEnvironmentVariables=None, modelExecutionRoleARN=None, endpointCreationPolicy=<sagemaker_pyspark.SageMakerEstimator.EndpointCreationPolicy._CreateOnConstruct object>, sagemakerClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._SageMakerDefaultClient object>, prependResultRows=True, namePolicy=<sagemaker_pyspark.NamePolicy.RandomNamePolicy object>, uid=None, javaObject=None)¶

Bases: sagemaker_pyspark.wrapper.SageMakerJavaWrapper, pyspark.ml.wrapper.JavaModel

A Model implementation which transforms a DataFrame by making requests to a SageMaker Endpoint. Manages life cycle of all necessary SageMaker entities, including Model, EndpointConfig, and Endpoint.

This Model transforms one DataFrame to another by repeated, distributed SageMaker Endpoint invocation. Each invocation request body is formed by concatenating input DataFrame Rows serialized to Byte Arrays by the specified RequestRowSerializer. The invocation request content-type property is set from RequestRowSerializer.contentType. The invocation request accepts property is set from ResponseRowDeserializer.accepts.

The transformed DataFrame is produced by deserializing each invocation response body into a series of Rows. Row deserialization is delegated to the specified ResponseRowDeserializer. If prependInputRows is false, the transformed DataFrame will contain just these Rows. If prependInputRows is true, then each transformed Row is a concatenation of the input Row with its corresponding SageMaker invocation deserialized Row.

Each invocation of transform() passes the Dataset.schema of the input DataFrame to requestRowSerialize by invoking RequestRowSerializer.setSchema().

The specified RequestRowSerializer also controls the validity of input Row Schemas for this Model. Schema validation is carried out on each call to transformSchema(), which invokes RequestRowSerializer.validateSchema().

Adapting this SageMaker model to the data format and type of a specific Endpoint is achieved by sub-classing RequestRowSerializer and ResponseRowDeserializer. Examples of a Serializer and Deseralizer are LibSVMRequestRowSerializer and LibSVMResponseRowDeserializer respectively.

Parameters:

endpointInstanceType (str) – The instance type used to run the model container
endpointInitialInstanceCount (int) – The initial number of instances used to host the model
requestRowSerializer (RequestRowSerializer) – Serializes a Row to an Array of Bytes
responseRowDeserializer (ResponseRowDeserializer) – Deserializes an Array of Bytes to a series of Rows
existingEndpointName (str) – An endpoint name
modelImage (str) – A Docker image URI
modelPath (str) – An S3 location that a successfully completed SageMaker Training Job has stored its model output to.
modelEnvironmentVariables (dict) – The environment variables that SageMaker will set on the model container during execution.
modelExecutionRoleARN (str) – The IAM Role used by SageMaker when running the hosted Model and to download model data from S3
endpointCreationPolicy (EndpointCreationPolicy) – Whether the endpoint is created upon SageMakerModel construction, transformation, or not at all.
sagemakerClient (AmazonSageMaker) – CreateModel, and CreateEndpoint requests.
prependResultRows (bool) – Whether the transformation result should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.
namePolicy (NamePolicy) – The NamePolicy to use when naming SageMaker entities created during usage of this Model.
uid (str) – The unique identifier of this Estimator. Used to represent this stage in Spark ML pipelines.

copy(extra=None)¶

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:	extra – Extra parameters to copy to the new instance
Returns:	Copy of this instance

explainParam(param)¶: Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()¶: Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)¶

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:	extra – extra param values
Returns:	merged param map

classmethod fromEndpoint(endpointName, requestRowSerializer, responseRowDeserializer, modelEnvironmentVariables=None, sagemakerClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._SageMakerDefaultClient object>, prependResultRows=True, namePolicy=<sagemaker_pyspark.NamePolicy.RandomNamePolicy object>, uid='sagemaker')¶

Creates a JavaSageMakerModel from existing model data in S3.

The returned JavaSageMakerModel can be used to transform Dataframes.

Parameters:	endpointName (str) – The name of an endpoint that is currently in service. requestRowSerializer (RequestRowSerializer) – Serializes a row to an array of bytes. responseRowDeserializer (ResponseRowDeserializer) – Deserializes an array of bytes to a series of rows. modelEnvironmentVariables – The environment variables that SageMaker will set on the model container during execution. sagemakerClient (AmazonSageMaker) – CreateTrainingJob, CreateModel, and CreateEndpoint requests. prependResultRows (bool) – Whether the transformation result should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer. namePolicy (NamePolicy) – The NamePolicy to use when naming SageMaker entities created during usage of the returned model. uid (String) – The unique identifier of the SageMakerModel. Used to represent the stage in Spark ML pipelines.
Returns:	A JavaSageMakerModel that sends InvokeEndpoint requests to an endpoint hosting the training job’s model.
Return type:	JavaSageMakerModel

classmethod fromModelS3Path(modelPath, modelImage, modelExecutionRoleARN, endpointInstanceType, endpointInitialInstanceCount, requestRowSerializer, responseRowDeserializer, modelEnvironmentVariables=None, endpointCreationPolicy=<sagemaker_pyspark.SageMakerEstimator.EndpointCreationPolicy._CreateOnConstruct object>, sagemakerClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._SageMakerDefaultClient object>, prependResultRows=True, namePolicy=<sagemaker_pyspark.NamePolicy.RandomNamePolicy object>, uid='sagemaker')¶

Creates a JavaSageMakerModel from existing model data in S3.

The returned JavaSageMakerModel can be used to transform Dataframes.

Parameters:	modelPath (str) – The S3 URI to the model data to host. modelImage (str) – The URI of the image that will serve model inferences. modelExecutionRoleARN (str) – The IAM Role used by SageMaker when running the hosted Model and to download model data from S3. endpointInstanceType (str) – The instance type used to run the model container. endpointInitialInstanceCount (int) – The initial number of instances used to host the model. requestRowSerializer (RequestRowSerializer) – Serializes a row to an array of bytes. responseRowDeserializer (ResponseRowDeserializer) – Deserializes an array of bytes to a series of rows. modelEnvironmentVariables – The environment variables that SageMaker will set on the model container during execution. endpointCreationPolicy (EndpointCreationPolicy) – Whether the endpoint is created upon SageMakerModel construction, transformation, or not at all. sagemakerClient (AmazonSageMaker) – CreateTrainingJob, CreateModel, and CreateEndpoint requests. prependResultRows (bool) – Whether the transformation result should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer. namePolicy (NamePolicy) – The NamePolicy to use when naming SageMaker entities created during usage of the returned model. uid (String) – The unique identifier of the SageMakerModel. Used to represent the stage in Spark ML pipelines.
Returns:	A JavaSageMakerModel that sends InvokeEndpoint requests to an endpoint hosting the training job’s model.
Return type:	JavaSageMakerModel

classmethod fromTrainingJob(trainingJobName, modelImage, modelExecutionRoleARN, endpointInstanceType, endpointInitialInstanceCount, requestRowSerializer, responseRowDeserializer, modelEnvironmentVariables=None, endpointCreationPolicy=<sagemaker_pyspark.SageMakerEstimator.EndpointCreationPolicy._CreateOnConstruct object>, sagemakerClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._SageMakerDefaultClient object>, prependResultRows=True, namePolicy=<sagemaker_pyspark.NamePolicy.RandomNamePolicy object>, uid='sagemaker')¶

Creates a JavaSageMakerModel from a successfully completed training job name.

The returned JavaSageMakerModel can be used to transform DataFrames.

Parameters:

trainingJobName (str) – Name of the successfully completed training job.
modelImage (str) – URI of the image that will serve model inferences.
modelExecutionRoleARN (str) – The IAM Role used by SageMaker when running the hosted Model and to download model data from S3.
endpointInstanceType (str) – The instance type used to run the model container.
endpointInitialInstanceCount (int) – The initial number of instances used to host the model.
requestRowSerializer (RequestRowSerializer) – Serializes a row to an array of bytes.
responseRowDeserializer (ResponseRowDeserializer) – Deserializes an array of bytes to a series of rows.
modelEnvironmentVariables – The environment variables that SageMaker will set on the model container during execution.
endpointCreationPolicy (EndpointCreationPolicy) – Whether the endpoint is created upon SageMakerModel construction, transformation, or not at all.
sagemakerClient (AmazonSageMaker) – CreateTrainingJob, CreateModel, and CreateEndpoint requests.
prependResultRows (bool) – Whether the transformation result should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.
namePolicy (NamePolicy) – The NamePolicy to use when naming SageMaker entities created during usage of the returned model.
uid (String) – The unique identifier of the SageMakerModel. Used to represent the stage in Spark ML pipelines.

Returns:

a JavaSageMakerModel that sends InvokeEndpoint requests to an: endpoint hosting the training job’s model.

Return type:

JavaSageMakerModel

getOrDefault(param)¶: Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getParam(paramName)¶: Gets a param by its name.

hasDefault(param)¶: Checks whether a param has a default value.

hasParam(paramName)¶: Tests whether this instance contains a param with a given (string) name.

isDefined(param)¶: Checks whether a param is explicitly set by user or has a default value.

isSet(param)¶: Checks whether a param is explicitly set by user.

params¶: Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

transform(dataset)¶

Transforms the input dataset with optional parameters.

Parameters:	dataset – input dataset, which is an instance of `pyspark.sql.DataFrame` params – an optional param map that overrides embedded params.
Returns:	transformed dataset

New in version 1.3.0.

SageMakerEstimator¶

class SageMakerEstimator(trainingImage, modelImage, trainingInstanceType, trainingInstanceCount, endpointInstanceType, endpointInitialInstanceCount, requestRowSerializer, responseRowDeserializer, hyperParameters=None, trainingInputS3DataPath=<sagemaker_pyspark.S3Resources.S3AutoCreatePath object>, trainingOutputS3DataPath=<sagemaker_pyspark.S3Resources.S3AutoCreatePath object>, trainingInstanceVolumeSizeInGB=1024, trainingProjectedColumns=None, trainingChannelName='train', trainingContentType=None, trainingS3DataDistribution='ShardedByS3Key', trainingSparkDataFormat='sagemaker', trainingSparkDataFormatOptions=None, trainingInputMode='File', trainingCompressionCodec=None, trainingMaxRuntimeInSeconds=86400, trainingKmsKeyId=None, modelEnvironmentVariables=None, endpointCreationPolicy=<sagemaker_pyspark.SageMakerEstimator.EndpointCreationPolicy._CreateOnConstruct object>, sagemakerClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._SageMakerDefaultClient object>, sagemakerRole=<sagemaker_pyspark.IAMRoleResource.IAMRoleFromConfig object>, s3Client=<sagemaker_pyspark.SageMakerClients.SageMakerClients._S3DefaultClient object>, stsClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._STSDefaultClient object>, modelPrependInputRowsToTransformationRows=True, deleteStagingDataAfterTraining=True, namePolicyFactory=<sagemaker_pyspark.NamePolicy.RandomNamePolicyFactory object>, uid=None)¶

Bases: sagemaker_pyspark.SageMakerEstimator.SageMakerEstimatorBase

Adapts a SageMaker learning Algorithm to a Spark Estimator.

Fits a SageMakerModel by running a SageMaker Training Job on a Spark Dataset. Each call to fit() submits a new SageMaker Training Job, creates a new SageMaker Model, and creates a new SageMaker Endpoint Config. A new Endpoint is either created by or the returned SageMakerModel is configured to generate an Endpoint on SageMakerModel transform.

On fit, the input Dataset is serialized with the specified trainingSparkDataFormat using the specified trainingSparkDataFormatOptions and uploaded to an S3 location specified by trainingInputS3DataPath. The serialized Dataset is compressed with trainingCompressionCodec, if not None.

trainingProjectedColumns can be used to control which columns on the input Dataset are transmitted to SageMaker. If not None, then only those column names will be serialized as input to the SageMaker Training Job.

A Training Job is created with the uploaded Dataset being input to the specified trainingChannelName, with the specified trainingInputMode. The algorithm is specified trainingImage, a Docker image URI reference. The Training Job is created with trainingInstanceCount instances of type trainingInstanceType. The Training Job will time-out after attr:trainingMaxRuntimeInSeconds, if not None.

SageMaker Training Job hyperparameters are built from the params on this Estimator. Param objects with neither a default value nor a set value are ignored. If a Param is not set but has a default value, the default value will be used. Param values are converted to SageMaker hyperparameter String values.

SageMaker uses the IAM Role with ARN sagemakerRole to access the input and output S3 buckets and trainingImage if the image is hosted in ECR. SageMaker Training Job output is stored in a Training Job specific sub-prefix of trainingOutputS3DataPath. This contains the SageMaker Training Job output file as well as the SageMaker Training Job model file.

After the Training Job is created, this Estimator will poll for success. Upon success a SageMakerModel is created and returned from fit. The SageMakerModel is created with a modelImage Docker image URI, defining the SageMaker model primary container and with modelEnvironmentVariables environment variables. Each SageMakerModel has a corresponding SageMaker hosting Endpoint. This Endpoint runs on at least endpointInitialInstanceCount instances of type endpointInstanceType. The Endpoint is created either during construction of the SageMakerModel or on the first call to transform, controlled by endpointCreationPolicy. Each Endpointinstance runs with sagemakerRole IAMRole.

The transform method on SageMakerModel uses requestRowSerializer to serialize Rows from the Dataset undergoing transformation, to requests on the hosted SageMaker Endpoint. The responseRowDeserializer is used to convert the response from the Endpoint to a series of Rows, forming the transformed Dataset. If modelPrependInputRowsToTransformationRows is true, then each transformed Row is also prepended with its corresponding input Row.

Parameters:

trainingImage (String) – A SageMaker Training Job Algorithm Specification Training Image Docker image URI.
modelImage (String) – A SageMaker Model hosting Docker image URI.
sageMakerRole (IAMRole) – The SageMaker TrainingJob and Hosting IAM Role. Used by SageMaker to access S3 and ECR Resources. SageMaker hosted Endpoint instances launched by this Estimator run with this role.
trainingInstanceType (str) – The SageMaker TrainingJob Instance Type to use.
trainingInstanceCount (int) – The number of instances of instanceType to run an SageMaker Training Job with.
endpointInstanceType (str) – The SageMaker Endpoint Config instance type.
endpointInitialInstanceCount (int) – The SageMaker Endpoint Config minimum number of instances that can be used to host modelImage.
requestRowSerializer (RequestRowSerializer) – Serializes Spark DataFrame Rows for transformation by Models built from this Estimator.
responseRowDeserializer (ResponseRowDeserializer) – Deserializes an Endpoint response into a series of Rows.
hyperParameters (dict) – A dict from hyperParameter names to their respective values for training.
trainingInputS3DataPath (S3Resource) – An S3 location to upload SageMaker Training Job input data to.
trainingOutputS3DataPath (S3Resource) – An S3 location for SageMaker to store Training Job output data to.
trainingInstanceVolumeSizeInGB (int) – The EBS volume size in gigabytes of each instance.
trainingProjectedColumns (List) – The columns to project from the Dataset being fit before training. If an Optional.empty is passed then no specific projection will occur and all columns will be serialized.
trainingChannelName (str) – The SageMaker Channel name to input serialized Dataset fit input to.
trainingContentType (str) – The MIME type of the training data.
trainingS3DataDistribution (str) – The SageMaker Training Job S3 data distribution scheme.
trainingSparkDataFormat (str) – The Spark Data Format name used to serialize the Dataset being fit for input to SageMaker.
trainingSparkDataFormatOptions (dict) – The Spark Data Format Options used during serialization of the Dataset being fit.
trainingInputMode (str) – The SageMaker Training Job Channel input mode.
trainingCompressionCodec (str) – The type of compression to use when serializing the Dataset being fit for input to SageMaker.
trainingMaxRuntimeInSeconds (int) – A SageMaker Training Job Termination Condition MaxRuntimeInHours.
trainingKmsKeyId (str) – A KMS key ID for the Output Data Source.
modelEnvironmentVariables (dict) – The environment variables that SageMaker will set on the model container during execution.
endpointCreationPolicy (EndpointCreationPolicy) – Defines how a SageMaker Endpoint referenced by a SageMakerModel is created.
sagemakerClient (AmazonSageMaker) – CreateModel, and CreateEndpoint requests.
s3Client (AmazonS3) – Used to create a bucket for staging SageMaker Training Job input and/or output if either are set to S3AutoCreatePath.
stsClient (AmazonSTS) – Used to resolve the account number when creating staging input / output buckets.
modelPrependInputRowsToTransformationRows (bool) – Whether the transformation result on Models built by this Estimator should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker Endpoint invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.
deleteStagingDataAfterTraining (bool) – Whether to remove the training data on s3 after training is complete or failed.
namePolicyFactory (NamePolicyFactory) – The NamePolicyFactory to use when naming SageMaker entities created during fit.
uid (str) – The unique identifier of this Estimator. Used to represent this stage in Spark ML pipelines.

copy(extra)¶

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:	extra – Extra parameters to copy to the new instance
Returns:	Copy of this instance

explainParam(param)¶: Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()¶: Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)¶

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:	extra – extra param values
Returns:	merged param map

fit(dataset)¶

Fits a SageMakerModel on dataset by running a SageMaker training job.

Parameters:	dataset (Dataset) – the dataset to use for the training job.
Returns:	The Model created by the training job.
Return type:	JavaSageMakerModel

getOrDefault(param)¶: Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getParam(paramName)¶: Gets a param by its name.

hasDefault(param)¶: Checks whether a param has a default value.

hasParam(paramName)¶: Tests whether this instance contains a param with a given (string) name.

isDefined(param)¶: Checks whether a param is explicitly set by user or has a default value.

isSet(param)¶: Checks whether a param is explicitly set by user.

params¶: Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

Algorithms¶

K Means¶

class KMeansSageMakerEstimator(trainingInstanceType, trainingInstanceCount, endpointInstanceType, endpointInitialInstanceCount, sagemakerRole=<sagemaker_pyspark.IAMRoleResource.IAMRoleFromConfig object>, requestRowSerializer=<sagemaker_pyspark.transformation.serializers.serializers.ProtobufRequestRowSerializer object>, responseRowDeserializer=<sagemaker_pyspark.transformation.deserializers.deserializers.KMeansProtobufResponseRowDeserializer object>, trainingInputS3DataPath=<sagemaker_pyspark.S3Resources.S3AutoCreatePath object>, trainingOutputS3DataPath=<sagemaker_pyspark.S3Resources.S3AutoCreatePath object>, trainingInstanceVolumeSizeInGB=1024, trainingProjectedColumns=None, trainingChannelName='train', trainingContentType=None, trainingS3DataDistribution='ShardedByS3Key', trainingSparkDataFormat='sagemaker', trainingSparkDataFormatOptions=None, trainingInputMode='File', trainingCompressionCodec=None, trainingMaxRuntimeInSeconds=86400, trainingKmsKeyId=None, modelEnvironmentVariables=None, endpointCreationPolicy=<sagemaker_pyspark.SageMakerEstimator.EndpointCreationPolicy._CreateOnConstruct object>, sagemakerClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._SageMakerDefaultClient object>, region=None, s3Client=<sagemaker_pyspark.SageMakerClients.SageMakerClients._S3DefaultClient object>, stsClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._STSDefaultClient object>, modelPrependInputRowsToTransformationRows=True, deleteStagingDataAfterTraining=True, namePolicyFactory=<sagemaker_pyspark.NamePolicy.RandomNamePolicyFactory object>, uid=None)¶

Bases: sagemaker_pyspark.SageMakerEstimator.SageMakerEstimatorBase

A SageMakerEstimator that runs a KMeans training job on Amazon SageMaker upon a call to fit() and returns a SageMakerModel that can be used to transform a DataFrame using the hosted K-Means model. K-Means Clustering is useful for grouping similar examples in your dataset.

Amazon SageMaker K-Means clustering trains on RecordIO-encoded Amazon Record protobuf data. SageMaker pyspark writes a DataFrame to S3 by selecting a column of Vectors named “features” and, if present, a column of Doubles named “label”. These names are configurable by passing a dictionary with entries in trainingSparkDataFormatOptions with key “labelColumnName” or “featuresColumnName”, with values corresponding to the desired label and features columns.

For inference, the SageMakerModel returned by fit() by the KMeansSageMakerEstimator uses ProtobufRequestRowSerializer to serialize Rows into RecordIO-encoded Amazon Record protobuf messages for inference, by default selecting the column named “features” expected to contain a Vector of Doubles.

Inferences made against an Endpoint hosting a K-Means model contain a “closest_cluster” field and a “distance_to_cluster” field, both appended to the input DataFrame as columns of Double.

Parameters:

sageMakerRole (IAMRole) – The SageMaker TrainingJob and Hosting IAM Role. Used by SageMaker to access S3 and ECR Resources. SageMaker hosted Endpoint instances launched by this Estimator run with this role.
trainingInstanceType (str) – The SageMaker TrainingJob Instance Type to use.
trainingInstanceCount (int) – The number of instances of instanceType to run an SageMaker Training Job with.
endpointInstanceType (str) – The SageMaker Endpoint Config instance type.
endpointInitialInstanceCount (int) – The SageMaker Endpoint Config minimum number of instances that can be used to host modelImage.
requestRowSerializer (RequestRowSerializer) – Serializes Spark DataFrame Rows for transformation by Models built from this Estimator.
responseRowDeserializer (ResponseRowDeserializer) – Deserializes an Endpoint response into a series of Rows.
trainingInputS3DataPath (S3Resource) – An S3 location to upload SageMaker Training Job input data to.
trainingOutputS3DataPath (S3Resource) – An S3 location for SageMaker to store Training Job output data to.
trainingInstanceVolumeSizeInGB (int) – The EBS volume size in gigabytes of each instance.
trainingProjectedColumns (List) – The columns to project from the Dataset being fit before training. If an Optional.empty is passed then no specific projection will occur and all columns will be serialized.
trainingChannelName (str) – The SageMaker Channel name to input serialized Dataset fit input to.
trainingContentType (str) – The MIME type of the training data.
trainingS3DataDistribution (str) – The SageMaker Training Job S3 data distribution scheme.
trainingSparkDataFormat (str) – The Spark Data Format name used to serialize the Dataset being fit for input to SageMaker.
trainingSparkDataFormatOptions (dict) – The Spark Data Format Options used during serialization of the Dataset being fit.
trainingInputMode (str) – The SageMaker Training Job Channel input mode.
trainingCompressionCodec (str) – The type of compression to use when serializing the Dataset being fit for input to SageMaker.
trainingMaxRuntimeInSeconds (int) – A SageMaker Training Job Termination Condition MaxRuntimeInHours.
trainingKmsKeyId (str) – A KMS key ID for the Output Data Source.
modelEnvironmentVariables (dict) – The environment variables that SageMaker will set on the model container during execution.
endpointCreationPolicy (EndpointCreationPolicy) – Defines how a SageMaker Endpoint referenced by a SageMakerModel is created.
sagemakerClient (AmazonSageMaker) – CreateModel, and CreateEndpoint requests.
region (str) – The region in which to run the algorithm. If not specified, gets the region from the DefaultAwsRegionProviderChain.
s3Client (AmazonS3) – Used to create a bucket for staging SageMaker Training Job input and/or output if either are set to S3AutoCreatePath.
stsClient (AmazonSTS) – Used to resolve the account number when creating staging input / output buckets.
modelPrependInputRowsToTransformationRows (bool) – Whether the transformation result on Models built by this Estimator should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker Endpoint invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.
deleteStagingDataAfterTraining (bool) – Whether to remove the training data on s3 after training is complete or failed.
namePolicyFactory (NamePolicyFactory) – The NamePolicyFactory to use when naming SageMaker entities created during fit.
uid (str) – The unique identifier of this Estimator. Used to represent this stage in Spark ML pipelines.

copy(extra)¶

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:	extra – Extra parameters to copy to the new instance
Returns:	Copy of this instance

explainParam(param)¶: Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()¶: Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)¶

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:	extra – extra param values
Returns:	merged param map

fit(dataset)¶

Fits a SageMakerModel on dataset by running a SageMaker training job.

Parameters:	dataset (Dataset) – the dataset to use for the training job.
Returns:	The Model created by the training job.
Return type:	JavaSageMakerModel

getOrDefault(param)¶: Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getParam(paramName)¶: Gets a param by its name.

hasDefault(param)¶: Checks whether a param has a default value.

hasParam(paramName)¶: Tests whether this instance contains a param with a given (string) name.

isDefined(param)¶: Checks whether a param is explicitly set by user or has a default value.

isSet(param)¶: Checks whether a param is explicitly set by user.

params¶: Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

Linear Learner Regressor¶

class LinearLearnerRegressor(trainingInstanceType, trainingInstanceCount, endpointInstanceType, endpointInitialInstanceCount, sagemakerRole=<sagemaker_pyspark.IAMRoleResource.IAMRoleFromConfig object>, requestRowSerializer=<sagemaker_pyspark.transformation.serializers.serializers.ProtobufRequestRowSerializer object>, responseRowDeserializer=<sagemaker_pyspark.transformation.deserializers.deserializers.LinearLearnerRegressorProtobufResponseRowDeserializer object>, trainingInputS3DataPath=<sagemaker_pyspark.S3Resources.S3AutoCreatePath object>, trainingOutputS3DataPath=<sagemaker_pyspark.S3Resources.S3AutoCreatePath object>, trainingInstanceVolumeSizeInGB=1024, trainingProjectedColumns=None, trainingChannelName='train', trainingContentType=None, trainingS3DataDistribution='ShardedByS3Key', trainingSparkDataFormat='sagemaker', trainingSparkDataFormatOptions=None, trainingInputMode='File', trainingCompressionCodec=None, trainingMaxRuntimeInSeconds=86400, trainingKmsKeyId=None, modelEnvironmentVariables=None, endpointCreationPolicy=<sagemaker_pyspark.SageMakerEstimator.EndpointCreationPolicy._CreateOnConstruct object>, sagemakerClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._SageMakerDefaultClient object>, region=None, s3Client=<sagemaker_pyspark.SageMakerClients.SageMakerClients._S3DefaultClient object>, stsClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._STSDefaultClient object>, modelPrependInputRowsToTransformationRows=True, deleteStagingDataAfterTraining=True, namePolicyFactory=<sagemaker_pyspark.NamePolicy.RandomNamePolicyFactory object>, uid=None, javaObject=None)¶

Bases: sagemaker_pyspark.SageMakerEstimator.SageMakerEstimatorBase, sagemaker_pyspark.algorithms.LinearLearnerSageMakerEstimator.LinearLearnerParams

A SageMakerEstimator that runs a Linear Learner training job in “regressor” mode in SageMaker and returns a SageMakerModel that can be used to transform a DataFrame using the hosted Linear Learner model. The Linear Learner Regressor is useful for predicting a real-valued label from training examples.

Amazon SageMaker Linear Learner trains on RecordIO-encoded Amazon Record protobuf data. SageMaker pyspark writes a DataFrame to S3 by selecting a column of Vectors named “features” and, if present, a column of Doubles named “label”. These names are configurable by passing a dictionary with entries in trainingSparkDataFormatOptions with key “labelColumnName” or “featuresColumnName”, with values corresponding to the desired label and features columns.

For inference against a hosted Endpoint, the SageMakerModel returned by :meth :fit() by Linear Learner uses ProtobufRequestRowSerializer to serialize Rows into RecordIO-encoded Amazon Record protobuf messages, by default selecting the column named “features” expected to contain a Vector of Doubles.

Inferences made against an Endpoint hosting a Linear Learner Regressor model contain a “score” field appended to the input DataFrame as a Double.

Parameters:

sageMakerRole (IAMRole) – The SageMaker TrainingJob and Hosting IAM Role. Used by SageMaker to access S3 and ECR Resources. SageMaker hosted Endpoint instances launched by this Estimator run with this role.
trainingInstanceType (str) – The SageMaker TrainingJob Instance Type to use.
trainingInstanceCount (int) – The number of instances of instanceType to run an SageMaker Training Job with.
endpointInstanceType (str) – The SageMaker Endpoint Config instance type.
endpointInitialInstanceCount (int) – The SageMaker Endpoint Config minimum number of instances that can be used to host modelImage.
requestRowSerializer (RequestRowSerializer) – Serializes Spark DataFrame Rows for transformation by Models built from this Estimator.
responseRowDeserializer (ResponseRowDeserializer) – Deserializes an Endpoint response into a series of Rows.
trainingInputS3DataPath (S3Resource) – An S3 location to upload SageMaker Training Job input data to.
trainingOutputS3DataPath (S3Resource) – An S3 location for SageMaker to store Training Job output data to.
trainingInstanceVolumeSizeInGB (int) – The EBS volume size in gigabytes of each instance.
trainingProjectedColumns (List) – The columns to project from the Dataset being fit before training. If an Optional.empty is passed then no specific projection will occur and all columns will be serialized.
trainingChannelName (str) – The SageMaker Channel name to input serialized Dataset fit input to.
trainingContentType (str) – The MIME type of the training data.
trainingS3DataDistribution (str) – The SageMaker Training Job S3 data distribution scheme.
trainingSparkDataFormat (str) – The Spark Data Format name used to serialize the Dataset being fit for input to SageMaker.
trainingSparkDataFormatOptions (dict) – The Spark Data Format Options used during serialization of the Dataset being fit.
trainingInputMode (str) – The SageMaker Training Job Channel input mode.
trainingCompressionCodec (str) – The type of compression to use when serializing the Dataset being fit for input to SageMaker.
trainingMaxRuntimeInSeconds (int) – A SageMaker Training Job Termination Condition MaxRuntimeInHours.
trainingKmsKeyId (str) – A KMS key ID for the Output Data Source.
modelEnvironmentVariables (dict) – The environment variables that SageMaker will set on the model container during execution.
endpointCreationPolicy (EndpointCreationPolicy) – Defines how a SageMaker Endpoint referenced by a SageMakerModel is created.
sagemakerClient (AmazonSageMaker) – CreateModel, and CreateEndpoint requests.
region (str) – The region in which to run the algorithm. If not specified, gets the region from the DefaultAwsRegionProviderChain.
s3Client (AmazonS3) – Used to create a bucket for staging SageMaker Training Job input and/or output if either are set to S3AutoCreatePath.
stsClient (AmazonSTS) – Used to resolve the account number when creating staging input / output buckets.
modelPrependInputRowsToTransformationRows (bool) – Whether the transformation result on Models built by this Estimator should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker Endpoint invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.
deleteStagingDataAfterTraining (bool) – Whether to remove the training data on s3 after training is complete or failed.
namePolicyFactory (NamePolicyFactory) – The NamePolicyFactory to use when naming SageMaker entities created during fit.
uid (str) – The unique identifier of this Estimator. Used to represent this stage in Spark ML pipelines.

copy(extra)¶

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:	extra – Extra parameters to copy to the new instance
Returns:	Copy of this instance

explainParam(param)¶: Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()¶: Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)¶

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:	extra – extra param values
Returns:	merged param map

fit(dataset)¶

Fits a SageMakerModel on dataset by running a SageMaker training job.

Parameters:	dataset (Dataset) – the dataset to use for the training job.
Returns:	The Model created by the training job.
Return type:	JavaSageMakerModel

getOrDefault(param)¶: Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getParam(paramName)¶: Gets a param by its name.

hasDefault(param)¶: Checks whether a param has a default value.

hasParam(paramName)¶: Tests whether this instance contains a param with a given (string) name.

isDefined(param)¶: Checks whether a param is explicitly set by user or has a default value.

isSet(param)¶: Checks whether a param is explicitly set by user.

params¶: Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

Linear Learner Binary Classifier¶

class LinearLearnerBinaryClassifier(trainingInstanceType, trainingInstanceCount, endpointInstanceType, endpointInitialInstanceCount, sagemakerRole=<sagemaker_pyspark.IAMRoleResource.IAMRoleFromConfig object>, requestRowSerializer=<sagemaker_pyspark.transformation.serializers.serializers.ProtobufRequestRowSerializer object>, responseRowDeserializer=<sagemaker_pyspark.transformation.deserializers.deserializers.LinearLearnerBinaryClassifierProtobufResponseRowDeserializer object>, trainingInputS3DataPath=<sagemaker_pyspark.S3Resources.S3AutoCreatePath object>, trainingOutputS3DataPath=<sagemaker_pyspark.S3Resources.S3AutoCreatePath object>, trainingInstanceVolumeSizeInGB=1024, trainingProjectedColumns=None, trainingChannelName='train', trainingContentType=None, trainingS3DataDistribution='ShardedByS3Key', trainingSparkDataFormat='sagemaker', trainingSparkDataFormatOptions=None, trainingInputMode='File', trainingCompressionCodec=None, trainingMaxRuntimeInSeconds=86400, trainingKmsKeyId=None, modelEnvironmentVariables=None, endpointCreationPolicy=<sagemaker_pyspark.SageMakerEstimator.EndpointCreationPolicy._CreateOnConstruct object>, sagemakerClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._SageMakerDefaultClient object>, region=None, s3Client=<sagemaker_pyspark.SageMakerClients.SageMakerClients._S3DefaultClient object>, stsClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._STSDefaultClient object>, modelPrependInputRowsToTransformationRows=True, deleteStagingDataAfterTraining=True, namePolicyFactory=<sagemaker_pyspark.NamePolicy.RandomNamePolicyFactory object>, uid=None, javaObject=None)¶

Bases: sagemaker_pyspark.SageMakerEstimator.SageMakerEstimatorBase, sagemaker_pyspark.algorithms.LinearLearnerSageMakerEstimator.LinearLearnerParams

A SageMakerEstimator that runs a Linear Learner training job in “binary classifier” mode in SageMaker and returns a SageMakerModel that can be used to transform a DataFrame using the hosted Linear Learner model. The Linear Learner Binary Classifier is useful for classifying examples into one of two classes.

Amazon SageMaker Linear Learner trains on RecordIO-encoded Amazon Record protobuf data. SageMaker pyspark writes a DataFrame to S3 by selecting a column of Vectors named “features” and, if present, a column of Doubles named “label”. These names are configurable by passing a dictionary with entries in trainingSparkDataFormatOptions with key “labelColumnName” or “featuresColumnName”, with values corresponding to the desired label and features columns.

Inferences made against an Endpoint hosting a Linear Learner Binary classifier model contain a “score” field and a “predicted_label” field, both appended to the input DataFrame as Doubles.

Parameters:

sageMakerRole (IAMRole) – The SageMaker TrainingJob and Hosting IAM Role. Used by SageMaker to access S3 and ECR Resources. SageMaker hosted Endpoint instances launched by this Estimator run with this role.
trainingInstanceType (str) – The SageMaker TrainingJob Instance Type to use.
trainingInstanceCount (int) – The number of instances of instanceType to run an SageMaker Training Job with.
endpointInstanceType (str) – The SageMaker Endpoint Config instance type.
endpointInitialInstanceCount (int) – The SageMaker Endpoint Config minimum number of instances that can be used to host modelImage.
requestRowSerializer (RequestRowSerializer) – Serializes Spark DataFrame Rows for transformation by Models built from this Estimator.
responseRowDeserializer (ResponseRowDeserializer) – Deserializes an Endpoint response into a series of Rows.
trainingInputS3DataPath (S3Resource) – An S3 location to upload SageMaker Training Job input data to.
trainingOutputS3DataPath (S3Resource) – An S3 location for SageMaker to store Training Job output data to.
trainingInstanceVolumeSizeInGB (int) – The EBS volume size in gigabytes of each instance.
trainingProjectedColumns (List) – The columns to project from the Dataset being fit before training. If an Optional.empty is passed then no specific projection will occur and all columns will be serialized.
trainingChannelName (str) – The SageMaker Channel name to input serialized Dataset fit input to.
trainingContentType (str) – The MIME type of the training data.
trainingS3DataDistribution (str) – The SageMaker Training Job S3 data distribution scheme.
trainingSparkDataFormat (str) – The Spark Data Format name used to serialize the Dataset being fit for input to SageMaker.
trainingSparkDataFormatOptions (dict) – The Spark Data Format Options used during serialization of the Dataset being fit.
trainingInputMode (str) – The SageMaker Training Job Channel input mode.
trainingCompressionCodec (str) – The type of compression to use when serializing the Dataset being fit for input to SageMaker.
trainingMaxRuntimeInSeconds (int) – A SageMaker Training Job Termination Condition MaxRuntimeInHours.
trainingKmsKeyId (str) – A KMS key ID for the Output Data Source.
modelEnvironmentVariables (dict) – The environment variables that SageMaker will set on the model container during execution.
endpointCreationPolicy (EndpointCreationPolicy) – Defines how a SageMaker Endpoint referenced by a SageMakerModel is created.
sagemakerClient (AmazonSageMaker) – CreateModel, and CreateEndpoint requests.
region (str) – The region in which to run the algorithm. If not specified, gets the region from the DefaultAwsRegionProviderChain.
s3Client (AmazonS3) – Used to create a bucket for staging SageMaker Training Job input and/or output if either are set to S3AutoCreatePath.
stsClient (AmazonSTS) – Used to resolve the account number when creating staging input / output buckets.
modelPrependInputRowsToTransformationRows (bool) – Whether the transformation result on Models built by this Estimator should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker Endpoint invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.
deleteStagingDataAfterTraining (bool) – Whether to remove the training data on s3 after training is complete or failed.
namePolicyFactory (NamePolicyFactory) – The NamePolicyFactory to use when naming SageMaker entities created during fit.
uid (str) – The unique identifier of this Estimator. Used to represent this stage in Spark ML pipelines.

copy(extra)¶

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:	extra – Extra parameters to copy to the new instance
Returns:	Copy of this instance

explainParam(param)¶: Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()¶: Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)¶

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:	extra – extra param values
Returns:	merged param map

fit(dataset)¶

Fits a SageMakerModel on dataset by running a SageMaker training job.

Parameters:	dataset (Dataset) – the dataset to use for the training job.
Returns:	The Model created by the training job.
Return type:	JavaSageMakerModel

getOrDefault(param)¶: Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getParam(paramName)¶: Gets a param by its name.

hasDefault(param)¶: Checks whether a param has a default value.

hasParam(paramName)¶: Tests whether this instance contains a param with a given (string) name.

isDefined(param)¶: Checks whether a param is explicitly set by user or has a default value.

isSet(param)¶: Checks whether a param is explicitly set by user.

params¶: Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

PCA¶

class PCASageMakerEstimator(trainingInstanceType, trainingInstanceCount, endpointInstanceType, endpointInitialInstanceCount, sagemakerRole=<sagemaker_pyspark.IAMRoleResource.IAMRoleFromConfig object>, requestRowSerializer=<sagemaker_pyspark.transformation.serializers.serializers.ProtobufRequestRowSerializer object>, responseRowDeserializer=<sagemaker_pyspark.transformation.deserializers.deserializers.PCAProtobufResponseRowDeserializer object>, trainingInputS3DataPath=<sagemaker_pyspark.S3Resources.S3AutoCreatePath object>, trainingOutputS3DataPath=<sagemaker_pyspark.S3Resources.S3AutoCreatePath object>, trainingInstanceVolumeSizeInGB=1024, trainingProjectedColumns=None, trainingChannelName='train', trainingContentType=None, trainingS3DataDistribution='ShardedByS3Key', trainingSparkDataFormat='sagemaker', trainingSparkDataFormatOptions=None, trainingInputMode='File', trainingCompressionCodec=None, trainingMaxRuntimeInSeconds=86400, trainingKmsKeyId=None, modelEnvironmentVariables=None, endpointCreationPolicy=<sagemaker_pyspark.SageMakerEstimator.EndpointCreationPolicy._CreateOnConstruct object>, sagemakerClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._SageMakerDefaultClient object>, region=None, s3Client=<sagemaker_pyspark.SageMakerClients.SageMakerClients._S3DefaultClient object>, stsClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._STSDefaultClient object>, modelPrependInputRowsToTransformationRows=True, deleteStagingDataAfterTraining=True, namePolicyFactory=<sagemaker_pyspark.NamePolicy.RandomNamePolicyFactory object>, uid=None)¶

Bases: sagemaker_pyspark.SageMakerEstimator.SageMakerEstimatorBase

A SageMakerEstimator that runs a PCA training job in SageMaker and returns a SageMakerModel that can be used to transform a DataFrame using the hosted PCA model. PCA, or Principal Component Analysis, is useful for reducing the dimensionality of data before training with another algorithm.

Amazon SageMaker PCA trains on RecordIO-encoded Amazon Record protobuf data. SageMaker pyspark writes a DataFrame to S3 by selecting a column of Vectors named “features” and, if present, a column of Doubles named “label”. These names are configurable by passing a dictionary with entries in trainingSparkDataFormatOptions with key “labelColumnName” or “featuresColumnName”, with values corresponding to the desired label and features columns.

PCASageMakerEstimator uses ProtobufRequestRowSerializer to serialize

Rows into RecordIO-encoded Amazon Record protobuf messages for inference, by default selecting

the column named “features” expected to contain a Vector of Doubles.

Inferences made against an Endpoint hosting a PCA model contain a “projection” field appended to the input DataFrame as a Dense Vector of Doubles.

Parameters:

sageMakerRole (IAMRole) – The SageMaker TrainingJob and Hosting IAM Role. Used by SageMaker to access S3 and ECR Resources. SageMaker hosted Endpoint instances launched by this Estimator run with this role.
trainingInstanceType (str) – The SageMaker TrainingJob Instance Type to use.
trainingInstanceCount (int) – The number of instances of instanceType to run an SageMaker Training Job with.
endpointInstanceType (str) – The SageMaker Endpoint Config instance type.
endpointInitialInstanceCount (int) – The SageMaker Endpoint Config minimum number of instances that can be used to host modelImage.
requestRowSerializer (RequestRowSerializer) – Serializes Spark DataFrame Rows for transformation by Models built from this Estimator.
responseRowDeserializer (ResponseRowDeserializer) – Deserializes an Endpoint response into a series of Rows.
trainingInputS3DataPath (S3Resource) – An S3 location to upload SageMaker Training Job input data to.
trainingOutputS3DataPath (S3Resource) – An S3 location for SageMaker to store Training Job output data to.
trainingInstanceVolumeSizeInGB (int) – The EBS volume size in gigabytes of each instance.
trainingProjectedColumns (List) – The columns to project from the Dataset being fit before training. If an Optional.empty is passed then no specific projection will occur and all columns will be serialized.
trainingChannelName (str) – The SageMaker Channel name to input serialized Dataset fit input to.
trainingContentType (str) – The MIME type of the training data.
trainingS3DataDistribution (str) – The SageMaker Training Job S3 data distribution scheme.
trainingSparkDataFormat (str) – The Spark Data Format name used to serialize the Dataset being fit for input to SageMaker.
trainingSparkDataFormatOptions (dict) – The Spark Data Format Options used during serialization of the Dataset being fit.
trainingInputMode (str) – The SageMaker Training Job Channel input mode.
trainingCompressionCodec (str) – The type of compression to use when serializing the Dataset being fit for input to SageMaker.
trainingMaxRuntimeInSeconds (int) – A SageMaker Training Job Termination Condition MaxRuntimeInHours.
trainingKmsKeyId (str) – A KMS key ID for the Output Data Source.
modelEnvironmentVariables (dict) – The environment variables that SageMaker will set on the model container during execution.
endpointCreationPolicy (EndpointCreationPolicy) – Defines how a SageMaker Endpoint referenced by a SageMakerModel is created.
sagemakerClient (AmazonSageMaker) – CreateModel, and CreateEndpoint requests.
region (str) – The region in which to run the algorithm. If not specified, gets the region from the DefaultAwsRegionProviderChain.
s3Client (AmazonS3) – Used to create a bucket for staging SageMaker Training Job input and/or output if either are set to S3AutoCreatePath.
stsClient (AmazonSTS) – Used to resolve the account number when creating staging input / output buckets.
modelPrependInputRowsToTransformationRows (bool) – Whether the transformation result on Models built by this Estimator should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker Endpoint invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.
deleteStagingDataAfterTraining (bool) – Whether to remove the training data on s3 after training is complete or failed.
namePolicyFactory (NamePolicyFactory) – The NamePolicyFactory to use when naming SageMaker entities created during fit.
uid (str) – The unique identifier of this Estimator. Used to represent this stage in Spark ML pipelines.

copy(extra)¶

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:	extra – Extra parameters to copy to the new instance
Returns:	Copy of this instance

explainParam(param)¶: Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()¶: Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)¶

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:	extra – extra param values
Returns:	merged param map

fit(dataset)¶

Fits a SageMakerModel on dataset by running a SageMaker training job.

Parameters:	dataset (Dataset) – the dataset to use for the training job.
Returns:	The Model created by the training job.
Return type:	JavaSageMakerModel

getOrDefault(param)¶: Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getParam(paramName)¶: Gets a param by its name.

hasDefault(param)¶: Checks whether a param has a default value.

hasParam(paramName)¶: Tests whether this instance contains a param with a given (string) name.

isDefined(param)¶: Checks whether a param is explicitly set by user or has a default value.

isSet(param)¶: Checks whether a param is explicitly set by user.

params¶: Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

XGBoost¶

class XGBoostSageMakerEstimator(trainingInstanceType, trainingInstanceCount, endpointInstanceType, endpointInitialInstanceCount, sagemakerRole=<sagemaker_pyspark.IAMRoleResource.IAMRoleFromConfig object>, requestRowSerializer=<sagemaker_pyspark.transformation.serializers.serializers.LibSVMRequestRowSerializer object>, responseRowDeserializer=<sagemaker_pyspark.transformation.deserializers.deserializers.XGBoostCSVRowDeserializer object>, trainingInputS3DataPath=<sagemaker_pyspark.S3Resources.S3AutoCreatePath object>, trainingOutputS3DataPath=<sagemaker_pyspark.S3Resources.S3AutoCreatePath object>, trainingInstanceVolumeSizeInGB=1024, trainingProjectedColumns=None, trainingChannelName='train', trainingContentType=None, trainingS3DataDistribution='ShardedByS3Key', trainingSparkDataFormat='libsvm', trainingSparkDataFormatOptions=None, trainingInputMode='File', trainingCompressionCodec=None, trainingMaxRuntimeInSeconds=86400, trainingKmsKeyId=None, modelEnvironmentVariables=None, endpointCreationPolicy=<sagemaker_pyspark.SageMakerEstimator.EndpointCreationPolicy._CreateOnConstruct object>, sagemakerClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._SageMakerDefaultClient object>, region=None, s3Client=<sagemaker_pyspark.SageMakerClients.SageMakerClients._S3DefaultClient object>, stsClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._STSDefaultClient object>, modelPrependInputRowsToTransformationRows=True, deleteStagingDataAfterTraining=True, namePolicyFactory=<sagemaker_pyspark.NamePolicy.RandomNamePolicyFactory object>, uid=None)¶

Bases: sagemaker_pyspark.SageMakerEstimator.SageMakerEstimatorBase

A SageMakerEstimator that runs an XGBoost training job in Amazon SageMaker and returns a SageMakerModel that can be used to transform a DataFrame using he hosted XGBoost model. XGBoost is an open-source distributed gradient boosting library that Amazon SageMaker has adapted to run on Amazon SageMaker.

XGBoost trains and infers on LibSVM-formatted data. XGBoostSageMakerEstimator uses Spark’s LibSVMFileFormat to write the training DataFrame to S3, and serializes Rows to LibSVM for inference, selecting the column named “features” by default, expected to contain a Vector of Doubles.

Inferences made against an Endpoint hosting an XGBoost model contain a “prediction” field appended to the input DataFrame as a column of Doubles, containing the prediction corresponding to the given Vector of features.

See XGBoost github for more on XGBoost

Parameters:

sageMakerRole (IAMRole) – The SageMaker TrainingJob and Hosting IAM Role. Used by SageMaker to access S3 and ECR Resources. SageMaker hosted Endpoint instances launched by this Estimator run with this role.
trainingInstanceType (str) – The SageMaker TrainingJob Instance Type to use.
trainingInstanceCount (int) – The number of instances of instanceType to run an SageMaker Training Job with.
endpointInstanceType (str) – The SageMaker Endpoint Config instance type.
endpointInitialInstanceCount (int) – The SageMaker Endpoint Config minimum number of instances that can be used to host modelImage.
requestRowSerializer (RequestRowSerializer) – Serializes Spark DataFrame Rows for transformation by Models built from this Estimator.
responseRowDeserializer (ResponseRowDeserializer) – Deserializes an Endpoint response into a series of Rows.
trainingInputS3DataPath (S3Resource) – An S3 location to upload SageMaker Training Job input data to.
trainingOutputS3DataPath (S3Resource) – An S3 location for SageMaker to store Training Job output data to.
trainingInstanceVolumeSizeInGB (int) – The EBS volume size in gigabytes of each instance.
trainingProjectedColumns (List) – The columns to project from the Dataset being fit before training. If an Optional.empty is passed then no specific projection will occur and all columns will be serialized.
trainingChannelName (str) – The SageMaker Channel name to input serialized Dataset fit input to.
trainingContentType (str) – The MIME type of the training data.
trainingS3DataDistribution (str) – The SageMaker Training Job S3 data distribution scheme.
trainingSparkDataFormat (str) – The Spark Data Format name used to serialize the Dataset being fit for input to SageMaker.
trainingSparkDataFormatOptions (dict) – The Spark Data Format Options used during serialization of the Dataset being fit.
trainingInputMode (str) – The SageMaker Training Job Channel input mode.
trainingCompressionCodec (str) – The type of compression to use when serializing the Dataset being fit for input to SageMaker.
trainingMaxRuntimeInSeconds (int) – A SageMaker Training Job Termination Condition MaxRuntimeInHours.
trainingKmsKeyId (str) – A KMS key ID for the Output Data Source.
modelEnvironmentVariables (dict) – The environment variables that SageMaker will set on the model container during execution.
endpointCreationPolicy (EndpointCreationPolicy) – Defines how a SageMaker Endpoint referenced by a SageMakerModel is created.
sagemakerClient (AmazonSageMaker) – CreateModel, and CreateEndpoint requests.
region (str) – The region in which to run the algorithm. If not specified, gets the region from the DefaultAwsRegionProviderChain.
s3Client (AmazonS3) – Used to create a bucket for staging SageMaker Training Job input and/or output if either are set to S3AutoCreatePath.
stsClient (AmazonSTS) – Used to resolve the account number when creating staging input / output buckets.
modelPrependInputRowsToTransformationRows (bool) – Whether the transformation result on Models built by this Estimator should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker Endpoint invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.
deleteStagingDataAfterTraining (bool) – Whether to remove the training data on s3 after training is complete or failed.
namePolicyFactory (NamePolicyFactory) – The NamePolicyFactory to use when naming SageMaker entities created during fit.
uid (str) – The unique identifier of this Estimator. Used to represent this stage in Spark ML pipelines.

copy(extra)¶

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:	extra – Extra parameters to copy to the new instance
Returns:	Copy of this instance

explainParam(param)¶: Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()¶: Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)¶

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:	extra – extra param values
Returns:	merged param map

fit(dataset)¶

Fits a SageMakerModel on dataset by running a SageMaker training job.

Parameters:	dataset (Dataset) – the dataset to use for the training job.
Returns:	The Model created by the training job.
Return type:	JavaSageMakerModel

getOrDefault(param)¶: Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getParam(paramName)¶: Gets a param by its name.

hasDefault(param)¶: Checks whether a param has a default value.

hasParam(paramName)¶: Tests whether this instance contains a param with a given (string) name.

isDefined(param)¶: Checks whether a param is explicitly set by user or has a default value.

isSet(param)¶: Checks whether a param is explicitly set by user.

params¶: Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

Serializers¶

class RequestRowSerializer¶

Bases: sagemaker_pyspark.wrapper.SageMakerJavaWrapper

setSchema(schema)¶

Sets the rowSchema for this RequestRowSerializer.

Parameters:	schema (StructType) – the schema that this RequestRowSerializer will use.

class UnlabeledCSVRequestRowSerializer(schema=None, featuresColumnName='features')¶

Bases: sagemaker_pyspark.transformation.serializers.serializers.RequestRowSerializer

Serializes according to the current implementation of the scoring service.

Parameters:	schema (StructType) – tbd featuresColumnName (str) – name of the features column.

class ProtobufRequestRowSerializer(schema=None, featuresColumnName='features')¶

Bases: sagemaker_pyspark.transformation.serializers.serializers.RequestRowSerializer

A RequestRowSerializer for converting labeled rows to SageMaker Protobuf-in-recordio request data.

Parameters:	schema (StructType) – The schema of Rows being serialized. This parameter is optional as the schema may not be known when this serializer is constructed.

class LibSVMRequestRowSerializer(schema=None, labelColumnName='label', featuresColumnName='features')¶

Bases: sagemaker_pyspark.transformation.serializers.serializers.RequestRowSerializer

Extracts a label column and features column from a Row and serializes as a LibSVM record.

Each Row must contain a Double column and a Vector column containing the label and features respectively. Row field indexes for the label and features are obtained by looking up the index of labelColumnName and featuresColumnName respectively in the specified schema.

A schema must be specified before this RequestRowSerializer can be used by a client. The schema is set either on instantiation of this RequestRowSerializer or by RequestRowSerializer.setSchema().

Parameters:	schema (StructType) – The schema of Rows being serialized. This parameter is optional as the schema may not be known when this serializer is constructed. labelColumnName (str) – The name of the label column. featuresColumnName (str) – The name of the features column.