API¶
This part of the documentation covers all the interfaces of sagemaker_pyspark.
SageMakerModel¶
-
class
SageMakerModel
(endpointInstanceType, endpointInitialInstanceCount, requestRowSerializer, responseRowDeserializer, existingEndpointName=None, modelImage=None, modelPath=None, modelEnvironmentVariables=None, modelExecutionRoleARN=None, endpointCreationPolicy=<sagemaker_pyspark.SageMakerEstimator.EndpointCreationPolicy._CreateOnConstruct object>, sagemakerClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._SageMakerDefaultClient object>, prependResultRows=True, namePolicy=<sagemaker_pyspark.NamePolicy.RandomNamePolicy object>, uid=None, javaObject=None)¶ Bases:
sagemaker_pyspark.wrapper.SageMakerJavaWrapper
,pyspark.ml.wrapper.JavaModel
A Model implementation which transforms a DataFrame by making requests to a SageMaker Endpoint. Manages life cycle of all necessary SageMaker entities, including Model, EndpointConfig, and Endpoint.
This Model transforms one DataFrame to another by repeated, distributed SageMaker Endpoint invocation. Each invocation request body is formed by concatenating input DataFrame Rows serialized to Byte Arrays by the specified
RequestRowSerializer
. The invocation request content-type property is set fromRequestRowSerializer.contentType
. The invocation request accepts property is set fromResponseRowDeserializer.accepts
.The transformed DataFrame is produced by deserializing each invocation response body into a series of Rows. Row deserialization is delegated to the specified
ResponseRowDeserializer
. If prependInputRows is false, the transformed DataFrame will contain just these Rows. If prependInputRows is true, then each transformed Row is a concatenation of the input Row with its corresponding SageMaker invocation deserialized Row.Each invocation of
transform()
passes theDataset.schema
of the input DataFrame to requestRowSerialize by invokingRequestRowSerializer.setSchema()
.The specified RequestRowSerializer also controls the validity of input Row Schemas for this Model. Schema validation is carried out on each call to
transformSchema()
, which invokesRequestRowSerializer.validateSchema()
.Adapting this SageMaker model to the data format and type of a specific Endpoint is achieved by sub-classing RequestRowSerializer and ResponseRowDeserializer. Examples of a Serializer and Deseralizer are
LibSVMRequestRowSerializer
andLibSVMResponseRowDeserializer
respectively.Parameters: - endpointInstanceType (str) – The instance type used to run the model container
- endpointInitialInstanceCount (int) – The initial number of instances used to host the model
- requestRowSerializer (RequestRowSerializer) – Serializes a Row to an Array of Bytes
- responseRowDeserializer (ResponseRowDeserializer) – Deserializes an Array of Bytes to a series of Rows
- existingEndpointName (str) – An endpoint name
- modelImage (str) – A Docker image URI
- modelPath (str) – An S3 location that a successfully completed SageMaker Training Job has stored its model output to.
- modelEnvironmentVariables (dict) – The environment variables that SageMaker will set on the model container during execution.
- modelExecutionRoleARN (str) – The IAM Role used by SageMaker when running the hosted Model and to download model data from S3
- endpointCreationPolicy (EndpointCreationPolicy) – Whether the endpoint is created upon SageMakerModel construction, transformation, or not at all.
- sagemakerClient (AmazonSageMaker) – CreateModel, and CreateEndpoint requests.
- prependResultRows (bool) – Whether the transformation result should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.
- namePolicy (NamePolicy) – The NamePolicy to use when naming SageMaker entities created during usage of this Model.
- uid (str) – The unique identifier of this Estimator. Used to represent this stage in Spark ML pipelines.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
classmethod
fromEndpoint
(endpointName, requestRowSerializer, responseRowDeserializer, modelEnvironmentVariables=None, sagemakerClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._SageMakerDefaultClient object>, prependResultRows=True, namePolicy=<sagemaker_pyspark.NamePolicy.RandomNamePolicy object>, uid='sagemaker')¶ Creates a JavaSageMakerModel from existing model data in S3.
The returned JavaSageMakerModel can be used to transform Dataframes.
Parameters: - endpointName (str) – The name of an endpoint that is currently in service.
- requestRowSerializer (RequestRowSerializer) – Serializes a row to an array of bytes.
- responseRowDeserializer (ResponseRowDeserializer) – Deserializes an array of bytes to a series of rows.
- modelEnvironmentVariables – The environment variables that SageMaker will set on the model container during execution.
- sagemakerClient (AmazonSageMaker) – CreateTrainingJob, CreateModel, and CreateEndpoint requests.
- prependResultRows (bool) – Whether the transformation result should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.
- namePolicy (NamePolicy) – The NamePolicy to use when naming SageMaker entities created during usage of the returned model.
- uid (String) – The unique identifier of the SageMakerModel. Used to represent the stage in Spark ML pipelines.
Returns: A JavaSageMakerModel that sends InvokeEndpoint requests to an endpoint hosting the training job’s model.
Return type: JavaSageMakerModel
-
classmethod
fromModelS3Path
(modelPath, modelImage, modelExecutionRoleARN, endpointInstanceType, endpointInitialInstanceCount, requestRowSerializer, responseRowDeserializer, modelEnvironmentVariables=None, endpointCreationPolicy=<sagemaker_pyspark.SageMakerEstimator.EndpointCreationPolicy._CreateOnConstruct object>, sagemakerClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._SageMakerDefaultClient object>, prependResultRows=True, namePolicy=<sagemaker_pyspark.NamePolicy.RandomNamePolicy object>, uid='sagemaker')¶ Creates a JavaSageMakerModel from existing model data in S3.
The returned JavaSageMakerModel can be used to transform Dataframes.
Parameters: - modelPath (str) – The S3 URI to the model data to host.
- modelImage (str) – The URI of the image that will serve model inferences.
- modelExecutionRoleARN (str) – The IAM Role used by SageMaker when running the hosted Model and to download model data from S3.
- endpointInstanceType (str) – The instance type used to run the model container.
- endpointInitialInstanceCount (int) – The initial number of instances used to host the model.
- requestRowSerializer (RequestRowSerializer) – Serializes a row to an array of bytes.
- responseRowDeserializer (ResponseRowDeserializer) – Deserializes an array of bytes to a series of rows.
- modelEnvironmentVariables – The environment variables that SageMaker will set on the model container during execution.
- endpointCreationPolicy (EndpointCreationPolicy) – Whether the endpoint is created upon SageMakerModel construction, transformation, or not at all.
- sagemakerClient (AmazonSageMaker) – CreateTrainingJob, CreateModel, and CreateEndpoint requests.
- prependResultRows (bool) – Whether the transformation result should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.
- namePolicy (NamePolicy) – The NamePolicy to use when naming SageMaker entities created during usage of the returned model.
- uid (String) – The unique identifier of the SageMakerModel. Used to represent the stage in Spark ML pipelines.
Returns: A JavaSageMakerModel that sends InvokeEndpoint requests to an endpoint hosting the training job’s model.
Return type: JavaSageMakerModel
-
classmethod
fromTrainingJob
(trainingJobName, modelImage, modelExecutionRoleARN, endpointInstanceType, endpointInitialInstanceCount, requestRowSerializer, responseRowDeserializer, modelEnvironmentVariables=None, endpointCreationPolicy=<sagemaker_pyspark.SageMakerEstimator.EndpointCreationPolicy._CreateOnConstruct object>, sagemakerClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._SageMakerDefaultClient object>, prependResultRows=True, namePolicy=<sagemaker_pyspark.NamePolicy.RandomNamePolicy object>, uid='sagemaker')¶ Creates a JavaSageMakerModel from a successfully completed training job name.
The returned JavaSageMakerModel can be used to transform DataFrames.
Parameters: - trainingJobName (str) – Name of the successfully completed training job.
- modelImage (str) – URI of the image that will serve model inferences.
- modelExecutionRoleARN (str) – The IAM Role used by SageMaker when running the hosted Model and to download model data from S3.
- endpointInstanceType (str) – The instance type used to run the model container.
- endpointInitialInstanceCount (int) – The initial number of instances used to host the model.
- requestRowSerializer (RequestRowSerializer) – Serializes a row to an array of bytes.
- responseRowDeserializer (ResponseRowDeserializer) – Deserializes an array of bytes to a series of rows.
- modelEnvironmentVariables – The environment variables that SageMaker will set on the model container during execution.
- endpointCreationPolicy (EndpointCreationPolicy) – Whether the endpoint is created upon SageMakerModel construction, transformation, or not at all.
- sagemakerClient (AmazonSageMaker) – CreateTrainingJob, CreateModel, and CreateEndpoint requests.
- prependResultRows (bool) – Whether the transformation result should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.
- namePolicy (NamePolicy) – The NamePolicy to use when naming SageMaker entities created during usage of the returned model.
- uid (String) – The unique identifier of the SageMakerModel. Used to represent the stage in Spark ML pipelines.
Returns: - a JavaSageMakerModel that sends InvokeEndpoint requests to an
endpoint hosting the training job’s model.
Return type: JavaSageMakerModel
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
transform
(dataset)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
SageMakerEstimator¶
-
class
SageMakerEstimator
(trainingImage, modelImage, trainingInstanceType, trainingInstanceCount, endpointInstanceType, endpointInitialInstanceCount, requestRowSerializer, responseRowDeserializer, hyperParameters=None, trainingInputS3DataPath=<sagemaker_pyspark.S3Resources.S3AutoCreatePath object>, trainingOutputS3DataPath=<sagemaker_pyspark.S3Resources.S3AutoCreatePath object>, trainingInstanceVolumeSizeInGB=1024, trainingProjectedColumns=None, trainingChannelName='train', trainingContentType=None, trainingS3DataDistribution='ShardedByS3Key', trainingSparkDataFormat='sagemaker', trainingSparkDataFormatOptions=None, trainingInputMode='File', trainingCompressionCodec=None, trainingMaxRuntimeInSeconds=86400, trainingKmsKeyId=None, modelEnvironmentVariables=None, endpointCreationPolicy=<sagemaker_pyspark.SageMakerEstimator.EndpointCreationPolicy._CreateOnConstruct object>, sagemakerClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._SageMakerDefaultClient object>, sagemakerRole=<sagemaker_pyspark.IAMRoleResource.IAMRoleFromConfig object>, s3Client=<sagemaker_pyspark.SageMakerClients.SageMakerClients._S3DefaultClient object>, stsClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._STSDefaultClient object>, modelPrependInputRowsToTransformationRows=True, deleteStagingDataAfterTraining=True, namePolicyFactory=<sagemaker_pyspark.NamePolicy.RandomNamePolicyFactory object>, uid=None)¶ Bases:
sagemaker_pyspark.SageMakerEstimator.SageMakerEstimatorBase
Adapts a SageMaker learning Algorithm to a Spark Estimator.
Fits a
SageMakerModel
by running a SageMaker Training Job on a Spark Dataset. Each call tofit()
submits a new SageMaker Training Job, creates a new SageMaker Model, and creates a new SageMaker Endpoint Config. A new Endpoint is either created by or the returned SageMakerModel is configured to generate an Endpoint on SageMakerModel transform.On fit, the input
Dataset
is serialized with the specified trainingSparkDataFormat using the specified trainingSparkDataFormatOptions and uploaded to an S3 location specified bytrainingInputS3DataPath
. The serialized Dataset is compressed withtrainingCompressionCodec
, if not None.trainingProjectedColumns
can be used to control which columns on the input Dataset are transmitted to SageMaker. If not None, then only those column names will be serialized as input to the SageMaker Training Job.A Training Job is created with the uploaded Dataset being input to the specified
trainingChannelName
, with the specifiedtrainingInputMode
. The algorithm is specifiedtrainingImage
, a Docker image URI reference. The Training Job is created with trainingInstanceCount instances of typetrainingInstanceType
. The Training Job will time-out after attr:trainingMaxRuntimeInSeconds, if not None.SageMaker Training Job hyperparameters are built from the
params
on this Estimator. Param objects with neither a default value nor a set value are ignored. If a Param is not set but has a default value, the default value will be used. Param values are converted to SageMaker hyperparameter String values.SageMaker uses the IAM Role with ARN
sagemakerRole
to access the input and output S3 buckets and trainingImage if the image is hosted in ECR. SageMaker Training Job output is stored in a Training Job specific sub-prefix oftrainingOutputS3DataPath
. This contains the SageMaker Training Job output file as well as the SageMaker Training Job model file.After the Training Job is created, this Estimator will poll for success. Upon success a SageMakerModel is created and returned from fit. The SageMakerModel is created with a modelImage Docker image URI, defining the SageMaker model primary container and with
modelEnvironmentVariables
environment variables. Each SageMakerModel has a corresponding SageMaker hosting Endpoint. This Endpoint runs on at least endpointInitialInstanceCount instances of type endpointInstanceType. The Endpoint is created either during construction of the SageMakerModel or on the first call totransform
, controlled byendpointCreationPolicy
. Each Endpointinstance runs with sagemakerRole IAMRole.The transform method on SageMakerModel uses
requestRowSerializer
to serialize Rows from the Dataset undergoing transformation, to requests on the hosted SageMaker Endpoint. TheresponseRowDeserializer
is used to convert the response from the Endpoint to a series of Rows, forming the transformed Dataset. IfmodelPrependInputRowsToTransformationRows
is true, then each transformed Row is also prepended with its corresponding input Row.Parameters: - trainingImage (String) – A SageMaker Training Job Algorithm Specification Training Image Docker image URI.
- modelImage (String) – A SageMaker Model hosting Docker image URI.
- sageMakerRole (IAMRole) – The SageMaker TrainingJob and Hosting IAM Role. Used by SageMaker to access S3 and ECR Resources. SageMaker hosted Endpoint instances launched by this Estimator run with this role.
- trainingInstanceType (str) – The SageMaker TrainingJob Instance Type to use.
- trainingInstanceCount (int) – The number of instances of instanceType to run an SageMaker Training Job with.
- endpointInstanceType (str) – The SageMaker Endpoint Config instance type.
- endpointInitialInstanceCount (int) – The SageMaker Endpoint Config minimum number of instances that can be used to host modelImage.
- requestRowSerializer (RequestRowSerializer) – Serializes Spark DataFrame Rows for transformation by Models built from this Estimator.
- responseRowDeserializer (ResponseRowDeserializer) – Deserializes an Endpoint response into a series of Rows.
- hyperParameters (dict) – A dict from hyperParameter names to their respective values for training.
- trainingInputS3DataPath (S3Resource) – An S3 location to upload SageMaker Training Job input data to.
- trainingOutputS3DataPath (S3Resource) – An S3 location for SageMaker to store Training Job output data to.
- trainingInstanceVolumeSizeInGB (int) – The EBS volume size in gigabytes of each instance.
- trainingProjectedColumns (List) – The columns to project from the Dataset being fit before training. If an Optional.empty is passed then no specific projection will occur and all columns will be serialized.
- trainingChannelName (str) – The SageMaker Channel name to input serialized Dataset fit input to.
- trainingContentType (str) – The MIME type of the training data.
- trainingS3DataDistribution (str) – The SageMaker Training Job S3 data distribution scheme.
- trainingSparkDataFormat (str) – The Spark Data Format name used to serialize the Dataset being fit for input to SageMaker.
- trainingSparkDataFormatOptions (dict) – The Spark Data Format Options used during serialization of the Dataset being fit.
- trainingInputMode (str) – The SageMaker Training Job Channel input mode.
- trainingCompressionCodec (str) – The type of compression to use when serializing the Dataset being fit for input to SageMaker.
- trainingMaxRuntimeInSeconds (int) – A SageMaker Training Job Termination Condition MaxRuntimeInHours.
- trainingKmsKeyId (str) – A KMS key ID for the Output Data Source.
- modelEnvironmentVariables (dict) – The environment variables that SageMaker will set on the model container during execution.
- endpointCreationPolicy (EndpointCreationPolicy) – Defines how a SageMaker Endpoint referenced by a SageMakerModel is created.
- sagemakerClient (AmazonSageMaker) – CreateModel, and CreateEndpoint requests.
- s3Client (AmazonS3) – Used to create a bucket for staging SageMaker Training Job input and/or output if either are set to S3AutoCreatePath.
- stsClient (AmazonSTS) – Used to resolve the account number when creating staging input / output buckets.
- modelPrependInputRowsToTransformationRows (bool) – Whether the transformation result on Models built by this Estimator should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker Endpoint invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.
- deleteStagingDataAfterTraining (bool) – Whether to remove the training data on s3 after training is complete or failed.
- namePolicyFactory (NamePolicyFactory) – The NamePolicyFactory to use when naming SageMaker entities created during fit.
- uid (str) – The unique identifier of this Estimator. Used to represent this stage in Spark ML pipelines.
-
copy
(extra)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
fit
(dataset)¶ Fits a SageMakerModel on dataset by running a SageMaker training job.
Parameters: dataset (Dataset) – the dataset to use for the training job. Returns: The Model created by the training job. Return type: JavaSageMakerModel
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
Algorithms¶
K Means¶
-
class
KMeansSageMakerEstimator
(trainingInstanceType, trainingInstanceCount, endpointInstanceType, endpointInitialInstanceCount, sagemakerRole=<sagemaker_pyspark.IAMRoleResource.IAMRoleFromConfig object>, requestRowSerializer=<sagemaker_pyspark.transformation.serializers.serializers.ProtobufRequestRowSerializer object>, responseRowDeserializer=<sagemaker_pyspark.transformation.deserializers.deserializers.KMeansProtobufResponseRowDeserializer object>, trainingInputS3DataPath=<sagemaker_pyspark.S3Resources.S3AutoCreatePath object>, trainingOutputS3DataPath=<sagemaker_pyspark.S3Resources.S3AutoCreatePath object>, trainingInstanceVolumeSizeInGB=1024, trainingProjectedColumns=None, trainingChannelName='train', trainingContentType=None, trainingS3DataDistribution='ShardedByS3Key', trainingSparkDataFormat='sagemaker', trainingSparkDataFormatOptions=None, trainingInputMode='File', trainingCompressionCodec=None, trainingMaxRuntimeInSeconds=86400, trainingKmsKeyId=None, modelEnvironmentVariables=None, endpointCreationPolicy=<sagemaker_pyspark.SageMakerEstimator.EndpointCreationPolicy._CreateOnConstruct object>, sagemakerClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._SageMakerDefaultClient object>, region=None, s3Client=<sagemaker_pyspark.SageMakerClients.SageMakerClients._S3DefaultClient object>, stsClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._STSDefaultClient object>, modelPrependInputRowsToTransformationRows=True, deleteStagingDataAfterTraining=True, namePolicyFactory=<sagemaker_pyspark.NamePolicy.RandomNamePolicyFactory object>, uid=None)¶ Bases:
sagemaker_pyspark.SageMakerEstimator.SageMakerEstimatorBase
A
SageMakerEstimator
that runs a KMeans training job on Amazon SageMaker upon a call tofit()
and returns aSageMakerModel
that can be used to transform a DataFrame using the hosted K-Means model. K-Means Clustering is useful for grouping similar examples in your dataset.Amazon SageMaker K-Means clustering trains on RecordIO-encoded Amazon Record protobuf data. SageMaker pyspark writes a DataFrame to S3 by selecting a column of Vectors named “features” and, if present, a column of Doubles named “label”. These names are configurable by passing a dictionary with entries in trainingSparkDataFormatOptions with key “labelColumnName” or “featuresColumnName”, with values corresponding to the desired label and features columns.
For inference, the SageMakerModel returned by
fit()
by the KMeansSageMakerEstimator usesProtobufRequestRowSerializer
to serialize Rows into RecordIO-encoded Amazon Record protobuf messages for inference, by default selecting the column named “features” expected to contain a Vector of Doubles.Inferences made against an Endpoint hosting a K-Means model contain a “closest_cluster” field and a “distance_to_cluster” field, both appended to the input DataFrame as columns of Double.
Parameters: - sageMakerRole (IAMRole) – The SageMaker TrainingJob and Hosting IAM Role. Used by SageMaker to access S3 and ECR Resources. SageMaker hosted Endpoint instances launched by this Estimator run with this role.
- trainingInstanceType (str) – The SageMaker TrainingJob Instance Type to use.
- trainingInstanceCount (int) – The number of instances of instanceType to run an SageMaker Training Job with.
- endpointInstanceType (str) – The SageMaker Endpoint Config instance type.
- endpointInitialInstanceCount (int) – The SageMaker Endpoint Config minimum number of instances that can be used to host modelImage.
- requestRowSerializer (RequestRowSerializer) – Serializes Spark DataFrame Rows for transformation by Models built from this Estimator.
- responseRowDeserializer (ResponseRowDeserializer) – Deserializes an Endpoint response into a series of Rows.
- trainingInputS3DataPath (S3Resource) – An S3 location to upload SageMaker Training Job input data to.
- trainingOutputS3DataPath (S3Resource) – An S3 location for SageMaker to store Training Job output data to.
- trainingInstanceVolumeSizeInGB (int) – The EBS volume size in gigabytes of each instance.
- trainingProjectedColumns (List) – The columns to project from the Dataset being fit before training. If an Optional.empty is passed then no specific projection will occur and all columns will be serialized.
- trainingChannelName (str) – The SageMaker Channel name to input serialized Dataset fit input to.
- trainingContentType (str) – The MIME type of the training data.
- trainingS3DataDistribution (str) – The SageMaker Training Job S3 data distribution scheme.
- trainingSparkDataFormat (str) – The Spark Data Format name used to serialize the Dataset being fit for input to SageMaker.
- trainingSparkDataFormatOptions (dict) – The Spark Data Format Options used during serialization of the Dataset being fit.
- trainingInputMode (str) – The SageMaker Training Job Channel input mode.
- trainingCompressionCodec (str) – The type of compression to use when serializing the Dataset being fit for input to SageMaker.
- trainingMaxRuntimeInSeconds (int) – A SageMaker Training Job Termination Condition MaxRuntimeInHours.
- trainingKmsKeyId (str) – A KMS key ID for the Output Data Source.
- modelEnvironmentVariables (dict) – The environment variables that SageMaker will set on the model container during execution.
- endpointCreationPolicy (EndpointCreationPolicy) – Defines how a SageMaker Endpoint referenced by a SageMakerModel is created.
- sagemakerClient (AmazonSageMaker) – CreateModel, and CreateEndpoint requests.
- region (str) – The region in which to run the algorithm. If not specified, gets the region from the DefaultAwsRegionProviderChain.
- s3Client (AmazonS3) – Used to create a bucket for staging SageMaker Training Job input and/or output if either are set to S3AutoCreatePath.
- stsClient (AmazonSTS) – Used to resolve the account number when creating staging input / output buckets.
- modelPrependInputRowsToTransformationRows (bool) – Whether the transformation result on Models built by this Estimator should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker Endpoint invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.
- deleteStagingDataAfterTraining (bool) – Whether to remove the training data on s3 after training is complete or failed.
- namePolicyFactory (NamePolicyFactory) – The NamePolicyFactory to use when naming SageMaker entities created during fit.
- uid (str) – The unique identifier of this Estimator. Used to represent this stage in Spark ML pipelines.
-
copy
(extra)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
fit
(dataset)¶ Fits a SageMakerModel on dataset by running a SageMaker training job.
Parameters: dataset (Dataset) – the dataset to use for the training job. Returns: The Model created by the training job. Return type: JavaSageMakerModel
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
Linear Learner Regressor¶
-
class
LinearLearnerRegressor
(trainingInstanceType, trainingInstanceCount, endpointInstanceType, endpointInitialInstanceCount, sagemakerRole=<sagemaker_pyspark.IAMRoleResource.IAMRoleFromConfig object>, requestRowSerializer=<sagemaker_pyspark.transformation.serializers.serializers.ProtobufRequestRowSerializer object>, responseRowDeserializer=<sagemaker_pyspark.transformation.deserializers.deserializers.LinearLearnerRegressorProtobufResponseRowDeserializer object>, trainingInputS3DataPath=<sagemaker_pyspark.S3Resources.S3AutoCreatePath object>, trainingOutputS3DataPath=<sagemaker_pyspark.S3Resources.S3AutoCreatePath object>, trainingInstanceVolumeSizeInGB=1024, trainingProjectedColumns=None, trainingChannelName='train', trainingContentType=None, trainingS3DataDistribution='ShardedByS3Key', trainingSparkDataFormat='sagemaker', trainingSparkDataFormatOptions=None, trainingInputMode='File', trainingCompressionCodec=None, trainingMaxRuntimeInSeconds=86400, trainingKmsKeyId=None, modelEnvironmentVariables=None, endpointCreationPolicy=<sagemaker_pyspark.SageMakerEstimator.EndpointCreationPolicy._CreateOnConstruct object>, sagemakerClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._SageMakerDefaultClient object>, region=None, s3Client=<sagemaker_pyspark.SageMakerClients.SageMakerClients._S3DefaultClient object>, stsClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._STSDefaultClient object>, modelPrependInputRowsToTransformationRows=True, deleteStagingDataAfterTraining=True, namePolicyFactory=<sagemaker_pyspark.NamePolicy.RandomNamePolicyFactory object>, uid=None, javaObject=None)¶ Bases:
sagemaker_pyspark.SageMakerEstimator.SageMakerEstimatorBase
,sagemaker_pyspark.algorithms.LinearLearnerSageMakerEstimator.LinearLearnerParams
A
SageMakerEstimator
that runs a Linear Learner training job in “regressor” mode in SageMaker and returns aSageMakerModel
that can be used to transform a DataFrame using the hosted Linear Learner model. The Linear Learner Regressor is useful for predicting a real-valued label from training examples.Amazon SageMaker Linear Learner trains on RecordIO-encoded Amazon Record protobuf data. SageMaker pyspark writes a DataFrame to S3 by selecting a column of Vectors named “features” and, if present, a column of Doubles named “label”. These names are configurable by passing a dictionary with entries in trainingSparkDataFormatOptions with key “labelColumnName” or “featuresColumnName”, with values corresponding to the desired label and features columns.
For inference against a hosted Endpoint, the SageMakerModel returned by :meth :fit() by Linear Learner uses
ProtobufRequestRowSerializer
to serialize Rows into RecordIO-encoded Amazon Record protobuf messages, by default selecting the column named “features” expected to contain a Vector of Doubles.Inferences made against an Endpoint hosting a Linear Learner Regressor model contain a “score” field appended to the input DataFrame as a Double.
Parameters: - sageMakerRole (IAMRole) – The SageMaker TrainingJob and Hosting IAM Role. Used by SageMaker to access S3 and ECR Resources. SageMaker hosted Endpoint instances launched by this Estimator run with this role.
- trainingInstanceType (str) – The SageMaker TrainingJob Instance Type to use.
- trainingInstanceCount (int) – The number of instances of instanceType to run an SageMaker Training Job with.
- endpointInstanceType (str) – The SageMaker Endpoint Config instance type.
- endpointInitialInstanceCount (int) – The SageMaker Endpoint Config minimum number of instances that can be used to host modelImage.
- requestRowSerializer (RequestRowSerializer) – Serializes Spark DataFrame Rows for transformation by Models built from this Estimator.
- responseRowDeserializer (ResponseRowDeserializer) – Deserializes an Endpoint response into a series of Rows.
- trainingInputS3DataPath (S3Resource) – An S3 location to upload SageMaker Training Job input data to.
- trainingOutputS3DataPath (S3Resource) – An S3 location for SageMaker to store Training Job output data to.
- trainingInstanceVolumeSizeInGB (int) – The EBS volume size in gigabytes of each instance.
- trainingProjectedColumns (List) – The columns to project from the Dataset being fit before training. If an Optional.empty is passed then no specific projection will occur and all columns will be serialized.
- trainingChannelName (str) – The SageMaker Channel name to input serialized Dataset fit input to.
- trainingContentType (str) – The MIME type of the training data.
- trainingS3DataDistribution (str) – The SageMaker Training Job S3 data distribution scheme.
- trainingSparkDataFormat (str) – The Spark Data Format name used to serialize the Dataset being fit for input to SageMaker.
- trainingSparkDataFormatOptions (dict) – The Spark Data Format Options used during serialization of the Dataset being fit.
- trainingInputMode (str) – The SageMaker Training Job Channel input mode.
- trainingCompressionCodec (str) – The type of compression to use when serializing the Dataset being fit for input to SageMaker.
- trainingMaxRuntimeInSeconds (int) – A SageMaker Training Job Termination Condition MaxRuntimeInHours.
- trainingKmsKeyId (str) – A KMS key ID for the Output Data Source.
- modelEnvironmentVariables (dict) – The environment variables that SageMaker will set on the model container during execution.
- endpointCreationPolicy (EndpointCreationPolicy) – Defines how a SageMaker Endpoint referenced by a SageMakerModel is created.
- sagemakerClient (AmazonSageMaker) – CreateModel, and CreateEndpoint requests.
- region (str) – The region in which to run the algorithm. If not specified, gets the region from the DefaultAwsRegionProviderChain.
- s3Client (AmazonS3) – Used to create a bucket for staging SageMaker Training Job input and/or output if either are set to S3AutoCreatePath.
- stsClient (AmazonSTS) – Used to resolve the account number when creating staging input / output buckets.
- modelPrependInputRowsToTransformationRows (bool) – Whether the transformation result on Models built by this Estimator should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker Endpoint invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.
- deleteStagingDataAfterTraining (bool) – Whether to remove the training data on s3 after training is complete or failed.
- namePolicyFactory (NamePolicyFactory) – The NamePolicyFactory to use when naming SageMaker entities created during fit.
- uid (str) – The unique identifier of this Estimator. Used to represent this stage in Spark ML pipelines.
-
copy
(extra)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
fit
(dataset)¶ Fits a SageMakerModel on dataset by running a SageMaker training job.
Parameters: dataset (Dataset) – the dataset to use for the training job. Returns: The Model created by the training job. Return type: JavaSageMakerModel
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
Linear Learner Binary Classifier¶
-
class
LinearLearnerBinaryClassifier
(trainingInstanceType, trainingInstanceCount, endpointInstanceType, endpointInitialInstanceCount, sagemakerRole=<sagemaker_pyspark.IAMRoleResource.IAMRoleFromConfig object>, requestRowSerializer=<sagemaker_pyspark.transformation.serializers.serializers.ProtobufRequestRowSerializer object>, responseRowDeserializer=<sagemaker_pyspark.transformation.deserializers.deserializers.LinearLearnerBinaryClassifierProtobufResponseRowDeserializer object>, trainingInputS3DataPath=<sagemaker_pyspark.S3Resources.S3AutoCreatePath object>, trainingOutputS3DataPath=<sagemaker_pyspark.S3Resources.S3AutoCreatePath object>, trainingInstanceVolumeSizeInGB=1024, trainingProjectedColumns=None, trainingChannelName='train', trainingContentType=None, trainingS3DataDistribution='ShardedByS3Key', trainingSparkDataFormat='sagemaker', trainingSparkDataFormatOptions=None, trainingInputMode='File', trainingCompressionCodec=None, trainingMaxRuntimeInSeconds=86400, trainingKmsKeyId=None, modelEnvironmentVariables=None, endpointCreationPolicy=<sagemaker_pyspark.SageMakerEstimator.EndpointCreationPolicy._CreateOnConstruct object>, sagemakerClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._SageMakerDefaultClient object>, region=None, s3Client=<sagemaker_pyspark.SageMakerClients.SageMakerClients._S3DefaultClient object>, stsClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._STSDefaultClient object>, modelPrependInputRowsToTransformationRows=True, deleteStagingDataAfterTraining=True, namePolicyFactory=<sagemaker_pyspark.NamePolicy.RandomNamePolicyFactory object>, uid=None, javaObject=None)¶ Bases:
sagemaker_pyspark.SageMakerEstimator.SageMakerEstimatorBase
,sagemaker_pyspark.algorithms.LinearLearnerSageMakerEstimator.LinearLearnerParams
A
SageMakerEstimator
that runs a Linear Learner training job in “binary classifier” mode in SageMaker and returns aSageMakerModel
that can be used to transform a DataFrame using the hosted Linear Learner model. The Linear Learner Binary Classifier is useful for classifying examples into one of two classes.Amazon SageMaker Linear Learner trains on RecordIO-encoded Amazon Record protobuf data. SageMaker pyspark writes a DataFrame to S3 by selecting a column of Vectors named “features” and, if present, a column of Doubles named “label”. These names are configurable by passing a dictionary with entries in trainingSparkDataFormatOptions with key “labelColumnName” or “featuresColumnName”, with values corresponding to the desired label and features columns.
Inferences made against an Endpoint hosting a Linear Learner Binary classifier model contain a “score” field and a “predicted_label” field, both appended to the input DataFrame as Doubles.
Parameters: - sageMakerRole (IAMRole) – The SageMaker TrainingJob and Hosting IAM Role. Used by SageMaker to access S3 and ECR Resources. SageMaker hosted Endpoint instances launched by this Estimator run with this role.
- trainingInstanceType (str) – The SageMaker TrainingJob Instance Type to use.
- trainingInstanceCount (int) – The number of instances of instanceType to run an SageMaker Training Job with.
- endpointInstanceType (str) – The SageMaker Endpoint Config instance type.
- endpointInitialInstanceCount (int) – The SageMaker Endpoint Config minimum number of instances that can be used to host modelImage.
- requestRowSerializer (RequestRowSerializer) – Serializes Spark DataFrame Rows for transformation by Models built from this Estimator.
- responseRowDeserializer (ResponseRowDeserializer) – Deserializes an Endpoint response into a series of Rows.
- trainingInputS3DataPath (S3Resource) – An S3 location to upload SageMaker Training Job input data to.
- trainingOutputS3DataPath (S3Resource) – An S3 location for SageMaker to store Training Job output data to.
- trainingInstanceVolumeSizeInGB (int) – The EBS volume size in gigabytes of each instance.
- trainingProjectedColumns (List) – The columns to project from the Dataset being fit before training. If an Optional.empty is passed then no specific projection will occur and all columns will be serialized.
- trainingChannelName (str) – The SageMaker Channel name to input serialized Dataset fit input to.
- trainingContentType (str) – The MIME type of the training data.
- trainingS3DataDistribution (str) – The SageMaker Training Job S3 data distribution scheme.
- trainingSparkDataFormat (str) – The Spark Data Format name used to serialize the Dataset being fit for input to SageMaker.
- trainingSparkDataFormatOptions (dict) – The Spark Data Format Options used during serialization of the Dataset being fit.
- trainingInputMode (str) – The SageMaker Training Job Channel input mode.
- trainingCompressionCodec (str) – The type of compression to use when serializing the Dataset being fit for input to SageMaker.
- trainingMaxRuntimeInSeconds (int) – A SageMaker Training Job Termination Condition MaxRuntimeInHours.
- trainingKmsKeyId (str) – A KMS key ID for the Output Data Source.
- modelEnvironmentVariables (dict) – The environment variables that SageMaker will set on the model container during execution.
- endpointCreationPolicy (EndpointCreationPolicy) – Defines how a SageMaker Endpoint referenced by a SageMakerModel is created.
- sagemakerClient (AmazonSageMaker) – CreateModel, and CreateEndpoint requests.
- region (str) – The region in which to run the algorithm. If not specified, gets the region from the DefaultAwsRegionProviderChain.
- s3Client (AmazonS3) – Used to create a bucket for staging SageMaker Training Job input and/or output if either are set to S3AutoCreatePath.
- stsClient (AmazonSTS) – Used to resolve the account number when creating staging input / output buckets.
- modelPrependInputRowsToTransformationRows (bool) – Whether the transformation result on Models built by this Estimator should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker Endpoint invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.
- deleteStagingDataAfterTraining (bool) – Whether to remove the training data on s3 after training is complete or failed.
- namePolicyFactory (NamePolicyFactory) – The NamePolicyFactory to use when naming SageMaker entities created during fit.
- uid (str) – The unique identifier of this Estimator. Used to represent this stage in Spark ML pipelines.
-
copy
(extra)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
fit
(dataset)¶ Fits a SageMakerModel on dataset by running a SageMaker training job.
Parameters: dataset (Dataset) – the dataset to use for the training job. Returns: The Model created by the training job. Return type: JavaSageMakerModel
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
PCA¶
-
class
PCASageMakerEstimator
(trainingInstanceType, trainingInstanceCount, endpointInstanceType, endpointInitialInstanceCount, sagemakerRole=<sagemaker_pyspark.IAMRoleResource.IAMRoleFromConfig object>, requestRowSerializer=<sagemaker_pyspark.transformation.serializers.serializers.ProtobufRequestRowSerializer object>, responseRowDeserializer=<sagemaker_pyspark.transformation.deserializers.deserializers.PCAProtobufResponseRowDeserializer object>, trainingInputS3DataPath=<sagemaker_pyspark.S3Resources.S3AutoCreatePath object>, trainingOutputS3DataPath=<sagemaker_pyspark.S3Resources.S3AutoCreatePath object>, trainingInstanceVolumeSizeInGB=1024, trainingProjectedColumns=None, trainingChannelName='train', trainingContentType=None, trainingS3DataDistribution='ShardedByS3Key', trainingSparkDataFormat='sagemaker', trainingSparkDataFormatOptions=None, trainingInputMode='File', trainingCompressionCodec=None, trainingMaxRuntimeInSeconds=86400, trainingKmsKeyId=None, modelEnvironmentVariables=None, endpointCreationPolicy=<sagemaker_pyspark.SageMakerEstimator.EndpointCreationPolicy._CreateOnConstruct object>, sagemakerClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._SageMakerDefaultClient object>, region=None, s3Client=<sagemaker_pyspark.SageMakerClients.SageMakerClients._S3DefaultClient object>, stsClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._STSDefaultClient object>, modelPrependInputRowsToTransformationRows=True, deleteStagingDataAfterTraining=True, namePolicyFactory=<sagemaker_pyspark.NamePolicy.RandomNamePolicyFactory object>, uid=None)¶ Bases:
sagemaker_pyspark.SageMakerEstimator.SageMakerEstimatorBase
A
SageMakerEstimator
that runs a PCA training job in SageMaker and returns aSageMakerModel
that can be used to transform a DataFrame using the hosted PCA model. PCA, or Principal Component Analysis, is useful for reducing the dimensionality of data before training with another algorithm.Amazon SageMaker PCA trains on RecordIO-encoded Amazon Record protobuf data. SageMaker pyspark writes a DataFrame to S3 by selecting a column of Vectors named “features” and, if present, a column of Doubles named “label”. These names are configurable by passing a dictionary with entries in trainingSparkDataFormatOptions with key “labelColumnName” or “featuresColumnName”, with values corresponding to the desired label and features columns.
PCASageMakerEstimator uses
ProtobufRequestRowSerializer
to serializeRows into RecordIO-encoded Amazon Record protobuf messages for inference, by default selectingthe column named “features” expected to contain a Vector of Doubles.
Inferences made against an Endpoint hosting a PCA model contain a “projection” field appended to the input DataFrame as a Dense Vector of Doubles.
Parameters: - sageMakerRole (IAMRole) – The SageMaker TrainingJob and Hosting IAM Role. Used by SageMaker to access S3 and ECR Resources. SageMaker hosted Endpoint instances launched by this Estimator run with this role.
- trainingInstanceType (str) – The SageMaker TrainingJob Instance Type to use.
- trainingInstanceCount (int) – The number of instances of instanceType to run an SageMaker Training Job with.
- endpointInstanceType (str) – The SageMaker Endpoint Config instance type.
- endpointInitialInstanceCount (int) – The SageMaker Endpoint Config minimum number of instances that can be used to host modelImage.
- requestRowSerializer (RequestRowSerializer) – Serializes Spark DataFrame Rows for transformation by Models built from this Estimator.
- responseRowDeserializer (ResponseRowDeserializer) – Deserializes an Endpoint response into a series of Rows.
- trainingInputS3DataPath (S3Resource) – An S3 location to upload SageMaker Training Job input data to.
- trainingOutputS3DataPath (S3Resource) – An S3 location for SageMaker to store Training Job output data to.
- trainingInstanceVolumeSizeInGB (int) – The EBS volume size in gigabytes of each instance.
- trainingProjectedColumns (List) – The columns to project from the Dataset being fit before training. If an Optional.empty is passed then no specific projection will occur and all columns will be serialized.
- trainingChannelName (str) – The SageMaker Channel name to input serialized Dataset fit input to.
- trainingContentType (str) – The MIME type of the training data.
- trainingS3DataDistribution (str) – The SageMaker Training Job S3 data distribution scheme.
- trainingSparkDataFormat (str) – The Spark Data Format name used to serialize the Dataset being fit for input to SageMaker.
- trainingSparkDataFormatOptions (dict) – The Spark Data Format Options used during serialization of the Dataset being fit.
- trainingInputMode (str) – The SageMaker Training Job Channel input mode.
- trainingCompressionCodec (str) – The type of compression to use when serializing the Dataset being fit for input to SageMaker.
- trainingMaxRuntimeInSeconds (int) – A SageMaker Training Job Termination Condition MaxRuntimeInHours.
- trainingKmsKeyId (str) – A KMS key ID for the Output Data Source.
- modelEnvironmentVariables (dict) – The environment variables that SageMaker will set on the model container during execution.
- endpointCreationPolicy (EndpointCreationPolicy) – Defines how a SageMaker Endpoint referenced by a SageMakerModel is created.
- sagemakerClient (AmazonSageMaker) – CreateModel, and CreateEndpoint requests.
- region (str) – The region in which to run the algorithm. If not specified, gets the region from the DefaultAwsRegionProviderChain.
- s3Client (AmazonS3) – Used to create a bucket for staging SageMaker Training Job input and/or output if either are set to S3AutoCreatePath.
- stsClient (AmazonSTS) – Used to resolve the account number when creating staging input / output buckets.
- modelPrependInputRowsToTransformationRows (bool) – Whether the transformation result on Models built by this Estimator should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker Endpoint invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.
- deleteStagingDataAfterTraining (bool) – Whether to remove the training data on s3 after training is complete or failed.
- namePolicyFactory (NamePolicyFactory) – The NamePolicyFactory to use when naming SageMaker entities created during fit.
- uid (str) – The unique identifier of this Estimator. Used to represent this stage in Spark ML pipelines.
-
copy
(extra)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
fit
(dataset)¶ Fits a SageMakerModel on dataset by running a SageMaker training job.
Parameters: dataset (Dataset) – the dataset to use for the training job. Returns: The Model created by the training job. Return type: JavaSageMakerModel
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
XGBoost¶
-
class
XGBoostSageMakerEstimator
(trainingInstanceType, trainingInstanceCount, endpointInstanceType, endpointInitialInstanceCount, sagemakerRole=<sagemaker_pyspark.IAMRoleResource.IAMRoleFromConfig object>, requestRowSerializer=<sagemaker_pyspark.transformation.serializers.serializers.LibSVMRequestRowSerializer object>, responseRowDeserializer=<sagemaker_pyspark.transformation.deserializers.deserializers.XGBoostCSVRowDeserializer object>, trainingInputS3DataPath=<sagemaker_pyspark.S3Resources.S3AutoCreatePath object>, trainingOutputS3DataPath=<sagemaker_pyspark.S3Resources.S3AutoCreatePath object>, trainingInstanceVolumeSizeInGB=1024, trainingProjectedColumns=None, trainingChannelName='train', trainingContentType=None, trainingS3DataDistribution='ShardedByS3Key', trainingSparkDataFormat='libsvm', trainingSparkDataFormatOptions=None, trainingInputMode='File', trainingCompressionCodec=None, trainingMaxRuntimeInSeconds=86400, trainingKmsKeyId=None, modelEnvironmentVariables=None, endpointCreationPolicy=<sagemaker_pyspark.SageMakerEstimator.EndpointCreationPolicy._CreateOnConstruct object>, sagemakerClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._SageMakerDefaultClient object>, region=None, s3Client=<sagemaker_pyspark.SageMakerClients.SageMakerClients._S3DefaultClient object>, stsClient=<sagemaker_pyspark.SageMakerClients.SageMakerClients._STSDefaultClient object>, modelPrependInputRowsToTransformationRows=True, deleteStagingDataAfterTraining=True, namePolicyFactory=<sagemaker_pyspark.NamePolicy.RandomNamePolicyFactory object>, uid=None)¶ Bases:
sagemaker_pyspark.SageMakerEstimator.SageMakerEstimatorBase
A
SageMakerEstimator
that runs an XGBoost training job in Amazon SageMaker and returns aSageMakerModel
that can be used to transform a DataFrame using he hosted XGBoost model. XGBoost is an open-source distributed gradient boosting library that Amazon SageMaker has adapted to run on Amazon SageMaker.XGBoost trains and infers on LibSVM-formatted data. XGBoostSageMakerEstimator uses Spark’s LibSVMFileFormat to write the training DataFrame to S3, and serializes Rows to LibSVM for inference, selecting the column named “features” by default, expected to contain a Vector of Doubles.
Inferences made against an Endpoint hosting an XGBoost model contain a “prediction” field appended to the input DataFrame as a column of Doubles, containing the prediction corresponding to the given Vector of features.
See XGBoost github for more on XGBoost
Parameters: - sageMakerRole (IAMRole) – The SageMaker TrainingJob and Hosting IAM Role. Used by SageMaker to access S3 and ECR Resources. SageMaker hosted Endpoint instances launched by this Estimator run with this role.
- trainingInstanceType (str) – The SageMaker TrainingJob Instance Type to use.
- trainingInstanceCount (int) – The number of instances of instanceType to run an SageMaker Training Job with.
- endpointInstanceType (str) – The SageMaker Endpoint Config instance type.
- endpointInitialInstanceCount (int) – The SageMaker Endpoint Config minimum number of instances that can be used to host modelImage.
- requestRowSerializer (RequestRowSerializer) – Serializes Spark DataFrame Rows for transformation by Models built from this Estimator.
- responseRowDeserializer (ResponseRowDeserializer) – Deserializes an Endpoint response into a series of Rows.
- trainingInputS3DataPath (S3Resource) – An S3 location to upload SageMaker Training Job input data to.
- trainingOutputS3DataPath (S3Resource) – An S3 location for SageMaker to store Training Job output data to.
- trainingInstanceVolumeSizeInGB (int) – The EBS volume size in gigabytes of each instance.
- trainingProjectedColumns (List) – The columns to project from the Dataset being fit before training. If an Optional.empty is passed then no specific projection will occur and all columns will be serialized.
- trainingChannelName (str) – The SageMaker Channel name to input serialized Dataset fit input to.
- trainingContentType (str) – The MIME type of the training data.
- trainingS3DataDistribution (str) – The SageMaker Training Job S3 data distribution scheme.
- trainingSparkDataFormat (str) – The Spark Data Format name used to serialize the Dataset being fit for input to SageMaker.
- trainingSparkDataFormatOptions (dict) – The Spark Data Format Options used during serialization of the Dataset being fit.
- trainingInputMode (str) – The SageMaker Training Job Channel input mode.
- trainingCompressionCodec (str) – The type of compression to use when serializing the Dataset being fit for input to SageMaker.
- trainingMaxRuntimeInSeconds (int) – A SageMaker Training Job Termination Condition MaxRuntimeInHours.
- trainingKmsKeyId (str) – A KMS key ID for the Output Data Source.
- modelEnvironmentVariables (dict) – The environment variables that SageMaker will set on the model container during execution.
- endpointCreationPolicy (EndpointCreationPolicy) – Defines how a SageMaker Endpoint referenced by a SageMakerModel is created.
- sagemakerClient (AmazonSageMaker) – CreateModel, and CreateEndpoint requests.
- region (str) – The region in which to run the algorithm. If not specified, gets the region from the DefaultAwsRegionProviderChain.
- s3Client (AmazonS3) – Used to create a bucket for staging SageMaker Training Job input and/or output if either are set to S3AutoCreatePath.
- stsClient (AmazonSTS) – Used to resolve the account number when creating staging input / output buckets.
- modelPrependInputRowsToTransformationRows (bool) – Whether the transformation result on Models built by this Estimator should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker Endpoint invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.
- deleteStagingDataAfterTraining (bool) – Whether to remove the training data on s3 after training is complete or failed.
- namePolicyFactory (NamePolicyFactory) – The NamePolicyFactory to use when naming SageMaker entities created during fit.
- uid (str) – The unique identifier of this Estimator. Used to represent this stage in Spark ML pipelines.
-
copy
(extra)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
fit
(dataset)¶ Fits a SageMakerModel on dataset by running a SageMaker training job.
Parameters: dataset (Dataset) – the dataset to use for the training job. Returns: The Model created by the training job. Return type: JavaSageMakerModel
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
Serializers¶
-
class
RequestRowSerializer
¶ Bases:
sagemaker_pyspark.wrapper.SageMakerJavaWrapper
-
setSchema
(schema)¶ Sets the rowSchema for this RequestRowSerializer.
Parameters: schema (StructType) – the schema that this RequestRowSerializer will use.
-
-
class
UnlabeledCSVRequestRowSerializer
(schema=None, featuresColumnName='features')¶ Bases:
sagemaker_pyspark.transformation.serializers.serializers.RequestRowSerializer
Serializes according to the current implementation of the scoring service.
Parameters: - schema (StructType) – tbd
- featuresColumnName (str) – name of the features column.
-
class
ProtobufRequestRowSerializer
(schema=None, featuresColumnName='features')¶ Bases:
sagemaker_pyspark.transformation.serializers.serializers.RequestRowSerializer
A RequestRowSerializer for converting labeled rows to SageMaker Protobuf-in-recordio request data.
Parameters: schema (StructType) – The schema of Rows being serialized. This parameter is optional as the schema may not be known when this serializer is constructed.
-
class
LibSVMRequestRowSerializer
(schema=None, labelColumnName='label', featuresColumnName='features')¶ Bases:
sagemaker_pyspark.transformation.serializers.serializers.RequestRowSerializer
Extracts a label column and features column from a Row and serializes as a LibSVM record.
Each Row must contain a Double column and a Vector column containing the label and features respectively. Row field indexes for the label and features are obtained by looking up the index of labelColumnName and featuresColumnName respectively in the specified schema.
A schema must be specified before this RequestRowSerializer can be used by a client. The schema is set either on instantiation of this RequestRowSerializer or by
RequestRowSerializer.setSchema()
.Parameters: - schema (StructType) – The schema of Rows being serialized. This parameter is optional as the schema may not be known when this serializer is constructed.
- labelColumnName (str) – The name of the label column.
- featuresColumnName (str) – The name of the features column.
Deserializers¶
-
class
ResponseRowDeserializer
¶ Bases:
sagemaker_pyspark.wrapper.SageMakerJavaWrapper
-
class
XGBoostCSVRowDeserializer
(prediction_column_name='prediction')¶ Bases:
sagemaker_pyspark.transformation.deserializers.deserializers.ResponseRowDeserializer
A
ResponseRowDeserializer
for converting a comma-delimited string of predictions to labeled Vectors.Parameters: prediction_column_name (str) – the name of the output predictions column.
-
class
ProtobufResponseRowDeserializer
(schema, protobufKeys=None)¶ Bases:
sagemaker_pyspark.transformation.deserializers.deserializers.ResponseRowDeserializer
A
ResponseRowDeserializer
for converting SageMaker Protobuf-in-recordio response data to Spark rows.Parameters: schema (StructType) – The schema of rows in the response.
-
class
PCAProtobufResponseRowDeserializer
(projection_column_name='projection')¶ Bases:
sagemaker_pyspark.transformation.deserializers.deserializers.ResponseRowDeserializer
Deserializes a Protobuf response from the PCA model image to a Vector of Doubles containing the projection of the input vector.
Parameters: projection_column_name (str) – name of the column holding Vectors of Doubles representing the projected vectors.
-
class
LDAProtobufResponseRowDeserializer
(projection_column_name='topic_mixture')¶ Bases:
sagemaker_pyspark.transformation.deserializers.deserializers.ResponseRowDeserializer
Deserializes a Protobuf response from the LDA model image to a Vector of Doubles representing the topic mixture for the document represented by the input vector.
Parameters: projection_column_name (str) – name of the column holding Vectors of Doubles representing the topic mixtures for the documents.
-
class
KMeansProtobufResponseRowDeserializer
(distance_to_cluster_column_name='distance_to_cluster', closest_cluster_column_name='closest_cluster')¶ Bases:
sagemaker_pyspark.transformation.deserializers.deserializers.ResponseRowDeserializer
Deserializes a Protobuf response from the Kmeans model image into Rows in a Spark DataFrame.
Parameters: - distance_to_cluster_column_name (str) – name of the column of doubles indicating the distance to the nearest cluster from the input record.
- closest_cluster_column_name (str) – name of the column of doubles indicating the label of the closest cluster for the input record.
-
class
LinearLearnerBinaryClassifierProtobufResponseRowDeserializer
(score_column_name='score', predicted_label_column_name='predicted_label')¶ Bases:
sagemaker_pyspark.transformation.deserializers.deserializers.ResponseRowDeserializer
Deserializes a Protobuf response from the LinearLearner model image with predictorType “binary_classifier” into Rows in a Spark Dataframe.
Parameters: - score_column_name (str) – name of the column indicating the output score for the record.
- predicted_label_column_name (str) – name of the column indicating the predicted label for the record.
-
class
LinearLearnerMultiClassClassifierProtobufResponseRowDeserializer
(score_column_name='score', predicted_label_column_name='predicted_label')¶ Bases:
sagemaker_pyspark.transformation.deserializers.deserializers.ResponseRowDeserializer
Deserializes a Protobuf response from the LinearLearner model image with predictorType “multiclass_classifier” into Rows in a Spark Dataframe.
Parameters: - score_column_name (str) – name of the column indicating the output score for the record.
- predicted_label_column_name (str) – name of the column indicating the predicted label for the record.
-
class
LinearLearnerRegressorProtobufResponseRowDeserializer
(score_column_name='score')¶ Bases:
sagemaker_pyspark.transformation.deserializers.deserializers.ResponseRowDeserializer
Deserializes a Protobuf response from the LinearLearner model image with predictorType “regressor” into Rows in a Spark DataFrame.
Parameters: score_column_name (str) – name of the column of Doubles indicating the output score for the record.
-
class
FactorizationMachinesBinaryClassifierDeserializer
(score_column_name='score', predicted_label_column_name='predicted_label')¶ Bases:
sagemaker_pyspark.transformation.deserializers.deserializers.ResponseRowDeserializer
Deserializes a Protobuf response from the Factorization Machines model image with predictorType “binary_classifier” into Rows in a Spark Dataframe.
Parameters: - score_column_name (str) – name of the column indicating the output score for the record.
- predicted_label_column_name (str) – name of the column indicating the predicted label for the record.
-
class
FactorizationMachinesRegressorDeserializer
(score_column_name='score')¶ Bases:
sagemaker_pyspark.transformation.deserializers.deserializers.ResponseRowDeserializer
Deserializes a Protobuf response from the Factorization Machines model image with predictorType “regressor” into Rows in a Spark DataFrame.
Parameters: score_column_name (str) – name of the column of Doubles indicating the output score for the record.
-
class
LibSVMResponseRowDeserializer
(dim, labelColumnName, featuresColumnName)¶ Bases:
sagemaker_pyspark.transformation.deserializers.deserializers.ResponseRowDeserializer
A
ResponseRowDeserializer
for converting LibSVM response data to labeled vectors.Parameters: - dim (int) – The vector dimension
- labelColumnName (str) – The name of the label column
- featuresColumnName (str) – The name of the features column
Other Classes¶
Top-level module for sagemaker_pyspark
-
class
SageMakerJavaWrapper
¶ Bases:
pyspark.ml.wrapper.JavaWrapper
-
class
IAMRole
(role)¶ Bases:
sagemaker_pyspark.IAMRoleResource.IAMRoleResource
Specifies an IAM Role by ARN or Name.
Parameters: role (str) – IAM Role Name or ARN.
-
class
IAMRoleFromConfig
(configKey='com.amazonaws.services.sagemaker.sparksdk.sagemakerrole')¶ Bases:
sagemaker_pyspark.IAMRoleResource.IAMRoleResource
Gets an IAM role from Spark config
Parameters: configKey (str) – key in Spark config corresponding to IAM Role ARN.
-
class
S3DataPath
(bucket, objectPath)¶ Bases:
sagemaker_pyspark.S3Resources.S3Resource
,sagemaker_pyspark.wrapper.SageMakerJavaWrapper
Represents a location within an S3 Bucket.
Parameters: - bucket (str) – An S3 Bucket Name.
- objectPath (str) – An S3 key or key prefix.
-
class
S3AutoCreatePath
¶ Bases:
sagemaker_pyspark.S3Resources.S3Resource
,sagemaker_pyspark.wrapper.SageMakerJavaWrapper
Defines an S3 location that will be auto-created at runtime.
-
class
S3Resource
¶ Bases:
sagemaker_pyspark.wrapper.SageMakerJavaWrapper
An S3 Resource for SageMaker to use.
-
class
EndpointCreationPolicy
¶ Bases:
object
Determines whether and when to create the Endpoint and other Hosting resources.
-
CREATE_ON_CONSTRUCT
¶ create the Endpoint upon creation of the SageMakerModel, at the end of fit()
-
CREATE_ON_TRANSFORM
¶ create the Endpoint upon invocation of SageMakerModel.transform().
-
DO_NOT_CREATE
¶ do not create the Endpoint.
-
-
class
Option
(value)¶ Bases:
sagemaker_pyspark.wrapper.SageMakerJavaWrapper
-
class
RandomNamePolicy
(prefix='')¶ Bases:
sagemaker_pyspark.NamePolicy.NamePolicy
Provides random, unique SageMaker entity names that begin with the specified prefix.
Parameters: prefix (str) – The common name prefix for all SageMaker entities named with this NamePolicy.
-
class
RandomNamePolicyFactory
(prefix='')¶ Bases:
sagemaker_pyspark.NamePolicy.NamePolicyFactory
Creates a RandomNamePolicy upon a call to createNamePolicy
Parameters: prefix (str) – The common name prefix for all SageMaker entities named with this NamePolicy.
-
class
CustomNamePolicy
(trainingJobName, modelName, endpointConfigName, endpointName)¶ Bases:
sagemaker_pyspark.NamePolicy.NamePolicy
Provides custom SageMaker entity names.
Parameters: - trainingJobName (str) – The name of the SageMaker entity
- modelName (str) – The model name of the SageMaker entity
- endpointConfigName (str) – The endpoint config name of the SageMaker entity
- endpointName (str) – The endpoint name of the SageMaker entity
-
class
CustomNamePolicyFactory
(trainingJobName, modelName, endpointConfigName, endpointName)¶ Bases:
sagemaker_pyspark.NamePolicy.NamePolicyFactory
Creates a CustomNamePolicy upon a call to createNamePolicy
Parameters: - trainingJobName (str) – The job name of the SageMaker entity with this NamePolicy.
- modelName (str) – The job name of the SageMaker entity with this NamePolicy.
- endpointConfigName (str) – The job name of the SageMaker entity with this NamePolicy.
- endpointName (str) – The job name of the SageMaker entity with this NamePolicy.
-
class
CustomNamePolicyWithTimeStampSuffix
(trainingJobName, modelName, endpointConfigName, endpointName)¶ Bases:
sagemaker_pyspark.NamePolicy.NamePolicy
Provides custom SageMaker entity names with timestamp suffix.
Parameters: - trainingJobName (str) – The job name of the SageMaker entity
- modelName (str) – The model name of the SageMaker entity
- endpointConfigName (str) – The endpoint config name of the SageMaker entity
- endpointName (str) – The endpoint name of the SageMaker entity
-
class
CustomNamePolicyWithTimeStampSuffixFactory
(trainingJobName, modelName, endpointConfigName, endpointName)¶ Bases:
sagemaker_pyspark.NamePolicy.NamePolicyFactory
Creates a CustomNamePolicyFactoryWithTimeStampSuffix upon a call to createNamePolicy
Parameters: - trainingJobName (str) – The job name of the SageMaker entity with this NamePolicy.
- modelName (str) – The job name of the SageMaker entity with this NamePolicy.
- endpointConfigName (str) – The job name of the SageMaker entity with this NamePolicy.
- endpointName (str) – The job name of the SageMaker entity with this NamePolicy.
-
classpath_jars
()¶ Returns a list with the paths to the required jar files.
The sagemakerpyspark library is mostly a wrapper of the scala sagemakerspark sdk and it depends on a set of jar files to work correctly. This function retrieves the location of these jars in the local installation.
Returns: List of absolute paths.
-
class
SageMakerResourceCleanup
(sagemakerClient, java_object=None)¶ Bases:
sagemaker_pyspark.wrapper.SageMakerJavaWrapper
-
class
CreatedResources
(model_name=None, endpoint_config_name=None, endpoint_name=None, java_object=None)¶ Bases:
sagemaker_pyspark.wrapper.SageMakerJavaWrapper
Resources that may have been created during operation of the SageMaker Estimator and Model.
Parameters: - model_name (str) – Name of the SageMaker Model that was created, or None if it wasn’t created.
- endpoint_config_name (str) – Name of the SageMaker EndpointConfig that was created, or None if it wasn’t created.
- endpoint_name (str) – Name of the SageMaker Endpoint that was created, or None if it wasn’t created.
- ( (java_object) – obj: py4j.java_gateway.JavaObject, optional): an existing CreatedResources java instance. If provided the other arguments are ignored.