Feature Transformation¶
Categorical Feature Transformations¶
-
class
ml4ir.base.features.feature_fns.categorical.
CategoricalEmbeddingWithHashBuckets
(feature_info: dict, file_io: ml4ir.base.io.file_io.FileIO, **kwargs)¶ Bases:
ml4ir.base.features.feature_fns.base.BaseFeatureLayerOp
Converts a string feature tensor into a categorical embedding. Works by first converting the string into num_hash_buckets buckets each of size hash_bucket_size, then converting each hash bucket into a categorical embedding of dimension embedding_size. Finally, these embeddings are combined either through mean, sum or concat operations to generate the final embedding based on the feature_info.
Initialize the layer to get categorical embedding with hash buckets
Parameters: - feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
- file_io (FileIO object) – FileIO handler object for reading and writing
Notes
- Args under feature_layer_info:
- num_hash_buckets : int
- number of different hash buckets to convert the input string into
- hash_bucket_size : int
- the size of each hash bucket
- embedding_size : int
- dimension size of the categorical embedding
- merge_mode : str
- can be one of “mean”, “sum”, “concat” representing the mode of combining embeddings from each categorical embedding
-
LAYER_NAME
= 'categorical_embedding_with_hash_buckets'¶
-
NUM_HASH_BUCKETS
= 'num_hash_buckets'¶
-
HASH_BUCKET_SIZE
= 'hash_bucket_size'¶
-
EMBEDDING_SIZE
= 'embedding_size'¶
-
MERGE_MODE
= 'merge_mode'¶
-
call
(inputs, training=None)¶ Defines the forward pass for the layer on the inputs tensor
Parameters: - inputs (tensor) – Input tensor on which the feature transforms are applied
- training (boolean) – Boolean flag indicating if the layer is being used in training mode or not
Returns: Resulting tensor after the forward pass through the feature transform layer
Return type: tf.Tensor
-
class
ml4ir.base.features.feature_fns.categorical.
CategoricalEmbeddingWithIndices
(feature_info: dict, file_io: ml4ir.base.io.file_io.FileIO, **kwargs)¶ Bases:
ml4ir.base.features.feature_fns.base.BaseFeatureLayerOp
Converts input integer tensor into categorical embedding. Works by converting the categorical indices in the input feature_tensor, represented as integer values, into categorical embeddings based on the feature_info.
Initialize feature layer to convert categorical feature into embedding based on indices
Parameters: - feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
- file_io (FileIO object) – FileIO handler object for reading and writing
Notes
- Args under feature_layer_info:
- num_buckets : int
- Maximum number of categorical values
- default_value : int
- default value to be assigned to indices out of the num_buckets range
- embedding_size : int
- dimension size of the categorical embedding
-
LAYER_NAME
= 'categorical_embedding_with_indices'¶
-
NUM_BUCKETS
= 'num_buckets'¶
-
DEFAULT_VALUE
= 'default_value'¶
-
EMBEDDING_SIZE
= 'embedding_size'¶
-
call
(inputs, training=None)¶ Defines the forward pass for the layer on the inputs tensor
Parameters: - inputs (tensor) – Input tensor on which the feature transforms are applied
- training (boolean) – Boolean flag indicating if the layer is being used in training mode or not
Returns: Resulting tensor after the forward pass through the feature transform layer
Return type: tf.Tensor
-
class
ml4ir.base.features.feature_fns.categorical.
CategoricalEmbeddingToEncodingBiLSTM
(feature_info: dict, file_io: ml4ir.base.io.file_io.FileIO, **kwargs)¶ Bases:
ml4ir.base.features.feature_fns.base.BaseFeatureLayerOp
Encode a string tensor into categorical embedding. Works by converting the string into a word sequence and then generating a categorical/char embedding for each words based on the List of strings that form the vocabulary set of categorical values, defined by the argument vocabulary_file. The char/byte embeddings are then combined using a biLSTM.
Initialize the layer to convert input string tensor into an encoding using categorical embeddings
Parameters: - feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
- file_io (FileIO object) – FileIO handler object for reading and writing
Notes
- Args under feature_layer_info:
- vocabulary_file : string
- path to vocabulary CSV file for the input tensor containing the vocabulary to look-up. uses the “key” named column as vocabulary of the 1st column if no “key” column present.
- max_length: int
- max number of rows to consider from the vocabulary file.
- if null, considers the entire file vocabulary.
- embedding_size : int
- dimension size of the embedding;
- if null, then the tensor is just converted to its one-hot representation
- encoding_size : int
- dimension size of the sequence encoding computed using a biLSTM
The input dimension for the embedding is fixed to 256 because the string is converted into a bytes sequence.
-
LAYER_NAME
= 'categorical_embedding_to_encoding_bilstm'¶
-
VOCABULARY_FILE
= 'vocabulary_file'¶
-
MAX_LENGTH
= 'max_length'¶
-
EMBEDDING_SIZE
= 'embedding_size'¶
-
ENCODING_SIZE
= 'encoding_size'¶
-
LSTM_KERNEL_INITIALIZER
= 'lstm_kernel_initializer'¶
-
call
(inputs, training=None)¶ Defines the forward pass for the layer on the inputs tensor
Parameters: - inputs (tensor) – Input tensor on which the feature transforms are applied
- training (boolean) – Boolean flag indicating if the layer is being used in training mode or not
Returns: Resulting tensor after the forward pass through the feature transform layer
Return type: tf.Tensor
-
class
ml4ir.base.features.feature_fns.categorical.
CategoricalEmbeddingWithVocabularyFile
(feature_info: dict, file_io: ml4ir.base.io.file_io.FileIO, **kwargs)¶ Bases:
ml4ir.base.features.feature_fns.base.BaseFeatureLayerOp
Converts a string tensor into a categorical embedding representation. Works by using a vocabulary file to convert the string tensor into categorical indices and then converting the categories into embeddings based on the feature_info.
Initialize layer to define a categorical embedding using a vocabulary file
Parameters: - feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
- file_io (FileIO object) – FileIO handler object for reading and writing
Notes
- Args under feature_layer_info:
- vocabulary_file : string
- path to vocabulary CSV file for the input tensor containing the vocabulary to look-up.
- uses the “key” named column as vocabulary of the 1st column if no “key” column present.
- max_length : int
- max number of rows to consider from the vocabulary file.
- if null, considers the entire file vocabulary.
- num_oov_buckets : int
- number of out of vocabulary buckets/slots to be used to
- encode strings into categorical indices
- embedding_size : int
- dimension size of categorical embedding
The vocabulary CSV file must contain two columns - key, id, where the key is mapped to one id thereby resulting in a many-to-one vocabulary mapping. If id field is absent, a unique whole number id is assigned by default resulting in a one-to-one mapping
-
LAYER_NAME
= 'categorical_embedding_with_vocabulary_file'¶
-
VOCABULARY_FILE
= 'vocabulary_file'¶
-
MAX_LENGTH
= 'max_length'¶
-
NUM_OOV_BUCKETS
= 'num_oov_buckets'¶
-
NUM_BUCKETS
= 'num_buckets'¶
-
EMBEDDING_SIZE
= 'embedding_size'¶
-
DEFAULT_VALUE
= 'default_value'¶
-
call
(inputs, training=None)¶ Defines the forward pass for the layer on the inputs tensor
Parameters: - inputs (tensor) – Input tensor on which the feature transforms are applied
- training (boolean) – Boolean flag indicating if the layer is being used in training mode or not
Returns: Resulting tensor after the forward pass through the feature transform layer
Return type: tf.Tensor
-
class
ml4ir.base.features.feature_fns.categorical.
CategoricalEmbeddingWithVocabularyFileAndDropout
(feature_info: dict, file_io: ml4ir.base.io.file_io.FileIO, **kwargs)¶ Bases:
ml4ir.base.features.feature_fns.base.BaseFeatureLayerOp
Converts a string tensor into a categorical embedding representation. Works by using a vocabulary file to convert the string tensor into categorical indices and then converting the categories into embeddings based on the feature_info. Also uses a dropout to convert categorical indices to the OOV index of 0 at a rate of dropout_rate
Parameters: - feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
- file_io (FileIO object) – FileIO handler object for reading and writing
Notes
- Args under feature_layer_info:
- vocabulary_file : str
- path to vocabulary CSV file for the input tensor
- dropout_rate : float
- rate at which to convert categorical indices to OOV
- embedding_size : int
- dimension size of categorical embedding
The vocabulary CSV file must contain two columns - key, id, where the key is mapped to one id thereby resulting in a many-to-one vocabulary mapping. If id field is absent, a unique natural number id is assigned by default resulting in a one-to-one mapping
OOV index will be set to 0 num_oov_buckets will be 0
-
LAYER_NAME
= 'categorical_embedding_with_vocabulary_file_and_dropout'¶
-
VOCABULARY_FILE
= 'vocabulary_file'¶
-
DROPOUT_RATE
= 'dropout_rate'¶
-
EMBEDDING_SIZE
= 'embedding_size'¶
-
NUM_BUCKETS
= 'num_buckets'¶
-
DEFAULT_VALUE
= 'default_value'¶
-
call
(inputs, training=None)¶ Defines the forward pass for the layer on the inputs tensor
Parameters: - inputs (tensor) – Input tensor on which the feature transforms are applied
- training (boolean) – Boolean flag indicating if the layer is being used in training mode or not
Returns: Resulting tensor after the forward pass through the feature transform layer
Return type: tf.Tensor
-
class
ml4ir.base.features.feature_fns.categorical.
CategoricalIndicatorWithVocabularyFile
(feature_info: dict, file_io: ml4ir.base.io.file_io.FileIO, **kwargs)¶ Bases:
ml4ir.base.features.feature_fns.base.BaseFeatureLayerOp
Converts a string tensor into a categorical one-hot representation. Works by using a vocabulary file to convert the string tensor into categorical indices and then converting the categories into one-hot representation.
Parameters: - feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
- file_io (FileIO object) – FileIO handler object for reading and writing
Notes
- Args under feature_layer_info:
- vocabulary_file : string
- path to vocabulary CSV file for the input tensor containing the vocabulary to look-up. uses the “key” named column as vocabulary of the 1st column if no “key” column present.
- max_length : int
- max number of rows to consider from the vocabulary file. if null, considers the entire file vocabulary.
- num_oov_buckets : int, optional
- number of out of vocabulary buckets/slots to be used to encode strings into categorical indices. If not specified, the default is 1.
The vocabulary CSV file must contain two columns - key, id, where the key is mapped to one id thereby resulting in a many-to-one vocabulary mapping. If id field is absent, a unique whole number id is assigned by default resulting in a one-to-one mapping
-
LAYER_NAME
= 'categorical_indicator_with_vocabulary_file'¶
-
VOCABULARY_FILE
= 'vocabulary_file'¶
-
MAX_LENGTH
= 'max_length'¶
-
NUM_OOV_BUCKETS
= 'num_oov_buckets'¶
-
call
(inputs, training=None)¶ Defines the forward pass for the layer on the inputs tensor
Parameters: - inputs (tensor) – Input tensor on which the feature transforms are applied
- training (boolean) – Boolean flag indicating if the layer is being used in training mode or not
Returns: Resulting tensor after the forward pass through the feature transform layer
Return type: tf.Tensor
Sequence Feature Transformations¶
-
class
ml4ir.base.features.feature_fns.sequence.
BytesSequenceToEncodingBiLSTM
(feature_info: dict, file_io: ml4ir.base.io.file_io.FileIO, **kwargs)¶ Bases:
ml4ir.base.features.feature_fns.base.BaseFeatureLayerOp
Encode a string tensor into an encoding. Works by converting the string into a bytes sequence and then generating a categorical/char embedding for each of the 256 bytes. The char/byte embeddings are then combined using a biLSTM
Initialize a feature layer to convert string tensor to bytes encoding
Parameters: - feature_info (dict) – Dictionary representing the feature_config for the input feature
- file_io (FileIO object) – FileIO handler object for reading and writing
Notes
- Args under feature_layer_info:
- max_length : int
- max length of bytes sequence
- embedding_size : int
- dimension size of the embedding; if null, then the tensor is just converted to its one-hot representation
- encoding_size : int
- dimension size of the sequence encoding computed using a biLSTM
The input dimension for the embedding is fixed to 256 because the string is converted into a bytes sequence.
-
LAYER_NAME
= 'bytes_sequence_to_encoding_bilstm'¶
-
MAX_LENGTH
= 'max_length'¶
-
EMBEDDING_SIZE
= 'embedding_size'¶
-
LSTM_KERNEL_INITIALIZER
= 'lstm_kernel_initializer'¶
-
ENCODING_SIZE
= 'encoding_size'¶
-
call
(inputs, training=None)¶ Defines the forward pass for the layer on the inputs tensor
Parameters: - inputs (tensor) – Input tensor on which the feature transforms are applied
- training (boolean) – Boolean flag indicating if the layer is being used in training mode or not
Returns: Resulting tensor after the forward pass through the feature transform layer
Return type: tf.Tensor
-
class
ml4ir.base.features.feature_fns.sequence.
Global1dPooling
(feature_info: dict, file_io: ml4ir.base.io.file_io.FileIO, **kwargs)¶ Bases:
ml4ir.base.features.feature_fns.base.BaseFeatureLayerOp
1D pooling to reduce a variable length sequence feature into a scalar value. This method optionally allows users to add multiple such pooling operations to produce a fixed dimensional feature vector as well.
Initialize a feature layer to apply global 1D pooling operation on input tensor
Parameters: - feature_info (dict) – Dictionary representing the feature_config for the input feature
- file_io (FileIO object) – FileIO handler object for reading and writing
Notes
- Args under feature_layer_info:
- fns : list of str
- List of string pooling operations that should be applied. Must be one of [“sum”, “mean”, “max”, “min”, “count_nonzero”]
- padded_val : int/float
- Value to be ignored from the pooling operations.
- masked_max_val : int/float
- Value used to mask the padded values for computing the max and min pooling operations. This allows us to ignore these values in the min and max pool operations. For example, if all the values in the tensor are in [0., 1.], then a masked_max_val of > 1. will make sure we do not pick padded values in the min/max pooling ops. Default value: 2
-
LAYER_NAME
= 'global_1d_pooling'¶
-
FNS
= 'fns'¶
-
PADDED_VAL
= 'padded_val'¶
-
MASKED_MAX_VAL
= 'masked_max_val'¶
-
DEFAULT_MASKED_MAX_VAL
= 2.0¶
-
call
(inputs, training=None)¶ Defines the forward pass for the layer on the inputs tensor
Parameters: - inputs (tensor) – Input tensor on which the feature transforms are applied
- training (boolean) – Boolean flag indicating if the layer is being used in training mode or not
Returns: Resulting tensor after the forward pass through the feature transform layer
Return type: tf.Tensor
Tensorflow Native Operations¶
-
class
ml4ir.base.features.feature_fns.tf_native.
TFNativeOpLayer
(feature_info: dict, file_io: ml4ir.base.io.file_io.FileIO, **kwargs)¶ Bases:
ml4ir.base.features.feature_fns.base.BaseFeatureLayerOp
Run a series of tensorflow native operations on the input feature tensor. The functions will be applied in the order they are specified.
Initialize the feature layer
Parameters: - feature_info (dict) – Dictionary representing the feature_config for the input feature
- file_io (FileIO object) – FileIO handler object for reading and writing
Notes
- Args under feature_layer_info:
- ops: list of dict
List of function specifications with associated arguments
- Arguments under ops:
- fn : str
- Tensorflow native function name. Should start with tf. Example: tf.math.log or tf.clip_by_value
- args : dict
- Keyword arguments to be passed to the tensorflow function
-
LAYER_NAME
= 'tf_native_op'¶
-
OPS
= 'ops'¶
-
call
(inputs, training=None)¶ Defines the forward pass for the layer on the inputs tensor
Parameters: - inputs (tensor) – Input tensor on which the feature transforms are applied
- training (boolean) – Boolean flag indicating if the layer is being used in training mode or not
Returns: Resulting tensor after the forward pass through the feature transform layer
Return type: tf.Tensor