Feature Transformation

Categorical Feature Transformations

class ml4ir.base.features.feature_fns.categorical.CategoricalEmbeddingWithHashBuckets(feature_info: dict, file_io: ml4ir.base.io.file_io.FileIO, **kwargs)

Bases: ml4ir.base.features.feature_fns.base.BaseFeatureLayerOp

Converts a string feature tensor into a categorical embedding. Works by first converting the string into num_hash_buckets buckets each of size hash_bucket_size, then converting each hash bucket into a categorical embedding of dimension embedding_size. Finally, these embeddings are combined either through mean, sum or concat operations to generate the final embedding based on the feature_info.

Initialize the layer to get categorical embedding with hash buckets

Parameters:
  • feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
  • file_io (FileIO object) – FileIO handler object for reading and writing

Notes

Args under feature_layer_info:
num_hash_buckets : int
number of different hash buckets to convert the input string into
hash_bucket_size : int
the size of each hash bucket
embedding_size : int
dimension size of the categorical embedding
merge_mode : str
can be one of “mean”, “sum”, “concat” representing the mode of combining embeddings from each categorical embedding
LAYER_NAME = 'categorical_embedding_with_hash_buckets'
NUM_HASH_BUCKETS = 'num_hash_buckets'
HASH_BUCKET_SIZE = 'hash_bucket_size'
EMBEDDING_SIZE = 'embedding_size'
MERGE_MODE = 'merge_mode'
call(inputs, training=None)

Defines the forward pass for the layer on the inputs tensor

Parameters:
  • inputs (tensor) – Input tensor on which the feature transforms are applied
  • training (boolean) – Boolean flag indicating if the layer is being used in training mode or not
Returns:

Resulting tensor after the forward pass through the feature transform layer

Return type:

tf.Tensor

class ml4ir.base.features.feature_fns.categorical.CategoricalEmbeddingWithIndices(feature_info: dict, file_io: ml4ir.base.io.file_io.FileIO, **kwargs)

Bases: ml4ir.base.features.feature_fns.base.BaseFeatureLayerOp

Converts input integer tensor into categorical embedding. Works by converting the categorical indices in the input feature_tensor, represented as integer values, into categorical embeddings based on the feature_info.

Initialize feature layer to convert categorical feature into embedding based on indices

Parameters:
  • feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
  • file_io (FileIO object) – FileIO handler object for reading and writing

Notes

Args under feature_layer_info:
num_buckets : int
Maximum number of categorical values
default_value : int
default value to be assigned to indices out of the num_buckets range
embedding_size : int
dimension size of the categorical embedding
LAYER_NAME = 'categorical_embedding_with_indices'
NUM_BUCKETS = 'num_buckets'
DEFAULT_VALUE = 'default_value'
EMBEDDING_SIZE = 'embedding_size'
call(inputs, training=None)

Defines the forward pass for the layer on the inputs tensor

Parameters:
  • inputs (tensor) – Input tensor on which the feature transforms are applied
  • training (boolean) – Boolean flag indicating if the layer is being used in training mode or not
Returns:

Resulting tensor after the forward pass through the feature transform layer

Return type:

tf.Tensor

class ml4ir.base.features.feature_fns.categorical.CategoricalEmbeddingToEncodingBiLSTM(feature_info: dict, file_io: ml4ir.base.io.file_io.FileIO, **kwargs)

Bases: ml4ir.base.features.feature_fns.base.BaseFeatureLayerOp

Encode a string tensor into categorical embedding. Works by converting the string into a word sequence and then generating a categorical/char embedding for each words based on the List of strings that form the vocabulary set of categorical values, defined by the argument vocabulary_file. The char/byte embeddings are then combined using a biLSTM.

Initialize the layer to convert input string tensor into an encoding using categorical embeddings

Parameters:
  • feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
  • file_io (FileIO object) – FileIO handler object for reading and writing

Notes

Args under feature_layer_info:
vocabulary_file : string
path to vocabulary CSV file for the input tensor containing the vocabulary to look-up. uses the “key” named column as vocabulary of the 1st column if no “key” column present.
max_length: int
max number of rows to consider from the vocabulary file.
if null, considers the entire file vocabulary.
embedding_size : int
dimension size of the embedding;
if null, then the tensor is just converted to its one-hot representation
encoding_size : int
dimension size of the sequence encoding computed using a biLSTM

The input dimension for the embedding is fixed to 256 because the string is converted into a bytes sequence.

LAYER_NAME = 'categorical_embedding_to_encoding_bilstm'
VOCABULARY_FILE = 'vocabulary_file'
MAX_LENGTH = 'max_length'
EMBEDDING_SIZE = 'embedding_size'
ENCODING_SIZE = 'encoding_size'
LSTM_KERNEL_INITIALIZER = 'lstm_kernel_initializer'
call(inputs, training=None)

Defines the forward pass for the layer on the inputs tensor

Parameters:
  • inputs (tensor) – Input tensor on which the feature transforms are applied
  • training (boolean) – Boolean flag indicating if the layer is being used in training mode or not
Returns:

Resulting tensor after the forward pass through the feature transform layer

Return type:

tf.Tensor

class ml4ir.base.features.feature_fns.categorical.CategoricalEmbeddingWithVocabularyFile(feature_info: dict, file_io: ml4ir.base.io.file_io.FileIO, **kwargs)

Bases: ml4ir.base.features.feature_fns.base.BaseFeatureLayerOp

Converts a string tensor into a categorical embedding representation. Works by using a vocabulary file to convert the string tensor into categorical indices and then converting the categories into embeddings based on the feature_info.

Initialize layer to define a categorical embedding using a vocabulary file

Parameters:
  • feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
  • file_io (FileIO object) – FileIO handler object for reading and writing

Notes

Args under feature_layer_info:
vocabulary_file : string
path to vocabulary CSV file for the input tensor containing the vocabulary to look-up.
uses the “key” named column as vocabulary of the 1st column if no “key” column present.
max_length : int
max number of rows to consider from the vocabulary file.
if null, considers the entire file vocabulary.
num_oov_buckets : int
number of out of vocabulary buckets/slots to be used to
encode strings into categorical indices
embedding_size : int
dimension size of categorical embedding

The vocabulary CSV file must contain two columns - key, id, where the key is mapped to one id thereby resulting in a many-to-one vocabulary mapping. If id field is absent, a unique whole number id is assigned by default resulting in a one-to-one mapping

LAYER_NAME = 'categorical_embedding_with_vocabulary_file'
VOCABULARY_FILE = 'vocabulary_file'
MAX_LENGTH = 'max_length'
NUM_OOV_BUCKETS = 'num_oov_buckets'
NUM_BUCKETS = 'num_buckets'
EMBEDDING_SIZE = 'embedding_size'
DEFAULT_VALUE = 'default_value'
call(inputs, training=None)

Defines the forward pass for the layer on the inputs tensor

Parameters:
  • inputs (tensor) – Input tensor on which the feature transforms are applied
  • training (boolean) – Boolean flag indicating if the layer is being used in training mode or not
Returns:

Resulting tensor after the forward pass through the feature transform layer

Return type:

tf.Tensor

class ml4ir.base.features.feature_fns.categorical.CategoricalEmbeddingWithVocabularyFileAndDropout(feature_info: dict, file_io: ml4ir.base.io.file_io.FileIO, **kwargs)

Bases: ml4ir.base.features.feature_fns.base.BaseFeatureLayerOp

Converts a string tensor into a categorical embedding representation. Works by using a vocabulary file to convert the string tensor into categorical indices and then converting the categories into embeddings based on the feature_info. Also uses a dropout to convert categorical indices to the OOV index of 0 at a rate of dropout_rate

Parameters:
  • feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
  • file_io (FileIO object) – FileIO handler object for reading and writing

Notes

Args under feature_layer_info:
vocabulary_file : str
path to vocabulary CSV file for the input tensor
dropout_rate : float
rate at which to convert categorical indices to OOV
embedding_size : int
dimension size of categorical embedding

The vocabulary CSV file must contain two columns - key, id, where the key is mapped to one id thereby resulting in a many-to-one vocabulary mapping. If id field is absent, a unique natural number id is assigned by default resulting in a one-to-one mapping

OOV index will be set to 0 num_oov_buckets will be 0

LAYER_NAME = 'categorical_embedding_with_vocabulary_file_and_dropout'
VOCABULARY_FILE = 'vocabulary_file'
DROPOUT_RATE = 'dropout_rate'
EMBEDDING_SIZE = 'embedding_size'
NUM_BUCKETS = 'num_buckets'
DEFAULT_VALUE = 'default_value'
call(inputs, training=None)

Defines the forward pass for the layer on the inputs tensor

Parameters:
  • inputs (tensor) – Input tensor on which the feature transforms are applied
  • training (boolean) – Boolean flag indicating if the layer is being used in training mode or not
Returns:

Resulting tensor after the forward pass through the feature transform layer

Return type:

tf.Tensor

class ml4ir.base.features.feature_fns.categorical.CategoricalIndicatorWithVocabularyFile(feature_info: dict, file_io: ml4ir.base.io.file_io.FileIO, **kwargs)

Bases: ml4ir.base.features.feature_fns.base.BaseFeatureLayerOp

Converts a string tensor into a categorical one-hot representation. Works by using a vocabulary file to convert the string tensor into categorical indices and then converting the categories into one-hot representation.

Parameters:
  • feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
  • file_io (FileIO object) – FileIO handler object for reading and writing

Notes

Args under feature_layer_info:
vocabulary_file : string
path to vocabulary CSV file for the input tensor containing the vocabulary to look-up. uses the “key” named column as vocabulary of the 1st column if no “key” column present.
max_length : int
max number of rows to consider from the vocabulary file. if null, considers the entire file vocabulary.
num_oov_buckets : int, optional
number of out of vocabulary buckets/slots to be used to encode strings into categorical indices. If not specified, the default is 1.

The vocabulary CSV file must contain two columns - key, id, where the key is mapped to one id thereby resulting in a many-to-one vocabulary mapping. If id field is absent, a unique whole number id is assigned by default resulting in a one-to-one mapping

LAYER_NAME = 'categorical_indicator_with_vocabulary_file'
VOCABULARY_FILE = 'vocabulary_file'
MAX_LENGTH = 'max_length'
NUM_OOV_BUCKETS = 'num_oov_buckets'
call(inputs, training=None)

Defines the forward pass for the layer on the inputs tensor

Parameters:
  • inputs (tensor) – Input tensor on which the feature transforms are applied
  • training (boolean) – Boolean flag indicating if the layer is being used in training mode or not
Returns:

Resulting tensor after the forward pass through the feature transform layer

Return type:

tf.Tensor

Sequence Feature Transformations

class ml4ir.base.features.feature_fns.sequence.BytesSequenceToEncodingBiLSTM(feature_info: dict, file_io: ml4ir.base.io.file_io.FileIO, **kwargs)

Bases: ml4ir.base.features.feature_fns.base.BaseFeatureLayerOp

Encode a string tensor into an encoding. Works by converting the string into a bytes sequence and then generating a categorical/char embedding for each of the 256 bytes. The char/byte embeddings are then combined using a biLSTM

Initialize a feature layer to convert string tensor to bytes encoding

Parameters:
  • feature_info (dict) – Dictionary representing the feature_config for the input feature
  • file_io (FileIO object) – FileIO handler object for reading and writing

Notes

Args under feature_layer_info:
max_length : int
max length of bytes sequence
embedding_size : int
dimension size of the embedding; if null, then the tensor is just converted to its one-hot representation
encoding_size : int
dimension size of the sequence encoding computed using a biLSTM

The input dimension for the embedding is fixed to 256 because the string is converted into a bytes sequence.

LAYER_NAME = 'bytes_sequence_to_encoding_bilstm'
MAX_LENGTH = 'max_length'
EMBEDDING_SIZE = 'embedding_size'
LSTM_KERNEL_INITIALIZER = 'lstm_kernel_initializer'
ENCODING_SIZE = 'encoding_size'
call(inputs, training=None)

Defines the forward pass for the layer on the inputs tensor

Parameters:
  • inputs (tensor) – Input tensor on which the feature transforms are applied
  • training (boolean) – Boolean flag indicating if the layer is being used in training mode or not
Returns:

Resulting tensor after the forward pass through the feature transform layer

Return type:

tf.Tensor

class ml4ir.base.features.feature_fns.sequence.Global1dPooling(feature_info: dict, file_io: ml4ir.base.io.file_io.FileIO, **kwargs)

Bases: ml4ir.base.features.feature_fns.base.BaseFeatureLayerOp

1D pooling to reduce a variable length sequence feature into a scalar value. This method optionally allows users to add multiple such pooling operations to produce a fixed dimensional feature vector as well.

Initialize a feature layer to apply global 1D pooling operation on input tensor

Parameters:
  • feature_info (dict) – Dictionary representing the feature_config for the input feature
  • file_io (FileIO object) – FileIO handler object for reading and writing

Notes

Args under feature_layer_info:
fns : list of str
List of string pooling operations that should be applied. Must be one of [“sum”, “mean”, “max”, “min”, “count_nonzero”]
padded_val : int/float
Value to be ignored from the pooling operations.
masked_max_val : int/float
Value used to mask the padded values for computing the max and min pooling operations. This allows us to ignore these values in the min and max pool operations. For example, if all the values in the tensor are in [0., 1.], then a masked_max_val of > 1. will make sure we do not pick padded values in the min/max pooling ops. Default value: 2
LAYER_NAME = 'global_1d_pooling'
FNS = 'fns'
PADDED_VAL = 'padded_val'
MASKED_MAX_VAL = 'masked_max_val'
DEFAULT_MASKED_MAX_VAL = 2.0
call(inputs, training=None)

Defines the forward pass for the layer on the inputs tensor

Parameters:
  • inputs (tensor) – Input tensor on which the feature transforms are applied
  • training (boolean) – Boolean flag indicating if the layer is being used in training mode or not
Returns:

Resulting tensor after the forward pass through the feature transform layer

Return type:

tf.Tensor

Tensorflow Native Operations

class ml4ir.base.features.feature_fns.tf_native.TFNativeOpLayer(feature_info: dict, file_io: ml4ir.base.io.file_io.FileIO, **kwargs)

Bases: ml4ir.base.features.feature_fns.base.BaseFeatureLayerOp

Run a series of tensorflow native operations on the input feature tensor. The functions will be applied in the order they are specified.

Initialize the feature layer

Parameters:
  • feature_info (dict) – Dictionary representing the feature_config for the input feature
  • file_io (FileIO object) – FileIO handler object for reading and writing

Notes

Args under feature_layer_info:
ops: list of dict

List of function specifications with associated arguments

Arguments under ops:
fn : str
Tensorflow native function name. Should start with tf. Example: tf.math.log or tf.clip_by_value
args : dict
Keyword arguments to be passed to the tensorflow function
LAYER_NAME = 'tf_native_op'
OPS = 'ops'
call(inputs, training=None)

Defines the forward pass for the layer on the inputs tensor

Parameters:
  • inputs (tensor) – Input tensor on which the feature transforms are applied
  • training (boolean) – Boolean flag indicating if the layer is being used in training mode or not
Returns:

Resulting tensor after the forward pass through the feature transform layer

Return type:

tf.Tensor