Feature Transformation

Categorical Feature Transformations

ml4ir.base.features.feature_fns.categorical.categorical_embedding_with_hash_buckets(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)

Converts a string feature tensor into a categorical embedding. Works by first converting the string into num_hash_buckets buckets each of size hash_bucket_size, then converting each hash bucket into a categorical embdding of dimension embedding_size. Finally, these embeddings are combined either through mean, sum or concat operations to generate the final embedding based on the feature_info.

Parameters:
  • feature_tensor (Tensor) – String feature tensor
  • feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
  • file_io (FileIO object) – FileIO handler object for reading and writing
Returns:

categorical embedding for the input feature_tensor

Return type:

Tensor object

Notes

Args under feature_layer_info:
num_hash_buckets : int
number of different hash buckets to convert the input string into
hash_bucket_size : int
the size of each hash bucket
embedding_size : int
dimension size of the categorical embedding
merge_mode : str
can be one of “mean”, “sum”, “concat” representing the mode of combining embeddings from each categorical embedding
ml4ir.base.features.feature_fns.categorical.categorical_embedding_with_indices(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)

Converts input integer tensor into categorical embedding. Works by converting the categorical indices in the input feature_tensor, represented as integer values, into categorical embeddings based on the feature_info.

Parameters:
  • feature_tensor (Tensor object) – int feature tensor
  • feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
  • file_io (FileIO object) – FileIO handler object for reading and writing
Returns:

categorical embedding for the input feature_tensor

Return type:

Tensor object

Notes

Args under feature_layer_info:
num_buckets : int
Maximum number of categorical values
default_value : int
default value to be assigned to indices out of the num_buckets range
embedding_size : int
dimension size of the categorical embedding

String based categorical features should already be converted into numeric indices

ml4ir.base.features.feature_fns.categorical.categorical_embedding_to_encoding_bilstm(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)

Encode a string tensor into categorical embedding. Works by converting the string into a word sequence and then generating a categorical/char embedding for each words based on the List of strings that form the vocabulary set of categorical values, defined by the argument vocabulary_file. The char/byte embeddings are then combined using a biLSTM.

Parameters:
  • feature_tensor (Tensor object) – String feature tensor that is to be encoded
  • feature_info (dict) – Dictionary representing the feature_config for the input feature
  • file_io (FileIO object) – FileIO handler object for reading and writing
Returns:

Encoded feature tensor

Return type:

Tensor object

Notes

Args under feature_layer_info:
vocabulary_file : string
path to vocabulary CSV file for the input tensor containing the vocabulary to look-up. uses the “key” named column as vocabulary of the 1st column if no “key” column present.
max_length: int
max number of rows to consider from the vocabulary file.
if null, considers the entire file vocabulary.
embedding_size : int
dimension size of the embedding;
if null, then the tensor is just converted to its one-hot representation
encoding_size : int
dimension size of the sequence encoding computed using a biLSTM

The input dimension for the embedding is fixed to 256 because the string is converted into a bytes sequence.

ml4ir.base.features.feature_fns.categorical.categorical_embedding_with_vocabulary_file(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)

Converts a string tensor into a categorical embedding representation. Works by using a vocabulary file to convert the string tensor into categorical indices and then converting the categories into embeddings based on the feature_info.

Parameters:
  • feature_tensor (Tensor object) – String feature tensor
  • feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
  • file_io (FileIO object) – FileIO handler object for reading and writing
Returns:

Categorical embedding representation of input feature_tensor

Return type:

Tensor object

Notes

Args under feature_layer_info:
vocabulary_file : string
path to vocabulary CSV file for the input tensor containing the vocabulary to look-up.
uses the “key” named column as vocabulary of the 1st column if no “key” column present.
max_length : int
max number of rows to consider from the vocabulary file.
if null, considers the entire file vocabulary.
num_oov_buckets : int
number of out of vocabulary buckets/slots to be used to
encode strings into categorical indices
embedding_size : int
dimension size of categorical embedding

The vocabulary CSV file must contain two columns - key, id, where the key is mapped to one id thereby resulting in a many-to-one vocabulary mapping. If id field is absent, a unique whole number id is assigned by default resulting in a one-to-one mapping

class ml4ir.base.features.feature_fns.categorical.CategoricalDropout(dropout_rate, seed=None, **kwargs)

Bases: tensorflow.python.keras.engine.base_layer.Layer

Custom Dropout class for categorical indices

Examples

>>> inputs: [[1, 2, 3], [4, 1, 2]]
>>> dropout_rate = 0.5
>>> When training, output: [[0, 0, 3], [0, 1, 2]]
>>> When testing, output: [[1, 2, 3], [4, 1, 2]]

Notes

At training time, mask indices to 0 at dropout_rate

This works similar to tf.keras.layers.Dropout without the scaling Ref: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout

Parameters:
  • dropout_rate (float) – fraction of units to drop, i.e., set to OOV token 0
  • seed (int) – random seed for sampling to mask/drop categorical labels

Notes

We define OOV index to be 0 for this function and when dropout is applied, it converts p% of the values to 0(which is the OOV index). This allows us to train a good average embedding for the OOV token.

get_config()

Get config for the CategoricalDropout tensorflow layer

Returns:Configuration dictionary for the tensorflow layer
Return type:dict
call(inputs, training=None)

Run the CategoricalDropout layer by masking input labels to OOV index 0 at dropout_rate

Parameters:
  • input (Tensor object) – int categorical index tensor to be masked
  • training (bool) – If the layer should be run as training or not
Returns:

Masked tensor object with values set to 0 at probability of dropout_rate

Return type:

Tensor object

ml4ir.base.features.feature_fns.categorical.categorical_embedding_with_vocabulary_file_and_dropout(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)

Converts a string tensor into a categorical embedding representation. Works by using a vocabulary file to convert the string tensor into categorical indices and then converting the categories into embeddings based on the feature_info. Also uses a dropout to convert categorical indices to the OOV index of 0 at a rate of dropout_rate

Parameters:
  • feature_tensor (Tensor object) – String feature tensor
  • feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
  • file_io (FileIO object) – FileIO handler object for reading and writing
Returns:

Categorical embedding representation of input feature_tensor

Return type:

Tensor object

Notes

Args under feature_layer_info:
vocabulary_file : str
path to vocabulary CSV file for the input tensor
dropout_rate : float
rate at which to convert categorical indices to OOV
embedding_size : int
dimension size of categorical embedding

The vocabulary CSV file must contain two columns - key, id, where the key is mapped to one id thereby resulting in a many-to-one vocabulary mapping. If id field is absent, a unique natural number id is assigned by default resulting in a one-to-one mapping

OOV index will be set to 0 num_oov_buckets will be 0

ml4ir.base.features.feature_fns.categorical.categorical_indicator_with_vocabulary_file(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)

Converts a string tensor into a categorical one-hot representation. Works by using a vocabulary file to convert the string tensor into categorical indices and then converting the categories into one-hot representation.

Parameters:
  • feature_tensor (Tensor object) – String feature tensor
  • feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
  • file_io (FileIO object) – FileIO handler object for reading and writing
Returns:

Categorical one-hot representation of input feature_tensor

Return type:

Tensor object

Notes

Args under feature_layer_info:
vocabulary_file : string
path to vocabulary CSV file for the input tensor containing the vocabulary to look-up. uses the “key” named column as vocabulary of the 1st column if no “key” column present.
max_length : int
max number of rows to consider from the vocabulary file. if null, considers the entire file vocabulary.
num_oov_buckets : int, optional
number of out of vocabulary buckets/slots to be used to encode strings into categorical indices. If not specified, the default is 1.

The vocabulary CSV file must contain two columns - key, id, where the key is mapped to one id thereby resulting in a many-to-one vocabulary mapping. If id field is absent, a unique whole number id is assigned by default resulting in a one-to-one mapping

ml4ir.base.features.feature_fns.categorical.categorical_indices_from_vocabulary_file(feature_info, feature_tensor, file_io)

Extract the vocabulary (encoding and values) from the stated vocabulary_file inside feature_info. And encode the feature_tensor with the vocabulary.

Parameters:
  • feature_tensor (Tensor object) – String feature tensor
  • feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
  • file_io (FileIO object) – FileIO handler object for reading and writing
Returns:

  • categorical_indices (Tensor object) – corresponding to encode of the feature_tensor based on the vocabulary.
  • num_oov_buckets (int) – applied num_oov_buckets
  • vocabulary_keys (list) – values of the vocabulary stated in the vocabulary_file.

class ml4ir.base.features.feature_fns.categorical.VocabLookup(vocabulary_keys, vocabulary_ids, num_oov_buckets: int = None, default_value: int = None, feature_name='categorical_variable')

Bases: tensorflow.python.keras.engine.base_layer.Layer

The class defines a keras layer wrapper around a tf lookup table using the given vocabulary list. Maps each entry of a vocabulary list into categorical indices.

vocabulary_list

List of strings that form the vocabulary set of categorical values

Type:list
num_oov_buckets

Number of buckets to be used for out of vocabulary strings

Type:int
default_value

Default value to strbe used for OOV values

Type:int
feature_name

Name of the input feature tensor

Type:str
lookup_table

Tensorflow look up table that maps strings to integer indices

Type:LookupTable object

Notes

Issue[1] with using LookupTable with keras symbolic tensors; expects eager tensors.

Ref: https://github.com/tensorflow/tensorflow/issues/38305

build(input_shape)

Defines a Lookup Table using a KeyValueTensorInitializer to map the keys to the IDs. Allows definition of two types of lookup tables based on whether the user specifies num_oov_buckets or the default_value

call(input_text)

Convert string tensors to numeric indices using lookup table

Parameters:input_text (Tensor object) – String categorical tensor
Returns:Numeric tensor object with corresponding lookup indices
Return type:Tensor object
get_config()

Get tensorflow configuration for the lookup table

Returns:Configuration dictionary for the lookup table layer
Return type:dict
ml4ir.base.features.feature_fns.categorical.get_vocabulary_info(feature_info, file_io)

Extract the vocabulary (encoding and values) from the stated vocabulary_file inside feature_info.

Parameters:
  • feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
  • file_io (FileIO object) – FileIO handler object for reading and writing files
Returns:

  • vocabulary_keys (list) – values of the vocabulary stated in the vocabulary_file.
  • vocabulary_ids (list) – corresponding encoding ids (values of the vocabulary_keys).

Notes

Args under feature_layer_info
vocabulary_file : str
path to vocabulary CSV file for the input tensor containing the vocabulary to look-up. uses the “key” named column as vocabulary of the 1st column if no “key” column present.
max_length : int
max number of rows to consider from the vocabulary file. if null, considers the entire file vocabulary.
default_value : int
default stated value in the configure used to replace missing data points.

Sequence Feature Transformations

ml4ir.base.features.feature_fns.sequence.bytes_sequence_to_encoding_bilstm(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)

Encode a string tensor into an encoding. Works by converting the string into a bytes sequence and then generating a categorical/char embedding for each of the 256 bytes. The char/byte embeddings are then combined using a biLSTM

Parameters:
  • feature_tensor (Tensor object) – String feature tensor that is to be encoded
  • feature_info (dict) – Dictionary representing the feature_config for the input feature
  • file_io (FileIO object) – FileIO handler object for reading and writing
Returns:

Encoded feature tensor

Return type:

Tensor object

Notes

Args under feature_layer_info:
max_length : int
max length of bytes sequence
embedding_size : int
dimension size of the embedding; if null, then the tensor is just converted to its one-hot representation
encoding_size : int
dimension size of the sequence encoding computed using a biLSTM

The input dimension for the embedding is fixed to 256 because the string is converted into a bytes sequence.

ml4ir.base.features.feature_fns.sequence.get_bilstm_encoding(embedding, lstm_units, kernel_initializer='glorot_uniform')

Convert sequence into encoding by passing through bidirectional LSTM

Parameters:
  • sequence_tensor (Tensor object) – Sequence tensor with representations for each time step
  • lstm_units (int) – Number of units in the LSTM
  • kernel_initializer (str) – Any supported tf.keras.initializers e.g., ‘ones’, ‘glorot_uniform’, ‘lecun_normal’ …
Returns:

Encoded feature tensor

Return type:

Tensor object

ml4ir.base.features.feature_fns.sequence.global_1d_pooling(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)

1D pooling to reduce a variable length sequence feature into a scalar value. This method optionally allows users to add multiple such pooling operations to produce a fixed dimensional feature vector as well.

Parameters:
  • feature_tensor (Tensor object) – String feature tensor that is to be aggregated/pooled Dimensions -> [batch_size, max_sequence_size, max_len]
  • feature_info (dict) – Dictionary representing the feature_config for the input feature
  • file_io (FileIO object) – FileIO handler object for reading and writing
Returns:

Global pooled/aggregated feature vector

Return type:

Tensor object

Notes

Args under feature_layer_info:
fns : list of str
List of string pooling operations that should be applied. Must be one of [“sum”, “mean”, “max”, “min”, “count_nonzero”]
padded_val : int/float
Value to be ignored from the pooling operations.
masked_max_val : int/float
Value used to mask the padded values for computing the max and min pooling operations. This allows us to ignore these values in the min and max pool operations. For example, if all the values in the tensor are in [0., 1.], then a masked_max_val of > 1. will make sure we do not pick padded values in the min/max pooling ops. Default value: 2

Tensorflow Native Operations

ml4ir.base.features.feature_fns.tf_native.tf_native_op(feature_tensor: tensorflow.python.framework.ops.Tensor, feature_info: dict, file_io: ml4ir.base.io.file_io.FileIO)

Run a series of tensorflow native operations on the input feature tensor. The functions will be applied in the order they are specified.

Parameters:
  • feature_tensor (Tensor) – Input feature tensor
  • feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
  • file_io (FileIO object) – FileIO handler object for reading and writing
Returns:

Modified feature tensor after applying all the specified ops

Return type:

Tensor object

Notes

Args under feature_layer_info:
ops: list of dict

List of function specifications with associated arguments

Arguments under opts:
fn : str
Tensorflow native function name. Should start with tf. Example: tf.math.log or tf.clip_by_value
args : dict
Keyword arguments to be passed to the tensorflow function