Feature Transformation¶

Categorical Feature Transformations¶

ml4ir.base.features.feature_fns.categorical.categorical_embedding_with_hash_buckets(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶

Converts a string feature tensor into a categorical embedding. Works by first converting the string into num_hash_buckets buckets each of size hash_bucket_size, then converting each hash bucket into a categorical embdding of dimension embedding_size. Finally, these embeddings are combined either through mean, sum or concat operations to generate the final embedding based on the feature_info.

Parameters:	feature_tensor (Tensor) – String feature tensor feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig file_io (FileIO object) – FileIO handler object for reading and writing
Returns:	categorical embedding for the input feature_tensor
Return type:	Tensor object

Notes

Args under feature_layer_info:

num_hash_buckets : int: number of different hash buckets to convert the input string into
hash_bucket_size : int: the size of each hash bucket
embedding_size : int: dimension size of the categorical embedding
merge_mode : str: can be one of “mean”, “sum”, “concat” representing the mode of combining embeddings from each categorical embedding

ml4ir.base.features.feature_fns.categorical.categorical_embedding_with_indices(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶

Converts input integer tensor into categorical embedding. Works by converting the categorical indices in the input feature_tensor, represented as integer values, into categorical embeddings based on the feature_info.

Parameters:	feature_tensor (Tensor object) – int feature tensor feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig file_io (FileIO object) – FileIO handler object for reading and writing
Returns:	categorical embedding for the input feature_tensor
Return type:	Tensor object

Notes

Args under feature_layer_info:

num_buckets : int: Maximum number of categorical values
default_value : int: default value to be assigned to indices out of the num_buckets range
embedding_size : int: dimension size of the categorical embedding

String based categorical features should already be converted into numeric indices

ml4ir.base.features.feature_fns.categorical.categorical_embedding_to_encoding_bilstm(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶

Encode a string tensor into categorical embedding. Works by converting the string into a word sequence and then generating a categorical/char embedding for each words based on the List of strings that form the vocabulary set of categorical values, defined by the argument vocabulary_file. The char/byte embeddings are then combined using a biLSTM.

Parameters:	feature_tensor (Tensor object) – String feature tensor that is to be encoded feature_info (dict) – Dictionary representing the feature_config for the input feature file_io (FileIO object) – FileIO handler object for reading and writing
Returns:	Encoded feature tensor
Return type:	Tensor object

Notes

Args under feature_layer_info:

vocabulary_file : string

path to vocabulary CSV file for the input tensor containing the vocabulary to look-up. uses the “key” named column as vocabulary of the 1st column if no “key” column present.

max_length: int

max number of rows to consider from the vocabulary file.: if null, considers the entire file vocabulary.

embedding_size : int

dimension size of the embedding;: if null, then the tensor is just converted to its one-hot representation

encoding_size : int

dimension size of the sequence encoding computed using a biLSTM

The input dimension for the embedding is fixed to 256 because the string is converted into a bytes sequence.

ml4ir.base.features.feature_fns.categorical.categorical_embedding_with_vocabulary_file(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶

Converts a string tensor into a categorical embedding representation. Works by using a vocabulary file to convert the string tensor into categorical indices and then converting the categories into embeddings based on the feature_info.

Parameters:	feature_tensor (Tensor object) – String feature tensor feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig file_io (FileIO object) – FileIO handler object for reading and writing
Returns:	Categorical embedding representation of input feature_tensor
Return type:	Tensor object

Notes

Args under feature_layer_info:

vocabulary_file : string

path to vocabulary CSV file for the input tensor containing the vocabulary to look-up.: uses the “key” named column as vocabulary of the 1st column if no “key” column present.

max_length : int

max number of rows to consider from the vocabulary file.: if null, considers the entire file vocabulary.

num_oov_buckets : int

number of out of vocabulary buckets/slots to be used to: encode strings into categorical indices

embedding_size : int

dimension size of categorical embedding

The vocabulary CSV file must contain two columns - key, id, where the key is mapped to one id thereby resulting in a many-to-one vocabulary mapping. If id field is absent, a unique whole number id is assigned by default resulting in a one-to-one mapping

class ml4ir.base.features.feature_fns.categorical.CategoricalDropout(dropout_rate, seed=None, **kwargs)¶

Bases: tensorflow.python.keras.engine.base_layer.Layer

Custom Dropout class for categorical indices

Examples

>>> inputs: [[1, 2, 3], [4, 1, 2]]
>>> dropout_rate = 0.5

>>> When training, output: [[0, 0, 3], [0, 1, 2]]
>>> When testing, output: [[1, 2, 3], [4, 1, 2]]

Notes

At training time, mask indices to 0 at dropout_rate

This works similar to tf.keras.layers.Dropout without the scaling Ref: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout

Parameters:	dropout_rate (float) – fraction of units to drop, i.e., set to OOV token 0 seed (int) – random seed for sampling to mask/drop categorical labels

Notes

We define OOV index to be 0 for this function and when dropout is applied, it converts p% of the values to 0(which is the OOV index). This allows us to train a good average embedding for the OOV token.

get_config()¶

Get config for the CategoricalDropout tensorflow layer

Returns:	Configuration dictionary for the tensorflow layer
Return type:	dict

call(inputs, training=None)¶

Run the CategoricalDropout layer by masking input labels to OOV index 0 at dropout_rate

Parameters:	input (Tensor object) – int categorical index tensor to be masked training (bool) – If the layer should be run as training or not
Returns:	Masked tensor object with values set to 0 at probability of dropout_rate
Return type:	Tensor object

ml4ir.base.features.feature_fns.categorical.categorical_embedding_with_vocabulary_file_and_dropout(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶

Converts a string tensor into a categorical embedding representation. Works by using a vocabulary file to convert the string tensor into categorical indices and then converting the categories into embeddings based on the feature_info. Also uses a dropout to convert categorical indices to the OOV index of 0 at a rate of dropout_rate

Parameters:	feature_tensor (Tensor object) – String feature tensor feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig file_io (FileIO object) – FileIO handler object for reading and writing
Returns:	Categorical embedding representation of input feature_tensor
Return type:	Tensor object

Notes

Args under feature_layer_info:

vocabulary_file : str: path to vocabulary CSV file for the input tensor
dropout_rate : float: rate at which to convert categorical indices to OOV
embedding_size : int: dimension size of categorical embedding

The vocabulary CSV file must contain two columns - key, id, where the key is mapped to one id thereby resulting in a many-to-one vocabulary mapping. If id field is absent, a unique natural number id is assigned by default resulting in a one-to-one mapping

OOV index will be set to 0 num_oov_buckets will be 0

ml4ir.base.features.feature_fns.categorical.categorical_indicator_with_vocabulary_file(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶

Converts a string tensor into a categorical one-hot representation. Works by using a vocabulary file to convert the string tensor into categorical indices and then converting the categories into one-hot representation.

Parameters:	feature_tensor (Tensor object) – String feature tensor feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig file_io (FileIO object) – FileIO handler object for reading and writing
Returns:	Categorical one-hot representation of input feature_tensor
Return type:	Tensor object

Notes

Args under feature_layer_info:

vocabulary_file : string: path to vocabulary CSV file for the input tensor containing the vocabulary to look-up. uses the “key” named column as vocabulary of the 1st column if no “key” column present.
max_length : int: max number of rows to consider from the vocabulary file. if null, considers the entire file vocabulary.
num_oov_buckets : int, optional: number of out of vocabulary buckets/slots to be used to encode strings into categorical indices. If not specified, the default is 1.

The vocabulary CSV file must contain two columns - key, id, where the key is mapped to one id thereby resulting in a many-to-one vocabulary mapping. If id field is absent, a unique whole number id is assigned by default resulting in a one-to-one mapping

ml4ir.base.features.feature_fns.categorical.categorical_indices_from_vocabulary_file(feature_info, feature_tensor, file_io)¶

Extract the vocabulary (encoding and values) from the stated vocabulary_file inside feature_info. And encode the feature_tensor with the vocabulary.

Parameters:

feature_tensor (Tensor object) – String feature tensor
feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
file_io (FileIO object) – FileIO handler object for reading and writing

Returns:

categorical_indices (Tensor object) – corresponding to encode of the feature_tensor based on the vocabulary.
num_oov_buckets (int) – applied num_oov_buckets
vocabulary_keys (list) – values of the vocabulary stated in the vocabulary_file.

class ml4ir.base.features.feature_fns.categorical.VocabLookup(vocabulary_keys, vocabulary_ids, num_oov_buckets: int = None, default_value: int = None, feature_name='categorical_variable')¶

Bases: tensorflow.python.keras.engine.base_layer.Layer

The class defines a keras layer wrapper around a tf lookup table using the given vocabulary list. Maps each entry of a vocabulary list into categorical indices.

vocabulary_list¶

List of strings that form the vocabulary set of categorical values

Type:	list

num_oov_buckets¶

Number of buckets to be used for out of vocabulary strings

Type:	int

default_value¶

Default value to strbe used for OOV values

Type:	int

feature_name¶

Name of the input feature tensor

Type:	str

lookup_table¶

Tensorflow look up table that maps strings to integer indices

Type:	LookupTable object

Notes

Issue[1] with using LookupTable with keras symbolic tensors; expects eager tensors.

Ref: https://github.com/tensorflow/tensorflow/issues/38305

build(input_shape)¶: Defines a Lookup Table using a KeyValueTensorInitializer to map the keys to the IDs. Allows definition of two types of lookup tables based on whether the user specifies num_oov_buckets or the default_value

call(input_text)¶

Convert string tensors to numeric indices using lookup table

Parameters:	input_text (Tensor object) – String categorical tensor
Returns:	Numeric tensor object with corresponding lookup indices
Return type:	Tensor object

get_config()¶

Get tensorflow configuration for the lookup table

Returns:	Configuration dictionary for the lookup table layer
Return type:	dict

ml4ir.base.features.feature_fns.categorical.get_vocabulary_info(feature_info, file_io)¶

Extract the vocabulary (encoding and values) from the stated vocabulary_file inside feature_info.

Parameters:

feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
file_io (FileIO object) – FileIO handler object for reading and writing files

Returns:

vocabulary_keys (list) – values of the vocabulary stated in the vocabulary_file.
vocabulary_ids (list) – corresponding encoding ids (values of the vocabulary_keys).

Notes

Args under feature_layer_info

vocabulary_file : str: path to vocabulary CSV file for the input tensor containing the vocabulary to look-up. uses the “key” named column as vocabulary of the 1st column if no “key” column present.
max_length : int: max number of rows to consider from the vocabulary file. if null, considers the entire file vocabulary.
default_value : int: default stated value in the configure used to replace missing data points.

Sequence Feature Transformations¶

ml4ir.base.features.feature_fns.sequence.bytes_sequence_to_encoding_bilstm(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶

Encode a string tensor into an encoding. Works by converting the string into a bytes sequence and then generating a categorical/char embedding for each of the 256 bytes. The char/byte embeddings are then combined using a biLSTM

Parameters:	feature_tensor (Tensor object) – String feature tensor that is to be encoded feature_info (dict) – Dictionary representing the feature_config for the input feature file_io (FileIO object) – FileIO handler object for reading and writing
Returns:	Encoded feature tensor
Return type:	Tensor object

Notes

Args under feature_layer_info:

max_length : int: max length of bytes sequence
embedding_size : int: dimension size of the embedding; if null, then the tensor is just converted to its one-hot representation
encoding_size : int: dimension size of the sequence encoding computed using a biLSTM

The input dimension for the embedding is fixed to 256 because the string is converted into a bytes sequence.

ml4ir.base.features.feature_fns.sequence.get_bilstm_encoding(embedding, lstm_units, kernel_initializer='glorot_uniform')¶

Convert sequence into encoding by passing through bidirectional LSTM

Parameters:	sequence_tensor (Tensor object) – Sequence tensor with representations for each time step lstm_units (int) – Number of units in the LSTM kernel_initializer (str) – Any supported tf.keras.initializers e.g., ‘ones’, ‘glorot_uniform’, ‘lecun_normal’ …
Returns:	Encoded feature tensor
Return type:	Tensor object

ml4ir.base.features.feature_fns.sequence.global_1d_pooling(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶

1D pooling to reduce a variable length sequence feature into a scalar value. This method optionally allows users to add multiple such pooling operations to produce a fixed dimensional feature vector as well.

Parameters:	feature_tensor (Tensor object) – String feature tensor that is to be aggregated/pooled Dimensions -> [batch_size, max_sequence_size, max_len] feature_info (dict) – Dictionary representing the feature_config for the input feature file_io (FileIO object) – FileIO handler object for reading and writing
Returns:	Global pooled/aggregated feature vector
Return type:	Tensor object

Notes

Args under feature_layer_info:

fns : list of str: List of string pooling operations that should be applied. Must be one of [“sum”, “mean”, “max”, “min”, “count_nonzero”]
padded_val : int/float: Value to be ignored from the pooling operations.
masked_max_val : int/float: Value used to mask the padded values for computing the max and min pooling operations. This allows us to ignore these values in the min and max pool operations. For example, if all the values in the tensor are in [0., 1.], then a masked_max_val of > 1. will make sure we do not pick padded values in the min/max pooling ops. Default value: 2

Tensorflow Native Operations¶

ml4ir.base.features.feature_fns.tf_native.tf_native_op(feature_tensor: tensorflow.python.framework.ops.Tensor, feature_info: dict, file_io: ml4ir.base.io.file_io.FileIO)¶

Run a series of tensorflow native operations on the input feature tensor. The functions will be applied in the order they are specified.

Parameters:	feature_tensor (Tensor) – Input feature tensor feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig file_io (FileIO object) – FileIO handler object for reading and writing
Returns:	Modified feature tensor after applying all the specified ops
Return type:	Tensor object

Notes

Args under feature_layer_info:

ops: list of dict

List of function specifications with associated arguments

Arguments under opts:

fn : str: Tensorflow native function name. Should start with tf. Example: tf.math.log or tf.clip_by_value
args : dict: Keyword arguments to be passed to the tensorflow function