Feature Transformation¶

Categorical Feature Transformations¶

ml4ir.base.features.feature_fns.categorical.categorical_embedding_with_hash_buckets(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶

Converts a string feature tensor into a categorical embedding. Works by first converting the string into num_hash_buckets buckets each of size hash_bucket_size, then converting each hash bucket into a categorical embdding of dimension embedding_size. Finally, these embeddings are combined either through mean, sum or concat operations to generate the final embedding based on the feature_info.

Parameters:	feature_tensor (Tensor) – String feature tensor feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig file_io (FileIO object) – FileIO handler object for reading and writing
Returns:	categorical embedding for the input feature_tensor
Return type:	Tensor object

Notes

Args under feature_layer_info:

num_hash_buckets : int: number of different hash buckets to convert the input string into
hash_bucket_size : int: the size of each hash bucket
embedding_size : int: dimension size of the categorical embedding
merge_mode : str: can be one of “mean”, “sum”, “concat” representing the mode of combining embeddings from each categorical embedding

ml4ir.base.features.feature_fns.categorical.categorical_embedding_with_indices(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶

Converts input integer tensor into categorical embedding. Works by converting the categorical indices in the input feature_tensor, represented as integer values, into categorical embeddings based on the feature_info.

Parameters:	feature_tensor (Tensor object) – int feature tensor feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig file_io (FileIO object) – FileIO handler object for reading and writing
Returns:	categorical embedding for the input feature_tensor
Return type:	Tensor object

Notes

Args under feature_layer_info:

num_buckets : int: Maximum number of categorical values
default_value : int: default value to be assigned to indices out of the num_buckets range
embedding_size : int: dimension size of the categorical embedding

String based categorical features should already be converted into numeric indices

ml4ir.base.features.feature_fns.categorical.categorical_embedding_to_encoding_bilstm(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶

Encode a string tensor into categorical embedding. Works by converting the string into a word sequence and then generating a categorical/char embedding for each words based on the List of strings that form the vocabulary set of categorical values, defined by the argument vocabulary_file. The char/byte embeddings are then combined using a biLSTM.

Parameters:	feature_tensor (Tensor object) – String feature tensor that is to be encoded feature_info (dict) – Dictionary representing the feature_config for the input feature file_io (FileIO object) – FileIO handler object for reading and writing
Returns:	Encoded feature tensor
Return type:	Tensor object

Notes

Args under feature_layer_info:

vocabulary_file : string

path to vocabulary CSV file for the input tensor containing the vocabulary to look-up. uses the “key” named column as vocabulary of the 1st column if no “key” column present.

max_length: int

max number of rows to consider from the vocabulary file.: if null, considers the entire file vocabulary.

embedding_size : int

dimension size of the embedding;: if null, then the tensor is just converted to its one-hot representation

encoding_size : int

dimension size of the sequence encoding computed using a biLSTM

The input dimension for the embedding is fixed to 256 because the string is converted into a bytes sequence.

ml4ir.base.features.feature_fns.categorical.categorical_embedding_with_vocabulary_file(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶

Converts a string tensor into a categorical embedding representation. Works by using a vocabulary file to convert the string tensor into categorical indices and then converting the categories into embeddings based on the feature_info.

Parameters:	feature_tensor (Tensor object) – String feature tensor feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig file_io (FileIO object) – FileIO handler object for reading and writing
Returns:	Categorical embedding representation of input feature_tensor
Return type:	Tensor object

Notes

Args under feature_layer_info:

vocabulary_file : string

path to vocabulary CSV file for the input tensor containing the vocabulary to look-up.: uses the “key” named column as vocabulary of the 1st column if no “key” column present.

max_length : int

max number of rows to consider from the vocabulary file.: if null, considers the entire file vocabulary.

num_oov_buckets : int

number of out of vocabulary buckets/slots to be used to: encode strings into categorical indices

embedding_size : int

dimension size of categorical embedding

The vocabulary CSV file must contain two columns - key, id, where the key is mapped to one id thereby resulting in a many-to-one vocabulary mapping. If id field is absent, a unique whole number id is assigned by default resulting in a one-to-one mapping

class ml4ir.base.features.feature_fns.categorical.CategoricalDropout(dropout_rate, seed=None, **kwargs)¶

Bases: tensorflow.python.keras.engine.base_layer.Layer

Custom Dropout class for categorical indices

Examples

>>> inputs: [[1, 2, 3], [4, 1, 2]]
>>> dropout_rate = 0.5

>>> When training, output: [[0, 0, 3], [0, 1, 2]]
>>> When testing, output: [[1, 2, 3], [4, 1, 2]]

Notes

At training time, mask indices to 0 at dropout_rate

This works similar to tf.keras.layers.Dropout without the scaling Ref: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout

Parameters:	dropout_rate (float) – fraction of units to drop, i.e., set to OOV token 0 seed (int) – random seed for sampling to mask/drop categorical labels

Notes

We define OOV index to be 0 for this function and when dropout is applied, it converts p% of the values to 0(which is the OOV index). This allows us to train a good average embedding for the OOV token.

get_config()¶

Get config for the CategoricalDropout tensorflow layer

Returns:	Configuration dictionary for the tensorflow layer
Return type:	dict

call(inputs, training=None)¶

Run the CategoricalDropout layer by masking input labels to OOV index 0 at dropout_rate

Parameters:	input (Tensor object) – int categorical index tensor to be masked training (bool) – If the layer should be run as training or not
Returns:	Masked tensor object with values set to 0 at probability of dropout_rate
Return type:	Tensor object

ml4ir.base.features.feature_fns.categorical.categorical_embedding_with_vocabulary_file_and_dropout(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶

Converts a string tensor into a categorical embedding representation. Works by using a vocabulary file to convert the string tensor into categorical indices and then converting the categories into embeddings based on the feature_info. Also uses a dropout to convert categorical indices to the OOV index of 0 at a rate of dropout_rate

Parameters:	feature_tensor (Tensor object) – String feature tensor feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig file_io (FileIO object) – FileIO handler object for reading and writing
Returns:	Categorical embedding representation of input feature_tensor
Return type:	Tensor object

Notes

Args under feature_layer_info:

vocabulary_file : str: path to vocabulary CSV file for the input tensor
dropout_rate : float: rate at which to convert categorical indices to OOV
embedding_size : int: dimension size of categorical embedding

The vocabulary CSV file must contain two columns - key, id, where the key is mapped to one id thereby resulting in a many-to-one vocabulary mapping. If id field is absent, a unique natural number id is assigned by default resulting in a one-to-one mapping

OOV index will be set to 0 num_oov_buckets will be 0

ml4ir.base.features.feature_fns.categorical.categorical_indicator_with_vocabulary_file(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶

Converts a string tensor into a categorical one-hot representation. Works by using a vocabulary file to convert the string tensor into categorical indices and then converting the categories into one-hot representation.

Parameters:	feature_tensor (Tensor object) – String feature tensor feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig file_io (FileIO object) – FileIO handler object for reading and writing
Returns:	Categorical one-hot representation of input feature_tensor
Return type:	Tensor object

Notes

Args under feature_layer_info:

vocabulary_file : string: path to vocabulary CSV file for the input tensor containing the vocabulary to look-up. uses the “key” named column as vocabulary of the 1st column if no “key” column present.
max_length : int: max number of rows to consider from the vocabulary file. if null, considers the entire file vocabulary.
num_oov_buckets : int, optional: number of out of vocabulary buckets/slots to be used to encode strings into categorical indices. If not specified, the default is 1.

The vocabulary CSV file must contain two columns - key, id, where the key is mapped to one id thereby resulting in a many-to-one vocabulary mapping. If id field is absent, a unique whole number id is assigned by default resulting in a one-to-one mapping

ml4ir.base.features.feature_fns.categorical.categorical_indices_from_vocabulary_file(feature_info, feature_tensor, file_io)¶

Extract the vocabulary (encoding and values) from the stated vocabulary_file inside feature_info. And encode the feature_tensor with the vocabulary.

Parameters:

feature_tensor (Tensor object) – String feature tensor
feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
file_io (FileIO object) – FileIO handler object for reading and writing

Returns:

categorical_indices (Tensor object) – corresponding to encode of the feature_tensor based on the vocabulary.
num_oov_buckets (int) – applied num_oov_buckets
vocabulary_keys (list) – values of the vocabulary stated in the vocabulary_file.

class ml4ir.base.features.feature_fns.categorical.VocabLookup(vocabulary_keys, vocabulary_ids, num_oov_buckets: int = None, default_value: int = None, feature_name='categorical_variable')¶

Bases: tensorflow.python.keras.engine.base_layer.Layer

The class defines a keras layer wrapper around a tf lookup table using the given vocabulary list. Maps each entry of a vocabulary list into categorical indices.

vocabulary_list¶

List of strings that form the vocabulary set of categorical values

Type:	list

num_oov_buckets¶

Number of buckets to be used for out of vocabulary strings

Type:	int

default_value¶

Default value to strbe used for OOV values

Type:	int

feature_name¶

Name of the input feature tensor

Type:	str

lookup_table¶

Tensorflow look up table that maps strings to integer indices

Type:	LookupTable object

Notes

Issue[1] with using LookupTable with keras symbolic tensors; expects eager tensors.

Ref: https://github.com/tensorflow/tensorflow/issues/38305

build(input_shape)¶: Defines a Lookup Table using a KeyValueTensorInitializer to map the keys to the IDs. Allows definition of two types of lookup tables based on whether the user specifies num_oov_buckets or the default_value

call(input_text)¶

Convert string tensors to numeric indices using lookup table

Parameters:	input_text (Tensor object) – String categorical tensor
Returns:	Numeric tensor object with corresponding lookup indices
Return type:	Tensor object

get_config()¶

Get tensorflow configuration for the lookup table

Returns:	Configuration dictionary for the lookup table layer
Return type:	dict

ml4ir.base.features.feature_fns.categorical.get_vocabulary_info(feature_info, file_io)¶

Extract the vocabulary (encoding and values) from the stated vocabulary_file inside feature_info.

Parameters:

feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
file_io (FileIO object) – FileIO handler object for reading and writing files

Returns:

vocabulary_keys (list) – values of the vocabulary stated in the vocabulary_file.
vocabulary_ids (list) – corresponding encoding ids (values of the vocabulary_keys).

Notes

Args under feature_layer_info

vocabulary_file : str: path to vocabulary CSV file for the input tensor containing the vocabulary to look-up. uses the “key” named column as vocabulary of the 1st column if no “key” column present.
max_length : int: max number of rows to consider from the vocabulary file. if null, considers the entire file vocabulary.
default_value : int: default stated value in the configure used to replace missing data points.

Sequence Feature Transformations¶

ml4ir.base.features.feature_fns.sequence.bytes_sequence_to_encoding_bilstm(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶

Encode a string tensor into an encoding. Works by converting the string into a bytes sequence and then generating a categorical/char embedding for each of the 256 bytes. The char/byte embeddings are then combined using a biLSTM

Parameters:	feature_tensor (Tensor object) – String feature tensor that is to be encoded feature_info (dict) – Dictionary representing the feature_config for the input feature file_io (FileIO object) – FileIO handler object for reading and writing
Returns:	Encoded feature tensor
Return type:	Tensor object

Notes

Args under feature_layer_info:

max_length : int: max length of bytes sequence
embedding_size : int: dimension size of the embedding; if null, then the tensor is just converted to its one-hot representation
encoding_size : int: dimension size of the sequence encoding computed using a biLSTM

The input dimension for the embedding is fixed to 256 because the string is converted into a bytes sequence.

ml4ir.base.features.feature_fns.sequence.get_bilstm_encoding(embedding, lstm_units, kernel_initializer='glorot_uniform')¶

Convert sequence into encoding by passing through bidirectional LSTM

Parameters:	sequence_tensor (Tensor object) – Sequence tensor with representations for each time step lstm_units (int) – Number of units in the LSTM kernel_initializer (str) – Any supported tf.keras.initializers e.g., ‘ones’, ‘glorot_uniform’, ‘lecun_normal’ …
Returns:	Encoded feature tensor
Return type:	Tensor object