Feature Transformation

Categorical Feature Transformations

ml4ir.base.features.feature_fns.categorical.categorical_embedding_with_hash_buckets(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)

Converts a string feature tensor into a categorical embedding. Works by first converting the string into num_hash_buckets buckets each of size hash_bucket_size, then converting each hash bucket into a categorical embdding of dimension embedding_size. Finally, these embeddings are combined either through mean, sum or concat operations to generate the final embedding based on the feature_info.

Parameters:
  • feature_tensor (Tensor) – String feature tensor
  • feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
  • file_io (FileIO object) – FileIO handler object for reading and writing
Returns:

categorical embedding for the input feature_tensor

Return type:

Tensor object

Notes

Args under feature_layer_info:
num_hash_buckets : int
number of different hash buckets to convert the input string into
hash_bucket_size : int
the size of each hash bucket
embedding_size : int
dimension size of the categorical embedding
merge_mode : str
can be one of “mean”, “sum”, “concat” representing the mode of combining embeddings from each categorical embedding
ml4ir.base.features.feature_fns.categorical.categorical_embedding_with_indices(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)

Converts input integer tensor into categorical embedding. Works by converting the categorical indices in the input feature_tensor, represented as integer values, into categorical embeddings based on the feature_info.

Parameters:
  • feature_tensor (Tensor object) – int feature tensor
  • feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
  • file_io (FileIO object) – FileIO handler object for reading and writing
Returns:

categorical embedding for the input feature_tensor

Return type:

Tensor object

Notes

Args under feature_layer_info:
num_buckets : int
Maximum number of categorical values
default_value : int
default value to be assigned to indices out of the num_buckets range
embedding_size : int
dimension size of the categorical embedding

String based categorical features should already be converted into numeric indices

ml4ir.base.features.feature_fns.categorical.categorical_embedding_to_encoding_bilstm(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)

Encode a string tensor into categorical embedding. Works by converting the string into a word sequence and then generating a categorical/char embedding for each words based on the List of strings that form the vocabulary set of categorical values, defined by the argument vocabulary_file. The char/byte embeddings are then combined using a biLSTM.

Parameters:
  • feature_tensor (Tensor object) – String feature tensor that is to be encoded
  • feature_info (dict) – Dictionary representing the feature_config for the input feature
  • file_io (FileIO object) – FileIO handler object for reading and writing
Returns:

Encoded feature tensor

Return type:

Tensor object

Notes

Args under feature_layer_info:
vocabulary_file : string
path to vocabulary CSV file for the input tensor containing the vocabulary to look-up. uses the “key” named column as vocabulary of the 1st column if no “key” column present.
max_length: int
max number of rows to consider from the vocabulary file.
if null, considers the entire file vocabulary.
embedding_size : int
dimension size of the embedding;
if null, then the tensor is just converted to its one-hot representation
encoding_size : int
dimension size of the sequence encoding computed using a biLSTM

The input dimension for the embedding is fixed to 256 because the string is converted into a bytes sequence.

ml4ir.base.features.feature_fns.categorical.categorical_embedding_with_vocabulary_file(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)

Converts a string tensor into a categorical embedding representation. Works by using a vocabulary file to convert the string tensor into categorical indices and then converting the categories into embeddings based on the feature_info.

Parameters:
  • feature_tensor (Tensor object) – String feature tensor
  • feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
  • file_io (FileIO object) – FileIO handler object for reading and writing
Returns:

Categorical embedding representation of input feature_tensor

Return type:

Tensor object

Notes

Args under feature_layer_info:
vocabulary_file : string
path to vocabulary CSV file for the input tensor containing the vocabulary to look-up.
uses the “key” named column as vocabulary of the 1st column if no “key” column present.
max_length : int
max number of rows to consider from the vocabulary file.
if null, considers the entire file vocabulary.
num_oov_buckets : int
number of out of vocabulary buckets/slots to be used to
encode strings into categorical indices
embedding_size : int
dimension size of categorical embedding

The vocabulary CSV file must contain two columns - key, id, where the key is mapped to one id thereby resulting in a many-to-one vocabulary mapping. If id field is absent, a unique whole number id is assigned by default resulting in a one-to-one mapping

class ml4ir.base.features.feature_fns.categorical.CategoricalDropout(dropout_rate, seed=None, **kwargs)

Bases: tensorflow.python.keras.engine.base_layer.Layer

Custom Dropout class for categorical indices

Examples

>>> inputs: [[1, 2, 3], [4, 1, 2]]
>>> dropout_rate = 0.5
>>> When training, output: [[0, 0, 3], [0, 1, 2]]
>>> When testing, output: [[1, 2, 3], [4, 1, 2]]

Notes

At training time, mask indices to 0 at dropout_rate

This works similar to tf.keras.layers.Dropout without the scaling Ref: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout

Parameters:
  • dropout_rate (float) – fraction of units to drop, i.e., set to OOV token 0
  • seed (int) – random seed for sampling to mask/drop categorical labels

Notes

We define OOV index to be 0 for this function and when dropout is applied, it converts p% of the values to 0(which is the OOV index). This allows us to train a good average embedding for the OOV token.

get_config()

Get config for the CategoricalDropout tensorflow layer

Returns:Configuration dictionary for the tensorflow layer
Return type:dict
call(inputs, training=None)

Run the CategoricalDropout layer by masking input labels to OOV index 0 at dropout_rate

Parameters:
  • input (Tensor object) – int categorical index tensor to be masked
  • training (bool) – If the layer should be run as training or not
Returns:

Masked tensor object with values set to 0 at probability of dropout_rate

Return type:

Tensor object

ml4ir.base.features.feature_fns.categorical.categorical_embedding_with_vocabulary_file_and_dropout(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)

Converts a string tensor into a categorical embedding representation. Works by using a vocabulary file to convert the string tensor into categorical indices and then converting the categories into embeddings based on the feature_info. Also uses a dropout to convert categorical indices to the OOV index of 0 at a rate of dropout_rate

Parameters:
  • feature_tensor (Tensor object) – String feature tensor
  • feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
  • file_io (FileIO object) – FileIO handler object for reading and writing
Returns:

Categorical embedding representation of input feature_tensor

Return type:

Tensor object

Notes

Args under feature_layer_info:
vocabulary_file : str
path to vocabulary CSV file for the input tensor
dropout_rate : float
rate at which to convert categorical indices to OOV
embedding_size : int
dimension size of categorical embedding

The vocabulary CSV file must contain two columns - key, id, where the key is mapped to one id thereby resulting in a many-to-one vocabulary mapping. If id field is absent, a unique natural number id is assigned by default resulting in a one-to-one mapping

OOV index will be set to 0 num_oov_buckets will be 0

ml4ir.base.features.feature_fns.categorical.categorical_indicator_with_vocabulary_file(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)

Converts a string tensor into a categorical one-hot representation. Works by using a vocabulary file to convert the string tensor into categorical indices and then converting the categories into one-hot representation.

Parameters:
  • feature_tensor (Tensor object) – String feature tensor
  • feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
  • file_io (FileIO object) – FileIO handler object for reading and writing
Returns:

Categorical one-hot representation of input feature_tensor

Return type:

Tensor object

Notes

Args under feature_layer_info:
vocabulary_file : string
path to vocabulary CSV file for the input tensor containing the vocabulary to look-up. uses the “key” named column as vocabulary of the 1st column if no “key” column present.
max_length : int
max number of rows to consider from the vocabulary file. if null, considers the entire file vocabulary.
num_oov_buckets : int, optional
number of out of vocabulary buckets/slots to be used to encode strings into categorical indices. If not specified, the default is 1.

The vocabulary CSV file must contain two columns - key, id, where the key is mapped to one id thereby resulting in a many-to-one vocabulary mapping. If id field is absent, a unique whole number id is assigned by default resulting in a one-to-one mapping

ml4ir.base.features.feature_fns.categorical.categorical_indices_from_vocabulary_file(feature_info, feature_tensor, file_io)

Extract the vocabulary (encoding and values) from the stated vocabulary_file inside feature_info. And encode the feature_tensor with the vocabulary.

Parameters:
  • feature_tensor (Tensor object) – String feature tensor
  • feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
  • file_io (FileIO object) – FileIO handler object for reading and writing
Returns:

  • categorical_indices (Tensor object) – corresponding to encode of the feature_tensor based on the vocabulary.
  • num_oov_buckets (int) – applied num_oov_buckets
  • vocabulary_keys (list) – values of the vocabulary stated in the vocabulary_file.

class ml4ir.base.features.feature_fns.categorical.VocabLookup(vocabulary_keys, vocabulary_ids, num_oov_buckets: int = None, default_value: int = None, feature_name='categorical_variable')

Bases: tensorflow.python.keras.engine.base_layer.Layer

The class defines a keras layer wrapper around a tf lookup table using the given vocabulary list. Maps each entry of a vocabulary list into categorical indices.

vocabulary_list

List of strings that form the vocabulary set of categorical values

Type:list
num_oov_buckets

Number of buckets to be used for out of vocabulary strings

Type:int
default_value

Default value to strbe used for OOV values

Type:int
feature_name

Name of the input feature tensor

Type:str
lookup_table

Tensorflow look up table that maps strings to integer indices

Type:LookupTable object

Notes

Issue[1] with using LookupTable with keras symbolic tensors; expects eager tensors.

Ref: https://github.com/tensorflow/tensorflow/issues/38305

build(input_shape)

Defines a Lookup Table using a KeyValueTensorInitializer to map the keys to the IDs. Allows definition of two types of lookup tables based on whether the user specifies num_oov_buckets or the default_value

call(input_text)

Convert string tensors to numeric indices using lookup table

Parameters:input_text (Tensor object) – String categorical tensor
Returns:Numeric tensor object with corresponding lookup indices
Return type:Tensor object
get_config()

Get tensorflow configuration for the lookup table

Returns:Configuration dictionary for the lookup table layer
Return type:dict
ml4ir.base.features.feature_fns.categorical.get_vocabulary_info(feature_info, file_io)

Extract the vocabulary (encoding and values) from the stated vocabulary_file inside feature_info.

Parameters:
  • feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
  • file_io (FileIO object) – FileIO handler object for reading and writing files
Returns:

  • vocabulary_keys (list) – values of the vocabulary stated in the vocabulary_file.
  • vocabulary_ids (list) – corresponding encoding ids (values of the vocabulary_keys).

Notes

Args under feature_layer_info
vocabulary_file : str
path to vocabulary CSV file for the input tensor containing the vocabulary to look-up. uses the “key” named column as vocabulary of the 1st column if no “key” column present.
max_length : int
max number of rows to consider from the vocabulary file. if null, considers the entire file vocabulary.
default_value : int
default stated value in the configure used to replace missing data points.

Sequence Feature Transformations

ml4ir.base.features.feature_fns.sequence.bytes_sequence_to_encoding_bilstm(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)

Encode a string tensor into an encoding. Works by converting the string into a bytes sequence and then generating a categorical/char embedding for each of the 256 bytes. The char/byte embeddings are then combined using a biLSTM

Parameters:
  • feature_tensor (Tensor object) – String feature tensor that is to be encoded
  • feature_info (dict) – Dictionary representing the feature_config for the input feature
  • file_io (FileIO object) – FileIO handler object for reading and writing
Returns:

Encoded feature tensor

Return type:

Tensor object

Notes

Args under feature_layer_info:
max_length : int
max length of bytes sequence
embedding_size : int
dimension size of the embedding; if null, then the tensor is just converted to its one-hot representation
encoding_size : int
dimension size of the sequence encoding computed using a biLSTM

The input dimension for the embedding is fixed to 256 because the string is converted into a bytes sequence.

ml4ir.base.features.feature_fns.sequence.get_bilstm_encoding(embedding, lstm_units, kernel_initializer='glorot_uniform')

Convert sequence into encoding by passing through bidirectional LSTM

Parameters:
  • sequence_tensor (Tensor object) – Sequence tensor with representations for each time step
  • lstm_units (int) – Number of units in the LSTM
  • kernel_initializer (str) – Any supported tf.keras.initializers e.g., ‘ones’, ‘glorot_uniform’, ‘lecun_normal’ …
Returns:

Encoded feature tensor

Return type:

Tensor object