Feature Transformation¶
Categorical Feature Transformations¶
-
ml4ir.base.features.feature_fns.categorical.categorical_embedding_with_hash_buckets(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶ Converts a string feature tensor into a categorical embedding. Works by first converting the string into num_hash_buckets buckets each of size hash_bucket_size, then converting each hash bucket into a categorical embdding of dimension embedding_size. Finally, these embeddings are combined either through mean, sum or concat operations to generate the final embedding based on the feature_info.
Parameters: - feature_tensor (Tensor) – String feature tensor
- feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
- file_io (FileIO object) – FileIO handler object for reading and writing
Returns: categorical embedding for the input feature_tensor
Return type: Tensor object
Notes
- Args under feature_layer_info:
- num_hash_buckets : int
- number of different hash buckets to convert the input string into
- hash_bucket_size : int
- the size of each hash bucket
- embedding_size : int
- dimension size of the categorical embedding
- merge_mode : str
- can be one of “mean”, “sum”, “concat” representing the mode of combining embeddings from each categorical embedding
-
ml4ir.base.features.feature_fns.categorical.categorical_embedding_with_indices(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶ Converts input integer tensor into categorical embedding. Works by converting the categorical indices in the input feature_tensor, represented as integer values, into categorical embeddings based on the feature_info.
Parameters: - feature_tensor (Tensor object) – int feature tensor
- feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
- file_io (FileIO object) – FileIO handler object for reading and writing
Returns: categorical embedding for the input feature_tensor
Return type: Tensor object
Notes
- Args under feature_layer_info:
- num_buckets : int
- Maximum number of categorical values
- default_value : int
- default value to be assigned to indices out of the num_buckets range
- embedding_size : int
- dimension size of the categorical embedding
String based categorical features should already be converted into numeric indices
-
ml4ir.base.features.feature_fns.categorical.categorical_embedding_to_encoding_bilstm(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶ Encode a string tensor into categorical embedding. Works by converting the string into a word sequence and then generating a categorical/char embedding for each words based on the List of strings that form the vocabulary set of categorical values, defined by the argument vocabulary_file. The char/byte embeddings are then combined using a biLSTM.
Parameters: - feature_tensor (Tensor object) – String feature tensor that is to be encoded
- feature_info (dict) – Dictionary representing the feature_config for the input feature
- file_io (FileIO object) – FileIO handler object for reading and writing
Returns: Encoded feature tensor
Return type: Tensor object
Notes
- Args under feature_layer_info:
- vocabulary_file : string
- path to vocabulary CSV file for the input tensor containing the vocabulary to look-up. uses the “key” named column as vocabulary of the 1st column if no “key” column present.
- max_length: int
- max number of rows to consider from the vocabulary file.
- if null, considers the entire file vocabulary.
- embedding_size : int
- dimension size of the embedding;
- if null, then the tensor is just converted to its one-hot representation
- encoding_size : int
- dimension size of the sequence encoding computed using a biLSTM
The input dimension for the embedding is fixed to 256 because the string is converted into a bytes sequence.
-
ml4ir.base.features.feature_fns.categorical.categorical_embedding_with_vocabulary_file(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶ Converts a string tensor into a categorical embedding representation. Works by using a vocabulary file to convert the string tensor into categorical indices and then converting the categories into embeddings based on the feature_info.
Parameters: - feature_tensor (Tensor object) – String feature tensor
- feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
- file_io (FileIO object) – FileIO handler object for reading and writing
Returns: Categorical embedding representation of input feature_tensor
Return type: Tensor object
Notes
- Args under feature_layer_info:
- vocabulary_file : string
- path to vocabulary CSV file for the input tensor containing the vocabulary to look-up.
- uses the “key” named column as vocabulary of the 1st column if no “key” column present.
- max_length : int
- max number of rows to consider from the vocabulary file.
- if null, considers the entire file vocabulary.
- num_oov_buckets : int
- number of out of vocabulary buckets/slots to be used to
- encode strings into categorical indices
- embedding_size : int
- dimension size of categorical embedding
The vocabulary CSV file must contain two columns - key, id, where the key is mapped to one id thereby resulting in a many-to-one vocabulary mapping. If id field is absent, a unique whole number id is assigned by default resulting in a one-to-one mapping
-
class
ml4ir.base.features.feature_fns.categorical.CategoricalDropout(dropout_rate, seed=None, **kwargs)¶ Bases:
tensorflow.python.keras.engine.base_layer.LayerCustom Dropout class for categorical indices
Examples
>>> inputs: [[1, 2, 3], [4, 1, 2]] >>> dropout_rate = 0.5
>>> When training, output: [[0, 0, 3], [0, 1, 2]] >>> When testing, output: [[1, 2, 3], [4, 1, 2]]
Notes
At training time, mask indices to 0 at dropout_rate
This works similar to tf.keras.layers.Dropout without the scaling Ref: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout
Parameters: - dropout_rate (float) – fraction of units to drop, i.e., set to OOV token 0
- seed (int) – random seed for sampling to mask/drop categorical labels
Notes
We define OOV index to be 0 for this function and when dropout is applied, it converts p% of the values to 0(which is the OOV index). This allows us to train a good average embedding for the OOV token.
-
get_config()¶ Get config for the CategoricalDropout tensorflow layer
Returns: Configuration dictionary for the tensorflow layer Return type: dict
-
call(inputs, training=None)¶ Run the CategoricalDropout layer by masking input labels to OOV index 0 at dropout_rate
Parameters: - input (Tensor object) – int categorical index tensor to be masked
- training (bool) – If the layer should be run as training or not
Returns: Masked tensor object with values set to 0 at probability of dropout_rate
Return type: Tensor object
-
ml4ir.base.features.feature_fns.categorical.categorical_embedding_with_vocabulary_file_and_dropout(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶ Converts a string tensor into a categorical embedding representation. Works by using a vocabulary file to convert the string tensor into categorical indices and then converting the categories into embeddings based on the feature_info. Also uses a dropout to convert categorical indices to the OOV index of 0 at a rate of dropout_rate
Parameters: - feature_tensor (Tensor object) – String feature tensor
- feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
- file_io (FileIO object) – FileIO handler object for reading and writing
Returns: Categorical embedding representation of input feature_tensor
Return type: Tensor object
Notes
- Args under feature_layer_info:
- vocabulary_file : str
- path to vocabulary CSV file for the input tensor
- dropout_rate : float
- rate at which to convert categorical indices to OOV
- embedding_size : int
- dimension size of categorical embedding
The vocabulary CSV file must contain two columns - key, id, where the key is mapped to one id thereby resulting in a many-to-one vocabulary mapping. If id field is absent, a unique natural number id is assigned by default resulting in a one-to-one mapping
OOV index will be set to 0 num_oov_buckets will be 0
-
ml4ir.base.features.feature_fns.categorical.categorical_indicator_with_vocabulary_file(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶ Converts a string tensor into a categorical one-hot representation. Works by using a vocabulary file to convert the string tensor into categorical indices and then converting the categories into one-hot representation.
Parameters: - feature_tensor (Tensor object) – String feature tensor
- feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
- file_io (FileIO object) – FileIO handler object for reading and writing
Returns: Categorical one-hot representation of input feature_tensor
Return type: Tensor object
Notes
- Args under feature_layer_info:
- vocabulary_file : string
- path to vocabulary CSV file for the input tensor containing the vocabulary to look-up. uses the “key” named column as vocabulary of the 1st column if no “key” column present.
- max_length : int
- max number of rows to consider from the vocabulary file. if null, considers the entire file vocabulary.
- num_oov_buckets : int, optional
- number of out of vocabulary buckets/slots to be used to encode strings into categorical indices. If not specified, the default is 1.
The vocabulary CSV file must contain two columns - key, id, where the key is mapped to one id thereby resulting in a many-to-one vocabulary mapping. If id field is absent, a unique whole number id is assigned by default resulting in a one-to-one mapping
-
ml4ir.base.features.feature_fns.categorical.categorical_indices_from_vocabulary_file(feature_info, feature_tensor, file_io)¶ Extract the vocabulary (encoding and values) from the stated vocabulary_file inside feature_info. And encode the feature_tensor with the vocabulary.
Parameters: - feature_tensor (Tensor object) – String feature tensor
- feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
- file_io (FileIO object) – FileIO handler object for reading and writing
Returns: - categorical_indices (Tensor object) – corresponding to encode of the feature_tensor based on the vocabulary.
- num_oov_buckets (int) – applied num_oov_buckets
- vocabulary_keys (list) – values of the vocabulary stated in the vocabulary_file.
-
class
ml4ir.base.features.feature_fns.categorical.VocabLookup(vocabulary_keys, vocabulary_ids, num_oov_buckets: int = None, default_value: int = None, feature_name='categorical_variable')¶ Bases:
tensorflow.python.keras.engine.base_layer.LayerThe class defines a keras layer wrapper around a tf lookup table using the given vocabulary list. Maps each entry of a vocabulary list into categorical indices.
-
vocabulary_list¶ List of strings that form the vocabulary set of categorical values
Type: list
-
num_oov_buckets¶ Number of buckets to be used for out of vocabulary strings
Type: int
-
default_value¶ Default value to strbe used for OOV values
Type: int
-
feature_name¶ Name of the input feature tensor
Type: str
-
lookup_table¶ Tensorflow look up table that maps strings to integer indices
Type: LookupTable object
Notes
Issue[1] with using LookupTable with keras symbolic tensors; expects eager tensors.
Ref: https://github.com/tensorflow/tensorflow/issues/38305
-
build(input_shape)¶ Defines a Lookup Table using a KeyValueTensorInitializer to map the keys to the IDs. Allows definition of two types of lookup tables based on whether the user specifies num_oov_buckets or the default_value
-
call(input_text)¶ Convert string tensors to numeric indices using lookup table
Parameters: input_text (Tensor object) – String categorical tensor Returns: Numeric tensor object with corresponding lookup indices Return type: Tensor object
-
get_config()¶ Get tensorflow configuration for the lookup table
Returns: Configuration dictionary for the lookup table layer Return type: dict
-
-
ml4ir.base.features.feature_fns.categorical.get_vocabulary_info(feature_info, file_io)¶ Extract the vocabulary (encoding and values) from the stated vocabulary_file inside feature_info.
Parameters: - feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
- file_io (FileIO object) – FileIO handler object for reading and writing files
Returns: - vocabulary_keys (list) – values of the vocabulary stated in the vocabulary_file.
- vocabulary_ids (list) – corresponding encoding ids (values of the vocabulary_keys).
Notes
- Args under feature_layer_info
- vocabulary_file : str
- path to vocabulary CSV file for the input tensor containing the vocabulary to look-up. uses the “key” named column as vocabulary of the 1st column if no “key” column present.
- max_length : int
- max number of rows to consider from the vocabulary file. if null, considers the entire file vocabulary.
- default_value : int
- default stated value in the configure used to replace missing data points.
Sequence Feature Transformations¶
-
ml4ir.base.features.feature_fns.sequence.bytes_sequence_to_encoding_bilstm(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶ Encode a string tensor into an encoding. Works by converting the string into a bytes sequence and then generating a categorical/char embedding for each of the 256 bytes. The char/byte embeddings are then combined using a biLSTM
Parameters: - feature_tensor (Tensor object) – String feature tensor that is to be encoded
- feature_info (dict) – Dictionary representing the feature_config for the input feature
- file_io (FileIO object) – FileIO handler object for reading and writing
Returns: Encoded feature tensor
Return type: Tensor object
Notes
- Args under feature_layer_info:
- max_length : int
- max length of bytes sequence
- embedding_size : int
- dimension size of the embedding; if null, then the tensor is just converted to its one-hot representation
- encoding_size : int
- dimension size of the sequence encoding computed using a biLSTM
The input dimension for the embedding is fixed to 256 because the string is converted into a bytes sequence.
-
ml4ir.base.features.feature_fns.sequence.get_bilstm_encoding(embedding, lstm_units, kernel_initializer='glorot_uniform')¶ Convert sequence into encoding by passing through bidirectional LSTM
Parameters: - sequence_tensor (Tensor object) – Sequence tensor with representations for each time step
- lstm_units (int) – Number of units in the LSTM
- kernel_initializer (str) – Any supported tf.keras.initializers e.g., ‘ones’, ‘glorot_uniform’, ‘lecun_normal’ …
Returns: Encoded feature tensor
Return type: Tensor object
-
ml4ir.base.features.feature_fns.sequence.global_1d_pooling(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶ 1D pooling to reduce a variable length sequence feature into a scalar value. This method optionally allows users to add multiple such pooling operations to produce a fixed dimensional feature vector as well.
Parameters: - feature_tensor (Tensor object) – String feature tensor that is to be aggregated/pooled Dimensions -> [batch_size, max_sequence_size, max_len]
- feature_info (dict) – Dictionary representing the feature_config for the input feature
- file_io (FileIO object) – FileIO handler object for reading and writing
Returns: Global pooled/aggregated feature vector
Return type: Tensor object
Notes
- Args under feature_layer_info:
- fns : list of str
- List of string pooling operations that should be applied. Must be one of [“sum”, “mean”, “max”, “min”, “count_nonzero”]
- padded_val : int/float
- Value to be ignored from the pooling operations.
- masked_max_val : int/float
- Value used to mask the padded values for computing the max and min pooling operations. This allows us to ignore these values in the min and max pool operations. For example, if all the values in the tensor are in [0., 1.], then a masked_max_val of > 1. will make sure we do not pick padded values in the min/max pooling ops. Default value: 2
Tensorflow Native Operations¶
-
ml4ir.base.features.feature_fns.tf_native.tf_native_op(feature_tensor: tensorflow.python.framework.ops.Tensor, feature_info: dict, file_io: ml4ir.base.io.file_io.FileIO)¶ Run a series of tensorflow native operations on the input feature tensor. The functions will be applied in the order they are specified.
Parameters: - feature_tensor (Tensor) – Input feature tensor
- feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
- file_io (FileIO object) – FileIO handler object for reading and writing
Returns: Modified feature tensor after applying all the specified ops
Return type: Tensor object
Notes
- Args under feature_layer_info:
- ops: list of dict
List of function specifications with associated arguments
- Arguments under opts:
- fn : str
- Tensorflow native function name. Should start with tf. Example: tf.math.log or tf.clip_by_value
- args : dict
- Keyword arguments to be passed to the tensorflow function