Feature Transformation¶
Categorical Feature Transformations¶
-
ml4ir.base.features.feature_fns.categorical.categorical_embedding_with_hash_buckets(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶ Converts a string feature tensor into a categorical embedding. Works by first converting the string into num_hash_buckets buckets each of size hash_bucket_size, then converting each hash bucket into a categorical embdding of dimension embedding_size. Finally, these embeddings are combined either through mean, sum or concat operations to generate the final embedding based on the feature_info.
Parameters: - feature_tensor (Tensor) – String feature tensor
- feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
- file_io (FileIO object) – FileIO handler object for reading and writing
Returns: categorical embedding for the input feature_tensor
Return type: Tensor object
Notes
- Args under feature_layer_info:
- num_hash_buckets : int
- number of different hash buckets to convert the input string into
- hash_bucket_size : int
- the size of each hash bucket
- embedding_size : int
- dimension size of the categorical embedding
- merge_mode : str
- can be one of “mean”, “sum”, “concat” representing the mode of combining embeddings from each categorical embedding
-
ml4ir.base.features.feature_fns.categorical.categorical_embedding_with_indices(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶ Converts input integer tensor into categorical embedding. Works by converting the categorical indices in the input feature_tensor, represented as integer values, into categorical embeddings based on the feature_info.
Parameters: - feature_tensor (Tensor object) – int feature tensor
- feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
- file_io (FileIO object) – FileIO handler object for reading and writing
Returns: categorical embedding for the input feature_tensor
Return type: Tensor object
Notes
- Args under feature_layer_info:
- num_buckets : int
- Maximum number of categorical values
- default_value : int
- default value to be assigned to indices out of the num_buckets range
- embedding_size : int
- dimension size of the categorical embedding
String based categorical features should already be converted into numeric indices
-
ml4ir.base.features.feature_fns.categorical.categorical_embedding_to_encoding_bilstm(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶ Encode a string tensor into categorical embedding. Works by converting the string into a word sequence and then generating a categorical/char embedding for each words based on the List of strings that form the vocabulary set of categorical values, defined by the argument vocabulary_file. The char/byte embeddings are then combined using a biLSTM.
Parameters: - feature_tensor (Tensor object) – String feature tensor that is to be encoded
- feature_info (dict) – Dictionary representing the feature_config for the input feature
- file_io (FileIO object) – FileIO handler object for reading and writing
Returns: Encoded feature tensor
Return type: Tensor object
Notes
- Args under feature_layer_info:
- vocabulary_file : string
- path to vocabulary CSV file for the input tensor containing the vocabulary to look-up. uses the “key” named column as vocabulary of the 1st column if no “key” column present.
- max_length: int
- max number of rows to consider from the vocabulary file.
- if null, considers the entire file vocabulary.
- embedding_size : int
- dimension size of the embedding;
- if null, then the tensor is just converted to its one-hot representation
- encoding_size : int
- dimension size of the sequence encoding computed using a biLSTM
The input dimension for the embedding is fixed to 256 because the string is converted into a bytes sequence.
-
ml4ir.base.features.feature_fns.categorical.categorical_embedding_with_vocabulary_file(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶ Converts a string tensor into a categorical embedding representation. Works by using a vocabulary file to convert the string tensor into categorical indices and then converting the categories into embeddings based on the feature_info.
Parameters: - feature_tensor (Tensor object) – String feature tensor
- feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
- file_io (FileIO object) – FileIO handler object for reading and writing
Returns: Categorical embedding representation of input feature_tensor
Return type: Tensor object
Notes
- Args under feature_layer_info:
- vocabulary_file : string
- path to vocabulary CSV file for the input tensor containing the vocabulary to look-up.
- uses the “key” named column as vocabulary of the 1st column if no “key” column present.
- max_length : int
- max number of rows to consider from the vocabulary file.
- if null, considers the entire file vocabulary.
- num_oov_buckets : int
- number of out of vocabulary buckets/slots to be used to
- encode strings into categorical indices
- embedding_size : int
- dimension size of categorical embedding
The vocabulary CSV file must contain two columns - key, id, where the key is mapped to one id thereby resulting in a many-to-one vocabulary mapping. If id field is absent, a unique whole number id is assigned by default resulting in a one-to-one mapping
-
class
ml4ir.base.features.feature_fns.categorical.CategoricalDropout(dropout_rate, seed=None, **kwargs)¶ Bases:
tensorflow.python.keras.engine.base_layer.LayerCustom Dropout class for categorical indices
Examples
>>> inputs: [[1, 2, 3], [4, 1, 2]] >>> dropout_rate = 0.5
>>> When training, output: [[0, 0, 3], [0, 1, 2]] >>> When testing, output: [[1, 2, 3], [4, 1, 2]]
Notes
At training time, mask indices to 0 at dropout_rate
This works similar to tf.keras.layers.Dropout without the scaling Ref: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout
Parameters: - dropout_rate (float) – fraction of units to drop, i.e., set to OOV token 0
- seed (int) – random seed for sampling to mask/drop categorical labels
Notes
We define OOV index to be 0 for this function and when dropout is applied, it converts p% of the values to 0(which is the OOV index). This allows us to train a good average embedding for the OOV token.
-
get_config()¶ Get config for the CategoricalDropout tensorflow layer
Returns: Configuration dictionary for the tensorflow layer Return type: dict
-
call(inputs, training=None)¶ Run the CategoricalDropout layer by masking input labels to OOV index 0 at dropout_rate
Parameters: - input (Tensor object) – int categorical index tensor to be masked
- training (bool) – If the layer should be run as training or not
Returns: Masked tensor object with values set to 0 at probability of dropout_rate
Return type: Tensor object
-
ml4ir.base.features.feature_fns.categorical.categorical_embedding_with_vocabulary_file_and_dropout(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶ Converts a string tensor into a categorical embedding representation. Works by using a vocabulary file to convert the string tensor into categorical indices and then converting the categories into embeddings based on the feature_info. Also uses a dropout to convert categorical indices to the OOV index of 0 at a rate of dropout_rate
Parameters: - feature_tensor (Tensor object) – String feature tensor
- feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
- file_io (FileIO object) – FileIO handler object for reading and writing
Returns: Categorical embedding representation of input feature_tensor
Return type: Tensor object
Notes
- Args under feature_layer_info:
- vocabulary_file : str
- path to vocabulary CSV file for the input tensor
- dropout_rate : float
- rate at which to convert categorical indices to OOV
- embedding_size : int
- dimension size of categorical embedding
The vocabulary CSV file must contain two columns - key, id, where the key is mapped to one id thereby resulting in a many-to-one vocabulary mapping. If id field is absent, a unique natural number id is assigned by default resulting in a one-to-one mapping
OOV index will be set to 0 num_oov_buckets will be 0
-
ml4ir.base.features.feature_fns.categorical.categorical_indicator_with_vocabulary_file(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶ Converts a string tensor into a categorical one-hot representation. Works by using a vocabulary file to convert the string tensor into categorical indices and then converting the categories into one-hot representation.
Parameters: - feature_tensor (Tensor object) – String feature tensor
- feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
- file_io (FileIO object) – FileIO handler object for reading and writing
Returns: Categorical one-hot representation of input feature_tensor
Return type: Tensor object
Notes
- Args under feature_layer_info:
- vocabulary_file : string
- path to vocabulary CSV file for the input tensor containing the vocabulary to look-up. uses the “key” named column as vocabulary of the 1st column if no “key” column present.
- max_length : int
- max number of rows to consider from the vocabulary file. if null, considers the entire file vocabulary.
- num_oov_buckets : int, optional
- number of out of vocabulary buckets/slots to be used to encode strings into categorical indices. If not specified, the default is 1.
The vocabulary CSV file must contain two columns - key, id, where the key is mapped to one id thereby resulting in a many-to-one vocabulary mapping. If id field is absent, a unique whole number id is assigned by default resulting in a one-to-one mapping
-
ml4ir.base.features.feature_fns.categorical.categorical_indices_from_vocabulary_file(feature_info, feature_tensor, file_io)¶ Extract the vocabulary (encoding and values) from the stated vocabulary_file inside feature_info. And encode the feature_tensor with the vocabulary.
Parameters: - feature_tensor (Tensor object) – String feature tensor
- feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
- file_io (FileIO object) – FileIO handler object for reading and writing
Returns: - categorical_indices (Tensor object) – corresponding to encode of the feature_tensor based on the vocabulary.
- num_oov_buckets (int) – applied num_oov_buckets
- vocabulary_keys (list) – values of the vocabulary stated in the vocabulary_file.
-
class
ml4ir.base.features.feature_fns.categorical.VocabLookup(vocabulary_keys, vocabulary_ids, num_oov_buckets: int = None, default_value: int = None, feature_name='categorical_variable')¶ Bases:
tensorflow.python.keras.engine.base_layer.LayerThe class defines a keras layer wrapper around a tf lookup table using the given vocabulary list. Maps each entry of a vocabulary list into categorical indices.
-
vocabulary_list¶ List of strings that form the vocabulary set of categorical values
Type: list
-
num_oov_buckets¶ Number of buckets to be used for out of vocabulary strings
Type: int
-
default_value¶ Default value to strbe used for OOV values
Type: int
-
feature_name¶ Name of the input feature tensor
Type: str
-
lookup_table¶ Tensorflow look up table that maps strings to integer indices
Type: LookupTable object
Notes
Issue[1] with using LookupTable with keras symbolic tensors; expects eager tensors.
Ref: https://github.com/tensorflow/tensorflow/issues/38305
-
build(input_shape)¶ Defines a Lookup Table using a KeyValueTensorInitializer to map the keys to the IDs. Allows definition of two types of lookup tables based on whether the user specifies num_oov_buckets or the default_value
-
call(input_text)¶ Convert string tensors to numeric indices using lookup table
Parameters: input_text (Tensor object) – String categorical tensor Returns: Numeric tensor object with corresponding lookup indices Return type: Tensor object
-
get_config()¶ Get tensorflow configuration for the lookup table
Returns: Configuration dictionary for the lookup table layer Return type: dict
-
-
ml4ir.base.features.feature_fns.categorical.get_vocabulary_info(feature_info, file_io)¶ Extract the vocabulary (encoding and values) from the stated vocabulary_file inside feature_info.
Parameters: - feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig
- file_io (FileIO object) – FileIO handler object for reading and writing files
Returns: - vocabulary_keys (list) – values of the vocabulary stated in the vocabulary_file.
- vocabulary_ids (list) – corresponding encoding ids (values of the vocabulary_keys).
Notes
- Args under feature_layer_info
- vocabulary_file : str
- path to vocabulary CSV file for the input tensor containing the vocabulary to look-up. uses the “key” named column as vocabulary of the 1st column if no “key” column present.
- max_length : int
- max number of rows to consider from the vocabulary file. if null, considers the entire file vocabulary.
- default_value : int
- default stated value in the configure used to replace missing data points.
Sequence Feature Transformations¶
-
ml4ir.base.features.feature_fns.sequence.bytes_sequence_to_encoding_bilstm(feature_tensor, feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶ Encode a string tensor into an encoding. Works by converting the string into a bytes sequence and then generating a categorical/char embedding for each of the 256 bytes. The char/byte embeddings are then combined using a biLSTM
Parameters: - feature_tensor (Tensor object) – String feature tensor that is to be encoded
- feature_info (dict) – Dictionary representing the feature_config for the input feature
- file_io (FileIO object) – FileIO handler object for reading and writing
Returns: Encoded feature tensor
Return type: Tensor object
Notes
- Args under feature_layer_info:
- max_length : int
- max length of bytes sequence
- embedding_size : int
- dimension size of the embedding; if null, then the tensor is just converted to its one-hot representation
- encoding_size : int
- dimension size of the sequence encoding computed using a biLSTM
The input dimension for the embedding is fixed to 256 because the string is converted into a bytes sequence.
-
ml4ir.base.features.feature_fns.sequence.get_bilstm_encoding(embedding, lstm_units, kernel_initializer='glorot_uniform')¶ Convert sequence into encoding by passing through bidirectional LSTM
Parameters: - sequence_tensor (Tensor object) – Sequence tensor with representations for each time step
- lstm_units (int) – Number of units in the LSTM
- kernel_initializer (str) – Any supported tf.keras.initializers e.g., ‘ones’, ‘glorot_uniform’, ‘lecun_normal’ …
Returns: Encoded feature tensor
Return type: Tensor object