Feature Processing¶
-
ml4ir.base.features.preprocessing.
preprocess_text
¶ String preprocessing function that removes punctuation and converts strings to lower case based on the arguments.
Parameters feature_tensor : Tensor object
input feature tensor of type tf.string- remove_punctuation : bool
- Whether to remove punctuation characters from strings
- to_lower : bool
- Whether to convert string to lower case
- punctuation : str
- Punctuation characters to replace (a single string containing the character to remove
- replace_with_whitespace : bool
- if True punctuation will be replaced by whitespace (i.e. used as separator), note that leading and trailing whitespace will also be removed, as well as consecutive whitespaces.
Returns: Processed string tensor Return type: Tensor object Examples
- Input:
>>> feature_tensor = "ABCabc123,,," >>> remove_punctuation = True >>> to_lower = True
- Output:
>>> "abcabc123"
-
ml4ir.base.features.preprocessing.
get_one_hot_label_vectorizer
(feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶ Returns a tf function to convert categorical string labels to a one hot encoding.
Parameters: - feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig. See categorical_indicator_with_vocabulary_file, here it is used to read a vocabulary file to create the one hot encoding.
- file_io (FileIO required to load the vocabulary file.) –
Returns: Function that converts labels into one hot vectors
Return type: function
Examples
- Input:
>>> feature_tensor = ["abc", "xyz", "abc"] >>> vocabulary file >>> abc -> 0 >>> xyz -> 1 >>> def -> 2
- Output:
>>> [[1, 0, 0], [0, 1, 0], [1, 0, 0]]
-
ml4ir.base.features.preprocessing.
split_and_pad_string
¶ String preprocessing function that splits and pads a sequence based on the max_length.
Parameters: - feature_tensor (Tensor object) – Input feature tensor of type tf.string.
- split_char (str) – String separator to split the string input.
- max_length (int) – max length of the sequence produced after padding.
Returns: processed float tensor
Return type: Tensor object
Examples
- Input:
>>> feature_tensor = "AAA,BBB,CCC" >>> split_char = "," >>> max_length = 5
- Output:
>>> ['AAA', 'BBB', 'CCC', '', '']
-
ml4ir.base.features.preprocessing.
natural_log
¶ Compute the signed log of the feature_tensor
Parameters: - feature_tensor (Tensor object) – input feature tensor of type tf.float32
- shift (int) – floating point shift that is added to the feature tensor element wise before computing natural log (used to handle 0 values)
Examples
- Input:
>>> feature_tensor = [10, 0] >>> shift = 1
- Output:
>>> [2.39, 0.]