Feature Processing

ml4ir.base.features.preprocessing.preprocess_text

String preprocessing function that removes punctuation and converts strings to lower case based on the arguments.

Parameters feature_tensor : Tensor object

input feature tensor of type tf.string
remove_punctuation : bool
Whether to remove punctuation characters from strings
to_lower : bool
Whether to convert string to lower case
punctuation : str
Punctuation characters to replace (a single string containing the character to remove
replace_with_whitespace : bool
if True punctuation will be replaced by whitespace (i.e. used as separator), note that leading and trailing whitespace will also be removed, as well as consecutive whitespaces.
Returns:Processed string tensor
Return type:Tensor object

Examples

Input:
>>> feature_tensor = "ABCabc123,,,"
>>> remove_punctuation = True
>>> to_lower = True
Output:
>>> "abcabc123"
ml4ir.base.features.preprocessing.get_one_hot_label_vectorizer(feature_info, file_io: ml4ir.base.io.file_io.FileIO)

Returns a tf function to convert categorical string labels to a one hot encoding.

Parameters:
  • feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig. See categorical_indicator_with_vocabulary_file, here it is used to read a vocabulary file to create the one hot encoding.
  • file_io (FileIO required to load the vocabulary file.) –
Returns:

Function that converts labels into one hot vectors

Return type:

function

Examples

Input:
>>> feature_tensor = ["abc", "xyz", "abc"]
>>> vocabulary file
>>>    abc -> 0
>>>    xyz -> 1
>>>    def -> 2
Output:
>>> [[1, 0, 0], [0, 1, 0], [1, 0, 0]]
ml4ir.base.features.preprocessing.split_and_pad_string

String preprocessing function that splits and pads a sequence based on the max_length.

Parameters:
  • feature_tensor (Tensor object) – Input feature tensor of type tf.string.
  • split_char (str) – String separator to split the string input.
  • max_length (int) – max length of the sequence produced after padding.
Returns:

processed float tensor

Return type:

Tensor object

Examples

Input:
>>> feature_tensor = "AAA,BBB,CCC"
>>> split_char = ","
>>> max_length = 5
Output:
>>> ['AAA', 'BBB', 'CCC', '', '']
ml4ir.base.features.preprocessing.natural_log

Compute the signed log of the feature_tensor

Parameters:
  • feature_tensor (Tensor object) – input feature tensor of type tf.float32
  • shift (int) – floating point shift that is added to the feature tensor element wise before computing natural log (used to handle 0 values)

Examples

Input:
>>> feature_tensor = [10, 0]
>>> shift = 1
Output:
>>> [2.39, 0.]