Feature Processing¶

ml4ir.base.features.preprocessing.preprocess_text¶

String preprocessing function that removes punctuation and converts strings to lower case based on the arguments.

Parameters feature_tensor : Tensor object

input feature tensor of type tf.string

remove_punctuation : bool: Whether to remove punctuation characters from strings
to_lower : bool: Whether to convert string to lower case
punctuation : str: Punctuation characters to replace (a single string containing the character to remove
replace_with_whitespace : bool: if True punctuation will be replaced by whitespace (i.e. used as separator), note that leading and trailing whitespace will also be removed, as well as consecutive whitespaces.

Returns:	Processed string tensor
Return type:	Tensor object

Examples

Input:

>>> feature_tensor = "ABCabc123,,,"
>>> remove_punctuation = True
>>> to_lower = True

Output:

>>> "abcabc123"

ml4ir.base.features.preprocessing.get_one_hot_label_vectorizer(feature_info, file_io: ml4ir.base.io.file_io.FileIO)¶

Returns a tf function to convert categorical string labels to a one hot encoding.

Parameters:	feature_info (dict) – Dictionary representing the configuration parameters for the specific feature from the FeatureConfig. See categorical_indicator_with_vocabulary_file, here it is used to read a vocabulary file to create the one hot encoding. file_io (FileIO required to load the vocabulary file.) –
Returns:	Function that converts labels into one hot vectors
Return type:	function

Examples

Input:

>>> feature_tensor = ["abc", "xyz", "abc"]
>>> vocabulary file
>>>    abc -> 0
>>>    xyz -> 1
>>>    def -> 2

Output:

>>> [[1, 0, 0], [0, 1, 0], [1, 0, 0]]

ml4ir.base.features.preprocessing.split_and_pad_string¶

String preprocessing function that splits and pads a sequence based on the max_length.

Parameters:	feature_tensor (Tensor object) – Input feature tensor of type tf.string. split_char (str) – String separator to split the string input. max_length (int) – max length of the sequence produced after padding.
Returns:	processed float tensor
Return type:	Tensor object

Examples

Input:

>>> feature_tensor = "AAA,BBB,CCC"
>>> split_char = ","
>>> max_length = 5

Output:

>>> ['AAA', 'BBB', 'CCC', '', '']

ml4ir.base.features.preprocessing.natural_log¶

Compute the signed log of the feature_tensor

Parameters:	feature_tensor (Tensor object) – input feature tensor of type tf.float32 shift (int) – floating point shift that is added to the feature tensor element wise before computing natural log (used to handle 0 values)

Examples

Input:

>>> feature_tensor = [10, 0]
>>> shift = 1

Output:

>>> [2.39, 0.]