Data Loaders and Helpers¶
RelevanceDataset¶
-
class
ml4ir.base.data.relevance_dataset.
RelevanceDataset
(data_dir: str, data_format: str, feature_config: ml4ir.base.features.feature_config.FeatureConfig, tfrecord_type: str, file_io: ml4ir.base.io.file_io.FileIO, max_sequence_size: int = 0, batch_size: int = 128, preprocessing_keys_to_fns: dict = {}, train_pcent_split: float = 0.8, val_pcent_split: float = -1, test_pcent_split: float = -1, use_part_files: bool = False, parse_tfrecord: bool = True, logger: Optional[logging.Logger] = None, keep_additional_info: int = 0, non_zero_features_only: int = 0, output_name: str = None)¶ Bases:
object
class to create/load TFRecordDataset for train, validation and test
Constructor method to instantiate a RelevanceDataset object Loads and creates the TFRecordDataset for train, validation and test splits
Parameters: - data_dir (str) – path to the directory containing train, validation and test data
- data_format ({"tfrecord", "csv", "libsvm"}) – type of data files to be converted into TFRecords and loaded as a TFRecordDataset
- feature_config (FeatureConfig object) – FeatureConfig object that defines the features to be loaded in the dataset and the preprocessing functions to be applied to each of them
- tfrecord_type ({"example", "sequence_example"}) – Type of the TFRecord protobuf message to be used for TFRecordDataset
- file_io (FileIO object) – file I/O handler objects for reading and writing data
- max_sequence_size (int, optional) – maximum number of sequence to be used with a single SequenceExample proto message The data will be appropriately padded or clipped to fit the max value specified
- batch_size (int, optional) – size of each data batch
- preprocessing_keys_to_fns (dict of (str, function), optional) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the RelevanceDataset object
- train_pcent_split (float, optional) – ratio of overall data to be used as training set
- val_pcent_split (float, optional) – ratio of overall data to be used as validation set
- test_pcent_split (float, optional) – ratio of overall data to be used as test set
- use_part_files (bool, optional) – load dataset from part files checked using “part-” prefix
- parse_tfrecord (bool, optional) – parse the TFRecord string from the dataset; returns strings as is otherwise
- logger (Logger, optional) – logging handler for status messages
- output_name (str) – The name of tensorflow’s output node which carry the prediction score.
Notes
- Currently supports CSV, TFRecord and Libsvm data formats
- Does not support automatically splitting train, validation and test
- data_dir should contain train, validation and test directories with files within them
-
create_dataset
(parse_tfrecord=True)¶ Loads and creates train, validation and test datasets
Parameters: parse_tfrecord (bool) – parse the TFRecord string from the dataset; returns strings as is otherwise
-
balance_classes
()¶ Balance class labels in the train dataset
-
train_val_test_split
()¶ Split the dataset into train, validation and test
tfrecord_reader¶
-
class
ml4ir.base.data.tfrecord_reader.
TFRecordParser
(feature_config: ml4ir.base.features.feature_config.FeatureConfig, preprocessing_map: ml4ir.base.features.preprocessing.PreprocessingMap, required_fields_only: Optional[bool] = False)¶ Bases:
object
Base class for parsing TFRecord examples. This class consolidates the parsing and feature extraction pipeline for both Example and SequenceExample protobuf messages
Constructor method for instantiating a TFRecordParser object
Parameters: - feature_config (FeatureConfig) – FeatureConfig object defining context and sequence feature information
- preprocessing_map (PreprocessingMap object) – Object mapping preprocessing feature function names to their definitons
- required_fields_only (bool, optional) – Whether to only use required fields from the feature_config
-
get_features_spec
()¶ Define the features spec from the feature_config. The features spec will be used to parse the serialized TFRecord
Returns: feature specification dictionary that can be used to parse TFRecords Return type: dict Notes
For SequenceExample messages, this method returns a pair of dictionaries, one each for context and sequence features.
-
extract_features_from_proto
(proto)¶ Parse the serialized proto string to extract features
Parameters: proto (tf.Tensor) – A scalar string tensor that is the serialized form of a TFRecord object Returns: Dictionary of features extracted from the proto as per the features_spec Return type: dict of Tensors Notes
For SequenceExample proto messages, this function returns two dictionaries, one for context and another for sequence feature tensors. For Example proto messages, this function returns a single dictionary of feature tensors.
-
get_default_tensor
(feature_info, sequence_size=0)¶ Get the default tensor for a given feature configuration
Parameters: - feature_info (dict) – Feature configuration information for the feature as specified in the feature_config
- sequence_size (int, optional) – Number of elements in the sequence of a SequenceExample
Returns: Tensor object that can be used as a default tensor if the expected feature is missing from the TFRecord
Return type: tf.Tensor
-
get_feature
(feature_info, extracted_features, sequence_size=0)¶ Fetch the feature from the feature dictionary of extracted features
Parameters: - feature_info (dict) – Feature configuration information for the feature as specified in the feature_config
- extracted_features (dict) – Dictionary of feature tensors extracted by parsing the serialized TFRecord
- sequence_size (int, optional) – Number of elements in the sequence of a SequenceExample
Returns: Feature tensor that is obtained from the extracted features for the given feature_info
Return type: tf.Tensor
-
generate_and_add_mask
(extracted_features, features_dict)¶ Create a mask to identify padded values
Parameters: - extracted_features (dict) – Dictionary of tensors extracted from the serialized TFRecord
- features_dict (dict) – Dictionary of tensors that will be used for model training/serving as inputs to the model
Returns: - features_dict (dict) – Dictionary of tensors that will be used for model training/serving updated with the mask tensor if applicable
- sequence_size (int) – Number of elements in the sequence of the TFRecord
-
pad_feature
(feature_tensor, feature_info)¶ Pad the feature to the max_sequence_size in order to create uniform data batches for training :param feature_tensor: Feature tensor to be padded :type feature_tensor: tf.Tensor :param feature_info: Feature configuration information for the feature as specified in the feature_config :type feature_info: dict
Returns: Feature tensor padded to the max_sequence_size Return type: tf.Tensor
-
preprocess_feature
(feature_tensor, feature_info)¶ Preprocess feature based on the feature configuration
Parameters: - feature_tensor (tf.Tensor) – input feature tensor to be preprocessed
- feature_info (dict) – Feature configuration information for the feature as specified in the feature_config
Returns: preprocessed tensor object
Return type: tf.Tensor
Notes
Only preprocessing functions part of the preprocessing_map can be used in this function for preprocessing at data loading
Pass custom preprocessing functions while instantiating the RelevanceDataset object with preprocessing_keys_to_fns argument
-
get_parse_fn
() → tensorflow.python.eager.def_function.function¶ Define a parsing function that will be used to load the TFRecordDataset and create input features for the model.
Returns: Parsing function that takes in a serialized TFRecord protobuf message and extracts a dictionary of feature tensors Return type: tf.function Notes
This function will also be used with the TFRecord serving signature in the saved model.
-
class
ml4ir.base.data.tfrecord_reader.
TFRecordExampleParser
(feature_config: ml4ir.base.features.feature_config.FeatureConfig, preprocessing_map: ml4ir.base.features.preprocessing.PreprocessingMap, required_fields_only: Optional[bool] = False)¶ Bases:
ml4ir.base.data.tfrecord_reader.TFRecordParser
Class for parsing Example TFRecord protobuf messages
Constructor method for instantiating a TFRecordParser object
Parameters: - feature_config (FeatureConfig) – FeatureConfig object defining context and sequence feature information
- preprocessing_map (PreprocessingMap object) – Object mapping preprocessing feature function names to their definitons
- required_fields_only (bool, optional) – Whether to only use required fields from the feature_config
-
get_features_spec
()¶ Define the features spec from the feature_config. This will be used to parse the serialized TFRecord
Returns: feature specification dictionary that can be used to parse TFRecords Return type: dict
-
extract_features_from_proto
(serialized)¶ Parse the serialized proto string to extract features
Parameters: proto (tf.Tensor) – A scalar string tensor that is the serialized form of a TFRecord object Returns: Dictionary of features extracted from the proto as per the features_spec Return type: dict of Tensors
-
get_default_tensor
(feature_info, sequence_size=0)¶ Get the default tensor for a given feature configuration
Parameters: - feature_info (dict) – Feature configuration information for the feature as specified in the feature_config
- sequence_size (int, optional) – Number of elements in the sequence of a SequenceExample
Returns: Tensor object that can be used as a default tensor if the expected feature is missing from the TFRecord
Return type: tf.Tensor
-
get_feature
(feature_info, extracted_features, sequence_size=0)¶ Fetch the feature from the feature dictionary of extracted features
Parameters: - feature_info (dict) – Feature configuration information for the feature as specified in the feature_config
- extracted_features (dict) – Dictionary of feature tensors extracted by parsing the serialized TFRecord
- sequence_size (int, optional) – Number of elements in the sequence of a SequenceExample
Returns: Feature tensor that is obtained from the extracted features for the given feature_info
Return type: tf.Tensor
-
generate_and_add_mask
(extracted_features, features_dict)¶ Create a mask to identify padded values
Parameters: - extracted_features (dict) – Dictionary of tensors extracted from the serialized TFRecord
- features_dict (dict) – Dictionary of tensors that will be used for model training/serving as inputs to the model
Returns: - features_dict (dict) – Dictionary of tensors that will be used for model training/serving updated with the mask tensor if applicable
- sequence_size (int) – Number of elements in the sequence of the TFRecord
-
pad_feature
(feature_tensor, feature_info)¶ Pad the feature to the max_sequence_size in order to create uniform data batches for training :param feature_tensor: Feature tensor to be padded :type feature_tensor: tf.Tensor :param feature_info: Feature configuration information for the feature as specified in the feature_config :type feature_info: dict
Returns: Feature tensor padded to the max_sequence_size Return type: tf.Tensor
-
class
ml4ir.base.data.tfrecord_reader.
TFRecordSequenceExampleParser
(feature_config: ml4ir.base.features.feature_config.FeatureConfig, preprocessing_map: ml4ir.base.features.preprocessing.PreprocessingMap, required_fields_only: Optional[bool] = False, pad_sequence: Optional[bool] = True, max_sequence_size: Optional[int] = 25, output_name: Optional[str] = None)¶ Bases:
ml4ir.base.data.tfrecord_reader.TFRecordParser
Constructor method for instantiating a TFRecordParser object
Parameters: - feature_config (FeatureConfig) – FeatureConfig object defining context and sequence feature information
- preprocessing_map (PreprocessingMap object) – Object mapping preprocessing feature function names to their definitons
- required_fields_only (bool, optional) – Whether to only use required fields from the feature_config
- pad_sequence (bool, optional) – Whether to pad sequence
- max_sequence_size (int, optional) – Maximum number of sequence per query. Used for padding
- output_name (str) – The name of tensorflow’s output node which carry the prediction score
-
get_features_spec
()¶ Define the features spec from the feature_config. This will be used to parse the serialized TFRecord
Returns: - dict – Feature specification dictionary that can be used to parse Context features from the serialized SequenceExample
- dict – Feature specification dictionary that can be used to parse Sequence features (or feature lists) from the serialized SequenceExample
-
extract_features_from_proto
(serialized)¶ Parse the serialized proto string to extract features
Parameters: proto (tf.Tensor) – A scalar string tensor that is the serialized form of a TFRecord object Returns: - dict of Tensors – Dictionary of context feature tensors extracted from the proto as per the features_spec
- dict of Tensors – Dictionary of sequence feature tensors extracted from the proto as per the features_spec
-
get_default_tensor
(feature_info, sequence_size)¶ Get the default tensor for a given feature configuration :param feature_info: Feature configuration information for the feature as specified in the feature_config :type feature_info: dict :param sequence_size: Number of elements in the sequence of a SequenceExample :type sequence_size: int, optional
Returns: Tensor object that can be used as a default tensor if the expected feature is missing from the TFRecord Return type: tf.Tensor
-
get_feature
(feature_info, extracted_features, sequence_size)¶ Fetch the feature from the feature dictionary of extracted features :param feature_info: Feature configuration information for the feature as specified in the feature_config :type feature_info: dict :param extracted_features: Dictionary of feature tensors extracted by parsing the serialized TFRecord :type extracted_features: dict :param sequence_size: Number of elements in the sequence of a SequenceExample :type sequence_size: int, optional
Returns: Feature tensor that is obtained from the extracted features for the given feature_info Return type: tf.Tensor
-
generate_and_add_mask
(extracted_features, features_dict)¶ Create a mask to identify padded values
Parameters: - extracted_features (dict) – Dictionary of tensors extracted from the serialized TFRecord
- features_dict (dict) – Dictionary of tensors that will be used for model training/serving as inputs to the model
Returns: - features_dict (dict) – Dictionary of tensors that will be used for model training/serving updated with the mask tensor if applicable
- sequence_size (int) – Number of elements in the sequence of the TFRecord
-
pad_feature
(feature_tensor, feature_info)¶ Pad the feature to the max_sequence_size in order to create uniform data batches for training :param feature_tensor: Feature tensor to be padded :type feature_tensor: tf.Tensor :param feature_info: Feature configuration information for the feature as specified in the feature_config :type feature_info: dict
Returns: Feature tensor padded to the max_sequence_size Return type: tf.Tensor
-
ml4ir.base.data.tfrecord_reader.
get_parse_fn
(tfrecord_type: str, feature_config: ml4ir.base.features.feature_config.FeatureConfig, preprocessing_keys_to_fns: dict, max_sequence_size: int = 0, required_fields_only: bool = False, pad_sequence: bool = True, output_name: str = None) → tensorflow.python.eager.def_function.function¶ Create a parsing function to extract features from serialized TFRecord data using the definition from the FeatureConfig
Parameters: - tfrecord_type ({"example", "sequence_example"}) – Type of TFRecord data to be loaded into a dataset
- feature_config (FeatureConfig object) – FeatureConfig object defining the features to be extracted
- preprocessing_keys_to_fns (dict of(str, function), optional) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the RelevanceDataset object
- max_sequence_size (int) – Maximum number of sequence per query. Used for padding
- required_fields_only (bool, optional) – Whether to only use required fields from the feature_config
- pad_sequence (bool) – Whether to pad sequence
- output_name (str) – The name of tensorflow’s output node which carry the prediction score
Returns: Parsing function that takes in a serialized SequenceExample or Example message and extracts a dictionary of feature tensors
Return type: tf.function
-
ml4ir.base.data.tfrecord_reader.
read
(data_dir: str, feature_config: ml4ir.base.features.feature_config.FeatureConfig, tfrecord_type: str, file_io: ml4ir.base.io.file_io.FileIO, max_sequence_size: int = 0, batch_size: int = 0, preprocessing_keys_to_fns: dict = {}, parse_tfrecord: bool = True, use_part_files: bool = False, logger: logging.Logger = None, **kwargs) → tensorflow.python.data.ops.readers.TFRecordDatasetV2¶ Extract features by reading and parsing TFRecord data and converting into a TFRecordDataset using the FeatureConfig
Parameters: - data_dir (str) – path to the directory containing train, validation and test data
- feature_config (FeatureConfig object) – FeatureConfig object that defines the features to be loaded in the dataset and the preprocessing functions to be applied to each of them
- tfrecord_type ({"example", "sequence_example"}) – Type of the TFRecord protobuf message to be used for TFRecordDataset
- file_io (FileIO object) – file I/O handler objects for reading and writing data
- max_sequence_size (int, optional) – maximum number of sequence to be used with a single SequenceExample proto message The data will be appropriately padded or clipped to fit the max value specified
- batch_size (int, optional) – size of each data batch
- preprocessing_keys_to_fns (dict of(str, function), optional) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the RelevanceDataset object
- use_part_files (bool, optional) – load dataset from part files checked using “part-” prefix
- parse_tfrecord (bool, optional) – parse the TFRecord string from the dataset; returns strings as is otherwise
- logger (Logger, optional) – logging handler for status messages
Returns: TFRecordDataset loaded from the data_dir specified using the FeatureConfig
Return type: TFRecordDataset
csv_reader¶
-
ml4ir.base.data.csv_reader.
read
(data_dir: str, feature_config: ml4ir.base.features.feature_config.FeatureConfig, tfrecord_type: str, tfrecord_dir: str, file_io: ml4ir.base.io.file_io.FileIO, batch_size: int = 128, preprocessing_keys_to_fns: dict = {}, use_part_files: bool = False, max_sequence_size: int = 25, parse_tfrecord: bool = True, logger=None, **kwargs) → tensorflow.python.data.ops.readers.TFRecordDatasetV2¶ Create a TFRecordDataset from directory of CSV files using the FeatureConfig
- Current execution plan:
- Load CSVs as pandas dataframes
- Convert each query into tf.train.SequenceExample protobufs
- Write the protobufs into a .tfrecord file
- Load .tfrecord file into a TFRecordDataset and parse the protobufs
Parameters: - data_dir (str) – Path to directory containing csv files to read
- feature_config (FeatureConfig object) – FeatureConfig object that defines the features to be loaded in the dataset and the preprocessing functions to be applied to each of them
- tfrecord_dir (str) – Path to directory where the serialized .tfrecord files will be stored
- batch_size (int) – value specifying the size of the data batch
- use_part_files (bool) – load dataset from part files checked using “part-” prefix
- max_sequence_size (int) – value specifying max number of records per query
- logger (Logger object) – logging handler to print and save status messages
Returns: tensorflow TFRecordDataset loaded from the CSV file
Return type: TFRecordDataset object
tfrecord_writer¶
Writes data in Example or SequenceExample protobuf (tfrecords) format.
To use it as a standalone script, refer to the argument spec at the bottom
Notes
Setting --keep-single-files
writes one tfrecord file
for each CSV file (better performance). If not set,
joins everything to a single tfrecord file.
Examples
Syntax to convert a single or several CSVs:
>>> python ml4ir/base/data/tfrecord_writer.py \
... sequence_example|example \
... --csv-files <SPACE_SEPARATED_PATHS_TO_CSV_FILES> \
... --out-dir <PATH_TO_OUTPUT_DIR> \
... --feature_config <PATH_TO_YAML_FEATURE_CONFIG> \
... --keep-single-files
or to convert all CSV files in a dir
>>> python ml4ir/base/data/tfrecord_writer.py \
... sequence_example|example \
... --csv-dir <DIR_WITH_CSVs> \
... --out-dir <PATH_TO_OUTPUT_DIR> \
... --feature_config <PATH_TO_YAML_FEATURE_CONFIG> \
... --keep-single-files
Usage example:
>>> python ml4ir/base/data/tfrecord_writer.py \
... sequence_example \
... --csv-files /tmp/d.csv /tmp/d2.csv \
... --out-dir /tmp \
... --feature-config /tmp/fconfig.yaml \
... --keep-single-files
-
ml4ir.base.data.tfrecord_writer.
write_from_files
(csv_files: List[str], tfrecord_file: str, feature_config: ml4ir.base.features.feature_config.FeatureConfig, tfrecord_type: str, file_io: ml4ir.base.io.file_io.FileIO, logger: logging.Logger = None)¶ Converts data from CSV files into tfrecord files
Parameters: - csv_files (list of str) – list of csv file paths to read data from
- tfrecord_file (str) – tfrecord file path to write the output
- feature_config (FeatureConfig) – FeatureConfig object that defines the features to be loaded in the dataset and the preprocessing functions to be applied to each of them
- tfrecord_type ({"example", "sequence_example"}) – Type of the TFRecord protobuf message to be used for TFRecordDataset
- file_io (FileIO object) – FileIO handler object for reading and writing files
- logger (Logger, optional) – logging handler for status messages
-
ml4ir.base.data.tfrecord_writer.
write_from_df
(df: pandas.core.frame.DataFrame, tfrecord_file: str, feature_config: ml4ir.base.features.feature_config.FeatureConfig, tfrecord_type: str, logger: logging.Logger = None)¶ Converts data from CSV files into tfrecord files
Parameters df : pd.DataFrame
pandas DataFrame to be converted to TFRecordDataset- tfrecord_file : str
- tfrecord file path to write the output
- feature_config : FeatureConfig
- FeatureConfig object that defines the features to be loaded in the dataset and the preprocessing functions to be applied to each of them
- tfrecord_type : {“example”, “sequence_example”}
- Type of the TFRecord protobuf message to be used for TFRecordDataset
- logger : Logger, optional
- logging handler for status messages