Data Loaders and Helpers

RelevanceDataset

class ml4ir.base.data.relevance_dataset.RelevanceDataset(data_dir: str, data_format: str, feature_config: ml4ir.base.features.feature_config.FeatureConfig, tfrecord_type: str, file_io: ml4ir.base.io.file_io.FileIO, max_sequence_size: int = 0, batch_size: int = 128, preprocessing_keys_to_fns: dict = {}, train_pcent_split: float = 0.8, val_pcent_split: float = -1, test_pcent_split: float = -1, use_part_files: bool = False, parse_tfrecord: bool = True, logger: Optional[logging.Logger] = None, keep_additional_info: int = 0, non_zero_features_only: int = 0)

Bases: object

class to create/load TFRecordDataset for train, validation and test

Constructor method to instantiate a RelevanceDataset object Loads and creates the TFRecordDataset for train, validation and test splits

Parameters:
  • data_dir (str) – path to the directory containing train, validation and test data
  • data_format ({"tfrecord", "csv", "libsvm"}) – type of data files to be converted into TFRecords and loaded as a TFRecordDataset
  • feature_config (FeatureConfig object) – FeatureConfig object that defines the features to be loaded in the dataset and the preprocessing functions to be applied to each of them
  • tfrecord_type ({"example", "sequence_example"}) – Type of the TFRecord protobuf message to be used for TFRecordDataset
  • file_io (FileIO object) – file I/O handler objects for reading and writing data
  • max_sequence_size (int, optional) – maximum number of sequence to be used with a single SequenceExample proto message The data will be appropriately padded or clipped to fit the max value specified
  • batch_size (int, optional) – size of each data batch
  • preprocessing_keys_to_fns (dict of (str, function), optional) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the RelevanceDataset object
  • train_pcent_split (float, optional) – ratio of overall data to be used as training set
  • val_pcent_split (float, optional) – ratio of overall data to be used as validation set
  • test_pcent_split (float, optional) – ratio of overall data to be used as test set
  • use_part_files (bool, optional) – load dataset from part files checked using “part-” prefix
  • parse_tfrecord (bool, optional) – parse the TFRecord string from the dataset; returns strings as is otherwise
  • logger (Logger, optional) – logging handler for status messages

Notes

  • Currently supports CSV, TFRecord and Libsvm data formats
  • Does not support automatically splitting train, validation and test
  • data_dir should contain train, validation and test directories with files within them
create_dataset(parse_tfrecord=True)

Loads and creates train, validation and test datasets

Parameters:parse_tfrecord (bool) – parse the TFRecord string from the dataset; returns strings as is otherwise
balance_classes()

Balance class labels in the train dataset

train_val_test_split()

Split the dataset into train, validation and test

tfrecord_reader

ml4ir.base.data.tfrecord_reader.preprocess_feature(feature_tensor, feature_info, preprocessing_map)

Preprocess feature based on the feature configuration

Parameters:
  • feature_tensor (tf.Tensor) – input feature tensor to be preprocessed
  • feature_info (dict) – feature configuration for the feature being preprocessed
  • preprocessing_map (PreprocessingMap object) – map of preprocessing feature functions
Returns:

preprocessed tensor object

Return type:

tf.Tensor

Notes

Only preprocessing functions part of the preprocessing_map can be used in this function for preprocessing at data loading

Pass custom preprocessing functions while instantiating the RelevanceDataset object with preprocessing_keys_to_fns argument

ml4ir.base.data.tfrecord_reader.make_example_parse_fn(feature_config: ml4ir.base.features.feature_config.FeatureConfig, preprocessing_map: ml4ir.base.features.preprocessing.PreprocessingMap, required_fields_only: bool = False) → tensorflow.python.eager.def_function.function

Create a parse function using the Example features spec

Parameters:
  • feature_config (FeatureConfig) – FeatureConfig object defining context and sequence feature information
  • preprocessing_map (PreprocessingMap object) – map of preprocessing feature functions
  • required_fields_only (bool, optional) – Whether to only use required fields from the feature_config
Returns:

Parsing function that takes in a serialized Example message and extracts a feature dictionary

Return type:

tf.function

ml4ir.base.data.tfrecord_reader.make_sequence_example_parse_fn(feature_config: ml4ir.base.features.feature_config.FeatureConfig, preprocessing_map: ml4ir.base.features.preprocessing.PreprocessingMap, max_sequence_size: int = 25, required_fields_only: bool = False, pad_sequence: bool = True) → tensorflow.python.eager.def_function.function

Create a parse function using the SequenceExample features spec

Parameters:
  • feature_config (FeatureConfig) – FeatureConfig object defining context and sequence feature information
  • preprocessing_map (int) – map of preprocessing feature functions
  • max_sequence_size (int) – Maximum number of sequence per query. Used for padding
  • required_fields_only (bool, optional) – Whether to only use required fields from the feature_config
  • pad_sequence (bool) – Whether to pad sequence
Returns:

Parsing function that takes in a serialized SequenceExample message and extracts a feature dictionary for context and sequence features

Return type:

tf.function

ml4ir.base.data.tfrecord_reader.get_parse_fn(tfrecord_type: str, feature_config: ml4ir.base.features.feature_config.FeatureConfig, preprocessing_keys_to_fns: dict, max_sequence_size: int = 0, required_fields_only: bool = False, pad_sequence: bool = True)

Create a parsing function to extract features from serialized TFRecord data using the definition from the FeatureConfig

Parameters:
  • tfrecord_type ({"example", "sequence_example"}) – Type of TFRecord data to be loaded into a dataset
  • feature_config (FeatureConfig object) – FeatureConfig object defining the features to be extracted
  • preprocessing_keys_to_fns (dict of (str, function), optional) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the RelevanceDataset object
  • max_sequence_size (int) – Maximum number of sequence per query. Used for padding
  • required_fields_only (bool, optional) – Whether to only use required fields from the feature_config
  • pad_sequence (bool) – Whether to pad sequence
Returns:

Parsing function that takes in a serialized SequenceExample or Example message and extracts a dictionary of feature tensors

Return type:

tf.function

ml4ir.base.data.tfrecord_reader.read(data_dir: str, feature_config: ml4ir.base.features.feature_config.FeatureConfig, tfrecord_type: str, file_io: ml4ir.base.io.file_io.FileIO, max_sequence_size: int = 0, batch_size: int = 0, preprocessing_keys_to_fns: dict = {}, parse_tfrecord: bool = True, use_part_files: bool = False, logger: logging.Logger = None, **kwargs) → tensorflow.python.data.ops.readers.TFRecordDatasetV2

Extract features by reading and parsing TFRecord data and converting into a TFRecordDataset using the FeatureConfig

Parameters:
  • data_dir (str) – path to the directory containing train, validation and test data
  • feature_config (FeatureConfig object) – FeatureConfig object that defines the features to be loaded in the dataset and the preprocessing functions to be applied to each of them
  • tfrecord_type ({"example", "sequence_example"}) – Type of the TFRecord protobuf message to be used for TFRecordDataset
  • file_io (FileIO object) – file I/O handler objects for reading and writing data
  • max_sequence_size (int, optional) – maximum number of sequence to be used with a single SequenceExample proto message The data will be appropriately padded or clipped to fit the max value specified
  • batch_size (int, optional) – size of each data batch
  • preprocessing_keys_to_fns (dict of (str, function), optional) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the RelevanceDataset object
  • use_part_files (bool, optional) – load dataset from part files checked using “part-” prefix
  • parse_tfrecord (bool, optional) – parse the TFRecord string from the dataset; returns strings as is otherwise
  • logger (Logger, optional) – logging handler for status messages
Returns:

TFRecordDataset loaded from the data_dir specified using the FeatureConfig

Return type:

TFRecordDataset

csv_reader

ml4ir.base.data.csv_reader.read(data_dir: str, feature_config: ml4ir.base.features.feature_config.FeatureConfig, tfrecord_type: str, tfrecord_dir: str, file_io: ml4ir.base.io.file_io.FileIO, batch_size: int = 128, preprocessing_keys_to_fns: dict = {}, use_part_files: bool = False, max_sequence_size: int = 25, parse_tfrecord: bool = True, logger=None, **kwargs) → tensorflow.python.data.ops.readers.TFRecordDatasetV2

Create a TFRecordDataset from directory of CSV files using the FeatureConfig

Current execution plan:
  1. Load CSVs as pandas dataframes
  2. Convert each query into tf.train.SequenceExample protobufs
  3. Write the protobufs into a .tfrecord file
  4. Load .tfrecord file into a TFRecordDataset and parse the protobufs

Parameters data_dir : str

Path to directory containing csv files to read
feature_config : FeatureConfig object
FeatureConfig object that defines the features to be loaded in the dataset and the preprocessing functions to be applied to each of them
tfrecord_dir : str
Path to directory where the serialized .tfrecord files will be stored
batch_size : int
value specifying the size of the data batch
use_part_files : bool
load dataset from part files checked using “part-” prefix
max_sequence_size : int
value specifying max number of records per query
logger : Logger object
logging handler to print and save status messages
Returns:tensorflow TFRecordDataset loaded from the CSV file
Return type:TFRecordDataset object

tfrecord_writer

Writes data in Example or SequenceExample protobuf (tfrecords) format.

To use it as a standalone script, refer to the argument spec at the bottom

Notes

Setting --keep-single-files writes one tfrecord file for each CSV file (better performance). If not set, joins everything to a single tfrecord file.

Examples

Syntax to convert a single or several CSVs:

>>> python ml4ir/base/data/tfrecord_writer.py \
... sequence_example|example \
... --csv-files <SPACE_SEPARATED_PATHS_TO_CSV_FILES> \
... --out-dir <PATH_TO_OUTPUT_DIR> \
... --feature_config <PATH_TO_YAML_FEATURE_CONFIG> \
... --keep-single-files

or to convert all CSV files in a dir

>>> python ml4ir/base/data/tfrecord_writer.py \
... sequence_example|example \
... --csv-dir <DIR_WITH_CSVs> \
... --out-dir <PATH_TO_OUTPUT_DIR> \
... --feature_config <PATH_TO_YAML_FEATURE_CONFIG> \
... --keep-single-files

Usage example:

>>> python ml4ir/base/data/tfrecord_writer.py \
... sequence_example \
... --csv-files /tmp/d.csv /tmp/d2.csv \
... --out-dir /tmp \
... --feature-config /tmp/fconfig.yaml \
... --keep-single-files
ml4ir.base.data.tfrecord_writer.write_from_files(csv_files: List[str], tfrecord_file: str, feature_config: ml4ir.base.features.feature_config.FeatureConfig, tfrecord_type: str, file_io: ml4ir.base.io.file_io.FileIO, logger: logging.Logger = None)

Converts data from CSV files into tfrecord files

Parameters:
  • csv_files (list of str) – list of csv file paths to read data from
  • tfrecord_file (str) – tfrecord file path to write the output
  • feature_config (FeatureConfig) – FeatureConfig object that defines the features to be loaded in the dataset and the preprocessing functions to be applied to each of them
  • tfrecord_type ({"example", "sequence_example"}) – Type of the TFRecord protobuf message to be used for TFRecordDataset
  • logger (Logger, optional) – logging handler for status messages
ml4ir.base.data.tfrecord_writer.write_from_df(df: pandas.core.frame.DataFrame, tfrecord_file: str, feature_config: ml4ir.base.features.feature_config.FeatureConfig, tfrecord_type: str, logger: logging.Logger = None)

Converts data from CSV files into tfrecord files

Parameters df : pd.DataFrame

pandas DataFrame to be converted to TFRecordDataset
tfrecord_file : str
tfrecord file path to write the output
feature_config : FeatureConfig
FeatureConfig object that defines the features to be loaded in the dataset and the preprocessing functions to be applied to each of them
tfrecord_type : {“example”, “sequence_example”}
Type of the TFRecord protobuf message to be used for TFRecordDataset
logger : Logger, optional
logging handler for status messages