Relevance Models¶
RelevanceModel¶
-
class
ml4ir.base.model.relevance_model.RelevanceModel(feature_config: ml4ir.base.features.feature_config.FeatureConfig, tfrecord_type: str, file_io: ml4ir.base.io.file_io.FileIO, scorer: Optional[ml4ir.base.model.scoring.scoring_model.ScorerBase] = None, metrics: List[Union[str, Type[tensorflow.python.keras.metrics.Metric]]] = [], optimizer: Optional[tensorflow.python.keras.optimizer_v2.optimizer_v2.OptimizerV2] = None, model_file: Optional[str] = None, initialize_layers_dict: dict = {}, freeze_layers_list: list = [], compile_keras_model: bool = False, output_name: str = 'score', logger=None)¶ Bases:
objectConstructor to instantiate a RelevanceModel that can be used for training and evaluating the search ML task
Parameters: - feature_config (FeatureConfig object) – FeatureConfig object that defines the features to be loaded in the dataset and the preprocessing functions to be applied to each of them
- tfrecord_type ({"example", "sequence_example"}) – Type of the TFRecord protobuf message used for TFRecordDataset
- file_io (FileIO object) – file I/O handler objects for reading and writing data
- scorer (ScorerBase object) – Scorer object that wraps an InteractionModel and converts input features into scores
- metrics (list) – List of keras Metric classes that will be used for evaluating the trained model
- optimizer (Optimizer) – Tensorflow keras optimizer to be used for training the model
- model_file (str, optional) – Path to pretrained model file to be loaded for evaluation or retraining
- initialize_layers_dict (dict, optional) – Dictionary of tensorflow layer names mapped to the path of pretrained weights Use this for transfer learning with pretrained weights
- freeze_layers_list (list, optional) – List of model layer names to be frozen Use this for freezing pretrained weights from other ml4ir models
- compile_keras_model (bool, optional) – Whether the keras model loaded from disk should be compiled with loss, metrics and an optimizer
- output_name (str, optional) – Name of the output tensorflow node that captures the score
- logger (Logger, optional) – logging handler for status messages
-
is_compiled= None¶ Specify inputs to the model
Individual input nodes are defined for each feature Each data point represents features for all records in a single query
-
classmethod
from_relevance_scorer(feature_config: ml4ir.base.features.feature_config.FeatureConfig, interaction_model: ml4ir.base.model.scoring.interaction_model.InteractionModel, model_config: dict, loss: ml4ir.base.model.losses.loss_base.RelevanceLossBase, metrics: List[Union[tensorflow.python.keras.metrics.Metric, str]], optimizer: tensorflow.python.keras.optimizer_v2.optimizer_v2.OptimizerV2, tfrecord_type: str, file_io: ml4ir.base.io.file_io.FileIO, model_file: Optional[str] = None, initialize_layers_dict: dict = {}, freeze_layers_list: list = [], compile_keras_model: bool = False, output_name: str = 'score', logger=None)¶ Create a RelevanceModel with default Scorer function constructed from an InteractionModel
Parameters: - feature_config (FeatureConfig object) – FeatureConfig object that defines the features to be loaded in the dataset and the preprocessing functions to be applied to each of them
- tfrecord_type ({"example", "sequence_example"}) – Type of the TFRecord protobuf message used for TFRecordDataset
- file_io (FileIO object) – file I/O handler objects for reading and writing data
- interaction_model (InteractionModel object) – InteractionModel object that converts input features into a dense feature representation
- loss (RelevanceLossBase object) – Loss object defining the final activation layer and the loss function
- metrics (list) – List of keras Metric classes that will be used for evaluating the trained model
- optimizer (Optimizer) – Tensorflow keras optimizer to be used for training the model
- model_file (str, optional) – Path to pretrained model file to be loaded for evaluation or retraining
- initialize_layers_dict (dict, optional) – Dictionary of tensorflow layer names mapped to the path of pretrained weights Use this for transfer learning with pretrained weights
- freeze_layers_list (list, optional) – List of model layer names to be frozen Use this for freezing pretrained weights from other ml4ir models
- compile_keras_model (bool, optional) – Whether the keras model loaded from disk should be compiled with loss, metrics and an optimizer
- output_name (str, optional) – Name of the output tensorflow node that captures the score
- logger (Logger, optional) – logging handler for status messages
Returns: RelevanceModel object with a default scorer build with a custom InteractionModel
Return type:
-
classmethod
from_univariate_interaction_model(model_config, feature_config: ml4ir.base.features.feature_config.FeatureConfig, tfrecord_type: str, loss: ml4ir.base.model.losses.loss_base.RelevanceLossBase, metrics: List[Union[tensorflow.python.keras.metrics.Metric, str]], optimizer: tensorflow.python.keras.optimizer_v2.optimizer_v2.OptimizerV2, feature_layer_keys_to_fns: dict = {}, model_file: Optional[str] = None, initialize_layers_dict: dict = {}, freeze_layers_list: list = [], compile_keras_model: bool = False, output_name: str = 'score', max_sequence_size: int = 0, file_io: ml4ir.base.io.file_io.FileIO = None, logger=None)¶ Create a RelevanceModel with default UnivariateInteractionModel
Parameters: - feature_config (FeatureConfig object) – FeatureConfig object that defines the features to be loaded in the dataset and the preprocessing functions to be applied to each of them
- model_config (dict) – dictionary defining the dense model architecture
- tfrecord_type ({"example", "sequence_example"}) – Type of the TFRecord protobuf message used for TFRecordDataset
- file_io (FileIO object) – file I/O handler objects for reading and writing data
- loss (RelevanceLossBase object) – Loss object defining the final activation layer and the loss function
- metrics (list) – List of keras Metric classes that will be used for evaluating the trained model
- optimizer (Optimizer) – Tensorflow keras optimizer to be used for training the model
- feature_layer_keys_to_fns (dict) – Dictionary of custom feature transformation functions to be applied on the input features as part of the InteractionModel
- model_file (str, optional) – Path to pretrained model file to be loaded for evaluation or retraining
- initialize_layers_dict (dict, optional) – Dictionary of tensorflow layer names mapped to the path of pretrained weights Use this for transfer learning with pretrained weights
- freeze_layers_list (list, optional) – List of model layer names to be frozen Use this for freezing pretrained weights from other ml4ir models
- compile_keras_model (bool, optional) – Whether the keras model loaded from disk should be compiled with loss, metrics and an optimizer
- output_name (str, optional) – Name of the output tensorflow node that captures the score
- max_sequence_size (int, optional) – Maximum length of the sequence to be used for SequenceExample protobuf objects
- logger (Logger, optional) – logging handler for status messages
Returns: RelevanceModel object with a UnivariateInteractionModel
Return type:
-
fit(dataset: ml4ir.base.data.relevance_dataset.RelevanceDataset, num_epochs: int, models_dir: str, logs_dir: Optional[str] = None, logging_frequency: int = 25, monitor_metric: str = '', monitor_mode: str = '', patience=2)¶ Trains model for defined number of epochs and returns the training and validation metrics as a dictionary
Parameters: - dataset (RelevanceDataset object) – RelevanceDataset object to be used for training and validation
- num_epochs (int) – Value specifying number of epochs to train for
- models_dir (str) – Directory to save model checkpoints
- logs_dir (str, optional) – Directory to save model logs If set to False, no progress logs will be written
- logging_frequency (int, optional) – Every #batches to log results
- monitor_metric (str, optional) – Name of the metric to monitor for early stopping, checkpointing
- monitor_mode ({"max", "min"}) – Whether to maximize or minimize the monitoring metric
- patience (int) – Number of epochs to wait before early stopping
Returns: train_metrics – Train and validation metrics in a single dictionary where key is metric name and value is floating point metric value. This dictionary will be used for experiment tracking for each ml4ir run
Return type: dict
-
predict(test_dataset: tensorflow.python.data.ops.readers.TFRecordDatasetV2, inference_signature: str = 'serving_default', additional_features: dict = {}, logs_dir: Optional[str] = None, logging_frequency: int = 25)¶ Predict the scores on the test dataset using the trained model
Parameters: - test_dataset (Dataset object) – Dataset object for which predictions are to be made
- inference_signature (str, optional) – If using a SavedModel for prediction, specify the inference signature to be used for computing scores
- additional_features (dict, optional) – Dictionary containing new feature name and function definition to compute them. Use this to compute additional features from the scores. For example, converting ranking scores for each document into ranks for the query
- logs_dir (str, optional) – Path to directory to save logs
- logging_frequency (int) – Value representing how often(in batches) to log status
Returns: pandas DataFrame containing the predictions on the test dataset made with the RelevanceModel
Return type: pd.DataFrame
-
evaluate(test_dataset: tensorflow.python.data.ops.readers.TFRecordDatasetV2, inference_signature: str = None, additional_features: dict = {}, group_metrics_min_queries: int = 50, logs_dir: Optional[str] = None, logging_frequency: int = 25)¶ Evaluate the RelevanceModel
Parameters: - test_dataset (an instance of tf.data.dataset) –
- inference_signature (str, optional) – If using a SavedModel for prediction, specify the inference signature to be used for computing scores
- additional_features (dict, optional) – Dictionary containing new feature name and function definition to compute them. Use this to compute additional features from the scores. For example, converting ranking scores for each document into ranks for the query
- group_metrics_min_queries (int, optional) – Minimum count threshold per group to be considered for computing groupwise metrics
- logs_dir (str, optional) – Path to directory to save logs
- logging_frequency (int) – Value representing how often(in batches) to log status
Returns: - df_overall_metrics (pd.DataFrame object) – pd.DataFrame containing overall metrics
- df_groupwise_metrics (pd.DataFrame object) – pd.DataFrame containing groupwise metrics if group_metric_keys are defined in the FeatureConfig
- metrics_dict (dict) – metrics as a dictionary of metric names mapping to values
Notes
You can directly do a model.evaluate() only if the keras model is compiled
Override this method to implement your own evaluation metrics.
-
save(models_dir: str, preprocessing_keys_to_fns={}, postprocessing_fn=None, required_fields_only: bool = True, pad_sequence: bool = False)¶ Save the RelevanceModel as a tensorflow SavedModel to the models_dir
There are two different serving signatures currently used to save the model:
- default: default keras model without any pre/post processing wrapper
- tfrecord: serving signature that allows keras model to be served using TFRecord proto messages.
- Allows definition of custom pre/post processing logic
Additionally, each model layer is also saved as a separate numpy zipped array to enable transfer learning with other ml4ir models.
Parameters: - models_dir (str) – path to directory to save the model
- preprocessing_keys_to_fns (dict) – dictionary mapping function names to tf.functions that should be saved in the preprocessing step of the tfrecord serving signature
- postprocessing_fn (function) – custom tensorflow compatible postprocessing function to be used at serving time. Saved as part of the postprocessing layer of the tfrecord serving signature
- required_fields_only (bool) – boolean value defining if only required fields need to be added to the tfrecord parsing function at serving time
- pad_sequence (bool, optional) – Value defining if sequences should be padded for SequenceExample proto inputs at serving time. Set this to False if you want to not handle padded scores.
Notes
All the functions passed under preprocessing_keys_to_fns here must be serializable tensor graph operations
-
load(model_file: str) → tensorflow.python.keras.engine.training.Model¶ Loads model from the SavedModel file specified
Parameters: model_file (str) – path to file with saved tf keras model Returns: Tensorflow keras model loaded from file Return type: tf.keras.Model Notes
Retraining currently not supported! Would require compiling the model with the right loss and optimizer states
-
load_weights(model_file: str)¶ Load saved model with compile=False
Parameters: model_file (str) – path to file with saved tf keras model
RankingModel¶
-
class
ml4ir.applications.ranking.model.ranking_model.RankingModel(feature_config: ml4ir.base.features.feature_config.FeatureConfig, tfrecord_type: str, file_io: ml4ir.base.io.file_io.FileIO, scorer: Optional[ml4ir.base.model.scoring.scoring_model.ScorerBase] = None, metrics: List[Union[str, Type[tensorflow.python.keras.metrics.Metric]]] = [], optimizer: Optional[tensorflow.python.keras.optimizer_v2.optimizer_v2.OptimizerV2] = None, model_file: Optional[str] = None, initialize_layers_dict: dict = {}, freeze_layers_list: list = [], compile_keras_model: bool = False, output_name: str = 'score', logger=None)¶ Bases:
ml4ir.base.model.relevance_model.RelevanceModelConstructor to instantiate a RelevanceModel that can be used for training and evaluating the search ML task
Parameters: - feature_config (FeatureConfig object) – FeatureConfig object that defines the features to be loaded in the dataset and the preprocessing functions to be applied to each of them
- tfrecord_type ({"example", "sequence_example"}) – Type of the TFRecord protobuf message used for TFRecordDataset
- file_io (FileIO object) – file I/O handler objects for reading and writing data
- scorer (ScorerBase object) – Scorer object that wraps an InteractionModel and converts input features into scores
- metrics (list) – List of keras Metric classes that will be used for evaluating the trained model
- optimizer (Optimizer) – Tensorflow keras optimizer to be used for training the model
- model_file (str, optional) – Path to pretrained model file to be loaded for evaluation or retraining
- initialize_layers_dict (dict, optional) – Dictionary of tensorflow layer names mapped to the path of pretrained weights Use this for transfer learning with pretrained weights
- freeze_layers_list (list, optional) – List of model layer names to be frozen Use this for freezing pretrained weights from other ml4ir models
- compile_keras_model (bool, optional) – Whether the keras model loaded from disk should be compiled with loss, metrics and an optimizer
- output_name (str, optional) – Name of the output tensorflow node that captures the score
- logger (Logger, optional) – logging handler for status messages
-
predict(test_dataset: tensorflow.python.data.ops.readers.TFRecordDatasetV2, inference_signature: str = 'serving_default', additional_features: dict = {}, logs_dir: Optional[str] = None, logging_frequency: int = 25)¶ Predict the scores on the test dataset using the trained model
Parameters: - test_dataset (Dataset object) – Dataset object for which predictions are to be made
- inference_signature (str, optional) – If using a SavedModel for prediction, specify the inference signature to be used for computing scores
- additional_features (dict, optional) – Dictionary containing new feature name and function definition to compute them. Use this to compute additional features from the scores. For example, converting ranking scores for each document into ranks for the query
- logs_dir (str, optional) – Path to directory to save logs
- logging_frequency (int) – Value representing how often(in batches) to log status
Returns: pandas DataFrame containing the predictions on the test dataset made with the RelevanceModel
Return type: pd.DataFrame
-
evaluate(test_dataset: tensorflow.python.data.ops.readers.TFRecordDatasetV2, inference_signature: str = None, additional_features: dict = {}, group_metrics_min_queries: int = 50, logs_dir: Optional[str] = None, logging_frequency: int = 25)¶ Evaluate the RelevanceModel
Parameters: - test_dataset (an instance of tf.data.dataset) –
- inference_signature (str, optional) – If using a SavedModel for prediction, specify the inference signature to be used for computing scores
- additional_features (dict, optional) – Dictionary containing new feature name and function definition to compute them. Use this to compute additional features from the scores. For example, converting ranking scores for each document into ranks for the query
- group_metrics_min_queries (int, optional) – Minimum count threshold per group to be considered for computing groupwise metrics
- logs_dir (str, optional) – Path to directory to save logs
- logging_frequency (int) – Value representing how often(in batches) to log status
Returns: - df_overall_metrics (pd.DataFrame object) – pd.DataFrame containing overall metrics
- df_groupwise_metrics (pd.DataFrame object) – pd.DataFrame containing groupwise metrics if group_metric_keys are defined in the FeatureConfig
- metrics_dict (dict) – metrics as a dictionary of metric names mapping to values
Notes
You can directly do a model.evaluate() only if the keras model is compiled
Override this method to implement your own evaluation metrics.
-
save(models_dir: str, preprocessing_keys_to_fns={}, postprocessing_fn=None, required_fields_only: bool = True, pad_sequence: bool = False)¶ Save the RelevanceModel as a tensorflow SavedModel to the models_dir Additionally, sets the score for the padded records to 0
- There are two different serving signatures currently used to save the model
default: default keras model without any pre/post processing wrapper tfrecord: serving signature that allows keras model to be served using TFRecord proto messages.
Allows definition of custom pre/post processing logic
Additionally, each model layer is also saved as a separate numpy zipped array to enable transfer learning with other ml4ir models.
Parameters: - models_dir (str) – path to directory to save the model
- preprocessing_keys_to_fns (dict) – dictionary mapping function names to tf.functions that should be saved in the preprocessing step of the tfrecord serving signature
- postprocessing_fn (function) – custom tensorflow compatible postprocessing function to be used at serving time. Saved as part of the postprocessing layer of the tfrecord serving signature
- required_fields_only (bool) – boolean value defining if only required fields need to be added to the tfrecord parsing function at serving time
- pad_sequence (bool, optional) – Value defining if sequences should be padded for SequenceExample proto inputs at serving time. Set this to False if you want to not handle padded scores.
Notes
All the functions passed under preprocessing_keys_to_fns here must be serializable tensor graph operations