

class ml4ir.base.pipeline.RelevancePipeline(args: argparse.Namespace)

Bases: object

Base class that defines a pipeline to train, evaluate and save a RelevanceModel using ml4ir

Constructor to create a RelevancePipeline object to train, evaluate and save a model on ml4ir. This method sets up data, logs, models directories, file handlers used. The method also loads and sets up the FeatureConfig for the model training pipeline

Parameters:args (argparse Namespace) – arguments to be used with the pipeline. Typically, passed from command line arguments
setup_logging() → logging.Logger

Set up the logging utilities for the training pipeline Additionally, removes pre existing job status files


Set the random seeds for tensorflow and numpy in order to replicate results

Parameters:reset_graph (bool) – Reset the tensorflow graph and clears the keras session
get_relevance_dataset(preprocessing_keys_to_fns={}) → ml4ir.base.data.relevance_dataset.RelevanceDataset

Create RelevanceDataset object by loading train, test data as tensorflow datasets

Parameters:preprocessing_keys_to_fns (dict of (str, function)) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the RelevanceDataset object
Returns:RelevanceDataset object that can be used for training and evaluating the model
Return type:RelevanceDataset object


Override this method to create custom dataset objects

get_kfold_relevance_dataset(num_folds, include_testset_in_kfold, read_data_sets, preprocessing_keys_to_fns={}) → ml4ir.base.data.relevance_dataset.RelevanceDataset

Create RelevanceDataset object by loading train, test data as tensorflow datasets

  • num_folds (int) – number of folds in kfold
  • include_testset_in_kfold (bool) – whether to include the testset in the folds
  • read_data_sets (bool) – whether to call create_dataset which reads data from files.
  • preprocessing_keys_to_fns (dict of (str, function)) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the RelevanceDataset object

RelevanceDataset object that can be used for training and evaluating the model

Return type:

KfoldRelevanceDataset object


Override this method to create custom dataset objects


Fetch the class of the RelevanceModel to be used for the ml4ir pipeline

Return type:RelevanceModel class

Get the primary loss function to be used with the RelevanceModel

Return type:RelevanceLossBase object

Get the auxiliary loss function to be used with the RelevanceModel

Return type:RelevanceLossBase object
static get_metrics(metrics_keys: List[str]) → List[Union[keras.metrics.base_metric.Metric, str]]

Get the list of keras metrics to be used with the RelevanceModel

Parameters:metrics_keys (List of str) – List of strings indicating the metrics to instantiate and retrieve
Return type:list of keras Metric objects
get_relevance_model(feature_layer_keys_to_fns={}) → ml4ir.base.model.relevance_model.RelevanceModel

Creates a RankingModel that can be used for training and evaluating :param feature_layer_keys_to_fns: dictionary of function names mapped to tensorflow compatible

function definitions that can now be used in the InteractionModel as a feature function to transform input features
Returns:RankingModel that can be used for training and evaluating a ranking model
Return type:RankingModel


Override this method to create custom loss, scorer, model objects


Run the pipeline to train, evaluate and save the model. It also runs the pipeline in kfold cross validation mode if specified.

Returns:Experiment tracking dictionary with metrics and metadata for the run. Used for model selection and hyperparameter optimization
Return type:dict


Also populates a experiment tracking dictionary containing the metadata, model architecture and metrics generated by the model


Run the pipeline to train, evaluate and save the model.

Parameters:relevance_dataset (RelevanceDataset) – RelevanceDataset used for running the pipeline. If none, the relevance dataset will be created.
Returns:Experiment tracking dictionary with metrics and metadata for the run. Used for model selection and hyperparameter optimization
Return type:dict


Also populates a experiment tracking dictionary containing the metadata, model architecture and metrics generated by the model


Performs arbitrary pre-processing steps such as copying or transforming data that the rest of the code can not accommodate. It serves as a placeholder without an explicit implementation (returns self) in the base pipeline. We expect that users can extend it in their custom pipelines.


Performs arbitrary post-training steps such as copying or transforming data that the rest of the code can not accommodate. It serves as a placeholder without an explicit implementation (returns self) in the base pipeline. We expect that users can extend it in their custom pipelines.

finish(job_status, job_info)

Wrap up the model training pipeline. Performs the following actions

  • save a job status file as _SUCCESS or _FAILURE to indicate job status.
  • delete temp data and models directories
  • if using spark IO, transfers models and logs directories to HDFS location from local directories
  • log overall run time of ml4ir job
  • job_status (str) – Tuple with first element _SUCCESS or _FAILURE second element
  • job_info (str) – for _SUCCESS, is experiment tracking metrics and metadata for _FAILURE, is stacktrace of failure


class ml4ir.applications.ranking.pipeline.RankingPipeline(args: argparse.Namespace)

Bases: ml4ir.base.pipeline.RelevancePipeline

Base class that defines a pipeline to train, evaluate and save a RankingModel using ml4ir

Constructor to create a RelevancePipeline object to train, evaluate and save a model on ml4ir. This method sets up data, logs, models directories, file handlers used. The method also loads and sets up the FeatureConfig for the model training pipeline :param args: arguments to be used with the pipeline.

Typically, passed from command line arguments

Fetch the class of the RelevanceModel to be used for the ml4ir pipeline :returns: :rtype: RelevanceModel class


Get the primary loss function to be used with the RelevanceModel

Return type:RelevanceLossBase object

Get the auxiliary loss function to be used with the RelevanceModel

Return type:RelevanceLossBase object
static get_metrics(metrics_keys: List[str]) → List[Union[keras.metrics.base_metric.Metric, str]]

Get the list of keras metrics to be used with the RelevanceModel

Parameters:metrics_keys (List of str) – List of strings indicating the metrics to instantiate and retrieve
Return type:list of keras Metric objects

Validate the arguments to be used with RelevancePipeline


Create a RankingPipeline object used in running kfold cross validation.

kfold_analysis(base_logs_dir, run_id, num_folds, pvalue_threshold=0.1, metrics=None)

Aggregate results of the k-fold runs and perform t-test on the results between old(prod model) and new model’s w.r.t the specified metrics. :param base_logs_dir: Total number of folds :type base_logs_dir: int :param run_id: current fold number :type run_id: int :param num_folds: Total number of folds :type num_folds: int :param pvalue_threshold: the threshold used for pvalue to assess significance :type pvalue_threshold: float :param metrics: List of metrics to include in the kfold analysis :type metrics: list

run_kfold_analysis(logs_dir, run_id, num_folds, metrics)

Running the kfold analysis for ranking. Parameters: ———– logs_dir: str

path to logs directory
run_id: str
string run_id
num_folds: int
number of folds
metrics: list
list of metrics to include in the kfold analysis

summary of the kfold analysis


class ml4ir.applications.classification.pipeline.ClassificationPipeline(args: argparse.Namespace)

Bases: ml4ir.base.pipeline.RelevancePipeline

Base class that defines a pipeline to train, evaluate and save a RelevanceModel for classification using ml4ir

Constructor to create a RelevancePipeline object to train, evaluate and save a model on ml4ir. This method sets up data, logs, models directories, file handlers used. The method also loads and sets up the FeatureConfig for the model training pipeline

Parameters:args (argparse Namespace) – arguments to be used with the pipeline. Typically, passed from command line arguments

Fetch the class of the RelevanceModel to be used for the ml4ir pipeline

Return type:RelevanceModel class

Get the primary loss function to be used with the RelevanceModel

Return type:RelevanceLossBase object
static get_metrics(metrics_keys: List[str]) → List[Union[keras.metrics.base_metric.Metric, str]]

Get the list of keras metrics to be used with the RelevanceModel

Parameters:metrics_keys (List of str) – List of strings indicating the metrics to instantiate and retrieve
Return type:list of keras Metric objects
get_relevance_dataset(parse_tfrecord=True, preprocessing_keys_to_fns={}) → ml4ir.base.data.relevance_dataset.RelevanceDataset

Create RelevanceDataset object by loading train, test data as tensorflow datasets Defines a preprocessing feature function to one hot vectorize classification labels

Parameters:preprocessing_keys_to_fns (dict of (str, function)) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the RelevanceDataset object
Returns:RelevanceDataset object that can be used for training and evaluating the model
Return type:RelevanceDataset object


Override this method to create custom dataset objects

get_kfold_relevance_dataset(num_folds, include_testset_in_kfold, read_data_sets=False, parse_tfrecord=True, preprocessing_keys_to_fns={}) → ml4ir.base.data.kfold_relevance_dataset.KfoldRelevanceDataset

Create KfoldRelevanceDataset object by loading train, test data as tensorflow datasets Defines a preprocessing feature function to one hot vectorize classification labels

  • num_folds (int) – Number of folds in kfold CV
  • include_testset_in_kfold (bool) – Whether to include testset in the folds
  • read_data_sets (bool) – Whether to read datasets from disk
  • preprocessing_keys_to_fns (dict of (str, function)) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the KfoldRelevanceDataset object

KfoldRelevanceDataset object that can be used for training and evaluating the model in a kfold cross validation mode.

Return type:

KfoldRelevanceDataset object


Override this method to create custom dataset objects


Create a ClassificationPipeline object used in running kfold cross validation.

run_kfold_analysis(base_logs_dir, base_run_id, num_folds, metrics)