Pipelines

RelevancePipeline

class ml4ir.base.pipeline.RelevancePipeline(args: argparse.Namespace)

Bases: object

Base class that defines a pipeline to train, evaluate and save a RelevanceModel using ml4ir

Constructor to create a RelevancePipeline object to train, evaluate and save a model on ml4ir. This method sets up data, logs, models directories, file handlers used. The method also loads and sets up the FeatureConfig for the model training pipeline

Parameters:args (argparse Namespace) – arguments to be used with the pipeline. Typically, passed from command line arguments
setup_logging() → logging.Logger

Set up the logging utilities for the training pipeline Additionally, removes pre existing job status files

set_seeds(reset_graph=True)

Set the random seeds for tensorflow and numpy in order to replicate results

Parameters:reset_graph (bool) – Reset the tensorflow graph and clears the keras session
get_relevance_dataset(preprocessing_keys_to_fns={}) → ml4ir.base.data.relevance_dataset.RelevanceDataset

Create RelevanceDataset object by loading train, test data as tensorflow datasets

Parameters:preprocessing_keys_to_fns (dict of (str, function)) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the RelevanceDataset object
Returns:RelevanceDataset object that can be used for training and evaluating the model
Return type:RelevanceDataset object

Notes

Override this method to create custom dataset objects

get_kfold_relevance_dataset(num_folds, include_testset_in_kfold, read_data_sets, preprocessing_keys_to_fns={}) → ml4ir.base.data.relevance_dataset.RelevanceDataset

Create RelevanceDataset object by loading train, test data as tensorflow datasets

Parameters:
  • num_folds (int) – number of folds in kfold
  • include_testset_in_kfold (bool) – whether to include the testset in the folds
  • read_data_sets (bool) – whether to call create_dataset which reads data from files.
  • preprocessing_keys_to_fns (dict of (str, function)) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the RelevanceDataset object
Returns:

RelevanceDataset object that can be used for training and evaluating the model

Return type:

KfoldRelevanceDataset object

Notes

Override this method to create custom dataset objects

get_relevance_model_cls()

Fetch the class of the RelevanceModel to be used for the ml4ir pipeline

Returns:
Return type:RelevanceModel class
get_loss()

Get the primary loss function to be used with the RelevanceModel

Returns:
Return type:RelevanceLossBase object
get_aux_loss()

Get the auxiliary loss function to be used with the RelevanceModel

Returns:
Return type:RelevanceLossBase object
static get_metrics(metrics_keys: List[str]) → List[Union[keras.metrics.base_metric.Metric, str]]

Get the list of keras metrics to be used with the RelevanceModel

Parameters:metrics_keys (List of str) – List of strings indicating the metrics to instantiate and retrieve
Returns:
Return type:list of keras Metric objects
get_relevance_model(feature_layer_keys_to_fns={}) → ml4ir.base.model.relevance_model.RelevanceModel

Creates a RankingModel that can be used for training and evaluating :param feature_layer_keys_to_fns: dictionary of function names mapped to tensorflow compatible

function definitions that can now be used in the InteractionModel as a feature function to transform input features
Returns:RankingModel that can be used for training and evaluating a ranking model
Return type:RankingModel

Notes

Override this method to create custom loss, scorer, model objects

create_pipeline_for_kfold(args)
run()

Run the pipeline to train, evaluate and save the model. It also runs the pipeline in kfold cross validation mode if specified.

Returns:Experiment tracking dictionary with metrics and metadata for the run. Used for model selection and hyperparameter optimization
Return type:dict

Notes

Also populates a experiment tracking dictionary containing the metadata, model architecture and metrics generated by the model

run_pipeline(relevance_dataset=None)

Run the pipeline to train, evaluate and save the model.

Parameters:relevance_dataset (RelevanceDataset) – RelevanceDataset used for running the pipeline. If none, the relevance dataset will be created.
Returns:Experiment tracking dictionary with metrics and metadata for the run. Used for model selection and hyperparameter optimization
Return type:dict

Notes

Also populates a experiment tracking dictionary containing the metadata, model architecture and metrics generated by the model

pre_processing_step()

Performs arbitrary pre-processing steps such as copying or transforming data that the rest of the code can not accommodate. It serves as a placeholder without an explicit implementation (returns self) in the base pipeline. We expect that users can extend it in their custom pipelines.

post_training_step()

Performs arbitrary post-training steps such as copying or transforming data that the rest of the code can not accommodate. It serves as a placeholder without an explicit implementation (returns self) in the base pipeline. We expect that users can extend it in their custom pipelines.

finish(job_status, job_info)

Wrap up the model training pipeline. Performs the following actions

  • save a job status file as _SUCCESS or _FAILURE to indicate job status.
  • delete temp data and models directories
  • if using spark IO, transfers models and logs directories to HDFS location from local directories
  • log overall run time of ml4ir job
Parameters:
  • job_status (str) – Tuple with first element _SUCCESS or _FAILURE second element
  • job_info (str) – for _SUCCESS, is experiment tracking metrics and metadata for _FAILURE, is stacktrace of failure

RankingPipeline

class ml4ir.applications.ranking.pipeline.RankingPipeline(args: argparse.Namespace)

Bases: ml4ir.base.pipeline.RelevancePipeline

Base class that defines a pipeline to train, evaluate and save a RankingModel using ml4ir

Constructor to create a RelevancePipeline object to train, evaluate and save a model on ml4ir. This method sets up data, logs, models directories, file handlers used. The method also loads and sets up the FeatureConfig for the model training pipeline :param args: arguments to be used with the pipeline.

Typically, passed from command line arguments
get_relevance_model_cls()

Fetch the class of the RelevanceModel to be used for the ml4ir pipeline :returns: :rtype: RelevanceModel class

get_loss()

Get the primary loss function to be used with the RelevanceModel

Returns:
Return type:RelevanceLossBase object
get_aux_loss()

Get the auxiliary loss function to be used with the RelevanceModel

Returns:
Return type:RelevanceLossBase object
static get_metrics(metrics_keys: List[str]) → List[Union[keras.metrics.base_metric.Metric, str]]

Get the list of keras metrics to be used with the RelevanceModel

Parameters:metrics_keys (List of str) – List of strings indicating the metrics to instantiate and retrieve
Returns:
Return type:list of keras Metric objects
validate_args()

Validate the arguments to be used with RelevancePipeline

create_pipeline_for_kfold(args)

Create a RankingPipeline object used in running kfold cross validation.

kfold_analysis(base_logs_dir, run_id, num_folds, pvalue_threshold=0.1, metrics=None)

Aggregate results of the k-fold runs and perform t-test on the results between old(prod model) and new model’s w.r.t the specified metrics. :param base_logs_dir: Total number of folds :type base_logs_dir: int :param run_id: current fold number :type run_id: int :param num_folds: Total number of folds :type num_folds: int :param pvalue_threshold: the threshold used for pvalue to assess significance :type pvalue_threshold: float :param metrics: List of metrics to include in the kfold analysis :type metrics: list

run_kfold_analysis(logs_dir, run_id, num_folds, metrics)

Running the kfold analysis for ranking. Parameters: ———– logs_dir: str

path to logs directory
run_id: str
string run_id
num_folds: int
number of folds
metrics: list
list of metrics to include in the kfold analysis

summary of the kfold analysis

ClassificationPipeline

class ml4ir.applications.classification.pipeline.ClassificationPipeline(args: argparse.Namespace)

Bases: ml4ir.base.pipeline.RelevancePipeline

Base class that defines a pipeline to train, evaluate and save a RelevanceModel for classification using ml4ir

Constructor to create a RelevancePipeline object to train, evaluate and save a model on ml4ir. This method sets up data, logs, models directories, file handlers used. The method also loads and sets up the FeatureConfig for the model training pipeline

Parameters:args (argparse Namespace) – arguments to be used with the pipeline. Typically, passed from command line arguments
get_relevance_model_cls()

Fetch the class of the RelevanceModel to be used for the ml4ir pipeline

Returns:
Return type:RelevanceModel class
get_loss()

Get the primary loss function to be used with the RelevanceModel

Returns:
Return type:RelevanceLossBase object
static get_metrics(metrics_keys: List[str]) → List[Union[keras.metrics.base_metric.Metric, str]]

Get the list of keras metrics to be used with the RelevanceModel

Parameters:metrics_keys (List of str) – List of strings indicating the metrics to instantiate and retrieve
Returns:
Return type:list of keras Metric objects
get_relevance_dataset(parse_tfrecord=True, preprocessing_keys_to_fns={}) → ml4ir.base.data.relevance_dataset.RelevanceDataset

Create RelevanceDataset object by loading train, test data as tensorflow datasets Defines a preprocessing feature function to one hot vectorize classification labels

Parameters:preprocessing_keys_to_fns (dict of (str, function)) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the RelevanceDataset object
Returns:RelevanceDataset object that can be used for training and evaluating the model
Return type:RelevanceDataset object

Notes

Override this method to create custom dataset objects

get_kfold_relevance_dataset(num_folds, include_testset_in_kfold, read_data_sets=False, parse_tfrecord=True, preprocessing_keys_to_fns={}) → ml4ir.base.data.kfold_relevance_dataset.KfoldRelevanceDataset

Create KfoldRelevanceDataset object by loading train, test data as tensorflow datasets Defines a preprocessing feature function to one hot vectorize classification labels

Parameters:
  • num_folds (int) – Number of folds in kfold CV
  • include_testset_in_kfold (bool) – Whether to include testset in the folds
  • read_data_sets (bool) – Whether to read datasets from disk
  • preprocessing_keys_to_fns (dict of (str, function)) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the KfoldRelevanceDataset object
Returns:

KfoldRelevanceDataset object that can be used for training and evaluating the model in a kfold cross validation mode.

Return type:

KfoldRelevanceDataset object

Notes

Override this method to create custom dataset objects

create_pipeline_for_kfold(args)

Create a ClassificationPipeline object used in running kfold cross validation.

run_kfold_analysis(base_logs_dir, base_run_id, num_folds, metrics)