Pipelines¶

RelevancePipeline¶

class ml4ir.base.pipeline.RelevancePipeline(args: argparse.Namespace)¶

Bases: object

Base class that defines a pipeline to train, evaluate and save a RelevanceModel using ml4ir

Constructor to create a RelevancePipeline object to train, evaluate and save a model on ml4ir. This method sets up data, logs, models directories, file handlers used. The method also loads and sets up the FeatureConfig for the model training pipeline

Parameters:	args (argparse Namespace) – arguments to be used with the pipeline. Typically, passed from command line arguments

setup_logging() → logging.Logger¶: Set up the logging utilities for the training pipeline Additionally, removes pre existing job status files

set_seeds(reset_graph=True)¶

Set the random seeds for tensorflow and numpy in order to replicate results

Parameters:	reset_graph (bool) – Reset the tensorflow graph and clears the keras session

get_relevance_dataset(preprocessing_keys_to_fns={}) → ml4ir.base.data.relevance_dataset.RelevanceDataset¶

Create RelevanceDataset object by loading train, test data as tensorflow datasets

Parameters:	preprocessing_keys_to_fns (dict of (str, function)) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the RelevanceDataset object
Returns:	RelevanceDataset object that can be used for training and evaluating the model
Return type:	RelevanceDataset object

Notes

Override this method to create custom dataset objects

get_kfold_relevance_dataset(num_folds, include_testset_in_kfold, read_data_sets, preprocessing_keys_to_fns={}) → ml4ir.base.data.relevance_dataset.RelevanceDataset¶

Create RelevanceDataset object by loading train, test data as tensorflow datasets

Parameters:	num_folds (int) – number of folds in kfold include_testset_in_kfold (bool) – whether to include the testset in the folds read_data_sets (bool) – whether to call create_dataset which reads data from files. preprocessing_keys_to_fns (dict of (str, function)) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the RelevanceDataset object
Returns:	RelevanceDataset object that can be used for training and evaluating the model
Return type:	KfoldRelevanceDataset object

Notes

Override this method to create custom dataset objects

get_relevance_model_cls()¶

Fetch the class of the RelevanceModel to be used for the ml4ir pipeline

Returns:
Return type:	RelevanceModel class

get_loss()¶

Get the primary loss function to be used with the RelevanceModel

Returns:
Return type:	RelevanceLossBase object

get_aux_loss()¶

Get the auxiliary loss function to be used with the RelevanceModel

Returns:
Return type:	RelevanceLossBase object

static get_metrics(metrics_keys: List[str]) → List[Union[keras.metrics.base_metric.Metric, str]]¶

Get the list of keras metrics to be used with the RelevanceModel

Parameters:	metrics_keys (List of str) – List of strings indicating the metrics to instantiate and retrieve
Returns:
Return type:	list of keras Metric objects

get_relevance_model(feature_layer_keys_to_fns={}) → ml4ir.base.model.relevance_model.RelevanceModel¶

Creates a RankingModel that can be used for training and evaluating :param feature_layer_keys_to_fns: dictionary of function names mapped to tensorflow compatible

function definitions that can now be used in the InteractionModel as a feature function to transform input features

Returns:	RankingModel that can be used for training and evaluating a ranking model
Return type:	RankingModel

Notes

Override this method to create custom loss, scorer, model objects

create_pipeline_for_kfold(args)¶

run()¶

Run the pipeline to train, evaluate and save the model. It also runs the pipeline in kfold cross validation mode if specified.

Returns:	Experiment tracking dictionary with metrics and metadata for the run. Used for model selection and hyperparameter optimization
Return type:	dict

Notes

Also populates a experiment tracking dictionary containing the metadata, model architecture and metrics generated by the model

run_pipeline(relevance_dataset=None)¶

Run the pipeline to train, evaluate and save the model.

Parameters:	relevance_dataset (RelevanceDataset) – RelevanceDataset used for running the pipeline. If none, the relevance dataset will be created.
Returns:	Experiment tracking dictionary with metrics and metadata for the run. Used for model selection and hyperparameter optimization
Return type:	dict

Notes

Also populates a experiment tracking dictionary containing the metadata, model architecture and metrics generated by the model

pre_processing_step()¶: Performs arbitrary pre-processing steps such as copying or transforming data that the rest of the code can not accommodate. It serves as a placeholder without an explicit implementation (returns self) in the base pipeline. We expect that users can extend it in their custom pipelines.

post_training_step()¶: Performs arbitrary post-training steps such as copying or transforming data that the rest of the code can not accommodate. It serves as a placeholder without an explicit implementation (returns self) in the base pipeline. We expect that users can extend it in their custom pipelines.

finish(job_status, job_info)¶

Wrap up the model training pipeline. Performs the following actions

save a job status file as _SUCCESS or _FAILURE to indicate job status.

delete temp data and models directories

if using spark IO, transfers models and logs directories to HDFS location from local directories

log overall run time of ml4ir job

Parameters:	job_status (str) – Tuple with first element _SUCCESS or _FAILURE second element job_info (str) – for _SUCCESS, is experiment tracking metrics and metadata for _FAILURE, is stacktrace of failure

RankingPipeline¶

class ml4ir.applications.ranking.pipeline.RankingPipeline(args: argparse.Namespace)¶

Bases: ml4ir.base.pipeline.RelevancePipeline

Base class that defines a pipeline to train, evaluate and save a RankingModel using ml4ir

Constructor to create a RelevancePipeline object to train, evaluate and save a model on ml4ir. This method sets up data, logs, models directories, file handlers used. The method also loads and sets up the FeatureConfig for the model training pipeline :param args: arguments to be used with the pipeline.

Typically, passed from command line arguments

get_relevance_model_cls()¶: Fetch the class of the RelevanceModel to be used for the ml4ir pipeline :returns: :rtype: RelevanceModel class

get_loss()¶

Get the primary loss function to be used with the RelevanceModel

Returns:
Return type:	RelevanceLossBase object

get_aux_loss()¶

Get the auxiliary loss function to be used with the RelevanceModel

Returns:
Return type:	RelevanceLossBase object

static get_metrics(metrics_keys: List[str]) → List[Union[keras.metrics.base_metric.Metric, str]]¶

Get the list of keras metrics to be used with the RelevanceModel

Parameters:	metrics_keys (List of str) – List of strings indicating the metrics to instantiate and retrieve
Returns:
Return type:	list of keras Metric objects

validate_args()¶: Validate the arguments to be used with RelevancePipeline

create_pipeline_for_kfold(args)¶: Create a RankingPipeline object used in running kfold cross validation.

kfold_analysis(base_logs_dir, run_id, num_folds, pvalue_threshold=0.1, metrics=None)¶: Aggregate results of the k-fold runs and perform t-test on the results between old(prod model) and new model’s w.r.t the specified metrics. :param base_logs_dir: Total number of folds :type base_logs_dir: int :param run_id: current fold number :type run_id: int :param num_folds: Total number of folds :type num_folds: int :param pvalue_threshold: the threshold used for pvalue to assess significance :type pvalue_threshold: float :param metrics: List of metrics to include in the kfold analysis :type metrics: list

run_kfold_analysis(logs_dir, run_id, num_folds, metrics)¶

Running the kfold analysis for ranking. Parameters: ———– logs_dir: str

path to logs directory

run_id: str: string run_id
num_folds: int: number of folds
metrics: list: list of metrics to include in the kfold analysis

summary of the kfold analysis

ClassificationPipeline¶

class ml4ir.applications.classification.pipeline.ClassificationPipeline(args: argparse.Namespace)¶

Bases: ml4ir.base.pipeline.RelevancePipeline

Base class that defines a pipeline to train, evaluate and save a RelevanceModel for classification using ml4ir

Constructor to create a RelevancePipeline object to train, evaluate and save a model on ml4ir. This method sets up data, logs, models directories, file handlers used. The method also loads and sets up the FeatureConfig for the model training pipeline

Parameters:	args (argparse Namespace) – arguments to be used with the pipeline. Typically, passed from command line arguments

get_relevance_model_cls()¶

Fetch the class of the RelevanceModel to be used for the ml4ir pipeline

Returns:
Return type:	RelevanceModel class

get_loss()¶

Get the primary loss function to be used with the RelevanceModel

Returns:
Return type:	RelevanceLossBase object

static get_metrics(metrics_keys: List[str]) → List[Union[keras.metrics.base_metric.Metric, str]]¶

Get the list of keras metrics to be used with the RelevanceModel

Parameters:	metrics_keys (List of str) – List of strings indicating the metrics to instantiate and retrieve
Returns:
Return type:	list of keras Metric objects

get_relevance_dataset(parse_tfrecord=True, preprocessing_keys_to_fns={}) → ml4ir.base.data.relevance_dataset.RelevanceDataset¶

Create RelevanceDataset object by loading train, test data as tensorflow datasets Defines a preprocessing feature function to one hot vectorize classification labels

Parameters:	preprocessing_keys_to_fns (dict of (str, function)) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the RelevanceDataset object
Returns:	RelevanceDataset object that can be used for training and evaluating the model
Return type:	RelevanceDataset object

Notes

Override this method to create custom dataset objects

get_kfold_relevance_dataset(num_folds, include_testset_in_kfold, read_data_sets=False, parse_tfrecord=True, preprocessing_keys_to_fns={}) → ml4ir.base.data.kfold_relevance_dataset.KfoldRelevanceDataset¶

Create KfoldRelevanceDataset object by loading train, test data as tensorflow datasets Defines a preprocessing feature function to one hot vectorize classification labels

Parameters:	num_folds (int) – Number of folds in kfold CV include_testset_in_kfold (bool) – Whether to include testset in the folds read_data_sets (bool) – Whether to read datasets from disk preprocessing_keys_to_fns (dict of (str, function)) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the KfoldRelevanceDataset object
Returns:	KfoldRelevanceDataset object that can be used for training and evaluating the model in a kfold cross validation mode.
Return type:	KfoldRelevanceDataset object

Notes

Override this method to create custom dataset objects

create_pipeline_for_kfold(args)¶: Create a ClassificationPipeline object used in running kfold cross validation.

run_kfold_analysis(base_logs_dir, base_run_id, num_folds, metrics)¶