Pipelines¶
RelevancePipeline¶
-
class
ml4ir.base.pipeline.
RelevancePipeline
(args: argparse.Namespace)¶ Bases:
object
Base class that defines a pipeline to train, evaluate and save a RelevanceModel using ml4ir
Constructor to create a RelevancePipeline object to train, evaluate and save a model on ml4ir. This method sets up data, logs, models directories, file handlers used. The method also loads and sets up the FeatureConfig for the model training pipeline
Parameters: args (argparse Namespace) – arguments to be used with the pipeline. Typically, passed from command line arguments -
setup_logging
() → logging.Logger¶ Set up the logging utilities for the training pipeline Additionally, removes pre existing job status files
-
set_seeds
(reset_graph=True)¶ Set the random seeds for tensorflow and numpy in order to replicate results
Parameters: reset_graph (bool) – Reset the tensorflow graph and clears the keras session
-
get_relevance_dataset
(preprocessing_keys_to_fns={}) → ml4ir.base.data.relevance_dataset.RelevanceDataset¶ Create RelevanceDataset object by loading train, test data as tensorflow datasets
Parameters: preprocessing_keys_to_fns (dict of (str, function)) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the RelevanceDataset object Returns: RelevanceDataset object that can be used for training and evaluating the model Return type: RelevanceDataset object Notes
Override this method to create custom dataset objects
-
get_kfold_relevance_dataset
(num_folds, include_testset_in_kfold, read_data_sets, preprocessing_keys_to_fns={}) → ml4ir.base.data.relevance_dataset.RelevanceDataset¶ Create RelevanceDataset object by loading train, test data as tensorflow datasets
Parameters: - num_folds (int) – number of folds in kfold
- include_testset_in_kfold (bool) – whether to include the testset in the folds
- read_data_sets (bool) – whether to call create_dataset which reads data from files.
- preprocessing_keys_to_fns (dict of (str, function)) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the RelevanceDataset object
Returns: RelevanceDataset object that can be used for training and evaluating the model
Return type: KfoldRelevanceDataset object
Notes
Override this method to create custom dataset objects
-
get_relevance_model_cls
()¶ Fetch the class of the RelevanceModel to be used for the ml4ir pipeline
Returns: Return type: RelevanceModel class
-
get_loss
()¶ Get the primary loss function to be used with the RelevanceModel
Returns: Return type: RelevanceLossBase object
-
get_aux_loss
()¶ Get the auxiliary loss function to be used with the RelevanceModel
Returns: Return type: RelevanceLossBase object
-
static
get_metrics
(metrics_keys: List[str]) → List[Union[keras.metrics.base_metric.Metric, str]]¶ Get the list of keras metrics to be used with the RelevanceModel
Parameters: metrics_keys (List of str) – List of strings indicating the metrics to instantiate and retrieve Returns: Return type: list of keras Metric objects
-
get_relevance_model
(feature_layer_keys_to_fns={}) → ml4ir.base.model.relevance_model.RelevanceModel¶ Creates a RankingModel that can be used for training and evaluating :param feature_layer_keys_to_fns: dictionary of function names mapped to tensorflow compatible
function definitions that can now be used in the InteractionModel as a feature function to transform input featuresReturns: RankingModel that can be used for training and evaluating a ranking model Return type: RankingModel Notes
Override this method to create custom loss, scorer, model objects
-
create_pipeline_for_kfold
(args)¶
-
run
()¶ Run the pipeline to train, evaluate and save the model. It also runs the pipeline in kfold cross validation mode if specified.
Returns: Experiment tracking dictionary with metrics and metadata for the run. Used for model selection and hyperparameter optimization Return type: dict Notes
Also populates a experiment tracking dictionary containing the metadata, model architecture and metrics generated by the model
-
run_pipeline
(relevance_dataset=None)¶ Run the pipeline to train, evaluate and save the model.
Parameters: relevance_dataset (RelevanceDataset) – RelevanceDataset used for running the pipeline. If none, the relevance dataset will be created. Returns: Experiment tracking dictionary with metrics and metadata for the run. Used for model selection and hyperparameter optimization Return type: dict Notes
Also populates a experiment tracking dictionary containing the metadata, model architecture and metrics generated by the model
-
pre_processing_step
()¶ Performs arbitrary pre-processing steps such as copying or transforming data that the rest of the code can not accommodate. It serves as a placeholder without an explicit implementation (returns self) in the base pipeline. We expect that users can extend it in their custom pipelines.
-
post_training_step
()¶ Performs arbitrary post-training steps such as copying or transforming data that the rest of the code can not accommodate. It serves as a placeholder without an explicit implementation (returns self) in the base pipeline. We expect that users can extend it in their custom pipelines.
-
finish
(job_status, job_info)¶ Wrap up the model training pipeline. Performs the following actions
- save a job status file as _SUCCESS or _FAILURE to indicate job status.
- delete temp data and models directories
- if using spark IO, transfers models and logs directories to HDFS location from local directories
- log overall run time of ml4ir job
Parameters: - job_status (str) – Tuple with first element _SUCCESS or _FAILURE second element
- job_info (str) – for _SUCCESS, is experiment tracking metrics and metadata for _FAILURE, is stacktrace of failure
-
RankingPipeline¶
-
class
ml4ir.applications.ranking.pipeline.
RankingPipeline
(args: argparse.Namespace)¶ Bases:
ml4ir.base.pipeline.RelevancePipeline
Base class that defines a pipeline to train, evaluate and save a RankingModel using ml4ir
Constructor to create a RelevancePipeline object to train, evaluate and save a model on ml4ir. This method sets up data, logs, models directories, file handlers used. The method also loads and sets up the FeatureConfig for the model training pipeline :param args: arguments to be used with the pipeline.
Typically, passed from command line arguments-
get_relevance_model_cls
()¶ Fetch the class of the RelevanceModel to be used for the ml4ir pipeline :returns: :rtype: RelevanceModel class
-
get_loss
()¶ Get the primary loss function to be used with the RelevanceModel
Returns: Return type: RelevanceLossBase object
-
get_aux_loss
()¶ Get the auxiliary loss function to be used with the RelevanceModel
Returns: Return type: RelevanceLossBase object
-
static
get_metrics
(metrics_keys: List[str]) → List[Union[keras.metrics.base_metric.Metric, str]]¶ Get the list of keras metrics to be used with the RelevanceModel
Parameters: metrics_keys (List of str) – List of strings indicating the metrics to instantiate and retrieve Returns: Return type: list of keras Metric objects
-
validate_args
()¶ Validate the arguments to be used with RelevancePipeline
-
create_pipeline_for_kfold
(args)¶ Create a RankingPipeline object used in running kfold cross validation.
-
kfold_analysis
(base_logs_dir, run_id, num_folds, pvalue_threshold=0.1, metrics=None)¶ Aggregate results of the k-fold runs and perform t-test on the results between old(prod model) and new model’s w.r.t the specified metrics. :param base_logs_dir: Total number of folds :type base_logs_dir: int :param run_id: current fold number :type run_id: int :param num_folds: Total number of folds :type num_folds: int :param pvalue_threshold: the threshold used for pvalue to assess significance :type pvalue_threshold: float :param metrics: List of metrics to include in the kfold analysis :type metrics: list
-
run_kfold_analysis
(logs_dir, run_id, num_folds, metrics)¶ Running the kfold analysis for ranking. Parameters: ———– logs_dir: str
path to logs directory- run_id: str
- string run_id
- num_folds: int
- number of folds
- metrics: list
- list of metrics to include in the kfold analysis
summary of the kfold analysis
-
ClassificationPipeline¶
-
class
ml4ir.applications.classification.pipeline.
ClassificationPipeline
(args: argparse.Namespace)¶ Bases:
ml4ir.base.pipeline.RelevancePipeline
Base class that defines a pipeline to train, evaluate and save a RelevanceModel for classification using ml4ir
Constructor to create a RelevancePipeline object to train, evaluate and save a model on ml4ir. This method sets up data, logs, models directories, file handlers used. The method also loads and sets up the FeatureConfig for the model training pipeline
Parameters: args (argparse Namespace) – arguments to be used with the pipeline. Typically, passed from command line arguments -
get_relevance_model_cls
()¶ Fetch the class of the RelevanceModel to be used for the ml4ir pipeline
Returns: Return type: RelevanceModel class
-
get_loss
()¶ Get the primary loss function to be used with the RelevanceModel
Returns: Return type: RelevanceLossBase object
-
static
get_metrics
(metrics_keys: List[str]) → List[Union[keras.metrics.base_metric.Metric, str]]¶ Get the list of keras metrics to be used with the RelevanceModel
Parameters: metrics_keys (List of str) – List of strings indicating the metrics to instantiate and retrieve Returns: Return type: list of keras Metric objects
-
get_relevance_dataset
(parse_tfrecord=True, preprocessing_keys_to_fns={}) → ml4ir.base.data.relevance_dataset.RelevanceDataset¶ Create RelevanceDataset object by loading train, test data as tensorflow datasets Defines a preprocessing feature function to one hot vectorize classification labels
Parameters: preprocessing_keys_to_fns (dict of (str, function)) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the RelevanceDataset object Returns: RelevanceDataset object that can be used for training and evaluating the model Return type: RelevanceDataset object Notes
Override this method to create custom dataset objects
-
get_kfold_relevance_dataset
(num_folds, include_testset_in_kfold, read_data_sets=False, parse_tfrecord=True, preprocessing_keys_to_fns={}) → ml4ir.base.data.kfold_relevance_dataset.KfoldRelevanceDataset¶ Create KfoldRelevanceDataset object by loading train, test data as tensorflow datasets Defines a preprocessing feature function to one hot vectorize classification labels
Parameters: - num_folds (int) – Number of folds in kfold CV
- include_testset_in_kfold (bool) – Whether to include testset in the folds
- read_data_sets (bool) – Whether to read datasets from disk
- preprocessing_keys_to_fns (dict of (str, function)) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the KfoldRelevanceDataset object
Returns: KfoldRelevanceDataset object that can be used for training and evaluating the model in a kfold cross validation mode.
Return type: KfoldRelevanceDataset object
Notes
Override this method to create custom dataset objects
-
create_pipeline_for_kfold
(args)¶ Create a ClassificationPipeline object used in running kfold cross validation.
-
run_kfold_analysis
(base_logs_dir, base_run_id, num_folds, metrics)¶
-