Pipelines¶
RelevancePipeline¶
-
class
ml4ir.base.pipeline.RelevancePipeline(args: argparse.Namespace)¶ Bases:
objectBase class that defines a pipeline to train, evaluate and save a RelevanceModel using ml4ir
Constructor to create a RelevancePipeline object to train, evaluate and save a model on ml4ir. This method sets up data, logs, models directories, file handlers used. The method also loads and sets up the FeatureConfig for the model training pipeline
Parameters: args (argparse Namespace) – arguments to be used with the pipeline. Typically, passed from command line arguments -
setup_logging() → logging.Logger¶ Set up the logging utilities for the training pipeline Additionally, removes pre existing job status files
-
set_seeds(reset_graph=True)¶ Set the random seeds for tensorflow and numpy in order to replicate results
Parameters: reset_graph (bool) – Reset the tensorflow graph and clears the keras session
-
get_relevance_dataset(preprocessing_keys_to_fns={}) → ml4ir.base.data.relevance_dataset.RelevanceDataset¶ Create RelevanceDataset object by loading train, test data as tensorflow datasets
Parameters: preprocessing_keys_to_fns (dict of (str, function)) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the RelevanceDataset object Returns: RelevanceDataset object that can be used for training and evaluating the model Return type: RelevanceDataset object Notes
Override this method to create custom dataset objects
-
get_kfold_relevance_dataset(num_folds, include_testset_in_kfold, read_data_sets, preprocessing_keys_to_fns={}) → ml4ir.base.data.relevance_dataset.RelevanceDataset¶ Create RelevanceDataset object by loading train, test data as tensorflow datasets
Parameters: - num_folds (int) – number of folds in kfold
- include_testset_in_kfold (bool) – whether to include the testset in the folds
- read_data_sets (bool) – whether to call create_dataset which reads data from files.
- preprocessing_keys_to_fns (dict of (str, function)) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the RelevanceDataset object
Returns: RelevanceDataset object that can be used for training and evaluating the model
Return type: KfoldRelevanceDataset object
Notes
Override this method to create custom dataset objects
-
get_relevance_model(feature_layer_keys_to_fns={}) → ml4ir.base.model.relevance_model.RelevanceModel¶ Creates RelevanceModel that can be used for training and evaluating
Parameters: feature_layer_keys_to_fns (dict of (str, function)) – dictionary of function names mapped to tensorflow compatible function definitions that can now be used in the InteractionModel as a feature function to transform input features Returns: RelevanceModel that can be used for training and evaluating Return type: RelevanceModel Notes
Override this method to create custom loss, scorer, model objects
-
create_pipeline_for_kfold(args)¶
-
run()¶ Run the pipeline to train, evaluate and save the model. It also runs the pipeline in kfold cross validation mode if specified.
Returns: Experiment tracking dictionary with metrics and metadata for the run. Used for model selection and hyperparameter optimization Return type: dict Notes
Also populates a experiment tracking dictionary containing the metadata, model architecture and metrics generated by the model
-
run_pipeline(relevance_dataset=None)¶ Run the pipeline to train, evaluate and save the model.
Parameters: relevance_dataset (RelevanceDataset) – RelevanceDataset used for running the pipeline. If none, the relevance dataset will be created. Returns: Experiment tracking dictionary with metrics and metadata for the run. Used for model selection and hyperparameter optimization Return type: dict Notes
Also populates a experiment tracking dictionary containing the metadata, model architecture and metrics generated by the model
-
pre_processing_step()¶ Performs arbitrary pre-processing steps such as copying or transforming data that the rest of the code can not accommodate. It serves as a placeholder without an explicit implementation (returns self) in the base pipeline. We expect that users can extend it in their custom pipelines.
-
post_training_step()¶ Performs arbitrary post-training steps such as copying or transforming data that the rest of the code can not accommodate. It serves as a placeholder without an explicit implementation (returns self) in the base pipeline. We expect that users can extend it in their custom pipelines.
-
finish(job_status, job_info)¶ Wrap up the model training pipeline. Performs the following actions
- save a job status file as _SUCCESS or _FAILURE to indicate job status.
- delete temp data and models directories
- if using spark IO, transfers models and logs directories to HDFS location from local directories
- log overall run time of ml4ir job
Parameters: - job_status (str) – Tuple with first element _SUCCESS or _FAILURE second element
- job_info (str) – for _SUCCESS, is experiment tracking metrics and metadata for _FAILURE, is stacktrace of failure
-
RankingPipeline¶
-
class
ml4ir.applications.ranking.pipeline.RankingPipeline(args: argparse.Namespace)¶ Bases:
ml4ir.base.pipeline.RelevancePipelineBase class that defines a pipeline to train, evaluate and save a RankingModel using ml4ir
Constructor to create a RelevancePipeline object to train, evaluate and save a model on ml4ir. This method sets up data, logs, models directories, file handlers used. The method also loads and sets up the FeatureConfig for the model training pipeline
Parameters: args (argparse Namespace) – arguments to be used with the pipeline. Typically, passed from command line arguments -
get_relevance_model(feature_layer_keys_to_fns={}) → ml4ir.base.model.relevance_model.RelevanceModel¶ Creates a RankingModel that can be used for training and evaluating
Parameters: feature_layer_keys_to_fns (dict of (str, function)) – dictionary of function names mapped to tensorflow compatible function definitions that can now be used in the InteractionModel as a feature function to transform input features Returns: RankingModel that can be used for training and evaluating a ranking model Return type: RankingModel Notes
Override this method to create custom loss, scorer, model objects
-
validate_args()¶ Validate the arguments to be used with RelevancePipeline
-
create_pipeline_for_kfold(args)¶ Create a RankingPipeline object used in running kfold cross validation.
-
kfold_analysis(base_logs_dir, run_id, num_folds, pvalue_threshold=0.1, metrics=None)¶ Aggregate results of the k-fold runs and perform t-test on the results between old(prod model) and new model’s w.r.t the specified metrics.
Parameters: - base_logs_dir (int) – Total number of folds
- run_id (int) – current fold number
- num_folds (int) – Total number of folds
- pvalue_threshold (float) – the threshold used for pvalue to assess significance
- metrics (list) – List of metrics to include in the kfold analysis
-
run_kfold_analysis(logs_dir, run_id, num_folds, metrics)¶ Running the kfold analysis for ranking.
- logs_dir: str
- path to logs directory
- run_id: str
- string run_id
- num_folds: int
- number of folds
- metrics: list
- list of metrics to include in the kfold analysis
summary of the kfold analysis
-
ClassificationPipeline¶
-
class
ml4ir.applications.classification.pipeline.ClassificationPipeline(args: argparse.Namespace)¶ Bases:
ml4ir.base.pipeline.RelevancePipelineBase class that defines a pipeline to train, evaluate and save a RelevanceModel for classification using ml4ir
Constructor to create a RelevancePipeline object to train, evaluate and save a model on ml4ir. This method sets up data, logs, models directories, file handlers used. The method also loads and sets up the FeatureConfig for the model training pipeline
Parameters: args (argparse Namespace) – arguments to be used with the pipeline. Typically, passed from command line arguments -
get_relevance_model(feature_layer_keys_to_fns={}) → ml4ir.base.model.relevance_model.RelevanceModel¶ Creates a RelevanceModel that can be used for training and evaluating
Parameters: feature_layer_keys_to_fns (dict of (str, function)) – dictionary of function names mapped to tensorflow compatible function definitions that can now be used in the InteractionModel as a feature function to transform input features Returns: RelevanceModel that can be used for training and evaluating a classification model Return type: RelevanceModel Notes
Override this method to create custom loss, scorer, model objects
-
get_relevance_dataset(parse_tfrecord=True, preprocessing_keys_to_fns={}) → ml4ir.base.data.relevance_dataset.RelevanceDataset¶ Create RelevanceDataset object by loading train, test data as tensorflow datasets Defines a preprocessing feature function to one hot vectorize classification labels
Parameters: preprocessing_keys_to_fns (dict of (str, function)) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the RelevanceDataset object Returns: RelevanceDataset object that can be used for training and evaluating the model Return type: RelevanceDataset object Notes
Override this method to create custom dataset objects
-
get_kfold_relevance_dataset(num_folds, include_testset_in_kfold, read_data_sets=False, parse_tfrecord=True, preprocessing_keys_to_fns={}) → ml4ir.base.data.kfold_relevance_dataset.KfoldRelevanceDataset¶ Create KfoldRelevanceDataset object by loading train, test data as tensorflow datasets Defines a preprocessing feature function to one hot vectorize classification labels
Parameters: - num_folds (int) – Number of folds in kfold CV
- include_testset_in_kfold (bool) – Whether to include testset in the folds
- read_data_sets (bool) – Whether to read datasets from disk
- preprocessing_keys_to_fns (dict of (str, function)) – dictionary of function names mapped to function definitions that can now be used for preprocessing while loading the TFRecordDataset to create the KfoldRelevanceDataset object
Returns: KfoldRelevanceDataset object that can be used for training and evaluating the model in a kfold cross validation mode.
Return type: KfoldRelevanceDataset object
Notes
Override this method to create custom dataset objects
-
create_pipeline_for_kfold(args)¶ Create a ClassificationPipeline object used in running kfold cross validation.
-
run_kfold_analysis(base_logs_dir, base_run_id, num_folds, metrics)¶
-