Defining the FeatureConfig¶
In this section, we describe how to define a feature configuration YAML file for your ml4ir application.
There are two types of feature configs that are supported in ml4ir - ExampleFeatureConfig
and SequenceExampleFeatureConfig
corresponding to the two types of TFRecord training and serving data format supported.
Main Keys¶
The feature config YAML file contains these main keys and their corresponding definitions:
query_key
: Feature used to uniquely identify each query (or data point)label
: Feature to be used as the labelrank
: Feature to identify the position of the sequence record in aSequenceExample
proto. It does not need to be specified if usingExample
data format.features
: List of features that are used by theRelevanceModel
for training and evaluation.
Feature Information¶
For each of the features in the FeatureConfig, we define a corresponding feature information definition. The main keys that should be specified for each feature are:
name | str
Name of the feature in the input dataset (CSV, TFRecord, libsvm, etc.)
node_name | str | default=name
Name of the feature in the tensorflow model. This will be the name of the feature in the input layer. Using the same input feature with multiple name nodes and feature transformations is supported. For example, using query text for character and word embeddings.
dtype | str
Tensorflow data type of the feature. Can be string
, int64
or float
trainable | bool | default=True
Value representing whether the feature is to be used for the scoring function. If set to False, the feature is considered a metadata feature that can be used to compute custom metrics and losses. Setting it to True, will make the transformed feature available for scoring by default.
tfrecord_type | str
Type of the SequenceExample feature type. Can be one of sequence
for features unique to each sequence record or context
for features common to all sequence records.
preprocessing_info | list of dicts | default=[]
List of preprocessing functions to be used on the feature. These functions will be applied in the data loading phase and will not be part of the tensorflow model. ml4ir provides an option to persist preprocessing logic as part of the SavedModel if the preprocessing functions are tensorflow compatible and serializable code.
For each preprocessing function, specify fn
, the name of the function to be used, and args
, a dictionary of values that are passed as arguments to the function. For example, to preprocess a text feature to remove punctuation and lower case, one can specify the preprocessing info as below
preprocessing_info:
- fn: preprocess_text
args:
remove_punctuation: true
to_lower: true
For more information on defining custom preprocessing functions and using it with ml4ir, check this guide
feature_layer_info | dict
Definition of the feature transformation function to be applied to the feature in the model. Use this section to specify predefined or custom transformation functions to the model. Only tensorflow compatible functions can be used here as the transformation functions will be part of the RelevanceModel
and serialized when the model is saved.
To define a feature transformation specify fn
, the feature transformation function to be applied on the feature, and args
, the key value pairs to be passed as arguments to the transformation function. For example, to use a text feature to learn character embeddings and produce a sequence encoding by using a bidirectional LSTM, define the feature layer as below
feature_layer_info:
type: numeric
fn: bytes_sequence_to_encoding_bilstm
args:
encoding_type: bilstm
encoding_size: 128
embedding_size: 128
max_length: 20
For more information on defining custom feature transformation functions and using it with ml4ir, check this guide
serving_info | dict
Definition of serving time feature attributes that will be used for model inference in production. Specifically, three key attributes can be specified in this section - name
, default_value
and required
. name
captures the name of the feature in production feature store that should be mapped to the model feature while constructing the input TFRecord proto. default_value
captures the value to be used to fill the input feature tensor if the feature is absent in production. required
is a boolean value representing if the feature is required at inference; the feature tensor will be set to default value otherwise.
log_at_inference | bool | default=False
Value representing if the feature should be logged when running RelevanceModel.predict(...)
. Setting to True, returns the feature value when running inference. This can be used for error analysis on test examples and computing more complex metrics in a post processing job.
is_group_metric_key | bool | default=False
Value representing if the feature should be used for computing groupwise metrics when running RelevanceModel.evaluate(...)
. The usage and implementation of the groupwise metrics is left to the user to be customized. The Ranking models come prepackaged with groupwise MRR and ACR metrics.
is_aux_label | bool | default=False
Value representing if the feature is used as an auxiliary label to compute failure metrics and auxiliary loss. The usage of the feature to compute the failure metrics is left to the user to be customized. The Ranking models come prepackaged with failure metrics computation that can be used, for example, to compute rate of clicks on documents without a match on the subject field.
In Ranking applications,
A secondary label is any feature/value that serves as a proxy relevance assessment that the user might be interested to measure on the dataset in addition to the primary click labels. For example, this could be used with an exact query match feature. In that case, the metric sheds light on scenarios where the records with an exact match are ranked lower than those without. This would provide the user with complimentary information (to typical click metrics such as MRR and ACR) about the model to help make better trade-off decisions w.r.t. best model selection.
The FeatureConfig
can be extended to support additional attributes as necessary.
Example¶
This is an example configuration for the query_text
feature, which will first be preprocessed to convert to lower case, remove punctuations, etc. Further we transform the feature with a sequence encoding using a bidirectional LSTM. At serving time, the feature qtext
will be mapped from production feature store into the query_text
feature for the model.
- name: query_text
node_name: query_text
trainable: true
dtype: string
log_at_inference: true
feature_layer_info:
fn: bytes_sequence_to_encoding_bilstm
args:
encoding_type: bilstm
encoding_size: 128
embedding_size: 128
max_length: 20
preprocessing_info:
- fn: preprocess_text
args:
remove_punctuation: true
to_lower: true
serving_info:
name: qtext
required: true
default_value: ""