Model Registration

To Monitor any model in DMM it first needs to be registered with DMM. You can register a new model either through DMM UI or using DMM’s Public APIs. In both cases the information about the model is captured through a Model Config JSON.

Few things to keep in mind while registering a model:

  1. A data column can be only one of these column types: feature, prediction, timestamp, row identifier.

  2. A data column can be only one of these value type: numerical, categorical, string.

  3. All columns in the training data file need not be declared. In this case, the undeclared columns will be ignored from any analysis within DMM.

    1. In Guided Registration flow you can ignore some of the columns by unselecting the checkbox next to them.

    2. In Model Config, you can ignore some of the columns by not mentioning them in the dataColumns attribute.

  4. Feature and prediction columns can only be of value type numerical or categorical.

  5. Prediction column is optional but when declared there can only be one (in the current release of DMM).

    1. When a prediction column (or row_identifer column) is not declared, Model Quality metrics can not be monitored.

    2. Data drift can still be monitored for all feature columns.

  6. Timestamp and row_identier column types are optional

    1. But when present there can be only one timestamp and/or row_identifer columns.

    2. Both of these can only be of string value type.

    3. These two columns can be declared at the time of adding prediction data for the first time, but it is recommended that they be declared during model registration.

  7. Timestamp column should contain the date/time of when the prediction was made. It should be in a valid UTC format (ISO 8601).

  8. When the timestamp column is not declared, the ingestion time of the prediction dataset in DMM is substituted as the timestamp of prediction.

  9. Row_identifier is used to uniquely identify each prediction row. It is typically referred to as prediction ID, transaction ID, etc.

    1. The row_identifer values are used to match ground truth data to the predictions to calculate model quality metrics.

  10. When a row_identifer column (or Prediction column) is not declared, Model Quality metrics can not be monitored.


Model Config

Model Config JSON should capture all the information needed for registering a model on DMM. The easiest way to generate a Model Config file for a model is to use the Guided Flow in the Register Model flow. In the Step 4, you can download the Model Config for future reference, sharing with colleagues or for offline edits.

This section provides details about the structure of the Model Config. Lets review the sample Model Config JSON to get a good idea about it.

{
        "dataType": "tabular",
        "dataColumns": [
                {
                        "name": "age",
                        "valueType": "numerical",
                        "isFeature": true,
                        "binsNum": 8
                },
                {
                        "name": "job",
                        "valueType": "categorical",
                        "isFeature": true,
                        "binsCategories": ["blue-collar", "management", "technician", "admin"]
                },
                {
                        "name": "marital",
                        "valueType": "categorical",
                        "isFeature": true
                },
                {
                        "name": "cons.conf.idx",
                        "valueType": "numerical",
                        "isFeature": true
                },
                {
                        "name": "euribor3m",
                        "valueType": "numerical",
                        "isFeature": true,
                        "binsEdges": [0, 1, 2, 3, 4, 5]
                },
                {
                        "name": "y",
                        "valueType": "categorical",
                        "isPrediction": true
                },
                {
                        "name": "date",
                        "valueType": "string",
                        "isTimestamp": true
                },
                {
                        "name": "RowId",
                        "valueType": "string",
                        "isRowIdentifier": true
                }
        ],
        "dataLocation": "https://s3.ap-south-1.amazonaws.com/my-s3-bucket/TrainingData.csv",
        "modelType": "classification",
        "modelMetadata": {
                "name": "PropensityModel",
                "modelVersion": "1",
                "dateCreated": "2020-04-17T08:11:12",
                "dataset": "OpenSource-TrainingData",
                "description": "Predict Purchase Propensity",
                "author": "samit.demo@dominodatalab.com"
        }
}

Following are the key attributes in the Model Config JSON:

dataType - Only valid value is ‘tabular’ (for the current release)

dataColumns - Use it to declare all data columns that you want to analyze. Specify the ‘name’, ‘valueType’ and attribute identifying the column type. Optionally, use binning related attributes to control the bins construction (see Supported Binning Methods section in this doc). Columns not declared here will be ignored from any analysis. Following attributes are supported for identifying the column type:

  1. isFeature - Input feature of the model. Data drift will be calculated for this data column. Needs to be declared while registering the model (along with its training data). The column needs to be present in all training & prediction datasets registered with the model.

  2. isPrediction - Output prediction of the model. Data drift and model quality mterics are calculated for this data column. While it is optional (model quality metrics will not be calculcated for the model in this case), if declared, it has to be done while registering the model (along with its training data).

  3. isTimestamp - Used to identify the column which contains the timestamp for the prediction made. If not declared, the ingestion time of the data in DMM is used as the timestamp of the prediction. Column values need to follow the ISO 8601 time format.

  4. isRowIdentifier - Used to uniquely identify each prediction. Is used for matching ground truth labels to their corresponding prediction values. Model quality metrics will not be calculated if this column is not present. If used, needs to be present in both prediction and ground truth datasets.

  5. isGroundTruth - Used to identify the column which contains the ground truth labels in the ground truth datasets.

  6. isPredictionProbability - Column which contains the probability value for the model’s prediction. Can be a single value (maps to the probability value of the positive class) or a list of values (in this case the length of the list has to match the number of unique prediction labels/classes present in the training dataset)

  7. isSampleWeight - Column which contains the weight to be associated with each prediction to calculate the Gini metric.

dataLocation - Use it to specify the full S3 URL link for the training data file.

modelType - Valid values are ‘classification’ and ‘regression’.

modelMetadata - Use it to capture metadata related to the model. Specify the ‘name’, ‘modelVersion’, ‘dataset’, ‘dateCreated’, ‘description’ and ‘author’ attributes. dateCreated needs to be in a valid UTC format (ISO 8601).


Supported Binning Methods

Bins are important in DMM to calculate probability distributions and divergence values for data drift. It affects not only the quality in drift value, but also the performance of the tool. Having high number of bins (greater than 20) usually causes noise/false alarms and considerably slows down the computation and UI performance.

When a user has not specified any binning strategy, DMM uses the Freedman Diaconis Estimator method to calculate the number of bins for Numerical variables. This count is capped to 20, if the count returned by the default method is higher than 20. For numerical variables, DMM will automatically add one guard bin for values which fall outside the min and max range of the values present in the training data. For training data this guard bin will have zero count (unless the user uses 'binsEdges' over-ride strategy mentioned below), however for Prediction data they may see values falling in it indicating a case that prediction data has values outside the min-max seen on the training data.

For categorical variables, the class values are used as bins. DMM will automatically add one guard bin 'Untrained Classes'. For training data this guard bin will have zero counts (unless the ‘binsCategories’ override strategy is used), however for Prediction data counts of all classes which were not present in the training data will fall in this bin. Users can potentially use this bin to detect new classes previously unseen during training.

Users can override these defaults and fine tune the bin creation using following attributes in the Model Config JSON.

Note: Changing bins after a model has been successfully registered is not currently supported in DMM.

For numerical data columns, user can use two approaches:

binsNum - This takes a positive integer greater than equal to 2, and less than 20 as input. DMM will create that many equal sized bins for the numerical variable. DMM will use the max and min value in the training dataset to determine the bin widths. DMM will add two guard bands in addition to the user defined bins.

Example of valid 'binsNum'

“binsNum”: 10

binsEdges - This takes an array of real numbers as input. These correspond to actual bin edges. For creating N user-defined bins, users need to provide N+1 bin edges. This is similar to histogram_bin_edges method used in Numpy. DMM will add two guard bands in addition to the user defined bins.

  1. Edges can be both positive and negative decimal numbers (except Infinity)

  2. Minimum of 3 and Maximum of 20 numbers/edges can be provided in the array.

  3. They should be monotonically increasing (lowest to highest) from start of the array to end of the array.

  4. All provided values should be unique. No duplicates.

Example of valid 'binsEdges'

“binsEdges”: [-10, -4.5, -0.25, 0, 3.2, 5.11111]

Examples of invalid 'binsEdges'

“binsEdges”: [-10, 4, -0.25, 0, 3.2, 5.11111] –> not monotonically increasing

“binsEdges”: [-10, XYZ, -0.25, 0, 3.2, 5.11111] –> string value present

“binsEdges”: [1,2] –> less than 3 edges provided

“binsEdges”: [1,2,2,4,6] –> Duplicates present

For categorical data columns, user can use the following approach:

binsCategories - This takes an array of strings as input (length should be less than 100) and creates a bin for each of them. The values should ideally correspond to class values present in the data column in the training data or class values user expects to find in the prediction data. Counts of all other class values of the training and prediction data columns will fall in the 'Untrained Classes' guard bin. If the user has specified an 'Untrained Classes' bin as part of the 'binsCategories', then it will correspond to the internal 'Untrained Classes' bin.

Example of valid 'binsCategories'

“binsCategories”: [“red”, “blue”, “green”, “white”, “yellow”]

Providing Pre-computed Distributions

There are situations in which a user might want to override the distribution computed from the training data. For example, when the training dataset was artificially balanced by either over or under sampling, and the distribution represented in it is not what would be realistically expected in the prediction data. This may cause false alarms for data drift checks.

In such cases, user can use the 'precomputedBinCounts' attribute to over-ride the distribution for any column for the training data.

'precomputedBinCounts' attribute takes an array of positive integers (0 and above) as input. The input represents the count of values in each of the bins (excluding the guard bin). First value corresponds to the count of the 1st bin. This attribute can only be used when the bins structure has been explicitly specified using 'binsEdges' for numerical columns and 'binsCategories' for categorical columns. The length of the array should be equal to bins specified by the user (excluding the guard bins). DMM will internally calculate the probability distribution from the counts.

Example of valid 'precomputedBinCounts'

“precomputedBinCounts”: [100,250,400,50,200]

Note: Even if the user has provided the precomputed distributions for any (or all) data columns, users still have to provide a training data file that validates successfully and all the columns (including those for which precomputed distributions are provided) need to be present in the file.