MadLib._internal.ml_model.MLModel

class MadLib._internal.ml_model.MLModel[source]

Bases: ABC

Abstract base class for machine learning models.

This class defines the interface that all machine learning models must implement, whether they are scikit-learn models or PySpark ML models. It provides methods for training, prediction, confidence estimation, and entropy calculation.

nan_fill

Value to use for filling NaN values in feature vectors

Type:

float or None

use_vectors

Whether the model expects feature vectors in vector format

Type:

bool

use_floats

Whether the model uses float32 (True) or float64 (False) precision

Type:

bool

__init__()

Methods

__init__()

entropy(df, vector_col, output_col)

Calculate entropy of predictions.

params_dict()

Get a dictionary of model parameters.

predict(df, vector_col, output_col)

Make predictions using the trained model.

prediction_conf(df, vector_col, label_column)

Calculate prediction confidence scores.

prep_fvs(fvs[, feature_col])

Prepare feature vectors for model input.

train(df, vector_col, label_column)

Train the model on the given data.

Attributes

nan_fill

Value to use for filling NaN values in feature vectors.

trained_model

The trained ML Model object

use_floats

Whether the model uses float32 or float64 precision.

use_vectors

Whether the model expects feature vectors in vector format.

abstractmethod entropy(df: DataFrame | DataFrame, vector_col: str, output_col: str) DataFrame | DataFrame[source]

Calculate entropy of predictions.

Parameters:
  • df (pandas.DataFrame or pyspark.sql.DataFrame) – The DataFrame containing the feature vectors

  • vector_col (str) – Name of the column containing feature vectors

  • output_col (str) – Name of the column to store entropy values in

Returns:

The input DataFrame with entropy values added in the output_col

Return type:

pandas.DataFrame or pyspark.sql.DataFrame

abstract property nan_fill: float | None

Value to use for filling NaN values in feature vectors.

Returns:

The value to use for filling NaN values, or None if no filling is needed

Return type:

float or None

abstractmethod params_dict() dict[source]

Get a dictionary of model parameters.

Returns:

Dictionary containing model parameters and configuration

Return type:

dict

abstractmethod predict(df: DataFrame | DataFrame, vector_col: str, output_col: str) DataFrame | DataFrame[source]

Make predictions using the trained model.

Parameters:
  • df (pandas.DataFrame or pyspark.sql.DataFrame) – The DataFrame containing the feature vectors to predict on

  • vector_col (str) – Name of the column containing feature vectors

  • output_col (str) – Name of the column to store predictions in

Returns:

The input DataFrame with predictions added in the output_col

Return type:

pandas.DataFrame or pyspark.sql.DataFrame

abstractmethod prediction_conf(df: DataFrame | DataFrame, vector_col: str, label_column: str) DataFrame | DataFrame[source]

Calculate prediction confidence scores.

Parameters:
  • df (pandas.DataFrame or pyspark.sql.DataFrame) – The DataFrame containing the feature vectors

  • vector_col (str) – Name of the column containing feature vectors

  • label_column (str) – Name of the column containing true labels

Returns:

The input DataFrame with confidence scores added

Return type:

pandas.DataFrame or pyspark.sql.DataFrame

prep_fvs(fvs: DataFrame | DataFrame, feature_col: str = 'features') DataFrame | DataFrame[source]

Prepare feature vectors for model input.

This method handles NaN filling and conversion between vector and array formats based on the model’s requirements.

Parameters:
  • fvs (pandas.DataFrame or pyspark.sql.DataFrame) – DataFrame containing feature vectors

  • feature_col (str, optional) – Name of the column containing feature vectors

Returns:

DataFrame with prepared feature vectors

Return type:

pandas.DataFrame or pyspark.sql.DataFrame

abstractmethod train(df: DataFrame | DataFrame, vector_col: str, label_column: str)[source]

Train the model on the given data.

Parameters:
  • df (pandas.DataFrame or pyspark.sql.DataFrame) – The DataFrame containing training data

  • vector_col (str) – Name of the column containing feature vectors

  • label_column (str) – Name of the column containing labels

Returns:

The trained model (self)

Return type:

MLModel

abstract property trained_model

The trained ML Model object

Returns:

The trained ML Model object

Return type:

MLModel

abstract property use_floats: bool

Whether the model uses float32 or float64 precision.

Returns:

True if the model uses float32, False if it uses float64

Return type:

bool

abstract property use_vectors: bool

Whether the model expects feature vectors in vector format.

Returns:

True if the model expects vectors, False if it expects arrays

Return type:

bool