MadLib._internal.ml_model.MLModel

class MadLib._internal.ml_model.MLModel[source]

Bases: ABC

Abstract base class for machine learning models.

This class defines the interface that all machine learning models must implement, whether they are scikit-learn models or PySpark ML models. It provides methods for training, prediction, confidence estimation, and entropy calculation.

nan_fill

Value to use for filling NaN values in feature vectors

Type:: float or None

use_vectors

Whether the model expects feature vectors in vector format

Type:: bool

use_floats

Whether the model uses float32 (True) or float64 (False) precision

Type:: bool

__init__()

Methods

`__init__`()
`entropy`(df, vector_col, output_col)	Calculate entropy of predictions.
`params_dict`()	Get a dictionary of model parameters.
`predict`(df, vector_col, output_col)	Make predictions using the trained model.
`prediction_conf`(df, vector_col, label_column)	Calculate prediction confidence scores.
`prep_fvs`(fvs[, feature_col])	Prepare feature vectors for model input.
`train`(df, vector_col, label_column)	Train the model on the given data.

Attributes

`nan_fill`	Value to use for filling NaN values in feature vectors.
`trained_model`	The trained ML Model object
`use_floats`	Whether the model uses float32 or float64 precision.
`use_vectors`	Whether the model expects feature vectors in vector format.

abstractmethod entropy(df: DataFrame | DataFrame, vector_col: str, output_col: str) → DataFrame | DataFrame[source]

Calculate entropy of predictions.

Parameters:

df (pandas.DataFrame or pyspark.sql.DataFrame) – The DataFrame containing the feature vectors
vector_col (str) – Name of the column containing feature vectors
output_col (str) – Name of the column to store entropy values in

Returns:

The input DataFrame with entropy values added in the output_col

Return type:

pandas.DataFrame or pyspark.sql.DataFrame

abstract property nan_fill: float | None

Value to use for filling NaN values in feature vectors.

Returns:: The value to use for filling NaN values, or None if no filling is needed
Return type:: float or None

abstractmethod params_dict() → dict[source]

Get a dictionary of model parameters.

Returns:: Dictionary containing model parameters and configuration
Return type:: dict

abstractmethod predict(df: DataFrame | DataFrame, vector_col: str, output_col: str) → DataFrame | DataFrame[source]

Make predictions using the trained model.

Parameters:

df (pandas.DataFrame or pyspark.sql.DataFrame) – The DataFrame containing the feature vectors to predict on
vector_col (str) – Name of the column containing feature vectors
output_col (str) – Name of the column to store predictions in

Returns:

The input DataFrame with predictions added in the output_col

Return type:

pandas.DataFrame or pyspark.sql.DataFrame

abstractmethod prediction_conf(df: DataFrame | DataFrame, vector_col: str, label_column: str) → DataFrame | DataFrame[source]

Calculate prediction confidence scores.

Parameters:

df (pandas.DataFrame or pyspark.sql.DataFrame) – The DataFrame containing the feature vectors
vector_col (str) – Name of the column containing feature vectors
label_column (str) – Name of the column containing true labels

Returns:

The input DataFrame with confidence scores added

Return type:

pandas.DataFrame or pyspark.sql.DataFrame

prep_fvs(fvs: DataFrame | DataFrame, feature_col: str = 'features') → DataFrame | DataFrame[source]

Prepare feature vectors for model input.

This method handles NaN filling and conversion between vector and array formats based on the model’s requirements.

Parameters:

fvs (pandas.DataFrame or pyspark.sql.DataFrame) – DataFrame containing feature vectors
feature_col (str, optional) – Name of the column containing feature vectors

Returns:

DataFrame with prepared feature vectors

Return type:

pandas.DataFrame or pyspark.sql.DataFrame

abstractmethod train(df: DataFrame | DataFrame, vector_col: str, label_column: str)[source]

Train the model on the given data.

Parameters:

df (pandas.DataFrame or pyspark.sql.DataFrame) – The DataFrame containing training data
vector_col (str) – Name of the column containing feature vectors
label_column (str) – Name of the column containing labels

Returns:

The trained model (self)

Return type:

MLModel

abstract property trained_model

The trained ML Model object

Returns:: The trained ML Model object
Return type:: MLModel

abstract property use_floats: bool

Whether the model uses float32 or float64 precision.

Returns:: True if the model uses float32, False if it uses float64
Return type:: bool

abstract property use_vectors: bool

Whether the model expects feature vectors in vector format.

Returns:: True if the model expects vectors, False if it expects arrays
Return type:: bool