MadLib._internal.ml_model.MLModel
- class MadLib._internal.ml_model.MLModel[source]
Bases:
ABC
Abstract base class for machine learning models.
This class defines the interface that all machine learning models must implement, whether they are scikit-learn models or PySpark ML models. It provides methods for training, prediction, confidence estimation, and entropy calculation.
- nan_fill
Value to use for filling NaN values in feature vectors
- Type:
float or None
- use_vectors
Whether the model expects feature vectors in vector format
- Type:
bool
- use_floats
Whether the model uses float32 (True) or float64 (False) precision
- Type:
bool
- __init__()
Methods
__init__
()entropy
(df, vector_col, output_col)Calculate entropy of predictions.
Get a dictionary of model parameters.
predict
(df, vector_col, output_col)Make predictions using the trained model.
prediction_conf
(df, vector_col, label_column)Calculate prediction confidence scores.
prep_fvs
(fvs[, feature_col])Prepare feature vectors for model input.
train
(df, vector_col, label_column)Train the model on the given data.
Attributes
Value to use for filling NaN values in feature vectors.
The trained ML Model object
Whether the model uses float32 or float64 precision.
Whether the model expects feature vectors in vector format.
- abstractmethod entropy(df: DataFrame | DataFrame, vector_col: str, output_col: str) DataFrame | DataFrame [source]
Calculate entropy of predictions.
- Parameters:
df (pandas.DataFrame or pyspark.sql.DataFrame) – The DataFrame containing the feature vectors
vector_col (str) – Name of the column containing feature vectors
output_col (str) – Name of the column to store entropy values in
- Returns:
The input DataFrame with entropy values added in the output_col
- Return type:
pandas.DataFrame or pyspark.sql.DataFrame
- abstract property nan_fill: float | None
Value to use for filling NaN values in feature vectors.
- Returns:
The value to use for filling NaN values, or None if no filling is needed
- Return type:
float or None
- abstractmethod params_dict() dict [source]
Get a dictionary of model parameters.
- Returns:
Dictionary containing model parameters and configuration
- Return type:
dict
- abstractmethod predict(df: DataFrame | DataFrame, vector_col: str, output_col: str) DataFrame | DataFrame [source]
Make predictions using the trained model.
- Parameters:
df (pandas.DataFrame or pyspark.sql.DataFrame) – The DataFrame containing the feature vectors to predict on
vector_col (str) – Name of the column containing feature vectors
output_col (str) – Name of the column to store predictions in
- Returns:
The input DataFrame with predictions added in the output_col
- Return type:
pandas.DataFrame or pyspark.sql.DataFrame
- abstractmethod prediction_conf(df: DataFrame | DataFrame, vector_col: str, label_column: str) DataFrame | DataFrame [source]
Calculate prediction confidence scores.
- Parameters:
df (pandas.DataFrame or pyspark.sql.DataFrame) – The DataFrame containing the feature vectors
vector_col (str) – Name of the column containing feature vectors
label_column (str) – Name of the column containing true labels
- Returns:
The input DataFrame with confidence scores added
- Return type:
pandas.DataFrame or pyspark.sql.DataFrame
- prep_fvs(fvs: DataFrame | DataFrame, feature_col: str = 'features') DataFrame | DataFrame [source]
Prepare feature vectors for model input.
This method handles NaN filling and conversion between vector and array formats based on the model’s requirements.
- Parameters:
fvs (pandas.DataFrame or pyspark.sql.DataFrame) – DataFrame containing feature vectors
feature_col (str, optional) – Name of the column containing feature vectors
- Returns:
DataFrame with prepared feature vectors
- Return type:
pandas.DataFrame or pyspark.sql.DataFrame
- abstractmethod train(df: DataFrame | DataFrame, vector_col: str, label_column: str)[source]
Train the model on the given data.
- Parameters:
df (pandas.DataFrame or pyspark.sql.DataFrame) – The DataFrame containing training data
vector_col (str) – Name of the column containing feature vectors
label_column (str) – Name of the column containing labels
- Returns:
The trained model (self)
- Return type:
- abstract property trained_model
The trained ML Model object
- Returns:
The trained ML Model object
- Return type:
- abstract property use_floats: bool
Whether the model uses float32 or float64 precision.
- Returns:
True if the model uses float32, False if it uses float64
- Return type:
bool
- abstract property use_vectors: bool
Whether the model expects feature vectors in vector format.
- Returns:
True if the model expects vectors, False if it expects arrays
- Return type:
bool