MadLib.tools

Public API functions for MadLib.

This module provides the main functions that users will interact with. Implementation details are hidden in the _internal package.

Functions

`apply_matcher`(model, df, feature_col, output_col)	Apply a trained model to make predictions.
`create_seeds`(fvs, nseeds, labeler[, ...])	create seeds seeds to train a model
`down_sample`(fvs, percent, search_id_column)	down sample by score_column to produce percent * fvs.count() rows
`label_data`(model_spec, mode, labeler_spec, fvs)	Generate labeled data using active learning.
`train_matcher`(model_spec, labeled_data[, ...])	Train a matcher model on labeled data.

MadLib.tools.apply_matcher(model: MLModel | SKLearnModel | SparkMLModel, df: DataFrame | DataFrame, feature_col: str, output_col: str) → DataFrame[source]

Apply a trained model to make predictions.

Parameters:

model_spec (Union[MLModel, SKLearn Model, SparkMLModel]) – Either: - A trained MLModel instance - A trained scikit-learn or Spark model instance
df (pandas DataFrame) – The DataFrame to make predictions on
feature_col (str) – Name of the column containing feature vectors
output_col (str) – Name of the column to store predictions in

Returns:

The input DataFrame with predictions added

Return type:

pandas DataFrame

MadLib.tools.create_features(A: DataFrame | DataFrame, B: DataFrame | DataFrame, a_cols: List[str], b_cols: List[str], sim_functions: List[Callable[[...], Any]] | None = None, tokenizers: List[Callable[[...], Any]] | None = None, null_threshold: float = 0.5) → List[Callable][source]

creates the features which will be used to featurize your tuple pairs

Parameters:

A (Union[pd.DataFrame, SparkDataFrame]) – the records of table A
B (Union[pd.DataFrame, SparkDataFrame]) – the records of table B
a_cols (list) – The names of the columns for DataFrame A that should have features generated
b_cols (list) – The names of the columns for DataFrame B that should have features generated
sim_functions (list of callables, optional) – similarity functions to apply (default: None)
tokenizers (list of callables, optional) – tokenizers to use (default: None)
null_threshold (float) – the portion of values that must be null in order for the column to be dropped and not considered for feature generation

Returns:

a list containing initialized feature objects for columns in A, B

Return type:

List[Callable]

MadLib.tools.create_seeds(fvs: DataFrame | DataFrame, nseeds: int, labeler: Labeler | Dict, score_column: str = 'score') → DataFrame[source]

create seeds seeds to train a model

Parameters:

fvs (pandas DataFrame) – the DataFrame with feature vectors that is your training data
nseeds (int) – the number of seeds you want to use to train an initial model
labeler (Union[Labeler, Dict]) – the labeler object (or a labeler_spec dict) you want to use to assign labels to rows
score_column (str) – the name of the score column in your fvs DataFrame

Returns:

A DataFrame with labeled seeds, schema is (previous schema of fvs, label) where the values in label are either 0.0 or 1.0

Return type:

pandas DataFrame

MadLib.tools.down_sample(fvs: DataFrame | DataFrame, percent: float, search_id_column: str, score_column: str = 'score', bucket_size: int = 1000) → DataFrame | DataFrame[source]

down sample by score_column to produce percent * fvs.count() rows

Parameters:

fvs (Union[pd.DataFrame, SparkDataFrame]) – the feature vectors to be downsampled
percent (float) – the portion of the vectors to be output, (0.0, 1.0]
search_id_column (str) – the name of the column containing unique identifiers for each record
score_column (str) – the column that scored the vectors, should be positively correlated with the probability of the pair being a match
bucket_size (int = 1000) – the size of the buckets for partitioning, default 1000

Returns:

the down sampled dataset with percent * fvs.count() rows with the same schema as fvs

Return type:

Union[pd.DataFrame, SparkDataFrame]

MadLib.tools.featurize(features: List[Callable], A, B, candidates, output_col: str = 'features', fill_na: float = 0.0) → DataFrame[source]

applies the featurizer to the record pairs in candidates

Parameters:

features (List[Callable]) – a DataFrame containing initialized feature objects for columns in A, B
A (Union[pd.DataFrame, SparkDataFrame]) – the records of table A
B (Union[pd.DataFrame, SparkDataFrame]) – the records of table B
candidates (Union[pd.DataFrame, SparkDataFrame]) – id pairs of A and B that are potential matches
output_col (str) – the name of the column for the resulting feature vectors, default fvs
fill_na (float) – value to fill in for missing data, default 0.0

Returns:

DataFrame with feature vectors created with the following schema: (id2, id1, fv, other columns from candidates)

Return type:

pandas DataFrame

MadLib.tools.get_base_sim_functions()[source]

MadLib.tools.get_base_tokenizers()[source]

MadLib.tools.get_extra_tokenizers()[source]

MadLib.tools.label_data(model_spec: Dict | MLModel, mode: Literal['batch', 'continuous'], labeler_spec: Dict | Labeler, fvs: DataFrame | DataFrame, seeds: DataFrame | None = None, **learner_kwargs) → DataFrame[source]

Generate labeled data using active learning.

Parameters:

model_spec (Union[Dict, MLModel]) – Either: - A dict with model configuration (e.g. {‘model_type’: ‘sklearn’, ‘model_class’: XGBClassifier}) - An MLModel instance
mode (Literal["batch", "continuous"]) – Whether to use batch or continuous active learning
labeler_spec (Union[str, Dict, Labeler]) – Either: - A dict with labeler configuration (e.g. {‘name’: ‘cli’, ‘a_df’: df_a, ‘b_df’: df_b}) - A Labeler instance
fvs (pandas DataFrame) – The data that needs to be labeled
seeds (pandas DataFrame, optional) – Initial labeled examples to start with
**learner_kwargs – Additional keyword arguments to pass to the active learner constructor. For batch mode, see EntropyActiveLearner (e.g. batch_size, max_iter). For continuous mode, see ContinuousEntropyActiveLearner (e.g. queue_size, max_labeled, on_demand_stop).

Returns:

DataFrame with ids of potential matches and the corresponding label

Return type:

pandas DataFrame

MadLib.tools.train_matcher(model_spec: Dict | MLModel, labeled_data: DataFrame | DataFrame, feature_col: str = 'features', label_col: str = 'label') → MLModel[source]

Train a matcher model on labeled data.

Parameters:

model_spec (Union[Dict, MLModel]) – Either: - A dict with model configuration (e.g. {‘model_type’: ‘sklearn’, ‘model’: XGBClassifier, ‘model_args’:{‘max_depth’: 6}}) - An MLModel instance
labeled_data (pandas DataFrame) – DataFrame containing the labeled data
feature_col (str, optional) – Name of the column containing feature vectors
label_col (str, optional) – Name of the column containing labels

Returns:

The trained model

Return type:

MLModel