MadLib.tools

Public API functions for MadLib.

This module provides the main functions that users will interact with. Implementation details are hidden in the _internal package.

Functions

apply_matcher(model, df, feature_col, output_col)

Apply a trained model to make predictions.

create_seeds(fvs, nseeds, labeler[, ...])

create seeds seeds to train a model

down_sample(fvs, percent, search_id_column)

down sample by score_column to produce percent * fvs.count() rows

label_data(model_spec, mode, labeler_spec, fvs)

Generate labeled data using active learning.

train_matcher(model_spec, labeled_data[, ...])

Train a matcher model on labeled data.

MadLib.tools.apply_matcher(model: MLModel | SKLearnModel | SparkMLModel, df: DataFrame | DataFrame, feature_col: str, output_col: str) DataFrame[source]

Apply a trained model to make predictions.

Parameters:
  • model_spec (Union[MLModel, SKLearn Model, SparkMLModel]) – Either: - A trained MLModel instance - A trained scikit-learn or Spark model instance

  • df (pandas DataFrame) – The DataFrame to make predictions on

  • feature_col (str) – Name of the column containing feature vectors

  • output_col (str) – Name of the column to store predictions in

Returns:

The input DataFrame with predictions added

Return type:

pandas DataFrame

MadLib.tools.create_features(A: DataFrame | DataFrame, B: DataFrame | DataFrame, a_cols: List[str], b_cols: List[str], sim_functions: List[Callable[[...], Any]] | None = None, tokenizers: List[Callable[[...], Any]] | None = None, null_threshold: float = 0.5) List[Callable][source]

creates the features which will be used to featurize your tuple pairs

Parameters:
  • A (Union[pd.DataFrame, SparkDataFrame]) – the records of table A

  • B (Union[pd.DataFrame, SparkDataFrame]) – the records of table B

  • a_cols (list) – The names of the columns for DataFrame A that should have features generated

  • b_cols (list) – The names of the columns for DataFrame B that should have features generated

  • sim_functions (list of callables, optional) – similarity functions to apply (default: None)

  • tokenizers (list of callables, optional) – tokenizers to use (default: None)

  • null_threshold (float) – the portion of values that must be null in order for the column to be dropped and not considered for feature generation

Returns:

a list containing initialized feature objects for columns in A, B

Return type:

List[Callable]

MadLib.tools.create_seeds(fvs: DataFrame | DataFrame, nseeds: int, labeler: Labeler | Dict, score_column: str = 'score') DataFrame[source]

create seeds seeds to train a model

Parameters:
  • fvs (pandas DataFrame) – the DataFrame with feature vectors that is your training data

  • nseeds (int) – the number of seeds you want to use to train an initial model

  • labeler (Union[Labeler, Dict]) – the labeler object (or a labeler_spec dict) you want to use to assign labels to rows

  • score_column (str) – the name of the score column in your fvs DataFrame

Returns:

A DataFrame with labeled seeds, schema is (previous schema of fvs, label) where the values in label are either 0.0 or 1.0

Return type:

pandas DataFrame

MadLib.tools.down_sample(fvs: DataFrame | DataFrame, percent: float, search_id_column: str, score_column: str = 'score', bucket_size: int = 1000) DataFrame | DataFrame[source]

down sample by score_column to produce percent * fvs.count() rows

Parameters:
  • fvs (Union[pd.DataFrame, SparkDataFrame]) – the feature vectors to be downsampled

  • percent (float) – the portion of the vectors to be output, (0.0, 1.0]

  • search_id_column (str) – the name of the column containing unique identifiers for each record

  • score_column (str) – the column that scored the vectors, should be positively correlated with the probability of the pair being a match

  • bucket_size (int = 1000) – the size of the buckets for partitioning, default 1000

Returns:

the down sampled dataset with percent * fvs.count() rows with the same schema as fvs

Return type:

Union[pd.DataFrame, SparkDataFrame]

MadLib.tools.featurize(features: List[Callable], A, B, candidates, output_col: str = 'features', fill_na: float = 0.0) DataFrame[source]

applies the featurizer to the record pairs in candidates

Parameters:
  • features (List[Callable]) – a DataFrame containing initialized feature objects for columns in A, B

  • A (Union[pd.DataFrame, SparkDataFrame]) – the records of table A

  • B (Union[pd.DataFrame, SparkDataFrame]) – the records of table B

  • candidates (Union[pd.DataFrame, SparkDataFrame]) – id pairs of A and B that are potential matches

  • output_col (str) – the name of the column for the resulting feature vectors, default fvs

  • fill_na (float) – value to fill in for missing data, default 0.0

Returns:

DataFrame with feature vectors created with the following schema: (id2, id1, fv, other columns from candidates)

Return type:

pandas DataFrame

MadLib.tools.get_base_sim_functions()[source]
MadLib.tools.get_base_tokenizers()[source]
MadLib.tools.get_extra_tokenizers()[source]
MadLib.tools.label_data(model_spec: Dict | MLModel, mode: Literal['batch', 'continuous'], labeler_spec: Dict | Labeler, fvs: DataFrame | DataFrame, seeds: DataFrame | None = None, **learner_kwargs) DataFrame[source]

Generate labeled data using active learning.

Parameters:
  • model_spec (Union[Dict, MLModel]) – Either: - A dict with model configuration (e.g. {‘model_type’: ‘sklearn’, ‘model_class’: XGBClassifier}) - An MLModel instance

  • mode (Literal["batch", "continuous"]) – Whether to use batch or continuous active learning

  • labeler_spec (Union[str, Dict, Labeler]) – Either: - A dict with labeler configuration (e.g. {‘name’: ‘cli’, ‘a_df’: df_a, ‘b_df’: df_b}) - A Labeler instance

  • fvs (pandas DataFrame) – The data that needs to be labeled

  • seeds (pandas DataFrame, optional) – Initial labeled examples to start with

  • **learner_kwargs – Additional keyword arguments to pass to the active learner constructor. For batch mode, see EntropyActiveLearner (e.g. batch_size, max_iter). For continuous mode, see ContinuousEntropyActiveLearner (e.g. queue_size, max_labeled, on_demand_stop).

Returns:

DataFrame with ids of potential matches and the corresponding label

Return type:

pandas DataFrame

MadLib.tools.train_matcher(model_spec: Dict | MLModel, labeled_data: DataFrame | DataFrame, feature_col: str = 'features', label_col: str = 'label') MLModel[source]

Train a matcher model on labeled data.

Parameters:
  • model_spec (Union[Dict, MLModel]) – Either: - A dict with model configuration (e.g. {‘model_type’: ‘sklearn’, ‘model’: XGBClassifier, ‘model_args’:{‘max_depth’: 6}}) - An MLModel instance

  • labeled_data (pandas DataFrame) – DataFrame containing the labeled data

  • feature_col (str, optional) – Name of the column containing feature vectors

  • label_col (str, optional) – Name of the column containing labels

Returns:

The trained model

Return type:

MLModel