MadLib.tools
Public API functions for MadLib.
This module provides the main functions that users will interact with. Implementation details are hidden in the _internal package.
Functions
|
Apply a trained model to make predictions. |
|
create seeds seeds to train a model |
|
down sample by score_column to produce percent * fvs.count() rows |
|
Generate labeled data using active learning. |
|
Train a matcher model on labeled data. |
- MadLib.tools.apply_matcher(model: MLModel | SKLearnModel | SparkMLModel, df: DataFrame | DataFrame, feature_col: str, output_col: str) DataFrame [source]
Apply a trained model to make predictions.
- Parameters:
model_spec (Union[MLModel, SKLearn Model, SparkMLModel]) – Either: - A trained MLModel instance - A trained scikit-learn or Spark model instance
df (pandas DataFrame) – The DataFrame to make predictions on
feature_col (str) – Name of the column containing feature vectors
output_col (str) – Name of the column to store predictions in
- Returns:
The input DataFrame with predictions added
- Return type:
pandas DataFrame
- MadLib.tools.create_features(A: DataFrame | DataFrame, B: DataFrame | DataFrame, a_cols: List[str], b_cols: List[str], sim_functions: List[Callable[[...], Any]] | None = None, tokenizers: List[Callable[[...], Any]] | None = None, null_threshold: float = 0.5) List[Callable] [source]
creates the features which will be used to featurize your tuple pairs
- Parameters:
A (Union[pd.DataFrame, SparkDataFrame]) – the records of table A
B (Union[pd.DataFrame, SparkDataFrame]) – the records of table B
a_cols (list) – The names of the columns for DataFrame A that should have features generated
b_cols (list) – The names of the columns for DataFrame B that should have features generated
sim_functions (list of callables, optional) – similarity functions to apply (default: None)
tokenizers (list of callables, optional) – tokenizers to use (default: None)
null_threshold (float) – the portion of values that must be null in order for the column to be dropped and not considered for feature generation
- Returns:
a list containing initialized feature objects for columns in A, B
- Return type:
List[Callable]
- MadLib.tools.create_seeds(fvs: DataFrame | DataFrame, nseeds: int, labeler: Labeler | Dict, score_column: str = 'score') DataFrame [source]
create seeds seeds to train a model
- Parameters:
fvs (pandas DataFrame) – the DataFrame with feature vectors that is your training data
nseeds (int) – the number of seeds you want to use to train an initial model
labeler (Union[Labeler, Dict]) – the labeler object (or a labeler_spec dict) you want to use to assign labels to rows
score_column (str) – the name of the score column in your fvs DataFrame
- Returns:
A DataFrame with labeled seeds, schema is (previous schema of fvs, label) where the values in label are either 0.0 or 1.0
- Return type:
pandas DataFrame
- MadLib.tools.down_sample(fvs: DataFrame | DataFrame, percent: float, search_id_column: str, score_column: str = 'score', bucket_size: int = 1000) DataFrame | DataFrame [source]
down sample by score_column to produce percent * fvs.count() rows
- Parameters:
fvs (Union[pd.DataFrame, SparkDataFrame]) – the feature vectors to be downsampled
percent (float) – the portion of the vectors to be output, (0.0, 1.0]
search_id_column (str) – the name of the column containing unique identifiers for each record
score_column (str) – the column that scored the vectors, should be positively correlated with the probability of the pair being a match
bucket_size (int = 1000) – the size of the buckets for partitioning, default 1000
- Returns:
the down sampled dataset with percent * fvs.count() rows with the same schema as fvs
- Return type:
Union[pd.DataFrame, SparkDataFrame]
- MadLib.tools.featurize(features: List[Callable], A, B, candidates, output_col: str = 'features', fill_na: float = 0.0) DataFrame [source]
applies the featurizer to the record pairs in candidates
- Parameters:
features (List[Callable]) – a DataFrame containing initialized feature objects for columns in A, B
A (Union[pd.DataFrame, SparkDataFrame]) – the records of table A
B (Union[pd.DataFrame, SparkDataFrame]) – the records of table B
candidates (Union[pd.DataFrame, SparkDataFrame]) – id pairs of A and B that are potential matches
output_col (str) – the name of the column for the resulting feature vectors, default fvs
fill_na (float) – value to fill in for missing data, default 0.0
- Returns:
DataFrame with feature vectors created with the following schema: (id2, id1, fv, other columns from candidates)
- Return type:
pandas DataFrame
- MadLib.tools.label_data(model_spec: Dict | MLModel, mode: Literal['batch', 'continuous'], labeler_spec: Dict | Labeler, fvs: DataFrame | DataFrame, seeds: DataFrame | None = None, **learner_kwargs) DataFrame [source]
Generate labeled data using active learning.
- Parameters:
model_spec (Union[Dict, MLModel]) – Either: - A dict with model configuration (e.g. {‘model_type’: ‘sklearn’, ‘model_class’: XGBClassifier}) - An MLModel instance
mode (Literal["batch", "continuous"]) – Whether to use batch or continuous active learning
labeler_spec (Union[str, Dict, Labeler]) – Either: - A dict with labeler configuration (e.g. {‘name’: ‘cli’, ‘a_df’: df_a, ‘b_df’: df_b}) - A Labeler instance
fvs (pandas DataFrame) – The data that needs to be labeled
seeds (pandas DataFrame, optional) – Initial labeled examples to start with
**learner_kwargs – Additional keyword arguments to pass to the active learner constructor. For batch mode, see EntropyActiveLearner (e.g. batch_size, max_iter). For continuous mode, see ContinuousEntropyActiveLearner (e.g. queue_size, max_labeled, on_demand_stop).
- Returns:
DataFrame with ids of potential matches and the corresponding label
- Return type:
pandas DataFrame
- MadLib.tools.train_matcher(model_spec: Dict | MLModel, labeled_data: DataFrame | DataFrame, feature_col: str = 'features', label_col: str = 'label') MLModel [source]
Train a matcher model on labeled data.
- Parameters:
model_spec (Union[Dict, MLModel]) – Either: - A dict with model configuration (e.g. {‘model_type’: ‘sklearn’, ‘model’: XGBClassifier, ‘model_args’:{‘max_depth’: 6}}) - An MLModel instance
labeled_data (pandas DataFrame) – DataFrame containing the labeled data
feature_col (str, optional) – Name of the column containing feature vectors
label_col (str, optional) – Name of the column containing labels
- Returns:
The trained model
- Return type: