MadLib._internal.featurization.featurize

MadLib._internal.featurization.featurize(features: List[Callable], A, B, candidates, output_col: str = 'features', fill_na: float = 0.0) → DataFrame[source]

applies the featurizer to the record pairs in candidates

Parameters:

features (List[Callable]) – a DataFrame containing initialized feature objects for columns in A, B
A (Union[pd.DataFrame, SparkDataFrame]) – the records of table A
B (Union[pd.DataFrame, SparkDataFrame]) – the records of table B
candidates (Union[pd.DataFrame, SparkDataFrame]) – id pairs of A and B that are potential matches
output_col (str) – the name of the column for the resulting feature vectors, default fvs
fill_na (float) – value to fill in for missing data, default 0.0

Returns:

DataFrame with feature vectors created with the following schema: (id2, id1, fv, other columns from candidates)

Return type:

pandas DataFrame