MadLib._internal.featurization.featurize

MadLib._internal.featurization.featurize(features: List[Callable], A, B, candidates, output_col: str = 'features', fill_na: float = 0.0) DataFrame[source]

applies the featurizer to the record pairs in candidates

Parameters:
  • features (List[Callable]) – a DataFrame containing initialized feature objects for columns in A, B

  • A (Union[pd.DataFrame, SparkDataFrame]) – the records of table A

  • B (Union[pd.DataFrame, SparkDataFrame]) – the records of table B

  • candidates (Union[pd.DataFrame, SparkDataFrame]) – id pairs of A and B that are potential matches

  • output_col (str) – the name of the column for the resulting feature vectors, default fvs

  • fill_na (float) – value to fill in for missing data, default 0.0

Returns:

DataFrame with feature vectors created with the following schema: (id2, id1, fv, other columns from candidates)

Return type:

pandas DataFrame