MadLib._internal.featurization.create_features

MadLib._internal.featurization.create_features(A: DataFrame | DataFrame, B: DataFrame | DataFrame, a_cols: List[str], b_cols: List[str], sim_functions: List[Callable[[...], Any]] | None = None, tokenizers: List[Callable[[...], Any]] | None = None, null_threshold: float = 0.5) List[Callable][source]

creates the features which will be used to featurize your tuple pairs

Parameters:
  • A (Union[pd.DataFrame, SparkDataFrame]) – the records of table A

  • B (Union[pd.DataFrame, SparkDataFrame]) – the records of table B

  • a_cols (list) – The names of the columns for DataFrame A that should have features generated

  • b_cols (list) – The names of the columns for DataFrame B that should have features generated

  • sim_functions (list of callables, optional) – similarity functions to apply (default: None)

  • tokenizers (list of callables, optional) – tokenizers to use (default: None)

  • null_threshold (float) – the portion of values that must be null in order for the column to be dropped and not considered for feature generation

Returns:

a list containing initialized feature objects for columns in A, B

Return type:

List[Callable]