MadLib._internal.featurization.create_features
- MadLib._internal.featurization.create_features(A: DataFrame | DataFrame, B: DataFrame | DataFrame, a_cols: List[str], b_cols: List[str], sim_functions: List[Callable[[...], Any]] | None = None, tokenizers: List[Callable[[...], Any]] | None = None, null_threshold: float = 0.5) List[Callable] [source]
creates the features which will be used to featurize your tuple pairs
- Parameters:
A (Union[pd.DataFrame, SparkDataFrame]) – the records of table A
B (Union[pd.DataFrame, SparkDataFrame]) – the records of table B
a_cols (list) – The names of the columns for DataFrame A that should have features generated
b_cols (list) – The names of the columns for DataFrame B that should have features generated
sim_functions (list of callables, optional) – similarity functions to apply (default: None)
tokenizers (list of callables, optional) – tokenizers to use (default: None)
null_threshold (float) – the portion of values that must be null in order for the column to be dropped and not considered for feature generation
- Returns:
a list containing initialized feature objects for columns in A, B
- Return type:
List[Callable]