MadLib._internal.tokenizer.tokenizer.Tokenizer
- class MadLib._internal.tokenizer.tokenizer.Tokenizer[source]
Bases:
ABC
- __init__()
Methods
__init__
()out_col_name
(input_col)the name of the output column from the tokenizer e.g. for a 3gram tokenizer, the tokens from the name columns could be "3gram(name)".
tokenize
(s)convert the string into a BAG of tokens (tokens should not be deduped)
tokenize_set
(s)tokenize the string and return a set or None if the tokenize returns None
tokenize_spark
(input_col)return a column expression that gives the same output as the tokenize method.
- out_col_name(input_col)[source]
the name of the output column from the tokenizer e.g. for a 3gram tokenizer, the tokens from the name columns could be “3gram(name)”