MadLib._internal.tokenizer.tokenizer.Tokenizer

class MadLib._internal.tokenizer.tokenizer.Tokenizer[source]

Bases: ABC

__init__()

Methods

__init__()

out_col_name(input_col)

the name of the output column from the tokenizer e.g. for a 3gram tokenizer, the tokens from the name columns could be "3gram(name)".

tokenize(s)

convert the string into a BAG of tokens (tokens should not be deduped)

tokenize_set(s)

tokenize the string and return a set or None if the tokenize returns None

tokenize_spark(input_col)

return a column expression that gives the same output as the tokenize method.

out_col_name(input_col)[source]

the name of the output column from the tokenizer e.g. for a 3gram tokenizer, the tokens from the name columns could be “3gram(name)”

abstractmethod tokenize(s)[source]

convert the string into a BAG of tokens (tokens should not be deduped)

tokenize_set(s)[source]

tokenize the string and return a set or None if the tokenize returns None

tokenize_spark(input_col: Column)[source]

return a column expression that gives the same output as the tokenize method. required for effeciency when building metadata for certain methods