Fast Text Tokenization


[Up] [Top]

Documentation for package ‘tok’ version 0.2.0

Help Pages

decoder_byte_level Byte level decoder
encoding Encoding
model_bpe BPE model
model_unigram An implementation of the Unigram algorithm
model_wordpiece An implementation of the WordPiece algorithm
normalizer_nfc NFC normalizer
normalizer_nfkc NFKC normalizer
pre_tokenizer Generic class for tokenizers
pre_tokenizer_byte_level Byte level pre tokenizer
pre_tokenizer_whitespace This pre-tokenizer simply splits using the following regex: \w+|[^\w\s]+
processor_byte_level Byte Level post processor
tokenizer Tokenizer
tok_decoder Generic class for decoders
tok_model Generic class for tokenization models
tok_normalizer Generic class for normalizers
tok_processor Generic class for processors
tok_trainer Generic training class
trainer_bpe BPE trainer
trainer_unigram Unigram tokenizer trainer
trainer_wordpiece WordPiece tokenizer trainer