High-level tokenizer wrapper for DNN usage. More...

#include <opencv2/dnn/dnn.hpp>

Collaboration diagram for cv::dnn::Tokenizer:

Public Member Functions
	Tokenizer ()
	Construct a tokenizer with a given method default BPE. For BPE method you normally call Tokenizer::load() to initialize model data.

std::string	decode (const std::vector< int > &tokens)

std::vector< int >	encode (const std::string &text)
	Encode UTF-8 text to token ids (special tokens currently disabled).

Static Public Member Functions
static Tokenizer	load (CV_WRAP_FILE_PATH const std::string &model_config)
	Load a tokenizer from a model directory.

Detailed Description

High-level tokenizer wrapper for DNN usage.

Provides a simple API to encode and decode tokens for LLMs. Models are loaded via Tokenizer::load().

using namespace cv::dnn;
Tokenizer tok = Tokenizer::load("/path/to/model/");
std::vector<int> ids = tok.encode("hello world");
std::string text = tok.decode(ids);

Constructor & Destructor Documentation

◆ Tokenizer()

cv::dnn::Tokenizer::Tokenizer ( )

Construct a tokenizer with a given method default BPE. For BPE method you normally call Tokenizer::load() to initialize model data.

Member Function Documentation

◆ decode()

std::string cv::dnn::Tokenizer::decode ( const std::vector< int > & tokens )

Python:
	cv.dnn.Tokenizer.decode(	tokens	) ->	retval

◆ encode()

std::vector< int > cv::dnn::Tokenizer::encode ( const std::string & text )

Python:
	cv.dnn.Tokenizer.encode(	text	) ->	retval

Encode UTF-8 text to token ids (special tokens currently disabled).

Calls the underlying CoreBPE::encode with an empty allowed-special set.

Parameters

text	UTF-8 input string.

Returns: Vector of token ids (32-bit ids narrowed to int for convenience).

◆ load()

static Tokenizer cv::dnn::Tokenizer::load ( CV_WRAP_FILE_PATH const std::string & model_config )

static

Python:
	cv.dnn.Tokenizer.load(	model_config	) ->	retval
	cv.dnn.Tokenizer_load(	model_config	) ->	retval

Load a tokenizer from a model directory.

Expects the directory to contain:

config.json with field model_type with value "gpt2" or "gpt4".
tokenizer.json produced by the corresponding model family.

The argument is a path prefix; this function concatenates file names directly (e.g. model_dir + "config.json"), so model_dir must end with an appropriate path separator.

Parameters

model_config Path to config.json for model.

Returns: A Tokenizer ready for use. Throws cv::Exception if files are missing or model_type is unsupported.

The documentation for this class was generated from the following file:

opencv2/dnn/dnn.hpp

Public Member Functions