OpenCV 5.0.0-pre
Open Source Computer Vision
Loading...
Searching...
No Matches
cv::dnn::Tokenizer Class Reference

High-level tokenizer wrapper for DNN usage. More...

#include <opencv2/dnn/dnn.hpp>

Collaboration diagram for cv::dnn::Tokenizer:

Public Member Functions

 Tokenizer ()
 Construct a tokenizer with a given method default BPE. For BPE method you normally call Tokenizer::load() to initialize model data.
 
std::string decode (const std::vector< int > &tokens)
 
std::vector< int > encode (const std::string &text)
 Encode UTF-8 text to token ids (special tokens currently disabled).
 

Static Public Member Functions

static Tokenizer load (CV_WRAP_FILE_PATH const std::string &model_config)
 Load a tokenizer from a model directory.
 

Detailed Description

High-level tokenizer wrapper for DNN usage.

Provides a simple API to encode and decode tokens for LLMs. Models are loaded via Tokenizer::load().

using namespace cv::dnn;
Tokenizer tok = Tokenizer::load("/path/to/model/");
std::vector<int> ids = tok.encode("hello world");
std::string text = tok.decode(ids);
High-level tokenizer wrapper for DNN usage.
Definition dnn.hpp:2129
std::string decode(const std::vector< int > &tokens)
static Tokenizer load(CV_WRAP_FILE_PATH const std::string &model_config)
Load a tokenizer from a model directory.
std::vector< int > encode(const std::string &text)
Encode UTF-8 text to token ids (special tokens currently disabled).
Definition all_layers.hpp:47

Constructor & Destructor Documentation

◆ Tokenizer()

cv::dnn::Tokenizer::Tokenizer ( )

Construct a tokenizer with a given method default BPE. For BPE method you normally call Tokenizer::load() to initialize model data.

Member Function Documentation

◆ decode()

std::string cv::dnn::Tokenizer::decode ( const std::vector< int > & tokens)
Python:
cv.dnn.Tokenizer.decode(tokens) -> retval

◆ encode()

std::vector< int > cv::dnn::Tokenizer::encode ( const std::string & text)
Python:
cv.dnn.Tokenizer.encode(text) -> retval

Encode UTF-8 text to token ids (special tokens currently disabled).

Calls the underlying CoreBPE::encode with an empty allowed-special set.

Parameters
textUTF-8 input string.
Returns
Vector of token ids (32-bit ids narrowed to int for convenience).

◆ load()

static Tokenizer cv::dnn::Tokenizer::load ( CV_WRAP_FILE_PATH const std::string & model_config)
static
Python:
cv.dnn.Tokenizer.load(model_config) -> retval
cv.dnn.Tokenizer_load(model_config) -> retval

Load a tokenizer from a model directory.

Expects the directory to contain:

  • config.json with field model_type with value "gpt2" or "gpt4".
  • tokenizer.json produced by the corresponding model family.

The argument is a path prefix; this function concatenates file names directly (e.g. model_dir + "config.json"), so model_dir must end with an appropriate path separator.

Parameters
model_configPath to config.json for model.
Returns
A Tokenizer ready for use. Throws cv::Exception if files are missing or model_type is unsupported.

The documentation for this class was generated from the following file: