High-level tokenizer wrapper for DNN usage.
More...
#include <opencv2/dnn/dnn.hpp>
|
| | Tokenizer () |
| | Construct a tokenizer with a given method default BPE. For BPE method you normally call Tokenizer::load() to initialize model data.
|
| |
| std::string | decode (const std::vector< int > &tokens) |
| |
| std::vector< int > | encode (const std::string &text) |
| | Encode UTF-8 text to token ids (special tokens currently disabled).
|
| |
High-level tokenizer wrapper for DNN usage.
Provides a simple API to encode and decode tokens for LLMs. Models are loaded via Tokenizer::load().
std::vector<int> ids = tok.
encode(
"hello world");
std::string text = tok.
decode(ids);
High-level tokenizer wrapper for DNN usage.
Definition dnn.hpp:2129
std::string decode(const std::vector< int > &tokens)
static Tokenizer load(CV_WRAP_FILE_PATH const std::string &model_config)
Load a tokenizer from a model directory.
std::vector< int > encode(const std::string &text)
Encode UTF-8 text to token ids (special tokens currently disabled).
Definition all_layers.hpp:47
◆ Tokenizer()
| cv::dnn::Tokenizer::Tokenizer |
( |
| ) |
|
Construct a tokenizer with a given method default BPE. For BPE method you normally call Tokenizer::load() to initialize model data.
◆ decode()
| std::string cv::dnn::Tokenizer::decode |
( |
const std::vector< int > & | tokens | ) |
|
| Python: |
|---|
| cv.dnn.Tokenizer.decode( | tokens | ) -> | retval |
◆ encode()
| std::vector< int > cv::dnn::Tokenizer::encode |
( |
const std::string & | text | ) |
|
| Python: |
|---|
| cv.dnn.Tokenizer.encode( | text | ) -> | retval |
Encode UTF-8 text to token ids (special tokens currently disabled).
Calls the underlying CoreBPE::encode with an empty allowed-special set.
- Parameters
-
- Returns
- Vector of token ids (32-bit ids narrowed to int for convenience).
◆ load()
| Python: |
|---|
| cv.dnn.Tokenizer.load( | model_config | ) -> | retval |
| cv.dnn.Tokenizer_load( | model_config | ) -> | retval |
Load a tokenizer from a model directory.
Expects the directory to contain:
config.json with field model_type with value "gpt2" or "gpt4".
tokenizer.json produced by the corresponding model family.
The argument is a path prefix; this function concatenates file names directly (e.g. model_dir + "config.json"), so model_dir must end with an appropriate path separator.
- Parameters
-
| model_config | Path to config.json for model. |
- Returns
- A Tokenizer ready for use. Throws cv::Exception if files are missing or
model_type is unsupported.
The documentation for this class was generated from the following file: