Package org.opencv.dnn
Class Tokenizer
- java.lang.Object
-
- org.opencv.dnn.Tokenizer
-
public class Tokenizer extends java.lang.ObjectHigh-level tokenizer wrapper for DNN usage. Provides a simple API to encode and decode tokens for LLMs. Models are loaded via Tokenizer::load().using namespace cv::dnn; Tokenizer tok = Tokenizer::load("/path/to/model/"); std::vector<int> ids = tok.encode("hello world"); std::string text = tok.decode(ids);
-
-
Field Summary
Fields Modifier and Type Field Description protected longnativeObj
-
Constructor Summary
Constructors Modifier Constructor Description protectedTokenizer(long addr)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static Tokenizer__fromPtr__(long addr)java.lang.Stringdecode(MatOfInt tokens)MatOfIntencode(java.lang.String text)Encode UTF-8 text to token ids (special tokens currently disabled).longgetNativeObjAddr()static Tokenizerload(java.lang.String model_config)Load a tokenizer from a model directory.
-
-
-
Method Detail
-
getNativeObjAddr
public long getNativeObjAddr()
-
__fromPtr__
public static Tokenizer __fromPtr__(long addr)
-
load
public static Tokenizer load(java.lang.String model_config)
Load a tokenizer from a model directory. Expects the directory to contain: -config.jsonwith fieldmodel_typewith value "gpt2" or "gpt4". -tokenizer.jsonproduced by the corresponding model family. The argument is a path prefix; this function concatenates file names directly (e.g.model_dir+ "config.json"), somodel_dirmust end with an appropriate path separator.- Parameters:
model_config- Path to config.json for model.- Returns:
- A Tokenizer ready for use. Throws cv::Exception if files are missing or
model_typeis unsupported.
-
encode
public MatOfInt encode(java.lang.String text)
Encode UTF-8 text to token ids (special tokens currently disabled). Calls the underlyingCoreBPE::encodewith an empty allowed-special set.- Parameters:
text- UTF-8 input string.- Returns:
- Vector of token ids (32-bit ids narrowed to int for convenience).
-
decode
public java.lang.String decode(MatOfInt tokens)
-
-