Tokenizer (OpenCV 5.0.0 Java documentation)

java.lang.Object
- org.opencv.dnn.Tokenizer

```
public class Tokenizer
extends java.lang.Object
```
High-level tokenizer wrapper for DNN usage. Provides a simple API to encode and decode tokens for LLMs. Models are loaded via Tokenizer::load(). using namespace cv::dnn; Tokenizer tok = Tokenizer::load("/path/to/model/"); std::vector<int> ids = tok.encode("hello world"); std::string text = tok.decode(ids);

Field Summary

Fields
Modifier and Type Field Description

protected long nativeObj

Constructor Summary

Constructors
Modifier Constructor Description

protected Tokenizer(long addr)

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`static Tokenizer`	`__fromPtr__(long addr)`
`java.lang.String`	`decode(MatOfInt tokens)`
`MatOfInt`	`encode(java.lang.String text)`	Encode UTF-8 text to token ids (special tokens currently disabled).
`long`	`getNativeObjAddr()`
`static Tokenizer`	`load(java.lang.String model_config)`	Load a tokenizer from a model directory.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - nativeObj
```
protected final long nativeObj
```
- Constructor Detail
  - Tokenizer
```
protected Tokenizer(long addr)
```
- Method Detail
  - getNativeObjAddr
```
public long getNativeObjAddr()
```
  - __fromPtr__
```
public static Tokenizer __fromPtr__(long addr)
```
  - load
```
public static Tokenizer load(java.lang.String model_config)
```
    Load a tokenizer from a model directory. Expects the directory to contain: - config.json with field model_type with value "gpt2" or "gpt4". - tokenizer.json produced by the corresponding model family. The argument is a path prefix; this function concatenates file names directly (e.g. model_dir + "config.json"), so model_dir must end with an appropriate path separator.
    
    Parameters:
    
    model_config - Path to config.json for model.
    
    Returns:
    
    A Tokenizer ready for use. Throws cv::Exception if files are missing or model_type is unsupported.
  - encode
```
public MatOfInt encode(java.lang.String text)
```
    Encode UTF-8 text to token ids (special tokens currently disabled). Calls the underlying CoreBPE::encode with an empty allowed-special set.
    
    Parameters:
    
    text - UTF-8 input string.
    
    Returns:
    
    Vector of token ids (32-bit ids narrowed to int for convenience).
  - decode
```
public java.lang.String decode(MatOfInt tokens)
```