Class Tokenizer


  • public class Tokenizer
    extends java.lang.Object
    High-level tokenizer wrapper for DNN usage. Provides a simple API to encode and decode tokens for LLMs. Models are loaded via Tokenizer::load(). using namespace cv::dnn; Tokenizer tok = Tokenizer::load("/path/to/model/"); std::vector<int> ids = tok.encode("hello world"); std::string text = tok.decode(ids);
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected long nativeObj  
    • Constructor Summary

      Constructors 
      Modifier Constructor Description
      protected Tokenizer​(long addr)  
    • Field Detail

      • nativeObj

        protected final long nativeObj
    • Constructor Detail

      • Tokenizer

        protected Tokenizer​(long addr)
    • Method Detail

      • getNativeObjAddr

        public long getNativeObjAddr()
      • __fromPtr__

        public static Tokenizer __fromPtr__​(long addr)
      • load

        public static Tokenizer load​(java.lang.String model_config)
        Load a tokenizer from a model directory. Expects the directory to contain: - config.json with field model_type with value "gpt2" or "gpt4". - tokenizer.json produced by the corresponding model family. The argument is a path prefix; this function concatenates file names directly (e.g. model_dir + "config.json"), so model_dir must end with an appropriate path separator.
        Parameters:
        model_config - Path to config.json for model.
        Returns:
        A Tokenizer ready for use. Throws cv::Exception if files are missing or model_type is unsupported.
      • encode

        public MatOfInt encode​(java.lang.String text)
        Encode UTF-8 text to token ids (special tokens currently disabled). Calls the underlying CoreBPE::encode with an empty allowed-special set.
        Parameters:
        text - UTF-8 input string.
        Returns:
        Vector of token ids (32-bit ids narrowed to int for convenience).
      • decode

        public java.lang.String decode​(MatOfInt tokens)