Python Byte Pair Encoding

LBPE: Long-token-first Tokenization to Improve Large Language Models

Abstract: The prevalent use of Byte Pair Encoding (BPE) in Large Language Models (LLMs) facilitates robust handling of subword units and avoids issues of out-of-vocabulary words. Despite its success, ...

GitHub

shtest-encoding.py breaks with python 3.14

The llvm-21-tools package ships a test file that contains non-UTF-8 characters without an encoding declaration, causing package installation to fail with Python 3.14. This occurs during package ...

IEEE

Tibetan Speech Recognition Based on Hierarchical Transfer Strategy and Subword Modeling

Abstract: For Tibetan speech recognition, the accuracy is often limited due to several challenges, including the heavy reliance of end-to-end systems on large amounts of annotated data, inconsistent ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results

LBPE: Long-token-first Tokenization to Improve Large Language Models

shtest-encoding.py breaks with python 3.14

Tibetan Speech Recognition Based on Hierarchical Transfer Strategy and Subword Modeling

Trending now