Global Information Lookup Global Information

Byte pair encoding information


Byte pair encoding[1][2] (also known as digram coding)[3] is an algorithm, first described in 1994 by Philip Gage for encoding strings of text into tabular form for use in downstream modeling.[4] Its modification is notable as the large language model tokenizer with an ability to combine both tokens that encode single characters (including single digits or single punctuation marks) and those that encode whole words (even the longest compound words).[5][6][7] This modification, in the first step, assumes all unique characters to be an initial set of 1-character long n-grams (i.e. initial "tokens"). Then, successively the most frequent pair of adjacent characters is merged into a new, 2-character long n-gram and all instances of the pair are replaced by this new token. This is repeated until a vocabulary of prescribed size is obtained. Note that new words can always be constructed from final vocabulary tokens and initial-set characters.[8]

All the unique tokens found in a corpus are listed in a token vocabulary, the size of which, in the case of GPT-3.5 and GPT-4, is 100256.

The difference between the modified and the original algorithm is that the original algorithm does not merge the most frequent pair of bytes of data, but replaces them by a new byte that was not contained in the initial dataset. A lookup table of the replacements is required to rebuild the initial dataset. The algorithm is effective for tokenization because it has low computational overhead and remains consistent and reliable.

  1. ^ Gage, Philip (1994). "A New Algorithm for Data Compression". The C User Journal.
  2. ^ "A New Algorithm for Data Compression". Dr. Dobb's Journal. 1 February 1994. Retrieved 10 August 2020.
  3. ^ Witten, Ian H.; Moffat, Alistair; Bell, Timothy C. (1994). Managing Gigabytes. New York: Van Nostrand Reinhold. ISBN 978-0-442-01863-4.
  4. ^ "Byte Pair Encoding". Archived from the original on 2016-03-26.
  5. ^ Sennrich, Rico; Birch, Alexandra; Haddow, Barry (2015-08-31). "Neural Machine Translation of Rare Words with Subword Units". arXiv:1508.07909 [cs.CL].
  6. ^ Brown, Tom B.; Mann, Benjamin; Ryde r, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini (2020-06-04). "Language Models are Few-Shot Learners". arXiv:2005.14165 [cs.CL].
  7. ^ "google/sentencepiece". Google. 2021-03-02. Retrieved 2021-03-02.
  8. ^ Paaß, Gerhard; Giesselbach, Sven (2022). "Pre-trained Language Models". Foundation Models for Natural Language Processing. Artificial Intelligence: Foundations, Theory, and Algorithms. pp. 19–78. doi:10.1007/978-3-031-23190-2_2. ISBN 9783031231902. Retrieved 3 August 2023.

and 17 Related for: Byte pair encoding information

Request time (Page generated in 0.8577 seconds.)

Byte pair encoding

Last Update:

Byte pair encoding (also known as digram coding) is an algorithm, first described in 1994 by Philip Gage for encoding strings of text into tabular form...

Word Count : 635

Large language model

Last Update:

embedding is associated to the integer index. Algorithms include byte-pair encoding and WordPiece. Probabilistic tokenization also compresses the datasets...

Word Count : 11506

DTE

Last Update:

Explorer, a children's animated television show. Dual-Tile encoding, another name for byte pair encoding Directorate of Technical Education, Maharashtra, an...

Word Count : 134

Base64

Last Update:

the attachment. Base64 encoding causes an overhead of 33–37% relative to the size of the original binary data (33% by the encoding itself; up to 4% more...

Word Count : 3814

Silence compression

Last Update:

differential encoding algorithms include: Delta modulation quantizes and encodes differences between consecutive audio samples by encoding the derivative...

Word Count : 1456

Comparison of Unicode encodings

Last Update:

Unicode encodings. Two situations are considered: 8-bit-clean environments (which can be assumed), and environments that forbid use of byte values that...

Word Count : 2267

OpenAI

Last Update:

certain issues encoding vocabulary with word tokens by using byte pair encoding. This permits representing any string of characters by encoding both individual...

Word Count : 14070

Character encoding

Last Update:

Windows: Encoding.Convert – .NET API MultiByteToWideChar/WideCharToMultiByte – to convert from ANSI to Unicode & Unicode to ANSI Percent-encoding Alt code...

Word Count : 3718

CBOR

Last Update:

indefinite encoding, the parser must pair the break markers with the corresponding indefinite-length header bytes. Type 5 is similar but encodes a map (also...

Word Count : 1378

Sequitur algorithm

Last Update:

the list of symbol pairs. Context-free grammar Data compression Lossless data compression Straight-line grammar Byte pair encoding Nevill-Manning, C.G...

Word Count : 633

Modbus

Last Update:

protocol). PDU max size is 253 bytes. ADU max size on RS232/RS485 network is 256 bytes, and with TCP is 260 bytes. For data encoding, Modbus uses a big-endian...

Word Count : 4520

Byte

Last Update:

The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single...

Word Count : 6693

Audio codec

Last Update:

is a device or computer program capable of encoding or decoding a digital data stream (a codec) that encodes or decodes audio. In software, an audio codec...

Word Count : 349

ROM hacking

Last Update:

(such as byte pair encoding, also called dual tile encoding or DTE, in which certain combinations of two or more letters are encoded as one byte) which...

Word Count : 2922

LZ77 and LZ78

Last Update:

sequence. Of the 16 bits that make up these two bytes, 11 bits go to encoding the distance, 3 go to encoding the length, and the remaining two are used to...

Word Count : 2560

BPE

Last Update:

Express (.500 BPE) Borated polyethylene, a lightweight neutron absorber Byte pair encoding ASME BPE, a standard published by the American Society of Mechanical...

Word Count : 105

Query string

Last Update:

be percent-encoded in HTML forms to "%7E". The encoding of SPACE as '+' and the selection of "as-is" characters distinguishes this encoding from RFC 3986...

Word Count : 1944

PDF Search Engine © AllGlobal.net