Global Information Lookup Global Information

Charset detection information


Character encoding detection, charset detection, or code page detection is the process of heuristically guessing the character encoding of a series of bytes that represent text. The technique is recognised to be unreliable and is only used when specific metadata, such as a HTTP Content-Type: header is either not available, or is assumed to be untrustworthy.

This algorithm usually involves statistical analysis of byte patterns, like frequency distribution of trigraphs of various languages encoded in each code page that will be detected; such statistical analysis can also be used to perform language detection. This process is not foolproof because it depends on statistical data.

In general, incorrect charset detection leads to mojibake.

One of the few cases where charset detection works reliably is detecting UTF-8. This is due to the large percentage of invalid byte sequences in UTF-8, so that text in any other encoding that uses bytes with the high bit set is extremely unlikely to pass a UTF-8 validity test. However, badly written charset detection routines do not run the reliable UTF-8 test first, and may decide that UTF-8 is some other encoding. For example, it was common that web sites in UTF-8 containing the name of the German city München were shown as München, due to the code deciding it was an ISO-8859 encoding before (or without) even testing to see if it was UTF-8.

UTF-16 is fairly reliable to detect due to the high number of newlines (U+000A) and spaces (U+0020) that should be found when dividing the data into 16-bit words, and large numbers of NUL bytes all at even or odd locations. Common characters must be checked for, relying on a test to see that the text is valid UTF-16 fails: the Windows operating system would mis-detect the phrase "Bush hid the facts" (without a newline) in ASCII as Chinese UTF-16LE, since all the byte pairs matched assigned Unicode characters in UTF-16LE.

Charset detection is particularly unreliable in Europe, in an environment of mixed ISO-8859 encodings. These are closely related eight-bit encodings that share an overlap in their lower half with ASCII and all arrangements of bytes are valid. There is no technical way to tell these encodings apart and recognizing them relies on identifying language features, such as letter frequencies or spellings.

Due to the unreliability of heuristic detection, it is better to properly label datasets with the correct encoding. See Character encodings in HTML#Specifying the document's character encoding. Even though UTF-8 and UTF-16 are easy to detect, some systems require UTF encodings to explicitly label the document with a prefixed byte order mark (BOM).

and 21 Related for: Charset detection information

Request time (Page generated in 0.8187 seconds.)

Charset detection

Last Update:

Character encoding detection, charset detection, or code page detection is the process of heuristically guessing the character encoding of a series of...

Word Count : 553

Character encoding

Last Update:

character set identifiers (CCSIDs), each of which is variously called a "charset", "character set", "code page", or "CHARMAP". The code unit size is equivalent...

Word Count : 3718

Plain text

Last Update:

explicit indication of the character encoding, some applications use charset detection to attempt to guess what encoding was used. ASCII reserves the first...

Word Count : 1658

Character encodings in HTML

Last Update:

encodings in use, auto-detection is also often employed. Finally, browsers usually permit the user to override incorrect charset label manually as well...

Word Count : 2460

Bush hid the facts

Last Update:

" or "z!". The bug occurs when the string is passed to the Win32 charset detection function IsTextUnicode. IsTextUnicode guesses it is Unicode if the...

Word Count : 576

SubRip

Last Update:

any SubRip file parser must attempt to use Charset detection. Unicode BOMs are typically used to aid detection. YouTube only supports UTF-8. The default...

Word Count : 1808

Mojibake

Last Update:

type of software, the typical solution is either configuration or charset detection heuristics. Both are prone to mis-prediction. The encoding of text...

Word Count : 5985

Content sniffing

Last Update:

files for which the MIME type is already known. This technique is known as charset sniffing or codepage sniffing and, for certain encodings, may be used to...

Word Count : 618

ASCII

Last Update:

Whitespace characters Related topics CCSID Character encodings in HTML Charset detection Han unification Hardware code page MICR code Mojibake Variable-length...

Word Count : 8053

ISO basic Latin alphabet

Last Update:

Whitespace characters Related topics CCSID Character encodings in HTML Charset detection Han unification Hardware code page MICR code Mojibake Variable-length...

Word Count : 1650

Code page

Last Update:

encoding CCSID IBM's official "code page" definitions and assignments Charset detection Unicode "Contents". www.ibm.com. "Code Page". sap.com. Archived from...

Word Count : 9214

Unicode and HTML

Last Update:

using special characters on Wikipedia Character encodings in HTML Charset detection Unicode character reference (wikibooks) Ian Hickson (2011). "HTML5"...

Word Count : 2591

Code page 951

Last Update:

Whitespace characters Related topics CCSID Character encodings in HTML Charset detection Han unification Hardware code page MICR code Mojibake Variable-length...

Word Count : 242

Xerox Character Code Standard

Last Update:

Whitespace characters Related topics CCSID Character encodings in HTML Charset detection Han unification Hardware code page MICR code Mojibake Variable-length...

Word Count : 458

Lotus International Character Set

Last Update:

Whitespace characters Related topics CCSID Character encodings in HTML Charset detection Han unification Hardware code page MICR code Mojibake Variable-length...

Word Count : 1478

Han Xin code

Last Update:

characters from GB 18030 codepage. Unicode mode: 5.4.12  encodes UTF-8 charset with embedded lossless compression. In the Unicode mode, the input data...

Word Count : 2956

Proxy server

Last Update:

specified and returns the response. HTTP/1.1 200 OK Content-Type: text/html; charset UTF-8 Some web proxies allow the HTTP CONNECT method to set up forwarding...

Word Count : 5430

SOAP

Last Update:

/InStock HTTP/1.1 Host: www.example.org Content-Type: application/soap+xml; charset=utf-8 Content-Length: 299 SOAPAction: "http://www.w3.org/2003/05/soap-envelope"...

Word Count : 2604

DotCode

Last Update:

supports the following features:: 5.2.1  Natively encodes digits or ASCII charset (between 0 and 127) with A, B and C code sets and extended ASCII values...

Word Count : 2387

List of Apache modules

Last Update:

Apache Software Foundation. Retrieved 2022-01-13. "Apache Module mod_charset_lite". Apache HTTP Server 2.4 Documentation. Apache Software Foundation...

Word Count : 1916

Java version history

Last Update:

(security-libs/javax.net.ssl) Modified the MS950 charset Encoder's Conversion Table (core-libs/java.nio.charsets) Less Ambiguous Processing of ProcessBuilder...

Word Count : 10631

PDF Search Engine © AllGlobal.net