Soluling home   Document home

Charcter Encoding

Charcter Encoding

Almost all files and databases contain text. A specific encoding is used to represent text data. There are two kinds of encodings: Unicode and legacy encodings.

Unicode

The most common text encoding is Unicode (Wikipedia). Unicode contains tens of thousands of characters and more than 100 scripts. The Unicode standard defines the code point of each character. However, how these codes are written depends on the encoding used. There are four commonly used encodings, which Soluling supports all.

Encoding Bytes per char Byte order Description
UTF-8 1-4 - This is an 8-bit variable-width encoding which maximizes compatibility with ASCII. Part of the Unicode standard.
UTF-16 2 or 4 LE/BE This is a 16-bit, variable-width encoding. Part of the Unicode standard.
UTF-32 4 LE/BE This is a 32-bit, fixed-width encoding. Part of the Unicode standard.
GB18030 1, 2 or 4 - This is an 8-bit variable-width encoding which maximizes compatibility with GB2312 (legacy Simplified Chinese encoding). Not part of the Unicode standard but widely used in People's Republic of China.

In addition to the above Unicode encodings, there are few other encodings that are very seldom used. Such encodings are UTF-7 and UTF-EBCDIC. However, they are not part of the Unicode standard, and this is why Soluling does not support them.

Legacy encodings

Before Unicode was introduced, each encoding could handle only a limited amount of characters. This was usually limited to one script. There was encoding for Western European languages, East European languages, Chinese, Japanese, Korean, etc. These encodings are all based on code pages where one encoding implements one code page. One code page supports one script. All code pages support plain ASCII. These legacy encodings are still sometimes used, and Soluling supports them. When writing localized files or databases, Soluling must sometimes change the encoding to match the script of the target language. Learn more about output encodings.

Byte order mark

All Unicode encoding can have a byte order mark (BOM) at the beginning of the file (Wikipedia). The purpose of BOM is to indicate what encoding text is encoded. However, BOM is not required. Some files have it and do not have it. In general, it is recommenced to have a BOM in a Unicode file, but some tools and platforms can not cope with BOM, so there are files without BOMs. The following table lists the byte order marks used.

Encoding Representation Bytes as code page 1252 characters
UTF-8 EF BB BF 
UTF-16LE FF FE ÿþ
UTF-16BE FE FF þÿ
UTF-32LE FF FE 00 00 ÿþ<null><null>
UTF-32BE 00 00 FF FE <null><null>þÿ
GB18030 84 31 95 33 „1•3

Soluling can read and write files with or without BOMs. If the original file does not contain BOM, then Soluling might detect the encoding incorrectly. In that case, you have to manually choose the encoding from the File sheet of the source dialog. Learn more about output BOMs.

New line

Text files can have different newline marks (Wikipedia). Usually, a combination of carriage return (CR, \r, 0x0D, 13 in decimal) and line feed (LF, \n, 0x0A, 10 in decimal) characters are used. Possible combinations are:

New line Description
CR+LF Used in Windows, DOS and some other legacy operating systems
LF+CR Not used hardly at all
LF Used in Unix, Linux, Android, macOS, and iOS
CR Used in Mac OS (pre macOS) and some other legacy operation systems

When Soluling creates a localized text file, it always uses the same new lines characters as used in the original file. Unlike character encoding and BOM, you can not change the new line characters of the localized files. This is because the localized files are supposed to be used in the same system or platform as the original file, and it is important that the files use the same newline characters.

Text editors

The standard text editor of Windows, NotePad, can edit Ansi, UTF-8, UTF-16LE and UTF-16BE files. However, when working with Ansi files, Notepad always uses the current system code page of your Windows. For example, if you have English Windows, then Notepad uses code page 1252. You can not edit, for example, Japanese Ansi files. When working with Unicode files, Notepad always writes BOM even if the file did not originally contain BOM. Notepad is suitable for Windows related Unicode files that use BOM (most files are like this).

However, if you need to edit UTF-32 or GB18030 files or want more control about BOMs, you better use BabelPad (Unicode Text Editor). It can edit almost all Unicode files with or without BOMs. BabelPad is free software.