For a long time computers had a simple idea of a character: each octet (8-bit byte) of text contained one character. This meant an application could only use 256 characters at once. The first 128 characters (0 to 127) on Unix and similar systems usually corresponded to the ASCII character set, as they still do. So all other possibilities had to be crammed into the remaining 128. This was done by picking the appropriate character set for the use you were making. For example, ISO 8859 specified a set of extensions to ASCII for various alphabets.
This was fine for simple extensions and certain short enough relatives of the Latin alphabet (with no more than a few dozen alphabetic characters), but useless for complex alphabets. Also, having a different character set for each language is inconvenient: you have to start a new terminal to run the shell with each character set. So the character set had to be extended. To cut a long story short, the world has mostly standardised on a character set called Unicode, related to the international standard ISO 10646. The intention is that this will contain every single character used in all the languages of the world.
This has far too many characters to fit into a single octet. What's more, UNIX utilities such as zsh are so used to dealing with ASCII that removing it would cause no end of trouble. So what happens is this: the 128 ASCII characters are kept exactly the same (and they're the same as the first 128 characters of Unicode), but the remaining 128 characters are used to build up any other Unicode character by combining multiple octets together. The shell doesn't need to interpret these directly; it just needs to ask the system library how many octets form the next character, and if there's a valid character there at all. (It can also ask the system what width the character takes up on the screen, so that characters no longer need to be exactly one position wide.)
The way this is done is called UTF-8. Multibyte encodings of other character sets exist (you might encounter them for Asian character sets); zsh will be able to use any such encoding as long as it contains ASCII as a single-octet subset and the system can provide information about other characters. However, in the case of Unicode, UTF-8 is the only one you are likely to enounter that is useful in zsh.
(In case you're confused: Unicode is the character set, while UTF-8 is an encoding of it. You might hear about other encodings, such as UCS-2 and UCS-4 which are basically the character's index in the character set as a two-octet or four-octet integer. You might see files encoded this way, for example on Windows, but the shell can't deal directly with text in those formats.)
Until version 4.3, zsh didn't handle multibyte input properly at all. Each octet in a multibyte character would look to the shell like a separate character. If your terminal handled the character set, characters might appear correct on screen, but trying to edit them would cause all sorts of odd effects. (It was possible to edit in zsh using single-byte extensions of ASCII such as the ISO 8859 family, however.)
From version 4.3.4, multibyte input is handled in the line editor if zsh
has been compiled with the appropriate definitions, and is automatically
activated. This is indicated by the option
MULTIBYTE, which is
set by default on shells that support multibyte mode. Hence you
can test this with a standard option test: `
[[ -o multibyte ]]'.
MULTIBYTE option affects the entire shell: parameter expansion,
pattern matching, etc. count valid multibyte character strings as a
single character. You can unset the option locally in a function to
revert to single-byte operation.
Note that if the shell is emulating a Bourne shell the
option is unset by default. This allows various POSIX modes to
work normally (POSIX does not deal with multibyte characters). If
you use a "sh" or "ksh" emulation interactively you should probably
The other option that affects multibyte support is
new in version 4.3.9. When this is set, any zero-length punctuation
characters that follow an alphanumeric character (the base character) are
assumed to be modifications (accents etc.) to the base character and to
be displayed within the same screen area as the base character. As not
all terminals handle this, even if they correctly display the base
multibyte character, this option is not on by default. Recent versions
of the KDE and GNOME terminal emulators
gnome-terminal as well as
rxvt-unicode, and the Unicode version
xterm -u8 or the front-end
uxterm, are known to handle
COMBINING_CHARS option only affects output; combining characters
may always be input, but when the option is off will be displayed
specially. By default this is as a code point (the index of the
character in the character set) between angle brackets, usually
in inverse video. Highlighting of such special characters can
be modified using the new array parameter
Once you have a version of zsh with multibyte support, you need to ensure the environment is correct. We'll assume you're using UTF-8. Many modern systems may come set up correctly already. Try one of the editing widgets described in the next section to see.
There are basically three components.
LANG(there are others but this is the one to start with). You need to find a locale whose name contains
UTF-8. This will be a variant on your usual locale, which typically indicates the language and country; for example, mine is
en_GB.UTF-8. Luckily, zsh can complete locale names, so if you have the new completion system loaded you can type
export LANG=and attempt to complete a suitable locale. It's the locale that tells the shell to expect the right form of multibyte input. (However, there's no guarantee that the shell is actually going to get this input: for example, if you edit file names that have been created using a different character set it won't work properly.)
gnome-terminal, are likely to have extensive support for localization and may work correctly as soon as they know the locale. You can enable UTF-8 support for
xtermin its application defaults file. The following are the relevant resources; you don't actually need all of them, as described below. If you use a
~/.Xresourcesfile for setting resources, prefix all the lines with
*wideChars: true *locale: true *utf8: 1 *vt100Graphics: trueThis turns on support for wide characters (this is enabled by the
utf8resource, too); enables conversions to UTF-8 from other locales (this is the key resource and actually overrides
utf8); turns on UTF-8 mode (this resource is mostly used to force use of UTF-8 characters if your locale system isn't up to it); and allows certain graphic characters to work even with UTF-8 enabled. (Thanks to Phil Pennock for suggestions.)
iso10646-1(and not, for example,
iso8859-1). Not all characters will be available in any font, and some fonts may have a more restricted range of Unicode characters than others.
As mentioned in the previous section,
bindkey -m now outputs
a warning message telling you that multibyte input from the terminal
is likely not to work. (See 3.5 if you don't know what
this feature does.) If your terminal doesn't have characters
that need to be input as multibyte, however, you can still use
the meta bindings and can ignore the warning message. Use
bindkey -m 2>/dev/null to suppress it.
You might also note that the latest version of the Cygwin environment
for Windows supports UTF-8. In previous versions, zsh was able
to compile with the
MULTIBYTE option enabled, but the system
didn't provide full support for it.
Two functions are provided with zsh that help you input characters.
As with all editing widgets implemented by functions, you need to
mark the function for autoload, create the widget, and, if you are
going to use it frequently, bind it to a key sequence. The
insert-composed-char to F5 on my keyboard:
autoload -Uz insert-composed-char zle -N insert-composed-char bindkey '\e[15~' insert-composed-char
The two widgets are described in the
page, but here is a brief summary:
insert-composed-char is followed by two characters that
are a mnemonic for a multibyte character. For example
is a with an Umlaut;
cH is the symbol for hearts on a playing
card. Various accented characters, European and related alphabets,
and punctuation and mathematical symbols are available. The
mnemonics are mostly those given by RFC 1345, see
insert-unicode-char is used to input a Unicode character by
its hexadecimal number. This is the number given in the Unicode
character charts, see for example http://www.unicode.org/charts/.
You need to execute the function, then type the hexadecimal number
(you can omit any leading zeroes), then execute the function again.
Both functions can be used without multibyte mode, provided the locale is correct and the character selected exists in the current character set; however, using UTF-8 massively extends the number of valid characters that can be produced.
If you have a recent X Window System installation, you might find
AltGr key helps you input accented Latin characters; for
example on my keyboard
AltGr-; e gives
e with an acute accent.
See also http://www.cl.cam.ac.uk/~mgk25/unicode.html#input
for general information on entering Unicode characters from a keyboard.