GitHub - manakai/perl-web-encodings: Web character encodings for Perl

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 1,948 Commits
.github		.github
bin		bin
config		config
intermediate		intermediate
lib/Web		lib/Web
sketch		sketch
src		src
t		t
t_deps		t_deps
.gitignore		.gitignore
.gitmodules		.gitmodules
Makefile		Makefile
README.pod		README.pod
Repository files navigation

=head1 NAME

Web::Encoding - Web Encodings APIs

=head1 SYNOPSIS

  use Web::Encoding;
  $bytes = encode_web_utf8 $chars;
  $chars = decode_web_utf8 $bytes;

=head1 DESCRIPTION

The C<Web::Encoding> module provides a set of functions to handle
Web-compatible character encodings.

Also, there are following modules in the C<perl-web-encodings>
repository:

=over 4

=item L<Web::Encoding::UnivCharDet>

The universalchardet (or universal detector) implementation in Perl,
which can be used to implement HTML parsers.

=item L<Web::Encoding::Normalization>

Implementation of Unicode's string normalization algorithms, i.e. NFC,
NFD, NFKC, and NFKD.

=item L<Web::Encoding::Preload>

Preloading encoding modules and data files.

=back

=head1 FUNCTIONS

Functions described in these subsections are exported by default.

=head2 Encoding labels and properties of encodings

There are following functions to handle encoding labels and to obtain
properties of encodings:

=over 4

=item $boolean = is_web_encoding_label $label

Return whether the specified label identifies an encoding for the Web
or not.  It compares labels ASCII case-insensitively.

Unlike the C<encoding_label_to_name> function, however, this function
does not ignore spaces.

For backward compatibility, there is also the C<is_encoding_label>
function which is equivalent to this function.

=item $boolean = is_zip_encoding_label $label

Return whether the specified label identifies an encoding for the ZIP
file names or not.  It compares labels ASCII case-insensitively.

Unlike the C<encoding_label_to_name> function, however, this function
does not ignore spaces.

=item $boolean = is_all_encoding_label $label

Return whether the specified label identifies an encoding for any
context supported by this module, including but not limited to Web and
ZIP, or not.  It compares labels ASCII case-insensitively.

Unlike the C<encoding_label_to_name> function, however, this function
does not ignore spaces.

=item $key = encoding_label_to_name $label

Find the encoding identified by the specified label.

The function returns the encoding key (not a name), if found, or
C<undef>.

As does the "get an encoding" steps of the Encoding Standard, this
function ignores leading and trailing spaces of labels and compares
labels ASCII case-insensitively.

=item $key = fixup_html_meta_encoding_name $key

Replace a encoding key for the purpose of HTML character encoding
declaration, as in "prescan a byte stream to determine its encoding"
and "change the encoding" algorithms [HTML].  The argument must be an
encoding key (not a name or label).  The function returns an encoding
key.

=item $key = get_output_encoding_key $key

Return the result of applying the steps to "get an output encoding"
[ENCODING].  The argument must be an encoding key (not a name or
label).  The function returns an encoding key.

=item $name = encoding_name_to_compat_name $key

Replace an encoding key to its official name as used in e.g.
C<characterSet> or C<inputEncoding> attributes of the C<Document>
interface [ENCODING] [DOM].  The argument must be an encoding key (not
a name or label).  The function returns an encoding name.

=item $boolean = is_ascii_compat_encoding_name $key

Return whether the specified encoding is an ASCII-compatible character
encoding [ENCODING] or not.  The argument must be an encoding key (not
a name or label).

=item $key = web_locale_default_encoding_name $tag

Return the encoding key (not a name or label) of the default Web
character encoding for a locale [HTML].  If no explicit default is
known for the specified locale, C<undef> is returned.

The argument, which identifies the locale, must be either a BCP 47
language tag or a string C<*>.  The language tag must be either the
primary language tag only, or C<zh-TW>, C<zh-CN>, C<zh-HK>, C<zh-MO>,
C<zh-Hant>, C<zh-Hans>, C<sr-Latn>, or C<sr-Cyrl>; otherwise no data
is available.  The tags are ASCII case-insensitive.  If C<*> is
specified, the global default encoding that can be used when the
locale is not known or the locale has no default is returned.

For backward compatibility, there is also the
C<locale_default_encoding_name> function which is equivalent to this
function.

=item $key = zip_locale_default_encoding_name $tag

Return the encoding key (not a name or label) of the default ZIP
character encoding for a locale [HTML].  If no explicit default is
known for the specified locale, C<undef> is returned.

The argument, which identifies the locale, must be either a BCP 47
language tag or a string C<*>.  The language tag must be either the
primary language tag only, or C<zh-TW>, C<zh-CN>, C<zh-HK>, C<zh-MO>,
C<zh-Hant>, C<zh-Hans>, C<sr-Latn>, or C<sr-Cyrl>; otherwise no data
is available.  The tags are ASCII case-insensitive.  If C<*> is
specified, the global default encoding that can be used when the
locale is not known or the locale has no default is returned.

=back

For the purpose of this module, the B<key> of the encoding is a short
string uniquly identifying the encoding.  It is a lowercased variant
of the encoding name.

Note that the encoding names in the Encoding Standard are not
compatible with Perl L<Encode> module's encoding names.

=head2 Encoders and decoders

There are following functions for encoding and decoding:

=over 4

=item $bytes = encode_web_utf8 $chars

Encode the character string in UTF-8 and return the encoded bytes.

This function can be used to implement the "UTF-8 encode" operation of
the Encoding Standard.

=item $chars = decode_web_utf8 $bytes

Decode the bytes as UTF-8 and return the decoded character string.
Any bad byte is replaced by U+FFFD characters without failure.

This function can be used to implement the "UTF-8 decode" operation of
the Encoding Standard.

=item $chars = decode_web_utf8_no_bom $bytes

Decode the bytes as UTF-8, not recognizing BOM, and returns the
decoded character string.  Any bad byte is replaced by U+FFFD
characters without failure.

This function can be used to implement the "UTF-8 decode without BOM"
operation of the Encoding Standard.

=item $bytes = encode_web_charset $key, $chars

Encode the character string and return the encoded bytes.

The first argument must be the key of the encoding used to encode the
string.

Any character not representable in the encoding is converted to an
HTML decimal character reference for the character.

This function can be used to implement the "encode" operation with
error mode C<html> of the Encoding Standard and its obsolete UTF-16
encoder [ENCODING16].

=item $chars = decode_web_charset $key, $bytes

Decode the bytes and return the decoded character string.

The first argument must be the key of the encoding used to decode the
bytes.

Any bad byte is replaced by U+FFFD characters without failure.

This function is equivalent to the following code using
L<Web::Encoding::Decoder>:

  $decoder = Web::Encoding::Decoder->new_from_encoding_key ($key);
  $decoder->ignore_bom (1);
  return join '', @{$decoder->bytes ($bytes)}, @{$decoder->eof};

=item [$name, $name, ...] = encoding_names

Return the list of the encoding keys (i.e. the lowercase variants of
the encoding names), as an array reference.

=back

=head1 SUPPORTED ENCODINGS

The following encodings from the Encoding Standard are supported:
UTF-8
IBM866
ISO-8859-2
ISO-8859-3
ISO-8859-4
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-8-I
ISO-8859-10
ISO-8859-13
ISO-8859-14
ISO-8859-15
ISO-8859-16
KOI8-R
KOI8-U
macintosh
windows-874
windows-1250
windows-1251
windows-1252
windows-1253
windows-1254
windows-1255
windows-1256
windows-1257
windows-1258
x-mac-cyrillic
gb18030
GBK
Big5
EUC-JP
ISO-2022-JP
Shift_JIS
EUC-KR
x-user-defined
UTF-16BE
UTF-16LE
replacement

Please note that C<ascii> is a label of the windows-1252 encoding.

In addition, the following encodings required by Web and ZIP
compatibility are also supported:
armscii-8 georgian-academy georgian-ps viscii x-viet-vni x-viet-vps
x-viet-tcvn ibm437 ibm737 ibm775 ibm850 ibm852 ibm855 ibm857 ibm862
ibm865 x-mns4330

This module also adds labels to encodings defined by the Encoding
Standard.

These are willful violations to the Encoding Standard and the HTML
Standard, for compatibility with existing Web contents.  See
<https://github.com/manakai/data-web-defs> for the augumented list of
encodings and labels.

Functions C<is_web_encoding_label> and C<is_zip_encoding_label> can be
used to distinguish Web-only, ZIP-only, and common encodings.

Though supported by L<Web::Encoding::UnivCharDet>, the following
encodings are not supported by this module at the moment:
tscii tab iso-2022-cn iso-2022-kr hz-gb-2312 x-euc-tw johab utf-32be
utf-32le x-iso-10646-ucs-4-3412 x-iso-10646-ucs-4-2143 tam x-mac-ce

=head1 SPECIFICATIONS

=over 4

=item ENCODING

Encoding Standard <https://encoding.spec.whatwg.org/>.

=item ENCODING16

UTF-16 encoder
<https://github.com/whatwg/encoding/commit/8360f775c8df145f649047c7d59c5ff733ade112>.

=item HTML

HTML Standard <https://html.spec.whatwg.org/>.

=item DOM

DOM Standard <https://dom.spec.whatwg.org/>.

=item ENCVALID

Encoding Validation
<https://wiki.suikawiki.org/n/Encoding%20Validation>.

=back

=head1 DEPENDENCY

The module requires Perl 5.8 or later.

=head1 AUTHOR

Wakaba <wakaba@suikawiki.org>.

=head1 LICENSE

Copyright 2011-2025 Wakaba <wakaba@suikawiki.org>.

This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself.

=cut