UTF-8, Perl and You by ekc11009

VIEWS: 11 PAGES: 92

									UTF-8, Perl and
You
By Rafael Almeria
 Chapter 1:
Introduction
           1 - Introduction


This talk does not deal with the
  motivation for using utf-8.
                        1 - Introduction

 This talk is about:
   Implementation details.
   Understanding UTF-8.
   Converting your data,
   And knowing how to fix common problems.
                      1 - Introduction
 Some assumptions:
  Language: Perl
  Unix Operating System
  Input encoded as: ASCII, ISO-8859-1/Latin-1 or
   Windows-1252.
  Output encoded as: UTF-8
                         1 - Introduction
 What we’ll cover in this talk:
   A primer on character encoding
   A simplifying principle
   UTF-8
   Perl & UTF-8
   Making the Browser Happy
   Encoding Hell
           Chapter 2:
A Very Brief Primer on Character
           Encoding.
      2 - A Very Brief Primer on
         Character Encoding.



What is a character encoding?
         2 - A Very Brief Primer on
            Character Encoding.



 It’s a specific way to represent the
characters in a given character set.
         2 - A Very Brief Primer on
            Character Encoding.


A character set may have a numerical
  ordering on it for use with a given
         character encoding.
          2 - A Very Brief Primer on
             Character Encoding.


    The number given to a specific
character in an ordered character set is
             its code point.
        2 - A Very Brief Primer on
           Character Encoding.



Do not confuse the character’s code
   point with its representation!
        2 - A Very Brief Primer on
           Character Encoding.



It may be the same for ASCII, ISO-
 8859-1 and Windows-1252 and…
        2 - A Very Brief Primer on
           Character Encoding.



it may be the same for 1-byte UTF-8
                but…
            2 - A Very Brief Primer on
               Character Encoding.



it’s definitely not true for multi-byte UTF-8.
      2 - A Very Brief Primer on
         Character Encoding.



It’s a common problem. So don’t
          confuse them!
     Chapter 3:
A Simplifying Principle
                  3 - A Simplifying Principle
 If all of our data is encoded using only the following
  encodings (code point ranges are in parenthesis):
   ASCII (0x00 - 0x7F)
   ISO-8859-1/Latin-1 (0x00 - 0xFF)
   Windows-1252 (0x00 - 0xFF)
            3 - A Simplifying Principle



and if we only care about printable content then

ASCII  ISO-8859-1  Windows-1252
        3 - A Simplifying Principle



We can treat everything as Windows-1252!
           3 - A Simplifying Principle


This should be ok if we are sure that the
documents are from one of these three
 kinds of encodings but we’re not sure
    how each document is encoded.
    Chapter 4:
      UTF-8.
A Brave New World
   4 - UTF-8. A Brave New World



It supports every language you’ll
       probably ever need.
    4 - UTF-8. A Brave New World



No need for Windows-1252 this and
       Windows-1253 that.
    4 - UTF-8. A Brave New World



Its code point range is from 0x00 to
             0x10FFFF
       4 - UTF-8. A Brave New World



It uses a variable (1 to 4) byte encoding.
      4 - UTF-8. A Brave New World



1-byte UTF-8 is used for code points in
       the range 0x00 to 0x7F.
4 - UTF-8. A Brave New World


  1-byte UTF-8  ASCII
        MSBit is 0
code point  representation
           4 - UTF-8. A Brave New World
 Examples of 1-byte UTF-8:
  “A” -> 0100 0001
  “&” -> 0010 0110
  “5” -> 0011 0101
      4 - UTF-8. A Brave New World



2-byte UTF-8 is used for code points in
     the range 0x0080 to 0x07FF.
4 - UTF-8. A Brave New World



      2-byte UTF-8
code point != representation
      4 - UTF-8. A Brave New World



The code point is broken apart into two
                pieces.
      4 - UTF-8. A Brave New World


 The five MSBits of the code point are
 assigned to the first byte and the six
LSBits are assigned to the second byte.
     4 - UTF-8. A Brave New World

  For the first byte of 2-byte UTF-8

   The three MSBits are set to 110

The remaining bits are the five MSBits
         of the code point.
      4 - UTF-8. A Brave New World

 For the second byte of 2-byte UTF-8

     The two MSBits are set to 10

The remaining bits are the six LSBits of
           the code point.
      4 - UTF-8. A Brave New World



3-byte UTF-8 is used for code points in
     the range 0x0800 to 0xFFFF.
4 - UTF-8. A Brave New World



      3-byte UTF-8
code point != representation
    4 - UTF-8. A Brave New World



The code point is broken apart into
          three pieces.
           4 - UTF-8. A Brave New World
 The four MSBits of the code point are assigned to
  the first byte.
 The middle six bits are assigned to the second byte.
 The six LSBits are assigned to the third byte.
      4 - UTF-8. A Brave New World

  For the first byte of 3-byte UTF-8

   The four MSBits are set to 1110

The remaining bits are the four MSBits
         of the code point.
     4 - UTF-8. A Brave New World

For the second byte of 3-byte UTF-8

   The two MSBits are set to 10

The remaining bits are the six middle
       bits of the code point.
      4 - UTF-8. A Brave New World

  For the third byte of 3-byte UTF-8

     The two MSBits are set to 10

The remaining bits are the six LSBits of
           the code point.
      4 - UTF-8. A Brave New World



4-byte UTF-8 is used for code points in
   the range 0x10000 to 0x10FFFF.
4 - UTF-8. A Brave New World



      4-byte UTF-8
code point != representation
      4 - UTF-8. A Brave New World



The code point is broken apart into four
                pieces.
            4 - UTF-8. A Brave New World
 The three MSBits of the code point are assigned to the
  first byte.
 The next six MSBits are assigned to the second byte.
 Another of the next six MSBits are assigned to the third
  byte.
 The six LSBits are assigned to the fourth byte.
      4 - UTF-8. A Brave New World

   For the first byte of 4-byte UTF-8

   The five MSBits are set to 11110

The remaining bits are the three MSBits
          of the code point.
    4 - UTF-8. A Brave New World

For the second byte of 4-byte UTF-8

   The two MSBits are set to 10

The remaining bits are the next six
  middle bits of the code point.
    4 - UTF-8. A Brave New World

For the third byte of 4-byte UTF-8

  The two MSBits are set to 10

The remaining bits are the next six
  middle bits of the code point.
      4 - UTF-8. A Brave New World

  For the fourth byte of 4-byte UTF-8

     The two MSBits are set to 10

The remaining bits are the six LSBits of
           the code point.
 Chapter 5:
Perl & UTF-8
               5 - Perl & UTF-8

 If you want to create UTF-8 strings in
your Perl code then all you have to do is
       use the following notation:

             \x{codepoint}
               5 - Perl & UTF-8


For example, to create the string “niño”:

        my $str = “ni\x{f1}o”;
               5 - Perl & UTF-8

To write this string to STDOUT you
might do this:

binmode STDOUT, “:utf8”;
print $str;
                  5 - Perl & UTF-8


To undo it, do this:

binmode STDOUT;
print $str;
               5 - Perl & UTF-8

Or to write UTF-8 data to disk, you
could do this:

open(OFILE, “>:utf8”, $filename);
print OFILE $str;
               5 - Perl & UTF-8

To read UTF-8 data from disk, you could
do this:

open(IFILE, “<:utf8”, $filename);
my $str = <IFILE>;
               5 - Perl & UTF-8

To convert Windows-1252 to UTF-8, you
could do something like this:

use Text::Iconv;
use Encode;
my $utf8_str = Text::Iconv-
>new(“WINDOWS-1252”, “UTF-8”)-
>convert($str);
Encode::_utf8_on($utf8_str);
        Chapter 6:
Making the Browser Happy
       6 - Making the Browser Happy



  All the efforts up to now will be for
     naught if the browser doesn’t
understand how the page is encoded.
     6 - Making the Browser Happy



To make the browser aware of the
 nature of the data either add…
      6 - Making the Browser Happy




Content-type: text/html; charset=utf-8
      6 - Making the Browser Happy




or if you want to tag each document…
        6 - Making the Browser Happy


for XML add this declaration at the top of
            the document:

<?xml version=“1.0” encoding=“utf-8” ?>
        6 - Making the Browser Happy


for HTML add this declaration at the top of
   the <head> section of the document:

    <meta http-equiv=“Content-Type”
   content=“text/html; charset=utf-8” >
         6 - Making the Browser Happy


for XHTML add this declaration at the top of
    the <head> section of the document:

    <meta http-equiv=“Content-Type”
   content=“text/html; charset=utf-8” />
 Chapter 7:
Encoding Hell
            7 - Encoding Hell



So now we think we understand UTF-8…
           7 - Encoding Hell



…and we think we understand how to
  process this data in Perl but…
             7 - Encoding Hell



there is still SO MUCH OPPORTUNITY for
               things to go wrong!
             7 - Encoding Hell



The Byte Order Mark (0xFEFF code point)
            is one of them.
              7 - Encoding Hell



The intention is probably good but it can
           cause much grief.
               7 - Encoding Hell



Solution is to cut out the byte sequence EF
BB BF from the beginning of the document.
     7 - Encoding Hell


 Encoded Gibberish.

(It takes several forms)
7 - Encoding Hell



All Gibberish
                 7 - Encoding Hell


If it’s all gibberish then maybe the data is ok
 but you’re looking at it using the wrong pair
of glasses. Change the document encoding
 declaration. Or try changing your browser’s
        or application’s encoding setting.
   7 - Encoding Hell


Partially Gibberish

  (Two Cases)
           7 - Encoding Hell

First Case: What does it look like?

          Niño vs Ni?o
          Niño vs Ni o
                7 - Encoding Hell


      You likely have the dreaded “mixed
  encoding” nightmare. Probably someone
 has poured ISO-8859-1 or Windows-1252
 into a UTF-8 document or vice-versa. You
will need to figure out which bytes are which
and clean the document up to make it pure
                    UTF-8.
              7 - Encoding Hell


  Second Case: What does it look like?

     niño (viewed in UTF-8 mode)
niño (viewed in Windows-1252 mode)
               7 - Encoding Hell



   You likely have the double encoding
problem. Sometimes some of the data gets
encoded as UTF-8 twice! Again, you’ll need
      to look at the bytes and fix it.
      7 - Encoding Hell




Now some odds and ends…
             7 - Encoding Hell

HTML::Entities::decode_entities doesn’t
always do what you think. Sometimes it
 returns ISO-8859-1 instead of UTF-8.
          Caveat programmer!
                7 - Encoding Hell


  Be careful if you’re using the encode or
decode routines from Encode.pm, they may
not set the string’s UTF-8 flag appropriately.
             7 - Encoding Hell



And as a checklist of sorts when you’re
            debugging…
                     7 - Encoding Hell
 When debugging…make sure that
  The data has been encoded properly
  The data has been flagged as UTF-8
  That it has been written out properly.
  That the document has the appropriate encoding
   declaration.
  That your terminal or browser has been set to the
   correct encoding.
Conclusion
                    Conclusion
We notice that it is not easy to navigate the
transition from traditional encodings to UTF-8
but with perseverance it is doable. We have
illustrated the common encodings, how to
process our information in this environment
and how to tackle the common issues that might
arise.
References
                           References
 http://www.utf8-chartable.de/unicode-utf8-
  table.pl?htmlent=1 A nice list of UTF-8 characters, their
  character entities, code points and representation.
 http://en.wikipedia.org/wiki/UTF-8
 http://en.wikipedia.org/wiki/Replacement_character#Re
  placement_character
 http://en.wikipedia.org/wiki/Character_encoding
 http://en.wikipedia.org/wiki/Byte-order_mark
                           References
 http://en.wikipedia.org/wiki/Windows-1252
 http://en.wikipedia.org/wiki/ISO/IEC_8859-1
 http://en.wikipedia.org/wiki/ASCII
 http://www.w3.org/International/O-charset
 http://www.w3.org/International/O-HTTP-charset
 http://www.w3.org/International/tutorials/tutorial-char-
  enc/
                         References
 http://www.tbray.org/ongoing/When/200x/2003/04/06/Un
  icode
 http://www.tbray.org/ongoing/When/200x/2003/04/26/U
  TF
 http://www.joelonsoftware.com/articles/Unicode.html
 http://unicode.org/

								
To top