Unicode (and Java) - PowerPoint

Document Sample
Unicode (and Java) - PowerPoint Powered By Docstoc
					Unicode (and Java)
Brice Giesbrecht
Objective of Presentation
 The need for Unicode
 How it works
 Differentiate between encodings
 How to get your browser to work…
 See how Java consumes and
  produces data
Overview of Presentation
 Character Sets
 Unicode
 Encodings
 Unicode Support in Java
 Unicode Support in Databases (?)
 Demonstration (web app)
 Resources
 Door Prizes (for those still awake…)
Character Sets
   What is a character set?
   Code Page: a mapping in which a sequence of
    bits, usually a single octet representing integer
    values 0 through 255, is associated with a specific
    character (wikipedia)
   Most character sets are a direct mapping of a
    value to a number (7 bit / 8 bit)
   Character sets are NOT fonts!
   Encoding is usually a lookup in a table
   Most IBM and Microsoft code pages use ASCII as
    their base set of characters
   The English bias (compare to Indic languages)
Character Sets
   Issues Within a single Language
   Selectors to overcome 8 bit limitations (especially
    for CJK sets)
   Historical importance of platforms and hardware
   Compatibility (or more likely, lack thereof)
   ISCII as an example
   Issues outside a single Language
   How do you produce content using multiple
    languages? (Or the characters from those
    languages?)
   http://en.wikipedia.org/wiki/Code_page_437
Character Sets
   Enter the standards
   ISO-646 (ASCII, still 7 bit)
       12 whole code points to play with!
       C0 Control Set (0x00 – 0x1F)
   ISO-8859-n
       0x00 – 0x7F ISO-646 IRV
       0x80 – 0xFF Different for each set (or part)
       ISO 8859-1 (Latin1)
       C1 Control Set (0x80 – 0X9F)
   ISO-2022
       Designed for transmission
       Non Latin bases & multi byte sets
Character Sets
   Enter Microsoft!
   Windows code pages
       http://www.microsoft.com/globaldev/reference/wincp.ms
        px
   Cp1252
       Based on ISO 8859-1
       C1 code points used for printable characters
       Often mislabeled as ISO-8859-1 due to their similarities
Unicode
What is Unicode?
Unicode provides a unique number for
  every character,
no matter what the platform,
no matter what the program,
no matter what the language.
Unicode
 ISO 10646 1990
 Merged with the Unicode Consortium
  Ties a character, name, and a code
  point together
 BMP – Basic Multilingual Plane (the
  first 65,536 code points)
 ISO and UC Character repertoire are
  synchronized
 UCS (Universal Character Set)
Unicode
   Q: So are they the same thing?
    A: No. Although the character codes and
    encoding forms are synchronized between
    Unicode and ISO/IEC 10646, the Unicode
    Standard imposes additional constraints on
    implementations to ensure that they treat
    characters uniformly across platforms and
    applications. To this end, it supplies an
    extensive set of functional character
    specifications, character data, algorithms
    and substantial background material that is
    not in ISO/IEC 10646.
    (http://unicode.org/faq/unicode_iso.html)
Unicode
   The Unicode Standard includes a set of
    characters, names, and coded
    representations that are identical with
    those in ISO/IEC 10646:2003. It
    additionally provides details of character
    properties, processing algorithms, and
    definitions that are useful to implementers.
    [It] strengthens Unicode support for
    worldwide communication, software
    availability, and publishing.
    (http://www.iso.org)
Unicode
 UCS Code space: (0x – 0x7FFFFFFF)
  128 x 256 x 256 x 256 (GPRC)
  2,147,483,648 possible code points
 The Unicode Character Database
       http://unicode.org/Public/UNIDATA/UCD.html
       Main Definition (UnicodeData.txt)

   Available on line
       http://www.unicode.org/Public/UNIDATA/

   Unicode Code Space (0x – 0x10FFFF)
    17 x 256 x 256 1,114,112 code points
Unicode
   As of Unicode 5.0.0, 101,063 (9.1%) of
    these codepoints are assigned, with
    another 137,468 (12.3%) reserved for
    private use, leaving 875,441 (78.6%)
    unassigned. The number of assigned code
    points is made up as follows:

    98,884 graphemes
    140 formatting characters
    65 control characters
    2,048 surrogate characters
Unicode
 Plane 0 (0000-FFFF)
 Basic Multilingual Plane (BMP)
 Used for most of the alphabets
 Not all code points are used
 Allocated in areas/blocks
Unicode
 Plane 1 (10000-1FFFF):
 Supplementary Multilingual Plane
  (SMP)
 Historic scripts such as Linear B, but
  is also used for musical and
  mathematical symbols.
Unicode
 Plane 2 (20000-2FFFF)
 Supplementary Ideographic Plane
  (SIP)
 Used for about 40,000 rare Chinese
  characters that are mostly historic
Unicode
 Planes 3 to 13 (30000-DFFFF)
 Unassigned
Unicode
 Plane 14 (E0000-EFFFF)
 Supplementary Special-purpose Plane
  (SSP)
 glyph (font) selection
 code point + variation selector =
  variation sequence
   http://www.unicode.org/reports/tr37/tr37-3.html
    (Ideographic Variation Database)
Unicode
 Plane 15 (F0000-FFFFF)
 Plane 16 (100000-10FFFF)
 Plane 0 (E000-F8FF)
 Private Use Area (PUA)
       The use of the PUA was a concept inherited from certain
        Asian encoding systems. These systems had private use
        areas to encode Japanese Gaiji (rare personal name
        characters) in application-specific ways)
Unicode
ConScript Unicode Registry
   The purpose of the ConScript Unicode Registry
    (CSUR) is to coordinate the assignment of blocks
    out of the Unicode Private Use Area (E000-F8FF
    and 000F0000-0010FFFF) to constructed/artificial
    scripts, including scripts for constructed/artificial
    languages.
   Cirth, Klingon, Tengwar, etc.
Encodings
Purpose of the following encodings is to
  get the Unicode value to you.
  Depending on the storage or
  transmission protocols, different
  encodings will need to be
  used. These are not different
  character sets, they are ways of
  representing the characters in
  Unicode.
Encodings
   Endianness
       0x1234
       LE 34 12
       BE 12 34
   Byte Order Mark - 0xFEFF
       Helps Determine Endianness
       Unicode 3.2 (0x2060)
       0xFFFE reserved
       0XFEFF set aside for BOM
       Also used to declare encoding (UTF-8)
Encodings
UTF-8
   Variable-length character encoding
   Can address all characters in the UCS but was
    limited by RFC 3629 to just address the Unicode
    code space.
   BOM – EF BB BF
   Format
    000000-00007F   0zzzzzzz
    000080-0007FF   110yyyyy 10zzzzzz
    000800-00FFFF   1110xxxx 10yyyyyy 10zzzzzz
    010000-10FFFF   11110www 10xxxxxx 10yyyyyy 10zzzzzz
Encodings
UTF-32/UCS-4
   Fixed-length character encoding
   Uses 31 bits
   UCS-4 capable of addressing entire UCS, but was
    restricted to only cover the Unicode code space
   UTF-32 only covers the Unicode code space
   4E8C, 10302 = 00004E8C, 00010302
   BE BOM – 00 00 FE FF
   LE BOM – FF FE 00 00
Encodings
UCS-2
   Fixed-length encoding
   Two-octet
   It is NOT UTF-16!
   Only addresses BMP
   UCS-2BE, UCS-2LE
   Obsoleted by UTF-16
Encodings
UTF-16
   Variable-length encoding
   UTF-16BE, UTF-16LE
   BE BOM – FEFF
   LE BOM – FFFE
   Surrogates are used to address code points
    outside the BMP. (We will cover this later)
Encodings
UTF-16 Surrogate Pairs
   Needed for code points > 0xFFFF
   High Byte 0xD800 – 0xDBFF first surrogate
   Low Byte 0xDC00 – 0xDFFF second surrogate
   Algorithm:
       ((cp - 0x10000) high 10 bits) | 0xD800
       ((cp - 0x10000) low 10 bits) | 0xDC00
Encodings
Which Encoding should you use?
   If dealing with CJK or Hindi (>0x0800), UTF-8
    requires 3 bytes whereas UTF-16 needs only 2
   UTF-8 is great for ASCII whereas UTF-16 needs 2
    bytes for it
   Java uses UTF-16
   Windows uses UTF-16LE internally
   UTF-32 not really used that much
   UTF-8 and UTF-16 are the most common
Java
 J2SE 1.5 version 4.0
 J2SE 1.4 version 3.0
 J2SE 1.3 version 2.1
 Supplementary characters were part
  of Unicode 3.1
 Addressed in JSR 204
    (http://jcp.org/en/jsr/detail?id=204)
Java
 Unicode characters are specified
  using \u such as \u0039
 Unicode can be used in source files
 file.encoding=Cp1252 on my machine
 You can change this, but beware…
 Java reads and writes using this
  encoding by default
 You can specify the character set to
  use for reading or writing
Java
Big5         IBM420        ISO-8859-4       x-eucJP-Open   x-IBM949
Big5-HKSCS   IBM424        ISO-8859-5       x-IBM1006      x-IBM949C
EUC-JP       IBM437        ISO-8859-6       x-IBM1025      x-IBM950
EUC-KR       IBM500        ISO-8859-7       x-IBM1046      x-IBM964
GB18030      IBM775        ISO-8859-8       x-IBM1097      x-IBM970
GB2312       IBM850        ISO-8859-9       x-IBM1098      x-ISCII91
GBK          IBM852        JIS_X0201        x-IBM1112      x-ISO-2022-CN-CNS
IBM-Thai     IBM855        JIS_X0212-1990   x-IBM1122      x-ISO-2022-CN-GB
IBM00858     IBM857        KOI8-R           x-IBM1123      x-iso-8859-11
IBM01140     IBM860        Shift_JIS        x-IBM1124      x-JIS0208
IBM01141     IBM861        TIS-620          x-IBM1381      x-JISAutoDetect
IBM01142     IBM862        US-ASCII         x-IBM1383      x-Johab
IBM01143     IBM863        UTF-16           x-IBM33722     x-MacArabic
IBM01144     IBM864        UTF-16BE         x-IBM737       x-MacCentralEurope
IBM01145     IBM865        UTF-16LE         x-IBM856       x-MacCroatian
IBM01146     IBM866        UTF-8            x-IBM874       x-MacCyrillic
IBM01147     IBM868        windows-1250     x-IBM875       x-MacDingbat
IBM01148     IBM869        windows-1251     x-IBM921       x-MacGreek
IBM01149     IBM870        windows-1252     x-IBM922       x-MacHebrew
IBM037       IBM871        windows-1253     x-IBM930       x-MacIceland
IBM1026      IBM918        windows-1254     x-IBM933       x-MacRoman
IBM1047      ISO-2022-CN   windows-1255     x-IBM935       x-MacRomania
IBM273       ISO-2022-JP   windows-1256     x-IBM937       x-MacSymbol
IBM277       ISO-2022-KR   windows-1257     x-IBM939       x-MacThai
IBM278       ISO-8859-1    windows-1258     x-IBM942       x-MacTurkish
IBM280       ISO-8859-13   windows-31j      x-IBM942C      x-MacUkraine
IBM284       ISO-8859-15   x-Big5-Solaris   x-IBM943       x-MS950-HKSCS
IBM285       ISO-8859-2    x-euc-jp-linux   x-IBM943C      x-mswin-936
IBM297       ISO-8859-3    x-EUC-TW         x-IBM948       x-PCK
                                                           x-windows-874
                                                           x-windows-949
                                                           x-windows-950
Databases (Maybe)
   SQL 92 NATIONAL CHARACTER
       The <key word>s NATIONAL CHARACTER are used to specify a
        character string data type with a particular implementation-defined
        character repertoire. Special syntax (N'string') is provided for
        representing literals in that character repertoire.

 Collation
 Database Support
     MySQL
     Oracle
     Sql Server
     Postgres
Demonstration
   Read/Write/Examine UTF-8/UTF-16/UTF-
    16LE encoded text (with Hex editor)
   Show encoding settings in Eclipse and Java
   Show how windows (and eclipse console)
    can/can't display some characters
   web browser settings
   Chinese article on cracking of SHA-1
   Martin Fowler article on dependency
    Injection
Resources
   The big ones:
       http://www.unicode.org/Public/UNIDATA/
       http://en.wikipedia.org/wiki/Unicode
       http://www.evertype.com/standards/csur
   The rest:
       http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp
       http://en.wikibooks.org/wiki/Unicode/Character_reference
       http://www.joelonsoftware.com/articles/Unicode.html
       http://www.cl.cam.ac.uk/~mgk25/unicode.html
       http://czyborra.com/charsets/iso646.html
       http://www.fileformat.info/ (GREAT resource)
   For fun:
       http://www.omniglot.com/
       http://en.wikipedia.org/wiki/Constructed_language
       http://talideon.com/concultures/wiki/