Docstoc

Toward Decent Text Encoding

Document Sample
Toward Decent Text Encoding Powered By Docstoc
					.




                                Toward Decent
                                                                                                               EBCDIC AND ASCII
    Standards


                                                                                                                  Both EBCDIC and ASCII were put
                                                                                                               together with a great deal of thoughtful
                                                                                                               effort and are still widely used. They



                                  Text Encoding
                                                                                                               included upper– and lowercase Latin
                                                                                                               alphabets, the 10 decimal digits, a slightly
                                                                                                               enhanced but still inadequate set of spe-
                                                                                                               cial characters, and a set of noncharacters
                                                                                                               intended to be used for controlling record-
                                                    Neville Holmes, University of Tasmania                     ing machinery in various ways.
                                                                                                                  Having two character sets in concur-
                                                                                                               rent widespread use has been problem
                           ext is composed of characters; we                                                   enough, but there have been many other



                T          get different kinds of text from
                           different kinds of characters. So
                           character sets are very important.
                           And if there are contending views
                about whether we are well-served by our
                character-set standards, these views
                should be exposed and discussed.
                                                                                                               problems. Both EBCDIC and ASCII pro-
                                                                                                               vided users with a + symbol as standard,
                                                                                                               but (with breathtaking arrogance) the
                                                                                                               developers of both sets refused to provide
                                                                                                               the traditional multiplication and division
                                                                                                               symbols. The control characters not only
                                                                                                               proved inadequate but were used incon-
                   It’s strange that the computing indus-                                                      sistently. For example, both Unix and
                try has for so long stuck to poor and                                                          IBM PC operating systems have tradi-
                impoverished character sets for text                                                           tionally used the ASCII character set, but
                encoding. Now, without much public dis-                                                        in encoding text Unix has used a single
                cussion or dispute, the computing indus-             It’s strange that the                     line feed control character to separate lines
                try seems to be moving to an equally poor          computing industry has                      of text, while IBM PCs have used a car-
                but contrastingly obese character set                 stuck to poor and                        rier return/line feed control character pair.
                called Unicode.                                    impoverished character                         Both character sets became grossly dis-
                                                                    sets for text encoding.                    torted when they were adapted to encod-
                TRADITIONAL CHARACTER SETS                                                                     ing text in languages other than English.
                   The development of writing technol-                                                         ASCII had a problem anyway in being a 7-
                ogy—and, relatively recently, of print                                                         bit coding typically running on 8-bit
                technology—has been more a story of the         1950s, the typical line printer sported 26     machinery, which led to peculiar and
                gradual development of standards than a         letters (uppercase roman), 10 decimal dig-     inconsistent uses of the eighth bit. But the
                story of the development of machinery.          its, and a few special characters added        way in which versions of ASCII were
                The widespread acceptance of the roman          mainly for commercial use. If Fortran pro-     accreted for new languages was grotesque
                and italic forms of the Latin alphabet—         grammers wanted to see their additions         in the extreme.
                which have become the dominant alpha-           normally then they had to order their
                betic forms in countries such as Germany        machines with a special feature to replace     UNICODE CHARACTER SET
                and Turkey only within living memory—           the ampersands with + signs. And they             Little wonder, therefore, that the com-
                has added an important interlingual             were forced to use the asterisk as a multi-    puting industry should wish to replace
                aspect to the use of character sets and to      plication symbol.                              EBCDIC and ASCII with a new improved
                international use of print technology.             In the 1960s, two expanded character        character set called Unicode, particularly
                   The early development of automatic           sets came into wide use. When IBM intro-       when computing has become so interna-
                data processing mainly in English-speak-        duced the 8-bit System/360 computers, it       tional. What is amazing is that it has taken
                ing countries led to English versions of the    introduced an 8-bit character set called       this long. What is disappointing, if not
                Latin alphabet being used in associated         EBCDIC (Extended Binary Coded                  tragic, is that the replacement is so unsuit-
                machinery, particularly in printers. In the     Decimal Interchange Code) to go with it.       able for text encoding.
                                                                A particular desire for compatibility with        There has been relatively little popular
                                                                prior punched-card codings gave a quite        discussion of Unicode. A recent exception
                 Editor: Charles Severance, Michigan State      bizarre structure to this de facto standard.   is the complementary proposal by
                 University, Department of Computer                At the same time, a formal effort           Muhammad Mudawwar (“Multicode: A
                 Science, 1338 Engineering Bldg., East          resulted in a 7-bit standard character set     Truly Multilingual Approach to Text
                 Lansing, MI 48824; voice (517) 353-2268;       called ASCII (American Standard Code           Encoding,” Computer, Apr. 1997, pp. 37-
                 fax (517) 355-7516; crs@egr.msu.edu;           for Information Interchange), which was        43). Unicode seems to be trying to provide
                 http://www.egr.msu.edu/~crs                    particularly designed for the telegraphy of    a single character set to represent docu-
                                                                the time.                                      ments in any language or writing system

    108                      Computer
                                                                                                                                                   .




or mixture thereof. A large part of the dif-   and phrases from one another so that, for      effective, and efficient separation of func-
ficulty with Unicode, though, is that it is     example, English uses French and German        tion and would make it easy to combine
most suitable for—even aimed at—pre-           words and takes their diacritical marks        text from different writing systems.
senting text, not for encoding it. But pre-    with them. Readers are often better served        Most of the world’s writing systems
sentation of text is one technology, while     when the markings are kept as they are,        could probably be encoded using an 8-bit
the encoding, storage, and transmission        showing how words like “café” and              scheme. The one exception is the tradi-
of text is quite another.                      “cliché” should be pronounced.                 tional Chinese writing system, which en-
   Unicode is intended primarily to allow         Second, in this international society it    compasses thousands of distinct charac-
the computing and telecommunications           is important to be able to name people         ters. But it could be argued that this rich
industry to get by with only one charac-       and organizations in their own language.       and time-honored character system is more
ter set for the entire world (http://www.      Indeed, to many cultures it is insulting not   properly regarded as a reading system be-
unicode.org). One result is that everyone      to use their names properly. In Western        cause it is more efficient for reading than
has to use 16 bits for every character.        text, Chinese names are stripped of the        the various alphabetic systems but it is
Surely it would be sadistic to suggest that                                                   much less effective for writing/encoding.
the great redundancy involved in a 16-bit           For text encoding, the                       There is evidence that languages that
character set would allow the effective use        world needs a standard                     use Chinese characters could be encoded
of data compression techniques. Or to              for each writing system                    under an 8-bit scheme. For example, two
suggest that everyone’s equipment should                                                      articles in Computer’s last special issue on
                                                     that suits each and
support all the world’s writing systems,                                                      such matters (Jan. 1985) proposed encod-
past and present, at the same time. But
                                                    every language using                      ing the Chinese official spoken language,
with Unicode it’s either that or back to                 that system.                         Putonghuà, using its Pinyin alphabetic
proliferating versions.                                                                       system. As recently as last year, an 8-bit
   Mudawwar’s Multicode aims to                tone marks provided in their Pinyin            encoding system has been introduced in
counter the 16-bit drawback and several        spelling system, which would be equiva-        South Korea in which the Korean alpha-
others that he describes in some detail. But   lent to English usage stripping French or      bet is used to encode the Chinese charac-
Multicode is essentially a compromise;         German names of all their vowels. Most         ters they use. This system is being adopted
Mudawwar’s article emphasizes in its very      unfriendly behavior.                           in many circles.
last sentence that “both approaches can           The world is divided into writing sys-
coexist—Multicode for programming              tem zones. For languages that use the
ease and Unicode to support unified             same writing system—the system based                here is some reason to hope that a
fonts.” But in international communica-
tion, the necessary variety of Multicodes
would be much more complex than the
                                               on the Latin alphabet, for example—a
                                               good text-encoding standard would com-
                                               pletely support the exchange of names. I
                                                                                              T    family of 8-bit text encoding stan-
                                                                                                   dards could be designed to suit all the
                                                                                              world’s writing systems, a family that
single Unicode.                                should be able to read all Swedish names       would provide for cheap, efficient, and
                                               in plaintext e-mail messages, but at pre-      effective machinery for encoding, storing,
ENCODING TEXT                                  sent many are garbled. On the other hand,      and transmitting the world’s text. The
   Most traffic in text is raw text—mes-        writing system cultural zones expect to        most important gain from adopting text
sages, identifiers, business records—and        transliterate words and names from other       encoding standards for each writing sys-
the vast majority of this traffic is mono-      zones, which seems to be a quite amenable      tem (concordant between writing sys-
lingual. Indeed, the vast majority of pre-     approach, provided it can be done well.        tems) is that simple and effective
sented documents are also monolingual.                                                        equipment, and text processing software
Much of this monolingual text needs only       NECESSARY STANDARDS                            such as plaintext editors, e-mailers, and
an 8-bit encoding system to be encoded            For text encoding, the world needs a        Internet search engines, could be largely
as plaintext.                                  standard for each writing system that suits    independent of the language within any
   Mudawwar’s Multicode scheme rec-            each and every language using that sys-        writing system.
ognizes this and therefore provides for a      tem. These standards should be in accord          Also, text could be compatibly han-
separate character set for every “official      with each other so that basic processing—      dled across writing systems by such soft-
language” (“Unicode Misunderstood,”            such as distinguishing letters from punc-      ware even if the equipment wouldn’t
Computer, June 1997). Most of these lan-       tuation and numerals from words—is the         display or print the text “correctly.”
guages can be accommodated within an           same for each system and also so that text     Words would still look like words,
8-bit coding scheme. In this case              using one writing system can be practi-        numerals like numerals, and punctuation
Multicode provides for great data com-         cally encoded or viewed on equipment           like punctuation. y
pression, but in any case it separates lan-    designed for another writing system.
guages from one another, which is no              Text encoded by these systems could         Neville Holmes is a senior lecturer in the
longer the way of the world, if it ever was.   be marked up for presentation within the       School of Computing at the University
   There are two aspects of language inter-    writing system that encodes it. Using          of Tasmania. Contact him at neville.
change. First, languages borrow words          markup would surely provide a logical,         holmes@utas.edu.au.

                                                                                                                  August 1998                109

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:4
posted:2/25/2011
language:English
pages:2