Supporting Complex Scripts (such as Arabic and Hebrew) in your Windows 2000™ Application
F. Avery Bishop
Senior Program Manager Microsoft Corporation
Agenda:
Overview of character encoding, Unicode Guidelines for supporting complex scripts Right-to-left layout of applications Multilingual User Interface
Overview of Character Encoding and Unicode
Why do character set differences matter?
Historically, they fragmented code bases for both Windows and applications
Single byte: European editions Double byte: Far East editions Bi-directional: Middle East editions
Make it difficult to share data Make it difficult to develop multilingual applications
Example: Multiple Hebrew Character Encodings
8bit Hebrew encodings still in use
Windows codepage 1255 OEM (DOS) codepage 862 Visual Hebrew encodings (many exist)
Example: Multiple Arabic Character Encodings
8bit Arabic encodings supported in Internet Explorer 4.0/CS
ASMO-708 DOS 720 ISO 8859-6 Windows Codepage 1256 Other proprietary encodings
Logical vs Visual Encoding
Logical:
Storage order is same as typing order Allows natural text processing: Search Resizing (e.g., in web pages) IPC: Select, cut & paste Natural text processing difficult or impossible Cannot always map back to logical order
Visual
What is Unicode?
A 16-bit character encoding
A mapping of characters to numbers Syntax rules for display of complex scripts Not a font or glyph encoding! Not a sort algorithm!
Includes all characters in common use in modern scripts (and others) Basis for the ISO 10646 character encoding standard Native text encoding for Windows NT
0xFFFF
Compatibility Private use Future use Ideographs (Hanzi, Kanji, Hanja) Hangul Kana Symbols Punctuation Thai Indian Arabic, Hebrew Greek
Unicode / ISO 10646
international character encoding Windows 2000 uses Unicode version 2.0
16-bit
™
Latin ASCII
0x0000
(null) 0041 9662 FF96 4F85 0000
A
Relatives of Unicode
ISO/IEC 10646
32 bit ISO standard of 64K X 64K “planes” Unicode repertoire is plane 0 7 bit transformation format Not widely used 8 bit transformation format Used in web pages and some email
UTF-7
UTF-8
Unicode in Win32: the W and A Entry Points
Two kinds of window classes: Unicode, ANSI Win32 API has two versions of most functions:
“W” (wide) version handles Unicode “A” (ANSI – ) assumes the system default code page (character encoding)
Unicode in Win32 …
Macros resolve to W or A entry point Example: Macro for RegisterClassEx
#ifdef UNICODE #define RegisterClassEx RegisterClassExW #else #define RegisterClassEx RegisterClassExA #endif
To create Unicode application:
Compile with –DUNICODE or Use W routines explicitly
For Applications that Must Also Run on Windows 98…
Use Unicode everywhere with single binary, two code paths:
On Windows NT use W entry points On Windows 98, convert Unicode ANSI, use A entry points See sample GLOBALDV for example
See April Microsoft Systems Journal for details and other options
Summary: Use Unicode if you can!
Represent all text with one unambiguous encoding Support multilingual text easily Avoid special processing for variable bytelength characters Use standard encoding recognized throughout the industry and the world Support new scripts that are only supported through Unicode
Guidelines for Supporting Complex Scripts in Applications
1. Displaying Complex Scripts in Plain-text
In Win32 apps use standard edit control Use standard win32 API display functions
Win32 APIs: ExtTextOutW or DrawTextW ScriptString API in Uniscribe
Pitfalls in Enabling for Complex Scripts
When displaying typed text:
Do not output characters one by one! Do save text in a buffer and display the whole string with Uniscribe or Win32 API Do not sum cached character widths Do use a GetTextExtent function or Uniscribe
To measure line lengths:
2. Displaying Complex Scripts in Simple Formatted Text
In Win32 applications use rich edit control In web pages for Internet Explorer 5.0, use Document Object Model
3. Displaying CS in Text with Advanced Formatting and Layout
Use script APIs (“Uniscribe”) See MSJ article of November 1998
Overview of Uniscribe
Background and Purpose of Uniscribe Low level APIs High level APIs For details see November 1998 MSJ article
The Uniscribe DLL: USP10.DLL
Platforms
Windows 2000 Windows NT 4 Windows 98 Windows 95 (excluding Far East)
Single worldwide binary Installs with Windows2000, IE5, Office 2000
Hides language details
Syllable structure (Indian, Thai) Contextual shaping (Arabic, Indic) Caret placement (all) Wordbreak (Thai) National digits (Arabic, Indic, Thai) Bidirectional layout (Arabic, Hebrew)
Hides Unicode OS details
APIs are Unicode on all platforms Hides glyph codes Hides font differences
Shaping tables Fixed repertoire fonts
Uniscribe Structure
Client Itemize Measurer Renderer Shape, Place and TextOut Display Layout Caret Mouse Justify XtoCP & CPtoX
Uniscribe
Unicode BiDi algorithm Arabic shaping engine Hindi shaping engine Tamil shaping engine Thai shaping engine Vietnamese shaping Hebrew engine CMAP & width tables, OpenType library
GDI
GetCharABC WidthsI GetGlyphOutline
ExtTextOut ETO_ GLYPH_INDEX
Shaping engines
Per script Understand language rules Understand font features
OpenType provides full control Many older fixed layout fonts
Application
LPK. DLL
USER
GDI
Uniscribe
Low level APIs Support
Formatting text
Style runs
Measurement
Paragraph filling
Rendering
Information needed for font fallback
Summary
Script…
Itemize
Shape, Place Break, Layout TextOut
CPtoX, XtoCP
High level APIS
Purpose Analysis Display Font fallback
Purpose
For Windows 2000
ExtTextOut DrawText System edit control
Cross-platform Unicode plaintext display
Easier than low level APIs
Summary of ScriptString APIs:
ScriptString…
Analyse … query analysis ... Out Free
Provides simple font fallback
Implementing Right-to-left Layout in Applications
Background On RTL Layout (“Mirroring”) For BiDi Localization
Localized Arabic and Hebrew Windows® is laid out from Right to Left In the past was done “ad hoc” or not at all Windows 2000 and BiDi Windows 98 include mechanisms to “automatically” mirror shell and applications Also helpful for multilingual user interface support
Mirroring in System Based on Coordinate Transformation
Origin (0,0) in upper RIGHT corner of window X scale factor = -1, x values increase from right to left
0
1
1
0
Default (LTR) Window
Mirrored (RTL) Window
More Background on Mirroring…
Developers use programming interfaces and Windows style bits Automatic inheritance of RTL property:
Child window of RTL window defaults to RTL You can disable inheritance of RTL Property
APIs provided to disable mirroring of bitmaps
Implementing Mirroring in Win32 Applications: Standard Windows
Use SetProcessDefaultLayout:
Affects all Windows created thereafter SetProcessDefaultLayout(LAYOUTRTL) ; SetProcessDefaultLayout(0) ; // Reset to LTR
Or call CreateWindowEx:
Use extended style WS_EX_LAYOUTRTL To inhibit mirroring in child windows, also set WS_EX_NOINHERITLAYOUT
Changing Layout of Existing Window
BOOL IsRTLLayout ; // TRUE iff window is to be mirrored
// ... Get new value of IsRTLLayout
LONG lExStyles = GetWindowLongA(hWnd, GWL_EXSTYLE) ; // Check whether new layout is opposite current layout if(!!(IsRTLLayout) != !!(lExStyles & WS_EX_LAYOUTRTL)){ lExStyles ^= WS_EX_LAYOUTRTL ; // Toggle layout // Set extended styles to new value SetWindowLongA(hWnd, GWL_EXSTYLE, lExStyles) ;
// Update client area
InvalidateRect(hWnd, NULL, TRUE) ; }
Controlling Mirroring of a Device Context
SetLayout(HDC hDc, DWORD dwLayout)
dwLayout = 0 ; // will layout LTR dwLayout = LAYOUTRTL ;// will layout RLT dwLayout = LAYOUTRTL | LAYOUT_BITMAPORIENTATIONPRESERVED ; // will layout RTL, but not bitmaps
GetLayout(HDC hDc, DWORD *pdwLayout) Tells what the layout settings are for a hDc
Mirroring in Win32 Applications: Dialogs
Set WS_EX_LAYOUTRTL in dialog template Visual Studio 6 Dialog editor:
Has option for RTL layout BUG in Visual Studio 6: Writes WS_EX_LAYOUT_RTL to RC file! Must correct RC file by hand to compile Will be fixed in future version
Mirroring in Win32 Applications: Message Boxes
Set MB_RTLLAYOUT option bit
Guidelines for using RTL Layout
Using coordinates
Use GetWindowRect with care Use client, rather than screen coordinates Do not mix screen coordinates and client coordinates Use MapWindowPoints to map rectangles, instead of ClientToScreen and ScreenToClient
Windows 95 does not support mirroring!
Implementing Multi-language User Interface in Applications
Guidelines for Multilanguage User Interface
Initialize to current UI language
Windows 2000: GetUserDefaultUILanguage() Others: Use the language of the O/S See function InitUiLang in Globaldev sample code
Guidelines for Multilanguage User Interface
Allow user to select UI language
Put language-dependent resources in resource DLLs Use naming convention, e.g., res.dll Find all resource DLLs, put up list box of choices
See module UPDTLANG.CPP in Globaldev Sample
Summary
Use Unicode to encode if you can Use controls to display text and accept user input Use Uniscribe for advanced formatting Use new RTL layout API for applications localized to RTL languages Consider multilingual user interface
Further Information and Resources
http://www.microsoft.com/globaldev (Watch for updates!) MSJ articles, e.g.,
Uniscribe: http://www.microsoft.com/msj/1198/multilang/ multilangtop.htm Multilingual UI: http://www.microsoft.com/msj/0499/multilangU nicode/multilangUnicodetop.htm
Send suggestions to
nlshelp@microsoft.com