Document Formats in the Division of Informatics


Document Formats in the Division of Informatics

									Document Formats in the
Division of Informatics

by Paul Anderson


                                                          3    Document Preparation
Division of Informatics                                   It appears unreasonable to attempt to standardise on a
University of Edinburgh                                   single document preparation system:
                                                          Latex is freely available on all systems, and documents
                                                          can easily be exchanged in an edit-able form between
1    Summary                                              different platforms. It is also well suited to produc-
 u It is likely that there will continue to be several    ing large, structured documents, and it is has incom-
   different document preparation systems in use          parable features for handling mathematics. Latex does
   within the Division, and collaborative document        however have a steep learning curve and is not really
   preparation may involve document exchange in           suitable for small documents where tight user-control
   various formats. However:                              of the format is required.
                                                          Microsoft Word is simple to use for small documents
 u Documents should be distributed and published          and offers the user considerable WYSIWYG control
   only in ASCII text, HTML, of PDF. Wherever             over the layout. However, this is not really appro-
   possible, PDF should not be the only format avail-     priate for large documents, and it has a tendency to
   able.                                                  produce variations in style which look unprofessional.
                                                          The facilities for mathematics are usually inadequate
 u ASCII text is the preferred format for email, and      compared to Latex. Unless they are converted to some
   HTML or document attachments should only be            other format, Word documents cannot be processed au-
   used where plain text is inadequate.                   tomatically by other software for purposes such as in-
                                                          dexing and reformatting for Web use.
 u Care is required in converting to these formats so     More importantly, however, Microsoft Word docu-
   that the resulting documents conform to accepted       ments are extremely difficult to exchange between sys-
   standards.                                             tems. There is no good way of reliably exchanging
                                                          Word documents between Unix systems in such a way
                                                          that they can be co-operatively edited – they can be
2    Background                                           read and created on a Unix system in several ways, but
                                                          the format conversion is invariably inadequate. Pro-
Traditionally, most academics within the Division use     viding widespread general access to Word-compatible
a document preparation system such as Latex. More         software for people using Unix is probably not feasi-
recently, a lot of documents are being produced in pro-   ble. Even between different Microsoft systems, Word
prietary formats, such as Microsoft Word, especially      documents are incompatible if the version of Word, or
documents originating outside the Division. The in-       the platform is different. Each individual user must pay
creasing use of PCs by administrative staff, and the      to license the latest version of Word to maintain com-
increasing number of students familiar with Microsoft     patibility. This also has implications for archiving of
packages, means that the number of such documents is      documents; since Word formats rapidly become obso-
increasing. Several other formats, such as Postscript,    lete, there is a good chance that such documents will
DVI, PDF and HTML have all been used to publish and       be unreadable in several year’s time.
exchange documents. Many of these formats can only
be handled by restricted groups of users, on particu-     Some other packages such as Word-Perfect, Ap-
lar machines, and are unsuitable for sharing documents    plixware and StarOffice are also in use which provide
throughout the Division and beyond. This paper pro-       some degree of compatibility between PCs and Unix,
poses a policy for the format of documents exchanged      but are not fully compatible with Microsoft documents.
and published within the Division.

4      Suggested Policy                                     document. Publishing only PDF does not currently per-
                                                            mit this; the documents cannot be read by text based
We are not aware of any complete technical solution
                                                            systems such as web browsers for the visually im-
to this problem, but a Division policy on the format of
                                                            paired, or downloaded into hand-held machines, nor
common documents would satisfy many of the current
                                                            can they be easily searched, indexed, or otherwise pro-
                                                            cessed. For this reason, we would recommend pub-
Since there is no reliable two-way conversion between       lication in more than one format for such documents,
the common document formats, different approaches           wherever possible; for example, a document published
are proposed for different purposes:                        on the Web might be made available in both PDF (for
                                                            printing) and HTML (for online viewing and other pro-
4.1 Document Publishing
The simplest situation is where documents are pub-
lished with the intention that recipients will read, but    4.1.2 Email: The use of plain ASCII text is encour-
not edit them. This includes Web publishing, and dis-       aged for email wherever possible. This avoids the
tribution of papers intended only for printing, or online   problem of email attachment compatibility, makes it
reading. In this case, we simply require a format that      straightforward to include quotes in replies, and per-
can be read by freely available software on all plat-       mits keyword searching of mail archives. In particular
forms. There are three obvious candidates:                  the HTML and rich text messages produced by some
                                                            mailers, as well as the notorious Word attachments are
 u ASCII text is trivially readable on any system but
   supports very little in the way of document layout.
                                                            4.1.3 Some Unsupported Formats: Both Mi-
 u HTML can be created from most formats and
                                                            crosoft Word .DOC files and DVI in particular are un-
   easily read with any Web browser. This is con-
                                                            readable by many users and unsuitable for document
   venient to read online but does not produce very
                                                            publishing. XML is being widely discussed, but is not
   good printed output, and does not allow fine con-
                                                            yet mature enough for general use.
   trol over the layout. Some complex documents
   may not translate adequately into HTML.
 u PDF can be created from most other formats               5    Collaborative Documents
   [6.1]. It produces a faithful representation of the      There is no good solution for documents which must be
   layout and can be read on most platforms using a         passed between users for collaborative editing where
   freely available viewer. This is particularly suit-      different systems are being used; none of the document
   able for producing printed documents, but some-          conversion programs or “Word-compatible” packages
   what slow and less convenient to read online.            are adequate for handling Word documents on Unix
This can be viewed as as a sliding scale - as the users
                                                            In many cases, plain text, or HTML are adequate. La-
control of document style increases, the accessibility
                                                            tex can also be transmitted as plain text and processed
of the document to potential readers decreases.
                                                            on many different systems. Use of Latex by adminis-
We propose that all three formats be acceptable for         trative staff would probably require specific training.
document publication, the particular format being cho-
                                                            In those cases where it is necessary to use a propri-
sen to best suit each individual document. As an exam-
                                                            etary format, care should be taken to convert docu-
ple, this document is available in PDF1 , and HTML2 ,
                                                            ments into a more acceptable format before publishing
                                                            or distributing to other users. Collaborative editing of
Care must be taken when converting documents into           Word documents probably requires all participants to
these formats to ensure that the resulting document is      have access to machines running Microsoft Software.
as portable as possible [6].

                                                            6    Implementation
4.1.1 Text Extraction: We consider it important to
be able to extract the text stream from any published       For document exchange and publication, standards are
    1 documents.pdf                                         extremely important. Simply because a document can
    2 documents.html                                        be successfully viewed on the local machine, does not
    3 documents.txt                                         mean that it is readable by other users. Care is required

when preparing and converting documents to ensure
that the result conforms to the appropriate standards.

6.1 PDF
PDF can be generated from Postscript on Unix using
ps2pdf. The commercial Adobe Distiller Package
probably produces better quality PDF and provides ex-
tra features such as thumbnails. Postscript can be gen-
erated from all known formats, including Latex (using
Versions of Distiller are available for Windows and
Macintosh where they can be installed and accessed
easily from any package as a printer driver.
Care is required when creating PDF that appropriate
fonts are used. In general, it is best to use standard
Adobe fonts, otherwise very large files can be gener-
ated, and the quality of the resulting printout can be
poor. This problem is common with bitmap CM fonts
generated by Latex, and fonts packages such as times
usually produce better results.

6.2 HTML
HTML can be generated in many different ways.
The latex2html package can automatically convert
many latex documents (such as this one). Microsoft
Word can export directly into HTML4 , although the
quality od the resulting HTML is frequently very poor.
Care should be taken to ensure that the resulting
HTML is compliant with the official standard, oth-
erwise it may be unreadable or corrupted on some
browsers. The HTML Validator5 , can be used to check

The lynx Web browser can generate a reasonable
ASCII representation from many HTML documents.

    4 Using latex2html and importing the resulting HTML into

Word provides a better alternative than plain text for converting La-
tex documents into Word format

