Docstoc

corpas2003a

Document Sample
corpas2003a Powered By Docstoc
					      TOURISM AND TRAVEL LAW: ELECTRONIC RESOURCES FOR
     A CORPUS-BASED MULTILINGUAL GENERATION PROJECT*

                                 Gloria CORPAS PASTOR**


    1. Introduction

       Internet offers a wealth of information on legal systems and documents, not only
in English as lingua franca, but also in languages with lesser web presence, such as
Spanish, German and Italian. Electronic legal resources are core to the TURICOR
project, as raw material for a corpus-based natural language generation (NLG) system
and reliable data reservoir for legal translation and comparative law. This
multidisciplinary R&D project is a joint effort of 22 researchers in Departments of
Translation and Interpreting, Documentation, Philology, History of Law and Legal
Institutions, Commercial Law, and Computing from three Spanish Universities ─
University of Málaga, the headquarters, University of Alcalá de Henares (Madrid) and
University Pablo de Olavide (Seville). The aim of this paper is to describe on-going
research with special reference to corpus building from electronic legal resources.


       2. The TURICOR Project – An Overview1

      In line with recent developments in corpus-based EBMT (example-based machine
translation), TM (translation memories) systems and other electronic tools (the
translator’s workbench), the TURICOR project sets out to explore the possibilities of
corpus linguistics for automatic text generation and specialised translation. Our final
objective is to develop a prototype NLG system for producing legal documents (tourism
contracts) in each of the four target languages2 in parallel. The starting point will not be
a source text in one language, but a language independent interlingua content
representation to be expressed by means of text sentences in any or all languages
selected. With this aim in mind, a multilingual corpus (both parallel and comparable)
will be compiled from tourism law websites in the Internet. A protocol will be laid out
for searching the WWW, and retrieving, encoding and storing (hyper)texts. The data
extracted from the corpus will provide researchers with a rich gamut of information
about tourism advertising strategies, restricted languages, terminology, specialised

*The research reported in this paper has been carried out in the framework of project TURICOR: A multilingual
corpus of tourism contracts (German, Spanish, English, Italian) for automatic text generation and legal
translation [TURICOR: compilación de un corpus de contratos turísticos (alemán, español, inglés, italiano) para
la generación textual multilingüe y la traducción jurídica]. (Spanish Ministry of Science and Technology, ref.
no. BBF2003-04616, 2003-2006).
** Senior lecturer in Translation and Interpreting (University of Málaga, Spain).
1
   This section follows closely G. Corpas Pastor’s paper “TURICOR: Compilación de un corpus de
contratos turísticos (alemán, español, inglés, italiano) para la generación textual multilingüe y la
traducción jurídica”, in E. Ortega Arjonilla et al. (eds.), Panorama actual de la investigación en
traducción e interpretación, Vol. II, Granada, Atrio, 2003, pp. 373-384.
2
  As legal systems vary in the case of transnational languages or even within the same country, language
varieties have also been diatopically restricted. Thus, the TURICOR project covers Spain, Germany, Italy,
Great Britain (England and Wales, Scotland, Northern Ireland), the Isle of Mann, Eire and United States
of America.
lexicography, comparative law, translationese, contrastive rhetoric and linguistics.
TURICOR will also prove an invaluable tool for translators’ training and the teaching of
languages for special purposes. In addition, the (e-commerce) tourism industry will
greatly benefit from both the NLG system and the knowledge and lexical databases to
be implemented within the project.

       2.1. Background and Hypothesis
       The Information Society3 has brought about dramatic changes not only in well-
established areas of scientific and technological development, but also in all aspects of
human life. In fact, there are plenty of websites and tutorials devoted to promoting ICT
(Information and Communication Technology) among students and ordinary citizens. A
representative example is the Internet for Information and Communication Technology
tutorial from the RDN Virtual Training Suite for further education of the University of
Bristol (UK)4 ─ a set of free, open-access online tutorials, initially designed to help
students, lecturers and researchers improve their Internet information literacy and IT
skills.
       The availability and pervasiveness of this technology world wide has lead
countries to place increasing emphasis on the opportunities and the fruits promised by
information, communication and multimedia/multilingual technology in the global
village. In this new world order that is driven by knowledge and exchange of
information and ideas, surviving in the information age therefore depends on access to
national and global information networks. Hence the EC latest policies and programmes
tend to be orientated towards e-learning, e-commerce and all sorts of technology
development. In particular, European scientific research programmes are clearly geared
towards the application of language engineering to the so-called “language industries”
in Europe’s emerging language and speech technology marketplace. Within the
framework of EC multilingual and multicultural research programmes, namely MLIS-
Multilingual Information Society (1996-1999), Human Language Technologies (1998-
2002) and e-Content (2001—), a number of projects5 have been undertaken on
automated processing and translation technologies, such as corpus-and-web-based
machine translation systems6, text summarising and multilingual text generation7.

3
  This report is available full text in electronic format, see eEurope: An Information Society for All.
Action Plan. Prepared by the Council and European Commission for the Feira European Council: 10-20
June 2000. http://europa.eu.int/information_society/eeurope/index_en.htm. (2 Dec. 2003).
4
  http://www.vts.rdn.ac.uk/ (1 Dec. 2003)
5
  Similar Spanish and European R&D projects are listed in Aguayo et al., “Traducción automática y
generación textual: herramientas, grupos y proyectos de investigación”, en G. Corpas Pastor (ed.),
Recursos documentales, terminológicos y tecnológicos para la traducción del discurso jurídico (español,
alemán, inglés, italiano, árabe), Granada, Comares, 2003, pp. 1-32. See also the HLT website: http://www.
hltcentral.org (2 Dec. 2003).
6
  Among the most relevant projects are INTERLEX (Developing General and Terminological Multilingual
Databases to be exploited in the Internet from Translation Dictionaries in Electronic Format), MULTEXT
(Multilingual Text Tools and Corpora), NL-TRANSLEX (Machine Translation for Dutch and
English/French/German), TRANSACCOUNT (Translation of Annual Account and Financial Reporting
Documents between IAS and FR Accounting System), TT2 (TransType2 - Computer-Assisted
Translation), MULTEXT-EAST (Multilingual Text Tools And Corpora For Central And Eastern European
Languages), METIS (Statistical Machine Translation UsIng Monolingual Corpora).
7
  Some outstanding examples are AGILE (Automatic Generation of Instructions in Languages of Eastern
Europe), APOLLO (An Open Workbench for Multinational Document Creation and Maintenance), GIST
(Generating Instructional Text), MABLE (Multilingual Authoring Of Business Letters), MANDES
(Integrated and Efficient Multilingual Document Management System with Translation and
      Tourism plays an increasing important role not only in national economies and the
European Single Market, but also in the global village e-commerce8. Hence, the main
goal of the TURICOR project9 is to survey electronic resources and Internet-driven e-
contents for compiling virtual, specialised multilingual corpus10 with a view to
implementing a prototype NLG system11 capable of generating tourism contracts (TCs)
in English, German, Italian and Spanish. We are fully aware that so-called tourism law12
is the epitome of an interdisciplinary field, as Pengilley13 had rightly pointed out more
than a decade ago:
       Except in the case of specific regulatory legislation such as the licensing of travel agents,
for example, there is no such thing as the law of tourism and travel … The law speaks in terms
of general principle and one has to adapt such general principle to specific fact situations in the
travel and tourism industry. There is a law of competition. There is a law of contract. There is a
law of consumer protection. All apply to the travel and tourism industry.
      As in the case of travel agents licensing, some tourism contracts are governed by
specific regulations harmonised by International Laws and/or EC Directives which are
to be therefore transposed by all Member States in the form of specific laws, regulations
and administrative provisions. Within the scope of the TURICOR project, three main
types of so-called “tourism contracts” have been selected as objects of study, namely,
package travel contracts14, timesharing contracts15 and (air, sea, rail or road)



Layout/Editing Capabilities), METEO (Development and provision of multilingual information service),
MUSI (Multilingual Summarisation Tool for the Internet), etc.
8
   See Directive 2000/31/EC of the European Parliament and of the Council of 8 June 2000 on certain
legal aspects of information society services, in particular electronic commerce, in the Internal Market
('Directive on electronic commerce'). Official Journal L 178, 17/07/2000, pp. 1-16. File no. 32000L0031.
http://www.europa.eu.int/scadplus/leg/en/lvb/l24204.htm (2 Dec. 2003).
9
  For an overview of the project, see G. Corpas Pastor, “TURICOR: Compilación de un corpus de contratos
turísticos (alemán, español, inglés, italiano) para la generación textual multilingüe y la traducción
jurídica”, in E. Ortega Arjonilla et al. (eds.), Panorama actual de la investigación en traducción e
interpretación, Volume II, Granada, Atrio, 2003, pp. 373-384.
10
   According to the Expert Advisory Group on Language Engineering Standards (EAGLES), a corpus is “a
collection of pieces of language that are selected and ordered according to explicit linguistic criteria in
order to be used as a sample of the language” (“Text Corpora Working Group Reading Guide”, EAG-
TCWG-FR-2, 1996. http://www.ilc.cnr.it/EAGLES96/corpintr/corpintr.html (2 Dec. 2003).
11
    “Natural Language Generation (NLG) is the subfield of artificial intelligence and computational
linguistics that focuses on computer systems that can produce understandable texts in English or other
human languages. Typically starting from some nonlinguistic (sic) representation of information as input,
NLG systems use knowledge about language and the application domain to automatically produce
documents, reports, explanations, help messages and other kinds of texts”, in E. Reiter and M. Dale,
Building Natural Language Generation Systems, Cambridge, Cambridge University Press, 2000, p. 1.
12
    On tourism law and contracts, see, for example, A. Aurioles Martín, Introducción al Derecho
Turístico: Derecho Privado del Turismo, Madrid, Tecnos, 2002; R. Caballero Sánchez (ed.), Legislación
Sobre Turismo. Madrid, Mc Graw Hill, 2000; D. Grant and S. Mason, Holiday Law, London, Sweet &
Maxwell, 2003; and M. McDonald, European community tourism law and policy, Dublin, Blackhall,
2003.
13
   W. Pengilley, The Law of Travel and Tourism, London, Blackstone Press, 1990, p. 115.
14
   Also package travel, package holidays and package tours or just packages in accordance with Council
Directive 90/314/EEC of 13 June 1990. Official Journal L 158, 23/06/1990, pp. 59-64. File No.
31990L0314. http://www.europa.eu.int/scadplus/leg/en/lvb/l32019.htm (5 Nov. 2003).
15
    Also time-share contracts, timeshare contracts, timeshares or contracts relating to the purchase
of the right to use immovable properties on a timeshare basis, as in the Council Directive 94/47/EEC of
the European Parliament and the Council of 26 October 1994. Official Journal L 280, 29/10/1994, pp.
83-87. File No. 31994L0047. http://www.europa.eu.int/scadplus/leg/en/lvb/l32016.htm (5 Nov. 2003).
passenger transport contracts16. Other highly demanded contracts in the tourism
industry, subject to no specific regulations, such as travel insurance, hotel management,
or catering contracts, to name but a few, are also to be addressed in further stages of the
project.
       In close connection with the interdisciplinary nature of the TURICOR project and
its main goal, the following integrative, full-fledged hypothesis has been adopted as a
starting point: (i) it is possible to set up a protocol for compiling specialised corpora
which are representative of a given economic sector from just Internet electronic
resources; (ii) such specialised Internet-driven corpora could then be used to solve
pressing problems of natural language processing (NLP) research in order to improve
machine translation, translation memories, natural language generation and terminology
management systems; (iii) a corpus-based multilingual NLG system is expected to
significantly contribute to boosting the economic growth and rapid development of a
particular industry sector (eg. tourism e-commerce and marketing); and (iv) specialised,
multilingual, Internet-driven electronic corpora are an added-value research tool for
spin-off studies on translation, terminology and text-linguistics, on the one hand, and
legal, economic, advertising or marketing issues, on the other hand.

      2.2. Objectives and Methods
      In the light of the fourfold hypothesis (see section 2.2.), our basic goal can be
further elaborated by stating three clear objectives, namely, (a) to build up a
multilingual macrocorpus (Turicor), composed of several parallel and comparable
subcorpora of tourism law documents derived from electronic resources available in the
WWW; (b) to design and implement a Turicor-based information-exchange
standardised computer programme for the automatic production of multilingual
documents; and, finally, (c) to study tourism law and the main textual forms as samples
of specialised communication in restricted registers in the four target languages.
      In order to meet our first objective ─ Internet-driven corpus compilation ─,
electronic resources for law and tourism in the WWW will be located and evaluated17,
according to a validation system developed within the framework of a previous I&D
research project 18, which draws on well-known standards19. Next, a protocolised work


16
   International private air transport law is most recently regulated by the Montreal Convention for the
Unification of Certain Rules for International Carriage by Air (28 May 1999), available in .html format
from http://tlc.unn.ac.uk/tlcpg.asp?pageID=5 (5 Dec. 2003). Former Conventions (Warsaw, 1929;
Geneve, 1948; Rome, 1952; Guadalajara, 1961; Montreal, 1978) and Protocols (The Hague, 1955;
Guatemala, 1971; Montreal, 1975 and 1978), as well as Chicago Acts and related Protocols can be
accessed as .PDF full-text bilingual version (English and French) from the Institute of Air & Space Law,
McGill University (Montreal, Canada) website: http://www.iasl.mcgill.ca/index2.htm (4 Dec. 2003).
17
   In a previous R&D project (Ref. No. PB98-1399, Spanish Ministry of Education, 1999-2002) it was
found that it is possible at least to find and file package travel general terms and conditions from various
Spanish and German travel agents and tour operators websites. At this stage, the WWW will be searched
for package travel contracts in English and Italian, plus other types of contracts greatly demanded by the
tourism industry (passenger transport, travel insurance, hotel management, catering, on-line bookings of
air fares, hotel rooms, rental cars, etc.) in the four languages involved in the project.
18
   For a detailed account of the PB98-1399 project, see the papers edited by G. Corpas Pastor (opus cit.)
and Mª E. Gómez Rojo’s review (in this volume).
19
    We refer to J. E. Alexander and M. A. Tate’s Web Wisdom: How to Evaluate and Create Information
Quality on the Web, Mahwah, New Jersey, Lawrence Erlbaum Associates, 1999; and also A. Cooke’s
book A guide to finding quality information on the Internet: selection and evaluation strategies, London,
Library Association Publishing, 1999.
procedure for ad hoc corpus creation will be established following the PB98-1399
project guideless, including but not limited to, directions as to the selection of
appropriate information retrieval systems (I.R.S.) for downloading documents (legal
regulations, contracts forms and samples) in the corpus database; detailed instructions
about the composition of the varios subcorpora (diasistematic constraints of documents,
size, type and format considerations, number of languages, degree of communicative
specialisation, etc.); a set of coding tags (headers and DTD) following the TEI, as
developed in the previous project; a range of off-line web browsers to capture whole
web pages (contents and hypertextual structure) at once; and instructions about corpus
database management, alignment and concordancer tools.
       The second objective ─ implementing a corpus-driven multilingual prototype
NLG system ─ will require evaluation and corpus-validation of current state-of-the-art
NLG, EBMT and TM systems according to EAGLES standards20. For sentence
planning and generation, a domain interlingua ontology will be constructed upon a set
of general and tourism law concepts to be contrasted with any translation units obtained
after automatically aligning and segmenting bi-texts. Finally, a multilingual NLG
software will be developed on the basis of the language-independent grammar and a
relational database, plus a combination of fuzzy matches algorithms and example-based
MT systems.
       As a by-product of the two former objectives, the last objective involves
exploiting the data collected during the three-year project in various ways related to
discourse, communication and law. For example, corpus management will provide
invaluable data for terminological databases and formal text prototypes; the legal
discourse of the tourism industry will be finely characterised; national, communitary
and international specific regulations governing tourism contracts will be reviewed and
compared; bilingual sub-corpora of original documents and its corresponding translated
(or target) texts will allow research into translationese, legal translation teaching and
transgenre; and even the prototype NLG system might serve as basis for further
software development in the areas of Translation Technologies and Internet Access
Devices (IADs) for e-commmerce and e-advertising (as it is rare to see an e-commerce
website without e-advertising!).


      3. ‘Package holidays’ regulations in Eire: a case study

      As described in the previous section, a project major objective is to mine the
Turicor macrocorpus from the World Wide Web automatically21. The Turicor
multilingual parallel subcorpus is a bi- or multilingual corpus of originals and their
translations into one or more languages. It will include strictly related ‘mirror’
documents in the project four target languages: (a) communitary tourism and travel law
regulations; (b) any bi- or multilingual related websites retrieved from Internet
(legislation, reference, forms and contracts); and (c) translations from professionals or

20
   As redefined by Hovy et al. (eds.). Multilingual Information Management: Current Levels and Future
Abilities. Report for National Science Foundation, 1999. http://www.cs.cmu.edu/~ref/mlim/index.html
(17 Sept. 2003), conformant to ISO/IEC 9126.
21
   Documents will be searched and retrieved from the Internet whenever possible. However, some
documents will be have to be accessed from other electronic resources (CD-Roms, for instance) or rather
scanned. In addition, it should be pointed out that access to real samples of contracts can be extremely
difficult and time-consuming
translation students. In its turn, the Turicor comparable subcorpus will encompass a
wide range of texts: (a) tourism and travel contracts samples, (b) tourism and travel
legal forms, (c) relevant travel agencies and tour operators websites, (d) domestic
tourism and travel regulations (Statutory Instruments, Acts of Parliament, Royal
Decrees, relevant judicial decisions, etc.). That is to say, original legal documents that
have been produced independently of each other in the four target languages, but that
are considered to be similar (therefore, comparable) in terms of text type, form and
function, topic, specialisation and so forth.
        A plan initial stage task will be, then, to search the web for eligible documents.
To illustrate the point, we will present a case study on automatic location and retrieval
of rules and regulations governing package holidays in the Republic of Ireland (Eire).

      3.1. Searching the Internet
      A reliable but expensive way to access legal information in the WWW is to
subscribe to commercial services such as Westlaw22, LexisNexis23 or Celex24. However,
the money expenditure may not be worthwhile for simple research purposes, as basic
search skills unable users to have access to plenty of free electronic resources at a
mouse click.
      Any reliable search requires careful selection of relevant key words and
information retrieval systems. While indexation concepts are basic for global search
engines to automatically retrieve web pages contents, appropriate choice of I.R.S. can
be of paramount importance for more structured searches. For example, package
holidays, law and Republic of Ireland have been entered as key words for a first
Boolean search query using Google25 and All the Web26. However, the results obtained
are far from expected, as they contain a lot of noise and irrelevant information on travel
forums and chats, cheap flights and accommodation offers, advertising, tabloid news,
etc.




22
   http://web2.westlaw.com (5 Dec. 2003).
23
   http://www.lexisnexis.com/ (5 Dec. 2003).
24
   http://www.europa.eu.int/celex (5 Dec. 2003). Access to the “Expert Search” option requires a user
name and a password.
25
   http://www.google.com (5 Dec. 2003).
26
   http://www.alltheweb.com (5 Dec. 2003).
         Fig. 1. Global Search Query (All the Web).

      Attempts to narrow down the results by refining the key words tended to be
equally unsuccessful. This is partly because indexed key words do not seem to be the
real problem ─ any searches for Eire laws and regulations on package holidays have to
be redirected towards alternative information retrieval systems. A safer strategy would
be to resort to metasearch engines, such as Metacrawler27 and Highway6128, to find law
search engines:




27
     http://www.metacrawler.com (5 Dec. 2003).
28
     http://www.highway61.com (5 Dec. 2003).
      Fig. 2. Metasearch Query (Metacrawler).

      From there on, a next step will be locating websites devoted exclusively to the
Republic of Ireland legal system or just dealing with Eire legal resources as one of their
sections. A cursory look ended up with a good number of useful portals, gateways, legal
indices, resource guides and link pages, among which are the following:
AccessToLaw29, Carrow's Irish Law Links30, Legal-Island31, Lex Scripta: Legal
Megasites32, LLRX - Guide to European Legal Databases33, The Bar Council & Bar
Library of Northern Ireland34. According to Internet evaluation models, these top
quality websites would satisfy the criteria for efficient, valuable electronic information
resources, as they are updated on a regular basis (monthly, weekly or even daily), their
contents are logically ordered and accurate, identification dates of webmasters, contact
experts or official bodies are systematically provided, graphic and multimedia design is
user-friendly, related links (e.g. law databases and e-journals) are carefully selected, etc.


29
   http://www.accesstolaw.com (5 Dec. 2003).
30
   http://www.carrow.com/linkirish.html (5 Dec. 2003).
31
   http://www.legal-island.com (5 Dec. 2003).
32
   http://www.lexscripta.com/legal/omnibus/megasites.html (5 Dec. 2003).
33
   http://www.llrx.com/features/europe.htm (5 Dec. 2003)
34
   http://www.barlibrary.com/links.htm (5 Dec. 2003).
Most websites are provided with an internal search engine for quick reference, while
some of them offer WWW search tutorials as yet another asset.
      In order to proceed with the search, we have chosen one of the aforementioned
added-value law directories: AccessToLaw. A well-structured gateway, it covers United
Kingdom (England and Wales, Scotland, Northern Ireland), the Commonwealth
(Australia, Canada, Gibraltar, Malta, etc.) and other jurisdictions, such as Channel
Islands, Isle of Man, Republic of Ireland, Europe and major World Law resources. As a
general resource, it provides links to legal search engines and gateways, learned legal
journals and reference books, law electronic libraries and publishers, on line solicitors
and barristers, professional organizations, etc. on a wide range of subject areas, such as
criminal law, ecclesiastical law, family law, international law, property or shipping law,
to name but a few.




      Fig. 3. Specific search (AccessToLaw - homepage).

      As regards the Republic of Ireland (Eire), AccessToLaw offers a wealth of
information within the “Other Jurisdictions” section. For instance, it includes an
electronic full-text version of the Irish Constitution of 1937, plus a list of amendments
effected since the Constitution was enacted in 1937 up to November 2002. Primary
sources can be mainly accessed via links to the Government of Ireland35 and
particularly to the Oireachtas36 and the Law Reform Commission. Acts, Instruments,
decisions, provisions etc. can be also found through the Irish Law Site hosted by
University College of Cork Law Faculty and its two database initiatives: BAILLI
(British and Irish Legal Information Institute) and IRLII (Irish Legal Information
Initiative Site); the personal site of independent member of the Irish Senate, Feargall
Quinn, and two directories for Northern Ireland and Eire law (the Legal Eagle Links
website of solicitor, D. O' Reilly, and the Legal Island Site). Primary legislation (Acts of
the Oireachtas 1997 onwards) and secondary legislation (Statutory Instruments 1922
onwards) are contained in the Irish Statute Book (1922 onwards); courts and case laws
are arranged by subjects and alphabetically37 (Irish Supreme Court and Court of
Criminal Appeal Decisions 1997 onwards; Irish High Court Decisions 1996 onwards;
Irish Competition Authority Decisions 1991 onwards; Irish Information Commissioner's
Decisions 1998 onwards), whereas it is also possible to have access to other Irish law
materials (Irish Law Reform Commission Papers and Reports 1976 onwards, full text
Parliamentary Debates 1919, Bills and Explanatory Memoranda from the Houses of
Oireachtas, latest publications and annual reports issued by central government
departments, agencies and state sponsored bodies).
         Secondary sources are also well represented by numerous links to full-text
versions of electronic law journals, legal textbooks (eg. D. Whelan’s Guide to Irish
Law, 2001), publications by government departments and state organisations,
University teaching materials, dictionaries and directories of legal professions, etc. For
instance, there is direct access to the 2003 launched EPPI (Enhanced British
Parliamentary Papers on Ireland 1801-1922) bibliography database.

      3.2. Retrieving and processing the data
      Package holidays rules and regulations can be found in the Irish Statute Book
database (primary legislation, Acts of the Oireachtas). The Package Holidays and
Travel Trade Act, 199538 enables effect to be given to the Council Directive
90/314/EEC of 13 June 1990 of the European Communities on package travel, package
holidays and package tours. It amends39 The Transport (Tour Operators and Travel
Agents) Act, 198240. Both Acts can be cited together as The Transport (Travel Trade)
Acts, 1982 and 1995.
      Once the data have been located, the next step for corpus building is to retrieve
the corresponding documents. However, automatic downloading can be impaired by the
fractal, interactive, dynamic and graphical nature of WWW hypertexts. One major
problem is the one-by-one format access to single web pages/nodes, which can turn
downloading into a complex, time-consuming effort. For example, The Package
Holidays and Travel Trade Act, 1995 in HTML version consists of four parts ─ I.
Preliminary and General, II. Regulation of Travel Contract, III. Security, IV.
Amendment of Transport (Tour Operators and Travel Agents) Acts, 1982 ─, divided
35
   An internal search engine locates information from all government sites.
36
   The Oireachtas (Parliament) consists of two Houses – the Dáil Éireann (the House of Representatives,
directly elected) and the Seanad Éireann (the Senate, indirectly elected).
37
   The “Courts Service: Ireland” Section includes information on the Irish courts system, court rules,
court offices and law terms, plus a legal diary, press releases, publications and legal links.
38
   http://www.irishstatutebook.ie/ZZA17Y1995S1.html (5 Dec. 2003).
39
   Also referred to: Companies Act, 1963 (No. 33), Hotel Proprietors Act, 1963 (No. 7), Petty Sessions
(Ireland) Act, 1851 (c. 93) and Public Offices Fee Act, 1879 (c. 58).
40
   http://www.irishstatutebook.ie/ZZA3Y1982.html (5 Dec. 2003).
into 34 sections (and subsequent subsections) plus schedule. Downloading the whole
document means storing all its parts and sections one after the other, either as only one
document or else as 35 shorter documents! (In fact, single sections can be retrieved
individually from the WWW).




         Fig. 4. Package Holidays and Travel Trade Act, 1995 (Irish Statute Book Database).

      It should be pointed out, though, that global retrieval of WWW fragmented
content can be conveniently speeded up by offline browsers able to retrieve and store
whole websites (contents and navigation design), like GNU Wget41 or WebStripper42.
      Further problems relate to the loss of meaningful parts of the hypertext, such as
graphic and multimedia components and bullets, logos, banners …, missing navigation
design and reading paths, relevant formal layout and format, etc. All that (and much
more) is lost in a plain text format pre-processed for corpus management purposes,
since the next stage in the project workflow involves conversion of the .HTML
document into .TXT format. For example, the following figure illustrates Part III,
section                                25                               (“Insurance”):




41
     Free, shareware. http://www.gnu.org/software/wget/wget.html (5 Dic. 2003).
42
     http://www.webstripper.net (5. Dic. 2003).
Insurance. 25.-(1) The package provider shall have insurance under one or more
appropriate policies with an insurer authorised in respect of such business in a Member
State under which the insurer agrees to indemnify consumers (who shall be insured
persons under the policy), against- (a) the loss of all money paid over by them under
or in contemplation of contracts for relevant packages, and (b) where applicable to the
package concerned, the cost of repatriation of consumers based on administrative
arrangements established by the insurer to enable repatriation of such consumers, in
the event of insolvency of the package provider.(2) The package provider shall ensure
that it is a term of every contract with a consumer that the consumer acquires the
benefit of a policy of a kind mentioned in subsection (1) in the event of the
insolvency of the package provider.(3) In this section "appropriate policy" means one
which does not contain a condition which provides (in whatever terms) that no liability
shall arise under the policy, or that any liability so arising shall cease- (a) in the
event of some specified thing being done or omitted to be done after the happening of
the event giving rise to a claim under the policy, (b) in the event of the failure of
the policy holder to make payments to the insurer in connection with that policy or
with other policies, or (c) unless the policy holder keeps specified records or
provides the insurer with information therefrom.


       Fig. 5. Package Holidays Act and Travel Trade Act, 1995, section 25 (.TXT format).

         Documents retrieved from the Internet are stored in the corpus database both in
their original format (usually .HTML or .PDF) and in a plain format (.TXT) suitable for
corpus management. For each of them, a TEI-conformant DTD is provided. In this case
search, the “package travel” Act would belong to the (a) section of the multilingual
comparable corpus43 (likewise other similar domestic legislation from the remaining
countries covered in the TURICOR project). In addition, it would be stored in both
.HTML and .TXT formats, it would be conveniently identified (DTD file) and it would
include pointers to (i) type of tourism contract [“packages”], (ii) language [“English”],
(iii) type of regulation [“domestic law”] and (iv) jurisdiction [“Eire”].


       4. Conclusion

       This paper has provided a brief summary of the TURICOR project, with a view to
corpus building from Internet electronic resources. A search methodology has been
illustrated by means of a case study on domestic packages regulations in the Republic of
Ireland. This methodology comprises three main stages: (a) global Boolean search, (b)
law metasearch and (c) jurisdiction search. It could be successfully applied to all kinds
of legal searchers, be either domestic, international or communitary laws. In short, the
TURICOR project is beginning to open new, exiting research venues for comparative law,
legal translation, documentation and corpus-based NLP and NLG systems.

[Recibido el 6 de Diciembre de 2003. Aceptada su publicación el 14 de Diciembre de
2003]




43
  Similarly, communitary regulations would belong for instance to the multilingual parallel corpus, as
they are translated into all official EC languages.

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:83
posted:2/17/2008
language:English
pages:12