TOURISM AND TRAVEL LAW: ELECTRONIC RESOURCES FOR A CORPUS-BASED MULTILINGUAL GENERATION PROJECT* Gloria CORPAS PASTOR** 1. Introduction Internet offers a wealth of information on legal systems and documents, not only in English as lingua franca, but also in languages with lesser web presence, such as Spanish, German and Italian. Electronic legal resources are core to the TURICOR project, as raw material for a corpus-based natural language generation (NLG) system and reliable data reservoir for legal translation and comparative law. This multidisciplinary R&D project is a joint effort of 22 researchers in Departments of Translation and Interpreting, Documentation, Philology, History of Law and Legal Institutions, Commercial Law, and Computing from three Spanish Universities ─ University of Málaga, the headquarters, University of Alcalá de Henares (Madrid) and University Pablo de Olavide (Seville). The aim of this paper is to describe on-going research with special reference to corpus building from electronic legal resources. 2. The TURICOR Project – An Overview1 In line with recent developments in corpus-based EBMT (example-based machine translation), TM (translation memories) systems and other electronic tools (the translator’s workbench), the TURICOR project sets out to explore the possibilities of corpus linguistics for automatic text generation and specialised translation. Our final objective is to develop a prototype NLG system for producing legal documents (tourism contracts) in each of the four target languages2 in parallel. The starting point will not be a source text in one language, but a language independent interlingua content representation to be expressed by means of text sentences in any or all languages selected. With this aim in mind, a multilingual corpus (both parallel and comparable) will be compiled from tourism law websites in the Internet. A protocol will be laid out for searching the WWW, and retrieving, encoding and storing (hyper)texts. The data extracted from the corpus will provide researchers with a rich gamut of information about tourism advertising strategies, restricted languages, terminology, specialised *The research reported in this paper has been carried out in the framework of project TURICOR: A multilingual corpus of tourism contracts (German, Spanish, English, Italian) for automatic text generation and legal translation [TURICOR: compilación de un corpus de contratos turísticos (alemán, español, inglés, italiano) para la generación textual multilingüe y la traducción jurídica]. (Spanish Ministry of Science and Technology, ref. no. BBF2003-04616, 2003-2006). ** Senior lecturer in Translation and Interpreting (University of Málaga, Spain). 1 This section follows closely G. Corpas Pastor’s paper “TURICOR: Compilación de un corpus de contratos turísticos (alemán, español, inglés, italiano) para la generación textual multilingüe y la traducción jurídica”, in E. Ortega Arjonilla et al. (eds.), Panorama actual de la investigación en traducción e interpretación, Vol. II, Granada, Atrio, 2003, pp. 373-384. 2 As legal systems vary in the case of transnational languages or even within the same country, language varieties have also been diatopically restricted. Thus, the TURICOR project covers Spain, Germany, Italy, Great Britain (England and Wales, Scotland, Northern Ireland), the Isle of Mann, Eire and United States of America. lexicography, comparative law, translationese, contrastive rhetoric and linguistics. TURICOR will also prove an invaluable tool for translators’ training and the teaching of languages for special purposes. In addition, the (e-commerce) tourism industry will greatly benefit from both the NLG system and the knowledge and lexical databases to be implemented within the project. 2.1. Background and Hypothesis The Information Society3 has brought about dramatic changes not only in well- established areas of scientific and technological development, but also in all aspects of human life. In fact, there are plenty of websites and tutorials devoted to promoting ICT (Information and Communication Technology) among students and ordinary citizens. A representative example is the Internet for Information and Communication Technology tutorial from the RDN Virtual Training Suite for further education of the University of Bristol (UK)4 ─ a set of free, open-access online tutorials, initially designed to help students, lecturers and researchers improve their Internet information literacy and IT skills. The availability and pervasiveness of this technology world wide has lead countries to place increasing emphasis on the opportunities and the fruits promised by information, communication and multimedia/multilingual technology in the global village. In this new world order that is driven by knowledge and exchange of information and ideas, surviving in the information age therefore depends on access to national and global information networks. Hence the EC latest policies and programmes tend to be orientated towards e-learning, e-commerce and all sorts of technology development. In particular, European scientific research programmes are clearly geared towards the application of language engineering to the so-called “language industries” in Europe’s emerging language and speech technology marketplace. Within the framework of EC multilingual and multicultural research programmes, namely MLIS- Multilingual Information Society (1996-1999), Human Language Technologies (1998- 2002) and e-Content (2001—), a number of projects5 have been undertaken on automated processing and translation technologies, such as corpus-and-web-based machine translation systems6, text summarising and multilingual text generation7. 3 This report is available full text in electronic format, see eEurope: An Information Society for All. Action Plan. Prepared by the Council and European Commission for the Feira European Council: 10-20 June 2000. http://europa.eu.int/information_society/eeurope/index_en.htm. (2 Dec. 2003). 4 http://www.vts.rdn.ac.uk/ (1 Dec. 2003) 5 Similar Spanish and European R&D projects are listed in Aguayo et al., “Traducción automática y generación textual: herramientas, grupos y proyectos de investigación”, en G. Corpas Pastor (ed.), Recursos documentales, terminológicos y tecnológicos para la traducción del discurso jurídico (español, alemán, inglés, italiano, árabe), Granada, Comares, 2003, pp. 1-32. See also the HLT website: http://www. hltcentral.org (2 Dec. 2003). 6 Among the most relevant projects are INTERLEX (Developing General and Terminological Multilingual Databases to be exploited in the Internet from Translation Dictionaries in Electronic Format), MULTEXT (Multilingual Text Tools and Corpora), NL-TRANSLEX (Machine Translation for Dutch and English/French/German), TRANSACCOUNT (Translation of Annual Account and Financial Reporting Documents between IAS and FR Accounting System), TT2 (TransType2 - Computer-Assisted Translation), MULTEXT-EAST (Multilingual Text Tools And Corpora For Central And Eastern European Languages), METIS (Statistical Machine Translation UsIng Monolingual Corpora). 7 Some outstanding examples are AGILE (Automatic Generation of Instructions in Languages of Eastern Europe), APOLLO (An Open Workbench for Multinational Document Creation and Maintenance), GIST (Generating Instructional Text), MABLE (Multilingual Authoring Of Business Letters), MANDES (Integrated and Efficient Multilingual Document Management System with Translation and Tourism plays an increasing important role not only in national economies and the European Single Market, but also in the global village e-commerce8. Hence, the main goal of the TURICOR project9 is to survey electronic resources and Internet-driven e- contents for compiling virtual, specialised multilingual corpus10 with a view to implementing a prototype NLG system11 capable of generating tourism contracts (TCs) in English, German, Italian and Spanish. We are fully aware that so-called tourism law12 is the epitome of an interdisciplinary field, as Pengilley13 had rightly pointed out more than a decade ago: Except in the case of specific regulatory legislation such as the licensing of travel agents, for example, there is no such thing as the law of tourism and travel … The law speaks in terms of general principle and one has to adapt such general principle to specific fact situations in the travel and tourism industry. There is a law of competition. There is a law of contract. There is a law of consumer protection. All apply to the travel and tourism industry. As in the case of travel agents licensing, some tourism contracts are governed by specific regulations harmonised by International Laws and/or EC Directives which are to be therefore transposed by all Member States in the form of specific laws, regulations and administrative provisions. Within the scope of the TURICOR project, three main types of so-called “tourism contracts” have been selected as objects of study, namely, package travel contracts14, timesharing contracts15 and (air, sea, rail or road) Layout/Editing Capabilities), METEO (Development and provision of multilingual information service), MUSI (Multilingual Summarisation Tool for the Internet), etc. 8 See Directive 2000/31/EC of the European Parliament and of the Council of 8 June 2000 on certain legal aspects of information society services, in particular electronic commerce, in the Internal Market ('Directive on electronic commerce'). Official Journal L 178, 17/07/2000, pp. 1-16. File no. 32000L0031. http://www.europa.eu.int/scadplus/leg/en/lvb/l24204.htm (2 Dec. 2003). 9 For an overview of the project, see G. Corpas Pastor, “TURICOR: Compilación de un corpus de contratos turísticos (alemán, español, inglés, italiano) para la generación textual multilingüe y la traducción jurídica”, in E. Ortega Arjonilla et al. (eds.), Panorama actual de la investigación en traducción e interpretación, Volume II, Granada, Atrio, 2003, pp. 373-384. 10 According to the Expert Advisory Group on Language Engineering Standards (EAGLES), a corpus is “a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language” (“Text Corpora Working Group Reading Guide”, EAG- TCWG-FR-2, 1996. http://www.ilc.cnr.it/EAGLES96/corpintr/corpintr.html (2 Dec. 2003). 11 “Natural Language Generation (NLG) is the subfield of artificial intelligence and computational linguistics that focuses on computer systems that can produce understandable texts in English or other human languages. Typically starting from some nonlinguistic (sic) representation of information as input, NLG systems use knowledge about language and the application domain to automatically produce documents, reports, explanations, help messages and other kinds of texts”, in E. Reiter and M. Dale, Building Natural Language Generation Systems, Cambridge, Cambridge University Press, 2000, p. 1. 12 On tourism law and contracts, see, for example, A. Aurioles Martín, Introducción al Derecho Turístico: Derecho Privado del Turismo, Madrid, Tecnos, 2002; R. Caballero Sánchez (ed.), Legislación Sobre Turismo. Madrid, Mc Graw Hill, 2000; D. Grant and S. Mason, Holiday Law, London, Sweet & Maxwell, 2003; and M. McDonald, European community tourism law and policy, Dublin, Blackhall, 2003. 13 W. Pengilley, The Law of Travel and Tourism, London, Blackstone Press, 1990, p. 115. 14 Also package travel, package holidays and package tours or just packages in accordance with Council Directive 90/314/EEC of 13 June 1990. Official Journal L 158, 23/06/1990, pp. 59-64. File No. 31990L0314. http://www.europa.eu.int/scadplus/leg/en/lvb/l32019.htm (5 Nov. 2003). 15 Also time-share contracts, timeshare contracts, timeshares or contracts relating to the purchase of the right to use immovable properties on a timeshare basis, as in the Council Directive 94/47/EEC of the European Parliament and the Council of 26 October 1994. Official Journal L 280, 29/10/1994, pp. 83-87. File No. 31994L0047. http://www.europa.eu.int/scadplus/leg/en/lvb/l32016.htm (5 Nov. 2003). passenger transport contracts16. Other highly demanded contracts in the tourism industry, subject to no specific regulations, such as travel insurance, hotel management, or catering contracts, to name but a few, are also to be addressed in further stages of the project. In close connection with the interdisciplinary nature of the TURICOR project and its main goal, the following integrative, full-fledged hypothesis has been adopted as a starting point: (i) it is possible to set up a protocol for compiling specialised corpora which are representative of a given economic sector from just Internet electronic resources; (ii) such specialised Internet-driven corpora could then be used to solve pressing problems of natural language processing (NLP) research in order to improve machine translation, translation memories, natural language generation and terminology management systems; (iii) a corpus-based multilingual NLG system is expected to significantly contribute to boosting the economic growth and rapid development of a particular industry sector (eg. tourism e-commerce and marketing); and (iv) specialised, multilingual, Internet-driven electronic corpora are an added-value research tool for spin-off studies on translation, terminology and text-linguistics, on the one hand, and legal, economic, advertising or marketing issues, on the other hand. 2.2. Objectives and Methods In the light of the fourfold hypothesis (see section 2.2.), our basic goal can be further elaborated by stating three clear objectives, namely, (a) to build up a multilingual macrocorpus (Turicor), composed of several parallel and comparable subcorpora of tourism law documents derived from electronic resources available in the WWW; (b) to design and implement a Turicor-based information-exchange standardised computer programme for the automatic production of multilingual documents; and, finally, (c) to study tourism law and the main textual forms as samples of specialised communication in restricted registers in the four target languages. In order to meet our first objective ─ Internet-driven corpus compilation ─, electronic resources for law and tourism in the WWW will be located and evaluated17, according to a validation system developed within the framework of a previous I&D research project 18, which draws on well-known standards19. Next, a protocolised work 16 International private air transport law is most recently regulated by the Montreal Convention for the Unification of Certain Rules for International Carriage by Air (28 May 1999), available in .html format from http://tlc.unn.ac.uk/tlcpg.asp?pageID=5 (5 Dec. 2003). Former Conventions (Warsaw, 1929; Geneve, 1948; Rome, 1952; Guadalajara, 1961; Montreal, 1978) and Protocols (The Hague, 1955; Guatemala, 1971; Montreal, 1975 and 1978), as well as Chicago Acts and related Protocols can be accessed as .PDF full-text bilingual version (English and French) from the Institute of Air & Space Law, McGill University (Montreal, Canada) website: http://www.iasl.mcgill.ca/index2.htm (4 Dec. 2003). 17 In a previous R&D project (Ref. No. PB98-1399, Spanish Ministry of Education, 1999-2002) it was found that it is possible at least to find and file package travel general terms and conditions from various Spanish and German travel agents and tour operators websites. At this stage, the WWW will be searched for package travel contracts in English and Italian, plus other types of contracts greatly demanded by the tourism industry (passenger transport, travel insurance, hotel management, catering, on-line bookings of air fares, hotel rooms, rental cars, etc.) in the four languages involved in the project. 18 For a detailed account of the PB98-1399 project, see the papers edited by G. Corpas Pastor (opus cit.) and Mª E. Gómez Rojo’s review (in this volume). 19 We refer to J. E. Alexander and M. A. Tate’s Web Wisdom: How to Evaluate and Create Information Quality on the Web, Mahwah, New Jersey, Lawrence Erlbaum Associates, 1999; and also A. Cooke’s book A guide to finding quality information on the Internet: selection and evaluation strategies, London, Library Association Publishing, 1999. procedure for ad hoc corpus creation will be established following the PB98-1399 project guideless, including but not limited to, directions as to the selection of appropriate information retrieval systems (I.R.S.) for downloading documents (legal regulations, contracts forms and samples) in the corpus database; detailed instructions about the composition of the varios subcorpora (diasistematic constraints of documents, size, type and format considerations, number of languages, degree of communicative specialisation, etc.); a set of coding tags (headers and DTD) following the TEI, as developed in the previous project; a range of off-line web browsers to capture whole web pages (contents and hypertextual structure) at once; and instructions about corpus database management, alignment and concordancer tools. The second objective ─ implementing a corpus-driven multilingual prototype NLG system ─ will require evaluation and corpus-validation of current state-of-the-art NLG, EBMT and TM systems according to EAGLES standards20. For sentence planning and generation, a domain interlingua ontology will be constructed upon a set of general and tourism law concepts to be contrasted with any translation units obtained after automatically aligning and segmenting bi-texts. Finally, a multilingual NLG software will be developed on the basis of the language-independent grammar and a relational database, plus a combination of fuzzy matches algorithms and example-based MT systems. As a by-product of the two former objectives, the last objective involves exploiting the data collected during the three-year project in various ways related to discourse, communication and law. For example, corpus management will provide invaluable data for terminological databases and formal text prototypes; the legal discourse of the tourism industry will be finely characterised; national, communitary and international specific regulations governing tourism contracts will be reviewed and compared; bilingual sub-corpora of original documents and its corresponding translated (or target) texts will allow research into translationese, legal translation teaching and transgenre; and even the prototype NLG system might serve as basis for further software development in the areas of Translation Technologies and Internet Access Devices (IADs) for e-commmerce and e-advertising (as it is rare to see an e-commerce website without e-advertising!). 3. ‘Package holidays’ regulations in Eire: a case study As described in the previous section, a project major objective is to mine the Turicor macrocorpus from the World Wide Web automatically21. The Turicor multilingual parallel subcorpus is a bi- or multilingual corpus of originals and their translations into one or more languages. It will include strictly related ‘mirror’ documents in the project four target languages: (a) communitary tourism and travel law regulations; (b) any bi- or multilingual related websites retrieved from Internet (legislation, reference, forms and contracts); and (c) translations from professionals or 20 As redefined by Hovy et al. (eds.). Multilingual Information Management: Current Levels and Future Abilities. Report for National Science Foundation, 1999. http://www.cs.cmu.edu/~ref/mlim/index.html (17 Sept. 2003), conformant to ISO/IEC 9126. 21 Documents will be searched and retrieved from the Internet whenever possible. However, some documents will be have to be accessed from other electronic resources (CD-Roms, for instance) or rather scanned. In addition, it should be pointed out that access to real samples of contracts can be extremely difficult and time-consuming translation students. In its turn, the Turicor comparable subcorpus will encompass a wide range of texts: (a) tourism and travel contracts samples, (b) tourism and travel legal forms, (c) relevant travel agencies and tour operators websites, (d) domestic tourism and travel regulations (Statutory Instruments, Acts of Parliament, Royal Decrees, relevant judicial decisions, etc.). That is to say, original legal documents that have been produced independently of each other in the four target languages, but that are considered to be similar (therefore, comparable) in terms of text type, form and function, topic, specialisation and so forth. A plan initial stage task will be, then, to search the web for eligible documents. To illustrate the point, we will present a case study on automatic location and retrieval of rules and regulations governing package holidays in the Republic of Ireland (Eire). 3.1. Searching the Internet A reliable but expensive way to access legal information in the WWW is to subscribe to commercial services such as Westlaw22, LexisNexis23 or Celex24. However, the money expenditure may not be worthwhile for simple research purposes, as basic search skills unable users to have access to plenty of free electronic resources at a mouse click. Any reliable search requires careful selection of relevant key words and information retrieval systems. While indexation concepts are basic for global search engines to automatically retrieve web pages contents, appropriate choice of I.R.S. can be of paramount importance for more structured searches. For example, package holidays, law and Republic of Ireland have been entered as key words for a first Boolean search query using Google25 and All the Web26. However, the results obtained are far from expected, as they contain a lot of noise and irrelevant information on travel forums and chats, cheap flights and accommodation offers, advertising, tabloid news, etc. 22 http://web2.westlaw.com (5 Dec. 2003). 23 http://www.lexisnexis.com/ (5 Dec. 2003). 24 http://www.europa.eu.int/celex (5 Dec. 2003). Access to the “Expert Search” option requires a user name and a password. 25 http://www.google.com (5 Dec. 2003). 26 http://www.alltheweb.com (5 Dec. 2003). Fig. 1. Global Search Query (All the Web). Attempts to narrow down the results by refining the key words tended to be equally unsuccessful. This is partly because indexed key words do not seem to be the real problem ─ any searches for Eire laws and regulations on package holidays have to be redirected towards alternative information retrieval systems. A safer strategy would be to resort to metasearch engines, such as Metacrawler27 and Highway6128, to find law search engines: 27 http://www.metacrawler.com (5 Dec. 2003). 28 http://www.highway61.com (5 Dec. 2003). Fig. 2. Metasearch Query (Metacrawler). From there on, a next step will be locating websites devoted exclusively to the Republic of Ireland legal system or just dealing with Eire legal resources as one of their sections. A cursory look ended up with a good number of useful portals, gateways, legal indices, resource guides and link pages, among which are the following: AccessToLaw29, Carrow's Irish Law Links30, Legal-Island31, Lex Scripta: Legal Megasites32, LLRX - Guide to European Legal Databases33, The Bar Council & Bar Library of Northern Ireland34. According to Internet evaluation models, these top quality websites would satisfy the criteria for efficient, valuable electronic information resources, as they are updated on a regular basis (monthly, weekly or even daily), their contents are logically ordered and accurate, identification dates of webmasters, contact experts or official bodies are systematically provided, graphic and multimedia design is user-friendly, related links (e.g. law databases and e-journals) are carefully selected, etc. 29 http://www.accesstolaw.com (5 Dec. 2003). 30 http://www.carrow.com/linkirish.html (5 Dec. 2003). 31 http://www.legal-island.com (5 Dec. 2003). 32 http://www.lexscripta.com/legal/omnibus/megasites.html (5 Dec. 2003). 33 http://www.llrx.com/features/europe.htm (5 Dec. 2003) 34 http://www.barlibrary.com/links.htm (5 Dec. 2003). Most websites are provided with an internal search engine for quick reference, while some of them offer WWW search tutorials as yet another asset. In order to proceed with the search, we have chosen one of the aforementioned added-value law directories: AccessToLaw. A well-structured gateway, it covers United Kingdom (England and Wales, Scotland, Northern Ireland), the Commonwealth (Australia, Canada, Gibraltar, Malta, etc.) and other jurisdictions, such as Channel Islands, Isle of Man, Republic of Ireland, Europe and major World Law resources. As a general resource, it provides links to legal search engines and gateways, learned legal journals and reference books, law electronic libraries and publishers, on line solicitors and barristers, professional organizations, etc. on a wide range of subject areas, such as criminal law, ecclesiastical law, family law, international law, property or shipping law, to name but a few. Fig. 3. Specific search (AccessToLaw - homepage). As regards the Republic of Ireland (Eire), AccessToLaw offers a wealth of information within the “Other Jurisdictions” section. For instance, it includes an electronic full-text version of the Irish Constitution of 1937, plus a list of amendments effected since the Constitution was enacted in 1937 up to November 2002. Primary sources can be mainly accessed via links to the Government of Ireland35 and particularly to the Oireachtas36 and the Law Reform Commission. Acts, Instruments, decisions, provisions etc. can be also found through the Irish Law Site hosted by University College of Cork Law Faculty and its two database initiatives: BAILLI (British and Irish Legal Information Institute) and IRLII (Irish Legal Information Initiative Site); the personal site of independent member of the Irish Senate, Feargall Quinn, and two directories for Northern Ireland and Eire law (the Legal Eagle Links website of solicitor, D. O' Reilly, and the Legal Island Site). Primary legislation (Acts of the Oireachtas 1997 onwards) and secondary legislation (Statutory Instruments 1922 onwards) are contained in the Irish Statute Book (1922 onwards); courts and case laws are arranged by subjects and alphabetically37 (Irish Supreme Court and Court of Criminal Appeal Decisions 1997 onwards; Irish High Court Decisions 1996 onwards; Irish Competition Authority Decisions 1991 onwards; Irish Information Commissioner's Decisions 1998 onwards), whereas it is also possible to have access to other Irish law materials (Irish Law Reform Commission Papers and Reports 1976 onwards, full text Parliamentary Debates 1919, Bills and Explanatory Memoranda from the Houses of Oireachtas, latest publications and annual reports issued by central government departments, agencies and state sponsored bodies). Secondary sources are also well represented by numerous links to full-text versions of electronic law journals, legal textbooks (eg. D. Whelan’s Guide to Irish Law, 2001), publications by government departments and state organisations, University teaching materials, dictionaries and directories of legal professions, etc. For instance, there is direct access to the 2003 launched EPPI (Enhanced British Parliamentary Papers on Ireland 1801-1922) bibliography database. 3.2. Retrieving and processing the data Package holidays rules and regulations can be found in the Irish Statute Book database (primary legislation, Acts of the Oireachtas). The Package Holidays and Travel Trade Act, 199538 enables effect to be given to the Council Directive 90/314/EEC of 13 June 1990 of the European Communities on package travel, package holidays and package tours. It amends39 The Transport (Tour Operators and Travel Agents) Act, 198240. Both Acts can be cited together as The Transport (Travel Trade) Acts, 1982 and 1995. Once the data have been located, the next step for corpus building is to retrieve the corresponding documents. However, automatic downloading can be impaired by the fractal, interactive, dynamic and graphical nature of WWW hypertexts. One major problem is the one-by-one format access to single web pages/nodes, which can turn downloading into a complex, time-consuming effort. For example, The Package Holidays and Travel Trade Act, 1995 in HTML version consists of four parts ─ I. Preliminary and General, II. Regulation of Travel Contract, III. Security, IV. Amendment of Transport (Tour Operators and Travel Agents) Acts, 1982 ─, divided 35 An internal search engine locates information from all government sites. 36 The Oireachtas (Parliament) consists of two Houses – the Dáil Éireann (the House of Representatives, directly elected) and the Seanad Éireann (the Senate, indirectly elected). 37 The “Courts Service: Ireland” Section includes information on the Irish courts system, court rules, court offices and law terms, plus a legal diary, press releases, publications and legal links. 38 http://www.irishstatutebook.ie/ZZA17Y1995S1.html (5 Dec. 2003). 39 Also referred to: Companies Act, 1963 (No. 33), Hotel Proprietors Act, 1963 (No. 7), Petty Sessions (Ireland) Act, 1851 (c. 93) and Public Offices Fee Act, 1879 (c. 58). 40 http://www.irishstatutebook.ie/ZZA3Y1982.html (5 Dec. 2003). into 34 sections (and subsequent subsections) plus schedule. Downloading the whole document means storing all its parts and sections one after the other, either as only one document or else as 35 shorter documents! (In fact, single sections can be retrieved individually from the WWW). Fig. 4. Package Holidays and Travel Trade Act, 1995 (Irish Statute Book Database). It should be pointed out, though, that global retrieval of WWW fragmented content can be conveniently speeded up by offline browsers able to retrieve and store whole websites (contents and navigation design), like GNU Wget41 or WebStripper42. Further problems relate to the loss of meaningful parts of the hypertext, such as graphic and multimedia components and bullets, logos, banners …, missing navigation design and reading paths, relevant formal layout and format, etc. All that (and much more) is lost in a plain text format pre-processed for corpus management purposes, since the next stage in the project workflow involves conversion of the .HTML document into .TXT format. For example, the following figure illustrates Part III, section 25 (“Insurance”): 41 Free, shareware. http://www.gnu.org/software/wget/wget.html (5 Dic. 2003). 42 http://www.webstripper.net (5. Dic. 2003). Insurance. 25.-(1) The package provider shall have insurance under one or more appropriate policies with an insurer authorised in respect of such business in a Member State under which the insurer agrees to indemnify consumers (who shall be insured persons under the policy), against- (a) the loss of all money paid over by them under or in contemplation of contracts for relevant packages, and (b) where applicable to the package concerned, the cost of repatriation of consumers based on administrative arrangements established by the insurer to enable repatriation of such consumers, in the event of insolvency of the package provider.(2) The package provider shall ensure that it is a term of every contract with a consumer that the consumer acquires the benefit of a policy of a kind mentioned in subsection (1) in the event of the insolvency of the package provider.(3) In this section "appropriate policy" means one which does not contain a condition which provides (in whatever terms) that no liability shall arise under the policy, or that any liability so arising shall cease- (a) in the event of some specified thing being done or omitted to be done after the happening of the event giving rise to a claim under the policy, (b) in the event of the failure of the policy holder to make payments to the insurer in connection with that policy or with other policies, or (c) unless the policy holder keeps specified records or provides the insurer with information therefrom. Fig. 5. Package Holidays Act and Travel Trade Act, 1995, section 25 (.TXT format). Documents retrieved from the Internet are stored in the corpus database both in their original format (usually .HTML or .PDF) and in a plain format (.TXT) suitable for corpus management. For each of them, a TEI-conformant DTD is provided. In this case search, the “package travel” Act would belong to the (a) section of the multilingual comparable corpus43 (likewise other similar domestic legislation from the remaining countries covered in the TURICOR project). In addition, it would be stored in both .HTML and .TXT formats, it would be conveniently identified (DTD file) and it would include pointers to (i) type of tourism contract [“packages”], (ii) language [“English”], (iii) type of regulation [“domestic law”] and (iv) jurisdiction [“Eire”]. 4. Conclusion This paper has provided a brief summary of the TURICOR project, with a view to corpus building from Internet electronic resources. A search methodology has been illustrated by means of a case study on domestic packages regulations in the Republic of Ireland. This methodology comprises three main stages: (a) global Boolean search, (b) law metasearch and (c) jurisdiction search. It could be successfully applied to all kinds of legal searchers, be either domestic, international or communitary laws. In short, the TURICOR project is beginning to open new, exiting research venues for comparative law, legal translation, documentation and corpus-based NLP and NLG systems. [Recibido el 6 de Diciembre de 2003. Aceptada su publicación el 14 de Diciembre de 2003] 43 Similarly, communitary regulations would belong for instance to the multilingual parallel corpus, as they are translated into all official EC languages.