Document Sample
FRANCE Powered By Docstoc

The Bibliothèque nationale de France (BnF) has been active since 2000 in developing a
combined methodology including:
    Automatic large scale domain crawls several times a year
    Continuous crawl of automatically selected sites (10% of the total)
    Deposit of deep web sites that can't be harvested on-line
    Thematic event-based collection for very ephemeral sites (that crawlers would take too
       long to find)
In collaboration with INRIA (the French National Institute for Research in Computer Science
and Automatic Control), the BnF has been experimenting in Web collection techniques using
automatic harvesting and evaluation methods. A test of automatic selection parameters based
in-linking ranking (like Google) has been made.

The initial phase of the program, which ran until June 2001, focused on content gathering,
involving the collection of 16 audiovisual sites with small robots to assess the limits of these

In 2002 two harvests were made:
    a thematic collection of sites relating to the French elections (1900 sites)
    a comprehensive crawl of the .fr Web

In a second pilot on methods for archiving hidden Web sites, which ran from 2002 to June
2003, 100 selected Web owners were approached to deposit their Web sites in the BnF for
permanent archiving.

Legal Deposit
The Loi du 20 juin 1992 relative au dépôt légal , revising the legal deposit legislation, came
into force in 1994. It requires legal deposit of printed, graphic, photographic, sound,
audiovisual and multimedia documents, whatever the technical means of production, as soon
as they are made accessible to the public by the publication of a physical carrier.
The legislation does not cover online electronic publications.
An extension of the French legal deposit law to networked digital material should be voted by
the French Parliament in 2004. It will allow BnF and INA (Institut National de l'Audiovisuel,
in charge of TV and Radio preservation) to harvest sites on-line as well as request a deposit
from publishers, when online harvesting is impossible and make the collection publicly
available on site.

There is no specific project at the moment.

A portable extraction tool (DeepArc) was developed to enable simple extraction of database
to XML by the producers. This tool, currently being tested by IIPC members will be released
in open-source soon. It is envisaged that harvester technology may be employed as a
discovery tool to identify sites in the Deep Web of interest for collection, which would then
be transferred to the BnF after a technical negotiation with site owners.
Web crawling is based on techniques developed at INRIA for Xyleme, a project concerned
with creation of large dynamic data warehouses of XML data ( ).
Like other robots, the Xyleme crawler is designed to retrieve HTML and XML pages from
the web to a place where they can be monitored and stored. A key feature of Xyleme is that
certain pages can be refreshed regularly.