Document Sample
INA Powered By Docstoc
					                                "Here be Dragons"
Strategies for dealing with viruses in the web archive
The Nature of the Problem
The Nature of the Problem
• What are the Dragons?
– We talk about""computer viruses" but we’re
describing all forms of malware: viruses, trojans,

• Do we keep them put of the archive completely?
– No, because we consider them to be an important
part of the web as it exists today

                                       Here be Dragons   | 28 juin 2011   4
Web capture at INA
                     Proxy Server

  The Internet

                       Web crawlers


                              Here be Dragons   | 28 juin 2011   5
Where are the Risks?
•   Low Risk during capture
–   Robots stream data for storage
–   Captured data not executed in any way
•   Medium Risk during testing /file verification
–   Statistics, index creation
•   High Risk during consultation
–   Files are recovered and executed in their original
    environment (albeit without a live element)

                                         Here be Dragons   | 28 juin 2011   6
What do we need to Protect?
• Protection of the archive
– Can we be sure that viruses cannot damage the
archive itself?
• Protection of user workstations
– Currently the user environment is simply a Firefox
browser running under MacOS.
• Protection of the internal network
– The majority of our systems are linux/mac based
and therefore are less prone to viruses*, however
the wider network has Windows infrastructure
                                       Here be Dragons   | 28 juin 2011   7
Antivirus tools
What tools are available ?
• ClamAV (linux/mac/windows)
- Open source (GPL) anti-virus toolkit for UNIX
- Most popular
• Dazuko (linux)
– File Access Control
– Kernel modifications required
– File access automatically blocked
• HAVP (linux)
- HTTP Antivirus-proxy

                                       Here be Dragons   | 28 juin 2011   9
• Protection of the archive
– Run antivirus checks at regular intervals
– Create a list of files ‘at risk’ - blocked from users
• Protection of user workstations/internal network
– Check all traffic that passes between the archive
and the user
– Allow a quarantine period between archival and

                                         Here be Dragons   | 28 juin 2011   10
Background scanning

                 Archive File
    Storage       Extraction                 Virus/malware
                                             index blacklist

                   File 1..
                   File 2..
                   File 3..
                   File 4..

               Antivirus proxy

                                 Here be Dragons   | 28 juin 2011   11
   Proposed "Antivirus on the fly "

                    Storage                   Antivirus proxy

                              Error message
index blacklist
                                                         Firefox connects to
      Allow/Block                                          proxy - port 81


                              Error message

                                                     Here be Dragons   | 28 juin 2011   12
Test Results
• Full DLWeb Archive Test: 6.3Tb compressed:
– 106 Million files verified ~ 20 Million files per day
– 5 files greater than the configured limit of 2Gb
– 10 failed connections
– ClamAV reported 2458 viruses (28 unique) ~ 1 per
50,000 files

                                         Here be Dragons   | 28 juin 2011   14
Virus Results by individual classification

                                   Here be Dragons   | 28 juin 2011   15
Virus Results by grouping

                            Here be Dragons   | 28 juin 2011   16
Test Results - continued
• Full DLWeb Archive Test: 6.3Tb compressed:
– 2459 files re-verified with Windows AV (AVG)
– 2095 detected as viruses (8 unique)

                                      Here be Dragons   | 28 juin 2011   17
Test Results - comparison
• US Election 2000 (Library of Congress)
– 12 million files checked
– 362 viruses found by ClamAV (1 in 33,000 files)

  Virus Checking Web Archives - Preservation Working Group
         G.Jones, M. Ashenfelder, I. Garcia del Campo
         Library of Congress

                                                  Here be Dragons   | 28 juin 2011   18
General Virus Statistics
• How do our test results compare?
– Exploit & Iframe take first place, followed by
backdoors and trojans

                                    Blocks by Malware Type
                           – web
                                    browsing statistics
                                    Q2 2009 - Global Threat

                                             Here be Dragons   | 28 juin 2011      19
Issues with ClamAV
• File limits
− Need to split up files to effectively test them
• Scalability
− How well does the antivirus proxy scale with
multiple user sessions?
• Connection failures
− Need retry capability

                                        Here be Dragons   | 28 juin 2011   21
Reducing the risk
• Simple steps
− Ensure that all processes accessing archival
material have the correct permissions i.e. not root
• Choice of operating system
− Where possible, avoid Windows; MacOS and
Linux/UNIX have far lower rates of viruses and
malware – but will this be the case in 20 years?

                                       Here be Dragons   | 28 juin 2011   22
Disposable Sessions/Virtualisation
• Can we create temporary sessions?
– Virtual sessions which are deleted after each user
– Good level of protection for each user workstation
– Protection is focused on any files accessible by the
user (such as removable media)
– Could also be a solution in supporting legacy
applications required to read archived contents.
– Also moves us closer to a virtualised world (where a
lot of server architecture is heading)

                                          Here be Dragons   | 28 juin 2011   23
• Viruses/malware are relatively low risk
– The majority of which are redirections to sites
which should not exist in the archive
• Useful having two virus checks
– Blacklists give flexibility to conduct other tests
– Keeps track of viruses/malware – does the
detection rate change over time?

                                         Here be Dragons   | 28 juin 2011   24
• The technology is capable of regular archive file
– File extraction is quite efficient
• Quarantine period needed between collect and
consultation to “catch the latest viruses”
– Minimum 2 weeks (maybe more) to allow for
updates to antivirus software
• More testing required!
– Full test of antivirus proxy still required

                                         Here be Dragons   | 28 juin 2011   25

Shared By: