Docstoc

INA

Document Sample
INA Powered By Docstoc
					                                "Here be Dragons"
Strategies for dealing with viruses in the web archive
                01
The Nature of the Problem
The Nature of the Problem
• What are the Dragons?
– We talk about""computer viruses" but we’re
describing all forms of malware: viruses, trojans,
worms

• Do we keep them put of the archive completely?
– No, because we consider them to be an important
part of the web as it exists today



                                       Here be Dragons   | 28 juin 2011   4
Web capture at INA
                     Proxy Server




  The Internet




                       Web crawlers




         Storage


                              Here be Dragons   | 28 juin 2011   5
Where are the Risks?
•   Low Risk during capture
–   Robots stream data for storage
–   Captured data not executed in any way
•   Medium Risk during testing /file verification
–   Statistics, index creation
•   High Risk during consultation
–   Files are recovered and executed in their original
    environment (albeit without a live element)


                                         Here be Dragons   | 28 juin 2011   6
What do we need to Protect?
• Protection of the archive
– Can we be sure that viruses cannot damage the
archive itself?
• Protection of user workstations
– Currently the user environment is simply a Firefox
browser running under MacOS.
• Protection of the internal network
– The majority of our systems are linux/mac based
and therefore are less prone to viruses*, however
the wider network has Windows infrastructure
                                       Here be Dragons   | 28 juin 2011   7
    02
Antivirus tools
What tools are available ?
• ClamAV (linux/mac/windows)
- Open source (GPL) anti-virus toolkit for UNIX
- Most popular
• Dazuko (linux)
– File Access Control
– Kernel modifications required
– File access automatically blocked
• HAVP (linux)
- HTTP Antivirus-proxy

                                       Here be Dragons   | 28 juin 2011   9
Strategy
• Protection of the archive
– Run antivirus checks at regular intervals
– Create a list of files ‘at risk’ - blocked from users
• Protection of user workstations/internal network
– Check all traffic that passes between the archive
and the user
– Allow a quarantine period between archival and
consultation


                                         Here be Dragons   | 28 juin 2011   10
Background scanning

                 Archive File
    Storage       Extraction                 Virus/malware
                                             index blacklist


                 Clamdscan
                   File 1..
                   File 2..
                   File 3..
                   File 4..


               Antivirus proxy




                                 Here be Dragons   | 28 juin 2011   11
   Proposed "Antivirus on the fly "

                    Storage                   Antivirus proxy


Virus/malware
                              Error message
index blacklist
                                                         Firefox connects to
      Allow/Block                                          proxy - port 81

                  Clamdscan

                     Allow
                                                 User
                                              workstation
                     Block
                              Error message

                                                     Here be Dragons   | 28 juin 2011   12
03
Results
Test Results
• Full DLWeb Archive Test: 6.3Tb compressed:
– 106 Million files verified ~ 20 Million files per day
– 5 files greater than the configured limit of 2Gb
– 10 failed connections
– ClamAV reported 2458 viruses (28 unique) ~ 1 per
50,000 files




                                         Here be Dragons   | 28 juin 2011   14
Virus Results by individual classification




                                   Here be Dragons   | 28 juin 2011   15
Virus Results by grouping




                            Here be Dragons   | 28 juin 2011   16
Test Results - continued
• Full DLWeb Archive Test: 6.3Tb compressed:
– 2459 files re-verified with Windows AV (AVG)
– 2095 detected as viruses (8 unique)




                                      Here be Dragons   | 28 juin 2011   17
Test Results - comparison
• US Election 2000 (Library of Congress)
– 12 million files checked
– 362 viruses found by ClamAV (1 in 33,000 files)



  Virus Checking Web Archives
         netpreserve.org - Preservation Working Group
         G.Jones, M. Ashenfelder, I. Garcia del Campo
         Library of Congress

                                                  Here be Dragons   | 28 juin 2011   18
General Virus Statistics
• How do our test results compare?
– Exploit & Iframe take first place, followed by
backdoors and trojans

                                    Blocks by Malware Type
                                    Scansafe.com – web
                                    browsing statistics
                                    Q2 2009 - Global Threat
                                    Report
                                    http://www.scansafe.com/__data/assets/pdf_file/1354
                                    6/Q209_GTR_FINAL.pdf




                                             Here be Dragons   | 28 juin 2011      19
 04
Conclusions
Issues with ClamAV
• File limits
− Need to split up files to effectively test them
• Scalability
− How well does the antivirus proxy scale with
multiple user sessions?
• Connection failures
− Need retry capability



                                        Here be Dragons   | 28 juin 2011   21
Reducing the risk
• Simple steps
− Ensure that all processes accessing archival
material have the correct permissions i.e. not root
• Choice of operating system
− Where possible, avoid Windows; MacOS and
Linux/UNIX have far lower rates of viruses and
malware – but will this be the case in 20 years?




                                       Here be Dragons   | 28 juin 2011   22
Disposable Sessions/Virtualisation
• Can we create temporary sessions?
– Virtual sessions which are deleted after each user
– Good level of protection for each user workstation
– Protection is focused on any files accessible by the
user (such as removable media)
– Could also be a solution in supporting legacy
applications required to read archived contents.
– Also moves us closer to a virtualised world (where a
lot of server architecture is heading)



                                          Here be Dragons   | 28 juin 2011   23
Conclusions
• Viruses/malware are relatively low risk
– The majority of which are redirections to sites
which should not exist in the archive
• Useful having two virus checks
– Blacklists give flexibility to conduct other tests
– Keeps track of viruses/malware – does the
detection rate change over time?




                                         Here be Dragons   | 28 juin 2011   24
Conclusions
• The technology is capable of regular archive file
scanning
– File extraction is quite efficient
• Quarantine period needed between collect and
consultation to “catch the latest viruses”
– Minimum 2 weeks (maybe more) to allow for
updates to antivirus software
• More testing required!
– Full test of antivirus proxy still required

                                         Here be Dragons   | 28 juin 2011   25
05
Questions

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:4
posted:6/28/2011
language:English
pages:26