INA by wuyunyi

VIEWS: 2 PAGES: 26

									                                "Here be Dragons"
Strategies for dealing with viruses in the web archive
                01
The Nature of the Problem
The Nature of the Problem
• What are the Dragons?
– We talk about""computer viruses" but we’re
describing all forms of malware: viruses, trojans,
worms

• Do we keep them put of the archive completely?
– No, because we consider them to be an important
part of the web as it exists today



                                       Here be Dragons   | 8 août 2012   4
Web capture at INA
                     Proxy Server




  The Internet




                       Web crawlers




         Storage


                              Here be Dragons   | 8 août 2012   5
Where are the Risks?
•   Low Risk during capture
–   Robots stream data for storage
–   Captured data not executed in any way
•   Medium Risk during testing /file verification
–   Statistics, index creation
•   High Risk during consultation
–   Files are recovered and executed in their original
    environment (albeit without a live element)


                                         Here be Dragons   | 8 août 2012   6
What do we need to Protect?
• Protection of the archive
– Can we be sure that viruses cannot damage the
archive itself?
• Protection of user workstations
– Currently the user environment is simply a Firefox
browser running under MacOS.
• Protection of the internal network
– The majority of our systems are linux/mac based
and therefore are less prone to viruses*, however
the wider network has Windows infrastructure
                                       Here be Dragons   | 8 août 2012   7
    02
Antivirus tools
What tools are available ?
• ClamAV (linux/mac/windows)
- Open source (GPL) anti-virus toolkit for UNIX
- Most popular
• Dazuko (linux)
– File Access Control
– Kernel modifications required
– File access automatically blocked
• HAVP (linux)
- HTTP Antivirus-proxy

                                       Here be Dragons   | 8 août 2012   9
Strategy
• Protection of the archive
– Run antivirus checks at regular intervals
– Create a list of files ‘at risk’ - blocked from users
• Protection of user workstations/internal network
– Check all traffic that passes between the archive
and the user
– Allow a quarantine period between archival and
consultation


                                          Here be Dragons   | 8 août 2012   10
Background scanning

                 Archive File
    Storage       Extraction                Virus/malware
                                            index blacklist


                 Clamdscan
                   File 1..
                   File 2..
                   File 3..
                   File 4..


               Antivirus proxy




                                 Here be Dragons   | 8 août 2012   11
   Proposed "Antivirus on the fly "

                    Storage                   Antivirus proxy


Virus/malware
                              Error message
index blacklist
                                                         Firefox connects to
      Allow/Block                                          proxy - port 81

                  Clamdscan

                     Allow
                                                 User
                                              workstation
                     Block
                              Error message

                                                     Here be Dragons   | 8 août 2012   12
03
Results
Test Results
• Full DLWeb Archive Test: 6.3Tb compressed:
– 106 Million files verified ~ 20 Million files per day
– 5 files greater than the configured limit of 2Gb
– 10 failed connections
– ClamAV reported 2458 viruses (28 unique) ~ 1 per
50,000 files




                                          Here be Dragons   | 8 août 2012   14
Virus Results by individual classification




                                    Here be Dragons   | 8 août 2012   15
Virus Results by grouping




                            Here be Dragons   | 8 août 2012   16
Test Results - continued
• Full DLWeb Archive Test: 6.3Tb compressed:
– 2459 files re-verified with Windows AV (AVG)
– 2095 detected as viruses (8 unique)




                                      Here be Dragons   | 8 août 2012   17
Test Results - comparison
• US Election 2000 (Library of Congress)
– 12 million files checked
– 362 viruses found by ClamAV (1 in 33,000 files)



  Virus Checking Web Archives
         netpreserve.org - Preservation Working Group
         G.Jones, M. Ashenfelder, I. Garcia del Campo
         Library of Congress

                                                   Here be Dragons   | 8 août 2012   18
General Virus Statistics
• How do our test results compare?
– Exploit & Iframe take first place, followed by
backdoors and trojans

                                    Blocks by Malware Type
                                    Scansafe.com – web
                                    browsing statistics
                                    Q2 2009 - Global Threat
                                    Report
                                    http://www.scansafe.com/__data/assets/pdf_file/1354
                                    6/Q209_GTR_FINAL.pdf




                                             Here be Dragons   | 8 août 2012       19
 04
Conclusions
Issues with ClamAV
• File limits
− Need to split up files to effectively test them
• Scalability
− How well does the antivirus proxy scale with
multiple user sessions?
• Connection failures
− Need retry capability



                                        Here be Dragons   | 8 août 2012   21
Reducing the risk
• Simple steps
− Ensure that all processes accessing archival
material have the correct permissions i.e. not root
• Choice of operating system
− Where possible, avoid Windows; MacOS and
Linux/UNIX have far lower rates of viruses and
malware – but will this be the case in 20 years?




                                       Here be Dragons   | 8 août 2012   22
Disposable Sessions/Virtualisation
• Can we create temporary sessions?
– Virtual sessions which are deleted after each user
– Good level of protection for each user workstation
– Protection is focused on any files accessible by the
user (such as removable media)
– Could also be a solution in supporting legacy
applications required to read archived contents.
– Also moves us closer to a virtualised world (where a
lot of server architecture is heading)



                                          Here be Dragons   | 8 août 2012   23
Conclusions
• Viruses/malware are relatively low risk
– The majority of which are redirections to sites
which should not exist in the archive
• Useful having two virus checks
– Blacklists give flexibility to conduct other tests
– Keeps track of viruses/malware – does the
detection rate change over time?




                                         Here be Dragons   | 8 août 2012   24
Conclusions
• The technology is capable of regular archive file
scanning
– File extraction is quite efficient
• Quarantine period needed between collect and
consultation to “catch the latest viruses”
– Minimum 2 weeks (maybe more) to allow for
updates to antivirus software
• More testing required!
– Full test of antivirus proxy still required

                                         Here be Dragons   | 8 août 2012   25
05
Questions

								
To top