Learning Center
Plans & pricing Sign in
Sign Out



• Caching uses faster hardware to save information
  (code or data) that you have used recently so that, if
  you need it again, it takes less time to access
   – for processing a program, caching takes place in cache
     memory, which is either stored on the CPU, or on the
      • storage is typically for a very brief time period (fractions of a
   – for secondary storage, caching is stored in a buffer on the
     hard disk
      • storage is typically until there are new hard disk accesses
   – for web access, caching is stored on the hard disk itself
      • storage is typically for about a month if the information being
        stored is static (dynamic web content is usually not cached)
Controlling Browser Caches from Apache
• Why wouldn’t you want your pages cached in the browsers
  of the users who visit your website?
   – if the content is being modified often
   – if the content is dynamic
      • typically browser caches will not cache dynamic content – if the file
        extension includes .php, .dhtml, or .cgi but this is not always the case
   – if the web page causes cookies to be created or set
   – if information is sent to a specific users, for instance a page
     created as a result of entering data into a form
• From apache, you can control how long items are cached
  by using the mod_expires module by using the HTTP
  Expires header
   – if the content is being modified, you can set an expiration date
     of the last modification
   – alternatively, you might want to set an expiration date relative
     to the date the items is sent (e.g., 3 days from now)
                      Expires Directives
• ExpiresActive – controls whether you will send an expiration in
  the header or not (values on or off)
   – just because you set this to on does not necessarily mean an
     expiration will be sent with the header, see below
• ExpiresByType type/encoding “<base> [plus] {<num>
   – this is an algorithm for computing expiration time for a particular file
• ExpiresDefault “<base> [plus] {<num> <type>}*”
   – this is an algorithm for computing the expiration time for all other
     file types
       •   <base> is either access or modification
       •   plus is merely a keyword, optional
       •   num is a number
       •   type is the type of time unit (e.g., month, week, day, hour)
   – examples:
       • ExpiresDefault “access plus 2 weeks”
       • ExpiresType text/html “access plus 1 day”
       • ExpiresType image/gif “access plus 1 week 3 days 6 hours”
                Expiration Contexts
• These directives can be placed in any context
   – in a container, the directives only impact files of that
     container (e.g., that directory)
   – in an .htaccess, the directives only impact files of that
     directory (or lower)
   – ExpiresType and ExpiresDefault can override earlier
     definitions if placed in a lower context
      • e.g., if placed in <Directory /var/web/htdocs> and a later one is
        placed in <Directory /var/web/htdocs/pub> then the later one
        overrides the earlier one, and if there is an .htaccess file in
        /var/web/htdocs/pub/foo, it overrides the earlier ones
• If you do not use a default and a file does not match
  the give type in an ExpiresType, then no expiration
  would be sent
   – in order to ensure that a document is resent every time, set
     the expiration to 1 second (the least amount of time)
      • ExpiresDefault “access plus 1 second”
                        SSI Caching
• Recall that SSI is used to generate and/or insert dynamic
  content into your web pages
   – Apache will not place a last modification date or content-
     length in the HTTP header of any SSI page because these are
     difficult for Apache to determine, so by default, there is no
     date to compare against with respect to whether a document
     has expired (and thus, it will be assumed to have expired)
• However, if you want to permit SSI caching, it is possible
  because SSI can also be used to create the outlines for a
  page (e.g., using #include to include the navigation bar
  and footer)
   – two ways to do this:
      • use the XBitHack Full directive which tells Apache to determine the
        last modification date of the SSI file, not any included files
      • use the mod_expires and set the directives of the previous slides to a
        specified time for these particular files using a <Directory> or <Files>
        or <Location> container
                     Proxy Caches
• Browser caches are useful for users who frequently
  view the same web sites over and over
  – but for an organization, the browser caches cannot help
     • that is, a browser cache is local to a single computer and not
       shared among multiple clients of the same site
     • so the organization needs a cache that can extend across multiple
       users so that users who view the same web sites can obtain pages
       from the shared cache instead of having to wait for content to
       come across the Internet
• A proxy cache is one that extends across users of an
  – a proxy cache is part of a proxy server
  – the proxy server offers a cache for all users so that
    commonly accessed content can be retrieved across the
    Internet once and then shared, improving network usage
                      Proxy Servers
• A proxy server serves at least two functions
   – it offers an extended cache to the local users so that
     multiple users who access the same pages get a savings
   – it offers control over what material can be brought into the
     organization’s network and thus on to the clients
      • for instance, it can filter material for viruses
      • it can also filter material to disallow access to pornography, etc
   – other functions that it can serve include
      • an authentication server
      • performing SSL operations like encryption and decryption
      • collecting statistics on web traffic and usage
   – additionally, the proxy server can offer an added degree of
     anonymity in that it is the proxy server that places
     requests of remote hosts, not an individual’s computer
      • thus, the IP addresses sent to servers is that of the proxy server
        not of the client
        Forward vs Reverse Proxies
• The typical form of proxy server is the forward proxy
   – a collection of browsers (on the same LAN, or within an
     organization) share the same proxy server
   – all client requests go to the proxy server
      • the server looks in its cache to see if the material is available
      • if not, the server looks to make sure that the request can be fulfilled
        (does not violate any access rules), and sends the request over the
      • once a response is received, the server caches it and responds to the
• A reverse proxy server is used at the server end of the
   – responses from the Internet come into the proxy server which
     then determines which web server to route the request on to
   – this might be used to balance the load of many requests for a
     company that runs multiple servers
   – it also allows the proxy server to cache information and
     respond directly if the requested page is in its cache
      • we’ll consider reverse proxy servers in a bit
  Using Apache as a Forward Server
• You can use your Apache web server as a forward proxy
  server with little enhancement
   – you might do this if you already have a web server and your
     organization uses it often (say for 50% of all web traffic)
• Use the mod_proxy module and the <proxy> container
   – ProxyRequests on (or off – the default)
   – ProxyTimeout x (x is in seconds)
   – <Proxy URL>
      • access directives here such as Deny from all, Access from and/or authorization directives that require log ins
   – </Proxy>
      • the URL can be a regular expression and matches based on the
        outgoing URL, use * for “all http requests”
• NOTE: to use Apache as a forward server, you must also
  configure your browsers to work with a proxy server
   – see pages 332-337
                       Other Directives
• NoProxy valuelist
   – This directive specifies items (domains, subnets, IP addresses,
     hostnames) that should be handled by this server rather than proxied
   – These items are separated by spaces as in 192
       • any URL that would be sent to one of these locations is instead redirected to
         the location specified by a ProxyRemote directive
       • ProxyRemote *
• ProxyBlock valuelist
   – Unlike NoProxy, the valuelist here can also include words or *,
     words mean that if they appear anywhere in the URL, they are
• ProxyVia on, off – the default, full, block)
   – If on, then any request that is redirected by proxy will have a “Via:”
     line added to the header to indicate how the request was serviced
   – If full, then the server’s version is added to each Via line
   – If block, then remove all Via lines
       • this differs from off because off only controls this server, other servers
         whose ProxyVia is on will still insert their own Via line, if set to block, all
         via headers are removed by this server
               Additional Modules
• Aside from mod_proxy, you might use any of these
  – mod_proxy_ajp – AJP support for mod_proxy
     • AJP is the Apache Jserv Protocol which enhances performance
       by using a new binary format for packets, and adds SSL security
  – mod_proxy_balancer – extension for load balancing
     • Your Apache proxy server can issue requests to back-end servers
       (i.e., in a reverse proxy setting)
     • There are 3 load balancing algorithms:
         – Request Counting – each server gets an equal number of requests
         – Weighted Traffic Counting – requests are distributed based on byte
           count of the amount of work each server has recently handled
         – Pending Request Counting – distributed based on how many requests
           are currently waiting for each server
  – mod_proxy_connect – used for CONNECT request
    handling (connect is an HTTP method)
  – mod_proxy_ftp, mod_proxy_http – ftp and http support
    for the proxy server
  – mod_cache – as we discussed earlier in these notes
               Reverse Proxy Uses
• The reverse proxy works at the server end
   – One of its capabilities is to perform load balancing
• Additional features of the reverse proxy server are
   – Scrubbing – verification of incoming http requests to make
     sure that each request is syntactically valid
   – Fault tolerance – as part of load balancing, if a server goes
     down, the reverse proxy server can continue to maintain the
     incoming requests by reallocating the requests that the
     server was supposed to handle, and rebalancing the load to
     the remaining available servers
   – HTTPS support – if the back-end web servers do not have
     the capability
   – Redeployment – if a request requires a web application
     (e.g., execution of perl code), the request can be sent to a
     separate server that runs code
   – Central repository – to cache static data for quick response
The Reverse Proxy Server
  Apache as a Reverse Proxy Server
• By default, Apache is configured to serve as a
  reverse proxy server
  – the forward proxy server is controlled through
     • if you set it to off, Apache will function as a reverse proxy
       server but not a forward proxy server
  – the directives ProxyPass and ProxyPassReverse map
    incoming URLs to a new location no matter if that
    incoming URL is coming internally (from a client of
    this site that the Apache server is serving as a proxy)
    or externally (for a reverse proxy mapping)
     • ProxyPass /foo
     • ProxyPassReverse /foo
        – now, any request received by this server for anything under
          directory /foo is sent (redirected) to
• Apache is not the best proxy server
   – its main use is for a web server only
   – it does not provide the types of security and access control that
     Squid has
• Therefore, we will concentrate on Squid
   – note: the textbook also discussed Pound but we will skip that
• Squid is an open source proxy server
   – its main use is as a forward proxy server but it can also be set up as
     a reverse proxy server
   – its genesis is back with the original CERN HTTP server from 1994
     which had a caching module
   – the caching module was separated and it has evolved over time into
• In these notes, we will look at installing and running a basic
  configuration for Squid along with setting up access control
  list directives to control access and content
   – we are going to skip over a lot of detail on Squid as there is not
     sufficient time to cover it
                Installing Squid
• Squid source code for Linux or Windows can be
  found at
  – installing from the source code allows you to
    configure Squid (much like we did with Apache)
• Before compiling, we will want to “tune” our
  – recall that Linux limits the number of descriptors
    available by your software (possibly to 64)
  – this wasn’t too critical in Apache unless we were
    going to use it to run a lot of VHs
  – but it is important to Squid because Squid uses a file
    descriptor for each request, so we will want to
    increase the number of descriptors available to Squid
         Increasing File Descriptors
• In most Unix systems, its easy, just use ulimit -n
   – ulimit –n unlimited
   – ulimit –n 8192 (or some other number)
      • some Unix systems will use different commands or editing a
        config file
• In Linux its more complex
   – edit the file /usr/include/bits/typesizes.h and change the
     entry #define __FD_SETSIZE 1024 to a larger number
     such as 4096 or 8192
   – next, place that number in the file /proc/sys/fs/file-max
     (instead of editing that file, you could do echo 8192 >
   – now, you can use ulimit with –Hn as in ulimit –Hn 8192
      • make sure the number you use is consistent in all three operations
      • when done, you do not have to reboot Linux, now you can
        configure and compile Squid from source code
                 Configure Options
• Similar to Apache, you can change many defaults in
  Squid through the ./configure command
  – --prefix – same as in Apache
  – --sysconfdir, --localstatedir
     • change the location of the configuration (from prefix/etc) and var
       (from prefix/var) directories, the var directory stores Squid’s log
       files and disk cache
  – --enable-x
     • allows you to enable Squid-specific modules including
         – gnuregex
         – carp (Cache Array Routing Protocol useful for forwarding cache
           misses to an array)
         – pthreads
         – storeio (storage modules)
         – removal-policies
         – ssl, openssl
     • full list is available at http://wiki.squid-
               Squid Configuration
• Once compiled and installed, running Squid is fairly
  simple if you don’t want to make any changes to the
  – the config file is squid.conf
     • like httpd.conf, the file contains comments and directives
     • directives are similar to httpd.conf directives in that they all start
       with the directive and are followed by zero or more arguments
       which can include for instance on/off, times (e.g., 2 minutes), IP
       addresses, filenames/paths, keywords such as deny, UNGET, etc
  – we will study some of the configuration directives later
  – as with Apache, changing the conf file requires that you
    restart Squid so that the file can be reread
     • although in Squid, you can keep Squid running and still have it
       reread this file
  – unlike Apache, Squid directives and values are case
                Initializing the Cache
• Before running Squid, and whenever you want to add a
  new cache directory, you must first initialize the cache
   – squid –z
   – this initializes all of the directories listed in the Cache variable
• For this command to work successfully
   – you must make sure that the owner that Squid runs under
     (probably squid) has read and write permission for each of the
     directories under cache_dir
       • when these directories are created, make sure they are either owned by
         squid or that squid is in the same group as the owner
   – the name of the owner of these directories is established using
     the cache_effective_user directive in squid.conf
   – you should start squid using the command su – squid
       • this tells Squid to switch from root to squid as soon as it can (after
         dealing with root-only tasks)
                         Running Squid
• You start Squid from the command line and control it much like
  apachectl, but there are a lot of possible options, here we look
  at the most important
   – -a port – start Squid but have it listen to the port supplied rather than
     the default port (3128), this also overrides any port specified in
     squid.conf using the http_port directive
   – -f file – specify alternative conf file
   – -k function – perform administrative function such as reconfigure,
     rotate, shutdown, debug or parse
       • parse causes Squid to read the conf file to test it for errors without reading it
         to configure itself, this is useful for debugging your conf file
   – -s – enables logging to the syslog daemon
   – -z – initializes cache directories
   – -D – disable initial DNS test
       • squid usually tests the DNS before starting
   – -N – keep Squid in the foreground instead of as a background
       • you might do this when first testing Squid so that you can see immediate
         feedback printed to the terminal window, once debugged, kill Squid and
         rerun it without this option
• If you want to run Squid upon booting
   – you might add the start-up command to a script in rc.d, init.d or
• Many people do not like running Squid in the main OS
   – for security purposes, just as you might not want to run apache in the
     main OS environment, therefore they create a chroot environment
   – this is a new root filesystem directory separate from the remainder of
     the filesystem
   – anyone who hacks into squid will not be able to damage your file
     system, only the chroot environment
• The safest way to shut down Squid is through
   – squid –k shutdown
       • do not use kill
• To reconfigure Squid after changing squid.conf
   – run squid –k reconfigure, this prevents you from having to
     stop/restart squid
• To rotate Squid log files, use squid –k rotate
   – put this in a crontab to rotate the files every so often (e.g., once a day)
                       ACLs in Squid
• Since apache can be used as a proxy server, you might
  wonder why use squid?
   – squid allows you to define access control lists (acls) which in
     turn can then be used to specify rules for access
      • who should be able to access web pages via squid?
      • what pages should be accessible? are there restrictions based on file
        name? web server? web page content or size?
      • what pages should be cached?
      • what pages can be redirected?
   – such rules are defined in two portions
      • acl definition (similar to what we saw when defining accessors in bind)
      • followed by an access statement (allow or deny statements)
   – Squid offers a variety of acl definition types
      •   IP addresses
      •   IP aliases
      •   URLs
      •   User names (requiring authentication)
      •   file types
        Defining ACLs and Rules
• Define access in two steps
  – first, define your ACL statements
     • simple definitions of a name to a specification
        – such as calling a particular IP address “home” or using a regular
          expression to match against URLs and calling them
     • each acl contains a type that specifies what type of
       information you are using as a comparison, e.g., IP address,
       IP alias, user name, filename, port address, regular
  – second, define a rule for how the ACL(s) is to be used
     • the rule will typically specify if this acl can or cannot gain
       access through squid, for instance, if foo is a previously
       defined acl, then the following allows access
        – http_access allow foo
  – you must define an acl before you use it in any rule
• The most common form of acl is to define and permit
  access to specific clients
   – we will define some src (source IP address) acls
      • typically with src, we define specific IP addresses or subnetworks
        (rather than IP aliases)
   – acl src localhost
      • here, we define the source acl “localhost” to be the IP address
   – acl src mynet 10.2/16
      • this could also be
• Now we use our acls to allow and deny access
   – http_access allow localhost
   – http_access allow mynet
   – http_access deny all
      • here, we are allowing access only from localhost and those on “mynet”,
        everyone else is denied
      • order of the allow and deny statements is critical, we will explore this
        next time
                    Types of ACLs
• Aside from src, you can also specify ACLs based on
  – dst – the URL of the web server (destination)
  – srcdomain and dstdomain – same as src and dst except
    that these permit IP aliases
  – srcdom_regex and dstdom_regex – same as srcdomain
    and dstdomain except that the IP aliases can be denoted
    using regular expressions
  – time – specify the times and days of the week that the
    proxy server allows or denies access
  – port, method, proto – specify the port(s) that the proxy
    server permits access, the HTTP methods allowable (or
    denied) and the protocal(s) allowable (or denied)
  – rep_mime_type – allow or deny access based on the type
    of file being returned
     • we will study these (and others) in detail next time
                 Types of ACLs
• src – the IP address of the user (client) whose
  requests are going from their browser to the squid
  proxy server
• srcdomain – the IP alias of the user
• dst – the IP address of the requested URL on the
• dstdomain – the IP alias of the request
• myip – same as src, but it is the internal IP address
  rather than (possibly) an external IP address
• srcdom_regex, dstdom_regex – same as srcdomain
  and dstdomain except that regular expressions are
• arp – access controlled based on the MAC address
• You can specify an IP alias using src or dst, but this
  requires that squid use a reverse DNS lookup
   – it is best to use srcdomain/dstdomain if you want to
     specify aliases instead of addresses
• When using srcdomain and dstdomain, you can
  specify part of the domain, such as or .edu
  instead of
   – this is not true if you specify IP aliases using src and dst
• If you use src/dst, then after doing the reverse
  lookup one time, the value is cached
   – if the IP address were to change, squid would not be able
     to find the computer in the future
                     More ACLs
• port – specify one or more port numbers
  – ranges separated by – as in 8000-8010
  – multiple ports are separated by spaces or on separate
  – typically, you will define “safe” ports and then disallow
    access to any port that is not safe, for example:
     • acl port safe_ports 80 443 8080 3128
     • http_access deny !safe_ports
• method – permissible HTTP method
  – squid also knows additional methods including
     • acl method allowable_method GET HEAD OPTIONS
     • http_access deny !allowable_method
                    More ACLs
• proto – permissible protocol(s)
  – http, https, ftp, gopher, whois, urn and cache_object
     • ex: acl proto myprotos HTTP HTTPS FTP
• proxy_auth – requires user login and a
  file/database of username/passwords
  – you specify the allowable user names here, such as
     • acl proxy_auth legal_users foxr zappaf newellg
• maxconn – maximum connections
  – you can control access based on a maximum number
    of server connections
  – this limitation is per IP address, so for instance you
    could limit users to 25 accesses, once the number is
    exceeded, that particular IP address gets “shut out”
                        Time ACLs
• To control when users can access the proxy server, based
  on either days of the week, or times (or both)
   – S, M, T, W, H, F, A for Sunday – Saturday, D for weekdays
   – time specified as a range, hh:mm – hh:mm in military time
• The format is acl name time [day(s)] [hh:mm - hh:mm]
   – example: to specify weekdays from 9 am to 5 pm:
      • acl weekdays time D 09:00 – 17:00
   – example: to specify Saturday and Sunday:
      • acl weekend time SA
• The first time must be less than the second
   – if you want to indicate a time that wraps around midnight,
     such as 9:30 pm to 5:30 am, you have to divide this into two
     definitions (9:30 pm – 11:59 pm, and 12:00 am – 5:30 am)
   – if days have different times, you need to separate them into
     multiple statements, such as wanting to define a time for M 3-7
     and W 3-8 would require two definitions
Regular Expressions and more ACLs
• As stated earlier, you can specify regular
  expressions in srcdom_regex and dstdom_regex
• There are also regex versions to build rules for
  the URL
  – url_regex and urlpath_regex
     • for the full URL and the path (directory) portion of the
       URL respectively
        – you might use this to find URLs that contain certain words, such
          as paths that include “bin”, or paths/filenames that include words
          like “porn”
  – ident_regex
     • to apply regular expressions to user names after the squid
       server performs authentication
     User Names & Authentication
• The ident acl can be used to match user names
• The proxy_auth acl can specify either REQUIRED or
  specific users by name that then require that a user log in
   – authentication requires that the user must perform a
     username/password authentication before Squid can continue
      • any request that must be authenticated is postponed until authentication
        can be completed
   – although authentication itself adds time, using ident or
     proxy_auth also adds time after authentication has taken place
     because Squid must still look up the user’s name among the
     authentication records to see if the name has been
• Squid itself does not come with its own authentication
  mechanisms, so we have to add them as modules much
  like with apache
                 Other ACL Types
• req_mime_type and rep_mime_type
  – test content-type in either the request or response header
  – it only makes sense to use req_mime_type when
    uploading a file via POST or PUT
  – example: acl badImage rep_mime_type image/jpeg
• Browsers
  – restrict what type(s) of browser can make a request
• External ACLs
  – this allows Squid to sort of “pass the buck” by requesting
    that some outside process(es) get involved to determine if
    a request should be fulfilled or not
     • external ACLs can include factors such as cache access time,
       number of children processes available, login or ident name, and
       many of the ACLs we have already covered, but now handled by
       some other server
            Matching Against ACLs
• As we have seen, a single ACL can contain multiple items
  to match against
• ACL lists are “ORed” items – the ACL is true if there is a
  match among any item in the list
   – to establish if an ACL is true, Squid works down the list of
     items looking for the first match, or the end of the list
      • if a match is found, the ACL is established as true, otherwise that ACL
        is established as false
   – for example: acl Simpsons ident Lisa Bart Marge Homer
      • Squid will attempt to confirm that the user’s identity, as previously
        established via authentication, matches any one of the items
   – if you have a lot of ACLs and/or lengthy lists in ACLs, it is
     worthwhile ordering the entries based on most common to least
      • imagine that Homer is the most common user, then move Homer’s
        name to be first in the list, and if Bart is the least common user, move
        his name to the end
                       Types of Rules
• The most common rule is the http_access rule
   – access is either allow or deny
   – if allow and the acl matches, then you are allowing the client to
     have access, if deny and the acl matches, you are disallowing
     the client to have access
• You can also use http_reply_access
   – this allows the retrieved item to be let through the proxy server
     back to the client, again you can use allow or deny
      • this rule allows you to supply definitions that can disallow items being
        returned based on content (type, size, etc)
• You can control whether an item is cached or not using
  no_cache rules
   – here, the word “no” in a rule means “do not cache”
      • it looks like a double negative: no_cache someACL no
   – you would use this to ensure certain pages do not get cached
     (e.g., they have dynamic content, they aren’t worth caching,
     they are too large)
                    Matching Rules
• Imagine an access rule says
   – http_access allow A B C D
      • this means that all of A, B, C and D must be true for access to be
      • Squid will stop searching this rule after the first mismatch, so
        again, you might order these in this case from the least likely to
        the most likely to be more efficient (if A is usually true but C is
        seldom true, put C first)
   – http_access deny A B C D
      • all must be true to deny access, if any are untrue, the rule is
• To create OR access rules, list each access rule
  sequentially as in
   – http_access allow A
   – http_access allow B
      • now, if either A or B are true, access is allowed
              Allow vs Deny Order
• In Apache, you specified the order that allow and deny are
  enforced using the Order directive
   – in Squid, the order is based strictly on the order of the rules as
     they appear in your conf file
• In Apache, you would specify “deny from all” first and
  then override this with more specific “allow” statements
• In Squid, you do this in the opposite way
   – place an allow statement first, if the rule is true, then the
     remainder of the rules are skipped
   – add a deny all type statement at the end to act as a default or
     “fall through” case
   – you might define ALL to be everyone (e.g., IP address 0/0)
   – the deny all will look like this: http_access deny ALL
• You can specify multiple sets of rules, typically each set
  will contain allow statements and end with a deny ALL
                 Rule Organization
• You have to place your allow and deny rules in a logical
  manner for them to work
   – for instance, you would not do http_access deny All as the first
     rule because it would be true of everyone and no other rules
     would be checked
• You will want to do is organize rules generally like this:
   – specific denial rules
   – specific acceptance rules
   – http_access deny All
• In this way, if a particular situation fits both a denial and
  acceptance rule, the access is denied
   – for instance, a request may be acceptable because it has the
     proper src IP address, but it is during the wrong time of day, so
     it should ultimately be denied
   – by reversing the order of the denial and acceptance rules, the
     request would be fulfilled because as soon as it is accepted,
     access is allowed an no further rules are considered
                Common Scenarios
• Allowing only local clients
  –   acl ALL src 0/0
  –   acl MyNetwork src 172.31/16
  –   http_access allow MyNetwork
  –   http_access deny ALL
• Blocking a few clients (assume ALL and
  MyNetwork are as defined above)
  –   acl ProblemHosts 172.31.4/24
  –   http_access deny ProblemHosts
  –   http_access allow MyNetwork
  –   http_access deny ALL
       • notice the ordering here, since MyNetwork is more general than
         ProblemHosts, we first deny anyone specifically in
         ProblemHosts, then we allow access to those in MyNetwork that
         were not in ProblemHosts
• Denying access to any URL that looks like it might
  contain pornography
   – acl PornSites url_regex –i porn nude sex [add more terms
   – http_access deny PornSites
   – http_access allow ALL
        • here we allow anyone access if the URL does not include the list
          of PornSite words
• We might want to add to this a refusal to accept
  replies that contain movie or image files
   –   acl Movies rep_mime_type video/*
   –   acl Images rep_mime_type image/*
   –   http_reply_access deny Movies
   –   http_reply_access deny Images
   –   http_reply_access allow ALL
                        And More
• Here, we restrict access to be working hours and our
  own site (disallow access to URLs off site)
   –   acl WorkHours D 08:30-17:30
   –   acl OurLocation dstdomain “/usr/local/squid/etc/ourURLS”
   –   http_access allow WorkHours OurLocation
   –   http_access deny ALL
• And here is an example to permit only specific port
   – acl SafePorts port 80 21 443 563 70 210 280 488 591 777
   – acl SSLPorts port 443 563
   – acl CONNECT method CONNECT
   – http_access deny !SafePorts
   – http_access deny CONNECT !SSLPorts
   – http_access allow ALL
• A redirector is similar to the rewrite rules and redirection
  used by apache
   – here however, we are redirecting an internal request before it
     leaves the proxy server
      • in apache, we redirect an external request to either a new internal
        location/file or to a new external resource
   – this permits
      •   access control
      •   the remove of advertisement
      •   local mirroring of resources
      •   working around browser bugs
   – with access control, you can even send the user to a page that
     explains why they were rerouted
• A redirector is just a program that reads a URI (along with
  other information) and creates a new URI as output
   – redirectors are often written in Perl or Python, or possibly C
           How to Use a Redirector
• To apply a redirector in squid, issue one or more of
  the following directives
   – redirect_program specifies the external program to run
     (the redirector)
   – redirect_children specifies how many redirector processes
     Squid should start
   – redirect_rewrite_host_header will update a request header
     to specify that a redirector is being used
   – redirect_access allows you to specify rules that decide
     which requests to send to redirectors
      • without this, every request to Squid is sent to a redirector to
        check to see if it should be redirected
   – redirector_bypass will bypass a redirector if all spawned
     redirectors are currently busy, otherwise the requests
     begin to stack up and wait
           How to Write a Redirector
• A redirector receives four pieces of input
   –   request URI, including any query terms (after the ?)
   –   client IP address (and optionally domain name)
   –   user’s name or proxy authentication
   –   HTTP request method
• Your redirector code will consist of rules that
   – investigate parts of the input to see if any of the redirector
     rules match to the input
        • if a rule matches, then the redirector code will produce output
          which will be a new URI, redirecting the request
        • an example might be to search for any URL being sent to an IP
          address in China, and rewrite the query to a mirror site that exists
          in Taiwan
        • another example is to search the URI for certain “bad words” and
          if any are found, redirect the request to a local page that explains
          why any such requests are not being allowed
                    Redirector Code
• Aside from building a new URI for the request, it
  can also alter components of a response header
• The redirector code may involve
  –   database queries
  –   searching the URI for specified regular expressions
  –   complex computations
  –   invoking other programs
       • thus, a redirector can take a long time to respond, this would
         slow squid’s processing down
       • this is one reason why the bypass directive is available, you
         don’t necessarily want to penalize everyone because of
         redirections taking too much time
• Redirector code is commonly written in Perl but it
  can be written in other scripting languages
           Authentication Helpers
• As with Apache, Squid does not have built-in
  mechanisms for handling password files
  – so Squid turns to authentication helpers
  – Squid supports three forms of authentication, the first two
    are similar to Apache
     • Basic, Digest, NTLM (this is an MS authentication protocol)
  – for each of these, you have to download the software and
    compile it and then configure Squid to use the helper
     • we already visited how to write ACL and http_access directives
       that use authentication, so we skip it here
  – basic authentication helpers come with Squid, you can
    use NCSA (simple), LDAP, MSNT (for MS NT
    databases), NTLM, PAM, SASL (which includes SSL),
    winbind or others
                         Log Files
• As with Apache, Squid uses log files to store
  messages of importance and to maintain access and
  error logs
  – however, one additional log that Squid has that Apache
    does not is a cache log in order to record what files are
  – there are also optional log files available
     • useragent.log and referer.log which contain information about
       user agent headers and web referers for every access
     • swap.state and netdb_statestore information regarding the disk
       and network performance of Squid
  – you can control the names of the log files and which of
    these optional log files are used through directives in your
    conf file
  – because there are so many logs and they can generate a lot
    of content, there are log rotation tools available just as
    with Apache
• This log contains
   – configuration information
   – warnings about performance problems
   – errors
• Entries are of the form
   – date time | message
• Configuration messages might include such things as
   – process ID of a starting squid process
   – successful (or failed) tests to the DNS and the DNS IP
     address (as obtained from resolv.conf)
   – starting helper programs
• The remaining cache entries are made based on a
  specified debug level that dictate which types of
  operations should be logged here
   – normal information, warnings, errors, emergencies, etc
• Much like Apache’s access log, Squid’s access log will store
  every request received
   – each entry contains 10 pieces of information
       • timestamp
       • response time
       • client address
       • status code of request
       • size of file transferred
       • HTTP method
       • URI
       • client identity (if available)
       • how requests were fulfilled on a cache miss (that is, where we had to go to
         get the file)
       • content type
   – status codes differ from Apache as they indicate cache access as well
     as server status codes, and include these:
         and NONE
           Directives for access.log
• log_icp_queries – default is enabled, allows you to control
  whether ICP (Internet Cache Protocol) requests are logged or
• emulate_http_log – whether to use the same format as http
  server access logs (that is, match Apache’s server log) or use
  Squid’s native format which contains more information
• log_mime_hdrs – if set to on, Squid will add HTTP request and
  response headers to each log entry (this adds two more fields to
  each entry)
• log_fqdn – this toggles whether Squid records requests by
  destination IP address or hostname – if hostname, then Squid
  has to do a reverse DNS lookup which takes more time
• log_ip_on_direct – same as above except whether to log client’s
  (requestor’s) IP address or hostname
• strip_query_terms, uri_whitespace – whether to remove the
  query terms from an URL and whether to strip, chop, or encode
  white space in a URL (if any)
• The store.log file stores decisions to store and remove
  objects from the Squid cache
   – if an object is cached, the entry includes where it was
     cached and when
   – if an object is uncacheable, then the entry indicates why
     the object was uncacheable
   – if a cache is full, a replacement strategy is used to decide
     what to remove, and any such action is logged here
• The store log contains the following fields:
   – timestamp, action (SWAPOUT, RELESE, SO_FAIL),
     directory number (which cache), file number, cache key
     (the hash value of the object), status code, date,
     last_modified from the HTTP response header, expires,
     content-type, content-length/size, HTTP method and URI

To top