Deploying the Google Search Appliance - Google Enterprise Italia by wuzhenguang

VIEWS: 3 PAGES: 8

									Deploying the Google Search Appliance


A Google white paper
November 2006
Deploying the Google Search Appliance
The following is a guide on the various factors that should go into determining how to best deploy
the Google Search Appliance to meet your enterprise search needs.


INTRODUCTION .................................................................................... 3
ANALYZE REQUIREMENTS ................................................................. 3

Scoping Index Capacity Needs ...................................................................................................3

Determining What to Index........................................................................................................3

Determining How to Index .........................................................................................................3

Determining Query Scalability...................................................................................................4

Scoping Availability Needs .........................................................................................................4
DETERMINE DEPLOYMENT ARCHITECTURE.................................... 5
Scenario 1 – Single Google Search Appliance GB-1001...........................................................5

Scenario 2 – Multiple GB-1001 behind a load balancer...........................................................5

Scenario 3 – Two clusters in a data center ................................................................................6

Scenario 4 – Multiple clusters in two or more data centers.....................................................6
OPERATIONS ........................................................................................ 7
Deployment ..................................................................................................................................7

Customization ..............................................................................................................................7

Upgrade........................................................................................................................................7

Migration .....................................................................................................................................7
SUMMARY ............................................................................................. 8
Introduction
The Google Search Appliance is a full-featured enterprise search solution that brings Google’s
award-winning search technology to the Enterprise. Because it runs much of the same software
as Google’s own datacenter infrastructure, the Google Search Appliance provides high levels of
search relevancy, scalability, and redundancy to meet the ever-growing, mission critical
information access demands of the enterprise.
Unlike many enterprise applications, the Google Search Appliance is designed to be self
sufficient - Hardware, Software, reliable networking, storage, and security support are built in.
Customers don't have to spend large amounts of time or money to plan for Quality of Service,
backup and restore. This document outlines the recommended methodology for deploying the
Google Search Appliance to meet Document Capacity, Scalability, and Redundancy needs.
Please consult your Google Enterprise Account Team to understand specific features and
capabilities of the appliance.


Analyze Requirements
Scoping Index Capacity Needs
Each individual appliance unit has been built and tested to index a specified document count and
there are three models of the appliance that scale based on capacity: the GB-1001, GB-5005,
and GB-8008. These appliance units can individually index up to 3M, 10M and 30M respectively.
From a sizing perspective, Google recommends that organizations choose a base unit that
meets current document capacity as well as projected document growth needs for 2 years.
Google’s appliances have been designed to be flexible so if a model upgrade is required at any
time, there is a seamless migration plan that allows for the installation and transition to the new
unit with no service downtime. Running the appliance close to the document limit does not mean
worse performance. However, because this will require a hardware change, if the needed
document capacity is close to the physical indexing limits of a given model, it is recommended to
select the next model in the stack to simplify management of the solution over the course of 2
years. So for example, if it is determined that there are 2.5M documents that need to be indexed
today, it would be recommended to choose the GB-5005 from a sizing perspective in order to
plan for additional document growth.
Determining What to Index
Determining what to index might seem straightforward, but there could be surprises. The content
typically includes external web sites, intranet web applications, portals, file servers, and
enterprise applications. A file system might be much deeper than you think. A content or
document management system might contain more documents than originally thought. A
database record is also considered a single "document". Limiting index scope by adjusting
content acquisition rules can be a discovery process that needs to be taken into consideration.
The appliance provides detailed crawl logs on each document that has been crawled. It also
gives summary information on document types and sizes. These features allow administrators to
fine tune crawler rules.
Determining How to Index
Determining how to index also needs careful consideration. For example, a file system can be
directly crawled, or it could be mounted as a web share and be crawled as web documents.
Content can be pushed into the appliance through an XML feed API. The choice has implications
on manageability, security and most importantly on the end user experience.
Determining Query Scalability
Once the appropriate appliance model has been determined based on index capacity,
organizations can begin to scope their query throughput needs. Each model of the appliance has
been tested to meet a query volume of up to 25 queries per second (QPS)1. This query volume
will meet most internal, corporate network search requirements as well as most public-facing
site-search requirements
However, some organizations will have needs to scale beyond this metric and for these
instances, multiple appliance units can be deployed in parallel to linearly scale on query volume.
For example, if is determined that during peak usage, the search traffic on your external website
may exceed 30 QPS, two appliances can be deployed behind a load balancer to meet this
excess load. The appliances will be of the same model which was determined above based on
index capacity.
Scoping Availability Needs
Once Index Capacity needs Query Scalability needs are understood, an organization will need to
determine the availability requirements of the search solution.
Each appliance unit has a level of fault tolerance built into the system. For example, the GB-
1001 is a single node unit with RAID architecture to provide tolerance against disk failures. If
additional levels of fault tolerance are required, multiple individual appliance units can be
deployed in parallel (similar to scaling for query volume) to meet the most mission critical
redundancy requirements2. Furthermore, the GB-5005 and GB-8008 are clustered units that
provide tolerance against entire node failures. There are a number of differences between a GB-
1001 model and a clustered model of the appliance.
     •     A cluster has a built-in hardware load balancer. It uses a round-robin approach to
           redirecting requests. This ensures service availability and scalability that would have to
           be provided via an external load balancer for the GB-1001. The cluster needs three IP
           addresses: one for crawl, one for serve and one for the internal switch. For GB-5005,
           one node can go down without affecting the operation of the appliance; for GB-8008, a
           maximum of three nodes could go down without affecting the operation.
     •     All services in the cluster are replicated on multiple nodes. For example, an internal
           DNS service manages internal IP address allocation and other networking issues.
     •     In a cluster, the hard disks are shared by all nodes. Data is replicated on multiple disks.
           This ensures data recovery and eliminates external data backup and restore needs.
     •     Multiple nodes in the cluster participate in the crawling. Each one crawls a subset of the
           URLs.
     •     Clusters have higher index capacities than the single server models.

1 There are a number of different factors that will determine the actual query volume for a particular deployment. Such
factors include document mix, average document size as well as overall corpus and index size. As these factors are
highly dependent on the IT environment and high throughput query processing is a critical requirement, it is
recommended to consult a Google Technology Specialist to determine the most optimal architecture.

2 The Google Search Appliance has been designed to be deployed in parallel to scale for performance as well as
provide additional levels of system redundancy. In this deployment, each individual appliance unit will function as an
independent and autonomous unit with no communication with any other units in the deployment. In order to maintain
index parity between the multiple units in the architecture, a common configuration file can be imported into each unit
in the array. Due to how the crawling and indexing components have been designed, once the initial discovery is
complete and the crawl begins to stabilize, each unit will in effect have a mirrored index of one another. There will be
minor variances between these indexes but in most cases, these variances won’t significantly impact the user
experience. Furthermore, an additional cost consideration when deploying multiple Appliances is the network
components, such as a load balancer, that will manage traffic flow and fail over of this architecture.
For most organizations, it is common to deploy multiple GB-1001s in parallel to provide
additional levels of redundancy in the solution. The additional level of fault-tolerance built into the
clustered GB-5005/8008 units will provide adequate redundancy for many organizations.
However, those with globally redundant data-centers will choose to deploy multiple
geographically dispersed clusters to meet stringent Disaster Recovery policies.


Determine deployment architecture
Once Document Capacity, Query Volume and Redundancy requirements are determined,
designing an architecture to meet these needs can be fairly straight-forward and can span the
spectrum from an organizational search solution to a globally deployed enterprise search
deployment.
There are four different scenarios for the deployment architecture of a Google Search Appliance:

Scenario 1 – Single Google Search Appliance GB-1001
For example, a department of a large organization has determined current and projected
document capacity to be 2M documents over the course of 2 years, and they do not expect
query volume higher than 25 QPS. They also believe that some down time is tolerable. In this
case, a single GB-1001 is a good choice.
Scenario 2 – Multiple GB-1001 behind a load balancer
A customer has determined current and projected document capacity to be 2M documents over
the course of 2 years, and they do not expect query volume higher than 25 QPS. However, 24/7
service availability is critical to the business. Google Search Appliance can attain high availability
using an server load balancing technology. The organization may choose to deploy a second
GB-1001 in parallel behind a fault-detecting load balancer to provide an additional level of
redundancy and seamless service availability (Diagram1.1). Furthermore, the above deployment
can be further expanded to meet higher levels of query performance by simply deploying
additional appliances in parallel.
For Forms and Basic/NTLM Authentication, there is no session state. The load balancer can be
configured so that any request can be serviced by any node in the cluster. On the other hand, if
the Authentication SPI is used, GSA uses a session cookie that is encrypted, and the key is
maintained by the node that generated the cookie. So the load balancer needs to be configured
to support sticky sessions.
Scenario 3 – Two clusters in a data center
On the other end of the spectrum, if an organization determines that its needs for Document
Capacity are on the order of 3M+ documents and search is a mission-critical enterprise
application where High Availability is a key component of the overall IT strategy, then a cluster
deployment would meet such requirements. In this configuration, the clustered unit would
provide intra-data center fault tolerance. A cluster is always sold with a backup cluster. As
mentioned above, although a cluster provides much better redundancy, it still has several single
points of failure. That's why it's good to have a hot backup appliance in case the primary fails.
Scenario 4 – Multiple clusters in two or more data centers
In some large organizations, multiple geographically distributed data centers are in service to
provide shorter response time and higher redundancy and data safety. The appliance fits well in
such a scenario. One or two clusters can be deployed at a single data center, and all the
appliances at different data centers can be setup to crawl the same contents. A long distance
load balancer can be deployed to direct the traffic among the clusters.
Operations
Google recommends that independent appliances be used at different stages of deployment. In
the customization and staging phase, we recommend a separate appliance. For production
system, please refer to the different scenarios above.
Deployment
GB-1001 is simple enough that customers can configure it without a problem. Google provides
on site deployment services for the GB-5005 and GB-5005 models. It usually only takes a few
hours to deploy a cluster, which means the cluster will be brought online and has the basic
configuration completed. The actual time depends on whether the customer has prepared well
according to the checklist that Google provides.
Customization
Web content is what most organizations choose to put under search. But Google Search
Appliance can go beyond that. Database, content systems, other enterprise applications such as
ERP, CRM - these can all be searched and served by the Appliance. Google and its partners
already provide connectivity to search some popular systems, but sometimes work needs to be
done to crawl proprietary systems. Google provides APIs and SDKs to help customers: a feed
API to allow content to be pushed into the appliance; OneBox to deliver real time information
from enterprise applications; a security SPI to hook in proprietary security system. Although
Google doesn't provide professional services, Google GEP Partners can work with our
customers to bring search to these systems around the globe.
Upgrade
The appliance software can be upgraded without affecting the service. Two versions of software
can run simultaneously in a single appliance, and user can accept the new version after making
sure that the appliance is functioning correctly.
Migration
It's quite common for customers to migrate up to a larger appliance model as their needs
become greater and their enterprise content grows.. There are two ways to migrate. The full
product configuration can be exported from one appliance and loaded into another thereby
replicating the configuration. The new appliance is then put into action acquiring content and
building the index. The constraint of this approach is that the target appliance has to be the same
or newer version of the original appliance. The other option is to manually configure the
appliance. It will take only a few minutes.


Summary
The Google Search Appliance is an enterprise-class search system that can serve the needs of
the most demanding business environment. Yet, its simple plug-and-play design allows it to be
used for departmental use. By following these deployment guidelines and recommendations,
you can ensure that your Google system will meet your Document Capacity, Scalability, and
Redundancy needs.

								
To top