Lucene or Solr- Choosing the right search development platform

Document Sample
Lucene or Solr- Choosing the right search development platform Powered By Docstoc
					The great improvements in the capabilities of Lucene and Solr open source search
technology have created rapidly growing interest in using them as alternatives to other
search applications. As is often the case with open-source technology, online
community documentation provides rich details on features and variations, but it
provides explicit direction on which technologies would be the best choice. So when
is Lucene preferable to Solr and vice versa?
  There is no single answer, as Lucene and Solr are complementary technologies that
bring very similar underlying capabilities to bear on somewhat distinct problems. Solr
is versatile and powerful, a full featured, production-ready search application server
requiring relatively less formal software programming. Lucene presents a collection
of directly callable Java libraries, with fine-grained control of machine functions and
independence from higher-level protocols.
  The functions of Solr & Lucene are highly familiar, if not just the same. If you are
building an app for the enterprise sector, for instance, you will find Solr an almost
100% match to your business requirements: it comes ready to run in a servlet
container such as Tomcat or Jetty, and ready to scale in a production Java environment.
Its RESTful interfaces and XML-based configuration files can greatly accelerate
development and maintenance. In fact, Lucene programmers have often reported that
they find Solr to contain 鈥渢 he same features I was going to build myself as a
framework for Lucene, but already very-well implemented.鈥?Once you start with
Solr, and you find yourself using a lot of the features Solr provides out of the box, you
will likely be better off using Solr 鈥檚 well-organized extension mechanisms instead
of starting from scratch using Apache Lucene.
  Searching with Solr
  The data once imported was not very large, only 50GB worth of data overall. This
again could be managed by adjusting the field types, whether data had to be stored or
not, and the amount of historical information to be imported. Now that the data was
available, searches could be executed on the data.
  I also found the packaged Schema Browser was very handy. Admittedly, the Schema
Browser takes a while to process all the fields in the index so if you have a lot of data
this can take a while. However the benefit is that it can provide answers to some of
the more common questions that could be asked such as: the number of documents
per value which can help for groups of items such as types of orders; how many
documents actually have parent accounts; how orders are provided by various sending
systems;how many orders are for a given state or postal code; etc. The data can also
yield additional insights from more advanced searches such as faceted searches, such
as what postal codes are responding to which advertising or product promotions;
which areas have the most activity for certain types of orders; or, how many domains
are covered per type of account. And the list goes on.
  Operationally speaking, the Solr instances were managed in one of two ways:
periodic updates from the main production instances or continual updates with
application code not only adding data to the Oracle database but inserting them into
the Solr index as well. Hence the operations against the existing production instances
could be managed to minimize impacts and eliminate any unnecessary processing.
  If, on the other hand, you don 鈥檛 want to make any calls via HTTP, and want to
have all of your resources controlled exclusively by Java API calls that you write,
Lucene may be a better choice. Lucene works best when constructing and embedding
a state-of-the-art search engine, allowing programmers to assemble and compile
inside a native Java application. Some programmers set aside the convenience of Solr
in order to more directly control the large set of sophisticated features with low-level
access, data, or state manipulation, and choose Lucene instead, for example with
byte-level manipulation of segments or intervention in data I/O. Investment at the
lower level enables development of extremely sophisticated, cutting edge text search
and retrieval capabilities.
  As for features, the latest version of Solr generally encapsulates the latest version of
Lucene. As the two are in many ways functional siblings, spending time on gaining a
solid understanding how Lucene works internally can help you understand Apache
Solr and its extension of Lucene's workings.
  With these new capabilities, answers to key questions can be found in seconds. Data
can be mined quickly, efficiently and flexibly without a lot of specialized training for
business users. Additionally, the indexes could be managed in such a way such that
additional data could be added for to increase the scope of analysis, or subsets of data
could be indexed and searched for specific business reasons such as service outages or
legal reasons.
  In the end the users were quite happy with the new capabilities provided by Solr
which allowed them to address business needs much more quickly and explore new
patterns that had eluded them before, and operations was happy since the new
capabilities came with little additional hardware or operational costs.
  To know more about Solr and Lucene check out Lucid Imagination website

Shared By: