Docstoc

rivet-alfresco-solr - Meetup

Document Sample
rivet-alfresco-solr - Meetup Powered By Docstoc
					    Integrating Apache Solr with
 Alfresco WCM for Faceted Search
and Navigation of Next-Generation
              Web Sites
            Vagif Jalilov
            Rivet Logic
                   About Rivet Logic
• Award-winning professional services focused on:
   – Enterprise Content Management
   – Web Content Management
   – Collaboration and Social Communities


• Using Leading Open Source Software
     Business Case for Alfresco & Solr
•   Large scale sites
•   Need for real-time updates
•   Full-text search
•   Faceted search
    Technical Challenges for Search
• Accurately index each page
  – Solution: Assembly of relevant content to index
• Targeted, real-time indexing
  – Solution: Trigger indexing from publishing
    mechanism
         Possible Index Solutions
• Spidering/Crawling
  – Follow navigational & cross-links
  – Parse HTML and fetch relevant content
  – Spider full (or partial) site each time
• Real-time Indexing
  – Triggered by FSR deployment
  – Process only change-set (incremental updates)
  – Assemble relevant page content
          Typical Web Application
Source Control         CMS (Alfresco)
• Source code & libs   • Binary Content
• View templates
• Site navigation
• Web content
“Managed” (Riveted) Web Application
Source Control         CMS (Alfresco)
• Source code & libs   • Binary Content
• (View templates)     • Web Content
                       • Site Navigation
                       • (View templates)
          Page Composition
                                 Meta-
                               content.xml
                Page-
             metadata.xml

                                             Related-
                    Section-                 links.xml
dynamic
                    html.xml


                                        Supporting-
                                         items.xml
                  dynamic
Content Delivery



           (http://crafterrivet.org)
Alfresco WCM Lifecycle
Indexing Architecture
             Solr Customizations
• Custom Solr
  – Schema.xml
     • Fields (Type, Indexed/Stored)
     • Unique key
  – Solrconfig.xml
     • “dismax” type request handler to define queried fields
     • ExtractingRequestHandler (indexing RT docs)
                     Custom Solr Schema
<field name="page_url" type="string" indexed="true" stored="true" required="true"/>
 <field name="page_title" type="text" indexed="true" stored="true"/>
 <field name="page_category" type="string" indexed="true" stored="true"/>
 <field name="page_type" type="string" indexed="true" stored="true"/>
 <field name="page_last_modified" type="date" indexed="true" stored="true"/>
 <field name="page_text" type="text" indexed="true" stored="true"/>
 <field name="page_file_size" type="int" indexed="false" stored="true"/>
</fields>

<uniqueKey>page_url</uniqueKey>
                ExtractingRequestHandler
<!-- Solr Cell: http://wiki.apache.org/solr/ExtractingRequestHandler -->
 <requestHandler name="/update/extract"
     class="org.apache.solr.handler.extraction.ExtractingRequestHandler" startup="lazy">
  <lst name="defaults">
    <str name="fmap.content">page_text</str>
    <str name="fmap.title">page_title</str>
    <str name="uprefix">ignored_</str>
  </lst>
</requestHandler>

<dynamicField name="ignored_*" type="ignored"/>

ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");
up.addFile(new File(filePath));
SolrServer solrServer = new CommonsHttpSolrServer(solrServerUrl);
solrServer.request(up);
solrServer.commit();
                Custom RequestHandler
<!-- DisMaxRequestHandler allows easy searching across multiple fields
     for simple user-entered phrases. It's implementation is now
     just the standard SearchHandler with a default query type
     of "dismax".
     see http://wiki.apache.org/solr/DisMaxRequestHandler
 -->
<requestHandler name=”solrDemoDismax" class="solr.SearchHandler" >
  <lst name="defaults">
   <str name="defType">dismax</str>
   <str name="qf">
      page_title^5.0 page_text^1.0
   </str>
  </lst>
</requestHandler>
                Compilation
• Compiler Engine processes all instructions
• Dispatches to appropriate Page Type Compiler
Content Deployment & Solr Update
                   Compiler Instructions
<updates deploy-root=”/path/to/content/root">
   ...
   <update>/solutions/security/article.xml</update>
   <delete>/products/widget/top-section.xml</delete>
   ...
</updates>
           Compilation Types
1. Web Pages (HTML)
2. Rich Text (PDF)
Web Page Compilation & Indexing




               Indexer
               Instructions
              HTML Indexer Instruction
<?xml version="1.0" encoding="ISO-8859-1"?>
<add>
 <doc>
   <field name="page_url">/solutions/content-mgmt/overview.html</field>
   <field name="page_title">Increase productivity and streamline workflow
   throughout the enterprise</field>
   <field name="page_description">Commercial enterprises and government agencies
   face significant challenges as they strive to meet a rapidly growing need to
   manage thousands ...</field>
   <field name="page_category”>Solutions</field>
   <field name="page_type">Web Page</field>
   <field name="page_last_modified">2009-12-18T15:03:57Z</field>
   <field name="page_text">Rivet Logic addresses many of today's workplace
   challenges with Enterprise Content Management (ECM) solutions that enable
   organizations to transform traditional content repositories and static
   intranets into dynamic, collaborative work environments through open source
   functionality. Through ...</field>
 </doc>
</add>
Rich Text Compilation & Indexing
         Rich Text Indexer Instruction
<?xml version="1.0" encoding="ISO-8859-1"?>
<add>
 <doc>
   <field name=”page_file">/docroot/static/about-us/press-
   releases/2010/rl_crafter_studio.pdf</field>
  <field name=”page_url”>/about-us/press-
   releases/2010/rl_crafter_studio.pdf</field>
  <field name="page_title”>Rivet Logic launches Crafter Studio for
   user friendly Web content authoring and publishing.</field>
  <field name="page_category">News</field>
  <field name="page_type">Press Release</field>
  <field name="page_last_modified">2007-12-19T08:00:00Z</field>
  <field name="page_file_size”>135</field>
 </doc>
</add>
Compiler Configuration
                          Compiler Configuration
<compiler-config>
    <page-types>
              <page-type
                    name="Solution Page”
                    compiler="com.rivetlogic.index.compile.ArticleCompiler">
                    <uri-pattern pattern=".*/page-content/solutions/.*(article|page-metadata|meta-content).xml$" />
                    <properties>
                           <property field=“page_type” value=“Web Page”/>
                           <property field=“page_category” value=“Solutions”/>
                    </properties>
              </page-type>
              <page-type
                    name="Press Release Page”
                    compiler="com.paetec.index.model.compile.PressReleaseCompiler">
                    <uri-pattern pattern=".*/press-releases/.*/(press-release|meta-content).xml$" />
                    <properties>
                           <property field=“page_type” value=“Press Release”/>
                           <property field=“page_category” value=“News”/>
                    </properties>
              </page-type>
    <page-types>
<compiler-config>
                    Search UI
•   Full text search
•   Faceted search on category & type
•   Pagination or search result clustering
•   Keyword highlighting in search results
•   Track user queries
Search Results Page
Clustered Results
                       Summary
• Requirements:
  – Real time updates
  – Full editorial control
  – Faceted search
• Solution
  –   Alfresco CMS
  –   Alfresco plugin for Solr indexing
  –   Compile updates & index
  –   Serve in UI (ft search + facets)
                    Q&A
• Thank you for attending :-)
• Questions, comments…
Appendix
Search Model/API

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:6
posted:11/14/2012
language:English
pages:32