Embed
Email

101tec's+open+source+projects.

Document Sample

Shared by: ajizai
Categories
Tags
Stats
views:
22
posted:
11/25/2011
language:
English
pages:
363
101tec's open source projects.

Displaying 331 issues as at 25/Nov/11 12:52 PM.

Project Key Summary Issue Type Status Priority Resolution Assignee

webdav-servlet WDS-22 Office 2007 running at a Windows XP opens an Bug Open Major UNRESOLVED Unassigned

Office document read-only









webdav-servlet WDS-21 Add missing ITransaction to IMimeTyper Bug Open Blocker UNRESOLVED Unassigned









1 of 363

webdav-servlet WDS-20 Clear Unconsumed Inputstream Improvement Open Critical UNRESOLVED Unassigned









webdav-servlet WDS-19 PROPPATCH fails on locked object with correct Bug Open Critical UNRESOLVED Unassigned

lock token in IF Header









2 of 363

webdav-servlet WDS-18 Send simple error response instead of multi Improvement Open Critical UNRESOLVED Unassigned

status status if appropriate









3 of 363

webdav-servlet WDS-17 Allow none or empty LOCK owner token Improvement Open Major UNRESOLVED Unassigned









4 of 363

webdav-servlet WDS-16 Locking fails from davfs2 clients Bug Resolved Major Fixed Unassigned









5 of 363

webdav-servlet WDS-15 webdav on websphere 6.1 and 7 throws timeout Bug Open Major UNRESOLVED Marko Bauhardt

exception while execute DoProp Method









6 of 363

webdav-servlet WDS-14 IWebdavStore: add destroy() life cycle method to Improvement Resolved Major Fixed Unassigned

deallocate resources









7 of 363

webdav-servlet WDS-13 LockedObject prone to Bug Resolved Major Fixed Unassigned

ArrayIndexOutOfBoundsException









webdav-servlet WDS-12 SimpleDateFormat usage is not thread safe Bug Resolved Major Fixed Unassigned









8 of 363

webdav-servlet WDS-11 Copy coll to non existent path results in error 500 Bug Open Major UNRESOLVED Unassigned

instead of 409









webdav-servlet WDS-10 MKCOl with existing Content-Type should result Bug Open Major UNRESOLVED Unassigned

in 415









webdav-servlet WDS-9 MKCOL on nonexistend path get 207 instead of Bug Resolved Major Fixed Unassigned

404 error









9 of 363

webdav-servlet WDS-8 Compilation for iJetty does not work Bug Resolved Major Fixed Marko Bauhardt









webdav-servlet WDS-7 improve the WebdavServlet/WebdavServletBean Task Open Major UNRESOLVED Marko Bauhardt

to register another implementations of

IMethodExecutor









10 of 363

webdav-servlet WDS-6 Problem when opening an accentuated file or Bug Open Minor UNRESOLVED Unassigned

folder









webdav-servlet WDS-5 dont distribute the log4j.xml file within the jar Improvement Resolved Major Fixed Marko Bauhardt









11 of 363

webdav-servlet WDS-4 Enhance GET responses to return HTML+clickable New Feature Resolved Minor Fixed Unassigned

links









webdav-servlet WDS-3 deploy webdav-servlet jar, javadoc. sources to Task Closed Major Fixed Marko Bauhardt

maven2 repository if version 2.0 is released









12 of 363

webdav-servlet WDS-2 DoPut returns wrong content-length in response Bug Closed Major Fixed Marko Bauhardt









webdav-servlet WDS-1 npe while deleting files Bug Closed Major Fixed Marko Bauhardt









13 of 363

Nutch Gui NUTCHGUI-29 Nutch Gui not working - 127.0.0.1 return 404 Bug Open Major UNRESOLVED Marko Bauhardt

page not foudn









Nutch Gui NUTCHGUI-28 dont protect css, gfx and js folder Task Resolved Major Fixed Marko Bauhardt

Nutch Gui NUTCHGUI-27 scheduling plugin: validate form elements, for Bug Open Major UNRESOLVED Marko Bauhardt

example: Pages per segment

Nutch Gui NUTCHGUI-26 the scheduling plugin should be configurable to Improvement Open Major UNRESOLVED Marko Bauhardt

refresh a crawl or create a new crawl



Nutch Gui NUTCHGUI-25 add a error html for exceptions Task Open Major UNRESOLVED Marko Bauhardt

Nutch Gui NUTCHGUI-24 add error page in web.xml for the authentication Bug Resolved Major Fixed Marko Bauhardt

error 403 (user is not in role)



Nutch Gui NUTCHGUI-23 login.html and logout.html should not be Bug Resolved Major Fixed Marko Bauhardt

protected

Nutch Gui NUTCHGUI-22 delete LoginController from the src/java folder Improvement Resolved Major Fixed Marko Bauhardt



Nutch Gui NUTCHGUI-21 make the nutchgui.auth file configurable Improvement Resolved Major Fixed Marko Bauhardt





Nutch Gui NUTCHGUI-20 NutchGuirealm should use a better name for the Improvement Resolved Major Fixed Marko Bauhardt

LoginModule





Nutch Gui NUTCHGUI-19 crawling went wrong if http.agent.name is not set Bug Resolved Major Fixed Marko Bauhardt



Nutch Gui NUTCHGUI-18 crawling failed if metadata.enabled = true but Bug Resolved Major Fixed Marko Bauhardt

there are no metadata-urls uploaded



Nutch Gui NUTCHGUI-17 restart searcher if crawl folder is putted/removed Task Resolved Major Fixed Marko Bauhardt

to/from search



14 of 363

Nutch Gui NUTCHGUI-16 implement a simple admin job overview plugin Task Open Major UNRESOLVED Marko Bauhardt



Nutch Gui NUTCHGUI-15 implement a query filter that search in fields from Task Resolved Major Fixed Marko Bauhardt

the metadata indexing plugin

Nutch Gui NUTCHGUI-14 make the metadata indexing optional Task Resolved Major Fixed Marko Bauhardt

Nutch Gui NUTCHGUI-13 make the black white filtering optional Task Resolved Major Fixed Marko Bauhardt

Nutch Gui NUTCHGUI-12 implement nutch gui core classes, for example Task Resolved Major Fixed Marko Bauhardt

httpServer, GuiComponentDeployer etc



Nutch Gui NUTCHGUI-11 apply metadata indexing patch NUTCH-747 Task Resolved Major Fixed Marko Bauhardt

Nutch Gui NUTCHGUI-10 apply black white filtering patch NUTCH-249 Task Resolved Major Fixed Marko Bauhardt



Nutch Gui NUTCHGUI-9 implement admin url upload plugin Task Resolved Major Fixed Marko Bauhardt



Nutch Gui NUTCHGUI-8 implement admin scheduling plugin Task Resolved Major Fixed Marko Bauhardt



Nutch Gui NUTCHGUI-7 implement admin crawl plugin Task Resolved Major Fixed Marko Bauhardt





Nutch Gui NUTCHGUI-6 implement admin configuration plugin Task Resolved Major Fixed Marko Bauhardt





Nutch Gui NUTCHGUI-5 implement admin system plugin Task Resolved Major Fixed Marko Bauhardt



Nutch Gui NUTCHGUI-4 implement admin instance plugin Task Resolved Major Fixed Marko Bauhardt

Nutch Gui NUTCHGUI-3 implement admin welcome plugin Task Resolved Major Fixed Marko Bauhardt



Nutch Gui NUTCHGUI-2 implement i18n translation mechanism Task Resolved Major Fixed Marko Bauhardt





Nutch Gui NUTCHGUI-1 implement role based security for login Task Resolved Major Fixed Marko Bauhardt

mechanism









15 of 363

Katta KATTA-193 Implement query filters New Feature Resolved Major Fixed Johannes Zillmann









Katta KATTA-192 unstable master failover on session reconnect Bug Resolved Major Fixed Johannes Zillmann









16 of 363

Katta KATTA-191 property to set Lucene versione New Feature Open Trivial UNRESOLVED Unassigned









Katta KATTA-190 make inner classes of LuceneServer protected Task Resolved Major Fixed Johannes Zillmann









Katta KATTA-189 improve multithreaded shard search by using Improvement Resolved Trivial Fixed Johannes Zillmann

ExecutorCompletionService









17 of 363

Katta KATTA-188 katta startNode stuck after Bug Open Major UNRESOLVED Unassigned

org.apache.zookeeper.KeeperException$NotEmp

tyException









18 of 363

Katta KATTA-187 add additional parameters to the Katta client to Improvement Resolved Trivial Fixed Unassigned

change the output format for easier scripting and

parsing









Katta KATTA-186 make message about document and term Improvement Resolved Trivial Fixed Unassigned

statistics more readable







Katta KATTA-185 adding of lucene indices as single sharded katta Improvement Open Minor UNRESOLVED Johannes Zillmann

indices









Katta KATTA-184 upgrade to zookeeper 3.3.2 (from 3.3.1) Improvement Resolved Major Fixed Johannes Zillmann







Katta KATTA-183 zkclient can get unresponsive through OOM Bug Resolved Major Fixed Johannes Zillmann









19 of 363

Katta KATTA-182 memory leak in client - Bug Resolved Major Fixed Johannes Zillmann

ZooKeeper$ZkWatchManager.existWatches can

grow huge









20 of 363

Katta KATTA-181 don't remove node-shard-mapping from client if Improvement Resolved Major Fixed Johannes Zillmann

proxy fails one time









Katta KATTA-180 avoid search exceptions on index-removal Improvement Open Major UNRESOLVED Johannes Zillmann









21 of 363

Katta KATTA-179 undeployment of index can lead to NPE in Bug Resolved Minor Fixed Johannes Zillmann

BooleanQuery - improve exception message









22 of 363

Katta KATTA-178 undeploying indices can leave empty shard-to- Bug Resolved Major Fixed Johannes Zillmann

node pathes









23 of 363

Katta KATTA-177 IndexDeployFuture not safe for quick Bug Resolved Major Fixed Johannes Zillmann

undeployments









24 of 363

Katta KATTA-176 deploy-client misses index-state updates Bug Open Major UNRESOLVED Johannes Zillmann









Katta KATTA-175 make lucene-server thread pool parameters Improvement Resolved Minor Fixed Johannes Zillmann

configurable









Katta KATTA-174 allow configuration of content-server through New Feature Resolved Major Fixed Johannes Zillmann

katta.node.properties









25 of 363

Katta KATTA-173 upgrade to hadoop-0.20.2 Task Resolved Major Fixed Unassigned









Katta KATTA-172 update to lucene 3.0.3 Improvement Resolved Major Fixed Johannes Zillmann

Katta KATTA-171 use timeout from Client in the LuceneServer as Improvement Resolved Major Fixed Johannes Zillmann

well









26 of 363

Katta KATTA-170 Negative range query broken Bug Open Critical UNRESOLVED Unassigned









27 of 363

Katta KATTA-169 results do not get closed in Bug Open Major UNRESOLVED Unassigned

WorkQueue.getResults() if waitTime == 0









28 of 363

Katta KATTA-168 katta deadlock after reconnecting upon child Bug Open Major UNRESOLVED Unassigned

thread death









29 of 363

Katta KATTA-167 Katta runs out of memory Bug Open Major UNRESOLVED Unassigned









Katta KATTA-166 Searches that match > 2^31 documents are Bug Open Major UNRESOLVED Unassigned

handled incorrectly









30 of 363

Katta KATTA-165 enabled throttling leads to Bug Resolved Major Fixed Johannes Zillmann

IndexOutOfBoundsException when adding index









31 of 363

Katta KATTA-164 Unit test fails Bug Open Major UNRESOLVED Unassigned









Katta KATTA-163 don't exit node/master operation thread in case Improvement Resolved Major Fixed Johannes Zillmann

an unexpected exception occurs









32 of 363

Katta KATTA-162 Allow LuceneClient to be extended more easily Improvement Resolved Major Fixed Johannes Zillmann









33 of 363

Katta KATTA-161 Katta nodes stop communicating with master, Bug Resolved Major Fixed Johannes Zillmann

but don't exactly become "disconnected"









34 of 363

Katta KATTA-160 Stop trying to rebalance/replicate an index when Bug Resolved Major Fixed Johannes Zillmann

the index could not be found in the file system

any more









Katta KATTA-159 LuceneServerTest does not compile Bug Resolved Trivial Fixed Johannes Zillmann









Katta KATTA-158 remove a node from Katta New Feature Open Major UNRESOLVED Unassigned









35 of 363

Katta KATTA-157 set timeout on LuceneClient New Feature Resolved Major Fixed Johannes Zillmann





Katta KATTA-156 allow shard selection by regular expression New Feature Resolved Major Fixed Unassigned









Katta KATTA-155 Retrieving details of many hits is very slow Improvement Open Major UNRESOLVED Unassigned









Katta KATTA-154 HitsMapWritable readFields does not add hits Improvement Resolved Minor Fixed Johannes Zillmann

optimally









Katta KATTA-153 LuceneServer loads all fields from index, even if Improvement Resolved Major Fixed Johannes Zillmann

only fewer are requested









Katta KATTA-152 modify LuceneServer for easier sub-classing Improvement Resolved Minor Fixed Johannes Zillmann









Katta KATTA-151 use Hadoop 0.21 Task Open Major UNRESOLVED Unassigned





Katta KATTA-150 port parameter for startNode New Feature Resolved Trivial Fixed Johannes Zillmann









36 of 363

Katta KATTA-149 LuceneServer synchronizes on a Improvement Resolved Trivial Fixed Johannes Zillmann

ConcurrentHashMap





Katta KATTA-148 deploy index fails if Debug logging is enabled at Bug Resolved Blocker Fixed Unassigned

the master and the

LowestShardCountDistributionPolicy is chosen









Katta KATTA-147 upgrade to zookeeper 3.3 Task Resolved Major Fixed Johannes Zillmann









37 of 363

Katta KATTA-146 java.util.ConcurrentModificationException when Bug Resolved Major Fixed Unassigned

multiple LuceneClient objects are created

simultaneously









38 of 363

Katta KATTA-145 java.lang.NullPointerException when a Bug Resolved Major Fixed Unassigned

LuceneClient is created sometimes









39 of 363

Katta KATTA-144 impossible to resolve dependencies: Bug Open Major UNRESOLVED Unassigned

java.io.FileNotFoundException









40 of 363

Katta KATTA-143 CLONE -cobertura.jar version mismatch when Bug Resolved Trivial Duplicate Unassigned

compiling /extras/indexing









41 of 363

Katta KATTA-142 memory leak in client usage when adding and Bug Resolved Major Fixed Johannes Zillmann

removing indices









42 of 363

Katta KATTA-141 ec2 scipts broken Bug Resolved Major Fixed Johannes Zillmann









43 of 363

Katta KATTA-140 inconsistent search errors during stress test Bug Resolved Major Fixed Johannes Zillmann









44 of 363

Katta KATTA-139 reconnecting node fails while deleting queue Bug Open Major UNRESOLVED Johannes Zillmann









45 of 363

Katta KATTA-138 Cluster can "hang" following a major change Bug Open Critical UNRESOLVED Unassigned









46 of 363

Katta KATTA-137 ZKClient does not compile against Zookeeper Bug Resolved Major Fixed Unassigned

3.3.1









47 of 363

Katta KATTA-136 NPE in client on remove index event Bug Resolved Major Fixed Johannes Zillmann









48 of 363

Katta KATTA-135 NPE in Bug Resolved Major Fixed Johannes Zillmann

AbstractIndexOperation.addRunningDeployments









49 of 363

Katta KATTA-134 cobertura.jar version mismatch when compiling Bug Resolved Trivial Fixed Johannes Zillmann

/extras/indexing









Katta KATTA-133 upgrade zookeeper 3.2.2 to 3.3.0 Task Resolved Major Duplicate Unassigned









50 of 363

Katta KATTA-132 nDocs must be > 0 exception when query on Bug Resolved Major Fixed Johannes Zillmann

many instances









51 of 363

Katta KATTA-131 missing sort fields in single documents leads to Bug Open Major UNRESOLVED Johannes Zillmann

exception on sorting









Katta KATTA-130 Deploying from HDFS fails to unzip correctly Bug Resolved Minor Fixed Johannes Zillmann









Katta KATTA-129 Add a newly deployed shard (in hdfs, say) to an New Feature Resolved Major Duplicate Unassigned

existing index









52 of 363

Katta KATTA-128 Only one Search executed at a time per node -> Improvement Resolved Major Fixed Johannes Zillmann

Increase RPC Server threads









Katta KATTA-127 reverse deploy - copy a valid index from the katta New Feature Open Major UNRESOLVED Unassigned

system (+ all its shards ) to a hdfs uri









Katta KATTA-126 Genericity nits Improvement Resolved Major Fixed Johannes Zillmann







53 of 363

Katta KATTA-125 ConcurrentModificationException in Bug Resolved Critical Fixed Unassigned

net.sf.katta.protocol.InteractionProtocol









54 of 363

Katta KATTA-124 imbalanced shard distribution with Bug Resolved Major Fixed Johannes Zillmann

'LowestShardCountDistributionPolicy'









Katta KATTA-123 Remove System.exit(1); from printUsageAndExit Bug Resolved Major Won't Fix Unassigned









55 of 363

Katta KATTA-122 removeIndex leaks file descriptors Bug Resolved Major Fixed Johannes Zillmann









Katta KATTA-121 bin/stop-all.sh not working (Sometimes nodes Bug Open Major UNRESOLVED Unassigned

just hang and won't stop)









56 of 363

Katta KATTA-120 Adding wrong file path will cause listIndices to fail Bug Resolved Major Fixed Johannes Zillmann

(and maybe other parts as well)









Katta KATTA-119 Reload/Refresh feature New Feature Open Major UNRESOLVED Unassigned









57 of 363

Katta KATTA-118 master should periodically balance indices New Feature Open Major UNRESOLVED Unassigned









Katta KATTA-117 add command line option to print stacktrace on Improvement Resolved Major Fixed Johannes Zillmann

error









58 of 363

Katta KATTA-116 distribution of shards does not take currently Bug Resolved Major Fixed Johannes Zillmann

deploying shards into account









Katta KATTA-115 zkclient: update log4j to newest version Task Open Trivial UNRESOLVED Unassigned



Katta KATTA-114 zkclient: property to disable ivy in zkclient build Improvement Open Minor UNRESOLVED Unassigned

system

Katta KATTA-113 zkclient git repo does not include all Bug Resolved Major Fixed Johannes Zillmann

dependencies in lib/

Katta KATTA-112 ship build.xml and ivy.xml in tarballs Improvement Resolved Minor Fixed Johannes Zillmann









59 of 363

Katta KATTA-111 improve build system to ease packaging for Task Open Minor UNRESOLVED Unassigned

Debian (and other distros)









Katta KATTA-110 use a released 0.1 version of zkclient instead of Task Resolved Major Fixed Johannes Zillmann

the snapshot

Katta KATTA-109 split katta distribution into katta and katta.gui Task Resolved Major Fixed Johannes Zillmann





Katta KATTA-108 make loadtests more robust Improvement Open Major UNRESOLVED Unassigned









60 of 363

Katta KATTA-107 Katta master does not run on cygwin Bug Resolved Blocker Fixed Unassigned









Katta KATTA-106 improve katta's monitoring abilities Improvement Open Major UNRESOLVED Johannes Zillmann









Katta KATTA-105 throttle shard deployment New Feature Resolved Major Fixed Johannes Zillmann





Katta KATTA-104 upgrade to zookeeper 3.2.2 Task Resolved Major Fixed Johannes Zillmann

Katta KATTA-103 upgrade to lucene 3.0 Task Resolved Major Fixed Johannes Zillmann









61 of 363

Katta KATTA-102 node failover in Client is not safe for Bug Resolved Major Fixed Johannes Zillmann

multithreaded use









Katta KATTA-101 refactore INodeManaged implementation into Improvement Resolved Major Fixed Johannes Zillmann

sub-packages









62 of 363

Katta KATTA-100 ivy setup does not work for extras/indexing Bug Resolved Major Fixed Johannes Zillmann

module









Katta KATTA-99 Access binary fields in search results New Feature Resolved Major Cannot Reproduce Unassigned









63 of 363

Katta KATTA-98 bin/start-all.sh script should start primary and Improvement Open Major UNRESOLVED Unassigned

secondary master for fail over support









Katta KATTA-97 gracefull shutdown of JmxMonitor Improvement Resolved Major Fixed Johannes Zillmann









64 of 363

Katta KATTA-96 upgrade mechanism for katta New Feature Resolved Major Fixed Johannes Zillmann









Katta KATTA-95 IndexDeployFuture.joinDeployment() seems to Bug Resolved Major Fixed Johannes Zillmann

hang from time to time









65 of 363

Katta KATTA-94 refactor configuration management Improvement Open Major UNRESOLVED Unassigned









66 of 363

Katta KATTA-93 hits are (re-)sorted completely on client side Bug Resolved Major Fixed Johannes Zillmann









Katta KATTA-92 Enable query the hadoop version for the katta Improvement Open Minor UNRESOLVED Unassigned

cluster





Katta KATTA-91 Searching for index by wildcard only supports "all Improvement Resolved Minor Duplicate Unassigned

indexes" rather than "all matching this glob"





Katta KATTA-90 Allow Katta to be installed as a Maven artifact. Improvement Open Minor UNRESOLVED Unassigned









67 of 363

Katta KATTA-89 Dependencies are checked in to git as jars Improvement Open Major UNRESOLVED Unassigned









Katta KATTA-88 Allow shard size deployment for LPT (Longest Improvement Open Trivial UNRESOLVED Unassigned

Processing Time)







Katta KATTA-87 Katta.gui - it would be great to have a web based New Feature Resolved Major Fixed Stefan Groschupf

gui allowing to monitor and maybe administrate

katta

Katta KATTA-86 source should be in jar as well Improvement Resolved Trivial Fixed Stefan Groschupf









Katta KATTA-85 upgrade to hadoop 20.1 jars Improvement Resolved Major Fixed Stefan Groschupf









68 of 363

Katta KATTA-84 Distribution Policy that picks the node with the New Feature Resolved Major Fixed Unassigned

fewest shards, and allow hadoop zip files to be

stream unpacked instead of spooled to local disk

first









69 of 363

Katta KATTA-83 Build failure when building on fresh checkout Bug Resolved Blocker Cannot Reproduce Unassigned









Katta KATTA-82 Katta need to be monitor able New Feature Resolved Major Fixed Stefan Groschupf







Katta KATTA-81 NodeInteraction: the max tryCount should be not Bug Resolved Major Fixed Johannes Zillmann

hardcoded to 3 but equal to the replication level

or configurable.

Katta KATTA-80 Configuration loading from files Improvement Resolved Minor Fixed Unassigned









70 of 363

Katta KATTA-79 Set the maximum shard size in bytes Improvement Resolved Minor Won't Fix Unassigned









Katta KATTA-78 Add basic Lucene Sort capabilities New Feature Resolved Major Fixed Johannes Zillmann









Katta KATTA-77 Merging indexes requires missing Commons HTTP Bug Resolved Major Fixed Unassigned

Client









71 of 363

Katta KATTA-76 Attempting to add an index on a non-existent Bug Resolved Major Fixed Johannes Zillmann

HDFS breaks listIndexes









Katta KATTA-75 tarbomb: Katta 0.5.1 release tarball expands in Bug Resolved Major Fixed Peter Voss

place

Katta KATTA-74 Running "ant jar" in extras/indexing fails due to Bug Resolved Major Fixed Stefan Groschupf

missing jets3t dependency



Katta KATTA-73 CDPATH environment variable causes bin scripts Bug Resolved Major Fixed Unassigned

to fail.









72 of 363

Katta KATTA-72 NPE when searching an index that doesn't exist Bug Resolved Major Fixed Peter Voss









Katta KATTA-71 katta hangs when deploying when errors happen Bug Resolved Major Fixed Stefan Groschupf

when pulling a index from s3.









73 of 363

Katta KATTA-70 Katta immediately terminates after startup when Bug Resolved Major Fixed Peter Voss

using external Zookeeper









Katta KATTA-69 Katta is much too sensitive to recoverable Bug Resolved Major Duplicate Peter Voss

KeeperExceptions









74 of 363

Katta KATTA-68 Need "fsck" for katta clusters New Feature Open Major UNRESOLVED Unassigned









Katta KATTA-67 Default namespace is not created when using Bug Resolved Major Fixed Stefan Groschupf

external Katta









75 of 363

Katta KATTA-66 Update jets3t jar from version 0.5.0 to 0.6.1 Bug Resolved Major Fixed Unassigned









Katta KATTA-65 Katta needs two missing jars to be able to use the Bug Resolved Minor Won't Fix Unassigned

s3:// and s3n:// protocols for pulling indexes









76 of 363

Katta KATTA-64 problems running with multiple zookeeper Bug Resolved Major Won't Fix Johannes Zillmann

servers.









Katta KATTA-63 Use java which is on the PATH if JAVA_HOME is Improvement Resolved Major Fixed Peter Voss

not set









77 of 363

Katta KATTA-62 Need to be able to rebalance shard assignments Bug Open Major UNRESOLVED Unassigned

in a cluster









78 of 363

Katta KATTA-61 Katta needs a load balancing shard deployment Bug Open Major UNRESOLVED Unassigned

policy









Katta KATTA-60 Allow user to specify katta root ZooKeeper path New Feature Resolved Major Fixed Unassigned









79 of 363

Katta KATTA-59 Katta should (optionally) allow partial results Improvement Resolved Major Fixed Unassigned









80 of 363

Katta KATTA-58 Change structure of data in Zookeeper to make Improvement Resolved Major Duplicate Unassigned

all node data ephemeral on node connection









81 of 363

Katta KATTA-57 Permissions on scripts aren't set properly after Bug Resolved Major Fixed Peter Voss

deployment









Katta KATTA-56 Documentation on configuring Katta has Bug Resolved Minor Fixed Stefan Groschupf

misleading info on hadoop









Katta KATTA-55 The 0.5.1 release is missing the /extras directory Bug Resolved Major Fixed Stefan Groschupf

with EC2 support



Katta KATTA-54 Generalize Katta so that Lucene is one use case New Feature Resolved Major Fixed Stefan Groschupf

(and mapfiles another).









82 of 363

Katta KATTA-53 Need to fix spelling errors and other small issues Bug Resolved Major Fixed Unassigned

in code









Katta KATTA-52 ZKClient.reconnect() should only be called on Bug Resolved Major Fixed Peter Voss

KeeperState.Expired events









Katta KATTA-51 master hangs when reconnecting to zk during Bug Resolved Major Fixed Stefan Groschupf

deployment of a index.







83 of 363

Katta KATTA-50 Intermittent failure of ClientFailoverTest Bug Resolved Major Cannot Reproduce Unassigned









84 of 363

Katta KATTA-49 Introduce a refreshIndex method Improvement Open Major UNRESOLVED Unassigned









85 of 363

Katta KATTA-48 Extend IndexMetaData to include index name Improvement Resolved Major Fixed Peter Voss









Katta KATTA-47 addIndex should not require the user to provide Improvement Resolved Major Fixed Unassigned

an Analyzer









86 of 363

Katta KATTA-46 Make Katta Client(s) programmatically Improvement Resolved Minor Fixed Stefan Groschupf

configurable









87 of 363

Katta KATTA-45 logical error on getting scoreDoc List in Bug Resolved Major Fixed Peter Voss

KattaMultiSearcher::search()









88 of 363

Katta KATTA-44 Incorrect usage of Bug Resolved Major Fixed Peter Voss

org.apache.lucene.util.PriorityQueue









89 of 363

Katta KATTA-43 Katta does not recover well from expired sessions Bug Resolved Major Fixed Johannes Zillmann









Katta KATTA-42 upgrade zkClient to use zookeeper 3.1.1 Improvement Resolved Major Fixed Peter Voss









90 of 363

Katta KATTA-41 SecondaryMaster cannot take over when Bug Resolved Critical Fixed Stefan Groschupf

firstMaster failed:









Katta KATTA-40 cannot start as Quorum Zookeepers in katta-0.4 Bug Resolved Major Cannot Reproduce Stefan Groschupf









91 of 363

Katta KATTA-39 logical bug in caclating the DF Bug Resolved Major Fixed Unassigned









Katta KATTA-38 naive bug in finding a free port for RPCServer Bug Resolved Major Fixed Stefan Groschupf









Katta KATTA-37 undeploy one index will delete all data of other Bug Resolved Blocker Fixed Stefan Groschupf

index









92 of 363

Katta KATTA-36 bin/katta showStructure throws an exception Bug Resolved Major Fixed Peter Voss









Katta KATTA-35 build.xml doesn't provide description for the Improvement Resolved Trivial Fixed Peter Voss

'eclipse' and 'dist' targets









93 of 363

Katta KATTA-34 Random failure of test net.sf.katta.zk.ZKClientTest Bug Resolved Minor Cannot Reproduce Unassigned









Katta KATTA-33 upgrade information in the readnme file Bug Resolved Major Fixed Unassigned

Katta KATTA-32 in the release docs/reports is empthy Bug Resolved Major Fixed Stefan Groschupf



Katta KATTA-31 dont rsync log or zookeeper folders Bug Resolved Major Fixed Stefan Groschupf

Katta KATTA-30 add checkstyle as part of the test Improvement Resolved Minor Fixed Unassigned









94 of 363

Katta KATTA-29 Exception while restarting zk client Bug Resolved Major Cannot Reproduce Unassigned









95 of 363

Katta KATTA-28 generalize client node interaction api Improvement Resolved Major Fixed Stefan Groschupf









Katta KATTA-27 queryParser is not thread-safe, though we use in Bug Resolved Major Fixed Stefan Groschupf

a setup that actually could be multithreaded.



Katta KATTA-26 identical shardName conflict with each other Bug Resolved Blocker Won't Fix Stefan Groschupf









Katta KATTA-25 upgrade to zookeeper 3.x Improvement Resolved Major Fixed Stefan Groschupf



Katta KATTA-24 upgrade to latest lucene Improvement Resolved Major Fixed Stefan Groschupf









96 of 363

Katta KATTA-23 Parallelize search result detail retrieval Improvement Resolved Minor Fixed Peter Voss









Katta KATTA-22 Avoid using Bug Resolved Major Fixed Unassigned

com.sun.xml.internal.ws.util.ByteArrayBuffer

(non standard Java)



Katta KATTA-21 'ant eclipse' doesn't set the execution Bug Resolved Major Fixed Peter Voss

environment to 1.6









97 of 363

Katta KATTA-20 Incorrect results returned when limiting the Bug Resolved Major Fixed Peter Voss

number of hits









Katta KATTA-19 create a simple extras/webui that illustrates how Task Open Major UNRESOLVED Unassigned

to use the katta client

Katta KATTA-18 upgrade to hadoop 0.19.0 Improvement Resolved Major Fixed Stefan Groschupf









98 of 363

Katta KATTA-17 create a load test running on ec2 Improvement Resolved Major Fixed Johannes Zillmann









Katta KATTA-16 KATTA-28 Sub-task Resolved Major Fixed Unassigned

The katta manager should manage and the client

should query multiple pools of different kinds of

servers









Katta KATTA-15 Zookeeper server disconnects cause major Bug Resolved Major Fixed Stefan Groschupf

problems









Katta KATTA-14 Katta should be able to use an external Bug Resolved Major Fixed Stefan Groschupf

zookeeper cluster







99 of 363

Katta KATTA-13 run katta on ec2 Improvement Resolved Major Fixed Stefan Groschupf









Katta KATTA-12 katta-env.sh does not have KATTA_SLAVE_SLEEP. Improvement Resolved Minor Won't Fix Stefan Groschupf









Katta KATTA-11 introduce merging index size threshold Improvement Resolved Minor Won't Fix Unassigned









Katta KATTA-10 make shard folder dependent on node-port Improvement Resolved Major Fixed Stefan Groschupf









100 of 363

Katta KATTA-9 do not zip shards Improvement Resolved Major Won't Fix Unassigned







Katta KATTA-8 make metadata mapwritable Improvement Resolved Major Won't Fix Unassigned









Katta KATTA-7 KATTA-5 Sub-task Resolved Major Duplicate Unassigned

fix master failover









101 of 363

Katta KATTA-6 KATTA-5 Sub-task Resolved Major Won't Fix Stefan Groschupf

refactor cluster start/sop mechanism









Katta KATTA-5 [WRAPPER] improve zookeeper ephemeral node Bug Resolved Major Fixed Unassigned

handling









Katta KATTA-4 concurrent search on node Improvement Resolved Major Fixed Stefan Groschupf









102 of 363

Katta KATTA-3 index speed improvement Improvement Closed Major Won't Fix Unassigned









Katta KATTA-2 check correct use of analyzer on searching Bug Resolved Major Fixed Stefan Groschupf









Katta KATTA-1 make lucene analyzer configurable for indexing Improvement Resolved Minor Won't Fix Unassigned









bixo BIXO-88 1232123 Bug Open Major UNRESOLVED Ken Krugler









103 of 363

bixo BIXO-87 Decouple use of ec2-api-tools from bixo Improvement Open Minor UNRESOLVED Ken Krugler









bixo BIXO-86 FetcherPolicy and crawlDelay=0: Divide By Zero Bug Open Trivial UNRESOLVED Ken Krugler









bixo BIXO-85 Add target languages to FetchPolicy, use during Improvement Open Minor UNRESOLVED Unassigned

fetch



bixo BIXO-84 Specify gzip compression with HttpClient requests Improvement Open Minor UNRESOLVED Ken Krugler









104 of 363

bixo BIXO-83 Improve SimpleCrawlTool error messages Improvement Open Major UNRESOLVED Ken Krugler









bixo BIXO-82 Configure HttpClient request headers Improvement Open Critical UNRESOLVED Ken Krugler









bixo BIXO-81 Source should be in jars Improvement Open Minor UNRESOLVED Unassigned



bixo BIXO-80 Support issue linking in Jira Improvement Open Minor UNRESOLVED Marko Bauhardt



bixo BIXO-79 Add full set of licenses for all jars we include in Improvement Open Minor UNRESOLVED Ken Krugler

the distribution build





105 of 363

bixo BIXO-78 Generate release tarball on TeamCity and copy to Improvement Open Critical UNRESOLVED Ken Krugler

Nexus directly









bixo BIXO-77 Create Cascading Scheme for reading/writing New Feature Open Major UNRESOLVED Unassigned

WARC files









106 of 363

bixo BIXO-76 Support Solr via new Cascading Scheme New Feature Open Major UNRESOLVED Unassigned









bixo BIXO-75 BIXO-73 Sub-task Open Major UNRESOLVED Ken Krugler

Clean up DMOZ domain extraction code - remove

OOM exceptions

bixo BIXO-74 BIXO-73 Sub-task Open Minor UNRESOLVED Ken Krugler

Update SimpleCrawlTool to use dmoz link data







bixo BIXO-73 [Wrapper] Improve DMOZ data support Improvement Open Minor UNRESOLVED Ken Krugler









107 of 363

bixo BIXO-72 BIXO-68 Sub-task Open Major UNRESOLVED Marko Bauhardt

Enable read-only rsync from Nexus releases

directory









bixo BIXO-71 BIXO-68 Sub-task Closed Major Fixed Ken Krugler

Create documentation on how to deploy releases

to Nexus repository

bixo BIXO-70 BIXO-68 Sub-task Open Major UNRESOLVED Ken Krugler

Submit Jira issue to Maven to pull from Nexus

repository for central repo

bixo BIXO-69 BIXO-68 Sub-task Closed Major Fixed Ken Krugler

Modify TeamCity to deploy snapshots to Nexus

repository manager

bixo BIXO-68 Set up Bixo deployment to Maven repository Task Open Major UNRESOLVED Ken Krugler









108 of 363

bixo BIXO-67 BIXO-68 Sub-task Resolved Major Fixed Frank Henze

Set up Nexus repository manager on 101tec.com

server









bixo BIXO-66 Break bixo-core into bixo-core, bixo-parse, bixo- Improvement Open Major UNRESOLVED Ken Krugler

index









109 of 363

bixo BIXO-65 Create Java-centric launch/deploy scripts for Improvement Open Major UNRESOLVED Ken Krugler

running Bixo in EC2









bixo BIXO-64 Allow Team City to send emails to the list Task Closed Minor Fixed Ken Krugler









110 of 363

bixo BIXO-63 Create AMI for Bixo Task Open Major UNRESOLVED Unassigned









bixo BIXO-62 Add "truncated" flag to FetchedDatum Improvement Open Minor UNRESOLVED Unassigned







bixo BIXO-61 Catch case of fetch being past time limit inside of Improvement Open Minor UNRESOLVED Ken Krugler

FetcherRunnable









bixo BIXO-60 Create EC2 deployment script New Feature Closed Critical Fixed Ken Krugler









111 of 363

bixo BIXO-59 Fix up HttpClient handling of stale connections Bug Closed Critical Fixed Ken Krugler









bixo BIXO-58 Switch to using Maven for dependent jars where Task Closed Major Fixed Ken Krugler

possible









bixo BIXO-57 UrlFilter should return boolean not string Task Closed Major Fixed Ken Krugler





bixo BIXO-56 Support relative path normalization and other Improvement Open Minor UNRESOLVED Ken Krugler

transformations



bixo BIXO-55 Support building a Hadoop job jar Task Closed Major Fixed Stefan Groschupf

bixo BIXO-54 Set up Google or Yahoo mailing list for Bixo Task Resolved Major Fixed Ken Krugler

bixo BIXO-53 Separate out the Eclipse build classes from the Bug Resolved Major Fixed Stefan Groschupf

Ant build classes









bixo BIXO-52 Add language detection operation New Feature Open Major UNRESOLVED Ken Krugler









112 of 363

bixo BIXO-51 Add stand-alone class for URL filtering New Feature Open Major UNRESOLVED Ken Krugler









bixo BIXO-50 Add stand-alone class for URL normalization New Feature Closed Critical Fixed Ken Krugler









bixo BIXO-48 making the indexing details configurable Improvement Resolved Major Fixed Stefan Groschupf

bixo BIXO-47 Add test for slow response from web server Improvement Resolved Major Fixed Ken Krugler



113 of 363

bixo BIXO-46 Add simple robots.txt fetch/parse to the New Feature Closed Critical Fixed Ken Krugler

FetchBuffer operation





bixo BIXO-45 indexing anchor texts as well. Improvement Open Major UNRESOLVED Stefan Groschupf







bixo BIXO-44 two test data folders Improvement Closed Major Fixed Ken Krugler









bixo BIXO-43 Log progress using counters Improvement In Progress Major UNRESOLVED Ken Krugler









114 of 363

bixo BIXO-42 Make WebgraphWebServerTest a "long-running" Improvement Open Minor UNRESOLVED Stefan Groschupf

test with improved auto-download









bixo BIXO-41 Update README with more info on how to run Improvement Open Minor UNRESOLVED Ken Krugler

tests, use Eclipse









115 of 363

bixo BIXO-40 When building a fetch queue for a PLD, use IP Improvement Closed Critical Fixed Ken Krugler

addresses to sub-segment









bixo BIXO-39 Clean up classes used in HTML parser code Improvement Closed Minor Won't Fix Ken Krugler









bixo BIXO-38 Clean up tuples used to exchange information Improvement Closed Major Fixed Ken Krugler

between Cascading operations









116 of 363

bixo BIXO-37 Set up and use configuration settings in the Task Closed Major Fixed Ken Krugler

fetcher









bixo BIXO-36 Explicitly set HttpClient redirect handling Improvement Closed Minor Fixed Ken Krugler

bixo BIXO-35 Support https protocol Improvement Closed Major Fixed Ken Krugler









bixo BIXO-34 Configure HttpClient for proper retry handling Improvement Closed Minor Fixed Ken Krugler









bixo BIXO-33 Configure HttpClient for max connections per Improvement Open Minor UNRESOLVED Ken Krugler

server









117 of 363

bixo BIXO-32 Turn off stale connection check in HttpClient Improvement Closed Minor Fixed Ken Krugler









bixo BIXO-31 Switch to LOGGER from LOG everywhere Task Closed Minor Fixed Stefan Groschupf

bixo BIXO-30 Set up continuous integration build system Task Closed Major Fixed Stefan Groschupf



bixo BIXO-29 Set up way to run integration tests and long tests Improvement Resolved Major Fixed Ken Krugler

separately from unit tests









bixo BIXO-28 Create a multi sink tap Task Closed Major Fixed Stefan Groschupf





bixo BIXO-27 create a sink tap that indexes the content Task Resolved Major Fixed Stefan Groschupf

bixo BIXO-26 Create separate integration directory and move Improvement Closed Major Fixed Ken Krugler

appropriate tests there from src/test









bixo BIXO-25 Store a complete http header in the output of the Improvement Closed Major Fixed Ken Krugler

fetcher







118 of 363

bixo BIXO-24 pre sort UrlWithScore Improvement Open Minor UNRESOLVED Stefan Groschupf









bixo BIXO-23 cascading fetcher Task Resolved Major Fixed Stefan Groschupf





bixo BIXO-22 http tests Task Resolved Major Fixed Stefan Groschupf









bixo BIXO-21 make sure we use traps for all pipes Task Open Major UNRESOLVED Stefan Groschupf

bixo BIXO-20 create a crawl simulation platform Task Resolved Major Fixed Stefan Groschupf









bixo BIXO-19 test servlet Task Resolved Major Fixed Stefan Groschupf





bixo BIXO-18 adding domain db Task Open Minor UNRESOLVED Stefan Groschupf



bixo BIXO-17 update url DB -> as cascading Task Open Critical UNRESOLVED Ken Krugler

bixo BIXO-16 create katta index Task Closed Major Fixed Stefan Groschupf

bixo BIXO-15 scrape the output -> as cascading Task Closed Major Fixed Stefan Groschupf

bixo BIXO-14 parse the output -> as cascading Task Resolved Major Fixed Stefan Groschupf

bixo BIXO-13 group urls by pld Task Resolved Major Fixed Stefan Groschupf

bixo BIXO-12 create a simple scoring based on the fetch time Task Resolved Major Fixed Stefan Groschupf



bixo BIXO-11 create a basic fetch loop Task Resolved Major Fixed Stefan Groschupf

bixo BIXO-10 create a basic url importer Task Resolved Major Fixed Stefan Groschupf

bixo BIXO-9 Create queue-based fetcher Task Resolved Major Fixed Ken Krugler









119 of 363

bixo BIXO-8 set flow connector properties Task Open Major UNRESOLVED Stefan Groschupf









bixo BIXO-7 do we need url normalisation? New Feature Resolved Major Fixed Stefan Groschupf









bixo BIXO-6 version in jar Task Open Minor UNRESOLVED Stefan Groschupf









bixo BIXO-5 conf files Task Resolved Major Fixed Ken Krugler









bixo BIXO-4 license.txt Task Open Major UNRESOLVED Stefan Groschupf

bixo BIXO-3 header Task Resolved Major Fixed Stefan Groschupf









120 of 363

bixo BIXO-2 Create PLD (paid level domain) extractor Task Resolved Major Fixed Ken Krugler









bixo BIXO-1 Set up ivy-based dependency management Task Resolved Major Fixed Stefan Groschupf









Generated at Fri Nov 25 12:52:39 UTC 2011 using JIRA Enterprise Edition, Version: 3.13.1-#333.









121 of 363

Reporter Created Updated Affects Version/s Fix Version/s Component/s Due Date Votes Images Work Ratio

Christian Kütbach 10/14/2011 8:06 10/14/2011 8:06 2.0.1 0









Sebastian Felis 8/9/2011 12:59 8/16/2011 7:33 2.1 0









122 of 363

Sebastian Felis 8/9/2011 12:55 8/16/2011 10:32 2.1 0









Sebastian Felis 12/30/2010 12:05 8/16/2011 10:36 0









123 of 363

Sebastian Felis 12/30/2010 10:30 8/16/2011 10:39 0









124 of 363

Sebastian Felis 12/30/2010 9:14 12/30/2010 9:15 0









125 of 363

André Schild 5/5/2010 6:38 5/6/2010 7:00 2.0.1 2.0.1 0

2.1









126 of 363

Marko Bauhardt 1/28/2010 8:37 1/28/2010 8:37 2 2.0.1 0









127 of 363

Bela Ban 1/4/2010 10:37 5/9/2010 18:50 2 2.1 0









128 of 363

Knut Forkalsrud 11/29/2009 3:20 5/9/2010 18:52 2 2.1 0









Thomas Fromm 8/6/2009 9:46 5/10/2010 7:50 2 2.1 0









129 of 363

Thomas Fromm 8/6/2009 9:41 8/12/2009 7:40 2 2.0.1 0









Thomas Fromm 8/6/2009 9:37 8/19/2009 15:27 2 2.0.1 0









Thomas Fromm 8/6/2009 9:32 8/12/2009 8:31 2 2.0.1 0









130 of 363

Jim Cortez 8/4/2009 16:41 8/5/2009 9:28 2.1 0









Marko Bauhardt 7/27/2009 7:26 8/6/2009 8:15 2 2.1 0









131 of 363

Miguel Ferreira 7/22/2009 14:17 5/6/2010 19:50 2 2.1 0









Marko Bauhardt 7/20/2009 8:37 8/12/2009 7:33 2 2.0.1 0









132 of 363

André Schild 5/18/2009 8:47 5/26/2009 8:20 2.1 0









Marko Bauhardt 4/24/2009 6:25 8/12/2009 8:52 2 2 0









133 of 363

André Schild 4/22/2009 7:41 9/23/2011 12:59 2 2 0









Marko Bauhardt 4/22/2009 7:11 8/12/2009 8:52 2 2 0









134 of 363

nicolas frances 12/29/2010 7:46 3/21/2011 13:23 0









Marko Bauhardt 10/2/2009 16:09 10/6/2009 14:29 0.2 0.2.1 0

Marko Bauhardt 10/2/2009 11:41 10/2/2009 11:41 0.2 0.2.1 0



Marko Bauhardt 10/2/2009 8:17 10/2/2009 8:17 0.2 0.2.1 0





Marko Bauhardt 9/30/2009 13:11 9/30/2009 13:11 0.2 0.2.1 0

Marko Bauhardt 9/30/2009 12:34 10/6/2009 14:29 0.2 0.2.1 0





Marko Bauhardt 9/30/2009 12:34 10/6/2009 14:29 0.2 0.2.1 0



Marko Bauhardt 9/30/2009 11:56 10/5/2009 14:05 0.2 0.2.1 0



Marko Bauhardt 9/29/2009 9:58 10/5/2009 13:53 0.2 0.2.1 0





Marko Bauhardt 9/29/2009 9:31 10/5/2009 13:59 0.2 0.2.1 0







Max Josef Ender 8/26/2009 10:21 9/23/2009 14:56 0.1 0.2 0



Max Josef Ender 8/26/2009 10:20 8/26/2009 16:48 0.1 0





Marko Bauhardt 8/24/2009 15:19 9/24/2009 9:09 0.2 0







135 of 363

Marko Bauhardt 8/24/2009 15:17 9/23/2009 14:57 0.3 0



Marko Bauhardt 8/14/2009 12:57 8/19/2009 15:24 0.1 0



Marko Bauhardt 8/13/2009 9:04 8/13/2009 9:04 0.1 0

Marko Bauhardt 8/13/2009 9:04 8/13/2009 9:04 0.1 0

Marko Bauhardt 8/13/2009 8:54 8/13/2009 8:54 0.1 0





Marko Bauhardt 8/13/2009 8:52 8/13/2009 8:59 0.1 0

Marko Bauhardt 8/13/2009 8:52 8/13/2009 9:00 0.1 0



Marko Bauhardt 8/13/2009 8:50 8/19/2009 15:24 0.1 0



Marko Bauhardt 8/13/2009 8:49 8/13/2009 8:55 0.1 0



Marko Bauhardt 8/13/2009 8:45 8/19/2009 15:24 0.1 0





Marko Bauhardt 8/13/2009 8:43 8/13/2009 8:59 0.1 0





Marko Bauhardt 8/13/2009 8:42 8/13/2009 8:55 0.1 0



Marko Bauhardt 8/13/2009 8:40 8/13/2009 8:55 0.1 0

Marko Bauhardt 8/13/2009 8:39 8/13/2009 8:55 0.1 0



Marko Bauhardt 8/13/2009 8:35 8/24/2009 15:17 0.1 0





Marko Bauhardt 8/13/2009 8:34 9/21/2009 14:52 0.2 0









136 of 363

Andrew 6/17/2011 20:56 9/26/2011 9:05 0.6.5 search 0









Johannes Zillmann 5/14/2011 15:31 5/14/2011 15:46 0.6.2 0.6.4 cluster 0









137 of 363

Mathias Walter 3/14/2011 15:14 3/14/2011 15:14 0.6.3 index search 0









Mathias Walter 3/14/2011 14:57 5/14/2011 16:24 0.6.3 0.6.4 search 0









Mathias Walter 3/14/2011 10:54 5/14/2011 16:44 0.6.3 0.6.4 search 0









138 of 363

Murali Krishna 3/13/2011 12:53 3/13/2011 12:53 0.6.2 infrastructure 0









139 of 363

Mathias Walter 3/11/2011 13:52 5/14/2011 16:22 0.6.3 0.6.4 0









Mathias Walter 3/10/2011 14:42 5/14/2011 16:44 0.6.3 0.6.4 infrastructure 0









Johannes Zillmann 3/7/2011 8:36 3/7/2011 8:36 0.6.5 index 0









Johannes Zillmann 2/14/2011 21:27 2/14/2011 21:34 0.6.3 0.6.4 cluster 0







Johannes Zillmann 2/14/2011 13:26 2/14/2011 13:29 0.6 0.6.4 0









140 of 363

Johannes Zillmann 2/14/2011 9:24 2/14/2011 13:05 0.6.3 0.6.4 0









141 of 363

Johannes Zillmann 2/2/2011 12:37 2/2/2011 18:30 0.6 0.6.4 0









Johannes Zillmann 2/2/2011 8:26 5/26/2011 7:26 0.6.5 cluster 0









142 of 363

Johannes Zillmann 2/1/2011 22:09 2/1/2011 22:35 0.6.4 0









143 of 363

Johannes Zillmann 2/1/2011 21:44 2/1/2011 22:31 0.6.4 index 0









144 of 363

Johannes Zillmann 2/1/2011 21:04 2/1/2011 21:09 0.6.4 cluster 0









145 of 363

Johannes Zillmann 2/1/2011 17:30 5/26/2011 7:26 0.6.2 0.6.5 cluster 0









Johannes Zillmann 2/1/2011 16:49 2/1/2011 17:20 0.6.3 0.6.4 cluster 0









Johannes Zillmann 2/1/2011 12:06 2/1/2011 16:50 0.6.4 cluster 0









146 of 363

Johannes Zillmann 1/26/2011 10:55 1/29/2011 14:11 0.6.3 0.6.4 0









Johannes Zillmann 1/26/2011 10:46 1/26/2011 12:04 0.6.4 search 0

Johannes Zillmann 1/26/2011 10:32 2/1/2011 16:50 0.6.4 search 0









147 of 363

Murali Krishna 1/10/2011 7:39 5/26/2011 7:26 0.6.2 0.6.5 search 0









148 of 363

Patrick Crenshaw 12/13/2010 15:56 5/26/2011 7:26 0.6.3 0.6.5 search 0









149 of 363

mg 12/2/2010 1:33 5/26/2011 7:26 0.6.3 0.6.5 0









150 of 363

mg 12/1/2010 22:19 5/26/2011 7:26 0.6 0.6.5 0

0.6.1

0.6.2

0.6.3









mg 12/1/2010 19:38 5/26/2011 7:26 0.6.2 0.6.5 0

0.6.3









151 of 363

mg 11/22/2010 8:44 11/24/2010 11:19 0.6.2 0.6.3 cluster 0









152 of 363

Patrick Crenshaw 11/17/2010 23:03 11/18/2010 8:37 0









Johannes Zillmann 10/25/2010 9:06 10/25/2010 9:25 0.6.2 0.6.3 cluster 0









153 of 363

Michael Small 10/21/2010 20:19 10/26/2010 7:46 0.6.2 0.6.3 1









154 of 363

mg 10/20/2010 18:03 9/26/2011 7:14 0.6.2 0.6.3 0









155 of 363

Hongchao Li 10/18/2010 16:41 10/25/2010 9:24 0.6.2 0.6.3 0









Mathias Walter 9/24/2010 11:00 9/26/2010 13:30 0.6.2 0.6.3 0









Mathias Walter 9/20/2010 7:00 5/26/2011 7:26 0.6.2 0.6.5 cluster 0









156 of 363

Mathias Walter 9/17/2010 14:14 10/4/2010 8:33 0.6.2 0.6.3 search 1





Mathias Walter 9/17/2010 13:57 10/25/2010 10:05 0.6.2 0.6.3 search 0









Mathias Walter 9/13/2010 13:07 5/26/2011 7:26 0.6.2 0.6.5 infrastructure 0









Mathias Walter 9/13/2010 12:22 9/27/2010 9:16 0.6.3 infrastructure 0









Mathias Walter 9/10/2010 12:05 9/26/2010 15:54 0.6.2 0.6.3 index 0









Mathias Walter 8/18/2010 7:42 9/26/2010 14:17 0.6.2 0.6.3 search 0









Mathias Walter 8/17/2010 14:30 9/10/2010 11:20 0.6.2 0





Mathias Walter 8/13/2010 12:08 8/17/2010 14:21 0.6.2 0.6.3 cluster 0









157 of 363

Mathias Walter 8/13/2010 11:35 8/17/2010 14:15 0.6.2 0.6.3 index 0







Mathias Walter 8/13/2010 11:25 8/17/2010 13:59 0.6.2 0.6.3 cluster 0









Johannes Zillmann 8/2/2010 8:08 8/2/2010 18:46 0.6.1 0.6.3 0









158 of 363

Hongchao Li 7/28/2010 19:01 8/2/2010 15:14 0.6.1 0.6.2 0









159 of 363

Hongchao Li 7/28/2010 18:49 8/2/2010 8:29 0.6.1 0.6.2 search 0









160 of 363

rafia taqdees 7/26/2010 6:58 7/26/2010 6:58 0









161 of 363

rafia taqdees 7/26/2010 6:54 9/26/2010 16:00 0.6.1 index 0









162 of 363

Johannes Zillmann 7/9/2010 7:42 7/9/2010 7:49 0.6 0.6.2 search 0

0.6.1









163 of 363

Johannes Zillmann 7/2/2010 15:22 7/2/2010 15:25 0.6 0.6.2 infrastructure 0

0.6.1









164 of 363

Hongchao Li 7/1/2010 15:56 7/2/2010 14:39 0.6.1 0.6.2 search 0









165 of 363

Johannes Zillmann 6/24/2010 12:30 5/26/2011 7:26 0.6.1 0.6.5 cluster 0









166 of 363

Eric McCoy 6/9/2010 12:58 10/27/2010 17:59 0.6.1 0









167 of 363

Thomas Koch 6/3/2010 11:18 8/2/2010 8:09 0









168 of 363

Hongchao Li 5/12/2010 7:51 7/2/2010 15:41 0.6 0.6.2 0









169 of 363

Eric McCoy 5/4/2010 17:11 5/12/2010 9:22 0.6.1 0.6.2 0









170 of 363

Rodney O'Donnell 4/30/2010 10:56 9/26/2010 15:59 0.6.1 0.6.3 index 0









Karthik K 4/26/2010 19:13 9/26/2010 15:51 cluster 0









171 of 363

Hongchao Li 4/25/2010 11:05 7/2/2010 14:39 0.6.2 0









172 of 363

Hongchao Li 4/25/2010 10:43 5/26/2011 7:26 0.6.1 0.6.5 0









David Buttler 4/23/2010 15:23 7/2/2010 14:38 0.6.1 0.6.2 infrastructure 0









Karthik K 4/19/2010 6:18 4/25/2010 10:53 index 0









173 of 363

thibaut 3/23/2010 23:11 7/2/2010 15:43 0.6.1 0.6.2 0









Karthik K 3/20/2010 1:19 1/26/2011 10:08 0.7 0









Karthik K 3/4/2010 8:13 7/2/2010 14:33 0.6.2 0







174 of 363

thibaut 3/3/2010 20:29 5/12/2010 9:09 0.6.1 0.6.2 0









175 of 363

Hongchao Li 2/24/2010 16:11 2/28/2010 11:33 0.6 0.6.1 cluster 0









thibaut 2/23/2010 12:32 2/24/2010 12:11 0









176 of 363

Neil Cohen 2/20/2010 9:43 2/28/2010 11:59 0.6 0.6.1 infrastructure 0









thibaut 2/17/2010 16:55 8/26/2011 13:16 0









177 of 363

thibaut 2/1/2010 23:01 2/2/2010 9:23 0.6 0.6 0









thibaut 2/1/2010 16:39 4/25/2010 10:57 0.7 0









178 of 363

Johannes Zillmann 2/1/2010 16:34 2/1/2010 16:39 0.7 0









Johannes Zillmann 2/1/2010 16:05 2/1/2010 16:16 0.6 0.6 0









179 of 363

Johannes Zillmann 1/31/2010 17:40 2/1/2010 16:08 0.6 0.6 cluster 0









Thomas Koch 1/20/2010 15:10 9/10/2010 9:23 1



Thomas Koch 1/14/2010 13:58 1/22/2010 9:46 0



Thomas Koch 1/14/2010 12:59 1/27/2010 10:25 0



Thomas Koch 1/14/2010 9:42 1/27/2010 10:03 0.6 0









180 of 363

Thomas Koch 1/14/2010 8:02 1/29/2010 8:23 0.6 0









Johannes Zillmann 1/14/2010 7:21 1/14/2010 8:57 0.6 0.6 0



Johannes Zillmann 1/12/2010 13:55 1/12/2010 15:28 0.6 0





Johannes Zillmann 1/7/2010 16:57 1/7/2010 16:57 0.6 0









181 of 363

thibaut 1/3/2010 22:14 6/22/2010 14:46 0.6 0.6 infrastructure 0









Johannes Zillmann 12/30/2009 20:14 12/31/2009 4:23 0.6 cluster 0









Johannes Zillmann 12/22/2009 11:25 1/12/2010 15:44 0.5.1 0.6 0





Johannes Zillmann 12/22/2009 11:24 1/5/2010 12:40 0.6 0.6 0

Johannes Zillmann 12/22/2009 11:23 12/31/2009 11:11 0.6 0.6 search 0









182 of 363

Johannes Zillmann 12/21/2009 12:01 12/28/2009 14:36 0.6 0.6 search 0









Johannes Zillmann 12/16/2009 14:50 12/30/2009 19:44 0.6 0.6 0









183 of 363

Johannes Zillmann 12/15/2009 18:06 12/15/2009 18:27 0.6 0.6 infrastructure 0









thibaut 12/14/2009 10:28 1/8/2010 8:24 0.6 search 0









184 of 363

Aseem Jain 12/8/2009 12:58 12/8/2009 12:58 0.6 cluster 0









Johannes Zillmann 12/8/2009 12:34 12/30/2009 19:43 0.6 0.6 0









185 of 363

Johannes Zillmann 12/8/2009 12:25 1/6/2010 15:52 0.5.1 0.6 cluster 0









Johannes Zillmann 12/1/2009 14:24 12/28/2009 14:35 0.6 0









186 of 363

Johannes Zillmann 11/19/2009 16:31 1/8/2010 7:55 0.7 cluster search 0









187 of 363

Johannes Zillmann 11/8/2009 23:03 9/26/2011 7:14 0.5.1 0.6 search 0









Yair Even-Zohar 10/22/2009 17:29 10/22/2009 17:40 0.5.1 cluster 0







Phil Hagelberg 10/20/2009 23:47 9/26/2010 14:29 0.5.1 search 0







Phil Hagelberg 10/16/2009 21:14 10/17/2009 0:45 0.6 0









188 of 363

Phil Hagelberg 10/16/2009 20:54 10/31/2009 17:31 0.6 0









Jason Rutherglen 10/13/2009 23:46 10/14/2009 4:17 0.5.1 cluster 0









Stefan Groschupf 10/8/2009 7:54 1/12/2010 15:51 0.5.1 0.6 0





Stefan Groschupf 10/8/2009 7:10 10/13/2009 18:59 0.5.1 0.6 0









Stefan Groschupf 10/7/2009 7:48 10/8/2009 1:18 0.5.1 0.6 0









189 of 363

Jason Venner 10/6/2009 16:16 10/13/2009 4:10 infrastructure 0









190 of 363

Imran M M Yousuf 10/5/2009 5:18 10/13/2009 6:55 0.5.1 infrastructure 0









Stefan Groschupf 10/5/2009 0:26 12/30/2009 20:15 0.6 0







Stefan Groschupf 9/30/2009 23:29 11/19/2009 16:45 0.6 0





Jason Rutherglen 9/28/2009 17:45 2/1/2011 17:20 0.5.1 0.6 0









191 of 363

Jason Rutherglen 8/5/2009 22:36 4/25/2010 10:59 0.5.1 index 0









Jonathan Gray 7/21/2009 22:27 11/25/2009 14:28 0.5.1 0.6 1









Phil Hagelberg 7/17/2009 18:58 10/13/2009 18:56 0.5.1 0.6 index 0









192 of 363

Phil Hagelberg 7/17/2009 0:00 6/24/2010 13:16 0.5.1 0.6 index 0









Phil Hagelberg 7/15/2009 23:23 10/13/2009 21:44 0.5.1 0.6 0



Phil Hagelberg 6/23/2009 21:25 10/31/2009 17:41 0.5.1 0.6 infrastructure 0





Phil Hagelberg 6/23/2009 21:21 10/14/2009 21:41 0.5.1 0.6 infrastructure 0









193 of 363

Peter Voss 6/15/2009 19:05 6/16/2009 7:23 0.5.1 0.6 search 0









Stefan Groschupf 6/5/2009 18:21 6/5/2009 18:28 0.5.1 0.6 0









194 of 363

Johannes Herr 6/5/2009 14:21 6/5/2009 17:00 0.6 0.6 0









Ted Dunning 6/4/2009 22:23 10/14/2009 21:44 0.6 0









195 of 363

Ted Dunning 6/4/2009 21:45 12/30/2009 20:17 0.7 0









Johannes Herr 6/4/2009 16:17 6/12/2009 18:11 0.6 0.6 0









196 of 363

Ken Krugler 6/4/2009 16:04 10/13/2009 19:00 0.5.1 0.6 0









Ken Krugler 6/4/2009 15:09 6/5/2009 19:26 0.5.1 0.6 0









197 of 363

Stefan Groschupf 6/2/2009 17:42 12/2/2009 13:47 0.5.1 0.6 0









Peter Voss 5/27/2009 12:20 5/27/2009 12:20 0.6 0









198 of 363

Ted Dunning 5/13/2009 17:27 5/13/2009 17:27 0.5.1 cluster 0









199 of 363

Ted Dunning 5/13/2009 17:19 1/8/2010 7:55 0.5.1 0.7 cluster 0









Andrew John 5/11/2009 0:24 10/4/2009 3:52 0.5.1 0.6 0









200 of 363

Ted Dunning 5/10/2009 2:19 10/4/2009 3:58 0.6 0









201 of 363

Ted Dunning 5/10/2009 2:03 12/7/2009 19:39 0.5.1 0









202 of 363

Ken Krugler 5/8/2009 23:16 9/29/2009 6:31 0.5.1 0.6 0









Ken Krugler 5/8/2009 23:14 10/13/2009 19:06 0.5.1 0.6 0









Ken Krugler 5/8/2009 23:09 10/13/2009 22:52 0.5.1 0.6 0





Andrew John 5/8/2009 21:27 1/12/2010 15:50 0.5.1 0.6 2









203 of 363

Ted Dunning 4/30/2009 21:48 5/1/2009 6:07 0.5.1 0.6 0









Peter Voss 4/27/2009 7:59 4/28/2009 12:22 0.6 0









Stefan Groschupf 4/25/2009 4:12 4/26/2009 18:46 0.5 0.6 0









204 of 363

Peter Voss 4/23/2009 15:35 1/8/2010 7:43 0









205 of 363

Erich Nachbar 4/22/2009 22:17 1/8/2010 7:55 0.5 0.7 2









206 of 363

Erich Nachbar 4/22/2009 22:02 9/29/2009 6:19 0.5 0.6 0









VM 4/22/2009 7:14 5/1/2009 8:00 0.6 0









207 of 363

Erich Nachbar 4/21/2009 23:08 4/23/2009 7:55 0.5.1 0









208 of 363

dengminwen 4/21/2009 12:20 5/4/2009 20:03 0.5.1 0.6 0









209 of 363

dengminwen 4/21/2009 12:07 9/29/2009 6:08 0.5.1 0.6 search 0









210 of 363

Ted Dunning 4/17/2009 16:17 12/28/2009 14:32 0.4 0.6 cluster 0









Stefan Groschupf 4/17/2009 7:11 9/28/2009 14:26 0.5 0.6 0









211 of 363

Stefan Groschupf 4/17/2009 4:23 4/17/2009 7:27 0.5 0.5.1 0









Stefan Groschupf 4/17/2009 4:22 5/4/2009 4:49 0.5 0.6 0









212 of 363

Stefan Groschupf 4/17/2009 4:21 4/17/2009 4:57 0.5 0.5.1 0









Stefan Groschupf 4/17/2009 4:21 4/17/2009 4:50 0.5 0.5.1 0









Stefan Groschupf 4/17/2009 4:20 4/17/2009 4:40 0.5 0.5.1 0









213 of 363

Peter Voss 4/16/2009 14:20 4/16/2009 14:22 0.5 0.6 0









VM 4/16/2009 5:11 4/16/2009 13:25 0.5 0.6 0









214 of 363

VM 4/13/2009 1:58 9/28/2009 14:31 0.6 0









Stefan Groschupf 4/9/2009 6:31 4/10/2009 6:15 0.6 0.6 0

Stefan Groschupf 4/9/2009 6:30 4/9/2009 8:01 0.6 0.6 0



Stefan Groschupf 4/8/2009 23:00 1/12/2010 15:45 0.6 0

Stefan Groschupf 4/3/2009 5:53 4/10/2009 6:19 0.5 0.6 0









215 of 363

Stefan Groschupf 4/1/2009 8:46 9/26/2011 7:13 0.5 0.6 0









216 of 363

Stefan Groschupf 4/1/2009 8:13 10/4/2009 3:57 0.5 0.6 0









Stefan Groschupf 4/1/2009 8:04 4/1/2009 8:07 0.5 0.5 0





Stefan Groschupf 4/1/2009 6:32 4/1/2009 6:36 0.5 0.5 0









Stefan Groschupf 3/28/2009 4:43 3/28/2009 4:51 0.5 0



Stefan Groschupf 3/25/2009 17:41 3/27/2009 5:44 0.5 0.4 0









217 of 363

Erich Nachbar 3/19/2009 22:05 4/4/2009 8:37 0.5 0.5 search 0









Peter Voss 3/19/2009 17:56 3/19/2009 23:21 0.5 0







Peter Voss 3/19/2009 17:47 3/25/2009 16:02 0.5 0.5 infrastructure 0









218 of 363

Peter Voss 3/19/2009 17:35 3/28/2009 16:43 0.5 0.5 search 0









Stefan Groschupf 3/17/2009 5:15 3/17/2009 5:17 0.5 0



Stefan Groschupf 2/19/2009 3:04 2/19/2009 3:05 0.5 0









219 of 363

Stefan Groschupf 1/8/2009 16:04 1/7/2010 18:55 0.6 0









Ted Dunning 1/7/2009 21:50 10/4/2009 3:57 0.6 0









Ted Dunning 1/7/2009 21:45 4/3/2009 4:15 0.5 0.5 0









Ted Dunning 1/7/2009 21:44 5/1/2009 17:55 0.6 0









220 of 363

Stefan Groschupf 12/18/2008 7:55 1/8/2009 15:19 0.5 0









Stefan Groschupf 12/18/2008 7:35 6/5/2009 3:36 0.6 0









Johannes Zillmann 12/12/2008 15:04 1/8/2010 7:51 0.4 cluster 0









Johannes Zillmann 12/12/2008 15:02 3/28/2009 5:43 0.4 0.5 cluster 0









221 of 363

Johannes Zillmann 12/12/2008 14:59 1/12/2010 15:49 0.4 index 0







Johannes Zillmann 12/12/2008 14:57 10/4/2009 3:55 0.4 cluster 0









Johannes Zillmann 12/12/2008 14:55 5/1/2009 7:09 0.4 cluster 0









222 of 363

Johannes Zillmann 12/12/2008 14:54 2/19/2010 16:28 0.4 0.6 cluster 0









Johannes Zillmann 12/12/2008 14:27 1/8/2010 7:39 0.4 0.6 cluster 0









Johannes Zillmann 12/12/2008 14:21 4/1/2009 16:53 0.4 0.5 search 0









223 of 363

Johannes Zillmann 12/12/2008 14:15 2/1/2010 16:19 0.4 index 0









Johannes Zillmann 12/12/2008 14:10 4/3/2009 7:29 0.4 0.5 search 0









Johannes Zillmann 12/12/2008 14:08 1/12/2010 15:47 0.4 index 0









remodel 2/28/2010 13:06 2/28/2010 13:06 0









224 of 363

Vivek Magotra 12/3/2009 5:21 12/3/2009 5:21 0









Fuad Efendi 11/30/2009 16:00 12/2/2009 2:05 0









Ken Krugler 10/30/2009 12:55 10/30/2009 12:55 0.4 0





Ken Krugler 10/30/2009 12:53 10/30/2009 12:53 0.4 0









225 of 363

Ken Krugler 10/30/2009 12:51 10/30/2009 12:51 0.4 0









Ken Krugler 10/30/2009 12:49 10/30/2009 12:49 0.4 0









Ken Krugler 10/30/2009 12:46 10/30/2009 12:46 0



Ken Krugler 10/30/2009 12:45 10/30/2009 12:45 0



Ken Krugler 10/30/2009 12:43 10/30/2009 12:43 0.4 0







226 of 363

Ken Krugler 10/30/2009 12:41 10/30/2009 12:41 0.4 0









Ken Krugler 10/27/2009 21:47 10/27/2009 21:47 0.4 0









227 of 363

Ken Krugler 10/27/2009 12:44 10/27/2009 12:44 0.4 0









Ken Krugler 10/1/2009 16:02 10/1/2009 16:02 0





Ken Krugler 10/1/2009 16:01 10/1/2009 16:01 0.4 0









Ken Krugler 10/1/2009 15:53 10/1/2009 15:53 0.4 0









228 of 363

Ken Krugler 9/30/2009 17:06 10/31/2009 17:36 0









Ken Krugler 9/17/2009 21:41 9/30/2009 16:54 0.4 0





Ken Krugler 9/17/2009 21:39 10/31/2009 17:42 0





Ken Krugler 9/17/2009 21:38 9/28/2009 13:11 0





Ken Krugler 9/17/2009 21:32 9/17/2009 21:32 0









229 of 363

Ken Krugler 9/17/2009 21:18 9/18/2009 15:18 0.4 0.5 0









Ken Krugler 8/14/2009 20:10 8/14/2009 20:10 0.4 0









230 of 363

Ken Krugler 8/14/2009 20:05 9/11/2009 22:49 0.4 0









Ken Krugler 8/14/2009 14:46 9/11/2009 22:28 0









231 of 363

Ken Krugler 8/14/2009 14:23 10/27/2009 16:57 0.4 0









Ken Krugler 8/14/2009 14:12 8/14/2009 14:12 0.4 0







Ken Krugler 8/14/2009 14:04 8/18/2009 2:36 0.3 0.5 0









Ken Krugler 8/8/2009 14:29 8/14/2009 20:03 0.4 0









232 of 363

Ken Krugler 7/31/2009 22:58 8/4/2009 22:34 0.3 0.4 0









Ken Krugler 7/31/2009 22:51 9/30/2009 16:55 0.3 0.5 0









Stefan Groschupf 6/15/2009 20:37 10/27/2009 16:52 0.4 0





Ken Krugler 5/1/2009 21:40 5/1/2009 21:40 0.3 0





Ken Krugler 4/30/2009 21:20 9/11/2009 22:34 0

Ken Krugler 4/30/2009 21:18 5/17/2009 23:36 0

Ken Krugler 4/30/2009 21:18 4/30/2009 23:10 0.4 0









Ken Krugler 4/29/2009 10:54 4/29/2009 10:54 0









233 of 363

Ken Krugler 4/29/2009 10:24 4/29/2009 10:24 0









Ken Krugler 4/29/2009 10:20 8/5/2009 18:41 0.4 0









Stefan Groschupf 4/24/2009 23:13 4/24/2009 23:13 0.4 0

Stefan Groschupf 4/24/2009 23:12 4/24/2009 23:12 0.4 0



234 of 363

Ken Krugler 4/24/2009 20:35 8/7/2009 23:09 0







Stefan Groschupf 4/22/2009 21:45 4/22/2009 21:45 0.5 0







Stefan Groschupf 4/22/2009 21:38 9/11/2009 22:34 0.3 0.5 0









Ken Krugler 4/22/2009 0:34 8/7/2009 23:13 0









235 of 363

Ken Krugler 4/18/2009 16:04 4/18/2009 16:04 0









Ken Krugler 4/18/2009 15:54 8/18/2009 2:36 0.5 0









236 of 363

Ken Krugler 4/18/2009 15:49 8/7/2009 23:08 0.4 0









Ken Krugler 4/18/2009 15:43 9/30/2009 17:08 0.5 0









Ken Krugler 4/18/2009 15:42 8/11/2009 16:29 0.4 0









237 of 363

Ken Krugler 4/18/2009 15:39 8/11/2009 16:31 0.4 0









Ken Krugler 4/18/2009 15:35 10/30/2009 12:29 0

Ken Krugler 4/18/2009 15:35 10/27/2009 16:52 0.4 0









Ken Krugler 4/18/2009 15:33 9/11/2009 22:46 0.4 0









Ken Krugler 4/18/2009 15:31 4/18/2009 15:31 0









238 of 363

Ken Krugler 4/18/2009 15:29 7/31/2009 22:39 0.4 0









Ken Krugler 4/18/2009 15:27 10/30/2009 12:34 0.5 0

Ken Krugler 4/18/2009 15:26 9/11/2009 22:31 0.5 0



Ken Krugler 4/18/2009 15:25 10/27/2009 23:49 0.5 0









Stefan Groschupf 4/18/2009 0:45 7/31/2009 22:40 0.3 0.4 0





Stefan Groschupf 4/18/2009 0:43 4/18/2009 0:46 0.3 0

Stefan Groschupf 4/11/2009 1:13 9/11/2009 22:38 0.3 0.5 0









Stefan Groschupf 4/10/2009 23:50 7/31/2009 22:41 0.3 0.4 0









239 of 363

Stefan Groschupf 4/10/2009 23:43 8/18/2009 2:36 0.3 0.5 0









Stefan Groschupf 4/10/2009 21:01 4/18/2009 0:39 0.3 0.3 0





Stefan Groschupf 4/10/2009 21:00 4/10/2009 21:00 0.2 0.2 0









Stefan Groschupf 4/9/2009 22:09 8/18/2009 2:36 0.2 0.5 0

Stefan Groschupf 4/9/2009 5:07 4/9/2009 5:09 0.2 0.2 0









Stefan Groschupf 4/4/2009 0:29 4/7/2009 1:02 0.2 0





Stefan Groschupf 4/4/2009 0:28 8/11/2009 16:32 0.2 0



Stefan Groschupf 4/4/2009 0:26 8/7/2009 23:12 0.2 0

Stefan Groschupf 4/4/2009 0:26 7/31/2009 22:40 0.2 0

Stefan Groschupf 4/4/2009 0:26 8/11/2009 16:33 0.4 0

Stefan Groschupf 4/4/2009 0:26 4/18/2009 0:40 0.3 0

Stefan Groschupf 4/3/2009 22:26 4/3/2009 22:32 0.1 0.1 0

Stefan Groschupf 4/3/2009 22:25 4/3/2009 22:32 0.1 0.1 0



Stefan Groschupf 4/3/2009 22:25 4/3/2009 22:32 0.1 0.1 0

Stefan Groschupf 4/3/2009 22:25 4/3/2009 22:32 0.1 0.1 0

Ken Krugler 4/2/2009 2:17 4/10/2009 21:02 0.2 0.2 0









240 of 363

Stefan Groschupf 4/2/2009 2:16 8/18/2009 2:36 0.2 0.5 0









Stefan Groschupf 4/2/2009 1:13 4/3/2009 22:32 0.1 0.1 0









Stefan Groschupf 4/1/2009 1:31 8/18/2009 2:36 0.2 0.5 0









Stefan Groschupf 4/1/2009 1:20 4/7/2009 1:11 0.2 0.2 0









Stefan Groschupf 4/1/2009 1:18 9/25/2009 18:14 0.2 0.5 0

Stefan Groschupf 4/1/2009 1:18 4/7/2009 2:00 0.2 0.2 0









241 of 363

Ken Krugler 3/30/2009 23:53 4/3/2009 22:33 0.1 0.1 0









Ken Krugler 3/30/2009 20:30 4/3/2009 22:33 0.1 0.1 0









242 of 363

Sub-Tasks Issue Links Environment Description Security Level

Office 2007, Webdav-Servlet 2.0.1, Windows XP If one opens an office document from a webdav-

share, the file ist read-only.



The problems seems to be within the doLock-

Method:



private void generateXMLReport(ITransaction

transaction,

HttpServletResponse resp, LockedObject lo)

[...]

generatedXML.writeElement("DAV::owner",

XMLWriter.OPENING);

// encapsulating the owner with an href-element

will trigger the bug

// generatedXML.writeElement("DAV::href",

XMLWriter.OPENING);

generatedXML.writeText(_lockOwner);

// generatedXML.writeElement("DAV::href",

XMLWriter.CLOSING);



as I remove this element, all testet combinations

of Windows-OS Version and Office-Versions (2007

and 2010) worked fine.







In the new release of r100 the implementation of

IMimeTyper changed that the object store ist

requested without the transaction. The Grails

WebDAV plugin uses a transaction and will fail on

the MIME request due NPE.



The provided patch adds ITransaction parameter

to IMimeTyper.getMimeType() to fix this issue.









243 of 363

From the mailing list:

http://sourceforge.net/mailarchive/forum.php?th

read_name=4D6FAF4F.2090403%40rewoo.com&f

orum_name=webdav-servlet-general

----------8





x.xxxxxx

@aarboard.ch











247 of 363

Exception = java.net.SocketTimeoutException

Source =

com.ibm.ws.webcontainer.channel.WCCByteBuff

erInputStream

probeid = 102

Stack Dump = java.net.SocketTimeoutException:

Async operation timed out

at

com.ibm.ws.tcp.channel.impl.AioTCPReadReques

tContextImpl.processSyncReadRequest(AioTCPRe

adRequestContextImpl.java:157)



at

com.ibm.ws.tcp.channel.impl.TCPReadRequestCo

ntextImpl.read(TCPReadRequestContextImpl.java

:109)

at

com.ibm.ws.http.channel.impl.HttpServiceContex

tImpl.fillABuffer(HttpServiceContextImpl.java:413

6)

at

com.ibm.ws.http.channel.impl.HttpServiceContex

tImpl.readSingleBlock(HttpServiceContextImpl.jav

a:3378)

at

com.ibm.ws.http.channel.impl.HttpServiceContex

tImpl.readBodyBuffer(HttpServiceContextImpl.jav

a:3483)

at

com.ibm.ws.http.channel.inbound.impl.HttpInbo

undServiceContextImpl.getRequestBodyBuffer(Ht









248 of 363

An implementation of IWebdavStore might have

to destroy resources created in the constructor.

However, there is currently no destroy() method

in IWebdavStore. The attached patch creates this

method, implements it in LocalFilesystemStore

and calls destroy() on the store in the webdav

servlet.



Further improvments would be to also create an

init() method, which is called by the servlet. This

would replace calling directly the constructor of

the IWebdavStore with a File arg.









249 of 363

Mac OS X The method "removeLockedObjectOwner" in

LockedObject.java seems to have a problem with

ArrayIndexOutOfBoundsExceptions.

From what I can tell the issue is just a matter of

remembering that the array was shrunk by one

when removing the lock owner.



http://webdav-

servlet.svn.sourceforge.net/viewvc/webdav-

servlet/trunk/src/main/java/net/sf/webdav/locki

ng/LockedObject.java?annotate=54#l116





The following fix seems to resolve the problem in

my simple test setup.





infundibulum:webdav-servlet knut$ svn diff

/Users/knut/src/webdav-

servlet/src/main/java/net/sf/webdav/locking/Loc

kedObject.java

Index: /Users/knut/src/webdav-

servlet/src/main/java/net/sf/webdav/locking/Loc

kedObject.java

=========================================

==========================

--- /Users/knut/src/webdav-

servlet/src/main/java/net/sf/webdav/locking/Loc

kedObject.java (revision 87)

+++ /Users/knut/src/webdav-

servlet/src/main/java/net/sf/webdav/locking/Loc

This bug comes not from me, but I currently copy

my issues from the original SourceForge project

bugtracker to this jira, so I saw this (IMHO

eligible) bugreport:



in a multithreaded environment

SimpleDateFormat can create a correctly

formatted output for a totally different date.



see e.g.

net.sf.webdav.methods.CREATION_DATE_FORMA

T

250 of 363

When I try to copy a collection to non existend

path, then I get an error

500 instead of 409 (rfc 2518 8.8.5)

Problem appears at DoCopy.copy at

createResource.



I know the specification is for the case a little bit

unclear but I think the error 409 fits the best for

this case.

At the moment there is no handling of any

specific content-type.

According RFC 2518 8.3.1 I would suggest to

return always a 415 error by default.

Maybe at later time the API can be extended to

handle specific content-types.

Acording to RFC 2518 8.3.1 the MKCOL must fail

with a 409 error, when one

or more parent elements in the path not exists.

Currently I get an 207 Multi-Status with a

containing 404.



Fix can be done simply by this at DoMkcol:

parentSo = _store.getStoredObject(transaction,

parentPath);

if(parentSo == null){

// parent not exists

resp.sendError(WebdavStatus.SC_CONFLICT);

return;

}









251 of 363

All environments The current implementation for building an iJetty-

compatible servlet has problems. The following

patch fixes that:



Index:

src/main/java/net/sf/webdav/methods/DoLock.ja

va

=========================================

==========================

---

src/main/java/net/sf/webdav/methods/DoLock.ja

va (revision 82)

+++

src/main/java/net/sf/webdav/methods/DoLock.ja

va (working copy)

@@ -417,7 +417,7 @@

currentNode = childList.item(i);



if (currentNode.getNodeType() ==

Node.ELEMENT_NODE) {

- _lockOwner = currentNode.getTextContent();

+ _lockOwner = currentNode.getNodeValue();

}

}

}

Index: build.gradle

=========================================

==========================

--- build.gradle (revision 82)

+++ build.gradle (working copy)

@@ -143,10 +143,16 @@









252 of 363

Server on Linux, all clients (windows, mac and I have a folder in my server called "joão" and

linux) when I try to open it using Mac, Windows or

Linux, the servlet returns an error. Inspecting the

logs I can see that the encodings of the GetObject

request are wrong.



Let me see if I can explain the procedure so that

you guys may

replicate it:



1. create a folder or a file with an accent within

the webdav storage

area, e.g. "joão" or "luís.txt"

2. start tomcat with the webdav servlet

application installed

3. mount the webdav disk on your favorite OS or

webdav client

4. try to open the created folder or file on the

client.



You'll notice that the client will popup an error

every time one tries

to open the folder or the file.

Somehow I think the client makes the request to

the webdav servlet in

URLEncoded form. However, the servlet is unable

to decode the folder

and outputs an error.



Some further info:









253 of 363

When a non-webdav client does a GET request

for a folder resource we currently return a very

simple text document.



It would be nice if we return a full html response

page with correct charset encoding, and clickable

links to navigate the folders and download the

files from a standard webbrowser.



Currently the GET is handled in DoGet.java in the

folderBody method



if (so.isFolder()) {

// TODO some folder response (for browsers, DAV

tools

// use propfind) in html?

OutputStream out = resp.getOutputStream();

String[] children =

_store.getChildrenNames(transaction,

path);

children = children == null ? new String[] {} :

children;

StringBuffer childrenTemp = new StringBuffer();

childrenTemp.append("Contents of this

Folder:\n");

for (String child : children) {

childrenTemp.append(child);

childrenTemp.append("\n");

}

out.write(childrenTemp.toString().getBytes());

}









254 of 363

Hello,



> In addition to this problem there seems to exist

a problem with

> gvfs/1.0.2 on PUT (file creation. I'm trying to

track in more details

> where it fails. (The file is created, but somehow

the gnome client

> seems to not evaluate the answer in the way it

was intended by the

> webdav servlet and it retrys to PUT the file...)



I tracked down the problem to the DoPut class,

where the response length is Set to the length of

the uploaded file. (Line 169)



The effect of this is, that gvfs trys to receive a

response of "content-length", but the webdav

servlet only sends a few bytes. So the request

does timeout and gvfs retrys the put, timesout

again and then dropps the connection.



According to the code just above and the

comments it seems that something very similar

happens with Goliath.

Just removing the content-length from the reply

makes it work with gvfs.



Is there actually a webdav client REQUIRING the

response length to be set to the resource size

instead getChildrenNames(ITransaction

String[] of the response length ?

transaction, String folderUri);



if the uri points to a file an npe is thrown in

DoPropFind on line 220.









255 of 363

0.4 Hello,



Maybe it is because it is not clear enough for

dummies like me but when i launch bin/nutch

admin /opt/nutch-gui-0.4/build/nutch-gui-0.4

50060, i get an error in hadoop log :





2010-12-29 08:37:13,122 WARN

servlet.PageNotFound - No mapping found for

HTTP request with URI [/general/index.html] in

DispatcherServlet with name 'springapp'









maybe use for example login.htm







maybe with Environment variable like "-

Djava.security.auth.login.config=path/nutchgui.a

uth" or over the nutch-default.xml

for example NutchGuiLogin



Or configure the name within the nutch-

default.xml

validate and reject an error if user forgot to set

http.agent.name before start crawling.

maybe the crawling should succeed anyway in

this case and the user should get a warning.









256 of 363

this plugin allows us to crawl specified pages.



this plugin is for url uploading. the uploaded urls

wil be fetch.

this plugin can be use to run scheduled crawls



with this plugin a user can create new crawl's and

crawl these created crawls.

a host statistic should be also supported.

with this plugin it is possible to configure a nutch

instance. the nutch-site.xml of an instance should

be overwrite

this plugin should show memory/cpu usage and a

logfile viewer

this plugin should be create nutch instances

this plugin should be the welcome page for every

instance

supported languages

+ german

+ english









257 of 363

This patch implements Lucene query filters.



Currently the LuceneServer class declares a few

methods which take a filter argument, but these

methods throw

UnsupportedOperationException(). This patch

provides support for passing a filter argument to

the search() and getResultCount() methods.



A FilterWritable class is included to allow passing

filter arguments between client and server.



Reported by Murali Krishna:

{noformat}

Hi,



I usually see operator thread getting stopped and

restarted immediately. Is that expected? This

particular instance, in one of the node it stopped

for almost 2 hours and no index deployment

happened during this time on this node. The

'listNodes' was showing the node connected

though. I am using Katta 0.6.2.



2011-05-05 00:02:15,793 INFO

net.sf.katta.master.OperatorThread:100 -

operator thread stopped

2011-05-05 00:02:17,276 WARN

org.I0Itec.zkclient.ZkEventThread:78 - Error

handling event ZkEvent[State changed to

SyncConnected sent to net.sf.katta.protoco

l.InteractionProtocol$1@64cbdef5]

org.I0Itec.zkclient.exception.ZkNodeExistsExcepti

on:

org.apache.zookeeper.KeeperException$NodeExi

stsException: KeeperErrorCode = NodeExists for

/katta/maste

r

at

org.I0Itec.zkclient.exception.ZkException.create(Z

kException.java:55)

at

org.I0Itec.zkclient.ZkClient.retryUntilConnected(Z

258 of 363

Recently, the constant

{{Version.LUCENE_CURRENT}} was replaced by

{{Version.LUCENE_30}} in many files (mainly test

code) because LUCENE_CURRENT is deprecated.

If someone uses Lucene 3.1, which works straight

forward by the way, they have to update the

version again. I'm using Lucene 4.0 and also have

to change the version.

It would be much easier to have a global constant

or a property to change the underlying Lucene

version at a single point.

To extend LuceneServer more easily and to

implement custom or enhanced search methods

it is necessary to make the inner classes of

LuceneServer protected and make some of their

properties public or create getter/setter methods

for it.



The attached patch makes the inner classes

protected and some properties public.

The multithread shard search can be slightly

improved by the use of CompletionService.

Currently the multithread search is done by

creating a SearchCall for every shard and

submitting them to the thread pool. Later on, the

search function waits for each submitted

SearchCall in the order they are submitted rather

in the order they are finished. In the worst case

the first SearchCall takes much longer than the

later once. All finished SearchCalls are

unprocessed until the first SearchCall gets

finished. This can increase the memory

consumption (later gc) and decrease processing

speed.



The attached patch fixes this by using

[ExecutorCompletionService|http://download.or

acle.com/javase/6/docs/api/java/util/concurrent/

ExecutorCompletionService.html].







259 of 363

Hi,

I am on 0.6.2 and I get this error during reconnect

and katta node stops updating the log after this,

but the process continues to run.



2011-03-05 14:04:33,500 INFO

net.sf.katta.operation.node.AbstractShardOperati

on:75 - publish shard

'Table_inc_1299357600#part-r-00005'

2011-03-05 14:04:33,507 INFO

net.sf.katta.operation.node.AbstractShardOperati

on:55 - redeploy shard

'Table_inc_1299357600#part-r-00003'

2011-03-05 14:04:33,507 INFO

net.sf.katta.operation.node.AbstractShardOperati

on:75 - publish shard

'Table_inc_1299357600#part-r-00003'

2011-03-05 14:04:33,964 WARN

org.I0Itec.zkclient.ZkEventThread:78 - Error

handling event ZkEvent[State changed to

SyncConnected sent to net.sf.katta.protoco

l.InteractionProtocol$1@2e3c5d8e]

org.I0Itec.zkclient.exception.ZkException:

org.apache.zookeeper.KeeperException$NotEmp

tyException: KeeperErrorCode = Directory not

empty for /katta/work/node-queues/host1:20000

at

org.I0Itec.zkclient.exception.ZkException.create(Z

kException.java:68)

at

org.I0Itec.zkclient.ZkClient.retryUntilConnected(Z









260 of 363

The output tables produced by the katta

commands {{listIndices}}, {{listNodes}} and

{{check}} are not sorted and contain separation

characters. Sometimes it is hard to read the

unsorted output and parsing would be easier if

no extra separation characters would be present.

It would be useful to have options to supress

these separators, order the table and optionally

remove the table header.



The attached patch introduces three parameters:

* {{-b}} batch mode

* {{-n}} don't write column names

* {{-S}} sort the index/shard/node names



The DocumentFrequencyWritable.toString()

method returns a string not easy to understand.

The aim is to clearify the two returned numbers.

The attached patch does this.



When adding a lucene indice:

{code}

bin/katta addIndex testIndex 3

{code}

the path need to contain one or multiple lucene

indices. The adding fails if the given folder is a

lucene index by itself.

But we could add a special treatment for this case

and add the index as single sharded katta index.









See

https://issues.apache.org/jira/browse/ZOOKEEPE

R-795

Related to KATTA-182 and KATTA-167

The ZkEventThread catches Exception in its run

method. Throwables will crash the thread and the

client event handling stops working. That way an

OOM will crash the client.







261 of 363

Executed DeployUndeploySearchInLoop for a

longer time (which deploys an index, search in it,

and undeploys it). This emulates the usage of a

LuceneClient for a long time. Looking at heap

dumps after several hours i found

{{ZooKeeper$ZkWatchManager.existWatches}}

map with 8400 entries. Entries for pathes like

{{/katta/shard-to-nodes/index53#bIndex}} which

belonging to indices which were already

undeployed a long time.



A zookeeper watch is usually removed when an

event for it is triggered. ZkClient immediatly

registeres the watch again in case it have a

listener for it.

Now in Katta a the Client remove it shard-to-

nodes listeners if an index has been removed.

Looking at the thread-sump it seems that

sometimes the

{{ZooKeeper$ZkWatchManager.existWatches}}

has been cleared from undeployed indices and

sometimes not. I suspect that this depends on the

sequence of events. if the client gets a index-

removed event before the last shard-to-node

change event, every index related watch gets

removed. If its the other way around some

obsolete watches are still hanging around

forever.



Removing the watches explicitely isn't that easy:

[ZOOKEEPER-









262 of 363

Right now a node is removed from shard-to-node

mappings in the Client if an proxy-invocations

fails.

The node-proxy itself, which gets removed as

well, might be re-established if a new shard is

added, but the removed shard-to-node mappings

are never re-established for that client.



Now a failing proxy-invocation does not

necessarily mean that the proxy is corrupt, see

KATTA-180 as a example.



So following approach would look a bit safer:

- remove a node-proxy only if x successive

invocations are failed

- re-establish the proxy immediatly

- if the re-established proxy fails - remove the

shard-to-node mapping









Right know if an index is removed its shards are

removed immediately from zookeeper (first from

the master - then from the nodes) and then from

the content-server (no each node), The removal

from the content-server produces exception if

seach operations on these shards are running.



One way to avoid these exception would be to

delay the physical deletion. The remaining

queries should finish quickly and no new search

operations should be scheduled because of the

virtual deletion.









263 of 363

Querying a index while it is undeployed resulted

in following exception:

{noformat}

11/02/01 19:51:16,911 ERROR

client.NodeInteraction:166 - Error calling public

abstract

net.sf.katta.lib.lucene.DocumentFrequencyWrita

ble

net.sf.katta.lib.lucene.ILuceneServer.getDocFreqs

(net.sf.katta.lib.lucene.QueryWritable,java.lang.St

ring[]) throws java.io.IOException on

eagle.local:20000 (try # 1 of 3) (id=0)

java.lang.reflect.InvocationTargetException

at

sun.reflect.NativeMethodAccessorImpl.invoke0(N

ative Method)

at

sun.reflect.NativeMethodAccessorImpl.invoke(Na

tiveMethodAccessorImpl.java:39)

at

sun.reflect.DelegatingMethodAccessorImpl.invok

e(DelegatingMethodAccessorImpl.java:25)

at

java.lang.reflect.Method.invoke(Method.java:597

)

at

net.sf.katta.client.NodeInteraction.run(NodeInter

action.java:135)

at

java.util.concurrent.ThreadPoolExecutor$Worker.

runTask(ThreadPoolExecutor.java:886)









264 of 363

Did a test where i do deply-undeploy-search in a

loop. Have this running for i while the zk-

filesystem looks like:

{noformat}

'-+shard-to-nodes

'-+index655#dIndex

'-+index0#bIndex

'-+eagle.local:20000

'-+index1013#bIndex

'-+index769#dIndex

'-+index735#dIndex

'-+index1090#cIndex

'-+index1135#dIndex

'-+index499#dIndex

'-+index1087#cIndex

'-+index849#cIndex

'-+index1010#dIndex

...

{noformat}



So sometime the remove seems to work -









265 of 363

Adding and removing an index can lead to

following exception in IndexDeployFuture:

{noformat}

11/02/01 22:01:36,617 WARN

zkclient.ZkEventThread:78 - Error handling event

ZkEvent[Data of /zk_testsystem/indicies/indexA

changed sent to

ZkDataListenerAdapter:/zk_testsystem/indicies/in

dexA]

java.lang.IllegalStateException

at

net.sf.katta.client.IndexDeployFuture.handleData

Deleted(IndexDeployFuture.java:88)

at

net.sf.katta.protocol.InteractionProtocol$ZkDataL

istenerAdapter.handleDataDeleted(InteractionPr

otocol.java:615)

at

org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:54

9)

at

org.I0Itec.zkclient.ZkEventThread.run(ZkEventThr

ead.java:72)

{noformat}



Additionaly the IndexDeployFuture as component

does not seem to unregister itself in some cases.









266 of 363

1st Mail from Murali Krishna:

{noformat}

Hi,

I have a question on zkclient and deployclient's

lifetime. Can we use the same deployClient

through out the process without recreating the

interaction protocol or deployclient at the client

side? Can this still work even if Zookeeper

processes or katta nodes gets restarted in

between?



In our cluster, we keep deploying indices every 10

minutes with the same deplpy client and we are

seeing a problem where after a day or so, the

deploy operation gets stuck. Essentially, the

IIndexDeployFuture.getState() always returns

IndexState.DEPLOYING. I am just thinking

whether this is related to some problem with

Deployclient and we should reinitialize zkclient,

interactionprotocol and deployclient every time ?



My code snippet is as below to

keep the deployClient in memory.

ZkConfiguration zkConf = new ZkConfiguration();

ZkClient zkClient = new

ZkClient(zkConf.getZKServers());

InteractionProtocol protocol = new

InteractionProtocol(zkClient, zkConf);

deployClient = new DeployClient(protocol);

{noformat}

Currently LuceneServer uses a fixed size thread

pool of 100 for processing incoming search calls.

Once KATTA-174 this should be made

configurable so people can tune it to their needs.





Needed by KATTA-171. This would allow to store

custom properties for an implementation of

IContentServer (f.e. LuceneServer) inside the

katta.node.properties.





267 of 363

from 0.20.1

There are several RPC bugs fixed which could be

relevant for katta:

http://hadoop.apache.org/common/docs/r0.20.2

/releasenotes.html

from 3.0.0

From Hongchao, dev-list:

{noformat}

In katta-0.6.3, the LuceneClient class allows us to

set up a timeout

value for the search. We tried that function and

found some

interesting things. We did see the LuceneClient

interrupted the slow

queries although the actual used time might be a

little longer than

the timeout value. This is understandable.

However, from what we saw,

we suspect that the involved katta nodes still

continue to work on the

slow queries even after the LuceneClient times

out the search. Could

you please help us make sure whether it is the

case? If it is, do you

think there is an easy way to ask katta nodes to

stop the work related

to the slow query right, too, after the

LuceneClient timed out the

query?

{noformat}



Part of the response from Johannes:

{noformat}

So basically you want that when the timeout

happens not only the querying threads on the

client, but also the threads on the lucene-nodes

stop there work, right ?









268 of 363

RHEL 5, 0.6.2 katta, 2.9.2 lucene I am using katta0.6.2. The query results are not

working as expected when I query for "NOT

range". For eg, if I have a field indexed with multi

value and if I query for NOT of that range, it still

returns the doc.

The issue happens with lucene as well, when we

use MultiSearcher. The query rewrite changes the

[a - b*] to something like (a - b1 OR a- b2) where

b1 in index1 and b2 index2. It should be 'AND' in

case of negative query.



MultiSearcher's rewriten query is wrong.



Katta also seem to inherit the bug from Lucene's

Query combine method mentioned at

https://issues.apache.org/jira/browse/LUCENE-

2756.



test scenario:

index1: Has 1 shard which contains the document

A

index2: Has 2 shards, one of which has the

document A

index3 : Has 1 shard which doesnt contain A

index4 : same shard as index 1



for the query "B:0 NOT B:[1 TO 5]" :

index1 did not return A (correct).

index2 returned it (wrong)

index3 didn't return (correct)

index1 & index2 returns A twice (wrong)









269 of 363

The timeout code in WorkQueue.java (I removed

some log lines from the

code below for clarity) seems to have an issue if

the waitTime is

exactly 0. Looks like it will return the results

without closing them if

waitTime is exactly 0.





/**

* Use a user-provided policy to decide how long

to wait for and

whether to

* terminate the call.

*

* @param policy

* How to decide when to return and to terminate

the call.

* @return the results, which may or may not be

complete and/or closed.

*/

public ClientResult

getResults(IResultPolicy policy) {

int callId = callCounter++;

long start = 0;

long waitTime = 0;

while (true) {

synchronized (results) {

// Need to stay synchronized before waitTime()

through wait()

or we will









270 of 363

Presumably it's different from the similar-

sounding issues fixed in 0.6.3 because at least

some of those circumstances we think we've

been able to clearly confirm the 0.6.3 fixes

working... Attached is a force-jstack (normal

jstack fails due to deadlock), here's all I have from

the log:

2010-12-02 01:06:40,453 INFO

net.sf.katta.operation.node.AbstractShardOperati

on:55 - redeploy shard 'index8#shard_0'

2010-12-02 01:06:40,455 INFO

net.sf.katta.operation.node.AbstractShardOperati

on:75 - publish shard 'index8#shard_0'

2010-12-02 01:06:40,453 INFO

org.apache.zookeeper.ClientCnxn:1157 - Client

session timed out, have not heard from server in

45820ms for sessionid 0x32c9d05a0c704ce,

closing sock

et connection and attempting reconnect

2010-12-02 01:06:40,558 INFO

org.I0Itec.zkclient.ZkClient:449 - zookeeper state

changed (Disconnected)

2010-12-02 01:07:10,971 INFO

org.apache.zookeeper.ClientCnxn:1041 - Opening

socket connection to server

quorum2/10.1.1.2:2281

2010-12-02 01:07:10,973 INFO

org.apache.zookeeper.ClientCnxn:1157 - Client

session timed out, have not heard from server in

30414ms for sessionid 0x32c9d05a0c704ce,

closing sock









271 of 363

x86_64, jdk 1.6r11, 16G heap limit, 23 nodes, During sustained operation with unknown cause

iCMS GC with parallel young generation collection (relatively low query load, hard limit on

and an iot of 60% concurrent deployments managed centrally)

katta runs out of memory. The retained size to

garbage ratio varies pretty wildly and we collect

more aggressively than default so it seems

unlikely to be a config problem. We don't know if

it's due to a leak or due to overtasking - if the

latter, it would be nice to have katta limit the

inbound requests according to current

circumstances rather than just die. We're posting

heap dumps for analysis but it'll be a few hours

before they arrive (lot of heap).

As of 0.6.2 you can store indices with > 2^31

nDocs and katta will let you submit a search for

them, but several core places to luceneclient still

use the primitive int type (signed in java, so

limited to 2^31 results) to store and return

counts, so the search results are incorrect and the

api doesn't allow applications to even count them

correctly by retyping. client's .search, get hits' list

, .count, &c. For example, the docId field of the

Hit object is an int, in creating a list of them, we

sort calling .equals and .equals relies on _docId

and so we must have collisions whenever there

are more than 2^31. For count, we often find

negative results (overflow) with a large enough

range.



Possible solutions:

A) Make all such areas Generics, so folks can

choose their precision and we can avoid breaking

compatibility with existing apps using

luceneclient.

B) Add new implementations of the existing

methods where everything is a long instead of an

int (push the limit out to something beyond what

katta could otherwise handle). Compatible but

ugly to maintain.

C) Change both the private implementations (like

hit-equals) and the exposed interfaces (like

.search) from int to long. Easiest but makes it

272 of 363

We experience it while testing even with 0.6.2

which is getting in the way of our testing of the

throttle and of the trunk version (we switched

back to 0.6.2 to see if the errors encountered

were unrelated to the katta changes):

{noformat}

2010-11-19 04:27:59,219 ERROR

net.sf.katta.operation.node.AbstractShardOperati

on:59 - failed to deploy shard

'ipovw_031_101119042717#shard_1' on node

'srch02-lab:20000'

net.sf.katta.util.KattaException: Can not load

shard: hdfs://slnamenode:9000/data/katta-

deploy/ipovwtest/release/031_101119042717/sh

ard_1

at

net.sf.katta.node.ShardManager.installShard(Sha

rdManager.java:144)

at

net.sf.katta.node.ShardManager.installShard(Sha

rdManager.java:66)

at

net.sf.katta.operation.node.ShardDeployOperatio

n.execute(ShardDeployOperation.java:36)

at

net.sf.katta.operation.node.AbstractShardOperati

on.execute(AbstractShardOperation.java:56)

at

net.sf.katta.operation.node.AbstractShardOperati

on.execute(AbstractShardOperation.java:27)

at









273 of 363

Ubuntu While building the latest snapshot (katta-HEAD-

02055b0 ) a unit test fails

"

[junit] Running

net.sf.katta.node.ShardManagerTest

[junit] Tests run: 1, Failures: 0, Errors: 1, Time

elapsed: 0.857 sec



BUILD FAILED

/home/patrick/Downloads/katta-HEAD-

02055b0/src/build/ant/common-build.xml:99:

Tests failed



Total time: 1 minute 38 seconds

"



Based on KATTA-161. Seems like we some

unexpected exceptions getting swallowed and

the threads stopping to work silently.









274 of 363

Currently LuceneClient and LuceneServer are not

structured in such a way that they can be

extended easily. For instance, LuceneClient

declares its kattaClient instance variable as

private and provides no accessor. To extend that

class, the extender would have create a separate

Katta Client solely used within the sub-class which

seems wasteful (when the parent has a perfectly

good one).



As a use case, our company would love to use

Katta because our Lucene indexes are quite large

(combined they are well over 1TB), and the

management of our current Solr deployment is

becoming overwhelming (replication and

sharding strategies are quite fixed in Solr once

established). Unfortunately, we require faceting

functionality which is provided by Solr but not by

Katta. Most of our faceting requirements are

quite simply, so I'd love to extends LuceneClient

to provide this facility. As stated above though,

the current implementation of LuceneClient and

LuceneServer makes this unnecessarily difficult.









275 of 363

Debian sid, Sun jdk 1.6.0u11 Over the course of extended runtime, for

unknown reasons, one or more katta nodes

appears to have its thread that takes instructions

from the master stop responding. The evidence is

in the logs going really quiet - normally we'd have

a variety of shard deployments, undeployments,

etc. showing up in the node log several times per

minute at least. They are almost always:

INFO net.sf.katta.node.Node

INFO

net.sf.katta.operation.node.AbstractShardOperati

on

INFO net.sf.katta.node.ShardManager

INFO net.sf.katta.lib.lucene.LuceneServer

INFO

net.sf.katta.operation.node.AbstractShardOperati

on



When this problem occurs, however, we see NO

info-level messages from any of those classes for

hours or even days on end (while all other nodes

are exhibiting the normal behavior), but what we

do start seeing is this sort of warning, always

clustered together and during the time that the

node doesn't receive updates:

2010-10-14 13:50:08,543 WARN

org.apache.hadoop.ipc.Server:662 - IPC Server

Responder, call

getDetails([Ljava.lang.String;@4def22a3, 9) from

1.1.1.1:44457: output error

2010-10-14 13:51:08,544 WARN









276 of 363

Any We ran into a problem recently. The scenario is

that one of our katta nodes got disconnected due

to some reason and , consequently, a lot of

rebalance operations were triggered in katta to

replicate the miss shards. The indexes were

originally deployed from HDFS and some of these

indexes had been deleted from HDFS by the

rebalancing time. In this situation, the attempts

to replicate these missing shards would fail since

there were no the copies of these shards in HDFS

any more. The problem is that katta would never

stop its such rebalance attempts. As soon as its

first rebalance attempt failed, katta immediately

submitted another same attempt. Thus, tons of

such rebalance requests were queued up in

Zookeeper and blocked any other normal index

deployments.



So, I am wondering whether katta could make

decisions basing on the reasons for the rebalance

failures. For example, if it is caused by a

malfunctioning file system such as HDFS is not

responsive, katta could try the rebalance again

later. However, if the reason is that the index files

could not be found but the file system is healthy,

katta should give up the rebalance efforts since

the effort is no way

JDK 1.6_21, JDK 1.5 to succeed.

The LuceneServerTest.java does not compile. The

following error message occurs for line 45 and 63:



{noformat}

qualified new of static class

{noformat}



The Katta command line tool provides the option

{{listNodes}}, but does not provide an option to

remove nodes which are not used anymore (i. e.

test nodes). One can remove these nodes with

the Zookeeper command line tool. But it would

be useful to have such an option with the Katta

command line tool.



277 of 363

There is no way to set the broadcast timeout in

LuceneClient. The default timeout of 12 seconds

is sometimes to small.

If a lot of different indices with different names

are deployed, but the search should be limited to

a subset of these indices only, all the subset index

names have to be given manually. It would be

easier if an index pattern could be used instead.



The retrieval of many hit details is very slow

because they requested individually. It would be

nice if they would be requested in batches per

shard. That would improve the retrieval

performance dramatically.

In @HitsMapWritable.readFields@ the number of

hits to read is known, but the dimension of the

@_hits List@ is not adapted. If many hits are read

by a client, the @ArrayList@ will be expanded

multiple times. That decreases the performance.



{{LuceneServer}} loads the full document, even if

only a few fields are requested during a

{{getDetails()}} call. That is a performance issue if

a lot of document fields are indexed and stored

or if some fields are quite large and/or should be

lazy loaded.



The {{net.sf.katta.lib.lucene.LuceneServer}} class

should be modified such sub-classing is easier and

code reuse higher. The motivation is to make an

embedded Solr server as a sub-class of

{{LuceneServer}} to accept both Solr and Lucene

queries. I'll post and updated patch of [SOLR-

1395|https://issues.apache.org/jira/browse/SOL

R-1395] too.

Update to the current version (0.21.0) of Hadoop.

This version is still an RC version, but Katta should

be adapted anyway.

Add a port parameter "-p" to the katta startNode

argument to run Katta nodes at different ports

locally. That makes debugging easier.







278 of 363

LuceneServer synchronizes on a

ConcurrentHashMap and queries

indexSearcher.maxDoc(), but does not use

_maxDoc. That can be avoided.

If KATTA_LOG_LEVEL=Debug is set, the master

fails to deploy indices with the following

exception:



ERROR 2010-08-13 12:58:09,006

[OperatorThread]

net.sf.katta.operation.master.AbstractIndexOper

ation - failed to deploy index sen-00002

java.util.IllegalFormatConversionException: d !=

java.util.concurrent.atomic.AtomicInteger

at

java.util.Formatter$FormatSpecifier.failConversio

n(Formatter.java:3999)

at

java.util.Formatter$FormatSpecifier.printInteger(

Formatter.java:2709)

at

java.util.Formatter$FormatSpecifier.print(Format

ter.java:2661)

at

java.util.Formatter.format(Formatter.java:2433)

at

java.util.Formatter.format(Formatter.java:2367)

at java.lang.String.format(String.java:2769)

at

net.sf.katta.master.LowestShardCountDistributio

nPolicy.chooseNewNodes(LowestShardCountDist

ributionPolicy.java:200)

at

net.sf.katta.master.LowestShardCountDistributio

nPolicy.createDistributionPlan(LowestShardCount

see

http://github.com/sgroschupf/zkclient/issues/unr

eads#issue/11









279 of 363

linux We created LuceneClient on different machines

(katta nodes). I found the

java.util.ConcurrentModificationException couple

of times. By checking the log I found multiple

LuceneClients were created from different

machines at that time when the exception was

thrown.



java.util.ConcurrentModificationException

at

java.util.AbstractList$Itr.checkForComodification(

AbstractList.java:372)

at

java.util.AbstractList$Itr.next(AbstractList.java:34

3)

at

org.apache.hadoop.conf.Configuration.loadResou

rces(Configuration.java:1028)

at

org.apache.hadoop.conf.Configuration.getProps(

Configuration.java:979)

at

org.apache.hadoop.conf.Configuration.set(Config

uration.java:404)

at net.sf.katta.client.Client.(Client.java:102)

at net.sf.katta.client.Client.(Client.java:93)

at net.sf.katta.client.Client.(Client.java:88)

at

net.sf.katta.lib.lucene.LuceneClient.(Lucene

Client.java:78)

at









280 of 363

linux If a LuceneClient is created at the time of

dropping an index (or at the time of adding an

index), we got a java.lang.NullPointerException:

java.lang.NullPointerException

at

net.sf.katta.client.Client.isIndexSearchable(Client.

java:247)

at

net.sf.katta.client.Client.addOrWatchNewIndexes

(Client.java:180)

at net.sf.katta.client.Client.(Client.java:122)

at net.sf.katta.client.Client.(Client.java:87)

at net.sf.katta.client.Client.(Client.java:82)

at

net.sf.katta.lib.lucene.LuceneClient.(Lucene

Client.java:73)

at

com.mcafee.titan.search.LuceneClientFactory.ma

keObject(LuceneClientFactory.java:36)



The 36 line of LuceneClientFactory.java is:

return new LuceneClient(new

DefaultNodeSelectionPolicy(), zkConf);

where zkConf is an object of ZkConfiguration.









281 of 363

Administrator@Rafia /cygdrive/d/katta/katta-

core-0.6.1

$ ant compile

Buildfile: D:\katta\katta-core-0.6.1\build.xml



compile:



check-ivy-available:



download-ivy:



install-ivy:

[echo] Ivy path



resolve:

[ivy:resolve] :: Ivy 2.0.0 - 20090108225011 ::

http://ant.apache.org/ivy/ ::

:: loading settings :: file = D:\katta\katta-core-

0.6.1\ivysettings.xml

[ivy:resolve] DEPRECATED: useOrigin option is

deprecated when calling resolve, u

se useOrigin setting on the cache implementation

instead

[ivy:resolve] :: resolving dependencies ::

101tec#katta;working@Rafia

[ivy:resolve] confs: [ant, eclipse, compile, test,

instrument, checkstyle]

[ivy:resolve] found zkclient#zkclient;0.1.0 in

libraries

[ivy:resolve] found zookeeper#zookeeper;3.2.2 in









282 of 363

all Repro: run "ant jar" from katta/extras/indexing

Fix: see attached (update cobertura.jar to 1.9.3





rodo@rodimus:~/patch/katta/extras/indexing$

ant jar



...



[ivy:resolve] :::: WARNINGS

[ivy:resolve] module not found:

net.sourceforge.cobertura#cobertura;1.9.1

[ivy:resolve] ==== libraries: tried

[ivy:resolve] -- artifact

net.sourceforge.cobertura#cobertura;1.9.1!cober

tura.jar:

[ivy:resolve]

/home/rodo/patch/katta/extras/indexing/../..//li

b/cobertura-1.9.1.jar

[ivy:resolve]

/home/rodo/patch/katta/extras/indexing/lib/cob

ertura-1.9.1.jar

[ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::

[ivy:resolve] :: UNRESOLVED DEPENDENCIES ::

[ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::

[ivy:resolve] ::

net.sourceforge.cobertura#cobertura;1.9.1: not

found

[ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::

[ivy:resolve]

[ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE









283 of 363

Honchao reported

{noformat}

owever, if we reuse the LuceneClient for queries,

as the time passed

(mostly we deploy/remove indexes from katta),

the memory utilized by

the LuceneClient is getting bigger and bigger and

can not be released.

It seems there is a strong tie between the

memory size and the number

of index deployment/removal during the life time

of the LuceneClient.

When the utilized memory reaches some level,

the LuceneClient gets the

following errors consistently:



2010-07-07 11:13:05,181 -

org.I0Itec.zkclient.ZkEventThread - WARN -

Error handling event ZkEvent[Children of

/katta/indicies changed sent

to

net.sf.katta.protocol.InteractionProtocol$AddRe

moveListenerAdapter@2c70f837]

java.lang.NullPointerException

at

net.sf.katta.client.Client.removeIndex(Client.java:

172)

at

net.sf.katta.client.Client$1.removed(Client.java:1

08)

at









284 of 363

from Jack Key:

{noformat}

Hello, Everyone.



QUESTION 1:

Will there be an update to the "katta-images" S3

bucket that includes an AMI for Katta 0.6.1?

Unfortunately, the only publicly available image

in the S3 Bucket "katta-images" is ami-b3a84fda

(katta-images/katta-0.4.0-i386.manifest.xml).



QUESTION 2:

Has anyone had success building an AMI with

Katta 0.6.1?



I tried using the "create-image" script as shown in

http://katta.sourceforge.net/documentation/run

ning-katta-on-ec2.

But it seems to be produce AMIs that do not work

with "bin/katta-ec2 launch-cluster".

The launch spins up instances, the "katta-ec2

login" logs me into the master, but the master

seems unaware of any Nodes,

and there is no output (error messages or

otherwise) when i run katta listNodes.



INPUTS

in my katta-ec2-env.sh, I used

KATTA_VERSION=core-0.6.1 in the katta-ec2-

env.sh script.









285 of 363

linux When we ran a stress test on katta, I got some

jdk 1.6.0_17 inconsistent errors about "can not reach xxxx

shard" and found the following logs inside:



--------------------------------------------------------------------

--------------------------------------------------------------------

--------------------------------------

2010-06-25 19:33:04,524 -

net.sf.katta.client.NodeInteraction - ERROR



{noformat}

Error calling public abstract

net.sf.katta.lib.lucene.HitsMapWritable

net.sf.katta.lib.lucene.ILuceneServer.search(net.sf

.katta.lib.luc

ene.QueryWritable,net.sf.katta.lib.lucene.Docum

entFrequencyWritable,java.lang.String[],int,net.sf

.katta.lib.lucene.SortWritable)

throws java.io.IOException on

haystack008.scur.colo:20000 (try # 1 of

3) (id=2

264969)

java.lang.reflect.InvocationTargetException

at

sun.reflect.GeneratedMethodAccessor36.invoke(

Unknown Source)

at

sun.reflect.DelegatingMethodAccessorImpl.invok

e(DelegatingMethodAccessorImpl.java:25)

at

java.lang.reflect.Method.invoke(Method.java:597









286 of 363

from Larry Lui:

{noformat}

2010-06-21 16:16:38,492 WARN

org.I0Itec.zkclient.ZkEventThread:78 -

Error handling event ZkEvent[State changed to

SyncConnected sent to

net.sf.katta.protocol.InteractionProtocol$1@523

b2208]

org.I0Itec.zkclient.exception.ZkException:

org.apache.zookeeper.KeeperException$NotEmp

tyException: KeeperErrorCode

= Directory not empty for /var/katta/work/node-

queues/iguana:20000

at

org.I0Itec.zkclient.exception.ZkException.create(Z

kException.java:68)

at

org.I0Itec.zkclient.ZkClient.retryUntilConnected(Z

kClient.java:685)

at

org.I0Itec.zkclient.ZkClient.delete(ZkClient.java:71

6)

at

org.I0Itec.zkclient.ZkClient.deleteRecursive(ZkClie

nt.java:516)

at

net.sf.katta.protocol.InteractionProtocol.publish

Node(InteractionProtocol.java:366)

at net.sf.katta.node.Node.init(Node.java:106)

at

net.sf.katta.node.Node.reconnect(Node.java:120)









287 of 363

When a relatively large (appx. one third) number

of nodes go offline, are rebuilt, and then brought

back online with no data, Katta can sometimes

get stuck rebalancing the cluster. It attempts to

replicate the underreplicated shards -- we had

confirmed that, according to Katta anyway, we

had at least one copy of all shards -- and gets to

about 90% completion before it just... stops. The

cluster still seems to respond to search requests

just fine, but you can't perform any modifications

(removing indexes, adding new indexes). We see

messages like this in our master log files from

around the time that the problem started:



2010-06-09 02:40:40,332 INFO

net.sf.katta.master.OperatorThread:125 -

skipping operation

'BalanceIndexOperation:6d82733b:index_name_

here'



No warnings or errors that I can see.



To troubleshoot, I first restarted the standalone

Zookeeper nodes one at a time. This had no

effect. Then I restarted the Katta nodes (only the

nodes -- the masters I left unchanged), also one

at a time. After restarting all the Katta nodes,

suddenly the cluster started working again.

Initially we got a large-ish number of indexes in

the ERROR state (about 88 out of 432), which was









288 of 363

Debian Linux unstable See Debian Bug #584378

http://bugs.debian.org/cgi-

bin/bugreport.cgi?bug=584378



> find src/main/java -name *.java -and -type f -

print0 | xargs -0 /usr/lib/jvm/default-

java/bin/javac -cp /usr/share/java/log4j-

1.2.jar:/usr/share/java/zookeeper.jar:debian/_jh_

build.zkclient -d debian/_jh_build.zkclient

>

src/main/java/org/I0Itec/zkclient/ZkServer.java:1

27: cannot find symbol

> symbol : constructor Factory(int)

> location: class

org.apache.zookeeper.server.NIOServerCnxn.Fact

ory

> _nioFactory = new

NIOServerCnxn.Factory(port);

>^

> Note:

src/main/java/org/I0Itec/zkclient/ZkClient.java

uses unchecked or unsafe operations.

> Note: Recompile with -Xlint:unchecked for

details.



The full build log is available from:

http://people.debian.org/~lucas/logs/2010/06/02

/zkclient_0.1.0+dfsg1-1_lsid64.buildlog









289 of 363

From Hongchao Li:



{noformat}

I create an object of

net.sf.katta.lib.lucene.LuceneClient in my query

client and serve external queries with this object..

Also, I

periodically remove/add indexes when the query

client is in service.

Sometime, I got the following exceptions:



InteractionProtocol$AddRemoveListenerAdapter

@7ae35bb7]

java.lang.NullPointerException

at

net.sf.katta.client.Client.removeIndex(Client.java:

172)

at

net.sf.katta.client.Client$1.removed(Client.java:1

08)

at

net.sf.katta.protocol.InteractionProtocol$AddRe

moveListenerAdapter.handleChildChange(Interac

tionProtocol.java:529)

at

org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:56

8)

at

org.I0Itec.zkclient.ZkEventThread.run(ZkEventThr

ead.java:72)

10/04/26 13:04:26 WARN









290 of 363

Periodically we get the following error, which is

roughly contemporaneous with the entire cluster

becoming unresponsive:



2010-05-04 15:00:01,121 ERROR

net.sf.katta.operation.master.AbstractIndexOper

ation:59 - failed to deploy balance [index name]

java.lang.NullPointerException

at

net.sf.katta.operation.master.AbstractIndexOper

ation.addRunningDeployments(AbstractIndexOpe

ration.java:108)

at

net.sf.katta.operation.master.AbstractIndexOper

ation.distributeIndexShards(AbstractIndexOperati

on.java:65)

at

net.sf.katta.operation.master.BalanceIndexOpera

tion.execute(BalanceIndexOperation.java:54)

at

net.sf.katta.master.OperatorThread.executeOper

ation(OperatorThread.java:121)

at

net.sf.katta.master.OperatorThread.run(Operator

Thread.java:80)



We then get one of these about every 8 seconds

until the cluster is restarted. There are no errors

or warnings in the log file before this that I can

see, and several index balance before it all

succeed.









291 of 363

all Repro: run "ant jar" from katta/extras/indexing

Fix: see attached (update cobertura.jar to 1.9.3





rodo@rodimus:~/patch/katta/extras/indexing$

ant jar



...



[ivy:resolve] :::: WARNINGS

[ivy:resolve] module not found:

net.sourceforge.cobertura#cobertura;1.9.1

[ivy:resolve] ==== libraries: tried

[ivy:resolve] -- artifact

net.sourceforge.cobertura#cobertura;1.9.1!cober

tura.jar:

[ivy:resolve]

/home/rodo/patch/katta/extras/indexing/../..//li

b/cobertura-1.9.1.jar

[ivy:resolve]

/home/rodo/patch/katta/extras/indexing/lib/cob

ertura-1.9.1.jar

[ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::

[ivy:resolve] :: UNRESOLVED DEPENDENCIES ::

[ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::

[ivy:resolve] ::

net.sourceforge.cobertura#cobertura;1.9.1: not

found

[ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::

[ivy:resolve]

[ivy:resolve] :: applications that use zookeeper as

We use other USE VERBOSE OR DEBUG MESSAGE

well, that run on 3.3.0 (along with katta). Having

katta upgraded to 3.3.0 will be useful to maintain

a single zk quorum and library integration.









292 of 363

Hongchao reported:

{noformat}

we have a huge application including about 7

billion documents. It is

indexed it into 720 shards, i.e. about 10 million

documents per shard,

and the total size is about 0.5 tera bytes. Then,

we divide these

shards into 120 indexes with 6 shards per index

and deploy them into

our katta cluster as 120 instances with replication

2. our katta

cluster currently has 22 nodes and each node has

16 CPUs. When we

submit a query on more than 37 instances at the

same time. we got the

following exception:





10/03/26 13:53:29 ERROR

client.NodeInteraction:166 - Error calling

public abstract

net.sf.katta.lib.lucene.HitsMapWritable

net.sf.katta.lib.lucene.ILuceneServer.search(net.sf

.katta.lib.lucene.QueryWritable,net.sf.katta.lib.lu

cene.DocumentFrequencyWritable,java.lang.Strin

g[],int)

throws java.io.IOException on

haystack016.scur.colo:20000 (try # 1 of

3) (id=21)

java.lang.reflect.InvocationTargetException









293 of 363

Reported by Hongchao:

{noformat}

Katta works well on sorting at most time.

However, it fails in one

case: when we do a query and some searched

records do not have the

fields to be sorted on or the fields are empty,

katta reports errors

like 'can not reach shard ...'. If some result

records miss the sorted

fields, a graceful way is to put at the very

beginning or the very end

of the result list. It should not report errors.

{noformat}



I think a good way to solve this is to first find out

RHEL Linux 5; Java 1.6.0 When I try handle those cases.

how luceneto deploy a solr shard from hdfs, it

fails because the unzip routine does not create

the directories before it unzips the file.

My fix is to add the following check into

net.sf.katta.util.FileUtil unzip method:

{noformat}

if (!targetFile.getParentFile().exists()) {

targetFile.getParentFile().mkdirs();

}

{noformat}





'* When we deploy a new index (an optimized

index) in hdfs and ask katta to deploy it, currently

as it exists - we can only deploy a new index with

new set of shards.



It would be useful to add a newly deployed

lucene index (optimized) as a shard to an existing

katta index , so that it is entirely transparent to

the client , while searching the index.



Would this involve changing some zk mapping

etc. ?







294 of 363

The following line in Node.java only opens one

RPC Server Thread per node, thus only one search

call will be executed per node at a time and all

the other requests queued...

_rpcServer = RPC.getServer(nodeManaged,

"0.0.0.0", serverPort, new Configuration());



It should be changed to



int numthreads = 10;

_rpcServer = RPC.getServer(nodeManaged,

"0.0.0.0", serverPort, numthreads, false, new

Configuration());



and be configurable by the user.



reverse deploy - copy a valid index from the katta

system ( + all its shards ) to a hdfs uri back (

assuming the original hdfs uri backing up the

shards have been deleted entirely, which is a

possibility ) .



This is very useful to examine deployed indices

separately .



Once in hdfs , it can be copied using

hdfs.copyToLocal and examined using luke and

others.



Once this is fully functional - this can be used to

pull many such indices to hdfs and then to the

local disk of a box and merged back and deployed

. But this would be an useful stepping stone

towards that.



First-cut patch in place - please review.



todo: more unit testing to be done on the

PriorityQueue (from Lucene) and some other

genericity fixes.





295 of 363

As recommended, I'm running zookeper now in

standalone mode.



I just got the following exception (0.6.1), and the

master gets unresponsive afterwards and I have

to kill the master process.

Listindices will still work, but removing an index

or adding and index will not.









Probably an easy fix (just use a hashmap which is

threadsafe)



java.util.ConcurrentModificationException

at

java.util.HashMap$HashIterator.nextEntry(Hash

Map.java:793)

at

java.util.HashMap$KeyIterator.next(HashMap.jav

a:828)

at

net.sf.katta.protocol.InteractionProtocol$1.handl

eStateChanged(InteractionProtocol.java:87)

at

org.I0Itec.zkclient.ZkClient$5.run(ZkClient.java:48

4)

at

org.I0Itec.zkclient.ZkEventThread.run(ZkEventThr

ead.java:72)









296 of 363

7-node katta cluster, it should have nothing to do With release of 0.6 and the setting of

with OS 'master.deploy.policy=net.sf.katta.master.Lowest

ShardCountDistributionPolicy', the shards can not

be evenly distributed over all nodes. I

deployed my 42 indexes sequentially into katta.

Only one index has 7 shards and all others have 1

to 3 shards. And, 16 of them are with replication

level 2 and others with replication-level 1. The

policy worked perfectly for us with a trunk

version before. I think it should be easy to

reproduce this bug if you sequentially deploy

multiple indexes, each of which has different

number of shards and/or replication level.



The following is the output of 'katta check' of my

cluster. (To me, it seems that katta mistakenly

used old 'default deploy policy' to assign the

shards.):



--------------------------------------------------------------------

------------------------------

| Node | Connected | Shard Status

|

=========================================

=========================================

================

| haystack004.scur.colo:20000 | true | 2/2 ##

|

--------------------------------------------------------------------

------------------------------

| haystack005.scur.colo:20000 |existing katta

When trying to unpublish a non true | 17/17

index from an external java program (trough

katta.main), Katta will kill the original java

program when no index is found. (in

printUsageAndExit)



It should be better just to return.









297 of 363

lsof shows file descriptors grow as we deploy new

indices and remove the old.



Here is a patch -



---

a/src/main/java/net/sf/katta/lib/lucene/LuceneS

erver.java

+++

b/src/main/java/net/sf/katta/lib/lucene/LuceneS

erver.java

@@ -124,7 +124,7 @@ public class LuceneServer

implements IContentServer, ILucene

public void removeShard(final String shardName)

{

LOG.info("LuceneServer " + _nodeName + "

removing shard " + shardName);

synchronized (_searcherByShard) {

- final Searchable remove =

_searcherByShard.remove(shardName);

+ final IndexSearcher remove =

_searcherByShard.remove(shardName);

if (remove == null) {

return; // nothing to do.

}

@@ -133,6 +133,10 @@ public class

LuceneServer implements IContentServer, ILucen

} catch (final IOException e) {

throw new RuntimeException("unable to retrive

maxDocs from searchable")

} process nodes seem to hang in a deadlock.

The

kill processid won't have any effect on them and I

have to kill them with kill -9.





I can reproduce this behaviour on windows

(cygwin) and on debian linux.









298 of 363

I added an index with a wrong path before

"file://c:/tmp/test", which failed to deploy.



listIndices won't work anymore:

ERROR: Wrong FS: file://c:/tmp/test, expected:

file:///



Maybe other parts of katta fail too (I can't test

right now)

In addition to a the redeploy feature, it would be

nice to have a reload and refresh feature.



Reload would reload all the shards on all the

nodes (closing, opening lucene indexes)

Refresh would load all new shards from index

directory, and then drop those shards not

present anymore in the index directory.



This would make frequent updates very easy, as it

would be possible to do only a "refresh" on the

an index, which would load the new shards and

drop the non existing shards, instead of having to

juggle between multiple indexes.









299 of 363

some input from Ted Dunning:

{noformat}

My own thought is that the master should do the

following balancing actions fairly often:



- to reassign shards that have not yet been

loaded by a node

- to add replicas of shards to nodes with a serious

underload of shards

- to tell a node to stop serving shards with excess

advertised replicas (largely due to the second

action)



As it stands, I think that the katta master is a bit

too laissez faire and should be a bit more

aggressive.

One thing to watch out for, however, in making

the master more aggressive is that when a node

disappears for a short time

it should re-advertise all of the shards it still finds

on its disk rather than delete them.

It should only delete them after a period of time

that allows the master to consider what should

be done.

Combined with a partial result policy, this can

make transient node failures much less of a

problem.

{noformat}









300 of 363

from Erich

{noformat}

I'm running 0.6-rc1 and have problems with

imbalanced shards.

When I deployed 62 indices consisting of 183

shards (replication 2)

across 5 nodes (all up during and after the

deployment) I get the

following distribution:



--------------------------------------------------------------------

--------------------------------------

| Node | Connected | Shard Status

|

=========================================

============

| grid0004:20000 | true | 62/62

############################# |

--------------------------------------------------------------------

--------------------------------------

| grid0002:20000 | true | 62/62

############################# |

--------------------------------------------------------------------

--------------------------------------

| grid0003:20000 | true | 62/62

############################# |

--------------------------------------------------------------------

--------------------------------------

| grid0001:20001 | true | 0/0

|

--------------------------------------------------------------------

log4j is at 1.2.15 I think. Shouldn't make any

problems to update from 1.2.14.

patch will follow in a few moments



sorry for filling the bug here. I think it's the most

appropriate place.

People who would like to (re)build from the

tarballs need build.xml and ivy.xml. I think it's a

bug that they aren't included since you ship

extras/*/build.xml and extras/*/ivy.xml and it

also doesn't make sense to have the source

without the build system.



301 of 363

'* release a source only tarball without

precompiled .jar files (and without the prototype

javascript library)

* make it possible to not use ivy in the build

system

* release zkclient

* make it possible to compile only selected extras



Maybe there are more things to do.







To reduce the size of the core katta distribution

we make a separate distribution for the web ui.



This is a follow up task from KATTA-17.



Following shortcomings:

- the results files are always to be found on the

master host, also it can be triggered from any

other host

- the test does not survive katta master change

- the nodes hold the result of each query in

memory (which could possibly lead to an OOM)



I think all this could be fixed relative easily by

using

[bookkeeper|http://hadoop.apache.org/zookeep

er/docs/r3.2.2/bookkeeperOverview.html]









302 of 363

I'm trying to start katta with bin/katta

startMaster, but it fails with the following error

message:



$ bin/katta startMaster

ERROR: Path must start with / character

Usage:

startMaster Starts a local master





Katta '0.6-dev'

Git-Revision '775443f'

Based on KATTA-82.

Some initial thoughts for discussion:

- things we want to monitor ?

-- we have:

--- cpu, memory, garbage collection

-- we might want ?:

--- disc-io/allocation (support multiple discs first)

--- shard distribution, query distribution, cluster

balance (have this partly in zk)

- how do we publish those ?

-- currently a IMonitor publishes to zk

-- do we want to publish via jmx ?

-- publish to ganglia ?

- we might extend our IMonitor solution to run

multiple IMonitors at the same time

Some user;s reported problems with the cluster

stability when deploying a large index.









303 of 363

In NodeInteraction there are these lines

{code}

VersionedProtocol proxy =

_shardManager.getProxy(_node);

if (proxy == null) {

String msg = "No proxy for node: " + _node;

LOG.debug(msg);

_result.addError(new KattaException(msg),

_shards);

return;

}

{code}

In this case the normal "try another node" will

simply be skipped.

But this situation can only be happen when the

Clien is used multithreaded and another node

interaction failed on a node and so kicked out the

e.g. all lucene stuff should be in a package, etc

This might influence configuration.









304 of 363

from Richard Tang:

{noformat}

I have troubles in creating sample data as in

online tutorial

http://katta.sourceforge.net/documentation/ho

w-to-create-a-katta-index).

In particular, when I tried to exec 'ant jar',

exceptions are thrown



[ivy:resolve] :::: WARNINGS

[ivy:resolve] module not found: junit#junit;3.8.1

[ivy:resolve] ==== libraries: tried

[ivy:resolve] -- artifact junit#junit;3.8.1!junit.jar:

[ivy:resolve] /root/katta-

buildindex2/katta/extras/indexing/lib/junit-

3.8.1.jar

[ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::

[ivy:resolve] :: UNRESOLVED DEPENDENCIES ::

[ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::

[ivy:resolve] :: junit#junit;3.8.1: not found

[ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::

[ivy:resolve]

{noformat}



There should be a metod to access binary fields in

results as well.



If I iterate over



List ids = client.getDetails(hits,

fetch)



I can only access String fields (eg

ids.get(i).get("stringfield")) and not binary fields

which are normally accessed with getBinaryValue

in lucene.









305 of 363

Linux To make fail over easier to use and configure,



Currently bin/start-all.sh starts primary master

and nodes, and does not start the secondary

master. There is an improvement required in the

script so that, as per the configuration of the

katta -cluster it should also start the secondary

master seamlessly.



Also attaching a fail over experimentation

documentation.

The JmxMonitor sometime tries to write to zk

after zkclient shutdown already happended.

{noformat}

Exception in thread "Thread-1723" Exception in

thread "Thread-1780"

org.I0Itec.zkclient.exception.ZkNodeExistsExcepti

on:

org.apache.zookeeper.KeeperException$NodeExi

stsException: KeeperErrorCode = NodeExists for

/katta/server-metrics/localhost:20002

at

org.I0Itec.zkclient.exception.ZkException.create(Z

kException.java:40)

at

org.I0Itec.zkclient.ZkClient.retryUntilConnected(Z

kClient.java:664)

at

org.I0Itec.zkclient.ZkClient.create(ZkClient.java:24

1)

at

org.I0Itec.zkclient.ZkClient.createEphemeral(ZkCli

ent.java:277)

at

net.sf.katta.protocol.InteractionProtocol.setMetri

c(InteractionProtocol.java:449)

at

net.sf.katta.monitor.JmxMonitor$JmxMonitorThr

ead.run(JmxMonitor.java:116)

Caused by:

org.apache.zookeeper.KeeperException$NodeExi

stsException: KeeperErrorCode = NodeExists for

306 of 363

KATTA-43 probably includes changes in the

internal zk file structure. There should be a simple

mechanism that a existing 0.5 cluster can be

upgraded to 06 (and for future versions as well).



Over the last weeks i recognized that the test

suite hangs from time to time. Saw this with

MultiInstanceTest and with LuceneClientTest at

least. Not sure yet if it is a master or a client

problem.



Here a stacktrace from one hang:

{noformat}

[junit] "main" prio=5 tid=0x0000000101802800

nid=0x100401000 in Object.wait()

[0x0000000100400000]

[junit] java.lang.Thread.State: TIMED_WAITING

(on object monitor)

[junit] at java.lang.Object.wait(Native Method)

[junit] - waiting on (a

net.sf.katta.client.IndexDeployFuture)

[junit] at

net.sf.katta.client.IndexDeployFuture.joinDeploy

ment(IndexDeployFuture.java:54)

[junit] - locked (a

net.sf.katta.client.IndexDeployFuture)

[junit] at

net.sf.katta.client.LuceneClientTest.onBeforeClass

(LuceneClientTest.java:85)

[junit] at

net.sf.katta.AbstractKattaTest.beforeClass(Abstra

ctKattaTest.java:89)

[junit] at

net.sf.katta.testutil.ExtendedTestCase.setUp(Exte

ndedTestCase.java:67)

[junit] at









307 of 363

This issue is created from the discussion i had

with [~tdunning] in KATTA-81. The main idea is to

use zk for a centralized configuration

management instead of having too many

properties files.



The main points of the suggestion:

- one configuration file with all katta

configuration values for master/node/client (on

the master's location)

-- the master writes the files to zk on changes

-- the nodes & clients are reading those file from

zk before starting

- the configuration values each component needs

is:

--zk server addresses

--zk session timeout

--zk directory where cluster or config is stored by

the master

- there should be a possibility to override the

values for individual instances

- provide util methods/factories for the "connect

to zk, read properties, construct client/node"

thing



Since the above changes is likely to break some

api i would also refactor the Client class:

- move nodeSelectionPolicy into configuration

object

- pass zkClient instead of zkProperties into

constructor and make an additional connect









308 of 363

On a search query:

(1) query is broadcasted to each node with

relevant shards

(2) each node creates a HitsMapWritable and

sends it back to the client. This HitsMapWritable

object can contain the results of several shards.

(3) on client side each HitsMapWritable gets

temporarily transformed into a Hits object and

with the List of each of that Hits objects a

single Hits object is build.This final single Hits

object is returned to the searcher.





I think different sortings take place on following

places:

(2a) when the shard is queried the result list is

already sorted by lucene

(2b) on combining the result lists of the different

shards a PriorityQueue is used (KattaHitQueue),

so i guess the resulting list is also sorted

(3a) when transforming the HitsMapWritable to a

Hits object to a List automatically the

internal sorting of the Hits class is triggered an

resorts the whole node-sublist

(3b) when the final Hits object is build, the

internal sorting of the Hits object is used to sort

all hits after merging the sub-lists



I think the following undesired things happen:

- The 2a, 2b sorting is wasted in the moment the

hits geta mixture of from theversions ininto client

Having transferred Hadoop hit queue the the

/ nodes yield errors. It will be a nice to have

feature to be able and find the hadoop version of

a cluster

It would be great to be able to issue searches for

indexes "foo-*" rather than having to enumerate

all indexes starting with foo.



For other projects to be able to depend on Katta,

it needs to be available as a maven artifact.



I am working on a patch for this.



309 of 363

Currently the git repository has all Katta's

dependencies checked into the lib/ directory,

which bloats the size of the repository. These

dependencies should be downloaded in an

automated fashion as part of the build process.



See the discussion in [KATTA-74].



This is uncommittable code, however it's a

placeholder until LPT can be integrated into

Katta. Thanks to Kevin Peterson for writing this is

Ruby first and suggesting the algorithm.









I recently very enjoyed having the sources in that

jar as well, since with modern ides it give

instances access to java docand the source code.

Having more people see our sources will be

definitely good for us. For example mockito

includes the sources into the jar, we should do

the same.









310 of 363

The default distribution policy tends to pile

shards up on the nodes listed first in the

conf/nodes file.

The LowestShardCountDistributionPolicy picks

the node with the lest shard to replicate to, and

the node with the most shards to remove replicas

from.

Set by changing the configuration property

master.deploy.policy=net.sf.katta.master.LowestS

hardCountDistributionPolicy in

katta.master.properties.



Also, zip files read from hadoop were stored on

the local disk and then unpacked, this patch reads

the zip file from hadoop and unpacks it as it is

read.

The system property katta.spool.zip.shards may

be set to the string true, to force the zip file to be

spooled to local disk first.



Minor cleanup in Node, to ensure that temporary

files and directories are deleted in the case of an

error exit.









311 of 363

$ uname -a The mockito jar seems to be missing from lib

Linux imyousuf-laptop 2.6.28-15-generic #52- folder and its neither downloaded by Ivy, as a

Ubuntu SMP Wed Sep 9 10:49:34 UTC 2009 i686 result can not build 0.5.1 branch.

GNU/Linux

$ ant clean dist

$ java -version Buildfile: build.xml

java version "1.6.0_13"

Java(TM) SE Runtime Environment (build 1.6.0_13- clean:

b03) [echo] cleaning katta-core

Java HotSpot(TM) Server VM (build 11.3-b02,

mixed mode) check-ivy-available:



$ ant -version download-ivy:

Apache Ant version 1.7.0 compiled on December

13 2006 install-ivy:

[echo] Ivy path

$ git --version

git version 1.6.2.1 resolve:

[ivy:resolve] :: Ivy 2.0.0 - 20090108225011 ::

$ git remote show origin http://ant.apache.org/ivy/ ::

* remote origin :: loading settings :: file = /media/unix-

URL: 2/projects/katta/ivy/ivysettings.xml

git://katta.git.sourceforge.net/gitroot/katta/katta [ivy:resolve] DEPRECATED: useOrigin option is

Remote branch merged with 'git pull' while on deprecated when calling resolve, use useOrigin

branch master setting on the cache implementation instead

master [ivy:resolve] :: resolving dependencies ::

Remote branch merged with 'git pull' while on emi#event-processing/grid;working@imyousuf-

branch stable-0.5.1 laptop

stable-0.5.1 [ivy:resolve] confs: [ant, eclipse, compile, test,

Tracked remote branches instrument, checkstyle, job]

branch-0.1 [ivy:resolve] found svnant#svnant;1.2.1 in

.. so we need something taking measurements

(cpu, memory, etc) publish it and another

components that subscribes to those and store

them.







When implementing SOLR-1395 I needed to load

configuration files from specific paths rather than

from the classpath.



The patch also includes a jar-test target.

312 of 363

When building shards, it would be useful to

consistently build

shards up to a given size so that they are all

roughly the same

size in bytes. Perhaps in Hadoop this is difficult

because the

number of jobs must be set ahead of time?

I am trying to get strictly time-ordered results

from Katta by passing a Sort.



Each of my shards represents a fixed time period,

so in addition to being able to pass a Sort, it will

also sort the shard names and process them in

order (according to the sort order, reverse or not,

in the primary SortField).

The instructions at

http://katta.sourceforge.net/documentation/ho

w-to-merge-indexes fail with Katta 0.5.1 due to

Commons HTTP client being missing:



java.lang.NoClassDefFoundError:

org/apache/commons/httpclient/HttpMethod



Copying its jar from Hadoop's lib directory to

Katta's lib fixes it.









313 of 363

In this example I've attempted to add an index,

but I've made a typo in the port number:



$ bin/katta addIndex index3

hdfs://localhost:80200/index

org.apache.lucene.analysis.StandardAnalyzer 1

.not deployed index index3

$ bin/katta listIndexes

Exception in thread "main" java.io.IOException:

Call to localhost/127.0.0.1:19000 failed on local

exception: Connection refused



You should be able to list your indices even if one

of them is not available. That way you would be

able to know which index to remove.



In addition, the "not deployed index index3"

message you get upon trying to add the index in

the first place is unhelpful.







Expanding katta-core-0.5.1.tar.gz does not create

a subdirectory for katta as it should.

The jets3t library is not found when attempting to

build the example indexing tool. A log is attached.



Since the bin scripts in katta seem to be based on

those from Hadoop, they're subject to the same

bug regarding CDPATH:

https://issues.apache.org/jira/browse/HADOOP-

6101



The scripts expect cd to produce no output, but

this is not true in bash if the CDPATH

environment variable is set. A suggested fix would

be to unset CDPATH in the affected scripts.









314 of 363

> Exception in thread "main"

java.lang.NullPointerException

> at java.util.ArrayList.addAll(ArrayList.java:472)

> at

net.sf.katta.client.Client.getShardsToSearchIn(Clie

nt.java:296)

> at

net.sf.katta.client.Client.getNode2ShardsMap(Cli

ent.java:278)

> at

net.sf.katta.client.Client.search(Client.java:212)

> at

net.sf.katta.client.Client.search(Client.java:205)

> at net.sf.katta.Katta.search(Katta.java:466)

> at net.sf.katta.Katta.main(Katta.java:100)









315 of 363

When trying to start a Katta master using an

external Zookeeper via bin/katta startMaster. The

process starts and immediately finishes without

error message.



As far as I understand the issue, the problem

seems to be in Katta.startMaster(). After starting

master and client etc. The following code is

executed:



if (_zkServer != null) {

_zkServer.join();

} else {

// since we do not have a running zookeeper we

need something to join

// in...

Thread thread = new Thread("keep a live");

thread.setDaemon(true);

thread.start();

try {

thread.join();

} catch (InterruptedException e) {

}

}



Since when using an external Zookeeper the

_zkServer variable is null the threading code is

executed. The intention of this block seems to be

to sleep indefinitely, which of course would make

sense in this case, since the master is running his

threads, doing work etc.

If you get a ConnectionLossException when trying

to deploy an index, the entire index is marked as

ERROR and can never recover.



At the least, Katta should handle these situations

more gracefully.



For instance, ZkClient.exists just blows out,

transforming the recoverable exception into a

non-recoverable KattaException.



It is dangerous to change too many of these, but

some can probably be fixed.

316 of 363

It seems to happen pretty often that a cluster

runs into an error of some kind and never resets

the status from ERROR back to OK.



As such, it would be really useful to have a utility

that would allow a cluster to be scanned for

correctness and then mark the cluster as up or

down.







When using the new feature 'external zookeeper'

implemented in KATTA-14 the default namespace

(rooted at "/katta") seems to be not created. This

namespace usually is created in

ZkServer.startZooKeeperServer(), which is not

called from the Katta.startMaster() now because

of the new conditional:



public static void startMaster(ZkConfiguration

conf) throws KattaException {

...

if (conf.isEmbedded()) {

_zkServer = new ZkServer(conf); // /.

The following jars are needed if you want to use

the s3 or s3n protocol (via jets3t) as the path to

indexes for the addIndex command.



* commons-httpclient-3.0.1.jar

* commons-codec-1.3.jar



These should be part of the standard Katta

release, in the lib/ sub-directory.









318 of 363

via mail from Johannes Herr:



Hi Stefan,



the patch with my changes is attached.



The main point is changing the port used in

ZKClient. ZKClient wraps a ZooKeeper object

instantiated in ZKClient.start() via:



{code:java}

_zk = new ZooKeeper(_servers, _timeOut, this);

{code}



The _servers variable is defined in the ZKClient

constructor via:



{code:java}

_servers = configuration.getZKServers();

{code}



That means it holds the value of the

zookeeper.servers line in katta.zk.properties. In

my case it is:



{noformat}

zookeeper.servers=hostA:2181,hostB:2181

{noformat}



If I understand the code correctly this means, that

there will be a client connection to hostA:2181 or









319 of 363

(see KATTA-61 for a related issue)



If you add a new node to a katta cluster, it will

never be assigned any nodes unless another node

goes down.



It would be nice to be able to rebalance a cluster

by moving shards from the node with the most

shards to the node with the least.



Rebalancing should, of course, only be done if

there is an imbalance worth correcting. My first

swag at that rule would be to only rebalance if

some node has more than 2 fewer shards than

the average number of shards computed by

dividing the total number of shards by the

number of nodes (using floating point, not

integer math).



It is also worth having a throttle that limits how

often shards are moved.



And, of course, it is important that shards not be

deleted from where they came from until they

are present on the destination node and possibly

not even for a while after that.









320 of 363

If you have a scenario where you add a new node

to a cluster and take down an old node, then the

shards of the old node will be distributed evenly

across the entire cluster.



It would be preferable in many cases if all of the

shards from the old node were put on the

(empty) new node.



This should be relatively easy to do by building a

deployment policy that is aware of the number of

shards each node has and assigns new nodes to

the least occupied node.



A related issue is balancing a cluster after adding

a new node, but without taking down an old

node. I will file a separate issue on that and put a

link here. Having a cluster balancer would solve

this issue as well, but more slowly.



Allow user to specify a root path other than

/katta.

This allows multiple installations of Katta (master,

clients, nodes) to share one ZooKeeper server.

Also rename ZkPathes.java to ZkPaths.java.









321 of 363

It is sometimes important to proceed even if

some shards are not available. I suggest two

changes in semantics:



1) if a request to a node fails in the threaded

request loop in the client, then additional search

requests will be created to do the same request

on any other nodes that have the same shards as

well as marking the node as down. If no other

nodes have the shard, then request will be

marked as failing.



2) once results from x% of the shards in the

original request have been collected, a deadline

will be set for t milliseconds in the future. If all

results arrive before the deadline, then the

search will proceed as normal. If the deadline

arrives before all results have been collected,

then if y% of the shards have results, the results

will be returned as complete. If the deadline

passes with less than y% of the shards having

results, then an error will be raised. The values of

x, y and t will be parameters of the search with

reasonable defaults (such as 70%, 90%, 500ms).



Note that it is important for x and y to be

separate so that x can be set low enough so that

the deadline will always be triggered while y is

still high enough to guarantee reasonable results

from all successful queries.









322 of 363

The idea here is that all data in katta that

depends on a search node staying up should be

ephemeral and created by that node.



Currently, the structure is something like



node2shards//shard*



and



shard2node//



The files under shard2node disappear correctly

when the node disappears, but the files in the

first are created by the master and do not

necessarily go down. The master could be

extended to make this happen, but if the master

ever lost track, then the data would be corrupt.

This currently happens in EC2 if ZK connection

parameters are not set with long timeouts and it

causes the entire cluster to appear to be down.

Just making the shard files ephemeral does not

work because the node directory still exists.



What I propose is that the data about what

shards a node is serving be kept in an ordinary file

that lists the shards rather than as a directory

with a single entry per shard. This file can then be

node-wise ephemeral and will vanish correctly if

the node drops out. To make this work, it would









323 of 363

After downloading 0.5.1 and deploying/untarring

(with -p) on a cluster, none of the files in the bin/

subdirectory had their 'x' bit set.



Which means that commands would fail, due to

the reliance on this permission.



Some options for how to fix this:



* Permissions should be set correctly in the

.tar(.gz) release file.

* There should be an install script that sets these

after deployment

* The scripts should internally use "sh xxx" calls to

other scripts.



See

http://katta.sourceforge.net/documentation/inst

all-and-configure-katta



Has section that says "# host:path where hadoop

code should be rsync'd from. Unset by default."



Jira comment - maybe add a "documentation"

component?

As per the download from

http://katta.sourceforge.net/home/download



Make Katta usable for any task that requires

distributed shards.

Separate out shard management from Lucene

searching.

Lucene becomes one use case of Katta. Add a

second use case which searched Hadoop

MapFiles.









324 of 363

Here is a patch that does the following to the

node sub-directory.



If this is a good thing to do, I can continue with

other sub-directories.









Changed DocumentFrequenceWritable to

DocumentFrequencyWritable (to correct the

spelling)

Fixed grammar and spelling in a number of places

Made KattaHitQueue implement Iterable to

make code more readable

Got rid of some redundant boxing/unboxing



Removed unused handle method and the

apparently unused IRequestHandler interface



Added some javadoc comments

Deleted some goo.



Added several questions in the form of TODO's









Only when the session expires, we have to close

the current ZooKeeper instance and start another

one.

We also have to distinguish between those two

events and let the IZKStateListener know what

has happened, because there are different

actions that might have to take place on

"disconnect" and "expired". E.g. ephemeral nodes

only have to be recreated when the session

expired.









325 of 363

The test ClientFailoverTest fails once in a while:



I see the following output:

09/04/23 17:29:17 WARN node.BaseNode:244 -

Old node path '/katta/nodes/Senor-

Vossi.local:20017' for this node detected, delete

it...

09/04/23 17:29:17 INFO master.Master:159 - got

node event: [Senor-Vossi.local:20016]

09/04/23 17:29:17 INFO node.BaseNode:253 -

node 'Senor-Vossi.local:20017' announced

09/04/23 17:29:17 INFO node.BaseNode:260 -

Start serving shards...

09/04/23 17:29:17 INFO master.Master:159 - got

node event: [Senor-Vossi.local:20017, Senor-

Vossi.local:20016]

09/04/23 17:29:17 INFO node.BaseNode:405 -

announce shard 'index1_aIndex'

09/04/23 17:29:17 WARN node.BaseNode:409 -

detected old shard-to-node entry - deleting it..

09/04/23 17:29:17 INFO node.BaseNode:405 -

announce shard 'index1_bIndex'

09/04/23 17:29:17 WARN node.BaseNode:409 -

detected old shard-to-node entry - deleting it..

09/04/23 17:29:17 INFO node.BaseNode:405 -

announce shard 'index1_cIndex'

09/04/23 17:29:17 WARN node.BaseNode:409 -

detected old shard-to-node entry - deleting it..

09/04/23 17:29:17 INFO

master.DistributeShardsThread:131 - processing

of update started...









326 of 363

N/A The current redeploy method is a combination of

remove and deploy making the index unavailable

until the deployment is finished.



It would be a great addition if an already existing

index could be refreshed:

- Katta would examine the HDFS index directory

to see, which shards are there

- Depending on the shard directory name it could

deduct if the shard is already present on the

nodes and doesn't need to be copied or

- If there are new shards that are not already

distributed. It would automatically start copy

them over to the shards.

- After the deployment is done it switches all

search traffic to the refreshed index

automatically.



Assumptions:

- It is okay for refreshed indexes to have deleted

events not automatically removed (or the client

needs to filter those out through an add'l deleted

field)

- Shard names are unique and incremental

updates into the shard directory will have a

unique name to avoid collisions.



This would be a great enhancement considering

that it would cut down deployment time for large

indexes dramatically and simplify the overall

client code (the current API makes it necessary to









327 of 363

N/A I'm currently using the IndexMetaData to

programatically inspect and rotate the deployed

indexes. It appears to have everything I need, but

the actual index name.



Here is some sample code that illustrates the

problem:



{code:java}

DeployClient dc = new DeployClient(_zkconf);

List eventIndexes = new Vector();

for (IndexMetaData i :

dc.getIndexes(IndexState.DEPLOYED)) {

System.out.println("index name:" + i.getName());

// this would be nice and logical to do

}

{code}



I looked at Katta.java and ZK paths are used to

get to the index names, which isn't really a good

way.









The old (deprecated) API is hard wired to use the

KeywordAnalyzer.

The new API expects that we get a lucene query.

So there is no reason for addIndex to take an

Analyzer as an input param.









328 of 363

N/A Hi,



While using Katta in our development project it

became clear that it would be a lot easier for us

to use it if it would allow for a programmatical

configuration of the Client and Deploy Client

instead of insisting on property files in the

classpath with a certain name.



Something like:



Properties prop = new Properties();

prop.setProperty("zookeeper.servers",

"localhost:2181");



DeployClient dc = new DeployClient(new

ZkConfiguration(prop));

dc.addIndex("events",

"hdfs://localhost/myIndex",

StandardAnalyzer.class.getName(), 1);

dc.disconnect();



Ideally the same would also work for regular

Clients.

Thanks!









329 of 363

i am sorry that i found this bug long ago, but i

forget to report it



the buggy code is in KattaMultiSearcher::search(),

as below:



================================= below

================================

boolean working = true;

while (working) {

ScoreDoc scoreDoc = null;

for (int i = 0; i hitB.getDocId();

}

- return hitA.getScore() hitB.getScore();

}









331 of 363

We have been running a moderate sized cluster

in EC2 for some time now (300 shards). Due to

various reasons, we have seen a number of cases

where we had a Zookeeper session expire. This

leads to an inconsistent state for katta in ZK

which often prevents further use of the cluster.



What should happen when a search node has a

session expiration is



a) all of it's own ephemeral files in ZK will

disappear. (this seems to work correctly)



b) Any dependent files created by the master

should be deleted as quickly as possible. It is

preferable if all files that depend on the node

being present be created as ephemeral by the

node itself. THis would include node2shards,

shard2nodes and server entries. (this is probably

where the discrepancy occurs)



c) the node should re-register with ZK from

scratch, almost as if it were starting over. THe one

exception is that it should declare that it has

whatever shards it has. (this definitely does not

happen correctly)



d) the node should honor requests that come in

during the disconnect time since these may come

from clients who didn't get them memo about

the node disappearing.update to reflect the

The ZKClient needs an

changes of zookeeper 3.1.

For example there are much more granular states

that an event now can have (e.g. Node Removed).









332 of 363

SecondaryMaster cannot take over when

firstMaster failed:

this bug can be described as below:

(1) firstMaster failed

(2) zookeeper delete the "/katta/master"

zookeeper node

(3) zookeeper tell SecondaryMaster the above

event: (in func processDataOrChildChange of

class ZKClient.java)

(4) Set listeners = {

SecondaryMaster's ZKClient instance };

(ZKClient.java +511)

(5) SecondaryMaster try to resubscribe

"/katta/master" (ZKClient.java +513) listeners =

empty_Set, after func

resubscribeDataPath(event, path, listeners) was

called.



i make a simple Patch for this bug, code as below,

if you have a better patch, please tell me



in func resubscribeDataPath of class ZKClient.java

(line 534 ~ 545)

======================= below

=========================================

==

private byte[]

resubscribeDataPath(WatchedEvent event, final

String path, final Set listeners) {

cannot start as Quorum Zookeepers in katta-0.4,

but i didn't try katta-0.5, if u can do the job, tell

me please.

finally, i start Quorum Zookeepers outside katta

(that is, start 3 zookeepers first, and then start

katta master or start katta node connect to the 3

zookeepers)







333 of 363

in Node.java +469 ~ 472 (func getDocFreqs of

class Node)

BUG and Patch as below:

====================== below

==========================

- final java.util.Iterator termIterator =

termSet.iterator();

int numDocs = 0;

for (final String shard : shards) {

+ final java.util.Iterator termIterator =

termSet.iterator();

while (termIterator.hasNext()) {

====================== above

==========================

in Node.java +383 (func startRPCServer of class

Node)

BUG: if (configuration.getStartPort() - serverPort

0

at

org.apache.zookeeper.ZooKeeper.validatePath(Z

ooKeeper.java:534)

at

org.apache.zookeeper.ZooKeeper.getChildren(Zo

oKeeper.java:1078)

at

org.apache.zookeeper.ZooKeeper.getChildren(Zo

oKeeper.java:1119) you do not see the 'eclipse'

When you do "ant -p"

and 'dist' targets listed.

This is because of the missing descriptions for

these targets in the build.xml file.









335 of 363

I have experienced a random test failure for

net.sf.katta.zk.ZKClientTest.

It might be a timing related issue.



I checked out the latest sources, and executed

the following steps :

ant clean

ant compile

ant test





In this folder should be a sub folder with our

coverage reports.



For now we only want to check that each java file

has the license header.









336 of 363

I'm seeing a lot of following errors in the test logs.

This might be releated to the zookeeper update.

java.lang.RuntimeException: Exception while

restarting zk client

at

net.sf.katta.zk.ZKClient.reconnect(ZKClient.java:5

68)

at

net.sf.katta.zk.ZKClient.processDisconnect(ZKClie

nt.java:488)

at

net.sf.katta.zk.ZKClient.process(ZKClient.java:466)

at

org.apache.zookeeper.ClientCnxn$EventThread.r

un(ClientCnxn.java:366)

Caused by: net.sf.katta.util.KattaException:

unable to create path '/katta/node-to-

shard/jemez:20017' in ZK

at

net.sf.katta.zk.ZKClient.create(ZKClient.java:296)

at

net.sf.katta.zk.ZKClient.create(ZKClient.java:275)

at

net.sf.katta.node.Node.announceNode(Node.java

:183)

at

net.sf.katta.node.Node.handleReconnect(Node.ja

va:147)

at

net.sf.katta.zk.ZKClient.reconnect(ZKClient.java:5

65)









337 of 363

KATTA-16 Right now the IClient Node interaction is defined

by the API ISearch. Though this is very limited.

This api is very focused on search. Does not give

users the ability to programaticaly create Lucene

Query Objects and we always expect

HitsMapWritable.

The idea is to generalize this api so it can be used

for more than just search, but content look up

(cached pages, facet search etc). We could have a

simple Response request(Request) method and

encode the kind of requests into the request

object. Since the lucene query object is

serializable it should be easy to write a writable

request object that transport all sorts including a

lucene query.









Looks like that in case we deploy two different

indexes that have the shards with the same name

we run into conflicts in our zookeeper structure.

Eg:

+ indexA

++Shard1

+ indexB

++ Shard1

At least Shard1 will be a problem in the

shard2Node path.

I suggest we just name the shards

indexName_shardName since this would make it

unique. 2.x is known to have problems during

Zookeeper

reconnection.









338 of 363

all Hi,



I did some tests with retrieving the search result

details in parallel using a thread pool (i.e.

executing details = client.getDetails(hit) in

parallel).

For retrieving the first 20 results I saw speed

improvement on the order of 5-10 times faster

compared to a single threaded retrieval.



The actual implementation is trivial especially

with the awesome Concurrent package

(ThreadPool and CountDownLatch).



IMHO, a simple and very rewarding

improvement.

-Erich





This is used in

net.sf.katta.indexing.SequenceFileCreator and

there is an easy replacement for this:

java.io.ByteArrayOutputStream









339 of 363

I did the following tests from the command line



pvoss$ bin/katta search testIndex foo:bar 2

4 hits found in 0.011sec.



| Hit | Node | Shard | DocId | Score |

=========================================

==============================

| 0 | Senor-Vossi.local:20000 | testIndex_aIndex

| 0 | 6.811141 |

--------------------------------------------------------------------

---

| 1 | Senor-Vossi.local:20000 | testIndex_bIndex

| 0 | 5.898621 |

--------------------------------------------------------------------

---



$ bin/katta search testIndex foo:bar 3

4 hits found in 0.012sec.



| Hit | Node | Shard | DocId | Score |

=========================================

==============================

| 0 | Senor-Vossi.local:20000 | testIndex_cIndex |

0 | 6.811141 |

--------------------------------------------------------------------

---

| 1 | Senor-Vossi.local:20000 | testIndex_aIndex

| 0 | 6.811141 |

--------------------------------------------------------------------

---









340 of 363

The idea is to run load tests on ec2 to make

performance a first level citizen.

+ write some script that creates a ec2 cluster

+ deploy the current sources on this cluster

+ write a class that generates a test index

+ start katta, deploy the test index

+ run a http://faban.sunsource.net test

+ graph the result

+ shutdown the cluster.







It is common in large scale search deployments to

separate (at least) raw search and content

retrieval. The query types, volumes are very

different in these different engines, so the

number of shards are different as well. The

problems of index deployment and management

are the same, however.



This indicates that katta should have the

following extensions:



- abstract out a small KattaMangeable interface

to allow manageable instances to be managed

- extend the configuration to allow multiple pools

to be managed

- extend the client software to allow different

pools to be queried



Zookeeper 2 assumes that a client will re-

establish all watches. katta doesn't do this so that

in environments like EC2 where disconnects are

pretty common, there are serious problems.



The simple solution is to move to ZK 3. We have

done this and should have a patch available

shortly.

It is nice to have an integrated zookeeper cluster,

but it should be possible to use an external

cluster for production deployments.





341 of 363

Make katta able to run on ec2.

The idea is that we port all hadoop ec2 scripts to

katta.

We need a script to create AMI and managa katta

scripts.

We need to investigate wich ports we need to

open to access katta nodes as client.

I removed in the ec2 script the the line: -e 's|#

export KATTA_SLAVE_SLEEP=.*|export

KATTA_SLAVE_SLEEP=1|' \

in section:



# Configure Katta

sed -i -e "s|# export JAVA_HOME=.*|export

JAVA_HOME=/usr/local/jdk${JAVA_VERSION}|" \

-e 's|# export KATTA_LOG_DIR=.*|export

KATTA_LOG_DIR=/mnt/katta/logs|' \

-e 's|# export KATTA_SLAVE_SLEEP=.*|export

KATTA_SLAVE_SLEEP=1|' \

-e 's|# export KATTA_OPTS=.*|export

KATTA_OPTS=-server|' \

/usr/local/katta-$KATTA_VERSION/conf/katta-

env.sh





It make quite sense to have a slave sleep in big

installations. So we might want to add it.







When merging is use with a cronjob it might

make sense to not merge indexes which have a

certain size or document count.

With a configurable threshold it could be

achieved that small indexes will be merged

together and big indexes will remain untouched.



Currently its not possible to start more then one

node out of the same katta distribution without

reconfiguring the shard-folder in

conf/katta.node.properties.

It might make sense that each node stores it

shards in //nodeName.

342 of 363

The profit of shipping the shards seems

questionable. It makes especially the merging a

lot harder.

Marko is expert here and knows more.

Currently metadata in the zookeeper system is

stored in Writables (like IndexMetadata).

Does it make sense to replace these writables

with a MapWritable implementation ?

This should give more freedom to future changes.





I think because of invalid handling of the

ephemeral's the master-failover mechanism is

broken.









343 of 363

A shutdown of the cluster (bin/stop-all.sh) leaves

the ephemeral nodes (of master and nodes)

behind.

On newstart, the ephemera's are still there (but

not connected to any owner). The master and all

nodes delete unconnected ephemeral's with their

address and create new one.

This leed's to a lot of node connected / node

disconnected events in the master log on startup

and it looks like there is something wrong.

Especially on large cluster the startup log looks

very confusing. But redeployments shouldn't

happen, thanks to the safe mode.



Think there are several solutions.

I is possible to re-own unconnected ephemerals if

seesion id and password are known. So if f.e. a

node would persist it zookeeper sessionId and

password it could reown it ephemeral on startup

if still existent.



What is already there is a shutdown hook on

node side. This hook tries, among other stuff, to

delete the node's ephemeral.

The problem with this is the stop-all script, since

it stops the master with the zk process first and

then the node. So if the hook executes there is no

zk system to communicate anymore.

I think it would be a good move to decouple

master and zookeeper process and stop the

KATTA-6, KATTA-7 zookeeper process a last in the stop-all.sh script.

The ephemeral node

(http://hadoop.apache.org/zookeeper/docs/r3.0.

1/zookeeperProgrammers.html#Ephemeral+Nod

es) handling seems to be suboptimal in a lot of

cases. Please see subtasks.



Node's receive search requests for one or more

shards at a time. The searches are executed

sequentially but should be rather executed in

parallel.





344 of 363

I think we don't utilize the full power of lucene

indexing. For example we have our own flushing

mechanism. Lucene has and promotes a ram-

based flushing scheme. I already made a lot of

TODO's in the index related code that needs to be

revised.

Taking a quick look on the search code of the

node it seems to me that there is one lucene

analyzer hardcoded.

Instead the analyzer should be dependent of the

index since we allow to specify a analyzer per

index.

This should be checked!

The lucene analyzer which is used for indexing is

specified through IDocumentFactory.

I would think decoupling analyzer from the

factory make sense. It should be easy to just

specify the class name of the analyzer in the

katta.index.properties.









345 of 363

The bixo release includes a version of the ec2-api-

tools (bin/ec2/support/ec2-api-tools-1.3-30349).

This is at least a couple versions older than the

latest set of tools

(http://developer.amazonwebservices.com/conn

ect/entry.jspa?externalID=351)



While having the tools bundled with bixo reduces

the additional step of requiring one to download

these tools, it does mean than anyone who

already has the tools is now having to deal with

multiple versions.



In order to decouple the tools one could just start

with making changes to setenv.sh by honoring

the default EC2 shell variables (as per

http://docs.amazonwebservices.com/AWSEC2/20

08-12-

01/GettingStartedGuide/index.html?setting-up-

your-tools.html)







I noticed in dry-run fro 10000 domains: some

webmasters use crawl-delay = 0.



Existing FetcherPolicy has this code:

public FetchRequest getFetchRequest(int

maxUrls) {

int numUrls = Math.min(maxUrls,

(int)(DEFAULT_FETCH_INTERVAL / _crawlDelay));





Defaults to English



See [BIXO-82] for related issue.

Need to verify what has to happen in client to

properly handle decompression of response.



See [BIXO-82] for related issue.







346 of 363

Currently if you specify an existing empty

directory as the output dir for SimpleCrawlTool,

you get a confusing message that says something

about there not being any previous crawl data.



Better would be to either treat this as an initial

crawl by default, or specify in the error message

that an initial crawl must use a non-existing

directory for the output dir.



And changing the param name to "crawldir" from

"outputdir" would also make things clearer.



See Nutch Http.java:



{noformat}

// Set the User Agent in the header

headers.add(new Header("User-Agent",

userAgent));

// prefer English

headers.add(new Header("Accept-Language", "en-

us,en-gb,en;q=0.7,*;q=0.3"));

// prefer UTF-8

headers.add(new Header("Accept-Charset", "utf-

8,ISO-8859-1;q=0.7,*;q=0.7"));

// prefer understandable formats

headers.add(new Header("Accept",

"text/html,application/xml;q=0.9,application/xht

ml+xml,text/xml;q=0.9,text/plain;q=0.8,image/pn

g,*/*;q=0.5"));

// accept gzipped content

headers.add(new Header("Accept-Encoding", "x-

gzip, gzip, deflate"));

hostConf.getParams().setParameter("http.default-

headers", headers);

See [KATTA-86] for similar issue that Stefan

resolved.

Enable linking in Jira, e.g. depends on, informs,

related to.

These should go into docs/licenses/



See [BIXO-4] for a related issue.



347 of 363

Currently we build the tarball (bixo-dist-

.tgz) in the tagged release branch, and

push this to GitHub.



But this bloats the size of the repo (> 500MB

currently) and the push takes a long time, as the

tarball is > 30MB and growing.



A better solution would be to have the dist target

copy the resulting tarball to the Nexus repository,

or to at least manually deploy it there for the

time being.



The doc/releasing.txt procedure document would

need to be updated. It should also include a step

where the Bixo web site gets updated to include a

link to the latest release on Nexus. See

[http://bixo.101tec.com/wp-

admin/page.php?action=edit&post=19&message

=1]

Web ARChive files are enhanced versions of the

.arc files.



It would be great if crawl results could be

read/written using this format, via a Cascading

scheme.



For more info, see:



*

http://bibnum.bnf.fr/WARC/WARC_ISO_28500_v

ersion1_latestdraft.pdf

* [The WARC File Format (ISO 28500) -

Information, Maintenance,

Drafts|http://bibnum.bnf.fr/WARC/]

* [ISO

28500:2009|http://www.iso.org/iso/iso_catalogu

e/catalogue_tc/catalogue_detail.htm?csnumber=

44717]



Heritrix has support for WARC 1.0, so we should



348 of 363

Currently we have an IndexScheme class in Bixo

that extends the abstract Cascading Scheme. This

lets us write out Lucene indexes as part of a

Cascading flow, via a tap that uses this scheme.



It would be great to have something similar that

lets us create the Lucene index using a Solr-

defined schema. I believe this involves:



* Embedding Solr.

* Providing a mapping from tuples to Solr fields.



See

[https://issues.apache.org/jira/browse/NUTCH-

760] for a similar issue on the Nutch side of the

fence, though we don't need to worry about

search support.









Add option to extract N random URLs from the

compressed dmoz-links.zip file that the user

would need to download from the Nexus

repository.



BIXO-74, BIXO-75 Provide better support for using DMOZ data in

Bixo.









349 of 363

In order for Bixo to get auto-deployed to the

Maven central repository, we have to provide

rsync (preferably over ssh) access to the

repository.



References for how to do this:



* General info:

http://maven.apache.org/guides/mini/guide-

central-repository-upload.html

* Useful steps, bash script:

http://vafer.org/blog/20081026142413

* Maven public key:

http://www.ibiblio.org/maven/id_dsa.pub



I know, it's a pain in the butt...sorry about that.









Process described here:

http://maven.apache.org/guides/mini/guide-

central-repository-upload.html







BIXO-67, BIXO-69, BIXO-70, BIXO-71, BIXO-72 List of tasks required to make this happen









350 of 363

I want to get Bixo into the Maven central

repository, as that makes it very easy for people

using Ivy or Maven to grab the jar.



To do this, I need to be able to push Bixo to a

repository that I control, and then submit a

request in the Maven Jira system to set up for

auto-syncing from my repository to the Maven

central repository. See

[http://maven.apache.org/guides/mini/guide-

central-repository-upload.html] for full details.



Seems like Nexus is a good option for a small,

lightweight repository manager.



* Download here -

[http://nexus.sonatype.org/download-

nexus.html]

* Documentation here -

[http://www.sonatype.com/books/nexus-

book/reference/]







Currently the bixo-core jar includes things like

ICU4J, which is only used in the parse pipe. And

the lucene jar is only needed in the index pipe.



By breaking it up, the size of the job jar needed

for fetching gets much smaller.



Don't know if we'll need a bixo-test, currently

not.









351 of 363

We want to switch to a more pure Java solution,

versus all of the Hadoop Bash scripts that don't

provide very good error checking, and aren't very

efficient since they wind up creating a JVM one or

more times for each command, in order to run

code.



It's likely we'd still want to keep around the old

scripts, for things like creating an AMI, but use

new scripts as thin wrappers that run Java code

to create a cluster, proxy it, push Bixo, etc.



TeamCity (continuous integration build server)

wants to send emails when the build is broken.



But the bixo-dev Yahoo mailing list will only

accept emails from the verified email addresses

of registered members.



And since TeamCity is a bot, even creating a

"Team City" user and adding them to bixo-dev

(which I've done) still isn't sufficient, as the email

that Yahoo sends to oss-teamcity@101tec.com

for verification won't get answered.









352 of 363

'* Target m1 (small) EC2 instance, so 32-bit FC6?

* Get recent version of Java installed.

* Install LZO support, and re-enable use of the

org.apache.hadoop.io.compress.LzoCodec code

in the list of io.compression.codecs, in the

hadoop-ec2-init-remote.sh script.

* Set thread stack size at boot time.

* Set large max open file limit at boot time (e.g.

65K)

* Make sure noatime is specified.

* Configure & auto-start nscd service

* Configure Hadoop



It would be great to have everything from the

hadoop-ec2-init-remote.sh script pre-configured

in the AMI. This is located in bin/ec2/hadoop-

aws/etc/









It's useful to know when the content has been

truncated, especially for parsers who might not

work properly with truncated (typically binary)

content.

Currently the check for whether the fetch has run

past the target duration is only made when

requesting a FetchList, but once the list has been

handed off to a FetcherRunnable, it runs until

done.



So if a site is really slow (e.g. every URL will time

out), you can get a FetchList that will take a long

time to process.

We need a script for starting up an EC2 cluster

and deploying Bixo









353 of 363

We need to:



* Re-create problem seen during intentionally

slow fetches from Facebook

* Use Use HttpRequestRetryHandler. See

[http://hc.apache.org/httpcomponents-

client/tutorial/html/fundamentals.html#d4e246]

* Call abort if we don't read the entire result back







And then remove these jars from our lib/

directory.



For jars that we need to keep in our lib/ directory

(not in Maven) we should create an Ivy

dependency xml file.



We might also want to use a different Ivy cache,

to avoid the dreaded "Ivy created a fake

dependency file" problem.

Since we now also have also url Normalizer

interface as well url filter, the filter should return

a boolean value not a string.

Examples of these are in Nutch's urlnormalizer-

regex plugin, or rather the regex-normalize.xml

file (see attachment).







Currently both Eclipse and Ant put their classes in

the same location, which confuses Eclipse

sometimes after an Ant build.



Better would be to have .../build/eclipse as a

directory for all Eclipse classes, and clean-eclipse

should delete this directory.





Use ICU, plus some Nutch code, to create a post-

fetch operation that adds language meta-data.









354 of 363

We need a class that can filter by URL, and by

content-type.



For URL filtering, common approaches are:



* By protocol (no https)

* By domain (only ibm.com, not blogger.com)

* By suffix (no .zip)

* By query (no query strings)



For content-type filtering, it can be



* By entire content type (only text/html or

text/xhtml)

* By main content type (only text/)



Something that can be used outside of Bixo if

needed. Normalization includes:



* Adding "http://" as default protocol if URL is

missing this.

* Adding "www." if URL hostname is just a PLD.

* Lower-case everything

* Get rid of default port

* Add trailing slash if no path

* Converting %hh to regular characters, if these

chars are valid in a URL

* Escaping characters that aren't valid in a URL

* Converting '+' to %20

* Get rid of anchors

* Remove trailing '?'



* Clean up paths that use ".." to go up in a

directory

* Clean up paths that have '//'

* Stripping out session ids



We could also try to detect URLs encoded using

8859-1/CP1252 and convert to UTF-8









355 of 363

Includes porting Nutch's robots.txt parser.



Don't worry about different hostnames ==

different robots.txt for now.

We might want to index the ancor text for a page

as well and not just the page content. I'm sure

this will highly improve the search result quality.



Currently we have src/test-data and

src/test/resources that host test data.

We should only have one folder that host test

data.

The idea of src/test/resources is that we store

there config files, thats why it is in the classpath,

we should not store there test data but move it

into src/test-data.

See

[http://www.jakobhoman.com/2007/11/quick-

tour-of-hadoops-reporter-object.html] for details.



Examples of things we could/should be reporting:



* increment counter for fetching a page

* Increment a counter for queuing up a set of

URLs for a domain

* Increment a counter for each URL being queued

* Update status every 5 seconds or so with

number of active threads, throughput values?









356 of 363

The download script currently doesn't work on

my Mac, due to dependency on wget (which is

not part of a standard Mac install, IIRC).



And this is definitely a longer-running test,

especially with the need to fetch a large amount

of data.



But if the test data is auto-cached, then it's

appropriate to be run as one of the long tests

that automatically get executed on the CI build

machine.

'* Note about needing Java 1.6, how to create a

build.sh and run.sh to handle this.

* Note about {{ant clean-eclipse eclipse}} to

create Eclipse project files.

* Note about how to bump memory in Eclipse to

run some of the tests.

* Note about using Ivy to handle dependencies,

and thus lib/ is just a container (now) for

potential jars that will be dynamically added.









357 of 363

While testing a fetch of a bunch of Apache.org

URLs, I noticed that keep-alive wasn't working as

well as I'd expected.



The problem was that xxx.apache.org would

often wind up using a different server, based on

the sub-domain. So we'd lose our keep-alive

connection whenever we switched between

servers.



What I need to do is convert hostnames to IP

addresses for the sub-set of top URLs we're going

to try to fetch, given the PLD/fetch

duration/remaining time constraints. Then use

this to group by same IP, and inside the

FetcherQueue implementation I don't want to

pass back a FetchList that spans IP addresses.



Note that only doing this IP processing for a sub-

set of URLs is going to be more efficient than

what Nutch does, where it converts all potential

URLs to IP addresses first. This places huge load

on the DNS system, and can cause a big slow-

down.



This was originally forked from Nutch, and thus

there are classes we don't need, and other

changes we should make, to do a cleaner job of

integrating it into Bixo.



Also I need to fix up the copyright headers

(combo Apache/Bixo).

We need to do a review of all tuples used to pass

information between operations, to ensure



* Consistent naming

* Appropriate information



It would be good to get Chris's input on this as

well.







358 of 363

Currently there are lots of TODO notes in the

fetcher code, for places where settings should be

controlled via some configuration value.



In Nutch this was handled by passing a JobConf

everywhere, which always felt awkward.



We need to decide on our approach and then

implement it.



One idea would be to have a bean we use for

settings, and inject that into the Cascading conf.







Configure it to match fetcher settings.

See Nutch implementation for HttpClient 3.1

handling of this.



Note, though, that HttpClient 4 has a significantly

different (and better) approach to this.



'* [http://hc.apache.org/httpcomponents-

client/httpclient/apidocs/org/apache/http/impl/c

lient/AbstractHttpClient.html#setHttpRequestRet

ryHandler(org.apache.http.client.HttpRequestRet

ryHandler)]

* [http://hc.apache.org/httpcomponents-

client/httpclient/apidocs/org/apache/http/client/

HttpRequestRetryHandler.html]

Currently it's the default, which is 2. So if

somebody runs with threads-per-server > 2, this

will cause failures.



See http://marc.info/?l=httpclient-

users&m=123925647125933&w=2









359 of 363

See http://marc.info/?l=httpclient-

users&m=123869610506345&w=2



But make sure we can handle connection failures

properly first. Though even with the stale

connection check, a failure could sometimes

happen if the server closes the connection in the

(small) window between the check and the

request.



PS - Oleg says that stale connection checking is

evil consistency in naming.

For :)

This should run all three types of tests (unit,

integration, and long).

We'll have three classes of tests: unit, integration,

and long.



Unit tests should be run as part of the {{ant test}}

command. But the other two tests should be

{{ant test-integ}} and {{ant test-long}} targets.



I don't know how to do this with JUnit 4 (it's easy

with TestNG, using the @Test(groups = "xxx")

annotation).

Idea is that we can write status and content of a

fetch into separate folders to later on process

only the data we need.



Now that we have crawling simulation toolkit, we

should clean up our tests.



All tests that actually need a internet connection

should go into a package integration tests. Maybe

we should also those tests that run very long into

such a package.









360 of 363

We actually could sort the UrlWithScoreTuples

since the buffer is a reducer and we could use a

hadoop secondary sort there, what is actually

supported by Cascading.



Right now the fetcher is a map reduce job,

though for optimal flexibility we actually want to

have fully cascading based fetcher.

I created a localhost server class that can be

easily used to simulate all sorts of webserver

behavior.

It allows to provide an HttpHandler that has full

access to http header and response, so it is easy

to simulate from content types slow server etc.





The idea is to simulate web crawls though not

hitting websites.

We need a simulation that allows us to measure

all sorts of things to optimize crawler

performance etc.



we need a test servlet that allows us to simulate

all sorts of things like hanging connections,

redirects etc.

Here we store robotos.txt and other general

information for PLDs.









One that works well with the Cascading/Hadoop

model of getting an iterator of URLs for a given

PLD or IP address "key".









361 of 363

make sure we set flow connector properties in all

flows

Map properties =

getProperties();

FlowConnector.setApplicationJarClass(properties,

CascadingClusterTest.class);

FlowConnector flowConnector = new

FlowConnector(properties);

Actually url normalisation does not matter during

fetch time, because the server will return the

right url.

I think it might be important for link analysis to

make sure even though the url is different we

make sure we address the same page.

This said in context of duplicated pages, web

spam etc, we should conside use the md5 of the

page as key and not the url.

This said the url normalization makes no sense or

do I miss something?

we should have the source version as part of the

jar. See katta -version of example. We need to

extract to extract git information in ant though.

git-show might be a good starting point.



I dont think we need following files in our conf

folder. Should we delete them?

hadoop-default.xml

hadoop-site.xml

masters

slaves



Do we need a license file in the root folder?

How should the header look like, does it need to

contain license infos?

We need the header.

Should we have an ant task that checks *.java

that header is correct? Had this in a other project,

was quite handy.









362 of 363

Basic approach is:



# Split hostname of URL at '.'

# if pieces <=2 then return as-is

# If pieces == 4 && isValidIPv4 then return as-is

# If pieces == 6 && isValidIPv6 then return as-is

# if lowercase(last piece) == valid country code

then:

#* If lowercase(second to last piece) == valid

short TLD (e.g. "co") then return last three pieces

#* Else return last two pieces

# Else return last two pieces









Currently bixo has jars in its /lib sub-directory,

and it assumes you have Cascading & Hadoop

project (with appropriate versions) located on

your local disk with build.properties edited to

specify their location.



It would be better to use Ivy to handle jar

dependency management.



I'd lean towards simpler XML at the cost of a

precondition like having the Ivy jar in the Ant

directory, for example.









363 of 363



Related docs
Other docs by ajizai
agc
Views: 1  |  Downloads: 0
Bilaga-10-Invitation-press-FKG
Views: 0  |  Downloads: 0
UnderGrd-1
Views: 0  |  Downloads: 0
Interactiv
Views: 0  |  Downloads: 0
business_toc
Views: 0  |  Downloads: 0
Problems - Welcome to web.gccaz.edu
Views: 0  |  Downloads: 0
student-images-upload
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!