101tec's open source projects.
Displaying 331 issues as at 25/Nov/11 12:52 PM.
Project Key Summary Issue Type Status Priority Resolution Assignee
webdav-servlet WDS-22 Office 2007 running at a Windows XP opens an Bug Open Major UNRESOLVED Unassigned
Office document read-only
webdav-servlet WDS-21 Add missing ITransaction to IMimeTyper Bug Open Blocker UNRESOLVED Unassigned
1 of 363
webdav-servlet WDS-20 Clear Unconsumed Inputstream Improvement Open Critical UNRESOLVED Unassigned
webdav-servlet WDS-19 PROPPATCH fails on locked object with correct Bug Open Critical UNRESOLVED Unassigned
lock token in IF Header
2 of 363
webdav-servlet WDS-18 Send simple error response instead of multi Improvement Open Critical UNRESOLVED Unassigned
status status if appropriate
3 of 363
webdav-servlet WDS-17 Allow none or empty LOCK owner token Improvement Open Major UNRESOLVED Unassigned
4 of 363
webdav-servlet WDS-16 Locking fails from davfs2 clients Bug Resolved Major Fixed Unassigned
5 of 363
webdav-servlet WDS-15 webdav on websphere 6.1 and 7 throws timeout Bug Open Major UNRESOLVED Marko Bauhardt
exception while execute DoProp Method
6 of 363
webdav-servlet WDS-14 IWebdavStore: add destroy() life cycle method to Improvement Resolved Major Fixed Unassigned
deallocate resources
7 of 363
webdav-servlet WDS-13 LockedObject prone to Bug Resolved Major Fixed Unassigned
ArrayIndexOutOfBoundsException
webdav-servlet WDS-12 SimpleDateFormat usage is not thread safe Bug Resolved Major Fixed Unassigned
8 of 363
webdav-servlet WDS-11 Copy coll to non existent path results in error 500 Bug Open Major UNRESOLVED Unassigned
instead of 409
webdav-servlet WDS-10 MKCOl with existing Content-Type should result Bug Open Major UNRESOLVED Unassigned
in 415
webdav-servlet WDS-9 MKCOL on nonexistend path get 207 instead of Bug Resolved Major Fixed Unassigned
404 error
9 of 363
webdav-servlet WDS-8 Compilation for iJetty does not work Bug Resolved Major Fixed Marko Bauhardt
webdav-servlet WDS-7 improve the WebdavServlet/WebdavServletBean Task Open Major UNRESOLVED Marko Bauhardt
to register another implementations of
IMethodExecutor
10 of 363
webdav-servlet WDS-6 Problem when opening an accentuated file or Bug Open Minor UNRESOLVED Unassigned
folder
webdav-servlet WDS-5 dont distribute the log4j.xml file within the jar Improvement Resolved Major Fixed Marko Bauhardt
11 of 363
webdav-servlet WDS-4 Enhance GET responses to return HTML+clickable New Feature Resolved Minor Fixed Unassigned
links
webdav-servlet WDS-3 deploy webdav-servlet jar, javadoc. sources to Task Closed Major Fixed Marko Bauhardt
maven2 repository if version 2.0 is released
12 of 363
webdav-servlet WDS-2 DoPut returns wrong content-length in response Bug Closed Major Fixed Marko Bauhardt
webdav-servlet WDS-1 npe while deleting files Bug Closed Major Fixed Marko Bauhardt
13 of 363
Nutch Gui NUTCHGUI-29 Nutch Gui not working - 127.0.0.1 return 404 Bug Open Major UNRESOLVED Marko Bauhardt
page not foudn
Nutch Gui NUTCHGUI-28 dont protect css, gfx and js folder Task Resolved Major Fixed Marko Bauhardt
Nutch Gui NUTCHGUI-27 scheduling plugin: validate form elements, for Bug Open Major UNRESOLVED Marko Bauhardt
example: Pages per segment
Nutch Gui NUTCHGUI-26 the scheduling plugin should be configurable to Improvement Open Major UNRESOLVED Marko Bauhardt
refresh a crawl or create a new crawl
Nutch Gui NUTCHGUI-25 add a error html for exceptions Task Open Major UNRESOLVED Marko Bauhardt
Nutch Gui NUTCHGUI-24 add error page in web.xml for the authentication Bug Resolved Major Fixed Marko Bauhardt
error 403 (user is not in role)
Nutch Gui NUTCHGUI-23 login.html and logout.html should not be Bug Resolved Major Fixed Marko Bauhardt
protected
Nutch Gui NUTCHGUI-22 delete LoginController from the src/java folder Improvement Resolved Major Fixed Marko Bauhardt
Nutch Gui NUTCHGUI-21 make the nutchgui.auth file configurable Improvement Resolved Major Fixed Marko Bauhardt
Nutch Gui NUTCHGUI-20 NutchGuirealm should use a better name for the Improvement Resolved Major Fixed Marko Bauhardt
LoginModule
Nutch Gui NUTCHGUI-19 crawling went wrong if http.agent.name is not set Bug Resolved Major Fixed Marko Bauhardt
Nutch Gui NUTCHGUI-18 crawling failed if metadata.enabled = true but Bug Resolved Major Fixed Marko Bauhardt
there are no metadata-urls uploaded
Nutch Gui NUTCHGUI-17 restart searcher if crawl folder is putted/removed Task Resolved Major Fixed Marko Bauhardt
to/from search
14 of 363
Nutch Gui NUTCHGUI-16 implement a simple admin job overview plugin Task Open Major UNRESOLVED Marko Bauhardt
Nutch Gui NUTCHGUI-15 implement a query filter that search in fields from Task Resolved Major Fixed Marko Bauhardt
the metadata indexing plugin
Nutch Gui NUTCHGUI-14 make the metadata indexing optional Task Resolved Major Fixed Marko Bauhardt
Nutch Gui NUTCHGUI-13 make the black white filtering optional Task Resolved Major Fixed Marko Bauhardt
Nutch Gui NUTCHGUI-12 implement nutch gui core classes, for example Task Resolved Major Fixed Marko Bauhardt
httpServer, GuiComponentDeployer etc
Nutch Gui NUTCHGUI-11 apply metadata indexing patch NUTCH-747 Task Resolved Major Fixed Marko Bauhardt
Nutch Gui NUTCHGUI-10 apply black white filtering patch NUTCH-249 Task Resolved Major Fixed Marko Bauhardt
Nutch Gui NUTCHGUI-9 implement admin url upload plugin Task Resolved Major Fixed Marko Bauhardt
Nutch Gui NUTCHGUI-8 implement admin scheduling plugin Task Resolved Major Fixed Marko Bauhardt
Nutch Gui NUTCHGUI-7 implement admin crawl plugin Task Resolved Major Fixed Marko Bauhardt
Nutch Gui NUTCHGUI-6 implement admin configuration plugin Task Resolved Major Fixed Marko Bauhardt
Nutch Gui NUTCHGUI-5 implement admin system plugin Task Resolved Major Fixed Marko Bauhardt
Nutch Gui NUTCHGUI-4 implement admin instance plugin Task Resolved Major Fixed Marko Bauhardt
Nutch Gui NUTCHGUI-3 implement admin welcome plugin Task Resolved Major Fixed Marko Bauhardt
Nutch Gui NUTCHGUI-2 implement i18n translation mechanism Task Resolved Major Fixed Marko Bauhardt
Nutch Gui NUTCHGUI-1 implement role based security for login Task Resolved Major Fixed Marko Bauhardt
mechanism
15 of 363
Katta KATTA-193 Implement query filters New Feature Resolved Major Fixed Johannes Zillmann
Katta KATTA-192 unstable master failover on session reconnect Bug Resolved Major Fixed Johannes Zillmann
16 of 363
Katta KATTA-191 property to set Lucene versione New Feature Open Trivial UNRESOLVED Unassigned
Katta KATTA-190 make inner classes of LuceneServer protected Task Resolved Major Fixed Johannes Zillmann
Katta KATTA-189 improve multithreaded shard search by using Improvement Resolved Trivial Fixed Johannes Zillmann
ExecutorCompletionService
17 of 363
Katta KATTA-188 katta startNode stuck after Bug Open Major UNRESOLVED Unassigned
org.apache.zookeeper.KeeperException$NotEmp
tyException
18 of 363
Katta KATTA-187 add additional parameters to the Katta client to Improvement Resolved Trivial Fixed Unassigned
change the output format for easier scripting and
parsing
Katta KATTA-186 make message about document and term Improvement Resolved Trivial Fixed Unassigned
statistics more readable
Katta KATTA-185 adding of lucene indices as single sharded katta Improvement Open Minor UNRESOLVED Johannes Zillmann
indices
Katta KATTA-184 upgrade to zookeeper 3.3.2 (from 3.3.1) Improvement Resolved Major Fixed Johannes Zillmann
Katta KATTA-183 zkclient can get unresponsive through OOM Bug Resolved Major Fixed Johannes Zillmann
19 of 363
Katta KATTA-182 memory leak in client - Bug Resolved Major Fixed Johannes Zillmann
ZooKeeper$ZkWatchManager.existWatches can
grow huge
20 of 363
Katta KATTA-181 don't remove node-shard-mapping from client if Improvement Resolved Major Fixed Johannes Zillmann
proxy fails one time
Katta KATTA-180 avoid search exceptions on index-removal Improvement Open Major UNRESOLVED Johannes Zillmann
21 of 363
Katta KATTA-179 undeployment of index can lead to NPE in Bug Resolved Minor Fixed Johannes Zillmann
BooleanQuery - improve exception message
22 of 363
Katta KATTA-178 undeploying indices can leave empty shard-to- Bug Resolved Major Fixed Johannes Zillmann
node pathes
23 of 363
Katta KATTA-177 IndexDeployFuture not safe for quick Bug Resolved Major Fixed Johannes Zillmann
undeployments
24 of 363
Katta KATTA-176 deploy-client misses index-state updates Bug Open Major UNRESOLVED Johannes Zillmann
Katta KATTA-175 make lucene-server thread pool parameters Improvement Resolved Minor Fixed Johannes Zillmann
configurable
Katta KATTA-174 allow configuration of content-server through New Feature Resolved Major Fixed Johannes Zillmann
katta.node.properties
25 of 363
Katta KATTA-173 upgrade to hadoop-0.20.2 Task Resolved Major Fixed Unassigned
Katta KATTA-172 update to lucene 3.0.3 Improvement Resolved Major Fixed Johannes Zillmann
Katta KATTA-171 use timeout from Client in the LuceneServer as Improvement Resolved Major Fixed Johannes Zillmann
well
26 of 363
Katta KATTA-170 Negative range query broken Bug Open Critical UNRESOLVED Unassigned
27 of 363
Katta KATTA-169 results do not get closed in Bug Open Major UNRESOLVED Unassigned
WorkQueue.getResults() if waitTime == 0
28 of 363
Katta KATTA-168 katta deadlock after reconnecting upon child Bug Open Major UNRESOLVED Unassigned
thread death
29 of 363
Katta KATTA-167 Katta runs out of memory Bug Open Major UNRESOLVED Unassigned
Katta KATTA-166 Searches that match > 2^31 documents are Bug Open Major UNRESOLVED Unassigned
handled incorrectly
30 of 363
Katta KATTA-165 enabled throttling leads to Bug Resolved Major Fixed Johannes Zillmann
IndexOutOfBoundsException when adding index
31 of 363
Katta KATTA-164 Unit test fails Bug Open Major UNRESOLVED Unassigned
Katta KATTA-163 don't exit node/master operation thread in case Improvement Resolved Major Fixed Johannes Zillmann
an unexpected exception occurs
32 of 363
Katta KATTA-162 Allow LuceneClient to be extended more easily Improvement Resolved Major Fixed Johannes Zillmann
33 of 363
Katta KATTA-161 Katta nodes stop communicating with master, Bug Resolved Major Fixed Johannes Zillmann
but don't exactly become "disconnected"
34 of 363
Katta KATTA-160 Stop trying to rebalance/replicate an index when Bug Resolved Major Fixed Johannes Zillmann
the index could not be found in the file system
any more
Katta KATTA-159 LuceneServerTest does not compile Bug Resolved Trivial Fixed Johannes Zillmann
Katta KATTA-158 remove a node from Katta New Feature Open Major UNRESOLVED Unassigned
35 of 363
Katta KATTA-157 set timeout on LuceneClient New Feature Resolved Major Fixed Johannes Zillmann
Katta KATTA-156 allow shard selection by regular expression New Feature Resolved Major Fixed Unassigned
Katta KATTA-155 Retrieving details of many hits is very slow Improvement Open Major UNRESOLVED Unassigned
Katta KATTA-154 HitsMapWritable readFields does not add hits Improvement Resolved Minor Fixed Johannes Zillmann
optimally
Katta KATTA-153 LuceneServer loads all fields from index, even if Improvement Resolved Major Fixed Johannes Zillmann
only fewer are requested
Katta KATTA-152 modify LuceneServer for easier sub-classing Improvement Resolved Minor Fixed Johannes Zillmann
Katta KATTA-151 use Hadoop 0.21 Task Open Major UNRESOLVED Unassigned
Katta KATTA-150 port parameter for startNode New Feature Resolved Trivial Fixed Johannes Zillmann
36 of 363
Katta KATTA-149 LuceneServer synchronizes on a Improvement Resolved Trivial Fixed Johannes Zillmann
ConcurrentHashMap
Katta KATTA-148 deploy index fails if Debug logging is enabled at Bug Resolved Blocker Fixed Unassigned
the master and the
LowestShardCountDistributionPolicy is chosen
Katta KATTA-147 upgrade to zookeeper 3.3 Task Resolved Major Fixed Johannes Zillmann
37 of 363
Katta KATTA-146 java.util.ConcurrentModificationException when Bug Resolved Major Fixed Unassigned
multiple LuceneClient objects are created
simultaneously
38 of 363
Katta KATTA-145 java.lang.NullPointerException when a Bug Resolved Major Fixed Unassigned
LuceneClient is created sometimes
39 of 363
Katta KATTA-144 impossible to resolve dependencies: Bug Open Major UNRESOLVED Unassigned
java.io.FileNotFoundException
40 of 363
Katta KATTA-143 CLONE -cobertura.jar version mismatch when Bug Resolved Trivial Duplicate Unassigned
compiling /extras/indexing
41 of 363
Katta KATTA-142 memory leak in client usage when adding and Bug Resolved Major Fixed Johannes Zillmann
removing indices
42 of 363
Katta KATTA-141 ec2 scipts broken Bug Resolved Major Fixed Johannes Zillmann
43 of 363
Katta KATTA-140 inconsistent search errors during stress test Bug Resolved Major Fixed Johannes Zillmann
44 of 363
Katta KATTA-139 reconnecting node fails while deleting queue Bug Open Major UNRESOLVED Johannes Zillmann
45 of 363
Katta KATTA-138 Cluster can "hang" following a major change Bug Open Critical UNRESOLVED Unassigned
46 of 363
Katta KATTA-137 ZKClient does not compile against Zookeeper Bug Resolved Major Fixed Unassigned
3.3.1
47 of 363
Katta KATTA-136 NPE in client on remove index event Bug Resolved Major Fixed Johannes Zillmann
48 of 363
Katta KATTA-135 NPE in Bug Resolved Major Fixed Johannes Zillmann
AbstractIndexOperation.addRunningDeployments
49 of 363
Katta KATTA-134 cobertura.jar version mismatch when compiling Bug Resolved Trivial Fixed Johannes Zillmann
/extras/indexing
Katta KATTA-133 upgrade zookeeper 3.2.2 to 3.3.0 Task Resolved Major Duplicate Unassigned
50 of 363
Katta KATTA-132 nDocs must be > 0 exception when query on Bug Resolved Major Fixed Johannes Zillmann
many instances
51 of 363
Katta KATTA-131 missing sort fields in single documents leads to Bug Open Major UNRESOLVED Johannes Zillmann
exception on sorting
Katta KATTA-130 Deploying from HDFS fails to unzip correctly Bug Resolved Minor Fixed Johannes Zillmann
Katta KATTA-129 Add a newly deployed shard (in hdfs, say) to an New Feature Resolved Major Duplicate Unassigned
existing index
52 of 363
Katta KATTA-128 Only one Search executed at a time per node -> Improvement Resolved Major Fixed Johannes Zillmann
Increase RPC Server threads
Katta KATTA-127 reverse deploy - copy a valid index from the katta New Feature Open Major UNRESOLVED Unassigned
system (+ all its shards ) to a hdfs uri
Katta KATTA-126 Genericity nits Improvement Resolved Major Fixed Johannes Zillmann
53 of 363
Katta KATTA-125 ConcurrentModificationException in Bug Resolved Critical Fixed Unassigned
net.sf.katta.protocol.InteractionProtocol
54 of 363
Katta KATTA-124 imbalanced shard distribution with Bug Resolved Major Fixed Johannes Zillmann
'LowestShardCountDistributionPolicy'
Katta KATTA-123 Remove System.exit(1); from printUsageAndExit Bug Resolved Major Won't Fix Unassigned
55 of 363
Katta KATTA-122 removeIndex leaks file descriptors Bug Resolved Major Fixed Johannes Zillmann
Katta KATTA-121 bin/stop-all.sh not working (Sometimes nodes Bug Open Major UNRESOLVED Unassigned
just hang and won't stop)
56 of 363
Katta KATTA-120 Adding wrong file path will cause listIndices to fail Bug Resolved Major Fixed Johannes Zillmann
(and maybe other parts as well)
Katta KATTA-119 Reload/Refresh feature New Feature Open Major UNRESOLVED Unassigned
57 of 363
Katta KATTA-118 master should periodically balance indices New Feature Open Major UNRESOLVED Unassigned
Katta KATTA-117 add command line option to print stacktrace on Improvement Resolved Major Fixed Johannes Zillmann
error
58 of 363
Katta KATTA-116 distribution of shards does not take currently Bug Resolved Major Fixed Johannes Zillmann
deploying shards into account
Katta KATTA-115 zkclient: update log4j to newest version Task Open Trivial UNRESOLVED Unassigned
Katta KATTA-114 zkclient: property to disable ivy in zkclient build Improvement Open Minor UNRESOLVED Unassigned
system
Katta KATTA-113 zkclient git repo does not include all Bug Resolved Major Fixed Johannes Zillmann
dependencies in lib/
Katta KATTA-112 ship build.xml and ivy.xml in tarballs Improvement Resolved Minor Fixed Johannes Zillmann
59 of 363
Katta KATTA-111 improve build system to ease packaging for Task Open Minor UNRESOLVED Unassigned
Debian (and other distros)
Katta KATTA-110 use a released 0.1 version of zkclient instead of Task Resolved Major Fixed Johannes Zillmann
the snapshot
Katta KATTA-109 split katta distribution into katta and katta.gui Task Resolved Major Fixed Johannes Zillmann
Katta KATTA-108 make loadtests more robust Improvement Open Major UNRESOLVED Unassigned
60 of 363
Katta KATTA-107 Katta master does not run on cygwin Bug Resolved Blocker Fixed Unassigned
Katta KATTA-106 improve katta's monitoring abilities Improvement Open Major UNRESOLVED Johannes Zillmann
Katta KATTA-105 throttle shard deployment New Feature Resolved Major Fixed Johannes Zillmann
Katta KATTA-104 upgrade to zookeeper 3.2.2 Task Resolved Major Fixed Johannes Zillmann
Katta KATTA-103 upgrade to lucene 3.0 Task Resolved Major Fixed Johannes Zillmann
61 of 363
Katta KATTA-102 node failover in Client is not safe for Bug Resolved Major Fixed Johannes Zillmann
multithreaded use
Katta KATTA-101 refactore INodeManaged implementation into Improvement Resolved Major Fixed Johannes Zillmann
sub-packages
62 of 363
Katta KATTA-100 ivy setup does not work for extras/indexing Bug Resolved Major Fixed Johannes Zillmann
module
Katta KATTA-99 Access binary fields in search results New Feature Resolved Major Cannot Reproduce Unassigned
63 of 363
Katta KATTA-98 bin/start-all.sh script should start primary and Improvement Open Major UNRESOLVED Unassigned
secondary master for fail over support
Katta KATTA-97 gracefull shutdown of JmxMonitor Improvement Resolved Major Fixed Johannes Zillmann
64 of 363
Katta KATTA-96 upgrade mechanism for katta New Feature Resolved Major Fixed Johannes Zillmann
Katta KATTA-95 IndexDeployFuture.joinDeployment() seems to Bug Resolved Major Fixed Johannes Zillmann
hang from time to time
65 of 363
Katta KATTA-94 refactor configuration management Improvement Open Major UNRESOLVED Unassigned
66 of 363
Katta KATTA-93 hits are (re-)sorted completely on client side Bug Resolved Major Fixed Johannes Zillmann
Katta KATTA-92 Enable query the hadoop version for the katta Improvement Open Minor UNRESOLVED Unassigned
cluster
Katta KATTA-91 Searching for index by wildcard only supports "all Improvement Resolved Minor Duplicate Unassigned
indexes" rather than "all matching this glob"
Katta KATTA-90 Allow Katta to be installed as a Maven artifact. Improvement Open Minor UNRESOLVED Unassigned
67 of 363
Katta KATTA-89 Dependencies are checked in to git as jars Improvement Open Major UNRESOLVED Unassigned
Katta KATTA-88 Allow shard size deployment for LPT (Longest Improvement Open Trivial UNRESOLVED Unassigned
Processing Time)
Katta KATTA-87 Katta.gui - it would be great to have a web based New Feature Resolved Major Fixed Stefan Groschupf
gui allowing to monitor and maybe administrate
katta
Katta KATTA-86 source should be in jar as well Improvement Resolved Trivial Fixed Stefan Groschupf
Katta KATTA-85 upgrade to hadoop 20.1 jars Improvement Resolved Major Fixed Stefan Groschupf
68 of 363
Katta KATTA-84 Distribution Policy that picks the node with the New Feature Resolved Major Fixed Unassigned
fewest shards, and allow hadoop zip files to be
stream unpacked instead of spooled to local disk
first
69 of 363
Katta KATTA-83 Build failure when building on fresh checkout Bug Resolved Blocker Cannot Reproduce Unassigned
Katta KATTA-82 Katta need to be monitor able New Feature Resolved Major Fixed Stefan Groschupf
Katta KATTA-81 NodeInteraction: the max tryCount should be not Bug Resolved Major Fixed Johannes Zillmann
hardcoded to 3 but equal to the replication level
or configurable.
Katta KATTA-80 Configuration loading from files Improvement Resolved Minor Fixed Unassigned
70 of 363
Katta KATTA-79 Set the maximum shard size in bytes Improvement Resolved Minor Won't Fix Unassigned
Katta KATTA-78 Add basic Lucene Sort capabilities New Feature Resolved Major Fixed Johannes Zillmann
Katta KATTA-77 Merging indexes requires missing Commons HTTP Bug Resolved Major Fixed Unassigned
Client
71 of 363
Katta KATTA-76 Attempting to add an index on a non-existent Bug Resolved Major Fixed Johannes Zillmann
HDFS breaks listIndexes
Katta KATTA-75 tarbomb: Katta 0.5.1 release tarball expands in Bug Resolved Major Fixed Peter Voss
place
Katta KATTA-74 Running "ant jar" in extras/indexing fails due to Bug Resolved Major Fixed Stefan Groschupf
missing jets3t dependency
Katta KATTA-73 CDPATH environment variable causes bin scripts Bug Resolved Major Fixed Unassigned
to fail.
72 of 363
Katta KATTA-72 NPE when searching an index that doesn't exist Bug Resolved Major Fixed Peter Voss
Katta KATTA-71 katta hangs when deploying when errors happen Bug Resolved Major Fixed Stefan Groschupf
when pulling a index from s3.
73 of 363
Katta KATTA-70 Katta immediately terminates after startup when Bug Resolved Major Fixed Peter Voss
using external Zookeeper
Katta KATTA-69 Katta is much too sensitive to recoverable Bug Resolved Major Duplicate Peter Voss
KeeperExceptions
74 of 363
Katta KATTA-68 Need "fsck" for katta clusters New Feature Open Major UNRESOLVED Unassigned
Katta KATTA-67 Default namespace is not created when using Bug Resolved Major Fixed Stefan Groschupf
external Katta
75 of 363
Katta KATTA-66 Update jets3t jar from version 0.5.0 to 0.6.1 Bug Resolved Major Fixed Unassigned
Katta KATTA-65 Katta needs two missing jars to be able to use the Bug Resolved Minor Won't Fix Unassigned
s3:// and s3n:// protocols for pulling indexes
76 of 363
Katta KATTA-64 problems running with multiple zookeeper Bug Resolved Major Won't Fix Johannes Zillmann
servers.
Katta KATTA-63 Use java which is on the PATH if JAVA_HOME is Improvement Resolved Major Fixed Peter Voss
not set
77 of 363
Katta KATTA-62 Need to be able to rebalance shard assignments Bug Open Major UNRESOLVED Unassigned
in a cluster
78 of 363
Katta KATTA-61 Katta needs a load balancing shard deployment Bug Open Major UNRESOLVED Unassigned
policy
Katta KATTA-60 Allow user to specify katta root ZooKeeper path New Feature Resolved Major Fixed Unassigned
79 of 363
Katta KATTA-59 Katta should (optionally) allow partial results Improvement Resolved Major Fixed Unassigned
80 of 363
Katta KATTA-58 Change structure of data in Zookeeper to make Improvement Resolved Major Duplicate Unassigned
all node data ephemeral on node connection
81 of 363
Katta KATTA-57 Permissions on scripts aren't set properly after Bug Resolved Major Fixed Peter Voss
deployment
Katta KATTA-56 Documentation on configuring Katta has Bug Resolved Minor Fixed Stefan Groschupf
misleading info on hadoop
Katta KATTA-55 The 0.5.1 release is missing the /extras directory Bug Resolved Major Fixed Stefan Groschupf
with EC2 support
Katta KATTA-54 Generalize Katta so that Lucene is one use case New Feature Resolved Major Fixed Stefan Groschupf
(and mapfiles another).
82 of 363
Katta KATTA-53 Need to fix spelling errors and other small issues Bug Resolved Major Fixed Unassigned
in code
Katta KATTA-52 ZKClient.reconnect() should only be called on Bug Resolved Major Fixed Peter Voss
KeeperState.Expired events
Katta KATTA-51 master hangs when reconnecting to zk during Bug Resolved Major Fixed Stefan Groschupf
deployment of a index.
83 of 363
Katta KATTA-50 Intermittent failure of ClientFailoverTest Bug Resolved Major Cannot Reproduce Unassigned
84 of 363
Katta KATTA-49 Introduce a refreshIndex method Improvement Open Major UNRESOLVED Unassigned
85 of 363
Katta KATTA-48 Extend IndexMetaData to include index name Improvement Resolved Major Fixed Peter Voss
Katta KATTA-47 addIndex should not require the user to provide Improvement Resolved Major Fixed Unassigned
an Analyzer
86 of 363
Katta KATTA-46 Make Katta Client(s) programmatically Improvement Resolved Minor Fixed Stefan Groschupf
configurable
87 of 363
Katta KATTA-45 logical error on getting scoreDoc List in Bug Resolved Major Fixed Peter Voss
KattaMultiSearcher::search()
88 of 363
Katta KATTA-44 Incorrect usage of Bug Resolved Major Fixed Peter Voss
org.apache.lucene.util.PriorityQueue
89 of 363
Katta KATTA-43 Katta does not recover well from expired sessions Bug Resolved Major Fixed Johannes Zillmann
Katta KATTA-42 upgrade zkClient to use zookeeper 3.1.1 Improvement Resolved Major Fixed Peter Voss
90 of 363
Katta KATTA-41 SecondaryMaster cannot take over when Bug Resolved Critical Fixed Stefan Groschupf
firstMaster failed:
Katta KATTA-40 cannot start as Quorum Zookeepers in katta-0.4 Bug Resolved Major Cannot Reproduce Stefan Groschupf
91 of 363
Katta KATTA-39 logical bug in caclating the DF Bug Resolved Major Fixed Unassigned
Katta KATTA-38 naive bug in finding a free port for RPCServer Bug Resolved Major Fixed Stefan Groschupf
Katta KATTA-37 undeploy one index will delete all data of other Bug Resolved Blocker Fixed Stefan Groschupf
index
92 of 363
Katta KATTA-36 bin/katta showStructure throws an exception Bug Resolved Major Fixed Peter Voss
Katta KATTA-35 build.xml doesn't provide description for the Improvement Resolved Trivial Fixed Peter Voss
'eclipse' and 'dist' targets
93 of 363
Katta KATTA-34 Random failure of test net.sf.katta.zk.ZKClientTest Bug Resolved Minor Cannot Reproduce Unassigned
Katta KATTA-33 upgrade information in the readnme file Bug Resolved Major Fixed Unassigned
Katta KATTA-32 in the release docs/reports is empthy Bug Resolved Major Fixed Stefan Groschupf
Katta KATTA-31 dont rsync log or zookeeper folders Bug Resolved Major Fixed Stefan Groschupf
Katta KATTA-30 add checkstyle as part of the test Improvement Resolved Minor Fixed Unassigned
94 of 363
Katta KATTA-29 Exception while restarting zk client Bug Resolved Major Cannot Reproduce Unassigned
95 of 363
Katta KATTA-28 generalize client node interaction api Improvement Resolved Major Fixed Stefan Groschupf
Katta KATTA-27 queryParser is not thread-safe, though we use in Bug Resolved Major Fixed Stefan Groschupf
a setup that actually could be multithreaded.
Katta KATTA-26 identical shardName conflict with each other Bug Resolved Blocker Won't Fix Stefan Groschupf
Katta KATTA-25 upgrade to zookeeper 3.x Improvement Resolved Major Fixed Stefan Groschupf
Katta KATTA-24 upgrade to latest lucene Improvement Resolved Major Fixed Stefan Groschupf
96 of 363
Katta KATTA-23 Parallelize search result detail retrieval Improvement Resolved Minor Fixed Peter Voss
Katta KATTA-22 Avoid using Bug Resolved Major Fixed Unassigned
com.sun.xml.internal.ws.util.ByteArrayBuffer
(non standard Java)
Katta KATTA-21 'ant eclipse' doesn't set the execution Bug Resolved Major Fixed Peter Voss
environment to 1.6
97 of 363
Katta KATTA-20 Incorrect results returned when limiting the Bug Resolved Major Fixed Peter Voss
number of hits
Katta KATTA-19 create a simple extras/webui that illustrates how Task Open Major UNRESOLVED Unassigned
to use the katta client
Katta KATTA-18 upgrade to hadoop 0.19.0 Improvement Resolved Major Fixed Stefan Groschupf
98 of 363
Katta KATTA-17 create a load test running on ec2 Improvement Resolved Major Fixed Johannes Zillmann
Katta KATTA-16 KATTA-28 Sub-task Resolved Major Fixed Unassigned
The katta manager should manage and the client
should query multiple pools of different kinds of
servers
Katta KATTA-15 Zookeeper server disconnects cause major Bug Resolved Major Fixed Stefan Groschupf
problems
Katta KATTA-14 Katta should be able to use an external Bug Resolved Major Fixed Stefan Groschupf
zookeeper cluster
99 of 363
Katta KATTA-13 run katta on ec2 Improvement Resolved Major Fixed Stefan Groschupf
Katta KATTA-12 katta-env.sh does not have KATTA_SLAVE_SLEEP. Improvement Resolved Minor Won't Fix Stefan Groschupf
Katta KATTA-11 introduce merging index size threshold Improvement Resolved Minor Won't Fix Unassigned
Katta KATTA-10 make shard folder dependent on node-port Improvement Resolved Major Fixed Stefan Groschupf
100 of 363
Katta KATTA-9 do not zip shards Improvement Resolved Major Won't Fix Unassigned
Katta KATTA-8 make metadata mapwritable Improvement Resolved Major Won't Fix Unassigned
Katta KATTA-7 KATTA-5 Sub-task Resolved Major Duplicate Unassigned
fix master failover
101 of 363
Katta KATTA-6 KATTA-5 Sub-task Resolved Major Won't Fix Stefan Groschupf
refactor cluster start/sop mechanism
Katta KATTA-5 [WRAPPER] improve zookeeper ephemeral node Bug Resolved Major Fixed Unassigned
handling
Katta KATTA-4 concurrent search on node Improvement Resolved Major Fixed Stefan Groschupf
102 of 363
Katta KATTA-3 index speed improvement Improvement Closed Major Won't Fix Unassigned
Katta KATTA-2 check correct use of analyzer on searching Bug Resolved Major Fixed Stefan Groschupf
Katta KATTA-1 make lucene analyzer configurable for indexing Improvement Resolved Minor Won't Fix Unassigned
bixo BIXO-88 1232123 Bug Open Major UNRESOLVED Ken Krugler
103 of 363
bixo BIXO-87 Decouple use of ec2-api-tools from bixo Improvement Open Minor UNRESOLVED Ken Krugler
bixo BIXO-86 FetcherPolicy and crawlDelay=0: Divide By Zero Bug Open Trivial UNRESOLVED Ken Krugler
bixo BIXO-85 Add target languages to FetchPolicy, use during Improvement Open Minor UNRESOLVED Unassigned
fetch
bixo BIXO-84 Specify gzip compression with HttpClient requests Improvement Open Minor UNRESOLVED Ken Krugler
104 of 363
bixo BIXO-83 Improve SimpleCrawlTool error messages Improvement Open Major UNRESOLVED Ken Krugler
bixo BIXO-82 Configure HttpClient request headers Improvement Open Critical UNRESOLVED Ken Krugler
bixo BIXO-81 Source should be in jars Improvement Open Minor UNRESOLVED Unassigned
bixo BIXO-80 Support issue linking in Jira Improvement Open Minor UNRESOLVED Marko Bauhardt
bixo BIXO-79 Add full set of licenses for all jars we include in Improvement Open Minor UNRESOLVED Ken Krugler
the distribution build
105 of 363
bixo BIXO-78 Generate release tarball on TeamCity and copy to Improvement Open Critical UNRESOLVED Ken Krugler
Nexus directly
bixo BIXO-77 Create Cascading Scheme for reading/writing New Feature Open Major UNRESOLVED Unassigned
WARC files
106 of 363
bixo BIXO-76 Support Solr via new Cascading Scheme New Feature Open Major UNRESOLVED Unassigned
bixo BIXO-75 BIXO-73 Sub-task Open Major UNRESOLVED Ken Krugler
Clean up DMOZ domain extraction code - remove
OOM exceptions
bixo BIXO-74 BIXO-73 Sub-task Open Minor UNRESOLVED Ken Krugler
Update SimpleCrawlTool to use dmoz link data
bixo BIXO-73 [Wrapper] Improve DMOZ data support Improvement Open Minor UNRESOLVED Ken Krugler
107 of 363
bixo BIXO-72 BIXO-68 Sub-task Open Major UNRESOLVED Marko Bauhardt
Enable read-only rsync from Nexus releases
directory
bixo BIXO-71 BIXO-68 Sub-task Closed Major Fixed Ken Krugler
Create documentation on how to deploy releases
to Nexus repository
bixo BIXO-70 BIXO-68 Sub-task Open Major UNRESOLVED Ken Krugler
Submit Jira issue to Maven to pull from Nexus
repository for central repo
bixo BIXO-69 BIXO-68 Sub-task Closed Major Fixed Ken Krugler
Modify TeamCity to deploy snapshots to Nexus
repository manager
bixo BIXO-68 Set up Bixo deployment to Maven repository Task Open Major UNRESOLVED Ken Krugler
108 of 363
bixo BIXO-67 BIXO-68 Sub-task Resolved Major Fixed Frank Henze
Set up Nexus repository manager on 101tec.com
server
bixo BIXO-66 Break bixo-core into bixo-core, bixo-parse, bixo- Improvement Open Major UNRESOLVED Ken Krugler
index
109 of 363
bixo BIXO-65 Create Java-centric launch/deploy scripts for Improvement Open Major UNRESOLVED Ken Krugler
running Bixo in EC2
bixo BIXO-64 Allow Team City to send emails to the list Task Closed Minor Fixed Ken Krugler
110 of 363
bixo BIXO-63 Create AMI for Bixo Task Open Major UNRESOLVED Unassigned
bixo BIXO-62 Add "truncated" flag to FetchedDatum Improvement Open Minor UNRESOLVED Unassigned
bixo BIXO-61 Catch case of fetch being past time limit inside of Improvement Open Minor UNRESOLVED Ken Krugler
FetcherRunnable
bixo BIXO-60 Create EC2 deployment script New Feature Closed Critical Fixed Ken Krugler
111 of 363
bixo BIXO-59 Fix up HttpClient handling of stale connections Bug Closed Critical Fixed Ken Krugler
bixo BIXO-58 Switch to using Maven for dependent jars where Task Closed Major Fixed Ken Krugler
possible
bixo BIXO-57 UrlFilter should return boolean not string Task Closed Major Fixed Ken Krugler
bixo BIXO-56 Support relative path normalization and other Improvement Open Minor UNRESOLVED Ken Krugler
transformations
bixo BIXO-55 Support building a Hadoop job jar Task Closed Major Fixed Stefan Groschupf
bixo BIXO-54 Set up Google or Yahoo mailing list for Bixo Task Resolved Major Fixed Ken Krugler
bixo BIXO-53 Separate out the Eclipse build classes from the Bug Resolved Major Fixed Stefan Groschupf
Ant build classes
bixo BIXO-52 Add language detection operation New Feature Open Major UNRESOLVED Ken Krugler
112 of 363
bixo BIXO-51 Add stand-alone class for URL filtering New Feature Open Major UNRESOLVED Ken Krugler
bixo BIXO-50 Add stand-alone class for URL normalization New Feature Closed Critical Fixed Ken Krugler
bixo BIXO-48 making the indexing details configurable Improvement Resolved Major Fixed Stefan Groschupf
bixo BIXO-47 Add test for slow response from web server Improvement Resolved Major Fixed Ken Krugler
113 of 363
bixo BIXO-46 Add simple robots.txt fetch/parse to the New Feature Closed Critical Fixed Ken Krugler
FetchBuffer operation
bixo BIXO-45 indexing anchor texts as well. Improvement Open Major UNRESOLVED Stefan Groschupf
bixo BIXO-44 two test data folders Improvement Closed Major Fixed Ken Krugler
bixo BIXO-43 Log progress using counters Improvement In Progress Major UNRESOLVED Ken Krugler
114 of 363
bixo BIXO-42 Make WebgraphWebServerTest a "long-running" Improvement Open Minor UNRESOLVED Stefan Groschupf
test with improved auto-download
bixo BIXO-41 Update README with more info on how to run Improvement Open Minor UNRESOLVED Ken Krugler
tests, use Eclipse
115 of 363
bixo BIXO-40 When building a fetch queue for a PLD, use IP Improvement Closed Critical Fixed Ken Krugler
addresses to sub-segment
bixo BIXO-39 Clean up classes used in HTML parser code Improvement Closed Minor Won't Fix Ken Krugler
bixo BIXO-38 Clean up tuples used to exchange information Improvement Closed Major Fixed Ken Krugler
between Cascading operations
116 of 363
bixo BIXO-37 Set up and use configuration settings in the Task Closed Major Fixed Ken Krugler
fetcher
bixo BIXO-36 Explicitly set HttpClient redirect handling Improvement Closed Minor Fixed Ken Krugler
bixo BIXO-35 Support https protocol Improvement Closed Major Fixed Ken Krugler
bixo BIXO-34 Configure HttpClient for proper retry handling Improvement Closed Minor Fixed Ken Krugler
bixo BIXO-33 Configure HttpClient for max connections per Improvement Open Minor UNRESOLVED Ken Krugler
server
117 of 363
bixo BIXO-32 Turn off stale connection check in HttpClient Improvement Closed Minor Fixed Ken Krugler
bixo BIXO-31 Switch to LOGGER from LOG everywhere Task Closed Minor Fixed Stefan Groschupf
bixo BIXO-30 Set up continuous integration build system Task Closed Major Fixed Stefan Groschupf
bixo BIXO-29 Set up way to run integration tests and long tests Improvement Resolved Major Fixed Ken Krugler
separately from unit tests
bixo BIXO-28 Create a multi sink tap Task Closed Major Fixed Stefan Groschupf
bixo BIXO-27 create a sink tap that indexes the content Task Resolved Major Fixed Stefan Groschupf
bixo BIXO-26 Create separate integration directory and move Improvement Closed Major Fixed Ken Krugler
appropriate tests there from src/test
bixo BIXO-25 Store a complete http header in the output of the Improvement Closed Major Fixed Ken Krugler
fetcher
118 of 363
bixo BIXO-24 pre sort UrlWithScore Improvement Open Minor UNRESOLVED Stefan Groschupf
bixo BIXO-23 cascading fetcher Task Resolved Major Fixed Stefan Groschupf
bixo BIXO-22 http tests Task Resolved Major Fixed Stefan Groschupf
bixo BIXO-21 make sure we use traps for all pipes Task Open Major UNRESOLVED Stefan Groschupf
bixo BIXO-20 create a crawl simulation platform Task Resolved Major Fixed Stefan Groschupf
bixo BIXO-19 test servlet Task Resolved Major Fixed Stefan Groschupf
bixo BIXO-18 adding domain db Task Open Minor UNRESOLVED Stefan Groschupf
bixo BIXO-17 update url DB -> as cascading Task Open Critical UNRESOLVED Ken Krugler
bixo BIXO-16 create katta index Task Closed Major Fixed Stefan Groschupf
bixo BIXO-15 scrape the output -> as cascading Task Closed Major Fixed Stefan Groschupf
bixo BIXO-14 parse the output -> as cascading Task Resolved Major Fixed Stefan Groschupf
bixo BIXO-13 group urls by pld Task Resolved Major Fixed Stefan Groschupf
bixo BIXO-12 create a simple scoring based on the fetch time Task Resolved Major Fixed Stefan Groschupf
bixo BIXO-11 create a basic fetch loop Task Resolved Major Fixed Stefan Groschupf
bixo BIXO-10 create a basic url importer Task Resolved Major Fixed Stefan Groschupf
bixo BIXO-9 Create queue-based fetcher Task Resolved Major Fixed Ken Krugler
119 of 363
bixo BIXO-8 set flow connector properties Task Open Major UNRESOLVED Stefan Groschupf
bixo BIXO-7 do we need url normalisation? New Feature Resolved Major Fixed Stefan Groschupf
bixo BIXO-6 version in jar Task Open Minor UNRESOLVED Stefan Groschupf
bixo BIXO-5 conf files Task Resolved Major Fixed Ken Krugler
bixo BIXO-4 license.txt Task Open Major UNRESOLVED Stefan Groschupf
bixo BIXO-3 header Task Resolved Major Fixed Stefan Groschupf
120 of 363
bixo BIXO-2 Create PLD (paid level domain) extractor Task Resolved Major Fixed Ken Krugler
bixo BIXO-1 Set up ivy-based dependency management Task Resolved Major Fixed Stefan Groschupf
Generated at Fri Nov 25 12:52:39 UTC 2011 using JIRA Enterprise Edition, Version: 3.13.1-#333.
121 of 363
Reporter Created Updated Affects Version/s Fix Version/s Component/s Due Date Votes Images Work Ratio
Christian Kütbach 10/14/2011 8:06 10/14/2011 8:06 2.0.1 0
Sebastian Felis 8/9/2011 12:59 8/16/2011 7:33 2.1 0
122 of 363
Sebastian Felis 8/9/2011 12:55 8/16/2011 10:32 2.1 0
Sebastian Felis 12/30/2010 12:05 8/16/2011 10:36 0
123 of 363
Sebastian Felis 12/30/2010 10:30 8/16/2011 10:39 0
124 of 363
Sebastian Felis 12/30/2010 9:14 12/30/2010 9:15 0
125 of 363
André Schild 5/5/2010 6:38 5/6/2010 7:00 2.0.1 2.0.1 0
2.1
126 of 363
Marko Bauhardt 1/28/2010 8:37 1/28/2010 8:37 2 2.0.1 0
127 of 363
Bela Ban 1/4/2010 10:37 5/9/2010 18:50 2 2.1 0
128 of 363
Knut Forkalsrud 11/29/2009 3:20 5/9/2010 18:52 2 2.1 0
Thomas Fromm 8/6/2009 9:46 5/10/2010 7:50 2 2.1 0
129 of 363
Thomas Fromm 8/6/2009 9:41 8/12/2009 7:40 2 2.0.1 0
Thomas Fromm 8/6/2009 9:37 8/19/2009 15:27 2 2.0.1 0
Thomas Fromm 8/6/2009 9:32 8/12/2009 8:31 2 2.0.1 0
130 of 363
Jim Cortez 8/4/2009 16:41 8/5/2009 9:28 2.1 0
Marko Bauhardt 7/27/2009 7:26 8/6/2009 8:15 2 2.1 0
131 of 363
Miguel Ferreira 7/22/2009 14:17 5/6/2010 19:50 2 2.1 0
Marko Bauhardt 7/20/2009 8:37 8/12/2009 7:33 2 2.0.1 0
132 of 363
André Schild 5/18/2009 8:47 5/26/2009 8:20 2.1 0
Marko Bauhardt 4/24/2009 6:25 8/12/2009 8:52 2 2 0
133 of 363
André Schild 4/22/2009 7:41 9/23/2011 12:59 2 2 0
Marko Bauhardt 4/22/2009 7:11 8/12/2009 8:52 2 2 0
134 of 363
nicolas frances 12/29/2010 7:46 3/21/2011 13:23 0
Marko Bauhardt 10/2/2009 16:09 10/6/2009 14:29 0.2 0.2.1 0
Marko Bauhardt 10/2/2009 11:41 10/2/2009 11:41 0.2 0.2.1 0
Marko Bauhardt 10/2/2009 8:17 10/2/2009 8:17 0.2 0.2.1 0
Marko Bauhardt 9/30/2009 13:11 9/30/2009 13:11 0.2 0.2.1 0
Marko Bauhardt 9/30/2009 12:34 10/6/2009 14:29 0.2 0.2.1 0
Marko Bauhardt 9/30/2009 12:34 10/6/2009 14:29 0.2 0.2.1 0
Marko Bauhardt 9/30/2009 11:56 10/5/2009 14:05 0.2 0.2.1 0
Marko Bauhardt 9/29/2009 9:58 10/5/2009 13:53 0.2 0.2.1 0
Marko Bauhardt 9/29/2009 9:31 10/5/2009 13:59 0.2 0.2.1 0
Max Josef Ender 8/26/2009 10:21 9/23/2009 14:56 0.1 0.2 0
Max Josef Ender 8/26/2009 10:20 8/26/2009 16:48 0.1 0
Marko Bauhardt 8/24/2009 15:19 9/24/2009 9:09 0.2 0
135 of 363
Marko Bauhardt 8/24/2009 15:17 9/23/2009 14:57 0.3 0
Marko Bauhardt 8/14/2009 12:57 8/19/2009 15:24 0.1 0
Marko Bauhardt 8/13/2009 9:04 8/13/2009 9:04 0.1 0
Marko Bauhardt 8/13/2009 9:04 8/13/2009 9:04 0.1 0
Marko Bauhardt 8/13/2009 8:54 8/13/2009 8:54 0.1 0
Marko Bauhardt 8/13/2009 8:52 8/13/2009 8:59 0.1 0
Marko Bauhardt 8/13/2009 8:52 8/13/2009 9:00 0.1 0
Marko Bauhardt 8/13/2009 8:50 8/19/2009 15:24 0.1 0
Marko Bauhardt 8/13/2009 8:49 8/13/2009 8:55 0.1 0
Marko Bauhardt 8/13/2009 8:45 8/19/2009 15:24 0.1 0
Marko Bauhardt 8/13/2009 8:43 8/13/2009 8:59 0.1 0
Marko Bauhardt 8/13/2009 8:42 8/13/2009 8:55 0.1 0
Marko Bauhardt 8/13/2009 8:40 8/13/2009 8:55 0.1 0
Marko Bauhardt 8/13/2009 8:39 8/13/2009 8:55 0.1 0
Marko Bauhardt 8/13/2009 8:35 8/24/2009 15:17 0.1 0
Marko Bauhardt 8/13/2009 8:34 9/21/2009 14:52 0.2 0
136 of 363
Andrew 6/17/2011 20:56 9/26/2011 9:05 0.6.5 search 0
Johannes Zillmann 5/14/2011 15:31 5/14/2011 15:46 0.6.2 0.6.4 cluster 0
137 of 363
Mathias Walter 3/14/2011 15:14 3/14/2011 15:14 0.6.3 index search 0
Mathias Walter 3/14/2011 14:57 5/14/2011 16:24 0.6.3 0.6.4 search 0
Mathias Walter 3/14/2011 10:54 5/14/2011 16:44 0.6.3 0.6.4 search 0
138 of 363
Murali Krishna 3/13/2011 12:53 3/13/2011 12:53 0.6.2 infrastructure 0
139 of 363
Mathias Walter 3/11/2011 13:52 5/14/2011 16:22 0.6.3 0.6.4 0
Mathias Walter 3/10/2011 14:42 5/14/2011 16:44 0.6.3 0.6.4 infrastructure 0
Johannes Zillmann 3/7/2011 8:36 3/7/2011 8:36 0.6.5 index 0
Johannes Zillmann 2/14/2011 21:27 2/14/2011 21:34 0.6.3 0.6.4 cluster 0
Johannes Zillmann 2/14/2011 13:26 2/14/2011 13:29 0.6 0.6.4 0
140 of 363
Johannes Zillmann 2/14/2011 9:24 2/14/2011 13:05 0.6.3 0.6.4 0
141 of 363
Johannes Zillmann 2/2/2011 12:37 2/2/2011 18:30 0.6 0.6.4 0
Johannes Zillmann 2/2/2011 8:26 5/26/2011 7:26 0.6.5 cluster 0
142 of 363
Johannes Zillmann 2/1/2011 22:09 2/1/2011 22:35 0.6.4 0
143 of 363
Johannes Zillmann 2/1/2011 21:44 2/1/2011 22:31 0.6.4 index 0
144 of 363
Johannes Zillmann 2/1/2011 21:04 2/1/2011 21:09 0.6.4 cluster 0
145 of 363
Johannes Zillmann 2/1/2011 17:30 5/26/2011 7:26 0.6.2 0.6.5 cluster 0
Johannes Zillmann 2/1/2011 16:49 2/1/2011 17:20 0.6.3 0.6.4 cluster 0
Johannes Zillmann 2/1/2011 12:06 2/1/2011 16:50 0.6.4 cluster 0
146 of 363
Johannes Zillmann 1/26/2011 10:55 1/29/2011 14:11 0.6.3 0.6.4 0
Johannes Zillmann 1/26/2011 10:46 1/26/2011 12:04 0.6.4 search 0
Johannes Zillmann 1/26/2011 10:32 2/1/2011 16:50 0.6.4 search 0
147 of 363
Murali Krishna 1/10/2011 7:39 5/26/2011 7:26 0.6.2 0.6.5 search 0
148 of 363
Patrick Crenshaw 12/13/2010 15:56 5/26/2011 7:26 0.6.3 0.6.5 search 0
149 of 363
mg 12/2/2010 1:33 5/26/2011 7:26 0.6.3 0.6.5 0
150 of 363
mg 12/1/2010 22:19 5/26/2011 7:26 0.6 0.6.5 0
0.6.1
0.6.2
0.6.3
mg 12/1/2010 19:38 5/26/2011 7:26 0.6.2 0.6.5 0
0.6.3
151 of 363
mg 11/22/2010 8:44 11/24/2010 11:19 0.6.2 0.6.3 cluster 0
152 of 363
Patrick Crenshaw 11/17/2010 23:03 11/18/2010 8:37 0
Johannes Zillmann 10/25/2010 9:06 10/25/2010 9:25 0.6.2 0.6.3 cluster 0
153 of 363
Michael Small 10/21/2010 20:19 10/26/2010 7:46 0.6.2 0.6.3 1
154 of 363
mg 10/20/2010 18:03 9/26/2011 7:14 0.6.2 0.6.3 0
155 of 363
Hongchao Li 10/18/2010 16:41 10/25/2010 9:24 0.6.2 0.6.3 0
Mathias Walter 9/24/2010 11:00 9/26/2010 13:30 0.6.2 0.6.3 0
Mathias Walter 9/20/2010 7:00 5/26/2011 7:26 0.6.2 0.6.5 cluster 0
156 of 363
Mathias Walter 9/17/2010 14:14 10/4/2010 8:33 0.6.2 0.6.3 search 1
Mathias Walter 9/17/2010 13:57 10/25/2010 10:05 0.6.2 0.6.3 search 0
Mathias Walter 9/13/2010 13:07 5/26/2011 7:26 0.6.2 0.6.5 infrastructure 0
Mathias Walter 9/13/2010 12:22 9/27/2010 9:16 0.6.3 infrastructure 0
Mathias Walter 9/10/2010 12:05 9/26/2010 15:54 0.6.2 0.6.3 index 0
Mathias Walter 8/18/2010 7:42 9/26/2010 14:17 0.6.2 0.6.3 search 0
Mathias Walter 8/17/2010 14:30 9/10/2010 11:20 0.6.2 0
Mathias Walter 8/13/2010 12:08 8/17/2010 14:21 0.6.2 0.6.3 cluster 0
157 of 363
Mathias Walter 8/13/2010 11:35 8/17/2010 14:15 0.6.2 0.6.3 index 0
Mathias Walter 8/13/2010 11:25 8/17/2010 13:59 0.6.2 0.6.3 cluster 0
Johannes Zillmann 8/2/2010 8:08 8/2/2010 18:46 0.6.1 0.6.3 0
158 of 363
Hongchao Li 7/28/2010 19:01 8/2/2010 15:14 0.6.1 0.6.2 0
159 of 363
Hongchao Li 7/28/2010 18:49 8/2/2010 8:29 0.6.1 0.6.2 search 0
160 of 363
rafia taqdees 7/26/2010 6:58 7/26/2010 6:58 0
161 of 363
rafia taqdees 7/26/2010 6:54 9/26/2010 16:00 0.6.1 index 0
162 of 363
Johannes Zillmann 7/9/2010 7:42 7/9/2010 7:49 0.6 0.6.2 search 0
0.6.1
163 of 363
Johannes Zillmann 7/2/2010 15:22 7/2/2010 15:25 0.6 0.6.2 infrastructure 0
0.6.1
164 of 363
Hongchao Li 7/1/2010 15:56 7/2/2010 14:39 0.6.1 0.6.2 search 0
165 of 363
Johannes Zillmann 6/24/2010 12:30 5/26/2011 7:26 0.6.1 0.6.5 cluster 0
166 of 363
Eric McCoy 6/9/2010 12:58 10/27/2010 17:59 0.6.1 0
167 of 363
Thomas Koch 6/3/2010 11:18 8/2/2010 8:09 0
168 of 363
Hongchao Li 5/12/2010 7:51 7/2/2010 15:41 0.6 0.6.2 0
169 of 363
Eric McCoy 5/4/2010 17:11 5/12/2010 9:22 0.6.1 0.6.2 0
170 of 363
Rodney O'Donnell 4/30/2010 10:56 9/26/2010 15:59 0.6.1 0.6.3 index 0
Karthik K 4/26/2010 19:13 9/26/2010 15:51 cluster 0
171 of 363
Hongchao Li 4/25/2010 11:05 7/2/2010 14:39 0.6.2 0
172 of 363
Hongchao Li 4/25/2010 10:43 5/26/2011 7:26 0.6.1 0.6.5 0
David Buttler 4/23/2010 15:23 7/2/2010 14:38 0.6.1 0.6.2 infrastructure 0
Karthik K 4/19/2010 6:18 4/25/2010 10:53 index 0
173 of 363
thibaut 3/23/2010 23:11 7/2/2010 15:43 0.6.1 0.6.2 0
Karthik K 3/20/2010 1:19 1/26/2011 10:08 0.7 0
Karthik K 3/4/2010 8:13 7/2/2010 14:33 0.6.2 0
174 of 363
thibaut 3/3/2010 20:29 5/12/2010 9:09 0.6.1 0.6.2 0
175 of 363
Hongchao Li 2/24/2010 16:11 2/28/2010 11:33 0.6 0.6.1 cluster 0
thibaut 2/23/2010 12:32 2/24/2010 12:11 0
176 of 363
Neil Cohen 2/20/2010 9:43 2/28/2010 11:59 0.6 0.6.1 infrastructure 0
thibaut 2/17/2010 16:55 8/26/2011 13:16 0
177 of 363
thibaut 2/1/2010 23:01 2/2/2010 9:23 0.6 0.6 0
thibaut 2/1/2010 16:39 4/25/2010 10:57 0.7 0
178 of 363
Johannes Zillmann 2/1/2010 16:34 2/1/2010 16:39 0.7 0
Johannes Zillmann 2/1/2010 16:05 2/1/2010 16:16 0.6 0.6 0
179 of 363
Johannes Zillmann 1/31/2010 17:40 2/1/2010 16:08 0.6 0.6 cluster 0
Thomas Koch 1/20/2010 15:10 9/10/2010 9:23 1
Thomas Koch 1/14/2010 13:58 1/22/2010 9:46 0
Thomas Koch 1/14/2010 12:59 1/27/2010 10:25 0
Thomas Koch 1/14/2010 9:42 1/27/2010 10:03 0.6 0
180 of 363
Thomas Koch 1/14/2010 8:02 1/29/2010 8:23 0.6 0
Johannes Zillmann 1/14/2010 7:21 1/14/2010 8:57 0.6 0.6 0
Johannes Zillmann 1/12/2010 13:55 1/12/2010 15:28 0.6 0
Johannes Zillmann 1/7/2010 16:57 1/7/2010 16:57 0.6 0
181 of 363
thibaut 1/3/2010 22:14 6/22/2010 14:46 0.6 0.6 infrastructure 0
Johannes Zillmann 12/30/2009 20:14 12/31/2009 4:23 0.6 cluster 0
Johannes Zillmann 12/22/2009 11:25 1/12/2010 15:44 0.5.1 0.6 0
Johannes Zillmann 12/22/2009 11:24 1/5/2010 12:40 0.6 0.6 0
Johannes Zillmann 12/22/2009 11:23 12/31/2009 11:11 0.6 0.6 search 0
182 of 363
Johannes Zillmann 12/21/2009 12:01 12/28/2009 14:36 0.6 0.6 search 0
Johannes Zillmann 12/16/2009 14:50 12/30/2009 19:44 0.6 0.6 0
183 of 363
Johannes Zillmann 12/15/2009 18:06 12/15/2009 18:27 0.6 0.6 infrastructure 0
thibaut 12/14/2009 10:28 1/8/2010 8:24 0.6 search 0
184 of 363
Aseem Jain 12/8/2009 12:58 12/8/2009 12:58 0.6 cluster 0
Johannes Zillmann 12/8/2009 12:34 12/30/2009 19:43 0.6 0.6 0
185 of 363
Johannes Zillmann 12/8/2009 12:25 1/6/2010 15:52 0.5.1 0.6 cluster 0
Johannes Zillmann 12/1/2009 14:24 12/28/2009 14:35 0.6 0
186 of 363
Johannes Zillmann 11/19/2009 16:31 1/8/2010 7:55 0.7 cluster search 0
187 of 363
Johannes Zillmann 11/8/2009 23:03 9/26/2011 7:14 0.5.1 0.6 search 0
Yair Even-Zohar 10/22/2009 17:29 10/22/2009 17:40 0.5.1 cluster 0
Phil Hagelberg 10/20/2009 23:47 9/26/2010 14:29 0.5.1 search 0
Phil Hagelberg 10/16/2009 21:14 10/17/2009 0:45 0.6 0
188 of 363
Phil Hagelberg 10/16/2009 20:54 10/31/2009 17:31 0.6 0
Jason Rutherglen 10/13/2009 23:46 10/14/2009 4:17 0.5.1 cluster 0
Stefan Groschupf 10/8/2009 7:54 1/12/2010 15:51 0.5.1 0.6 0
Stefan Groschupf 10/8/2009 7:10 10/13/2009 18:59 0.5.1 0.6 0
Stefan Groschupf 10/7/2009 7:48 10/8/2009 1:18 0.5.1 0.6 0
189 of 363
Jason Venner 10/6/2009 16:16 10/13/2009 4:10 infrastructure 0
190 of 363
Imran M M Yousuf 10/5/2009 5:18 10/13/2009 6:55 0.5.1 infrastructure 0
Stefan Groschupf 10/5/2009 0:26 12/30/2009 20:15 0.6 0
Stefan Groschupf 9/30/2009 23:29 11/19/2009 16:45 0.6 0
Jason Rutherglen 9/28/2009 17:45 2/1/2011 17:20 0.5.1 0.6 0
191 of 363
Jason Rutherglen 8/5/2009 22:36 4/25/2010 10:59 0.5.1 index 0
Jonathan Gray 7/21/2009 22:27 11/25/2009 14:28 0.5.1 0.6 1
Phil Hagelberg 7/17/2009 18:58 10/13/2009 18:56 0.5.1 0.6 index 0
192 of 363
Phil Hagelberg 7/17/2009 0:00 6/24/2010 13:16 0.5.1 0.6 index 0
Phil Hagelberg 7/15/2009 23:23 10/13/2009 21:44 0.5.1 0.6 0
Phil Hagelberg 6/23/2009 21:25 10/31/2009 17:41 0.5.1 0.6 infrastructure 0
Phil Hagelberg 6/23/2009 21:21 10/14/2009 21:41 0.5.1 0.6 infrastructure 0
193 of 363
Peter Voss 6/15/2009 19:05 6/16/2009 7:23 0.5.1 0.6 search 0
Stefan Groschupf 6/5/2009 18:21 6/5/2009 18:28 0.5.1 0.6 0
194 of 363
Johannes Herr 6/5/2009 14:21 6/5/2009 17:00 0.6 0.6 0
Ted Dunning 6/4/2009 22:23 10/14/2009 21:44 0.6 0
195 of 363
Ted Dunning 6/4/2009 21:45 12/30/2009 20:17 0.7 0
Johannes Herr 6/4/2009 16:17 6/12/2009 18:11 0.6 0.6 0
196 of 363
Ken Krugler 6/4/2009 16:04 10/13/2009 19:00 0.5.1 0.6 0
Ken Krugler 6/4/2009 15:09 6/5/2009 19:26 0.5.1 0.6 0
197 of 363
Stefan Groschupf 6/2/2009 17:42 12/2/2009 13:47 0.5.1 0.6 0
Peter Voss 5/27/2009 12:20 5/27/2009 12:20 0.6 0
198 of 363
Ted Dunning 5/13/2009 17:27 5/13/2009 17:27 0.5.1 cluster 0
199 of 363
Ted Dunning 5/13/2009 17:19 1/8/2010 7:55 0.5.1 0.7 cluster 0
Andrew John 5/11/2009 0:24 10/4/2009 3:52 0.5.1 0.6 0
200 of 363
Ted Dunning 5/10/2009 2:19 10/4/2009 3:58 0.6 0
201 of 363
Ted Dunning 5/10/2009 2:03 12/7/2009 19:39 0.5.1 0
202 of 363
Ken Krugler 5/8/2009 23:16 9/29/2009 6:31 0.5.1 0.6 0
Ken Krugler 5/8/2009 23:14 10/13/2009 19:06 0.5.1 0.6 0
Ken Krugler 5/8/2009 23:09 10/13/2009 22:52 0.5.1 0.6 0
Andrew John 5/8/2009 21:27 1/12/2010 15:50 0.5.1 0.6 2
203 of 363
Ted Dunning 4/30/2009 21:48 5/1/2009 6:07 0.5.1 0.6 0
Peter Voss 4/27/2009 7:59 4/28/2009 12:22 0.6 0
Stefan Groschupf 4/25/2009 4:12 4/26/2009 18:46 0.5 0.6 0
204 of 363
Peter Voss 4/23/2009 15:35 1/8/2010 7:43 0
205 of 363
Erich Nachbar 4/22/2009 22:17 1/8/2010 7:55 0.5 0.7 2
206 of 363
Erich Nachbar 4/22/2009 22:02 9/29/2009 6:19 0.5 0.6 0
VM 4/22/2009 7:14 5/1/2009 8:00 0.6 0
207 of 363
Erich Nachbar 4/21/2009 23:08 4/23/2009 7:55 0.5.1 0
208 of 363
dengminwen 4/21/2009 12:20 5/4/2009 20:03 0.5.1 0.6 0
209 of 363
dengminwen 4/21/2009 12:07 9/29/2009 6:08 0.5.1 0.6 search 0
210 of 363
Ted Dunning 4/17/2009 16:17 12/28/2009 14:32 0.4 0.6 cluster 0
Stefan Groschupf 4/17/2009 7:11 9/28/2009 14:26 0.5 0.6 0
211 of 363
Stefan Groschupf 4/17/2009 4:23 4/17/2009 7:27 0.5 0.5.1 0
Stefan Groschupf 4/17/2009 4:22 5/4/2009 4:49 0.5 0.6 0
212 of 363
Stefan Groschupf 4/17/2009 4:21 4/17/2009 4:57 0.5 0.5.1 0
Stefan Groschupf 4/17/2009 4:21 4/17/2009 4:50 0.5 0.5.1 0
Stefan Groschupf 4/17/2009 4:20 4/17/2009 4:40 0.5 0.5.1 0
213 of 363
Peter Voss 4/16/2009 14:20 4/16/2009 14:22 0.5 0.6 0
VM 4/16/2009 5:11 4/16/2009 13:25 0.5 0.6 0
214 of 363
VM 4/13/2009 1:58 9/28/2009 14:31 0.6 0
Stefan Groschupf 4/9/2009 6:31 4/10/2009 6:15 0.6 0.6 0
Stefan Groschupf 4/9/2009 6:30 4/9/2009 8:01 0.6 0.6 0
Stefan Groschupf 4/8/2009 23:00 1/12/2010 15:45 0.6 0
Stefan Groschupf 4/3/2009 5:53 4/10/2009 6:19 0.5 0.6 0
215 of 363
Stefan Groschupf 4/1/2009 8:46 9/26/2011 7:13 0.5 0.6 0
216 of 363
Stefan Groschupf 4/1/2009 8:13 10/4/2009 3:57 0.5 0.6 0
Stefan Groschupf 4/1/2009 8:04 4/1/2009 8:07 0.5 0.5 0
Stefan Groschupf 4/1/2009 6:32 4/1/2009 6:36 0.5 0.5 0
Stefan Groschupf 3/28/2009 4:43 3/28/2009 4:51 0.5 0
Stefan Groschupf 3/25/2009 17:41 3/27/2009 5:44 0.5 0.4 0
217 of 363
Erich Nachbar 3/19/2009 22:05 4/4/2009 8:37 0.5 0.5 search 0
Peter Voss 3/19/2009 17:56 3/19/2009 23:21 0.5 0
Peter Voss 3/19/2009 17:47 3/25/2009 16:02 0.5 0.5 infrastructure 0
218 of 363
Peter Voss 3/19/2009 17:35 3/28/2009 16:43 0.5 0.5 search 0
Stefan Groschupf 3/17/2009 5:15 3/17/2009 5:17 0.5 0
Stefan Groschupf 2/19/2009 3:04 2/19/2009 3:05 0.5 0
219 of 363
Stefan Groschupf 1/8/2009 16:04 1/7/2010 18:55 0.6 0
Ted Dunning 1/7/2009 21:50 10/4/2009 3:57 0.6 0
Ted Dunning 1/7/2009 21:45 4/3/2009 4:15 0.5 0.5 0
Ted Dunning 1/7/2009 21:44 5/1/2009 17:55 0.6 0
220 of 363
Stefan Groschupf 12/18/2008 7:55 1/8/2009 15:19 0.5 0
Stefan Groschupf 12/18/2008 7:35 6/5/2009 3:36 0.6 0
Johannes Zillmann 12/12/2008 15:04 1/8/2010 7:51 0.4 cluster 0
Johannes Zillmann 12/12/2008 15:02 3/28/2009 5:43 0.4 0.5 cluster 0
221 of 363
Johannes Zillmann 12/12/2008 14:59 1/12/2010 15:49 0.4 index 0
Johannes Zillmann 12/12/2008 14:57 10/4/2009 3:55 0.4 cluster 0
Johannes Zillmann 12/12/2008 14:55 5/1/2009 7:09 0.4 cluster 0
222 of 363
Johannes Zillmann 12/12/2008 14:54 2/19/2010 16:28 0.4 0.6 cluster 0
Johannes Zillmann 12/12/2008 14:27 1/8/2010 7:39 0.4 0.6 cluster 0
Johannes Zillmann 12/12/2008 14:21 4/1/2009 16:53 0.4 0.5 search 0
223 of 363
Johannes Zillmann 12/12/2008 14:15 2/1/2010 16:19 0.4 index 0
Johannes Zillmann 12/12/2008 14:10 4/3/2009 7:29 0.4 0.5 search 0
Johannes Zillmann 12/12/2008 14:08 1/12/2010 15:47 0.4 index 0
remodel 2/28/2010 13:06 2/28/2010 13:06 0
224 of 363
Vivek Magotra 12/3/2009 5:21 12/3/2009 5:21 0
Fuad Efendi 11/30/2009 16:00 12/2/2009 2:05 0
Ken Krugler 10/30/2009 12:55 10/30/2009 12:55 0.4 0
Ken Krugler 10/30/2009 12:53 10/30/2009 12:53 0.4 0
225 of 363
Ken Krugler 10/30/2009 12:51 10/30/2009 12:51 0.4 0
Ken Krugler 10/30/2009 12:49 10/30/2009 12:49 0.4 0
Ken Krugler 10/30/2009 12:46 10/30/2009 12:46 0
Ken Krugler 10/30/2009 12:45 10/30/2009 12:45 0
Ken Krugler 10/30/2009 12:43 10/30/2009 12:43 0.4 0
226 of 363
Ken Krugler 10/30/2009 12:41 10/30/2009 12:41 0.4 0
Ken Krugler 10/27/2009 21:47 10/27/2009 21:47 0.4 0
227 of 363
Ken Krugler 10/27/2009 12:44 10/27/2009 12:44 0.4 0
Ken Krugler 10/1/2009 16:02 10/1/2009 16:02 0
Ken Krugler 10/1/2009 16:01 10/1/2009 16:01 0.4 0
Ken Krugler 10/1/2009 15:53 10/1/2009 15:53 0.4 0
228 of 363
Ken Krugler 9/30/2009 17:06 10/31/2009 17:36 0
Ken Krugler 9/17/2009 21:41 9/30/2009 16:54 0.4 0
Ken Krugler 9/17/2009 21:39 10/31/2009 17:42 0
Ken Krugler 9/17/2009 21:38 9/28/2009 13:11 0
Ken Krugler 9/17/2009 21:32 9/17/2009 21:32 0
229 of 363
Ken Krugler 9/17/2009 21:18 9/18/2009 15:18 0.4 0.5 0
Ken Krugler 8/14/2009 20:10 8/14/2009 20:10 0.4 0
230 of 363
Ken Krugler 8/14/2009 20:05 9/11/2009 22:49 0.4 0
Ken Krugler 8/14/2009 14:46 9/11/2009 22:28 0
231 of 363
Ken Krugler 8/14/2009 14:23 10/27/2009 16:57 0.4 0
Ken Krugler 8/14/2009 14:12 8/14/2009 14:12 0.4 0
Ken Krugler 8/14/2009 14:04 8/18/2009 2:36 0.3 0.5 0
Ken Krugler 8/8/2009 14:29 8/14/2009 20:03 0.4 0
232 of 363
Ken Krugler 7/31/2009 22:58 8/4/2009 22:34 0.3 0.4 0
Ken Krugler 7/31/2009 22:51 9/30/2009 16:55 0.3 0.5 0
Stefan Groschupf 6/15/2009 20:37 10/27/2009 16:52 0.4 0
Ken Krugler 5/1/2009 21:40 5/1/2009 21:40 0.3 0
Ken Krugler 4/30/2009 21:20 9/11/2009 22:34 0
Ken Krugler 4/30/2009 21:18 5/17/2009 23:36 0
Ken Krugler 4/30/2009 21:18 4/30/2009 23:10 0.4 0
Ken Krugler 4/29/2009 10:54 4/29/2009 10:54 0
233 of 363
Ken Krugler 4/29/2009 10:24 4/29/2009 10:24 0
Ken Krugler 4/29/2009 10:20 8/5/2009 18:41 0.4 0
Stefan Groschupf 4/24/2009 23:13 4/24/2009 23:13 0.4 0
Stefan Groschupf 4/24/2009 23:12 4/24/2009 23:12 0.4 0
234 of 363
Ken Krugler 4/24/2009 20:35 8/7/2009 23:09 0
Stefan Groschupf 4/22/2009 21:45 4/22/2009 21:45 0.5 0
Stefan Groschupf 4/22/2009 21:38 9/11/2009 22:34 0.3 0.5 0
Ken Krugler 4/22/2009 0:34 8/7/2009 23:13 0
235 of 363
Ken Krugler 4/18/2009 16:04 4/18/2009 16:04 0
Ken Krugler 4/18/2009 15:54 8/18/2009 2:36 0.5 0
236 of 363
Ken Krugler 4/18/2009 15:49 8/7/2009 23:08 0.4 0
Ken Krugler 4/18/2009 15:43 9/30/2009 17:08 0.5 0
Ken Krugler 4/18/2009 15:42 8/11/2009 16:29 0.4 0
237 of 363
Ken Krugler 4/18/2009 15:39 8/11/2009 16:31 0.4 0
Ken Krugler 4/18/2009 15:35 10/30/2009 12:29 0
Ken Krugler 4/18/2009 15:35 10/27/2009 16:52 0.4 0
Ken Krugler 4/18/2009 15:33 9/11/2009 22:46 0.4 0
Ken Krugler 4/18/2009 15:31 4/18/2009 15:31 0
238 of 363
Ken Krugler 4/18/2009 15:29 7/31/2009 22:39 0.4 0
Ken Krugler 4/18/2009 15:27 10/30/2009 12:34 0.5 0
Ken Krugler 4/18/2009 15:26 9/11/2009 22:31 0.5 0
Ken Krugler 4/18/2009 15:25 10/27/2009 23:49 0.5 0
Stefan Groschupf 4/18/2009 0:45 7/31/2009 22:40 0.3 0.4 0
Stefan Groschupf 4/18/2009 0:43 4/18/2009 0:46 0.3 0
Stefan Groschupf 4/11/2009 1:13 9/11/2009 22:38 0.3 0.5 0
Stefan Groschupf 4/10/2009 23:50 7/31/2009 22:41 0.3 0.4 0
239 of 363
Stefan Groschupf 4/10/2009 23:43 8/18/2009 2:36 0.3 0.5 0
Stefan Groschupf 4/10/2009 21:01 4/18/2009 0:39 0.3 0.3 0
Stefan Groschupf 4/10/2009 21:00 4/10/2009 21:00 0.2 0.2 0
Stefan Groschupf 4/9/2009 22:09 8/18/2009 2:36 0.2 0.5 0
Stefan Groschupf 4/9/2009 5:07 4/9/2009 5:09 0.2 0.2 0
Stefan Groschupf 4/4/2009 0:29 4/7/2009 1:02 0.2 0
Stefan Groschupf 4/4/2009 0:28 8/11/2009 16:32 0.2 0
Stefan Groschupf 4/4/2009 0:26 8/7/2009 23:12 0.2 0
Stefan Groschupf 4/4/2009 0:26 7/31/2009 22:40 0.2 0
Stefan Groschupf 4/4/2009 0:26 8/11/2009 16:33 0.4 0
Stefan Groschupf 4/4/2009 0:26 4/18/2009 0:40 0.3 0
Stefan Groschupf 4/3/2009 22:26 4/3/2009 22:32 0.1 0.1 0
Stefan Groschupf 4/3/2009 22:25 4/3/2009 22:32 0.1 0.1 0
Stefan Groschupf 4/3/2009 22:25 4/3/2009 22:32 0.1 0.1 0
Stefan Groschupf 4/3/2009 22:25 4/3/2009 22:32 0.1 0.1 0
Ken Krugler 4/2/2009 2:17 4/10/2009 21:02 0.2 0.2 0
240 of 363
Stefan Groschupf 4/2/2009 2:16 8/18/2009 2:36 0.2 0.5 0
Stefan Groschupf 4/2/2009 1:13 4/3/2009 22:32 0.1 0.1 0
Stefan Groschupf 4/1/2009 1:31 8/18/2009 2:36 0.2 0.5 0
Stefan Groschupf 4/1/2009 1:20 4/7/2009 1:11 0.2 0.2 0
Stefan Groschupf 4/1/2009 1:18 9/25/2009 18:14 0.2 0.5 0
Stefan Groschupf 4/1/2009 1:18 4/7/2009 2:00 0.2 0.2 0
241 of 363
Ken Krugler 3/30/2009 23:53 4/3/2009 22:33 0.1 0.1 0
Ken Krugler 3/30/2009 20:30 4/3/2009 22:33 0.1 0.1 0
242 of 363
Sub-Tasks Issue Links Environment Description Security Level
Office 2007, Webdav-Servlet 2.0.1, Windows XP If one opens an office document from a webdav-
share, the file ist read-only.
The problems seems to be within the doLock-
Method:
private void generateXMLReport(ITransaction
transaction,
HttpServletResponse resp, LockedObject lo)
[...]
generatedXML.writeElement("DAV::owner",
XMLWriter.OPENING);
// encapsulating the owner with an href-element
will trigger the bug
// generatedXML.writeElement("DAV::href",
XMLWriter.OPENING);
generatedXML.writeText(_lockOwner);
// generatedXML.writeElement("DAV::href",
XMLWriter.CLOSING);
as I remove this element, all testet combinations
of Windows-OS Version and Office-Versions (2007
and 2010) worked fine.
In the new release of r100 the implementation of
IMimeTyper changed that the object store ist
requested without the transaction. The Grails
WebDAV plugin uses a transaction and will fail on
the MIME request due NPE.
The provided patch adds ITransaction parameter
to IMimeTyper.getMimeType() to fix this issue.
243 of 363
From the mailing list:
http://sourceforge.net/mailarchive/forum.php?th
read_name=4D6FAF4F.2090403%40rewoo.com&f
orum_name=webdav-servlet-general
----------8
x.xxxxxx
@aarboard.ch
247 of 363
Exception = java.net.SocketTimeoutException
Source =
com.ibm.ws.webcontainer.channel.WCCByteBuff
erInputStream
probeid = 102
Stack Dump = java.net.SocketTimeoutException:
Async operation timed out
at
com.ibm.ws.tcp.channel.impl.AioTCPReadReques
tContextImpl.processSyncReadRequest(AioTCPRe
adRequestContextImpl.java:157)
at
com.ibm.ws.tcp.channel.impl.TCPReadRequestCo
ntextImpl.read(TCPReadRequestContextImpl.java
:109)
at
com.ibm.ws.http.channel.impl.HttpServiceContex
tImpl.fillABuffer(HttpServiceContextImpl.java:413
6)
at
com.ibm.ws.http.channel.impl.HttpServiceContex
tImpl.readSingleBlock(HttpServiceContextImpl.jav
a:3378)
at
com.ibm.ws.http.channel.impl.HttpServiceContex
tImpl.readBodyBuffer(HttpServiceContextImpl.jav
a:3483)
at
com.ibm.ws.http.channel.inbound.impl.HttpInbo
undServiceContextImpl.getRequestBodyBuffer(Ht
248 of 363
An implementation of IWebdavStore might have
to destroy resources created in the constructor.
However, there is currently no destroy() method
in IWebdavStore. The attached patch creates this
method, implements it in LocalFilesystemStore
and calls destroy() on the store in the webdav
servlet.
Further improvments would be to also create an
init() method, which is called by the servlet. This
would replace calling directly the constructor of
the IWebdavStore with a File arg.
249 of 363
Mac OS X The method "removeLockedObjectOwner" in
LockedObject.java seems to have a problem with
ArrayIndexOutOfBoundsExceptions.
From what I can tell the issue is just a matter of
remembering that the array was shrunk by one
when removing the lock owner.
http://webdav-
servlet.svn.sourceforge.net/viewvc/webdav-
servlet/trunk/src/main/java/net/sf/webdav/locki
ng/LockedObject.java?annotate=54#l116
The following fix seems to resolve the problem in
my simple test setup.
infundibulum:webdav-servlet knut$ svn diff
/Users/knut/src/webdav-
servlet/src/main/java/net/sf/webdav/locking/Loc
kedObject.java
Index: /Users/knut/src/webdav-
servlet/src/main/java/net/sf/webdav/locking/Loc
kedObject.java
=========================================
==========================
--- /Users/knut/src/webdav-
servlet/src/main/java/net/sf/webdav/locking/Loc
kedObject.java (revision 87)
+++ /Users/knut/src/webdav-
servlet/src/main/java/net/sf/webdav/locking/Loc
This bug comes not from me, but I currently copy
my issues from the original SourceForge project
bugtracker to this jira, so I saw this (IMHO
eligible) bugreport:
in a multithreaded environment
SimpleDateFormat can create a correctly
formatted output for a totally different date.
see e.g.
net.sf.webdav.methods.CREATION_DATE_FORMA
T
250 of 363
When I try to copy a collection to non existend
path, then I get an error
500 instead of 409 (rfc 2518 8.8.5)
Problem appears at DoCopy.copy at
createResource.
I know the specification is for the case a little bit
unclear but I think the error 409 fits the best for
this case.
At the moment there is no handling of any
specific content-type.
According RFC 2518 8.3.1 I would suggest to
return always a 415 error by default.
Maybe at later time the API can be extended to
handle specific content-types.
Acording to RFC 2518 8.3.1 the MKCOL must fail
with a 409 error, when one
or more parent elements in the path not exists.
Currently I get an 207 Multi-Status with a
containing 404.
Fix can be done simply by this at DoMkcol:
parentSo = _store.getStoredObject(transaction,
parentPath);
if(parentSo == null){
// parent not exists
resp.sendError(WebdavStatus.SC_CONFLICT);
return;
}
251 of 363
All environments The current implementation for building an iJetty-
compatible servlet has problems. The following
patch fixes that:
Index:
src/main/java/net/sf/webdav/methods/DoLock.ja
va
=========================================
==========================
---
src/main/java/net/sf/webdav/methods/DoLock.ja
va (revision 82)
+++
src/main/java/net/sf/webdav/methods/DoLock.ja
va (working copy)
@@ -417,7 +417,7 @@
currentNode = childList.item(i);
if (currentNode.getNodeType() ==
Node.ELEMENT_NODE) {
- _lockOwner = currentNode.getTextContent();
+ _lockOwner = currentNode.getNodeValue();
}
}
}
Index: build.gradle
=========================================
==========================
--- build.gradle (revision 82)
+++ build.gradle (working copy)
@@ -143,10 +143,16 @@
252 of 363
Server on Linux, all clients (windows, mac and I have a folder in my server called "joão" and
linux) when I try to open it using Mac, Windows or
Linux, the servlet returns an error. Inspecting the
logs I can see that the encodings of the GetObject
request are wrong.
Let me see if I can explain the procedure so that
you guys may
replicate it:
1. create a folder or a file with an accent within
the webdav storage
area, e.g. "joão" or "luÃs.txt"
2. start tomcat with the webdav servlet
application installed
3. mount the webdav disk on your favorite OS or
webdav client
4. try to open the created folder or file on the
client.
You'll notice that the client will popup an error
every time one tries
to open the folder or the file.
Somehow I think the client makes the request to
the webdav servlet in
URLEncoded form. However, the servlet is unable
to decode the folder
and outputs an error.
Some further info:
253 of 363
When a non-webdav client does a GET request
for a folder resource we currently return a very
simple text document.
It would be nice if we return a full html response
page with correct charset encoding, and clickable
links to navigate the folders and download the
files from a standard webbrowser.
Currently the GET is handled in DoGet.java in the
folderBody method
if (so.isFolder()) {
// TODO some folder response (for browsers, DAV
tools
// use propfind) in html?
OutputStream out = resp.getOutputStream();
String[] children =
_store.getChildrenNames(transaction,
path);
children = children == null ? new String[] {} :
children;
StringBuffer childrenTemp = new StringBuffer();
childrenTemp.append("Contents of this
Folder:\n");
for (String child : children) {
childrenTemp.append(child);
childrenTemp.append("\n");
}
out.write(childrenTemp.toString().getBytes());
}
254 of 363
Hello,
> In addition to this problem there seems to exist
a problem with
> gvfs/1.0.2 on PUT (file creation. I'm trying to
track in more details
> where it fails. (The file is created, but somehow
the gnome client
> seems to not evaluate the answer in the way it
was intended by the
> webdav servlet and it retrys to PUT the file...)
I tracked down the problem to the DoPut class,
where the response length is Set to the length of
the uploaded file. (Line 169)
The effect of this is, that gvfs trys to receive a
response of "content-length", but the webdav
servlet only sends a few bytes. So the request
does timeout and gvfs retrys the put, timesout
again and then dropps the connection.
According to the code just above and the
comments it seems that something very similar
happens with Goliath.
Just removing the content-length from the reply
makes it work with gvfs.
Is there actually a webdav client REQUIRING the
response length to be set to the resource size
instead getChildrenNames(ITransaction
String[] of the response length ?
transaction, String folderUri);
if the uri points to a file an npe is thrown in
DoPropFind on line 220.
255 of 363
0.4 Hello,
Maybe it is because it is not clear enough for
dummies like me but when i launch bin/nutch
admin /opt/nutch-gui-0.4/build/nutch-gui-0.4
50060, i get an error in hadoop log :
2010-12-29 08:37:13,122 WARN
servlet.PageNotFound - No mapping found for
HTTP request with URI [/general/index.html] in
DispatcherServlet with name 'springapp'
maybe use for example login.htm
maybe with Environment variable like "-
Djava.security.auth.login.config=path/nutchgui.a
uth" or over the nutch-default.xml
for example NutchGuiLogin
Or configure the name within the nutch-
default.xml
validate and reject an error if user forgot to set
http.agent.name before start crawling.
maybe the crawling should succeed anyway in
this case and the user should get a warning.
256 of 363
this plugin allows us to crawl specified pages.
this plugin is for url uploading. the uploaded urls
wil be fetch.
this plugin can be use to run scheduled crawls
with this plugin a user can create new crawl's and
crawl these created crawls.
a host statistic should be also supported.
with this plugin it is possible to configure a nutch
instance. the nutch-site.xml of an instance should
be overwrite
this plugin should show memory/cpu usage and a
logfile viewer
this plugin should be create nutch instances
this plugin should be the welcome page for every
instance
supported languages
+ german
+ english
257 of 363
This patch implements Lucene query filters.
Currently the LuceneServer class declares a few
methods which take a filter argument, but these
methods throw
UnsupportedOperationException(). This patch
provides support for passing a filter argument to
the search() and getResultCount() methods.
A FilterWritable class is included to allow passing
filter arguments between client and server.
Reported by Murali Krishna:
{noformat}
Hi,
I usually see operator thread getting stopped and
restarted immediately. Is that expected? This
particular instance, in one of the node it stopped
for almost 2 hours and no index deployment
happened during this time on this node. The
'listNodes' was showing the node connected
though. I am using Katta 0.6.2.
2011-05-05 00:02:15,793 INFO
net.sf.katta.master.OperatorThread:100 -
operator thread stopped
2011-05-05 00:02:17,276 WARN
org.I0Itec.zkclient.ZkEventThread:78 - Error
handling event ZkEvent[State changed to
SyncConnected sent to net.sf.katta.protoco
l.InteractionProtocol$1@64cbdef5]
org.I0Itec.zkclient.exception.ZkNodeExistsExcepti
on:
org.apache.zookeeper.KeeperException$NodeExi
stsException: KeeperErrorCode = NodeExists for
/katta/maste
r
at
org.I0Itec.zkclient.exception.ZkException.create(Z
kException.java:55)
at
org.I0Itec.zkclient.ZkClient.retryUntilConnected(Z
258 of 363
Recently, the constant
{{Version.LUCENE_CURRENT}} was replaced by
{{Version.LUCENE_30}} in many files (mainly test
code) because LUCENE_CURRENT is deprecated.
If someone uses Lucene 3.1, which works straight
forward by the way, they have to update the
version again. I'm using Lucene 4.0 and also have
to change the version.
It would be much easier to have a global constant
or a property to change the underlying Lucene
version at a single point.
To extend LuceneServer more easily and to
implement custom or enhanced search methods
it is necessary to make the inner classes of
LuceneServer protected and make some of their
properties public or create getter/setter methods
for it.
The attached patch makes the inner classes
protected and some properties public.
The multithread shard search can be slightly
improved by the use of CompletionService.
Currently the multithread search is done by
creating a SearchCall for every shard and
submitting them to the thread pool. Later on, the
search function waits for each submitted
SearchCall in the order they are submitted rather
in the order they are finished. In the worst case
the first SearchCall takes much longer than the
later once. All finished SearchCalls are
unprocessed until the first SearchCall gets
finished. This can increase the memory
consumption (later gc) and decrease processing
speed.
The attached patch fixes this by using
[ExecutorCompletionService|http://download.or
acle.com/javase/6/docs/api/java/util/concurrent/
ExecutorCompletionService.html].
259 of 363
Hi,
I am on 0.6.2 and I get this error during reconnect
and katta node stops updating the log after this,
but the process continues to run.
2011-03-05 14:04:33,500 INFO
net.sf.katta.operation.node.AbstractShardOperati
on:75 - publish shard
'Table_inc_1299357600#part-r-00005'
2011-03-05 14:04:33,507 INFO
net.sf.katta.operation.node.AbstractShardOperati
on:55 - redeploy shard
'Table_inc_1299357600#part-r-00003'
2011-03-05 14:04:33,507 INFO
net.sf.katta.operation.node.AbstractShardOperati
on:75 - publish shard
'Table_inc_1299357600#part-r-00003'
2011-03-05 14:04:33,964 WARN
org.I0Itec.zkclient.ZkEventThread:78 - Error
handling event ZkEvent[State changed to
SyncConnected sent to net.sf.katta.protoco
l.InteractionProtocol$1@2e3c5d8e]
org.I0Itec.zkclient.exception.ZkException:
org.apache.zookeeper.KeeperException$NotEmp
tyException: KeeperErrorCode = Directory not
empty for /katta/work/node-queues/host1:20000
at
org.I0Itec.zkclient.exception.ZkException.create(Z
kException.java:68)
at
org.I0Itec.zkclient.ZkClient.retryUntilConnected(Z
260 of 363
The output tables produced by the katta
commands {{listIndices}}, {{listNodes}} and
{{check}} are not sorted and contain separation
characters. Sometimes it is hard to read the
unsorted output and parsing would be easier if
no extra separation characters would be present.
It would be useful to have options to supress
these separators, order the table and optionally
remove the table header.
The attached patch introduces three parameters:
* {{-b}} batch mode
* {{-n}} don't write column names
* {{-S}} sort the index/shard/node names
The DocumentFrequencyWritable.toString()
method returns a string not easy to understand.
The aim is to clearify the two returned numbers.
The attached patch does this.
When adding a lucene indice:
{code}
bin/katta addIndex testIndex 3
{code}
the path need to contain one or multiple lucene
indices. The adding fails if the given folder is a
lucene index by itself.
But we could add a special treatment for this case
and add the index as single sharded katta index.
See
https://issues.apache.org/jira/browse/ZOOKEEPE
R-795
Related to KATTA-182 and KATTA-167
The ZkEventThread catches Exception in its run
method. Throwables will crash the thread and the
client event handling stops working. That way an
OOM will crash the client.
261 of 363
Executed DeployUndeploySearchInLoop for a
longer time (which deploys an index, search in it,
and undeploys it). This emulates the usage of a
LuceneClient for a long time. Looking at heap
dumps after several hours i found
{{ZooKeeper$ZkWatchManager.existWatches}}
map with 8400 entries. Entries for pathes like
{{/katta/shard-to-nodes/index53#bIndex}} which
belonging to indices which were already
undeployed a long time.
A zookeeper watch is usually removed when an
event for it is triggered. ZkClient immediatly
registeres the watch again in case it have a
listener for it.
Now in Katta a the Client remove it shard-to-
nodes listeners if an index has been removed.
Looking at the thread-sump it seems that
sometimes the
{{ZooKeeper$ZkWatchManager.existWatches}}
has been cleared from undeployed indices and
sometimes not. I suspect that this depends on the
sequence of events. if the client gets a index-
removed event before the last shard-to-node
change event, every index related watch gets
removed. If its the other way around some
obsolete watches are still hanging around
forever.
Removing the watches explicitely isn't that easy:
[ZOOKEEPER-
262 of 363
Right now a node is removed from shard-to-node
mappings in the Client if an proxy-invocations
fails.
The node-proxy itself, which gets removed as
well, might be re-established if a new shard is
added, but the removed shard-to-node mappings
are never re-established for that client.
Now a failing proxy-invocation does not
necessarily mean that the proxy is corrupt, see
KATTA-180 as a example.
So following approach would look a bit safer:
- remove a node-proxy only if x successive
invocations are failed
- re-establish the proxy immediatly
- if the re-established proxy fails - remove the
shard-to-node mapping
Right know if an index is removed its shards are
removed immediately from zookeeper (first from
the master - then from the nodes) and then from
the content-server (no each node), The removal
from the content-server produces exception if
seach operations on these shards are running.
One way to avoid these exception would be to
delay the physical deletion. The remaining
queries should finish quickly and no new search
operations should be scheduled because of the
virtual deletion.
263 of 363
Querying a index while it is undeployed resulted
in following exception:
{noformat}
11/02/01 19:51:16,911 ERROR
client.NodeInteraction:166 - Error calling public
abstract
net.sf.katta.lib.lucene.DocumentFrequencyWrita
ble
net.sf.katta.lib.lucene.ILuceneServer.getDocFreqs
(net.sf.katta.lib.lucene.QueryWritable,java.lang.St
ring[]) throws java.io.IOException on
eagle.local:20000 (try # 1 of 3) (id=0)
java.lang.reflect.InvocationTargetException
at
sun.reflect.NativeMethodAccessorImpl.invoke0(N
ative Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(Na
tiveMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invok
e(DelegatingMethodAccessorImpl.java:25)
at
java.lang.reflect.Method.invoke(Method.java:597
)
at
net.sf.katta.client.NodeInteraction.run(NodeInter
action.java:135)
at
java.util.concurrent.ThreadPoolExecutor$Worker.
runTask(ThreadPoolExecutor.java:886)
264 of 363
Did a test where i do deply-undeploy-search in a
loop. Have this running for i while the zk-
filesystem looks like:
{noformat}
'-+shard-to-nodes
'-+index655#dIndex
'-+index0#bIndex
'-+eagle.local:20000
'-+index1013#bIndex
'-+index769#dIndex
'-+index735#dIndex
'-+index1090#cIndex
'-+index1135#dIndex
'-+index499#dIndex
'-+index1087#cIndex
'-+index849#cIndex
'-+index1010#dIndex
...
{noformat}
So sometime the remove seems to work -
265 of 363
Adding and removing an index can lead to
following exception in IndexDeployFuture:
{noformat}
11/02/01 22:01:36,617 WARN
zkclient.ZkEventThread:78 - Error handling event
ZkEvent[Data of /zk_testsystem/indicies/indexA
changed sent to
ZkDataListenerAdapter:/zk_testsystem/indicies/in
dexA]
java.lang.IllegalStateException
at
net.sf.katta.client.IndexDeployFuture.handleData
Deleted(IndexDeployFuture.java:88)
at
net.sf.katta.protocol.InteractionProtocol$ZkDataL
istenerAdapter.handleDataDeleted(InteractionPr
otocol.java:615)
at
org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:54
9)
at
org.I0Itec.zkclient.ZkEventThread.run(ZkEventThr
ead.java:72)
{noformat}
Additionaly the IndexDeployFuture as component
does not seem to unregister itself in some cases.
266 of 363
1st Mail from Murali Krishna:
{noformat}
Hi,
I have a question on zkclient and deployclient's
lifetime. Can we use the same deployClient
through out the process without recreating the
interaction protocol or deployclient at the client
side? Can this still work even if Zookeeper
processes or katta nodes gets restarted in
between?
In our cluster, we keep deploying indices every 10
minutes with the same deplpy client and we are
seeing a problem where after a day or so, the
deploy operation gets stuck. Essentially, the
IIndexDeployFuture.getState() always returns
IndexState.DEPLOYING. I am just thinking
whether this is related to some problem with
Deployclient and we should reinitialize zkclient,
interactionprotocol and deployclient every time ?
My code snippet is as below to
keep the deployClient in memory.
ZkConfiguration zkConf = new ZkConfiguration();
ZkClient zkClient = new
ZkClient(zkConf.getZKServers());
InteractionProtocol protocol = new
InteractionProtocol(zkClient, zkConf);
deployClient = new DeployClient(protocol);
{noformat}
Currently LuceneServer uses a fixed size thread
pool of 100 for processing incoming search calls.
Once KATTA-174 this should be made
configurable so people can tune it to their needs.
Needed by KATTA-171. This would allow to store
custom properties for an implementation of
IContentServer (f.e. LuceneServer) inside the
katta.node.properties.
267 of 363
from 0.20.1
There are several RPC bugs fixed which could be
relevant for katta:
http://hadoop.apache.org/common/docs/r0.20.2
/releasenotes.html
from 3.0.0
From Hongchao, dev-list:
{noformat}
In katta-0.6.3, the LuceneClient class allows us to
set up a timeout
value for the search. We tried that function and
found some
interesting things. We did see the LuceneClient
interrupted the slow
queries although the actual used time might be a
little longer than
the timeout value. This is understandable.
However, from what we saw,
we suspect that the involved katta nodes still
continue to work on the
slow queries even after the LuceneClient times
out the search. Could
you please help us make sure whether it is the
case? If it is, do you
think there is an easy way to ask katta nodes to
stop the work related
to the slow query right, too, after the
LuceneClient timed out the
query?
{noformat}
Part of the response from Johannes:
{noformat}
So basically you want that when the timeout
happens not only the querying threads on the
client, but also the threads on the lucene-nodes
stop there work, right ?
268 of 363
RHEL 5, 0.6.2 katta, 2.9.2 lucene I am using katta0.6.2. The query results are not
working as expected when I query for "NOT
range". For eg, if I have a field indexed with multi
value and if I query for NOT of that range, it still
returns the doc.
The issue happens with lucene as well, when we
use MultiSearcher. The query rewrite changes the
[a - b*] to something like (a - b1 OR a- b2) where
b1 in index1 and b2 index2. It should be 'AND' in
case of negative query.
MultiSearcher's rewriten query is wrong.
Katta also seem to inherit the bug from Lucene's
Query combine method mentioned at
https://issues.apache.org/jira/browse/LUCENE-
2756.
test scenario:
index1: Has 1 shard which contains the document
A
index2: Has 2 shards, one of which has the
document A
index3 : Has 1 shard which doesnt contain A
index4 : same shard as index 1
for the query "B:0 NOT B:[1 TO 5]" :
index1 did not return A (correct).
index2 returned it (wrong)
index3 didn't return (correct)
index1 & index2 returns A twice (wrong)
269 of 363
The timeout code in WorkQueue.java (I removed
some log lines from the
code below for clarity) seems to have an issue if
the waitTime is
exactly 0. Looks like it will return the results
without closing them if
waitTime is exactly 0.
/**
* Use a user-provided policy to decide how long
to wait for and
whether to
* terminate the call.
*
* @param policy
* How to decide when to return and to terminate
the call.
* @return the results, which may or may not be
complete and/or closed.
*/
public ClientResult
getResults(IResultPolicy policy) {
int callId = callCounter++;
long start = 0;
long waitTime = 0;
while (true) {
synchronized (results) {
// Need to stay synchronized before waitTime()
through wait()
or we will
270 of 363
Presumably it's different from the similar-
sounding issues fixed in 0.6.3 because at least
some of those circumstances we think we've
been able to clearly confirm the 0.6.3 fixes
working... Attached is a force-jstack (normal
jstack fails due to deadlock), here's all I have from
the log:
2010-12-02 01:06:40,453 INFO
net.sf.katta.operation.node.AbstractShardOperati
on:55 - redeploy shard 'index8#shard_0'
2010-12-02 01:06:40,455 INFO
net.sf.katta.operation.node.AbstractShardOperati
on:75 - publish shard 'index8#shard_0'
2010-12-02 01:06:40,453 INFO
org.apache.zookeeper.ClientCnxn:1157 - Client
session timed out, have not heard from server in
45820ms for sessionid 0x32c9d05a0c704ce,
closing sock
et connection and attempting reconnect
2010-12-02 01:06:40,558 INFO
org.I0Itec.zkclient.ZkClient:449 - zookeeper state
changed (Disconnected)
2010-12-02 01:07:10,971 INFO
org.apache.zookeeper.ClientCnxn:1041 - Opening
socket connection to server
quorum2/10.1.1.2:2281
2010-12-02 01:07:10,973 INFO
org.apache.zookeeper.ClientCnxn:1157 - Client
session timed out, have not heard from server in
30414ms for sessionid 0x32c9d05a0c704ce,
closing sock
271 of 363
x86_64, jdk 1.6r11, 16G heap limit, 23 nodes, During sustained operation with unknown cause
iCMS GC with parallel young generation collection (relatively low query load, hard limit on
and an iot of 60% concurrent deployments managed centrally)
katta runs out of memory. The retained size to
garbage ratio varies pretty wildly and we collect
more aggressively than default so it seems
unlikely to be a config problem. We don't know if
it's due to a leak or due to overtasking - if the
latter, it would be nice to have katta limit the
inbound requests according to current
circumstances rather than just die. We're posting
heap dumps for analysis but it'll be a few hours
before they arrive (lot of heap).
As of 0.6.2 you can store indices with > 2^31
nDocs and katta will let you submit a search for
them, but several core places to luceneclient still
use the primitive int type (signed in java, so
limited to 2^31 results) to store and return
counts, so the search results are incorrect and the
api doesn't allow applications to even count them
correctly by retyping. client's .search, get hits' list
, .count, &c. For example, the docId field of the
Hit object is an int, in creating a list of them, we
sort calling .equals and .equals relies on _docId
and so we must have collisions whenever there
are more than 2^31. For count, we often find
negative results (overflow) with a large enough
range.
Possible solutions:
A) Make all such areas Generics, so folks can
choose their precision and we can avoid breaking
compatibility with existing apps using
luceneclient.
B) Add new implementations of the existing
methods where everything is a long instead of an
int (push the limit out to something beyond what
katta could otherwise handle). Compatible but
ugly to maintain.
C) Change both the private implementations (like
hit-equals) and the exposed interfaces (like
.search) from int to long. Easiest but makes it
272 of 363
We experience it while testing even with 0.6.2
which is getting in the way of our testing of the
throttle and of the trunk version (we switched
back to 0.6.2 to see if the errors encountered
were unrelated to the katta changes):
{noformat}
2010-11-19 04:27:59,219 ERROR
net.sf.katta.operation.node.AbstractShardOperati
on:59 - failed to deploy shard
'ipovw_031_101119042717#shard_1' on node
'srch02-lab:20000'
net.sf.katta.util.KattaException: Can not load
shard: hdfs://slnamenode:9000/data/katta-
deploy/ipovwtest/release/031_101119042717/sh
ard_1
at
net.sf.katta.node.ShardManager.installShard(Sha
rdManager.java:144)
at
net.sf.katta.node.ShardManager.installShard(Sha
rdManager.java:66)
at
net.sf.katta.operation.node.ShardDeployOperatio
n.execute(ShardDeployOperation.java:36)
at
net.sf.katta.operation.node.AbstractShardOperati
on.execute(AbstractShardOperation.java:56)
at
net.sf.katta.operation.node.AbstractShardOperati
on.execute(AbstractShardOperation.java:27)
at
273 of 363
Ubuntu While building the latest snapshot (katta-HEAD-
02055b0 ) a unit test fails
"
[junit] Running
net.sf.katta.node.ShardManagerTest
[junit] Tests run: 1, Failures: 0, Errors: 1, Time
elapsed: 0.857 sec
BUILD FAILED
/home/patrick/Downloads/katta-HEAD-
02055b0/src/build/ant/common-build.xml:99:
Tests failed
Total time: 1 minute 38 seconds
"
Based on KATTA-161. Seems like we some
unexpected exceptions getting swallowed and
the threads stopping to work silently.
274 of 363
Currently LuceneClient and LuceneServer are not
structured in such a way that they can be
extended easily. For instance, LuceneClient
declares its kattaClient instance variable as
private and provides no accessor. To extend that
class, the extender would have create a separate
Katta Client solely used within the sub-class which
seems wasteful (when the parent has a perfectly
good one).
As a use case, our company would love to use
Katta because our Lucene indexes are quite large
(combined they are well over 1TB), and the
management of our current Solr deployment is
becoming overwhelming (replication and
sharding strategies are quite fixed in Solr once
established). Unfortunately, we require faceting
functionality which is provided by Solr but not by
Katta. Most of our faceting requirements are
quite simply, so I'd love to extends LuceneClient
to provide this facility. As stated above though,
the current implementation of LuceneClient and
LuceneServer makes this unnecessarily difficult.
275 of 363
Debian sid, Sun jdk 1.6.0u11 Over the course of extended runtime, for
unknown reasons, one or more katta nodes
appears to have its thread that takes instructions
from the master stop responding. The evidence is
in the logs going really quiet - normally we'd have
a variety of shard deployments, undeployments,
etc. showing up in the node log several times per
minute at least. They are almost always:
INFO net.sf.katta.node.Node
INFO
net.sf.katta.operation.node.AbstractShardOperati
on
INFO net.sf.katta.node.ShardManager
INFO net.sf.katta.lib.lucene.LuceneServer
INFO
net.sf.katta.operation.node.AbstractShardOperati
on
When this problem occurs, however, we see NO
info-level messages from any of those classes for
hours or even days on end (while all other nodes
are exhibiting the normal behavior), but what we
do start seeing is this sort of warning, always
clustered together and during the time that the
node doesn't receive updates:
2010-10-14 13:50:08,543 WARN
org.apache.hadoop.ipc.Server:662 - IPC Server
Responder, call
getDetails([Ljava.lang.String;@4def22a3, 9) from
1.1.1.1:44457: output error
2010-10-14 13:51:08,544 WARN
276 of 363
Any We ran into a problem recently. The scenario is
that one of our katta nodes got disconnected due
to some reason and , consequently, a lot of
rebalance operations were triggered in katta to
replicate the miss shards. The indexes were
originally deployed from HDFS and some of these
indexes had been deleted from HDFS by the
rebalancing time. In this situation, the attempts
to replicate these missing shards would fail since
there were no the copies of these shards in HDFS
any more. The problem is that katta would never
stop its such rebalance attempts. As soon as its
first rebalance attempt failed, katta immediately
submitted another same attempt. Thus, tons of
such rebalance requests were queued up in
Zookeeper and blocked any other normal index
deployments.
So, I am wondering whether katta could make
decisions basing on the reasons for the rebalance
failures. For example, if it is caused by a
malfunctioning file system such as HDFS is not
responsive, katta could try the rebalance again
later. However, if the reason is that the index files
could not be found but the file system is healthy,
katta should give up the rebalance efforts since
the effort is no way
JDK 1.6_21, JDK 1.5 to succeed.
The LuceneServerTest.java does not compile. The
following error message occurs for line 45 and 63:
{noformat}
qualified new of static class
{noformat}
The Katta command line tool provides the option
{{listNodes}}, but does not provide an option to
remove nodes which are not used anymore (i. e.
test nodes). One can remove these nodes with
the Zookeeper command line tool. But it would
be useful to have such an option with the Katta
command line tool.
277 of 363
There is no way to set the broadcast timeout in
LuceneClient. The default timeout of 12 seconds
is sometimes to small.
If a lot of different indices with different names
are deployed, but the search should be limited to
a subset of these indices only, all the subset index
names have to be given manually. It would be
easier if an index pattern could be used instead.
The retrieval of many hit details is very slow
because they requested individually. It would be
nice if they would be requested in batches per
shard. That would improve the retrieval
performance dramatically.
In @HitsMapWritable.readFields@ the number of
hits to read is known, but the dimension of the
@_hits List@ is not adapted. If many hits are read
by a client, the @ArrayList@ will be expanded
multiple times. That decreases the performance.
{{LuceneServer}} loads the full document, even if
only a few fields are requested during a
{{getDetails()}} call. That is a performance issue if
a lot of document fields are indexed and stored
or if some fields are quite large and/or should be
lazy loaded.
The {{net.sf.katta.lib.lucene.LuceneServer}} class
should be modified such sub-classing is easier and
code reuse higher. The motivation is to make an
embedded Solr server as a sub-class of
{{LuceneServer}} to accept both Solr and Lucene
queries. I'll post and updated patch of [SOLR-
1395|https://issues.apache.org/jira/browse/SOL
R-1395] too.
Update to the current version (0.21.0) of Hadoop.
This version is still an RC version, but Katta should
be adapted anyway.
Add a port parameter "-p" to the katta startNode
argument to run Katta nodes at different ports
locally. That makes debugging easier.
278 of 363
LuceneServer synchronizes on a
ConcurrentHashMap and queries
indexSearcher.maxDoc(), but does not use
_maxDoc. That can be avoided.
If KATTA_LOG_LEVEL=Debug is set, the master
fails to deploy indices with the following
exception:
ERROR 2010-08-13 12:58:09,006
[OperatorThread]
net.sf.katta.operation.master.AbstractIndexOper
ation - failed to deploy index sen-00002
java.util.IllegalFormatConversionException: d !=
java.util.concurrent.atomic.AtomicInteger
at
java.util.Formatter$FormatSpecifier.failConversio
n(Formatter.java:3999)
at
java.util.Formatter$FormatSpecifier.printInteger(
Formatter.java:2709)
at
java.util.Formatter$FormatSpecifier.print(Format
ter.java:2661)
at
java.util.Formatter.format(Formatter.java:2433)
at
java.util.Formatter.format(Formatter.java:2367)
at java.lang.String.format(String.java:2769)
at
net.sf.katta.master.LowestShardCountDistributio
nPolicy.chooseNewNodes(LowestShardCountDist
ributionPolicy.java:200)
at
net.sf.katta.master.LowestShardCountDistributio
nPolicy.createDistributionPlan(LowestShardCount
see
http://github.com/sgroschupf/zkclient/issues/unr
eads#issue/11
279 of 363
linux We created LuceneClient on different machines
(katta nodes). I found the
java.util.ConcurrentModificationException couple
of times. By checking the log I found multiple
LuceneClients were created from different
machines at that time when the exception was
thrown.
java.util.ConcurrentModificationException
at
java.util.AbstractList$Itr.checkForComodification(
AbstractList.java:372)
at
java.util.AbstractList$Itr.next(AbstractList.java:34
3)
at
org.apache.hadoop.conf.Configuration.loadResou
rces(Configuration.java:1028)
at
org.apache.hadoop.conf.Configuration.getProps(
Configuration.java:979)
at
org.apache.hadoop.conf.Configuration.set(Config
uration.java:404)
at net.sf.katta.client.Client.(Client.java:102)
at net.sf.katta.client.Client.(Client.java:93)
at net.sf.katta.client.Client.(Client.java:88)
at
net.sf.katta.lib.lucene.LuceneClient.(Lucene
Client.java:78)
at
280 of 363
linux If a LuceneClient is created at the time of
dropping an index (or at the time of adding an
index), we got a java.lang.NullPointerException:
java.lang.NullPointerException
at
net.sf.katta.client.Client.isIndexSearchable(Client.
java:247)
at
net.sf.katta.client.Client.addOrWatchNewIndexes
(Client.java:180)
at net.sf.katta.client.Client.(Client.java:122)
at net.sf.katta.client.Client.(Client.java:87)
at net.sf.katta.client.Client.(Client.java:82)
at
net.sf.katta.lib.lucene.LuceneClient.(Lucene
Client.java:73)
at
com.mcafee.titan.search.LuceneClientFactory.ma
keObject(LuceneClientFactory.java:36)
The 36 line of LuceneClientFactory.java is:
return new LuceneClient(new
DefaultNodeSelectionPolicy(), zkConf);
where zkConf is an object of ZkConfiguration.
281 of 363
Administrator@Rafia /cygdrive/d/katta/katta-
core-0.6.1
$ ant compile
Buildfile: D:\katta\katta-core-0.6.1\build.xml
compile:
check-ivy-available:
download-ivy:
install-ivy:
[echo] Ivy path
resolve:
[ivy:resolve] :: Ivy 2.0.0 - 20090108225011 ::
http://ant.apache.org/ivy/ ::
:: loading settings :: file = D:\katta\katta-core-
0.6.1\ivysettings.xml
[ivy:resolve] DEPRECATED: useOrigin option is
deprecated when calling resolve, u
se useOrigin setting on the cache implementation
instead
[ivy:resolve] :: resolving dependencies ::
101tec#katta;working@Rafia
[ivy:resolve] confs: [ant, eclipse, compile, test,
instrument, checkstyle]
[ivy:resolve] found zkclient#zkclient;0.1.0 in
libraries
[ivy:resolve] found zookeeper#zookeeper;3.2.2 in
282 of 363
all Repro: run "ant jar" from katta/extras/indexing
Fix: see attached (update cobertura.jar to 1.9.3
rodo@rodimus:~/patch/katta/extras/indexing$
ant jar
...
[ivy:resolve] :::: WARNINGS
[ivy:resolve] module not found:
net.sourceforge.cobertura#cobertura;1.9.1
[ivy:resolve] ==== libraries: tried
[ivy:resolve] -- artifact
net.sourceforge.cobertura#cobertura;1.9.1!cober
tura.jar:
[ivy:resolve]
/home/rodo/patch/katta/extras/indexing/../..//li
b/cobertura-1.9.1.jar
[ivy:resolve]
/home/rodo/patch/katta/extras/indexing/lib/cob
ertura-1.9.1.jar
[ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::
[ivy:resolve] :: UNRESOLVED DEPENDENCIES ::
[ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::
[ivy:resolve] ::
net.sourceforge.cobertura#cobertura;1.9.1: not
found
[ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::
[ivy:resolve]
[ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE
283 of 363
Honchao reported
{noformat}
owever, if we reuse the LuceneClient for queries,
as the time passed
(mostly we deploy/remove indexes from katta),
the memory utilized by
the LuceneClient is getting bigger and bigger and
can not be released.
It seems there is a strong tie between the
memory size and the number
of index deployment/removal during the life time
of the LuceneClient.
When the utilized memory reaches some level,
the LuceneClient gets the
following errors consistently:
2010-07-07 11:13:05,181 -
org.I0Itec.zkclient.ZkEventThread - WARN -
Error handling event ZkEvent[Children of
/katta/indicies changed sent
to
net.sf.katta.protocol.InteractionProtocol$AddRe
moveListenerAdapter@2c70f837]
java.lang.NullPointerException
at
net.sf.katta.client.Client.removeIndex(Client.java:
172)
at
net.sf.katta.client.Client$1.removed(Client.java:1
08)
at
284 of 363
from Jack Key:
{noformat}
Hello, Everyone.
QUESTION 1:
Will there be an update to the "katta-images" S3
bucket that includes an AMI for Katta 0.6.1?
Unfortunately, the only publicly available image
in the S3 Bucket "katta-images" is ami-b3a84fda
(katta-images/katta-0.4.0-i386.manifest.xml).
QUESTION 2:
Has anyone had success building an AMI with
Katta 0.6.1?
I tried using the "create-image" script as shown in
http://katta.sourceforge.net/documentation/run
ning-katta-on-ec2.
But it seems to be produce AMIs that do not work
with "bin/katta-ec2 launch-cluster".
The launch spins up instances, the "katta-ec2
login" logs me into the master, but the master
seems unaware of any Nodes,
and there is no output (error messages or
otherwise) when i run katta listNodes.
INPUTS
in my katta-ec2-env.sh, I used
KATTA_VERSION=core-0.6.1 in the katta-ec2-
env.sh script.
285 of 363
linux When we ran a stress test on katta, I got some
jdk 1.6.0_17 inconsistent errors about "can not reach xxxx
shard" and found the following logs inside:
--------------------------------------------------------------------
--------------------------------------------------------------------
--------------------------------------
2010-06-25 19:33:04,524 -
net.sf.katta.client.NodeInteraction - ERROR
{noformat}
Error calling public abstract
net.sf.katta.lib.lucene.HitsMapWritable
net.sf.katta.lib.lucene.ILuceneServer.search(net.sf
.katta.lib.luc
ene.QueryWritable,net.sf.katta.lib.lucene.Docum
entFrequencyWritable,java.lang.String[],int,net.sf
.katta.lib.lucene.SortWritable)
throws java.io.IOException on
haystack008.scur.colo:20000 (try # 1 of
3) (id=2
264969)
java.lang.reflect.InvocationTargetException
at
sun.reflect.GeneratedMethodAccessor36.invoke(
Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invok
e(DelegatingMethodAccessorImpl.java:25)
at
java.lang.reflect.Method.invoke(Method.java:597
286 of 363
from Larry Lui:
{noformat}
2010-06-21 16:16:38,492 WARN
org.I0Itec.zkclient.ZkEventThread:78 -
Error handling event ZkEvent[State changed to
SyncConnected sent to
net.sf.katta.protocol.InteractionProtocol$1@523
b2208]
org.I0Itec.zkclient.exception.ZkException:
org.apache.zookeeper.KeeperException$NotEmp
tyException: KeeperErrorCode
= Directory not empty for /var/katta/work/node-
queues/iguana:20000
at
org.I0Itec.zkclient.exception.ZkException.create(Z
kException.java:68)
at
org.I0Itec.zkclient.ZkClient.retryUntilConnected(Z
kClient.java:685)
at
org.I0Itec.zkclient.ZkClient.delete(ZkClient.java:71
6)
at
org.I0Itec.zkclient.ZkClient.deleteRecursive(ZkClie
nt.java:516)
at
net.sf.katta.protocol.InteractionProtocol.publish
Node(InteractionProtocol.java:366)
at net.sf.katta.node.Node.init(Node.java:106)
at
net.sf.katta.node.Node.reconnect(Node.java:120)
287 of 363
When a relatively large (appx. one third) number
of nodes go offline, are rebuilt, and then brought
back online with no data, Katta can sometimes
get stuck rebalancing the cluster. It attempts to
replicate the underreplicated shards -- we had
confirmed that, according to Katta anyway, we
had at least one copy of all shards -- and gets to
about 90% completion before it just... stops. The
cluster still seems to respond to search requests
just fine, but you can't perform any modifications
(removing indexes, adding new indexes). We see
messages like this in our master log files from
around the time that the problem started:
2010-06-09 02:40:40,332 INFO
net.sf.katta.master.OperatorThread:125 -
skipping operation
'BalanceIndexOperation:6d82733b:index_name_
here'
No warnings or errors that I can see.
To troubleshoot, I first restarted the standalone
Zookeeper nodes one at a time. This had no
effect. Then I restarted the Katta nodes (only the
nodes -- the masters I left unchanged), also one
at a time. After restarting all the Katta nodes,
suddenly the cluster started working again.
Initially we got a large-ish number of indexes in
the ERROR state (about 88 out of 432), which was
288 of 363
Debian Linux unstable See Debian Bug #584378
http://bugs.debian.org/cgi-
bin/bugreport.cgi?bug=584378
> find src/main/java -name *.java -and -type f -
print0 | xargs -0 /usr/lib/jvm/default-
java/bin/javac -cp /usr/share/java/log4j-
1.2.jar:/usr/share/java/zookeeper.jar:debian/_jh_
build.zkclient -d debian/_jh_build.zkclient
>
src/main/java/org/I0Itec/zkclient/ZkServer.java:1
27: cannot find symbol
> symbol : constructor Factory(int)
> location: class
org.apache.zookeeper.server.NIOServerCnxn.Fact
ory
> _nioFactory = new
NIOServerCnxn.Factory(port);
>^
> Note:
src/main/java/org/I0Itec/zkclient/ZkClient.java
uses unchecked or unsafe operations.
> Note: Recompile with -Xlint:unchecked for
details.
The full build log is available from:
http://people.debian.org/~lucas/logs/2010/06/02
/zkclient_0.1.0+dfsg1-1_lsid64.buildlog
289 of 363
From Hongchao Li:
{noformat}
I create an object of
net.sf.katta.lib.lucene.LuceneClient in my query
client and serve external queries with this object..
Also, I
periodically remove/add indexes when the query
client is in service.
Sometime, I got the following exceptions:
InteractionProtocol$AddRemoveListenerAdapter
@7ae35bb7]
java.lang.NullPointerException
at
net.sf.katta.client.Client.removeIndex(Client.java:
172)
at
net.sf.katta.client.Client$1.removed(Client.java:1
08)
at
net.sf.katta.protocol.InteractionProtocol$AddRe
moveListenerAdapter.handleChildChange(Interac
tionProtocol.java:529)
at
org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:56
8)
at
org.I0Itec.zkclient.ZkEventThread.run(ZkEventThr
ead.java:72)
10/04/26 13:04:26 WARN
290 of 363
Periodically we get the following error, which is
roughly contemporaneous with the entire cluster
becoming unresponsive:
2010-05-04 15:00:01,121 ERROR
net.sf.katta.operation.master.AbstractIndexOper
ation:59 - failed to deploy balance [index name]
java.lang.NullPointerException
at
net.sf.katta.operation.master.AbstractIndexOper
ation.addRunningDeployments(AbstractIndexOpe
ration.java:108)
at
net.sf.katta.operation.master.AbstractIndexOper
ation.distributeIndexShards(AbstractIndexOperati
on.java:65)
at
net.sf.katta.operation.master.BalanceIndexOpera
tion.execute(BalanceIndexOperation.java:54)
at
net.sf.katta.master.OperatorThread.executeOper
ation(OperatorThread.java:121)
at
net.sf.katta.master.OperatorThread.run(Operator
Thread.java:80)
We then get one of these about every 8 seconds
until the cluster is restarted. There are no errors
or warnings in the log file before this that I can
see, and several index balance before it all
succeed.
291 of 363
all Repro: run "ant jar" from katta/extras/indexing
Fix: see attached (update cobertura.jar to 1.9.3
rodo@rodimus:~/patch/katta/extras/indexing$
ant jar
...
[ivy:resolve] :::: WARNINGS
[ivy:resolve] module not found:
net.sourceforge.cobertura#cobertura;1.9.1
[ivy:resolve] ==== libraries: tried
[ivy:resolve] -- artifact
net.sourceforge.cobertura#cobertura;1.9.1!cober
tura.jar:
[ivy:resolve]
/home/rodo/patch/katta/extras/indexing/../..//li
b/cobertura-1.9.1.jar
[ivy:resolve]
/home/rodo/patch/katta/extras/indexing/lib/cob
ertura-1.9.1.jar
[ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::
[ivy:resolve] :: UNRESOLVED DEPENDENCIES ::
[ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::
[ivy:resolve] ::
net.sourceforge.cobertura#cobertura;1.9.1: not
found
[ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::
[ivy:resolve]
[ivy:resolve] :: applications that use zookeeper as
We use other USE VERBOSE OR DEBUG MESSAGE
well, that run on 3.3.0 (along with katta). Having
katta upgraded to 3.3.0 will be useful to maintain
a single zk quorum and library integration.
292 of 363
Hongchao reported:
{noformat}
we have a huge application including about 7
billion documents. It is
indexed it into 720 shards, i.e. about 10 million
documents per shard,
and the total size is about 0.5 tera bytes. Then,
we divide these
shards into 120 indexes with 6 shards per index
and deploy them into
our katta cluster as 120 instances with replication
2. our katta
cluster currently has 22 nodes and each node has
16 CPUs. When we
submit a query on more than 37 instances at the
same time. we got the
following exception:
10/03/26 13:53:29 ERROR
client.NodeInteraction:166 - Error calling
public abstract
net.sf.katta.lib.lucene.HitsMapWritable
net.sf.katta.lib.lucene.ILuceneServer.search(net.sf
.katta.lib.lucene.QueryWritable,net.sf.katta.lib.lu
cene.DocumentFrequencyWritable,java.lang.Strin
g[],int)
throws java.io.IOException on
haystack016.scur.colo:20000 (try # 1 of
3) (id=21)
java.lang.reflect.InvocationTargetException
293 of 363
Reported by Hongchao:
{noformat}
Katta works well on sorting at most time.
However, it fails in one
case: when we do a query and some searched
records do not have the
fields to be sorted on or the fields are empty,
katta reports errors
like 'can not reach shard ...'. If some result
records miss the sorted
fields, a graceful way is to put at the very
beginning or the very end
of the result list. It should not report errors.
{noformat}
I think a good way to solve this is to first find out
RHEL Linux 5; Java 1.6.0 When I try handle those cases.
how luceneto deploy a solr shard from hdfs, it
fails because the unzip routine does not create
the directories before it unzips the file.
My fix is to add the following check into
net.sf.katta.util.FileUtil unzip method:
{noformat}
if (!targetFile.getParentFile().exists()) {
targetFile.getParentFile().mkdirs();
}
{noformat}
'* When we deploy a new index (an optimized
index) in hdfs and ask katta to deploy it, currently
as it exists - we can only deploy a new index with
new set of shards.
It would be useful to add a newly deployed
lucene index (optimized) as a shard to an existing
katta index , so that it is entirely transparent to
the client , while searching the index.
Would this involve changing some zk mapping
etc. ?
294 of 363
The following line in Node.java only opens one
RPC Server Thread per node, thus only one search
call will be executed per node at a time and all
the other requests queued...
_rpcServer = RPC.getServer(nodeManaged,
"0.0.0.0", serverPort, new Configuration());
It should be changed to
int numthreads = 10;
_rpcServer = RPC.getServer(nodeManaged,
"0.0.0.0", serverPort, numthreads, false, new
Configuration());
and be configurable by the user.
reverse deploy - copy a valid index from the katta
system ( + all its shards ) to a hdfs uri back (
assuming the original hdfs uri backing up the
shards have been deleted entirely, which is a
possibility ) .
This is very useful to examine deployed indices
separately .
Once in hdfs , it can be copied using
hdfs.copyToLocal and examined using luke and
others.
Once this is fully functional - this can be used to
pull many such indices to hdfs and then to the
local disk of a box and merged back and deployed
. But this would be an useful stepping stone
towards that.
First-cut patch in place - please review.
todo: more unit testing to be done on the
PriorityQueue (from Lucene) and some other
genericity fixes.
295 of 363
As recommended, I'm running zookeper now in
standalone mode.
I just got the following exception (0.6.1), and the
master gets unresponsive afterwards and I have
to kill the master process.
Listindices will still work, but removing an index
or adding and index will not.
Probably an easy fix (just use a hashmap which is
threadsafe)
java.util.ConcurrentModificationException
at
java.util.HashMap$HashIterator.nextEntry(Hash
Map.java:793)
at
java.util.HashMap$KeyIterator.next(HashMap.jav
a:828)
at
net.sf.katta.protocol.InteractionProtocol$1.handl
eStateChanged(InteractionProtocol.java:87)
at
org.I0Itec.zkclient.ZkClient$5.run(ZkClient.java:48
4)
at
org.I0Itec.zkclient.ZkEventThread.run(ZkEventThr
ead.java:72)
296 of 363
7-node katta cluster, it should have nothing to do With release of 0.6 and the setting of
with OS 'master.deploy.policy=net.sf.katta.master.Lowest
ShardCountDistributionPolicy', the shards can not
be evenly distributed over all nodes. I
deployed my 42 indexes sequentially into katta.
Only one index has 7 shards and all others have 1
to 3 shards. And, 16 of them are with replication
level 2 and others with replication-level 1. The
policy worked perfectly for us with a trunk
version before. I think it should be easy to
reproduce this bug if you sequentially deploy
multiple indexes, each of which has different
number of shards and/or replication level.
The following is the output of 'katta check' of my
cluster. (To me, it seems that katta mistakenly
used old 'default deploy policy' to assign the
shards.):
--------------------------------------------------------------------
------------------------------
| Node | Connected | Shard Status
|
=========================================
=========================================
================
| haystack004.scur.colo:20000 | true | 2/2 ##
|
--------------------------------------------------------------------
------------------------------
| haystack005.scur.colo:20000 |existing katta
When trying to unpublish a non true | 17/17
index from an external java program (trough
katta.main), Katta will kill the original java
program when no index is found. (in
printUsageAndExit)
It should be better just to return.
297 of 363
lsof shows file descriptors grow as we deploy new
indices and remove the old.
Here is a patch -
---
a/src/main/java/net/sf/katta/lib/lucene/LuceneS
erver.java
+++
b/src/main/java/net/sf/katta/lib/lucene/LuceneS
erver.java
@@ -124,7 +124,7 @@ public class LuceneServer
implements IContentServer, ILucene
public void removeShard(final String shardName)
{
LOG.info("LuceneServer " + _nodeName + "
removing shard " + shardName);
synchronized (_searcherByShard) {
- final Searchable remove =
_searcherByShard.remove(shardName);
+ final IndexSearcher remove =
_searcherByShard.remove(shardName);
if (remove == null) {
return; // nothing to do.
}
@@ -133,6 +133,10 @@ public class
LuceneServer implements IContentServer, ILucen
} catch (final IOException e) {
throw new RuntimeException("unable to retrive
maxDocs from searchable")
} process nodes seem to hang in a deadlock.
The
kill processid won't have any effect on them and I
have to kill them with kill -9.
I can reproduce this behaviour on windows
(cygwin) and on debian linux.
298 of 363
I added an index with a wrong path before
"file://c:/tmp/test", which failed to deploy.
listIndices won't work anymore:
ERROR: Wrong FS: file://c:/tmp/test, expected:
file:///
Maybe other parts of katta fail too (I can't test
right now)
In addition to a the redeploy feature, it would be
nice to have a reload and refresh feature.
Reload would reload all the shards on all the
nodes (closing, opening lucene indexes)
Refresh would load all new shards from index
directory, and then drop those shards not
present anymore in the index directory.
This would make frequent updates very easy, as it
would be possible to do only a "refresh" on the
an index, which would load the new shards and
drop the non existing shards, instead of having to
juggle between multiple indexes.
299 of 363
some input from Ted Dunning:
{noformat}
My own thought is that the master should do the
following balancing actions fairly often:
- to reassign shards that have not yet been
loaded by a node
- to add replicas of shards to nodes with a serious
underload of shards
- to tell a node to stop serving shards with excess
advertised replicas (largely due to the second
action)
As it stands, I think that the katta master is a bit
too laissez faire and should be a bit more
aggressive.
One thing to watch out for, however, in making
the master more aggressive is that when a node
disappears for a short time
it should re-advertise all of the shards it still finds
on its disk rather than delete them.
It should only delete them after a period of time
that allows the master to consider what should
be done.
Combined with a partial result policy, this can
make transient node failures much less of a
problem.
{noformat}
300 of 363
from Erich
{noformat}
I'm running 0.6-rc1 and have problems with
imbalanced shards.
When I deployed 62 indices consisting of 183
shards (replication 2)
across 5 nodes (all up during and after the
deployment) I get the
following distribution:
--------------------------------------------------------------------
--------------------------------------
| Node | Connected | Shard Status
|
=========================================
============
| grid0004:20000 | true | 62/62
############################# |
--------------------------------------------------------------------
--------------------------------------
| grid0002:20000 | true | 62/62
############################# |
--------------------------------------------------------------------
--------------------------------------
| grid0003:20000 | true | 62/62
############################# |
--------------------------------------------------------------------
--------------------------------------
| grid0001:20001 | true | 0/0
|
--------------------------------------------------------------------
log4j is at 1.2.15 I think. Shouldn't make any
problems to update from 1.2.14.
patch will follow in a few moments
sorry for filling the bug here. I think it's the most
appropriate place.
People who would like to (re)build from the
tarballs need build.xml and ivy.xml. I think it's a
bug that they aren't included since you ship
extras/*/build.xml and extras/*/ivy.xml and it
also doesn't make sense to have the source
without the build system.
301 of 363
'* release a source only tarball without
precompiled .jar files (and without the prototype
javascript library)
* make it possible to not use ivy in the build
system
* release zkclient
* make it possible to compile only selected extras
Maybe there are more things to do.
To reduce the size of the core katta distribution
we make a separate distribution for the web ui.
This is a follow up task from KATTA-17.
Following shortcomings:
- the results files are always to be found on the
master host, also it can be triggered from any
other host
- the test does not survive katta master change
- the nodes hold the result of each query in
memory (which could possibly lead to an OOM)
I think all this could be fixed relative easily by
using
[bookkeeper|http://hadoop.apache.org/zookeep
er/docs/r3.2.2/bookkeeperOverview.html]
302 of 363
I'm trying to start katta with bin/katta
startMaster, but it fails with the following error
message:
$ bin/katta startMaster
ERROR: Path must start with / character
Usage:
startMaster Starts a local master
Katta '0.6-dev'
Git-Revision '775443f'
Based on KATTA-82.
Some initial thoughts for discussion:
- things we want to monitor ?
-- we have:
--- cpu, memory, garbage collection
-- we might want ?:
--- disc-io/allocation (support multiple discs first)
--- shard distribution, query distribution, cluster
balance (have this partly in zk)
- how do we publish those ?
-- currently a IMonitor publishes to zk
-- do we want to publish via jmx ?
-- publish to ganglia ?
- we might extend our IMonitor solution to run
multiple IMonitors at the same time
Some user;s reported problems with the cluster
stability when deploying a large index.
303 of 363
In NodeInteraction there are these lines
{code}
VersionedProtocol proxy =
_shardManager.getProxy(_node);
if (proxy == null) {
String msg = "No proxy for node: " + _node;
LOG.debug(msg);
_result.addError(new KattaException(msg),
_shards);
return;
}
{code}
In this case the normal "try another node" will
simply be skipped.
But this situation can only be happen when the
Clien is used multithreaded and another node
interaction failed on a node and so kicked out the
e.g. all lucene stuff should be in a package, etc
This might influence configuration.
304 of 363
from Richard Tang:
{noformat}
I have troubles in creating sample data as in
online tutorial
http://katta.sourceforge.net/documentation/ho
w-to-create-a-katta-index).
In particular, when I tried to exec 'ant jar',
exceptions are thrown
[ivy:resolve] :::: WARNINGS
[ivy:resolve] module not found: junit#junit;3.8.1
[ivy:resolve] ==== libraries: tried
[ivy:resolve] -- artifact junit#junit;3.8.1!junit.jar:
[ivy:resolve] /root/katta-
buildindex2/katta/extras/indexing/lib/junit-
3.8.1.jar
[ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::
[ivy:resolve] :: UNRESOLVED DEPENDENCIES ::
[ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::
[ivy:resolve] :: junit#junit;3.8.1: not found
[ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::
[ivy:resolve]
{noformat}
There should be a metod to access binary fields in
results as well.
If I iterate over
List ids = client.getDetails(hits,
fetch)
I can only access String fields (eg
ids.get(i).get("stringfield")) and not binary fields
which are normally accessed with getBinaryValue
in lucene.
305 of 363
Linux To make fail over easier to use and configure,
Currently bin/start-all.sh starts primary master
and nodes, and does not start the secondary
master. There is an improvement required in the
script so that, as per the configuration of the
katta -cluster it should also start the secondary
master seamlessly.
Also attaching a fail over experimentation
documentation.
The JmxMonitor sometime tries to write to zk
after zkclient shutdown already happended.
{noformat}
Exception in thread "Thread-1723" Exception in
thread "Thread-1780"
org.I0Itec.zkclient.exception.ZkNodeExistsExcepti
on:
org.apache.zookeeper.KeeperException$NodeExi
stsException: KeeperErrorCode = NodeExists for
/katta/server-metrics/localhost:20002
at
org.I0Itec.zkclient.exception.ZkException.create(Z
kException.java:40)
at
org.I0Itec.zkclient.ZkClient.retryUntilConnected(Z
kClient.java:664)
at
org.I0Itec.zkclient.ZkClient.create(ZkClient.java:24
1)
at
org.I0Itec.zkclient.ZkClient.createEphemeral(ZkCli
ent.java:277)
at
net.sf.katta.protocol.InteractionProtocol.setMetri
c(InteractionProtocol.java:449)
at
net.sf.katta.monitor.JmxMonitor$JmxMonitorThr
ead.run(JmxMonitor.java:116)
Caused by:
org.apache.zookeeper.KeeperException$NodeExi
stsException: KeeperErrorCode = NodeExists for
306 of 363
KATTA-43 probably includes changes in the
internal zk file structure. There should be a simple
mechanism that a existing 0.5 cluster can be
upgraded to 06 (and for future versions as well).
Over the last weeks i recognized that the test
suite hangs from time to time. Saw this with
MultiInstanceTest and with LuceneClientTest at
least. Not sure yet if it is a master or a client
problem.
Here a stacktrace from one hang:
{noformat}
[junit] "main" prio=5 tid=0x0000000101802800
nid=0x100401000 in Object.wait()
[0x0000000100400000]
[junit] java.lang.Thread.State: TIMED_WAITING
(on object monitor)
[junit] at java.lang.Object.wait(Native Method)
[junit] - waiting on (a
net.sf.katta.client.IndexDeployFuture)
[junit] at
net.sf.katta.client.IndexDeployFuture.joinDeploy
ment(IndexDeployFuture.java:54)
[junit] - locked (a
net.sf.katta.client.IndexDeployFuture)
[junit] at
net.sf.katta.client.LuceneClientTest.onBeforeClass
(LuceneClientTest.java:85)
[junit] at
net.sf.katta.AbstractKattaTest.beforeClass(Abstra
ctKattaTest.java:89)
[junit] at
net.sf.katta.testutil.ExtendedTestCase.setUp(Exte
ndedTestCase.java:67)
[junit] at
307 of 363
This issue is created from the discussion i had
with [~tdunning] in KATTA-81. The main idea is to
use zk for a centralized configuration
management instead of having too many
properties files.
The main points of the suggestion:
- one configuration file with all katta
configuration values for master/node/client (on
the master's location)
-- the master writes the files to zk on changes
-- the nodes & clients are reading those file from
zk before starting
- the configuration values each component needs
is:
--zk server addresses
--zk session timeout
--zk directory where cluster or config is stored by
the master
- there should be a possibility to override the
values for individual instances
- provide util methods/factories for the "connect
to zk, read properties, construct client/node"
thing
Since the above changes is likely to break some
api i would also refactor the Client class:
- move nodeSelectionPolicy into configuration
object
- pass zkClient instead of zkProperties into
constructor and make an additional connect
308 of 363
On a search query:
(1) query is broadcasted to each node with
relevant shards
(2) each node creates a HitsMapWritable and
sends it back to the client. This HitsMapWritable
object can contain the results of several shards.
(3) on client side each HitsMapWritable gets
temporarily transformed into a Hits object and
with the List of each of that Hits objects a
single Hits object is build.This final single Hits
object is returned to the searcher.
I think different sortings take place on following
places:
(2a) when the shard is queried the result list is
already sorted by lucene
(2b) on combining the result lists of the different
shards a PriorityQueue is used (KattaHitQueue),
so i guess the resulting list is also sorted
(3a) when transforming the HitsMapWritable to a
Hits object to a List automatically the
internal sorting of the Hits class is triggered an
resorts the whole node-sublist
(3b) when the final Hits object is build, the
internal sorting of the Hits object is used to sort
all hits after merging the sub-lists
I think the following undesired things happen:
- The 2a, 2b sorting is wasted in the moment the
hits geta mixture of from theversions ininto client
Having transferred Hadoop hit queue the the
/ nodes yield errors. It will be a nice to have
feature to be able and find the hadoop version of
a cluster
It would be great to be able to issue searches for
indexes "foo-*" rather than having to enumerate
all indexes starting with foo.
For other projects to be able to depend on Katta,
it needs to be available as a maven artifact.
I am working on a patch for this.
309 of 363
Currently the git repository has all Katta's
dependencies checked into the lib/ directory,
which bloats the size of the repository. These
dependencies should be downloaded in an
automated fashion as part of the build process.
See the discussion in [KATTA-74].
This is uncommittable code, however it's a
placeholder until LPT can be integrated into
Katta. Thanks to Kevin Peterson for writing this is
Ruby first and suggesting the algorithm.
I recently very enjoyed having the sources in that
jar as well, since with modern ides it give
instances access to java docand the source code.
Having more people see our sources will be
definitely good for us. For example mockito
includes the sources into the jar, we should do
the same.
310 of 363
The default distribution policy tends to pile
shards up on the nodes listed first in the
conf/nodes file.
The LowestShardCountDistributionPolicy picks
the node with the lest shard to replicate to, and
the node with the most shards to remove replicas
from.
Set by changing the configuration property
master.deploy.policy=net.sf.katta.master.LowestS
hardCountDistributionPolicy in
katta.master.properties.
Also, zip files read from hadoop were stored on
the local disk and then unpacked, this patch reads
the zip file from hadoop and unpacks it as it is
read.
The system property katta.spool.zip.shards may
be set to the string true, to force the zip file to be
spooled to local disk first.
Minor cleanup in Node, to ensure that temporary
files and directories are deleted in the case of an
error exit.
311 of 363
$ uname -a The mockito jar seems to be missing from lib
Linux imyousuf-laptop 2.6.28-15-generic #52- folder and its neither downloaded by Ivy, as a
Ubuntu SMP Wed Sep 9 10:49:34 UTC 2009 i686 result can not build 0.5.1 branch.
GNU/Linux
$ ant clean dist
$ java -version Buildfile: build.xml
java version "1.6.0_13"
Java(TM) SE Runtime Environment (build 1.6.0_13- clean:
b03) [echo] cleaning katta-core
Java HotSpot(TM) Server VM (build 11.3-b02,
mixed mode) check-ivy-available:
$ ant -version download-ivy:
Apache Ant version 1.7.0 compiled on December
13 2006 install-ivy:
[echo] Ivy path
$ git --version
git version 1.6.2.1 resolve:
[ivy:resolve] :: Ivy 2.0.0 - 20090108225011 ::
$ git remote show origin http://ant.apache.org/ivy/ ::
* remote origin :: loading settings :: file = /media/unix-
URL: 2/projects/katta/ivy/ivysettings.xml
git://katta.git.sourceforge.net/gitroot/katta/katta [ivy:resolve] DEPRECATED: useOrigin option is
Remote branch merged with 'git pull' while on deprecated when calling resolve, use useOrigin
branch master setting on the cache implementation instead
master [ivy:resolve] :: resolving dependencies ::
Remote branch merged with 'git pull' while on emi#event-processing/grid;working@imyousuf-
branch stable-0.5.1 laptop
stable-0.5.1 [ivy:resolve] confs: [ant, eclipse, compile, test,
Tracked remote branches instrument, checkstyle, job]
branch-0.1 [ivy:resolve] found svnant#svnant;1.2.1 in
.. so we need something taking measurements
(cpu, memory, etc) publish it and another
components that subscribes to those and store
them.
When implementing SOLR-1395 I needed to load
configuration files from specific paths rather than
from the classpath.
The patch also includes a jar-test target.
312 of 363
When building shards, it would be useful to
consistently build
shards up to a given size so that they are all
roughly the same
size in bytes. Perhaps in Hadoop this is difficult
because the
number of jobs must be set ahead of time?
I am trying to get strictly time-ordered results
from Katta by passing a Sort.
Each of my shards represents a fixed time period,
so in addition to being able to pass a Sort, it will
also sort the shard names and process them in
order (according to the sort order, reverse or not,
in the primary SortField).
The instructions at
http://katta.sourceforge.net/documentation/ho
w-to-merge-indexes fail with Katta 0.5.1 due to
Commons HTTP client being missing:
java.lang.NoClassDefFoundError:
org/apache/commons/httpclient/HttpMethod
Copying its jar from Hadoop's lib directory to
Katta's lib fixes it.
313 of 363
In this example I've attempted to add an index,
but I've made a typo in the port number:
$ bin/katta addIndex index3
hdfs://localhost:80200/index
org.apache.lucene.analysis.StandardAnalyzer 1
.not deployed index index3
$ bin/katta listIndexes
Exception in thread "main" java.io.IOException:
Call to localhost/127.0.0.1:19000 failed on local
exception: Connection refused
You should be able to list your indices even if one
of them is not available. That way you would be
able to know which index to remove.
In addition, the "not deployed index index3"
message you get upon trying to add the index in
the first place is unhelpful.
Expanding katta-core-0.5.1.tar.gz does not create
a subdirectory for katta as it should.
The jets3t library is not found when attempting to
build the example indexing tool. A log is attached.
Since the bin scripts in katta seem to be based on
those from Hadoop, they're subject to the same
bug regarding CDPATH:
https://issues.apache.org/jira/browse/HADOOP-
6101
The scripts expect cd to produce no output, but
this is not true in bash if the CDPATH
environment variable is set. A suggested fix would
be to unset CDPATH in the affected scripts.
314 of 363
> Exception in thread "main"
java.lang.NullPointerException
> at java.util.ArrayList.addAll(ArrayList.java:472)
> at
net.sf.katta.client.Client.getShardsToSearchIn(Clie
nt.java:296)
> at
net.sf.katta.client.Client.getNode2ShardsMap(Cli
ent.java:278)
> at
net.sf.katta.client.Client.search(Client.java:212)
> at
net.sf.katta.client.Client.search(Client.java:205)
> at net.sf.katta.Katta.search(Katta.java:466)
> at net.sf.katta.Katta.main(Katta.java:100)
315 of 363
When trying to start a Katta master using an
external Zookeeper via bin/katta startMaster. The
process starts and immediately finishes without
error message.
As far as I understand the issue, the problem
seems to be in Katta.startMaster(). After starting
master and client etc. The following code is
executed:
if (_zkServer != null) {
_zkServer.join();
} else {
// since we do not have a running zookeeper we
need something to join
// in...
Thread thread = new Thread("keep a live");
thread.setDaemon(true);
thread.start();
try {
thread.join();
} catch (InterruptedException e) {
}
}
Since when using an external Zookeeper the
_zkServer variable is null the threading code is
executed. The intention of this block seems to be
to sleep indefinitely, which of course would make
sense in this case, since the master is running his
threads, doing work etc.
If you get a ConnectionLossException when trying
to deploy an index, the entire index is marked as
ERROR and can never recover.
At the least, Katta should handle these situations
more gracefully.
For instance, ZkClient.exists just blows out,
transforming the recoverable exception into a
non-recoverable KattaException.
It is dangerous to change too many of these, but
some can probably be fixed.
316 of 363
It seems to happen pretty often that a cluster
runs into an error of some kind and never resets
the status from ERROR back to OK.
As such, it would be really useful to have a utility
that would allow a cluster to be scanned for
correctness and then mark the cluster as up or
down.
When using the new feature 'external zookeeper'
implemented in KATTA-14 the default namespace
(rooted at "/katta") seems to be not created. This
namespace usually is created in
ZkServer.startZooKeeperServer(), which is not
called from the Katta.startMaster() now because
of the new conditional:
public static void startMaster(ZkConfiguration
conf) throws KattaException {
...
if (conf.isEmbedded()) {
_zkServer = new ZkServer(conf); // /.
The following jars are needed if you want to use
the s3 or s3n protocol (via jets3t) as the path to
indexes for the addIndex command.
* commons-httpclient-3.0.1.jar
* commons-codec-1.3.jar
These should be part of the standard Katta
release, in the lib/ sub-directory.
318 of 363
via mail from Johannes Herr:
Hi Stefan,
the patch with my changes is attached.
The main point is changing the port used in
ZKClient. ZKClient wraps a ZooKeeper object
instantiated in ZKClient.start() via:
{code:java}
_zk = new ZooKeeper(_servers, _timeOut, this);
{code}
The _servers variable is defined in the ZKClient
constructor via:
{code:java}
_servers = configuration.getZKServers();
{code}
That means it holds the value of the
zookeeper.servers line in katta.zk.properties. In
my case it is:
{noformat}
zookeeper.servers=hostA:2181,hostB:2181
{noformat}
If I understand the code correctly this means, that
there will be a client connection to hostA:2181 or
319 of 363
(see KATTA-61 for a related issue)
If you add a new node to a katta cluster, it will
never be assigned any nodes unless another node
goes down.
It would be nice to be able to rebalance a cluster
by moving shards from the node with the most
shards to the node with the least.
Rebalancing should, of course, only be done if
there is an imbalance worth correcting. My first
swag at that rule would be to only rebalance if
some node has more than 2 fewer shards than
the average number of shards computed by
dividing the total number of shards by the
number of nodes (using floating point, not
integer math).
It is also worth having a throttle that limits how
often shards are moved.
And, of course, it is important that shards not be
deleted from where they came from until they
are present on the destination node and possibly
not even for a while after that.
320 of 363
If you have a scenario where you add a new node
to a cluster and take down an old node, then the
shards of the old node will be distributed evenly
across the entire cluster.
It would be preferable in many cases if all of the
shards from the old node were put on the
(empty) new node.
This should be relatively easy to do by building a
deployment policy that is aware of the number of
shards each node has and assigns new nodes to
the least occupied node.
A related issue is balancing a cluster after adding
a new node, but without taking down an old
node. I will file a separate issue on that and put a
link here. Having a cluster balancer would solve
this issue as well, but more slowly.
Allow user to specify a root path other than
/katta.
This allows multiple installations of Katta (master,
clients, nodes) to share one ZooKeeper server.
Also rename ZkPathes.java to ZkPaths.java.
321 of 363
It is sometimes important to proceed even if
some shards are not available. I suggest two
changes in semantics:
1) if a request to a node fails in the threaded
request loop in the client, then additional search
requests will be created to do the same request
on any other nodes that have the same shards as
well as marking the node as down. If no other
nodes have the shard, then request will be
marked as failing.
2) once results from x% of the shards in the
original request have been collected, a deadline
will be set for t milliseconds in the future. If all
results arrive before the deadline, then the
search will proceed as normal. If the deadline
arrives before all results have been collected,
then if y% of the shards have results, the results
will be returned as complete. If the deadline
passes with less than y% of the shards having
results, then an error will be raised. The values of
x, y and t will be parameters of the search with
reasonable defaults (such as 70%, 90%, 500ms).
Note that it is important for x and y to be
separate so that x can be set low enough so that
the deadline will always be triggered while y is
still high enough to guarantee reasonable results
from all successful queries.
322 of 363
The idea here is that all data in katta that
depends on a search node staying up should be
ephemeral and created by that node.
Currently, the structure is something like
node2shards//shard*
and
shard2node//
The files under shard2node disappear correctly
when the node disappears, but the files in the
first are created by the master and do not
necessarily go down. The master could be
extended to make this happen, but if the master
ever lost track, then the data would be corrupt.
This currently happens in EC2 if ZK connection
parameters are not set with long timeouts and it
causes the entire cluster to appear to be down.
Just making the shard files ephemeral does not
work because the node directory still exists.
What I propose is that the data about what
shards a node is serving be kept in an ordinary file
that lists the shards rather than as a directory
with a single entry per shard. This file can then be
node-wise ephemeral and will vanish correctly if
the node drops out. To make this work, it would
323 of 363
After downloading 0.5.1 and deploying/untarring
(with -p) on a cluster, none of the files in the bin/
subdirectory had their 'x' bit set.
Which means that commands would fail, due to
the reliance on this permission.
Some options for how to fix this:
* Permissions should be set correctly in the
.tar(.gz) release file.
* There should be an install script that sets these
after deployment
* The scripts should internally use "sh xxx" calls to
other scripts.
See
http://katta.sourceforge.net/documentation/inst
all-and-configure-katta
Has section that says "# host:path where hadoop
code should be rsync'd from. Unset by default."
Jira comment - maybe add a "documentation"
component?
As per the download from
http://katta.sourceforge.net/home/download
Make Katta usable for any task that requires
distributed shards.
Separate out shard management from Lucene
searching.
Lucene becomes one use case of Katta. Add a
second use case which searched Hadoop
MapFiles.
324 of 363
Here is a patch that does the following to the
node sub-directory.
If this is a good thing to do, I can continue with
other sub-directories.
Changed DocumentFrequenceWritable to
DocumentFrequencyWritable (to correct the
spelling)
Fixed grammar and spelling in a number of places
Made KattaHitQueue implement Iterable to
make code more readable
Got rid of some redundant boxing/unboxing
Removed unused handle method and the
apparently unused IRequestHandler interface
Added some javadoc comments
Deleted some goo.
Added several questions in the form of TODO's
Only when the session expires, we have to close
the current ZooKeeper instance and start another
one.
We also have to distinguish between those two
events and let the IZKStateListener know what
has happened, because there are different
actions that might have to take place on
"disconnect" and "expired". E.g. ephemeral nodes
only have to be recreated when the session
expired.
325 of 363
The test ClientFailoverTest fails once in a while:
I see the following output:
09/04/23 17:29:17 WARN node.BaseNode:244 -
Old node path '/katta/nodes/Senor-
Vossi.local:20017' for this node detected, delete
it...
09/04/23 17:29:17 INFO master.Master:159 - got
node event: [Senor-Vossi.local:20016]
09/04/23 17:29:17 INFO node.BaseNode:253 -
node 'Senor-Vossi.local:20017' announced
09/04/23 17:29:17 INFO node.BaseNode:260 -
Start serving shards...
09/04/23 17:29:17 INFO master.Master:159 - got
node event: [Senor-Vossi.local:20017, Senor-
Vossi.local:20016]
09/04/23 17:29:17 INFO node.BaseNode:405 -
announce shard 'index1_aIndex'
09/04/23 17:29:17 WARN node.BaseNode:409 -
detected old shard-to-node entry - deleting it..
09/04/23 17:29:17 INFO node.BaseNode:405 -
announce shard 'index1_bIndex'
09/04/23 17:29:17 WARN node.BaseNode:409 -
detected old shard-to-node entry - deleting it..
09/04/23 17:29:17 INFO node.BaseNode:405 -
announce shard 'index1_cIndex'
09/04/23 17:29:17 WARN node.BaseNode:409 -
detected old shard-to-node entry - deleting it..
09/04/23 17:29:17 INFO
master.DistributeShardsThread:131 - processing
of update started...
326 of 363
N/A The current redeploy method is a combination of
remove and deploy making the index unavailable
until the deployment is finished.
It would be a great addition if an already existing
index could be refreshed:
- Katta would examine the HDFS index directory
to see, which shards are there
- Depending on the shard directory name it could
deduct if the shard is already present on the
nodes and doesn't need to be copied or
- If there are new shards that are not already
distributed. It would automatically start copy
them over to the shards.
- After the deployment is done it switches all
search traffic to the refreshed index
automatically.
Assumptions:
- It is okay for refreshed indexes to have deleted
events not automatically removed (or the client
needs to filter those out through an add'l deleted
field)
- Shard names are unique and incremental
updates into the shard directory will have a
unique name to avoid collisions.
This would be a great enhancement considering
that it would cut down deployment time for large
indexes dramatically and simplify the overall
client code (the current API makes it necessary to
327 of 363
N/A I'm currently using the IndexMetaData to
programatically inspect and rotate the deployed
indexes. It appears to have everything I need, but
the actual index name.
Here is some sample code that illustrates the
problem:
{code:java}
DeployClient dc = new DeployClient(_zkconf);
List eventIndexes = new Vector();
for (IndexMetaData i :
dc.getIndexes(IndexState.DEPLOYED)) {
System.out.println("index name:" + i.getName());
// this would be nice and logical to do
}
{code}
I looked at Katta.java and ZK paths are used to
get to the index names, which isn't really a good
way.
The old (deprecated) API is hard wired to use the
KeywordAnalyzer.
The new API expects that we get a lucene query.
So there is no reason for addIndex to take an
Analyzer as an input param.
328 of 363
N/A Hi,
While using Katta in our development project it
became clear that it would be a lot easier for us
to use it if it would allow for a programmatical
configuration of the Client and Deploy Client
instead of insisting on property files in the
classpath with a certain name.
Something like:
Properties prop = new Properties();
prop.setProperty("zookeeper.servers",
"localhost:2181");
DeployClient dc = new DeployClient(new
ZkConfiguration(prop));
dc.addIndex("events",
"hdfs://localhost/myIndex",
StandardAnalyzer.class.getName(), 1);
dc.disconnect();
Ideally the same would also work for regular
Clients.
Thanks!
329 of 363
i am sorry that i found this bug long ago, but i
forget to report it
the buggy code is in KattaMultiSearcher::search(),
as below:
================================= below
================================
boolean working = true;
while (working) {
ScoreDoc scoreDoc = null;
for (int i = 0; i hitB.getDocId();
}
- return hitA.getScore() hitB.getScore();
}
331 of 363
We have been running a moderate sized cluster
in EC2 for some time now (300 shards). Due to
various reasons, we have seen a number of cases
where we had a Zookeeper session expire. This
leads to an inconsistent state for katta in ZK
which often prevents further use of the cluster.
What should happen when a search node has a
session expiration is
a) all of it's own ephemeral files in ZK will
disappear. (this seems to work correctly)
b) Any dependent files created by the master
should be deleted as quickly as possible. It is
preferable if all files that depend on the node
being present be created as ephemeral by the
node itself. THis would include node2shards,
shard2nodes and server entries. (this is probably
where the discrepancy occurs)
c) the node should re-register with ZK from
scratch, almost as if it were starting over. THe one
exception is that it should declare that it has
whatever shards it has. (this definitely does not
happen correctly)
d) the node should honor requests that come in
during the disconnect time since these may come
from clients who didn't get them memo about
the node disappearing.update to reflect the
The ZKClient needs an
changes of zookeeper 3.1.
For example there are much more granular states
that an event now can have (e.g. Node Removed).
332 of 363
SecondaryMaster cannot take over when
firstMaster failed:
this bug can be described as below:
(1) firstMaster failed
(2) zookeeper delete the "/katta/master"
zookeeper node
(3) zookeeper tell SecondaryMaster the above
event: (in func processDataOrChildChange of
class ZKClient.java)
(4) Set listeners = {
SecondaryMaster's ZKClient instance };
(ZKClient.java +511)
(5) SecondaryMaster try to resubscribe
"/katta/master" (ZKClient.java +513) listeners =
empty_Set, after func
resubscribeDataPath(event, path, listeners) was
called.
i make a simple Patch for this bug, code as below,
if you have a better patch, please tell me
in func resubscribeDataPath of class ZKClient.java
(line 534 ~ 545)
======================= below
=========================================
==
private byte[]
resubscribeDataPath(WatchedEvent event, final
String path, final Set listeners) {
cannot start as Quorum Zookeepers in katta-0.4,
but i didn't try katta-0.5, if u can do the job, tell
me please.
finally, i start Quorum Zookeepers outside katta
(that is, start 3 zookeepers first, and then start
katta master or start katta node connect to the 3
zookeepers)
333 of 363
in Node.java +469 ~ 472 (func getDocFreqs of
class Node)
BUG and Patch as below:
====================== below
==========================
- final java.util.Iterator termIterator =
termSet.iterator();
int numDocs = 0;
for (final String shard : shards) {
+ final java.util.Iterator termIterator =
termSet.iterator();
while (termIterator.hasNext()) {
====================== above
==========================
in Node.java +383 (func startRPCServer of class
Node)
BUG: if (configuration.getStartPort() - serverPort
0
at
org.apache.zookeeper.ZooKeeper.validatePath(Z
ooKeeper.java:534)
at
org.apache.zookeeper.ZooKeeper.getChildren(Zo
oKeeper.java:1078)
at
org.apache.zookeeper.ZooKeeper.getChildren(Zo
oKeeper.java:1119) you do not see the 'eclipse'
When you do "ant -p"
and 'dist' targets listed.
This is because of the missing descriptions for
these targets in the build.xml file.
335 of 363
I have experienced a random test failure for
net.sf.katta.zk.ZKClientTest.
It might be a timing related issue.
I checked out the latest sources, and executed
the following steps :
ant clean
ant compile
ant test
In this folder should be a sub folder with our
coverage reports.
For now we only want to check that each java file
has the license header.
336 of 363
I'm seeing a lot of following errors in the test logs.
This might be releated to the zookeeper update.
java.lang.RuntimeException: Exception while
restarting zk client
at
net.sf.katta.zk.ZKClient.reconnect(ZKClient.java:5
68)
at
net.sf.katta.zk.ZKClient.processDisconnect(ZKClie
nt.java:488)
at
net.sf.katta.zk.ZKClient.process(ZKClient.java:466)
at
org.apache.zookeeper.ClientCnxn$EventThread.r
un(ClientCnxn.java:366)
Caused by: net.sf.katta.util.KattaException:
unable to create path '/katta/node-to-
shard/jemez:20017' in ZK
at
net.sf.katta.zk.ZKClient.create(ZKClient.java:296)
at
net.sf.katta.zk.ZKClient.create(ZKClient.java:275)
at
net.sf.katta.node.Node.announceNode(Node.java
:183)
at
net.sf.katta.node.Node.handleReconnect(Node.ja
va:147)
at
net.sf.katta.zk.ZKClient.reconnect(ZKClient.java:5
65)
337 of 363
KATTA-16 Right now the IClient Node interaction is defined
by the API ISearch. Though this is very limited.
This api is very focused on search. Does not give
users the ability to programaticaly create Lucene
Query Objects and we always expect
HitsMapWritable.
The idea is to generalize this api so it can be used
for more than just search, but content look up
(cached pages, facet search etc). We could have a
simple Response request(Request) method and
encode the kind of requests into the request
object. Since the lucene query object is
serializable it should be easy to write a writable
request object that transport all sorts including a
lucene query.
Looks like that in case we deploy two different
indexes that have the shards with the same name
we run into conflicts in our zookeeper structure.
Eg:
+ indexA
++Shard1
+ indexB
++ Shard1
At least Shard1 will be a problem in the
shard2Node path.
I suggest we just name the shards
indexName_shardName since this would make it
unique. 2.x is known to have problems during
Zookeeper
reconnection.
338 of 363
all Hi,
I did some tests with retrieving the search result
details in parallel using a thread pool (i.e.
executing details = client.getDetails(hit) in
parallel).
For retrieving the first 20 results I saw speed
improvement on the order of 5-10 times faster
compared to a single threaded retrieval.
The actual implementation is trivial especially
with the awesome Concurrent package
(ThreadPool and CountDownLatch).
IMHO, a simple and very rewarding
improvement.
-Erich
This is used in
net.sf.katta.indexing.SequenceFileCreator and
there is an easy replacement for this:
java.io.ByteArrayOutputStream
339 of 363
I did the following tests from the command line
pvoss$ bin/katta search testIndex foo:bar 2
4 hits found in 0.011sec.
| Hit | Node | Shard | DocId | Score |
=========================================
==============================
| 0 | Senor-Vossi.local:20000 | testIndex_aIndex
| 0 | 6.811141 |
--------------------------------------------------------------------
---
| 1 | Senor-Vossi.local:20000 | testIndex_bIndex
| 0 | 5.898621 |
--------------------------------------------------------------------
---
$ bin/katta search testIndex foo:bar 3
4 hits found in 0.012sec.
| Hit | Node | Shard | DocId | Score |
=========================================
==============================
| 0 | Senor-Vossi.local:20000 | testIndex_cIndex |
0 | 6.811141 |
--------------------------------------------------------------------
---
| 1 | Senor-Vossi.local:20000 | testIndex_aIndex
| 0 | 6.811141 |
--------------------------------------------------------------------
---
340 of 363
The idea is to run load tests on ec2 to make
performance a first level citizen.
+ write some script that creates a ec2 cluster
+ deploy the current sources on this cluster
+ write a class that generates a test index
+ start katta, deploy the test index
+ run a http://faban.sunsource.net test
+ graph the result
+ shutdown the cluster.
It is common in large scale search deployments to
separate (at least) raw search and content
retrieval. The query types, volumes are very
different in these different engines, so the
number of shards are different as well. The
problems of index deployment and management
are the same, however.
This indicates that katta should have the
following extensions:
- abstract out a small KattaMangeable interface
to allow manageable instances to be managed
- extend the configuration to allow multiple pools
to be managed
- extend the client software to allow different
pools to be queried
Zookeeper 2 assumes that a client will re-
establish all watches. katta doesn't do this so that
in environments like EC2 where disconnects are
pretty common, there are serious problems.
The simple solution is to move to ZK 3. We have
done this and should have a patch available
shortly.
It is nice to have an integrated zookeeper cluster,
but it should be possible to use an external
cluster for production deployments.
341 of 363
Make katta able to run on ec2.
The idea is that we port all hadoop ec2 scripts to
katta.
We need a script to create AMI and managa katta
scripts.
We need to investigate wich ports we need to
open to access katta nodes as client.
I removed in the ec2 script the the line: -e 's|#
export KATTA_SLAVE_SLEEP=.*|export
KATTA_SLAVE_SLEEP=1|' \
in section:
# Configure Katta
sed -i -e "s|# export JAVA_HOME=.*|export
JAVA_HOME=/usr/local/jdk${JAVA_VERSION}|" \
-e 's|# export KATTA_LOG_DIR=.*|export
KATTA_LOG_DIR=/mnt/katta/logs|' \
-e 's|# export KATTA_SLAVE_SLEEP=.*|export
KATTA_SLAVE_SLEEP=1|' \
-e 's|# export KATTA_OPTS=.*|export
KATTA_OPTS=-server|' \
/usr/local/katta-$KATTA_VERSION/conf/katta-
env.sh
It make quite sense to have a slave sleep in big
installations. So we might want to add it.
When merging is use with a cronjob it might
make sense to not merge indexes which have a
certain size or document count.
With a configurable threshold it could be
achieved that small indexes will be merged
together and big indexes will remain untouched.
Currently its not possible to start more then one
node out of the same katta distribution without
reconfiguring the shard-folder in
conf/katta.node.properties.
It might make sense that each node stores it
shards in //nodeName.
342 of 363
The profit of shipping the shards seems
questionable. It makes especially the merging a
lot harder.
Marko is expert here and knows more.
Currently metadata in the zookeeper system is
stored in Writables (like IndexMetadata).
Does it make sense to replace these writables
with a MapWritable implementation ?
This should give more freedom to future changes.
I think because of invalid handling of the
ephemeral's the master-failover mechanism is
broken.
343 of 363
A shutdown of the cluster (bin/stop-all.sh) leaves
the ephemeral nodes (of master and nodes)
behind.
On newstart, the ephemera's are still there (but
not connected to any owner). The master and all
nodes delete unconnected ephemeral's with their
address and create new one.
This leed's to a lot of node connected / node
disconnected events in the master log on startup
and it looks like there is something wrong.
Especially on large cluster the startup log looks
very confusing. But redeployments shouldn't
happen, thanks to the safe mode.
Think there are several solutions.
I is possible to re-own unconnected ephemerals if
seesion id and password are known. So if f.e. a
node would persist it zookeeper sessionId and
password it could reown it ephemeral on startup
if still existent.
What is already there is a shutdown hook on
node side. This hook tries, among other stuff, to
delete the node's ephemeral.
The problem with this is the stop-all script, since
it stops the master with the zk process first and
then the node. So if the hook executes there is no
zk system to communicate anymore.
I think it would be a good move to decouple
master and zookeeper process and stop the
KATTA-6, KATTA-7 zookeeper process a last in the stop-all.sh script.
The ephemeral node
(http://hadoop.apache.org/zookeeper/docs/r3.0.
1/zookeeperProgrammers.html#Ephemeral+Nod
es) handling seems to be suboptimal in a lot of
cases. Please see subtasks.
Node's receive search requests for one or more
shards at a time. The searches are executed
sequentially but should be rather executed in
parallel.
344 of 363
I think we don't utilize the full power of lucene
indexing. For example we have our own flushing
mechanism. Lucene has and promotes a ram-
based flushing scheme. I already made a lot of
TODO's in the index related code that needs to be
revised.
Taking a quick look on the search code of the
node it seems to me that there is one lucene
analyzer hardcoded.
Instead the analyzer should be dependent of the
index since we allow to specify a analyzer per
index.
This should be checked!
The lucene analyzer which is used for indexing is
specified through IDocumentFactory.
I would think decoupling analyzer from the
factory make sense. It should be easy to just
specify the class name of the analyzer in the
katta.index.properties.
345 of 363
The bixo release includes a version of the ec2-api-
tools (bin/ec2/support/ec2-api-tools-1.3-30349).
This is at least a couple versions older than the
latest set of tools
(http://developer.amazonwebservices.com/conn
ect/entry.jspa?externalID=351)
While having the tools bundled with bixo reduces
the additional step of requiring one to download
these tools, it does mean than anyone who
already has the tools is now having to deal with
multiple versions.
In order to decouple the tools one could just start
with making changes to setenv.sh by honoring
the default EC2 shell variables (as per
http://docs.amazonwebservices.com/AWSEC2/20
08-12-
01/GettingStartedGuide/index.html?setting-up-
your-tools.html)
I noticed in dry-run fro 10000 domains: some
webmasters use crawl-delay = 0.
Existing FetcherPolicy has this code:
public FetchRequest getFetchRequest(int
maxUrls) {
int numUrls = Math.min(maxUrls,
(int)(DEFAULT_FETCH_INTERVAL / _crawlDelay));
Defaults to English
See [BIXO-82] for related issue.
Need to verify what has to happen in client to
properly handle decompression of response.
See [BIXO-82] for related issue.
346 of 363
Currently if you specify an existing empty
directory as the output dir for SimpleCrawlTool,
you get a confusing message that says something
about there not being any previous crawl data.
Better would be to either treat this as an initial
crawl by default, or specify in the error message
that an initial crawl must use a non-existing
directory for the output dir.
And changing the param name to "crawldir" from
"outputdir" would also make things clearer.
See Nutch Http.java:
{noformat}
// Set the User Agent in the header
headers.add(new Header("User-Agent",
userAgent));
// prefer English
headers.add(new Header("Accept-Language", "en-
us,en-gb,en;q=0.7,*;q=0.3"));
// prefer UTF-8
headers.add(new Header("Accept-Charset", "utf-
8,ISO-8859-1;q=0.7,*;q=0.7"));
// prefer understandable formats
headers.add(new Header("Accept",
"text/html,application/xml;q=0.9,application/xht
ml+xml,text/xml;q=0.9,text/plain;q=0.8,image/pn
g,*/*;q=0.5"));
// accept gzipped content
headers.add(new Header("Accept-Encoding", "x-
gzip, gzip, deflate"));
hostConf.getParams().setParameter("http.default-
headers", headers);
See [KATTA-86] for similar issue that Stefan
resolved.
Enable linking in Jira, e.g. depends on, informs,
related to.
These should go into docs/licenses/
See [BIXO-4] for a related issue.
347 of 363
Currently we build the tarball (bixo-dist-
.tgz) in the tagged release branch, and
push this to GitHub.
But this bloats the size of the repo (> 500MB
currently) and the push takes a long time, as the
tarball is > 30MB and growing.
A better solution would be to have the dist target
copy the resulting tarball to the Nexus repository,
or to at least manually deploy it there for the
time being.
The doc/releasing.txt procedure document would
need to be updated. It should also include a step
where the Bixo web site gets updated to include a
link to the latest release on Nexus. See
[http://bixo.101tec.com/wp-
admin/page.php?action=edit&post=19&message
=1]
Web ARChive files are enhanced versions of the
.arc files.
It would be great if crawl results could be
read/written using this format, via a Cascading
scheme.
For more info, see:
*
http://bibnum.bnf.fr/WARC/WARC_ISO_28500_v
ersion1_latestdraft.pdf
* [The WARC File Format (ISO 28500) -
Information, Maintenance,
Drafts|http://bibnum.bnf.fr/WARC/]
* [ISO
28500:2009|http://www.iso.org/iso/iso_catalogu
e/catalogue_tc/catalogue_detail.htm?csnumber=
44717]
Heritrix has support for WARC 1.0, so we should
348 of 363
Currently we have an IndexScheme class in Bixo
that extends the abstract Cascading Scheme. This
lets us write out Lucene indexes as part of a
Cascading flow, via a tap that uses this scheme.
It would be great to have something similar that
lets us create the Lucene index using a Solr-
defined schema. I believe this involves:
* Embedding Solr.
* Providing a mapping from tuples to Solr fields.
See
[https://issues.apache.org/jira/browse/NUTCH-
760] for a similar issue on the Nutch side of the
fence, though we don't need to worry about
search support.
Add option to extract N random URLs from the
compressed dmoz-links.zip file that the user
would need to download from the Nexus
repository.
BIXO-74, BIXO-75 Provide better support for using DMOZ data in
Bixo.
349 of 363
In order for Bixo to get auto-deployed to the
Maven central repository, we have to provide
rsync (preferably over ssh) access to the
repository.
References for how to do this:
* General info:
http://maven.apache.org/guides/mini/guide-
central-repository-upload.html
* Useful steps, bash script:
http://vafer.org/blog/20081026142413
* Maven public key:
http://www.ibiblio.org/maven/id_dsa.pub
I know, it's a pain in the butt...sorry about that.
Process described here:
http://maven.apache.org/guides/mini/guide-
central-repository-upload.html
BIXO-67, BIXO-69, BIXO-70, BIXO-71, BIXO-72 List of tasks required to make this happen
350 of 363
I want to get Bixo into the Maven central
repository, as that makes it very easy for people
using Ivy or Maven to grab the jar.
To do this, I need to be able to push Bixo to a
repository that I control, and then submit a
request in the Maven Jira system to set up for
auto-syncing from my repository to the Maven
central repository. See
[http://maven.apache.org/guides/mini/guide-
central-repository-upload.html] for full details.
Seems like Nexus is a good option for a small,
lightweight repository manager.
* Download here -
[http://nexus.sonatype.org/download-
nexus.html]
* Documentation here -
[http://www.sonatype.com/books/nexus-
book/reference/]
Currently the bixo-core jar includes things like
ICU4J, which is only used in the parse pipe. And
the lucene jar is only needed in the index pipe.
By breaking it up, the size of the job jar needed
for fetching gets much smaller.
Don't know if we'll need a bixo-test, currently
not.
351 of 363
We want to switch to a more pure Java solution,
versus all of the Hadoop Bash scripts that don't
provide very good error checking, and aren't very
efficient since they wind up creating a JVM one or
more times for each command, in order to run
code.
It's likely we'd still want to keep around the old
scripts, for things like creating an AMI, but use
new scripts as thin wrappers that run Java code
to create a cluster, proxy it, push Bixo, etc.
TeamCity (continuous integration build server)
wants to send emails when the build is broken.
But the bixo-dev Yahoo mailing list will only
accept emails from the verified email addresses
of registered members.
And since TeamCity is a bot, even creating a
"Team City" user and adding them to bixo-dev
(which I've done) still isn't sufficient, as the email
that Yahoo sends to oss-teamcity@101tec.com
for verification won't get answered.
352 of 363
'* Target m1 (small) EC2 instance, so 32-bit FC6?
* Get recent version of Java installed.
* Install LZO support, and re-enable use of the
org.apache.hadoop.io.compress.LzoCodec code
in the list of io.compression.codecs, in the
hadoop-ec2-init-remote.sh script.
* Set thread stack size at boot time.
* Set large max open file limit at boot time (e.g.
65K)
* Make sure noatime is specified.
* Configure & auto-start nscd service
* Configure Hadoop
It would be great to have everything from the
hadoop-ec2-init-remote.sh script pre-configured
in the AMI. This is located in bin/ec2/hadoop-
aws/etc/
It's useful to know when the content has been
truncated, especially for parsers who might not
work properly with truncated (typically binary)
content.
Currently the check for whether the fetch has run
past the target duration is only made when
requesting a FetchList, but once the list has been
handed off to a FetcherRunnable, it runs until
done.
So if a site is really slow (e.g. every URL will time
out), you can get a FetchList that will take a long
time to process.
We need a script for starting up an EC2 cluster
and deploying Bixo
353 of 363
We need to:
* Re-create problem seen during intentionally
slow fetches from Facebook
* Use Use HttpRequestRetryHandler. See
[http://hc.apache.org/httpcomponents-
client/tutorial/html/fundamentals.html#d4e246]
* Call abort if we don't read the entire result back
And then remove these jars from our lib/
directory.
For jars that we need to keep in our lib/ directory
(not in Maven) we should create an Ivy
dependency xml file.
We might also want to use a different Ivy cache,
to avoid the dreaded "Ivy created a fake
dependency file" problem.
Since we now also have also url Normalizer
interface as well url filter, the filter should return
a boolean value not a string.
Examples of these are in Nutch's urlnormalizer-
regex plugin, or rather the regex-normalize.xml
file (see attachment).
Currently both Eclipse and Ant put their classes in
the same location, which confuses Eclipse
sometimes after an Ant build.
Better would be to have .../build/eclipse as a
directory for all Eclipse classes, and clean-eclipse
should delete this directory.
Use ICU, plus some Nutch code, to create a post-
fetch operation that adds language meta-data.
354 of 363
We need a class that can filter by URL, and by
content-type.
For URL filtering, common approaches are:
* By protocol (no https)
* By domain (only ibm.com, not blogger.com)
* By suffix (no .zip)
* By query (no query strings)
For content-type filtering, it can be
* By entire content type (only text/html or
text/xhtml)
* By main content type (only text/)
Something that can be used outside of Bixo if
needed. Normalization includes:
* Adding "http://" as default protocol if URL is
missing this.
* Adding "www." if URL hostname is just a PLD.
* Lower-case everything
* Get rid of default port
* Add trailing slash if no path
* Converting %hh to regular characters, if these
chars are valid in a URL
* Escaping characters that aren't valid in a URL
* Converting '+' to %20
* Get rid of anchors
* Remove trailing '?'
* Clean up paths that use ".." to go up in a
directory
* Clean up paths that have '//'
* Stripping out session ids
We could also try to detect URLs encoded using
8859-1/CP1252 and convert to UTF-8
355 of 363
Includes porting Nutch's robots.txt parser.
Don't worry about different hostnames ==
different robots.txt for now.
We might want to index the ancor text for a page
as well and not just the page content. I'm sure
this will highly improve the search result quality.
Currently we have src/test-data and
src/test/resources that host test data.
We should only have one folder that host test
data.
The idea of src/test/resources is that we store
there config files, thats why it is in the classpath,
we should not store there test data but move it
into src/test-data.
See
[http://www.jakobhoman.com/2007/11/quick-
tour-of-hadoops-reporter-object.html] for details.
Examples of things we could/should be reporting:
* increment counter for fetching a page
* Increment a counter for queuing up a set of
URLs for a domain
* Increment a counter for each URL being queued
* Update status every 5 seconds or so with
number of active threads, throughput values?
356 of 363
The download script currently doesn't work on
my Mac, due to dependency on wget (which is
not part of a standard Mac install, IIRC).
And this is definitely a longer-running test,
especially with the need to fetch a large amount
of data.
But if the test data is auto-cached, then it's
appropriate to be run as one of the long tests
that automatically get executed on the CI build
machine.
'* Note about needing Java 1.6, how to create a
build.sh and run.sh to handle this.
* Note about {{ant clean-eclipse eclipse}} to
create Eclipse project files.
* Note about how to bump memory in Eclipse to
run some of the tests.
* Note about using Ivy to handle dependencies,
and thus lib/ is just a container (now) for
potential jars that will be dynamically added.
357 of 363
While testing a fetch of a bunch of Apache.org
URLs, I noticed that keep-alive wasn't working as
well as I'd expected.
The problem was that xxx.apache.org would
often wind up using a different server, based on
the sub-domain. So we'd lose our keep-alive
connection whenever we switched between
servers.
What I need to do is convert hostnames to IP
addresses for the sub-set of top URLs we're going
to try to fetch, given the PLD/fetch
duration/remaining time constraints. Then use
this to group by same IP, and inside the
FetcherQueue implementation I don't want to
pass back a FetchList that spans IP addresses.
Note that only doing this IP processing for a sub-
set of URLs is going to be more efficient than
what Nutch does, where it converts all potential
URLs to IP addresses first. This places huge load
on the DNS system, and can cause a big slow-
down.
This was originally forked from Nutch, and thus
there are classes we don't need, and other
changes we should make, to do a cleaner job of
integrating it into Bixo.
Also I need to fix up the copyright headers
(combo Apache/Bixo).
We need to do a review of all tuples used to pass
information between operations, to ensure
* Consistent naming
* Appropriate information
It would be good to get Chris's input on this as
well.
358 of 363
Currently there are lots of TODO notes in the
fetcher code, for places where settings should be
controlled via some configuration value.
In Nutch this was handled by passing a JobConf
everywhere, which always felt awkward.
We need to decide on our approach and then
implement it.
One idea would be to have a bean we use for
settings, and inject that into the Cascading conf.
Configure it to match fetcher settings.
See Nutch implementation for HttpClient 3.1
handling of this.
Note, though, that HttpClient 4 has a significantly
different (and better) approach to this.
'* [http://hc.apache.org/httpcomponents-
client/httpclient/apidocs/org/apache/http/impl/c
lient/AbstractHttpClient.html#setHttpRequestRet
ryHandler(org.apache.http.client.HttpRequestRet
ryHandler)]
* [http://hc.apache.org/httpcomponents-
client/httpclient/apidocs/org/apache/http/client/
HttpRequestRetryHandler.html]
Currently it's the default, which is 2. So if
somebody runs with threads-per-server > 2, this
will cause failures.
See http://marc.info/?l=httpclient-
users&m=123925647125933&w=2
359 of 363
See http://marc.info/?l=httpclient-
users&m=123869610506345&w=2
But make sure we can handle connection failures
properly first. Though even with the stale
connection check, a failure could sometimes
happen if the server closes the connection in the
(small) window between the check and the
request.
PS - Oleg says that stale connection checking is
evil consistency in naming.
For :)
This should run all three types of tests (unit,
integration, and long).
We'll have three classes of tests: unit, integration,
and long.
Unit tests should be run as part of the {{ant test}}
command. But the other two tests should be
{{ant test-integ}} and {{ant test-long}} targets.
I don't know how to do this with JUnit 4 (it's easy
with TestNG, using the @Test(groups = "xxx")
annotation).
Idea is that we can write status and content of a
fetch into separate folders to later on process
only the data we need.
Now that we have crawling simulation toolkit, we
should clean up our tests.
All tests that actually need a internet connection
should go into a package integration tests. Maybe
we should also those tests that run very long into
such a package.
360 of 363
We actually could sort the UrlWithScoreTuples
since the buffer is a reducer and we could use a
hadoop secondary sort there, what is actually
supported by Cascading.
Right now the fetcher is a map reduce job,
though for optimal flexibility we actually want to
have fully cascading based fetcher.
I created a localhost server class that can be
easily used to simulate all sorts of webserver
behavior.
It allows to provide an HttpHandler that has full
access to http header and response, so it is easy
to simulate from content types slow server etc.
The idea is to simulate web crawls though not
hitting websites.
We need a simulation that allows us to measure
all sorts of things to optimize crawler
performance etc.
we need a test servlet that allows us to simulate
all sorts of things like hanging connections,
redirects etc.
Here we store robotos.txt and other general
information for PLDs.
One that works well with the Cascading/Hadoop
model of getting an iterator of URLs for a given
PLD or IP address "key".
361 of 363
make sure we set flow connector properties in all
flows
Map properties =
getProperties();
FlowConnector.setApplicationJarClass(properties,
CascadingClusterTest.class);
FlowConnector flowConnector = new
FlowConnector(properties);
Actually url normalisation does not matter during
fetch time, because the server will return the
right url.
I think it might be important for link analysis to
make sure even though the url is different we
make sure we address the same page.
This said in context of duplicated pages, web
spam etc, we should conside use the md5 of the
page as key and not the url.
This said the url normalization makes no sense or
do I miss something?
we should have the source version as part of the
jar. See katta -version of example. We need to
extract to extract git information in ant though.
git-show might be a good starting point.
I dont think we need following files in our conf
folder. Should we delete them?
hadoop-default.xml
hadoop-site.xml
masters
slaves
Do we need a license file in the root folder?
How should the header look like, does it need to
contain license infos?
We need the header.
Should we have an ant task that checks *.java
that header is correct? Had this in a other project,
was quite handy.
362 of 363
Basic approach is:
# Split hostname of URL at '.'
# if pieces <=2 then return as-is
# If pieces == 4 && isValidIPv4 then return as-is
# If pieces == 6 && isValidIPv6 then return as-is
# if lowercase(last piece) == valid country code
then:
#* If lowercase(second to last piece) == valid
short TLD (e.g. "co") then return last three pieces
#* Else return last two pieces
# Else return last two pieces
Currently bixo has jars in its /lib sub-directory,
and it assumes you have Cascading & Hadoop
project (with appropriate versions) located on
your local disk with build.properties edited to
specify their location.
It would be better to use Ivy to handle jar
dependency management.
I'd lean towards simpler XML at the cost of a
precondition like having the Ivy jar in the Ant
directory, for example.
363 of 363