IQPDF - A better PDF search tool - Ken B Smith, Auckland - June 2011
On the CD supplied you will see a number of files. Copy them all to a sub directory somewhere,
or onto a stick or even leave them on the CD (although that can be a little slow). Navigate to
that directory and run the file IQ.EXE or put a shortcut to it on your desktop. The program is
written so that it requires no installation or registry entries, so it can do no harm at all to your
system at all, and can be run in the most restrictive mode. This copy is valid until 31 Dec 2011.
This is not to be awkward, but it is a Beta and the file structure will not be maintained into the
final version.
OK, you run IQPDF.EXE and you have a selection of PDFs to play with. The King James Bible and
the Complete Shakespeare plus some B777 technical manuals.. The text box is a free text query.
Type anything you like and see what happens when you click GO.
IQPDF sends its' output pages to SumatraPDF - an amazing fast and lightweight PDF reader. The
pages will stack with the best search result on top. use more tabs / pages to get a deeper
result.
Searching is all about speed and accuracy. The speed of IQPDF has to be tried to be believed.
The accuracy is down to you. The more words it can find on a page then the more accuracy you
will get. However more words can often bring up pages that you do not want - perhaps. But
that is part of the wonder of IOPDF. You find things and relationships in a search that you never
knew were there.
In practice the word order of your search has meaning. The first word carries more weight than
the last. Word order is not important. Unlike conventional PDF searches, a contiguous word
spread is not required. If a word appears anywhere on a page - it will score score that page.
The whole thing is quite intuitive - give it a try.
Because IQPDF is so fast - it is able to search many PDFs at once and rank pages from all the
PDFs for display. On average a 20MB - 25000 word PDF will take around 250ms (a quarter of a
second) to search. Believe it or not - that is all. Try it.
On the King James - try "Pale Horse" and a good start for Shakespeare would be "Scotch". or
better "band of brothers". Capitalization is ignored throughout as are hyphens and other
punctuation. Link words and single letters are discarded.
Whatever you search for - finding all your words on one page will get that page on the first PDF
page displayed, and as you move down the page stack, the word count may be less. A single
occurrence of a word in a document will get you just one page..
Searching a long and complex document takes a little thought or plain luck to get exactly what
you need first try. The progressive ranking in tabs helps no end. Use more pages / tabs to get
more depth of search.
The King James has 531431 words in 2444 pages and Shakespeare is showing 731211 words in
3066. A search of multiple words in Shakespeare, such as "scotch the snake not killed it" will
take around 2 seconds depending upon your PC speed. The FOXIT tabbed output takes longer
than that to deploy, particularly as I have a 500ms delay on the tab creation for each page to
allow FOXIT to keep up with my program.
If you forget to close the FOXIT reader between searches, it will just create more tabs for
subsequent results. This is deliberate and useful. Likewise the "All Files" option will scan all files
in the current directory and combine results. This is useful for our work, but hardly likely to
produce sensible results on the mix of PDFs provided.
How does it work? Well searching a conventional PDF is just too slow and awkward. Even trying
to work inside a PDF in real time is problematic. What IQ does is use a type of 3D binary
concordance (created by another module) and you can see this in the IDX files. The supplied
index files are effectively final format except that in this version I have left the keywords in plain
text to enable you to see some of the structure. I have been using this version to debug and
perform speed tests. Later versions use a full binary index and the ability to see these words is
removed as everything is in binary format.
The KJ Bible contains 13650 discrete words (that I found) and the Works of Shakespeare run to
24955 individual words, although there a few double words in there due to typesetting faults.
This is interesting on several levels, not least of which is the shear depth of Bill's vocabulary.
Enjoy
Ken