Searching the Web for Source Code
Document Sample


Searching the Web
for Source Code
Raphael Hoffmann
Daniel Weld
October 30th, 2006
Why Search for Code?
How can I parse
an XML
document?
What are developers searching for?
“Most of the code is open source so you can reuse it.
But I don’t think that’s the primary use – it’s more
about how to learn about things and when you’re
building open-source packages, to make sure
you’re doing it the right way.”
Tom Stocky, Product Manager
Google Code Search
What are developers searching for?
What developers using the MSN search engine are interested in finding
Why not use web search engines?
parse xml java
Only compatible with new Java versions
Requires installation of external library,
but no link
Code on pages essentially the same
Contains no code examples
Code Search Engines
Index source code of open-source
Projects (from compressed archive
Files and CVS repositories)
Code is parsed and terms in type
names, variable names, etc. are
weighted differently.
import javax.xml.parsers.*;
import org.w3c.dom.*;
public class JAXPSample {
public static void main(String[] args) {
String filename = "sample.xml";
try {
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
DocumentBuilder parser =
factory.newDocumentBuilder();
Document d = parser.parse(filename);
} catch (Exception e) {
System.err.println("Exception: " + e.getMessage());
}
}
}
Why not use code search engines?
parse xml java
Irrelevant
(An Emacs Lisp File!?!)
Code is complicated,
contains no comments related to query,
and is more than 300(!) lines long
Requires installation of external library,
but no link
Code on pages essentially the same
Analysis
• Web search engines use structure on web
pages, but ignore structure in code
• Code search engines use structure in code,
but ignore structure on web pages.
Our solution
• A novel hybrid search
engine
• Index code snippets
found on web pages
• Link them to required
libraries and
documentation
Assieme
links to links to
required libraries pages with snippets
group pages with
similar snippets
Assieme
• High precision
• Results contain documentation and simple
examples
• Dependencies in code are shown along with
links to download required libraries
• Page summaries show relevant
packages/types being used
• Pages with similar snippets grouped together
Assieme Operation
1. Crawling web pages and libraries
Preprocessing
2. Indexing libraries
3. Extracting snippets from web pages
4. Finding the code referenced in snippets
Query Time
• Handling Queries
Assieme Operation
Crawling web pages preprocessing query
time
and libraries
Tutorial and Technical article
Sites, e.g.
IBM DeveloperWorks, Open-source sites, e.g.
and pages with keywords Sourceforge.net, Apache.org
Assieme Operation
query
Indexing libraries preprocessing
time
packages
javax.mail.search
javax.mail.event
...
types
Language javax.mail.SecuritySupport12
Parser javax.mail.Address
...
methods / fields
mailapi.jar library index
javax.mail.Flags.add(...)
javax.mail.add(...)
...
Assieme Operation
Extracting snippets preprocessing query
time
from web pages
<a name="listing1">
<b>Listing 1. MBean interface for a Web
server</b></a><br />
<table width="100%" cellpadding="0"
cellspacing="0" border="0">
<tr>
<td class="code-outline"> Classifier
<pre class="displaycode">
public interface WebServerMBean {
public int getPort(); uses Snippet
public String getLogLevel();
public void setLogLevel(String level); • HTML structure or
public boolean isStarted();
public void stop();
• Terms/n-gram statistics Text?
public void start(); • Language parsers
}
</pre>
</td>
</tr>
</table
><br /> <p> Implementing the MBean class is
usually fairly straightforward, as the MBean
Assieme Operation
Finding the code preprocessing query
time
referenced in snippets
referenced
public class WebServer implements WebServerMBean
libraries
{ ... }
...
WebServer ws = new WebServer(...);
MBeanServer server =
ManagementFactory.getPlatformMBeanServer(); Language referenced
server.registerMBean
(ws, new ObjectName Parser types
("myapp:type=webserver,name=Port 8080"));
referenced
methods, fields
library index
Assieme Operation
Handling Queries preprocessing query
time
Cluster add links
Query TF/IDF
based on to
scoring
ref. code libraries
Weights for terms in
text, url, title,
used packages,
types, methods, …
Conclusion & Outlook
• To make code search work well, we need to
exploit both, structure on web pages, and
structure in code.
• Next, we want to better visualize and
support library version information.
• Also, we are interested in linking
Troubleshooting information on the web,
such as error messages, to code.
Related docs
Get documents about "