OneRiot Inner Workings of a Realtime Search Engine

Document Sample
OneRiot Inner Workings of a Realtime Search Engine
Shared by: Courtney Walsh
Stats
views:
1440
posted:
11/20/2009
language:
English
pages:
6
The Inner Workings of a Realtime Search Engine:

Thoughts on realtime search, by the team at OneRiot

A OneRiot White Paper



First publish June 2009 Updated November 2009



For Additional Information, contact: Tobias Peggs tobias@oneriot.com twitter.com/tobiaspeggs



Users Want Realtime Search

Across all the major search engines, industry numbers indicate that 40% of users are performing search queries which display an intent that is best satisfied by realtime search results(1). Irrespective of industry numbers, in 2009, Iran – the election, the uprising and the search query – proved beyond doubt that there is huge demand for search results from the realtime web(2). The question on everybody’s lips is: “What’s going on right now?” In order to answer that question, they need to find the news, stories and videos with the most social relevance right now. Realtime search results meet that need. Every day hundreds of millions of search engine users(3) type something as heavyweight as “Obama,” or as entertaining as “Britney”, into a search box and expect to find out what’s going on right now for that topic. These types of searches are commonly called “Browse” searches, as people are “Browsing” for information. They don’t have a particular URL in mind. They just want to know what’s going on right now – the source of information being less important than the information itself. Users performing Browse searches are best satisfied by search results from the realtime web. Specific

Realtime Search Information Search



The best traditional search engines are very good at finding Navigation search results, and specific information. The best realtime web search engines are very good at finding Browse search results – addressing fully 40% of the market. With 1% of the search market worth $1bn per year(4), 40% is a huge target to go after.

The Inner Workings of a Realtime Search Engine: Thoughts on Realtime Search, by the team at OneRiot 1



Navigational Search



Of the remaining 60% of searches, 20% are “Navigation” searches and 40% are specific “Informative” searches(1). An example of a Navigation search is when a user is trying to get to Sony.com, or Yahoo.com. They will enter a search query in an attempt to find a recognized home page. An example of an Informative search is when a user is trying to find a specific recipe for Cabbage Soup that is definitely “out there somewhere.” They enter a query in attempt to find that specific information.



Traditional Search - A Broad Overview

Traditional search engines treat the web like a library. Web pages are crawled, and the content is stored in an index for efficient retrieval of information. Those web pages also build up a “Rank” over time (e.g. Google’s “PageRank”(5)). Pages with the highest Rank percolate to the top of the results. A page’s Rank is constructed from many factors, but one of the most important is citation importance(5) – broadly, the number of inbound links to that web page. This approach tends to favor highly referenced resources like Wikipedia. For example, search for “Britney Spears” on a traditional search engine and the top result is likely to be a Wikipedia page. This approach produces dependable results, but results that are not necessarily reflective of why the user would be searching for Britney at any particular time (i.e. to find out what’s going on right now). Additionally, a page’s Rank is relatively static. It changes periodically, but not at a pace to keep up with the realtime world of changing interests in a topic(6). A page with high rank might be tremendously relevant yesterday, but not tomorrow. A traditional search engine is only able to return yesterday’s relevant result. The implication is that traditional search engines will struggle to surface the hyper-fresh and socially relevant “realtime” results that satisfy users performing Browse searches. OneRiot, a realtime search engine, is focused exclusively on solving that problem and addressing that 40% of the market. To do that, we have had to invent: • New ways to index the web: by harnessing the power of the realtime social web. • New ways to rank the content in that index: at search time, to deliver the most socially-relevant result right now. We will now consider each of these two innovations in turn.



In the last two years there has been an explosion in the number of links being shared across the realtime social web. This in part has been driven by the phenomenal growth of Twitter(8). But the realtime web is much wider than Twitter. Services like Digg and Delicious – whose user communities provide a wealth of explicit “social signals” to important pieces of content – also continue to grow(9). Meanwhile, the rise of sharing services like Shareaholic(10) have promoted additional realtime sharing of content on the web. URL shorteners like Bit.ly and TinyURL make this even easier for users. Facebook and other social networks make it easy to share links across users’ social graphs. Some of this information is publically available, some is not(11). But there are a plethora of tools and services that have made sharing of links commonplace(12) among nearly 200 million US users of the internet and more than a billion internationally(13). At OneRiot we aggregate that realtime activity across the social web, considering the links people are sharing right now. We then crawl the pages those links point to and index the content on those pages – and we do it fast. Currently we index the content of the page and make it ready to search in less than 0.8 seconds.



New Ways to Index the Realtime Web

Traditional search engines crawl the web by methodically following links between billions of pages, then indexing the content on those pages(7). Broadly, they consider the link to be a signal to an important piece of content. OneRiot, in contrast, considers realtime activity on the social web when determining which pages to index. We consider the links people are tweeting, or digging, or sharing on other services, as a signal to an important piece of content.



It’s a completely new way to index the web. Effectively, users of the social web are helping OneRiot curate the search index as they Tweet, Digg or share links on other services. Those pages inherently have social buzz, and implicitly reflect “what’s going on right now” for their subject matter. Meanwhile, we provide the infrastructure to keep it all up to date in “realtime.”

2



The Inner Workings of a Realtime Search Engine: Thoughts on Realtime Search, by the team at OneRiot



In addition, OneRiot also draws upon its own panel of users to help determine what webpages will be indexed. Similar to Compete.com(14) or other internet measurement services, OneRiot manages a significant panel of users (over 3 million have joined at the time of writing) who have opted in to pass back anonymous data about what pages are important to them as they surf the web. This aggregation of data from our own panel alongside realtime sharing activity on services like Twitter and Digg helps create a huge realtime index of the web. While the volume of shared links on Twitter is exploding, they account for a fraction of the web pages in our index. This is important. There is no doubt that Twitter provides a tremendously valuable stream of data for us, but Harvard Business Review recently reported(15) that 10% of the Twitter users create 90% of the content. If a search index is exclusively based on tweets its results will be heavily biased towards the social activity of that subset of power users. So OneRiot’s search index is constantly being updated with the web pages that are generating social buzz across the whole web right now, not just on one service. We index hyper-fresh, socially relevant pages. Pages that perhaps haven’t been published long enough to start building up a traditional Rank in Google. In other words, our index is full of potential results for that 40% of users performing Browse queries. When the user wants to know “what’s going on right now,” we’ve got the pages indexed to help answer that question – powered by the social web, and a lot of realtime infrastructure. Naturally there are some challenges to creating a realtime index of the social web. Chief among them is spam. Indeed, many observers think there’s a tsunami of spam heading for the social web – especially in the realtime conversations that often act as the platform for link sharing(16). Undoubtedly, there is tremendous value from following the stream of realtime conversation on services like Twitter. But, at the same time, a user can simply tweet something like “Obama is awesome ” and see the link to that porn movie show up in search results for “Obama” on any search engine that only indexes tweets. At OneRiot, we’ve chosen to index the content behind the link – whether that link has been tweeted or dugg, or shared elsewhere. So our search results focus on the content that the social web is buzzing about, in addition to the conversation it is having. In the Obama example above, our crawler would go to the page behind the link that was tweeted,, then we’d index and categorize the content on that page. A search on OneRiot for “Obama” would not return that porn movie. Our index is realtime, but also reliable.



New Ways to Rank the Realtime Web – PulseRank

Now that you’ve got a realtime index of the web, full of content that users on the social web are buzzing about right now, how do you rank the pages within it? When you search, what results should be retrieved from the index and placed at the top of the search results page? In other words, what are the news, stories and videos with the most social relevance right now? Firstly, being a realtime search engine, OneRiot ranks its results at search-time. That’s key. Realtime search results need to be ordered based on social relevance right now, not sometime in the past. Secondly, we have invented a new ranking algorithm – PulseRank – to drive the realtime ordering of our search results. Think of PulseRank as PageRank for the realtime time web. If PageRank reflects historical dependability of a web page, then PulseRank reflects its current social buzz. PulseRank is the ranking algorithm for the 40% of searches that traditional search engines struggle with today. Our PulseRank algorithm looks at dozens of factors that give “weight” to certain results in realtime. These include, but are not limited to: • Freshness: A story published two minutes ago is probably more interesting than one published two weeks ago, if the user is performing a browse search. But the ranking algorithm also accounts for the fact that the most recently published content is not necessarily the most relevant. The realtime stream – aka the firehose – can be noisy and filled with spam(16). • Domain Authority: Just because I’ve published a post on my own personal blog about Obama, should that be weighted more highly than a post from, say, The New York Times, on the same subject published at the same time? PulseRank considers factors like the number links being shared from a particular domain right now, and in realtime increases the weight for links from currently popular domains. • People Authority: PulseRank considers who shared the link on the social web. Known spammers tend to pummel their social graph with the same link many times a day. Links shared in this manner will get a lower weight in our system. More thoughtful social web users share links that tend to get retweeted and heavily dugg. Those links get a higher weight. A particular user’s score also updates in realtime in our system to account for fluctuations in their perceived authority.

3



The Inner Workings of a Realtime Search Engine: Thoughts on Realtime Search, by the team at OneRiot



• Acceleration: PulseRank considers whether a link is increasing in hotness or decreasing in hotness. For example, we assess whether more people are sharing the link right now than they were 2 minutes ago. The algorithm is weighted to favor “emerging” webpages rather than popular ones that everyone already knows about. These are just four of dozens of factors that combine at search-time to calculate a page’s PulseRank, which determines where the link sits on our search results page. The end result to the user is better search results. In short, the most socially relevant content on the web right now, related to your search query, should be the top result.



The Future: Monetizing Realtime Web Search

Clearly, contextual ads against search results pages is a proven model. This approach definitely has its place in realtime search. However, because search results from the realtime web keep updating, our internal studies have shown that users search many more times per day per query with OneRiot than they do on a traditional engine – because they want to stay on top of the latest buzz. That alone offers many more opportunities to monetize the same user using this well understood model. Our belief, however, is that new realtime monetization models are needed to fully optimize yields while enhancing the experience for end-users and advertisers. This is why we have built RiotWise – a system specifically created to monetize the realtime web. RiotWise is a way for publishers of quality content to reach users of realtime web applications – those featuring search as well as other forms of realtime data (such as streams). RiotWise is also a way for developers of those applications to monetize in a manner that also adds direct value to their users. The service is grounded on the premise that users of realtime web apps are trying to find out what’s going on right now for a particular topic. It is built on OneRiot’s realtime search technology. RiotWise serves up links to web pages from a content network that helps users find out what’s going right now. It directly matches the user’s intent – in other words, it adds value to the users’ experience; it helps them do what they are trying to do. In its initial rollout, we’ve seen high Click Through Rates (CTRs) on RiotWise. Additionally, the click throughs are highly qualified and well targeted. The user clicking is trying to find out what’s going on right now for a particular topic, and they are being served links to content from quality publishers that exactly addresses that topic. This provides a great experience for the user, and a great audience for the content provider. RiotWise is built from the ground up to be distributed across the realtime web. Any third party application builder can pull a relevant content feed from the OneRiot API, deliver it to users in realtime, and monetize with fresh, high quality, relevant content. These features align RiotWise with the needs of realtime web application users, allowing partners to monetize by serving content that adds value to their overall service. Realtime search, and monetization of the realtime web, is just beginning. We are excited about the road ahead.

4



Delivering Realtime Web Search Results at Scale

Clearly, delivering search results at speed and scale is critical. Every new data stream that the system ingests adds a layer of complexity at scale. We’ve built some fantastic technology to be able to deal with that, including a highly optimized in-memory index to support super-fast indexing and retrieval of search results(17). Our technology



also includes a robust search API that is powering many partners, helping to deliver realtime search results to their users. And Microsoft recently released a new version of Internet Explorer 8 bundled with OneRiot search(18). Being able to deliver realtime web search results at scale is key – we owe it to our users and our partners.



The Inner Workings of a Realtime Search Engine: Thoughts on Realtime Search, by the team at OneRiot



References

1. 2. 3. 4. 5. 6. 7. 8. 9. Analysis modeling user behavior in 2009, extrapolating from conclusions published in “A Taxonomy of Web Search” by Andrei Broder, 2002, http://www.sigir.org/forum/ F2002/broder.pdf “With Help from Media, Twitter Overtakes Google as Leading Search Engine for “Iran+Election”” by Alex Patriquin, 2009, http://blog.compete.com/2009/06/25/iran-election-twitter-google-search/ “comScore Releases May 2009 U.S. Search Engine Rankings” by Sarah Radwanick, 2009, http://comscore.com/Press_Events/Press_Releases/2009/6/comScore_Releases_May_2009_U.S._Search_Engine_Rankings “Why 1% of search market share is worth over $1 Billion” by Don Dodge, 2007, http://dondodge.typepad.com/the_next_big_thing/2007/05/why_1_of_search.html “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page, 1998, http://infolab.stanford.edu/~backrub/google.html “Beyond PageRank: Machine Learning for Static Ranking” by Matthew Richardson, Amit Prakash and Eric Brill, 2006, http://research.microsoft.com/pubs/68149/staticrank.pdf “Web crawler” published on Wikipedia, http://en.wikipedia.org/wiki/Web_crawler “Twitter Now Growing at a Staggering 1,382 Percent” by Adam Ostrow, 2009, http://mashable.com/2009/03/16/twitter-growth-rate-versus-facebook/ “Twitter Surges Past Digg, LinkedIn, And NYTimes.com With 32 Million Global Visitors” by Erick Schonfeld, 2009, http://www.techcrunch.com/2009/05/20/twitter-surgespast-digg-linkedin-and-nytimescom-with-32-million-global-visitors/



10. “Shareaholic for Firefox crossed 1 million downloads today!” by The Shareaholic Team, 2009, http://www.facebook.com/shareaholic 11. “The Day Facebook Changed Forever: Messages to Become Public By Default” by Marshal Kirkpatrick, 2009, http://www.readwriteweb.com/archives/the_day_facebook_ changed_messages_to_become_pulic.php 12. “All Hail the Blabbermouth: Gossip and the Realtime Web” by Carmel Hagen, 2009, http://blog.oneriot.com/content/2009/06/comprehensive-share-indexing-the-realtimeweb/ 13. “ComScore: Internet Population Passes One Billion; Top 15 Countries” by Erick Schonfeld, 2009, http://www.techcrunch.com/2009/01/23/comscore-internet-populationpasses-one-billion-top-15-countries/ 14. “Where does Compete’s data come from?” Compete.com corporate web site, http://www.compete.com/resources/methodology/ 15. “New Twitter Research: Men Follow Men and Nobody Tweets” by Bill Heil and Mikolaj Piskorski, 2009, http://blogs.harvardbusiness.org/cs/2009/06/new_twitter_research_ men_follo.html 16. “Twitter’s Real Time Spam Problem” by Danny Sullivan, 2009, http://searchengineland.com/twitters-real-time-spam-problem-20614 17. “The Technology Platform for Realtime Search” by Alessio Signorini, forthcoming 18. “Microsoft Launching Real-Time-Focused IE8 Bundled with OneRiot Search” by Jolie O’Dell, 2009, http://www.readwriteweb.com/archives/microsoft_launching_real_timefocused_ie8_bundled.php



The Inner Workings of a Realtime Search Engine: Thoughts on Realtime Search, by the team at OneRiot



5




Share This Document


Related docs
Other docs by Courtney Walsh
OneRiot Inner Workings of a Realtime Search Engine
Views: 1440  |  Downloads: 39
by registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!