Document Sample
correlate_googlelabs_com_tutorial Powered By Docstoc
					                   Google Correlate Tutorial
                   Google Correlate is an experimental new tool on Google Labs which lets you use the same
                   methodology and data as Google Flu Trends.

                   What is Google Correlate?

                   Google Correlate is like Google Trends in reverse. With Google Trends, you type in a query and
                   get back a series of its frequency (over time, or in each US state). With Google Correlate, you
                   enter a data series (the target) and get back queries whose frequency follows a similar pattern.

                    Google                 “mittens” →

open in browser customize   free license                                                                    
                    Google                                                                  → “mittens”

                   Correlated Queries

                   When you upload a data set (a time series, for instance), Google Correlate will compute the
                   Pearson Correlation Coefficient (r) between your time series and the frequency time series for
                   every query in our database. Correlation coefficients range from r=­1.0 to r=+1.0. The queries
                   that Google Correlate shows you are the ones with the highest correlation coefficient (i.e. closest
                   to r=1.0).

                   For example, say your data set was a sine wave from 2003­2010, so that 0.0 is the summer
                   solstice and 1.0 is the winter solstice.

                    Week                   Value
                    1/05/03                0.983
                    1/12/03                0.965
                    1/19/03                0.939
                    1/26/03                0.907
                    2/02/03                0.869
                    2/09/03                0.826
                    2/16/03                0.778

open in browser customize   free license                                                                       
                   This could be generated in Excel using the SIN or COS function. You can download a complete
                   spreadsheet for this data set here.

                   To get your data into Google Correlate, you have two main options:

                       1. Copy/paste from a spreadsheet program (Excel, OpenOffice, …)
                       2. Export CSV and upload it to Google Correlate.

                   We’ll use the copy/paste technique here. To get started, select all the cells of interest in your
                   spreadsheet program and copy them. Then click “Enter your own data” on Google Correlate and
                   switch to the “Time Series” tab. Click on the spreadsheet there and hit Control­V (Command­V
                   on a Mac) to paste in the data:

                    Copy data from a spreadsheet        ...and paste it into Google Correlate. Be sure to leave
                    program...                          off the header row and give your data series a name!

open in browser customize   free license                                                                     
                   Then click “Search correlations” at the bottom to find queries whose time series are correlated.
                   Here are the top queries that come back when you search for this time series:

                   0.9483         alpine touring
                   0.9439         nordica
                   0.9381         volkl
                   0.9339         colds
                   0.9329         hockey arena
                   0.9329         fritschi
                   0.9290         obermeyer
                   0.9289         telemark boot
                   0.9270         wedding soup
                   0.9267         ski boot

                   The numbers are the Pearson Correlation Coefficients. r=0.9483 indicates a very good fit
                   between the target data and the query time series. Here’s what ‘alpine touring’ looks like next to
                   the time series that we uploaded:

open in browser customize   free license                                                                      
                   A few things to note here:

                            Any query containing ‘alpine touring’ will contribute to the time series for ‘alpine touring’.
                            This includes queries like ‘alpine touring skis’, ‘alpine touring vacations’, etc.

                            The data is aggregated to weekly counts. Each week goes from one Sunday to the next.
                            The points for 2006/01/01, for example, include queries from the start of Sunday, January
                            1, 2006 to the end of Saturday, January 6, 2006. Google Correlate contains data starting
                            from January 5, 2003 (the first Sunday of 2003).

                            The vertical grid lines mark the beginning of each year.
open in browser customize   free license                                                                           
                            The units on the y­axis are standard deviations away from the mean. Each time series is
                            normalized so that its mean is 0.0 and its standard deviation is 1.0. This puts all series
                            on the same scale so that they’re easier to compare. This also explains why the ‘Winter
                            Wave’ time series ranges from ­1.4 to +1.4, even though the input series only ranged
                            from 0 to 1.

                   Negative Correlations
                   Google Correlate only shows you positive correlations. But sometimes the negative correlations
                   can be just as interesting. If you want to see queries which are negatively correlated with your
                   data, just multiply your input data by ­1 in your spreadsheet program before uploading it to
                   Google Correlate.

                   Here are the negative correlations for the seasons time series

                   0.9729         boat trailer
                   0.9664         trumpet vine
                   0.9630         golf course
                   0.9626         rotary mower
                   0.9618         gary fisher
                   0.9603         deck railing
                   0.9597         used bikes
                   0.9590         pig roast
                   0.9578         bike carrier
                   0.9577         course rating

                   So the time series for the query ‘boat trailer’ had a correlation of r=­0.9729 with the original
                   ‘Winter Wave’ time series. As you might expect, the queries which are negatively correlated with
                   winter are summer queries.

open in browser customize   free license                                                                       
                   Holdouts and Missing Data

                   Sometimes you don’t have a complete time series or would prefer to hold out a portion of your
                   data for testing. You can accomplish this in Google Correlate by putting blank values in your data
                   when you upload it:

                   For example, here is the Winter Wave time series with 2006 and 2007 withheld:

open in browser customize   free license                                                                      
                   If you look closely, you can see that the blue line has a gap between the end of 2005 and the
                   start of 2008. When computing correlations, these weeks will be ignored in the time series for
                   candidate queries. This means that, if you build a model for your time series using query data,
                   you can use this held out portion of the time series as a test set.

                   Removing selected weeks from uploaded data sets is a general technique which can be used
                   for other purposes as well. For instance, if your uploaded data has a large spike over a small
                   time period, that spike may have a large (and unwanted) influence on the results. If you withhold
                   the spiking weeks from your data set, you can remove their influence entirely.

open in browser customize   free license                                                                     
                   Building a Model with Query Data

                   Note: Statistical modeling is a fine art. This example is presented simply as a demonstration of
                   what’s possible, not as a demonstration of good modeling techniques.

                   Having found queries which are correlated with the winter, we can use them to build a model.
                   Using the Winter Wave with holdout, we get a list of queries whose time series is correlated with
                   the winter. If you click “Export data as CSV” on that page, you’ll get a CSV file containing weekly
                   time series for the top few results.

                   You can import this data into a spreadsheet or your favorite numerical analysis tool to do the
                   modeling. For example, in this spreadsheet, we built a very simple model by summing up the
                   time series for the 20 most highly­correlated queries. We then computed the Pearson
                   Correlation Coefficient between the target time series and the model estimates on the holdout
                   period (2006­2007), which was r=0.979. This indicates that the query data was able to predict
                   previously­unseen real­world data.

open in browser customize   free license                                                                       
                   Of course, there are better ways to model whether it’s winter in the United States. But it is
                   interesting that we can do so exclusively with query data. A similar sequence turned influenza
                   data from the CDC into Google Flu Trends and there are no doubt other time series which can
                   be modeled in a similar way.

                   Correlate by States

                   The examples thus far have worked exclusively with time series. Google Correlate can also find
                   queries whose popularity correlates with a data set across space rather than time.

                   As a simple example, let’s create a data set which is 1 for every state in New England but 0 for
                   all other states:

open in browser customize   free license                                                                    
                   Here are the queries whose popularity is most highly­correlated with this New England data set:

                   0.9903         gorges grant hotel
                   0.9863         england basketball
                   0.9850         neasc
                   0.9846         boston dirt dog
                   0.9829         new england association of schools and colleges
                   0.9815         new england map
                   0.9805         hood ice cream
                   0.9800         map of new england
                   0.9799         new england inns
                   0.9794         new england recruiting report

                   As before, these are Pearson Correlation values. But what does it mean for a query to be
                   correlated with this US states data set? Let’s look at the maps for our New England set and the
                   query “map of new england” side­by­side:

                       Left: our “New England” data set. Right: the popularity of the query “map of new england”.

                   The maps indicate that the query ‘map of new england’ is popular in states where our data set
                   has a 1 and not popular in states where our data has a 0. Clicking the “Scatter plot” link on the
                   result makes this more explicit:
open in browser customize   free license                                                                     
                   The six points on the top right are the six states in New England. The smattering of dots on the
                   lower left are the other 44 states and the District of Columbia. This makes it clear that the query
                   ‘map of new england’ is popular in the six states in New England and nowhere else.

                   For the New England data set, Google Correlate brings back queries which are characteristic of
                   the New England region. If you have a data set which can be broken down by state, uploading it
                   to Google Correlate may give you insight into some of the driving factors behind your data.

                   The same techniques discussed for the time series examples also apply to states correlation. If
                   you don’t specify a state then it will be held out. In particular, it is often useful to hold out the
                   District of Columbia which is an outlier in many data sets.
open in browser customize   free license                                                                         
                   Google Correlate makes an attempt to filter out queries which are unlikely to be interesting.

                   These include:

                            Queries with a low correlation value (less than r=0.6)
                            Misspelled queries
                            Pornographic queries
                            Rare queries
                            Queries which only correlate with a small portion of the time series

                   For more information about the filtering operations performed by Google Correlate, please refer
                   to the Google Correlate Whitepaper.

                   Protecting User Privacy
                   At Google, we are keenly aware of the trust our users place in us, and of our responsibility to
                   protect their privacy. Google Correlate can never be used to identify individual users because we
                   rely on anonymized, aggregated counts of how often certain search queries occur each week.
                   We rely on millions of search queries issued to Google over time, and the patterns we observe in
                   the data are only meaningful across large populations of Google search users. You can learn
                   more about how this data is used and how Google protects users' privacy at our Privacy Center.

open in browser customize   free license                                                                     

Shared By: