Document Sample

```					                   Google Correlate Tutorial
Google Correlate is an experimental new tool on Google Labs which lets you use the same
methodology and data as Google Flu Trends.

Google Correlate is like Google Trends in reverse. With Google Trends, you type in a query and
get back a series of its frequency (over time, or in each US state). With Google Correlate, you
enter a data series (the target) and get back queries whose frequency follows a similar pattern.

Trends

open in browser customize   free license                                                                              pdfcrowd.com
Correlate

Correlated Queries

When you upload a data set (a time series, for instance), Google Correlate will compute the
Pearson Correlation Coefficient (r) between your time series and the frequency time series for
every query in our database. Correlation coefficients range from r=­1.0 to r=+1.0. The queries
that Google Correlate shows you are the ones with the highest correlation coefficient (i.e. closest
to r=1.0).

For example, say your data set was a sine wave from 2003­2010, so that 0.0 is the summer
solstice and 1.0 is the winter solstice.

Week                   Value
1/05/03                0.983
1/12/03                0.965
1/19/03                0.939
1/26/03                0.907
2/02/03                0.869
2/09/03                0.826
2/16/03                0.778
...

open in browser customize   free license                                                                                 pdfcrowd.com
This could be generated in Excel using the SIN or COS function. You can download a complete
spreadsheet for this data set here.

To get your data into Google Correlate, you have two main options:

1. Copy/paste from a spreadsheet program (Excel, OpenOffice, …)

We’ll use the copy/paste technique here. To get started, select all the cells of interest in your
switch to the “Time Series” tab. Click on the spreadsheet there and hit Control­V (Command­V
on a Mac) to paste in the data:

Copy data from a spreadsheet        ...and paste it into Google Correlate. Be sure to leave
program...                          off the header row and give your data series a name!

open in browser customize   free license                                                                               pdfcrowd.com
Then click “Search correlations” at the bottom to find queries whose time series are correlated.
Here are the top queries that come back when you search for this time series:

0.9483         alpine touring
0.9439         nordica
0.9381         volkl
0.9339         colds
0.9329         hockey arena
0.9329         fritschi
0.9290         obermeyer
0.9289         telemark boot
0.9270         wedding soup
0.9267         ski boot

The numbers are the Pearson Correlation Coefficients. r=0.9483 indicates a very good fit
between the target data and the query time series. Here’s what ‘alpine touring’ looks like next to
the time series that we uploaded:

open in browser customize   free license                                                                                pdfcrowd.com
A few things to note here:

Any query containing ‘alpine touring’ will contribute to the time series for ‘alpine touring’.
This includes queries like ‘alpine touring skis’, ‘alpine touring vacations’, etc.

The data is aggregated to weekly counts. Each week goes from one Sunday to the next.
The points for 2006/01/01, for example, include queries from the start of Sunday, January
1, 2006 to the end of Saturday, January 6, 2006. Google Correlate contains data starting
from January 5, 2003 (the first Sunday of 2003).

The vertical grid lines mark the beginning of each year.
open in browser customize   free license                                                                                     pdfcrowd.com
The units on the y­axis are standard deviations away from the mean. Each time series is
normalized so that its mean is 0.0 and its standard deviation is 1.0. This puts all series
on the same scale so that they’re easier to compare. This also explains why the ‘Winter
Wave’ time series ranges from ­1.4 to +1.4, even though the input series only ranged
from 0 to 1.

Negative Correlations
Google Correlate only shows you positive correlations. But sometimes the negative correlations
can be just as interesting. If you want to see queries which are negatively correlated with your

Here are the negative correlations for the seasons time series

0.9729         boat trailer
0.9664         trumpet vine
0.9630         golf course
0.9626         rotary mower
0.9618         gary fisher
0.9603         deck railing
0.9597         used bikes
0.9590         pig roast
0.9578         bike carrier
0.9577         course rating

So the time series for the query ‘boat trailer’ had a correlation of r=­0.9729 with the original
‘Winter Wave’ time series. As you might expect, the queries which are negatively correlated with
winter are summer queries.

open in browser customize   free license                                                                                 pdfcrowd.com
Holdouts and Missing Data

Sometimes you don’t have a complete time series or would prefer to hold out a portion of your
data for testing. You can accomplish this in Google Correlate by putting blank values in your data

For example, here is the Winter Wave time series with 2006 and 2007 withheld:

open in browser customize   free license                                                                                pdfcrowd.com
If you look closely, you can see that the blue line has a gap between the end of 2005 and the
start of 2008. When computing correlations, these weeks will be ignored in the time series for
candidate queries. This means that, if you build a model for your time series using query data,
you can use this held out portion of the time series as a test set.

Removing selected weeks from uploaded data sets is a general technique which can be used
for other purposes as well. For instance, if your uploaded data has a large spike over a small
time period, that spike may have a large (and unwanted) influence on the results. If you withhold
the spiking weeks from your data set, you can remove their influence entirely.

open in browser customize   free license                                                                               pdfcrowd.com
Building a Model with Query Data

Note: Statistical modeling is a fine art. This example is presented simply as a demonstration of
what’s possible, not as a demonstration of good modeling techniques.

Having found queries which are correlated with the winter, we can use them to build a model.
Using the Winter Wave with holdout, we get a list of queries whose time series is correlated with
the winter. If you click “Export data as CSV” on that page, you’ll get a CSV file containing weekly
time series for the top few results.

You can import this data into a spreadsheet or your favorite numerical analysis tool to do the
modeling. For example, in this spreadsheet, we built a very simple model by summing up the
time series for the 20 most highly­correlated queries. We then computed the Pearson
Correlation Coefficient between the target time series and the model estimates on the holdout
period (2006­2007), which was r=0.979. This indicates that the query data was able to predict
previously­unseen real­world data.

open in browser customize   free license                                                                                 pdfcrowd.com
Of course, there are better ways to model whether it’s winter in the United States. But it is
interesting that we can do so exclusively with query data. A similar sequence turned influenza
data from the CDC into Google Flu Trends and there are no doubt other time series which can
be modeled in a similar way.

Correlate by States

The examples thus far have worked exclusively with time series. Google Correlate can also find
queries whose popularity correlates with a data set across space rather than time.

As a simple example, let’s create a data set which is 1 for every state in New England but 0 for
all other states:

open in browser customize   free license                                                                              pdfcrowd.com
Here are the queries whose popularity is most highly­correlated with this New England data set:

0.9903         gorges grant hotel
0.9850         neasc
0.9846         boston dirt dog
0.9829         new england association of schools and colleges
0.9815         new england map
0.9805         hood ice cream
0.9800         map of new england
0.9799         new england inns
0.9794         new england recruiting report

As before, these are Pearson Correlation values. But what does it mean for a query to be
correlated with this US states data set? Let’s look at the maps for our New England set and the
query “map of new england” side­by­side:

Left: our “New England” data set. Right: the popularity of the query “map of new england”.

The maps indicate that the query ‘map of new england’ is popular in states where our data set
has a 1 and not popular in states where our data has a 0. Clicking the “Scatter plot” link on the
result makes this more explicit:
open in browser customize   free license                                                                               pdfcrowd.com
The six points on the top right are the six states in New England. The smattering of dots on the
lower left are the other 44 states and the District of Columbia. This makes it clear that the query
‘map of new england’ is popular in the six states in New England and nowhere else.

For the New England data set, Google Correlate brings back queries which are characteristic of
the New England region. If you have a data set which can be broken down by state, uploading it
to Google Correlate may give you insight into some of the driving factors behind your data.

The same techniques discussed for the time series examples also apply to states correlation. If
you don’t specify a state then it will be held out. In particular, it is often useful to hold out the
District of Columbia which is an outlier in many data sets.
open in browser customize   free license                                                                                   pdfcrowd.com
Filtering
Google Correlate makes an attempt to filter out queries which are unlikely to be interesting.

These include:

Queries with a low correlation value (less than r=0.6)
Misspelled queries
Pornographic queries
Rare queries
Queries which only correlate with a small portion of the time series

Protecting User Privacy
At Google, we are keenly aware of the trust our users place in us, and of our responsibility to
protect their privacy. Google Correlate can never be used to identify individual users because we
rely on anonymized, aggregated counts of how often certain search queries occur each week.
We rely on millions of search queries issued to Google over time, and the patterns we observe in
the data are only meaningful across large populations of Google search users. You can learn
more about how this data is used and how Google protects users' privacy at our Privacy Center.

open in browser customize   free license                                                                               pdfcrowd.com

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 37 posted: 5/26/2011 language: English pages: 13