Rabu, 25 Mei 2011

Mining patterns in search data with Google Correlate

It all started with the flu. In 2008, we found that the activity of certain search terms are good indicators of actual flu activity. Based on this finding, we launched Google Flu Trends to provide timely estimates of flu activity in 28 countries. Since then, we’ve seen a number of other researchers—including our very own—use search activity data to estimate other real world activities.

However, tools that provide access to search data, such as Google Trends or Google Insights for Search, weren’t designed with this type of research in mind. Those systems allow you to enter a search term and see the trend; but researchers told us they want to enter the trend of some real world activity and see which search terms best match that trend. In other words, they wanted a system that was like Google Trends but in reverse.

This is now possible with Google Correlate, which we’re launching today on Google Labs. Using Correlate, you can upload your own data series and see a list of search terms whose popularity best corresponds with that real world trend. In the example below, we uploaded official flu activity data from the U.S. CDC over the last several years and found that people search for terms like [cold or flu] in a similar pattern to actual flu rates. Finding out these correlated terms is how we built Google Flu Trends:


You can also enter a search term such as [ribosome] and find other terms whose activity corresponds well over time with the one you’re interested in:


It turns out cell biology isn’t all too popular in the summer time (sorry biologists!). What’s interesting is that the ups and downs of web search activity for cell biology terms is unique enough that searching on Correlate for [ribosome] brings up searches for other biology terms, such as [mitochondria]. Of course, correlation isn’t the same thing as causation, so we can’t explain why two terms follow the same pattern. But my guess in this case is that both terms are popular when schools teach these concepts.

Search activity is an incredible source of data that may lead to advances in economics, health and other fields; but we need to handle that data with privacy controls in mind. With this system, we don’t care what any one person is searching for. In fact, we rely on millions of anonymized search queries issued to Google over time, and the patterns we observe in the data are only meaningful across large populations.

We encourage you to read our white paper describing the methodology behind Google Correlate. Or for lighter reading, check out our comic! We’ve enjoyed uploading different data sets to see fascinating and sometimes perplexing correlations. Plug in your data and let us know what you find.

Tidak ada komentar:

Posting Komentar