Tuesday, December 21, 2010

Ngrams: a problem of misinterpreted scans

A search of the term "Nintendo" in Google's new "Google Books Ngram Viewer" tool brought up an interesting result:

Apparently, for a short period of time, the term "Nintendo" was as visible in the expanse of literature as it would be in 1995! Oh, and not a single mention of "Nintendo" between the years 1871 and 1960, and then nothing again in any published book until 1989. (Side note: Nintendo was originally founded in Japan in 1889 as a playing-card manufacturer, becoming the electronics-games manufacturer with which we are most familiar starting in 1974.)

Now, these mentions are in published material that Google has scanned from printed books over the past several years, and not the popularity of the Nintendo site online (for that, check the Google Trends page for "Nintendo"). So, how can there be any mentions of a Japanese company in English printed material about 20 years prior to the founding of that company? Well, looking at the highlighted sources for "Nintendo" from 1870, I found that in most cases, it was a mis-identification of intendo (i.e., the text-recognition software mistook a preceding letter or symbol as an "N", thus finding producing the (case-specific) results for "Nintendo" around 1870). For example:

There was also one result for "nintendo" that apparently was a footnote translation from Italian (although not modern Italian, since the phrases in don't translate directly in the Google translator).

In addition, Nintendo - as a culturally important company in the US - didn't come into being until 1974, so what are the mentions for "Nintendo" in 1960? Clicking on the link for this time period, I was provided with four results. Two of these results were additional nintendo-as-Italian examples, and the other two were examples of mis-filing. One result referenced the film Chinatown, which didn't come out until 1974, and based on the snippet view might be talking about the generation of children who grew up with Nintendo and the film Chinatown. The other result showed a Singapore Airlines advertisement snippet from the Economist magazine, which means that it couldn't be from 1960, since Singapore Airlines (as an independent entity) didn't exist until after 1972, after it split from Malaysia-Singapore Airlines. Furthermore, the result shows that the reference comes from the 364th volume, issues 8280-8283 of The Economist, which (assuming that 52 volumes per year since its founding in 1843) means that the 8280th issue would have come out in 2002 (it's difficult to find issue numbers in The Economist website, but this article that appears to have been published there was written in November of 2002).

The term "Nintendo" does appear to refer to the video game company in all of the books that I looked at for all dates after 1988.

All this means that there are some problems due to technological issues that will likely creep in to a data analysis. It is important to filter the data prior to analysis, (although one hopes that a lot of these problems won't be too problematic without all the filtering).

In addition to the technological problem of the scanning and visual character recognition software, there may also be problems in the usage of words, such with the term "Sony" (which many of us no doubt associate with the Japanese company originally founded in 1946). When doing a search for "Sony" (recall: it is case-sensitive), one gets a lot of noise prior to the 1970s (when the company was introducing the Betamax video cassette and the Walkman).

No comments: