Google Ngram Viewer

The Google Ngram Viewer or Google Books Ngram Viewer is an online search engine that charts the frequencies of any set of comma-delimited search strings using a yearly count of n-grams found in sources printed between 1500 and 2008[1][2][3][4][5] in Google's text corpora in English, Chinese (simplified), French, German, Hebrew, Italian, Russian, or Spanish.[2][6] There are also some specialized English corpora, such as American English, British English, English Fiction, and English One Million; and the 2009 version of most corpora is also available.[7]

The program can search for a single word or a phrase, including misspellings or gibberish.[6] The n-grams are matched with the text within the selected corpus, optionally using case-sensitive spelling (which compares the exact use of uppercase letters),[3] and, if found in 40 or more books, are then plotted on a graph.[8]

The Google Ngram Viewer, as of January 2016, supports searches for parts of speech and wildcards.[7]

History

The program was developed by Jon Orwant and Will Brockman and released in mid-December 2010.[2][4] It was inspired by a prototype (called "Bookworm") created by Jean-Baptiste Michel and Erez Aiden from Harvard's Cultural Observatory and Yuan Shen from MIT and Steven Pinker.[9]

The Ngram Viewer was initially based on the 2009 edition of the Google Books Ngram Corpus. As of January 2016, the program can search an individual language's corpus within the 2009 or the 2012 edition.

Operation and restrictions

Commas delimit user-entered search-terms, indicating each separate word or phrase to find.[8] The Ngram Viewer returns a plotted line chart within seconds of the user pressing the Enter key or the "Search" button on the screen.

As an adjustment for more books having been published during some years, the data is normalized, as a relative level, by the number of books published in each year.[8]

Google populated the database from over 5 million books published up to 2008. Accordingly, as of January  2016, no data will match beyond the year 2008, no matter if the corpus was generated in 2009 or 2012. Due to limitations on the size of the Ngram database, only matches found in at least 40 books are indexed in the database; otherwise the database could not have stored all possible combinations.[8]

Typically, search terms cannot end with punctuation, although a separate full stop (a period) can be searched.[8] Also, an ending question mark (as in "Why?") will cause a 2nd search for the question mark separately.[8]

Omitting the periods in abbreviations will allow a form of matching, such as using "R M S" to search for "R.M.S." versus "RMS".

Corpora

The corpora used for the search are composed of total_counts, 1-grams, 2-grams, 3-grams, 4-grams, and 5-grams files for each language. The file format of each of the files is tab-separated data. Each line has the following format:[10]

  • total_counts file
    year TAB match_count TAB page_count TAB volume_count NEWLINE
  • Version 1 ngram file (generated in July 2009)
    ngram TAB year TAB match_count TAB page_count TAB volume_count NEWLINE
  • Version 2 ngram file (generated in July 2012)
    ngram TAB year TAB match_count TAB volume_count NEWLINE

The Google Ngram Viewer uses match_count to plot the graph.

As an example, a word "Wikipedia" from the Version 2 file of the English 1-grams is stored as follows:[11]

ngram year match_count volume_count
Wikipedia 1904 1 1
Wikipedia 1912 11 1
Wikipedia 1924 1 1
Wikipedia 1925 11 1
Wikipedia 1929 11 1
Wikipedia 1943 11 1
Wikipedia 1946 11 1
Wikipedia 1947 11 1
Wikipedia 1949 11 1
Wikipedia 1951 11 1
Wikipedia 1953 22 2
Wikipedia 1955 11 1
Wikipedia 1958 1 1
Wikipedia 1961 22 2
Wikipedia 1964 22 2
Wikipedia 1965 11 1
Wikipedia 1966 15 2
Wikipedia 1969 33 3
Wikipedia 1970 129 4
Wikipedia 1971 44 4
Wikipedia 1972 22 2
Wikipedia 1973 1 1
Wikipedia 1974 2 1
Wikipedia 1975 33 3
Wikipedia 1976 11 1
Wikipedia 1977 13 3
Wikipedia 1978 11 1
Wikipedia 1979 112 12
Wikipedia 1980 13 4
Wikipedia 1982 11 1
Wikipedia 1983 3 2
Wikipedia 1984 48 3
Wikipedia 1985 37 3
Wikipedia 1986 6 4
Wikipedia 1987 13 2
Wikipedia 1988 14 3
Wikipedia 1990 12 2
Wikipedia 1991 8 5
Wikipedia 1992 1 1
Wikipedia 1993 1 1
Wikipedia 1994 23 3
Wikipedia 1995 4 1
Wikipedia 1996 23 3
Wikipedia 1997 6 1
Wikipedia 1998 32 10
Wikipedia 1999 39 11
Wikipedia 2000 43 12
Wikipedia 2001 59 14
Wikipedia 2002 105 19
Wikipedia 2003 149 53
Wikipedia 2004 803 285
Wikipedia 2005 2964 911
Wikipedia 2006 9818 2655
Wikipedia 2007 20017 5400
Wikipedia 2008 33722 6825

The graph plotted by the Google Ngram Viewer using the above data is here.

Criticism

The data set has been criticized for its reliance upon inaccurate OCR, an overabundance of scientific literature, and for including large numbers of incorrectly dated and categorized texts.[12][13] Because of these errors, and because it is uncontrolled for bias[14] (such as the increasing amount of scientific literature, which causes other terms to appear to decline in popularity), it is risky to use this corpus to study language or test theories.[15] Since the data set does not include metadata, it may not reflect general linguistic or cultural change[16] and can only hint at such an effect.

Another issue is that the corpus is in effect a library, containing one of each book. A single, prolific author is thereby able to noticeably insert new phrases into the Google Books lexicon, whether the author is widely read or not.[14]

OCR issues

Optical character recognition, or OCR, is not always reliable, and some characters may not be scanned correctly. In particular, systemic errors like the confusion of "s" and "f" in pre-19th century texts (due to the use of the long s which was similar in appearance to "f") can cause systemic bias. Although Google Ngram Viewer claims that the results are reliable from 1800 onwards, poor OCR and insufficient data mean that frequencies given for languages such as Chinese may only be accurate from 1970 onward, with earlier parts of the corpus showing no results at all for common terms, and data for some years containing more than 50% noise.[17][18]

See also

References

  1. ^ "Quantitative analysis of culture using millions of digitized books" JB Michel et al, Science 2011, DOI: 10.1126/science.1199644 [1]
  2. ^ a b c "Google Ngram Database Tracks Popularity Of 500 Billion Words" Huffington Post, 17 December 2010, webpage: HP8150.
  3. ^ a b "Google Ngram Viewer - Google Books", Books.Google.com, May 2012, webpage: G-Ngrams.
  4. ^ a b "Google's Ngram Viewer: A time machine for wordplay", Cnet.com, 17 December 2010, webpage: CN93.
  5. ^ "A Picture is Worth 500 Billion Words – By Rusty S. Thompson", HarrisburgMagazine.com, 20 September 2011, webpage: HBMag20.
  6. ^ a b "Google Books Ngram Viewer - University at Buffalo Libraries", Lib.Buffalo.edu, 22 August 2011, webpage: Buf497.
  7. ^ a b Google Books Ngram Viewer info page: https://books.google.com/ngrams/info
  8. ^ a b c d e f "Google Ngram Viewer - Google Books" (Information), Books.Google.com, December 16, 2010, webpage: G-Ngrams-info: notes bigrams and use of quotes for words with apostrophes.
  9. ^ The RSA (4 February 2010). "Steven Pinker - The Stuff of Thought: Language as a window into human nature" – via YouTube.
  10. ^ "Google Books Ngram Viewer". Google.
  11. ^ googlebooks-eng-all-1gram-20120701-w.gz at http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
  12. ^ Google Ngrams: OCR and Metadata. ResourceShelf, 19 December 2010
  13. ^ Nunberg, Geoff (16 December 2010). "Humanities research with the Google Books corpus". Archived from the original on 10 March 2016.
  14. ^ a b Pechenick, Eitan Adam; Danforth, Christopher M.; Dodds, Peter Sheridan; Barrat, Alain (7 October 2015). "Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution". PLOS ONE. 10 (10): e0137041. doi:10.1371/journal.pone.0137041.
  15. ^ Zhang, Sarah. "The Pitfalls of Using Google Ngram to Study Language". WIRED. Retrieved 2017-05-24.
  16. ^ Koplenig, Alexander (2015-09-02). "The impact of lacking metadata for the measurement of cultural and linguistic change using the Google Ngram data sets—Reconstructing the composition of the German corpus in times of WWII". Digital Scholarship in the Humanities (published 2017-04-01). 32 (1): 169–188. doi:10.1093/llc/fqv037. ISSN 2055-7671.
  17. ^ Google n-grams and pre-modern Chinese. digitalsinology.org.
  18. ^ When n-grams go bad. digitalsinology.org.

Bibliography

External links

Abseiling

Abseiling (/ˈæbseɪl/ or /ˈɑːpzaɪl/; from German abseilen, 'to rope down'), also known as rappelling (/ɹæˈpɛl/ or /ɹəˈpɛl/) from French rapeler, 'to recall' or 'to pull through'), is a controlled descent off a vertical drop, such as a rock face, using a rope.

This technique is used by climbers, mountaineers, cavers, canyoners, search and rescue and rope access technicians to descend cliffs or slopes when they are too steep and/or dangerous to descend without protection. Many climbers use this technique to protect established anchors from damage. Rope access technicians also use this as a method to access difficult-to-reach areas from above for various industrial applications like maintenance, construction, inspection and welding.To descend safely, abseilers use a variety of techniques to increase the friction on the rope to the point where it can be controlled comfortably. These techniques range from wrapping the rope around their body (e.g. The Dülfersitz) to using a custom built device like a rack. Practitioners choose a technique based on speed, safety, weight and other circumstantial concerns.

In the United States, the term "rappelling" is used nearly exclusively. In the United Kingdom, both terms are understood, but "abseilling" is strongly preferred. In Australia, New Zealand and Canada, the two terms are used interchangably. Globally, the term "rappelling" appears in books written in English more often than "abseiling".

Adviser

An adviser or advisor is normally a person with more and deeper knowledge in a specific area and usually also includes persons with cross-functional and multidisciplinary expertise. An adviser's role is that of a mentor or guide and differs categorically from that of a task-specific consultant. An adviser is typically part of the leadership, whereas consultants fulfill functional roles.The spellings adviser and advisor have both been in use since the sixteenth century. Adviser has always been the more usual spelling, though advisor has gained frequency in recent years and is a common alternative, especially in North America.

Ait

An ait (, like eight) or eyot () is a small island. It is especially used to refer to river islands found on the River Thames and its tributaries in England.Aits are typically formed by the deposit of sediment in the water, which accumulates over a period of time. An ait is characteristically long and narrow, and may become a permanent island should it become secured and protected by growing vegetation. However, aits may also be eroded: the resulting sediment is deposited further downstream and could result in another ait. A channel with numerous aits is called a braided channel.

Blockbuster (entertainment)

A blockbuster is a work of entertainment – especially a feature film, but also other media – that is highly popular and financially successful. The term has also come to refer to any large-budget production intended for "blockbuster" status, aimed at mass markets with associated merchandising, sometimes on a scale that meant the financial fortunes of a film studio or a distributor could depend on it.

Computational social science

Computational social science refers to the academic sub-disciplines concerned with computational approaches to the social sciences. This means that computers are used to model, simulate, and analyze social phenomena. Fields include computational economics, computational sociology, cliodynamics, culturomics, and the automated analysis of contents, in social and traditional media. It focuses on investigating social and behavioral relationships and interactions through social simulation, modeling, network analysis, and media analysis.

Contract of sale

A contract of sale, sales contract, sales order, or contract for sale is a legal contract for the purchase of assets (goods or property) by a buyer (or purchaser) from a seller (or vendor) for an agreed upon value in money (or money equivalent).

An obvious ancient practice of exchange, in many common law jurisdictions, it is now governed by statutory law. See commercial law.

Contracts of sale involving goods are governed by Article 2 of the Uniform Commercial Code in most jurisdictions in the United States and Canada. However in Quebec, such contracts are governed by the Civil Code of Quebec as a nominate contract in the book on the law of obligations. In Muslim countries it is governed by sharia (Islamic law).

A contract of sale lays out the terms of a transaction of goods or services, identifying the goods sold, listing delivery instructions, inspection period, any warranties and details of payment.

Culturomics

Culturomics is a form of computational lexicology that studies human behavior and cultural trends through the quantitative analysis of digitized texts. Researchers data mine large digital archives to investigate cultural phenomena reflected in language and word usage. The term is an American neologism first described in a 2010 Science article called Quantitative Analysis of Culture Using Millions of Digitized Books, co-authored by Harvard researchers Jean-Baptiste Michel and Erez Lieberman Aiden.Michel and Aiden helped create the Google Labs project Google Ngram Viewer which uses n-grams to analyze the Google Books digital library for cultural patterns in language use over time.

Because the Google Ngram data set is not an unbiased sample, and does not include metadata, there are several pitfalls when using it to study language or the popularity of terms. Medical literature accounts for a large, but shifting, share of the corpus, which does not take into account how often the literature is printed, or read.

Demonym

A demonym (; from Greek δῆμος, dêmos, "people, tribe" and όνομα, ónoma, "name") is a word that identifies residents or natives of a particular place and is derived from the name of the place.Examples of demonyms include Cochabambino, for a person from the city of Cochabamba; American for a person from the country called the United States of America; and Swahili, for a person of the Swahili coast.

Demonyms do not always clearly distinguish place of origin or ethnicity from place of residence or citizenship, and many demonyms overlap with the ethnonym for the ethnically dominant group of a region. Thus a Thai may be any resident or citizen of Thailand of any ethnic group, or more narrowly a member of the Thai people.

Conversely, some groups of people may be associated with multiple demonyms. For example, a native of the United Kingdom may be called a British person, a Briton or, informally, a Brit. In some languages, a demonym may be borrowed from another language as a nickname or descriptive adjective for a group of people: for example, "Québécois(e)" is commonly used in English for a native of Quebec (though "Quebecker" is also available).

In English, demonyms are capitalized and are often the same as the adjectival form of the place, e.g. Egyptian, Japanese, or Greek. Significant exceptions exist; for instance, the adjectival form of Spain is "Spanish", but the demonym is "Spaniard".

English commonly uses national demonyms such as "Ethiopian" or "Guatemalan", while the usage of local demonyms such as "Chicagoan", "Okie", or "Parisian", is rare. Many local demonyms are rarely used and many places, especially smaller towns and cities, lack a commonly used and accepted demonym altogether.

French onion dip

French onion dip or California dip is an American dip typically made with a base of sour cream and flavored with minced onion, and usually served with potato chips as chips and dip.

Glittering generality

A glittering generality (also called glowing generality) is an emotionally appealing phrase so closely associated with highly valued concepts and beliefs that it carries conviction without supporting information or reason. Such highly valued concepts attract general approval and acclaim. Their appeal is to emotions such as love of country and home, and desire for peace, freedom, glory, and honor. They ask for approval without examination of the reason. They are typically used by politicians and propagandists.

Heartland (United States)

Heartland is an American political term referring to U.S. states that "don't touch an ocean," whether the Atlantic or Pacific, or to the Midwestern United States. The phrase not only refers to a tangible region but is also a cultural term connoting many ideas and values, such as hard work, rustic small town communities, rural heritage, simplicity, and honesty. Citizens of the Heartland—referred to as simply "Heartlanders"—are often seen as Blue collar.

Old North-West, Louisiana (colony of France) and Great Lakes region are traditional definition of the Mid-West. US Census Bureau said 12 states such as North Dakota, South Dakota, Illinois, Iowa, Kansas, Minnesota, Missouri, Nebraska, Michigan, Wisconsin, Indiana and Ohio are the Mid-West. These are typically associated with "Small Heartland". Large Heartland means "Small Heartland" plus Montana, Kentucky, Idaho, Colorado, Oklahoma, Nevada, West Virginia, Wyoming, Utah and Southern States Texas, Louisiana and Arkansas.

Jesus H. Christ

"Jesus H. Christ" is an expletive interjection referencing Jesus Christ. It is typically uttered in anger, surprise, or frustration, though sometimes also with humorous intent.

Linear park

A linear park is a park in an urban or suburban setting that is substantially longer than it is wide. Some are rail trails ("rails to trails"), that are disused railroad beds converted to recreational use, while others use strips of public land next to canals, streams, extended defensive walls, electrical lines, highways and shorelines. They are also often described as greenways. In Australia, a linear park along the coast is known as a foreshoreway.

Occupational medicine

Occupational medicine, until 1960 called industrial medicine, is the branch of medicine which is concerned with the maintenance of health in the workplace, including prevention and treatment of diseases and injuries, with secondary objectives of maintaining and increasing productivity and social adjustment in the workplace.It is, thus, the branch of clinical medicine active in the field of occupational health and safety. OM specialists work to ensure that the highest standards of occupational health and safety are achieved and maintained in the workplace. While OM may involve a wide number of disciplines, it centers on preventive medicine and the management of illness, injury, and disability related to the workplace. Occupational physicians must have a wide knowledge of clinical medicine and be competent in a number of important areas. They often advise international bodies, governmental and state agencies, organizations and trade unions. There are contextual links to physical medicine and rehabilitation and to insurance medicine.

Political economy

Political economy is the study of production and trade and their relations with law, custom and government; and with the distribution of national income and wealth. As a discipline, political economy originated in moral philosophy, in the 18th century, to explore the administration of states' wealth, with "political" signifying the Greek word polity and "economy" signifying the Greek word "okonomie" (household management). The earliest works of political economy are usually attributed to the British scholars Adam Smith, Thomas Malthus, and David Ricardo, although they were preceded by the work of the French physiocrats, such as François Quesnay (1694–1774) and Anne-Robert-Jacques Turgot (1727–1781).In the late 19th century, the term "economics" gradually began to replace the term "political economy" with the rise of mathematical modelling coinciding with the publication of an influential textbook by Alfred Marshall in 1890. Earlier, William Stanley Jevons, a proponent of mathematical methods applied to the subject, advocated economics for brevity and with the hope of the term becoming "the recognised name of a science". Citation measurement metrics from Google Ngram Viewer indicate that use of the term "economics" began to overshadow "political economy" around roughly 1910, becoming the preferred term for the discipline by 1920. Today, the term "economics" usually refers to the narrow study of the economy absent other political and social considerations while the term "political economy" represents a distinct and competing approach.

Political economy, where it is not used as a synonym for economics, may refer to very different things. From an academic standpoint, the term may reference Marxian economics, applied public choice approaches emanating from the Chicago school and the Virginia school. In common parlance, "political economy" may simply refer to the advice given by economists to the government or public on general economic policy or on specific economic proposals developed by political scientists. A rapidly growing mainstream literature from the 1970s has expanded beyond the model of economic policy in which planners maximize utility of a representative individual toward examining how political forces affect the choice of economic policies, especially as to distributional conflicts and political institutions. It is available as a stand-alone area of study in certain colleges and universities.

Smiley

A smiley (sometimes called a happy face or smiley face) is a stylized representation of a smiling humanoid face that is a part of popular culture worldwide. The classic form designed by Harvey Ball in 1963 comprises a yellow circle with two black dots representing eyes and a black arc representing the mouth () On the Internet and in other plain text communication channels, the emoticon form (sometimes also called the smiley-face emoticon) has traditionally been most popular, typically employing a colon and a right parenthesis to form sequences such as :-), :), or (: that resemble a smiling face when viewed after rotation through 90 degrees. "Smiley" is also sometimes used as a generic term for any emoticon. The smiley has been referenced in nearly all areas of Western culture including music, movies, and art. The smiley has also been associated with late 1980s and early 1990s rave culture.The plural form "smilies" is commonly used, but the variant spelling "smilie" is not as common as the "y" spelling.

Spanish orthography

Spanish orthography is the orthography used in the Spanish language. The alphabet uses the Latin script. The spelling is fairly phonemic, especially in comparison to more opaque orthographies like English and Irish, having a relatively consistent mapping of graphemes to phonemes; in other words, the pronunciation of a given Spanish-language word can largely be predicted from its spelling and to a slightly lesser extent vice versa. Notable features of Spanish punctuation include the lack of the serial comma and the inverted question and exclamation marks: ⟨¿⟩ ⟨¡⟩.

Spanish uses capital letters much less often than English; they are not used on adjectives derived from proper nouns (e.g. francés, español, israelí from Francia, España, and Israel, respectively) and book titles capitalize only the first word (e.g. La rebelión de las masas).

Spanish uses only the acute accent, over any vowel: ⟨á é í ó ú⟩. This accent is used to mark the tonic (stressed) syllable, though it may also be used occasionally to distinguish homophones such as si ('if') and sí ('yes'). The only other diacritics used are the tilde on the letter ⟨ñ⟩, which is considered a separate letter from ⟨n⟩, and the diaeresis used in the sequences ⟨güe⟩ and ⟨güi⟩—as in bilingüe ('bilingual')—to indicate that the ⟨u⟩ is pronounced, [w], rather than having the usual silent role that it plays in unmarked ⟨gue⟩ and ⟨gui⟩.

In contrast with English, Spanish has an official body that governs linguistic rules, orthography among them: the Royal Spanish Academy, which makes periodic changes to orthography. It is the policy of the Royal Spanish Academy that, when quoting older texts, one should update spelling to the current rules, except in discussions of the history of the Spanish language.

User experience

User experience (UX) refers to a person's emotions and attitudes about using a particular product, system or service. It includes the practical, experiential, affective, meaningful and valuable aspects of human–computer interaction and product ownership. Additionally, it includes a person’s perceptions of system aspects such as utility, ease of use and efficiency. User experience may be considered subjective in nature to the degree that it is about individual perception and thought with respect to the system. User experience is dynamic as it is constantly modified over time due to changing usage circumstances and changes to individual systems as well as the wider usage context in which they can be found. In the end, user experience is about how the user interacts with and experiences the product.

You can't have your cake and eat it

You can't have your cake and eat it (too) is a popular English idiomatic proverb or figure of speech. The proverb literally means "you cannot simultaneously retain your cake and eat it". Once the cake is eaten, it is gone. It can be used to say that one cannot or should not have or want more than one deserves or is reasonable, or that one cannot or should not try to have two incompatible things. The proverb's meaning is similar to the phrases "you can't have it both ways" and "you can't have the best of both worlds."

Many people are confused by the meaning of "have" and "eat" in the order as used here, although still understand the proverb and its intent and use it in this form. Some people feel the above form of the proverb is incorrect and illogical and instead prefer: "You can't eat your cake and [then still] have it too", which is in fact closer to the original form of the proverb (see further explanations below) but uncommon today. Another variant uses "keep" instead of "have".Having to choose whether to have or eat your cake illustrates the concept of trade-offs or opportunity cost.

Overview
Advertising
Communication
Software
Platforms
Hardware
Development
tools
Publishing
Search
(timeline)
Events
People
Other
Related

This page is based on a Wikipedia article written by authors (here).
Text is available under the CC BY-SA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.