The Sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs in the site. This allows search engines to crawl the site more efficiently and to find URLs that may be isolated from rest of the site's content. The sitemaps protocol is a URL inclusion protocol and complements
robots.txt, a URL exclusion protocol.
Google first introduced Sitemaps 0.84 in June 2005 so web developers could publish lists of links from across their sites. Google, MSN and Yahoo announced joint support for the Sitemaps protocol in November 2006. The schema version was changed to "Sitemap 0.90", but no other changes were made.
In April 2007, Ask.com and IBM announced support for Sitemaps. Also, Google, Yahoo, MS announced auto-discovery for sitemaps through
robots.txt. In May 2007, the state governments of Arizona, California, Utah and Virginia announced they would use Sitemaps on their web sites.
The Sitemaps protocol is based on ideas from "Crawler-friendly Web Servers," with improvements including auto-discovery through
robots.txt and the ability to specify the priority and change frequency of pages.
Sitemaps are particularly beneficial on websites where:
The Sitemap Protocol format consists of XML tags. The file itself must be UTF-8 encoded. Sitemaps can also be just a plain text list of URLs. They can also be compressed in .gz format.
A sample Sitemap that contains just one URL and uses all optional tags is shown below.
<?xml version="1.0" encoding="utf-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"> <url> <loc>http://example.com/</loc> <lastmod>2006-11-18</lastmod> <changefreq>daily</changefreq> <priority>0.8</priority> </url> </urlset>
The Sitemap XML protocol is also extended to provide a way of listing multiple Sitemaps in a 'Sitemap index' file. The maximum Sitemap size of 50 MiB or 50,000 URLs means this is necessary for large sites.
An example of Sitemap index referencing one separate sitemap follows.
<?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc>http://www.example.com/sitemap1.xml.gz</loc> <lastmod>2014-10-01T18:23:17+00:00</lastmod> </sitemap> </sitemapindex>
The definitions for the elements are shown below:
||Yes||The document-level element for the Sitemap. The rest of the document after the '<?xml version>' element must be contained in this.|
||Yes||Parent element for each entry.|
||Yes||The document-level element for the Sitemap index. The rest of the document after the '<?xml version>' element must be contained in this.|
||Yes||Parent element for each entry in the index.|
||Yes||Provides the full URL of the page or sitemap, including the protocol (e.g. http, https) and a trailing slash, if required by the site's hosting server. This value must be shorter than 2,048 characters. Note that ampersands in the URL need to be escaped as |
||No||The date that the file was last modified, in ISO 8601 format. This can display the full date and time or, if desired, may simply be the date in the format YYYY-MM-DD.|
||No||How frequently the page may change:
"Always" is used to denote documents that change each time that they are accessed. "Never" is used to denote archived URLs (i.e. files that will not be changed again).
This is used only as a guide for crawlers, and is not used to determine how frequently pages are indexed.
Does not apply to
||No||The priority of that URL relative to other URLs on the site. This allows webmasters to suggest to crawlers which pages are considered more important.
The valid range is from 0.0 to 1.0, with 1.0 being the most important. The default value is 0.5.
Rating all pages on a site with a high priority does not affect search listings, as it is only used to suggest to the crawlers how important pages in the site are to one another.
Does not apply to
Support for the elements that are not required can vary from one search engine to another.
The Sitemaps protocol allows the Sitemap to be a simple list of URLs in a text file. The file specifications of XML Sitemaps apply to text Sitemaps as well; the file must be UTF-8 encoded, and cannot be more than 10 MB large or contain more than 50,000 URLs, but can be compressed as a gzip file.
A syndication feed is a permitted method of submitting URLs to crawlers; this is advised mainly for sites that already have syndication feeds. One stated drawback is this method might only provide crawlers with more recently created URLs, but other URLs can still be discovered during normal crawling.
It can be beneficial to have a syndication feed as a delta update (containing only the newest content) to supplement a complete sitemap.
If Sitemaps are submitted directly to a search engine (pinged), it will return status information and any processing errors. The details involved with submission will vary with the different search engines. The location of the sitemap can also be included in the
robots.txt file by adding the following line:
<sitemap_location> should be the complete URL to the sitemap, such as:
This directive is independent of the user-agent line, so it doesn't matter where it is placed in the file. If the website has several sitemaps, multiple "Sitemap:" records may be included in
robots.txt, or the URL can simply point to the main sitemap index file.
The following table lists the sitemap submission URLs for several major search engines:
|Search engine||Submission URL||Help page||Market|
|Baidu||https://zhanzhang.baidu.com/dashboard/index||Baidu Webmaster Dashboard||China, Hong Kong, Singapore|
|Bing (and Yahoo!)||https://www.bing.com/webmaster/ping.aspx?siteMap=||Bing Webmaster Tools||Global|
|https://www.google.com/webmasters/tools/ping?sitemap=||Submitting a Sitemap||Global|
|Yandex||https://webmaster.yandex.com/site/map.xml||Sitemaps files||Russia, Ukraine, Belarus, Kazakhstan, Turkey|
Sitemaps supplement and do not replace the existing crawl-based mechanisms that search engines already use to discover URLs. Using this protocol does not guarantee that web pages will be included in search indexes, nor does it influence the way that pages are ranked in search results. Specific examples are provided below.
Sitemap files have a limit of 50,000 URLs and 50MiB per sitemap. Sitemaps can be compressed using gzip, reducing bandwidth consumption. Multiple sitemap files are supported, with a Sitemap index file serving as an entry point. Sitemap index files may not list more than 50,000 Sitemaps and must be no larger than 50MiB (52,428,800 bytes) and can be compressed. You can have more than one Sitemap index file.
As with all XML files, any data values (including URLs) must use entity escape codes for the characters ampersand (&), single quote ('), double quote ("), less than (<), and greater than (>).
Best practice for optimising a sitemap index for search engine crawlability is to ensure the index refers only to sitemaps as opposed to other sitemap indexes. Nesting a sitemap index within a sitemap index is invalid according to Google.
A number of additional XML sitemap types outside of the scope of the sitemaps protocol are supported by Google to allow webmasters to provide additional data on the content of their websites. Video and image sitemaps are intended to improve the capability of websites to rank in image and video searches.
Video sitemaps indicate data related to embeding and autoplaying, preferred thumbnails to show in search results, publication date, video duration, and other metadata. Video sitemaps are also used to allow search engines to index videos that are embeded on a website, but that are hosted externally, such as on Vimeo or YouTube.
Image sitemaps are used to indicate image metadata, such as licensing information, geographic location, and an image's caption.
In December 2011, Google announced the annotations for sites that want to target users in many languages and, optionally, countries. A few months later Google announced, on their official blog, that they are adding support for specifying the rel="alternate" and hreflang annotations in Sitemaps. Instead of the (until then only option) HTML link elements the Sitemaps option offered many advantages which included a smaller page size and easier deployment for some websites.
One example of the Multilingual Sitemap would be as follows:
If for example we have a site that targets English language users through
http://www.example.com/en and Greek language users through
http://www.example.com/gr, up until then the only option was to add the hreflang annotation either in the HTTP header or as HTML elements on both URLs like this
<link rel="alternate" hreflang="en" href="http://www.example.com/en" > <link rel="alternate" hreflang="gr" href="http://www.example.com/gr" >
But now, one can alternatively use the following equivalent markup in Sitemaps:
1 <url> 2 <loc>http://www.example.com/en</loc> 3 <xhtml:link 4 rel="alternate" 5 hreflang="gr" 6 href="http://www.example.com/gr" /> 7 <xhtml:link 8 rel="alternate" 9 hreflang="en" 10 href="http://www.example.com/en" /> 11 </url> 12 <url> 13 <loc>http://www.example.com/gr</loc> 14 <xhtml:link 15 rel="alternate" 16 hreflang="gr" 17 href="http://www.example.com/gr" /> 18 <xhtml:link 19 rel="alternate" 20 hreflang="en" 21 href="http://www.example.com/en" /> 22 </url>
Android "Q" is the upcoming tenth major release and the 17th version of the Android mobile operating system. The first beta of Android Q was released on March 13, 2019 for all Google Pixel phones. The final release of Android Q is scheduled to be released in the third quarter of 2019.BigQuery
BigQuery is a RESTful web service that enables interactive analysis of massively large datasets working in conjunction with Google Storage. It is a serverless Platform as a Service (PaaS) that may be used complementarily with MapReduce.Biositemap
A Biositemap is a way for a biomedical research institution of organisation to show how biological information is distributed throughout their Information Technology systems and networks. This information may be shared with other organisations and researchers.
The Biositemap enables web browsers, crawlers and robots to easily access and process the information to use in other systems, media and computational formats. Biositemaps protocols provide clues for the Biositemap web harvesters, allowing them to find resources and content across the whole interlink of the Biositemap system. This means that human or machine users can access any relevant information on any topic across all organisations throughout the Biositemap system and bring it to their own systems for assimilation or analysis.Chromebit
The Chromebit is a dongle running Google's Chrome OS operating system. When placed in the HDMI port of a television or a monitor, this device turns that display into a personal computer. Chromebit allows adding a keyboard or mouse over Bluetooth or Wi-Fi. The device was announced in April 2015 and began shipping that November.G Suite Marketplace
G Suite Marketplace (formerly Google Apps Marketplace) is a product of Google Inc. It is an online store for web applications that work with Google Apps (Gmail, Google Docs, Google Sites, Google Calendar, Google Contacts, etc.) and with third party software. Some Apps are free. Apps are based on Google APIs or on Google Apps Script.Google Base
Google Base was a database provided by Google into which any user can add almost any type of content, such as text, images, and structured information in formats such as XML, PDF, Excel, RTF, or WordPerfect. As of September 2010, the product has since been downgraded to Google Merchant Center. If Google finds it relevant, submitted content may appear on its shopping search engine, Google Maps or even the web search. The piece of content can then be labeled with attributes like the ingredients for a recipe or the camera model for stock photography. Because information about the service was leaked before public release, it generated much interest in the information technology community prior to release. Google subsequently responded on their blog with an official statement:
"You may have seen stories today reporting on a new product that we're testing, and speculating about our plans. Here's what's really going on. We are testing a new way for content owners to submit their content to Google, which we hope will complement existing methods such as our web crawl and Google Sitemaps. We think it's an exciting product, and we'll let you know when there's more news."Files can be uploaded to the Google Base servers by browsing your computer or the web, by various FTP methods, or by API coding. Online tools are provided to view the number of downloads of the user's files, and other performance measures.
On December 17, 2010, it was announced that Google Base's API is deprecated in favor of a set of new APIs known as Google Shopping APIs.Google Behind the Screen
"Google: Behind the Screen" (Dutch: "Google: achter het scherm") is a 51-minute episode of the documentary television series Backlight about Google. The episode was first broadcast on 7 May 2006 by VPRO on Nederland 3. It was directed by IJsbrand van Veelen, produced by Nicoline Tania, and edited by Doke Romeijn and Frank Wiering.Google Dataset Search
Google Dataset Search is a search engine from Google that helps researchers locate online data that is freely available for use. The company launched the service on September 5, 2018, and stated that the product was targeted at scientists and data journalists.
Google Dataset Search complements Google Scholar, the company's search engine for academic studies and reports.Google Finance
Google Finance is a website focusing on business news and financial information hosted by Google.Google Fit
Google Fit is a health-tracking platform developed by Google for the Android operating system and Wear OS. It is a single set of APIs that blends data from multiple apps and devices. Google Fit uses sensors in a user's activity tracker or mobile device to record physical fitness activities (such as walking or cycling), which are measured against the user's fitness goals to provide a comprehensive view of their fitness.Google Forms
Google Forms is a survey administration app that is included in the Google Drive office suite along with Google Docs, Google Sheets, and Google Slides.
Forms features all of the collaboration and sharing features found in Docs, Sheets, and Slides.Google Guice
Google Guice (pronounced "juice") is an open-source software framework for the Java platform released by Google under the Apache License. It provides support for dependency injection using annotations to configure Java objects. Dependency injection is a design pattern whose core principle is to separate behavior from dependency resolution.
Guice allows implementation classes to be bound programmatically to an interface, then injected into constructors, methods or fields using an @Inject annotation. When more than one implementation of the same interface is needed, the user can create custom annotations that identify an implementation, then use that annotation when injecting it.
Being the first generic framework for dependency injection using Java annotations in 2008, Guice won the 18th Jolt Award for best Library, Framework, or Component.History of Google
The Google company was officially launched in 1998 by Larry Page and Sergey Brin to market Google Search, which has become the most used web-based search engine. Page and Brin, students at Stanford University in California, developed a search algorithm at first known as "BackRub" in 1996. The search engine soon proved successful and the expanding company moved several times, finally settling at Mountain View in 2003. This marked a phase of rapid growth, with the company making its initial public offering in 2004 and quickly becoming one of the world's largest media companies. The company launched Google News in 2002, Gmail in 2004, Google Maps in 2005, Google Chrome in 2008, and the social network known as Google+ in 2011, in addition to many other products. In 2015, Google became the main subsidiary of the holding company Alphabet Inc.
The search engine went through numerous updates in attempts to combat search engine optimization abuse, provide dynamic updating of results, and make the indexing system rapid and flexible. Search results started to be personalized in 2005, and later Google Suggest autocompletion was introduced. From 2007, Universal Search provided all types of content, not just text content, in search results.
Google has engaged in partnerships with NASA, AOL, Sun Microsystems, News Corporation, Sky UK, and others. The company set up a charitable offshoot, Google.org, in 2005. Google was involved in a 2006 legal dispute in the US over a court order to disclose URLs and search strings, and has been the subject of tax avoidance investigations in the UK.
The name Google is a variant of googol, chosen to suggest very large numbers.Hreflang
The rel="alternate" hreflang="x" link attribute is a HTML meta element described in RFC 5988. Hreflang specifies the language and optional geographic restrictions for a document. Hreflang is interpreted by search engines and can be used by webmasters to clarify the lingual and geographical targeting of a website.Narayanan Shivakumar
Narayanan Shivakumar is an entrepreneur that worked for Google between 2001 and 2010. He had the title of Distinguished Entrepreneur and activated at Google's Seattle-Kirkland R&D Center; earlier, he was an Engineering Director and launched AdSense, Sitemaps
, Google Search Appliance and other key products. An online video of Shivakumar's keynote at Google Developer Day, Beijing June'07.
Before he joined Google in its early days, he obtained his PhD in Computer Science from Stanford University. His advisor was Prof. Hector Garcia-Molina. Before Google, he cofounded Gigabeat.com, an online music startup acquired by Napster. Shivakumar's personal webpage.Resources of a Resource
Resources of a Resource (ROR) is an XML format for describing the content of an internet resource or website in a generic fashion so this content can be better understood by search engines, spiders, web applications, etc. The ROR format provides several pre-defined terms for describing objects like sitemaps, products, events, reviews, jobs, classifieds, etc. The format can be extended with custom terms.
RORweb.com is the official website of ROR; the ROR format was created by AddMe.com as a way to help search engines better understand content and meaning. Similar concepts, like Google Sitemaps and Google Base, have also been developed since the introduction of the ROR format.
ROR objects are placed in an ROR feed called ror.xml. This file is typically located in the root directory of the resource or website it describes. When a search engine like Google or Yahoo searches the web to determine how to categorize content, the ROR feed allows the search engines "spider" to quickly identify all the content and attributes of the website.
This has three main benefits:
It allows the spider to correctly categorize the content of the website into its engine.
It allows the spider to extract very detailed information about the objects on a website (sitemaps, products, events, reviews, jobs, classifieds, etc.)
It allows the website owner to optimize his site for inclusion of its content into the search engines.Robots exclusion standard
The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned. Robots are often used by search engines to categorize websites. Not all robots cooperate with the standard; email harvesters, spambots, malware and robots that scan for security vulnerabilities may even start with the portions of the website where they have been told to stay out. The standard is different from but can be used in conjunction with, Sitemaps, a robot inclusion standard for websites.Site map
A site map (or sitemap) is a list of pages of a web site.
There are three primary kinds of site map:
Site maps used during the planning of a Web site by its designers.
Human-visible listings, typically hierarchical, of the pages on a site.
Structured listings intended for web crawlers such as search engines.Yahoo! Site Explorer
Yahoo! Site Explorer (YSE) was a Yahoo! service which allowed users to view information on websites in Yahoo!'s search index. The service was closed on November 21, 2011 and merged with Bing Webmaster Tools, a tool similar to Google Search Console (previously Google Webmaster Tools). In particular, it was useful for finding information on backlinks pointing to a given webpage or domain because YSE offered full, timely backlink reports for any site. After merging with Bing Webmaster Tools, the service only offers full backlink reports to sites owned by the webmaster. Reports for sites not owned by the webmaster are limited to 1,000 links.Webmasters who added a special authentication code to their websites were also allowed to:
See extra information on their sites
Submit up to 20 URL removal requests for their domains to Yahoo!.
Rewrite dynamic URLs from their site by either removing a dynamic parameter or by using a default value for a parameter.
Submit feeds for Yahoo Search Monkey
View Errors Yahoo encountered while crawling their web site