Sitemaps

The Sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs in the site. This allows search engines to crawl the site more efficiently and to find URLs that may be isolated from rest of the site's content. The sitemaps protocol is a URL inclusion protocol and complements robots.txt, a URL exclusion protocol.

History

Google first introduced Sitemaps 0.84 in June 2005 so web developers could publish lists of links from across their sites. Google, MSN and Yahoo announced joint support for the Sitemaps protocol in November 2006. The schema version was changed to "Sitemap 0.90", but no other changes were made.

In April 2007, Ask.com and IBM announced support for Sitemaps. Also, Google, Yahoo, MS announced auto-discovery for sitemaps through robots.txt. In May 2007, the state governments of Arizona, California, Utah and Virginia announced they would use Sitemaps on their web sites.

The Sitemaps protocol is based on ideas[1] from "Crawler-friendly Web Servers,"[2] with improvements including auto-discovery through robots.txt and the ability to specify the priority and change frequency of pages.

Purpose

Sitemaps are particularly beneficial on websites where:

  • Some areas of the website are not available through the browsable interface
  • Webmasters use rich Ajax, Silverlight, or Flash content that is not normally processed by search engines.
  • The site is very large and there is a chance for the web crawlers to overlook some of the new or recently updated content
  • When websites have a huge number of pages that are isolated or not well linked together, or
  • When a website has few external links

File format

The Sitemap Protocol format consists of XML tags. The file itself must be UTF-8 encoded. Sitemaps can also be just a plain text list of URLs. They can also be compressed in .gz format.

A sample Sitemap that contains just one URL and uses all optional tags is shown below.

<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
    <url>
        <loc>http://example.com/</loc>
        <lastmod>2006-11-18</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.8</priority>
    </url>
</urlset>

The Sitemap XML protocol is also extended to provide a way of listing multiple Sitemaps in a 'Sitemap index' file. The maximum Sitemap size of 50 MiB or 50,000 URLs[3] means this is necessary for large sites.

An example of Sitemap index referencing one separate sitemap follows.

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <sitemap>
      <loc>http://www.example.com/sitemap1.xml.gz</loc>
      <lastmod>2014-10-01T18:23:17+00:00</lastmod>
   </sitemap>
</sitemapindex>

Element definitions

The definitions for the elements are shown below:[3]

Element Required? Description
<urlset> Yes The document-level element for the Sitemap. The rest of the document after the '<?xml version>' element must be contained in this.
<url> Yes Parent element for each entry.
<sitemapindex> Yes The document-level element for the Sitemap index. The rest of the document after the '<?xml version>' element must be contained in this.
<sitemap> Yes Parent element for each entry in the index.
<loc> Yes Provides the full URL of the page or sitemap, including the protocol (e.g. http, https) and a trailing slash, if required by the site's hosting server. This value must be shorter than 2,048 characters. Note that ampersands in the URL need to be escaped as &amp;.
<lastmod> No The date that the file was last modified, in ISO 8601 format. This can display the full date and time or, if desired, may simply be the date in the format YYYY-MM-DD.
<changefreq> No How frequently the page may change:
  • always
  • hourly
  • daily
  • weekly
  • monthly
  • yearly
  • never

"Always" is used to denote documents that change each time that they are accessed. "Never" is used to denote archived URLs (i.e. files that will not be changed again).

This is used only as a guide for crawlers, and is not used to determine how frequently pages are indexed.

Does not apply to <sitemap> elements.

<priority> No The priority of that URL relative to other URLs on the site. This allows webmasters to suggest to crawlers which pages are considered more important.

The valid range is from 0.0 to 1.0, with 1.0 being the most important. The default value is 0.5.

Rating all pages on a site with a high priority does not affect search listings, as it is only used to suggest to the crawlers how important pages in the site are to one another.

Does not apply to <sitemap> elements.

Support for the elements that are not required can vary from one search engine to another.[3]

Other formats

Text file

The Sitemaps protocol allows the Sitemap to be a simple list of URLs in a text file. The file specifications of XML Sitemaps apply to text Sitemaps as well; the file must be UTF-8 encoded, and cannot be more than 10 MB large or contain more than 50,000 URLs,[4] but can be compressed as a gzip file.[3]

Syndication feed

A syndication feed is a permitted method of submitting URLs to crawlers; this is advised mainly for sites that already have syndication feeds. One stated drawback is this method might only provide crawlers with more recently created URLs, but other URLs can still be discovered during normal crawling.[3]

It can be beneficial to have a syndication feed as a delta update (containing only the newest content) to supplement a complete sitemap.

Search engine submission

If Sitemaps are submitted directly to a search engine (pinged), it will return status information and any processing errors. The details involved with submission will vary with the different search engines. The location of the sitemap can also be included in the robots.txt file by adding the following line:

Sitemap: <sitemap_location>

The <sitemap_location> should be the complete URL to the sitemap, such as:

https://www.example.org/sitemap.xml

This directive is independent of the user-agent line, so it doesn't matter where it is placed in the file. If the website has several sitemaps, multiple "Sitemap:" records may be included in robots.txt, or the URL can simply point to the main sitemap index file.

The following table lists the sitemap submission URLs for several major search engines:

Search engine Submission URL Help page Market
Baidu https://zhanzhang.baidu.com/dashboard/index Baidu Webmaster Dashboard China, Hong Kong, Singapore
Bing (and Yahoo!) https://www.bing.com/webmaster/ping.aspx?siteMap= Bing Webmaster Tools Global
Google https://www.google.com/webmasters/tools/ping?sitemap= Submitting a Sitemap Global
Yandex https://webmaster.yandex.com/site/map.xml Sitemaps files Russia, Ukraine, Belarus, Kazakhstan, Turkey

Sitemap URLs submitted using the sitemap submission URLs need to be URL-encoded, for example: replacing : (colon) with %3A, / (slash) with %2F.[3]

Limitations for search engine indexing

Sitemaps supplement and do not replace the existing crawl-based mechanisms that search engines already use to discover URLs. Using this protocol does not guarantee that web pages will be included in search indexes, nor does it influence the way that pages are ranked in search results. Specific examples are provided below.

  • Google - Webmaster Support on Sitemaps: "Using a sitemap doesn't guarantee that all the items in your sitemap will be crawled and indexed, as Google processes rely on complex algorithms to schedule crawling. However, in most cases, your site will benefit from having a sitemap, and you'll never be penalized for having one."[5]
  • Bing - Bing uses the standard sitemaps.org protocol and is very similar to the one mentioned below.
  • Yahoo - After the search deal commenced between Yahoo! Inc. and Microsoft, Yahoo! Site Explorer has merged with Bing Webmaster Tools

Sitemap limits

Sitemap files have a limit of 50,000 URLs and 50MiB per sitemap. Sitemaps can be compressed using gzip, reducing bandwidth consumption. Multiple sitemap files are supported, with a Sitemap index file serving as an entry point. Sitemap index files may not list more than 50,000 Sitemaps and must be no larger than 50MiB (52,428,800 bytes) and can be compressed. You can have more than one Sitemap index file.[3]

As with all XML files, any data values (including URLs) must use entity escape codes for the characters ampersand (&), single quote ('), double quote ("), less than (<), and greater than (>).

Best practice for optimising a sitemap index for search engine crawlability is to ensure the index refers only to sitemaps as opposed to other sitemap indexes. Nesting a sitemap index within a sitemap index is invalid according to Google.

Additional Sitemap Types

A number of additional XML sitemap types outside of the scope of the sitemaps protocol are supported by Google to allow webmasters to provide additional data on the content of their websites. Video and image sitemaps are intended to improve the capability of websites to rank in image and video searches.[6][7]

Video Sitemaps

Video sitemaps indicate data related to embeding and autoplaying, preferred thumbnails to show in search results, publication date, video duration, and other metadata.[7] Video sitemaps are also used to allow search engines to index videos that are embeded on a website, but that are hosted externally, such as on Vimeo or YouTube.

Image Sitemaps

Image sitemaps are used to indicate image metadata, such as licensing information, geographic location, and an image's caption.[6]

Google News Sitemaps

Google supports a Google News sitemap type for facilitating quick indexing of time-sensitive news subjects.[8][9]

Multilingual and multinational Sitemaps

In December 2011, Google announced the annotations for sites that want to target users in many languages and, optionally, countries. A few months later Google announced, on their official blog,[10] that they are adding support for specifying the rel="alternate" and hreflang annotations in Sitemaps. Instead of the (until then only option) HTML link elements the Sitemaps option offered many advantages which included a smaller page size and easier deployment for some websites.

One example of the Multilingual Sitemap would be as follows:

If for example we have a site that targets English language users through http://www.example.com/en and Greek language users through http://www.example.com/gr, up until then the only option was to add the hreflang annotation either in the HTTP header or as HTML elements on both URLs like this

 <link rel="alternate" hreflang="en" href="http://www.example.com/en" >
 <link rel="alternate" hreflang="gr" href="http://www.example.com/gr" >

But now, one can alternatively use the following equivalent markup in Sitemaps:

 1  <url>
 2    <loc>http://www.example.com/en</loc>
 3     <xhtml:link
 4       rel="alternate"
 5       hreflang="gr"
 6       href="http://www.example.com/gr" />
 7     <xhtml:link
 8       rel="alternate"
 9       hreflang="en"
10       href="http://www.example.com/en" />
11  </url>
12  <url>
13    <loc>http://www.example.com/gr</loc>
14     <xhtml:link
15       rel="alternate"
16       hreflang="gr"
17       href="http://www.example.com/gr" />
18     <xhtml:link
19       rel="alternate"
20       hreflang="en"
21       href="http://www.example.com/en" />
22  </url>

See also

References

  1. ^ M.L. Nelson; J.A. Smith; del Campo; H. Van de Sompel; X. Liu (2006). "Efficient, Automated Web Resource Harvesting" (PDF). WIDM'06.
  2. ^ O. Brandman, J. Cho, Hector Garcia-Molina, and Narayanan Shivakumar (2000). "Crawler-friendly web servers". Proceedings of ACM SIGMETRICS Performance Evaluation Review, Volume 28, Issue 2. doi:10.1145/362883.362894.CS1 maint: Multiple names: authors list (link)
  3. ^ a b c d e f g "Sitemaps XML format". Sitemaps.org. 2016-11-21. Retrieved 2016-12-01.
  4. ^ "Build and submit a sitemap - Search Console Help". Support.google.com. Retrieved 5 August 2018.
  5. ^ "About Google Sitemaps". 2016-12-01. Retrieved 2016-12-01.
  6. ^ a b "Image Sitemaps". Google Search Console. Retrieved 28 December 2018.
  7. ^ a b "Video Sitemaps". Google Search Console. Retrieved 28 December 2018.
  8. ^ Bigby, Garenne. "Why You should be using a Google News Sitemap". Dyno Mapper. Retrieved 28 December 2018.
  9. ^ "Google News Sitemaps". Google Search Console. Retrieved 28 December 2018.
  10. ^ "Multilingual and multinational site annotations in Sitemaps". Google Webmaster Central Blog. Pierre Far. May 24, 2012.

External links

Android Q

Android "Q" is the upcoming tenth major release and the 17th version of the Android mobile operating system. The first beta of Android Q was released on March 13, 2019 for all Google Pixel phones. The final release of Android Q is scheduled to be released in the third quarter of 2019.

BigQuery

BigQuery is a RESTful web service that enables interactive analysis of massively large datasets working in conjunction with Google Storage. It is a serverless Platform as a Service (PaaS) that may be used complementarily with MapReduce.

Biositemap

A Biositemap is a way for a biomedical research institution of organisation to show how biological information is distributed throughout their Information Technology systems and networks. This information may be shared with other organisations and researchers.

The Biositemap enables web browsers, crawlers and robots to easily access and process the information to use in other systems, media and computational formats. Biositemaps protocols provide clues for the Biositemap web harvesters, allowing them to find resources and content across the whole interlink of the Biositemap system. This means that human or machine users can access any relevant information on any topic across all organisations throughout the Biositemap system and bring it to their own systems for assimilation or analysis.

Chromebit

The Chromebit is a dongle running Google's Chrome OS operating system. When placed in the HDMI port of a television or a monitor, this device turns that display into a personal computer. Chromebit allows adding a keyboard or mouse over Bluetooth or Wi-Fi. The device was announced in April 2015 and began shipping that November.

GData

GData (Google Data Protocol) provides a simple protocol for reading and writing data on the Internet, designed by Google. GData combines common XML-based syndication formats (Atom and RSS) with a feed-publishing system based on the Atom Publishing Protocol, plus some extensions for handling queries. It relies on XML or JSON as a data format.

Google provides GData client libraries for Java, JavaScript, .NET, PHP, Python, and Objective-C.

Gayglers

Gayglers is a term for the gay, lesbian, bisexual and transgender employees of Google. The term was first used for all LGBT employees at the company in 2006, and was conceived as a play on the word "Googler" (a colloquial term to describe all employees of Google).The term, first published openly by The New York Times in 2006 to describe some of the employees at the company's new Manhattan office, came into public awareness when Google began to participate as a corporate sponsor and float participant at several pride parades in San Francisco, New York, Dublin and Madrid during 2006. Google has since increased its public backing of LGBT-positive events and initiatives, including an announcement of opposition to Proposition 8.

Google Base

Google Base was a database provided by Google into which any user can add almost any type of content, such as text, images, and structured information in formats such as XML, PDF, Excel, RTF, or WordPerfect. As of September 2010, the product has since been downgraded to Google Merchant Center. If Google finds it relevant, submitted content may appear on its shopping search engine, Google Maps or even the web search. The piece of content can then be labeled with attributes like the ingredients for a recipe or the camera model for stock photography. Because information about the service was leaked before public release, it generated much interest in the information technology community prior to release. Google subsequently responded on their blog with an official statement:

"You may have seen stories today reporting on a new product that we're testing, and speculating about our plans. Here's what's really going on. We are testing a new way for content owners to submit their content to Google, which we hope will complement existing methods such as our web crawl and Google Sitemaps. We think it's an exciting product, and we'll let you know when there's more news."Files can be uploaded to the Google Base servers by browsing your computer or the web, by various FTP methods, or by API coding. Online tools are provided to view the number of downloads of the user's files, and other performance measures.

On December 17, 2010, it was announced that Google Base's API is deprecated in favor of a set of new APIs known as Google Shopping APIs.

Google Dataset Search

Google Dataset Search is a search engine from Google that helps researchers locate online data that is freely available for use. The company launched the service on September 5, 2018, and stated that the product was targeted at scientists and data journalists.

Google Dataset Search complements Google Scholar, the company's search engine for academic studies and reports.

Google Finance

Google Finance is a website focusing on business news and financial information hosted by Google.

Google Fit

Google Fit is a health-tracking platform developed by Google for the Android operating system and Wear OS. It is a single set of APIs that blends data from multiple apps and devices. Google Fit uses sensors in a user's activity tracker or mobile device to record physical fitness activities (such as walking or cycling), which are measured against the user's fitness goals to provide a comprehensive view of their fitness.

Google Forms

Google Forms is a survey administration app that is included in the Google Drive office suite along with Google Docs, Google Sheets, and Google Slides.

Forms features all of the collaboration and sharing features found in Docs, Sheets, and Slides.

Google The Thinking Factory

Google: The Thinking Factory is documentary film about Google Inc. from 2008 written and directed by Gilles Cayatte.

Hreflang

The rel="alternate" hreflang="x" link attribute is a HTML meta element described in RFC 5988. Hreflang specifies the language and optional geographic restrictions for a document. Hreflang is interpreted by search engines and can be used by webmasters to clarify the lingual and geographical targeting of a website.

Jamboard

Jamboard is an interactive whiteboard developed by Google, as part of the G Suite family. It was officially announced on 25 October 2016. It has a 55" 4K touchscreen display, and will have compatibility for online collaboration through cross-platform support. The display can also be mounted onto a wall or be configured into a stand.

Narayanan Shivakumar

Narayanan Shivakumar is an entrepreneur that worked for Google between 2001 and 2010. He had the title of Distinguished Entrepreneur and activated at Google's Seattle-Kirkland R&D Center; earlier, he was an Engineering Director and launched AdSense, Sitemaps

, Google Search Appliance and other key products. An online video of Shivakumar's keynote at Google Developer Day, Beijing June'07.

Before he joined Google in its early days, he obtained his PhD in Computer Science from Stanford University. His advisor was Prof. Hector Garcia-Molina. Before Google, he cofounded Gigabeat.com, an online music startup acquired by Napster. Shivakumar's personal webpage.

Resources of a Resource

Resources of a Resource (ROR) is an XML format for describing the content of an internet resource or website in a generic fashion so this content can be better understood by search engines, spiders, web applications, etc. The ROR format provides several pre-defined terms for describing objects like sitemaps, products, events, reviews, jobs, classifieds, etc. The format can be extended with custom terms.

RORweb.com is the official website of ROR; the ROR format was created by AddMe.com as a way to help search engines better understand content and meaning. Similar concepts, like Google Sitemaps and Google Base, have also been developed since the introduction of the ROR format.

ROR objects are placed in an ROR feed called ror.xml. This file is typically located in the root directory of the resource or website it describes. When a search engine like Google or Yahoo searches the web to determine how to categorize content, the ROR feed allows the search engines "spider" to quickly identify all the content and attributes of the website.

This has three main benefits:

It allows the spider to correctly categorize the content of the website into its engine.

It allows the spider to extract very detailed information about the objects on a website (sitemaps, products, events, reviews, jobs, classifieds, etc.)

It allows the website owner to optimize his site for inclusion of its content into the search engines.

Robots exclusion standard

The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned. Robots are often used by search engines to categorize websites. Not all robots cooperate with the standard; email harvesters, spambots, malware and robots that scan for security vulnerabilities may even start with the portions of the website where they have been told to stay out. The standard is different from but can be used in conjunction with, Sitemaps, a robot inclusion standard for websites.

Site map

A site map (or sitemap) is a list of pages of a web site.

There are three primary kinds of site map:

Site maps used during the planning of a Web site by its designers.

Human-visible listings, typically hierarchical, of the pages on a site.

Structured listings intended for web crawlers such as search engines.

Yahoo! Site Explorer

Yahoo! Site Explorer (YSE) was a Yahoo! service which allowed users to view information on websites in Yahoo!'s search index. The service was closed on November 21, 2011 and merged with Bing Webmaster Tools, a tool similar to Google Search Console (previously Google Webmaster Tools). In particular, it was useful for finding information on backlinks pointing to a given webpage or domain because YSE offered full, timely backlink reports for any site. After merging with Bing Webmaster Tools, the service only offers full backlink reports to sites owned by the webmaster. Reports for sites not owned by the webmaster are limited to 1,000 links.Webmasters who added a special authentication code to their websites were also allowed to:

See extra information on their sites

Submit Sitemaps

Submit up to 20 URL removal requests for their domains to Yahoo!.

Rewrite dynamic URLs from their site by either removing a dynamic parameter or by using a default value for a parameter.

Submit feeds for Yahoo Search Monkey

View Errors Yahoo encountered while crawling their web site

Overview
Advertising
Communication
Software
Platforms
Hardware
Development
tools
Publishing
Search
(timeline)
Events
People
Other
Related

This page is based on a Wikipedia article written by authors (here).
Text is available under the CC BY-SA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.