Apache Nutch

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Apache Nutch
Lucene Nutch Logo

Nutch
Screenshot
Nutch Web Interface Search
Nutch Web Interface Search
Developer(s)Apache Software Foundation
Stable release
1.14 and 2.3.1 / December 23, 2017
Written inJava
Operating systemCross-platform
TypeWeb crawler
LicenseApache License 2.0
Websitenutch.apache.org

Features

Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.

The fetcher ("robot" or "web crawler") has been written from scratch specifically for this project.

History

Nutch originated with Doug Cutting, creator of both Lucene and Hadoop, and Mike Cafarella.

In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multi-machine processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce facility and a distributed file system. The two facilities have been spun out into their own subproject, called Hadoop.

In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the Apache Software Foundation.[1]

In February 2014 the Common Crawl project adopted Nutch for its open, large-scale web crawl.[2]

While it was once a goal for the Nutch project to release a global large-scale web search engine, that is no longer the case.

Release history

1.x

Branch

2.x

Branch

Release date Description
1.1 2010-06-06 This release includes several major upgrades of existing libraries (Hadoop, Solr, Tika, etc.) on which Nutch depends. Various bug fixes, and speedups (e.g., to Fetcher2) have also been included.
1.2 2010-10-24 This release includes several improvements (addition of parse-html as a selectable parser again, configurable per-field indexing), new features (including adding timing information to all Tool classes, and implementation of parser timeouts), and bug fixes (fixing an NPE in distributed search, fixing of XML formatting issues per Document fields).
1.3 2011-06-07 This release includes several improvements (improved RSS parsing support, tighter integration with Apache Tika, external parsing support, improved language identification and an order of magnitude smaller source release tarball—only about 2 MB).
1.4 2011-11-26 This release includes several improvements including allowing Parsers to declare support for multiple MIME types, configurable Fetcher Queue depth, Fetcher speed improvements, tighter Tika integration, and support for HTTP auth in Solr indexing.
1.5 2012-06-07 This release includes several improvements including upgrades of several major components including Tika 1.1 and Hadoop 1.0.0, improvements to LinkRank and WebGraph elements as well as a number of new plugins covering blacklisting, filtering and parsing to name a few.
2.0 2012-07-07 This release offers users an edition focused on large scale crawling which builds on storage abstraction (via Apache Gora) for big data stores such as Apache Accumulo, Apache Avro, Apache Cassandra, Apache HBase, HDFS, an in memory data store and various high-profile SQL stores.
1.5.1 2012-07-10 This release is a maintenance release of the popular 1.5.X mainstream version of Nutch which has been widely adopted within the community.
2.1 2012-10-05 This release continues to provide Nutch users with a simplified Nutch distribution building on the 2.x development drive which is growing in popularity amongst the community. As well as addressing ~20 bugs this release also offers improved properties for better Solr configuration, upgrades to various Gora dependencies and the introduction of the option to build indexes in elastic search.
1.6 2012-12-06 This release includes over 20 bug fixes, the same in improvements, as well as new functionalities including a new HostNormalizer, the ability to dynamically set fetchInterval by MIME-type and functional enhancements to the Indexer API including the normalization of URLs and the deletion of robots noIndex documents. Other notable improvements include the upgrade of key dependencies to Tika 1.2 and Automaton 1.11-8.
2.2 2013-06-08 This release includes over 30 bug fixes and over 25 improvements representing the third release of increasingly popular 2.x Nutch series. This release features inclusion of Crawler-Commons which Nutch now utilizes for improved robots.txt parsing, library upgrades to Apache Hadoop 1.1.1, Apache Gora 0.3, Apache Tika 1.2 and Automaton 1.11-8.
1.7 2013-06-24 This release includes over 20 bug fixes, as many improvements; most noticeably featuring a new pluggable indexing architecture which currently supports Apache Solr and Elastic Search. Shadowing the recent Nutch 2.2 release, parsing of Robots.txt is now delegated to Crawler-Commons. Key library upgrades have been made to Apache Hadoop 1.2.0 and Apache Tika 1.3.
2.2.1 2013-07-02 This release includes library upgrades to Apache Hadoop 1.2.0 and Apache Tika 1.3, it is predominantly a bug fix for NUTCH-1591 - Incorrect conversion of ByteBuffer to String.
1.8 2014-03-17 Although this release includes library upgrades to Crawler Commons 0.3 and Apache Tika 1.5, it also provides over 30 bug fixes as well as 18 improvements.
2.3 2015-01-22 Nutch 2.3 release now comes packaged with a self-contained Apache Wicket-based Web Application. The SQL backend for Gora has been deprecated.[3]
1.10 2015-05-06 This release includes library upgrades to Tika 1.6, also provides over 46 bug fixes as well as 37 improvements and 12 new features.[4]
1.11 2015-12-07 This release includes library upgrades to Hadoop 2.X, Tika 1.11, also provides over 32 bug fixes as well as 35 improvements and 14 new features.[5]
2.3.1 2016-01-21 This bug fix release contains around 40 issues addressed.
1.12 2016-06-18
1.13 2017-04-02
1.14 2017-12-23
1.15 2018-08-09

Advantages

Nutch has the following advantages over a simple fetcher:[6]

  • Highly scalable and relatively feature rich crawler.
  • Features like politeness, which obeys robots.txt rules.
  • Robust and scalable – Nutch can run on a cluster of up to 100 machines.
  • Quality – crawling can be biased to fetch "important" pages first.

Scalability

IBM Research studied the performance[7] of Nutch/Lucene as part of its Commercial Scale Out (CSO) project.[8] Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the POWER5.

The ClueWeb09 dataset (used in e.g. TREC) was gathered using Nutch, with an average speed of 755.31 documents per second.[9]

Related projects

  • Hadoop – Java framework that supports distributed applications running on large clusters.

Search engines built with Nutch

See also

References

  1. ^ Nutch News
  2. ^ a b "Common Crawl's Move to Nutch – Common Crawl – Blog". blog.commoncrawl.org. Retrieved 2015-10-14.
  3. ^ "Nutch 2.3 Release". Apache Nutch News. The Apache Software Foundation. 22 January 2015. Retrieved 18 January 2016.
  4. ^ "Nutch 1.10 Release Notes". ASF JIRA. The Apache Software Foundation. 6 May 2015. Retrieved 18 January 2016.
  5. ^ "Nutch 1.11 Release Notes". ASF JIRA. The Apache Software Foundation. 7 December 2015. Retrieved 18 January 2016.
  6. ^ Siren, Sami (9 March 2009). "Using Nutch with Solr". Lucidworks.com. Retrieved 18 January 2016.
  7. ^ Scalability of the Nutch search engine
  8. ^ Base Operating System Provisioning and Bringup for a Commercial Supercomputer Archived December 3, 2008, at the Wayback Machine
  9. ^ The Sapphire Web Crawler - Crawl Statistics. Boston.lti.cs.cmu.edu (2008-10-01). Retrieved on 2013-07-21.
  10. ^ "Our Updated Search". Creative Commons. 2004-09-03.
  11. ^ "Creative Commons Unique Search Tool Now Integrated into Firefox 1.0". Creative Commons. 2004-11-22. Archived from the original on 2010-01-07.
  12. ^ "New CC search UI". Creative Commons. 2006-08-02.
  13. ^ Where can I get the source code for Wikia Search?
  14. ^ Update on Wikia – doing more of what’s working

Bibliography

External links

Apache Hadoop

Apache Hadoop ( ) is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Originally designed for computer clusters built from commodity hardware—still the common use—it has also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.The base Apache Hadoop framework is composed of the following modules:

Hadoop Common – contains libraries and utilities needed by other Hadoop modules;

Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;

Hadoop YARN – introduced in 2012 is a platform responsible for managing computing resources in clusters and using them for scheduling users' applications;

Hadoop MapReduce – an implementation of the MapReduce programming model for large-scale data processing.The term Hadoop is often used for both base modules and sub-modules and also the ecosystem, or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, and Apache Storm.Apache Hadoop's MapReduce and HDFS components were inspired by Google papers on MapReduce and Google File System.The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell scripts. Though MapReduce Java code is common, any programming language can be used with Hadoop Streaming to implement the map and reduce parts of the user's program. Other projects in the Hadoop ecosystem expose richer user interfaces.

Apache Lucene

Apache Lucene is a free and open-source information retrieval software library, originally written completely in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License.

Lucene has been ported to other programming languages including Object Pascal, Perl, C#, C++, Python, Ruby and PHP.

Apache OODT

The Apache Object Oriented Data Technology (OODT) is an open source data management system framework that is managed by the Apache Software Foundation. OODT was originally developed at NASA Jet Propulsion Laboratory to support capturing, processing and sharing of data for NASA's scientific archives.

Apache Tika

Apache Tika is a content detection and analysis framework, written in Java, stewarded at the Apache Software Foundation. It detects and extracts metadata and text from over a thousand different file types, and as well as providing a

Java library, has server and command-line editions suitable for use from other programming languages.

Chris Mattmann

Chris Mattmann is a Principal Data Scientist and Associate Chief Technology and Innovation Officer in the Office of the Chief Information Officer (OCIO) at the Jet Propulsion Laboratory (JPL) in Pasadena, California. He is also the manager of JPL's Open Source Applications office.. Mattmann was formerly Chief Architect in the Instrument and Data Systems section at the laboratory.Mattmann graduated from the University of Southern California (USC) in 2007 with a PhD in Computer Science studying with Dr. Nenad Medvidović and he went on to invent Apache Tika with Jérôme Charron. Apache Tika is a widely used software framework for content detection and analysis. Mattmann later wrote a book about the framework titled Tika in Action with Jukka Zitting, which is published by Manning Publications.

Chris Mattmann's work on Tika and other projects was heavily influenced by open source both at NASA and within the academic community. After creating Tika, and helping to create other projects including Apache Nutch an open source web crawler and the predecessor to the big data platform Apache Hadoop, in May 2013 Mattmann joined the Board of Directors at the Apache Software Foundation where he served until March 2018 and held roles including Treasurer, Vice Chairman, and Vice President of the Legal Affairs Committee.

During this time, Chris worked to apply open source principles to data management problems inspired by his work at NASA in Earth and Planetary science, and in engineering. Mattmann maintained an affiliation with USC as an Adjunct Associate Professor professor and in order to continue to do research on open source and data management, he created the Information Retrieval and Data Science Group (IRDS). IRDS includes diverse students in the areas of data science, information retrieval and informatics and the group exists within USC's Viterbi School of Engineering. The focus of the group is on cross disciplinary data and content analysis work applied to the science, business, engineering and information technology (IT) domains.

At NASA, Mattmann's work has been applied to a number of space missions including, Orbiting Carbon Observatory 1/2, NPP Sounder PEATE, and the Soil Moisture Active Passive (SMAP) Earth science missions.. Mattmann was also one of the principal developers of the Object Oriented Data Technology platform, an open source data management system framework originally developed by NASA JPL and then donated to the Apache Software Foundation. More recently, Chris has been focussed on Dark Web and automated data processing technologies and has been leading research teams working with DARPA and NASA JPL on the Memex project. This project involves data discovery and dissemination from the Dark Web.

Frontera (web crawling)

Frontera is an open-source, web crawling framework implementing crawl frontier component and providing scalability primitives for web crawler applications.

List of Java frameworks

Below is a list of Java programming language technologies (frameworks, libraries)

List of search engine software

Presented below is a list of search engine software.

Sematext

Sematext, a globally distributed organization, builds cloud and on-premises systems for application-performance monitoring, alerting and anomaly detection; centralized logging, log management and analytics; site-search analytics, and search enhancement. The company also provides search and Big Data consulting services and offers 24/7 production support and training for Solr and Elasticsearch to clients worldwide. The company markets its core products to engineers and DevOps, and its services to organizations using Elasticsearch, Solr, Lucene, Hadoop, HBase, Docker, Spark, Kafka, and many other platforms. Otis Gospodnetić (the co-author of Lucene in Action, the founder of Simpy, and committer on Lucene, Solr, Nutch, Apache Mahout, and Open Relevance projects) founded Sematext. Privately held, Sematext has its headquarters in Brooklyn, NY.

StormCrawler

StormCrawler is an open-source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache License and is written mostly in Java (programming language).

StormCrawler is modular and consists of a core module, which provides the basic building blocks of a web crawler such as fetching, parsing, URL filtering. Apart from the core components, the project also provide external resources, like for instance spout and bolts for Elasticsearch and Apache Solr or a ParserBolt which uses Apache Tika to parse various document formats.

The project is used in production by various companies.Linux.com published a Q&A in October 2016 with the author of StormCrawler. InfoQ ran one in December 2016. A comparative benchmark with Apache Nutch was published in January 2017 on dzone.com.Several research papers mentioned the use of StormCrawler in 2018, in particular:

The generation of a multi-million page corpus for the Persian language.

The SIREN - Security Information Retrieval and Extraction eNgine.The project WIKI contains a list of videos and slides available online.

Web ARChive

The Web ARChive (WARC) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. The WARC format generalizes the older format to better support the harvesting, access, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, and later-date transformations.WARC is now recognised by most national library systems as the standard to follow for web archival.

Web crawler

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering).

Web search engines and some other sites use Web crawling or spidering software to update their web content or indices of others sites' web content. Web crawlers copy pages for processing by a search engine which indexes the downloaded pages so users can search more efficiently.

Crawlers consume resources on visited systems and often visit sites without approval. Issues of schedule, load, and "politeness" come into play when large collections of pages are accessed. Mechanisms exist for public sites not wishing to be crawled to make this known to the crawling agent. For example, including a robots.txt file can request bots to index only parts of a website, or nothing at all.

The number of Internet pages is extremely large; even the largest crawlers fall short of making a complete index. For this reason, search engines struggled to give relevant search results in the early years of the World Wide Web, before 2000. Today, relevant results are given almost instantly.

Crawlers can validate hyperlinks and HTML code. They can also be used for web scraping (see also data-driven programming).

Top-level
projects
Commons
Incubator
Other projects
Attic
Licenses
Active
Discontinued
Types

This page is based on a Wikipedia article written by authors (here).
Text is available under the CC BY-SA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.