YaCy

YaCy (pronounced "ya see") is a free distributed search engine, built on principles of peer-to-peer (P2P) networks.[2][3] Its core is a computer program written in Java distributed on several hundred computers, as of September 2006, so-called YaCy-peers. Each YaCy-peer independently crawls through the Internet, analyzes and indexes found web pages, and stores indexing results in a common database (so called index) which is shared with other YaCy-peers using principles of P2P networks. It is a free search engine that everyone can use to build a search portal for their intranet and to help search the public internet clearly.

Compared to semi-distributed search engines, the YaCy-network has a decentralised architecture. All YaCy-peers are equal and no central server exists. It can be run either in a crawling mode or as a local proxy server, indexing web pages visited by the person running YaCy on his or her computer. (Several mechanisms are provided to protect the user's privacy). Access to the search functions is made by a locally running web server which provides a search box to enter search terms, and returns search results in a similar format to other popular search engines.

YaCy is available on Windows, Mac and Linux. YaCy was created in 2003 by Michael Christen[4].

YaCy
YaCy logo
Yacy-buscador
Original author(s)Michael Christen
Developer(s)YaCy community
Stable release
1.92 / 26 December 2016
Repositorygithub.com/yacy/yacy_search_server
Written inJava
Operating systemCross-platform
TypeOverlay network, Search engine
LicenseGPLv2+
Alexa rankIncrease 378,737 (As of 14 May 2019)[1]
Websitewww.yacy.net/en

System components

YaCy search engine is based on four elements:[5]

Crawler
A search robot that traverses from web page to web page and analyzes their content.
Indexer
Creates a reverse word index (RWI) i.e. each word from the RWI has its list of relevant URLs and ranking information. Words are saved in form of word hashes.
Search and administration interface
Made as a web interface provided by a local HTTP servlet with servlet engine.
Data storage
Used to store the reverse word index database utilizing a distributed hash table.
Yacy-buscador
Homepage of YaCy

Search-engine technology

YaCy Network Freeworld 01-12-2011
YaCy network
  • YaCy is a complete search appliance with user interface, index, administration and monitoring.
  • YaCy harvests web pages with a web crawler. Documents are then parsed, indexed and the search index is stored locally. If your peer is part a peer network, then your local search index is also merged into the shared index for that network.
  • A search is started then the local index contributes together with a global search index from peers in the YaCy search network.

YaCy platform architecture

Capture d'écran yacy
Web search showing results of the different components YaCy uses

YaCy uses a combination of techniques for the networking, administration, and maintenance of indexing the search engine including blacklisting, moderation, and communication with the community. Here is how YaCy performs these operations:

  • Community components
    1. Web forum[6]
    2. Statistics
    3. XML API
  • Maintenance
    1. Web Server
    2. Indexing
    3. Crawler with Balancer
    4. Peer-to-Peer Server Communication
  • Content organization
    1. Blacklisting and Filtering
    2. Search interface
    3. Bookmarks
    4. Monitoring search results

See also

  • Dooble – an open-source web browser with an integrated YaCy Search Engine Tool Widget

References

  1. ^ "yacy.net Site Info". Alexa Internet.
  2. ^ "YaCy takes on Google with open source search engine". The Register. 2011-11-29. Retrieved 2012-04-16.
  3. ^ "YaCy: It's About Freedom, Not Beating Google". PC World. 2011-12-03. Retrieved 2012-04-16.
  4. ^ "Ich entwickle eine P2P-basierende Suchmaschine. Wer macht mit?". Heise Online (in German). 2003-12-15. Retrieved 2018-05-09.
  5. ^ "YaCy Technology Architecture". YaCy.net. Retrieved 2012-02-14.
  6. ^ "forum.yacy.de". Retrieved 6 June 2017.

External links

Comparison of web search engines

Search engines are listed in tables below for comparison purposes. The first table lists the company behind the engine, volume and ad support and identifies the nature of the software being used as free software or proprietary. The second table lists privacy aspects along with other technical parameters, such as whether the engine provides personalization (alternatively viewed as a filter bubble).

Defunct or acquired search engines are not listed here.

Distributed hash table

A distributed hash table (DHT) is a class of a decentralized distributed system that provides a lookup service similar to a hash table: (key, value) pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key. Keys are unique identifiers which map to particular values, which in turn can be anything from addresses, to documents, to arbitrary data. Responsibility for maintaining the mapping from keys to values is distributed among the nodes, in such a way that a change in the set of participants causes a minimal amount of disruption. This allows a DHT to scale to extremely large numbers of nodes and to handle continual node arrivals, departures, and failures.

DHTs form an infrastructure that can be used to build more complex services, such as anycast, cooperative Web caching, distributed file systems, domain name services, instant messaging, multicast, and also peer-to-peer file sharing and content distribution systems. Notable distributed networks that use DHTs include BitTorrent's distributed tracker, the Coral Content Distribution Network, the Kad network, the Storm botnet, the Tox instant messenger, Freenet, the YaCy search engine, and the InterPlanetary File System.

Distributed search engine

A distributed search engine is a search engine where there is no central server. Unlike traditional centralized search engines, work such as crawling, data mining, indexing, and query processing is distributed among several peers in a decentralized manner where there is no single point of control.

Distributed web crawling

Distributed web crawling is a distributed computing technique whereby Internet search engines employ many computers to index the Internet via web crawling. Such systems may allow for users to voluntarily offer their own computing and bandwidth resources towards crawling web pages. By spreading the load of these tasks across many computers, costs that would otherwise be spent on maintaining large computing clusters are avoided.

Dooble

Dooble is a free and open-source Web browser that was created to improve privacy. Currently, Dooble is available for FreeBSD, Linux, OS X, OS/2, and Windows. Dooble uses Qt for its user interface and abstraction from the operating system and processor architecture. As a result, Dooble should be portable to any system that supports OpenSSL, POSIX threads, Qt, SQLite, and other libraries.

Filter bubble

A filter bubble – a term coined by Internet activist Eli Pariser – is a state of intellectual isolation that allegedly can result from personalized searches when a website algorithm selectively guesses what information a user would like to see based on information about the user, such as location, past click-behavior and search history. As a result, users become separated from information that disagrees with their viewpoints, effectively isolating them in their own cultural or ideological bubbles. The choices made by these algorithms are not transparent. Prime examples include Google Personalized Search results and Facebook's personalized news-stream. The bubble effect may have negative implications for civic discourse, according to Pariser, but contrasting views regard the effect as minimal and addressable. The results of the U.S. presidential election in 2016 have been associated with the influence of social media platforms such as Twitter and Facebook, and as a result have called into question the effects of the "filter bubble" phenomenon on user exposure to fake news and echo chambers, spurring new interest in the term, with many concerned that the phenomenon may harm democracy.

(Technology such as social media) “lets you go off with like-minded people, so you're not mixing and sharing and understanding other points of view ... It's super important. It's turned out to be more of a problem than I, or many others, would have expected.”

Internet privacy

Internet privacy involves the right or mandate of personal privacy concerning the storing, repurposing, provision to third parties, and displaying of information pertaining to oneself via the Internet. Internet privacy is a subset of data privacy. Privacy concerns have been articulated from the beginnings of large-scale computer sharing.Privacy can entail either Personally Identifying Information (PII) or non-PII information such as a site visitor's behavior on a website. PII refers to any information that can be used to identify an individual. For example, age and physical address alone could identify who an individual is without explicitly disclosing their name, as these two factors are unique enough to identify a specific person typically. Other forms of PII may soon include GPS Tracking Data used by Apps, as the daily commute and routine information can be enough to identify an individual.

Some experts such as Steve Rambam, a private investigator specializing in Internet privacy cases, believe that privacy no longer exists; saying, "Privacy is dead – get over it". In fact, it has been suggested that the "appeal of online services is to broadcast personal information on purpose." On the other hand, in his essay The Value of Privacy, security expert Bruce Schneier says, "Privacy protects us from abuses by those in power, even if we're doing nothing wrong at the time of surveillance."

List of FLOSS Weekly episodes

The following lists all the FLOSS Weekly shows that have been produced. There is a public list of potential future guests, although the show is only scheduled two months out.

List of XML and HTML character entity references

In SGML, HTML and XML documents, the logical constructs known as character data and attribute values consist of sequences of characters, in which each character can manifest directly (representing itself), or can be represented by a series of characters called a character reference, of which there are two types: a numeric character reference and a character entity reference. This article lists the character entity references that are valid in HTML and XML documents.

A character entity reference refers to the content of a named entity. An entity declaration is created by using the syntax in a Document Type Definition (DTD).

List of free and open-source software packages

This is a list of free and open-source software packages, computer software licensed under free software licenses and open-source licenses. Software that fits the Free Software Definition may be more appropriately called free software; the GNU project in particular objects to their works being referred to as open-source. For more information about the philosophical background for open-source software, see free software movement and Open Source Initiative. However, nearly all software meeting the Free Software Definition also meets the Open Source Definition and vice versa. A small fraction of the software that meets either definition is listed here.

Some of the open-source applications are also the basis of commercial products, shown in the List of commercial open-source applications and services.

List of search engine software

Presented below is a list of search engine software.

List of search engines

This is a list of search engines, including web search engines, selection-based search engines, metasearch engines, desktop search tools, and web portals and vertical market websites that have a search facility for online databases. For a list of search engine software, see List of enterprise search vendors.

Peer-to-peer

Peer-to-peer (P2P) computing or networking is a distributed application architecture that partitions tasks or workloads between peers. Peers are equally privileged, equipotent participants in the application. They are said to form a peer-to-peer network of nodes.

Peers make a portion of their resources, such as processing power, disk storage or network bandwidth, directly available to other network participants, without the need for central coordination by servers or stable hosts. Peers are both suppliers and consumers of resources, in contrast to the traditional client-server model in which the consumption and supply of resources is divided. Emerging collaborative P2P systems are going beyond the era of peers doing similar things while sharing resources, and are looking for diverse peers that can bring in unique resources and capabilities to a virtual community thereby empowering it to engage in greater tasks beyond those that can be accomplished by individual peers, yet that are beneficial to all the peers.While P2P systems had previously been used in many application domains, the architecture was popularized by the file sharing system Napster, originally released in 1999. The concept has inspired new structures and philosophies in many areas of human interaction. In such social contexts, peer-to-peer as a meme refers to the egalitarian social networking that has emerged throughout society, enabled by Internet technologies in general.

Seeks

Seeks is a free and open-source project licensed under the Affero General Public License version 3 (AGPLv3). It exists to create an alternative to the current market-leading search engines, driven by user concerns rather than corporate interests. The original manifesto was created by Emmanuel Benazera and Sylvio Drouin and published in October 2006. The project was under active development until April 2014, with both stable releases of the engine and revisions of the source code available for public use. In September 2011, Seeks won an innovation award at the Open World Forum Innovation Awards. The Seeks source code has not been updated since April 28, 2014 and no Seeks nodes have been usable since February 6, 2016.

Web crawler

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering).

Web search engines and some other sites use Web crawling or spidering software to update their web content or indices of others sites' web content. Web crawlers copy pages for processing by a search engine which indexes the downloaded pages so users can search more efficiently.

Crawlers consume resources on visited systems and often visit sites without approval. Issues of schedule, load, and "politeness" come into play when large collections of pages are accessed. Mechanisms exist for public sites not wishing to be crawled to make this known to the crawling agent. For example, including a robots.txt file can request bots to index only parts of a website, or nothing at all.

The number of Internet pages is extremely large; even the largest crawlers fall short of making a complete index. For this reason, search engines struggled to give relevant search results in the early years of the World Wide Web, before 2000. Today, relevant results are given almost instantly.

Crawlers can validate hyperlinks and HTML code. They can also be used for web scraping (see also data-driven programming).

Web search engine

A web search engine or Internet search engine is a software system that is designed to carry out web search (Internet search), which means to search the World Wide Web in a systematic way for particular information specified in a web search query. The search results are generally presented in a line of results, often referred to as search engine results pages (SERPs). The information may be a mix of web pages, images, videos, infographics, articles, research papers and other types of files. Some search engines also mine data available in databases or open directories. Unlike web directories, which are maintained only by human editors, search engines also maintain real-time information by running an algorithm on a web crawler.

Internet content that is not capable of being searched by a web search engine is generally described as the deep web.

Distributed web search
Distributed web crawlers
Active
Inactive

This page is based on a Wikipedia article written by authors (here).
Text is available under the CC BY-SA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.