Search Engines: Problems & Solution
An Alternative Approach
Table of Contents
- Search Engines: Problem 1. More than 99.9999% of the Web is Unreachable to Search Engines.
- Search Engines: Problem 2: More Than 99% of All Indexed Pages Bring No Visitors From Search Engines.
- Search Engines: Problem 3: Inability of Search Engines to Provide Quality Search Results.
- Paradox of Google Search Results.
- Explanation of Paradox of Google Search Results.
- Solution to Search Engine Problems.
- Speed of Bulk Processing of Internet, Sector-by-Sector.
- Three Purposes of Navigation Subsystems.
1. Search Engines: Problem 1:
Since the beginning of the Internet, many authors have been predicting that, as the Internet continues to grow, it would outgrow the search engines' abilities to index it. These predictions came true. It is clear today that as the amount of both indexed web and non-indexed web continue to grow, the indexed web grows much slower and, as a result, the gap between two continues to grow.
More than 99.9999% of the Web is Unreachable to Search Engines.
As of March 2021, all top search engines together indexed only less than 0.0001% (1/1,000,000) of the web pages posted on the Internet. Most of the remaining 99.9999% of web pages are accessible to visitors but not to search engines. For example, among U.S. government pages there are hundreds of millions pages accessible to visitors but not to search engines. The evidence is based on multiple studies and information provided by search engines about their databases. Search engines constantly update and clean their databases. According to WorldWideWebSize.com, the estimated number of pages indexed by Google, Bing, Yahoo Search together most of the time is around 5 billion, while the largest peak in the Google index was around 50 billion pages. Calculations (based on Zipf's logarithmic frequency intervals) published by scientists Maurice de Kunder and his colleagues in the journal Scientometrics (Estimating Search Engine Index Sizevariability: a 9-year Longitudinal Study).
According to Google webpage How Search Works, at the end of 2016 Google knew of 130 trillion web pages. 130 trillion is 26,000 times 5 billion (Google, Bing, Yahoo Search together have most of the time). Google later removed the statement (see screenshot). In fact, 130 trillion is just the tip of the iceberg. According to Google, in 2013, Google knew of 30 trillion pages (so far, this Google webpage remains in its online archive). By extrapolation, in 2021, Google should know quadrillions of web pages (that is millions times more than the indexed web). Also, the number of web pages known to Google is probably a small fraction of the Internet. The indexed web is clearly less than 0.0001% (possible less than 0.00001%) of the Internet.
Almost all indexed web pages belong to the surface (visible) web. Most non-indexed web pages belong to deep (invisible, hidden) web. This includes limited access, contextual, scripted, dynamic, unlinked, non-HTML, software, etc. The deep web includes all categories of web contents and all categories of web publishers.
The indexed web remains a microscopic fraction of all web pages posted on the Internet despite all combination of methods used by search engines: specialized crawling methods (such as Sitemap Protocol and OAI-PMH); specialized search engines, such as Ahmia, Deep Web Technologies, DeepPeep; specialized software (such as Tor); specialized query languages (such as DEQUEL).
2. Search Engines: Problem 2:
Google determined that up to 99.99999999% of all search results produced by Google algorithm are useless. As will be shown later, Google starts to take steps to replace its algorithm by manual processing and eliminate billions of search results.
More Than 99% of All Indexed Pages Bring No Visitors From Search Engines.
According to traffic studies, more than 90% of webpages in Google database receive no clicks by Google visitors during any search. Most of the remaining 10% of webpages receive just one click in a few days and thus proved to be almost useless for Google visitors. See, for example, a traffic study by ahrefs.com.
More than 95% of webpages in Google database either receive no clicks or receive just one click in a few days. Thus, only less than 5% of Google databases receive more than one click in a few days.
More than 99% of webpages in Google database don’t bring any visitors willing to remain on any website or bring one such visitor in a few days. Since less than 20% of clicked links are relevant to search terms, more than 80% of clicks don't result in visitors willing to remain on the website. Thus, only less than 1% of Google databases bring any visitors willing to remain on any website listed in search results. Accordingly, the active database of all search engines together is less than 50 million web pages (1% of 5 billion).
3. Search Engines: Problem 3: Inability of Search Engines to Provide Quality Search Results.In 1997, Sergey Brin and Lawrence Page wrote in The Anatomy of a Large-Scale Hypertextual Web Search Engine that the ability of leading search engines to find relevant pages was so low that "as of November 1997, only one of the top four commercial search engines finds itself (returns its own search page in response to its name in the top ten results)." In 1997, the list of search engines included, among others, AltaVista, Ask Jeeves, Excite, Inktomi , Infoseek, LookSmart, Lycos, WebCrawler, Yahoo.
In 2021 (about a quarter of century later), Google has been forced to realize that its algorithm is unable to produce quality search results (even results in response to Google name): nether to find relevant web pages, nor to exclude irrelevant pages from search results, nor to correctly sort search results and that quality search results can be produced only manually. Google starts to take steps to replace its algorithm with manually prepared search results. So far, these steps are very limited (only 25% replacement even for top 20 Google searches), but Google already expanded its efforts to searches outside the top searches.
4. Paradox of Google Search Results.Let us examine Google search results in response to its name:
- Search 1: site:Google.com Google returns about 500,000,000 results.
- Search 2: Google returns only about 64 results. Paradox: Google should return more results than Search 1.
- Search 3: Google Search Engine returns about 1,700,000,000 results.
- Search 4: Google Search returns only about 76 results. Paradox: Google should return more results than Search 3.
The same paradox also appears in some searches other than top. For example, the most popular IRS Form 1040 returns only about 90 results. In contrast, (a less popular) IRS Form 1120 returns about 1,840,000 results.
Analysis of search results for the Search 2 and Search 4 shows that about 50% of search results are on google.com domain and its subdomains. Other 50% are about equally distributed between some of the top websites: Facebook, Instagram, Twitter, YouTube, LinkedIn, Yahoo, Wikipedia, Merriam-Webster, Britannica, Apple, CNET, media (CNN, NYTimes, CNBC, Bloomberg, Forbes), top universities, etc.
Google explains the above paradox at the end of search results:
"In order to show you the most relevant results, we have omitted some entries very similar to the 100 already displayed. If you like, you can repeat the search with the omitted results included."
If we follow Google's offer to repeat the search, Google produces about 9,110,000,000 results (instead of 64 results) for the above Search 2 and about 6,580,000,000 results (instead of 76 results) for the Search 4.
In other words, Google claims that (in the Search 2 and Search 4) over 99.9999999% of search results are omitted because they are just "some entries very similar" to less than 0.0000001% of search results displayed. This explanation is against the logic. First, the same "similarities" (if any) should be present in the Search 1 and Search 3 (where no search results were omitted by Google). Second, the same "similarities" (if any) should be present in billions of Google searches, but Google declares such similarities only for a microscopically small fraction of Google searches (even only 25% among the top 20 searches). Third, 64 results could be "very similar" to some of 9,110,000,000 results, but 9,110,000,000 results could not be "very similar" to 64 results (unless 9,110,000,000 results are identical or almost identical to each other, that is not the case).
5. Explanation of Paradox of Google Search Results.The real explanation of Google search results paradox is very simple: the above search results were manually prepared in advance by Google employees (not by Google algorithm). Google made a strategic decision to reduce the number of search results to less than 100 (and thus exclude 99.99…99% of all search results as useless) for as many searches as possible. The reduction of search results is accompanied by reduction of websites appearing in search results to the top few thousands (social networks, knowledge bases, technology, media, government, organizations, universities, etc.).
The idea of manual preparation of search results is correct. The methodology used by Google is wrong. There are tens of billions of search terms. For example, ahrefs.com index contains 11.6 billion search terms. It could take 100s of years for all 100 thousand Google employees and contractors to manually prepare search results for tens of billions of search terms by the method used by Google. The calculations are simple. It takes at least 2 hours for a trained person to manually prepare search results for one search term. Thus, it could take a year for a person to prepare search results for 1,000 search terms (4 search terms a day, about 250 workdays a year). All 100 thousand Google employees and contractors working together (and doing nothing else) would be able to prepare manually 100 million search terms a year. It could take them hundreds of years to prepare manually search results for all tens of billions of search terms.
6. Solution to Search Engine Problems.The methods used by search engines to obtain information, create databases and produce search results should be abandoned. Search engines can access only a microscopic fraction (less than 1/1,000,000, probable less than 1/10,000,000) of the Internet and even for that microscopic fraction they are unable to produce quality search results: nether to find relevant web pages, nor to exclude irrelevant pages from search results, nor to correctly sort search results.
The manual approach used by Google (as a replacement for Google algorithm) to create search results also should be abandoned. It could take 100s of years for all 100 thousand Google employees and contractors working together (and doing nothing else) to prepare manually search results for all 10s of billions of search terms.
The solution is to create a large network of Internet navigation subsystems (Internet sector-by-sector) followed by the creation of the main search system. This allows simultaneous (bulk) processing of all web pages belonging to the same sector of the Internet. Some subsystems may have all three navigation tools: search buttons (typing), catalogs (scrolling) and directories (clicking). Other subsystems may have only some of the three tools. The approach should take into consideration the structure and contents of each sector.
Users should have access to both the general search system and a set of specialized navigation subsystems. A navigation inside specialized subsystems is usually much faster and more productive.
Manual processing and automatic processing must go hand in hand. For example, automatic browsing the Internet by a web crawler must be accompanied by visiting some of web pages manually to obtain additional information about web pages and Internet sectors. Analysis of information obtained by a web crawler may help to determine the list of web pages needed to be visited. Analysis of information obtained by visiting web pages may help to determine the list of web pages needed to be browsed by a web crawler. Processing of information obtained by a web crawler (such as extracting data, sorting, removal of duplicates) should be done automatically, but further analysis should be done manually.
Users should be able to search for both web pages and information. Currently, top search engines offer almost exclusively search for web pages. In general, a user needs to visit the web pages to find information relevant to search terms. As an exception, search engines offer some information relevant to search terms directly (USPS tracking, currency conversions, etc.).
7. Speed of Bulk Processing of Internet, Sector-by-Sector.Most sectors contain thousands of web pages. Some sectors contain millions of web pages. Compared to individual processing, the bulk processing increases the speed of processing by thousands and millions of times. For example, out of about 30,000 IRS forms and instructions, Google manually processed search results only for one form (1040) and 6 of its schedules, only for 5 tax years (2016-2020), only 3 search terms for a form.
Manual processing of search terms associated with 7 IRS forms for 5 tax years, 3 search terms each (7 x 5 x 3 = 105) took more than a month of work of one Google employee (about 4 search terms a day). In contrast, the simultaneous processing of all 30,000 IRS forms and instructions, multiplied by 30-40 search terms each (total about a million search terms) takes just a day of work of one employee. The increase of speed (100/month vs 1,000,000/day) is astronomical (around 250,000 times).
Some sectors of the Internet can be processed with the speed of tens of millions web pages (hundreds of millions search terms) a day. Many sectors of the Internet are much larger than the sector of IRS forms and instructions. Some sectors include millions and tens of millions web pages. The bulk processing of a sector containing 10 thousand web pages takes about the same time as a bulk processing of a sector containing 10 million web pages.
While bulk processing needs a much higher qualification of the employee, a well organized and trained 100-member team of such employees can process hundreds of millions web pages (tens of billions of search terms) in a year (a few times more than the 50 million web pages that all top search engines together have in their active databases today). This should result in at least ten thousand navigation subsystems. Creation of these subsystems should follow by creation of a general search engine where search results for tens of billions of search terms automatically accumulate and process search results prepared for subsystems.
This should provide high quality search results and dramatically change the extremely inefficient two-step search process currently offered by top search engines involving typing, scrolling, clicking (again and again)… First, a search engine user needs to type search terms. The user may decide to use some search syntax or no syntax. Entering correct search terms may need certain abilities of expression and sometimes knowledge of some search syntax. Second, a search engine user needs to scroll and search manually the list of search results to find the relevant web pages or to determine the absence of such pages in the list. For these purposes, the user needs to analyze titles and descriptions of listed webpages, click some links, wait for the web pages to open, and make determinations regarding their relevancies. The user may need to modify the search terms and repeat the process again and again. This trial-and-error method may become a few minutes of fun or hours of torture -- depending on the search terms and user's abilities. Such an extremely inefficient two-step search process is offered by modern search engines and accepted by most Internet users as the only option. In fact, the search in many cases can be done in seconds (instead of minutes and hours) by using a one-step process without even entering any search terms.
Total preparation time of a subsystem: one business day by one person.
A 100-member team should be able to prepare at least 10,000 subsystems a year. This will include hundreds of millions web pages (mostly the deep web) and tens of billions of search terms, which is many times more than that all top search engines together have in their active databases today. Creation of these subsystems should follow by creation of a general search engine where search results for tens of billions of search terms automatically accumulate and process search results prepared for subsystems.
Of course, to identify Internet sectors, extract, process and represent information (especially, related to the deep web), you need a set of algorithms for each stage of the process, and a methodology to create new algorithms, as needed, to reflect specific features of each sector of the Internet.
8. Three Purposes of Navigation Subsystems.The following are three main purposes of a network of navigation subsystems:
1. To index the deep web. The crawlers are unable to access the deep web. Only humans can. Deep web is at least 99.9999% of the web. A network of 10,000 subsystems would cover not only all the surface web but also hundreds of millions of web pages of the deep web (unreachable to top search engines).
2. To create quality search results millions times faster than the individual creation of search results for each search term (by Google approach). As soon as a large network of specialized navigation subsystems is created, the extraction of search terms and creation of search results for each search term can be done automatically because search terms belong to the same sector of the Internet.
3. To expand the existing search system. The above search results automatically extracted from the network of subsystems should be a part of the main search system. However, the creation of search results for the general search system is not the main purpose. Specialized navigation subsystems are to be created not to serve the main search system but to compliment it. They must exist together with the main system. Specialized navigation subsystems have important advantages over the general search system because they reflect specific features of each sector of the internet.