In our society one thing is key, information. With the rise of the information, age came growing dependencies on it, and in the last two decades, much of our society has become completely dependent on it. Whole new markets have emerged solely marketizing information. Platforms such as Facebook make their billions from collecting, sorting and selling information. The raw amount of data that exists today is simply mind-boggling, with almost 2.5 quintillion bytes of data being created every day in 2017 (10 Key Marketing Trends for 2017, 2017). However, what is probably a more surprising fact is that from this information, we can estimate that 90% of the data in existence has been created in the last two years. This indicates that there is an exponential growth in the amount of data we have. By 2025 the IDC predicts that there will be a massive 175 zettabytes of data in existence (Reinsel, Gantz and Ryding, 2018). With all this data comes a few questions, such as how do we sort all this information? Whilst much of this is private data, used in big data processes, a large proportion of this information is represented on webpages. It is estimated that there are about 6.08 billion pages indexed (Worldwidewebsize.com, 2019) by various search engines and it is through these tools that the general public access most of their information.
In exploring the scalability of search engines, it is important to have a strong understanding of what a search engine is. Put simply a search engine is a database tool, a user enters a search term (the query) and the search engine will cross-reference with this to entries inside its database, showing the user the most relevant results. Now there are many more processes that occur for modern search engines to work, such as language interpretation, but for this paper, we will just focus on scalability.
When an end-user thinks about a search engine, they often think about the search engine results page (SERP), this is the infamous little box that can provide an answer to almost anything. After a user enters a search term, they are presented with the SERP. This page is made up of a few elements that are important to understand. Whilst the main output of this page will be the raw search results they had requested, there will also be many other elements. These include related images, advertisements, videos, map results and all sorts of other information. All of these individual items are in their own right a small version of a search engine. The query entered provides the search results, but also all of these other targeted pieces of information. And as search engines’ main income source of revenue is through advertisements, this is one of the most important aspects to them. When discussing the scalability of search engines, it is important to remember that we are not only dealing with the pure search results but all these other related information sources.
To populate this database of websites, search engines utilize crawling. This tool searches the internet for new webpages, updated webpages, videos, pictures and many other types of files. All the information that is found is then added to their discovered URLs database (Muller, n.d.).
Once this process is completed, it will then be indexed. This is where more meaning is derived from the simple URLs that were gathered in the last phase. The search engine will process all of the URLs it gathered by scanning it for keywords, file type and other related information to put it into a structured format the search engine can utilize (Marsden, 2018).
The google search index is a massive store of information that has been sorted to google specification. It contains “hundreds of billions of web pages and is well over 100,000,000 gigabytes in size ” (Google.com, n.d.).
Finally ranking takes place. When the user enters their search term, they are not provided with a simple list of results from the database. Instead they are presented with a carefully curated list of results, that are best anticipated to match their search needs.
Now that we have a basic understanding of how a search engine works, we can explore what points will cause scalability issues. From the overview of a search engine, there are three key parts that are demanded upon the most, storage, bandwidth and compute power. As referenced earlier, google has over 100 petabytes of webpage information in its index (Google.com, n.d.). All of this data needs to be stored, backed up and accessed in and from many different geographic locations. What’s more, the size of this database is ever increasing and at an exponential rate. The figure of 100 petabytes in the google index was most likely outdated at the time of its release, due to the speed that the internet is growing. Furthermore, Google does not just store this information, it processes it in a number of different ways. This immense amount of data requires an equally impressive amount of computing power to manipulate it in a way that makes it useful for the search engine.
One simple, yet limited solution to scaling is simply increasing the raw amount of resources available to the systems. To keep up with the massive expansion of the internet more resources are required to gather, store and process it. This is facilitated in the many datacentres that are operated by search engine companies, Google owns and operates around 19 large main datacentres globally (Google Data Centres, n.d.) to provide it with the compute and storage needs of a company its size. These datacentres are not entirely for the sole use of google search but are instead part of a large overall ‘google cloud’ presence which fulfils both its own needs and that of other companies and individuals who rent resources from them. This type of infrastructure means that tasks can be performed on large clusters of machines, each node in the cluster can contain multiple CPU’s all of which are multithreaded, large amounts of ram and access to large network-based storage for databases. They can also utilize the large amount of bandwidth datacentres such as Google’s can access. Most of the tasks that are required for search, such as web crawling can be distributed across multiple nodes, in theory, this leaves near-unlimited resources available to it provided the datacentre has enough resources to scale.
As these datacentres don’t just serve the needs of the search engine, there are plenty of resources that are available to other tasks. This allows a high level of on the fly scalability where more nodes can be added to a cluster from a reserve pool which is permanently available but not a cost implication due to its other uses. This type of scalability helps to ensure the most cost-effective use of resources possible by only assigning what is needed to the search processes.
However, the extent to which resource scaling can actually be viable is limited. At the end of the day, search companies are businesses that want to keep costs as low as possible to ensure they remain profitable. Increasing the raw power available to search engines is costly, datacentres have huge overheads with large costs coming from cooling and powering the hardware. As such, whilst is it important to scale up hardware, it is more important to make the processes they are running as optimized as possible.
As mentioned before, a web crawler is an essential part of the operations of a search engine. It effectively creates a local cache of the internet so that other processes within the search engine can have quick access to the information. The goal of an effective web crawler is to have as small of a difference between its local cache and that of the internet, which is a challenge considering the rate at which the internet is growing and changing. There is no ‘central database’ of all the URLs on the internet, instead, the crawler has to go out and discover these URLs. One way in which this has been made more scalable is by utilizing certain data sources. In particular, one-way google has accomplished this is through the Chrome browser. As of October 2019, Chrome had a 64.92% market share of internet browsers (StatCounter Global Stats, 2019), making it the most popular by far. This is a massive data source that google can exploit. Through analytics, the browser can report back to google any and all URL’s that the browser has visited. This user-generated content is perfect for discovering new URL’s, particularly those that are used by humans on a regular basis. This data can also help to inform the crawler, how often the site should be “re-crawled” and its priority within the search engine. The use of this type of data gathering, moves the process of URL collecting from the search engine, to the user and utilizes their hardware. As the number of websites on the internet increases, so does the number of users. As such this data source will automatically scale with the spread of the web.
As mentioned earlier, it is not only the search results that search engines display. Rather many more types of information are also displayed, such as photos and videos. Each of these utilizing their own type of crawler specific to that datatype. There are also other types of crawlers, such as experimental and non-production web-crawlers running. In a search engine company like google, to assume they utilized one search engine would be preposterous. This poses a problem of its own, operating multiple web crawlers in an uncoordinated manner places extreme demands on infrastructure and could easily saturate the company’s available network bandwidth. To combat this, companies will employ one central web page fetching system. It acts as a gateway between the companies’ internal crawlers and the open web, reducing the risk of running multiple requests of the same item.
(Cambazoglu and Baeza-Yates, 2015)
The above diagram visualizes a two-level web page fetching system shared by different web crawlers. From this we can see that the multiple web crawlers are not putting any more strain on the external system then a single one would. All the requests are processed so as to provide the best result with the least amount of resources used. The high-level fetching system can utilize caching so that if one crawler has recently requested the information from one URL another making the same request can receive a cached version instead of the system using external bandwidth to search on the web.
As search engines are big business, there is plenty of competition. One way that companies can stand out in their search offering is through the use of proprietary software. One such example can be found without Google, Bigtable. Bigtable is a data storage system designed to handle large volumes of compressed data at a high speed. It is built on the Google File System, another of Google’s proprietary inventions. This is an example of scalability, in which efficiency is key. This system was developed to help reduce the amount of raw performance required to store high volumes of data, something key in search. It is also designed to scale over thousands of nodes, meaning. Ensuring that even though it is efficient, it has the ability to utilize more hardware when demands increase. In search engines, the development of new technology is key to its success.
To conclude, we can see that scalability is a massive issue for search engines. The rapid expansion of the internet means that search engines will have to deal with more information, at an increasing rate. However, it is evident that many search companies, such as Google, are prepared for this. They have developed technologies and infrastructure that will manage this excessive amount of data and it is these tools that the defining factor in a search engine success. The proprietary tools such as GFS and Bigtable are what Google can attribute much of its success to and this demonstrates how technological investment is key to become a market leader. In summary, although the internet has expanded rapidly so has the technology supporting it, and it is in the new development of tools, in the field of computer science, that allow our information age to expand yet remain accessible.
10 Key Marketing Trends for 2017. (2017). [whitepaper] IBM Marketing Cloud, p.3. Available at: http://comsense.consulting/wp-content/uploads/2017/03/10_Key_Marketing_Trends_for_2017_and_Ideas_for_Exceeding_Customer_Expectations.pdf [Accessed 28 oct. 2019].
Cambazoglu, B. and Baeza-Yates, R. (2015). Scalability challenges in web search engines. Morgan & Claypool, p.20-40.
Google Data Centers. (n.d.). Discover our data center locations. [online] Available at: https://www.google.com/about/datacenters/locations/ [Accessed 28 Oct. 2019].
Google.com. (n.d.). How Google Search works | Crawling & indexing. [online] Available at: https://www.google.com/search/howsearchworks/crawling-indexing/ [Accessed 3 Nov. 2019].
Hoff, T. (2008). Google Architecture – High Scalability.
Highscalability.com. Available at: http://highscalability.com/google-architecture [Accessed 3 Nov. 2019].
Marsden, S. (2018). What is Search Engine Indexing & How Does it Work? – DeepCrawl. [online] DeepCrawl. Available at:
Muller, B. (n.d.). How Search Engines Work: Crawling, Indexing, and Ranking | Beginner’s Guide to SEO. [online] Moz. Available at: https://moz.com/beginners-guide-to-seo/how-search-engines-operate [Accessed 3 Nov. 2019].
Reinsel, D., Gantz, J. and Ryding, J. (2018). The Digitization of the World. [whitepaper] IDC, p.3. Available at: https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf [Accessed 3 Nov. 2019].
StatCounter Global Stats. (2019). Browser Market Share Worldwide | StatCounter Global Stats. [online] Available at: https://gs.statcounter.com/browser-market-share [Accessed 28 Oct. 2019].
Worldwidewebsize.com. (2019). WorldWideWebSize.com | The size of the World Wide Web (The Internet). [online] Available at: https://www.worldwidewebsize.com/ [Accessed 28 Oct. 2019].