For many, Google is the internet. It’s the starting point for finding new sites and is arguably the most important invention since the internet itself. Without search engines, new web content would be inaccessible to the masses.
But do you know how search engines work? Every search engine has three main functions: crawling (to discover content), indexing (to track and store content), and retrieval (to fetch relevant content when users query the search engine).
How Search Engine Works?
Search engines perform several activities in order to deliver search results.
- Crawling – Process of fetching all the web pages linked to a website. This task is performed by a software, called a crawler or a spider (or Googlebot, in case of Google).
- Indexing – Process of creating an index for all the fetched web pages and keeping them into a giant database from where it can later be retrieved. Essentially, the process of indexing is identifying the words and expressions that best describe the page and assigning the page to particular keywords.
- Processing – When a search request comes, the search engine processes it, i.e. it compares the search string in the search request with the indexed pages in the database.
- Calculating Relevancy – It is likely that more than one page contains the search string, so the search engine starts calculating the relevancy of each of the pages in its index to the search string.
- Retrieving Results – The last step in search engine activities is retrieving the best-matched results. Basically, it is nothing more than simply displaying them in the browser.
Search engines such as Google and Yahoo! often update their relevancy algorithm dozens of times per month. When you see changes in your rankings it is due to an algorithmic shift or something else outside of your control.
Although the basic principle of operation of all search engines is the same, the minor differences between their relevancy algorithms lead to major changes in results relevancy.
Crawling is where it all begins: the acquisition of data about a website.
This involves scanning sites and collecting details about each page: titles, images, keywords, other linked pages, etc. Different crawlers may also look for different details, like page layouts, where advertisements are placed, whether links are crammed in, etc.
But how is a website crawled? An automated bot (called a “spider”) visits page after page as quickly as possible, using page links to find where to go next. Even in the earliest days, Google’s spiders could read several hundred pages per second. Nowadays, it’s in the thousands.
When a web crawler visits a page, it collects every link on the page and adds them to its list of next pages to visit. It goes to the next page in its list, collects the links on that page, and repeats. Web crawlers also revisit past pages once in a while to see if any changes happened.
This means any site that’s linked from an indexed site will eventually be crawled. Some sites are crawled more frequently, and some are crawled to greater depths, but sometimes a crawler may give up if a site’s page hierarchy is too complex.
One way to understand how a web crawler works is to build one yourself. There is a tutorial on creating a basic web crawler in PHP, so check that out if you have any programming experience.
Indexing is when the data from a crawl is processed and placed in a database.
Imagine making a list of all the books you own, their publishers, their authors, their genres, their page counts, etc. Crawling is when you comb through each book while indexing is when you log them to your list.
Here’s a peek inside one of Google’s search data centers:
Retrieval and Ranking
Retrieval is when the search engine processes your search query and returns the most relevant pages that match your query.
Ranking algorithms check your search query against billions of pages to determine each one’s relevance. Companies guard their ranking algorithms as patented industry secrets due to their complexity. A better algorithm translates to a better search experience.
They also don’t want web creators to game the system and unfairly climb to the tops of search results. If the internal methodology of a search engine ever got out, all kinds of people would surely exploit that knowledge to the detriment of searchers like you and me.
Search engine exploitation is possible, of course, but isn’t so easy anymore.
Originally, search engines ranked sites by how often keywords appeared on a page, which led to “keyword stuffing” — filling pages with keyword-heavy nonsense.
Then came the concept of link importance: search engines valued sites with lots of incoming links because they interpreted site popularity as relevance. But this led to link spamming all over the web. Nowadays, search engines weight links depending on the “authority” of the linking site. Search engines put more value on links from a government agency than links from a link directory.
What’s Next for Search Engines?
But here’s the gist of it.
Right now, you can search for “gluten-free cookies” but the results may return recipes for gluten-free cookies. Instead, you might find regular cookie recipes that say “This recipe is not gluten-free.” It has the right keywords, but the wrong meaning.
With semantics, you can search for cookie recipes and then remove certain ingredients: flour, nuts, etc. You can also narrow down results to only recipes with prep times less than 30 minutes and review scores of 4/5 or greater. That would be cool, right? That’s where we’re heading!
Read Other articles