Hang Ten, Beverly Rosenbaum
How to get the most out of your web surfing
Many people consider “Google” to be synonymous with web searching, implying that it is the only search engine available.
Indeed, practically everyone uses the word “Google” as a verb to describe searching the Internet, much like “Xeroxing” is intended to mean photocopying. But Google is only one of hundreds of search engines and search tools available. And, depending upon what information you're seeking, selecting the best one can improve the success of your search. This is especially true when searching for more specialized technical, legal, medical or scientific information. As an example, Google would be a poor choice when looking for job opportunities, while there are more than ten job search engines that would yield much better results.
So exactly how do search engines work, and why would you need them? They’re actually tools designed to retrieve content from Internet indexes based on criteria defined by you, the user. These databases contain information collected from billions of pages and documents that are on the Internet. Google claims an index of more than 3.3 billion pages, and Yahoo more than 3.1 billion! Think of a search engine as a card catalog in a huge library, to help you locate the information you need without having to examine every single book yourself.
When you sit at your computer and submit a search, you are presented with a list of results almost immediately. The speed of this search varies from one engine to another, and the results are often different because each search engine uses a different ranking process. And you’re probably wondering how search engines can collect information from so many pages that are constantly changing. To do this, they use software programs called “robots” or “crawlers” or “spiders” to continually follow hyperlinks from one document to another all around the Web. When these programs discover new links, updated pages, or dead links, they send that information back to their main site to be indexed. Google’s Googlebots fetch not only titles and text, but also copies of the page contents, and return them to their index stored on a huge set of computers. That is how you’re able to view from a Google search result a “cached” copy of how a web page last appeared, when the site may be currently unavailable.
An estimated 30 billion web pages are linked to more than 100 million web sites, and every single page has a unique address or URL (Uniform Resource Locator) to specify its location. This address incorporates three components – 1) the protocol, 2) the domain name or IP address where the resource is located, and 3) the path and file name. The protocol identifier is separated from the resource by a colon and two forward slashes; for HTTP (Hypertext Transfer Protocol), the resource name would always begin with “www.” The parts of the domain name or IP address are separated with periods, and single forward slashes separate the domain from the path of the files. So in this example -- http://www.hal-pc.org/journal/2009/09_feb/index.html -- the protocol is HTTP, the domain name is www.hal-pc.org, and journal/2009/09_feb/index.html is the path for the index page of the February 2009 issue of HAL-PC Magazine.
Site Maps Play an Important Role
To improve visibility and inform search engines about the pages on their site, webmasters create a text document outline of those links called a Site Map. This is a standard inclusion for web sites, and is always located at the root of the server. The URL for ours is www.hal-pc.org/sitemap.html. The Site Map is an XML file that contains URLs for the site along with additional information about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site), enabling search engines to more intelligently “crawl” the site. So the information from Site Maps augments the data collected in the “crawling” process. In addition, visitors to specific web sites often seek out and use the Site Map to find the page they need more quickly. This “bird’s-eye view” of the site’s content shows the structure and layout, and allows one-click access to all the topics. Visually impaired users who employ text readers to help them surf the Internet are also able to navigate web sites much more easily with a good Site Map. They are as important for human visitors as for the automated indexing “crawlers.”
You can imagine how single web pages that are not linked to any other page would never appear in the search engine results. Google first introduced the Sitemap Protocol in June 2005 so web developers could publish lists of links from across their sites. The next year Google, Yahoo, and MSN announced their joint support, followed by other search engines, and state governments were the first to announce that they would use Site Maps on their web sites.
How to Search
I’m sure you’ve often entered a search term and either gotten too many pages of results or nothing at all. Here are a few rules to remember: The most important thing to do is keep the search simple - describe what you are looking for in as few words as possible. If you’d like to search for an exact phrase, enclose the words within quotation marks. You can exclude certain words by appending them to the search terms, preceded by a space and then a minus sign (-). Placing a plus sign (+) immediately before the search term will yield only an exact match and no synonyms.
The biggest general search engines include Google (www.google.com/), Yahoo (www.yahoo.com/), and Ask (www.ask.com). While Google and Yahoo process search terms similarly, Ask allows you to enter your search in the form of a natural question, such “How do I make a resume?”
Both Yahoo and Ask also provide kid-oriented search engines at kids.yahoo.com and www.askkids.com.
In meta-search engines like Dogpile (www.dogpile.com/), Mamma (www.mamma.com/), Clusty (clusty.com), or Copernic (find.copernic.com/), the keywords you submit in the search box are transmitted simultaneously to several individual search engines and their databases of web pages. Within a few seconds, you are presented the results from all the search engines queried. Meta-search engines do not have their own database of Web pages, they search the indexes maintained by other search engine companies. Most send their queries to smaller, free search engines and directories, but Dogpile uses Google, Yahoo, Ask.com, and MSN Livesearch. Many search engines blend into the results any sites that have purchased ranking and inclusion, so you’ll see “sponsored” links below or beside the search results.
Waiting 17 Years for an Engine
A new search engine is due to come online later this year as a result of the Anti Car Theft Act of 1992 (Public Law 102-519). The National Motor Vehicle Title Information System (NMVTIS) will provide a searchable database of Vehicle Identification Numbers (VINs) to avoid fraudulent retitling of salvaged vehicles. This system will provide an electronic means to verify and exchange titling, brand, and theft data among motor vehicle administrators, law enforcement officials, prospective purchasers, and insurance carriers, and allows state titling agencies to verify the validity of ownership documents before they issue new titles. It has taken thus far 17 years to accomplish a piece of legislation that a majority of Congress obviously agreed would benefit the public. When available, it will be found at www.nmvtis.gov/.
At the end of 2008 Google controlled 72% of all searches in the US, Yahoo had 14%, and MSN 8%, for a total of 94%. So all the other search engines together have 6% market share. In future columns we’ll explore what you may be missing. Did you ever want more information about some of the people in the news? There are lots of reference engines, and even a pronunciation engine that provides both phonetic and audible assistance from 50 resources.
The Internet is a great place to find information on any topic by letting your fingers do the walking on your keyboard. In future columns, we’ll delve more deeply into search strategies and how to select the best tools for your needs. If you have any search queries, questions, or favorite search tools to share, send them to firstname.lastname@example.org.
Beverly Rosenbaum, a HAL-PC member, is a 1999 and 2000 Houston Press Club “Excellence in Journalism” award winner.