PDA

View Full Version : The Science of the Google Search Engine


Wise Young
11-16-2006, 04:40 AM
Many people are still amazed by the commercial success of Google that has made billions of its inventors and owners. Sergey Brin and Lawrence Page wrote a paper about Google and its rationale, just before implementation. That paper suggests that these two really knew what they were doing and worked hard at achieving the results that are now of course well-known. The design of a scaleable search engine is not trivial task.

http://infolab.stanford.edu/~backrub/google.html

In this paper, they addressed several critical questions:
1. Crawling the web. The difficulty of knowing what is worthwhile crawling and what is not looms large.
2. Indexing the web. Establishing, limiting, and handling the lexicon of words for the inde turned out to be a critical decision.
3. Improving the quality of searches. Computers have a real ability to find junk. How to get rid of the junk is a challenge.
4. Ranking the hits. It is amazing what the google program uses for its ranking. According to this paper, they even included such information as position, font, and capitalization information. This is reasonable because where html pages place information, the font they use, and the capitalization of the text gives an indication of what the writer of the html wants to emphasize to the potential audience.
5. Search words proximity. They place a high emphasis on search word proximity. Thus, if you search for "Bill Clinton", it does not provide search results for other Clintons.
6. Advertising. This was probably the most interesting part of the paper. They address the question of advertising and payment for rankings subverting the integrity of search engines.
7. Scaleability and cost. The last sentence of the last appendix of the paper is perhaps the most insightful statement of all.
Because humans can only type or speak a finite amount, and as computers continue improving, text indexing will scale even better than it does now. Of course there could be an infinite amount of machine generated content, but just indexing huge amounts of human generated content seems tremendously useful. So we are optimistic that our centralized web search engine architecture will improve in its ability to cover the pertinent text information over time and that there is a bright future for search.

Very interesting.

Wise.

MiamiProjectJames
12-02-2006, 02:34 AM
Dr. Young you are very interesting!

I'm a "webmaster" and never expected to find the keys to Google criteria on a SCI blog site.

Thank you,
James

Wise Young
12-02-2006, 03:45 AM
Dr. Young you are very interesting!

I'm a "webmaster" and never expected to find the keys to Google criteria on a SCI blog site.

Thank you,
James

While doing searches for SCI information, I often run across interesting web sites. But, the subject of how Google prioritizes its information has fascinated me ever since I started using it nearly a decade ago. It is hard to believe that Larry Page and Sergey Brin met each other in 1995. According to the Google web site (http://www.google.com/corporate/history.html), Larry Page was just a University of Michigan graduate student on a weekend visit to Stanford and Sergey Brin had been assigned to show him around. The two apparently argued a lot and started to work on their first search engine called "Back Rub". In 1998, after failing to interest potential investors, they decided to start the company themselves and brought in about $1 million of investment and opened the door to the company in September 1998. By September 1999, they removed the beta label on their google search engine.

Recently, I have been using StumbleUpon (http://www.stumbleupon.com/) as a secondary web search engine for what Google failed to find. Nearly 1.6 million web searchers are looking through the web and rating sites. These ratings provide a collective consensus concerning web sites. One of the problems with Google is that it doesn't have a way of rating the sites for content other than when people click on the blurb that Google extracts and presents on the listed sites.

StumbleUpon installs as a plugin on Firefox. You can select whatever topic or keyword you want and it will show those websites that the most people gives thumbs up to.

For example, if you type in "stem cells", you get the following sites:
http://www.berkeley.edu/news/media/releases/2006/10/23_stretch.shtml
http://louisville.edu/medschool/anatomy/1-news/departmentnews_stem%20cell%20030406.htm
http://onthescene.msnbc.com/baghdad/2006/11/other_peoples_s.html
http://video.google.com/videoplay?docid=-2576465041891813206
http://www.newsvine.com/_news/2006/09/21/369931-stem-cells-put-a-stop-to-macular-degeneration
http://www.newstarget.com/020935.html
http://learn.genetics.utah.edu/units/stemcells/index.cfm
http://ksjtracker.mit.edu/?p=1283
http://www.mentalfloss.com/blogs/archives/1279
http://13gb.com/media.php?id=1709

Only two of the stumble sites show up on the top ten hits of a Google search:
http://stemcells.alphamedpress.org/
*http://en.wikipedia.org/wiki/Stem_cell
http://stemcells.nih.gov/index.asp
*http://learn.genetics.utah.edu/units/stemcells/
http://www.cnn.com/SPECIALS/2001/stemcell/
http://www.time.com/time/2001/stemcells/
http://www.stemcells.ca/
http://www.news.wisc.edu/packages/stemcells/
http://www.aaas.org/spp/sfrl/projects/stem/index.shtml
http://www.pbs.org/wgbh/nova/sciencenow/3209/04.html

Granted, not all the Stumble sites are relevant but they present a different picture of stem cells than Google. The biggest difference between Stumble and Google is that the former is more up-to-date. By the way, not all that many people have been using Stumble to look for spinal cord injury sites because Stumble presents only two sites for "spinal cord injury". But, if all the people on CareCure were to start recommending sites that they post here to Stumble, it would soon be a great resource for anybody who wants to see the latest and best spinal cord injury information.

Wise.

rollin64
12-02-2006, 08:09 PM
i installed stumbleupon after reading this post. i like it a lot. it's a fun and useful add-on to firefox, IMO.

cvelusc
12-02-2006, 08:31 PM
Apparently Google has a human-powered "eval" site that does additional verification of links. Here's an excerpt:

It's one of the best kept secrets of Google. It's a mystery (http://www.searchbistro.com/exit.php?url_id=368&entry_id=19) on Webmasterworld. Also in Europe (http://www.searchbistro.com/exit.php?url_id=369&entry_id=19) (France) they don't know what to expect from that odd URL http://eval.google.com (http://www.searchbistro.com/exit.php?url_id=370&entry_id=19). Click it and you get ...nothing. The site reveals itself only if you have the proper login and if you use a network known by Google. Residues of Eval.google are found (http://www.searchbistro.com/exit.php?url_id=371&entry_id=19) on the web, but the full content of the mystery site has never been published before. Here it is: the real story about Eval.Google. They use... humans! Welcome to the first entry in Search Bistro.

Continue for the source article (http://www.searchbistro.com/index.php?/archives/19-Google-Secret-Lab,-Prelude.html).