Wise Young
11-16-2006, 04:40 AM
Many people are still amazed by the commercial success of Google that has made billions of its inventors and owners. Sergey Brin and Lawrence Page wrote a paper about Google and its rationale, just before implementation. That paper suggests that these two really knew what they were doing and worked hard at achieving the results that are now of course well-known. The design of a scaleable search engine is not trivial task.
http://infolab.stanford.edu/~backrub/google.html
In this paper, they addressed several critical questions:
1. Crawling the web. The difficulty of knowing what is worthwhile crawling and what is not looms large.
2. Indexing the web. Establishing, limiting, and handling the lexicon of words for the inde turned out to be a critical decision.
3. Improving the quality of searches. Computers have a real ability to find junk. How to get rid of the junk is a challenge.
4. Ranking the hits. It is amazing what the google program uses for its ranking. According to this paper, they even included such information as position, font, and capitalization information. This is reasonable because where html pages place information, the font they use, and the capitalization of the text gives an indication of what the writer of the html wants to emphasize to the potential audience.
5. Search words proximity. They place a high emphasis on search word proximity. Thus, if you search for "Bill Clinton", it does not provide search results for other Clintons.
6. Advertising. This was probably the most interesting part of the paper. They address the question of advertising and payment for rankings subverting the integrity of search engines.
7. Scaleability and cost. The last sentence of the last appendix of the paper is perhaps the most insightful statement of all.
Because humans can only type or speak a finite amount, and as computers continue improving, text indexing will scale even better than it does now. Of course there could be an infinite amount of machine generated content, but just indexing huge amounts of human generated content seems tremendously useful. So we are optimistic that our centralized web search engine architecture will improve in its ability to cover the pertinent text information over time and that there is a bright future for search.
Very interesting.
Wise.
http://infolab.stanford.edu/~backrub/google.html
In this paper, they addressed several critical questions:
1. Crawling the web. The difficulty of knowing what is worthwhile crawling and what is not looms large.
2. Indexing the web. Establishing, limiting, and handling the lexicon of words for the inde turned out to be a critical decision.
3. Improving the quality of searches. Computers have a real ability to find junk. How to get rid of the junk is a challenge.
4. Ranking the hits. It is amazing what the google program uses for its ranking. According to this paper, they even included such information as position, font, and capitalization information. This is reasonable because where html pages place information, the font they use, and the capitalization of the text gives an indication of what the writer of the html wants to emphasize to the potential audience.
5. Search words proximity. They place a high emphasis on search word proximity. Thus, if you search for "Bill Clinton", it does not provide search results for other Clintons.
6. Advertising. This was probably the most interesting part of the paper. They address the question of advertising and payment for rankings subverting the integrity of search engines.
7. Scaleability and cost. The last sentence of the last appendix of the paper is perhaps the most insightful statement of all.
Because humans can only type or speak a finite amount, and as computers continue improving, text indexing will scale even better than it does now. Of course there could be an infinite amount of machine generated content, but just indexing huge amounts of human generated content seems tremendously useful. So we are optimistic that our centralized web search engine architecture will improve in its ability to cover the pertinent text information over time and that there is a bright future for search.
Very interesting.
Wise.