Return-Path: <sentto-279987-5191-1029431755-fc=all.net@returns.groups.yahoo.com> Delivered-To: fc@all.net Received: from 204.181.12.215 [204.181.12.215] by localhost with POP3 (fetchmail-5.7.4) for fc@localhost (single-drop); Thu, 15 Aug 2002 10:19:09 -0700 (PDT) Received: (qmail 10116 invoked by uid 510); 15 Aug 2002 17:14:25 -0000 Received: from n30.grp.scd.yahoo.com (66.218.66.87) by all.net with SMTP; 15 Aug 2002 17:14:25 -0000 X-eGroups-Return: sentto-279987-5191-1029431755-fc=all.net@returns.groups.yahoo.com Received: from [66.218.66.94] by n30.grp.scd.yahoo.com with NNFMP; 15 Aug 2002 17:15:55 -0000 X-Sender: fc@red.all.net X-Apparently-To: iwar@onelist.com Received: (EGP: mail-8_0_7_4); 15 Aug 2002 17:15:54 -0000 Received: (qmail 17789 invoked from network); 15 Aug 2002 17:15:54 -0000 Received: from unknown (66.218.66.216) by m1.grp.scd.yahoo.com with QMQP; 15 Aug 2002 17:15:54 -0000 Received: from unknown (HELO red.all.net) (12.232.72.152) by mta1.grp.scd.yahoo.com with SMTP; 15 Aug 2002 17:15:53 -0000 Received: (from fc@localhost) by red.all.net (8.11.2/8.11.2) id g7FHGkg03386 for iwar@onelist.com; Thu, 15 Aug 2002 10:16:46 -0700 Message-Id: <200208151716.g7FHGkg03386@red.all.net> To: iwar@onelist.com (Information Warfare Mailing List) Organization: I'm not allowed to say X-Mailer: don't even ask X-Mailer: ELM [version 2.5 PL3] From: Fred Cohen <fc@all.net> X-Yahoo-Profile: fcallnet Mailing-List: list iwar@yahoogroups.com; contact iwar-owner@yahoogroups.com Delivered-To: mailing list iwar@yahoogroups.com Precedence: bulk List-Unsubscribe: <mailto:iwar-unsubscribe@yahoogroups.com> Date: Thu, 15 Aug 2002 10:16:46 -0700 (PDT) Subject: [iwar] [fc:Anchor.Votes] Reply-To: iwar@yahoogroups.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Spam-Status: No, hits=0.1 required=5.0 tests=WEIRD_PORT,DIFFERENT_REPLY_TO version=2.20 X-Spam-Level: Anchor Votes Dennis W. <http://www.yafla.com/%7Edforbes/index.htm Forbes - August 13th, 2002 Foreword If you're like many, Google <http://www.google.com/ has become your primary search engine, either directly or indirectly via a business relationship with another site (for example, Yahoo <http://www.yahoo.com/ uses Google to search the web). For me, Google has become my home page: Virtually every session begins with a Google search. Maybe I'm looking for some good <<a href="http://www.dannyg.com/javascript/quickref/JSB4RefBooklet.pdf">http://www.dannyg.com/javascript/quickref/JSB4RefBooklet.pdf> documentation for Netscape Navigator 4.7's obsolete and incomplete DOM, or a site showing the biggest skyscrapers <http://www.skyscraperpage.com/diagrams/index.php . Whatever it is, Google usually quickly gets me to my destination. Before I raise the ire of the hordes of fanatical Google fans, I should start this paper off by saying that I am a tremendous fan of Google, and I have yet to see any search engine which has it beat, but that doesn't mean that Google can't be made better. I am also concerned that Google's ranking technology is masked in a tight lipped aura: This is a classic security by obscurity, and while I do believe that there is some merit to that in certain circumstances, the easy cause/effect analysis of Google renders such obscurity transparent to those that want to manipulate Google rankings for their own gain. A Guess At The Technology Behind Google Rankings *DISCLAIMER: I do not have access to Google's page ranking technology, and apart from some partial details on their site, they keep their ranking techniques tight lipped to avoid intentional rank manipulating. As such, everything I say in this article is purely speculative based upon analysis of search results for various terms and phrases. Please also note that I browse the web using Opera with pop-ups disabled, so follow any link at your own discretion. Lately I've been fascinated by the techniques <http://www.google.com/technology/index.html that Google uses to rank the search results, as obviously this dictates the usability of the results, and alternately the value of a website to businesses that are trying to get eyeballs to see their products and services. Indeed, a case could arguably be made that Google search positioning is becoming one of the most important "real estate value" elements of any web page (more important than acquiring a good domain name, although I note later that it very well may be that the domain name+directory structure remains critically important if it's contextual for the good or service that you're selling). If Google were to ever sell page rankings, which they currently do not do, they literally would be in a windfall as every company rushes to make sure that they obtain the most eyeball potential. The Google ranking technique, in a nutshell, is that every link provided to a site is a vote for the site, with the weighting of the vote being determined by the number of votes that the voting site itself has received (another scenario is that indirectly each site promotes each subpage through internal linking, though effectively this results in the same thing for any aggregate site which provides an index). I've highlighted "site" for an important reason: A vote from anywhere within Slashdot garners the approximate voting power of Slashdot as a whole, a site which is one of the most linked sites on the Internet. The same reality holds true for the other conversation sites such as www.plastic.com <http://www.plastic.com/ or www.kuro5hin.org <http://www.kuro5hin.org/ . The flip side is true as well: Not only does the link vote apply to the destination page, but also to the site as a whole. By their very nature, aggregate sites like GeoCities or Angelfire will get a lot of votes because they contain tens of thousands of pages, and by extrapolation every hosted page itself starts quite high in the rankings, regardless of its own merit: You can prove this yourself by browsing through the <http://pages.yahoo.com/ GeoCities pages and looking for various papers covering a specific topic, for instance this page <http://www.geocities.com/jlhorse7/mantid.html on praying mantids (I randomly picked one as an example). Do a search on Google for mantids <http://www.google.com/search?q=mantids&hl=en&lr=&ie=UTF-8&oe=UTF-8&start=230&sa=N (note: either `mantis' or `mantids' is correct), and there it is in the #7 position (you can repeat this with virtually any page on an aggregate site). To put that into perspective, there are some 750 pages dealing with mantids that are linked from Google, and that limit is simply because that's the maximum results that Google will return for that particular search term. A quick check (using the link: search criteria) confirms that the page in question is linked by no other sites but itself). Another example of a megalinked aggregate site is the members.aol.com domain (apparently now hometown.aol.com), where AOL members can post webpages. Searching for " Ford <<a href="http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=ford+transmission&btnG=Google+Search">http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=ford+transmission&btnG=Google+Searc h</a> transmission " (perhaps you're having transmission problems and want to get it fixed, or you're thinking of purchasing a Ford and want to make sure the transmission is of good quality. Something of this nature could correlate with billions of dollars in sales and service) and the #1 result is http://hometown.aol.com/MKBradley/index.html . Again, to put this into perspective, the site in question is linked a lowly 11 times directly (2 times by themselves), yet his/her site has become the #1 voice regarding Ford transmissions (a product in millions of cars), again because it seems to have indirectly acquired the "voting power" of the entire members.aol.com site. Is it really a democracy that every page on these megalinked aggregate sites become premiere voices of their topic? Is it valid that this page would be ranked much more favourably if I hosted it on Geocities or aol.com? Not only does Google rank pages based upon the gross number of links multiplied by the various weighting factors, and then sorts them based upon the search criteria's appearing in the pages in question, but it also compares the text used in links themselves with the search criteria. For instance the following link, premiere Greater Toronto <http://www.yafla.com/ Area software development and consulting company, gives www.yafla.com <http://www.yafla.com/ some bonus points for anyone looking for any of those anchored words. I have no beef with that, and it actually makes a lot of sense, barring tampering (which is inevitable in any tamperable system). This particular ranking method came to the <http://www.wired.com/news/technology/0,1282,41401,00.html forefront a few years back in a rather hilarious circumstance. It's clear by analyzing Google's results that not only do votes accumulate for pages via anchor "democracy", but additionally Google gives a heavy bonus for any page which includes one or more of the search words in the domain name or subdirectory. For instance, writing a page about fixing Ford transmissions would likely get you a far better ranking as http://www.fixingfordtransmissions.com/fix/transmission/fixit.html than it likely would as http://www.bobthemechanic.com/tipsandtricks/tip27.html. So, anyways, that's a thoroughly amateur and largely obvious analysis of Google's page ranking techniques. Google appears to rank sites not just by the number of anchor tag "votes", multiplied by a site's weighting factor (it does not seem to be a page specific weighting, but rather seems to be site weighting. i.e. An obscure, unlinked, and unvisited page in the wilderness of AOL's members pages appears to be given the weight of the entire site, and conversely garners the votes of the entire site), but additionally by domain name matches. A Mystery Is Afoot Of course, then there's the perplexing. A search for "Britney <<a href="http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=%22britney+spears%22&btnG=Google+Search">http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=%22britney+spears%22&btnG=Go ogle+Search</a> Spears" gives the expected sites with Britney Spears in the URL or with heavily linked Britney Spears content, but then coming in at #9 is a hit for "Shavlik Technologies" <http://www.shavlik.com/ (a company which recently earned some fame by having their hotfix checking tool endorsed <http://support.microsoft.com/default.aspx?scid=KB;EN-US;Q303215& and distributed by Microsoft ). Clearly something is afoot as the page in question has no information whatsoever about Britney Spears, not even spicy pictures, nor does the URL have anything relating to Britney Spears in it. The first step in determining why Shavlik's website was earning such a high ranking for an unrelated search was to search for any sites which linked to their site, easily facilitated by a quick link <http://www.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=link%3Awww.shavlik.com&btnG=Google+Search search. Among the various sites purportedly linking to Shavlik.com are quite a few that neither the current nor cached versions have any links whatsoever to the network security company, but instead they link to Britney Spears content. One common element, at least at a cursory glance, was that they all linked to a now-defunct "britneyspearsnow.com" website. Conversely, doing a link check for sites <<a href="http://www.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=link%3Awww.britneyspearsnow.com&btnG=Google+Search">http://www.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=link%3Awww.britneyspe arsnow.com&btnG=Google+Search</a> linking to www.britneyspearsnow.com and strangely the very first hit is the Shavlik page. Clearly, either intentionally or unintentionally, Google is confused between the two sites, and Shavlik has ended up with an inflated page ranking because of it. I can't comment on what technically is going wrong without knowing how Google is determining links, however I did do some hash checks to determine if it's a very rare case where hash results collide, but none of the variants (MD4, MD5, SHA1,MD160) seemed to give common results for variations of the two URLs with various prefixes and suffixes (I didn't expect that they would given the entropy involved), though my tests were far from exhaustive. To add fuel to the fire that there was intentional manipulation of search results, searching for "Shavlik Britney Spears" brings up a couple of pages that list Britney Spears fan pages, but sitting in the middle is a which-one-of-these-links-doesn't-belong Shavlik link. How To Promote Your Own Site Clearly there is some awareness out there as to how to manipulate the search rankings, and following are a few methods that I think are common: * Create a site on a megalinked aggregate sites, and if you still want to have your own domain then at least point to your domain from one or more aggregate site. This is appropriately shunned by <http://www.google.com/webmasters/2.html#A1 Google (I believe they call it a "doorway" page). There are services that automate cross linking for search engine rank boosting. * If you decide to start from scratch and go it alone with your own domain, without the bonus of being an underling of a megasite and earning the instant bonus points, then encapsulate the content of your site in the URL: Don't worry too much about the length of the URL as most people get there via search engines or links anyways. You'd probably be better off with http://www.how_to_burn_your_own_divx_mpeg4_collection.com <http://www.how_to_burn_your_own_divx_mpeg4_collection.com/ than http://www.htbyodmp4c.com <http://www.htbyodmp4c.com/ . Additionally, name each subdirectory and document relating to the content. ex. /media_convergence_device/mpeg4/divx/build_your_own.html. As my disclaimer states, I cannot say with certainty that Google increases the ranking of sites with the search terms in the URL, however empirical evidence via test searches seems to prove this out. * Give yourself some freebies by using the signature line or link to address on discussion boards to point to your own site. Throw your opinion into every discussion regardless of your experience or lack thereof. In no way am I promoting any method that encourages false search rank increases, but the next time you look at a search page ranking, realize that many of them were achieved via these methods. Why It Matters, and the Future Page rankings on Google are tremendously important, to the point that one could state a case that they supercede the relevance of the various DNS authorities (indeed, DNS is largely becoming irrelevant): Whether your business is on page 1 or page 30 can be the difference between prosperity or failure. Some studies <http://www.slis.ualberta.ca/cais2000/wolfram.htm have indicated that the mean search pages for a given query is approximately 1.8: If your result isn't on the first two page, then the majority of users will never even see that your site exists, much less visit it. Of course not every site can be on the front page, but for a given search phrase there has got to be a better way than simply promoting anyone who hosts on an aggregate site, or pays a search `optimizing' company to cross-link them hundreds or thousands of time. It's an important point for the future because of a shift in the net: In the early days the net was largely populated with personal pages that could best be described as online bookmark lists: Everyone put a site up basically linking to all of their friends' sites, and among this giant recursive network a couple of neat links could be found. Sites truly could be ranked based upon the "votes" that they received. Very few people actually do that anymore, but instead cross-linking is mostly the domain of search ranking manipulation sites. Among legitimate pages, many actually avoid linking at all as every link represents a loss of a certain percentage of your readers: If I was concerned about whether people would make it this far, I might be concerned that I lost 1.7% of readers who went off to read about praying mantids, or to download the latest hotfix checker, etc. Most sites intentionally avoid linking anywhere outside of their own little world anymore. Anchor `voting' has largely become the victim of rank manipulations, and has proven itself to be a flawed technique for search rankings. Some other techniques are fledgling, such as the Alexa <http://www.alexa.com/ technique of monitoring user's browsing and formulating a "most popular" listing based upon that (and alternately monitoring similar sites that the user visited in a session for correlations), and apart from the privacy issues it may prove to be a practical approach in the future. Other approaches include representative users voting for pages, and so long as the votes apply to specific topical areas (i.e. "computer stores in the Halton region"), and are not judged against a site that is heavily visited due to its Britney Spears content, then that may be a viable solution. Of course, in such a case vote stuffing and tampering again is very likely. Anyways, just some meanderings about search engine technology. Cheers. Re: Slashdot Posting 2002-08-14 Well, it looks like this got linked from Slashdot as the referral logs expanded quickly. In any case, browsing through the postings <http://slashdot.org/article.pl?sid=02/08/14/1546233 a couple of quick clarifications seem to be in order: * I am not a Google expert, nor do I proclaim to be. I'm just a schmoe <http://www.dcn.davis.ca.us/%7Ebrandi/LeBrun/monkey.jpg who became curious why some search results were superb, while others were utter garbage (and I am finding an increasing trend in the number of garbage results. Not to be overly doomsdayish, but I truly do believe that the quality of Google results have been declining). I also noticed the trend towards any site that has a keyword in the URL being heavily promoted and became curious if it was documented. This is a personal interest simply because I use Google as a productivity tool regularly, and it really matters to me that it isn't destroyed. * As I mentioned above, whether it is page-by-page rankings, or page-site-page rankings, the results are largely the same: Some guy with a page on GeoCities, whose own page is linked to by the GeoCities index, which itself has a heavy bonus for being the heavily linked GeoCities, is given a heavy bonus just for putting his site up on GeoCities (phew), while another guy whose page is a custom domain starts at zero. The same thing holds true for a page on Slashdot: The main page, which is heavily bonused for being widely linked, links to a story, heavily promoting every link in that story out into the wild. There are some angry individuals nit picking about whether it really is "site-to-site", however I use that term to simplify (because the results are largely the same, as I detailed). The empirical evidence that effectively it becomes site to site for indexed sites is painfully easy to find. * Others have pointed out that PageRank, which is only one of many factors that Google uses to order search results (and is fundamentally different from "page rank" as described above, which is a composite of several algorithms. Note that I never used the term "PageRank"TM, nor was I every talking about that algorithm alone), is a heavily documented and understood algorithm. One kind sir sent me a link to a patent <http://www.delphion.com/details?pn=US06285999__ , while another, not so kind, sir, in his posting pointed to a <http://dbpubs.stanford.edu:8090/pub/1999-66 research paper. One could search countless <http://www.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=google+pagerank&btnG=Google+Search excellent <http://pr.efactory.de/e-pagerank-algorithm.shtml papers detailing algorithms for the PageRank technique . The result is that "pages vote for each other", as I detailed above. These are important facts, and clearly give information regarding the PageRank algorithm, however PageRank is apparently only one of many algorithms used to order results, and my concern is with the whole, not with one single element. Supposedly in Google's master search algorithm, the weighting of PageRank is at an all time low because of some of the problems that I encountered. ------------------------ Yahoo! Groups Sponsor ---------------------~--> 4 DVDs Free +s&p Join Now http://us.click.yahoo.com/pt6YBB/NXiEAA/RN.GAA/kgFolB/TM ---------------------------------------------------------------------~-> ------------------ http://all.net/ Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
This archive was generated by hypermail 2.1.2 : 2002-10-01 06:44:32 PDT