[iwar] [fc:Anchor.Votes]

From: Fred Cohen (fc@all.net)
Date: 2002-08-15 10:16:46


Return-Path: <sentto-279987-5191-1029431755-fc=all.net@returns.groups.yahoo.com>
Delivered-To: fc@all.net
Received: from 204.181.12.215 [204.181.12.215] by localhost with POP3 (fetchmail-5.7.4) for fc@localhost (single-drop); Thu, 15 Aug 2002 10:19:09 -0700 (PDT)
Received: (qmail 10116 invoked by uid 510); 15 Aug 2002 17:14:25 -0000
Received: from n30.grp.scd.yahoo.com (66.218.66.87) by all.net with SMTP; 15 Aug 2002 17:14:25 -0000
X-eGroups-Return: sentto-279987-5191-1029431755-fc=all.net@returns.groups.yahoo.com
Received: from [66.218.66.94] by n30.grp.scd.yahoo.com with NNFMP; 15 Aug 2002 17:15:55 -0000
X-Sender: fc@red.all.net
X-Apparently-To: iwar@onelist.com
Received: (EGP: mail-8_0_7_4); 15 Aug 2002 17:15:54 -0000
Received: (qmail 17789 invoked from network); 15 Aug 2002 17:15:54 -0000
Received: from unknown (66.218.66.216) by m1.grp.scd.yahoo.com with QMQP; 15 Aug 2002 17:15:54 -0000
Received: from unknown (HELO red.all.net) (12.232.72.152) by mta1.grp.scd.yahoo.com with SMTP; 15 Aug 2002 17:15:53 -0000
Received: (from fc@localhost) by red.all.net (8.11.2/8.11.2) id g7FHGkg03386 for iwar@onelist.com; Thu, 15 Aug 2002 10:16:46 -0700
Message-Id: <200208151716.g7FHGkg03386@red.all.net>
To: iwar@onelist.com (Information Warfare Mailing List)
Organization: I'm not allowed to say
X-Mailer: don't even ask
X-Mailer: ELM [version 2.5 PL3]
From: Fred Cohen <fc@all.net>
X-Yahoo-Profile: fcallnet
Mailing-List: list iwar@yahoogroups.com; contact iwar-owner@yahoogroups.com
Delivered-To: mailing list iwar@yahoogroups.com
Precedence: bulk
List-Unsubscribe: <mailto:iwar-unsubscribe@yahoogroups.com>
Date: Thu, 15 Aug 2002 10:16:46 -0700 (PDT)
Subject: [iwar] [fc:Anchor.Votes]
Reply-To: iwar@yahoogroups.com
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, hits=0.1 required=5.0 tests=WEIRD_PORT,DIFFERENT_REPLY_TO version=2.20
X-Spam-Level: 

Anchor Votes


Dennis W.  &lt;http://www.yafla.com/%7Edforbes/index.htm 
Forbes - August 13th, 2002


Foreword


If you're like many, Google &lt;http://www.google.com/ 
 has become your primary search engine, either directly or indirectly via a business 
relationship with another site (for example, Yahoo &lt;http://www.yahoo.com/ 
 uses Google to search the web). For me, Google has become my home page: Virtually 
every session begins with a Google search. Maybe I'm looking for some good  &lt;<a 
href="http://www.dannyg.com/javascript/quickref/JSB4RefBooklet.pdf">http://www.dannyg.com/javascript/quickref/JSB4RefBooklet.pdf> 
documentation for Netscape Navigator 4.7's obsolete and incomplete DOM, or a site 
showing the biggest skyscrapers &lt;http://www.skyscraperpage.com/diagrams/index.php 
. Whatever it is, Google usually quickly gets me to my destination. Before I raise 
the ire of the hordes of fanatical Google fans, I should start this paper off by 
saying that I am a tremendous fan of Google, and I have yet to see any search engine 
which has it beat, but that doesn't mean that Google can't be made better. I am also 
concerned that Google's ranking technology is masked in a tight
lipped aura: This is a classic security by obscurity, and while I do believe that 
there is some merit to that in certain circumstances, the easy cause/effect analysis 
of Google renders such obscurity transparent to those that want to manipulate Google 
rankings for their own gain.


A Guess At The Technology Behind Google Rankings


*DISCLAIMER: I do not have access to Google's page ranking technology, and apart 
from some partial details on their site, they keep their ranking techniques tight 
lipped to avoid intentional rank manipulating. As such, everything I say in this 
article is purely speculative based upon analysis of search results for various terms 
and phrases. Please also note that I browse the web using Opera with pop-ups disabled, 
so follow any link at your own discretion.

Lately I've been fascinated by the techniques &lt;http://www.google.com/technology/index.html 
 that Google uses to rank the search results, as obviously this dictates the usability 
of the results, and alternately the value of a website to businesses that are trying 
to get eyeballs to see their products and services. Indeed, a case could arguably 
be made that Google search positioning is becoming one of the most important "real 
estate value" elements of any web page (more important than acquiring a good domain 
name, although I note later that it very well may be that the domain name+directory 
structure remains critically important if it's contextual for the good or service 
that you're selling). If Google were to ever sell page rankings, which they currently 
do not do, they literally would be in a windfall as every company rushes to make 
sure that they obtain the most eyeball potential.

The Google ranking technique, in a nutshell, is that every link provided to a site 
is a vote for the site, with the weighting of the vote being determined by the number 
of votes that the voting site itself has received (another scenario is that indirectly 
each site promotes each subpage through internal linking, though effectively this 
results in the same thing for any aggregate site which provides an index). I've highlighted 
"site" for an important reason: A vote from anywhere within Slashdot garners the 
approximate voting power of Slashdot as a whole, a site which is one of the most 
linked sites on the Internet. The same reality holds true for the other conversation 
sites such as www.plastic.com &lt;http://www.plastic.com/ 
 or www.kuro5hin.org &lt;http://www.kuro5hin.org/ 
 . The flip side is true as well: Not only does the link vote apply to the destination 
page, but also to the site as a whole. By their very nature, aggregate sites like 
GeoCities or Angelfire will get a lot of votes
because they contain tens of thousands of pages, and by extrapolation every hosted 
page itself starts quite high in the rankings, regardless of its own merit: You can 
prove this yourself by browsing through the  &lt;http://pages.yahoo.com/ 
GeoCities pages and looking for various papers covering a specific topic, for instance 
this page  &lt;http://www.geocities.com/jlhorse7/mantid.html 
on praying mantids (I randomly picked one as an example). Do a search on Google for 
mantids &lt;http://www.google.com/search?q=mantids&hl=en&lr=&ie=UTF-8&oe=UTF-8&start=230&sa=N 
  (note: either `mantis' or `mantids' is correct), and there it is in the #7 position 
(you can repeat this with virtually any page on an aggregate site). To put that into 
perspective, there are some 750 pages dealing with mantids that are linked from Google, 
and that limit is simply because that's the maximum results that Google will return 
for that particular search term. A quick check (using the link: search criteria) 
confirms that
the page in question is linked by no other sites but itself). Another example of 
a megalinked aggregate site is the members.aol.com domain (apparently now hometown.aol.com), 
where AOL members can post webpages. Searching for " Ford  &lt;<a href="http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=ford+transmission&btnG=Google+Search">http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=ford+transmission&btnG=Google+Searc
h</a> 
transmission " (perhaps you're having transmission problems and want to get it fixed, 
or you're thinking of purchasing a Ford and want to make sure the transmission is 
of good quality. Something of this nature could correlate with billions of dollars 
in sales and service) and the #1 result is http://hometown.aol.com/MKBradley/index.html 
. Again, to put this into perspective, the site in question is linked a lowly 11 
times directly (2 times by themselves), yet his/her site has become the #1 voice 
regarding Ford transmissions (a product in millions of cars), again because it seems 
to have indirectly acquired the "voting power" of the entire members.aol.com site. 
Is
it really a democracy that every page on these megalinked aggregate sites become 
premiere voices of their topic? Is it valid that this page would be ranked much more 
favourably if I hosted it on Geocities or aol.com?  

Not only does Google rank pages based upon the gross number of links multiplied 
by the various weighting factors, and then sorts them based upon the search criteria's 
appearing in the pages in question, but it also compares the text used in links themselves 
with the search criteria. For instance the following link, premiere Greater Toronto 
 &lt;http://www.yafla.com/ Area software development 
and consulting company, gives www.yafla.com &lt;http://www.yafla.com/ 
 some bonus points for anyone looking for any of those anchored words. I have no 
beef with that, and it actually makes a lot of sense, barring tampering (which is 
inevitable in any tamperable system). This particular ranking method came to the 
 &lt;http://www.wired.com/news/technology/0,1282,41401,00.html 
forefront a few years back in a rather hilarious circumstance. 

It's clear by analyzing Google's results that not only do votes accumulate for pages 
via anchor "democracy", but additionally Google gives a heavy bonus for any page 
which includes one or more of the search words in the domain name or subdirectory. 
For instance, writing a page about fixing Ford transmissions would likely get you 
a far better ranking as http://www.fixingfordtransmissions.com/fix/transmission/fixit.html 
than it likely would as http://www.bobthemechanic.com/tipsandtricks/tip27.html. 


So, anyways, that's a thoroughly amateur and largely obvious analysis of Google's 
page ranking techniques. Google appears to rank sites not just by the number of anchor 
tag "votes", multiplied by a site's weighting factor (it does not seem to be a page 
specific weighting, but rather seems to be site weighting. i.e. An obscure, unlinked, 
and unvisited page in the wilderness of AOL's members pages appears to be given the 
weight of the entire site, and conversely garners the votes of the entire site), 
but additionally by domain name matches.


A Mystery Is Afoot


Of course, then there's the perplexing. A search for "Britney  &lt;<a href="http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=%22britney+spears%22&btnG=Google+Search">http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=%22britney+spears%22&btnG=Go
ogle+Search</a> 
Spears" gives the expected sites with Britney Spears in the URL or with heavily linked 
Britney Spears content, but then coming in at #9 is a hit for  "Shavlik Technologies" 
 &lt;http://www.shavlik.com/ (a company which 
recently earned some fame by having their hotfix checking tool endorsed  &lt;http://support.microsoft.com/default.aspx?scid=KB;EN-US;Q303215&amp; 
and distributed by Microsoft ). Clearly something is afoot as the page in question 
has no information whatsoever about Britney Spears, not even spicy pictures, nor 
does the URL have anything relating to Britney Spears in it. The first step in determining 
why Shavlik's website was earning such a high ranking for an unrelated search was 
to search for any sites which linked to their site, easily facilitated by a quick 
link
&lt;http://www.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=link%3Awww.shavlik.com&btnG=Google+Search 
search. Among the various sites purportedly linking to Shavlik.com are quite a few 
that neither the current nor cached versions have any links whatsoever to the network 
security company, but instead they link to Britney Spears content. One common element, 
at least at a cursory glance, was that they all linked to a now-defunct "britneyspearsnow.com" 
website. Conversely, doing a link check for sites  &lt;<a href="http://www.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=link%3Awww.britneyspearsnow.com&btnG=Google+Search">http://www.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=link%3Awww.britneyspe
arsnow.com&btnG=Google+Search</a> 
linking to www.britneyspearsnow.com and strangely the very first hit is the Shavlik 
page. Clearly, either intentionally or unintentionally, Google is confused between 
the two sites, and Shavlik has ended up with an inflated page ranking because of 
it. I can't comment on what technically is going wrong without knowing how Google 
is determining links, however I did do some hash
checks to determine if it's a very rare case where hash results collide, but none 
of the variants (MD4, MD5, SHA1,MD160) seemed to give common results for variations 
of the two URLs with various prefixes and suffixes (I didn't expect that they would 
given the entropy involved), though my tests were far from exhaustive. To add fuel 
to the fire that there was intentional manipulation of search results, searching 
for "Shavlik Britney Spears" brings up a couple of pages that list Britney Spears 
fan pages, but sitting in the middle is a which-one-of-these-links-doesn't-belong 
Shavlik link. 


How To Promote Your Own Site


 Clearly there is some awareness out there as to how to manipulate the search rankings, 
and following are a few methods that I think are common: 

*	Create a site on a megalinked aggregate sites, and if you still want to have your 
own domain then at least point to your domain from one or more aggregate site. This 
is appropriately shunned by  &lt;http://www.google.com/webmasters/2.html#A1 
Google (I believe they call it a "doorway" page). There are services that automate 
cross linking for search engine rank boosting. 
*	If you decide to start from scratch and go it alone with your own domain, without 
the bonus of being an underling of a megasite and earning the instant bonus points, 
then encapsulate the content of your site in the URL: Don't worry too much about 
the length of the URL as most people get there via search engines or links anyways. 
You'd probably be better off with http://www.how_to_burn_your_own_divx_mpeg4_collection.com 
&lt;http://www.how_to_burn_your_own_divx_mpeg4_collection.com/ 
 than http://www.htbyodmp4c.com &lt;http://www.htbyodmp4c.com/ 
. Additionally, name each subdirectory and document relating to the content. ex. 
/media_convergence_device/mpeg4/divx/build_your_own.html. As my disclaimer states, 
I cannot say with certainty that Google increases the ranking of sites with the search 
terms in the URL, however empirical evidence via test searches seems to prove this 
out.  
*	Give yourself some freebies by using the signature line or link to address on 
discussion boards to point to your own site. Throw your opinion into every discussion 
regardless of your experience or lack thereof.

In no way am I promoting any method that encourages false search rank increases, 
but the next time you look at a search page ranking, realize that many of them were 
achieved via these methods.


Why It Matters, and the Future


Page rankings on Google are tremendously important, to the point that one could 
state a case that they supercede the relevance of the various DNS authorities (indeed, 
DNS is largely becoming irrelevant): Whether your business is on page 1 or page 30 
can be the difference between prosperity or failure. Some studies  &lt;http://www.slis.ualberta.ca/cais2000/wolfram.htm 
have indicated that the mean search pages for a given query is approximately 1.8: 
If your result isn't on the first two page, then the majority of users will never 
even see that your site exists, much less visit it. Of course not every site can 
be on the front page, but for a given search phrase there has got to be a better 
way than simply promoting anyone who hosts on an aggregate site, or pays a search 
`optimizing' company to cross-link them hundreds or thousands of time.

It's an important point for the future because of a shift in the net: In the early 
days the net was largely populated with personal pages that could best be described 
as online bookmark lists: Everyone put a site up basically linking to all of their 
friends' sites, and among this giant recursive network a couple of neat links could 
be found. Sites truly could be ranked based upon the "votes" that they received. 
Very few people actually do that anymore, but instead cross-linking is mostly the 
domain of search ranking manipulation sites. Among legitimate pages, many actually 
avoid linking at all as every link represents a loss of a certain percentage of your 
readers: If I was concerned about whether people would make it this far, I might 
be concerned that I lost 1.7% of readers who went off to read about praying mantids, 
or to download the latest hotfix checker, etc. Most sites intentionally avoid linking 
anywhere outside of their own little world anymore. 

Anchor `voting' has largely become the victim of rank manipulations, and has proven 
itself to be a flawed technique for search rankings. Some other techniques are fledgling, 
such as the Alexa &lt;http://www.alexa.com/  
technique of monitoring user's browsing and formulating a "most popular" listing 
based upon that (and alternately monitoring similar sites that the user visited in 
a session for correlations), and apart from the privacy issues it may prove to be 
a practical approach in the future. Other approaches include representative users 
voting for pages, and so long as the votes apply to specific topical areas (i.e. 
"computer stores in the Halton region"), and are not judged against a site that is 
heavily visited due to its Britney Spears content, then that may be a viable solution. 
Of course, in such a case vote stuffing and tampering again is very likely. 

Anyways, just some meanderings about search engine technology.

Cheers.

Re: Slashdot Posting  2002-08-14  

Well, it looks like this got linked from Slashdot as the referral logs expanded 
quickly. In any case, browsing through the postings  &lt;http://slashdot.org/article.pl?sid=02/08/14/1546233 
a couple of quick clarifications seem to be in order:

*	I am not a Google expert, nor do I proclaim to be. I'm just a schmoe &lt;http://www.dcn.davis.ca.us/%7Ebrandi/LeBrun/monkey.jpg 
 who became curious why some search results were superb, while others were utter 
garbage (and I am finding an increasing trend in the number of garbage results. Not 
to be overly doomsdayish, but I truly do believe that the quality of Google results 
have been declining). I also noticed the trend towards any site that has a keyword 
in the URL being heavily promoted and became curious if it was documented. This is 
a personal interest simply because I use Google as a productivity tool regularly, 
and it really matters to me that it isn't destroyed. 
*	As I mentioned above, whether it is page-by-page rankings, or page-site-page rankings, 
the results are largely the same: Some guy with a page on GeoCities, whose own page 
is linked to by the GeoCities index, which itself has a heavy bonus for being the 
heavily linked GeoCities, is given a heavy bonus just for putting his site up on 
GeoCities (phew), while another guy whose page is a custom domain starts at zero. 
The same thing holds true for a page on Slashdot: The main page, which is heavily 
bonused for being widely linked, links to a story, heavily promoting every link in 
that story out into the wild. There are some angry individuals nit picking about 
whether it really is "site-to-site", however I use that term to simplify (because 
the results are largely the same, as I detailed). The empirical evidence that effectively 
it becomes site to site for indexed sites is painfully easy to find. 
*	Others have pointed out that PageRank, which is only one of many factors that 
Google uses to order search results (and is fundamentally different from "page rank" 
as described above, which is a composite of several algorithms. Note that I never 
used the term "PageRank"TM, nor was I every talking about that algorithm alone), 
is a heavily documented and understood algorithm. One kind sir sent me a link to 
a patent &lt;http://www.delphion.com/details?pn=US06285999__ 
, while another, not so kind, sir, in his posting pointed to a  &lt;http://dbpubs.stanford.edu:8090/pub/1999-66 
research paper. One could search countless &lt;http://www.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=google+pagerank&btnG=Google+Search 
 excellent  &lt;http://pr.efactory.de/e-pagerank-algorithm.shtml 
papers detailing algorithms for the PageRank technique . The result is that "pages 
vote for each other", as I detailed above. These are important facts, and clearly 
give information regarding the PageRank
algorithm, however PageRank is apparently only one of many algorithms used to order 
results, and my concern is with the whole, not with one single element. Supposedly 
in Google's master search algorithm, the weighting of PageRank is at an all time 
low because of some of the problems that I encountered. 

------------------------ Yahoo! Groups Sponsor ---------------------~-->
4 DVDs Free +s&p Join Now
http://us.click.yahoo.com/pt6YBB/NXiEAA/RN.GAA/kgFolB/TM
---------------------------------------------------------------------~->

------------------
http://all.net/ 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ 



This archive was generated by hypermail 2.1.2 : 2002-10-01 06:44:32 PDT