It’s amazing how much junk there is on the Internet. You let the spider loose and it just wanders so deep into a site that you never find it again.
I first made a system with levels to prevent the spider to go more than x link jumps (3 or 4) from the domain homepage of a site. This wasn’t enough though.
As I have a really small index I don’t want that many pages from every domain but as many different domains as possible, so I invented something I call CashRank which is a somewhat controversial way to rank pages.
Each page gets a dollar value indicating how much money has to be paid annually to keep that page online. This is domain registration fees, hosting fees, cost of IP address, advertising costs etz.
A page in only included in the index if it’s worth at least $1/year, this effectively limits auto generated content that has no value, because no one is paying for it’s upkeep.
The CashRank is also propagated through links, a page keeps $1 for it’s own
upkeep and then sends $1 through every link until it has used up all of it’s cash.
This means pages with many inbound links can have more pages in the index and only the first links on a page are counted.
Please give me your opinion on CashRank by leaving a comment below.