Making a good search algorithm that can’t be gamed by greedy marketers and other SEO scumbags like myself is not easy. Many doubt it’s even possible.
Just look at Google, the biggest search engine on the planet. Many of the top results are there because they were SEOed to be there. If Google can’t make unbiased search results, then who can?
I just read a great blog post by Jonathan Leger where he asks his readers for ideas for how to make a better search algorithm and I’m going to list some of the ideas I think might be useful and some additional ones I made up while reading.
Votes from web 2.0 sites
Using the votes from social sites like Digg, Twitter or StumbleUpon to help rank search was suggested and I think it’s a good idea. It’s the vote of the people and it reflects popularity.
The likely problems I see is that you can only use stories with hundreds of votes as the lower vote counts are often manipulated by marketing people.
Also the people participating in the voting is not always the same people you are making search results for. I for one do very little voting and a lot of searching and the votes cast might not be representative of what I want to find.
The bounce rate is the percent of visitors that click on a search result and then immediately hit the back button when they realize the site was not what they were looking for.
This is a factor already used by Google and I think it’s a good indicator if a search result is actually useful to those searching.
Then again this is nothing new but something to keep in mind when I do further algo changes.
Michael from better click bank analytics noted that the bounce rate can be manipulated using bot nets with automated scripts doing searches and clicking search results. This would require a significant number of IP addresses to work though as it’s easy to just count every IP address once.
A lot of people suggested that great content should be the determining factor and I agree. I just don’t know about any way to determine if the content is great or not, it’s really a matter of taste and content to demographics match.
Both bounce rate, inbound links and social votes are ways to find great content without having a computer that can actually review a site an determine if it’s good or not.
I don’t know if the big engines are doing this yet, but what if you let the browser toolbar measure how often links on a specific web page actually get clicked? This way you can rank different links according to how prominently they are placed on the page and how relevant they are to the page theme.
If the footer is stuffed with keyword rich links, nobody is going to see them or click on them and they will be discounted. The same goes for hidden links and just off topic ads.
I use a variation of this in my CashRank algorithm where I only count the first x links on a page, with x depending on how much CashRank the page has. Usually the most important links are placed first so you get an coarse emulation of actual link popularity.
A better way would be to actually render the page in a browser and measure where on the page the link is and how big it is to determine the likelyhood of it actually being clicked.
The key theme here is to improve classic link relevance by giving different links on the same page different weight based on how valuable screen and page estate they occupy.
Naturally having a real human reviewing a site will give the best results as far as removing plain spam from the results, only problem is that I can’t afford to hire 100,000 people to review all search results.
I do think however there is an idea here to have people review random sites and give statements as SPAM/NOT SPAM or COMMERCIAL/NON-COMMERCIAL and then you would feed the human reviewed results into a self learning filter and get more pages classified than is actually reviewed.
This is definitelysomething I will look into some day, could work well with a Mechanical Turk.
Deep Digging Tools / Categorization
This is something I’m working on and I think it’s one of the core issues with search. To better find relevant search terms to the one searched for and to find pages on the exact same topic that does not have the exact search term but still is relevant to the search term.
To categorize search terms and learn what they mean, or at least how they relate to other search terms will allow the search engine to provide better search results and tools to refine the search results.
Tools to Help The User Search
That’s a good one. People search in different ways, some type “dog food”, some “food for dogs” and some “dogs food” some “food dogs”. Having some tools that helps the user search in a way that would give relevant answers, some sort of easy, interactive refinement tool would be nice to try out.
Email and IM content
Using links embedded in instant messages and email is another way you could boost links as-it-happens but there’s privacy concerns of course and I don’t run an email service.
I’m going to investigate the Twitter API though if that could be used to found out what people are tweeting about.