Content Filters

By using strong content filters that remove adult, gambling and sexually explicit websites we can reduce the index size and provide a better user experience for our users

The filters used in the parsing are fairly simple. Every filter is a list of trigger keyphrases and a set of levels for nofollow, noindex, and delete.

First the page is parsed and all keyphrases in the page are assigned scores according to the standard parsing algoritm. After this the scores of all keywords that matches a keyphrase in the filter list are added.

If the calculated score is above the nofollow level, the links on the page is not followed. If the score is above the noindex level the page is not included in the index and you will not find it while searching. The page not yet deleted so it has a chance of being reintroduced in the index in case the filtered content is not present on the next run.

If the score is above the delete level the page is both de-indexed and deleted from the database and the URL is marked as violating the content filters preventing us from spidering the page again later.

The fact that a page deleted because of the content filters is never spidered again can be considered a bug in the algorithm as in some cases a page is filtered by mistake and it would be good to have it re-indexed after the mistake has been corrected.

Simon,
Secret Search Engine Labs