Search Engine Software

Search Engine in Pure PHP

The first version of our search engine software was done in a weekend using PHP for rapid application development. It was a fun hobby project to boost my programmers ego. Since then the code has grown over countless hours of development from the initial 1000 lines of code to over 5000 lines of PHP, SQL and HTML.

Made for Shared Hosting

From the beginning this software was made to easily run from a shared hosting account with PHP, mySQL and crontab support.

At the time that was quite an unusual approach to making search engine software for a general Internet search engine with most software being written in C++ for the perceived performance.

The Feature List

  • Uses about 3GB storage to index 250,000 pages and links
  • Can run from a shared hosting account with PHP and mySQL
  • Efficient keyword based index for fast queries
  • Supports keyphrases up to 4 words long
  • Includes a spider that crawls the web for pages
  • Finds keywords in title, description, anchor text, page contents and URL
  • Configurable content filters can remove sexually explicit content, spam and other unvanted pages
  • Keeps track of recent human generated searches
  • Option to manually add URL:s

The current development target is to make the software modular enough to allow it to be used as the search engine software powering other search engines.

I'm targeting nich search engines and site search engines and I'm looking into releasing the code under the GPL to allow others to use it as a basis for development.

List of Search Engine Software

While I'm considering how to release the code in a way you can use you can see if any of these search engines softwares will work for your project.

Apache Lucene is (according to their website) a high-performance, full-featured text search engine library written entirely in Java. It's an open source project, thus free to use. You will need another component called Nutch as well if you intend to crawl the web for pages, Lucene needs to be fed locally. Lucene

Entireweb Datafeed Entireweb is a large search engine based in Sweden and you can get their search results from an index of over 100 million pages as an xml datafeed completely free. Entireweb Datafeed

Exalead CloudView is an advanced search solution for big companies. It's based on the search technology of the Excalead internet search engine and can thus handle up to at least 8 billion documents. Exalead CloudView

Fluid Dynamics Search Engine is a site search engine in Perl with free and shareware ($40) versions available. Fluid Dynamics

FM SiteSearch Pro is a site search engine in PERL from Focalmedia.net. I seems to be targeted at smaller sites below 1000 pages and has both text file and MySQL storage support. FM SiteSearch Pro

FreeFind is a search service that lets you add search to your website using the FreeFind index. This means easy setup and no need for spiders and indexes on your own server. There's a free version with ads and a paid version from $19/month if you don't want the ads. FreeFind

Google Custom Search is a service by Google where you can make a search engine that includes only your specified sites. The free version has Google ads where the revenue is split between you and Google. There is also corporate paid version where you can turn ads off starting at $100/year for 1000 documents. As the custom search uses Googles database this only works as long as your website is accessible from the Internet, for Intranets you need another solution. Google Custom Search

Google Search Appliance is a service by Google where they put a computer (the search appliance) in your corporate network so you can search the Intranet that is not accessible from the Internet. The capacity is up to a million documents with pricing depending on the number of documents indexed. Google Search Appliance

Google Mini is basically the same thing but indexes less pages and costs less. Google Mini

ht://Dig is a search engine software used for site search or for vertical search engines. It was developed at San Diego State University to s search the campus networks, It's written in C and can be run on a Linux/Unix box or on Windows with Cygwin. It's free (LGPL) and indexes web pages, not databases. ht://Dig

IBM OmniFind Yahoo! Edition is a free enterprise search engine for up to 500,000 documents. There is more advanced editions available from IBM with a $20.000 price tag. OmniFind Yahoo! Edition

Inout is a PHP meta search engine script you can buy from inout scripts for $249. It uses the search results from 12 popular search engines to compile it's own set of results and includs functions to monetize the search engine with ads. Seems to have a lot of features according to the marketing materials on the website. Inout

LEXST is a search application that is scalable to index up to 2 billion webpages using 1120 nodes (servers). There's a free version that works on up to three nodes and if you go for the 1120 node version it will cost you $800,000. This software can handle really big intranets or even be used to make a full sized Internet search engine. LEXST

Open Web Spider is Open Source search engine software written in C# that seem to be targeted at smaller installation, I haven't found any hard numbers though. Open Web Spider

Sphider is a PHP and MySQL based site search engine. It's open source and has a good admin interface, probably a good candidate for site search on a smaller site. Sphider

Sphinx is a free open-source SQL full-text search engine. Sphinx

Swish-e is an open source search engine usable up to a million documents but more commonly used for tens of thousands of documents. It's written in C, is fast and flexible but requires some programming/assembling work to integrate with your website. Swish-e

Webinator is search software from Thundersone to spider and index doucuments. With a capacity of up to 200.000 pages this can take on larger sites and the software runs on Windows and Unix. There's a free version for up to 10.000 documents and paid version from $700 to $5800 Webinator

Zoom Search Engine is a search engine from Wrensoft that is slightly geared towards searching CDs and DVDs but it also has intranet and Internet search functionality. There's a free version that can index 50 pages (that's not much) and paid versions from $49 to $299 that indexes up to a million pages. Zoom Search

Zoom Master Node is a scalable search software from Wrensoft that will run on multiple servers and that is able to index over a million webpages. Master Node can take OpenSearch responces from other search servers to integrate searching of several different databases and indexes in a single interface. Master Node is free with the option to buy support for $250. Zoom Master Node

Make Your Own Search Engine Software

This is a collection of libraries and howtos that will help you make your own database search engine (easy), site search engine (medium) or Internet search engine (not easy)

The Porter Stemming Algorithm is a way to reduce Enlish words to their smallest common denominator. Large, largely, larger and largest would all become just large after stemming.

Written by Martin Porter the algorithm has become popular in all kinds of search engine software and the best part is that Martin provides the code free of charge under a do-what-you-want open source license in Ansi C, Java, Perl, Python, C#, Lisp, Ruby, VB6, VB7, Delphi, Javascript, PHP, Prolog, Haskell, T-SQL, matlab, Tcl, D, erlang, REBOL, and Scala.

The Porter Stemming Algorithm

Snowball is a new stemming library by Marin Porter that allows you to create stemming functions for other languages using a script like language. Available is stemmers for French, Spanish, Portuguese, Italian, Romanian, German, Dutch, Swedish, Norwegian, Danish, Russian, Finnish, Hungarian, Turkish

Snowball

Simon,
Secret Search Engine Labs