Better website search with Python Whoosh

Structure

The top level is the index. The schema defines the structure of each document in the index. The document is the equivalent of a row entry in a database, having all the fields defined by the schema.

Field types available

whoosh.fields.TEXT Plain text / Allow search for phrases

whoosh.fields.KEYWORD A comma or space separated list of keywords (to store the value of the field, use stored=True)

whoosh.fields.ID A single unit value, whether text or number, it must be whole.

whoosh.fields.STORED For reference only when reading the document, no search against it.

whoosh.fields.NUMERIC All kind of numbers: int, float & long.

whoosh.fields.DATETIME A date in a compact and sortable format.

whoosh.fields.BOOLEAN yes / no. true / false. 1/0. You get it.

INFO: stored means that it saved against the document, so you will get the value back when loading the document, but you won’t be able to search against the stored object.

Indexing

Whoosh offers shortcuts for using indexes, here’s an example:

from whoosh.filedb.filestore import FileStorage
storage = FileStorage("indexdir")
# Create an index
ix = storage.create_index(schema)
# Open an existing index
storage.open_index()

Multiple indexes

If you need to keep multiple indexes, you will need to name each index using the indexname parameter.

# Using the convenience functions
ix = index.create_in("indexdir", schema=schema, indexname="usages")
ix = index.open_dir("indexdir", indexname="usages")
# Using the Storage object
ix = storage.create_index(schema, indexname="usages")
ix = storage.open_index(indexname="usages")

Schema

from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
schema = Schema(id=ID(stored=True, unique=True),
            name=TEXT(stored=True),
            body=TEXT,
            tags=KEYWORD)

You will note the id is stored (retrievable on read) and unique (will be used for updating).

Document management

Add

writer = ix.writer()
writer.add_document(path=u"/a", content=u"The first document")
writer.add_document(path=u"/b", content=u"The second document")
writer.commit()

Update

writer = ix.writer()
writer.update_document(path=u“/a”, content=“Replacement for the first document”)
writer.commit()

Note that if the document doesn’t exist, update will act as add.

Delete

# Delete document by its path – this field must be indexed
ix.delete_by_term(‘path’, u‘/a/b/c’)
# Save the deletion to disk
ix.commit()

Stemming

The stem analyser will cut down the word complexity such as suffixes and accent, keeping the root of the word. That will allow higher matching potential but also result whish are similar to the query. </p>

from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer
schema = Schema(from_addr=ID(stored=True),
            to_addr=ID(stored=True),
            subject=TEXT(stored=True),
            body=TEXT(analyzer=StemmingAnalyzer()),
            tags=KEYWORD)

Searching

First, we need to open the index:

myindex = index.open_dir(settings.SEARCH_INDEX_PATH, indexname="myindexname")

Then we need to create a query object, which specify which field to search against: from whoosh.qparser import QueryParser qp = QueryParser(“fieldname”, schema=crew_index.schema) q = qp.parse(u‘%s’ % term)

The term can be a set of terms and comparators:

q = qp.parse(u'guillaume OR bob')

This will match fieldname with either guillaume or bob.

Finally, we open the search using ‘with’ to close it once we are done with it:

with myindex.searcher() as searcher:
    results = searcher.search(q, limit=None)
    print "How many results:", results.scored_length()
    matching_ids = []
    for result in results:
        result['id']

Each row of the ‘results’ object is a dictionary of each matched document, returning all the ‘STORED’ field. In this example, we are returning the ‘id’ of the document.

Filtering

It is possible that you would need to filter multiple terms, for multiple fields. Although you can use whoosh.qparser.MultifieldParser() to search against multiple fields, it will match all terms, to all fields. In my scenario, I would like to match exclusively specific terms to specific fields, like a filter.

It can be done by building the query object manually, then passing that query to either ‘filter’ or ‘mask’ those terms.

from whoosh.query import And, Or, Term
qp = QueryParser("fieldname", schema=crew_index.schema)
q = qp.parse(u'%s' % term)
filter_q = Or([Term("name", u"guillaume"), Term("tags", u"developer")])
mask_q = Or([Term("name", u"bob"), Term("tags", u"designer")])
with myindex.searcher() as searcher:
    results = searcher.search(q, filter=filter_q, mask=mask_q, limit=None)
    print "How many results:", results.scored_length()
    matching_ids = []
    for result in results:
        result['id']

Here’s the essential. For a complete reference, here's the documentation: http://whoosh.readthedocs.org/en/latest

< / >