Better website search with Python Whoosh
The top level is the index. The schema defines the structure of each document in the index. The document is the equivalent of a row entry in a database, having all the fields defined by the schema.
Field types available
whoosh.fields.TEXT Plain text / Allow search for phrases
whoosh.fields.KEYWORD A comma or space separated list of keywords (to store the value of the field, use stored=True)
whoosh.fields.ID A single unit value, whether text or number, it must be whole.
whoosh.fields.STORED For reference only when reading the document, no search against it.
whoosh.fields.NUMERIC All kind of numbers: int, float & long.
whoosh.fields.DATETIME A date in a compact and sortable format.
whoosh.fields.BOOLEAN yes / no. true / false. 1/0. You get it.
INFO: stored means that it saved against the document, so you will get the value back when loading the document, but you won’t be able to search against the stored object.
Whoosh offers shortcuts for using indexes, here’s an example:
from whoosh.filedb.filestore import FileStorage storage = FileStorage("indexdir") # Create an index ix = storage.create_index(schema) # Open an existing index storage.open_index()
If you need to keep multiple indexes, you will need to name each index using the
# Using the convenience functions ix = index.create_in("indexdir", schema=schema, indexname="usages") ix = index.open_dir("indexdir", indexname="usages") # Using the Storage object ix = storage.create_index(schema, indexname="usages") ix = storage.open_index(indexname="usages")
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED schema = Schema(id=ID(stored=True, unique=True), name=TEXT(stored=True), body=TEXT, tags=KEYWORD)
You will note the
id is stored (retrievable on read) and unique (will be used for updating).
writer = ix.writer() writer.add_document(path=u"/a", content=u"The first document") writer.add_document(path=u"/b", content=u"The second document") writer.commit()
writer = ix.writer() writer.update_document(path=u“/a”, content=“Replacement for the first document”) writer.commit()
Note that if the document doesn’t exist,
update will act as
# Delete document by its path – this field must be indexed ix.delete_by_term(‘path’, u‘/a/b/c’) # Save the deletion to disk ix.commit()
The stem analyser will cut down the word complexity such as suffixes and accent, keeping the root of the word. That will allow higher matching potential but also result whish are similar to the query. </p>
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED from whoosh.analysis import StemmingAnalyzer schema = Schema(from_addr=ID(stored=True), to_addr=ID(stored=True), subject=TEXT(stored=True), body=TEXT(analyzer=StemmingAnalyzer()), tags=KEYWORD)
First, we need to open the index:
myindex = index.open_dir(settings.SEARCH_INDEX_PATH, indexname="myindexname")
Then we need to create a query object, which specify which field to search against: from whoosh.qparser import QueryParser qp = QueryParser(“fieldname”, schema=crew_index.schema) q = qp.parse(u‘%s’ % term)
term can be a set of terms and comparators:
q = qp.parse(u'guillaume OR bob')
This will match
fieldname with either
Finally, we open the search using ‘with’ to close it once we are done with it:
with myindex.searcher() as searcher: results = searcher.search(q, limit=None) print "How many results:", results.scored_length() matching_ids =  for result in results: result['id']
Each row of the ‘results’ object is a dictionary of each matched document, returning all the ‘STORED’ field. In this example, we are returning the ‘id’ of the document.
It is possible that you would need to filter multiple terms, for multiple fields. Although you can use
whoosh.qparser.MultifieldParser() to search against multiple fields, it will match all terms, to all fields. In my scenario, I would like to match exclusively specific terms to specific fields, like a filter.
It can be done by building the query object manually, then passing that query to either ‘filter’ or ‘mask’ those terms.
from whoosh.query import And, Or, Term qp = QueryParser("fieldname", schema=crew_index.schema) q = qp.parse(u'%s' % term) filter_q = Or([Term("name", u"guillaume"), Term("tags", u"developer")]) mask_q = Or([Term("name", u"bob"), Term("tags", u"designer")]) with myindex.searcher() as searcher: results = searcher.search(q, filter=filter_q, mask=mask_q, limit=None) print "How many results:", results.scored_length() matching_ids =  for result in results: result['id']
Here’s the essential. For a complete reference, here's the documentation: http://whoosh.readthedocs.org/en/latest