Summary
I provide a basic indexing and retrieval code using the PyLucene 3.0 API. Lucene In Action (2nd Ed) covers Lucene 3.0, but the PyLucene code samples for have not been updated for the 3.0 API, only the Java ones. Unfortunately, there is currently little (no?) example PyLucene code in blogosphere. If you have links to more Lucene 3.0 tutorials and samples, please share them in the comments.
Update 20100810: In light of discussions with other, this post has been substantially rewritten since it was first posted.
Background
Historically, I have found it easy to write basic PyLucene 2.4 (or 2.9?) code. PyLucene includes Lucene In Action code samples ported from Java to Python, and these code samples are correct and easy to adapt. I recently was developing a new project based upon Lucene (biased-text-sample), and I decided to try PyLucene 3.0.2–1. I was surprised to find that PyLucene code samples in samples/LuceneInAction/ are out-of-date, and use the 2.x API. (Note: The code in samples/*.py appears to be updated to the 3.0 API.)
I was able to find no Lucene 3.0 tutorials or code samples on the web, except for this one article:
If you have links to more Lucene 3.0 tutorials and samples, please share them in the comments.
Sample PyLucene 3.0 code
In the spirit of Lingpipe’s Lucene 2.4 in 60 seconds, here are relevant PyLucene 3.0 code snippets from my biased-text-sample project, for indexing and retrieval.
Indexing
import lucene
from lucene import \
SimpleFSDirectory, System, File, \
Document, Field, StandardAnalyzer, IndexWriter, Version
if __name__ == "__main__":
lucene.initVM()
indexDir = "/Tmp/REMOVEME.index-dir"
dir = SimpleFSDirectory(File(indexDir))
analyzer = StandardAnalyzer(Version.LUCENE_30)
writer = IndexWriter(dir, analyzer, True, IndexWriter.MaxFieldLength(512))
print >> sys.stderr, "Currently there are %d documents in the index..." % writer.numDocs()
print >> sys.stderr, "Reading lines from sys.stdin..."
for l in sys.stdin:
doc = Document()
doc.add(Field("text", l, Field.Store.YES, Field.Index.ANALYZED))
writer.addDocument(doc)
print >> sys.stderr, "Indexed lines from stdin (%d documents in index)" % (writer.numDocs())
print >> sys.stderr, "About to optimize index of %d documents..." % writer.numDocs()
writer.optimize()
print >> sys.stderr, "...done optimizing index of %d documents" % writer.numDocs()
print >> sys.stderr, "Closing index of %d documents..." % writer.numDocs()
writer.close()
print >> sys.stderr, "...done closing index of %d documents" % writer.numDocs()
Retrieval
import lucene
from lucene import \
SimpleFSDirectory, System, File, \
Document, Field, StandardAnalyzer, IndexSearcher, Version, QueryParser
if __name__ == "__main__":
lucene.initVM()
indexDir = "/Tmp/REMOVEME.index-dir"
dir = SimpleFSDirectory(File(indexDir))
analyzer = StandardAnalyzer(Version.LUCENE_30)
searcher = IndexSearcher(dir)
query = QueryParser(Version.LUCENE_30, "text", analyzer).parse("Find this sentence please")
MAX = 1000
hits = searcher.search(query, MAX)
print "Found %d document(s) that matched query '%s':" % (hits.totalHits, query)
for hit in hits.scoreDocs:
print hit.score, hit.doc, hit.toString()
doc = searcher.doc(hit.doc)
print doc.get("text").encode("utf-8")