New users of Lucene are advised to use this ver­sion for new devel­op­ments, because it has a clean, type safe new API.

–http://lucene.apache.org/java/docs/index.html

PyLucene 3.0 in 60 seconds — Tutorial sample code for the 3.0 API

Sum­mary

I pro­vide a basic index­ing and retrieval code using the PyLucene 3.0 API. Lucene In Action (2nd Ed) cov­ers Lucene 3.0, but the PyLucene code sam­ples for have not been updated for the 3.0 API, only the Java ones. Unfor­tu­nately, there is cur­rently lit­tle (no?) exam­ple PyLucene code in blo­gos­phere. If you have links to more Lucene 3.0 tuto­ri­als and sam­ples, please share them in the comments.

Update 20100810: In light of dis­cus­sions with other, this post has been sub­stan­tially rewrit­ten since it was first posted.


Back­ground

His­tor­i­cally, I have found it easy to write basic PyLucene 2.4 (or 2.9?) code. PyLucene includes Lucene In Action code sam­ples ported from Java to Python, and these code sam­ples are cor­rect and easy to adapt. I recently was devel­op­ing a new project based upon Lucene (biased-text-sample), and I decided to try PyLucene 3.0.2–1. I was sur­prised to find that PyLucene code sam­ples in samples/LuceneInAction/ are out-of-date, and use the 2.x API. (Note: The code in samples/*.py appears to be updated to the 3.0 API.)

I was able to find no Lucene 3.0 tuto­ri­als or code sam­ples on the web, except for this one article:

If you have links to more Lucene 3.0 tuto­ri­als and sam­ples, please share them in the comments.


Sam­ple PyLucene 3.0 code

In the spirit of Lingpipe’s Lucene 2.4 in 60 sec­onds, here are rel­e­vant PyLucene 3.0 code snip­pets from my biased-text-sample project, for index­ing and retrieval.

Index­ing

import lucene
from lucene import \
    SimpleFSDirectory, System, File, \
    Document, Field, StandardAnalyzer, IndexWriter, Version

if __name__ == "__main__":
    lucene.initVM()
    indexDir = "/Tmp/REMOVEME.index-dir"
    dir = SimpleFSDirectory(File(indexDir))
    analyzer = StandardAnalyzer(Version.LUCENE_30)
    writer = IndexWriter(dir, analyzer, True, IndexWriter.MaxFieldLength(512))

    print >> sys.stderr, "Currently there are %d documents in the index..." % writer.numDocs()

    print >> sys.stderr, "Reading lines from sys.stdin..."
    for l in sys.stdin:
        doc = Document()
        doc.add(Field("text", l, Field.Store.YES, Field.Index.ANALYZED))
        writer.addDocument(doc)

    print >> sys.stderr, "Indexed lines from stdin (%d documents in index)" % (writer.numDocs())
    print >> sys.stderr, "About to optimize index of %d documents..." % writer.numDocs()
    writer.optimize()
    print >> sys.stderr, "...done optimizing index of %d documents" % writer.numDocs()
    print >> sys.stderr, "Closing index of %d documents..." % writer.numDocs()
    writer.close()
    print >> sys.stderr, "...done closing index of %d documents" % writer.numDocs()

Retrieval

import lucene
from lucene import \
    SimpleFSDirectory, System, File, \
    Document, Field, StandardAnalyzer, IndexSearcher, Version, QueryParser

if __name__ == "__main__":
    lucene.initVM()
    indexDir = "/Tmp/REMOVEME.index-dir"
    dir = SimpleFSDirectory(File(indexDir))
    analyzer = StandardAnalyzer(Version.LUCENE_30)
    searcher = IndexSearcher(dir)

    query = QueryParser(Version.LUCENE_30, "text", analyzer).parse("Find this sentence please")
    MAX = 1000
    hits = searcher.search(query, MAX)

    print "Found %d document(s) that matched query '%s':" % (hits.totalHits, query)

    for hit in hits.scoreDocs:
        print hit.score, hit.doc, hit.toString()
        doc = searcher.doc(hit.doc)
        print doc.get("text").encode("utf-8")
  • http://twitter.com/hntweets/status/20725122783 Hacker News

    MetaOp­ti­mize: PyLucene 3.0 in 60 sec­onds — Tuto­r­ial sam­ple code for the 3.0 API: http://bit.ly/afcsmG Com­ments: http://bit.ly/8YIoKA

  • http://twitter.com/lucene/status/20725764192 Lucene

    60 sec­ond PyLucene 3.0 tuto­r­ial with code — http://ow.ly/2n5W4 #lucene #pylucene

  • http://twitter.com/turian/status/20725868126 Joseph Turian

    PyLucene 3.0 in 60 sec­onds — Tuto­r­ial sam­ple code for the 3.0 API http://ow.ly/2n5W4

  • http://twitter.com/ogrisel/status/20726028534 Olivier Grisel

    RT @turian: PyLucene 3.0 in 60 sec­onds — Tuto­r­ial sam­ple code for the 3.0 API http://ow.ly/2n5W4

  • http://twitter.com/nvaroqua/status/20726829328 Nelle Varo­quaux

    RT @turian: PyLucene 3.0 in 60 sec­onds — Tuto­r­ial sam­ple code for the 3.0 API http://ow.ly/2n5W4

  • http://twitter.com/uilcrw/status/20726877648 Adinel Chirita

    RT @turian: PyLucene 3.0 in 60 sec­onds — Tuto­r­ial sam­ple code for the 3.0 API http://ow.ly/2n5W4

  • rcmuir

    Your javadoc link is wrong, it is not to the query­parser but to a con­tributed module.

    As you can see, the query­parser is indeed doc­u­mented… it seems the prob­lem is not lucene, but actu­ally python as its hid­ing java’s pack­ag­ing sys­tem from you.

    http://lucene.apache.org/java/3_0_2/api/core/or

  • http://metaoptimize.com Joseph Turian

    Thank you for the pointer. I landed on the wrong page through Google. I have updated my post in light of your comment.

  • Mike McCan­d­less

    When I look in the sam­ples dir for PyLucene 3.0.2, they look cur­rent. EG I see samples/IndexFiles.py and samples/SearchFiles.py, both of which look like they are using the 3.0 APIs (I think?).

  • http://twitter.com/ogirardot/status/20730416481 Olivier Girar­dot

    RT @turian: PyLucene 3.0 in 60 sec­onds — Tuto­r­ial sam­ple code for the 3.0 API http://ow.ly/2n5W4

  • http://metaoptimize.com Joseph Turian

    @Mike: Indeed you are cor­rect, I missed that. samples/IndexFiles.py and samples/SearchFiles.py look like they use the 3.0 API, and they pro­vide use­ful tuto­r­ial code. It is just the code in samples/LuceneInAction/ that uses the 2.x API.

  • Mike McCan­d­less

    Actu­ally, samples/LuceneInAction/* also seem (mostly?) cur­rent? EG I see many places where Query­Parser is instan­ti­ated with a Version.LUCENE_CURRENT as the first arg. And, lia/extsearch/collector/BookLinkCollector is cutover to the “new” (as of 2.9) Col­lec­tor API.

  • http://twitter.com/kicauan/status/20732468328 eBot

    MetaOp­ti­mize: PyLucene 3.0 in 60 sec­onds — Tuto­r­ial sam­ple code for the 3.0 APIhttp://su.pr/2WIcYq

  • http://twitter.com/tek_news/status/20733293801 Tech news (BOT)

    HNews: MetaOp­ti­mize: PyLucene 3.0 in 60 sec­onds — Tuto­r­ial sam­ple code for the 3.0 API http://bit.ly/alkmBh

  • http://metaoptimize.com Joseph Turian

    I am not sure how cur­rent the LIA code sam­ples are, or what per­cent of the code has been ported to the 3.0 API.

    In lia/meetlucene/Indexer.py, for exam­ple, the code does not work under the 3.0 API: IndexWriter(indexDir, Stan­dar­d­An­a­lyzer(), True)

  • http://metaoptimize.com Joseph Turian

    I am not sure how cur­rent the LIA code sam­ples are, or what per­cent of the code has been ported to the 3.0 API.

    In lia/meetlucene/Indexer.py, for exam­ple, the code does not work under the 3.0 API: IndexWriter(indexDir, Stan­dar­d­An­a­lyzer(), True)

  • http://twitter.com/newsyc20/status/20736900273 Hacker News 20

    MetaOp­ti­mize: PyLucene 3.0 in 60 sec­onds — Tuto­r­ial sam­ple code for the 3.0 API http://bit.ly/d0lq93 (http://bit.ly/dv2Rt8)

  • http://twitter.com/jeffmclamb/status/20738742241 Jeff McLamb

    PyLucene 3.0 in 60 sec­onds – Tuto­r­ial sam­ple code for the 3.0 API – MetaOp­ti­mize http://ff.im/-oZk1a

  • http://twitter.com/hackernewsyc/status/20738825922 Hacker News YC

    MetaOp­ti­mize: PyLucene 3.0 in 60 sec­onds — Tuto­r­ial sam­ple code for the 3.0 API http://bit.ly/asOFMC

  • http://twitter.com/kazuakey/status/20750259521 Kazuaki Hiraga

    RT @lucene: 60 sec­ond PyLucene 3.0 tuto­r­ial with code — http://ow.ly/2n5W4 #lucene #pylucene

  • http://twitter.com/joshva/status/20761533550 Joshva

    PyLucene 3.0 in 60 sec­onds – Tuto­r­ial sam­ple code for the 3.0 API – MetaOp­ti­mize http://bit.ly/b3nfcF

  • http://twitter.com/salve/status/20770561549 Paolo Patron­imic

    PyLucene 3.0 in 60 sec­onds – Tuto­r­ial sam­ple code for the 3.0 API – MetaOp­ti­mize:
    joshua : PyLucene 3.0 in 60 sec… http://bit.ly/cq0N8O

  • http://twitter.com/salve/status/20770904621 Paolo Patron­imic

    PyLucene 3.0 in 60 sec­onds – Tuto­r­ial sam­ple code for the 3.0 API – MetaOp­ti­mize:
    joshua : PyLucene 3.0 in 60.. http://bit.ly/b161NU

  • http://twitter.com/wangfengmadking/status/20773698560 Char­lie Epps

    RT @turian: PyLucene 3.0 in 60 sec­onds — Tuto­r­ial sam­ple code for the 3.0 API http://ow.ly/2n5W4

  • http://twitter.com/grimborg/status/20776310144 Òscar Vila­plana

    Pylucene 3.0 in 60 sec­onds http://is.gd/eb4Sd

  • Mike McCan­d­less

    You’re right! Some (most?) of the code under LuceneIn­Ac­tion hasn’t been updated for 3.0.

  • http://twitter.com/loggly/status/20801827163 Log­gly, Inc.

    From 0 to search­ing in 60 sec­onds: http://bit.ly/b4JbFA #python #search

  • http://twitter.com/rochacbruno/status/21099905963 Bruno Cezar Rocha

    PyLucene 3.0 in 60 sec­onds — Tuto­r­ial sam­ple code for the 3.0 API http://bit.ly/cdCihJ

  • http://twitter.com/rodrigoelias/status/21100020545 rodri­goelias

    RT @rochacbruno: PyLucene 3.0 in 60 sec­onds — Tuto­r­ial sam­ple code for the 3.0 API http://bit.ly/cdCihJ

  • http://twitter.com/david_buitrago/status/21104265423 david buitrago

    RT @turian: PyLucene 3.0 in 60 sec­onds — Tuto­r­ial sam­ple code for the 3.0 API http://ow.ly/2n5W4

  • http://twitter.com/rafaelcaricio/status/21189755721 Rafael Carí­cio

    Muito bom! :) RT @turian: PyLucene 3.0 in 60 sec­onds — Tuto­r­ial sam­ple code for the 3.0 API http://bit.ly/cdCihJ

  • http://ikaisays.com/2010/04/24/lucene-in-memory-search-example-now-updated-for-lucene-3-0-1/ Lucene In-Memory Search Exam­ple: Now updated for Lucene 3.0.1 « Ikai Lan says

    […] 3 com­ments Update: Here’s a link to some sam­ple code for Python using PyLucene. Thanks, […]

  • http://twitter.com/ikai/status/22225686053 Ikai Lan

    PyLucene 3.0 in 30 sec­onds #python #lucene http://bit.ly/bmTUTF

  • http://twitter.com/tunixman/status/22226215441 Evan Cof­sky

    RT @ikai: PyLucene 3.0 in 30 sec­onds #python #lucene http://bit.ly/bmTUTF

  • http://twitter.com/quippdpython/status/22231640721 quippd Python News

    RT @ikai PyLucene 3.0 in 30 sec­onds #python #lucene http://bit.ly/bmTUTF

  • http://twitter.com/hamiltonulmer/status/26673495294 Hamil­ton Ulmer

    Love the con­ver­gence. Look­ing for PyLucene 3.x docs, and found @turian’s post two months ago: http://bit.ly/bmTUTF

  • http://twitter.com/semanticpc/status/26677840152 Praveen Chan­dar

    RT @hamiltonulmer: Love the con­ver­gence. Look­ing for PyLucene 3.x docs, and found @turian’s post two months ago: http://bit.ly/bmTUTF

  • http://twitter.com/rosarioarun/status/26678203401 Rosario Arun

    RT @hamiltonulmer: Love the con­ver­gence. Look­ing for PyLucene 3.x docs, and found @turian’s post two months ago: http://bit.ly/bmTUTF

  • http://twitter.com/turian/status/26726084478 Joseph Turian

    RT @hamiltonulmer: Love the con­ver­gence. Look­ing for PyLucene 3.x docs, and found @turian’s post two months ago: http://bit.ly/bmTUTF

  • Cerin

    Thanks for the exam­ples. Took me a while to fig­ure out how to build PyLucene on Ubuntu, but after that these exam­ples worked perfectly.

  • http://twitter.com/kudzu/status/138308918987325442 kudzu

    pyluceneマジ簡単だよ、と言われてみせてもらったら本当に簡単そうだった。http://t.co/KLzbzMxQ

  • http://twitter.com/flopezluis/status/163759589337219074 Félix López

    tuto­r­ial pylucene http://t.co/M6OzXMSJ

  • http://twitter.com/rogueleaderr George Lon­don

    Thanks for pro­vid­ing this. Makes jump­ing into PyLucene way easier. 

blog comments powered by Disqus