Category Archives: Uncategorized

Free consultation on data strategy (NLP, ML, business intelligence, etc.)

Sum­mary
Email me your pitch and how you need help mon­e­tiz­ing data.
If I like your pitch, I’ll give you a free con­sul­ta­tion on data strat­egy (NLP, ML, busi­ness intel­li­gence, etc.)
After­wards, if we both think that I can add value to your busi­ness, we can talk about a longer-term rela­tion­ship.
You should for­ward this blog post to any friend who could use

KEA Keyphrase Extraction as an XML-RPC service (code release)

Sum­mary
We release code writ­ten by Ali Afshar, which turns the KEA keyphrase extrac­tor into an XML-RPC ser­vice. This allows you to use KEA as a ser­vice, call­ing it from a vari­ety of dif­fer­ent pro­gram­ming lan­guages. The code is released under the New BSD License.

Back­ground
Keyphrase extrac­tion (AKA ter­mi­nol­ogy min­ing, term extrac­tion, term recog­ni­tion, or glos­sary extrac­tion) is the

PyLucene 3.0 in 60 seconds — Tutorial sample code for the 3.0 API

Until there is bet­ter doc­u­men­ta­tion for Lucene 3.0, I rec­om­mend you use Lucene 2.4 or 2.9. Nonethe­less, I pro­vide a basic index­ing and retrieval code using the PyLucene 3.0 API, per­haps the first such exam­ple code on the web.

Perhaps job hopping is a good thing?

Sum­mary
I spec­u­late that job hop­ping, if it becomes a wide­spread phe­nom­e­non, might actu­ally lead to improved busi­ness effi­ciency. In this way, the “Gen Y” job hop­ping phe­nom­e­non could ulti­mately prove beneficial.

Back­ground

Mark Suster begins the debate by writ­ing: “[Job Hop­pers] Make Ter­ri­ble Employ­ees”.
Paul Dix responds that job hop­ping is not cor­re­lated with employee qual­ity and there are

Code maintainability, and the joy of outsourcing

Sum­mary
Accord­ing to com­mon wis­dom, the best code is devel­oped in-house. I am begin­ning to believe this is only true when the code must be tightly cou­pled, or there are real­is­tic secu­rity con­cerns. These sce­nar­ios are less com­mon than man­agers like to believe.
For run-of-the-mill devel­op­ment projects, out­sourc­ing might have advan­tages above-and-beyond cost sav­ings. If your code effort

Lean Startup, and The Stooges

Okay, I’m ready.
After read­ing a hand­ful of arti­cles mak­ing ten­u­ous con­nec­tions between entre­pre­neur­ship and music, including :

The Noto­ri­ous CEO: Ten Startup Com­mand­ments from Big­gie Smalls
Being like The Sex Pis­tols can help your startup?

I’ve decided to come out and share my favorite startup music.
Dirt, by The Stooges, is a proto-punk cut that sprawls for seven-minutes, brood­ing and smol­der­ing. It

Constitution for Governance of Open-Source Projects (v20100227)

Sum­mary
I pro­pose a default “Con­sti­tu­tion for Gov­er­nance of Open-Source Projects”.

Back­ground
I recently got involved in the OSQA project, which is a fork of CNPROG, which in turn is a clone of the Stack­Ex­change Q&A forum soft­ware.
Note that the OSQA project has no for­mal “home­page”, or instruc­tions on how to get involved. I only dis­cov­ered by chance that there is a mailing-list

Why can’t you pickle generators in Python? A pattern for saving training state

Sum­mary

A pat­tern for per­sist­ing gen­er­a­tors is to turn them into pickle-able class objects. This is use­ful when you use gen­er­a­tors for stream­ing train­ing exam­ples.
I would also try generator_tools, which might be a more con­ve­nient alter­na­tive to the pat­tern I describe. I haven’t used it yet.

Gen­er­a­tors for stream­ing train­ing exam­ples
For machine learn­ing, python gen­er­a­tors are a sim­ple idiom that make it

Use flag –xml when you run mysqldump

Sum­mary:

If you have text data (like a web scrape) stored in a MySQL data­base, and you want to share the data, mysql­dump to XML using the –xml flag.

When fields are unlikely to con­tain tabs, an even sim­pler for­mat is a tab-separated file, cre­ated using the –tab=path flag to mysql­dump. path must be owned by the MySQL database user.

The Prob­lem

Automatically sorting graph curves

A script for auto­mat­i­cally sort­ing graph curves, e.g. for gnuplot.