4 Machine Learning Sessions at Structure:Data that Shouldn’t Be Missed

Here my short­list of the ses­sions at GigaOm Structure:Data that I am most excited about. The fact that they are clus­tered together at the begin­ning of Wednes­day, March 21 is purely coin­ci­den­tal. For the curi­ous, here is the full lineup of speak­ers.

STRUCTURING DECISIONS FROM UNSTRUCTURED DATA (8:40 AM), with Seth Grimes, Ron Avnur, Paul Spe­ciale and Staffan Truve.

The first long ses­sion of the con­fer­ence is about the gen­eral prob­lem induc­ing struc­ture in data. Athough the topic is quite broad, I hope to see Seth Grimes leads the dis­cuss to non-obvious and forward-thinking busi­ness appli­ca­tions, par­tic­u­larly of text mining.

MACHINE LEARNING’S IMPACT ON BUSINESS MODELS AND INDUSTRY STRUCTURES (9:10 AM), with George Gilbert, Cur­rie Boyle, Alexan­der Gray, Mok Oh, and Amar­nath Thombre.

Chris Dixon has writ­ten on the strug­gle for devel­op­ing effec­tive machine learn­ing busi­ness mod­els, argu­ing that ML is “too hot” to be mar­keted in a B2B set­ting. I would like to see speaker insight into ML ser­vices as a B2B busi­ness model, as opposed to inter­nal use of ML.

PUZZLING (12:05 PM), with Jeff Jonas.

I’ve been mean­ing to see Jeff Jonas for a while, ever since my friend Todd Huff­man (@odd) spoke glow­ingly of him. Jeff’s talk appears to extend an idea I’ve men­tioned in a recent talk: The next step in pre­dic­tive ana­lyt­ics is using joins on machine extracted data sets to extract higher-level information.

UNDERWRITING FOR THE UNDERBANKED THROUGH DATA MINING (3:00 PM), with Mathew Ingram and Dou­glass Merrill.

I’ve been inter­ested in the use of ML for assess­ing credit more accu­rately since read­ing Pando Daily’s tax­on­omy of lend­ing and learn­ing about star­tups in that space. Niche areas in lend­ing are grow­ing; con­sider, for exam­ple, in vitro loans, and the fact that credit scores were his­tor­i­cally dif­fi­cult to esti­mate in Brazil.

Dis­clo­sure: MetaOp­ti­mize is a media part­ner for GigaOm Structure:Data, which means that I get a free pass in exchang­ing for cov­er­ing the event. It also means you get a dis­count of 20% if you buy a ticket through this link.

Discussion 2.0: Personalization

[The fol­low­ing post is my sub­mis­sion to the Knight-Mozilla “Beyond Com­ment Threads” chal­lenge.]
The fol­low­ing are the core prob­lems with cur­rent dis­cus­sion systems:

Trolls, acri­mo­nious peo­ple, and low qual­ity com­men­tary can drown out thought­ful dis­cus­sion and destroy a good com­mu­nity.
Bias towards senior­ity: Deep insight is penal­ized if it comes from a new, unknown, or anony­mous voice. For exam­ple, on

Fat Free CRM in five minutes on a fresh Amazon EC2 micro instance

Would you like to get Fat Free CRM up-and-running, but spend only five min­utes on deploy­ment?
I am not a Rails hacker, so get­ting Fat Free CRM installed and run­ning is non-trivial for me.
fatfreecrm-ec2 will auto­mat­i­cally deploy Fat Free CRM on a fresh Ama­zon EC2 micro instance. I have also tested it on a fresh Ubuntu Lin­ode slice.
Caveat: The five min­utes will

NLP Challenge: Find semantically related terms over a large vocabulary (>1M)?

In the spirit of shared tasks and NLP “bake offs”, I hereby announce the first MetaOp­ti­mize Chal­lenge. It’s an open prob­lem, and I am inter­ested in involv­ing prac­ti­tion­ers who want to demo their style, as well as peo­ple who want to learn some large-scale IR/NLP. Hope­fully, we’ll all learn some­thing about var­i­ous real-world approaches.
Join the announce­ment list

Information Organization: A case study in music recommendations

I intro­duce “infor­ma­tion orga­ni­za­tion”, an approach which I have been explor­ing for sev­eral years. As a case study, music rec­om­men­da­tions should be orga­nized, but exist­ing appli­ca­tions cur­rently orga­nize music rec­om­men­da­tions poorly. I dis­cuss issues with cur­rent appli­ca­tions, and dis­cuss fea­tures that address these issues.

Free consultation on data strategy (NLP, ML, business intelligence, etc.)

Email me your pitch and how you need help mon­e­tiz­ing data.
If I like your pitch, I’ll give you a free con­sul­ta­tion on data strat­egy (NLP, ML, busi­ness intel­li­gence, etc.)
After­wards, if we both think that I can add value to your busi­ness, we can talk about a longer-term rela­tion­ship.
You should for­ward this blog post to any friend who could use

KEA Keyphrase Extraction as an XML-RPC service (code release)

We release code writ­ten by Ali Afshar, which turns the KEA keyphrase extrac­tor into an XML-RPC ser­vice. This allows you to use KEA as a ser­vice, call­ing it from a vari­ety of dif­fer­ent pro­gram­ming lan­guages. The code is released under the New BSD License.

Keyphrase extrac­tion (AKA ter­mi­nol­ogy min­ing, term extrac­tion, term recog­ni­tion, or glos­sary extrac­tion) is the

PyLucene 3.0 in 60 seconds — Tutorial sample code for the 3.0 API

Until there is bet­ter doc­u­men­ta­tion for Lucene 3.0, I rec­om­mend you use Lucene 2.4 or 2.9. Nonethe­less, I pro­vide a basic index­ing and retrieval code using the PyLucene 3.0 API, per­haps the first such exam­ple code on the web.

Perhaps job hopping is a good thing?

I spec­u­late that job hop­ping, if it becomes a wide­spread phe­nom­e­non, might actu­ally lead to improved busi­ness effi­ciency. In this way, the “Gen Y” job hop­ping phe­nom­e­non could ulti­mately prove beneficial.


Mark Suster begins the debate by writ­ing: “[Job Hop­pers] Make Ter­ri­ble Employ­ees”.
Paul Dix responds that job hop­ping is not cor­re­lated with employee qual­ity and there are

Code maintainability, and the joy of outsourcing

Accord­ing to com­mon wis­dom, the best code is devel­oped in-house. I am begin­ning to believe this is only true when the code must be tightly cou­pled, or there are real­is­tic secu­rity con­cerns. These sce­nar­ios are less com­mon than man­agers like to believe.
For run-of-the-mill devel­op­ment projects, out­sourc­ing might have advan­tages above-and-beyond cost sav­ings. If your code effort