<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>MetaOptimize</title>
	<atom:link href="http://metaoptimize.com/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://metaoptimize.com/blog</link>
	<description>building machine learning and natural language processing tools</description>
	<lastBuildDate>Sat, 17 Mar 2012 21:25:50 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=abc</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>4 Machine Learning Sessions at Structure:Data that Shouldn’t Be Missed</title>
		<link>http://metaoptimize.com/blog/2012/03/17/4-machine-learning-sessions-at-structuredata-that-shouldnt-be-missed/</link>
		<comments>http://metaoptimize.com/blog/2012/03/17/4-machine-learning-sessions-at-structuredata-that-shouldnt-be-missed/#comments</comments>
		<pubDate>Sat, 17 Mar 2012 21:25:50 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://metaoptimize.com/blog/?p=256</guid>
		<description><![CDATA[
Here my shortlist of the sessions at GigaOm Structure:Data that I am most excited about. The fact that they are clustered together at the beginning of Wednesday, March 21 is purely coincidental. For the curious, here is the full lineup of speakers.

STRUCTURING DECISIONS FROM UNSTRUCTURED DATA (8:40 AM), with Seth Grimes, Ron Avnur, Paul Speciale [...]]]></description>
			<content:encoded><![CDATA[
<p>Here my shortlist of the sessions at <a href="http://event.gigaom.com/structuredata/">GigaOm Structure:Data</a> that I am most excited about. The fact that they are clustered together at the beginning of Wednesday, March 21 is purely coincidental. For the curious, here is the <a href="http://event.gigaom.com/structuredata/speakers/">full lineup of speakers</a>.</p>
<hr />
<p>STRUCTURING DECISIONS FROM UNSTRUCTURED DATA (8:40 AM), with Seth Grimes, Ron Avnur, Paul Speciale and Staffan Truve.</p>
<p>The first long session of the conference is about the general problem inducing structure in data. Athough the topic is quite broad, I hope to see Seth Grimes leads the discuss to non-obvious and forward-thinking business applications, particularly of text mining.</p>
<hr />
<p>MACHINE LEARNING’S IMPACT ON BUSINESS MODELS AND INDUSTRY STRUCTURES (9:10 AM), with George Gilbert, Currie Boyle, Alexander Gray, Mok Oh, and Amarnath Thombre.</p>
<p>Chris Dixon has written on the struggle for developing <a href="http://cdixon.org/2011/11/28/business-development-the-goldilocks-principle/">effective machine learning business models</a>, arguing that ML is “too hot” to be marketed in a B2B setting. I would like to see speaker insight into ML services as a B2B business model, as opposed to internal use of ML.</p>
<hr />
<p>PUZZLING (12:05 PM), with Jeff Jonas.</p>
<p>I’ve been meaning to see <a href="http://jeffjonas.typepad.com/">Jeff Jonas</a> for a while, ever since my friend <a href="https://twitter.com/odd">Todd Huffman (@odd)</a> spoke glowingly of him. Jeff’s talk appears to extend an idea I’ve <a href="http://files.meetup.com/1542972/20120202-more-data-same-models-STUDY-SLIDES.pdf">mentioned in a recent talk</a>: The next step in predictive analytics is using joins on machine extracted data sets to extract higher-level information.</p>
<hr />
<p>UNDERWRITING FOR THE UNDERBANKED THROUGH DATA MINING (3:00 PM), with Mathew Ingram and Douglass Merrill.</p>
<p>I’ve been interested in the use of ML for assessing credit more accurately since reading Pando Daily’s <a href="http://pandodaily.com/2012/02/27/big-data-machine-learning-scared-banks/">taxonomy of lending</a> and learning about startups in that space. Niche areas in lending are growing; consider, for example, <a href="http://online.wsj.com/article/SB10001424052970203960804577241270123249832-email.html">in vitro loans</a>, and the fact that credit scores were historically <a href="http://techcrunch.com/2010/03/07/brazil-the-new-home-of-financial-innovation/">difficult to estimate in Brazil</a>.</p>
<hr />
<p>Disclosure: MetaOptimize is a media partner for GigaOm Structure:Data, which means that I get a free pass in exchanging for covering the event. It also means you get a discount of 20% if you buy a ticket through <a href="http://structuredata2012-meta.eventbrite.com/?discount=METAOPTIMIZE">this link</a>.</p>

]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2012/03/17/4-machine-learning-sessions-at-structuredata-that-shouldnt-be-missed/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Discussion 2.0: Personalization</title>
		<link>http://metaoptimize.com/blog/2011/05/22/discussion-2-0-personalization/</link>
		<comments>http://metaoptimize.com/blog/2011/05/22/discussion-2-0-personalization/#comments</comments>
		<pubDate>Mon, 23 May 2011 01:16:49 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://metaoptimize.com/blog/?p=248</guid>
		<description><![CDATA[
[The following post is my submission to the Knight-Mozilla “Beyond Comment Threads” challenge.]
The following are the core problems with current discussion systems:

Trolls, acrimonious people, and low quality commentary can drown out thoughtful discussion and destroy a good community.
Bias towards seniority: Deep insight is penalized if it comes from a new, unknown, or anonymous voice. For [...]]]></description>
			<content:encoded><![CDATA[
<p>[The following post is my submission to the <a href="https://drumbeat.org/en-US/challenges/beyond-comment-threads/submission/186/">Knight-Mozilla “Beyond Comment Threads” challenge</a>.]</p>
<p>The following are the core problems with current discussion systems:</p>
<ol>
<li>Trolls, acrimonious people, and low quality commentary can drown out thoughtful discussion and destroy a good community.</li>
<li>Bias towards seniority: Deep insight is penalized if it comes from a new, unknown, or anonymous voice. For example, on Quora, answering a question one month faster than someone else can lead to a rich-get-richer phenomenon where the old answer gets more upvotes because it is always shown as the top answer, and hence has more visibility merely because it is older. Nepotism creates artificial friction and a barrier-entry, because it is an effective technique for enforcing community standards. But it has the downside that it discriminates—gently or extremely—-against insightful new commentators and anonymity.</li>
<li>Voting systems can be gamed by voting rings and ballot stuffing.</li>
<li>Voting systems can lead to “mob rule”.</li>
</ol>
<p>How do we address all these core problems in traditional commenting systems?<br />
How can we create an engaging system that most effectively promotes discussion? Can we avoid nepotism and bias against new, unknown, and anonymous commentators? How can we defend against basic trolling and voting rings?<br />
A next generation discussion system must address these core problems.</p>
<p>The core value of a discussion system is to <em>encourage stimulating and engaging discussion</em>. We want a system that is frictionless to participate in: You can lurk for years and then jump in when you have something great and insightful to say, and your voice is heard loud and clear. This is true democratization of discussion. </p>
<p>The solution is <strong>personalization</strong>. A next generation discussion system is personalized. Personalization makes discussion more stimulating and engaging. Each user that reads and participates gets a comment thread that is sorted by <em>personal</em> relevancy. Irrelevant comments are hidden by default, but can optionally be viewed. Personalization is tuned to promote discussion that the user finds <em>stimulating and engaging</em>, and hiding discussion that the user finds off-topic, spammy, excessively or insufficiently detailed, etc. </p>
<p>The beauty of personalization is its flexibility. It does not force a particular style of discussion. If the user enjoys:</p>
<ul>
<li>heated discussion back-and-forth discussion,</li>
<li>calm discussion with well-reasoned but concise arguments,</li>
<li>in-depth academic discourse,</li>
<li>tabloid-like ad-hominem, or</li>
<li>trolling and hate speech</li>
</ul>
<p>then the user gets what they want.</p>
<p>Additionally, personalization is adapted on a per-topic basis. One particular user might enjoy a heated discussion about abortion, a calm well-reasoned discussion about NoSQL databases, and low-brow discussion about celebrity romance. Per-topic personalization can satify all these user needs.</p>
<p>I can discuss more details about this approach, including how to:</p>
<ul>
<li>capture personalization information through user interaction with the discussion board.</li>
<li>incorporate atomic commenting.</li>
<li>federate discussion across multiple sites and liberate discussion from a single site.</li>
</ul>
<p>I can also discuss possible objections to personalization, and my response to them.<br />
Due to space limitations (500 words), I omit these details for now, and focus on personalization, which I believe addresses the core problems of traditional discussion systems.</p>

]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2011/05/22/discussion-2-0-personalization/feed/</wfw:commentRss>
		<slash:comments>17</slash:comments>
		</item>
		<item>
		<title>Fat Free CRM in five minutes on a fresh Amazon EC2 micro instance</title>
		<link>http://metaoptimize.com/blog/2010/12/29/fat-free-crm-in-five-minutes-on-a-fresh-amazon-ec2-micro-instance/</link>
		<comments>http://metaoptimize.com/blog/2010/12/29/fat-free-crm-in-five-minutes-on-a-fresh-amazon-ec2-micro-instance/#comments</comments>
		<pubDate>Wed, 29 Dec 2010 23:29:31 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://metaoptimize.com/blog/?p=244</guid>
		<description><![CDATA[
Would you like to get Fat Free CRM up-and-running, but spend only five minutes on deployment?
I am not a Rails hacker, so getting Fat Free CRM installed and running is non-trivial for me.
fatfreecrm-ec2 will automatically deploy Fat Free CRM on a fresh Amazon EC2 micro instance. I have also tested it on a fresh Ubuntu [...]]]></description>
			<content:encoded><![CDATA[
<p>Would you like to get <a href="http://www.fatfreecrm.com/">Fat Free CRM</a> up-and-running, but spend only five minutes on deployment?</p>
<p>I am not a Rails hacker, so getting Fat Free CRM installed and running is non-trivial for me.</p>
<p><a href="https://github.com/turian/fatfreecrm-ec2">fatfreecrm-ec2</a> will automatically deploy Fat Free CRM on a fresh Amazon EC2 micro instance. I have also tested it on a fresh Ubuntu Linode slice.</p>
<p>Caveat: The five minutes will probably be spent spinning up the EC2 instance. The script should only take about thirty seconds.</p>
<p>You can also try <a href="http://ryanwood.com/past/2010/1/21/fat-free-crm-on-heroku" class="broken_link">Fat Free CRM on Heroku</a>. The last time I tried this recipe, I had problems I couldn’t resolve, which I discuss in the comment thread on that blog post. However, it looks like the author has recently updated his code, so his script might be a good alternative.</p>
<p>Enjoy!</p>

]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2010/12/29/fat-free-crm-in-five-minutes-on-a-fresh-amazon-ec2-micro-instance/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>NLP Challenge: Find semantically related terms over a large vocabulary (&gt;1M)?</title>
		<link>http://metaoptimize.com/blog/2010/11/05/nlp-challenge-find-semantically-related-terms-over-a-large-vocabulary-1m/</link>
		<comments>http://metaoptimize.com/blog/2010/11/05/nlp-challenge-find-semantically-related-terms-over-a-large-vocabulary-1m/#comments</comments>
		<pubDate>Sat, 06 Nov 2010 00:49:57 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://metaoptimize.com/blog/?p=220</guid>
		<description><![CDATA[
Summary
In the spirit of shared tasks and NLP “bake offs”, I hereby announce the first MetaOptimize Challenge. It’s an open problem, and I am interested in involving practitioners who want to demo their style, as well as people who want to learn some large-scale IR/NLP. Hopefully, we’ll all learn something about various real-world approaches.
Join the [...]]]></description>
			<content:encoded><![CDATA[
<h2>Summary</h2>
<p>In the spirit of shared tasks and NLP “bake offs”, I hereby announce the first MetaOptimize Challenge. It’s an open problem, and I am interested in involving practitioners who want to demo their style, as well as people who want to learn some large-scale IR/NLP. Hopefully, we’ll all learn something about various real-world approaches.</p>
<p>Join the <a href="http://groups.google.com/group/metaoptimize-challenge-announce">announcement list</a> to hear about any developments or important announcements.</p>
<p>Join the <a href="http://groups.google.com/group/metaoptimize-challenge-discuss">discuss list</a> to chat about techniques and approaches.</p>
<p>I also have an <a href="#ulterior-motive">ulterior motive</a>.</p>
<hr />
<h2>The Problem</h2>
<p>Let’s say I have several ten or hundred million documents, which are very short (only a few words). There are several million word types in the vocabulary. What is the fastest way to find the top-k (say k=10) semantically related words for each word in the vocabulary?</p>
<p>“Semantically related” is purposefully left vague.</p>
<p>When I say fastest, I mean that it should take under a week of computation time, and as little human time as possible. So use of existing implementations is encouraged. Single machine or righteously parallel solutions will both be considered, <b>as long as your approach works and you demo it</b>, preferably in the next two weeks.</p>
<hr />
<h2>Background</h2>
<p>I brought up this question on MetaOptimize Q+A: <a href="http://metaoptimize.com/qa/questions/3230/">Find semantically related terms over a large vocabulary (&gt;1M)?</a> I had some ideas in mind. But I wanted to hear about other ideas.</p>
<p><a href="http://metaoptimize.com/qa/users/27/ogrisel/">Olivier Grisel</a> and <a href="http://metaoptimize.com/qa/users/363/andrew-rosenberg/">Andrew Rosenberg</a> commented on my question, suggesting I post this as a public challenge. So here goes. I hope people participate.</p>
<hr />
<h2>Why this is cool?</h2>
<p>Here is one potential application:<br />
Increased insight into emerging topics, trends, and new products. Run this on social media updates (Facebook posts, Tweets) after collecting sufficient mentions of a topic, trend or product, and have insight more insight into what is being discussed.</p>
<p>Coming up with other applications is an left as exercise for the reader.</p>
<hr />
<h2>Problem Details</h2>
<p>Here is a <a href="http://metaoptimize.s3.amazonaws.com/ukwac-uniqmultiwordterms.SAMPLE.txt.gz">sample dataset</a>, for development (6.7 million documents, 40 MB gzipped). There is one document per line. Each word is separated by a space:</p>
<pre>
abbey seal
abbey seekers
abbey series
</pre>
<p>Here is the <a href="http://metaoptimize.s3.amazonaws.com/ukwac-vocabulary.SAMPLE.txt.gz">sample vocabulary file</a>, in decreasing order of frequency (1.1 million word types, 4.5 MB gzipped).<br />
The first column is the frequency and the second column is the word.<br />
There might be words in the dataset that are <i>not</i> in the vocabulary:</p>
<pre>
  32972 group
  31998 research
  30820 information
  30090 uk
  29721 10
  29665 london
</pre>
<p>I will soon post a larger dataset and vocabulary file, and announce it on the –announce mailing list.</p>
<p>The desired output you produce is a file with eleven columns.<br />
The first column should be identical to the second column of the vocabulary file. There will be as many lines as there are in the vocabulary file. The next ten columns should be the ten most related words, in descending order of relevance:</p>
<pre>
group groups working research support pm steering ltd pvc advisory age
research researchers centre group researcher project | programme institute unit council
</pre>
<p>The challenge is, within two week, to post a full output file for the larger dataset. By “full” I mean there is one line for every vocabulary word.</p>
<hr />
<h2>How will you evaluate it?</h2>
<p>You should explain why your solution is correct. There is no “right” answer. Specifying evaluation pretty much determines the solution, as <a href="http://metaoptimize.com/qa/users/33/alexandre-passos/">Alexandre Passos</a> says (p.c.).</p>
<p>Honestly, being able to define the problem and justify your answer is half the puzzle.</p>
<p>Edit: For any submission, I will post for a random subset of vocab words each entry’s 10 related terms. I’ll then ask people to vote blind. This is a reasonable technique for quantitative evaluation.</p>
<hr />
<h2>Why is it hard?</h2>
<p>First you need to take each word, and define a <b>similarity</b> measure between words, depending upon their usage. You need to define this similarity measure over an appropriate feature vector for each word, and choosing a good feature vector is not necessarily obvious.</p>
<p>Second of all, you need to do fast <b>retrieval</b> of the ten most similar words. But if you look at all 1M * 1M pairs, that’s 1 trillion comparisons.</p>
<hr />
<h2>Challenge Details</h2>
<p>In a few days, I’m going to write a small post discussing the problem, and possible approaches. I will also point to existing open-source code that can perhaps solve the problem, so that skilled engineers have enough information to put together a working implementation, even if they have no background in NLP/IR. If you write up a good solution on your own blog or on the <a href="http://metaoptimize.com/qa/questions/3230/">MetaOptimize Q+A forum</a>, I’ll mention it in my blog post.</p>
<p>If you have a solution, please share it within, say, two weeks (Friday, November 19th). Share your full result file, or send it to me and I’ll put it on s3.</p>
<p>Join the <a href="http://groups.google.com/group/metaoptimize-challenge-announce">announcment list</a> to hear about any developments or important announcements.</p>
<p>Join the <a href="http://groups.google.com/group/metaoptimize-challenge-discuss">discuss list</a> to chat about techniques and approaches.</p>
<hr />
<h2>Data Set</h2>
<p>The data set is unique terms that occur in a crawl of .uk.</p>
<p>I took the <a href="http://wacky.sslmit.unibo.it/doku.php?id=corpora">UKWAC web-as-corpus crawl</a> (2 billion words, crawled in 2008), ran it through the <a href="http://code.google.com/p/splitta/">splitta</a> sentence splitter, removed all funny characters, ran the <a href="http://www.cis.upenn.edu/~treebank/tokenizer.sed">Penn treebank word tokenizer</a>, and perform term extraction with <a href="http://pypi.python.org/pypi/topia.termextract/">topia.termextract</a>, discarding terms that are single words:</p>
<p><tt><br />
./sentencesplit.py | remove-nonascii-characters.pl | ~/dev/common-scripts/tokenizer.sed | ./topiaterms.py | gzip –c &gt; ukwac-allmultiwordterms.txt.gz<br />
</tt></p>
<p>I then lowercased the terms, sorted them, and uniqued them, to give the <b>dataset</b>:</p>
<p><tt><br />
zcat ukwac-allmultiwordterms.txt.gz | remove-nonascii-characters.pl | perl –ne ‘print lc($_);’ | sort | uniq | gzip –c &gt; ukwac-uniqmultiwordterms.txt.gz<br />
</tt></p>
<p>Finally, I constructed the vocabulary from the unique terms, to give the <b>vocabulary</b>:<br />
<tt><br />
zcat ukwac-uniqmultiwordterms.txt.gz | perl –ne ‘s/ /\n/g; print’ | sort | uniq –c | sort –rn | gzip –c &gt; ukwac-vocabulary.txt.gz<br />
</tt></p>
<hr />
<p><a name="ulterior-motive"></a><br />
<h2>Ulterior motive</h2>
<p>If you do this, I have more exciting project work for you, and can pay. This is very similar to the style of interview question I ask, and it’s also very similar to the sort of work I do. So if you can hack it, you’re basically my ideal choice for a collaborator right now.</p>

]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2010/11/05/nlp-challenge-find-semantically-related-terms-over-a-large-vocabulary-1m/feed/</wfw:commentRss>
		<slash:comments>98</slash:comments>
		</item>
		<item>
		<title>Information Organization: A case study in music recommendations</title>
		<link>http://metaoptimize.com/blog/2010/09/15/information-organization-a-case-study-in-music-recommendations/</link>
		<comments>http://metaoptimize.com/blog/2010/09/15/information-organization-a-case-study-in-music-recommendations/#comments</comments>
		<pubDate>Wed, 15 Sep 2010 17:29:49 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[business intelligence]]></category>
		<category><![CDATA[information organization]]></category>
		<category><![CDATA[ir]]></category>
		<category><![CDATA[minimum viable product]]></category>
		<category><![CDATA[ML]]></category>
		<category><![CDATA[music]]></category>
		<category><![CDATA[music recommendation]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[recommendation]]></category>

		<guid isPermaLink="false">http://metaoptimize.com/blog/?p=204</guid>
		<description><![CDATA[I introduce "information organization", an approach which I have been exploring for several years. As a case study, music recommendations should be organized, but existing applications currently organize music recommendations poorly. I discuss issues with current applications, and discuss features that address these issues.]]></description>
			<content:encoded><![CDATA[
<h2>Summary</h2>
<p>I introduce “information organization”, an approach which I have been exploring for several years. As a case study, music recommendations should be organized, but existing applications currently organize music recommendations poorly. I discuss issues with current applications, and discuss features that address these issues.</p>
<hr />
<h2>Background</h2>
<p>Information organization is basically a family of patterns for collecting, presenting, and navigating information that is structured and/or textual.  These patterns draw upon ideas from NLP, ML, IR, UX, viz, and in general the patterns can be implemented as loosely-coupled components. The combination of the patterns leads to potent cross-interactions that add up more to the sum of the individual patterns. This will be more clear when we dive into the case study. These patterns are non-trivial to implement, and require some savvy in NLP or IR. For people knowledgeable in these arts, there is an opportunity to create differentiate your offering and creating unique value by implementing these patterns.</p>
<p>As a family of patterns, information organization is not tied to any particular application. Rather, it admits a variety of related applications that can build upon implemented patterns. In this case study, I’ll be talking about an application to organization music recommendations.</p>
<p>I have been for several years been developing this concept of information organization. Some of it has been implemented, but much of it has been designed but not yet implemented yet.</p>
<hr />
<h2>The problem</h2>
<p>People enjoy sharing music recommendations, but lack an effective platform for sharing these recommendations. We are discussing here about sharing music preferences in numerical form (scores) and textual form (reviews). I ignore the question of sharing the audio itself.</p>
<p>Here are some current approaches:</p>
<ul>
<li>Directly make a recommendation to a friend, online or off. Problem: Inherently transient and non-archival form of sharing. Also, there is no way for me to get recommendations from new people.</li>
<li>Listen to the radio, online or off. Problem: Not social. Inherently transient and non-archival form of sharing.</li>
<li>I write a blog article for myself or a larger publisher (e.g. Pitchfork) with my review of the music. Problem: Social features are limited. Discussion of particular songs is fragmented across publishers, which means that recommendations are not being effectively shared.</li>
<li>I share a Youtube link on Facebook, and discussion ensues. Problem: The discussion is circumscribed purely by my social circle, and I don’t have a mechanism for connecting with people outside my circle with whom I would nonetheless like to share music recommendations. Also, the historical archive is not accessible, and previous recommendations are not searchable.</li>
</ul>
<p>So what we’re getting at is a music recommendation system that has social, archival, and recommendation features.</p>
<hr />
<h2>Approach</h2>
<p><a href="http://metaoptimize.com/blog/wp-content/uploads/2010/09/information_organization_for_music.mockup.png"><img src="http://metaoptimize.com/blog/wp-content/uploads/2010/09/information_organization_for_music.mockup.small_.png" alt="" /></a></p>
<p>Here’s an application that could solve these problems. Click on the mockup image for a larger view of it.</p>
<p>Note that I am going to talk about many possible features for this application. If you are going to implement something, you should focus on core features. I talk about to variety of possible features to illustrate information organization patterns that are technically feasible but not yet commonplace.</p>
<p>At its core, there are two kinds of user activity:</p>
<ul>
<li>Navigating recommendations.</li>
<li>Adding recommendations.</li>
</ul>
<p>I’ll focus on navigation, since navigation suggests many of the most important features.</p>
<p>When navigating information, consider viewing the recommendations for a particular song. This page will contain different recommendations for the song. These recommendations will be summarized as expandable text snippets, and are presented in a ranked order. For example, if you have reviewed this song, your recommendation will rank at the top.  Recommendations from your friends have higher rank, as do recommendations from people with similar taste to you. (Some users care more about the taste of their friends, and the ranking should reflect that. Some users care more about the taste of people with similar tastes, and for them the ranking should reflect that.) Less important, but nonetheless useful, is the objective “authority” of the source. For example, recommendations by well-respected critics like Pitchfork have higher rank than recommendations by unknown critics, if there is not enough social or personal information to rank the recommendations.</p>
<p>Another aspect of navigating is search. Search should implement auto-complete and auto-suggest. Content should be auto-tagged based upon existing music meta-data as well as reviews, so that searching for “bounce” will find all bounce tracks, even if the term “bounce” is not explicitly mentioned in any review of a particular track.  Auto-tagging can be smoothed across different tracks by the same artist, as well as different tracks that have reviews that contain the same keywords.</p>
<p>Another aspect of navigating is finding related entities (entity = song, musician, genre, tag, etc.). Besides seeing popular songs by the same musician, it is also useful to see popular songs in the same genre, related tags, etc. Auto-tagging helps again here to figure out how related two entities are.</p>
<p>There is also the issue that there is no portable open data format for recording numerical preferences about some entity (AFAIK). Simply formalizing the exchange of preference information (not just for music, but any type of entity) would be a big deal.</p>
<hr />
<h2>Recap</h2>
<p>We have discussed a handful of different components (ranking based upon social graph, navigating based upon related music, etc.). These features are non-trivial to implement, and require some NLP or IR savvy. The challenge in implementing these features poses an opportunity to those who can. The more features implemented, the more value is created based upon their interactions, so the application can phase shift to a higher echelon of quality.  But there is clearly value in a music recommendation system that has only a partial feature set.  Which features are the easiest to implement that create the most value upfront? I believe that social features, and integration with Facebook and/or Twitter can add a lot of value in terms of creating engagement.</p>
<p>What is the <a href="http://venturehacks.com/articles/minimum-viable-product">minimum viable product</a> for this task?<br />
Possible answers:</p>
<ul>
<li>You log in with twitter, and type a review of a song or a band.  This will be auto-tweeted, but also added to the recommendation page for this song or band.</li>
<li>Scrape web reviews and recommendations. Extract a summary text snippet for each. (Note that this is a non-trivial feature.) Aggregate these snippets on the site.</li>
</ul>
<p>A benefit of the first approach is that it is inherently social. A benefit of the second approach is that it combats the “<a class="zem_slink" title="Cold start" rel="wikipedia" href="http://en.wikipedia.org/wiki/Cold_start">cold start</a>” problem, i.e. it immediately populates the database with useful information.</p>
<p>I am curious what you think is a good minimum viable product. Is personalization of the view part of the core feature set?</p>
<hr />
<h2>About the author</h2>
<p>I consult on <a href="http://metaoptimize.com/blog/2010/08/20/free-consultation-on-data-strategy-nlp-ml-business-intelligence-etc/">data strategy (NLP, ML, business intelligence, etc.)</a><br />
If you are interested in building out any of these ideas, get in touch with me and I can help. In particular, I can <a href="http://metaoptimize.com/blog/2010/08/20/free-consultation-on-data-strategy-nlp-ml-business-intelligence-etc/">advise on how to build the ML + NLP components</a>. I’ll help you get a practical prototype up-and-running really quick, and show you how to refine and improve the components as necessary. Part of the art in implementing information organization is identifying the components that add the most immediate value, and quickly implementing a solid baseline.</p>
<div class="zemanta-pixie" style="margin-top: 10px; height: 15px;"><a class="zemanta-pixie-a" title="Enhanced by Zemanta" href="http://www.zemanta.com/"><img class="zemanta-pixie-img" style="border: none; float: right;" src="http://img.zemanta.com/zemified_e.png?x-id=314fbb60-092f-4233-947a-6822e2503dc3" alt="Enhanced by Zemanta" /></a><span class="zem-script more-related pretty-attribution"><script src="http://static.zemanta.com/readside/loader.js" type="text/javascript"></script></span></div>

]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2010/09/15/information-organization-a-case-study-in-music-recommendations/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		</item>
		<item>
		<title>Free consultation on data strategy (NLP, ML, business intelligence, etc.)</title>
		<link>http://metaoptimize.com/blog/2010/08/20/free-consultation-on-data-strategy-nlp-ml-business-intelligence-etc/</link>
		<comments>http://metaoptimize.com/blog/2010/08/20/free-consultation-on-data-strategy-nlp-ml-business-intelligence-etc/#comments</comments>
		<pubDate>Fri, 20 Aug 2010 18:22:23 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[artificial intelligence]]></category>
		<category><![CDATA[BI]]></category>
		<category><![CDATA[business intelligence]]></category>
		<category><![CDATA[data mining]]></category>
		<category><![CDATA[large datasets]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[ML]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[statistical modeling]]></category>
		<category><![CDATA[text analysis]]></category>
		<category><![CDATA[web as corpus]]></category>

		<guid isPermaLink="false">http://metaoptimize.com/blog/?p=183</guid>
		<description><![CDATA[
Summary
Email me your pitch and how you need help monetizing data.
If I like your pitch, I’ll give you a free consultation on data strategy (NLP, ML, business intelligence, etc.)
Afterwards, if we both think that I can add value to your business, we can talk about a longer-term relationship.
You should forward this blog post to any [...]]]></description>
			<content:encoded><![CDATA[
<h2>Summary</h2>
<p><a href="mailto:joseph at metaoptimize dot com">Email me</a> your pitch and how you need help monetizing data.<br />
If I like your pitch, I’ll give you a free consultation on data strategy (NLP, ML, business intelligence, etc.)<br />
Afterwards, if we both think that I can add value to your business, we can talk about a longer-term relationship.</p>
<p>You should forward this blog post to any friend who could use this information.</p>
<hr />
<h2>What is data strategy?</h2>
<p>Do you know how to monetize the data you have? How can you improve monetization using other data available to you? How do you transform your data into actionable business intelligence?</p>
<p>I can help you shape your <b>data strategy</b>, your long-term plan for how your business will capture, process, and monetize data.  For example, data strategy can help you in the following circumstances:</p>
<ul>
<li>You don’t know who your individual users are or what they want, so you can’t effectively target ads.</li>
<li>You don’t know what user behavior on your site to track.</li>
<li>You don’t know what information you should start scraping from the web, information which you could use months or years down the line.</li>
</ul>
<p>Besides working backwards from your business goals and business assets to a viable data strategy, I can also help you with more concrete challenges in NLP and machine learning:</p>
<ul>
<li>How do I improve my search engine so that users don’t miss out on relevant results?</li>
<li>How do I add or improve recommendation, to connect users with what they want?</li>
<li>How do I scale this ML algorithm to billions of examples with millions of features?</li>
<li>How do I improve the accuracy of this NLP or ML tool?</li>
</ul>
<hr />
<p><a name="whoami"></p>
<h2>Who am I?</h2>
<p></a></p>
<p>My name is Joseph Turian, and I head MetaOptimize LLC. We consult on NLP, ML, and data strategy. We also run the <a href="http://metaoptimize.com/qa/">MetaOptimize Q&amp;A site</a>, where ML and NLP experts share their knowledge.</p>
<ul>
<li>I am a data expert, holding a Ph.D. in natural language processing and machine learning. I have a decade of experience in these topics. I specialize in <b>large data sets</b>.</li>
<li>I’m <b>business-minded</b>, so I focus on business goals and the most direct path of execution to achieve these goals.</li>
<li>I am also a <b>technology generalist</b> who has been hacking since age 10 and has programmed competitively at a world-class level.</li>
</ul>
<p>References from clients past and present available upon request.</p>
<hr />
<h2>What is the offer?</h2>
<p>You send me information about what you’re doing and why you think I can help you.<br />
<i>Bonus points</i> if you send me your deck, so I can understand your entire business picture. You are asking me to invest valuable expertise and potentially IP in your company, so appeal to me as a potential investor.<br />
<i>Demerits</i> if you send me an NDA prematurely. Uptight companies who think what they are doing isn’t protected by good execution are a turn-off. But if you must be all James Bond about it, I’ll still consider you.</p>
<p>If I like what you’re doing and I can budget time, we schedule a meeting (in person or over Skype) and I’ll give you a free consultation on what you’re doing.</p>
<p>If the initial meeting goes well, and we both see how I can add value to your business, we can decide to continue working together. I can continue to help you either:</p>
<ul>
<li>Advising you periodically about your data strategy.</li>
<li>Building you new tools to use in your product.</li>
<li>Licensing to you existing tools I’ve already built.</li>
<li>Training your smart tech geeks on NLP and ML technology for you to build in-house.</li>
</ul>
<p>Compensation accepted in the form of cash or equity or a mix of both. Pro-bono if you’re an awesome non-profit.</p>
<hr />
<h2>Why am I doing this?</h2>
<ul>
<li>More deals is always good.</li>
<li>I am a social hacker, and enjoy connecting and sharing with other entrepreneurs. I want to meet some more excellent people.</li>
<li>I would like to improve my understanding of widespread challenges and pain points in data strategy. That way, I can build a product that is useful for many people.</li>
<li>This is an interesting social business experiment.</li>
</ul>
<hr />
<h2>Who is this offer for?</h2>
<ul>
<li>Open-source projects looking to use NLP + ML to improve their users’ experience.</li>
<li>Unfunded startups with a promising team, product, and market.</li>
<li>Funded startups.</li>
<li>Established companies.</li>
</ul>
<hr />
<h2>What are you waiting for?</h2>
<p><a href="mailto:joseph at metaoptimize dot com">Email me</a> your pitch and how you need help monetizing data.<br />
Or forward this blog post to a friend who could use this information.</p>

]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2010/08/20/free-consultation-on-data-strategy-nlp-ml-business-intelligence-etc/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>KEA Keyphrase Extraction as an XML-RPC service (code release)</title>
		<link>http://metaoptimize.com/blog/2010/08/18/kea-keyphrase-extraction-as-an-xml-rpc-service/</link>
		<comments>http://metaoptimize.com/blog/2010/08/18/kea-keyphrase-extraction-as-an-xml-rpc-service/#comments</comments>
		<pubDate>Thu, 19 Aug 2010 03:38:38 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[KEA keyphrase extractor]]></category>
		<category><![CDATA[Remote procedure call]]></category>
		<category><![CDATA[term extractor]]></category>
		<category><![CDATA[Terminology extraction]]></category>
		<category><![CDATA[terminology mining]]></category>
		<category><![CDATA[xml]]></category>
		<category><![CDATA[XML-RPC]]></category>

		<guid isPermaLink="false">http://metaoptimize.com/blog/?p=169</guid>
		<description><![CDATA[
Summary
We release code written by Ali Afshar, which turns the KEA keyphrase extractor into an XML-RPC service. This allows you to use KEA as a service, calling it from a variety of different programming languages. The code is released under the New BSD License.

Background
Keyphrase extraction (AKA terminology mining, term extraction, term recognition, or glossary extraction) [...]]]></description>
			<content:encoded><![CDATA[
<h2>Summary</h2>
<p>We release <a href="http://github.com/turian/kea-service">code</a> written by Ali Afshar, which turns the <a href="http://www.nzdl.org/Kea/">KEA</a> keyphrase extractor into an XML-RPC service. This allows you to use KEA as a service, calling it from a variety of different programming languages. The code is released under the <a href="http://en.wikipedia.org/wiki/BSD_licenses#3-clause_license_.28.22New_BSD_License.22.29">New BSD License</a>.</p>
<hr />
<h2>Background</h2>
<p>Keyphrase extraction (AKA terminology mining, term extraction, term recognition, or glossary extraction) is the process of extracting multi-word phrases that summarize the meaning of a text passage.</p>
<p>For example, in <a href="http://www.fao.org/docrep/Article/ejade/ae228e/ae228e00.htm">this document</a> entitled “The Growing Global Obesity Problem: Some Policy Options to Address It”, the keyphrases might be: [“developing countries”, “food consumption”, “overweight”, “taxes”, “prices”, “price policies”, “fiscal policies”, “feeding habits”, “nutritional requirements”, “diet”, “nutrition policies”, and “food intake”.]</p>
<p>These keyphrases are useful for summarizing the topic of the text. Also, these keyphrases are useful in later NLP processing steps, and sometimes more informative and disambiguating than just the individual word tokens in the text.</p>
<p><a href="http://www.nzdl.org/Kea/">KEA</a> is a great keyphrase extraction implementation. It is useful because it is open-source, backed by solid research, comes with some annotated training data, and because it can extract keyphrases over unrestricted text, without needing a vocabulary of possible keyphrases.</p>
<p>Other implementations of keyphrase extraction include:</p>
<ul>
<li><a href="http://code.google.com/p/maui-indexer/">Maui</a>, a topic extractor from the same people that wrote KEA.</li>
<li><a href="http://pypi.python.org/pypi/topia.termextract/">topia.termextract</a> is a Python term extractor, which is relatively noisy, and proposes many bogus keywords, but it simple to use. This is my recommendation for quick-and-dirty but works immediately out-of-the-box.</li>
</ul>
<p>API implementations include:</p>
<ul>
<li><a href="http://www.nactem.ac.uk/software/termine/">Termine</a> by NacTem, a permissive term extractor I’ve used in the past. They will give you bulk access for research purposes.  It is a UK webservice that also is relatively noisy, and proposes many bogus keywords. However, it appears to me to be slightly more accurate than topia.termextract. YMMV.</li>
<li><a href="http://www.alchemyapi.com/api/keyword/">Alchemy’s</a> term extractor.
<li><a href="http://developer.yahoo.com/search/content/V1/termExtraction.html">The Yahoo term extraction API</a>, which is <a href="http://developer.yahoo.net/blog/archives/2010/08/api_updates_and_changes.html">now only available through YQL</a>. It is low recall but high precision. In other words, it gives you a small number of high quality terms, but misses many of the terms in your documents.</li>
<li><a href="http://fivefilters.org/term-extraction/">Five Filters</a>, a web service version of topia’s term extractor (see above).</li>
<li><a href="http://maui-indexer.appspot.com/">Maui on Appspot</a>.</li>
</ul>
<p>Peter Turney has done a lot of research on keyphrase extraction, and <a href="http://www.extractor.com/about.aspx">licenses his implementation</a>.</p>
<p>There is a wide academic literature on term extraction, which I won’t summarize here. The best introductory techniques are written by Park, who is now at IBM:<br />
<a href="http://portal.acm.org/citation.cfm?id=1072370">“Automatic glossary extraction: beyond terminology identification”</a> and<br />
“Glossary extraction and utilization in the information search and delivery system for IBM technical support”. You can read more about how to roll your own <a href="http://stackoverflow.com/questions/1575246/how-do-i-extract-keywords-used-in-text/1575345#1575345">termex implementation here</a>.</p>
<p>More information about the topic is available on the <a href="http://maui-indexer.blogspot.com/">Maui blog</a>.</p>
<hr />
<h2>Code</h2>
<p>When running KEA, instead of a standalone program which reads input from disk, for speed one might want a resident service that keeps the model in memory. Additionally, one might want to call this service from non-Java languages. XML-RPC is a widely supported standard for implementing remote services.</p>
<p>We hereby release <a href="http://github.com/turian/kea-service">KEA service</a> written by Ali Afshar, which turns the <a href="http://www.nzdl.org/Kea/">KEA</a> keyphrase extractor into an XML-RPC service. This allows you to use KEA as a service, calling it from a variety of different programming languages. The code is released under the <a href="http://en.wikipedia.org/wiki/BSD_licenses#3-clause_license_.28.22New_BSD_License.22.29">New BSD License</a>.</p>
<p>Also included in <a href="http://github.com/turian/kea-service/blob/master/README">the documentation</a> is a description of how to this Java program was converted into a XML-RPC service. </p>

]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2010/08/18/kea-keyphrase-extraction-as-an-xml-rpc-service/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>PyLucene 3.0 in 60 seconds — Tutorial sample code for the 3.0 API</title>
		<link>http://metaoptimize.com/blog/2010/08/09/pylucene-3-0-in-60-seconds-tutorial-sample-code-for-the-3-0-api/</link>
		<comments>http://metaoptimize.com/blog/2010/08/09/pylucene-3-0-in-60-seconds-tutorial-sample-code-for-the-3-0-api/#comments</comments>
		<pubDate>Mon, 09 Aug 2010 16:08:00 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[API]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[search]]></category>

		<guid isPermaLink="false">http://metaoptimize.com/blog/?p=135</guid>
		<description><![CDATA[Until there is better documentation for Lucene 3.0, I recommend you use Lucene 2.4 or 2.9. Nonetheless, I provide a basic indexing and retrieval code using the PyLucene 3.0 API, perhaps the first such example code on the web.]]></description>
			<content:encoded><![CDATA[
<h2>Summary</h2>
<p>I provide a basic indexing and retrieval code using the PyLucene 3.0 API. <a href="http://manning.com/lucene">Lucene In Action (2nd Ed)</a> covers Lucene 3.0, but the PyLucene code samples for have not been updated for the 3.0 API, only the Java ones. Unfortunately, there is currently little (no?) example PyLucene code in blogosphere. If you have links to more Lucene 3.0 tutorials and samples, please share them in the comments.</p>
<p><em>Update 20100810: In light of discussions with other, this post has been substantially rewritten since it was first posted.</em></p>
<hr />
<h2>Background</h2>
<p>Historically, I have found it easy to write basic PyLucene 2.4 (or 2.9?) code. PyLucene includes Lucene In Action code samples ported from Java to Python, and these code samples are correct and easy to adapt. I recently was developing a new project based upon Lucene (<a href="http://github.com/turian/biased-text-sample">biased-text-sample</a>), and I decided to try PyLucene 3.0.2–1. I was surprised to find that PyLucene code samples in <a href="http://svn.apache.org/viewvc/lucene/pylucene/trunk/samples/LuceneInAction/" class="broken_link"><tt>samples/LuceneInAction/</tt></a> are out-of-date, and use the 2.x API. (Note: The code in <a href="http://svn.apache.org/viewvc/lucene/pylucene/trunk/samples/"><tt>samples/*.py</tt></a> appears to be updated to the 3.0 API.)</p>
<p>I was able to find no Lucene 3.0 tutorials or code samples on the web, except for this one article:</p>
<ul>
<li>
<a href="http://ikaisays.com/2010/04/24/lucene-in-memory-search-example-now-updated-for-lucene-3-0-1/">Lucene In-Memory Search Example: Now updated for Lucene 3.0.1</a></li>
</ul>
<p>If you have links to more Lucene 3.0 tutorials and samples, please share them in the comments.</p>
<hr />
<h2>Sample PyLucene 3.0 code</h2>
<p>In the spirit of Lingpipe’s <a href="http://lingpipe-blog.com/2009/02/18/lucene-24-in-60-seconds/">Lucene 2.4 in 60 seconds</a>, here are relevant PyLucene 3.0 code snippets from my <a href="http://github.com/turian/biased-text-sample">biased-text-sample</a> project, for indexing and retrieval. </p>
<h3>Indexing</h3>
<pre>import lucene
from lucene import \
    SimpleFSDirectory, System, File, \
    Document, Field, StandardAnalyzer, IndexWriter, Version

if __name__ == "__main__":
    lucene.initVM()
    indexDir = "/Tmp/REMOVEME.index-dir"
    dir = SimpleFSDirectory(File(indexDir))
    analyzer = StandardAnalyzer(Version.LUCENE_30)
    writer = IndexWriter(dir, analyzer, True, IndexWriter.MaxFieldLength(512))

    print >> sys.stderr, "Currently there are %d documents in the index..." % writer.numDocs()

    print >> sys.stderr, "Reading lines from sys.stdin..."
    for l in sys.stdin:
        doc = Document()
        doc.add(Field("text", l, Field.Store.YES, Field.Index.ANALYZED))
        writer.addDocument(doc)

    print >> sys.stderr, "Indexed lines from stdin (%d documents in index)" % (writer.numDocs())
    print >> sys.stderr, "About to optimize index of %d documents..." % writer.numDocs()
    writer.optimize()
    print >> sys.stderr, "...done optimizing index of %d documents" % writer.numDocs()
    print >> sys.stderr, "Closing index of %d documents..." % writer.numDocs()
    writer.close()
    print >> sys.stderr, "...done closing index of %d documents" % writer.numDocs()
</pre>
<h3>Retrieval</h3>
<pre>
import lucene
from lucene import \
    SimpleFSDirectory, System, File, \
    Document, Field, StandardAnalyzer, IndexSearcher, Version, QueryParser

if __name__ == "__main__":
    lucene.initVM()
    indexDir = "/Tmp/REMOVEME.index-dir"
    dir = SimpleFSDirectory(File(indexDir))
    analyzer = StandardAnalyzer(Version.LUCENE_30)
    searcher = IndexSearcher(dir)

    query = QueryParser(Version.LUCENE_30, "text", analyzer).parse("Find this sentence please")
    MAX = 1000
    hits = searcher.search(query, MAX)

    print "Found %d document(s) that matched query '%s':" % (hits.totalHits, query)

    for hit in hits.scoreDocs:
        print hit.score, hit.doc, hit.toString()
        doc = searcher.doc(hit.doc)
        print doc.get("text").encode("utf-8")
</pre>

]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2010/08/09/pylucene-3-0-in-60-seconds-tutorial-sample-code-for-the-3-0-api/feed/</wfw:commentRss>
		<slash:comments>43</slash:comments>
		</item>
		<item>
		<title>Perhaps job hopping is a good thing?</title>
		<link>http://metaoptimize.com/blog/2010/04/27/perhaps-job-hopping-is-a-good-thing/</link>
		<comments>http://metaoptimize.com/blog/2010/04/27/perhaps-job-hopping-is-a-good-thing/#comments</comments>
		<pubDate>Tue, 27 Apr 2010 22:37:11 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[gen y]]></category>
		<category><![CDATA[social shift]]></category>

		<guid isPermaLink="false">http://metaoptimize.com/blog/?p=118</guid>
		<description><![CDATA[
Summary
I speculate that job hopping, if it becomes a widespread phenomenon, might actually lead to improved business efficiency. In this way, the “Gen Y” job hopping phenomenon could ultimately prove beneficial.

Background

Mark Suster begins the debate by writing: “[Job Hoppers] Make Terrible Employees”.
Paul Dix responds that job hopping is not correlated with employee quality and there [...]]]></description>
			<content:encoded><![CDATA[
<h1>Summary</h1>
<p>I speculate that job hopping, if it becomes a widespread phenomenon, might actually lead to improved business efficiency. In this way, the “Gen Y” job hopping phenomenon could ultimately prove beneficial.</p>
<hr />
<h1>Background</h1>
<ul>
<li>Mark Suster begins the debate by writing: <a href=http://www.bothsidesofthetable.com/2010/04/22/never-hire-job-hoppers-never-they-make-terrible-employees/>“[Job Hoppers] Make Terrible Employees”</a>.
<li>Paul Dix responds that <a href="http://www.pauldix.net/2010/04/why-mark-suster-is-wrong-about-not-hiring-job-hoppers.html">job hopping is not correlated with employee quality</a> and there are many better ways to assess the value of an individual employee than the length of their previous jobs.
<li>Penelope Trunk thinks that <a href="http://blogs.bnet.com/career-advice/?p=811&#038;tag=nl.e713">job hoppers make the best employees</a>, because they are <strong>more</strong> qualified and loyal.
<li>Mark Suster <a href="http://www.bothsidesofthetable.com/2010/04/25/job-hoppers-redux-an-employees-perspective/">replies</a> to Paul Dix, clarifying and defending his original arguments.
<li>Jason Calacanis looks at the overall trend of job hopping, and argues that it is a <a href="http://calacanis.com/2010/04/27/red-jackson-gen-y-loyalty/">negative trait of Gen Y</a>.
<li>Andrew Warner argues that <a href="http://mixergy.com/lets-admit-why-there-are-so-many-job-hoppers-in-startupland/">startup employers might simply be mismanaging expectations</a>.
</ul>
<p>I was discussing these articles today with <a href="http://www.chriskenton.com/">Chris Kenton</a> of <a href="http://www.socialrep.com/">SocialRep</a>.</p>
<hr />
<h1>Issues with Jason Calacanis’s piece</h1>
<p>Jason’s piece seems to have a very crochity tone, with a lot of: “The kids these days are driving society to hell in a handbasket” sort of feel. To wit:</p>
<ul>
<li> “the majority of them seem to lack killer instinct but have excel at entitlement“
<li> “It’s so obvious to me why our country is spiraling like a regional jet piloted by a $9 an hour, 20 year-old pilot with under 1,000 hours of flight time.“
</ul>
<p>These all sound like the sort of criticisms every older generation lobs at younger ones, which make me immediately skeptical.</p>
<p>Obviously, Jason has had negative experiences with employees who leave after one year. I’m not saying these employees were good. But I think he draws the wrong generalizations and I suspect that the trend of job hopping might ultimately lead to societal and economic good.</p>
<hr />
<h1>Could Job Hopping be beneficial?</h1>
<p>I think there is definitely a social shift that is occurring, but I think this concept of discrete “generations” is a red herring, since the shift is occurring gradually, not as a step function.</p>
<p>Here I’m just going to speculate a bit: If Jason’s prediction is true, and ten years down the road it is not uncommon that most people job hop every year until they find a good relationship, it might not be as grim as the old guard predicts. In fact, it could ultimately have beneficial effects. I can understand how this idea is scary to conventional businesses, but since I don’t have extensive industry experience, I have the luxury of having little enough bias to use my imagination about how this might ultimately be beneficial. <img src='http://metaoptimize.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>The real problem with job hopping is the initial expensive startup cost to integrating a new employee into your organization. Job hopping would not really be so problematic if businesses (and new employees) were set up for people to contribute value immediately. Perhaps employers and employees alike would benefit from businesses restructuring their processes to be more modular and self-contained. This is similar to how it seems initially expensive to design your code so that components are loosely coupled, but ultimately this discipline leads to greater flexibility and easier maintainability. Similarly, structuring your organization and processes in such a way that you can easily add (or remove!) talent can ultimately lead to efficiency. (I make similar comments about <a href="http://metaoptimize.com/blog/2010/03/11/code-maintainability-and-the-joy-of-outsourcing/">outsourcing your code</a>.)</p>
<p>As I said, this idea on my part is purely creative speculation, and I can’t claim I have enough experience to know whether this is true or not. So when it comes to whether job hopping is good (as Paul Dix says) or bad (as Mark Suster and Jason Calcanis say), I have to abstain.</p>
<p>The idea of blind loyalty is an artifact of situations in which the party to which you are loyal (a large corporation, an Army, etc.) is far too large to have a relationship with you. When an actual relationship is possible, that is far preferable to some impersonal loyalty.</p>
<p>Alignment of interests and clear communication is the best way to make any sort of relationship work.</p>

]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2010/04/27/perhaps-job-hopping-is-a-good-thing/feed/</wfw:commentRss>
		<slash:comments>24</slash:comments>
		</item>
		<item>
		<title>Code maintainability, and the joy of outsourcing</title>
		<link>http://metaoptimize.com/blog/2010/03/11/code-maintainability-and-the-joy-of-outsourcing/</link>
		<comments>http://metaoptimize.com/blog/2010/03/11/code-maintainability-and-the-joy-of-outsourcing/#comments</comments>
		<pubDate>Thu, 11 Mar 2010 21:26:54 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Outsourcing]]></category>
		<category><![CDATA[project manager]]></category>
		<category><![CDATA[Refactoring]]></category>
		<category><![CDATA[Software engineering]]></category>

		<guid isPermaLink="false">http://blog.metaoptimize.com/?p=106</guid>
		<description><![CDATA[
Summary
According to common wisdom, the best code is developed in-house. I am beginning to believe this is only true when the code must be tightly coupled, or there are realistic security concerns. These scenarios are less common than managers like to believe.
For run-of-the-mill development projects, outsourcing might have advantages above-and-beyond cost savings. If your code [...]]]></description>
			<content:encoded><![CDATA[
<h1>Summary</h1>
<p>According to common wisdom, the best code is developed in-house. I am beginning to believe this is only true when the code must be tightly coupled, or there are realistic security concerns. These scenarios are less common than managers like to believe.</p>
<p>For run-of-the-mill development projects, outsourcing might have advantages above-and-beyond cost savings. <em>If your code effort can be outsourced, you should try it</em>. Not only will it be cheaper, but the final code will be easier to maintain.</p>
<hr />
<h1>Background</h1>
<p>KSplice recently wrote about <a href="http://blog.ksplice.com/2010/03/quadruple-productivity-with-an-intern-army">the best way to manage interns</a>. The takehome point is: <em>“Divide tasks to be as loosely-coupled as possible.”</em></p>
<p>Recently, a commentator on <a href="http://thefunded.com/funds/item/6799">thefunded.com asked</a>:</p>
<blockquote><p>I’ve been working on a deal in which a larger software company would give me a platform they developed so we can build a business around it. The larger company has given up on it.</p>
<p>The key developer of the platform was to be included in the deal. But he’s apparently disgruntled and, literally, has gone postal. (There are serious issues; getting him back isn’t really an option now.)</p>
<p>So we have a platform, without documentation, and without the guy who built it. But it has been launched in public applications and is perfectly functional. Basically we would just be reskinning it and adding in a few new features when we relaunch under the new business.</p></blockquote>
<p>My advice? Try outsourcing.</p>
<hr />
<h1>Try outsourcing</h1>
<p>Here was my advice to this person:</p>
<p>Your goal is to improve the maintainability of your code, so that you can easily find new developers to jump in on your project. Your goal is also to have the code at a point that you are no longer beholden to any developers, and you can easily fire a developer without feeling like you are locked in to them.</p>
<p>My advice is that you find a good project manager to document the code and, more importantly, refactor the codebase to make the components more loosely coupled. This project manager should break the code into pieces and delegate to a handful of <em>independent remote subcontractors who don’t communicate with each other</em>. If independent remote workers can refactor and clean up the code, without communicating with each other, then it means the final code will be easy to maintain. It then follows that an in-house development team should be able to easily jump into the codebase. Or, you could outsource further improvments. Your choice.</p>
<p>Consider that the approach of independent remote developers with little communication is the same approach taken by many open-source projects.</p>
<p>If the project is hard to break into pieces, this is why you need a good project manager. He or she will understand the overall architecture, and see along what lines it is best to create division of responsibilities in the code.</p>
<p>You could choose a single tightly-knit dev team who are in constant communication, but the risk is that they will understand aspects of the code that they don’t document, and that there will be communal wisdom passed around by oral communication. In this case, you are bound to these developers.</p>
<p>What you want is everything written down and easy to pick up by the next guy. So you should force that to be the case in your refactoring process.</p>
<p>Although it might take independent remote developers more time to refactor the code-base than a single tightly-knit development team, if you go with the independent remote coders then the final product will be easier to maintain in the long run. And even though the independent remote coders will incur two or three times as many billable hours as the tightly knit team, if you use foreign programmers then their hourly rate is four to five times less than domestic programmers. So I think it’s a win in terms of cost and final results.</p>
<p>Even though I am a hardcore developer myself, I have recently been dabbling in subcontracting to independent developers in Eastern Europe, and have been amazed with the results. It allows me to develop much faster, and it makes my code easier to maintain, because it is impossible to subcontract work unless your code has good separation of concerns and is loosely coupled. I now have built some good relationships with sharp coders who I trust to understand my directions and deliver clean code on time.</p>
<hr />
<p>I sense that I am going to get push back on this by defensive domestic coders, because it goes against the common wisdom, but I think it is an option worth considering.</p>
<p>Would you share your experiences, positive and negative, with outsourcing?</p>

]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2010/03/11/code-maintainability-and-the-joy-of-outsourcing/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
	</channel>
</rss>
