<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>MetaOptimize</title>
	<atom:link href="http://metaoptimize.com/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://metaoptimize.com/blog</link>
	<description>building machine learning and natural language processing tools</description>
	<lastBuildDate>Fri, 20 Aug 2010 19:11:16 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=abc</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Free consultation on data strategy (NLP, ML, business intelligence, etc.)</title>
		<link>http://metaoptimize.com/blog/2010/08/20/free-consultation-on-data-strategy-nlp-ml-business-intelligence-etc/</link>
		<comments>http://metaoptimize.com/blog/2010/08/20/free-consultation-on-data-strategy-nlp-ml-business-intelligence-etc/#comments</comments>
		<pubDate>Fri, 20 Aug 2010 18:22:23 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[artificial intelligence]]></category>
		<category><![CDATA[BI]]></category>
		<category><![CDATA[business intelligence]]></category>
		<category><![CDATA[data mining]]></category>
		<category><![CDATA[large datasets]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[ML]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[statistical modeling]]></category>
		<category><![CDATA[text analysis]]></category>
		<category><![CDATA[web as corpus]]></category>

		<guid isPermaLink="false">http://metaoptimize.com/blog/?p=183</guid>
		<description><![CDATA[
Summary
Email me your pitch and how you need help monetizing data.
If I like your pitch, I’ll give you a free consultation on data strategy (NLP, ML, business intelligence, etc.)
Afterwards, if we both think that I can add value to your business, we can talk about a longer-term relationship.
You should forward this blog post to any [...]]]></description>
			<content:encoded><![CDATA[
<h2>Summary</h2>
<p><a href="mailto:joseph at metaoptimize dot com">Email me</a> your pitch and how you need help monetizing data.<br />
If I like your pitch, I’ll give you a free consultation on data strategy (NLP, ML, business intelligence, etc.)<br />
Afterwards, if we both think that I can add value to your business, we can talk about a longer-term relationship.</p>
<p>You should forward this blog post to any friend who could use this information.</p>
<hr />
<h2>What is data strategy?</h2>
<p>Do you know how to monetize the data you have? How can you improve monetization using other data available to you? How do you transform your data into actionable business intelligence?</p>
<p>I can help you shape your <b>data strategy</b>, your long-term plan for how your business will capture, process, and monetize data.  For example, data strategy can help you in the following circumstances:</p>
<ul>
<li>You don’t know who your individual users are or what they want, so you can’t effectively target ads.</li>
<li>You don’t know what user behavior on your site to track.</li>
<li>You don’t know what information you should start scraping from the web, information which you could use months or years down the line.</li>
</ul>
<p>Besides working backwards from your business goals and business assets to a viable data strategy, I can also help you with more concrete challenges in NLP and machine learning:</p>
<ul>
<li>How do I improve my search engine so that users don’t miss out on relevant results?</li>
<li>How do I add or improve recommendation, to connect users with what they want?</li>
<li>How do I scale this ML algorithm to billions of examples with millions of features?</li>
<li>How do I improve the accuracy of this NLP or ML tool?</li>
</ul>
<h2>Who am I?</h2>
<p>My name is Joseph Turian, and I head MetaOptimize LLC. We consult on NLP, ML, and data strategy. We also run the <a href="http://metaoptimize.com/qa/">MetaOptimize Q&amp;A site</a>, where ML and NLP experts share their knowledge.</p>
<ul>
<li>I am a data expert, holding a Ph.D. in natural language processing and machine learning. I have a decade of experience in these topics. I specialize in <b>large data sets</b>.</li>
<li>I’m <b>business-minded</b>, so I focus on business goals and the most direct path of execution to achieve these goals.</li>
<li>I am also a <b>technology generalist</b> who has been hacking since age 10 and has programmed competitively at a world-class level.</li>
</ul>
<p>References from clients past and present available upon request.</p>
<hr />
<h2>What is the offer?</h2>
<p>You send me information about what you’re doing and why you think I can help you.<br />
<i>Bonus points</i> if you send me your deck, so I can understand your entire business picture. You are asking me to invest valuable expertise and potentially IP in your company, so appeal to me as a potential investor.<br />
<i>Demerits</i> if you send me an NDA prematurely. Uptight companies who think what they are doing isn’t protected by good execution are a turn-off. But if you must be all James Bond about it, I’ll still consider you.</p>
<p>If I like what you’re doing and I can budget time, we schedule a meeting (in person or over Skype) and I’ll give you a free consultation on what you’re doing.</p>
<p>If the initial meeting goes well, and we both see how I can add value to your business, we can decide to continue working together. I can continue to help you either:</p>
<ul>
<li>Advising you periodically about your data strategy.</li>
<li>Building you new tools to use in your product.</li>
<li>Licensing to you existing tools I’ve already built.</li>
<li>Training your smart tech geeks on NLP and ML technology for you to build in-house.</li>
</ul>
<p>Compensation accepted in the form of cash or equity or a mix of both. Pro-bono if you’re an awesome non-profit.</p>
<hr />
<h2>Why am I doing this?</h2>
<ul>
<li>More deals is always good.</li>
<li>I am a social hacker, and enjoy connecting and sharing with other entrepreneurs. I want to meet some more excellent people.</li>
<li>I would like to improve my understanding of widespread challenges and pain points in data strategy. That way, I can build a product that is useful for many people.</li>
<li>This is an interesting social business experiment.</li>
</ul>
<hr />
<h2>Who is this offer for?</h2>
<ul>
<li>Open-source projects looking to use NLP + ML to improve their users’ experience.</li>
<li>Unfunded startups with a promising team, product, and market.</li>
<li>Funded startups.</li>
<li>Established companies.</li>
</ul>
<hr />
<h2>What are you waiting for?</h2>
<p><a href="mailto:joseph at metaoptimize dot com">Email me</a> your pitch and how you need help monetizing data.<br />
Or forward this blog post to a friend who could use this information.</p>

]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2010/08/20/free-consultation-on-data-strategy-nlp-ml-business-intelligence-etc/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>KEA Keyphrase Extraction as an XML-RPC service (code release)</title>
		<link>http://metaoptimize.com/blog/2010/08/18/kea-keyphrase-extraction-as-an-xml-rpc-service/</link>
		<comments>http://metaoptimize.com/blog/2010/08/18/kea-keyphrase-extraction-as-an-xml-rpc-service/#comments</comments>
		<pubDate>Thu, 19 Aug 2010 03:38:38 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[KEA keyphrase extractor]]></category>
		<category><![CDATA[Remote procedure call]]></category>
		<category><![CDATA[term extractor]]></category>
		<category><![CDATA[Terminology extraction]]></category>
		<category><![CDATA[terminology mining]]></category>
		<category><![CDATA[xml]]></category>
		<category><![CDATA[XML-RPC]]></category>

		<guid isPermaLink="false">http://metaoptimize.com/blog/?p=169</guid>
		<description><![CDATA[
Summary
We release code written by Ali Afshar, which turns the KEA keyphrase extractor into an XML-RPC service. This allows you to use KEA as a service, calling it from a variety of different programming languages. The code is released under the New BSD License.

Background
Keyphrase extraction (AKA terminology mining, term extraction, term recognition, or glossary extraction) [...]]]></description>
			<content:encoded><![CDATA[
<h2>Summary</h2>
<p>We release <a href="http://github.com/turian/kea-service">code</a> written by Ali Afshar, which turns the <a href="http://www.nzdl.org/Kea/">KEA</a> keyphrase extractor into an XML-RPC service. This allows you to use KEA as a service, calling it from a variety of different programming languages. The code is released under the <a href="http://en.wikipedia.org/wiki/BSD_licenses#3-clause_license_.28.22New_BSD_License.22.29">New BSD License</a>.</p>
<hr />
<h2>Background</h2>
<p>Keyphrase extraction (AKA terminology mining, term extraction, term recognition, or glossary extraction) is the process of extracting multi-word phrases that summarize the meaning of a text passage.</p>
<p>For example, in <a href="http://www.fao.org/docrep/Article/ejade/ae228e/ae228e00.htm">this document</a> entitled “The Growing Global Obesity Problem: Some Policy Options to Address It”, the keyphrases might be: [“developing countries”, “food consumption”, “overweight”, “taxes”, “prices”, “price policies”, “fiscal policies”, “feeding habits”, “nutritional requirements”, “diet”, “nutrition policies”, and “food intake”.]</p>
<p>These keyphrases are useful for summarizing the topic of the text. Also, these keyphrases are useful in later NLP processing steps, and sometimes more informative and disambiguating than just the individual word tokens in the text.</p>
<p><a href="http://www.nzdl.org/Kea/">KEA</a> is a great keyphrase extraction implementation. It is useful because it is open-source, backed by solid research, comes with some annotated training data, and because it can extract keyphrases over unrestricted text, without needing a vocabulary of possible keyphrases.</p>
<p>Other implementations of keyphrase extraction include:</p>
<ul>
<li><a href="http://code.google.com/p/maui-indexer/">Maui</a>, a topic extractor from the same people that wrote KEA.</li>
<li><a href="http://pypi.python.org/pypi/topia.termextract/">topia.termextract</a> is a Python term extractor, which is relatively noisy, and proposes many bogus keywords, but it simple to use. This is my recommendation for quick-and-dirty but works immediately out-of-the-box.</li>
</ul>
<p>API implementations include:</p>
<ul>
<li><a href="http://www.nactem.ac.uk/software/termine/">Termine</a> by NacTem, a permissive term extractor I’ve used in the past. They will give you bulk access for research purposes.  It is a UK webservice that also is relatively noisy, and proposes many bogus keywords. However, it appears to me to be slightly more accurate than topia.termextract. YMMV.</li>
<li><a href="http://www.alchemyapi.com/api/keyword/">Alchemy’s</a> term extractor.
<li><a href="http://developer.yahoo.com/search/content/V1/termExtraction.html">The Yahoo term extraction API</a>, which is <a href="http://developer.yahoo.net/blog/archives/2010/08/api_updates_and_changes.html">now only available through YQL</a>. It is low recall but high precision. In other words, it gives you a small number of high quality terms, but misses many of the terms in your documents.</li>
<li><a href="http://fivefilters.org/term-extraction/">Five Filters</a>, a web service version of topia’s term extractor (see above).</li>
<li><a href="http://maui-indexer.appspot.com/">Maui on Appspot</a>.</li>
</ul>
<p>Peter Turney has done a lot of research on keyphrase extraction, and <a href="http://www.extractor.com/about.aspx">licenses his implementation</a>.</p>
<p>There is a wide academic literature on term extraction, which I won’t summarize here. The best introductory techniques are written by Park, who is now at IBM:<br />
<a href="http://portal.acm.org/citation.cfm?id=1072370">“Automatic glossary extraction: beyond terminology identification”</a> and<br />
“Glossary extraction and utilization in the information search and delivery system for IBM technical support”. You can read more about how to roll your own <a href="http://stackoverflow.com/questions/1575246/how-do-i-extract-keywords-used-in-text/1575345#1575345">termex implementation here</a>.</p>
<p>More information about the topic is available on the <a href="http://maui-indexer.blogspot.com/">Maui blog</a>.</p>
<hr />
<h2>Code</h2>
<p>When running KEA, instead of a standalone program which reads input from disk, for speed one might want a resident service that keeps the model in memory. Additionally, one might want to call this service from non-Java languages. XML-RPC is a widely supported standard for implementing remote services.</p>
<p>We hereby release <a href="http://github.com/turian/kea-service">KEA service</a> written by Ali Afshar, which turns the <a href="http://www.nzdl.org/Kea/">KEA</a> keyphrase extractor into an XML-RPC service. This allows you to use KEA as a service, calling it from a variety of different programming languages. The code is released under the <a href="http://en.wikipedia.org/wiki/BSD_licenses#3-clause_license_.28.22New_BSD_License.22.29">New BSD License</a>.</p>
<p>Also included in <a href="http://github.com/turian/kea-service/blob/master/README">the documentation</a> is a description of how to this Java program was converted into a XML-RPC service. </p>

]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2010/08/18/kea-keyphrase-extraction-as-an-xml-rpc-service/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>PyLucene 3.0 in 60 seconds — Tutorial sample code for the 3.0 API</title>
		<link>http://metaoptimize.com/blog/2010/08/09/pylucene-3-0-in-60-seconds-tutorial-sample-code-for-the-3-0-api/</link>
		<comments>http://metaoptimize.com/blog/2010/08/09/pylucene-3-0-in-60-seconds-tutorial-sample-code-for-the-3-0-api/#comments</comments>
		<pubDate>Mon, 09 Aug 2010 16:08:00 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[API]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[search]]></category>

		<guid isPermaLink="false">http://metaoptimize.com/blog/?p=135</guid>
		<description><![CDATA[Until there is better documentation for Lucene 3.0, I recommend you use Lucene 2.4 or 2.9. Nonetheless, I provide a basic indexing and retrieval code using the PyLucene 3.0 API, perhaps the first such example code on the web.]]></description>
			<content:encoded><![CDATA[
<h2>Summary</h2>
<p>I provide a basic indexing and retrieval code using the PyLucene 3.0 API. <a href="http://manning.com/lucene">Lucene In Action (2nd Ed)</a> covers Lucene 3.0, but the PyLucene code samples for have not been updated for the 3.0 API, only the Java ones. Unfortunately, there is currently little (no?) example PyLucene code in blogosphere. If you have links to more Lucene 3.0 tutorials and samples, please share them in the comments.</p>
<p><em>Update 20100810: In light of discussions with other, this post has been substantially rewritten since it was first posted.</em></p>
<hr />
<h2>Background</h2>
<p>Historically, I have found it easy to write basic PyLucene 2.4 (or 2.9?) code. PyLucene includes Lucene In Action code samples ported from Java to Python, and these code samples are correct and easy to adapt. I recently was developing a new project based upon Lucene (<a href="http://github.com/turian/biased-text-sample">biased-text-sample</a>), and I decided to try PyLucene 3.0.2–1. I was surprised to find that PyLucene code samples in <a href="http://svn.apache.org/viewvc/lucene/pylucene/trunk/samples/LuceneInAction/"><tt>samples/LuceneInAction/</tt></a> are out-of-date, and use the 2.x API. (Note: The code in <a href="http://svn.apache.org/viewvc/lucene/pylucene/trunk/samples/"><tt>samples/*.py</tt></a> appears to be updated to the 3.0 API.)</p>
<p>I was able to find no Lucene 3.0 tutorials or code samples on the web, except for this one article:</p>
<ul>
<li>
<a href="http://ikaisays.com/2010/04/24/lucene-in-memory-search-example-now-updated-for-lucene-3-0-1/">Lucene In-Memory Search Example: Now updated for Lucene 3.0.1</a></li>
</ul>
<p>If you have links to more Lucene 3.0 tutorials and samples, please share them in the comments.</p>
<hr />
<h2>Sample PyLucene 3.0 code</h2>
<p>In the spirit of Lingpipe’s <a href="http://lingpipe-blog.com/2009/02/18/lucene-24-in-60-seconds/">Lucene 2.4 in 60 seconds</a>, here are relevant PyLucene 3.0 code snippets from my <a href="http://github.com/turian/biased-text-sample">biased-text-sample</a> project, for indexing and retrieval. </p>
<h3>Indexing</h3>
<pre>import lucene
from lucene import \
    SimpleFSDirectory, System, File, \
    Document, Field, StandardAnalyzer, IndexWriter, Version

if __name__ == "__main__":
    lucene.initVM()
    indexDir = "/Tmp/REMOVEME.index-dir"
    dir = SimpleFSDirectory(File(indexDir))
    analyzer = StandardAnalyzer(Version.LUCENE_30)
    writer = IndexWriter(dir, analyzer, True, IndexWriter.MaxFieldLength(512))

    print >> sys.stderr, "Currently there are %d documents in the index..." % writer.numDocs()

    print >> sys.stderr, "Reading lines from sys.stdin..."
    for l in sys.stdin:
        doc = Document()
        doc.add(Field("text", l, Field.Store.YES, Field.Index.ANALYZED))
        writer.addDocument(doc)

    print >> sys.stderr, "Indexed lines from stdin (%d documents in index)" % (writer.numDocs())
    print >> sys.stderr, "About to optimize index of %d documents..." % writer.numDocs()
    writer.optimize()
    print >> sys.stderr, "...done optimizing index of %d documents" % writer.numDocs()
    print >> sys.stderr, "Closing index of %d documents..." % writer.numDocs()
    writer.close()
    print >> sys.stderr, "...done closing index of %d documents" % writer.numDocs()
</pre>
<h3>Retrieval</h3>
<pre>
import lucene
from lucene import \
    SimpleFSDirectory, System, File, \
    Document, Field, StandardAnalyzer, IndexSearcher, Version, QueryParser

if __name__ == "__main__":
    lucene.initVM()
    indexDir = "/Tmp/REMOVEME.index-dir"
    dir = SimpleFSDirectory(File(indexDir))
    analyzer = StandardAnalyzer(Version.LUCENE_30)
    searcher = IndexSearcher(dir)

    query = QueryParser(Version.LUCENE_30, "text", analyzer).parse("Find this sentence please")
    MAX = 1000
    hits = searcher.search(query, MAX)

    print "Found %d document(s) that matched query '%s':" % (hits.totalHits, query)

    for hit in hits.scoreDocs:
        print hit.score, hit.doc, hit.toString()
        doc = searcher.doc(hit.doc)
        print doc.get("text").encode("utf-8")
</pre>

]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2010/08/09/pylucene-3-0-in-60-seconds-tutorial-sample-code-for-the-3-0-api/feed/</wfw:commentRss>
		<slash:comments>33</slash:comments>
		</item>
		<item>
		<title>Perhaps job hopping is a good thing?</title>
		<link>http://metaoptimize.com/blog/2010/04/27/perhaps-job-hopping-is-a-good-thing/</link>
		<comments>http://metaoptimize.com/blog/2010/04/27/perhaps-job-hopping-is-a-good-thing/#comments</comments>
		<pubDate>Tue, 27 Apr 2010 22:37:11 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[gen y]]></category>
		<category><![CDATA[social shift]]></category>

		<guid isPermaLink="false">http://metaoptimize.com/blog/?p=118</guid>
		<description><![CDATA[
Summary
I speculate that job hopping, if it becomes a widespread phenomenon, might actually lead to improved business efficiency. In this way, the “Gen Y” job hopping phenomenon could ultimately prove beneficial.

Background

Mark Suster begins the debate by writing: “[Job Hoppers] Make Terrible Employees”.
Paul Dix responds that job hopping is not correlated with employee quality and there [...]]]></description>
			<content:encoded><![CDATA[
<h1>Summary</h1>
<p>I speculate that job hopping, if it becomes a widespread phenomenon, might actually lead to improved business efficiency. In this way, the “Gen Y” job hopping phenomenon could ultimately prove beneficial.</p>
<hr />
<h1>Background</h1>
<ul>
<li>Mark Suster begins the debate by writing: <a href=http://www.bothsidesofthetable.com/2010/04/22/never-hire-job-hoppers-never-they-make-terrible-employees/>“[Job Hoppers] Make Terrible Employees”</a>.
<li>Paul Dix responds that <a href="http://www.pauldix.net/2010/04/why-mark-suster-is-wrong-about-not-hiring-job-hoppers.html">job hopping is not correlated with employee quality</a> and there are many better ways to assess the value of an individual employee than the length of their previous jobs.
<li>Penelope Trunk thinks that <a href="http://blogs.bnet.com/career-advice/?p=811&#038;tag=nl.e713">job hoppers make the best employees</a>, because they are <strong>more</strong> qualified and loyal.
<li>Mark Suster <a href="http://www.bothsidesofthetable.com/2010/04/25/job-hoppers-redux-an-employees-perspective/">replies</a> to Paul Dix, clarifying and defending his original arguments.
<li>Jason Calacanis looks at the overall trend of job hopping, and argues that it is a <a href="http://calacanis.com/2010/04/27/red-jackson-gen-y-loyalty/">negative trait of Gen Y</a>.
<li>Andrew Warner argues that <a href="http://mixergy.com/lets-admit-why-there-are-so-many-job-hoppers-in-startupland/">startup employers might simply be mismanaging expectations</a>.
</ul>
<p>I was discussing these articles today with <a href="http://www.chriskenton.com/">Chris Kenton</a> of <a href="http://www.socialrep.com/">SocialRep</a>.</p>
<hr />
<h1>Issues with Jason Calacanis’s piece</h1>
<p>Jason’s piece seems to have a very crochity tone, with a lot of: “The kids these days are driving society to hell in a handbasket” sort of feel. To wit:</p>
<ul>
<li> “the majority of them seem to lack killer instinct but have excel at entitlement“
<li> “It’s so obvious to me why our country is spiraling like a regional jet piloted by a $9 an hour, 20 year-old pilot with under 1,000 hours of flight time.“
</ul>
<p>These all sound like the sort of criticisms every older generation lobs at younger ones, which make me immediately skeptical.</p>
<p>Obviously, Jason has had negative experiences with employees who leave after one year. I’m not saying these employees were good. But I think he draws the wrong generalizations and I suspect that the trend of job hopping might ultimately lead to societal and economic good.</p>
<hr />
<h1>Could Job Hopping be beneficial?</h1>
<p>I think there is definitely a social shift that is occurring, but I think this concept of discrete “generations” is a red herring, since the shift is occurring gradually, not as a step function.</p>
<p>Here I’m just going to speculate a bit: If Jason’s prediction is true, and ten years down the road it is not uncommon that most people job hop every year until they find a good relationship, it might not be as grim as the old guard predicts. In fact, it could ultimately have beneficial effects. I can understand how this idea is scary to conventional businesses, but since I don’t have extensive industry experience, I have the luxury of having little enough bias to use my imagination about how this might ultimately be beneficial. <img src='http://metaoptimize.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>The real problem with job hopping is the initial expensive startup cost to integrating a new employee into your organization. Job hopping would not really be so problematic if businesses (and new employees) were set up for people to contribute value immediately. Perhaps employers and employees alike would benefit from businesses restructuring their processes to be more modular and self-contained. This is similar to how it seems initially expensive to design your code so that components are loosely coupled, but ultimately this discipline leads to greater flexibility and easier maintainability. Similarly, structuring your organization and processes in such a way that you can easily add (or remove!) talent can ultimately lead to efficiency. (I make similar comments about <a href="http://metaoptimize.com/blog/2010/03/11/code-maintainability-and-the-joy-of-outsourcing/">outsourcing your code</a>.)</p>
<p>As I said, this idea on my part is purely creative speculation, and I can’t claim I have enough experience to know whether this is true or not. So when it comes to whether job hopping is good (as Paul Dix says) or bad (as Mark Suster and Jason Calcanis say), I have to abstain.</p>
<p>The idea of blind loyalty is an artifact of situations in which the party to which you are loyal (a large corporation, an Army, etc.) is far too large to have a relationship with you. When an actual relationship is possible, that is far preferable to some impersonal loyalty.</p>
<p>Alignment of interests and clear communication is the best way to make any sort of relationship work.</p>

]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2010/04/27/perhaps-job-hopping-is-a-good-thing/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
		<item>
		<title>Code maintainability, and the joy of outsourcing</title>
		<link>http://metaoptimize.com/blog/2010/03/11/code-maintainability-and-the-joy-of-outsourcing/</link>
		<comments>http://metaoptimize.com/blog/2010/03/11/code-maintainability-and-the-joy-of-outsourcing/#comments</comments>
		<pubDate>Thu, 11 Mar 2010 21:26:54 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Outsourcing]]></category>
		<category><![CDATA[project manager]]></category>
		<category><![CDATA[Refactoring]]></category>
		<category><![CDATA[Software engineering]]></category>

		<guid isPermaLink="false">http://blog.metaoptimize.com/?p=106</guid>
		<description><![CDATA[
Summary
According to common wisdom, the best code is developed in-house. I am beginning to believe this is only true when the code must be tightly coupled, or there are realistic security concerns. These scenarios are less common than managers like to believe.
For run-of-the-mill development projects, outsourcing might have advantages above-and-beyond cost savings. If your code [...]]]></description>
			<content:encoded><![CDATA[
<h1>Summary</h1>
<p>According to common wisdom, the best code is developed in-house. I am beginning to believe this is only true when the code must be tightly coupled, or there are realistic security concerns. These scenarios are less common than managers like to believe.</p>
<p>For run-of-the-mill development projects, outsourcing might have advantages above-and-beyond cost savings. <em>If your code effort can be outsourced, you should try it</em>. Not only will it be cheaper, but the final code will be easier to maintain.</p>
<hr />
<h1>Background</h1>
<p>KSplice recently wrote about <a href="http://blog.ksplice.com/2010/03/quadruple-productivity-with-an-intern-army">the best way to manage interns</a>. The takehome point is: <em>“Divide tasks to be as loosely-coupled as possible.”</em></p>
<p>Recently, a commentator on <a href="http://thefunded.com/funds/item/6799">thefunded.com asked</a>:</p>
<blockquote><p>I’ve been working on a deal in which a larger software company would give me a platform they developed so we can build a business around it. The larger company has given up on it.</p>
<p>The key developer of the platform was to be included in the deal. But he’s apparently disgruntled and, literally, has gone postal. (There are serious issues; getting him back isn’t really an option now.)</p>
<p>So we have a platform, without documentation, and without the guy who built it. But it has been launched in public applications and is perfectly functional. Basically we would just be reskinning it and adding in a few new features when we relaunch under the new business.</p></blockquote>
<p>My advice? Try outsourcing.</p>
<hr />
<h1>Try outsourcing</h1>
<p>Here was my advice to this person:</p>
<p>Your goal is to improve the maintainability of your code, so that you can easily find new developers to jump in on your project. Your goal is also to have the code at a point that you are no longer beholden to any developers, and you can easily fire a developer without feeling like you are locked in to them.</p>
<p>My advice is that you find a good project manager to document the code and, more importantly, refactor the codebase to make the components more loosely coupled. This project manager should break the code into pieces and delegate to a handful of <em>independent remote subcontractors who don’t communicate with each other</em>. If independent remote workers can refactor and clean up the code, without communicating with each other, then it means the final code will be easy to maintain. It then follows that an in-house development team should be able to easily jump into the codebase. Or, you could outsource further improvments. Your choice.</p>
<p>Consider that the approach of independent remote developers with little communication is the same approach taken by many open-source projects.</p>
<p>If the project is hard to break into pieces, this is why you need a good project manager. He or she will understand the overall architecture, and see along what lines it is best to create division of responsibilities in the code.</p>
<p>You could choose a single tightly-knit dev team who are in constant communication, but the risk is that they will understand aspects of the code that they don’t document, and that there will be communal wisdom passed around by oral communication. In this case, you are bound to these developers.</p>
<p>What you want is everything written down and easy to pick up by the next guy. So you should force that to be the case in your refactoring process.</p>
<p>Although it might take independent remote developers more time to refactor the code-base than a single tightly-knit development team, if you go with the independent remote coders then the final product will be easier to maintain in the long run. And even though the independent remote coders will incur two or three times as many billable hours as the tightly knit team, if you use foreign programmers then their hourly rate is four to five times less than domestic programmers. So I think it’s a win in terms of cost and final results.</p>
<p>Even though I am a hardcore developer myself, I have recently been dabbling in subcontracting to independent developers in Eastern Europe, and have been amazed with the results. It allows me to develop much faster, and it makes my code easier to maintain, because it is impossible to subcontract work unless your code has good separation of concerns and is loosely coupled. I now have built some good relationships with sharp coders who I trust to understand my directions and deliver clean code on time.</p>
<hr />
<p>I sense that I am going to get push back on this by defensive domestic coders, because it goes against the common wisdom, but I think it is an option worth considering.</p>
<p>Would you share your experiences, positive and negative, with outsourcing?</p>

]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2010/03/11/code-maintainability-and-the-joy-of-outsourcing/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>Lean Startup, and The Stooges</title>
		<link>http://metaoptimize.com/blog/2010/03/10/lean-startup-and-the-stooges/</link>
		<comments>http://metaoptimize.com/blog/2010/03/10/lean-startup-and-the-stooges/#comments</comments>
		<pubDate>Wed, 10 Mar 2010 17:08:46 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.metaoptimize.com/?p=93</guid>
		<description><![CDATA[
Okay, I’m ready.
After reading a handful of articles making tenuous connections between entrepreneurship and music, including :

The Notorious CEO: Ten Startup Commandments from Biggie Smalls
Being like The Sex Pistols can help your startup?

I’ve decided to come out and share my favorite startup music.
Dirt, by The Stooges, is a proto-punk cut that sprawls for seven-minutes, brooding [...]]]></description>
			<content:encoded><![CDATA[
<p>Okay, I’m ready.</p>
<p>After reading a handful of articles making tenuous connections between entrepreneurship and music, including :</p>
<ul>
<li><a href="http://themetricsystem.rjmetrics.com/2009/08/10/the-notorious-ceo-ten-startup-commandments-from-biggie-smalls/">The Notorious CEO: Ten Startup Commandments from Biggie Smalls</a></li>
<li><a href="http://blog.smartupz.com/2010/03/being-like-sex-pistols-can-help-your.html">Being like The Sex Pistols can help your startup?</a></li>
</ul>
<p>I’ve decided to come out and share my favorite startup music.</p>
<p>Dirt, by <a href="http://en.wikipedia.org/wiki/The_Stooges">The Stooges</a>, is a <a href="http://www.allmusic.com/cg/amg.dll?p=amg&#038;sql=77:2698">proto-punk</a> cut that sprawls for seven-minutes, brooding and smoldering. It never climaxes or burns out, it just persists and drives forward.</p>
<p>Anyway, I believe this song should be the mantra for boostrappers, in particular those that practice the <a href="http://www.startuplessonslearned.com/">lean</a> <a href="http://groups.google.com/group/lean-startup-circle?pli=1">startup</a> <a href="http://leanstartup.pbworks.com/">methodology</a>.</p>
<ul>
<i>Ooh, I been dirt / And I don’t care / Cause I’m burning inside / I’m just a yearning inside / And I’m the fire o’ life.</i>
</ul>
<p>Without further ado, <b>DIRT</b>:</p>
<p><object width="425" height="344"><param name="movie" value="http://www.youtube.com/v/zxYXV2RrwIs&#038;hl=en_US&#038;fs=1&#038;"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/zxYXV2RrwIs&#038;hl=en_US&#038;fs=1&#038;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"></embed></object></p>

]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2010/03/10/lean-startup-and-the-stooges/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Constitution for Governance of Open-Source Projects (v20100227)</title>
		<link>http://metaoptimize.com/blog/2010/02/27/constitution-for-governance-of-open-source-projects-v20100227/</link>
		<comments>http://metaoptimize.com/blog/2010/02/27/constitution-for-governance-of-open-source-projects-v20100227/#comments</comments>
		<pubDate>Sun, 28 Feb 2010 01:08:09 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Free software]]></category>
		<category><![CDATA[Governance]]></category>
		<category><![CDATA[Open source]]></category>

		<guid isPermaLink="false">http://blog.metaoptimize.com/?p=89</guid>
		<description><![CDATA[
Summary
I propose a default “Constitution for Governance of Open-Source Projects”.

Background
I recently got involved in the OSQA project, which is a fork of CNPROG, which in turn is a clone of the StackExchange Q&#38;A forum software.
Note that the OSQA project has no formal “homepage”, or instructions on how to get involved. I only discovered by chance [...]]]></description>
			<content:encoded><![CDATA[
<h1>Summary</h1>
<p>I propose a default “Constitution for Governance of Open-Source Projects”.</p>
<hr />
<h1>Background</h1>
<p>I recently got involved in the <a href="http://osqa.net/question/2/where-can-i-get-the-source-code-for-osqa" class="broken_link">OSQA</a> project, which is a fork of <a href="http://github.com/cnprog/CNPROG">CNPROG</a>, which in turn is a clone of the <a href="http://stackexchange.com/">StackExchange</a> Q&amp;A forum software.</p>
<p>Note that the OSQA project has no formal “homepage”, or instructions on how to get involved. I only discovered by chance that there is a mailing-list (unarchived) and developer chat room. Nor was it immediately clear which OSQA github fork should one use.</p>
<p>This is because OSQA grew organically from one contributor to a handful, and developer involvement was an afterthought in this project. Not that there is anything wrong with that.<br />
However, now that a handful of people are involved in the project, and <a href="http://osqa.net/questions/unanswered/" class="broken_link">more people are trying to get involved</a>, we have begun discussing governance and decision-making policies on the mailing list. In fact,<br />
<a href="http://nmrwiki.org/">Evgeny Fadeev</a> poses this very question on <a href="http://stackoverflow.com/questions/2328631/how-to-achieve-effective-democratic-governance-for-an-open-source-project">StackOverflow</a>, and proposes some potential answers.</p>
<p>I believe that, by default, there are some simple but clear principles that should be enunciated. I hereby propose my</p>
<h1>Constitution for Governance of Open-Source Projects (v20100227)</h1>
<p>Let it be affirmed that the primary goal in instituting governance of an open-source project be to ensure the long-term health of the project.</p>
<p>Accordingly, the default bias should be towards openness and inclusiveness.<br />
However, policy should be changed as issues present themselves, in order to maintain the long-term health of the project.</p>
<p>For the model of decision making,  we favor a “do-ocracy”.<br />
The people who contribute the most generally command the respect of the community.<br />
Alienating them is the best way to derail the project.</p>
<p>The repository should be open the committers, given that commits can easily be reverted and commit-access easily revoked. This is preferable to alienating potential committers.</p>
<p>To ensure transparency for developers new and old, and allow them to decide their involvement in a project based upon the history of the project, their should be transparency and openess in the inner working of the project. For example, the email archive should be public.</p>
<p>Lastly, let us remember that too much red-tape gets in the way of progress. So red-tape and other barriers to contribution should be avoided, and only added as issues present themselves.</p>
<p>This Constitution can and should be amended as issues present themselves.</p>
<p>Therefore be it resolved.</p>

]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2010/02/27/constitution-for-governance-of-open-source-projects-v20100227/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Why can’t you pickle generators in Python? A pattern for saving training state</title>
		<link>http://metaoptimize.com/blog/2009/12/22/why-cant-you-pickle-generators-in-python-workaround-pattern-for-saving-training-state/</link>
		<comments>http://metaoptimize.com/blog/2009/12/22/why-cant-you-pickle-generators-in-python-workaround-pattern-for-saving-training-state/#comments</comments>
		<pubDate>Tue, 22 Dec 2009 08:52:13 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[experimental control]]></category>
		<category><![CDATA[Generator]]></category>
		<category><![CDATA[generators]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[persistance]]></category>
		<category><![CDATA[pickling]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[serialization]]></category>
		<category><![CDATA[training state]]></category>

		<guid isPermaLink="false">http://blog.metaoptimize.com/?p=72</guid>
		<description><![CDATA[
Summary

A pattern for persisting generators is to turn them into pickle-able class objects. This is useful when you use generators for streaming training examples.
I would also try generator_tools, which might be a more convenient alternative to the pattern I describe. I haven’t used it yet.

Generators for streaming training examples
For machine learning, python generators are a [...]]]></description>
			<content:encoded><![CDATA[
<h1>Summary</h1>
<p><a href="http://flickr.com/photos/28402283@N07/3186143355" title="Moon Rise behind the San Gorgonio Pass Wind Farm"><img align=right src="http://farm4.static.flickr.com/3118/3186143355_4840fb7620_t.jpg" /></a></p>
<p>A pattern for persisting generators is to turn them into pickle-able class objects. This is useful when you use generators for streaming training examples.</p>
<p>I would also try <a href="http://www.fiber-space.de/generator_tools/doc/generator_tools.html">generator_tools</a>, which might be a more convenient alternative to the pattern I describe. I haven’t used it yet.</p>
<hr />
<h2>Generators for streaming training examples</h2>
<p>For machine learning, python <a href="http://www.ibm.com/developerworks/library/l-pycon.html">generators</a> are a simple idiom that make it easy to generate a stream of training examples. Moreover, you can nest generators:</p>
<ul>
<li>The inner generator can be used to read one example at a time.</li>
<li>The outer generator can be used to read examples from the inner generator until you have a full minibatch, and then yield this minibatch.</li>
</ul>
<p>Here is some example code:</p>
<p>[Update: The example holds without the ALL CAPS magic variable names, “HYPERPARAMETERS”. However, I include HYPERPARAMETERS because I am including the actual code I am using. Hyperparameters are global, read-only variables that specify the particular experimental condition being tested. I can’t say that I have the best solution to this particular aspect of experimental control (hyperparameters). I might write a blog post about it in the future, to solicit feedback on improved methods. However, I have refined my current approach over several years, and I can assure you that it is far less painful than a handful of more “clean” approaches.]</p>
<pre>def get_train_example():
    HYPERPARAMETERS = common.hyperparameters.read("language-model")

    from vocabulary import wordmap
    for l in myopen(HYPERPARAMETERS["TRAIN_SENTENCES"]):
        prevwords = []
        for w in string.split(l):
            w = string.strip(w)
            id = None
            if wordmap.exists(w):
                prevwords.append(wordmap.id(w))
                if len(prevwords) >= HYPERPARAMETERS["WINDOW_SIZE"]:
                    yield prevwords[-HYPERPARAMETERS["WINDOW_SIZE"]:]
            else:
                prevwords = []

def get_train_minibatch():
    HYPERPARAMETERS = common.hyperparameters.read("language-model")
    minibatch = []
    for e in get_train_example():
        minibatch.append(e)
        if len(minibatch) >= HYPERPARAMETERS["MINIBATCH SIZE"]:
            assert len(minibatch) == HYPERPARAMETERS["MINIBATCH SIZE"]
            yield minibatch
            minibatch = []
</pre>
<h2>You can’t persist training state by pickling your generators</h2>
<p>However, generators become problematic when you want to persist your experiment’s state in order to later restart training at the same place. Unfortunately, <a href="http://bugs.python.org/issue1092962">you can’t pickle generators in Python</a>. And it can be a bit of a <a href="http://en.wiktionary.org/wiki/pain_in_the_ass">PITA</a> to workaround this, in order to save the training state.</p>
<h2>Pattern to workaround this annoyance</h2>
<p>Following useful discussion on <a href="http://groups.google.com/group/pylearn-dev/browse_thread/thread/c4e4dd3496bbbf08">pylearn-dev</a> and stackoverflow <a href="http://stackoverflow.com/questions/1942328/add-a-member-variable-method-to-a-python-generator">[1]</a> <a href="http://stackoverflow.com/questions/1939015/singleton-python-generator-or-pickle-a-python-generator">[2]</a>, I propose the following pattern for converting generators to pickle-able class objects:</p>
<ol>
<li>Convert the generator to a class in which the generator code is the <a href="http://stackoverflow.com/questions/1942328/add-a-member-variable-method-to-a-python-generator/1942387#1942387">__iter__</a> method</li>
<li>Add <a href="http://docs.python.org/library/pickle.html#object.__getstate__">__getstate__</a> and <a href="http://docs.python.org/library/pickle.html#object.__setstate__">__setstate__</a> methods to the class, to handling pickling. Remember that you can’t pickle file objects. So __setstate__ will have to re-open files, as necessary.</li>
</ol>
<p>Here is the updated code, after applying this pattern:</p>
<pre>
class TrainingExampleStream(object):
    def __init__(self):
        # Set the state variables, in case pickling happens before __iter__ is called.
        self.filename = None
        self.count = 0
        pass

    def __iter__(self):
        HYPERPARAMETERS = common.hyperparameters.read("language-model")
        from vocabulary import wordmap
        self.filename = HYPERPARAMETERS["TRAIN_SENTENCES"]
        self.count = 0
        for l in myopen(self.filename):
            prevwords = []
            for w in string.split(l):
                w = string.strip(w)
                id = None
                if wordmap.exists(w):
                    prevwords.append(wordmap.id(w))
                    if len(prevwords) >= HYPERPARAMETERS["WINDOW_SIZE"]:
                        self.count += 1
                        yield prevwords[-HYPERPARAMETERS["WINDOW_SIZE"]:]
                else:
                    prevwords = []

    def __getstate__(self):
        return self.filename, self.count

    def __setstate__(self, state):
        """
        @warning: We ignore the filename.  If we wanted
        to be really fastidious, we would assume that
        HYPERPARAMETERS["TRAIN_SENTENCES"] might change.  The only
        problem is that if we change filesystems, the filename
        might change just because the base file is in a different
        path. So we issue a warning if the filename is different from what is expected.
        """
        filename, count = state
        print >> sys.stderr, ("__setstate__(%s)..." % `state`)
        iter = self.__iter__()
        while count != self.count:
#            print count, self.count
            iter.next()
        if self.filename != filename:
            assert self.filename == HYPERPARAMETERS["TRAIN_SENTENCES"]
            print >> sys.stderr, ("self.filename %s != filename given to __setstate__ %s" % (self.filename, filename))
        print >> sys.stderr, ("...__setstate__(%s)" % `state`)

class TrainingMinibatchStream(object):
    def __init__(self):
        pass

    def __iter__(self):
        HYPERPARAMETERS = common.hyperparameters.read("language-model")
        minibatch = []
        self.get_train_example = TrainingExampleStream()
        for e in self.get_train_example:
            minibatch.append(e)
            if len(minibatch) >= HYPERPARAMETERS["MINIBATCH SIZE"]:
                assert len(minibatch) == HYPERPARAMETERS["MINIBATCH SIZE"]
                yield minibatch
                minibatch = []

    def __getstate__(self):
        return (self.get_train_example.__getstate__(),)

    def __setstate__(self, state):
        """
        @warning: We ignore the filename.
        """
        self.get_train_example = TrainingExampleStream()
        self.get_train_example.__setstate__(state[0])
</pre>

]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2009/12/22/why-cant-you-pickle-generators-in-python-workaround-pattern-for-saving-training-state/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Use flag –xml when you run mysqldump</title>
		<link>http://metaoptimize.com/blog/2009/10/14/use-flag-xml-when-you-run-mysqldump/</link>
		<comments>http://metaoptimize.com/blog/2009/10/14/use-flag-xml-when-you-run-mysqldump/#comments</comments>
		<pubDate>Wed, 14 Oct 2009 22:40:17 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[JSON]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[mysqldump]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://blog.metaoptimize.com/?p=60</guid>
		<description><![CDATA[
Summary:

If you have text data (like a web scrape) stored in a MySQL database, and you want to share the data, mysqldump to XML using the –xml flag.

When fields are unlikely to contain tabs, an even simpler format is a tab-separated file, created using the –tab=path flag to mysqldump. path must be owned by the [...]]]></description>
			<content:encoded><![CDATA[
<h1>Summary:</h1>
<p><a href="http://flickr.com/photos/24030756@N05/2649856228" title="psychogenic womb memory-gemini project"><img src="http://farm4.static.flickr.com/3223/2649856228_61b5405cfa_t.jpg" align="right"></a></p>
<p>If you have text data (like a <a class="zem_slink" href="http://en.wikipedia.org/wiki/Screen_scraping" title="Screen scraping" rel="wikipedia">web scrape</a>) stored in a <a class="zem_slink" href="http://www.mysql.com" title="MySQL" rel="homepage">MySQL</a> database, and you want to share the data, mysqldump to <a class="zem_slink" href="http://en.wikipedia.org/wiki/XML" title="XML" rel="wikipedia">XML</a> using the <tt>–xml</tt> flag.</p>
</p>
<p>When fields are unlikely to contain tabs, an even simpler format is a tab-separated file, created using the <tt>–tab=path</tt> flag to mysqldump. <tt>path</tt> must be owned by the MySQL database user.
</p>
<h1>The Problem with the standard MySQL dump format</h1>
<p>The standard MySQL dump looks as follows</p>
<pre><code>INSERT INTO `sources` VALUES (1,'2009-03-07 22:06:36','"You\'ve got to be kidding me"', ...
</code></pre>
<p>The problem is that the standard dump format is difficult to interact with programmatically.</p>
<p>It is difficult to parse using <a class="zem_slink" href="http://en.wikipedia.org/wiki/Regular_expression" title="Regular expression" rel="wikipedia">regular expressions</a> because you cannot merely search for single quotes. You have to search for single quotes that are not preceded by a <a href="http://en.wikipedia.org/wiki/Backslash">backslash</a> (unless, perhaps, that backslash is preceded by a backslash).</p>
<p>Also, there are no libraries for reading the standard dump format, nor scripts for converting it into a standard format like <a class="zem_slink" href="http://en.wikipedia.org/wiki/JSON" title="JSON" rel="wikipedia">JSON</a> or XML. I asked <a href="http://www.google.com/search?q=mysql+dump+library&amp;hl=en">the oracle</a> as well as <a href="http://stackoverflow.com/questions/1568838/library-to-read-a-mysql-dump">stackoverflow</a>.</p>
<p>So if you receive a MySQL dump in the standard format, you might have to install MySQL and import the dump to get at your data.</p>
<h1>The tabbed MySQL dump format</h1>
<p>You can create a directory with one file per table, and the table will be one-row-per-line, with <a class="zem_slink" href="http://en.wikipedia.org/wiki/Delimiter-separated_values" title="Delimiter-separated values" rel="wikipedia">tab-separated values</a>:</p>
<pre><code>mysqldump --tab=path database</code></pre>
<p>Here is some example output:</p>
<pre><code>1	2009-03-07 22:06:36	"You've got to be kidding me"</code></pre>
<p>If you get an error of the following form when you issue the mysqldump command:</p>
<pre><code>mysqldump: Got error: 1: Can't create/write to file 'path/database.txt' (Errcode: 13) when executing 'SELECT INTO OUTFILE'</code></pre>
<p>You can resolve this complaint by making sure that /tmp/path is owned by the mysql user (and also writeable by the current Unix user). Thanks <a href="http://forums.mysql.com/read.php?35,172714,172766#msg-172766">JinRong Ye</a>!</p>
<p>This format is convenient if none of your data contains tabs. In <a class="zem_slink" href="http://en.wikipedia.org/wiki/Natural_language_processing" title="Natural language processing" rel="wikipedia">NLP</a>, however, it is quite possible that your text will contain tabs.</p>
<h1>The XML MySQL dump format</h1>
<p>Enter the XML MySQL dump format:</p>
<pre><code>        &lt;table_data name="sources"&gt;
        &lt;row&gt;
                &lt;field name="id"&gt;1&lt;/field&gt;
                &lt;field name="created_at"&gt;2009-03-07 22:06:36&lt;/field&gt;
                &lt;field name="text"&gt;&amp;quot;You've got to be kidding me&amp;quot;&lt;/field&gt;
</code></pre>
<p>Ah… pure bliss. You can get the XML dump format as follows:</p>
<pre><code>mysqldump --xml database</code></pre>
<div class="zemanta-pixie" style="margin-top:10px;height:15px"><a class="zemanta-pixie-a" href="http://reblog.zemanta.com/zemified/41468938-de30-448c-ac95-b381457c48c8/" title="Reblog this post [with Zemanta]"><img class="zemanta-pixie-img" src="http://img.zemanta.com/reblog_e.png?x-id=41468938-de30-448c-ac95-b381457c48c8" alt="Reblog this post [with Zemanta]" style="border:none;float:right"></a><span class="zem-script more-related pretty-attribution"><script type="text/javascript" src="http://static.zemanta.com/readside/loader.js" defer="defer"></script></span></div>

]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2009/10/14/use-flag-xml-when-you-run-mysqldump/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Automatically sorting graph curves</title>
		<link>http://metaoptimize.com/blog/2009/09/17/automatically-sorting-graph-curves/</link>
		<comments>http://metaoptimize.com/blog/2009/09/17/automatically-sorting-graph-curves/#comments</comments>
		<pubDate>Thu, 17 Sep 2009 22:16:09 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[gnuplot]]></category>
		<category><![CDATA[Heuristics]]></category>

		<guid isPermaLink="false">http://blog.metaoptimize.com/?p=29</guid>
		<description><![CDATA[A script for automatically sorting graph curves, e.g. for gnuplot.]]></description>
			<content:encoded><![CDATA[
<h1>Summary</h1>
<p>A script for automatically sorting graph curves, e.g. for <a class="zem_slink" href="http://www.gnuplot.info/" title="Gnuplot" rel="homepage">gnuplot</a>.</p>
<h1>Problem</h1>
<p>When you have a bunch of curves, and you plot them in an arbitrary order, you might get the following:</p>
<p><img src="http://blog.metaoptimize.com/wp-content/uploads/2009/09/example1-unsorted.png"></p>
<p>Typically, you want to sort the graphs in what appears to be visually descending order, as follows:</p>
<p><img src="http://blog.metaoptimize.com/wp-content/uploads/2009/09/example1-sorted.png"></p>
<p>Sorting the curves is usually done manually, by eyeballing the curves. However, manual sorting of graph curves can become tedious. And when some curves don’t go out as far on the x-axis, it can be even trickier to place these short curves. (Some curves might be short if this experimental run trains more slowly.)</p>
<h1>Heuristic approach</h1>
<p>An automatic heuristic sorting approach is as follows:</p>
<ul>
<li>We maintain a sorted list of curves, from highest to lowest. The sorted list is initialized to empty.
</li>
<li>At each iteration, we find the curve that goes the furthest out on the x-axis, but is not yet in the sorted list. We then will choose where to insert it into the sorted list.
<ul>
<li>For this curve and all curves in the sorted list, we want an estimate of the curve value at the current curve’s furthest x-value. We compute this estimate using a <a class="zem_slink" href="http://en.wikipedia.org/wiki/Moving_average" title="Moving average" rel="wikipedia">moving average</a>. (For this reason, all curves should have aligned x-axis steps, and should have equidistant x-axis steps.)</li>
<li>We place this curve into the sorted list, to minimize the number of rank errors of curve estimates at this x-value.</li>
</ul>
</li>
</ul>
<p>And that’s it!</p>
<h1>Example output</h1>
<p>Here is the sorted output of a larger, more difficult example, sorted using the above heuristic. Click on this image to get a larger version you can inspect:<br />
<a href="http://blog.metaoptimize.com/wp-content/uploads/2009/09/example2-sorted.png"><img src="http://blog.metaoptimize.com/wp-content/uploads/2009/09/example2-sorted-small.png"></a><br />
A few of the decisions aren’t good. For example, why is curve 15 placed about curve 6? But most of the decisions are reasonable. For example, curve 13 is placed at the bottom, because it is very low compared to the other curves for the short duration that curve 13 is present.</p>
<h1>Code</h1>
<p>I have written a script implementing the heuristic above.</p>
<p>Here is the latest version of <a href="http://github.com/turian/common-scripts/blob/master/sort-curves.py">sort-curves.py</a>.<br />
You will also need <a href="http://github.com/turian/common/blob/master/movingaverage.py">movingaverage.py</a> from my <a class="zem_slink" href="http://www.python.org/" title="Python (programming language)" rel="homepage">Python</a> common library.</p>
<p>USAGE:</p>
<pre><code>./sort-curves.py *.dat
</code></pre>
<p>where every *.dat is in standard (gnuplot) two-column-per-line format:</p>
<pre><code>xvalue yvalue
</code></pre>
<p>Overall, I find this script a useful timesaver.</p>
<div style="margin-top: 10px; height: 15px;" class="zemanta-pixie"><a class="zemanta-pixie-a" href="http://reblog.zemanta.com/zemified/036db446-2e02-4881-94e1-41d7d839bf8d/" title="Reblog this post [with Zemanta]"><img style="border: medium none ; float: right;" class="zemanta-pixie-img" src="http://img.zemanta.com/reblog_e.png?x-id=036db446-2e02-4881-94e1-41d7d839bf8d" alt="Reblog this post [with Zemanta]"></a><span class="zem-script more-related pretty-attribution"><script type="text/javascript" src="http://static.zemanta.com/readside/loader.js" defer="defer"></script></span></div>

]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2009/09/17/automatically-sorting-graph-curves/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
