<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>MetaOptimize &#187; machine learning</title>
	<atom:link href="http://metaoptimize.com/blog/tag/machine-learning/feed/" rel="self" type="application/rss+xml" />
	<link>http://metaoptimize.com/blog</link>
	<description>building machine learning and natural language processing tools</description>
	<lastBuildDate>Wed, 08 Sep 2010 07:40:21 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=abc</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Free consultation on data strategy (NLP, ML, business intelligence, etc.)</title>
		<link>http://metaoptimize.com/blog/2010/08/20/free-consultation-on-data-strategy-nlp-ml-business-intelligence-etc/</link>
		<comments>http://metaoptimize.com/blog/2010/08/20/free-consultation-on-data-strategy-nlp-ml-business-intelligence-etc/#comments</comments>
		<pubDate>Fri, 20 Aug 2010 18:22:23 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[artificial intelligence]]></category>
		<category><![CDATA[BI]]></category>
		<category><![CDATA[business intelligence]]></category>
		<category><![CDATA[data mining]]></category>
		<category><![CDATA[large datasets]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[ML]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[statistical modeling]]></category>
		<category><![CDATA[text analysis]]></category>
		<category><![CDATA[web as corpus]]></category>

		<guid isPermaLink="false">http://metaoptimize.com/blog/?p=183</guid>
		<description><![CDATA[

Summary
Email me your pitch and how you need help monetizing data.
If I like your pitch, I’ll give you a free consultation on data strategy (NLP, ML, business intelligence, etc.)
Afterwards, if we both think that I can add value to your business, we can talk about a longer-term relationship.
You should forward this blog post to any [...]]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_blue" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Fmetaoptimize.com%252Fblog%252F2010%252F08%252F20%252Ffree-consultation-on-data-strategy-nlp-ml-business-intelligence-etc%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2Fbarjgj%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Free%20consultation%20on%20data%20strategy%20%28NLP%2C%20ML%2C%20business%20intelligence%2C%20etc.%29%22%20%7D);"></div>
<h2>Summary</h2>
<p><a href="mailto:joseph at metaoptimize dot com">Email me</a> your pitch and how you need help monetizing data.<br />
If I like your pitch, I’ll give you a free consultation on data strategy (NLP, ML, business intelligence, etc.)<br />
Afterwards, if we both think that I can add value to your business, we can talk about a longer-term relationship.</p>
<p>You should forward this blog post to any friend who could use this information.</p>
<hr />
<h2>What is data strategy?</h2>
<p>Do you know how to monetize the data you have? How can you improve monetization using other data available to you? How do you transform your data into actionable business intelligence?</p>
<p>I can help you shape your <b>data strategy</b>, your long-term plan for how your business will capture, process, and monetize data.  For example, data strategy can help you in the following circumstances:</p>
<ul>
<li>You don’t know who your individual users are or what they want, so you can’t effectively target ads.</li>
<li>You don’t know what user behavior on your site to track.</li>
<li>You don’t know what information you should start scraping from the web, information which you could use months or years down the line.</li>
</ul>
<p>Besides working backwards from your business goals and business assets to a viable data strategy, I can also help you with more concrete challenges in NLP and machine learning:</p>
<ul>
<li>How do I improve my search engine so that users don’t miss out on relevant results?</li>
<li>How do I add or improve recommendation, to connect users with what they want?</li>
<li>How do I scale this ML algorithm to billions of examples with millions of features?</li>
<li>How do I improve the accuracy of this NLP or ML tool?</li>
</ul>
<hr />
<p><a name="whoami"></p>
<h2>Who am I?</h2>
<p></a></p>
<p>My name is Joseph Turian, and I head MetaOptimize LLC. We consult on NLP, ML, and data strategy. We also run the <a href="http://metaoptimize.com/qa/">MetaOptimize Q&amp;A site</a>, where ML and NLP experts share their knowledge.</p>
<ul>
<li>I am a data expert, holding a Ph.D. in natural language processing and machine learning. I have a decade of experience in these topics. I specialize in <b>large data sets</b>.</li>
<li>I’m <b>business-minded</b>, so I focus on business goals and the most direct path of execution to achieve these goals.</li>
<li>I am also a <b>technology generalist</b> who has been hacking since age 10 and has programmed competitively at a world-class level.</li>
</ul>
<p>References from clients past and present available upon request.</p>
<hr />
<h2>What is the offer?</h2>
<p>You send me information about what you’re doing and why you think I can help you.<br />
<i>Bonus points</i> if you send me your deck, so I can understand your entire business picture. You are asking me to invest valuable expertise and potentially IP in your company, so appeal to me as a potential investor.<br />
<i>Demerits</i> if you send me an NDA prematurely. Uptight companies who think what they are doing isn’t protected by good execution are a turn-off. But if you must be all James Bond about it, I’ll still consider you.</p>
<p>If I like what you’re doing and I can budget time, we schedule a meeting (in person or over Skype) and I’ll give you a free consultation on what you’re doing.</p>
<p>If the initial meeting goes well, and we both see how I can add value to your business, we can decide to continue working together. I can continue to help you either:</p>
<ul>
<li>Advising you periodically about your data strategy.</li>
<li>Building you new tools to use in your product.</li>
<li>Licensing to you existing tools I’ve already built.</li>
<li>Training your smart tech geeks on NLP and ML technology for you to build in-house.</li>
</ul>
<p>Compensation accepted in the form of cash or equity or a mix of both. Pro-bono if you’re an awesome non-profit.</p>
<hr />
<h2>Why am I doing this?</h2>
<ul>
<li>More deals is always good.</li>
<li>I am a social hacker, and enjoy connecting and sharing with other entrepreneurs. I want to meet some more excellent people.</li>
<li>I would like to improve my understanding of widespread challenges and pain points in data strategy. That way, I can build a product that is useful for many people.</li>
<li>This is an interesting social business experiment.</li>
</ul>
<hr />
<h2>Who is this offer for?</h2>
<ul>
<li>Open-source projects looking to use NLP + ML to improve their users’ experience.</li>
<li>Unfunded startups with a promising team, product, and market.</li>
<li>Funded startups.</li>
<li>Established companies.</li>
</ul>
<hr />
<h2>What are you waiting for?</h2>
<p><a href="mailto:joseph at metaoptimize dot com">Email me</a> your pitch and how you need help monetizing data.<br />
Or forward this blog post to a friend who could use this information.</p>

<div style="float:left;margin:0px 0px 0px 0px;"></div>]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2010/08/20/free-consultation-on-data-strategy-nlp-ml-business-intelligence-etc/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Why can’t you pickle generators in Python? A pattern for saving training state</title>
		<link>http://metaoptimize.com/blog/2009/12/22/why-cant-you-pickle-generators-in-python-workaround-pattern-for-saving-training-state/</link>
		<comments>http://metaoptimize.com/blog/2009/12/22/why-cant-you-pickle-generators-in-python-workaround-pattern-for-saving-training-state/#comments</comments>
		<pubDate>Tue, 22 Dec 2009 08:52:13 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[experimental control]]></category>
		<category><![CDATA[Generator]]></category>
		<category><![CDATA[generators]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[persistance]]></category>
		<category><![CDATA[pickling]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[serialization]]></category>
		<category><![CDATA[training state]]></category>

		<guid isPermaLink="false">http://blog.metaoptimize.com/?p=72</guid>
		<description><![CDATA[

Summary

A pattern for persisting generators is to turn them into pickle-able class objects. This is useful when you use generators for streaming training examples.
I would also try generator_tools, which might be a more convenient alternative to the pattern I describe. I haven’t used it yet.

Generators for streaming training examples
For machine learning, python generators are a [...]]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_blue" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Fmetaoptimize.com%252Fblog%252F2009%252F12%252F22%252Fwhy-cant-you-pickle-generators-in-python-workaround-pattern-for-saving-training-state%252F%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Why%20can%27t%20you%20pickle%20generators%20in%20Python%3F%20A%20pattern%20for%20saving%20training%20state%22%20%7D);"></div>
<h1>Summary</h1>
<p><a href="http://flickr.com/photos/28402283@N07/3186143355" title="Moon Rise behind the San Gorgonio Pass Wind Farm"><img align=right src="http://farm4.static.flickr.com/3118/3186143355_4840fb7620_t.jpg" /></a></p>
<p>A pattern for persisting generators is to turn them into pickle-able class objects. This is useful when you use generators for streaming training examples.</p>
<p>I would also try <a href="http://www.fiber-space.de/generator_tools/doc/generator_tools.html">generator_tools</a>, which might be a more convenient alternative to the pattern I describe. I haven’t used it yet.</p>
<hr />
<h2>Generators for streaming training examples</h2>
<p>For machine learning, python <a href="http://www.ibm.com/developerworks/library/l-pycon.html">generators</a> are a simple idiom that make it easy to generate a stream of training examples. Moreover, you can nest generators:</p>
<ul>
<li>The inner generator can be used to read one example at a time.</li>
<li>The outer generator can be used to read examples from the inner generator until you have a full minibatch, and then yield this minibatch.</li>
</ul>
<p>Here is some example code:</p>
<p>[Update: The example holds without the ALL CAPS magic variable names, “HYPERPARAMETERS”. However, I include HYPERPARAMETERS because I am including the actual code I am using. Hyperparameters are global, read-only variables that specify the particular experimental condition being tested. I can’t say that I have the best solution to this particular aspect of experimental control (hyperparameters). I might write a blog post about it in the future, to solicit feedback on improved methods. However, I have refined my current approach over several years, and I can assure you that it is far less painful than a handful of more “clean” approaches.]</p>
<pre>def get_train_example():
    HYPERPARAMETERS = common.hyperparameters.read("language-model")

    from vocabulary import wordmap
    for l in myopen(HYPERPARAMETERS["TRAIN_SENTENCES"]):
        prevwords = []
        for w in string.split(l):
            w = string.strip(w)
            id = None
            if wordmap.exists(w):
                prevwords.append(wordmap.id(w))
                if len(prevwords) >= HYPERPARAMETERS["WINDOW_SIZE"]:
                    yield prevwords[-HYPERPARAMETERS["WINDOW_SIZE"]:]
            else:
                prevwords = []

def get_train_minibatch():
    HYPERPARAMETERS = common.hyperparameters.read("language-model")
    minibatch = []
    for e in get_train_example():
        minibatch.append(e)
        if len(minibatch) >= HYPERPARAMETERS["MINIBATCH SIZE"]:
            assert len(minibatch) == HYPERPARAMETERS["MINIBATCH SIZE"]
            yield minibatch
            minibatch = []
</pre>
<h2>You can’t persist training state by pickling your generators</h2>
<p>However, generators become problematic when you want to persist your experiment’s state in order to later restart training at the same place. Unfortunately, <a href="http://bugs.python.org/issue1092962">you can’t pickle generators in Python</a>. And it can be a bit of a <a href="http://en.wiktionary.org/wiki/pain_in_the_ass">PITA</a> to workaround this, in order to save the training state.</p>
<h2>Pattern to workaround this annoyance</h2>
<p>Following useful discussion on <a href="http://groups.google.com/group/pylearn-dev/browse_thread/thread/c4e4dd3496bbbf08">pylearn-dev</a> and stackoverflow <a href="http://stackoverflow.com/questions/1942328/add-a-member-variable-method-to-a-python-generator">[1]</a> <a href="http://stackoverflow.com/questions/1939015/singleton-python-generator-or-pickle-a-python-generator">[2]</a>, I propose the following pattern for converting generators to pickle-able class objects:</p>
<ol>
<li>Convert the generator to a class in which the generator code is the <a href="http://stackoverflow.com/questions/1942328/add-a-member-variable-method-to-a-python-generator/1942387#1942387">__iter__</a> method</li>
<li>Add <a href="http://docs.python.org/library/pickle.html#object.__getstate__">__getstate__</a> and <a href="http://docs.python.org/library/pickle.html#object.__setstate__">__setstate__</a> methods to the class, to handling pickling. Remember that you can’t pickle file objects. So __setstate__ will have to re-open files, as necessary.</li>
</ol>
<p>Here is the updated code, after applying this pattern:</p>
<pre>
class TrainingExampleStream(object):
    def __init__(self):
        # Set the state variables, in case pickling happens before __iter__ is called.
        self.filename = None
        self.count = 0
        pass

    def __iter__(self):
        HYPERPARAMETERS = common.hyperparameters.read("language-model")
        from vocabulary import wordmap
        self.filename = HYPERPARAMETERS["TRAIN_SENTENCES"]
        self.count = 0
        for l in myopen(self.filename):
            prevwords = []
            for w in string.split(l):
                w = string.strip(w)
                id = None
                if wordmap.exists(w):
                    prevwords.append(wordmap.id(w))
                    if len(prevwords) >= HYPERPARAMETERS["WINDOW_SIZE"]:
                        self.count += 1
                        yield prevwords[-HYPERPARAMETERS["WINDOW_SIZE"]:]
                else:
                    prevwords = []

    def __getstate__(self):
        return self.filename, self.count

    def __setstate__(self, state):
        """
        @warning: We ignore the filename.  If we wanted
        to be really fastidious, we would assume that
        HYPERPARAMETERS["TRAIN_SENTENCES"] might change.  The only
        problem is that if we change filesystems, the filename
        might change just because the base file is in a different
        path. So we issue a warning if the filename is different from what is expected.
        """
        filename, count = state
        print >> sys.stderr, ("__setstate__(%s)..." % `state`)
        iter = self.__iter__()
        while count != self.count:
#            print count, self.count
            iter.next()
        if self.filename != filename:
            assert self.filename == HYPERPARAMETERS["TRAIN_SENTENCES"]
            print >> sys.stderr, ("self.filename %s != filename given to __setstate__ %s" % (self.filename, filename))
        print >> sys.stderr, ("...__setstate__(%s)" % `state`)

class TrainingMinibatchStream(object):
    def __init__(self):
        pass

    def __iter__(self):
        HYPERPARAMETERS = common.hyperparameters.read("language-model")
        minibatch = []
        self.get_train_example = TrainingExampleStream()
        for e in self.get_train_example:
            minibatch.append(e)
            if len(minibatch) >= HYPERPARAMETERS["MINIBATCH SIZE"]:
                assert len(minibatch) == HYPERPARAMETERS["MINIBATCH SIZE"]
                yield minibatch
                minibatch = []

    def __getstate__(self):
        return (self.get_train_example.__getstate__(),)

    def __setstate__(self, state):
        """
        @warning: We ignore the filename.
        """
        self.get_train_example = TrainingExampleStream()
        self.get_train_example.__setstate__(state[0])
</pre>

<div style="float:left;margin:0px 0px 0px 0px;"></div>]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2009/12/22/why-cant-you-pickle-generators-in-python-workaround-pattern-for-saving-training-state/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
