<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>MetaOptimize &#187; training state</title>
	<atom:link href="http://metaoptimize.com/blog/tag/training-state/feed/" rel="self" type="application/rss+xml" />
	<link>http://metaoptimize.com/blog</link>
	<description>building machine learning and natural language processing tools</description>
	<lastBuildDate>Mon, 23 May 2011 01:16:49 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=abc</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Why can’t you pickle generators in Python? A pattern for saving training state</title>
		<link>http://metaoptimize.com/blog/2009/12/22/why-cant-you-pickle-generators-in-python-workaround-pattern-for-saving-training-state/</link>
		<comments>http://metaoptimize.com/blog/2009/12/22/why-cant-you-pickle-generators-in-python-workaround-pattern-for-saving-training-state/#comments</comments>
		<pubDate>Tue, 22 Dec 2009 08:52:13 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[experimental control]]></category>
		<category><![CDATA[Generator]]></category>
		<category><![CDATA[generators]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[persistance]]></category>
		<category><![CDATA[pickling]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[serialization]]></category>
		<category><![CDATA[training state]]></category>

		<guid isPermaLink="false">http://blog.metaoptimize.com/?p=72</guid>
		<description><![CDATA[

Summary

A pattern for persisting generators is to turn them into pickle-able class objects. This is useful when you use generators for streaming training examples.
I would also try generator_tools, which might be a more convenient alternative to the pattern I describe. I haven’t used it yet.

Generators for streaming training examples
For machine learning, python generators are a [...]]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_blue" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Fmetaoptimize.com%252Fblog%252F2009%252F12%252F22%252Fwhy-cant-you-pickle-generators-in-python-workaround-pattern-for-saving-training-state%252F%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Why%20can%27t%20you%20pickle%20generators%20in%20Python%3F%20A%20pattern%20for%20saving%20training%20state%22%20%7D);"></div>
<h1>Summary</h1>
<p><a href="http://flickr.com/photos/28402283@N07/3186143355" title="Moon Rise behind the San Gorgonio Pass Wind Farm"><img align=right src="http://farm4.static.flickr.com/3118/3186143355_4840fb7620_t.jpg" /></a></p>
<p>A pattern for persisting generators is to turn them into pickle-able class objects. This is useful when you use generators for streaming training examples.</p>
<p>I would also try <a href="http://www.fiber-space.de/generator_tools/doc/generator_tools.html">generator_tools</a>, which might be a more convenient alternative to the pattern I describe. I haven’t used it yet.</p>
<hr />
<h2>Generators for streaming training examples</h2>
<p>For machine learning, python <a href="http://www.ibm.com/developerworks/library/l-pycon.html">generators</a> are a simple idiom that make it easy to generate a stream of training examples. Moreover, you can nest generators:</p>
<ul>
<li>The inner generator can be used to read one example at a time.</li>
<li>The outer generator can be used to read examples from the inner generator until you have a full minibatch, and then yield this minibatch.</li>
</ul>
<p>Here is some example code:</p>
<p>[Update: The example holds without the ALL CAPS magic variable names, “HYPERPARAMETERS”. However, I include HYPERPARAMETERS because I am including the actual code I am using. Hyperparameters are global, read-only variables that specify the particular experimental condition being tested. I can’t say that I have the best solution to this particular aspect of experimental control (hyperparameters). I might write a blog post about it in the future, to solicit feedback on improved methods. However, I have refined my current approach over several years, and I can assure you that it is far less painful than a handful of more “clean” approaches.]</p>
<pre>def get_train_example():
    HYPERPARAMETERS = common.hyperparameters.read("language-model")

    from vocabulary import wordmap
    for l in myopen(HYPERPARAMETERS["TRAIN_SENTENCES"]):
        prevwords = []
        for w in string.split(l):
            w = string.strip(w)
            id = None
            if wordmap.exists(w):
                prevwords.append(wordmap.id(w))
                if len(prevwords) >= HYPERPARAMETERS["WINDOW_SIZE"]:
                    yield prevwords[-HYPERPARAMETERS["WINDOW_SIZE"]:]
            else:
                prevwords = []

def get_train_minibatch():
    HYPERPARAMETERS = common.hyperparameters.read("language-model")
    minibatch = []
    for e in get_train_example():
        minibatch.append(e)
        if len(minibatch) >= HYPERPARAMETERS["MINIBATCH SIZE"]:
            assert len(minibatch) == HYPERPARAMETERS["MINIBATCH SIZE"]
            yield minibatch
            minibatch = []
</pre>
<h2>You can’t persist training state by pickling your generators</h2>
<p>However, generators become problematic when you want to persist your experiment’s state in order to later restart training at the same place. Unfortunately, <a href="http://bugs.python.org/issue1092962">you can’t pickle generators in Python</a>. And it can be a bit of a <a href="http://en.wiktionary.org/wiki/pain_in_the_ass">PITA</a> to workaround this, in order to save the training state.</p>
<h2>Pattern to workaround this annoyance</h2>
<p>Following useful discussion on <a href="http://groups.google.com/group/pylearn-dev/browse_thread/thread/c4e4dd3496bbbf08">pylearn-dev</a> and stackoverflow <a href="http://stackoverflow.com/questions/1942328/add-a-member-variable-method-to-a-python-generator">[1]</a> <a href="http://stackoverflow.com/questions/1939015/singleton-python-generator-or-pickle-a-python-generator">[2]</a>, I propose the following pattern for converting generators to pickle-able class objects:</p>
<ol>
<li>Convert the generator to a class in which the generator code is the <a href="http://stackoverflow.com/questions/1942328/add-a-member-variable-method-to-a-python-generator/1942387#1942387">__iter__</a> method</li>
<li>Add <a href="http://docs.python.org/library/pickle.html#object.__getstate__">__getstate__</a> and <a href="http://docs.python.org/library/pickle.html#object.__setstate__">__setstate__</a> methods to the class, to handling pickling. Remember that you can’t pickle file objects. So __setstate__ will have to re-open files, as necessary.</li>
</ol>
<p>Here is the updated code, after applying this pattern:</p>
<pre>
class TrainingExampleStream(object):
    def __init__(self):
        # Set the state variables, in case pickling happens before __iter__ is called.
        self.filename = None
        self.count = 0
        pass

    def __iter__(self):
        HYPERPARAMETERS = common.hyperparameters.read("language-model")
        from vocabulary import wordmap
        self.filename = HYPERPARAMETERS["TRAIN_SENTENCES"]
        self.count = 0
        for l in myopen(self.filename):
            prevwords = []
            for w in string.split(l):
                w = string.strip(w)
                id = None
                if wordmap.exists(w):
                    prevwords.append(wordmap.id(w))
                    if len(prevwords) >= HYPERPARAMETERS["WINDOW_SIZE"]:
                        self.count += 1
                        yield prevwords[-HYPERPARAMETERS["WINDOW_SIZE"]:]
                else:
                    prevwords = []

    def __getstate__(self):
        return self.filename, self.count

    def __setstate__(self, state):
        """
        @warning: We ignore the filename.  If we wanted
        to be really fastidious, we would assume that
        HYPERPARAMETERS["TRAIN_SENTENCES"] might change.  The only
        problem is that if we change filesystems, the filename
        might change just because the base file is in a different
        path. So we issue a warning if the filename is different from what is expected.
        """
        filename, count = state
        print >> sys.stderr, ("__setstate__(%s)..." % `state`)
        iter = self.__iter__()
        while count != self.count:
#            print count, self.count
            iter.next()
        if self.filename != filename:
            assert self.filename == HYPERPARAMETERS["TRAIN_SENTENCES"]
            print >> sys.stderr, ("self.filename %s != filename given to __setstate__ %s" % (self.filename, filename))
        print >> sys.stderr, ("...__setstate__(%s)" % `state`)

class TrainingMinibatchStream(object):
    def __init__(self):
        pass

    def __iter__(self):
        HYPERPARAMETERS = common.hyperparameters.read("language-model")
        minibatch = []
        self.get_train_example = TrainingExampleStream()
        for e in self.get_train_example:
            minibatch.append(e)
            if len(minibatch) >= HYPERPARAMETERS["MINIBATCH SIZE"]:
                assert len(minibatch) == HYPERPARAMETERS["MINIBATCH SIZE"]
                yield minibatch
                minibatch = []

    def __getstate__(self):
        return (self.get_train_example.__getstate__(),)

    def __setstate__(self, state):
        """
        @warning: We ignore the filename.
        """
        self.get_train_example = TrainingExampleStream()
        self.get_train_example.__setstate__(state[0])
</pre>

<div style="float:left;margin:0px 0px 0px 0px;"></div>]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2009/12/22/why-cant-you-pickle-generators-in-python-workaround-pattern-for-saving-training-state/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

