<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>MetaOptimize &#187; xml</title>
	<atom:link href="http://metaoptimize.com/blog/tag/xml/feed/" rel="self" type="application/rss+xml" />
	<link>http://metaoptimize.com/blog</link>
	<description>building machine learning and natural language processing tools</description>
	<lastBuildDate>Wed, 08 Sep 2010 07:40:21 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=abc</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>KEA Keyphrase Extraction as an XML-RPC service (code release)</title>
		<link>http://metaoptimize.com/blog/2010/08/18/kea-keyphrase-extraction-as-an-xml-rpc-service/</link>
		<comments>http://metaoptimize.com/blog/2010/08/18/kea-keyphrase-extraction-as-an-xml-rpc-service/#comments</comments>
		<pubDate>Thu, 19 Aug 2010 03:38:38 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[KEA keyphrase extractor]]></category>
		<category><![CDATA[Remote procedure call]]></category>
		<category><![CDATA[term extractor]]></category>
		<category><![CDATA[Terminology extraction]]></category>
		<category><![CDATA[terminology mining]]></category>
		<category><![CDATA[xml]]></category>
		<category><![CDATA[XML-RPC]]></category>

		<guid isPermaLink="false">http://metaoptimize.com/blog/?p=169</guid>
		<description><![CDATA[

Summary
We release code written by Ali Afshar, which turns the KEA keyphrase extractor into an XML-RPC service. This allows you to use KEA as a service, calling it from a variety of different programming languages. The code is released under the New BSD License.

Background
Keyphrase extraction (AKA terminology mining, term extraction, term recognition, or glossary extraction) [...]]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_blue" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Fmetaoptimize.com%252Fblog%252F2010%252F08%252F18%252Fkea-keyphrase-extraction-as-an-xml-rpc-service%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2F97NtV8%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22KEA%20Keyphrase%20Extraction%20as%20an%20XML-RPC%20service%20%28code%20release%29%22%20%7D);"></div>
<h2>Summary</h2>
<p>We release <a href="http://github.com/turian/kea-service">code</a> written by Ali Afshar, which turns the <a href="http://www.nzdl.org/Kea/">KEA</a> keyphrase extractor into an XML-RPC service. This allows you to use KEA as a service, calling it from a variety of different programming languages. The code is released under the <a href="http://en.wikipedia.org/wiki/BSD_licenses#3-clause_license_.28.22New_BSD_License.22.29">New BSD License</a>.</p>
<hr />
<h2>Background</h2>
<p>Keyphrase extraction (AKA terminology mining, term extraction, term recognition, or glossary extraction) is the process of extracting multi-word phrases that summarize the meaning of a text passage.</p>
<p>For example, in <a href="http://www.fao.org/docrep/Article/ejade/ae228e/ae228e00.htm">this document</a> entitled “The Growing Global Obesity Problem: Some Policy Options to Address It”, the keyphrases might be: [“developing countries”, “food consumption”, “overweight”, “taxes”, “prices”, “price policies”, “fiscal policies”, “feeding habits”, “nutritional requirements”, “diet”, “nutrition policies”, and “food intake”.]</p>
<p>These keyphrases are useful for summarizing the topic of the text. Also, these keyphrases are useful in later NLP processing steps, and sometimes more informative and disambiguating than just the individual word tokens in the text.</p>
<p><a href="http://www.nzdl.org/Kea/">KEA</a> is a great keyphrase extraction implementation. It is useful because it is open-source, backed by solid research, comes with some annotated training data, and because it can extract keyphrases over unrestricted text, without needing a vocabulary of possible keyphrases.</p>
<p>Other implementations of keyphrase extraction include:</p>
<ul>
<li><a href="http://code.google.com/p/maui-indexer/">Maui</a>, a topic extractor from the same people that wrote KEA.</li>
<li><a href="http://pypi.python.org/pypi/topia.termextract/">topia.termextract</a> is a Python term extractor, which is relatively noisy, and proposes many bogus keywords, but it simple to use. This is my recommendation for quick-and-dirty but works immediately out-of-the-box.</li>
</ul>
<p>API implementations include:</p>
<ul>
<li><a href="http://www.nactem.ac.uk/software/termine/">Termine</a> by NacTem, a permissive term extractor I’ve used in the past. They will give you bulk access for research purposes.  It is a UK webservice that also is relatively noisy, and proposes many bogus keywords. However, it appears to me to be slightly more accurate than topia.termextract. YMMV.</li>
<li><a href="http://www.alchemyapi.com/api/keyword/">Alchemy’s</a> term extractor.
<li><a href="http://developer.yahoo.com/search/content/V1/termExtraction.html">The Yahoo term extraction API</a>, which is <a href="http://developer.yahoo.net/blog/archives/2010/08/api_updates_and_changes.html">now only available through YQL</a>. It is low recall but high precision. In other words, it gives you a small number of high quality terms, but misses many of the terms in your documents.</li>
<li><a href="http://fivefilters.org/term-extraction/">Five Filters</a>, a web service version of topia’s term extractor (see above).</li>
<li><a href="http://maui-indexer.appspot.com/">Maui on Appspot</a>.</li>
</ul>
<p>Peter Turney has done a lot of research on keyphrase extraction, and <a href="http://www.extractor.com/about.aspx">licenses his implementation</a>.</p>
<p>There is a wide academic literature on term extraction, which I won’t summarize here. The best introductory techniques are written by Park, who is now at IBM:<br />
<a href="http://portal.acm.org/citation.cfm?id=1072370">“Automatic glossary extraction: beyond terminology identification”</a> and<br />
“Glossary extraction and utilization in the information search and delivery system for IBM technical support”. You can read more about how to roll your own <a href="http://stackoverflow.com/questions/1575246/how-do-i-extract-keywords-used-in-text/1575345#1575345">termex implementation here</a>.</p>
<p>More information about the topic is available on the <a href="http://maui-indexer.blogspot.com/">Maui blog</a>.</p>
<hr />
<h2>Code</h2>
<p>When running KEA, instead of a standalone program which reads input from disk, for speed one might want a resident service that keeps the model in memory. Additionally, one might want to call this service from non-Java languages. XML-RPC is a widely supported standard for implementing remote services.</p>
<p>We hereby release <a href="http://github.com/turian/kea-service">KEA service</a> written by Ali Afshar, which turns the <a href="http://www.nzdl.org/Kea/">KEA</a> keyphrase extractor into an XML-RPC service. This allows you to use KEA as a service, calling it from a variety of different programming languages. The code is released under the <a href="http://en.wikipedia.org/wiki/BSD_licenses#3-clause_license_.28.22New_BSD_License.22.29">New BSD License</a>.</p>
<p>Also included in <a href="http://github.com/turian/kea-service/blob/master/README">the documentation</a> is a description of how to this Java program was converted into a XML-RPC service. </p>

<div style="float:left;margin:0px 0px 0px 0px;"></div>]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2010/08/18/kea-keyphrase-extraction-as-an-xml-rpc-service/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Use flag –xml when you run mysqldump</title>
		<link>http://metaoptimize.com/blog/2009/10/14/use-flag-xml-when-you-run-mysqldump/</link>
		<comments>http://metaoptimize.com/blog/2009/10/14/use-flag-xml-when-you-run-mysqldump/#comments</comments>
		<pubDate>Wed, 14 Oct 2009 22:40:17 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[JSON]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[mysqldump]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://blog.metaoptimize.com/?p=60</guid>
		<description><![CDATA[

Summary:

If you have text data (like a web scrape) stored in a MySQL database, and you want to share the data, mysqldump to XML using the –xml flag.

When fields are unlikely to contain tabs, an even simpler format is a tab-separated file, created using the –tab=path flag to mysqldump. path must be owned by the [...]]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_blue" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Fmetaoptimize.com%252Fblog%252F2009%252F10%252F14%252Fuse-flag-xml-when-you-run-mysqldump%252F%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Use%20flag%20--xml%20when%20you%20run%20mysqldump%22%20%7D);"></div>
<h1>Summary:</h1>
<p><a href="http://flickr.com/photos/24030756@N05/2649856228" title="psychogenic womb memory-gemini project"><img src="http://farm4.static.flickr.com/3223/2649856228_61b5405cfa_t.jpg" align="right"></a></p>
<p>If you have text data (like a <a class="zem_slink" href="http://en.wikipedia.org/wiki/Screen_scraping" title="Screen scraping" rel="wikipedia">web scrape</a>) stored in a <a class="zem_slink" href="http://www.mysql.com" title="MySQL" rel="homepage">MySQL</a> database, and you want to share the data, mysqldump to <a class="zem_slink" href="http://en.wikipedia.org/wiki/XML" title="XML" rel="wikipedia">XML</a> using the <tt>–xml</tt> flag.</p>
</p>
<p>When fields are unlikely to contain tabs, an even simpler format is a tab-separated file, created using the <tt>–tab=path</tt> flag to mysqldump. <tt>path</tt> must be owned by the MySQL database user.
</p>
<h1>The Problem with the standard MySQL dump format</h1>
<p>The standard MySQL dump looks as follows</p>
<pre><code>INSERT INTO `sources` VALUES (1,'2009-03-07 22:06:36','"You\'ve got to be kidding me"', ...
</code></pre>
<p>The problem is that the standard dump format is difficult to interact with programmatically.</p>
<p>It is difficult to parse using <a class="zem_slink" href="http://en.wikipedia.org/wiki/Regular_expression" title="Regular expression" rel="wikipedia">regular expressions</a> because you cannot merely search for single quotes. You have to search for single quotes that are not preceded by a <a href="http://en.wikipedia.org/wiki/Backslash">backslash</a> (unless, perhaps, that backslash is preceded by a backslash).</p>
<p>Also, there are no libraries for reading the standard dump format, nor scripts for converting it into a standard format like <a class="zem_slink" href="http://en.wikipedia.org/wiki/JSON" title="JSON" rel="wikipedia">JSON</a> or XML. I asked <a href="http://www.google.com/search?q=mysql+dump+library&amp;hl=en">the oracle</a> as well as <a href="http://stackoverflow.com/questions/1568838/library-to-read-a-mysql-dump">stackoverflow</a>.</p>
<p>So if you receive a MySQL dump in the standard format, you might have to install MySQL and import the dump to get at your data.</p>
<h1>The tabbed MySQL dump format</h1>
<p>You can create a directory with one file per table, and the table will be one-row-per-line, with <a class="zem_slink" href="http://en.wikipedia.org/wiki/Delimiter-separated_values" title="Delimiter-separated values" rel="wikipedia">tab-separated values</a>:</p>
<pre><code>mysqldump --tab=path database</code></pre>
<p>Here is some example output:</p>
<pre><code>1	2009-03-07 22:06:36	"You've got to be kidding me"</code></pre>
<p>If you get an error of the following form when you issue the mysqldump command:</p>
<pre><code>mysqldump: Got error: 1: Can't create/write to file 'path/database.txt' (Errcode: 13) when executing 'SELECT INTO OUTFILE'</code></pre>
<p>You can resolve this complaint by making sure that /tmp/path is owned by the mysql user (and also writeable by the current Unix user). Thanks <a href="http://forums.mysql.com/read.php?35,172714,172766#msg-172766">JinRong Ye</a>!</p>
<p>This format is convenient if none of your data contains tabs. In <a class="zem_slink" href="http://en.wikipedia.org/wiki/Natural_language_processing" title="Natural language processing" rel="wikipedia">NLP</a>, however, it is quite possible that your text will contain tabs.</p>
<h1>The XML MySQL dump format</h1>
<p>Enter the XML MySQL dump format:</p>
<pre><code>        &lt;table_data name="sources"&gt;
        &lt;row&gt;
                &lt;field name="id"&gt;1&lt;/field&gt;
                &lt;field name="created_at"&gt;2009-03-07 22:06:36&lt;/field&gt;
                &lt;field name="text"&gt;&amp;quot;You've got to be kidding me&amp;quot;&lt;/field&gt;
</code></pre>
<p>Ah… pure bliss. You can get the XML dump format as follows:</p>
<pre><code>mysqldump --xml database</code></pre>
<div class="zemanta-pixie" style="margin-top:10px;height:15px"><a class="zemanta-pixie-a" href="http://reblog.zemanta.com/zemified/41468938-de30-448c-ac95-b381457c48c8/" title="Reblog this post [with Zemanta]"><img class="zemanta-pixie-img" src="http://img.zemanta.com/reblog_e.png?x-id=41468938-de30-448c-ac95-b381457c48c8" alt="Reblog this post [with Zemanta]" style="border:none;float:right"></a><span class="zem-script more-related pretty-attribution"><script type="text/javascript" src="http://static.zemanta.com/readside/loader.js" defer="defer"></script></span></div>

<div style="float:left;margin:0px 0px 0px 0px;"></div>]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2009/10/14/use-flag-xml-when-you-run-mysqldump/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
