Summary:
If you have text data (like a web scrape) stored in a MySQL database, and you want to share the data, mysqldump to XML using the –xml flag.
When fields are unlikely to contain tabs, an even simpler format is a tab-separated file, created using the –tab=path flag to mysqldump. path must be owned by the MySQL database user.
The Problem with the standard MySQL dump format
The standard MySQL dump looks as follows
INSERT INTO `sources` VALUES (1,'2009-03-07 22:06:36','"You\'ve got to be kidding me"', ...
The problem is that the standard dump format is difficult to interact with programmatically.
It is difficult to parse using regular expressions because you cannot merely search for single quotes. You have to search for single quotes that are not preceded by a backslash (unless, perhaps, that backslash is preceded by a backslash).
Also, there are no libraries for reading the standard dump format, nor scripts for converting it into a standard format like JSON or XML. I asked the oracle as well as stackoverflow.
So if you receive a MySQL dump in the standard format, you might have to install MySQL and import the dump to get at your data.
The tabbed MySQL dump format
You can create a directory with one file per table, and the table will be one-row-per-line, with tab-separated values:
mysqldump --tab=path database
Here is some example output:
1 2009-03-07 22:06:36 "You've got to be kidding me"
If you get an error of the following form when you issue the mysqldump command:
mysqldump: Got error: 1: Can't create/write to file 'path/database.txt' (Errcode: 13) when executing 'SELECT INTO OUTFILE'
You can resolve this complaint by making sure that /tmp/path is owned by the mysql user (and also writeable by the current Unix user). Thanks JinRong Ye!
This format is convenient if none of your data contains tabs. In NLP, however, it is quite possible that your text will contain tabs.
The XML MySQL dump format
Enter the XML MySQL dump format:
<table_data name="sources">
<row>
<field name="id">1</field>
<field name="created_at">2009-03-07 22:06:36</field>
<field name="text">"You've got to be kidding me"</field>
Ah… pure bliss. You can get the XML dump format as follows:
mysqldump --xml database

![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_e.png?x-id=41468938-de30-448c-ac95-b381457c48c8)