Use flag –xml when you run mysqldump

Sum­mary:

If you have text data (like a web scrape) stored in a MySQL data­base, and you want to share the data, mysql­dump to XML using the –xml flag.

When fields are unlikely to con­tain tabs, an even sim­pler for­mat is a tab-separated file, cre­ated using the –tab=path flag to mysql­dump. path must be owned by the MySQL data­base user.

The Prob­lem with the stan­dard MySQL dump format

The stan­dard MySQL dump looks as follows

INSERT INTO `sources` VALUES (1,'2009-03-07 22:06:36','"You\'ve got to be kidding me"', ...

The prob­lem is that the stan­dard dump for­mat is dif­fi­cult to inter­act with programmatically.

It is dif­fi­cult to parse using reg­u­lar expres­sions because you can­not merely search for sin­gle quotes. You have to search for sin­gle quotes that are not pre­ceded by a back­slash (unless, per­haps, that back­slash is pre­ceded by a backslash).

Also, there are no libraries for read­ing the stan­dard dump for­mat, nor scripts for con­vert­ing it into a stan­dard for­mat like JSON or XML. I asked the ora­cle as well as stack­over­flow.

So if you receive a MySQL dump in the stan­dard for­mat, you might have to install MySQL and import the dump to get at your data.

The tabbed MySQL dump format

You can cre­ate a direc­tory with one file per table, and the table will be one-row-per-line, with tab-separated val­ues:

mysqldump --tab=path database

Here is some exam­ple output:

1	2009-03-07 22:06:36	"You've got to be kidding me"

If you get an error of the fol­low­ing form when you issue the mysql­dump command:

mysqldump: Got error: 1: Can't create/write to file 'path/database.txt' (Errcode: 13) when executing 'SELECT INTO OUTFILE'

You can resolve this com­plaint by mak­ing sure that /tmp/path is owned by the mysql user (and also write­able by the cur­rent Unix user). Thanks Jin­Rong Ye!

This for­mat is con­ve­nient if none of your data con­tains tabs. In NLP, how­ever, it is quite pos­si­ble that your text will con­tain tabs.

The XML MySQL dump format

Enter the XML MySQL dump format:

        <table_data name="sources">
        <row>
                <field name="id">1</field>
                <field name="created_at">2009-03-07 22:06:36</field>
                <field name="text">&quot;You've got to be kidding me&quot;</field>

Ah… pure bliss. You can get the XML dump for­mat as follows:

mysqldump --xml database
Reblog this post [with Zemanta]
  • http://josephturian.blogspot.com/ Joseph Turian

    In response to Joshua Reich:

    Let me answer your last ques­tion first:

    > 4. Why aren’t you using postgres ?

    I was get­ting data from some­one that uses MySQL.
    Know­ing what I know now, I believe should should have advised him to use the –xml flag.

    > 1. I am friends with awk & pals, and strip­ping out INSERT .. VALUE from
    > mysql dumps that I get from peo­ple is no biggy

    I am friends with perl, and you can­not sim­ply split using /,/ to get your fields. The comma might be right in the mid­dle of a string.

    With XML, though, it is sim­ple to grep for <row>, because you know 100% that < will only be in the markup.

    > 2. I’m pretty sure MySQL sup­ports dump­ing table data as CSV (SELECT
    > OUTHOUSE ‘/tmp/file.csv’ …)

    Same prob­lems as above.

    > 3. For big data, XML is just silly big.

    Why? It gzips eas­ily.
    Not being able to load it all into mem­ory is less of an issue if it is easy to split the data using reg­u­lar expressions.

  • http://www.dataspora.com/blog Michael E Driscoll

    XML has its place (some­where), but in this programmer’s hum­ble opin­ion, export­ing tab­u­lar data is not one of them.

    http://www.dataspora.com/blog/xml-and-big-data/

blog comments powered by Disqus