I'm looking for a way to automatically generate a parser and a translator from a corpus of code sources and their translation in another computer language.

  • The corpus is wikipedia.
  • The source language is mediawiki markup, the target language is html.
  • The AST I'm interested in, is the one extracted from a text written using mediawiki markup. How to automatically generate a parser & code-to-code translator from a corpus?

Background story:

  • I am imposing myself this exercise, as a way to dive into "machine learning"
  • It happens that I find no mediawiki markup parser in scheme

In particular:

  • Is it possible to come up with an algorithm, that can generate a parser or translator or both without giving the algorithm hints about the source and target languages? hints can be «what are token in target and source languages». If yes, how does look like this algorithm? I'm interested in both modes "with and without hints".
  • Can the same generated program do the opposite operation target->source?
  • Is it possible to understand how the generated program compute the result?

I'm not looking for a ready to consume code. If it exists please share.

I'm also interested to know how the machine-learning algorithms/technics (if any) can be applied to other problems/domains.

My preferred way to model data is the graph, if doesn't make sens, don't push it too hard.

I don't need the program to understand the underlying knowledge that are represented in both source and target. Just learn how to go from that source to target. This is different from NLP - as I understand it. The thing that is looks like more similar is "Controlled language machine translation". But here I think that both target and source language have specific properties (like a known grammar) that makes the problem different and simpler than natural language machine translation.

I was said that this was impossible and the subject of the research of a lot of brillant minds, if not the best. I did not find any articles dealing with this specific subject.

If this problem is AI-complete, I'd like an explanation and references that explains why.

If there is a solution, I'd like to know what it is, even if I don't have the required math level to fully understand it and even if the solution requires to be able to solve a NP-complete problem.

Corpus:

  • wikipedia dumps via torrent (markup): http://meta.wikimedia.org/wiki/Data_dump_torrents
  • wikipedia dumps (markup): http://dumps.wikimedia.org/
  • wikipedia dumps (html): looks like it is not available anymore except by doing a http request. Better use a readily available wikipedia-markup->html: http://www.mediawiki.org/wiki/Alternative_parsers

asked Mar 26 '14 at 16:13

amz3's gravatar image

amz3
1111

I also asked the question @ http://stackoverflow.com/questions/22621164/how-to-automatically-generate-a-parser-code-to-code-translator-from-a-corpus

(Apr 03 '14 at 14:01) amz3
Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.