|
What are the good visualization tools for exploring large data sets (say 10GB)? I am mainly looking for a tool to enable me do some exploratory data analysis and visualization on simple tables that are dumped from databases. Update: I received comments asking what kind of data I want to visualize. My question was more general but let's assume that I am interested in having Excel's graphing functionality for large data sets. Any kind of dashboading tool for big data would help. An example of what would be of interests is this gallery for large datasets.
|
|
I use ParaView (http://www.paraview.org/) and have visualized greater than 1TB of data doing analysis and filtering operations on the full dataset in real time. There are plugins that allow analysis and visualization via simple database queries, if that's what you are looking for. It depends on how much work you want to put into it at the pre-visualization stage. Plain text data and many standard DBs are terrible for reading and/or modifying in parallel so if you truly want to scale up beyond the local memory of your machine you'll need to start thinking about those issues. Also, there's a project spun off from VTK (which underlies Paraview) called Titan http://titan.sandia.gov to do informatics. It has recently incorporated Protoviz which was linked in the question.
(Jul 07 '10 at 15:50)
Nathan
Nathan, thanks for the pointer, was not familiar with that project. On the subject of VTK, VisIt is another solution built on VTK and is very similar in feature set to ParaView.
(Jul 09 '10 at 15:18)
corbett
|
|
Why not use R and the wonderful ggplot2 package, http://had.co.nz/ggplot2/ ? R with ggplot2 is a step up from Excel in terms of datasets, but the datasets I deal with on a regular basis (And which Excel has issues with when subcalculators and formulas are added) is really only 1 Mb worth of csv-data. R on a normal computer might get into trouble with a 10 Gb dataset I think, but there are workarounds such as those described here: http://yusung.blogspot.com/2007/09/dealing-with-large-data-set-in-r.html
(Jul 07 '10 at 17:52)
Tov Are Jacobsen
ggplot2 is terribly slow on large datasets
(Jul 07 '10 at 20:17)
Mark Alen
|
|
These are commerical applications but at both worth evaluating for exploring large datasets in-memory: I use Qlikview extensively for this purpose and it can handle 10GB of data efficiently. A similar application is Tableau, although in my experience it can't handle large data as well. |
|
I'm actually trying to put Protovis on Hbase for exactly this purpose |
|
I use PowerPivot quite a bit: it has reasonably good capacities to handle large datasets. Ideally you want to keep your data in the database and all the visualizations are translated into SQL queries that are executed on the DB server. As long as your queries aggreggate enough and you're not pulling in the whole DB PowerPivot can do some neat things. Particularly the PivotChart is of interest. |
|
Not quite what you asked, but one way to solve your problem is to filter and summarise your data first. For example, with many graphs, you only need one or two variables at a time. You can also try taking a random sample of your dataset, rather than visualising the full thing (just for the exploratory phase). Finally, try visualising means (or another statistic) instead of visualising each record. |
|
If your data has many features (and not just a lot of observations), GGobi (ggobi.org) is a fantastic way of exploring higher-dimensional associations. Mondrian is similar to ggobi http://rosuda.org/mondrian/
(Jul 13 '10 at 08:12)
Richie Cotton
|

What kind of data? Textual? If so, what kind of exploration do you want to see? Most frequent n-grams? A few more details would help giving you a useful answer. Thanks!