Recently, I have done a project mining behavior of car buyers. When I was cleaning the data, I came up with this problem. There are several attributes can be derived by other attributes after some simple calculation. Some of my colleagues insisted deleting those "redundant" attributes. Some held the oppisite opinion.
My question is that is there still necessity to follow the 3NF when constructing the database in ML/DM tasks? When I was taking data warehousing classes, the textbooks and my professor both told me that there is no need to follow 3NF, in some scenarios, counter-3NF is the necessity.
Furthermore, with the price for storage product getting lower and faster computing devices, is there the need to follow 3NF anymore?
asked Nov 03 '10 at 03:09
Well these days its the key value pair system that is gaining much prominence. Its completely denormalized. I really liked this if you want to know about the tradeoff's in terms of time and space between normalized and unnormalized databases. People look very upbeat about this, but its not right in every situation, like the article demonstrates.
If you have features that are non-linear combinations of others that makes sense for human analysis, (e.g. ratios such as the Price / Earning ratio of a stock), then I think your should keep them instead of just keeping the atomic features (e.g. Price and Earning in that case).
Depending on whether your machine learning model is able to build non-linear features by itself (e.g. SVM with RBF kernels) or not (e.g. penalized linear regression) those handcrafted non-linear features might be very good predictors that are critical to the predictive accuracy of your model.
answered Nov 03 '10 at 06:50
3NF is never necessary per se. Even in databases, normal forms are a tradeoff between, on one side, saving space and cutting time to modify records (hence ensuring consistency) and, on the other side, having faster insertion and deletion queries on the complete data.
Data warehouses in general don't bother with normalization because, once you accept the records you never be modified and you have a lot of space, you can afford to keep redundant copies of everything.
However, I don't see how this is relevant for machine learning, as you trivially do a SELECT query to get data and features both in normalized and unnormalized databases, and feed the results of such a query to any learner (that is: machine learning is representation-agnostic).
answered Nov 03 '10 at 07:21
Alexandre Passos ♦
Nothing makes the need to de-normalize a database more apparent than doing data warehousing work. Data warehouses almost always have as a primary theme the need to process many rows at a very rapid pace. Joins make this more time consuming because the code doing all the work is comparing/branching more often.
Often times it even makes sense to go so far as to generate rows with a lot of duplicate information just to avoid joins; that way the only comparisons performed can be indexed for further speed improvement.
While I know your question is about machine learning, I gave the answer I did because I believe there's a lot of common ground between data warehousing and data mining in this respect.
In direct answer to your co-worker's concerns: if you're doing a data warehouse, they're wrong to believe you should get rid of duplicate data -- provided the duplicate data isn't creating a huge performance burden or generating a lot of overhead.
Pick up a copy of the Data Warehouse Toolkit (isbn 0471200247). It'll give you a good step in the right direction for understanding why 3NF is not appropriate for data warehousing styled tasks.
The rule of thumb is essentially this: if your database is transactional, some sort of 3NF schema is more than likely appropriate. If it is subject-oriented (eg, you're seeking to glean information from or manipulate the data), then you should be more creative about your schema design and worry far less about "good" practices they taught you in college.
Furthermore, don't mix the two. If you need a subject-oriented database, don't try to shoe-horn a transactional database to suit your needs. Just make a new database schema and create an ETL task to do your data loading to periodically update the new db.