|
So I am looking for some pointers here; I hope this complies with the community guidelines and you are able to help. So I have been writing an algorithm in VBA (in the back of excel) since last September, its a huge pieces of work and its nearing its end, but I am stuck! I am in the process of turning a clustered data-set into NLP rules using Regex and they are clustered by a number of classifiers. The data-sets I've used for doing this are absolutely massive. The problem I am experiencing is turning the data into the shortest possible set of rules. if I have a table columns A-Z and rows A-Z (the same sets of n-grams), it produces a table like this: ------A ---B ---C ---D A ---0 ----0 ---0 ----0 B ---Y ----0 ---0 ----0 C ---Y ---Y ----0 ---0 D ---X ---Y ---Y ----0 E ---Y ---Y ---X ---Y F ---Y ---Y ---Y ---X So the Y's should end up in the rules, 0's are duplicates and X's are null values and are to be ignored. So the first an most inefficient way of writing the rules is just to take each combination i.e. IF A & B then, IF A & C then, IF A & E then, IF A & F then, and then re-iterate for B and so on. Clearly that's not the way forward, the next method is fairly easy to see; If A & ( B or C or E or F) then If B & ( C or D or E or F) then Re-iterating for each column. Again, very inefficient. The next method I can see, but this is the one I need help figuring out is this: Combining the rules IF (A or B) & ( C or E or F) then IF A & B then IF B & D then While this seems inefficient on this scale, the savings on word duplication when you get up to the size of the data-sets I have (1.45 trillion word combinations) the saving on the size of the rules would be huge. But I am completely lost, it feels like a maths combinatorial/calculus problem to me on one level, then on another level it feels like a basic algebra problem which with a little ingenuity I could just solve. But I cannot see the wood for the trees. Any suggestions would be very welcome. I am a complete amateur at this and am doing it as a side project in work; so your help is really appreciated. Even just pointing me in the direct of some articles, similar problem/solution guides would be very helpful. Many thanks in advance for even reading through this! |