|
I want to implement a similarity function that can accurately identify the similar log files. So far, I am unable to find a suitable similarity metric for my problem. I have log files generated from several PCs (around 300 PCs), where each file contains visited IP addresses on a daily basis. I want to compare the similarity by comparing the visited IP addresses on a daily basis. that is, I want to compare day1 of PC1 with day1 of PC2 and so on... for example (assume each log file contains only 4 days of data, if nothing visited on a particular day that row is left blank):
My similarity score between PC1 and PC2 would be:
For this problem, I can use Jaccard similarity index (considering each day as a set of IP addresses). But I am not sure whether that is a suitable metric In finding similar documents, I have seen people applying Jaccard index to the whole document but that is not what I am looking for. In my case, I wanted to apply Jaccard index for each day and sum them up to find the final similarity value. Is this approach technically sound? Thank you.
we have around 1000 IP addresses and we want to monitor the browsing (browsing these 1000 IP addresses) pattern where each PC is used by the same person. This study is conducted for 5 working days and we log the visited IP addresses. If any of these IP addresses are visited on Monday it has the highest weight, while if its visited on Friday, it has the lowest weight. Weights for Tuesday, Wednesday and Thursday are normalized accordingly. This is why I am more interested in day wise similarity. while my ultimate objective is to find the people who have similar browsing pattern (considering all 5 days). This study is kind of weird but I am doing it for a project. |
|
What you need is an evaluation measure. Once you have defined a suitable evaluation measure, the correct similarity measure will follow naturally. |
What you described seems to track "the use of different PCs by the same person in the same day" and it doesn't track "the use of the same PC by the same person during different days".