|
On a simple social networking site like Facebook or Twitter, we can have an approximate idea about the location of an individual from their profile information. With all the geo-tagging information coming in, it is becoming more accurate than ever. Is there an approximate algorithm which can help find out the location referenced in a web page. Say on the basis of verbs, adjectives and nouns used in the content. For example: Any word which follows "Location:" Any word which follows "I am from" I mean, there can be a way in which a set of rules can be created which can approximately tell the location referenced in the document. However, even if we do get to such rules, We still donot have an exhaustive database of locations worldwide to extract the location. But still a 60-70% accuracy won't be bad. Let me know your ideas on the same. |
|
Alexandre and yoavg are completely right that this is a standard Named Entity Recognition problem. If all you're asking is to find location references, then your best bet is to use a NER system. For instance, the Stanford NER system is available... Your original post describes something like a bootstrapping system. Following the Riloff approach you would start with some seeds like "I am from LOCATION" "I'll never go back to LOCATION" and/or then some location seeds, like LOCATION = {"New York, NY", "Tokyo"} Starting with the seed patterns, find a set of other location terms that fit the pattern. Then using the location seeds, find other patterns where location terms appear. Repeat until you're happy. This approach is intuitive and incredibly standard. Here's the Abney paper on bootstrapping generally, and a handful on different styles of bootstrapping for N E R. And for good measure, a pretty comprehensive survey of NER techniques through 2006. Thanks.. Nice information. Got enough to think..now..
(Jul 29 '10 at 11:44)
ArchieIndian
|
|
I don't know of any such system, but google maps' parser is really good at extracting locations. If you have access to it freely (which is hard, but assume you do), you can search at each post for the longest contiguous word sequences that it can parse (under the assumption that longest=more specific; with access to the parser I presume you could do better than that). Gmail supposedly has a lab that does this for emails, but it doesn't work for any address in brazil I type (even if the email has the exact address string returned by google maps). Also, is you just want to extract addresses, you could probably do better than that f1 if you used CRFs + BIO encoding + lots of place name features (extracted from maps, maybe) + lots of morphological features. @Alexandre Is there someone who is working on something similar and you know?
(Jul 29 '10 at 02:17)
ArchieIndian
1
Not on this problem specific, but there's a lot of work on information extraction/named entity recognition, and this is an instance of that sort of problem.
(Jul 29 '10 at 02:18)
Alexandre Passos ♦
Can you refer me a few. I will try and go through them so as to see what helps me..
(Jul 29 '10 at 03:33)
ArchieIndian
This is a seminal paper in information extraction http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.43.351&rep=rep1&type=pdf , and http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.141.584&rep=rep1&type=pdf is a more modern approach. For NER, this http://acl.ldc.upenn.edu/W/W03/W03-0430.pdf should do for a start.
(Jul 29 '10 at 05:36)
Alexandre Passos ♦
Thanks a lot Alexandre
(Jul 29 '10 at 05:56)
ArchieIndian
|
I was expecting some more responses to this one..
could you elaborate on your question? it is not clear to me if you are looking for names of places mentioned in the message (as Alexandre Passos is suggesting), or are you looking for indication that the person who wrote the message is actually from the given place?
Yes, it is like getting to know the place from a where the person is posting something with the reference of what he wrote. The approach is to find a list of nouns, verbs,other words etc which are used just before writing a location. For example http://www.complaints.com/2007/december/19/WALMART.COM_156912.htm should be able to extract "Bentonville, AK 72716-8611 US" It needs not be full proof but even a 70% accuracy wont be bad.
I still don't understand your goal. Imagine the message says: "I heard that Bentonville, AK, is a terrible place to be in and I never want to go there". Do you still want to extract Bentonvill, AK ?
If you do, then it is probably not very hard to work in the 70%s, use some of-the-shelf NER system and enrich it with some strong gazetteers and maybe some heuristic edit-distance matching of place names.
If you don't, then the problem becomes interesting.
Something like arranging the Web "What was updated from where?". Say AT&T wants to know what are people in New York talking about them in blogs. (Not Microblogs as on twitter either geotagging or profile information does the job).
Say some says "I am from Newyork.The AT&T connectivity problems donot seem to cease here" To be able to extract that some talked about poor AT&T connectivity in "New york" will be my aim. I have AT&T as well as connectivity passed as parameters.