[rdfweb-dev] Smushing the Semantic Web / Invalid data in IFP's
perry at coders.net
Mon Mar 22 11:12:49 UTC 2004
For the last two weeks I've been smushing JibberJims Scutter of FOAF
data. Most of this is due to my tinkering as I'm going along, and
another large chunk was due to treating foaf:mbox_sha1 as case
insensitive (I suspect a bug with my optimisations). I believe that if
I was to start again tomorrow, I'd get it done in about a week.
I also noticed, peering into the scroll, that there was a lot of rather
invalid data in IFP's, for instance in foaf:jabberID:
* "What the hell is a jabber?" - 48 unique nicks
* "What the fuck is Jabber?" - 217 unique nicks
* "wtf is jabber?" - 125 unique nicks
* "What the FUCK is a jabber?" - 16 unique nicks
* "fuck jabber" - 61 unique nicks
Since I assume these all come from the LJ data, and LJ has one nick per
person, this seriously shows that smushing on IFP's with the presence of
bad data is practically useless. (People who don't know what a jabber
ID is, just won't add it to a hand crafted FOAF file).
One option is of course, to do some preliminary analysis on the data.
For instance, jabber ID's always contain an "@" in them. Applying even
such a simple filter reduces the amount of rubbish amazingly. Is there
an ontology where you can specify a regex that a property must match to
: My humblest apologies for the corse language, I hope it doesn't
trip anyone's filters.
<mailto:perry at coders.net> foaf:name "Perry Lorier" ;
foaf:nick "Isomer","IsoosI","Remosi" .
More information about the foaf-dev