[rdfweb-dev] Smushing the Semantic Web / Invalid data in IFP's

Perry Lorier perry at coders.net
Mon Mar 22 11:12:49 UTC 2004

For the last two weeks I've been smushing JibberJims Scutter of FOAF 
data.  Most of this is due to my tinkering as I'm going along, and 
another large chunk was due to treating foaf:mbox_sha1 as case 
insensitive (I suspect a bug with my optimisations).  I believe that if 
I was to start again tomorrow, I'd get it done in about a week.

I also noticed, peering into the scroll, that there was a lot of rather 
invalid data in IFP's, for instance in foaf:jabberID:[1]

* "What the hell is a jabber?" - 48 unique nicks
* "What the fuck is Jabber?" - 217 unique nicks
* "wtf is jabber?" - 125 unique nicks
* "What the FUCK is a jabber?" - 16 unique nicks
* "fuck jabber" - 61 unique nicks

Since I assume these all come from the LJ data, and LJ has one nick per 
person, this seriously shows that smushing on IFP's with the presence of 
bad data is practically useless.  (People who don't know what a jabber 
ID is, just won't add it to a hand crafted FOAF file).

One option is of course, to do some preliminary analysis on the data. 
For instance, jabber ID's always contain an "@" in them.  Applying even 
such a simple filter reduces the amount of rubbish amazingly.  Is there 
an ontology where you can specify a regex that a property must match to 
be valid?

[1]: My humblest apologies for the corse language, I hope it doesn't 
trip anyone's filters.

<mailto:perry at coders.net> foaf:name "Perry Lorier" ;
   foaf:nick "Isomer","IsoosI","Remosi" .

More information about the foaf-dev mailing list