[rdfweb-dev] Smushing the Semantic Web / Invalid data in IFP's

Perry Lorier perry at coders.net
Wed Mar 24 01:01:06 UTC 2004

Morten Frederiksen wrote:

>Hi all,
>On Monday 22 March 2004 12:12, Perry Lorier wrote:
>>For the last two weeks I've been smushing JibberJims Scutter of FOAF
>>data.  Most of this is due to my tinkering as I'm going along, and
>>another large chunk was due to treating foaf:mbox_sha1 as case
>>insensitive (I suspect a bug with my optimisations).  I believe that if
>>I was to start again tomorrow, I'd get it done in about a week.
>Hmm, methinks that sounds like a long time...
yeah, my programs pretty inefficient.  I imagine I can get it down a lot 
faster, and 16 hours sounds about the right ballpark for what I was 
thinking.  The other thing was that I stopped and edited by code a lot 
then restarted the smush.  You also use a much saner schema than I.

Do you try and do a lot of it in memory?  I tried to keep my memory 
usage very low so I could use the machine while it was smushing.  This 
however means that it does a lot of I/O.

>Yep, that's a problem, not sure how to handle that in an intelligent and 
>managable fashion.
>However, I disagree with your numbers. The highest count for a value of 
>foaf:jabberID is 15 for "wocky", with "Wocky" in second place with 8 
>occurances (and danbri tied for third with "wtf?" with 7 occurances!).
>I've confirmed these numbers with crep, cut, sort and uniq on the original 
>input file. How did you arrive at your multi-hundred numbers?
probably by

A) theres still a lot of duplicate data in my smush, I've been meaning 
to go through and prune it all out  My original duplicate removal during 
smushing was buggy and didn't work, so about half my smush was completed 
before this bug was fixed.  I wouldn't have thought that it would have 
been that pronounced however.

B) I treat some fields (such as jabberID/mbox_sha1sum) as case 
insensitive, thus "wocky" and "Wocky" are the same thing.

I think that having a schema like 
http://coders.meta.net.nz/xmlns/2004/03/val and then 
http://coders.meta.net.nz/xmlns/2004/03/vfoaf would mean you can very 
easily remove the bogus data.

