[rdfweb-dev] Smushing the Semantic Web / Invalid data in IFP's
perry at coders.net
Wed Mar 24 01:01:06 UTC 2004
Morten Frederiksen wrote:
>On Monday 22 March 2004 12:12, Perry Lorier wrote:
>>For the last two weeks I've been smushing JibberJims Scutter of FOAF
>>data. Most of this is due to my tinkering as I'm going along, and
>>another large chunk was due to treating foaf:mbox_sha1 as case
>>insensitive (I suspect a bug with my optimisations). I believe that if
>>I was to start again tomorrow, I'd get it done in about a week.
>Hmm, methinks that sounds like a long time...
yeah, my programs pretty inefficient. I imagine I can get it down a lot
faster, and 16 hours sounds about the right ballpark for what I was
thinking. The other thing was that I stopped and edited by code a lot
then restarted the smush. You also use a much saner schema than I.
Do you try and do a lot of it in memory? I tried to keep my memory
usage very low so I could use the machine while it was smushing. This
however means that it does a lot of I/O.
>Yep, that's a problem, not sure how to handle that in an intelligent and
>However, I disagree with your numbers. The highest count for a value of
>foaf:jabberID is 15 for "wocky", with "Wocky" in second place with 8
>occurances (and danbri tied for third with "wtf?" with 7 occurances!).
>I've confirmed these numbers with crep, cut, sort and uniq on the original
>input file. How did you arrive at your multi-hundred numbers?
A) theres still a lot of duplicate data in my smush, I've been meaning
to go through and prune it all out My original duplicate removal during
smushing was buggy and didn't work, so about half my smush was completed
before this bug was fixed. I wouldn't have thought that it would have
been that pronounced however.
B) I treat some fields (such as jabberID/mbox_sha1sum) as case
insensitive, thus "wocky" and "Wocky" are the same thing.
I think that having a schema like
http://coders.meta.net.nz/xmlns/2004/03/val and then
http://coders.meta.net.nz/xmlns/2004/03/vfoaf would mean you can very
easily remove the bogus data.
More information about the foaf-dev