Righteous Wrath Online Community

General => Lobby => Topic started by: Darren Dirt on August 29, 2006, 10:41:18 AM

Title: Semantic data extractor (HTML webpage data miner from W3.org)
Post by: Darren Dirt on August 29, 2006, 10:41:18 AM
http://www.w3.org/2003/12/semantic-extractor

Quote
Semantic data extractor
This tool, implemented using an XSLT stylesheet, tries to extract some information from a HTML semantic rich document. It only uses informations available through a good usage of the semantics defined in HTML.

The aim is to show that providing a semantically rich HTML gives much more value to your code: using a semantically rich HTML code allows a better use of CSS, makes your HTML intelligible to a wider range of user agents (especially search engines bots).

As an aside, it can give clues to user agents developers on some hooks that could be interesting to add in their product.

examples I tried:

itself (http://www.w3.org/2005/08/online_xslt/xslt?xmlfile=http%3A%2F%2Fcgi.w3.org%2Fcgi-bin%2Ftidy-if%3FdocAddr%3Dhttp%3A%2F%2Fwww.w3.org%2F2003%2F12%2Fsemantic-extractor&xslfile=http%3A%2F%2Fwww.w3.org%2F2002%2F08%2Fextract-semantic.xsl)

Mozilla.com (http://www.w3.org/2005/08/online_xslt/xslt?xmlfile=http%3A%2F%2Fcgi.w3.org%2Fcgi-bin%2Ftidy-if%3FdocAddr%3Dhttp%3A%2F%2Fwww.mozilla.com&xslfile=http%3A%2F%2Fwww.w3.org%2F2002%2F08%2Fextract-semantic.xsl)

a typical page on "Ropin' The Web" (http://www.w3.org/2005/08/online_xslt/xslt?xmlfile=http%3A%2F%2Fcgi.w3.org%2Fcgi-bin%2Ftidy-if%3FdocAddr%3Dhttp%3A%2F%2Fwww1.agric.gov.ab.ca%2F%24department%2Fnewslett.nsf%2Fall%2Fcotl9879&xslfile=http%3A%2F%2Fwww.w3.org%2F2002%2F08%2Fextract-semantic.xsl)

"http://en.wikipedia.org/wiki/Semantic_web" (http://www.w3.org/2005/08/online_xslt/xslt?xmlfile=http%3A%2F%2Fcgi.w3.org%2Fcgi-bin%2Ftidy-if%3FdocAddr%3Dhttp%3A%2F%2Fen.wikipedia.org%2Fwiki%2FSemantic_Web&xslfile=http%3A%2F%2Fwww.w3.org%2F2002%2F08%2Fextract-semantic.xsl) (strangely, it seems that Wikipedia uses "HTML tidy service" when serving out ANY page!?)