μRaptor: A DOM-based system with appetite for hCard elements
Emir Muñoz, Luca Costabello, Pierre-Yves Vandenbussche

<!DOCTYPE html>
<html class="no-js" lang="en">
<head>
<title>Search Results for… </title>
<meta name="description" content="Results for…">
<link rel="shortcut icon" href="/local/images/favicon.ico" type="image/ico" />
<style> … </style>
</head>
<body class="find-a-physician page-app physicians unified interior">
<div id="page-wrap" itemscope itemtype="http://schema.org/Place" itemid="http://schema.org/Hospital">
<!--[if lte IE 7]> <div id="lt-ie-8-warning">
<h1>You are using an outdated browser</h1> …
</div>
<script> function hide_message(){ … } </script> <![endif]-->
<div id="page" class="clearfix">
…
</div>
<!-- /page -->
<div id="footer-wrap" class="clearfix">
<footer >
<section class="row site-info content-inline clearfix style-1">
<div class="vcard author">
<a href="/home/">
<img src="/contentAsset/raw-data/6a390782-1ca3-44c3-b463-068816ba0fc5/knockOutLogo"
alt="Alaska Regional Hospital" class="logo">
</a>
<meta itemprop="url" content="http://alaskaregional.com" />
<header itemprop="name" class="fn org">Alaska Regional Hospital</header>
<div class="contact-info">
<div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress" class="adr">
<span class="address1 street-address" itemprop="streetAddress">2801 DeBarr Road</span>
<span class="city locality" itemprop="addressLocality">Anchorage</span>,
<span class="state region" itemprop="addressRegion">AK</span>
<span class="zip postal-code" itemprop="postalCode">99508</span>
</div>
<div class="contacts">
<span class="tel" itemprop="telephone">Phone: (907) 276-1131</span><br />
<a href="/about/contact.dot">Contact Us</a>
</div>
</div>
</div>
</section>
</footer>
</div>
<!-- /footer-wrap -->
</div>
<!-- end page-wrap -->
<script type="text/javascript" src="//www.google.com/jsapi"></script>
</body>
</html>
<!DOCTYPE html>
<html class="no-js" lang="en">
<head>
<title>Search Results for… </title>
<meta name="description" content="Results for…">
<link rel="shortcut icon" href="/local/images/favicon.ico" type="image/ico" />
<style> … </style>
</head>
<body class="find-a-physician page-app physicians unified interior">
<div id="page-wrap" itemscope itemtype="http://schema.org/Place" itemid="http://schema.org/Hospital">
<!--[if lte IE 7]> <div id="lt-ie-8-warning">
<h1>You are using an outdated browser</h1> …
</div>
<script> function hide_message(){ … } </script> <![endif]-->
<div id="page" class="clearfix">
…
</div>
<!-- /page -->
<div id="footer-wrap" class="clearfix">
<footer>
<section class="row site-info content-inline clearfix style-1">
<div class="vcard author">
<a href="/home/">
<img src="/contentAsset/raw-data/6a390782-1ca3-44c3-b463-068816ba0fc5/knockOutLogo"
alt="Alaska Regional Hospital" class="logo">
</a>
<meta itemprop="url" content="http://alaskaregional.com" />
<header itemprop="name" class="fn org">Alaska Regional Hospital</header>
<div class="contact-info">
<div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress" class="adr">
<span class="address1 street-address" itemprop="streetAddress">2801 DeBarr Road</span>
<span class="city locality" itemprop="addressLocality">Anchorage</span>,
<span class="state region" itemprop="addressRegion">AK</span>
<span class="zip postal-code" itemprop="postalCode">99508</span>
</div>
<div class="contacts">
<span class="tel" itemprop="telephone">Phone: (907) 276-1131</span><br />
<a href="/about/contact.dot">Contact Us</a>
</div>
</div>
</div>
</section>
</footer>
</div>
<!-- /footer-wrap -->
</div>
<!-- end page-wrap -->
<script type="text/javascript" src="//www.google.com/jsapi"></script>
</body>
</html>
μRaptor in Action
Training Phase:
0) Clean the HTML.
1) hCard DOM sub-trees extraction.
2) CSS class co-occurrence.
3) DOM patterns to CSS selectors.
4) hCard element value constraints.
ZIP code: ^[0-9]{5}(-[0-9]{4})?$
<!DOCTYPE html> <!-- NEW BOOK STORE WEBPAGE -->
<html lang="en">
<head>
<title>Irish News… </title>
<meta name="description" content="News from Ireland…">
<link rel="shortcut icon" href="img/favicon.ico" type="image/ico" />
<style> … </style>
</head>
<body class="page-app home">
<div id="page-wrap">
<div id="page" class="clearfix">
…
</div>
<div id="footer-wrap" class="clearfix">
<footer>
<section class="row site-info content-inline clearfix style-1">
<div class="vcard author">
<a href="/home/"><img src="Logo.png" alt="Book Store" class="logo"></a>
<a class="url" href="http://booksgalway.ie">Website</a>
<header class="fn org">Book Store Galway</header>
<div class="contact-info">
<div class="adr">
<span class="address1 street-address">28 Shop Street</span>
<span class="city locality">Galway</span>,
<span class="state region">Co. Galway</span>
</div>
<div class="contacts">
<span class="tel">Phone: +(353) 091-1234569</span><br />
<a href="/contact-us.dot">Contact Us</a>
</div>
</div>
</div>
<div id="about_sub" class="accordion-body collapse">
<div class="accordion-inner">
<ul class="submenu">
<li>
<a href="/contact-us.dot">Contact Us</a>
</li> …
</ul>
</div>
<!-- / .submenu -->
</div>
</section>
</footer>
</div>
<!-- /footer-wrap -->
</div>
<script type="text/javascript" src="//www.google.com/jsapi"></script>
</body>
</html>
μRaptor in Action
Training Phase:
0) Clean the HTML.
1) hCard DOM sub-trees extraction.
2) CSS class co-occurrence.
3) DOM patterns to CSS selectors.
4) hCard element value constraints.
ZIP code: ^[0-9]{5}(-[0-9]{4})?$
Extraction Phase:
5) hCard DOM pattern detection.
div.author > div > div > span
6) hCard elements qualification.
7) System validation.
RDF models comparison using Apache Any23.
Conclusions
hCard in the wild
- they are very used
- but not properly used
- long tail distribution
- incompleteness
Evaluation
$A=\{\text{gold standard}\}$, $B=\{\mu Raptor\}$
- $P=\frac{|A|\cap|B|}{|B|}$
- $R=\frac{|A|\cap|B|}{|A|}$
- $F_1=\frac{2 PR}{P+R}$
μRaptor Results
- with only 30 rules:
- Precision = 0.94
- Recall = 0.7
- F-1 score = 0.8
Contact us: @Emir, @Luca, @Pierre-Yves
GitHub
Website
/