μRaptor: A DOM-based system with appetite for hCard elements

Emir Muñoz, Luca Costabello, Pierre-Yves Vandenbussche

GitHub

vCard example
<!DOCTYPE html>
<html class="no-js" lang="en">
  <head>
     <title>Search Results for… </title>
     <meta name="description" content="Results for…">
     <link rel="shortcut icon" href="/local/images/favicon.ico" type="image/ico" />
      <style> … </style>
  </head>
  <body class="find-a-physician page-app physicians unified interior">
     <div id="page-wrap" itemscope itemtype="http://schema.org/Place" itemid="http://schema.org/Hospital">
        <!--[if lte IE 7]> <div id="lt-ie-8-warning">
           <h1>You are using an outdated browser</h1> …
        </div>
        <script> function hide_message(){ … } </script> <![endif]-->
        <div id="page" class="clearfix">
           …
        </div>
        <!-- /page -->
        <div id="footer-wrap" class="clearfix">
           <footer >
              <section class="row site-info content-inline clearfix style-1">
                 <div class="vcard author">
                    <a href="/home/">
                    <img src="/contentAsset/raw-data/6a390782-1ca3-44c3-b463-068816ba0fc5/knockOutLogo" 
                         alt="Alaska Regional Hospital" class="logo">
                    </a>
                    <meta itemprop="url" content="http://alaskaregional.com" />
                    <header itemprop="name" class="fn org">Alaska Regional Hospital</header>
                    <div class="contact-info">
                       <div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress" class="adr">
                          <span class="address1 street-address" itemprop="streetAddress">2801 DeBarr Road</span>
                          <span class="city locality" itemprop="addressLocality">Anchorage</span>,
                          <span class="state region" itemprop="addressRegion">AK</span>
                          <span class="zip postal-code" itemprop="postalCode">99508</span>
                       </div>
                       <div class="contacts">
                          <span class="tel" itemprop="telephone">Phone: (907) 276-1131</span><br />
                          <a href="/about/contact.dot">Contact Us</a>
                       </div>
                    </div>
                 </div>
              </section>
           </footer>
        </div>
        <!-- /footer-wrap -->
     </div>
     <!-- end page-wrap -->
     <script type="text/javascript" src="//www.google.com/jsapi"></script>
  </body>
</html>
<!DOCTYPE html>
<html class="no-js" lang="en">
  <head>
     <title>Search Results for… </title>
     <meta name="description" content="Results for…">
     <link rel="shortcut icon" href="/local/images/favicon.ico" type="image/ico" />
     <style> … </style>
  </head>
  <body class="find-a-physician page-app physicians unified interior">
     <div id="page-wrap" itemscope itemtype="http://schema.org/Place" itemid="http://schema.org/Hospital">
        <!--[if lte IE 7]> <div id="lt-ie-8-warning">
           <h1>You are using an outdated browser</h1> …
        </div>
        <script> function hide_message(){ … } </script> <![endif]-->
        <div id="page" class="clearfix">
           …
        </div>
        <!-- /page -->
        <div id="footer-wrap" class="clearfix">
           <footer>
              <section class="row site-info content-inline clearfix style-1">
                 <div class="vcard author">
                    <a href="/home/">
                    <img src="/contentAsset/raw-data/6a390782-1ca3-44c3-b463-068816ba0fc5/knockOutLogo" 
                         alt="Alaska Regional Hospital" class="logo">
                    </a>
                    <meta itemprop="url" content="http://alaskaregional.com" />
                    <header itemprop="name" class="fn org">Alaska Regional Hospital</header>
                    <div class="contact-info">
                       <div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress" class="adr">
                          <span class="address1 street-address" itemprop="streetAddress">2801 DeBarr Road</span>
                          <span class="city locality" itemprop="addressLocality">Anchorage</span>,
                          <span class="state region" itemprop="addressRegion">AK</span>
                          <span class="zip postal-code" itemprop="postalCode">99508</span>
                       </div>
                       <div class="contacts">
                          <span class="tel" itemprop="telephone">Phone: (907) 276-1131</span><br />
                          <a href="/about/contact.dot">Contact Us</a>
                       </div>
                    </div>
                 </div>
              </section>
           </footer>
        </div>
        <!-- /footer-wrap -->
     </div>
     <!-- end page-wrap -->
     <script type="text/javascript" src="//www.google.com/jsapi"></script>
  </body>
</html>

μRaptor in Action

Training Phase:

0) Clean the HTML.

1) hCard DOM sub-trees extraction.

2) CSS class co-occurrence.

3) DOM patterns to CSS selectors.

4) hCard element value constraints.
ZIP code:
^[0-9]{5}(-[0-9]{4})?$

<!DOCTYPE html>    <!-- NEW BOOK STORE WEBPAGE -->
<html lang="en">
   <head>
      <title>Irish News… </title>
      <meta name="description" content="News from Ireland…">
      <link rel="shortcut icon" href="img/favicon.ico" type="image/ico" />
      <style> … </style>
   </head>
   <body class="page-app home">
      <div id="page-wrap">
         <div id="page" class="clearfix">
            …
         </div>
         <div id="footer-wrap" class="clearfix">
            <footer>
               <section class="row site-info content-inline clearfix style-1">
                  <div class="vcard author">
                     <a href="/home/"><img src="Logo.png" alt="Book Store" class="logo"></a>
                     <a class="url" href="http://booksgalway.ie">Website</a>
                     <header class="fn org">Book Store Galway</header>
                     <div class="contact-info">
                        <div class="adr">
                           <span class="address1 street-address">28 Shop Street</span>
                           <span class="city locality">Galway</span>,
                           <span class="state region">Co. Galway</span>
                        </div>
                        <div class="contacts">
                           <span class="tel">Phone: +(353) 091-1234569</span><br />
                           <a href="/contact-us.dot">Contact Us</a>
                        </div>
                     </div>
                  </div>
                  <div id="about_sub" class="accordion-body collapse">
                     <div class="accordion-inner">
                        <ul class="submenu">
                           <li>
                              <a href="/contact-us.dot">Contact Us</a>
                           </li> …
                        </ul>
                     </div>
                     <!-- / .submenu -->
                  </div>
               </section>
            </footer>
         </div>
         <!-- /footer-wrap -->
      </div>
      <script type="text/javascript" src="//www.google.com/jsapi"></script>
   </body>
</html>

μRaptor in Action

Training Phase:

0) Clean the HTML.

1) hCard DOM sub-trees extraction.

2) CSS class co-occurrence.

3) DOM patterns to CSS selectors.

4) hCard element value constraints.
ZIP code:
^[0-9]{5}(-[0-9]{4})?$

Extraction Phase:

5) hCard DOM pattern detection.
div.author > div > div > span

6) hCard elements qualification.

7) System validation.
RDF models comparison using Apache Any23.

Conclusions

hCard in the wild

  • they are very used
  • but not properly used
  • long tail distribution
  • incompleteness

Evaluation

$A=\{\text{gold standard}\}$, $B=\{\mu Raptor\}$

  • $P=\frac{|A|\cap|B|}{|B|}$
  • $R=\frac{|A|\cap|B|}{|A|}$
  • $F_1=\frac{2 PR}{P+R}$

μRaptor Results

  • with only 30 rules:
  • Precision = 0.94
  • Recall = 0.7
  • F-1 score = 0.8

Contact us: @Emir, @Luca, @Pierre-Yves

GitHub Website

/