Giter Club home page Giter Club logo

Comments (3)

dportabella avatar dportabella commented on August 29, 2024

this code:

import requests
from mdr import MDR
from lxml import etree
from pathlib2 import Path
import sys

mdr = MDR()
# text = requests.get('http://www.yelp.co.uk/biz/the-ledbury-london').text.encode('utf8')
text = Path('/tmp/mdr/tests/samples/htmlpage0.html').read_text(encoding='utf8')

candidates, doc = mdr.list_candidates(text)
for p in [doc.getpath(c) for c in candidates]: print p

def toString(tree):
    return etree.tostring(tree, pretty_print=True).replace("\n", " ")

for candidate in candidates:
    print "+++ candidate: " + doc.getpath(candidate)
    try:
        seed_records, mappings = mdr.extract(candidate)
        if (seed_records is not None):
            for seed_record in seed_records:
                print "   +++ seed_record: " + toString(seed_record)
                # todo: how do we print the tree grammar corresponding to this candidate+seed_record?
                for record, mapping in mappings.iteritems():
                    print "      +++ record: " + toString(record)
                    for k in mapping:
                        print "      +++ mapping: " + toString(k)
                        # todo: how do we extract the data from mapping?
    except:
        print("Unexpected error:", sys.exc_info()[0])
        # raise
        pass

print "+++ END"

produces:

/html/body/div[2]/div[3]/div[2]/div/div[1]/div[1]/div[3]/div[1]/div[2]/ul
/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]
/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]/div[2]
/html/body/div[2]
/html/body/div[2]/div[3]/div[1]/div/div[3]/div[2]/div/div[3]
/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]/div[2]/div[1]/table/tbody
...

+++ candidate: /html/body/div[2]/div[3]/div[2]/div/div[1]/div[1]/div[3]/div[1]/div[2]/ul
   +++ seed_record: <li>                 <div class="review review-with-no-actions" data-review-id="8spqL77wsNZWYeUbz_FRbg" itemprop="review" itemscope="" itemtype="http://schema.org/Review">             <meta itemprop="author" content="Raquel Y."/>             <div class="review-sidebar">         <div class="review-sidebar-content">                 <div class="ypassport media-block clearfix">         <div class="media-avatar">        <div class="photo-box pb-60s" data-hovercard-id="COTAMr3EZ2Ib_mab4t52Ew">                 <a href="/user_details?userid=qUUMmANmVRuQxagzH8irhA">       <img alt="Raquel Y." class="photo-box-img" height="60" src="//s3-media3.fl.yelpcdn.com/photo/mdr6efq3Ut1R5hu2FKNLuQ/60s.jpg" width="60"/>          </a>      </div>            </div>         <div class="media-story">                 <ul class="user-passport-info">         <li class="user-name">                                 <a class="user-display-name" href="/user_details?userid=qUUMmANmVRuQxagzH8irhA" data-hovercard-id="COTAMr3EZ2Ib_mab4t52Ew">Raquel Y.</a>          </li>         <li class="user-location">             <b>San Leandro, CA</b>         </li>     </ul>                  <ul class="user-passport-stats">             <li class="is-elite">                 <a href="/elite">Elite &#8217;14</a>             </li>         <li class="friend-count">             <span class="i-wrap ig-wrap-common i-friends-orange-common-wrap"><i class="i ig-common i-friends-orange-common"/> <b>653</b> friends</span>         </li>         <li class="review-count">             <span class="i-wrap ig-wrap-common i-star-orange-common-wrap"><i class="i ig-common i-star-orange-common"/> <b>434</b> reviews</span>         </li>     </ul>          </div>     </div>                      <ul class="iconed-list action-link-list">                  <li class="iconed-list-item">                          <a class="action-link send-to-friend" data-pop-uri="/send_to_friend/review/8spqL77wsNZWYeUbz_FRbg" href="/biz_share?bizid=WavvLdfdP6g8aZTtbBQHTw&amp;return_url=http%3A%2F%2Fwww.yelp.com%2Fbiz%2Fgary-danko-san-francisco&amp;reviewid=8spqL77wsNZWYeUbz_FRbg">         <div class="iconed-list-avatar">             <i class="i ig-common i-action-share-common"/>         </div>         <div class="iconed-list-story">             Share review         </div>     </a>                  </li>                  <li class="iconed-list-item">                          <a class="action-link send-compliment" href="/thanx?complimentable_id=8spqL77wsNZWYeUbz_FRbg&amp;complimentable_type=REVIEW&amp;previous_url=http%3A%2F%2Fwww.yelp.com%2Fbiz%2Fgary-danko-san-francisco&amp;user_id=qUUMmANmVRuQxagzH8irhA">         <div class="iconed-list-avatar">             <i class="i ig-common i-action-compliment-common"/>         </div>         <div class="iconed-list-story">             Compliment         </div>     </a>                  </li>                  <li class="iconed-list-item">                          <a class="action-link send-pm" href="/mail?action_send_form=1&amp;dst=qUUMmANmVRuQxagzH8irhA&amp;return_url=http%3A%2F%2Fwww.yelp.com%2Fbiz%2Fgary-danko-san-francisco" rel="Raquel Y.">         <div class="iconed-list-avatar">             <i class="i ig-common i-action-message-common"/>         </div>         <div class="iconed-list-story">             Send message         </div>     </a>                  </li>                  <li class="iconed-list-item manage-follow-container">                           <a class="action-link manage-following-add" href="/following_user/add?dst_user_id=qUUMmANmVRuQxagzH8irhA&amp;previous_url=http%3A%2F%2Fwww.yelp.com%2Fbiz%2Fgary-danko-san-francisco" rel="qUUMmANmVRuQxagzH8irhA" style="">         <div class="iconed-list-avatar">             <i class="i ig-common i-action-follow-common"/>         </div>         <div class="iconed-list-story">             Follow Raquel Y.         </div>     </a>                            <a class="action-link manage-following-remove" href="/following_user/remove?dst_user_id=qUUMmANmVRuQxagzH8irhA&amp;previous_url=http%3A%2F%2Fwww.yelp.com%2Fbiz%2Fgary-danko-san-francisco" rel="qUUMmANmVRuQxagzH8irhA" style="display: none;">         <div class="iconed-list-avatar">             <i class="i ig-common i-action-unfollow-common"/>         </div>         <div class="iconed-list-story">             Stop following Raquel Y.         </div>     </a>                  </li>     </ul>           </div>      </div>          <div class="review-wrapper">                <div class="review-content">             <div class="biz-rating biz-rating-very-large clearfix">         <div itemprop="reviewRating" itemscope="" itemtype="http://schema.org/Rating">          <div class="rating-very-large">         <i class="star-img stars_5" title="5.0 star rating">             <img alt="5.0 star rating" class="offscreen" height="303" src="http://s3-media3.fl.yelpcdn.com/assets/2/www/img/c2252a4cd43e/ico/stars/v2/stars_map.png" width="84"/>         </i>             <meta itemprop="ratingValue" content="5.0"/>     </div>           </div>             <span class="rating-qualifier">             <meta itemprop="datePublished" content="2014-06-09"/>         6/9/2014     </span>      </div>                  <span class="review-tags">                         <span class="i-wrap ig-wrap-common i-checkin-burst-blue-small-common-wrap badge checkin checkin-irregular"><i class="i ig-common i-checkin-burst-blue-small-common"/> 1 check-in here</span>                          <div class="review-favorites-lists">             <span class="i-wrap ig-wrap-common i-list-common-wrap badge"><i class="i ig-common i-list-common"/>     Listed in <a href="/list/best-of-sf-san-francisco-17" title="Must do, see, and eat! Lived here for 25 years... trust me :)">Best of SF</a>, <a href="/list/fine-dining-in-the-bay-area-san-francisco" title="Ouu la la! Every girl likes some wine and dine.. ;)">Fine Dining in the Bay Area</a> </span>         </div>          </span>                  <p class="review_comment ieSucks" itemprop="description" lang="en">Dear Gary Danko,<br/><br/>You are an amazing restaurant. &#160;From the fabulous presentation and taste of the food, to the elegant restaurant and incredible hospitality of the servers, you've given us a wonderful evening that made us wanting to come back for more. &#160;<br/><br/>What makes Gary Danko special: Other than the incredible food, ambiance, decor, and service, I like Gary Danko's system where you can order 3, 4, or 5 dishes of any category for a fixed price. &#160;Since I am not much of a sweet tooth and I love seafood, I was able to order 3 seafood entrees! &#160;Though everyone raved about the lobster which I have to disagree since it was probably my least favorite of the 3. &#160;The sauce wasn't as dynamic as you would expect a place like Gary Danko to have, though it was still good. &#160;<br/><br/>Yelpers, I have to say that you MUST order the scallops!!! I LOVE scallops, and Gary Danko's scallops were PERFECT. &#160;They cooked it perfectly to make the texture of the scallops just right, and the sauce complemented them amazingly that felt heaven in my mouth. &#160;If I come to Gary Danko again, I am definitely ordering the scallops again. &#160;It was my favorite dish of the night. &#160;My boyfriend on the other hand, thought the oysters was the best dish. &#160;They were very fresh and rich, topped with caviar. &#160;I wish I had more than one piece, but my boyfriend looked like he was enjoying it too much ;) I'd recommend ordering both!<br/><br/>My ratings for the dishes we had: <br/>Scallops 5/5<br/>Oysters 5/5<br/>Pork 4.5/5<br/>Sea bass 4.5/5<br/>Lamb 4/5<br/>Lobster 4/5<br/><br/>Free: bison bite, unlimited bread, mini dessert samples, birthday cake for my boyfriend, and banana cake to take home!<br/><br/>Also, the bathroom at Gary Danko is also beautiful. &#160;They use nice soft towels, not napkins. Can you believe that?! <br/><br/>Thank you for a wonderful and amazing night. &#160;I would have to say this is one of the best restaurants in San Francisco. &#160;I hope to come again!<br/><br/>Sincerely,<br/>Raquel Y.</p>                     <ul class="photo-box-grid clearfix js-content-expandable">                 <li>                         <div class="photo-box has-overlay">      <img alt="Great complimentary assorted dessert samplers!" class="photo-box-img" height="348" src="//s3-media1.fl.yelpcdn.com/bphoto/x44jM-xFds2Pjk1MhpoNbA/348s.jpg" width="348"/>                <div class="photo-box-overlay js-overlay">                 Great complimentary assorted dessert samplers!             </div>        <a class="biz-shim" href="/biz_photos/gary-danko-san-francisco?filter_by_userid=True&amp;select=x44jM-xFds2Pjk1MhpoNbA&amp;userid=qUUMmANmVRuQxagzH8irhA">             <span class="offscreen">Great complimentary assorted dessert samplers!</span>     </a>      </div>                  </li>                 <li>                         <div class="photo-box has-overlay">      <img alt="Lobster - good (4 stars)" class="photo-box-img" height="168" src="//s3-media4.fl.yelpcdn.com/bphoto/KSPgZmw-PVa9cbtj6rQ2hw/168s.jpg" width="168"/>                <div class="photo-box-overlay js-overlay">                 Lobster - good (4 stars)             </div>        <a class="biz-shim" href="/biz_photos/gary-danko-san-francisco?filter_by_userid=True&amp;select=KSPgZmw-PVa9cbtj6rQ2hw&amp;userid=qUUMmANmVRuQxagzH8irhA">             <span class="offscreen">Lobster - good (4 stars)</span>     </a>      </div>                  </li>                 <li>                         <div class="photo-box has-overlay">      <img alt="Scallops - AMAZING! (5 stars)" class="photo-box-img" height="168" src="//s3-media2.fl.yelpcdn.com/bphoto/O-DhZGNzVKTY8nDup0KYrg/168s.jpg" width="168"/>                <div class="photo-box-overlay js-overlay">                 Scallops - AMAZING! (5 stars)             </div>        <a class="biz-shim" href="/biz_photos/gary-danko-san-francisco?filter_by_userid=True&amp;select=O-DhZGNzVKTY8nDup0KYrg&amp;userid=qUUMmANmVRuQxagzH8irhA">             <span class="offscreen">Scallops - AMAZING! (5 stars)</span>     </a>      </div>                  </li>                  <li class="more-review-photos">                     <a href="/biz_photos/gary-danko-san-francisco?filter_by_userid=True&amp;select=x44jM-xFds2Pjk1MhpoNbA&amp;userid=qUUMmANmVRuQxagzH8irhA" class="js-content-expander">See all photos from Raquel Y.</a>                 </li>         </ul>      </div>     <div class="review-footer clearfix">                     <div class="rateReview ufc-feedback clearfix" data-review-id="8spqL77wsNZWYeUbz_FRbg">                     <p class="review-intro review-message">         Was this review &#8230;?     </p>                  <ul class="big-ufc" data-csrf-token="e105d849c006a5933a2d66fd65206dc9e0da76a63f541552529a3b1f88a94cfb">        <li class="ufc-stat inline-block ytype">         <a href="javascript:;" rel="useful" class="ybtn ybtn-small useful ytype"><span class="i-wrap ig-wrap-common i-big-ufc-useful-common-wrap button-content"><i class="i ig-common i-big-ufc-useful-common"/>     <span class="vote-type">Useful</span>     <span class="count">6</span></span></a>     </li>         <li class="ufc-stat inline-block ytype">         <a href="javascript:;" rel="funny" class="ybtn ybtn-small funny ytype"><span class="i-wrap ig-wrap-common i-big-ufc-funny-common-wrap button-content"><i class="i ig-common i-big-ufc-funny-common"/>     <span class="vote-type">Funny</span>     <span class="count">2</span></span></a>     </li>         <li class="ufc-stat inline-block ytype">         <a href="javascript:;" rel="cool" class="ybtn ybtn-small cool ytype"><span class="i-wrap ig-wrap-common i-big-ufc-cool-common-wrap button-content"><i class="i ig-common i-big-ufc-cool-common"/>     <span class="vote-type">Cool</span>     <span class="count">6</span></span></a>     </li>                  </ul>         </div>      </div>            </div>     </div>          </li>          
      +++ record: <span class="offscreen">Lobster - good (4 stars)</span>      
      +++ mapping: <span class="offscreen">Cheesecake</span>      
      +++ mapping: <span class="offscreen">Seared sea scallops</span>      
      +++ mapping: <span class="offscreen">I almost felt like I was in the movie Scarface - it reminded me of a smaller Mr. Chows- NYC - uptown location.</span>      
      +++ mapping: <span class="offscreen">Filet of beef with caramelized onion</span>      
      +++ mapping: <span class="offscreen">Chocolate souffl&#233; with duo of sauces</span>      
      +++ mapping: <span class="offscreen">Lemon souffle</span>      
      +++ mapping: <span class="offscreen">Scallops</span>      
...
      +++ record: <li>                 <div class="review review-with-no-actions" data-review-id="8spqL77wsNZWYeUbz_FRbg" itemprop="review" itemscope="" itemtype="http://schema.org/Review">             <meta itemprop="author" content="Raquel Y."/>             <div class="review-sidebar">         <div class="review-sidebar-content">                 <div class="ypassport media-block clearfix">         <div class="media-avatar">        <div class="photo-box pb-60s" data-hovercard-id="COTAMr3EZ2Ib_mab4t52Ew">                 <a href="/user_details?userid=qUUMmANmVRuQxagzH8irhA">       <img alt="Raquel Y." class="photo-box-img" height="60" src="//s3-media3.fl.yelpcdn.com/photo/mdr6efq3Ut1R5hu2FKNLuQ/60s.jpg" width="60"/>          </a>      </div>            </div>         <div class="media-story">                 <ul class="user-passport-info">         <li class="user-name">                                 <a class="user-display-name" href="/user_details?userid=qUUMmANmVRuQxagzH8irhA" data-hovercard-id="COTAMr3EZ2Ib_mab4t52Ew">Raquel Y.</a>          </li>         <li class="user-location">             <b>San Leandro, CA</b>         </li>     </ul>                  <ul class="user-passport-stats">             <li class="is-elite">                 <a href="/elite">Elite &#8217;14</a>             </li>         <li class="friend-count">             <span class="i-wrap ig-wrap-common i-friends-orange-common-wrap"><i class="i ig-common i-friends-orange-common"/> <b>653</b> friends</span>         </li>         <li class="review-count">             <span class="i-wrap ig-wrap-common i-star-orange-common-wrap"><i class="i ig-common i-star-orange-common"/> <b>434</b> reviews</span>         </li>     </ul>          </div>     </div>                      <ul class="iconed-list action-link-list">                  <li class="iconed-list-item">                          <a class="action-link send-to-friend" data-pop-uri="/send_to_friend/review/8spqL77wsNZWYeUbz_FRbg" href="/biz_share?bizid=WavvLdfdP6g8aZTtbBQHTw&amp;return_url=http%3A%2F%2Fwww.yelp.com%2Fbiz%2Fgary-danko-san-francisco&amp;reviewid=8spqL77wsNZWYeUbz_FRbg">         <div class="iconed-list-avatar">             <i class="i ig-common i-action-share-common"/>         </div>         <div class="iconed-list-story">             Share review         </div>     </a>                  </li>                  <li class="iconed-list-item">                          <a class="action-link send-compliment" href="/thanx?complimentable_id=8spqL77wsNZWYeUbz_FRbg&amp;complimentable_type=REVIEW&amp;previous_url=http%3A%2F%2Fwww.yelp.com%2Fbiz%2Fgary-danko-san-francisco&amp;user_id=qUUMmANmVRuQxagzH8irhA">         <div class="iconed-list-avatar">             <i class="i ig-common i-action-compliment-common"/>         </div>         <div class="iconed-list-story">             Compliment         </div>     </a>                  </li>                  <li class="iconed-list-item">                          <a class="action-link send-pm" href="/mail?action_send_form=1&amp;dst=qUUMmANmVRuQxagzH8irhA&amp;return_url=http%3A%2F%2Fwww.yelp.com%2Fbiz%2Fgary-danko-san-francisco" rel="Raquel Y.">         <div class="iconed-list-avatar">             <i class="i ig-common i-action-message-common"/>         </div>         <div class="iconed-list-story">             Send message         </div>     </a>                  </li>                  <li class="iconed-list-item manage-follow-container">                           <a class="action-link manage-following-add" href="/following_user/add?dst_user_id=qUUMmANmVRuQxagzH8irhA&amp;previous_url=http%3A%2F%2Fwww.yelp.com%2Fbiz%2Fgary-danko-san-francisco" rel="qUUMmANmVRuQxagzH8irhA" style="">         <div class="iconed-list-avatar">             <i class="i ig-common i-action-follow-common"/>         </div>         <div class="iconed-list-story">             Follow Raquel Y.         </div>     </a>                            <a class="action-link manage-following-remove" href="/following_user/remove?dst_user_id=qUUMmANmVRuQxagzH8irhA&amp;previous_url=http%3A%2F%2Fwww.yelp.com%2Fbiz%2Fgary-danko-san-francisco" rel="qUUMmANmVRuQxagzH8irhA" style="display: none;">         <div class="iconed-list-avatar">             <i class="i ig-common i-action-unfollow-common"/>         </div>         <div class="iconed-list-story">             Stop following Raquel Y.         </div>     </a>                  </li>     </ul>           </div>      </div>          <div class="review-wrapper">                <div class="review-content">             <div class="biz-rating biz-rating-very-large clearfix">         <div itemprop="reviewRating" itemscope="" itemtype="http://schema.org/Rating">          <div class="rating-very-large">         <i class="star-img stars_5" title="5.0 star rating">             <img alt="5.0 star rating" class="offscreen" height="303" src="http://s3-media3.fl.yelpcdn.com/assets/2/www/img/c2252a4cd43e/ico/stars/v2/stars_map.png" width="84"/>         </i>             <meta itemprop="ratingValue" content="5.0"/>     </div>           </div>             <span class="rating-qualifier">             <meta itemprop="datePublished" content="2014-06-09"/>         6/9/2014     </span>      </div>                  <span class="review-tags">                         <span class="i-wrap ig-wrap-common i-checkin-burst-blue-small-common-wrap badge checkin checkin-irregular"><i class="i ig-common i-checkin-burst-blue-small-common"/> 1 check-in here</span>                          <div class="review-favorites-lists">             <span class="i-wrap ig-wrap-common i-list-common-wrap badge"><i class="i ig-common i-list-common"/>     Listed in <a href="/list/best-of-sf-san-francisco-17" title="Must do, see, and eat! Lived here for 25 years... trust me :)">Best of SF</a>, <a href="/list/fine-dining-in-the-bay-area-san-francisco" title="Ouu la la! Every girl likes some wine and dine.. ;)">Fine Dining in the Bay Area</a> </span>         </div>          </span>                  <p class="review_comment ieSucks" itemprop="description" lang="en">Dear Gary Danko,<br/><br/>You are an amazing restaurant. &#160;From the fabulous presentation and taste of the food, to the elegant restaurant and incredible hospitality of the servers, you've given us a wonderful evening that made us wanting to come back for more. &#160;<br/><br/>What makes Gary Danko special: Other than the incredible food, ambiance, decor, and service, I like Gary Danko's system where you can order 3, 4, or 5 dishes of any category for a fixed price. &#160;Since I am not much of a sweet tooth and I love seafood, I was able to order 3 seafood entrees! &#160;Though everyone raved about the lobster which I have to disagree since it was probably my least favorite of the 3. &#160;The sauce wasn't as dynamic as you would expect a place like Gary Danko to have, though it was still good. &#160;<br/><br/>Yelpers, I have to say that you MUST order the scallops!!! I LOVE scallops, and Gary Danko's scallops were PERFECT. &#160;They cooked it perfectly to make the texture of the scallops just right, and the sauce complemented them amazingly that felt heaven in my mouth. &#160;If I come to Gary Danko again, I am definitely ordering the scallops again. &#160;It was my favorite dish of the night. &#160;My boyfriend on the other hand, thought the oysters was the best dish. &#160;They were very fresh and rich, topped with caviar. &#160;I wish I had more than one piece, but my boyfriend looked like he was enjoying it too much ;) I'd recommend ordering both!<br/><br/>My ratings for the dishes we had: <br/>Scallops 5/5<br/>Oysters 5/5<br/>Pork 4.5/5<br/>Sea bass 4.5/5<br/>Lamb 4/5<br/>Lobster 4/5<br/><br/>Free: bison bite, unlimited bread, mini dessert samples, birthday cake for my boyfriend, and banana cake to take home!<br/><br/>Also, the bathroom at Gary Danko is also beautiful. &#160;They use nice soft towels, not napkins. Can you believe that?! <br/><br/>Thank you for a wonderful and amazing night. &#160;I would have to say this is one of the best restaurants in San Francisco. &#160;I hope to come again!<br/><br/>Sincerely,<br/>Raquel Y.</p>                     <ul class="photo-box-grid clearfix js-content-expandable">                 <li>                         <div class="photo-box has-overlay">      <img alt="Great complimentary assorted dessert samplers!" class="photo-box-img" height="348" src="//s3-media1.fl.yelpcdn.com/bphoto/x44jM-xFds2Pjk1MhpoNbA/348s.jpg" width="348"/>                <div class="photo-box-overlay js-overlay">                 Great complimentary assorted dessert samplers!             </div>        <a class="biz-shim" href="/biz_photos/gary-danko-san-francisco?filter_by_userid=True&amp;select=x44jM-xFds2Pjk1MhpoNbA&amp;userid=qUUMmANmVRuQxagzH8irhA">             <span class="offscreen">Great complimentary assorted dessert samplers!</span>     </a>      </div>                  </li>                 <li>                         <div class="photo-box has-overlay">      <img alt="Lobster - good (4 stars)" class="photo-box-img" height="168" src="//s3-media4.fl.yelpcdn.com/bphoto/KSPgZmw-PVa9cbtj6rQ2hw/168s.jpg" width="168"/>                <div class="photo-box-overlay js-overlay">                 Lobster - good (4 stars)             </div>        <a class="biz-shim" href="/biz_photos/gary-danko-san-francisco?filter_by_userid=True&amp;select=KSPgZmw-PVa9cbtj6rQ2hw&amp;userid=qUUMmANmVRuQxagzH8irhA">             <span class="offscreen">Lobster - good (4 stars)</span>     </a>      </div>                  </li>                 <li>                         <div class="photo-box has-overlay">      <img alt="Scallops - AMAZING! (5 stars)" class="photo-box-img" height="168" src="//s3-media2.fl.yelpcdn.com/bphoto/O-DhZGNzVKTY8nDup0KYrg/168s.jpg" width="168"/>                <div class="photo-box-overlay js-overlay">                 Scallops - AMAZING! (5 stars)             </div>        <a class="biz-shim" href="/biz_photos/gary-danko-san-francisco?filter_by_userid=True&amp;select=O-DhZGNzVKTY8nDup0KYrg&amp;userid=qUUMmANmVRuQxagzH8irhA">             <span class="offscreen">Scallops - AMAZING! (5 stars)</span>     </a>      </div>                  </li>                  <li class="more-review-photos">                     <a href="/biz_photos/gary-danko-san-francisco?filter_by_userid=True&amp;select=x44jM-xFds2Pjk1MhpoNbA&amp;userid=qUUMmANmVRuQxagzH8irhA" class="js-content-expander">See all photos from Raquel Y.</a>                 </li>         </ul>      </div>     <div class="review-footer clearfix">                     <div class="rateReview ufc-feedback clearfix" data-review-id="8spqL77wsNZWYeUbz_FRbg">                     <p class="review-intro review-message">         Was this review &#8230;?     </p>                  <ul class="big-ufc" data-csrf-token="e105d849c006a5933a2d66fd65206dc9e0da76a63f541552529a3b1f88a94cfb">        <li class="ufc-stat inline-block ytype">         <a href="javascript:;" rel="useful" class="ybtn ybtn-small useful ytype"><span class="i-wrap ig-wrap-common i-big-ufc-useful-common-wrap button-content"><i class="i ig-common i-big-ufc-useful-common"/>     <span class="vote-type">Useful</span>     <span class="count">6</span></span></a>     </li>         <li class="ufc-stat inline-block ytype">         <a href="javascript:;" rel="funny" class="ybtn ybtn-small funny ytype"><span class="i-wrap ig-wrap-common i-big-ufc-funny-common-wrap button-content"><i class="i ig-common i-big-ufc-funny-common"/>     <span class="vote-type">Funny</span>     <span class="count">2</span></span></a>     </li>         <li class="ufc-stat inline-block ytype">         <a href="javascript:;" rel="cool" class="ybtn ybtn-small cool ytype"><span class="i-wrap ig-wrap-common i-big-ufc-cool-common-wrap button-content"><i class="i ig-common i-big-ufc-cool-common"/>     <span class="vote-type">Cool</span>     <span class="count">6</span></span></a>     </li>                  </ul>         </div>      </div>            </div>     </div>          </li>          
      +++ mapping: <li>                 <div class="review review-with-no-actions" data-review-id="Yx2OGU9pYACbTDTQTekoVw" itemprop="review" itemscope="" itemtype="http://schema.org/Review">             <meta itemprop="author" content="Jim G."/>             <div class="review-sidebar">         <div class="review-sidebar-content">                 <div class="ypassport media-block clearfix">         <div class="media-avatar">        <div class="photo-box pb-60s" data-hovercard-id="-WSpJjdedoEVIXRjbasrQg">                 <a href="/user_details?userid=0QdwQLVxZpgy9Qb2Qakflw">       <img alt="Jim G." class="photo-box-img" height="60" src="//s3-media4.fl.yelpcdn.com/photo/A5Ujjhu_lX9IYQsVhdd87w/60s.jpg" width="60"/>          </a>      </div>            </div>         <div class="media-story">                 <ul class="user-passport-info">         <li class="user-name">                                 <a class="user-display-name" href="/user_details?userid=0QdwQLVxZpgy9Qb2Qakflw" data-hovercard-id="-WSpJjdedoEVIXRjbasrQg">Jim G.</a>          </li>         <li class="user-location">             <b>West Hills, Los Angeles, CA</b>         </li>     </ul>                  <ul class="user-passport-stats">             <li class="is-elite">                 <a href="/elite">Elite &#8217;14</a>             </li>         <li class="friend-count">             <span class="i-wrap ig-wrap-common i-friends-orange-common-wrap"><i class="i ig-common i-friends-orange-common"/> <b>376</b> friends</span>         </li>         <li class="review-count">             <span class="i-wrap ig-wrap-common i-star-orange-common-wrap"><i class="i ig-common i-star-orange-common"/> <b>393</b> reviews</span>         </li>     </ul>          </div>     </div>                      <ul class="iconed-list action-link-list">                  <li class="iconed-list-item">                          <a class="action-link send-to-friend" data-pop-uri="/send_to_friend/review/Yx2OGU9pYACbTDTQTekoVw" href="/biz_share?bizid=WavvLdfdP6g8aZTtbBQHTw&amp;return_url=http%3A%2F%2Fwww.yelp.com%2Fbiz%2Fgary-danko-san-francisco&amp;reviewid=Yx2OGU9pYACbTDTQTekoVw">         <div class="iconed-list-avatar">             <i class="i ig-common i-action-share-common"/>         </div>         <div class="iconed-list-story">             Share review         </div>     </a>                  </li>                  <li class="iconed-list-item">                          <a class="action-link send-compliment" href="/thanx?complimentable_id=Yx2OGU9pYACbTDTQTekoVw&amp;complimentable_type=REVIEW&amp;previous_url=http%3A%2F%2Fwww.yelp.com%2Fbiz%2Fgary-danko-san-francisco&amp;user_id=0QdwQLVxZpgy9Qb2Qakflw">         <div class="iconed-list-avatar">             <i class="i ig-common i-action-compliment-common"/>         </div>         <div class="iconed-list-story">             Compliment         </div>     </a>                  </li>                  <li class="iconed-list-item">                          <a class="action-link send-pm" href="/mail?action_send_form=1&amp;dst=0QdwQLVxZpgy9Qb2Qakflw&amp;return_url=http%3A%2F%2Fwww.yelp.com%2Fbiz%2Fgary-danko-san-francisco" rel="Jim G.">         <div class="iconed-list-avatar">             <i class="i ig-common i-action-message-common"/>         </div>         <div class="iconed-list-story">             Send message         </div>     </a>                  </li>                  <li class="iconed-list-item manage-follow-container">                           <a class="action-link manage-following-add" href="/following_user/add?dst_user_id=0QdwQLVxZpgy9Qb2Qakflw&amp;previous_url=http%3A%2F%2Fwww.yelp.com%2Fbiz%2Fgary-danko-san-francisco" rel="0QdwQLVxZpgy9Qb2Qakflw" style="">         <div class="iconed-list-avatar">             <i class="i ig-common i-action-follow-common"/>         </div>         <div class="iconed-list-story">             Follow Jim G.         </div>     </a>                            <a class="action-link manage-following-remove" href="/following_user/remove?dst_user_id=0QdwQLVxZpgy9Qb2Qakflw&amp;previous_url=http%3A%2F%2Fwww.yelp.com%2Fbiz%2Fgary-danko-san-francisco" rel="0QdwQLVxZpgy9Qb2Qakflw" style="display: none;">         <div class="iconed-list-avatar">             <i class="i ig-common i-action-unfollow-common"/>         </div>         <div class="iconed-list-story">             Stop following Jim G.         </div>     </a>                  </li>     </ul>           </div>      </div>          <div class="review-wrapper">                <div class="review-content">             <div class="biz-rating biz-rating-very-large clearfix">         <div itemprop="reviewRating" itemscope="" itemtype="http://schema.org/Rating">          <div class="rating-very-large">         <i class="star-img stars_5" title="5.0 star rating">             <img alt="5.0 star rating" class="offscreen" height="303" src="http://s3-media3.fl.yelpcdn.com/assets/2/www/img/c2252a4cd43e/ico/stars/v2/stars_map.png" width="84"/>         </i>             <meta itemprop="ratingValue" content="5.0"/>     </div>           </div>             <span class="rating-qualifier">             <meta itemprop="datePublished" content="2014-06-06"/>         6/6/2014     </span>      </div>                  <span class="review-tags">                         <span class="i-wrap ig-wrap-common i-checkin-burst-blue-small-common-wrap badge checkin checkin-irregular"><i class="i ig-common i-checkin-burst-blue-small-common"/> 1 check-in here</span>          </span>                  <p class="review_comment ieSucks" itemprop="description" lang="en">No doubt the food is good to great, but the service is literally the best we ever experienced.<br/><br/>Make sure to get the tasting menu including the wine tasting. &#160;The sommelier has some really amazing wines and was extremely knowledgeable. Other than that sit back, relax, and enjoy being pampered for the night.<br/><br/>My wife asked to taste a side that was on a dish we were not getting. We ended up forgetting about it and we were getting in a cab when one of our waiters came sprinting out to give it to us to go. Over the top service with great food too. Loved it!</p>      </div>     <div class="review-footer clearfix">                     <div class="rateReview ufc-feedback clearfix" data-review-id="Yx2OGU9pYACbTDTQTekoVw">                     <p class="review-intro review-message">         Was this review &#8230;?     </p>                  <ul class="big-ufc" data-csrf-token="e105d849c006a5933a2d66fd65206dc9e0da76a63f541552529a3b1f88a94cfb">        <li class="ufc-stat inline-block ytype">         <a href="javascript:;" rel="useful" class="ybtn ybtn-small useful ytype"><span class="i-wrap ig-wrap-common i-big-ufc-useful-common-wrap button-content"><i class="i ig-common i-big-ufc-useful-common"/>     <span class="vote-type">Useful</span>     <span class="count">3</span></span></a>     </li>         <li class="ufc-stat inline-block ytype">         <a href="javascript:;" rel="funny" class="ybtn ybtn-small funny ytype"><span class="i-wrap ig-wrap-common i-big-ufc-funny-common-wrap button-content"><i class="i ig-common i-big-ufc-funny-common"/>     <span class="vote-type">Funny</span>     <span class="count">2</span></span></a>     </li>         <li class="ufc-stat inline-block ytype">         <a href="javascript:;" rel="cool" class="ybtn ybtn-small cool ytype"><span class="i-wrap ig-wrap-common i-big-ufc-cool-common-wrap button-content"><i class="i ig-common i-big-ufc-cool-common"/>     <span class="vote-type">Cool</span>     <span class="count">2</span></span></a>     </li>                  </ul>         </div>      </div>            </div>     </div>          </li>          
      +++ mapping: <li>                 <div class="review review-with-no-actions" data-review-id="kdmPnZ2DLpuIyelV0etLrQ" itemprop="review" itemscope="" itemtype="http://schema.org/Review">             <meta itemprop="author" content="Crystal B."/>             <div class="review-sidebar">         <div class="review-sidebar-content">                 <div class="ypassport media-block clearfix">         <div class="media-avatar">        <div class="photo-box pb-60s" data-hovercard-id="FtEce1tn2kYfNIWdptv9XQ">                 <a href="/user_details?userid=3tiRQSIu8xU8Cf-mPvLxig">       <img alt="Crystal B." class="photo-box-img" height="60" src="//s3-media4.fl.yelpcdn.com/photo/spZn_0oF4ul6xsqxck-j2Q/60s.jpg" width="60"/>          </a>      </div>            </div>         <div class="media-story">                 <ul class="user-passport-info">         <li class="user-name">                                 <a class="user-display-name" href="/user_details?userid=3tiRQSIu8xU8Cf-mPvLxig" data-hovercard-id="FtEce1tn2kYfNIWdptv9XQ">Crystal B.</a>          </li>         <li class="user-location">             <b>Brisbane, CA</b>         </li>     </ul>                  <ul class="user-passport-stats">         <li class="friend-count">             <span class="i-wrap ig-wrap-common i-friends-orange-common-wrap"><i class="i ig-common i-friends-orange-common"/> <b>0</b> friends</span>         </li>         <li class="review-count">             <span class="i-wrap ig-wrap-common i-star-orange-common-wrap"><i class="i ig-common i-star-orange-common"/> <b>4</b> reviews</span>         </li>     </ul>          </div>     </div>                      <ul class="iconed-list action-link-list">                  <li class="iconed-list-item">                          <a class="action-link send-to-friend" data-pop-uri="/send_to_friend/review/kdmPnZ2DLpuIyelV0etLrQ" href="/biz_share?bizid=WavvLdfdP6g8aZTtbBQHTw&amp;return_url=http%3A%2F%2Fwww.yelp.com%2Fbiz%2Fgary-danko-san-francisco&amp;reviewid=kdmPnZ2DLpuIyelV0etLrQ">         <div class="iconed-list-avatar">             <i class="i ig-common i-action-share-common"/>         </div>         <div class="iconed-list-story">             Share review         </div>     </a>                  </li>                  <li class="iconed-list-item">                          <a class="action-link send-compliment" href="/thanx?complimentable_id=kdmPnZ2DLpuIyelV0etLrQ&amp;complimentable_type=REVIEW&amp;previous_url=http%3A%2F%2Fwww.yelp.com%2Fbiz%2Fgary-danko-san-francisco&amp;user_id=3tiRQSIu8xU8Cf-mPvLxig">         <div class="iconed-list-avatar">             <i class="i ig-common i-action-compliment-common"/>         </div>         <div class="iconed-list-story">             Compliment         </div>     </a>                  </li>                  <li class="iconed-list-item">                          <a class="action-link send-pm" href="/mail?action_send_form=1&amp;dst=3tiRQSIu8xU8Cf-mPvLxig&amp;return_url=http%3A%2F%2Fwww.yelp.com%2Fbiz%2Fgary-danko-san-francisco" rel="Crystal B.">         <div class="iconed-list-avatar">             <i class="i ig-common i-action-message-common"/>         </div>         <div class="iconed-list-story">             Send message         </div>     </a>                  </li>                  <li class="iconed-list-item manage-follow-container">                           <a class="action-link manage-following-add" href="/following_user/add?dst_user_id=3tiRQSIu8xU8Cf-mPvLxig&amp;previous_url=http%3A%2F%2Fwww.yelp.com%2Fbiz%2Fgary-danko-san-francisco" rel="3tiRQSIu8xU8Cf-mPvLxig" style="">         <div class="iconed-list-avatar">             <i class="i ig-common i-action-follow-common"/>         </div>         <div class="iconed-list-story">             Follow Crystal B.         </div>     </a>                            <a class="action-link manage-following-remove" href="/following_user/remove?dst_user_id=3tiRQSIu8xU8Cf-mPvLxig&amp;previous_url=http%3A%2F%2Fwww.yelp.com%2Fbiz%2Fgary-danko-san-francisco" rel="3tiRQSIu8xU8Cf-mPvLxig" style="display: none;">         <div class="iconed-list-avatar">             <i class="i ig-common i-action-unfollow-common"/>         </div>         <div class="iconed-list-story">             Stop following Crystal B.         </div>     </a>                  </li>     </ul>           </div>      </div>          <div class="review-wrapper">                <div class="review-content">             <div class="biz-rating biz-rating-very-large clearfix">         <div itemprop="reviewRating" itemscope="" itemtype="http://schema.org/Rating">          <div class="rating-very-large">         <i class="star-img stars_3" title="3.0 star rating">             <img alt="3.0 star rating" class="offscreen" height="303" src="http://s3-media3.fl.yelpcdn.com/assets/2/www/img/c2252a4cd43e/ico/stars/v2/stars_map.png" width="84"/>         </i>             <meta itemprop="ratingValue" content="3.0"/>     </div>           </div>             <span class="rating-qualifier">             <meta itemprop="datePublished" content="2014-07-03"/>         7/3/2014     </span>      </div>                   <p class="review_comment ieSucks" itemprop="description" lang="en">Went to Gary Dankos for my 1 year wedding anniversary. I called about 2 months in advance to get in. They were completely booked except for the 9pm slot. These guys are serious. You can also get put on a waiting list to see if they have any cancellations to get in.<br/><br/>We picked to celebrate on a Monday since it's located right by Ghiradelli square. I didn't want to fight for street parking on the weekends. The restaurant was so unassuming and I would have walked right past it of the doorman wasn't there.<br/><br/>The atmosphere was beautiful, fresh flowers, pillars, and a lovely bar. Upon entry, The smell of the restaurant threw me off for a second, but I suppose it was mixture of the cheese, and the food. <br/><br/>We did the tasting menu and chose a wine to pair. It was a 5 course menu and since they sat us late, they threw in soup, some bubbly, and dessert. The food was okay, nothing I would go back for. Service was on point. For the $$$$ I expected to be blown away, but sadly I was not. I guess if you want service above food, this spot is for you. <br/><br/>I paid around $400.00 total. Won't be coming back, but at least now I know what the hype is all about. Which is not the food.</p>      </div>     <div class="review-footer clearfix">                     <div class="rateReview ufc-feedback clearfix" data-review-id="kdmPnZ2DLpuIyelV0etLrQ">                     <p class="review-intro review-message">         Was this review &#8230;?     </p>                  <ul class="big-ufc" data-csrf-token="e105d849c006a5933a2d66fd65206dc9e0da76a63f541552529a3b1f88a94cfb">        <li class="ufc-stat inline-block ytype">         <a href="javascript:;" rel="useful" class="ybtn ybtn-small useful ytype"><span class="i-wrap ig-wrap-common i-big-ufc-useful-common-wrap button-content"><i class="i ig-common i-big-ufc-useful-common"/>     <span class="vote-type">Useful</span>     <span class="count"/></span></a>     </li>         <li class="ufc-stat inline-block ytype">         <a href="javascript:;" rel="funny" class="ybtn ybtn-small funny ytype"><span class="i-wrap ig-wrap-common i-big-ufc-funny-common-wrap button-content"><i class="i ig-common i-big-ufc-funny-common"/>     <span class="vote-type">Funny</span>     <span class="count"/></span></a>     </li>         <li class="ufc-stat inline-block ytype">         <a href="javascript:;" rel="cool" class="ybtn ybtn-small cool ytype"><span class="i-wrap ig-wrap-common i-big-ufc-cool-common-wrap button-content"><i class="i ig-common i-big-ufc-cool-common"/>     <span class="vote-type">Cool</span>     <span class="count">1</span></span></a>     </li>                  </ul>         </div>      </div>            </div>     </div>          </li>          
...


+++ candidate: /html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]/div[2]
   +++ seed_record: <div class="ysection related-businesses">             <h3>People also viewed</h3>             <ul class="ylist">                     <li>                             <div class="media-block media-block-large clearfix biz-listing-medium">         <div class="media-avatar">        <div class="photo-box pb-60s">                 <a href="/biz/kokkari-estiatorio-san-francisco">       <img alt="Kokkari Estiatorio" class="photo-box-img" height="60" src="//s3-media1.fl.yelpcdn.com/bphoto/mHP1ehDdMW6sB8De4ks3bA/60s.jpg" width="60"/>          </a>      </div>            </div>         <div class="media-story">             <div class="media-title clearfix">                     <a class="biz-name" href="/biz/kokkari-estiatorio-san-francisco" data-hovercard-id="et2GLdh_XeynrwHekTpKNw">Kokkari Estiatorio</a>              </div>                       <div class="biz-rating biz-rating-medium clearfix">          <div class="rating">         <i class="star-img stars_4_half" title="4.5 star rating">             <img alt="4.5 star rating" class="offscreen" height="303" src="http://s3-media3.fl.yelpcdn.com/assets/2/www/img/c2252a4cd43e/ico/stars/v2/stars_map.png" width="84"/>         </i>     </div>                   <span class="review-count rating-qualifier">             <span itemprop="reviewCount">2936</span> reviews     </span>          </div>               <q>                 Zucchini Cakes - with cucumber and mint yogurt dressing.             </q>         </div>     </div>                      </li>                     <li>                             <div class="media-block media-block-large clearfix biz-listing-medium">         <div class="media-avatar">        <div class="photo-box pb-60s">                 <a href="/biz/lazy-bear-san-francisco">       <img alt="Lazy Bear" class="photo-box-img" height="60" src="//s3-media1.fl.yelpcdn.com/bphoto/As4wCRKRYRmvi4p5SNg9Zw/60s.jpg" width="60"/>          </a>      </div>            </div>         <div class="media-story">             <div class="media-title clearfix">                     <a class="biz-name" href="/biz/lazy-bear-san-francisco" data-hovercard-id="jVUh7IdfHBvsmoSJPlJtuA">Lazy Bear</a>              </div>                       <div class="biz-rating biz-rating-medium clearfix">          <div class="rating">         <i class="star-img stars_5" title="5.0 star rating">             <img alt="5.0 star rating" class="offscreen" height="303" src="http://s3-media3.fl.yelpcdn.com/assets/2/www/img/c2252a4cd43e/ico/stars/v2/stars_map.png" width="84"/>         </i>     </div>                   <span class="review-count rating-qualifier">             <span itemprop="reviewCount">115</span> reviews     </span>          </div>               <q>                 Chef David explains each course as it is brought out.             </q>         </div>     </div>                      </li>                     <li>                             <div class="media-block media-block-large clearfix biz-listing-medium">         <div class="media-avatar">        <div class="photo-box pb-60s">                 <a href="/biz/lous-cafe-san-francisco">       <img alt="Lou's Cafe" class="photo-box-img" height="60" src="//s3-media3.fl.yelpcdn.com/bphoto/qeBKIkfI4RLmoT4KuSRb3g/60s.jpg" width="60"/>          </a>      </div>            </div>         <div class="media-story">             <div class="media-title clearfix">                     <a class="biz-name" href="/biz/lous-cafe-san-francisco" data-hovercard-id="bYCiEUqD0jMxV0zN5XATrw">Lou&#8217;s Cafe</a>              </div>                       <div class="biz-rating biz-rating-medium clearfix">          <div class="rating">         <i class="star-img stars_4_half" title="4.5 star rating">             <img alt="4.5 star rating" class="offscreen" height="303" src="http://s3-media3.fl.yelpcdn.com/assets/2/www/img/c2252a4cd43e/ico/stars/v2/stars_map.png" width="84"/>         </i>     </div>                   <span class="review-count rating-qualifier">             <span itemprop="reviewCount">832</span> reviews     </span>          </div>               <q>                 Lou's makes the most amazing Veggie sandwich on Dutch Crunch.             </q>         </div>     </div>                      </li>             </ul>         </div>                                       
      +++ record: <div class="media-avatar">         <div class="photo-box pb-60s">                 <a href="/list/my-best-list-glendale-2">       <img alt="Jenny K." class="photo-box-img" height="60" src="//s3-media3.fl.yelpcdn.com/photo/EuBiWUGjLiMq8PtgeLTvIA/60s.jpg" width="60"/>          </a>      </div>           </div>          
      +++ record: <span itemprop="reviewCount">115</span> reviews      
...

among all this, we see that the first candidate: /html/body/div[2]/div[3]/div[2]/div/div[1]/div[1]/div[3]/div[1]/div[2]/ul
has one record: <li> <div class="review review-with-no-actions" data-review-id="8spqL77wsNZWYeUbz_FRbg"...

which corresponds to the entries of interest (one complete review per entry (mapping).
one of the mapping is:

<li>
  <div class="review review-with-no-actions" data-review-id="Yx2OGU9pYACbTDTQTekoVw"...
     ...
     <a href="/user_details?userid=0QdwQLVxZpgy9Qb2Qakflw" data-hovercard-id="-WSpJjdedoEVIXRjbasrQg">Jim G.</a>
     ...
     <p class="review_comment ieSucks" itemprop="description" lang="en">No doubt the food is good to great, b...
     ...

and we see the information of interest (author: Jim G.", review text: "No doubt the food is good"...)

So, how we extract this data from the mapping?

And how can we see the tree grammar that has been inferred?

from mdr.

dportabella avatar dportabella commented on August 29, 2024

I've simplified the example as follows:

# not a valid html page; however the alg fails if we use td for the name, price and quantity.
text = """
<html>
    <body>
        <h1>title</h1>
        <table>
            <product><name>p1</name><price>$10</price><quantity>111</quantity></product>
            <product><name>p2</name><price>$20</price></product>
            <product><name>p3</name><price>$30</price><quantity>333</quantity></product>
        </table>
    </body>
</html>
"""

from mdr import MDR
from lxml import etree
mdr = MDR()

candidates, doc = mdr.list_candidates(text)

def xmlToString(tree):
    return etree.tostring(tree, pretty_print=True).replace("\n", " ")

print "+++ candidate: " + doc.getpath(candidates[0])
seed_records, mappings = mdr.extract(candidates[0])
print "   +++ seed_record: " + xmlToString(seed_records[0])
for record, mapping in mappings.iteritems():
    print "      +++ record. elements: " + " , ".join([xmlToString(tree) for tree in record])
    for k in mapping:
        print "          +++ mapping: " + xmlToString(k)

this produces:

+++ candidate: /html/body/table
   +++ seed_record: <product>   <name>p1</name>   <price>$10</price>   <quantity>111</quantity> </product>              
      +++ record. elements: <product>   <name>p1</name>   <price>$10</price>   <quantity>111</quantity> </product>              
          +++ mapping: <price>$10</price> 
          +++ mapping: <quantity>111</quantity> 
          +++ mapping: <name>p1</name> 
          +++ mapping: <product>   <name>p1</name>   <price>$10</price>   <quantity>111</quantity> </product>              
      +++ record. elements: <product>   <name>p2</name>   <price>$20</price> </product>              
          +++ mapping: <price>$10</price> 
          +++ mapping: <name>p1</name> 
          +++ mapping: <product>   <name>p1</name>   <price>$10</price>   <quantity>111</quantity> </product>              
      +++ record. elements: <product>   <name>p3</name>   <price>$30</price>   <quantity>333</quantity> </product>          
          +++ mapping: <price>$10</price> 
          +++ mapping: <quantity>111</quantity> 
          +++ mapping: <name>p1</name> 
          +++ mapping: <product>   <name>p1</name>   <price>$10</price>   <quantity>111</quantity> </product>              

From here, how can we extract the values (p1, $10, 111), (p2, $20, null), (p3, $30, 333)?

from mdr.

dportabella avatar dportabella commented on August 29, 2024

solved:

text = """
<html>
    <body>
        <h1>title</h1>
        <table>
            <product><name>p1</name><price>$10</price><quantity>111</quantity></product>
            <product><name>p2</name><price>$20</price></product>
            <product><name>p3</name><price>$30</price><quantity>333</quantity></product>
        </table>
    </body>
</html>
"""

from mdr import MDR
from lxml import etree
mdr = MDR()

candidates, doc = mdr.list_candidates(text)

def xmlToString(tree):
    return etree.tostring(tree, pretty_print=True).replace("\n", " ")

print "+++ candidate: " + doc.getpath(candidates[0])
seed_records, mappings = mdr.extract(candidates[0])
print "   +++ seed_record: " + xmlToString(seed_records[0])
for record, mapping in mappings.iteritems():
    print "      +++ record. elements: " + " , ".join([xmlToString(tree) for tree in record])
    for s,t in mapping.iteritems():
        print "          +++ mapping: " + xmlToString(s) + " -> " + xmlToString(t)

results in:

$ python test2.py
+++ candidate: /html/body/table
   +++ seed_record: <product>   <name>p1</name>   <price>$10</price>   <quantity>111</quantity> </product>
      +++ record. elements: <product>   <name>p1</name>   <price>$10</price>   <quantity>111</quantity> </product>
          +++ mapping: <name>p1</name>  -> <name>p1</name>
          +++ mapping: <price>$10</price>  -> <price>$10</price>
          +++ mapping: <product>   <name>p1</name>   <price>$10</price>   <quantity>111</quantity> </product>               -> <product>   <name>p1</name>   <price>$10</price>   <quantity>111</quantity> </product>
          +++ mapping: <quantity>111</quantity>  -> <quantity>111</quantity>
      +++ record. elements: <product>   <name>p2</name>   <price>$20</price> </product>
          +++ mapping: <name>p1</name>  -> <name>p2</name>
          +++ mapping: <product>   <name>p1</name>   <price>$10</price>   <quantity>111</quantity> </product>               -> <product>   <name>p2</name>   <price>$20</price> </product>
          +++ mapping: <price>$10</price>  -> <price>$20</price>
      +++ record. elements: <product>   <name>p3</name>   <price>$30</price>   <quantity>333</quantity> </product>
          +++ mapping: <price>$10</price>  -> <price>$30</price>
          +++ mapping: <name>p1</name>  -> <name>p3</name>
          +++ mapping: <product>   <name>p1</name>   <price>$10</price>   <quantity>111</quantity> </product>               -> <product>   <name>p3</name>   <price>$30</price>   <quantity>333</quantity> </product>
          +++ mapping: <quantity>111</quantity>  -> <quantity>333</quantity>

from mdr.

Related Issues (6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.