ageitgey / node-unfluff Goto Github PK

View Code? Open in Web Editor NEW

2.1K 58.0 229.0 1.25 MB

Automatically extract body content (and other cool stuff) from an html document

License: Apache License 2.0

Makefile 0.04% JavaScript 0.01% HTML 98.93% CoffeeScript 1.04%

node-unfluff's Introduction

unfluff

An automatic web page content extractor for Node.js!

Automatically grab the main text out of a webpage like this:

extractor = require('unfluff');
data = extractor(my_html_data);
console.log(data.text);

In other words, it turns pretty webpages into boring plain text/json data:

This might be useful for:

Writing your own Instapaper clone
Easily building ML data sets from web pages
Reading your favorite articles from the console?

Please don't use this for:

Stealing other peoples' web pages
Making crappy spam sites with stolen content from other sites
Being a jerk

Credits / Thanks

This library is largely based on python-goose by Xavier Grangier which is in turn based on goose by Gravity Labs. However, it's not an exact port so it may behave differently on some pages and the feature set is a little bit different. If you are looking for a python or Scala/Java/JVM solution, check out those libraries!

Install

To install the command-line unfluff utility:

npm install -g unfluff

To install the unfluff module for use in your Node.js project:

npm install --save unfluff

Usage

You can use unfluff from node or right on the command line!

Extracted data elements

This is what unfluff will try to grab from a web page:

title - The document's title (from the <title> tag)
softTitle - A version of title with less truncation
date - The document's publication date
copyright - The document's copyright line, if present
author - The document's author
publisher - The document's publisher (website name)
text - The main text of the document with all the junk thrown away
image - The main image for the document (what's used by facebook, etc.)
videos - An array of videos that were embedded in the article. Each video has src, width and height.
tags- Any tags or keywords that could be found by checking <rel> tags or by looking at href urls.
canonicalLink - The canonical url of the document, if given.
lang - The language of the document, either detected or supplied by you.
description - The description of the document, from <meta> tags
favicon - The url of the document's favicon.
links - An array of links embedded within the article text. (text and href for each)

This is returned as a simple json object.

Command line interface

You can pass a webpage to unfluff and it will try to parse out the interesting bits.

You can either pass in a file name:

unfluff my_file.html

Or you can pipe it in:

curl -s "http://somesite.com/page" | unfluff

You can easily chain this together with other unix commands to do cool stuff. For example, you can download a web page, parse it and then use jq to print it just the body text.

curl -s "https://www.polygon.com/2014/6/26/5842180/shovel-knight-review-pc-3ds-wii-u" | unfluff | jq -r .text

And here's how to find the top 10 most common words in an article:

curl -s "https://www.polygon.com/2014/6/26/5842180/shovel-knight-review-pc-3ds-wii-u" | unfluff |  tr -c '[:alnum:]' '[\n*]' | sort | uniq -c | sort -nr | head -10

Module Interface

`extractor(html, language)`

html: The html you want to parse

language (optional): The document's two-letter language code. This will be auto-detected as best as possible, but there might be cases where you want to override it.

The extraction algorithm depends heavily on the language, so it probably won't work if you have the language set incorrectly.

extractor = require('unfluff');

data = extractor(my_html_data);

Or supply the language code yourself:

extractor = require('unfluff');

data = extractor(my_html_data, 'en');

data will then be a json object that looks like this:

{
  "title": "Shovel Knight review",
  "softTitle": "Shovel Knight review: rewrite history",
  "date": "2014-06-26T13:00:03Z",
  "copyright": "2016 Vox Media Inc Designed in house",
  "author": [
    "Griffin McElroy"
  ],
  "publisher": "Polygon",
  "text": "Shovel Knight is inspired by the past in all the right ways — but it's far from stuck in it. [.. snip ..]",
  "image": "http://cdn2.vox-cdn.com/uploads/chorus_image/image/34834129/jellyfish_hero.0_cinema_1280.0.png",  
  "tags": [],
  "videos": [],
  "canonicalLink": "http://www.polygon.com/2014/6/26/5842180/shovel-knight-review-pc-3ds-wii-u",
  "lang": "en",
  "description": "Shovel Knight is inspired by the past in all the right ways — but it's far from stuck in it.",
  "favicon": "http://cdn1.vox-cdn.com/community_logos/42931/favicon.ico",
  "links": [
    { "text": "Six Thirty", "href": "http://www.sixthirty.co/" }
  ]
}

`extractor.lazy(html, language)`

Lazy version of extractor(html, language).

The text extraction algorithm can be somewhat slow on large documents. If you only need access to elements like title or image, you can use the lazy extractor to get them more quickly without running the full processing pipeline.

This returns an object just like the regular extractor except all fields are replaced by functions and evaluation is only done when you call those functions.

extractor = require('unfluff');

data = extractor.lazy(my_html_data, 'en');

// Access whichever data elements you need directly.
console.log(data.title());
console.log(data.softTitle());
console.log(data.date());
console.log(data.copyright());
console.log(data.author());
console.log(data.publisher());
console.log(data.text());
console.log(data.image());
console.log(data.tags());
console.log(data.videos());
console.log(data.canonicalLink());
console.log(data.lang());
console.log(data.description());
console.log(data.favicon());

Some of these data elements require calculating intermediate representations of the html document. Everything is cached so looking up multiple data elements and looking them up multiple times should be as fast as possible.

Demo

The easiest way to try out unfluff is to just install it:

$ npm install -g unfluff
$ curl -s "http://www.cnn.com/2014/07/07/world/americas/mexico-earthquake/index.html" | unfluff

But if you can't be bothered, you can check out fetch text. It's a site by Andy Jiang that uses unfluff. You send an email with a url and it emails back with the cleaned content of that url. It should give you a good idea of how unfluff handles different urls.

What is broken

Parsing web pages in languages other than English is poorly tested and probably is buggy right now.
This definitely won't work yet for languages like Chinese / Arabic / Korean / etc that need smarter word tokenization.
This has only been tested on a limited set of web pages. There are probably lots of lurking bugs with web pages that haven't been tested yet.

node-unfluff's People

Contributors

Stargazers

Watchers

Forkers

lumiscript barredo fth-ship manuelmateus16 atteeela mathsad eiriklv jonathanyee sferoze vibster syeoryn samujjwal nvdnkpr martindale thatchristoph shobhitmittal e-jigsaw rayleyva iyerish agreen757 danielshir skyzhou davidohalloran shaohua zcfrank1st mhuebert marcooliveira rakesh-mohanta javascript-forks bradvogel weaver-viii mattpal pat-riley placentic hyharryhuang danielgranat cc-lam dminkovsky parallel-universe biddyweb falkirks lquadrosl hermesreader etrom bradbenvenuti burningtree always-akshat lixiangnlp whyyk7 jeffj flamingtop iagustian inbeom maxme duyet noodle-learns-programming lydonchandra bjonica bbonamin abhijeetpathak rybnik platvorm leesander1 jiawenzhang ycg663 timvracer wordjelly rasata harijoe adiomari eunarede engvik primefactor7 tomtwo tiagonog jdrew1303 fsaint knod tykarol musalys mobilewish johipsum sridhar-newsdistill lukyman barbietunnie madskristiansen codebuffet balasan abhiram24 empia simonqiang philgooch ahkimkoo tkrkt yanghuabei shahzaibzafar quantumlike dengdxq jarvisaoieong fintara

node-unfluff's Issues

vietnamese stop words

This is a quick one: I'm using node-unfluff on Vietnamese language. There is currently no stop words file for Vietnamese. I took this one

https://github.com/stopwords/vietnamese-stopwords/blob/master/vietnamese-stopwords.txt

and dropped it into my data directory and it seems to be working. So you could add this do your distributed files. Cheers.

calculateBestNode claims no nodesWithText on facebook developer page

I was testing out unfluff on the url https://developers.facebook.com/docs/facebook-login/access-tokens and realized that no article text extraction is actually happening. It successfully pulled an image, description, and title, but the text appears blank.

Extract not all text

From this html page

<HTML xmlns="http://www.w3.org/1999/xhtml"><HEAD><TITLE>Sales Associate</TITLE>
<META content=text/javascript http-equiv=Content-Script-Type>
<META content=text/css http-equiv=Content-Style-Type>
<META content="text/html; charset=UTF-8" http-equiv=Content-Type>
<META content=IE=EmulateIE7 http-equiv=X-UA-Compatible>
<SCRIPT type=text/javascript>
    var deviceMode="desktop";
</SCRIPT>
<LINK rel=icon type=image/x-icon href="https://wfa.kronostm.com/static/core/images/favicon_blnk.ico"><LINK rel="shortcut icon" type=image/x-icon href="https://wfa.kronostm.com/static/core/images/favicon_blnk.ico"><!-- Dependencies -->
<SCRIPT type=text/javascript src="https://wfa.kronostm.com/common/jsutils/yui-2.7.0/build/yahoo/yahoo-min.js"></SCRIPT>

<SCRIPT type=text/javascript src="https://wfa.kronostm.com/common/jsutils/yui-2.7.0/build/event/event-min.js"></SCRIPT>

<SCRIPT type=text/javascript src="https://wfa.kronostm.com/common/jsutils/yui-2.7.0/build/dom/dom-min.js"></SCRIPT>

<SCRIPT type=text/javascript src="https://wfa.kronostm.com/common/jsutils/yui-2.7.0/build/logger/logger-min.js"></SCRIPT>

<SCRIPT type=text/javascript src="https://wfa.kronostm.com/common/jsutils/yui-2.7.0/build/element/element-min.js"></SCRIPT>

<SCRIPT type=text/javascript src="https://wfa.kronostm.com/common/jsutils/yui-2.7.0/build/cookie/cookie-min.js"></SCRIPT>

<SCRIPT type=text/javascript src="https://wfa.kronostm.com/scripts/combined-4740157.js"></SCRIPT>

<SCRIPT type=text/javascript src="https://wfa.kronostm.com/scripts/validation_en_US-min-4740157.js"></SCRIPT>

<SCRIPT type=text/javascript src="https://wfa.kronostm.com/common/jsutils/WebKitDetect-min.js"></SCRIPT>

<SCRIPT type=text/javascript>
    var tbCloseLabel="Close";
</SCRIPT>
<LINK rel=stylesheet type=text/css href="https://wfa.kronostm.com/common/jsutils/thickbox/thickbox-min.css">
<SCRIPT type=text/javascript src="https://wfa.kronostm.com/common/jsutils/thickbox/jquery-1.4.4.min.js"></SCRIPT>

<SCRIPT type=text/javascript src="https://wfa.kronostm.com/common/jsutils/thickbox/thickbox-min.js"></SCRIPT>

<SCRIPT type=text/javascript>

    if ( (BrowserDetect.browser=="Opera") ||
    (BrowserDetect.browser=="Netscape" && BrowserDetect.version<"7.2")
    || (BrowserDetect.browser=="Safari" && BrowserDetect.version>"48" && BrowserDetect.version<"420")
    )
    {
    window.location = "browserError.jsp"
    }

 var ataoDebug = false;
    function initLogger() {};


    Deploy.events.clientEventData = {"SLOT_0_3_3_10":{"ISACTIVELOCATION":true,"ISACTIVEPOSTING":true},"SLOT_0_3_3_14_2":{"ISACTIVELOCATION":true,"APPLYLINK":"?LOCATION_ID=51516916603&locale=en_US&applicationName=SpecialtyRetailersNonReqExt&SEQ=returningMemberLoginOrRegister&POSTING_ID=46224061705","NAVIGATEREFERJOB":"?LOCATION_ID=51516916603&locale=en_US&applicationName=SpecialtyRetailersNonReqExt&SEQ=jobReferral&POSTING_ID=46224061705&sourceSeq=postingLocationDetails","ISACTIVEPOSTING":true},"SLOT_0_3_3_1_0":{"NAVIGATERETURNING":"?FROMJAF=false&INDEX=0&locale=en_US&applicationName=SpecialtyRetailersNonReqExt&SEQ=returningMembers","HMCHELPPREFIX":"/","NAVIGATEREGISTRATION":"?FROMJAF=false&INDEX=0&locale=en_US&applicationName=SpecialtyRetailersNonReqExt&SEQ=registration","SHOWEXPLICITREGISTRATIONMODE":true,"NAVIGATELOGOUT":"?logout=1&applicationName=SpecialtyRetailersNonReqExt&locale=en_US","ISGUEST":true},"SLOT_0_3_3_12_2":{"LOCATIONPOSTINGDETAILS":"true","NAVIGATETOLOCATIONJOBS":"?LOCATION_ID=51516916603&locale=en_US&applicationName=SpecialtyRetailersNonReqExt&SEQ=locationDetails&APPLYALLJOBS=true","POSTINGEXISTS":"true","NAVIGATETOANYJOBAPPLY":"?LOCATION_ID=51516916603&locale=en_US&applicationName=SpecialtyRetailersNonReqExt&SEQ=returningMemberLoginOrRegister&APPLYALLJOBS=true","OPENINGSENABLED":"false"}};

    var initCompleteEvent = new YAHOO.util.CustomEvent("initCompleteEvent"); 

    var compiledInit = function () {
    initLogger();

    YAHOO.util.Event.addListener(document.body, 'click', Deploy.eventHandler);


 initializeThickBox();

    if (initCompleteEvent != null) {initCompleteEvent.fire();}
    };
    YAHOO.util.Event.addListener(window, "load", compiledInit);
    Deploy.globalText = {};
    YAHOO.util.Event.onAvailable("messageContainer", populateMessages);


    //-->
</SCRIPT>
<LINK rel=STYLESHEET type=text/css href="https://wfa.kronostm.com/styles/combined-4740157.css"><LINK rel=STYLESHEET type=text/css href="https://wfa.kronostm.com/styles/style_en_US-min-4740157.css"><LINK rel=STYLESHEET type=text/css href="https://wfa.kronostm.com/assets/SpecialtyRetailersNonReqExt_CIBranding/css/customer.css"><LINK rel=stylesheet type=text/css href="https://wfa.kronostm.com/styles/print-min-4740157.css" media=print><LINK rel=STYLESHEET type=text/css href="https://wfa.kronostm.com/styles/ie-4740157.css">
<SCRIPT language=JavaScript type=text/javascript>
    var windowBeforeUnloadMsg = '';
    setHookOnWindowBeforeUnload(false);

    <!-- Begin Pre-loading the page *************
function clearPreloadDiv() { //DOM
 if (document.getElementById('prePageDiv')) {
  document.getElementById('prePageDiv').style.display='none';
 }
}
if(/MSIE/.test(navigator.userAgent)){
 YAHOO.util.Event.onDOMReady(clearPreloadDiv);
}
// End Pre-loading of the page ****************-->
</SCRIPT>
</HEAD>
<BODY id=DeployMainBody class=yui-skin-sam>
<DIV id=displayWait></DIV>
<DIV id=prePageDiv class=preLoadDiv style="DISPLAY: none">  </DIV><NOSCRIPT></NOSCRIPT>
<DIV id=header><!-- view div[jspviews/customerHeader.jsp] --><IFRAME id=customerHeader title="Customer Header" src="https://wfa.kronostm.com/static/SpecialtyRetailers/NonReqExt/Stage_Stores_Header.html" frameBorder=0 scrolling=no>
</IFRAME><!-- 
<img id="printLogo" src="https://wfa.kronostm.com/static/core/images/print_kronos.gif" alt="Logo for print version" />
--></DIV>
<DIV id=wrapper class=noFluid>
<DIV id=bodyContainer>
<DIV id=sidenav>
<DIV id=Slot_0_3_3_0_0><!-- view div[jspviews/customerSideNav.jsp] --><IFRAME id=customerLeftNav title="Customer Side Navigation" src="https://wfa.kronostm.com/static/core/core_leftnav.htm" frameBorder=0 scrolling=no>
</IFRAME></DIV></DIV>
<DIV id=navGroup class=fullCol>
<DIV id=loginnav><!-- view div[jspviews/loginNav.jsp] -->
<DIV id=member><A id=Div0 class=nav href="https://wfa.kronostm.com/?FROMJAF=false&INDEX=0&locale=en_US&applicationName=SpecialtyRetailersNonReqExt&SEQ=returningMembers"><SPAN><STRONG>Sign In</STRONG></SPAN></A> | Not a Member? <A id=Div2 class=nav href="https://wfa.kronostm.com/?FROMJAF=false&INDEX=0&locale=en_US&applicationName=SpecialtyRetailersNonReqExt&SEQ=registration"><SPAN>Join Now!</SPAN></A> </DIV></DIV>
<DIV id=nav><!-- view div[jspviews/nav.jsp] --><IFRAME id=shimIFrame title="Nav Menu Helper" style="HEIGHT: 22px; WIDTH: 120px; POSITION: absolute; LEFT: 0px; Z-INDEX: 1; DISPLAY: none; TOP: 0px" src="https://wfa.kronostm.com/common/IEFrameWarningBypass.htm" frameBorder=0 scrolling=no></IFRAME>
<DIV id=menuBar>
<DIV class=menuItemNoSub><A id=Div3 class=nav href="https://wfa.kronostm.com/?seq=home&applicationName=SpecialtyRetailersNonReqExt&locale=en_US"><SPAN>Home</SPAN></A> </DIV>
<DIV class=menuItemNoSub><A id=Div4 class=nav href="https://wfa.kronostm.com/?seq=allOpenJobs&applicationName=SpecialtyRetailersNonReqExt&locale=en_US&allOpenJobs=true"><SPAN>All Open Jobs</SPAN></A> </DIV>
<DIV class=menuItemNoSub><A id=Div5 class=nav href="https://wfa.kronostm.com/?seq=allLocations&applicationName=SpecialtyRetailersNonReqExt&locale=en_US&showAllLocations=true&EVENT=com.deploy.application.hourly.plugin.LocationSearch.doSearch"><SPAN>Jobs by Location</SPAN></A> </DIV>
<DIV class=menuItemNoSub><A id=Div6 class=nav href="https://wfa.kronostm.com/?seq=grand_opening&applicationName=SpecialtyRetailersNonReqExt&locale=en_US"><SPAN>Grand Openings</SPAN></A> </DIV></DIV></DIV></DIV>
<DIV id=Slot_0_3_3_10 class=fullCol><!-- view div[jspviews/positionTitle.jsp] -->
<H1>Sales Associate</H1>Location: <STRONG>Menomonie, WI (1501 N Broadway, Ste 1590)</STRONG> 
<DIV id=messageContainer></DIV></DIV>
<DIV id=caSidebar>
<DIV id=Slot_0_3_3_12_2><!-- view div[jspviews/locationInfo.jsp] -->
<DIV class=sidebar>
<H3>Location Details </H3>
<DIV style="MARGIN: 4px"><LABEL class=inline><STRONG>Stage Stores (Bealls, Goody's, Palais Royal, Peebles & Stage)</STRONG> </LABEL><BR><SPAN class="field readOnly">1501 N Broadway, Ste 1590</SPAN><BR><SPAN class="field readOnly">Menomonie</SPAN>, <SPAN class="field readOnly">WI</SPAN>  <SPAN class="field readOnly">54751</SPAN><BR>
<DIV class="emphasized label inline">P:</DIV><SPAN class="field readOnly">715-233-2038</SPAN> </DIV>
<P><SPAN class=pColor>»</SPAN> <A id=Div10 class="field readOnly" href="https://wfa.kronostm.com/?LOCATION_ID=51516916603&locale=en_US&applicationName=SpecialtyRetailersNonReqExt&SEQ=locationDetails&APPLYALLJOBS=true"><SPAN>See all jobs at this location</SPAN></A> </P></DIV></DIV>
<DIV id=Slot_0_3_3_12_4><!-- view div[jspviews/applyNow.jsp] --></DIV></DIV>
<DIV id=caMain>
<DIV id=Slot_0_3_3_14_2><!-- view div[jspviews/positionLocationDetails.jsp] -->
<DIV id=fullCol>
<DIV class=nonform>
<DIV>
<H4>
<DIV class=formRow>
<DIV class="h2Label inline"><LABEL class="h2Label inline">Job Description<IMG style="DISPLAY: none" src="https://pixel.appcast.io/kronost-te8/a31.png?e=366&t=1413902884" width=1 height=1> <IMG style="DISPLAY: none" src="https://pixel.appcast.io/kronost-te8/a31.png?e=394&t=1415396593" width=1 height=1> <IMG style="DISPLAY: none" src="https://pixel.appcast.io/kronost-te8/a31.png?e=393&t=1415396329" width=1 height=1></LABEL> </DIV> </DIV></H4>
<DIV class=formattedContent>
<P><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">REPORTS TO: </SPAN></SPAN><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Store Manager/Assistant Manager</SPAN></SPAN></P>
<P><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">OBJECTIVE: </SPAN></SPAN></P>
<P><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">To greet and assist customers with the selection of merchandise</SPAN></SPAN></P>
<P><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">RESPONSIBILITIES:</SPAN></SPAN></P>
<P><STRONG><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Customer Service</SPAN></SPAN></STRONG></P>
<UL>
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Smile and greet each customer promptly</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Interrupt store tasks to greet, assist, and answer questions for customers</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Suggest additional merchandise to customers on sales floor, at wrap stations, and in fitting rooms</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Follow-up on customers in fitting rooms to see if they need additional service</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Thank the customer by name and invite them to come back at the close of each sale</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Handle all returns/ exchanges according to company policies/ procedures, including Always Say Yes</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Maintain the store and all wrap stations in a clean, neat, and organized manner</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Opt-In (telephone and email capture)</SPAN></SPAN> </LI></UL>
<P><STRONG><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Personal Productivity</SPAN></SPAN></STRONG></P>
<UL>
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Sales</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Volume Per Hour (VPH)</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Units Per Transaction (UPT)</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">New Accounts</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Club 50</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Magazines</SPAN></SPAN> 
<LI>Opt-In (telephone and email capture) </LI></UL>
<P><STRONG><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Loss Prevention</SPAN></SPAN></STRONG></P>
<UL>
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Provide customer service</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Alert management of suspicious situations</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Follow store procedures</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Work with integrity</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Follow sensor tagging program (if applicable)</SPAN></SPAN> </LI></UL>
<P><STRONG><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Credit</SPAN></SPAN></STRONG></P>
<UL>
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Ask every customer to open a new account </SPAN></SPAN>
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Open and activate a minimum of (1) one new account per month </SPAN></SPAN>
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Always suggest purchase using company charge card</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Communicate the Premier Rewards Program</SPAN></SPAN> </LI></UL>
<P><STRONG><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Merchandise Presentation</SPAN></SPAN></STRONG></P>
<UL>
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Hang, Fold, Size, Sensor (where applicable), Sign and Steam (where applicable)</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Implement merchandising Floor Plans/guidelines</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Ensure featured merchandise is appropriately layered to suggest possible wardrobe ideas to the customer</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Follow "Folded Merchandise Guidelines"</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Create and maintain full, exciting displays</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Present clearance merchandise correctly</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">"Close to Open"</SPAN></SPAN> </LI></UL>
<P><STRONG><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Store Operations</SPAN></SPAN></STRONG></P>
<UL>
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Set-up Sales Event prior to sale start date (to include signing)</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Complete Price Changes (PCA's) by close of business on the effective date</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Complete Transfers by close of business on the effective date</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Complete Damages by close of business on the effective date</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Merchandise counts </SPAN></SPAN>
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Process in-coming freight including hanging, folding, sensor-tagging, etc</SPAN></SPAN> 
<LI><SPAN style="FONT-FAMILY: arial, sans-serif"><SPAN style="FONT-SIZE: 10pt">Housekeeping responsibilities including vacuuming, dusting, restrooms, etc</SPAN></SPAN> </LI></UL></DIV></DIV>
<DIV class=formRow>
<DIV class="emphasized label h2Label">Job Preview Video</DIV>
<DIV class=formattedContent><EMBED type=application/x-shockwave-flash height=390 width=640 src=https://www.youtube.com/v/22WFbDceaUM allowScriptAccess="always" allowfullscreen="true"> </DIV></DIV></DIV>
<DIV class="controlRow centered"><A id=Div11 class=largeButton href="https://wfa.kronostm.com/?LOCATION_ID=51516916603&locale=en_US&applicationName=SpecialtyRetailersNonReqExt&SEQ=returningMemberLoginOrRegister&POSTING_ID=46224061705"><SPAN>Apply Now</SPAN></A> </DIV>
<DIV class=controlRow><A id=Div12 class="iconMiniEmail icon-16 noline" href="https://wfa.kronostm.com/?LOCATION_ID=51516916603&locale=en_US&applicationName=SpecialtyRetailersNonReqExt&SEQ=jobReferral&POSTING_ID=46224061705&sourceSeq=postingLocationDetails"><SPAN>Refer This Job</SPAN></A> </DIV></DIV></DIV>
<DIV id=Slot_0_3_3_14_4><!-- view div[jspviews/applyNow.jsp] --></DIV></DIV>
<DIV id=caFooterGroup class=fullCol>
<DIV id=caFooter><!-- view div[jspviews/caFooter.jsp] -->
<DIV class=copyrightText><SPAN style="VERTICAL-ALIGN: top">Copyright © 2000 - 2015</SPAN> by Kronos Incorporated. All rights reserved. <SPAN style="VERTICAL-ALIGN: top"> |  <A href="http://www.kronos.com/Privacy.htm" data-role="none">Privacy Policy</A></SPAN> </DIV>
<DIV class=patentsText><SPAN class=patents>U. S. Patents 7,080,057; 7,310,626; 7,558,767; 7,562,059;</SPAN> 7,472,097; 7,606,778; 8,086,558 and 8,046,251. </DIV>
<SCRIPT type=text/javascript>
              var KronosUTMDomain = "kronostm.com";   // default SysVar value: "auto"
              var KronosUTMGifPath = "https://wfsa-img.kronostm.com/__utm.gif"; // default SysVar value: "common/jsutils/urchin/__utm.gif"
          </SCRIPT>

<SCRIPT type=text/javascript src="https://wfa.kronostm.com/common/jsutils/urchin/urchin.js"></SCRIPT>

<SCRIPT type=text/javascript>
              // Urchin tracking (UTM)
              urchinTracker();
          </SCRIPT>
</DIV></DIV></DIV></DIV>
<DIV id=footer><!-- view div[jspviews/customerFooter.jsp] --><IFRAME id=customerFooter title="Customer Footer" src="https://wfa.kronostm.com/static/core/core_footer.htm" frameBorder=0 scrolling=no>
</IFRAME></DIV>
<DIV id=footer>
<SCRIPT type=text/javascript src="https://wfa.kronostm.com/scripts/tools-min-4740157.js"></SCRIPT>
</DIV>
<SCRIPT>window.alert = function mozendaDoNothing() {}</SCRIPT>

<SCRIPT>window.confirm = function mozendaReturnTrue() {return true;}</SCRIPT>

<SCRIPT> 

    if (!window.console) 
     window.console = {};

    // union of Chrome, FF, IE, and Safari console methods
    var m = 
    [
     'log', 'info', 'warn', 'error', 'debug', 'trace', 'dir', 'group',
     'groupCollapsed', 'groupEnd', 'time', 'timeEnd', 'profile', 'profileEnd',
     'dirxml', 'assert', 'count', 'markTimeline', 'timeStamp', 'clear'
    ];

    // define undefined methods to prevent errors
    for (var i = 0; i < m.length; i++) 
    {
     if (!window.console[m[i]]) 
      window.console[m[i]] = function() {};
    } 

    window.console.log = function (debugStatement) { window.external.Log(debugStatement); }
   </SCRIPT>

<SCRIPT>window.open = function mozendaWindowOpen(param1, param2, param3) { window.name = param2; if( param1 != null && param1 != "") window.location = param1; }</SCRIPT>

<SCRIPT>window.showModalDialog = function mozendaWindowShowModalDialog(param1, param2, param3) { window.name = param2; if( param1 != null && param1 != "") window.location = param1; }</SCRIPT>

<SCRIPT>window.showModelessDialog = function mozendaWindowShowModlessDialog(param1, param2, param3) { window.name = param2; if( param1 != null && param1 != "") window.location = param1; }</SCRIPT>

<SCRIPT>document.open = function mozendaDocumentOpen(param1, param2, param3) { window.name = param2; if( param1 != null && param1 != "") window.location = param1; }</SCRIPT>
</BODY></HTML>
<HTML><HEAD><TITLE>Welcome to Stage Stores</TITLE>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1"></HEAD>
<BODY style="MARGIN: 0px"><IMG alt="Speciality Retailers" src="https://wfa.kronostm.com/images/Stage" Stores_CE.gif;pv99535325afb22125"> 
<SCRIPT>window.alert = function mozendaDoNothing() {}</SCRIPT>

<SCRIPT>window.confirm = function mozendaReturnTrue() {return true;}</SCRIPT>

<SCRIPT> 

    if (!window.console) 
     window.console = {};

    // union of Chrome, FF, IE, and Safari console methods
    var m = 
    [
     'log', 'info', 'warn', 'error', 'debug', 'trace', 'dir', 'group',
     'groupCollapsed', 'groupEnd', 'time', 'timeEnd', 'profile', 'profileEnd',
     'dirxml', 'assert', 'count', 'markTimeline', 'timeStamp', 'clear'
    ];

    // define undefined methods to prevent errors
    for (var i = 0; i < m.length; i++) 
    {
     if (!window.console[m[i]]) 
      window.console[m[i]] = function() {};
    } 

    window.console.log = function (debugStatement) { window.external.Log(debugStatement); }
   </SCRIPT>

<SCRIPT>window.open = function mozendaWindowOpen(param1, param2, param3) { window.name = param2; if( param1 != null && param1 != "") window.location = param1; }</SCRIPT>

<SCRIPT>window.showModalDialog = function mozendaWindowShowModalDialog(param1, param2, param3) { window.name = param2; if( param1 != null && param1 != "") window.location = param1; }</SCRIPT>

<SCRIPT>window.showModelessDialog = function mozendaWindowShowModlessDialog(param1, param2, param3) { window.name = param2; if( param1 != null && param1 != "") window.location = param1; }</SCRIPT>

<SCRIPT>document.open = function mozendaDocumentOpen(param1, param2, param3) { window.name = param2; if( param1 != null && param1 != "") window.location = param1; }</SCRIPT>
</BODY></HTML>
<HTML><HEAD><TITLE>Untitled Document</TITLE>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1"></HEAD>
<BODY>Core Sidenav 
<SCRIPT>window.alert = function mozendaDoNothing() {}</SCRIPT>

<SCRIPT>window.confirm = function mozendaReturnTrue() {return true;}</SCRIPT>

<SCRIPT> 

    if (!window.console) 
     window.console = {};

    // union of Chrome, FF, IE, and Safari console methods
    var m = 
    [
     'log', 'info', 'warn', 'error', 'debug', 'trace', 'dir', 'group',
     'groupCollapsed', 'groupEnd', 'time', 'timeEnd', 'profile', 'profileEnd',
     'dirxml', 'assert', 'count', 'markTimeline', 'timeStamp', 'clear'
    ];

    // define undefined methods to prevent errors
    for (var i = 0; i < m.length; i++) 
    {
     if (!window.console[m[i]]) 
      window.console[m[i]] = function() {};
    } 

    window.console.log = function (debugStatement) { window.external.Log(debugStatement); }
   </SCRIPT>

<SCRIPT>window.open = function mozendaWindowOpen(param1, param2, param3) { window.name = param2; if( param1 != null && param1 != "") window.location = param1; }</SCRIPT>

<SCRIPT>window.showModalDialog = function mozendaWindowShowModalDialog(param1, param2, param3) { window.name = param2; if( param1 != null && param1 != "") window.location = param1; }</SCRIPT>

<SCRIPT>window.showModelessDialog = function mozendaWindowShowModlessDialog(param1, param2, param3) { window.name = param2; if( param1 != null && param1 != "") window.location = param1; }</SCRIPT>

<SCRIPT>document.open = function mozendaDocumentOpen(param1, param2, param3) { window.name = param2; if( param1 != null && param1 != "") window.location = param1; }</SCRIPT>
</BODY></HTML>
<HTML xmlns="http://www.w3.org/1999/xhtml"><HEAD></HEAD>
<BODY>
<SCRIPT>window.alert = function mozendaDoNothing() {}</SCRIPT>

<SCRIPT>window.confirm = function mozendaReturnTrue() {return true;}</SCRIPT>

<SCRIPT> 

    if (!window.console) 
     window.console = {};

    // union of Chrome, FF, IE, and Safari console methods
    var m = 
    [
     'log', 'info', 'warn', 'error', 'debug', 'trace', 'dir', 'group',
     'groupCollapsed', 'groupEnd', 'time', 'timeEnd', 'profile', 'profileEnd',
     'dirxml', 'assert', 'count', 'markTimeline', 'timeStamp', 'clear'
    ];

    // define undefined methods to prevent errors
    for (var i = 0; i < m.length; i++) 
    {
     if (!window.console[m[i]]) 
      window.console[m[i]] = function() {};
    } 

    window.console.log = function (debugStatement) { window.external.Log(debugStatement); }
   </SCRIPT>

<SCRIPT>window.open = function mozendaWindowOpen(param1, param2, param3) { window.name = param2; if( param1 != null && param1 != "") window.location = param1; }</SCRIPT>

<SCRIPT>window.showModalDialog = function mozendaWindowShowModalDialog(param1, param2, param3) { window.name = param2; if( param1 != null && param1 != "") window.location = param1; }</SCRIPT>

<SCRIPT>window.showModelessDialog = function mozendaWindowShowModlessDialog(param1, param2, param3) { window.name = param2; if( param1 != null && param1 != "") window.location = param1; }</SCRIPT>

<SCRIPT>document.open = function mozendaDocumentOpen(param1, param2, param3) { window.name = param2; if( param1 != null && param1 != "") window.location = param1; }</SCRIPT>
</BODY></HTML>
<HTML><HEAD><TITLE>Untitled Document</TITLE>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1"></HEAD>
<BODY style="PADDING-BOTTOM: 10px; PADDING-TOP: 10px; PADDING-LEFT: 10px; MARGIN: 0px; PADDING-RIGHT: 10px; BACKGROUND-COLOR: #eee">Core Footer 
<SCRIPT>window.alert = function mozendaDoNothing() {}</SCRIPT>

<SCRIPT>window.confirm = function mozendaReturnTrue() {return true;}</SCRIPT>

<SCRIPT> 

    if (!window.console) 
     window.console = {};

    // union of Chrome, FF, IE, and Safari console methods
    var m = 
    [
     'log', 'info', 'warn', 'error', 'debug', 'trace', 'dir', 'group',
     'groupCollapsed', 'groupEnd', 'time', 'timeEnd', 'profile', 'profileEnd',
     'dirxml', 'assert', 'count', 'markTimeline', 'timeStamp', 'clear'
    ];

    // define undefined methods to prevent errors
    for (var i = 0; i < m.length; i++) 
    {
     if (!window.console[m[i]]) 
      window.console[m[i]] = function() {};
    } 

    window.console.log = function (debugStatement) { window.external.Log(debugStatement); }
   </SCRIPT>

<SCRIPT>window.open = function mozendaWindowOpen(param1, param2, param3) { window.name = param2; if( param1 != null && param1 != "") window.location = param1; }</SCRIPT>

<SCRIPT>window.showModalDialog = function mozendaWindowShowModalDialog(param1, param2, param3) { window.name = param2; if( param1 != null && param1 != "") window.location = param1; }</SCRIPT>

<SCRIPT>window.showModelessDialog = function mozendaWindowShowModlessDialog(param1, param2, param3) { window.name = param2; if( param1 != null && param1 != "") window.location = param1; }</SCRIPT>

<SCRIPT>document.open = function mozendaDocumentOpen(param1, param2, param3) { window.name = param2; if( param1 != null && param1 != "") window.location = param1; }</SCRIPT>
</BODY></HTML>

I'm geting only part of text:

     | Not a Member?

» See all jobs at this location

To greet and assist customers with the selection of merchandise

Interrupt store tasks to greet, assist, and answer questions for customers

Suggest additional merchandise to customers on sales floor, at wrap stations, and in fitting rooms

Follow-up on customers in fitting rooms to see if they need additional service

Thank the customer by name and invite them to come back at the close of each sale

Handle all returns/ exchanges according to company policies/ procedures, including Always Say Yes

Maintain the store and all wrap stations in a clean, neat, and organized manner

Ask every customer to open a new account

Open and activate a minimum of (1) one new account per month

Hang, Fold, Size, Sensor (where applicable), Sign and Steam (where applicable)

Ensure featured merchandise is appropriately layered to suggest possible wardrobe ideas to the customer

Complete Price Changes (PCA's) by close of business on the effective date

Complete Transfers by close of business on the effective date

Complete Damages by close of business on the effective date

Could you please look into it.

can't got iframe video from html

whats wrong with html/code below?

<title>Title</title> <iframe width="560" height="315" src="https://www.youtube.com/embed/4G5xhIv_-bA" frameborder="0" allowfullscreen></iframe>

let data = extractor.lazy(fs.readFileSync('index.html'));
let video = data.videos();
console.log(video);

output: []
`

Extract text with line breaks

How do I get the returned 'text' node to contain line breaks?
I have created a CodePen to demonstrate.
http://codepen.io/adrianparr/full/bpxKgo/

Here is my server-side JS (app.js) ...

var express = require('express'),
    http = require('http'),
    bodyParser = require('body-parser'),
    url = require('url'),
    extractor = require('unfluff');

var app = express();

app.use(bodyParser.json());
app.use(bodyParser.urlencoded({ extended: true }));
app.use(function(req, res, next) {
  res.header("Access-Control-Allow-Origin", "*");
  res.header("Access-Control-Allow-Headers", "Origin, X-Requested-With, Content-Type, Accept");
  next();
});

app.set('port', process.env.PORT || 3000);

app.get('/', function (req, res) {
    res.send('Please visit '/extract' and supply a 'url' query string.');
});

app.get('/extract', function (req, res) {
    loadAndExtract(req.query.url, res);
});

function loadAndExtract(passedUrl, originalRes) {
    var unfluffOptions = {
        host: url.parse(passedUrl).hostname,
        path: url.parse(passedUrl).pathname
    }
    var request = http.request(unfluffOptions, function (res) {
        var data = '';
        res.on('data', function (chunk) {
            data += chunk;
        });
        res.on('end', function () {
            var extractedData = extractor(data);
            console.log(extractedData.title);
            originalRes.setHeader('Content-Type', 'application/json');
            originalRes.send(extractedData);
        });
    });
    request.on('error', function (e) {
        console.log(e.message);
    });
    request.end();
}

var server = app.listen(3000, function () {
    var host = server.address().address;
    var port = server.address().port;
    console.log('Express server listening on http://%s:%s', host, port);
});

And here is my package.json file ...

{
  "name": "node-unfluff-demo",
  "version": "1.0.0",
  "description": "",
  "main": "app.js",
  "scripts": {
    "start": "start app.js"
  },
  "author": "",
  "license": "ISC",
  "dependencies": {
    "body-parser": "^1.15.0",
    "express": "^4.13.4",
    "unfluff": "^1.0.0"
  }
}

Ignores unordered/ordered lists in body

This is a real pain where someone is enumerating examples, etc. in an article.

Twitter status (tweet) as article?

In my project I found a need to mine images from tweets as "articles". People sometimes tweet links to articles, sometimes they tweet pictures from an event.

I added a little hack to do this that looks at the url to see if it's a tweet, then looks for a image reference:

    var rgxTwitterStatus = /https:\/\/twitter.com\/[^\/]+\/status/,
          rgxTwitterImage = /<img src="(https:\/\/pbs.twimg.com\/media\/[^"]+)/;

    if (url && rgxTwitterStatus.test(url)) {
        if (m = html.match(rgxTwitterImage)) {
            article.image = m[1];
        }
    }

I'm just posting it since it I thought it might be generally useful, and since I don't know coffee script or the unfluff codebase well enough to try and submit a pull req.

Typo in extractor#isHighlinkDensity ?

Line 378 of extractor.coffee is currently:

linkText = sb.join('')

which doesn't really make sense... shouldn't it be this?:

linkText = sb.join(' ')

Changing it to this, however, causes the polygon_video test to break. Not sure how to address this, or, really, how to make complete sense of the isHighlinkDensity function. The idea behind the function is perfectly clear, but how the function is "tuned" doesn't really make sense:

linkDivisor = numberOfLinkWords / wordsNumber
score = linkDivisor * numberOfLinks

score >= 1.0

Maybe a bad question to ask about a heuristic, but why is this the heuristic?

Purpose of various cleaner functions such as cleaner.cleanEmTags?

First, thanks for making this public, it's a really useful tool.
Apologies in advance if I have misunderstood the code.

A number of methods inside cleaner.coffee don't make sense to me.
A good example is cleanEmTags. Which sites have an <img> under an <em>?

I noticed that a lot of the cleaning operations in cleaner.coffee was in the original commit. What documents did you use on the initial version?

If I get a better understanding I'd be happy to add comments to make it clearer.

shouldn't remove code blocks

Try curl -s https://github.com/ageitgey/node-unfluff | unfluff. You'll notice that the code blocks are removed. Ideally, code would stay.

why this didn't support Chinese, what's difficult part ?

FYI

Display --help output if no arguments passed to unfluff CLI

I installed using npm i unfluff -g and accidentally typed % unfluffed on the CLI and it seems like unfluff just sits there waiting for some sort of input.

Not sure if you want to check if any input or arguments were passed to the unfluff command and just show the help/usage output.

Change the console.error to a opt in/out verbose mode.

For example on https://github.com/ageitgey/node-unfluff/blob/master/src/stopwords.coffee#L16 I get this logged to my console on certain webpages, It's not necessarily a major error, so I would like to not see it fill up my console.

How to get HTML content of the text?

Is there a way to return the content of textproperty with all the html stuff?
Thanks

What coffee does unfluff drink?

Seeing this error in "make": -

Error: unrecognized option: --js
at OptionParser.exports.OptionParser.OptionParser.parse (.../node-unfluff/node_modules/coffee-script/lib/coffee-script/optparse.js:51:19)

There are many kinds of coffee out there on internet. Which coffee does unfluff use? Me is http://coffeescript.org/

This patch seem to fix it: -

diff --git a/Makefile b/Makefile
index 557d91d..b63a69e 100644
--- a/Makefile
+++ b/Makefile
@@ -10,7 +10,7 @@ SRC = $(shell find "$(SRCDIR)" -name ".coffee" -type f | sort)
LIB = $(SRC:$(SRCDIR)/%.coffee=$(LIBDIR)/%.js)
TEST = $(shell find "$(TESTDIR)" -name ".coffee" -type f | sort)

-COFFEE=node_modules/.bin/coffee --js
+COFFEE=node_modules/.bin/coffee -c
MOCHA=node_modules/.bin/mocha --compilers coffee:coffee-script-redux/register -r test-setup.coffee -u tdd -R dot
CJSIFY=node_modules/.bin/cjsify --minify
SEMVER=node_modules/.bin/semver

Finally there isn't any Make instruction. I thought it will be nice to have one. Thanks for the awesome coffee!

Doesn't seem to work for sites that use <div> tags instead of <p>

I tried this with a CNN.com article and it didn't work because they don't use paragraphs. Any suggestions for a work-around?

ENOENT exception trying to open stopwords-en.txt when running as lambda

Hi, thanks for this module, it's great!

I've been migrating more and more things to serverless, and ran into an issue with it.

When I call unfluff() from a lambda it fails and exception is thrown:

{
    "errno": -2,
    "code": "ENOENT",
    "syscall": "open",
    "path": "/var/task/user/api/content/data/stopwords/stopwords-en.txt"
  }

I don't have additional information right now but thought I'd log it as an issue.

This is running on the now.sh platform, and it's possible it's a weird artefact of their build process.

If anyone is using this library on lambda in AWS I'd appreciate knowing that so can close this off and raise it over there instead.

How can I use this script inside an html page?

Do you have a sample where I can use javascript or jquery to use it?

Do I have to bundle this package in order to work as a standalone webpage ?

Cannot use client-side with React Native

After installing the module into my react native app, I get:

I tried manually adding fs to package.json to make sure it's there in the module map, but same error. Is it possible to use this library client-side in an RN app?

Problem with New York Times stories

The text field is empty when running unfluff on the html from a New York Times story. For example, if I request a story from nytimes.com in the node console and then pass the page html to unfluff, the returned text field is empty:

request({uri: 'https://www.nytimes.com/2017/06/01/climate/trump-paris-climate-agreement.html', jar: true}, function(e, r, b) {
  console.log(unfluff(b));
});

Result:

{ title: 'Trump Will Withdraw U.S. From Paris Climate Agreement', softTitle: 'Trump Will Withdraw U.S. From Paris Climate Agreement', date: '2017-06-01T14:48:08-04:00', author: [ 'Michael D. Shear', 'https://www.nytimes.com/by/michael-d-shear' ], publisher: undefined, copyright: '2017 The New York Times Company', favicon: 'https://static01.nyt.com/favicon.ico', description: 'The withdrawal process could take four years to complete, meaning a final decision would be up to the American voters in the next presidential election.', keywords: 'United Nations Framework Convention on Climate Change,Trump Donald J,United States Politics and Government,Global Warming', lang: 'en', canonicalLink: 'https://www.nytimes.com/2017/06/01/climate/trump-paris-climate-agreement.html', tags: [], image: 'https://static01.nyt.com/images/2017/06/02/us/02climatesub-alpha1/02climatesub-alpha1-facebookJumbo.jpg', videos: [], links: [], text: '' }

I've tried a couple of different Times urls and ensured that the request method is indeed passing the correct page html to the callback.

Extract author

Find first valid value (trimmed length in interval 0..100) of:

meta[name="author"]
[rel="author"]
[class="author"], [class="writer"], [class="writtenby"]
[id="author"], [id="writer"], [id="writtenby"]

Deprecated modules

Seeing following warnings when installing unfluff -
npm WARN deprecated [email protected]: the module is now available as 'css-select'
npm WARN deprecated [email protected]: the module is now available as 'css-what'

Pls update

Grabbing sidebar content

I noticed while parsing the url below that sidebar content sometimes get drawn into the article content.

http://news.forexlive.com/!/anz-on-gbp-mkts-need-something-fresh-to-trade-off-if-gbp-is-to-go-lower-in-near-term-20170117

I'll be looking into the code here but anyone more familiar with it who can beat me to it is much appreciated.

If someone is interested in optimizing the content extractor for a bunch of URLs i'm commonly parsing, i'd be interested in paying a freelance rate. Not looking for overfitting but hoping to improve this repos general capacity to handle varying content schemas.

Purpose of removing periods immediately followed by letters?

This line has been giving me unexpected issues and I'm thinking of removing it: https://github.com/ageitgey/node-unfluff/blob/master/src/formatter.coffee#L71 (txt = txt.replace(/(\w+\.)([A-Z]+)/, '$1 $2')). What is its purpose? What side-effects might I get from removing it?

The issue it gives me right now is with initialisms, like C.R.T. - they get separated into words and I get "C.", "R.", and "T.". Not sure how else to solve this issue.

Title should prefer meta[property="og:title"]

Would be nice if title() preferred the opengraph title (e.g. meta[property="og:title"]), as this title is usually more descriptive than the page title.

try to take div itemprop="articleBody" into account

Hello,
thanks for your module, it is working nicely.

I've had just a little issue with text extraction.
Your calculateBestNode() function doesn't take div or article into account and it will not check for schema.org itemprop="articleBody". But nodes marked with this itemprop are pretty good candidates.

Example:
http://www.lemonde.fr/election-presidentielle-2017/article/2016/12/02/et-hollande-renonca-a-se-representer_5042285_4854003.html
Your module extracts the parent.parent of the article and so takes the content-menu as text.

Thanks
Hector

Did I hear correctly? unfluff will be able to work on the browser?

That would be REALLY AMAZING....

IF that takes place, would you have some sample code for that matter?

Thanks

Hugo Barbosa

Text missing

Hi!
When I look at the page source of a html page I can see that there's text in some span or p tags. But this text does'nt show up in the result that unfluff returns when scraping.

Question: Why not? What can I do to extract all the text from the HTML document? Any configuration of the filtering that is applied?

Include image url extraction?

It will be great if it can extract the url of image in the page too.

Extracted Date is Wrong

I tried to extract this article Apple Seeds Eleventh Beta of iOS 12 to Developers [Update: Public Beta Available]

it was able to extract a date but it was the date of the first top rated comment not the article date response below.

{
  title: 'Apple Seeds Eleventh Beta of iOS 12 to Developers [Update',
  softTitle: 'Apple Seeds Eleventh Beta of iOS 12 to Developers [Update: Public Beta Available]',
  date: '8 hours ago at 10:09 am',
  author: ['Monday August 27, 2018 10:05 am PDT by Juli Clover'],
  publisher: null,
  copyright: '2000-document',
  favicon: '//cdn.macrumors.com/images-new/favicon.ico',
  description: 'Apple today seeded the eleventh beta of an upcoming iOS 12 update to developers for testing purposes, just a few days after seeding the tenth beta...',
  keywords: 'iOS 12',
  lang: 'en',
  canonicalLink: 'https://www.macrumors.com/2018/08/27/apple-seeds-ios-12-beta-11-to-developers/',
  tags: [],
  image: 'https://cdn.macrumors.com/article-new/2018/06/iOS-12-Memoji-800x775.jpg?retina',
  videos: [],
  links: [{
    text: 'Advertise on MacRumors',
    href: '//www.macrumors.com/contact.php'
  }],
  text: 'MacRumors attracts a broad audience         of both consumers and professionals interested in         the latest technologies and products. We also boast an active community focused on purchasing decisions and technical aspects of the iPhone, iPod, iPad, and Mac platforms.\n\nAdvertise on MacRumors'
}

Trim whitespace from tags?

I was trying to scrape a random page on theverge.com and noticed that some of the generated tags (which seem to be coming from some flyout sub-menus) aren't getting trimmed so have newlines and spaces from nested HTML.

Not sure if it makes sense to add a .trim() to somewhere like /src/extractor.coffee:122:

      tag = el.text().trim()
      if tag && tag.length > 0
        tags.push(tag)

Input:

$ curl -s http://www.theverge.com/2015/4/6/8357987/star-wars-digital-hd-collection-itunes-google-play | unfluff | json

Output:

{
  "title": "All six Star Wars movies are coming to iTunes, Google Play, and other video services",
  "favicon": "https://cdn0.vox-cdn.com/images/verge/favicon.vc44a54f.ico",
  "description": "The Star Wars movies are coming to smartphones. All six movies — yes, even The Phantom Menace, unfortunately — will be launched on digital video services such as iTunes, Google Play, Amazon Instant...",
  "lang": "en",
  "canonicalLink": "http://www.theverge.com/2015/4/6/8357987/star-wars-digital-hd-collection-itunes-google-play",
  "tags": [
    "\n      \n    Architecture\n  ",
    "\n      \n    Typography\n  ",
    "\n      \n    Concepts\n  ",
    "\n      \n    Politics\n  ",
    "\n      \n    National Security\n  ",
    "itunes",
    "google",
    "star wars",
    "microsoft",
    "movie",
    "film",
    "apple",
    "google play",
    "xbox video"
  ],
  "image": "https://cdn3.vox-cdn.com/thumbor/-su7_oi9qhXAY92kcBUg09sIaCU=/0x0:1536x864/1600x900/cdn0.vox-cdn.com/uploads/chorus_image/image/46061342/starwars-digital.0.0.jpg",
  "videos": [],
  "text": "The Star Wars movies are coming to smartphones. All six movies — yes, even The Phantom Menace, unfortunately — will be launched on digital video services such as iTunes, Google Play, Amazon Instant Video, and Xbox Video around the world on April 10th, Disney and Lucasfilm announced today. The launch will allow fans to buy Digital HD versions of the movies individually, or get them all at once as part of the Star Wars Digital Movie Collection.\n\nThe six movies each come with bonus features, including documentaries, interviews with production staff, deleted scenes, and closer looks at the films' models and sets. Some digital retailers are also offering extra incentives to buy the movies from their marketplace — get the entire collection from Xbox Video and you'll earn a digital R2-D2 to accompany your creepy, blank-faced Xbox Live avatar, a pinball table for the free-to-play Pinball FX 2, and access to an Xbox-only featurette.\n\nBuy the collection on Xbox Video to earn an imaginary R2-D2\n\nA number of digital retailers have yet to specify the price for the collection, but the whole bundle is available for $89 on the Google Play store, with individual movies going for $19.99. There were rumors last year that Disney was planning to re-release the original Star Wars movies on Blu-ray without George Lucas' CGI spot-welding, but viewers don't appear to be able to choose which version of the movie they want to watch with the upcoming Digital HD editions. Purchasers of the new versions will have to cope with unconvincing Dewbacks, Han having the gall to step on Jabba's tail, and the disconcerting sight of a moody Hayden Christensen as Vader's ghost at the Endor feast. At least they'll be in high resolution."
}

Where the semi-related HTML markup seems to be:

  <h2>Design</h2>
  <ul class="m-nav__menu">

<li class="design">
  <a href="/design" class="has-icon" >
      <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 43.978 46" class="icon" data-svg-fallback="/v/verge/scripts/../images/hubs/design.png" data-svg-fallback-override=""><g fill="#fff"><path d="M32.53 45.281v-32.559l-32.559 32.559h32.559zm-6.005-6.006h-12.436l12.436-12.435v12.435z" class="path1"/><path d="M43.993 18.966h-.618l-.809-3.644h-.72l-.809 3.644h-.619v25.104h.57v1.93h2.435v-1.93h.57z" class="path2"/><path d="M36.391 13.88v1.442h-1.262v26.819h.57v3.859h2.434v-3.859h.571v-37.107l-3.575 8.134z" class="path3"/></g></svg>
    All Design
  </a>
</li>

<li class="design">
  <a href="/tag/architecture" class="has-icon" >
      <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 43.978 46" class="icon" data-svg-fallback="/v/verge/scripts/../images/hubs/design.png" data-svg-fallback-override=""><g fill="#fff"><path d="M32.53 45.281v-32.559l-32.559 32.559h32.559zm-6.005-6.006h-12.436l12.436-12.435v12.435z" class="path1"/><path d="M43.993 18.966h-.618l-.809-3.644h-.72l-.809 3.644h-.619v25.104h.57v1.93h2.435v-1.93h.57z" class="path2"/><path d="M36.391 13.88v1.442h-1.262v26.819h.57v3.859h2.434v-3.859h.571v-37.107l-3.575 8.134z" class="path3"/></g></svg>
    Architecture
  </a>
</li>

<li class="design">
  <a href="/tag/typography" class="has-icon" >
      <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 43.978 46" class="icon" data-svg-fallback="/v/verge/scripts/../images/hubs/design.png" data-svg-fallback-override=""><g fill="#fff"><path d="M32.53 45.281v-32.559l-32.559 32.559h32.559zm-6.005-6.006h-12.436l12.436-12.435v12.435z" class="path1"/><path d="M43.993 18.966h-.618l-.809-3.644h-.72l-.809 3.644h-.619v25.104h.57v1.93h2.435v-1.93h.57z" class="path2"/><path d="M36.391 13.88v1.442h-1.262v26.819h.57v3.859h2.434v-3.859h.571v-37.107l-3.575 8.134z" class="path3"/></g></svg>
    Typography
  </a>
</li>

<li class="design">
  <a href="/tag/concepts" class="has-icon" >
      <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 43.978 46" class="icon" data-svg-fallback="/v/verge/scripts/../images/hubs/design.png" data-svg-fallback-override=""><g fill="#fff"><path d="M32.53 45.281v-32.559l-32.559 32.559h32.559zm-6.005-6.006h-12.436l12.436-12.435v12.435z" class="path1"/><path d="M43.993 18.966h-.618l-.809-3.644h-.72l-.809 3.644h-.619v25.104h.57v1.93h2.435v-1.93h.57z" class="path2"/><path d="M36.391 13.88v1.442h-1.262v26.819h.57v3.859h2.434v-3.859h.571v-37.107l-3.575 8.134z" class="path3"/></g></svg>
    Concepts
  </a>
</li>
  </ul>

Ignore Social Buttons

USAToday places social button at the tops of their articles, and the text from the buttons is imported by unfluff. Any way to skip this and all buttons?

ref: http://www.usatoday.com/story/news/nation/2014/10/02/liberia-ebola-patient-thomas-duncan-airport-screening/16591753/

unfluff in ionic app

In my Ionic app, all types/node were imported to require and use unluff, however, getting error when execution reaches to extractor

Error:
{"__zone_symbol__currentTask":{"type":"microTask","state":"notScheduled","source":"Promise.then","zone":"","cancelFn":null,"runCount":0}}

Code:
import {} from "@ types/node" ;

let extractor = require('unfluff');
let extractData = extractor(content, 'en');

Get parts of content independently

It'd be nice to be able to get title, image, and description for a page without getting the full text. Parsing the text can be very slow for long pages (e.g. http://en.wikipedia.org/wiki/Apple_Inc takes 2 seconds on my macbook).

Perhaps, something like:

var extractor = require('unfluff');
extractor(my_html_data, {
    lang: 'en', // Optional language
    text: false // Don't fetch text
});

or perhaps just change the API to expose the functions separately on the exports.

Bad lazy author extraction

https://medium.com/@pimterry/host-your-node-app-on-dokku-digitalocean-1cb97e3ab041

Pulls out
<link rel="author" href="https://medium.com/@pimterry">
instead of
<meta property="article:author" content="Tim Perry">

Consecute newlines in HTML text should be converted to spaces instead of '\n\n'

If you curl -s "http://www.sec.gov/Archives/edgar/data/1232524/000119312514258343/d752412dex991.htm" | unfluff, part of the result is “This\n\ntransaction supports our mission. Here is the source:

&#147;This  
transaction supports our mission

This obscures real paragraph breaks, as you can see.

How can I restore the content with the extracted metadata

Hi, I tried this tool and it is awesome but I have a question here:

Say that I want to extract the main page content (article) and show the restyled articles (like the pocket text view). But unfluff just extract the whole text (with images and videos splitted) in the json object, How can I restore the content? I even don't know how many paragraphs it has and where is the image position in the article. Thanks!

How can manage this case ?

Hello,

A missing space is provided by the following HTML code.
var content = extractor("<html><body><p>Conditions d'utilisation<b class='hideforacc'>du Service de livraison internationale - la page s'ouvre dans une nouvelle fenêtre ou un nouvel onglet</b></p></body></html>", "fr").text; console.log(content);

<b class='hideforacc'> should be replace by a space

How can I manage this case without modifying the HTML code ?

Thanks
Christophe

Convert to front-end friendly, remove 'fs'

This module could pretty easily be converted to be front end friendly by removing the need for 'fs'. Since the stopwords files are the only things being accessed with 'fs', this could be solved pretty easily.

I created my own solution, but I'm not sure it was particularly elegant. My current thought is to create an object with a key of the language code and a value of the stopwords array. Do you have other thoughts about more elegant solutions?

Date isn't always ISO format?

Is the date attribute supposed to be consistently an iso datestring? or it normal for it sometimes to be (an attempt at at) a human readable string, i.e. "August 2, 2017"?

{
    "title": "What Made the Moon? New Ideas Try to Rescue a Troubled Theory",
    "softTitle": "What Made the Moon? New Ideas Try to Rescue a Troubled Theory",
    "date": "ByRebecca BoyleAugust 2, 2017",
    "author": [
      "Rebecca Boyle"
    ],
    "publisher": "Quanta Magazine",
    "copyright": "2017",
    "description": "Textbooks say that the moon was formed after a Mars-size mass smashed the young Earth. But new evidence has cast doubt on that story, leaving researchers to",
    "keywords": "chemistry,geochemistry,geophysics,physics,planetary science",
    "lang": null,
    "canonicalLink": "https://www.quantamagazine.org/what-made-the-moon-new-ideas-try-to-rescue-a-troubled-theory-20170802/",
    "tags": [
      "planetary science",
      "chemistry",
      "geochemistry",
      "geophysics",
      "physics"
    ],
    "image": "https://d2r55xnwy6nx47.cloudfront.net/uploads/2017/08/Synestia_520x2921.jpg",
    "videos": [],
    "links": [
      {
        "text": "published a paper on the physics of synestias",
        "href": "http://onlinelibrary.wiley.com/doi/10.1002/2016JE005239/full"
      },
      {
        "text": "captured asteroids",
        "href": "http://aasnova.org/2016/09/23/explaining-the-birth-of-the-martian-moons/"
      },
      {
        "text": "others argue formed from Martian impacts",
        "href": "http://onlinelibrary.wiley.com/doi/10.1002/2017GL074002/abstract"
      },
      {
        "text": "she argued that Earth&#x2019;s moon is not the original moon",
        "href": "http://www.nature.com/ngeo/journal/v10/n2/abs/ngeo2866.html"
      },
      {
        "text": "Robin Canup",
        "href": "https://www.boulder.swri.edu/~robin/"
      }
    ],
    "text": "Conditions in this structure are indescribably hellish; there is no surface, but instead clouds of molten rock, with every region of the cloud forming molten-rock raindrops. The moon grew inside this vapor, Lock said, before the vapor eventually cooled and left in its wake the Earth-moon system.\n\nGiven the structure’s unusual characteristics, Lock and Stewart thought it deserved a new name. They tried several versions before coining synestia, which uses the Greek prefix syn-, meaning together, and the goddess Hestia, who represents the home, hearth and architecture. The word means “connected structure,” Stewart said.\n\n“These bodies aren’t what you think they are. They don’t look like what you thought they did,” she said.\n\nIn May, Lock and Stewart published a paper on the physics of synestias; their paper arguing for a synestia lunar origin is still in review. They presented the work at planetary science conferences in the winter and spring and say their fellow researchers were intrigued but hardly sold on the idea. That may be because synestias are still just an idea; unlike ringed planets, which are common in our solar system, and protoplanetary disks, which are common across the universe, no one has ever seen one.\n\n“But this is certainly an interesting pathway that could explain the features of our moon and get us over this hump that we’re in, where we have this model that doesn’t seem to work,” Lock said.\n\nAmong natural satellites in the solar system, Earth’s moon may be most striking for its solitude. Mercury and Venus lack natural satellites, in part because of their nearness to the sun, whose gravitational interactions would make their moons’ orbits unstable. Mars has tiny Phobos and Deimos, which some argue are captured asteroids and others argue formed from Martian impacts. And the gas giants are chockablock with moons, some rocky, some watery, some both.\n\nIn contrast to these moons, Earth’s satellite also stands out for its size and the physical burden it carries. The moon is about 1 percent the mass of Earth, while the combined mass of the outer planets’ satellites is less than one-tenth of 1 percent of their parents. Even more important, the moon contains 80 percent of the angular momentum of the Earth-moon system. That is to say, the moon is responsible for 80 percent of the motion of the system as a whole. For the outer planets, this value is less than 1 percent.\n\nThe moon may not have carried all this weight the whole time, however. The face of the moon bears witness to its lifelong bombardment; why should we assume that just one rock was responsible for carving it out of Earth? It’s possible that multiple impacts made the moon, said Raluca Rufu, a planetary scientist at the Weizmann Institute of Science in Rehovot, Israel.\n\nIn a paper published last winter, she argued that Earth’s moon is not the original moon. It is instead a compendium of creation by a thousand cuts — or at the very least, a dozen, according to her simulations. Projectiles coming in from multiple angles and at multiple speeds would hit Earth and form disks, which coalesce into “moonlets,” essentially crumbs that are smaller than Earth’s current moon. Interactions between moonlets of different ages cause them to merge, eventually forming the moon we know today.\n\nPlanetary scientists were receptive when her paper was published last year; Robin Canup, a lunar scientist at the Southwest Research Institute and a dean of moon-formation theories, said it was worth considering. More testing remains, however. Rufu is not sure whether the moonlets would have been locked in their orbital positions, similar to how Earth’s moon constantly faces the same direction; if so, she is not sure how they could have merged. “That’s what we are trying to figure out next,” Rufu said.\n\nMeanwhile, others have turned to another explanation for the similarity of Earth and the moon, one that might have a very simple answer. From synestias to moonlets, new physical models — and new physics — may be moot. It’s possible that the moon looks just like Earth because Theia did, too.",
  }

Incorrect video extractions

Getting the following video for http://www.theverge.com/2016/2/21/11077616/lg-g5-announced-specs-release-date-price-mwc-2016

  "videos": [
    {
      "src": "//www.googletagmanager.com/ns.html?id=GTM-5XTZVB",
      "height": "0",
      "width": "0"
    }
  ],

Will update as I dig a little deeper.

do you support open graph

Hello,

Do you support open graph? If not, do you have any plans to support open graph?

thanks

303 See Other

When I tested this link 'https://www.nature.com/articles/d41586-019-01252-0', I got "303 See Other". Do I miss any configuration? Is there any way to tell unfluff to continue fetching 303 redirect?

400 Bad Request

For some url's I get 400 Bad Request results:

podsavethepeople.com, www.trumptaxscam.org

Description should try to get meta[property="og:description"]

This issue is similar to the og:title field Issue that was solved some months ago.

An example of poor processing is the result of a twitter status url.

_Example URL:_ https://twitter.com/github/status/609116267891580928

_Result:_

{
title: 'GitHub on Twitter',
favicon: '//abs.twimg.com/favicons/favicon.ico',
description: undefined,
keywords: undefined,
lang: 'en',
canonicalLink: 'https://twitter.com/github/status/609116267891580928',
tags: [],
image: 'https://pbs.twimg.com/profile_images/426158315781881856/sBsvBbjY_400x400.png',
videos: [],
text: 'Log in\n\nTo bring you Twitter, we, and our partners, use cookies on our and other websites. Cookies help personalise Twitter content, tailor Twitter Ads (.......)'
}

_og:description of that page:_ “Check out the full list of @CoDeConf sessions and grab your ticket now! See you in Nashville in 2 weeks: http://t.co/9mynYt0VPc”

"Some features may not work without JavaScript. Please try enabling it if you encounter problems."

"Some features may not work without JavaScript. Please try enabling it if you encounter problems." this message replaced the text in the json and i'm pretty sure that it is not what's in the webpage that i tested on. can anyone explain this to me?

Thanks

Links and images

Is it possible to get an array of the images that are embeded in the content and their positions? and same thing with the links?

TypeError: this.lang is not a function

Any idea why I am getting this error from calling:

const meta = yield extractor.lazy(content, 'en')

loadash needs update

Npm audit is reporting unfluf as not secure due to it using old version of lodash.
unfluf needs version 4.17.15 or later of loadash.

Handle Asian scripts better

I'm using unfluff as an easy way to grab the first few paragraphs of wikipedia articles to describe media. When I print the text returned from https://en.wikipedia.org/wiki/Now_and_Then,_Here_and_There I get:

Now and Then, Here and There (

Now and Then, Here and There follows a young boy named Shuzo "Shu" Matsutani who, in an attempt to save an unknown girl, is transported to another world which is possibly the Earth in the far future. The world is desolate and militarized, and water is a scarce commodity.

At the start where the actual article gives:

Now and Then, Here and There (今、そこにいる僕 Ima, Soko ni Iru Boku?) is a thirteen episode anime series directed by Akitaro Daichi and written by Hideyuki Kurata. The story was originally conceived by director Daichi. It premiered in Japan on the WOWOW television station on October 14, 1999 and ran until January 20, 2000. It was licensed for Region 1 DVD English language release by Central Park Media under the US Manga Corps banner. Following the 2009 bankruptcy and liquidation of Central Park Media, ADV Films picked up the series for a release on July 7, 2009.[1] As of Sept. 1, 2009, the series is licensed by ADV's successor, AEsir Holdings, with distribution from Section23 Films.[2]

Now and Then, Here and There follows a young boy named Shuzo "Shu" Matsutani who, in an attempt to save an unknown girl, is transported to another world which is possibly the Earth in the far future. The world is desolate and militarized, and water is a scarce commodity.

The problem is almost certainly with the 今 character. I understand you know Asian text doesn't work very well. However, in this instance I'm losing a massive portion of English text. A simple fix for now would be just removing the offending character from the output or replacing it with the Unicode unknown character symbol.