The goal I am scraping html pages to extract clean text from the

you should probably start with empty options. The options you provided contain a

Extracted output is not clean it contain escape characters like /n/n or so on about node-html-to-text HOT 3 CLOSED

coolb0y commented on August 17, 2024

Extracted output is not clean it contain escape characters like /n/n or so on

from node-html-to-text.

Comments (3)

KillyMXI commented on August 17, 2024 1

What is the input HTML for this output?

from node-html-to-text.

coolb0y commented on August 17, 2024

https://github.com/coolb0y/ngo-file/blob/main/bookstore.html this is the file Which I am trying to scrap right now.

const html = await fs.readFile(filePath, 'utf-8');
 const text = convert(html, options);

This is code I am using in nodejs to scraping the content from html file

from node-html-to-text.

KillyMXI commented on August 17, 2024

you should probably start with empty options. The options you provided contain a lot of nonsense. Not all those options do what you hope they do;
not everything you might want to do is the responsibility of html-to-text.

That being said, there is still a lot that can be achieved before you have to resort to text post-processing.

  {
    wordwrap: false,
    encodeCharacters: { '/': '' }, // this wasn't the purpose of this option, but if it works - it works. You might have to use post-processing instead if you want to avoid double spaces as well.
    selectors: [
      { selector: 'img', format: 'skip' },
      { selector: 'hr', format: 'skip' },
      { selector: 'br', format: 'skip' },
      { selector: 'form', format: 'skip' },
      { selector: 'a', options: { ignoreHref: true } }, // link text can be a part of content, you might only need to avoid href when indexing
      { selector: 'a[href="#TOP"]', format: 'skip' }, // not something you'd search for
      { selector: 'p', format: 'inline' }, // since you don't care about block formatting for search indexing, you can mark everything inline
      { selector: 'table', format: 'inline' },
      { selector: 'h1', format: 'inline' },
    ]
  }

You are also incorrect about \n. html-to-text simply outputs line breaks. What you show is the output escaped after html-to-text to be a valid js string.

If there is anything more you'd like to do with the text - you don't have to do it all in one step with html-to-text, you can do it afterwards, post-processing the text with regular expressions.
Instead of encodeCharacters: { '/': '' }, I would rather do text.replace(/\s+\/\s+/g, ' ').
If you want to guarantee the output to be a string with no line breaks - that's also best achieved with a regex replace. encodeCharacters does not affect characters that are inserted as a part of layout.

from node-html-to-text.

Recommend Projects

Extracted output is not clean it contain escape characters like /n/n or so on about node-html-to-text HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent