Giter Club home page Giter Club logo

Comments (3)

KillyMXI avatar KillyMXI commented on August 17, 2024 1

What is the input HTML for this output?

from node-html-to-text.

coolb0y avatar coolb0y commented on August 17, 2024

https://github.com/coolb0y/ngo-file/blob/main/bookstore.html this is the file Which I am trying to scrap right now.

const html = await fs.readFile(filePath, 'utf-8');
 const text = convert(html, options);

This is code I am using in nodejs to scraping the content from html file

from node-html-to-text.

KillyMXI avatar KillyMXI commented on August 17, 2024
  1. you should probably start with empty options. The options you provided contain a lot of nonsense. Not all those options do what you hope they do;
  2. not everything you might want to do is the responsibility of html-to-text.

That being said, there is still a lot that can be achieved before you have to resort to text post-processing.

  {
    wordwrap: false,
    encodeCharacters: { '/': '' }, // this wasn't the purpose of this option, but if it works - it works. You might have to use post-processing instead if you want to avoid double spaces as well.
    selectors: [
      { selector: 'img', format: 'skip' },
      { selector: 'hr', format: 'skip' },
      { selector: 'br', format: 'skip' },
      { selector: 'form', format: 'skip' },
      { selector: 'a', options: { ignoreHref: true } }, // link text can be a part of content, you might only need to avoid href when indexing
      { selector: 'a[href="#TOP"]', format: 'skip' }, // not something you'd search for
      { selector: 'p', format: 'inline' }, // since you don't care about block formatting for search indexing, you can mark everything inline
      { selector: 'table', format: 'inline' },
      { selector: 'h1', format: 'inline' },
    ]
  }

You are also incorrect about \n. html-to-text simply outputs line breaks. What you show is the output escaped after html-to-text to be a valid js string.

If there is anything more you'd like to do with the text - you don't have to do it all in one step with html-to-text, you can do it afterwards, post-processing the text with regular expressions.
Instead of encodeCharacters: { '/': '' }, I would rather do text.replace(/\s+\/\s+/g, ' ').
If you want to guarantee the output to be a string with no line breaks - that's also best achieved with a regex replace. encodeCharacters does not affect characters that are inserted as a part of layout.

from node-html-to-text.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.