Giter Club home page Giter Club logo

hyntax's Introduction

Hyntax project logo — lego bricks in the shape of a capital letter H

Hyntax

Straightforward HTML parser for JavaScript. Live Demo.

  • Simple. API is straightforward, output is clear.
  • Forgiving. Just like a browser, normally parses invalid HTML.
  • Supports streaming. Can process HTML while it's still being loaded.
  • No dependencies.

Table Of Contents

Usage

npm install hyntax
const { tokenize, constructTree } = require('hyntax')
const util = require('util')

const inputHTML = `
<html>
  <body>
      <input type="text" placeholder="Don't type">
      <button>Don't press</button>
  </body>
</html>
`

const { tokens } = tokenize(inputHTML)
const { ast } = constructTree(tokens)

console.log(JSON.stringify(tokens, null, 2))
console.log(util.inspect(ast, { showHidden: false, depth: null }))

TypeScript Typings

Hyntax is written in JavaScript but has integrated TypeScript typings to help you navigate around its data structures. There is also Types Reference which covers most common types.

Streaming

Use StreamTokenizer and StreamTreeConstructor classes to parse HTML chunk by chunk while it's still being loaded from the network or read from the disk.

const { StreamTokenizer, StreamTreeConstructor } = require('hyntax')
const http = require('http')
const util = require('util')

http.get('http://info.cern.ch', (res) => {
  const streamTokenizer = new StreamTokenizer()
  const streamTreeConstructor = new StreamTreeConstructor()

  let resultTokens = []
  let resultAst

  res.pipe(streamTokenizer).pipe(streamTreeConstructor)

  streamTokenizer
    .on('data', (tokens) => {
      resultTokens = resultTokens.concat(tokens)
    })
    .on('end', () => {
      console.log(JSON.stringify(resultTokens, null, 2))
    })

  streamTreeConstructor
    .on('data', (ast) => {
      resultAst = ast
    })
    .on('end', () => {
      console.log(util.inspect(resultAst, { showHidden: false, depth: null }))
    })
}).on('error', (err) => {
  throw err;
})

Tokens

Here are all kinds of tokens which Hyntax will extract out of HTML string.

Overview of all possible tokens

Each token conforms to Tokenizer.Token interface.

AST Format

Resulting syntax tree will have at least one top-level Document Node with optional children nodes nested within.

{
  nodeType: TreeConstructor.NodeTypes.Document,
  content: {
    children: [
      {
        nodeType: TreeConstructor.NodeTypes.AnyNodeType,
        content: {}
      },
      {
        nodeType: TreeConstructor.NodeTypes.AnyNodeType,
        content: {}
      }
    ]
  }
}

Content of each node is specific to node's type, all of them are described in AST Node Types reference.

API Reference

Tokenizer

Hyntax has its tokenizer as a separate module. You can use generated tokens on their own or pass them further to a tree constructor to build an AST.

Interface

tokenize(html: String): Tokenizer.Result

Arguments

  • html
    HTML string to process
    Required.
    Type: string.

Tree Constructor

After you've got an array of tokens, you can pass them into tree constructor to build an AST.

Interface

constructTree(tokens: Tokenizer.AnyToken[]): TreeConstructor.Result

Arguments

Types Reference

Tokenizer.Result

interface Result {
  state: Tokenizer.State
  tokens: Tokenizer.AnyToken[]
}
  • state
    The current state of tokenizer. It can be persisted and passed to the next tokenizer call if the input is coming in chunks.
  • tokens
    Array of resulting tokens.
    Type: Tokenizer.AnyToken[]

TreeConstructor.Result

interface Result {
  state: State
  ast: AST
}
  • state
    The current state of the tree constructor. Can be persisted and passed to the next tree constructor call in case when tokens are coming in chunks.

  • ast
    Resulting AST.
    Type: TreeConstructor.AST

Tokenizer.Token

Generic Token, other interfaces use it to create a specific Token type.

interface Token<T extends TokenTypes.AnyTokenType> {
  type: T
  content: string
  startPosition: number
  endPosition: number
}
  • type
    One of the Token types.

  • content
    Piece of original HTML string which was recognized as a token.

  • startPosition
    Index of a character in the input HTML string where the token starts.

  • endPosition
    Index of a character in the input HTML string where the token ends.

Tokenizer.TokenTypes.AnyTokenType

Shortcut type of all possible tokens.

type AnyTokenType =
  | Text
  | OpenTagStart
  | AttributeKey
  | AttributeAssigment
  | AttributeValueWrapperStart
  | AttributeValue
  | AttributeValueWrapperEnd
  | OpenTagEnd
  | CloseTag
  | OpenTagStartScript
  | ScriptTagContent
  | OpenTagEndScript
  | CloseTagScript
  | OpenTagStartStyle
  | StyleTagContent
  | OpenTagEndStyle
  | CloseTagStyle
  | DoctypeStart
  | DoctypeEnd
  | DoctypeAttributeWrapperStart
  | DoctypeAttribute
  | DoctypeAttributeWrapperEnd
  | CommentStart
  | CommentContent
  | CommentEnd

Tokenizer.AnyToken

Shortcut to reference any possible token.

type AnyToken = Token<TokenTypes.AnyTokenType>

TreeConstructor.AST

Just an alias to DocumentNode. AST always has one top-level DocumentNode. See AST Node Types

type AST = TreeConstructor.DocumentNode

AST Node Types

There are 7 possible types of Node. Each type has a specific content.

type DocumentNode = Node<NodeTypes.Document, NodeContents.Document>	
type DoctypeNode = Node<NodeTypes.Doctype, NodeContents.Doctype>
type TextNode = Node<NodeTypes.Text, NodeContents.Text>
type TagNode = Node<NodeTypes.Tag, NodeContents.Tag>
type CommentNode = Node<NodeTypes.Comment, NodeContents.Comment>
type ScriptNode = Node<NodeTypes.Script, NodeContents.Script>
type StyleNode = Node<NodeTypes.Style, NodeContents.Style>

Interfaces for each content type:

TreeConstructor.Node

Generic Node, other interfaces use it to create specific Nodes by providing type of Node and type of the content inside the Node.

interface Node<T extends NodeTypes.AnyNodeType, C extends NodeContents.AnyNodeContent> {
  nodeType: T
  content: C
}

TreeConstructor.NodeTypes.AnyNodeType

Shortcut type of all possible Node types.

type AnyNodeType =
  | Document
  | Doctype
  | Tag
  | Text
  | Comment
  | Script
  | Style

Node Content Types

TreeConstructor.NodeTypes.AnyNodeContent

Shortcut type of all possible types of content inside a Node.

type AnyNodeContent =
  | Document
  | Doctype
  | Text
  | Tag
  | Comment
  | Script
  | Style

TreeConstructor.NodeContents.Document

interface Document {
  children: AnyNode[]
}

TreeConstructor.NodeContents.Doctype

interface Doctype {
  start: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeStart>
  attributes?: DoctypeAttribute[]
  end: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeEnd>
}

TreeConstructor.NodeContents.Text

interface Text {
  value: Tokenizer.Token<Tokenizer.TokenTypes.Text>
}

TreeConstructor.NodeContents.Tag

interface Tag {
  name: string
  selfClosing: boolean
  openStart: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagStart>
  attributes?: TagAttribute[]
  openEnd: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagEnd>
  children?: AnyNode[]
  close?: Tokenizer.Token<Tokenizer.TokenTypes.CloseTag>
}

TreeConstructor.NodeContents.Comment

interface Comment {
  start: Tokenizer.Token<Tokenizer.TokenTypes.CommentStart>
  value: Tokenizer.Token<Tokenizer.TokenTypes.CommentContent>
  end: Tokenizer.Token<Tokenizer.TokenTypes.CommentEnd>
}

TreeConstructor.NodeContents.Script

interface Script {
  openStart: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagStartScript>
  attributes?: TagAttribute[]
  openEnd: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagEndScript>
  value: Tokenizer.Token<Tokenizer.TokenTypes.ScriptTagContent>
  close: Tokenizer.Token<Tokenizer.TokenTypes.CloseTagScript>
}

TreeConstructor.NodeContents.Style

interface Style {
  openStart: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagStartStyle>,
  attributes?: TagAttribute[],
  openEnd: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagEndStyle>,
  value: Tokenizer.Token<Tokenizer.TokenTypes.StyleTagContent>,
  close: Tokenizer.Token<Tokenizer.TokenTypes.CloseTagStyle>
}

TreeConstructor.DoctypeAttribute

interface DoctypeAttribute {
  startWrapper?: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeAttributeWrapperStart>,
  value: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeAttribute>,
  endWrapper?: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeAttributeWrapperEnd>
}

TreeConstructor.TagAttribute

interface TagAttribute {
  key?: Tokenizer.Token<Tokenizer.TokenTypes.AttributeKey>,
  startWrapper?: Tokenizer.Token<Tokenizer.TokenTypes.AttributeValueWrapperStart>,
  value?: Tokenizer.Token<Tokenizer.TokenTypes.AttributeValue>,
  endWrapper?: Tokenizer.Token<Tokenizer.TokenTypes.AttributeValueWrapperEnd>
}

hyntax's People

Contributors

cfenzo avatar dependabot[bot] avatar mykolaharmash avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hyntax's Issues

AST → HTML Serializer

Add serializer for Hyntax AST. It should support some basic customization options like indentation size.

Support SVGs within HTML

The main issue with parsing SVG is that tags inside, like path can be at the same time self-closing and have children. It's not the case with HTML where the list of self-closing tags is known upfront. Need to think of how to fit this tags duality.

SVG 2 Specs

Example of tags-duality from animationMotion tag spec. The last path has children.

<svg width="5cm" height="3cm"  viewBox="0 0 500 300"
     xmlns="http://www.w3.org/2000/svg">
  <desc>Example animMotion01 - demonstrate motion animation computations</desc>
  <rect x="1" y="1" width="498" height="298"
        fill="none" stroke="blue" stroke-width="2" />
  <path id="path1" d="M100,250 C 100,50 400,50 400,250"
        fill="none" stroke="blue" stroke-width="7.06"  />
  <circle cx="100" cy="250" r="17.64" fill="blue"  />
  <path d="M-25,-12.5 L25,-12.5 L 0,-87.5 z"
        fill="yellow" stroke="red" stroke-width="7.06"  >
    <animateMotion dur="6s" repeatCount="indefinite" rotate="auto" >
       <mpath href="#path1"/>
    </animateMotion>
  </path>
</svg>

Unclear on usage

So if I just want to load a html string, and have it look for #someClassName or .someId to get their html text value, how do I use this?

Tokenizer breaks at nested script tags

The tokenizer breaks on script tags when a </script> occurs somewhere in the script. For example:

const { tokens } = tokenize(`
<script>
  console.log("<script></script>")
</script>
`);
console.log(JSON.stringify(tokens, undefined, 2));

I would expect that the tokenizer will return a token for the script tag with the content set to: console.log("<script></script>"). But instead it breaks the script's content at the </script> in the JS-string.

Full output:

[
  {
    "type": "token:text",
    "content": "\n",
    "startPosition": 0,
    "endPosition": 0
  },
  {
    "type": "token:open-tag-start-script",
    "content": "<script",
    "startPosition": 1,
    "endPosition": 7
  },
  {
    "type": "token:open-tag-end-script",
    "content": ">",
    "startPosition": 8,
    "endPosition": 8
  },
  {
    "type": "token:script-tag-content",
    "content": "\n  console.log(\"<script>",
    "startPosition": 9,
    "endPosition": 32
  },
  {
    "type": "token:close-tag-script",
    "content": "</script>",
    "startPosition": 33,
    "endPosition": 41
  },
  {
    "type": "token:text",
    "content": "\")\n",
    "startPosition": 42,
    "endPosition": 44
  },
  {
    "type": "token:close-tag",
    "content": "</script>",
    "startPosition": 45,
    "endPosition": 53
  },
  {
    "type": "token:text",
    "content": "\n",
    "startPosition": 54,
    "endPosition": 54
  }
]

Wrong parsing in children node of escapable raw text elements

Tags like <textarea>, <title> do not allow to create other tags in its content because in the two tags belong to escapable raw text elements (RCDATA element). So when parsing tags like the two, the node should not have children node with nodetype "tag".

Try code in HTML like the following:
<title>test<img></title>

In the parsed result of hyntax, it will display like this:
image

Tree-shaking does not work for Hyntax modules

See this issue for more context.

Parcel and other bundlers seams to bundle polyfills for Node.js streams even when Hyntax stream modules are not imported to the user's code. e.g. this code:

const { tokenize, constructTree } = require('hyntax')

will result in Stream polyfill in the bundle.

As a workaround user can import from individual files:

const tokenize = require('hyntax/lib/tokenize')
const constructTree = require('hyntax/lib/construct-tree')

TreeConstructor.NodeContents.Tag.close is actually optional

README:

interface Tag {
  name: string
  selfClosing: boolean
  openStart: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagStart>
  attributes?: TagAttribute[]
  openEnd: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagEnd>
  children?: AnyNode[]
  close: Tokenizer.Token<Tokenizer.TokenTypes.CloseTag>
}

TypeScript declarations:

close: Tokenizer.Token<Tokenizer.TokenTypes.CloseTag>

Actual AST(https://astexplorer.net/#/gist/a0e72b2be44f5c23214c4c238093510c/39df950f9fcd76b7df249427dbbca2f4809cf235):

{
  "nodeType": "document",
  "content": {
    "children": [
      {
        "nodeType": "tag",
        "parentRef": "[Circular ~]",
        "content": {
          "openStart": {
            "type": "token:open-tag-start",
            "content": "<br",
            "startPosition": 0,
            "endPosition": 2
          },
          "name": "br",
          "openEnd": {
            "type": "token:open-tag-end",
            "content": "/>",
            "startPosition": 4,
            "endPosition": 5
          },
          "selfClosing": true
        }
      }
    ]
  }
}

Fix coverage reporting

Right now pipeline which reports coverage to coverall.io is broken. Need to find a good, lightweight pipeline runner and resurrect the coverage.

Add TypeScript typings

Add index.d.ts with all the Hyntax types and refactor README to use TypeScript types conventions.

Generate HTML from the AST?

Hey thanks for putting this together. Unlike most of the other parsers, it doesn't shift tokens around.

Just wondering if you wrote the other side of this library yet? Generating HTML from a modified AST?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.