Giter Club home page Giter Club logo

html-table-to-json's Introduction

HTML-Table-to-JSON

Author: Colin Tremblay

Date: Saturday, 16th November, 2013

About

Easily scrape and parse a table stored on a web page.

This project is still in ALPHA, meaning it is not fully functional!

The current version is .7

The project currently works on most HTML tables. Nested tables and tables containing malicious inputs have not been dealt with.

Features similar to those in the javascript version (https://github.com/lightswitch05/table-to-json) by @lightswitch05 are being added incrementally.

Usage

To get the parser, simply download the 4 php files in 'src.'

To use, include HTMLTable2JSON.php in your php file, create a new HTMLTable2JSON object, and call tableToJSON($url);

Optional Arguments

  • firstColIsRowName
    • Boolean indicating whether the first column in the table should be parsed as the title for all values in the row.
    • Default: TRUE
  • tableID
    • String to contain the ID of the table. Allows user to specify a particular table. Default behavior simply grabs the first table encountered on the page.
    • Default: ''
  • ignoreColumns
    • Array of column indexes to ignore.
    • Format: array(0 => firstColToIgnore, 1 => secondColToIgnore) OR array(firstIndex, secondIndex).
    • Default: NULL
  • headers
    • Array of header names.
    • Format: array(colNum1 => header1, colNum2 => header2).
    • Default: NULL
  • firstRowIsData
    • Boolean indicating whether the first row contains data (not headers).
    • Choosing TRUE treats the first row as data regardless of <th> tags. DO NOT choose this if there are headers in the first row that you want to override.
    • Default: FALSE
  • onlyColumns
    • Array of column indexes to include; all others are ignored.
    • Format: array(0 => firstColToInclude, 1 => secondColToInclude) OR array(firstIndex, secondIndex).
    • Default NULL
  • arrangeByRows
    • Choosing FALSE treats cells as discrete objects. Cells are arranged in arrays by column, where each cell has properties of name, column title, row title, span number, and URL (if applicable).
    • Choosing TRUE treats each cell as a value for the attribute indicated in the column header. With this option, rows are arranged in an array, with column_title : cell_title pairs as attributes.
    • Default: FALSE
  • ignoreHidden
    • Boolean indicating whether rows tagged with style=\"display: none; should appear in output.
    • Setting TRUE will suppress hidden rows.
    • Default: FALSE
  • printJSON
    • Boolean indicating whether the program should create the JSON or simply return the output to the caller.
    • Setting FALSE leaves the output in the hands of the caller. TRUE creates a JSON file.
    • Default: TRUE
  • testingTable
    • String representing an HTML table. Allows user to manually input a table for conversion, instead of scraping from a webpage.
    • Ignores whatever value is in url.
    • Default: NULL

Note about php and optional arguments: If you wish to use an argument lower on the list, but not one higher, you must still fill in the higher values. To avoid changing the program, use NULL as the argument for any options you do not wish to change.

sample.php has examples of the correct usage.

TODO

  • Override cell names/data
  • Handle more html tags within a cell (links have been handled).
  • Do something with the style and/or class of a given cell, row, or column. This information could be useful in some cases and shouldn't be hard to access.
  • Code cleanup/refactor

For support, feedback, suggestions etc. please email [email protected]

License

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^

html-table-to-json's People

Contributors

ericpedia avatar tremblay avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.