Giter Club home page Giter Club logo

table_extractor's People

Contributors

zjensen262 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

table_extractor's Issues

Can't find model 'en_core_web_sm'

Thank you very much for your continued maintenance of this tool, But when I used python 3.7.12 to run this tool,it still occured the error like other users said before: [E050] Can't find model 'en_core_web_sm', The trial date is July 12, 2022, Is it because the Python version is too high, or other reasons?

tutorial fails with example html files, out of bounds error

Dear Jensen

A colleague wants to make a database of zeolites and their synthesis conditions. I therefore tried to run this table extractor to extract data about zeolite synthesis conditions.

When I however try to run the table extracter tutorial with jupyter-notebook on the file 101021acscgd8b00078.html I get the error shown below.

To get the script running I installed scikit-learn version 0.19.1 instead of sklearn.
If I installed sklearn I got a warning about a version mismatch and I could not find the correct version of sklearn.

I further more had to replace 'en' on line 27 with 'en_core_web_sm' as it did not recognize 'en'

Perhaps this is also because of a different package version.

Do you still remember which package versions you used, and perhaps how to solve this issue?
Thank you.

`Extracting Tables (HTML) from: 10.1021/acs.cgd.8b00078
3 3 3


IndexError Traceback (most recent call last)
in ()
1 # Supported domain names = zeolites, geopolymers, steel, titanium, aluminum, alloys
2 te = TableExtractor(domain_name='zeolites')
----> 3 tables = te.extract_and_save_all_tables(files, dois)

/home/johan/ownCloud/Documents/postdoc Amsterdam/zeoliteextroctor for ilse/table_extractor-master/tableextractor/table_extractor.py in extract_and_save_all_tables(self, files, dois)
52 print(len(tables), len(captions), len(footers))
53 cols, rows, col_inds, row_inds = self.get_headers(tables)
---> 54 pred_cols, pred_rows = self.classify_table_headers(cols, rows)
55 orients = []
56 composition_flags = []

/home/johan/ownCloud/Documents/postdoc Amsterdam/zeoliteextroctor for ilse/table_extractor-master/tableextractor/table_extractor.py in classify_table_headers(self, cols, rows)
828 vect_rows = []
829 for col, row in zip(cols, rows):
--> 830 vect_c, label = self.vectorize_words(col, [0]*len(col))
831 vect_r, label = self.vectorize_words(row, [0]*len(row))
832 vect_cols.append(self.clf.predict(vect_c))

/home/johan/ownCloud/Documents/postdoc Amsterdam/zeoliteextroctor for ilse/table_extractor-master/tableextractor/table_extractor.py in vectorize_words(self, words, labels)
817 for word, label in zip(words, labels):
818 if str(word) in self.embeddings:
--> 819 emb_vector.append(self.embeddings[str(word)])
820 label_vector.append(label)
821 else:

/home/johan/.local/lib/python3.6/site-packages/gensim/models/keyedvectors.py in getitem(self, entities)
351 if isinstance(entities, string_types):
352 # allow calls like trained_model['office'], as a shorthand for trained_model[['office']]
--> 353 return self.get_vector(entities)
354
355 return vstack([self.get_vector(entity) for entity in entities])

/home/johan/.local/lib/python3.6/site-packages/gensim/models/keyedvectors.py in get_vector(self, word)
469
470 def get_vector(self, word):
--> 471 return self.word_vec(word)
472
473 def words_closer_than(self, w1, w2):

/home/johan/.local/lib/python3.6/site-packages/gensim/models/keyedvectors.py in word_vec(self, word, use_norm)
2132 return word_vec
2133 for nh in ngram_hashes:
-> 2134 word_vec += ngram_weights[nh]
2135 return word_vec / len(ngram_hashes)
2136

IndexError: index 1998876 is out of bounds for axis 0 with size 1990189

`

When I tried to run the script for the other html file 101016jmicromeso201806043.html
I got the following result

Extracting Tables (HTML) from: 10.1016/j.micromeso.2018.06.043
0 0 0
Finished Extracting all Papers
Number Attempted: 1
Number Successful: 0
Number Failed: 0

Just opensource a part of code

Hi,
I read your paper https://pubs.acs.org/doi/abs/10.1021/acscentsci.9b00193
And it is very similar to our work. That is an exciting job, But I noticed that you just open sourced the code about the table extraction part.
That is very important about text extraction and I want to know how did you extract information from table as shown in the zeolite_data/xlsx or csv.
Can you open source the entire code if you like?
This is very important to me and will greatly advance our work
Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.