The table_extractor from olivettigroup

Can't find model 'en_core_web_sm'

Thank you very much for your continued maintenance of this tool, But when I used python 3.7.12 to run this tool,it still occured the error like other users said before: [E050] Can't find model 'en_core_web_sm', The trial date is July 12, 2022, Is it because the Python version is too high, or other reasons?

tutorial fails with example html files, out of bounds error

Dear Jensen

A colleague wants to make a database of zeolites and their synthesis conditions. I therefore tried to run this table extractor to extract data about zeolite synthesis conditions.

When I however try to run the table extracter tutorial with jupyter-notebook on the file 101021acscgd8b00078.html I get the error shown below.

To get the script running I installed scikit-learn version 0.19.1 instead of sklearn.
If I installed sklearn I got a warning about a version mismatch and I could not find the correct version of sklearn.

I further more had to replace 'en' on line 27 with 'en_core_web_sm' as it did not recognize 'en'

Perhaps this is also because of a different package version.

Do you still remember which package versions you used, and perhaps how to solve this issue?
Thank you.

`Extracting Tables (HTML) from: 10.1021/acs.cgd.8b00078
3 3 3

IndexError Traceback (most recent call last)
in ()
1 # Supported domain names = zeolites, geopolymers, steel, titanium, aluminum, alloys
2 te = TableExtractor(domain_name='zeolites')
----> 3 tables = te.extract_and_save_all_tables(files, dois)

/home/johan/ownCloud/Documents/postdoc Amsterdam/zeoliteextroctor for ilse/table_extractor-master/tableextractor/table_extractor.py in extract_and_save_all_tables(self, files, dois)
52 print(len(tables), len(captions), len(footers))
53 cols, rows, col_inds, row_inds = self.get_headers(tables)
---> 54 pred_cols, pred_rows = self.classify_table_headers(cols, rows)
55 orients = []
56 composition_flags = []

/home/johan/ownCloud/Documents/postdoc Amsterdam/zeoliteextroctor for ilse/table_extractor-master/tableextractor/table_extractor.py in classify_table_headers(self, cols, rows)
828 vect_rows = []
829 for col, row in zip(cols, rows):
--> 830 vect_c, label = self.vectorize_words(col, [0]*len(col))
831 vect_r, label = self.vectorize_words(row, [0]*len(row))
832 vect_cols.append(self.clf.predict(vect_c))

/home/johan/ownCloud/Documents/postdoc Amsterdam/zeoliteextroctor for ilse/table_extractor-master/tableextractor/table_extractor.py in vectorize_words(self, words, labels)
817 for word, label in zip(words, labels):
818 if str(word) in self.embeddings:
--> 819 emb_vector.append(self.embeddings[str(word)])
820 label_vector.append(label)
821 else:

/home/johan/.local/lib/python3.6/site-packages/gensim/models/keyedvectors.py in getitem(self, entities)
351 if isinstance(entities, string_types):
352 # allow calls like trained_model['office'], as a shorthand for trained_model[['office']]
--> 353 return self.get_vector(entities)
354
355 return vstack([self.get_vector(entity) for entity in entities])

/home/johan/.local/lib/python3.6/site-packages/gensim/models/keyedvectors.py in get_vector(self, word)
469
470 def get_vector(self, word):
--> 471 return self.word_vec(word)
472
473 def words_closer_than(self, w1, w2):

/home/johan/.local/lib/python3.6/site-packages/gensim/models/keyedvectors.py in word_vec(self, word, use_norm)
2132 return word_vec
2133 for nh in ngram_hashes:
-> 2134 word_vec += ngram_weights[nh]
2135 return word_vec / len(ngram_hashes)
2136

IndexError: index 1998876 is out of bounds for axis 0 with size 1990189

When I tried to run the script for the other html file 101016jmicromeso201806043.html
I got the following result

Extracting Tables (HTML) from: 10.1016/j.micromeso.2018.06.043
0 0 0
Finished Extracting all Papers
Number Attempted: 1
Number Successful: 0
Number Failed: 0

Just opensource a part of code

Hi,
I read your paper https://pubs.acs.org/doi/abs/10.1021/acscentsci.9b00193
And it is very similar to our work. That is an exciting job, But I noticed that you just open sourced the code about the table extraction part.
That is very important about text extraction and I want to know how did you extract information from table as shown in the zeolite_data/xlsx or csv.
Can you open source the entire code if you like?
This is very important to me and will greatly advance our work
Thanks

olivettigroup / table_extractor Goto Github PK

table_extractor's People

Contributors

Stargazers

Watchers

Forkers

table_extractor's Issues

Can't find model 'en_core_web_sm'

tutorial fails with example html files, out of bounds error

Just opensource a part of code

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent