oleehyo / texteller Goto Github PK

TexTeller can convert image to latex formulas (image2latex, latex OCR) with higher accuracy and exhibits superior generalization ability, enabling it to cover most usage scenarios.

Python 99.71% Shell 0.14% Batchfile 0.15%

image2text latex-ocr

texteller's Introduction

📄 English | 中文

𝚃𝚎𝚡𝚃𝚎𝚕𝚕𝚎𝚛

🤗 Hugging Face

demo.mp4

TexTeller is an end-to-end formula recognition model based on ViT, capable of converting images into corresponding LaTeX formulas.

TexTeller was trained with 7.5M image-formula pairs (dataset available here), compared to LaTeX-OCR which used a 100K dataset, TexTeller has stronger generalization abilities and higher accuracy, covering most use cases (except for scanned images and handwritten formulas).

🔄 Change Log

📮[2024-03-25] TexTeller 2.0 released! The training data for TexTeller 2.0 has been increased to 7.5M (about 15 times more than TexTeller 1.0 and also improved in data quality). The trained TexTeller 2.0 demonstrated superior performance in the test set, especially in recognizing rare symbols, complex multi-line formulas, and matrices.

There are more test images here and a horizontal comparison of recognition models from different companies.
📮[2024-04-12] Trained a formula detection model, thereby enhancing the capability to detect and recognize formulas in entire documents (whole-image inference)!
📮[2024-05-02] Support mixed Chinese English formula recognition(Beta).

🔑 Prerequisites

python=3.10

pytorch

Only CUDA versions >= 12.0 have been fully tested, so it is recommended to use CUDA version >= 12.0

🚀 Getting Started

Clone the repository:

git clone https://github.com/OleehyO/TexTeller

Install pytorch
Install the project's dependencies:
```
pip install -r requirements.txt
```

Enter the TexTeller/src directory and run the following command in the terminal to start inference:

python inference.py -img "/path/to/image.{jpg,png}" 
# use --inference-mode option to enable GPU(cuda or mps) inference
#+e.g. python inference.py -img "./img.jpg" --inference-mode cuda
# use -mix option to enable mixed text and formula recognition
#+e.g. python inference.py -img "./img.jpg" -mix -lang "en"

The first time you run it, the required checkpoints will be downloaded from Hugging Face

Important

If using mixed text and formula recognition, it is necessary to download formula detection model weights

🌐 Web Demo

Go to the TexTeller/src directory and run the following command:

./start_web.sh

Enter http://localhost:8501 in a browser to view the web demo.

Note

If you are Windows user, please run the start_web.bat file instead.

🧠 Full Image Inference

TexTeller also supports formula detection and recognition on full images, allowing for the detection of formulas throughout the image, followed by batch recognition of the formulas.

Download Weights

Download the model weights from this link and place them in src/models/det_model/model.

TexTeller's formula detection model was trained on a total of 11,867 images, consisting of 3,415 images from Chinese textbooks (over 130 layouts) and 8,272 images from the IBEM dataset.

Formula Detection

Run the following command in the TexTeller/src directory:

python infer_det.py

Detects all formulas in the full image, and the results are saved in TexTeller/src/subimages.

Batch Formula Recognition

After formula detection, run the following command in the TexTeller/src directory:

python rec_infer_from_crop_imgs.py

This will use the results of the previous formula detection to perform batch recognition on all cropped formulas, saving the recognition results as txt files in TexTeller/src/results.

📡 API Usage

We use ray serve to provide an API interface for TexTeller, allowing you to integrate TexTeller into your own projects. To start the server, you first need to enter the TexTeller/src directory and then run the following command:

python server.py  # default settings

Parameter	Description
`-ckpt`	The path to the weights file, default is TexTeller's pretrained weights.
`-tknz`	The path to the tokenizer, default is TexTeller's tokenizer.
`-port`	The server's service port, default is 8000.
`--inference-mode`	Whether to use GPU(cuda or mps) for inference, default is CPU.
`--num_beams`	The number of beams for beam search, default is 1.
`--num_replicas`	The number of service replicas to run on the server, default is 1 replica. You can use more replicas to achieve greater throughput.
`--ncpu_per_replica`	The number of CPU cores used per service replica, default is 1.
`--ngpu_per_replica`	The number of GPUs used per service replica, default is 1. You can set this value between 0 and 1 to run multiple service replicas on one GPU to share the GPU, thereby improving GPU utilization. (Note, if --num_replicas is 2, --ngpu_per_replica is 0.7, then 2 GPUs must be available)

Note

A client demo can be found at TexTeller/client/demo.py, you can refer to demo.py to send requests to the server

🏋️‍♂️ Training

Dataset

We provide an example dataset in the TexTeller/src/models/ocr_model/train/dataset directory, you can place your own images in the images directory and annotate each image with its corresponding formula in formulas.jsonl.

After preparing your dataset, you need to change the DIR_URL variable to your own dataset's path in .../dataset/loader.py

Retraining the Tokenizer

If you are using a different dataset, you might need to retrain the tokenizer to obtain a different dictionary. After configuring your dataset, you can train your own tokenizer with the following command:

In TexTeller/src/models/tokenizer/train.py, change new_tokenizer.save_pretrained('./your_dir_name') to your custom output directory

If you want to use a different dictionary size (default is 10k tokens), you need to change the VOCAB_SIZE variable in TexTeller/src/models/globals.py
In the TexTeller/src directory, run the following command:
```
python -m models.tokenizer.train
```

Training the Model

Modify num_processes in src/train_config.yaml to match the number of GPUs available for training (default is 1).

In the TexTeller/src directory, run the following command:

accelerate launch --config_file ./train_config.yaml -m models.ocr_model.train.train

You can set your own tokenizer and checkpoint paths in TexTeller/src/models/ocr_model/train/train.py (refer to train.py for more information). If you are using the same architecture and dictionary as TexTeller, you can also fine-tune TexTeller's default weights with your own dataset.

In TexTeller/src/globals.py and TexTeller/src/models/ocr_model/train/train_args.py, you can change the model's architecture and training hyperparameters.

Note

Our training scripts use the Hugging Face Transformers library, so you can refer to their documentation for more details and configurations on training parameters.

🚧 Limitations

Does not support scanned images and PDF document recognition
Does not support handwritten formulas

📅 Plans

~~Train the model with a larger dataset (7.5M samples, coming soon)~~
Recognition of scanned images
PDF document recognition + Support for English and Chinese scenarios
Inference acceleration
...

⭐️ Stargazers over time

💖 Acknowledgments

Thanks to LaTeX-OCR which has brought me a lot of inspiration, and im2latex-100K which enriches our dataset.

👥 Contributors

texteller's People

Contributors

Stargazers

Watchers

Forkers

hongchx5 happybuby keyman9848 hihass yrq66 benjamesbabala eight-corner gaoxiaoyu super-ruilei tonylee1256 huagetai gzww hesam-edrisi liushuchun yogeliu trocker xiaozhui2020 sinpy1117

texteller's Issues

该模型能否在llama.cpp上运行

对于生产环境部署, 能否使用llama.cpp/fastllm.cpp等纯C++框架运行TexTeller模型? 希望能提供便于部署的q4/q8模型和actions/dockerfile之类部署batch文件

首先感谢分享，请问预训练模型方便告知一下吗？或者说能分享下从零开始训练的方法。另外能适配下mps吗

可以适配一下Ctrl+V粘贴图片识别吗？

在windows端需要上传文件才能识别，比较繁琐。

latex detection dataset

https://github.com/microsoft/ArxivFormula

关于5.2更新中英文公式识别能力的疑惑

想问下[2024-05-02] Support mixed Chinese English formula recognition.具体是指什么样的功能或识别能力。测试如果是公式里带有中文字符并不能正常识别出来，看HF模型的tokenizer_config.json、vocab.jsont也没有更新中文的词。

Fail Prediction. Is the picture not have high quality?

除拖放之外支持粘贴图片

Windows端截图工具和Snipaste不支持直接拖放, 仅支持粘贴图片. 之前尝试修改web.py使得客户端同时支持拖放和粘贴, 但修改后拖放功能失效, 仅能进一步修改使其较好的支持粘贴功能. 希望能同时支持拖放上传图片和粘贴.

本地对web.py进行的修改如下, 仅实现了从剪切板粘贴图片:

from streamlit_paste_button import paste_image_button
paste_result = paste_image_button( label="📋 Paste an image", background_color="#FF0000", hover_background_color="#380909", errors='raise')

if paste_result.image_data is not None:
    temp_dir = tempfile.mkdtemp()
    png_file_path = os.path.join(temp_dir, 'image.png')

    if paste_result.image_data:
        img = paste_result.image_data

    img.save(png_file_path, format='PNG')
    file = open(png_file_path,'br')
    buffered=io.BytesIO(file.read())
    img_base64 = base64.b64encode(buffered.getvalue()).decode()

Windows11下我在anaconda Navigator中配置了环境，使用CPU模式，在识别时出现报错，请问应该怎么解决

以下是报错信息：

RuntimeError: stack expects a non-empty TensorList
Traceback:
File "E:\anaconda\envs\TeXTeller310\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 535, in _run_script
exec(code, module.dict)
File "C:\Users\啦啦啦\TexTeller\src\web.py", line 180, in
TexTeller_result = inference(
File "C:\Users\啦啦啦\TexTeller\src\models\ocr_model\utils\inference.py", line 27, in inference
pixel_values = torch.stack(imgs)

不错的工作，但好像识别率有些问题

我尝试了扫描版pdf里的公式识别，正确率不太行

感谢分享，本地识别结果存在异常

首先非常感谢分享该开源项目与模型，目前发现在本地调试中，部分自测的公式图片识别异常，存在复读问题，请帮忙分析下（在该项目提供的部分训练数据集中测试正常）：

识别结果为：
[\begin{array}{c}\Delta\underline{L}_{\text{\tiny{\it{j}}}\text{\tiny{\it{i}}} \text{\tiny{\it{j}}}\text{\tiny{\it{i}}}\text{\tiny{\it{j}}}\text{\tiny{\it{i}}} \text{\tiny{\it{j}}}\text{\tiny{\it{i}}}\text{\tiny{\it{j}}}\text{\tiny{\it{i}}} \text{\tiny{\it{j}}}\text{\tiny{\it{j}}}\text{\tiny{\it{j}}}\text{\tiny{\it{j}}} \text{\tiny{\it{j}}}\text{\tiny{\it{j}}}\text{\tiny{\it{j}}}\text{\tiny{\it{j}}} \text{\tiny{\it{j}}}\text{\tiny{\it{j}}}\text{\tiny{\it{j}}}\text{\tiny{\it{j}}} \text{\tiny{\it{j}}}\text{\tiny{\it{j}}}\text{\tiny{\it{j}}}\text{\tiny{\it{j}}} \text{\tiny{\it{j}}}\text{\tiny{\it{j}}}\text{\tiny{\it{j}}}\text{\tiny{\it{j}}} \text{\tiny{\it{j}}}\text{\tiny{\it{j}}}\text{\tiny{\it{j}}}\text{\tiny{\it{j}}} \text{\tiny{\it{j}}}\text{\tiny{\it{j}}}\text{\tiny{\it{j}}}\text{\tiny{\it{j}}} \text{\tiny{\it{j}}}\text{\tiny{\it{j}}}\text{\tiny{\it{j}}}\text{\tiny{\it{j}}} \text{\tiny{\it{j}}}\text{\tiny{\it{j}}}\text{\tiny{\it{j}}}\text{\tiny{\it{j}}} \text{\tiny{\it{j}}}\text{\tiny{\it{j}}}\text{\tiny{\it{j}}}\text{\tiny{\it{j}}} \text{\tiny{\it{j}}}\text{\tiny{\it{j}}}\text{\tiny{\it{j}}}\text{\it{ \it{j}}}\text{\tiny{\it{j}}}\text{\tiny{\it{j}}}\text{\it{ \it{j}}}\text{\tiny{\it{j}}}\text{\tiny{\it{j}}}\text{\it{ \it{j}}}\text{\tiny{\it{j}}}\text{\it{j}}}\text{\tiny{\it{j}}}\text{\it{ \it{j}}}\text{\tiny{\it{j}}}\text{\it{ \it{j}}}\text{\tiny{\it{j}}}\text{\it{ \it{j}}}\text{\tiny{\it{j}}}\text{\it{ \it{j}}}\text{\tiny{\it{j}}}\text{\it{ \it{j}}}\text{\tiny{\it{j}}}\text{\it{ \it{j}}}\text{\tiny{\it{j}}}\text{\it{ \it{j}}}\text{\tiny{\it{j}}}\text{\it{ \it{j}}}\text{\tiny{\it{j}}}\text{\it{ \it{j}}}\text{\tiny{\it{j}}}\text{\it{ \it{j}}}\text{\tiny{\it{j}}}\text{\tiny{\it{j}}}\text{\it{ \it{ \it{j}}}\text{\tiny{\it{j}}}\text{\it{ \it{ \it{j}}}\text{\it{ \it{j}}}\text{\tiny{\it{j}}}\text{\it{ \it{ \it{j}}}\text{\tiny{\it{j}}}\text{\it{ \it{ \it{j}}}\text{\it{ \it{ \it{j}}}\text{\tiny{\it{ \it{ \it{j}}}\text{\it{ \it{j}}}\text{\it{ \it{ \it{ \it{j}}}\text{\it{ \it{ \it{ }}\text{ \it{ \it{ }}\text{ \it{ \it{ }}\text{ \it{ \it{ \it{ \it{ \it{ \it{ \it{ \it{ \it{ \it{ \it{ \it{ \it{ \it{ \it{ \it{ \it{ \it{ \it{ \it{ \it{ \it{ \it{ \it{ \bm{ \it}}}\text{ \bm{ \it{ \bm{ \bm{ \bm{ \bm{ \bm{ \bm{ \bm{ \bm{ \bm{ \bm{ \bm{ \bm{ \bm{ \bm{ \bm{ \bm{ \bm{ \bm{ \bm{ \bm{ \bm{ \bm{ \

识别结果为：
[\bar{b}\bar{=}\frac{\Delta\bar{T_{i}}}{\bar{k_{i}}\bar{=}\bar{k_{i}}\bar{=} \bar{1}\bar{1}\bar{1}\bar{1}\bar{0}\bar{0}\bar{0}\bar{0}\bar{+}\bar{1}\bar{1} \bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{1}\bar{1} \bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0} \bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0} \bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0} \bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0} \bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0} \bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0} \bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\bar{0}\

另外请帮忙确认下目前在huggingface上提供的模型版本已经是TexTeller 2.0吗（基于7.5M数据训练）？

如何保存训练后的权重

作者您好，当我训练自己的时候的时候，能够训练python -m models.tokenizer.train 并保存了自己的tokenizer
当我通过python -m models.ocr_model.train.train，并未添加您训练好的模型，是通过您给的一些数据集进行训练
训练后不知道应该如何保存成作者您huggingface上这样的权重：https://huggingface.co/OleehyO/TexTeller/tree/main。
我想将保存下来的权重后加载您写的web进行重复使用
请问我是需要添加怎样的操作才能做到你这样吗？

对多行公式识别率偏低，以及有 hallucination 问题

非常感谢你的工作！拿 https://gregorygundersen.com/blog/2019/12/23/random-fourier-features/ 里的若干公式试了试，单行公式总体还可以，但多行公式大概 30--40% 预测的公式有语法错误、或有行缺失，我尝试将多行公式拆成单行来截图，会好很多。另外即使是单行公式也有时有 hallucination 问题，例如

$\mathbb R^D$ 变成了 $\mathbb S^2$。用的 NUM_BEAM=4。

使用katex渲染输出结果

我使用的latex引擎是tectonic, 对于不准备安装TeXLive的用户来说, 能否设法在web端使用katex渲染输出结果?

数据集方面的合作

你好，首先非常感谢你在Latex OCR方面所做的贡献，我下载过您提供的550K数据集，经过数据清洗，大概有1000个token，其中绝大多数我感觉已经造成数据集的冗余，也许剔除掉更好，然而这并不简单，我想经过mathpix api数据进行清洗或许是个可行的方案，对您提供的550K数据集我正在做这方面的工作，从结果上来看，600个token就可以完全覆盖，但我仍然缺乏最够的向550K这样高质量多行复杂印刷体数据（同时我正在构造手写体和扫描体），或许我们可以合作，我可以提供mathpix清洗后的数据（mathpix大概可以有600万的额度）如果对此感兴趣，欢迎联系

oleehyo / texteller Goto Github PK

texteller's Introduction

𝚃𝚎𝚡𝚃𝚎𝚕𝚕𝚎𝚛

🔄 Change Log

🔑 Prerequisites

🚀 Getting Started

🌐 Web Demo

🧠 Full Image Inference

Download Weights

Formula Detection

Batch Formula Recognition

📡 API Usage

🏋️‍♂️ Training

Dataset

Retraining the Tokenizer

Training the Model

🚧 Limitations

📅 Plans

⭐️ Stargazers over time

💖 Acknowledgments

👥 Contributors

texteller's People

Contributors

Stargazers

Watchers

Forkers

texteller's Issues

Recommend Projects

Recommend Topics

Recommend Org