Giter Club home page Giter Club logo

chinese-pdf-ocr's Introduction

zh en

chinese-pdf-ocr

对中文PDF文件进行OCR。使用了DayBreak-u/chineseocr_lite的OCR模型。

assets/demo.png assets/demo_web.png

项目目录结构

  • chineseocr_lite/
    引用自DayBreak-u/chineseocr_lite实现的轻量级中文OCR模型。
  • pdfocr.py
    对PDF文件进行OCR的核心逻辑。先对PDF某一页进行OCR,基于识别结果使用图形学算法对PDF该页划分段落,最后把OCR结果按段落组合。
  • requirements.txt
    记录了chineseocr_lite/pdfocr.py所需要的Python包。
  • demo_gui/
    一个简单的小程序。对给定的PDF的若干页进行OCR,然后将结果输出至终端,并在新的窗口中可视化显示当前页面的OCR结果。
  • demo_web/
    在浏览器上运行的网页应用。可以在网页上打开PDF进行OCR,鼠标点击识别结果可以将OCR文字复制到剪贴板。

安装基础依赖包

项目目录下的requirements.txt 记录了chineseocr_lite/pdfocr.py所需要的Python包。执行以下命令来安装:

pip3 install -r requirements.txt

运行demo_gui

切换目录

cd demo_gui/

安装poppler

用于PDF转图片,被Python的pdf2image包使用。各平台的安装方法

安装额外的依赖包

demo_gui/requirements.txt 记录了demo_gui/所需要的额外Python包。执行以下命令来安装:

pip3 install -r requirements.txt

运行主程序

python3 main.py --file <PDF文件路径> --start <OCR开始页码> --end <OCR结束页码>

📘 示例
对当前目录下的1.pdf文件进行OCR,页码从150开始,到155结束。

python3 main.py --file ./1.pdf --start 150 --end 155

效果图

点击识别后的图片,然后按键盘上任意键即可识别下一页。 效果图

运行demo_web

切换目录

cd demo_web/

安装额外的依赖包

本示例使用了Flask包来编写Python网页后端。

pip3 install -r requirements.txt

运行主程序

python3 main.py

访问网页

要访问该服务,在浏览器中输入如下网址(无需互联网连接):

http://127.0.0.1:5000

默认情况下,该服务只能通过本机地址127.0.0.15000端口访问。如果需要让局域网内的其它设备也能访问该网页,或是需要不同的端口号,请将demo_web/main.py的最后一行修改为:

app.run(host='0.0.0.0', port=<端口号>)

⚠️注意:
本服务使用了Flask自带的网页服务器。该服务器仅供开发使用,不能在实际生产环境中使用。如需将服务发布在公网,可以参考我的另一个项目NJUST_HomeworkCollector

效果图

打开网页后,先点击左上角的Upload PDF按钮上传PDF文件到本机浏览器。然后点击PreviousNext按钮切换PDF上/下页。最后点击右上角的OCR按钮,对当前页进行OCR。识别到的文本会由红框标出,点击对应的方框即可复制其中的文字。双击Page:后的当前页码,可以编辑并跳转到指定页。 web效果图

chinese-pdf-ocr's People

Contributors

newcomer00 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

chinese-pdf-ocr's Issues

pip3 install -r requirements.txt报错了

(venv) apple@MacBook-Pro chinese-pdf-ocr % pip3 install -r requirements.txt 
Collecting opencv-python==4.5.3.56 (from -r requirements.txt (line 1))
  Using cached opencv-python-4.5.3.56.tar.gz (89.2 MB)
  Installing build dependencies ... error
  error: subprocess-exited-with-error
  
  × pip subprocess to install build dependencies did not run successfully.
  │ exit code: 2
  ╰─> [117 lines of output]
      Ignoring numpy: markers 'python_version == "3.6" and platform_machine != "aarch64" and platform_machine != "arm64"' don't match your environment
      Ignoring numpy: markers 'python_version >= "3.6" and sys_platform == "linux" and platform_machine == "aarch64"' don't match your environment
      Ignoring numpy: markers 'python_version >= "3.6" and sys_platform == "darwin" and platform_machine == "arm64"' don't match your environment
      Ignoring numpy: markers 'python_version == "3.7" and platform_machine != "aarch64" and platform_machine != "arm64"' don't match your environment
      Ignoring numpy: markers 'python_version == "3.8" and platform_machine != "aarch64" and platform_machine != "arm64"' don't match your environment
      Collecting setuptools
        Using cached setuptools-69.5.1-py3-none-any.whl.metadata (6.2 kB)
      Collecting wheel
        Using cached wheel-0.43.0-py3-none-any.whl.metadata (2.2 kB)
      Collecting scikit-build
        Using cached scikit_build-0.17.6-py3-none-any.whl.metadata (14 kB)
      Collecting cmake
        Using cached cmake-3.29.2-py3-none-macosx_10_10_universal2.macosx_10_10_x86_64.macosx_11_0_arm64.macosx_11_0_universal2.whl.metadata (6.1 kB)
      Collecting pip
        Using cached pip-24.0-py3-none-any.whl.metadata (3.6 kB)
      Collecting numpy==1.19.3
        Using cached numpy-1.19.3.zip (7.3 MB)
        Installing build dependencies: started
        Installing build dependencies: finished with status 'done'
        Getting requirements to build wheel: started
        Getting requirements to build wheel: finished with status 'done'
      ERROR: Exception:
      Traceback (most recent call last):
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_internal/cli/base_command.py", line 180, in exc_logging_wrapper
          status = run_func(*args)
                   ^^^^^^^^^^^^^^^
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_internal/cli/req_command.py", line 245, in wrapper
          return func(self, options, args)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_internal/commands/install.py", line 377, in run
          requirement_set = resolver.resolve(
                            ^^^^^^^^^^^^^^^^^
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_internal/resolution/resolvelib/resolver.py", line 95, in resolve
          result = self._result = resolver.resolve(
                                  ^^^^^^^^^^^^^^^^^
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_vendor/resolvelib/resolvers.py", line 546, in resolve
          state = resolution.resolve(requirements, max_rounds=max_rounds)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_vendor/resolvelib/resolvers.py", line 397, in resolve
          self._add_to_criteria(self.state.criteria, r, parent=None)
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_vendor/resolvelib/resolvers.py", line 173, in _add_to_criteria
          if not criterion.candidates:
                 ^^^^^^^^^^^^^^^^^^^^
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_vendor/resolvelib/structs.py", line 156, in __bool__
          return bool(self._sequence)
                 ^^^^^^^^^^^^^^^^^^^^
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 155, in __bool__
          return any(self)
                 ^^^^^^^^^
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 143, in <genexpr>
          return (c for c in iterator if id(c) not in self._incompatible_ids)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 47, in _iter_built
          candidate = func()
                      ^^^^^^
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_internal/resolution/resolvelib/factory.py", line 182, in _make_candidate_from_link
          base: Optional[BaseCandidate] = self._make_base_candidate_from_link(
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_internal/resolution/resolvelib/factory.py", line 228, in _make_base_candidate_from_link
          self._link_candidate_cache[link] = LinkCandidate(
                                             ^^^^^^^^^^^^^^
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 290, in __init__
          super().__init__(
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 156, in __init__
          self.dist = self._prepare()
                      ^^^^^^^^^^^^^^^
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 222, in _prepare
          dist = self._prepare_distribution()
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 301, in _prepare_distribution
          return preparer.prepare_linked_requirement(self._ireq, parallel_builds=True)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_internal/operations/prepare.py", line 525, in prepare_linked_requirement
          return self._prepare_linked_requirement(req, parallel_builds)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_internal/operations/prepare.py", line 640, in _prepare_linked_requirement
          dist = _get_prepared_distribution(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_internal/operations/prepare.py", line 71, in _get_prepared_distribution
          abstract_dist.prepare_distribution_metadata(
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_internal/distributions/sdist.py", line 54, in prepare_distribution_metadata
          self._install_build_reqs(finder)
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_internal/distributions/sdist.py", line 124, in _install_build_reqs
          build_reqs = self._get_build_requires_wheel()
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_internal/distributions/sdist.py", line 101, in _get_build_requires_wheel
          return backend.get_requires_for_build_wheel()
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_internal/utils/misc.py", line 745, in get_requires_for_build_wheel
          return super().get_requires_for_build_wheel(config_settings=cs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_impl.py", line 166, in get_requires_for_build_wheel
          return self._call_hook('get_requires_for_build_wheel', {
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_impl.py", line 321, in _call_hook
          raise BackendUnavailable(data.get('traceback', ''))
      pip._vendor.pyproject_hooks._impl.BackendUnavailable: Traceback (most recent call last):
        File "/Users/apple/Documents/docs/tools/chinese-pdf-ocr/venv/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 77, in _build_backend
          obj = import_module(mod_path)
                ^^^^^^^^^^^^^^^^^^^^^^^
        File "/usr/local/Cellar/[email protected]/3.12.2_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/importlib/__init__.py", line 90, in import_module
          return _bootstrap._gcd_import(name[level:], package, level)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
        File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
        File "<frozen importlib._bootstrap>", line 1310, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
        File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
        File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
        File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
        File "<frozen importlib._bootstrap_external>", line 995, in exec_module
        File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
        File "/private/var/folders/hx/mwmbxj2n04b0_ncsr9mzqdph0000gn/T/pip-build-env-iafmtrpx/overlay/lib/python3.12/site-packages/setuptools/__init__.py", line 9, in <module>
          import distutils.core
      ModuleNotFoundError: No module named 'distutils'
      
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× pip subprocess to install build dependencies did not run successfully.
│ exit code: 2
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.