Giter Club home page Giter Club logo

sergio11 / document_search_engine_architecture Goto Github PK

View Code? Open in Web Editor NEW
27.0 3.0 12.0 13.71 MB

๐Ÿ“„๐Ÿš€ Unleash a powerful Document Search Engine with Apache NiFi for lightning-fast, comprehensive text indexing and search.

Home Page: https://sanchezsanchezsergio418.medium.com/an-architectural-approach-to-implement-a-large-scale-document-search-engine-based-on-apache-nifi-430cbe91065f?source=your_stories_page-------------------------------------

License: MIT License

Ruby 1.73% Shell 5.80% Dockerfile 3.66% Java 88.81%
kafka nifi nifi-templates docker tika tika-server hdfs elasticsearch logstash mongodb

document_search_engine_architecture's Introduction

๐Ÿ“š๐Ÿ” SearchForge: Crafting Powerful Document Searches with NiFi ๐Ÿš€

๐Ÿš€ This groundbreaking project pioneers an avant-garde architectural strategy, revolutionizing the implementation of a dynamic and powerful document search engine. At its core is the formidable Apache NiFi, strategically positioned as the linchpin of this transformative system. ๐Ÿ“š๐Ÿ”

๐ŸŒ In response to the escalating demand for efficient document retrieval and analysis, the innovative approach leverages the inherent capabilities of Apache NiFi. This adaptable framework facilitates a seamless Extract, Transform, Load (ETL) process, ensuring the efficient extraction of metadata and content from a diverse array of file formats. The result is a sophisticated document search engine that not only meets but exceeds the expectations of modern information retrieval systems. ๐ŸŒโœจ

๐Ÿ’ก Moreover, the architecture extends beyond mere functionality, embracing a holistic vision of scalability, flexibility, and performance. By intricately interweaving technologies such as Apache Kafka, Docker, JWT, MongoDB, Spring, Spring Boot, Swagger, and Elasticsearch, this project sets the stage for a comprehensive and streamlined document management ecosystem. The union of these cutting-edge technologies propels the search engine into a league of its own, promising not just search capabilities, but an immersive and intelligent exploration of information within documents. ๐Ÿš€๐Ÿ”—๐Ÿ’ฌ

More Details ๐Ÿ“

For comprehensive information about this project, check out this Medium article.

Main Components ๐Ÿ”ง

  • ETL Process: Our ETL (Extract, Transform, Load) process is designed based on Apache NiFi's flow-based programming model, making it efficient at extracting metadata and content from various file formats.
  • Microservice Architecture: We've implemented a robust microservice architecture to interact with the platform, enabling tasks such as retrieving specific file metadata, initiating file processing, and executing complex searches with ease.

Main Goals ๐ŸŽฏ

  • Fast & Efficient Search: Our search engine is optimized for speed and efficiency, providing a user experience comparable to other leading search engines.
  • Comprehensive Indexing: We extract and index all text within documents, including their content.
  • Scalability: The architecture is designed to scale effortlessly, leveraging modern data movement technologies.
  • Diverse File Handling: It's capable of handling a large number of files in various formats, including substantial ones.
  • High Availability: We've optimized the system to store vast amounts of data, maintaining multiple copies to ensure high availability and fault tolerance.
  • Integration Capabilities: The project is flexible, allowing seamless integration with external systems for complex tasks and platform usage scenarios.

Architecture Overview ๐Ÿ›๏ธ

Several critical components underpin our project, including:

  • ๐Ÿ“‚ HDFS Cluster: We use a 3-datanode HDFS cluster to store original files for processing.
  • ๐ŸŒŸ Apache Tika: We utilize two versions of Apache Tika servers, one of which has OCR capabilities for content extraction from images and scanned PDFs.
  • ๐Ÿ“ค SFTP Server: This serves as the entry point for the NiFi ETL process. A microservice uploads files to a shared directory, while a NiFi processor continuously polls for new additions.
  • ๐Ÿ”„ ETL Process: The NiFi ETL process moves files to the HDFS directory, determines their MIME type, and makes HTTP requests to the appropriate Apache Tika server for metadata and text content extraction. The data is then stored in a MongoDB collection, with process state updates published to Kafka.
  • ๐Ÿ˜ Elasticsearch Integration: Complex searches are made possible by syncing data to Elasticsearch via a Logstash pipeline, as MongoDB lacks advanced search capabilities.
  • ๐Ÿ“Š Data Exploration Tools: MongoDB Express and Kibana are employed to explore and visualize indexed data.
  • ๐ŸŒ Microservice Coordination: A Consul agent continuously monitors service availability and network locations.
  • ๐Ÿ” Authentication & Authorization: All exposed services require authentication and authorization, facilitated by obtaining identity from the SSO Keycloak Server through the API Gateway Service.
  • ๐ŸŒ‰ API Gateway: The API Gateway microservice unifies all APIs into a single point of entry using Spring Cloud Gateway.

Used technology

  • Spring Boot 2.3.5 / Apache Maven 3.6.3.
  • Spring Boot Starter Actuator.
  • Spring Cloud Stream.
  • Spring Cloud Gateway.
  • Spring Cloud Starter Consul Discovery.
  • Spring Cloud Starter OpenFeign.
  • Springdoc OpenApi.
  • Spring Boot Starter Security.
  • Spring Security OAuth2.
  • ElasticSearch - Logstash - Kibana (ELK Stack).
  • MongoDB.
  • Mongo DB Express (Web-based MongoDB admin interface, written with Node.js and express).
  • Consul Server.
  • SSO Keycloak Server.
  • Hadoop HDFS.
  • Apache Nifi.
  • Apache Tika Server.
  • Rabbit MQ / STOMP protocol.
  • Apache Kafka.
  • Kafka Rest Proxy

Running Applications as Docker containers.

Rake Tasks

The available tasks are detailed below (rake --task)

Task Description
check_deployment_file_task Check Deployment File
check_docker_task Check Docker and Docker Compose Task
cleaning_environment_task Cleaning Evironment Task
deploy Deploys the Document Search Engine architecture and laun...
login Authenticating with existing credentials
start Start Containers
status Status Containers
stop Stop Containers
undeploy UnDeploy Document Search Engine architecture

To start the platform make sure you have Ruby installed, go to the root directory of the project and run the rake deploy task, this task will carry out a series of preliminary checks, discard images and volumes that are no longer necessary and also proceed to download all the images and the initialization of the containers.

Containers Ports

In this table you can view the ports assigned to each service to access to the Web tools or something else you can use to monitoring the flow.

Container Port
Apache Nifi Dashboard UI localhost:8080
Hadoop Resource Manager localhost:8081
Kafka Topics UI localhost:8082
MongoDB Express localhost:8083
Kibana localhost:8084
Keycloak PGAdmin localhost:8085
Keycloak Admin UI localhost:8086
Consul Dashboard localhost:8087
Rabbit MQ - Stomp Dashboard localhost:8088
Hadoop NameNode Dashboard localhost:8089
API Gateway SSH localhost:2223
SFTP Server localhost:2222

Some Videos

An architectural approach to implement a large-scale document search engine based on Apache Nifi

An architectural approach to implement a large-scale document search engine based on Apache Nifi

Microservice architecture to interact with the platform

Some screenshots

As follow, I include some images that help us to understand the performance of each part of system

ETL Flow based on Apache Nifi

Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.

Events System based on Apache Kafka

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Apache Hadoop HDFS to store the files that will proccess

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject.

MongoDB to store the metadata and content of the files that have been proccessed.

Consul to coordinate microservices architecture.

SSO Keycloak Server

The entry point to the architecture.

ELK Stack

Visitors Count

Please Share & Star the repository to keep me motivated.

document_search_engine_architecture's People

Contributors

sergio11 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

document_search_engine_architecture's Issues

Failing to launch some Services on Docker-compose

Hi Sergio11,

Awesome project you have and thanks for documenting it all out so well! ๐Ÿ‘

I have these 2 issues when running the docker-compose.yml

  1. Nifi service is failing to start when running the docker-compose file.
2021-08-31 21:20:54,733 INFO [main] org.apache.nifi.NiFi Launching NiFi...

2021-08-31 21:21:04,127 INFO [main] o.a.n.p.AbstractBootstrapPropertiesLoader Determined default application properties path to be '/opt/nifi/nifi-current/./conf/nifi.properties'

2021-08-31 21:21:04,653 INFO [main] o.a.nifi.properties.NiFiPropertiesLoader Loaded 202 properties from /opt/nifi/nifi-current/./conf/nifi.properties

2021-08-31 21:21:05,449 ERROR [main] o.a.nifi.properties.NiFiPropertiesLoader Clustered Configuration Found: Shared Sensitive Properties Key [nifi.sensitive.props.key] required for cluster nodes

2021-08-31 21:21:05,479 ERROR [main] org.apache.nifi.NiFi Failure to launch NiFi due to java.lang.IllegalArgumentException: There was an issue decrypting protected properties

java.lang.IllegalArgumentException: There was an issue decrypting protected properties

at org.apache.nifi.NiFi.initializeProperties(NiFi.java:346)

at org.apache.nifi.NiFi.convertArgumentsToValidatedNiFiProperties(NiFi.java:314)

at org.apache.nifi.NiFi.convertArgumentsToValidatedNiFiProperties(NiFi.java:310)

at org.apache.nifi.NiFi.main(NiFi.java:302)

Caused by: org.apache.nifi.properties.SensitivePropertyProtectionException: Sensitive Properties Key [nifi.sensitive.props.key] not found: See Admin Guide section [Updating the Sensitive Properties Key]

at org.apache.nifi.properties.NiFiPropertiesLoader.getDefaultProperties(NiFiPropertiesLoader.java:220)

at org.apache.nifi.properties.NiFiPropertiesLoader.get(NiFiPropertiesLoader.java:209)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.nifi.NiFi.initializeProperties(NiFi.java:341)

... 3 common frames omitted
  1. activiti/rabbitmq-stomp is archived and cannot be pulled when running the docker-compose. Is it okay to replace with an alternative like jorgeacetozi/rabbitmq-stomp?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.