Hadoop Job Jar

This project includes tools to build a Hadoop job jar that can index documents from HDFS to Solr.

Features
Build Jars
How it Works
Ingest Mappers
Using the Job Jar
- Job Jar Arguments
- Example Job
How to Contribute

Features

Uses MapReduce to process large content sets.
Supports several types of content including SequenceFiles and CSV files.
Fully SolrCloud compatible.

Tested versions:

Solr 6.x and higher
Hadoop 2.4 and higher, 3.1 and higher

See the wiki page Tested Distros for more details on Hadoop distributions tested with this job jar.

Build Jars

Warning

You must use Java 8 or higher to build the .jar.

This repository uses Gradle to build the job jar. It is not required to have Gradle installed before starting the build process.

The compiled job jar relies on code in a separate repository that is included with hadoop-solr as a submodule named solr-hadoop-common.

After cloning hadoop-solr, but before building the job jar, you must first initialize the solr-hadoop-common submodule by running the following commands from the top level of your hadoop-solr clone:

git submodule init to initialize your local configuration file for the submodule.
git submodule update to fetch all the data from the remote repository and check out the appropriate commit listed in the super-project.
if a build is happening from a branch, please make sure that solr-hadoop-common is pointing to the correct SHA. (see https://github.com/blog/2104-working-with-submodules for more details)

   hadoop-solr $ git checkout <branch-name>
   hadoop-solr $ cd solr-hadoop-common
   hadoop-solr/solr-hadoop-common $ git checkout <SHA>
   hadoop-solr/solr-hadoop-common $ cd ..

Once the submodule has been pulled, you can build the job jar with this command:

./gradlew clean shadowJar --info

This will produce one jar, named solr-hadoop-job-4.0.0.jar, located in the solr-hadoop-job/build/libs directory.

If you rebuild the job jar at a later time (to take advantage of updates), remember to update the submodule with git submodule update.

Note

The Gradle task will also produce solr-hadoop-tika-4.0.0.jar located in the solr-hadoop-common/solr-hadoop-tika/build/libs directory. This jar will be needed if you use the -Dlw.tika.process parameter described below. Using Tika is recommended when when using the DirectoryIngestMapper or the ZipIngestMapper.

Troubleshooting

If GitHub + SSH is not configured the following exception will be thrown:

    Cloning into 'solr-hadoop-common'...
    Permission denied (publickey).
    fatal: Could not read from remote repository.

    Please make sure you have the correct access rights
    and the repository exists.
    fatal: clone of '[email protected]:LucidWorks/solr-hadoop-common.git' into submodule path 'solr-hadoop-common' failed

You can use the Generating an SSH Key tutorial to fix the problem.

How it Works

The job jar runs a series of MapReduce-enabled jobs to convert raw content into documents for indexing to Solr.

Warning

You must use the job jar with a user that has permissions to write to the hadoop.tmp.dir. The /tmp directory in HDFS must also be writable.

The Hadoop job jar works in three stages designed to take in raw content and output results to Solr. These stages are:

Create one or more SequenceFiles from the raw content. This is done in one of two ways:
1. If the source files are available in a shared Hadoop filesystem, prepare a list of source files and their locations as a SequenceFile. The raw contents of each file are not processed until step 2.
2. If the source files are not available, prepare a list of source files and the raw content. This process is done sequentially and can take a significant amount of time if there are a large number of documents and/or if they are very large.
Run a MapReduce job to extract text and metadata from the raw content.
- This process uses ingest mappers to parse documents and prepare them to be added to a Solr index. The section Ingest Mappers below provides a list of available mappers.
- Apache Tika can also be used for additional document parsing and metadata extraction.
Run a MapReduce job to send the extracted content from HDFS to Solr using the SolrJ client. This implementation works with SolrJ’s CloudServer Java client which is aware of where Solr is running via Zookeeper.

Note	Incremental indexing, where only changes to a document or directory are processed on successive job jar runs, is not supported. All three steps will be completed each time the job jar is run, regardless of whether the original content has changed.

The first step of the process converts the input content into a SequenceFile. In order to do this, the entire contents of that file must be read into memory so that it can be written out as a LWDocument in the SequenceFile. Thus, you should be careful to ensure that the system does not load into memory a file that is larger than the Java heap size of the process.

Ingest Mappers

Ingest mappers in the job jar parse documents and prepare them for indexing to Solr.

There are several available ingest mappers:

CSVIngestMapper
DirectoryIngestMapper
GrokIngestMapper
RegexIngestMapper
SequenceFileIngestMapper
SolrXMLIngestMapper
XMLIngestMapper
WarcIngestMapper
ZipIngestMapper

The ingest mapper is added to the job arguments with the use of the -cls parameter. However, many mappers require additional arguments. Please refer to the the wiki page Ingest Mapper Arguments for each mapper for the required and optional arguments.

Using the Job Jar

The job jar allows you to index many different types of content stored in HDFS to Solr. It uses MapReduce to leverage the scaling qualities of Apache Hadoop while indexing content to Solr.

To use the job jar, you will need to initiate a job in your Hadoop cluster (using the hadoop jar command). Additional parameters (arguments) will be required. These arguments define the location of your data, how to parse your content, and the location of your Solr instance for indexing.

The job jar takes three types of arguments. These must be defined in the proper order, as shown below:

the main class
system and mapper-specific arguments
key-value pair arguments

These are discussed in more detail in the section Job Jar Arguments below.

Important

The job jar can be run from any location, but requires a Hadoop client if used on a server where Hadoop (bin/hadoop) is not installed. A properly configured client allows the job jar to be submitted to Hadoop to run the job.

The specific client you need will vary depending on the Hadoop distribution vendor. Speak to your vendor for more information about how to download and configure a client for your distribution.

Job Jar Arguments

The job jar arguments allow you to define the type of content in HDFS, choose the ingest mappers appropriate for that content, and set other job parameters as needed.

There are three main sections to the job jar arguments:

the main class
system and mapper-specific arguments
key-value pair arguments

Warning

The arguments must be supplied in the above order.

The available arguments and parameters are described in the following sections.

Main Class

The main class must be specified. For all of the mappers available, it is always defined as com.lucidworks.hadoop.ingest.IngestJob.

System and Mapper-specific Arguments

System or Mapper-specific arguments, defined with a pattern of -Dargument=value, are supplied after the class name. In many cases, the arguments chosen depend on the ingest mapper chosen. The ingest mapper will be defined later in the argument string.

The order of system-level or mapper-specific arguments does not matter, but they must be after the class name and before the key-value pair arguments.

For available system arguments, see System Arguments.

For ingest mapper arguments, see Ingest Mapper Arguments.

Other arguments not described in this repo’s documentation (such as Hadoop-specific system arguments) can be supplied as needed and they will be added to the Hadoop configuration. These arguments should be defined with the -Dargument=value syntax.

Key-Value Pair Arguments

Key-value pair arguments apply to the ingest job generally. These arguments are expressed as -argument value. They are the last arguments supplied before the jar name is defined.

For more information see Key-Value Pair Arguments.

Example Job

This is a simple job request to index a CSV file which demonstrates the order of the arguments:

bin/hadoop jar /path/to/solr-hadoop-job-4.0.0.jar --(1)

   com.lucidworks.hadoop.ingest.IngestJob -- (2)

   -Dlww.commit.on.close=true -DcsvDelimiter=| -- (3)

   -cls com.lucidworks.hadoop.ingest.CSVIngestMapper -c gettingstarted -i /data/CSV -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -s http://localhost:8888/solr -- (4)

We can summarize the proper order as follows:

The Hadoop command to run a job. This includes the path to the job jar (as necessary).
The main ingest class.
Mapper arguments, which vary depending on the Mapper class chosen, in the format of -Dargument=value.
Key-value arguments, which include the ingest mapper, Solr collection name, and other parameters, in the format of -argument value.

How to Contribute

Fork this repo i.e. <username|organization>/hadoop-solr, following the fork a repo/ tutorial. Then, clone the forked repo on your local machine:
```
$ git clone https://github.com/<username|organization>/hadoop-solr.git
```
Configure remotes with the configuring remotes tutorial.
Create a new branch:
```
$ git checkout -b new_branch
$ git push origin new_branch
```
Use the creating branches tutorial to create the branch from GitHub UI if you prefer.
Develop on new_branch branch only, do not merge new_branch to your master. Commit changes to new_branch as often as you like:
```
$ git add <filename>
$ git commit -m 'commit message'
```
Push your changes to GitHub.
```
$ git push origin new_branch
```
Repeat the commit & push steps until your development is complete.
Before submitting a pull request, fetch upstream changes that were done by other contributors:
```
$ git fetch upstream
```

And update master locally:

$ git checkout master
$ git pull upstream master

Merge master branch into new_branch in order to avoid conflicts:
```
$ git checkout new_branch
$ git merge master
```
If conflicts happen, use the resolving merge conflicts tutorial to fix them:
Push master changes to new_branch branch
```
$ git push origin new_branch
```
Add jUnits, as appropriate to test your changes.
When all testing is done, use the create a pull request tutorial to submit your change to the repo.

Note	Please be sure that your pull request sends only your changes, and no others. Check it using the command: `git diff new_branch upstream/master`

louis-yang / hadoop-solr Goto Github PK

hadoop-solr's Introduction

Hadoop Job Jar

Contents

Features

Build Jars

Troubleshooting

How it Works

Ingest Mappers

Using the Job Jar

Job Jar Arguments

Main Class

System and Mapper-specific Arguments

Key-Value Pair Arguments

Example Job

How to Contribute

hadoop-solr's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent