Giter Club home page Giter Club logo

nutch's Introduction

Apache Nutch README

For the latest information about Nutch, please visit our website at:

https://nutch.apache.org/

and our wiki, at:

https://cwiki.apache.org/confluence/display/NUTCH/Home

To get started using Nutch read Tutorial:

https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial

Contributing

To contribute a patch, follow these instructions (note that installing Hub is not strictly required, but is recommended).

0. Download and install hub.github.com
1. File JIRA issue for your fix at https://issues.apache.org/jira/projects/NUTCH/issues
- you will get issue id NUTCH-xxx where xxx is the issue ID.
2. git clone https://github.com/apache/nutch.git
3. cd nutch
4. git checkout -b NUTCH-xxx
5. edit files (please try and include a test case if possible)
6. git status (make sure it shows what files you expected to edit)
7. Make sure that your code complies with the [Nutch codeformatting template](https://raw.githubusercontent.com/apache/nutch/master/eclipse-codeformat.xml), which is basially two space indents
8. git add <files>
9. git commit -m “fix for NUTCH-xxx contributed by <your username>”
10. git fork
11. git push -u <your git username> NUTCH-xxx
12. git pull-request

IDE setup

Eclipse

Generate Eclipse project files

ant eclipse

and follow the instructions in Importing existing projects.

You must configure the nutch-site.xml before running. Make sure, you've added http.agent.name and plugin.folders properties. The plugin.folders normally points to <project_root>/build/plugins.

Now create a Java Application Configuration, choose org.apache.nutch.crawl.Injector, add two paths as arguments. First one is the crawldb directory, second one is the URL directory where, the injector can read urls. Now run your configuration.

If we still see the No plugins found on paths of property plugin.folders="plugins", update the plugin.folders in the nutch-default.xml, this is a quick fix, but should not be used.

Intellij IDEA

First install the IvyIDEA Plugin. then run ant eclipse. This will create the necessary .classpath and .project files so that Intellij can import the project in the next step.

In Intellij IDEA, select File > New > Project from Existing Sources. Select the nutch home directory and click "Open".

On the "Import Project" screen select the "Import project from external model" radio button and select "Eclipse". Click "Create". On the next screen the "Eclipse projects directory" should be already set to the nutch folder. Leave the "Create module files near .classpath files" radio button selected. Click "Next" on the next screens. On the project SDK screen select Java 11 and click "Create". N.B. For anyone on a Mac with a homebrew-installed openjdk, you need to use the directory under libexec: <openjdk11_directory>/libexec/openjdk.jdk/Contents/Home.

Once the project is imported, you will see a popup saying "Ant build scripts found", "Frameworks detected - IvyIDEA Framework detected". Click "Import". If you don't get the pop-up, I'd suggest going through the steps again as this happens from time to time. There is another Ant popup that asks you to configure the project. Do NOT click "Configure".

To import the code-style, Go to Intellij IDEA > Preferences > Editor > Code Style > Java.

For the Scheme dropdown select "Project". Click the gear icon and select "Import Scheme" > "Eclipse XML file".

Select the eclipse-format.xml file and click "Open". On next screen check the "Current Scheme" checkbox and hit OK.

Running in Intellij IDEA

Running in Intellij

  • Open Run/Debug Configurations
  • Select "+" to create a new configuration and select "Application"
  • For "Main Class" enter a class with a main function (e.g. org.apache.nutch.indexer.IndexingJob).
  • For "Program Arguments" add the arguments needed for the class. You can get these by running the crawl executable for your job. Use full-qualified paths. (e.g. /Users/kamil/workspace/external/nutch/crawl/crawldb /Users/kamil/workspace/external/nutch/crawl/segments/20221222160141 -deleteGone)
  • For "Working Directory" enter "/Users/kamil/workspace/external/nutch/runtime/local".
  • Select "Modify options" > "Modify Classpath" and add the config directory belonging to the "Working Directory" from the previous step (e.g. /Users/kamil/workspace/external/nutch/runtime/local/conf). This will allow the resource loader to load that configuration.
  • Select "Modify options" > "Add VM Options". Add the VM options needed. You can get these by running the crawl executable for your job (e.g. -Xmx4096m -Dhadoop.log.dir=/Users/kamil/workspace/external/nutch/runtime/local/logs -Dhadoop.log.file=hadoop.log -Dmapreduce.job.reduces=2 -Dmapreduce.reduce.speculative=false -Dmapreduce.map.speculative=false -Dmapreduce.map.output.compress=true)

Note: You will need to manually trigger a build through ANT to get latest updated changes when running. This is because the ant build system is separate from the Intellij one.

nutch's People

Contributors

ayalciobotaru avatar balashashanka avatar chrismattmann avatar cutting avatar derhecht avatar erikhatcher avatar jnioche avatar jorgelbg avatar kamaci avatar karanjeets avatar kwhitehall avatar lewismc avatar maborec avatar mfeltscher avatar mjjoyce avatar naegelejd avatar nickredmark avatar okedoki avatar omkar20895 avatar pmezard avatar prakharchaube avatar r0ann3l avatar sebastian-nagel avatar sigram avatar siren avatar sujen1412 avatar tballison avatar tejasapatil avatar thammegowda avatar yossitamari avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.