Giter Club home page Giter Club logo

crawl_r's Introduction

CircleCI Dependency Status

VersionEye

This is the source code for the web application VersionEye.

Start the backend services for VersionEye

This project contains a docker-compose.yml file which describes the backend services of VersionEye. You can start the backend services like this:

docker-compose up -d

That will start:

  • MongoDB
  • RabbitMQ
  • ElasticSearch
  • Memcached

For persistence you should comment in and adjust the mount volumes in docker-compose.yml for MongoDB and ElasticSearch. If you are not interested in persisting the data on your host you can let it untouched.

Shutting down the backend services works like this:

docker-compose down

Configuration

All important configuration values are read from environment variable. Before you start VersioneyeCore.new you should adjust the values in script/set_vars_for_dev.sh and load them like this:

source ./script/set_vars_for_dev.sh

The most important env. variables are the ones for the backend services, which point to MongoDB, ElasticSearch, RabbitMQ and Memcached.

Install dependencies

If the backend services are all up and running and the environment variables are set correctly you can install the dependencies with bundler. If bundler is not installed on your machine run this command to install it:

gem install bundler

Then you can install the dependencies like this:

bundle install

Rails Server

If the dependencies are installed correctly you can start the Rails server like this:

rails s

Now the application should be available at http://localhost:3000.

Amazon S3

You can use the fake-s3 GEM to simulate S3 offline: https://github.com/jubos/fake-s3. You can start the fake-s3 service like this:

fakes3 -r /tmp -p 4567

React.JS

Some parts of VersionEye are implemented in ReactJS. For autocompiling the JSX files in ReactJS use the jsx node package.

jsx --watch src/ build/

jsx can be installed via NPM with this command:

sudo npm install -g react-tools

Tests

The tests for this project are running after each git push on CircleCI! For more details take a look to the circle.yml file in the root directory!

If the Docker containers for the backend services are running locally, the tests can be executed locally with this command:

./script/runtests_local.sh

Make sure that you followed the steps in the configuration section, before you run the tests!

Support

For commercial support send a message to [email protected].

License

VersionEye-Core is licensed under the MIT license!

Copyright (c) 2016 VersionEye GmbH

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

crawl_r's People

Contributors

reiz avatar timgluz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

crawl_r's Issues

Elixir crawlers with API Key

The Hex API is limited. We need an API key and the Hex crawler needs to be refactored that they that it's using an API key for every request to the Hex API. If possible the crawler should check how many requests remain and wait for N minutes if the rate limit is used of.

Ruby Gem ICalendar license not recognized

It uses a different file name that is not included inside the crawler: COPYING

I suggested to rename the file to current github standards but no respnse so far.
icalendar/icalendar#170

Maybe the crawler should also look for COPYING file names.

The Github search also points out a lot more file names that could be helpful.
https://github.com/search?utf8=%E2%9C%93&q=COPYING+in%3Apath&type=Code&ref=searchresults

  • All extensions for the current filenames not ony .txt, .md, but also .gpl, ...
  • Sometimes the License is inside gpl.txt, gpl.md, apache.txt, ... {license name}.ending
  • also often inside sub folders copying

What would be interesting here is how many of the crawled repositories are missing a license at the moment?

License recognition for CSharp

Many packages on Nuget only provide a link to license but no license name. During the crawling we should check if the license link is matching with a well known source. I found for example 64K license objects in our db where the url was http://www.apache.org/licenses/LICENSE-2.0.html and the license name was nil. I updated them all and set the name to Apache-2.0. This logic should be part of the Nuget crawler. Check at least for this urls:

http://www.apache.org/licenses/LICENSE-2.0.html
https://www.apache.org/licenses/LICENSE-2.0.html
https://opensource.org/licenses/MIT

License field has been moved for Rust Crates;

I got email notification for a Rust package i follow and i noticed it is now missing license details.

I digged into the problem and i noticed that the response model from the API has been changed - they moved all the licenses onto versions;

curl https://crates.io/api/v1/crates/tokio-core?api_key=XYZ | jq

No Cpan dependencies

I refactored the Cpan crawler. Remove all Threads and using RabbitMQ directly. In generell everything seems to work. But somehow there is not a single Perl dependency in the db. Even not after 2000 packages have been crawled. Maybe a bug?

Move LicenseMatcher into own project

It's one of reasons why specs are slow - it must rebuild the token index each time and it takes time;

Fastest solutions is to move everything into own repo; But it doesnt speed up creation of index;

Better approach: use proper tools, refactor it into 2 processes: training/match, where the training-process will build and save model; and match-process will re-use already existing model;

Add cargo links

@timgluz Every Rust package is available through a unique URL.

https://crates.io/crates/<PACKAGE>

For example:

https://crates.io/crates/aseprite

The crawler should create this links automatically so that we can display it on the VersionEye pages.

Some Perl URLs cant be fetched

In the logs of the cpan crawler I can see stuff like this:

[wtf] Failed to parse JSON response from https://fastapi.metacpan.org/v1/release/FLORA/Catalyst-Runtime-5.80009?join=author - execution expired (pid:197)

But if I try the URL in the browser I'm getting back a valid JSON in less then 1 second. Maybe a redirect issue?

Add Bitbucket crawler for Golang dependencies

According to go-search, there're 2000 (low priority!!) packages hosted on Bitbucket and it would nice if we could pull Go dependencies from there too like we're doing for Github;

Nokogiri License is not recognized but it still includes valid License.txt

License file: https://github.com/sparklemotion/nokogiri/blob/master/LICENSE.txt
Somehow the algorithm has problems with the file contents.

Caused by line breaks possibly:

to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,

https://github.com/versioneye/crawl_r/blob/master/lib/versioneye/crawlers/license_crawler.rb, line 370

The matcher should be made more robust to line breaks.

On-demand crawling

Hello,

Here's my current use case: I use versioneye to track the dependencies of internal projects (mostly PHP-related). Obviously we do not use all the packages available on packagist. We would also like not to depend on VersionEye public API to obtain packages information. However, crawling all of packagist while we're probably interested in 500-1000 packages seems a bit excessive.

Thus, it would probably make sense to have tracked projects dictate which packages should be crawled. Then, the crawler/producer would only have to walk through these packages in order to fetch a minimal list of metadata.

Tests are slow & breaking

The complete test suite quiet slow and some tests are failing. I guess the tests for bower are slow. Please review & refactor them. Make the test work again and if possible improve the speed.

Tests are failing

@timgluz One of the newly added tests is failing:

  2) PythonLicenseDetector detecting standard spdx license texts ignores popular non-sense
     Failure/Error: expect(spdx_id).to eq('GNU')

       expected: "GNU"
            got: nil

       (compared using ==)
     # ./spec/versioneye/utils/python_license_detector_spec.rb:58:in `block (3 levels) in <top (required)>'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.