versioneye / crawl_r Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 7.0 4.27 MB

VersionEye crawlers implemented in Ruby.

Home Page: https://www.versioneye.com

License: Other

Ruby 37.40% Shell 0.14% Roff 62.44% Dockerfile 0.02%

crawlers ruby versioneye

crawl_r's Introduction

VersionEye

This is the source code for the web application VersionEye.

Start the backend services for VersionEye

This project contains a docker-compose.yml file which describes the backend services of VersionEye. You can start the backend services like this:

docker-compose up -d

That will start:

MongoDB
RabbitMQ
ElasticSearch
Memcached

For persistence you should comment in and adjust the mount volumes in docker-compose.yml for MongoDB and ElasticSearch. If you are not interested in persisting the data on your host you can let it untouched.

Shutting down the backend services works like this:

docker-compose down

Configuration

All important configuration values are read from environment variable. Before you start VersioneyeCore.new you should adjust the values in script/set_vars_for_dev.sh and load them like this:

source ./script/set_vars_for_dev.sh

The most important env. variables are the ones for the backend services, which point to MongoDB, ElasticSearch, RabbitMQ and Memcached.

Install dependencies

If the backend services are all up and running and the environment variables are set correctly you can install the dependencies with bundler. If bundler is not installed on your machine run this command to install it:

gem install bundler

Then you can install the dependencies like this:

bundle install

Rails Server

If the dependencies are installed correctly you can start the Rails server like this:

rails s

Now the application should be available at http://localhost:3000.

Amazon S3

You can use the fake-s3 GEM to simulate S3 offline: https://github.com/jubos/fake-s3. You can start the fake-s3 service like this:

fakes3 -r /tmp -p 4567

React.JS

Some parts of VersionEye are implemented in ReactJS. For autocompiling the JSX files in ReactJS use the jsx node package.

jsx --watch src/ build/

jsx can be installed via NPM with this command:

sudo npm install -g react-tools

Tests

The tests for this project are running after each git push on CircleCI! For more details take a look to the circle.yml file in the root directory!

If the Docker containers for the backend services are running locally, the tests can be executed locally with this command:

./script/runtests_local.sh

Make sure that you followed the steps in the configuration section, before you run the tests!

Support

For commercial support send a message to [email protected].

License

VersionEye-Core is licensed under the MIT license!

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

crawl_r's People

Contributors

Stargazers

Watchers

Forkers

timgluz safety-mirror charveasna waywardsun chubbymaggie eacg-gmbh richvred

crawl_r's Issues

add Github crawler to fetch dependencies for Godep and Dep files

It should try to fetch those project files directly via url and pull dependency details with help of parsers.

[refactor] GosearchCrawler should save subpackages into module field and not new Product.

Lessons learned from CpanCrawler:
If product has many submodules, then it seriously slows down saving data and the same data is duplicated.

TODO:

Save each subpackage as module in the Product model and also in version model as users may remove submodules between each release.

Elixir crawlers with API Key

The Hex API is limited. We need an API key and the Hex crawler needs to be refactored that they that it's using an API key for every request to the Hex API. If possible the crawler should check how many requests remain and wait for N minutes if the rate limit is used of.

[Refactor] Every CrawlWorker has same implementation of `work` function

I noticed that most of CrawlWorker are duplicating the work method, which could be moved into parent class like the GosearchVersionWorker does: https://github.com/versioneye/crawl_r/blob/master/lib/versioneye/workers/crates_crawl_worker.rb

Hex crawler on prod

Bring the new Hex crawler to production.

Ruby Gem ICalendar license not recognized

It uses a different file name that is not included inside the crawler: COPYING

I suggested to rename the file to current github standards but no respnse so far.
icalendar/icalendar#170

Maybe the crawler should also look for COPYING file names.

The Github search also points out a lot more file names that could be helpful.
https://github.com/search?utf8=%E2%9C%93&q=COPYING+in%3Apath&type=Code&ref=searchresults

All extensions for the current filenames not ony .txt, .md, but also .gpl, ...
Sometimes the License is inside gpl.txt, gpl.md, apache.txt, ... {license name}.ending
also often inside sub folders copying

What would be interesting here is how many of the crawled repositories are missing a license at the moment?

Add ISC License

See e.g. https://github.com/copiousfreetime/hitimes/blob/master/LICENSE

License recognition for CSharp

Many packages on Nuget only provide a link to license but no license name. During the crawling we should check if the license link is matching with a well known source. I found for example 64K license objects in our db where the url was http://www.apache.org/licenses/LICENSE-2.0.html and the license name was nil. I updated them all and set the name to Apache-2.0. This logic should be part of the Nuget crawler. Check at least for this urls:

http://www.apache.org/licenses/LICENSE-2.0.html
https://www.apache.org/licenses/LICENSE-2.0.html
https://opensource.org/licenses/MIT

License field has been moved for Rust Crates;

I got email notification for a Rust package i follow and i noticed it is now missing license details.

I digged into the problem and i noticed that the response model from the API has been changed - they moved all the licenses onto versions;

curl https://crates.io/api/v1/crates/tokio-core?api_key=XYZ | jq

add crawler for Haskell Hackage

docs: https://hackage.haskell.org/

nb! use Accept header to get JSON data;

Check why Nuget crawler misses latest versions for System packages

System.Linq, System.Collections are still stuck at 4.0.11, but they should have 4.3.0

No Cpan dependencies

I refactored the Cpan crawler. Remove all Threads and using RabbitMQ directly. In generell everything seems to work. But somehow there is not a single Perl dependency in the db. Even not after 2000 packages have been crawled. Maybe a bug?

Add PerlLicense crawler that could look for license file on Metacpan filelist

Some, specially older, Perl packages have no license information, but if you look into their file directory, there usually exists license file. Quite many of them are using a name Artistic;

Also research how those old packages refer to dependencies;

Here's example for Curses: https://metacpan.org/source/GIRAFFED/Curses-1.36

Move LicenseMatcher into own project

It's one of reasons why specs are slow - it must rebuild the token index each time and it takes time;

Fastest solutions is to move everything into own repo; But it doesnt speed up creation of index;

Better approach: use proper tools, refactor it into 2 processes: training/match, where the training-process will build and save model; and match-process will re-use already existing model;

Add cargo links

@timgluz Every Rust package is available through a unique URL.

https://crates.io/crates/<PACKAGE>

For example:

https://crates.io/crates/aseprite

The crawler should create this links automatically so that we can display it on the VersionEye pages.

Some Perl URLs cant be fetched

In the logs of the cpan crawler I can see stuff like this:

[wtf] Failed to parse JSON response from https://fastapi.metacpan.org/v1/release/FLORA/Catalyst-Runtime-5.80009?join=author - execution expired (pid:197)

But if I try the URL in the browser I'm getting back a valid JSON in less then 1 second. Maybe a redirect issue?

Split `CpanCrawler.crawl` into Producer and Worker

It would be simpler and more robust to use persistent MQs when paginating over releases.

Add Bitbucket crawler for Golang dependencies

According to go-search, there're 2000 (low priority!!) packages hosted on Bitbucket and it would nice if we could pull Go dependencies from there too like we're doing for Github;

[refactor] Api for CpanCrawler has changed

I found out that the url of API for CPAN crawler has been changed and old url doesnt work anymore.

ToDO:

change url
check has data from API changed
update response fixtures in the spec/fixtures/files/cpan
run crawler for extended period of time and check does scrolling work

API docs: https://github.com/metacpan/metacpan-api/blob/master/docs/API-docs.md

Nokogiri License is not recognized but it still includes valid License.txt

License file: https://github.com/sparklemotion/nokogiri/blob/master/LICENSE.txt
Somehow the algorithm has problems with the file contents.

Caused by line breaks possibly:

to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,

https://github.com/versioneye/crawl_r/blob/master/lib/versioneye/crawlers/license_crawler.rb, line 370

The matcher should be made more robust to line breaks.

On-demand crawling

Hello,

Here's my current use case: I use versioneye to track the dependencies of internal projects (mostly PHP-related). Obviously we do not use all the packages available on packagist. We would also like not to depend on VersionEye public API to obtain packages information. However, crawling all of packagist while we're probably interested in 500-1000 packages seems a bit excessive.

Thus, it would probably make sense to have tracked projects dictate which packages should be crawled. Then, the crawler/producer would only have to walk through these packages in order to fetch a minimal list of metadata.

Tests are slow & breaking

The complete test suite quiet slow and some tests are failing. I guess the tests for bower are slow. Please review & refactor them. Make the test work again and if possible improve the speed.

Get dependency info for Gopkg which are not using any pkg managers

Here's a rough idea for futher research:

use some dockerized Go pkg-manager to generate project file with dependencies and version label/commit sha by giving an url of any git repo.

Tests are failing

@timgluz One of the newly added tests is failing:

  2) PythonLicenseDetector detecting standard spdx license texts ignores popular non-sense
     Failure/Error: expect(spdx_id).to eq('GNU')

       expected: "GNU"
            got: nil

       (compared using ==)
     # ./spec/versioneye/utils/python_license_detector_spec.rb:58:in `block (3 levels) in <top (required)>'

Wrong dates for Nuget

For some packages the release dates seem to be wrong. For example for this one it is wrong. Release 116 years ago.

Authors for Rust

@timgluz Why we don't have Author information for Rust packages? Is it not included in the REST API?